February 23, 2017 at 9:09 am
Hi all
To keep the example simple lets say for the sake of argument I have large CSV file of say 500,000 rows and 80 columns of mostly text data
I have been working on a process to create an XLSX file using SpreadsheetML this work fine until the amounts of data ramp up and I run out of memory
The main problem is creating a unique array of all of the text data for sharedStrings.xml
I just wondered if there was an efficient way to use SQL to dedupe those 40 million values
I can't simply use COM automation and get Excel to do the work for me as this process has to run unattended on a server and the Microsoft recommendation for this sort of thing is to use OpenXML
Thanks in advance for any ideas
February 23, 2017 at 9:12 am
Oh just to add I have tried using the Microsoft Scripting dictionary class but it is so sloooow and of course still consumes big amounts of memory
February 23, 2017 at 9:38 am
You say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically? or that each of those 40M values are a distinct item to be considered for deduplication? Or are you just (hah!) looking at it by unique rows?
Thomas Rushton
blog: https://thelonedba.wordpress.com
February 23, 2017 at 9:42 am
ThomasRushton - Thursday, February 23, 2017 9:38 AMYou say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically? or that each of those 40M values are a distinct item to be considered for deduplication? Or are you just (hah!) looking at it by unique rows?
The former as I need to build the sharedStrings.xml file from it
https://msdn.microsoft.com/en-us/library/office/gg278314.aspx
February 23, 2017 at 11:14 am
andrew 67979 - Thursday, February 23, 2017 9:42 AMThomasRushton - Thursday, February 23, 2017 9:38 AMYou say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically? or that each of those 40M values are a distinct item to be considered for deduplication? Or are you just (hah!) looking at it by unique rows?The former as I need to build the sharedStrings.xml file from it
https://msdn.microsoft.com/en-us/library/office/gg278314.aspx
Okay, ... so WHY are you "re-inventing the wheel", when you could import the csv file using SSIS and write it out in spreadsheet form without having to programmatically write out all the actual bytes on your own? When you said "the former", I intrepret that to mean that you were hoping to have a way to remove duplicate values from a set composed of all values in all columns. As SSIS would have to do this in order to create the spreadsheet, why do you think you need to do this manually? I'd be seeking to understand the larger objective... What's the purpose of doing this on your own? What have you already tried, and why?
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
February 23, 2017 at 11:16 am
If it were me, I'd just import the file into a table, hit it with the SQL hammer, and convert the deduplicated data to XML, and export it.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply