Parsing unstructured CSV file

Question

Parsing unstructured CSV file

N_Muller

SSCarpal Tunnel

Points: 4037
More actions
October 1, 2015 at 1:37 pm

#303576

I have a CSV file with roughly 6 million rows. The file is unstructured; that is, some rows have 5 fields, others have 15, and there are as many 50 fields in one row.
I am using bulk insert to read the entire file into a table in database, with each row being a database record. With that, I have one column that contains a row of comma delimited fields. All fields are character string and I want to find a quick way of parsing each row and placing each comma-delimited value in a column. For example:
CREATE TABLE MyTable
(
CSVString varchar(1000),
C1 varchar(20),
C2 varchar(20),
...
C50 varchar(20),
)
Column CSVString contains the a CSV row (I don't know how many filelds (no. of commas + 1) in the row, but if the row contains 10 fields, I need to populate columns C1-C10. If the row has 15 fields, I populate columns C1-C15.
How can I do this in a very efficient way? I tried CTE but performance was not very good.
Thanks

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 101130 More actions · Answer 1

Sounds like you need a really fast splitter. Check out DelimitedSplit8K.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 2

Forgot to mention about the DelimitedSplit8K. I tried that as well and it still doesn't meet my performance needs. Any other suggestions are welcomed.

Lowell SSC Guru Points: 323518 More actions · Answer 3

if that's too slow, the next step is a CLR Common Language runtime splitter. that's the fastest i've encountered.

I've personally grabbed this specific code from SimpleTalk[/url] and used it to parse the Webster dictionary from Gutenberg as well as a couple of thesaurus files I've imported into a database.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Bill Talada SSChampion Points: 11954 More actions · Answer 4

SqlServer is the wrong tool for the job if you want compiled speed. I suggest using a compiled console app that pre-processes your file so each row will have 50 tab or comma delimited columns. Perhaps some unix utility may do the job. Then, bcp in as you would a standard table.

Eric M Russell SSC Guru Points: 125563 More actions · Answer 5

It sounds like you're dealing with a file that has multiple record types. I can't think of any better explanation for a CSV file with a variable number of columns. Consider using some type of file splitter utility, or perhaps write a simple C# or C++ program, to first split the original file into multiple files based on the record type indicator which should be contained in one of the non-variable columns. Once done, it's simply a matter of creating a table for each record type and then loading from the appropriate file. If ingesting this form of file this be a regular part of your ETL schedule, then you could do the whole thing, splitting and loading, in an SSIS package.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

ScottPletcher SSC Guru Points: 101130 More actions · Answer 6

Can we see the actual code you used to do the split and column assignment? Maybe there is something there.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Bill Talada SSChampion Points: 11954 More actions · Answer 7

https://www.simple-talk.com/sql/sql-tools/sqlxml-bulk-loader-basics/

I came across the article above today and the author says he deals with importing tens of gigs at a time. I think XML might be ideal for this situation since it can handle "missing" or variable number of columns. Of course you'll need to preprocess your flat file into an XML file. The author claims it is the quickest and most versatile way of bulk loading huge amounts of data.

Eric M Russell SSC Guru Points: 125563 More actions · Answer 8

N_Muller (10/1/2015)
I have a CSV file with roughly 6 million rows. The file is unstructured; that is, some rows have 5 fields, others have 15, and there are as many 50 fields in one row.
...

A standard CSV file contains the column headers in the first record and all records have the same number of comma separated fields. However, if each record in this particular file contains anywhere from 5 - 50 fields, then it's not clear how you would know which specific fields are populated or how they line up with a column.

For example, do the records look similar to this, with some records simply having empty fields?

a,b,c,d,e,,, ...

a,,,d,e,,, ...

Or do they look more like this with some records having missing fields?

a,b,c,d,e,,, ...

a,d,e ...

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho