Parsing unstructured CSV file

  • I have a CSV file with roughly 6 million rows. The file is unstructured; that is, some rows have 5 fields, others have 15, and there are as many 50 fields in one row.

    I am using bulk insert to read the entire file into a table in database, with each row being a database record. With that, I have one column that contains a row of comma delimited fields. All fields are character string and I want to find a quick way of parsing each row and placing each comma-delimited value in a column. For example:

    CREATE TABLE MyTable

    (

    CSVString varchar(1000),

    C1 varchar(20),

    C2 varchar(20),

    ...

    C50 varchar(20),

    )

    Column CSVString contains the a CSV row (I don't know how many filelds (no. of commas + 1) in the row, but if the row contains 10 fields, I need to populate columns C1-C10. If the row has 15 fields, I populate columns C1-C15.

    How can I do this in a very efficient way? I tried CTE but performance was not very good.

    Thanks

  • Sounds like you need a really fast splitter. Check out DelimitedSplit8K.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Forgot to mention about the DelimitedSplit8K. I tried that as well and it still doesn't meet my performance needs. Any other suggestions are welcomed.

  • if that's too slow, the next step is a CLR Common Language runtime splitter. that's the fastest i've encountered.

    I've personally grabbed this specific code from SimpleTalk[/url] and used it to parse the Webster dictionary from Gutenberg as well as a couple of thesaurus files I've imported into a database.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • SqlServer is the wrong tool for the job if you want compiled speed. I suggest using a compiled console app that pre-processes your file so each row will have 50 tab or comma delimited columns. Perhaps some unix utility may do the job. Then, bcp in as you would a standard table.

  • It sounds like you're dealing with a file that has multiple record types. I can't think of any better explanation for a CSV file with a variable number of columns. Consider using some type of file splitter utility, or perhaps write a simple C# or C++ program, to first split the original file into multiple files based on the record type indicator which should be contained in one of the non-variable columns. Once done, it's simply a matter of creating a table for each record type and then loading from the appropriate file. If ingesting this form of file this be a regular part of your ETL schedule, then you could do the whole thing, splitting and loading, in an SSIS package.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Can we see the actual code you used to do the split and column assignment? Maybe there is something there.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • https://www.simple-talk.com/sql/sql-tools/sqlxml-bulk-loader-basics/

    I came across the article above today and the author says he deals with importing tens of gigs at a time. I think XML might be ideal for this situation since it can handle "missing" or variable number of columns. Of course you'll need to preprocess your flat file into an XML file. The author claims it is the quickest and most versatile way of bulk loading huge amounts of data.

  • N_Muller (10/1/2015)


    I have a CSV file with roughly 6 million rows. The file is unstructured; that is, some rows have 5 fields, others have 15, and there are as many 50 fields in one row.

    ...

    A standard CSV file contains the column headers in the first record and all records have the same number of comma separated fields. However, if each record in this particular file contains anywhere from 5 - 50 fields, then it's not clear how you would know which specific fields are populated or how they line up with a column.

    For example, do the records look similar to this, with some records simply having empty fields?

    a,b,c,d,e,,, ...

    a,,,d,e,,, ...

    Or do they look more like this with some records having missing fields?

    a,b,c,d,e,,, ...

    a,d,e ...

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply