Fast CSV Reader

  • I wonder whether anyone here fancies the challenge of trying to make this work in SSIS?

    Of course, implicit in that is that you make it work faster than the native flat file source, or there is little point.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • It's all .NET, so could just try a script task or transform in SSIS. 😀



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Alvin Ramard (1/14/2016)


    It's all .NET, so could just try a script task or transform in SSIS. 😀

    Script Component Source is where I would start, yes ...

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin (1/14/2016)


    I wonder whether anyone here fancies the challenge of trying to make this work in SSIS?

    Of course, implicit in that is that you make it work faster than the native flat file source, or there is little point.

    Saw that. As the author implied, few people get it to correct never mind being fast or easy on memory. I'd really have to run it through a knothole to make sure it didn't have any bugs in it like people that typically forget that an ending delimiter means an empty string should be returned after that delimiter. Also, will it return nulls, empty strings, spaces, or nothing at all for adjacent delimiters?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The fact that the author is not citing RFC4180 is troubling. This is exactly where many implementations start out on a bad foot, namely they are making up their own CSV file specification.

    RFC4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files

    The algorithm for processing CSVs is not magical and it surprises me how many vendors (including Microsoft in some of their tools) have gotten it wrong over the years.

    Back to the Code Project article...I have written code similar to what is needed to parse a CSV to strip "special" characters from incoming text files. Special in this case was anything not type-able on a US-101 keyboard and any control characters (e.g. tabs and line breaks inside text fields). The destination system was not tolerant of "special" characters and it was more efficient to strip them out in a pre-processing step prior to them hitting the database. The algorithm processed the file one character at a time and maintained a stack to allow for line breaks and tabs to be stripped out if inside text-delimiters (i.e. part of the text headed for the database) but maintained when outside a text-field, i.e. when they terminate a field or line.

    I wonder how it compares to the project geared towards processing CSV files in SSIS that is posted on CodePlex:

    CodePlex: Delimited File Source

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (1/15/2016)


    The fact that the author is not citing RFC4180 is troubling. This is exactly where many implementations start out on a bad foot, namely they are making up their own CSV file specification

    The algorithm for processing CSVs is not magical and it surprises me how many vendors (including Microsoft in some of their tools) have gotten it wrong over the years

    Heh... finally. Something that we won't have to look forward to a deep dive with each other. 😀 I absolutely agree!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply