Fast CSV Reader

Question

Post reply

Fast CSV Reader

Phil Parkin

SSC Guru

Points: 246963
More actions
January 14, 2016 at 11:05 am

#307738

I wonder whether anyone here fancies the challenge of trying to make this work in SSIS?
Of course, implicit in that is that you make it work faster than the native flat file source, or there is little point.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Alvin Ramard SSC-Forever Points: 41190 More actions · Answer 1

It's all .NET, so could just try a script task or transform in SSIS. 😀

Alvin Ramard
Memphis PASS Chapter[/url]

All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Phil Parkin SSC Guru Points: 246963 More actions · Answer 2

Alvin Ramard (1/14/2016)
It's all .NET, so could just try a script task or transform in SSIS. 😀

Script Component Source is where I would start, yes ...

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 3

Phil Parkin (1/14/2016)
I wonder whether anyone here fancies the challenge of trying to make this work in SSIS?
Of course, implicit in that is that you make it work faster than the native flat file source, or there is little point.

Saw that. As the author implied, few people get it to correct never mind being fast or easy on memory. I'd really have to run it through a knothole to make sure it didn't have any bugs in it like people that typically forget that an ending delimiter means an empty string should be returned after that delimiter. Also, will it return nulls, empty strings, spaces, or nothing at all for adjacent delimiters?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 4

The fact that the author is not citing RFC4180 is troubling. This is exactly where many implementations start out on a bad foot, namely they are making up their own CSV file specification.

RFC4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files

The algorithm for processing CSVs is not magical and it surprises me how many vendors (including Microsoft in some of their tools) have gotten it wrong over the years.

Back to the Code Project article...I have written code similar to what is needed to parse a CSV to strip "special" characters from incoming text files. Special in this case was anything not type-able on a US-101 keyboard and any control characters (e.g. tabs and line breaks inside text fields). The destination system was not tolerant of "special" characters and it was more efficient to strip them out in a pre-processing step prior to them hitting the database. The algorithm processed the file one character at a time and maintained a stack to allow for line breaks and tabs to be stripped out if inside text-delimiters (i.e. part of the text headed for the database) but maintained when outside a text-field, i.e. when they terminate a field or line.

I wonder how it compares to the project geared towards processing CSV files in SSIS that is posted on CodePlex:

CodePlex: Delimited File Source

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 5

Orlando Colamatteo (1/15/2016)
The fact that the author is not citing RFC4180 is troubling. This is exactly where many implementations start out on a bad foot, namely they are making up their own CSV file specification
The algorithm for processing CSVs is not magical and it surprises me how many vendors (including Microsoft in some of their tools) have gotten it wrong over the years

Heh... finally. Something that we won't have to look forward to a deep dive with each other. 😀 I absolutely agree!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)