How do I parse a CSV file with extra commas in a field?

  • I've got a print log that is saved daily into a CSV file and I'd like to use SSIS to import it into Sql Server 2012. The log contains 14 fields of information and the 5th field is the document name. Occasionally, a user will include a comma in the title of their document which then causes the package to not import the last field which happens to be document size which my boss is interested in capturing.

    I think the answer would be to count the commas in each row before it's parsed and if greater than 14 parse out the first 4 fields and the last 9 fields which should leave just the document name field, delete all commas in that field and put it back together for downstream processing.

    I don't know how to look at the data before it's parsed by the SSIS Source, so I'd appreciate a couple of hints from anybody to get me headed in the right direction, or a better way to do it than described above if there is one.

    Thanks

  • Rick Lang (1/3/2013)


    I've got a print log that is saved daily into a CSV file and I'd like to use SSIS to import it into Sql Server 2012. The log contains 14 fields of information and the 5th field is the document name. Occasionally, a user will include a comma in the title of their document which then causes the package to not import the last field which happens to be document size which my boss is interested in capturing.

    I think the answer would be to count the commas in each row before it's parsed and if greater than 14 parse out the first 4 fields and the last 9 fields which should leave just the document name field, delete all commas in that field and put it back together for downstream processing.

    I don't know how to look at the data before it's parsed by the SSIS Source, so I'd appreciate a couple of hints from anybody to get me headed in the right direction, or a better way to do it than described above if there is one.

    Thanks

    The very best way of dealing with this is to fix the problem at source.

    1) Produce the file using a different column delimiter (eg, |)

    2) Add a string delimiter to the document name output, such that any commas are embedded (eg, "Column1", 2,3, "Column, ", "Col5")

    Assuming neither of the above is possible, a script component is an obvious choice (if you don't mind doing some C#). This would allow you to take the entire input row and then parse it out into the required columns - should only require a handful of lines of code to do that.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Thanks for the quick reply. Unfortunately, the first two options are not possible as a limitation of the logging software. I'm down to your third suggestion. I'm assuming that your suggestion with the script component is as a source. Never having done that, but in trying to figure this out, it works well if the screwed up record is the first one. However, if it's in the middle of the stack somewhere, it looks like SSIS starts and builds a "template" of the fields and then discards excess field when it gets to a record with more than the usual number. Is that correct? or am I missing something and there's a way to specify extra columns at the start (which are usually empty)?

  • Not - not as a source, but as a transformation.

    Essentially, the component takes each row as a single string and then, using parsing code within the script component, you break the string into separate 'output' columns which the script component provides to the destination.

    I haven't done it for a while - I'll look for some useful links & post back if I find any.

    By the way, there are no doubt ways to do this with the script component as a source too - there are always numerous ways to skin a cat in SSIS.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Have a look here and see whether it gives you enough to get started with this.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil,

    Think I may have it. I used a flat file connection Manager, selected advanced and put 5 additional columns into the connection. Drug a Flat File Source on to the data flow, hooked it up to the Manager and the previews now show data preserved in those columns from my nonstandard rows. I'm pretty sure I can handle the script portion from here (as long as a user doesn't put more than 5 commas in their document title), but I appreciate your help.

    Rick

  • Rick Lang (1/3/2013)


    Phil,

    Think I may have it. I used a flat file connection Manager, selected advanced and put 5 additional columns into the connection. Drug a Flat File Source on to the data flow, hooked it up to the Manager and the previews now show data preserved in those columns from my nonstandard rows. I'm pretty sure I can handle the script portion from here (as long as a user doesn't put more than 5 commas in their document title), but I appreciate your help.

    Rick

    Cool, if this works it's easier than what I was suggesting!

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply