CSV Import Problem

  • Hi,

    This driving me nuts!

    I'm being sent a unicode CSV file that has some text qualifiers and is kind of partially fixed width. The text qualifiers are only used when necessary. For example:

    XXXX ,"1,400.00",GBN01 ,*XXXXXXXXX ,20,70.00,"1,400.00",17.50,300

    I've XXXXX'd out sensitive data.

    The problem is that I'm getting an error when trying to import this line, and it's the field "1,400.00" that is causing the problem. When I set up the Flat File Connection Manager I've specified " as the text qualifier and everything works fine. I preview the file and everything is just as I would expect.

    However, when I come to run the SSIS package I get the following error:

    "The data value cannot be converted for reasons other than sign mismatch or data overflow"

    If I manually change the field from "1,400.00" to 1400.00 with no text qualifier it works fine.

    Please, someone, give me some clues as to why this isn't working.

    Cheers

    George

  • George

    I think the problem is that 1,400.00 is actually a number, which is being sent as text because it has a "," as a thousands separator. I would import it into a text field and then attempt to convert it into a number.

    John

  • Hi John,

    Thanks for the reply.

    I've tried that. It just pushes the problem on. It seems that if I do that when it gets to the next numeric field it's trying to load text in to it. In other words, it's like the comma in the "1,400.00" is being treated as a field separator and the text qualifiers are being ignored.

    Could any of this be related to the file being unicode? I tried processing it as non-unicode, but that obviously created problems with the file not being read correctly.

    Cheers

    George

  • George

    Another way round it might be to open the file in Excel, format the columns appropriately and then save it and import from that. You could write a VB script that would do that for you in the package.

    John

  • Hi John,

    Yeah, I have thought about doing some pre-processing to convert the file to | or TAB delimeted. I just can't believe it doesn't work straight out of the tin.

    I've fired off an email to the supplier who's sending the data to ask if they can change the format of the file. I hope they can as this is driving me nuts 🙂

    In the meantime though, if you, or anyone, knows why this is happening I would really appreciate some help.

    Cheers

    George

  • Hi George,

    Its not working because you file is quite non-standard. Best bet is to get the file sent as straight csv or straight fixed width and with consistent use of delimiters.

    Otherwise you may need to import the file into a staging table with one column and then parse out that data into meaningful columns into you final destination table.

  • GN-340682 (11/26/2010)


    Hi John,

    Yeah, I have thought about doing some pre-processing to convert the file to | or TAB delimeted. I just can't believe it doesn't work straight out of the tin.

    I've fired off an email to the supplier who's sending the data to ask if they can change the format of the file. I hope they can as this is driving me nuts 🙂

    In the meantime though, if you, or anyone, knows why this is happening I would really appreciate some help.

    Cheers

    George

    You're taking the correct steps. The first step is always to ask the vendor to meet certain requirements. The problem is that they're probably exporting from a formatted spreadsheet or table or, worse, from a reporting system that does formatting.

    Please do post back if you have no success with the vendor. There's a couple of ways to get around this problem "auto-magically" without having to import and re-export from a spreadsheet on your end.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • GN-340682 (11/26/2010)


    Hi John,

    Thanks for the reply.

    I've tried that. It just pushes the problem on. It seems that if I do that when it gets to the next numeric field it's trying to load text in to it. In other words, it's like the comma in the "1,400.00" is being treated as a field separator and the text qualifiers are being ignored.

    Could any of this be related to the file being unicode? I tried processing it as non-unicode, but that obviously created problems with the file not being read correctly.

    Cheers

    George

    It's easy to understand why SSIS is getting confused with this format, which is most unusual. Including a thousands separator in a numeric field is a hindrance and it is this which I would address first. Based on the sample data provided, I can see no straightforward way of pre-processing the data to make it import correctly ('Delete all commas between concurrent sets of double quotes' is easy enough to state but more difficult to code).

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Thanks to everyone who posted. It's nice to know I'm not going mad, and it is an odd file format 🙂

    Hopefully they'll be able to change format and I won't have to code around it.

    Cheers

    George

  • Just to add a little more detail to this. After much fiddling I managed to get the data to import to the load table. This is how it's being imported (I've turned each column in to a row to make it visually easier to see):

    "1

    400.00"

    XXXXX

    XXXXX

    20.00

    70.00

    "1

    400.0

    As you can see SSIS has completely ignored the fact I've specified a text qualifier of " and effectively pushed all the columns along. Why does this work fine in the Flat File Connection Manager Editor (under the preview tab), but then fails so spectacularly when it comes to running it?

    I'm still waiting to hear if the file format can be changed, so in the meantime any ideas on this would be useful. Thanks

    Cheers

    George

  • i had same issue where we use to import CSV files. Best option i would recommend is first clean up your CSV by removing columns with internal COMMA's

    Good luck

  • SQLSmasher (11/30/2010)


    i had same issue where we use to import CSV files. Best option i would recommend is first clean up your CSV by removing columns with internal COMMA's

    Good luck

    Any suggestions on how you would do that? Maybe some code?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    The method i can straight away suggest is to use SSIS.

    1. Create in-memory single column table with each csv row as table row

    2. find COMMA's between "xxxx,yyyy"

    3. replace with "xxxxyyyy"

    4. continue with your particular SSIS job by reteriving data from this new table

    Thanks

  • Part 2 is the tricky bit though. If you've done this before, I think it would be useful if you could provide a code snippet.

    Tricky because

    "1,400.00",17.50,300

    needs to become

    "1400.00",17.50,300

    so the rules about which commas to remove are not straightforward ...

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Here's a *cough* loop version I just tried to solve that little puzzle. I'm more than confident there are much better solutions available...

    Edit: a recursive CTE would be possible, too. But at least this one is easily identified as a loop solution... 😉

    DECLARE

    @t VARCHAR(30),

    @repl CHAR(1),

    @identifier1Pos INT,

    @identifier2Pos INT,

    @commaPos INT,

    @lengthPos INT

    SET @repl='!' -- intermediate replacement for a comma in order not to have to deal with the string changes in length

    SET @t='"1400.00,",",17.50",300,",",'

    SELECT

    @identifier1Pos= charindex('"',@t), -- position of the first identifier

    @commaPos= charindex(',',@t,@identifier1Pos+1), -- position of the first comma after the identifier just found

    @identifier2Pos= charindex('"',@t,@identifier1Pos+1), -- position of the 2nd identifier

    @lengthPos= len(@t) -- overall strin length

    WHILE @identifier2Pos>0 and @identifier1Pos>0 and @commaPos>0 -- as long as all there are two identifier and a comma

    BEGIN

    IF @identifier1Pos< @commaPos and @commaPos < @identifier2Pos -- if comma is in between the two identifier

    SELECT @t=stuff(@t,@commaPos,1,@repl) -- replace comma with another character

    SELECT -- grab the next set of comma and identifier

    @identifier1Pos=charindex('"',@t,@identifier2Pos+1),

    @commaPos= charindex(',',@t,@identifier1Pos+1),

    @identifier2Pos=charindex('"',@t,@identifier1Pos+1)

    END

    SELECT REPLACE(@t,@repl,'') -- final output, remove the intermediate character



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply