Handling a malformed CSV file

  • Hi there.  This is my first post here so please excuse me if I am posting in the wrong place or missing an established post that solves my problem.  I have looked for one to no avail.

    I am in the process of writing a rather rough and ready sequence of stored procedures to import large (sometimes 1M records and 800 columns) csv files into an SQL Express database using bulk insert.  I have had very encouraging results so far, despite having to use a roughshod work around to get bulk insert to deal with double quotes in the csv files.  In a nutshell, the main stored procedure does the following:

    1)  Bulk inserts csv file to temp table using "," as field terminator and "\n as row terminator.  This removes all double quotes apart from those at the beginning of the first column of each row.

    2)  The following code is used to strip the last spurious double quote:

    set

    [column one] = replace([column one],'"','')

    This works suprisingly well.  2m records can be processed in under a minute on my system.

    3)  In a nutshell, the data in the temp table relates to a date range and represents an update.  The corresponding date range is deleted from the main table, the contents of the temp table transfered to the main table and then the temp table is cleared.

    Now, finally, my problem.  Due to a bug beyond my control  in the system that generates the csv file (little chance of getting it fixed), the last row of each csv file is not followed by a new line and is thus not terminated correctly.  Bulk insert returns the following error for example:

    "Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 126268, column 649 ([column name])"

    My stored procedure is designed so that it picks up the input csv file with a set name from a set directory at a given time period. 

    Is there a way that I could incorporate something into my stored procedure that would insert a carriage return or new line into the input csv file and thus terminate the last row correctly? A batch file or vbs script perhaps?  If so, could somebody help me do it?

    I am aware of several methods using third party software and text editors but I am eager to integrate this step within the automated procedure that exists.

    Many thanks for your time,

    Dan

  • The quick method of doing this from the command prompt is:

    echo.>>YourFile.csv

    You could put this in a batch file.

     

  • Another solution would be to use a job to start the whole process.

     

    The first job step would be in VBS to open the file using the FileSystemObject.  Add a line, close the file.  Call the sp which imports the data.

     

    Message back if you need more help in solving this problem.

  • And there was me worrying that it was going to be a complex solution! 

    echo.>>YourFile.csv works perfectly.

    Thank you both for your help, I appreciate it.  I think I'm going to hang around here, lots to learn.

     

  • Hi,

    I know my reply is just too late for this thread but i have almost the same problem. i importing a 15mil rows (around 3.4gb) and after 1M rows it failed because according to the error description there is additional column/spaces after each last column after this... i'm getting frustrated, because i working it for almost 3 days and nothing i can do to import it.

    Can someone help me on this. thank you very much.

    some workaround i do.

      --use dts package the same error

      --use bulk insert but it takes a lot of time and produce a connection failure afterwards

      --try to open the textfile in a text editor like edit,textpad,editplus but it cannot open because of memory limitation..

    any help is very much appreciated. thanks again.

    By the way how does echo.>>yourfilename used...

     

     

     

  • Echo.>>yourfilename simply appends a blank line to your file... Echo. creates the blank line.... ">>" is the old DOS notation for "take the output of the command to the left and append it to the file on the right."

    So far as your errors go, I need to know, are the rows that have the extra columns valid or not?  Is the file a CSV file or a fixed field file?  Do the rows with the extra columns have a totally different record layout than the others (file of mixed records)?  And, do you know how to use BCP with the option to save "bad" rows and how to use a BCP format file?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the reply,

    The file is tab delimited, and import the first 950,000+ records, then an error occured. I cannot open the file in TextPad, EditPlus text viewer, that's why i cannot say if the lines where the error occured has a valid format.

    For the bcp can you please give me some sample scripts..

    thanks again..

     

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply