BULK INSERT and CSVs

  • I've been looking around for a while and it seems that there is no way to import a CSV that surrounds its values with quotes (even when all values are always surrounded by quotes) without using a format file. Additionally, data fields can never contain the field terminator. So essentially if you have a pipe-delimited CSV with double quotes, you'd be using a format file with (e.g.) "|" to terminate one of the fields, and you'd have to make sure that the internal data didn't include "|"

    Is this really true? MySQL, which is free, has LOAD DATA INFILE syntax to handle this, and has for some time. It seems kind of funny that a commercial product like SQL Server wouldn't have it.

  • Try

    SELECT * FROM OPENROWSET('MSDASQL',

    'Driver={Microsoft Text Driver (*.txt; *.csv)};DEFAULTDIR=C:\;Extensions=CSV;',

    'SELECT * FROM [SampleFile.csv]')

  • Tried it, although it's not ideal because it requires me to change security settings so I probably can't deploy it to my client's computer. Anyway, I keep getting errors like:

    OLE DB provider "MSDASQL" for linked server "(null)" returned message "[Microsoft][ODBC Text Driver] The Microsoft Jet database engine could not find the object 'C:\inputData.csv'. Make sure the object exists and that you spell its name and the path name correctly.".

    Incidentally, I'm also trying to press ahead with the format files, but I get an "unexpected end of file" message whenever I try to use one of them. I'm pretty sure I have my field terminators right, though.

  • If you could illustrate with an example of the format file, it would be easier to see what the problem might be

  • Sure, sorry. Here is what I am working with right now:

    10.0

    4

    1 SQLCHAR 0 0 '"|"' 1 FIRST_NAME "SQL_Latin1_General_CP1_CI_AS"

    2 SQLCHAR 0 0 '"|"' 2 LAST_NAME "SQL_Latin1_General_CP1_CI_AS"

    3 SQLCHAR 0 0 '"|"' 3 STATE "SQL_Latin1_General_CP1_CI_AS"

    4 SQLCHAR 0 0 "" 4 COUNTRY "SQL_Latin1_General_CP1_CI_AS"

    That last terminator is supposed to have a backslash-r backslash-n in it (can't post it in the forum).

  • OK, I actually figured out the problem with the EOF. It looks like the format files don't allow you to surround terminators with single quotes, only double quotes. So a "|" terminator can be expressed in the field as "\"|\"", but not as '"|"'.

    My one remaining problem is, how to exclude the first quote on each line (i.e. the opening enclosing quote for the first field), because it is not a field terminator. I could add it to the row-terminator, i.e. "\"\r\"", but then I'd still have the opening quote for the first row, and the closing quote for the last row, to worry about.

  • Try this (missing the backslash n on the row terminator)

    10.0

    5

    1SQLCHAR00"\""0Quote1"SQL_Latin1_General_CP1_CI_AS"

    2SQLCHAR00"\"|\""1FIRST_NAME"SQL_Latin1_General_CP1_CI_AS"

    3SQLCHAR00"\"|\""2LAST_NAME"SQL_Latin1_General_CP1_CI_AS"

    4SQLCHAR00"\"|\""3STATE"SQL_Latin1_General_CP1_CI_AS"

    5SQLCHAR00"\"\r"4COUNTRY"SQL_Latin1_General_CP1_CI_AS"

  • That worked!! Great idea, I hadn't thought about creating a dummy column.

    It seems to all be working now, and crazy fast too.

    The one thing is, right now I have found it convenient to specify FIRSTROW=2 to bypass my CSV header (first row that specifies column names). MSDN, however, states:

    The FIRSTROW attribute is not intended to skip column headers. Skipping headers is not supported by the BULK INSERT statement. When skipping rows, the SQL Server Database Engine looks only at the field terminators, and does not validate the data in the fields of skipped rows.

    http://msdn.microsoft.com/en-us/library/ms188365.aspx

    I could program my script to strip out the header, but it doesn't seem to be worth the trouble as long as it's terminated in a line break. Despite their warning, it seems fine/convenient...

  • From what you are telling me, it is clearly absolutely fine to use FIRSTROW = 2. Yes it is crazy fast!:crazy: Format files can be a bit tricky but it's worth it in the end:satisfied:

  • One other issue I did encounter... these format files ask us to provide both a column order and a column name. I had always thought that the point of SQL is that the physical order of the columns doesn't matter so much. It might be difficult for my script to know the column orders, but it will be easy for it to know the column names. Is there a way I can ascertain the former from the latter?

    EDIT: And sure enough, this is causing my script to fail in some cases, because the order of columns of my data files doesn't always match the internal order of my SQL tables as they get created.

  • Using OPENROWSET BULK you can specify the columns in any order you like, as if you were selecting from a table eg

    SELECT STATE, COUNTRY, LAST_NAME, FIRST_NAME FROM OPENROWSET (BULK 'C:\SampleFile.csv', FORMATFILE = 'C:\SampleFormat.fmt') AS Z

  • Thanks, that worked like a charm. I just did something like:

    INSERT INTO

    (columns in order) SELECT (columns in order) FROM OPENROWSET (BULK [path], FORMATFILE=[path], FIRSTROW = 2) AS Z

    I thought it would be slower since it's reordering the columns, but if anything it's actually slightly faster now.

    Sweet!

  • Excellent! Looks like you nailed it:cool:

  • Hey, I have run into one more issue, which is the question of escaping. Since the pipe is the only character that is guaranteed to be a field separator in my case (quotes don't surround all the fields), I will be in trouble if I ever have a pipe within field content. Is there any way to specify a literal pipe that shouldn't be counted by BULK IMPORT as a separator? i.e. a way to escape it like:

    A sentence followed by a name|Here is the name

    Sentence One|Amanda

    Sentence Two|Kathleen

    This is a pipe: \||Andrew

  • e1785 (9/9/2010)


    Hey, I have run into one more issue, which is the question of escaping. Since the pipe is the only character that is guaranteed to be a field separator in my case (quotes don't surround all the fields), I will be in trouble if I ever have a pipe within field content. Is there any way to specify a literal pipe that shouldn't be counted by BULK IMPORT as a separator? i.e. a way to escape it like:

    A sentence followed by a name|Here is the name

    Sentence One|Amanda

    Sentence Two|Kathleen

    This is a pipe: \||Andrew

    Could you supply the first few lines of such a file including column headers, aliasing any sensitive data

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply