Cannot bulk import CSV file with double quotes

Question

Post reply

Cannot bulk import CSV file with double quotes

support 23368

SSChasing Mays

Points: 600
More actions
December 8, 2009 at 12:39 pm

#137420

Hello,
I have a simple query (see below) to import a comma delimited CSV file into a SQL Server 2005 database table. The Bulk Insert command works fine for File 1 (see below) which does not contain any double quotes around some strings. However when I import File 2, the import does not store the fields correctly as it splits the data on the commas within the double quotes.
Please note that this is a small test to reflect a real case scenario I need to implement. I cannot change the delimiter and the string fields in the CSV file may or may not be surrounded by double quotes.
I suspect I need to adjust the field terminator but not sure how.
Thanks in advance.
--------------------------------------------------------------------
File 1: test-no-quotes.csv
--------------------------
ID,Name,Address
1,John Doe,1 Sunny Street
2,Fred Bloggs,24 High Street
File 2: test-quotes.csv
-----------------------
ID,Name,Address
1,John Doe,"1, Sunny Street"
2,"Fred, Bloggs",24 High Street
Bulk Insert SQL statement
-------------------------
BULK INSERT xImportTable
FROM 'C:\Files\CSV\test-no-quotes.csv'
WITH
( FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = ''
)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

support 23368 SSChasing Mays Points: 600 More actions · Answer 1

Since my last post I have figured out that I could use a format file.

Here is another example CSV file which better illustrates what I am referring to:

---------------------------

ID,TextField1,TextField2

23,This is some text for this field,This is some other text

34,”This field, contains a comma and double quotes”,”So does this field, as you can see”

99,Sometimes it may not have double quotes,”Sometimes, it suddenly does”

---------------------------

As you can see from above, this is a perfectly acceptable CSV file but unfortunately I cannot control the consistency of the double quotes. The supplier of the data feed only appears to surround a field in double quotes if it contains the comma delimiter.

I have tried implementing a query using the MS SQL BULK INSERT query.

Here is an example:

BULK INSERT TempTable

FROM 'C:\Files\CSV\example.csv'

WITH

(

FORMATFILE = 'C:\Files\CSV\example-format.txt'

)

The problem I have here is the format file. I cannot get it to work on string fields in which some rows have the double quotes and other rows do not.

I managed to find an example on the internet here:

http://www.tek-tips.com/faqs.cfm?fid=1602

Although this works fine for fields surrounded by double quotes, as soon as you remove the double quotes from one of the fields the bulk insert fails.

So perhaps if anyone familiar with format files can help then I may have a solution.

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 2

This should do it... use the SELECT as if it were a table...

SELECT *

FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',

'Text;Database=C:\Files\CSV;HDR=YES;FMT=Delimited',

'SELECT * FROM example.csv')

... unless you're using 64 bit SQL Server or someone has changed the registry setting for the Jet driver. I've got a fix for the registry setting if someone changed it. Not sure if there's a fix for 64 bit.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

support 23368 SSChasing Mays Points: 600 More actions · Answer 3

Hi Jeff,

Thanks for this. I have never come across this method before and this does look promising.

I have a few questions if you would kindly respond to:

1. I am using SQL server so why have you suggested I use the JET database?

2. How do I get it to copy the results to a database table.

I came across the following syntax on the internet. Would this be closer to what I need?

SELECT * FROM OPENROWSET('MSDASQL','Driver={Microsoft Text Driver (*.txt; *.csv)};DefaultDir={Directory Path of the CSV File}','SELECT * FROM file.csv')

So for example:

SELECT * FROM OPENROWSET('MSDASQL','Driver={Microsoft Text Driver (*.txt; *.csv)};DefaultDir=Z:','SELECT * FROM file.csv')

support 23368 SSChasing Mays Points: 600 More actions · Answer 4

Hi Jeff,

I managed to get the OPENROWSET to work using the following

SELECT *

FROM OPENROWSET

('MSDASQL','Driver={Microsoft Text Driver (*.txt; *.csv)};DefaultDir=C:\Files\CSV','SELECT * FROM example3.csv')

How do I adjust the above script to copy the contents of the CSV file to a database table in which all the fields in the CSV file map directly to the fields in the database table?

I actually intend to apply this to very large CSV files so performance is very important. I understand there is a way to bulk import using OPENROWSET but not sure how.

support 23368 SSChasing Mays Points: 600 More actions · Answer 5

Ok I got a solution:

INSERT INTO

xTest2

(Numeric1,Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8)

SELECT

Numeric1,Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8

FROM OPENROWSET ('MSDASQL','Driver={Microsoft Text Driver (*.txt;*.csv)};DefaultDir=C:\Files\CSV','SELECT * FROM example3.csv')

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 6

Yep... you've got it. Select/Into FROM OpenRowSet also works.

You mentioned this was for big files. I'd first see if I could get the vendor to make the delimiters in the header and the data consistent so you could use the very high speed bulk insert. Of course, that is usually a futile effort.

Anyway, if you don't mind, I'd be interested in some of your performance numbers. If they're too low, there are other slightly more complicated methods to do this from T-SQL that I've used/developed.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

support 23368 SSChasing Mays Points: 600 More actions · Answer 7

Hi Jeff,

Thanks for your feedback.

I am at the point where I need to make a decision. I have 2 choices:

1. Use bulk insert - However I need to run a regular expression on the CSV file to convert the comma delimiter to a pipe delimiter. But I am assuming the data will map to the table quite fast.

2. Use the OPENROWSET - no need to change the CSV file but looking at the query it appears that it may work slower.

In both cases I would dump the data into a temporary table then query through the temporary table and copy the data into their final tables. Initially I need to load about 250,000 records.

When I originally did this I used the ASP file system object to copy each file into a local string variable then access each record line by line creating the multiple table records as I went along. Initially I was processing 1000 records every 10 minutes but it became progressively slower as more records were added. After about 2 days I gave up when it got to around 170,000 records.

I am hoping one of the above solutions would speed things up. I also need to look at adding some indexes to some table fields.

Any advice would be appreciated.

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 8

support 23368 (12/10/2009)
no need to change the CSV file but looking at the query it appears that it may work slower.

Heh... slower than what? The 2 days it took you to only load 170k rows?

I guess my question would be, did you try the OpenRowSet method to load all 250k rows into a staging table? If so, how long did it take?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

support 23368 SSChasing Mays Points: 600 More actions · Answer 9

Hi Jeff,

Thanks for your continued interest. I am reassured by the term "staging table". It suggests that this technique is used quite often when importing data.

To give you more information, I am in fact importing book records from a book data supplier. Each book is identified by its unique ISBN number. Normally when I read a row, I first check if the ISBN for that book already exists. I have since discovered that this check was indeed the bottle neck. During bench mark tests I found that it could take around 5 seconds to check all records for the existance of a single ISBN number....ouch!!!

To fix this, I made the ISBN field an index field in SQL server and now the check happens very rapidly (less than 1sec).

I will be trying out a new import with the OPENROWSET this weekend and will let you know how I get on. However it appears that the transfer of the book data from the staging table to the destination tables would have been the issue. Hopefully I can report significant improvements now that the ISBN field is indexed.

Watch this space.

Oh and one other thing. I also discovered that when you do an SQL query, the query does in fact get cached in SQL server. I have a bench mark SQL test to list books in a category that normally takes 9 seconds. However when I run the same query again it takes 1 second. I am now thinking that after I import the books I will run a query to list all books in a category in the hope that it is cached for future visitors. Is this a known technique?

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 10

Thanks for the feedback. I'd still be interested in knowing how long it took to import the 250k rows data into a staging table using the OpenRowSet technique in your environment.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

support 23368 SSChasing Mays Points: 600 More actions · Answer 11

Hi Jeff,

Yes I will most definately let you know how long it takes. I am going to start on it tonight.

John Bevilaqua-427875 SSC Veteran Points: 232 More actions · Answer 12

This might be have a bit of workaround code, but it does work and it runs very quickly

TRUNCATE TABLE dbo.ActivityLogs

BULK INSERT dbo.ActivityLogs

FROM 'C:\Logs\ActivityLog.txt' WITH (FIELDTERMINATOR = '","',DATAFILETYPE = 'widechar')

-- This gets rid of the leading and trailing double quotes in the 1st and last columns

UPDATE dbo.ActivityLogs

SET ActiveDate = REPLACE(ActiveDate,'"',''),

Status = replace(Status,'"','')

-- This gets rid of any rows containing only the column header names

DELETE FROM dbo.ActivityLogs

WHERE ActivityDate = 'ActivityDate'

SELECT * FROM dbo.ActivityLogs

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 13

support 23368 (12/11/2009)
Hi Jeff,
Yes I will most definately let you know how long it takes. I am going to start on it tonight.

Wow! It's taking a very long time! 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)