Problems using Bulk insert with a CSV file that has inconsistent quotes

Question

Problems using Bulk insert with a CSV file that has inconsistent quotes

stuehler

Old Hand

Points: 341
More actions
April 23, 2009 at 11:12 am

#130875

I have data that looks like this (the first row contain headers):
id, company, rep, employees
729216,INGRAM MICRO INC.,"Stuart, Becky",523
729235,"GREAT PLAINS ENERGY, INC.","Nelson, Beena",114
721177,GEORGE WESTON BAKERIES INC,"Hogan, Meg",253
In other words, only values that contain a comma are wrapped by quotes.
Because the quotes aren't consistent, I can't use '","' as a delimiter
I tried using ',' as a delimter and loading the the data into a temporary table where every column is a varchar, then using some post processing to strip out the quotes, but that doesn't work either, because the fields that contain ',' are split into multiple columns.
Unfortunately, I don't have the ability to manipulate the CSV file beforehand. And I can't use DST, because this needs to be part of an automated process.
In short, it's part of an application where users can upload CSV files to the server, and have those files loaded into a database.
Is this hopeless?
Many thanks in advance for any advice.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

It looks like the quotes are around a consistent set of columns, in your example. If that's so, then OpenRowSet with an XML-typle configuration file should be able to deal with that correctly. You can define the column-terminator for each column individually in that case. Might do exactly what you need.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

stuehler Old Hand Points: 341 More actions · Answer 2

GSquared,

Many thanks for your reply.

Actually, the quotes aren't as consistent.

Some columns will ALWAYS be wrapped in quotes, since the values will always contain a comma.

However, a few columns won't be consistent - the values in it will only occasionally contain commas, so will only occasionally have quotes.

Does that mean I can't use OpenRowSet?

Cheers,

Matt

GSquared SSC Guru Points: 260824 More actions · Answer 3

In that case, you're almost going to have to handle it row-by-row. You might be able to build an algorithm for splitting it that would work on the whole set at once, but it's probably going to be more efficient to use .NET regex functions or something of that sort.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

WayneS SSC Guru Points: 95460 More actions · Answer 4

Try this (you'll need to change the hard-coded path to your text file to import):

-- the final results end up in this table variable

declare @Import TABLE (

ID int,

Company varchar(50),

Rep varchar(50),

Employees int)

-- create a temporary table to do all the dirty work

if OBJECT_ID('tempdb..#BulkImport') is not null DROP TABLE #BulkImport

CREATE TABLE #BulkImport(

LineText varchar(100)

)

-- perform a bulk insert of the text file.

--******* CHANGE THE HARD-CODED PATH\FILENAME FOR YOUR SYSTEM

BULK INSERT #BulkImport FROM 'C:\SQL\Test.txt'

-- get rid of the header line

delete from #BulkImport where LineText like 'id%'

-- add necessary columns to the temporary table

-- if these columns were part of the table definition, then the bulk insert would fail

ALTER TABLE #BulkImport ADD FirstOne int

ALTER TABLE #BulkImport ADD SecondOne int

ALTER TABLE #BulkImport ADD ThirdOne int

ALTER TABLE #BulkImport ADD FourthOne int

-- declare some working variables

declare @first int, @second int, @third int, @fourth int

-- perform Phil Factor's "quirky update"

-- see http://www.simple-talk.com/content/print.aspx?article=446

update #BulkImport

set @first = FirstOne = CharIndex(',', LineText),

@second = SecondOne = case when SUBSTRING(LineText, @first+1,1) = '"' then CHARINDEX('"', LineText, @first+2)

else CHARINDEX(',', LineText, @first+2) end,

@third = ThirdOne = CharIndex('",', LineText, @second + 1),

@fourth = FourthOne = CharIndex(',', LineText, @third+1)

-- put this all into the table variable

insert into @Import

select

LEFT(LineText, FirstOne-1),

CASE when SUBSTRING(LineText, FirstOne+1,1) = '"' then SUBSTRING(LineText, FirstOne+2, SecondOne-FirstOne-2)

else SUBSTRING(LineText, FirstOne+1, SecondOne-FirstOne-1) end,

SUBSTRING(LineText, SecondOne + case when SUBSTRING(LineText, SecondOne, 1) = '"' then 3 else 2 end, ThirdOne - SecondOne - case when SUBSTRING(LineText, SecondOne, 1) = '"' then 4 else 3 end),

SUBSTRING(LineText, FourthOne+1, LEN(LineText))

from #BulkImport

-- show the results

select * from @Import

-- drop the temp table

DROP TABLE #BulkImport

Thank you Phil and Jeff for teaching this to me! 🙂

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 5

Actually.... let's just say the data from the original post was stored on the server in C:\Temp and the filename was called SomeFile.txt

Then, you can do a little SQL prestidigitation using Linked Servers...

--===== Create a text base linked server.

EXEC dbo.sp_AddLinkedServer TempTextServer,

'Jet 4.0',

'Microsoft.Jet.OLEDB.4.0',

'C:\Temp', --LOOK!!! THIS IS THE DIRECTORY YOU WANT TO POINT AT!!!!

NULL,

'Text'

GO

--===== Set up the login (change the first null to a valid login name if you don't want SA)

EXEC dbo.sp_AddLinkedSrvLogin TempTextServer, FALSE, NULL, NULL

GO

--===== List the file names available in the new text based linked server.

-- Notice that the "dot" in file names has been replace by a # sign

EXEC dbo.sp_Tables_Ex TempTextServer

GO

--===== Query the file as if it were a table. Notice that the quotes and commas

-- are handled automatcially.

SELECT *

FROM TempTextServer...[SomeFile#txt]

Here's what the output of the original file looks like from the last SELECT above....

[font="Courier New"]id company rep employees

----------- ---------------------------- -------------- -----------

729216 INGRAM MICRO INC. Stuart, Becky 523

729235 GREAT PLAINS ENERGY, INC. Nelson, Beena 114

721177 GEORGE WESTON BAKERIES INC Hogan, Meg 253

(3 row(s) affected)

[/font]

Of course, since it works like a table, you can do SELECT/INTO, INSERT/SELECT, or whatever you need to do.

There's no doubt about it... it WILL be slower than Bulk Insert... but it will work without having to write special parsing code.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

WayneS SSC Guru Points: 95460 More actions · Answer 6

Leave it to Jeff to find another way with less code.:rolleyes:

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

stuehler Old Hand Points: 341 More actions · Answer 7

Jeff and WayneS,

Just wanted to thank you both for these fantastic solutions.

They both work great (but you knew that already).

The other option I've been messing with is using RegExp to pre-process the file - loading the entire file into memory, replacing the comma delimiters with tabs, removing unnecessary quotes, and rewriting it. This allows me to use the standard "Bulk Insert".

So, I guess the question is - which is the most efficient, and places the least stress on the server?

FWIW - i expect the file will typically be between 5,000 and 20,000 rows, about 20 fields, and typically between 500kb and 5mb.

Cheers,

Matt

WayneS SSC Guru Points: 95460 More actions · Answer 8

stuehler (4/23/2009)
Jeff and WayneS,
Just wanted to thank you both for these fantastic solutions.

You're welcome

So, I guess the question is - which is the most efficient, and places the least stress on the server?
FWIW - i expect the file will typically be between 5,000 and 20,000 rows, about 20 fields, and typically between 500kb and 5mb.
Cheers,
Matt

Matt,

20 fields... my way would require significant changes (though it should be straightforward). Jeff's way should handle the file without changes.

5k-20k rows... Jeff said that my way would be faster and more efficient (to answer the first question). But, I'd set up a file with 20k entries and try his way out first. If that speed is acceptable, it's a lot easier. Plus my way uses a feature of the update statement not frequently seen, so it would be confusing to many people. (which is why I included the link to explain it.)

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 9

WayneS (4/23/2009)
Leave it to Jeff to find another way with less code.:rolleyes:

Heh... I'm just lazy... I hate typing lots of code.

I learned how to do this just for the occasional add hoc import of data for test purposes. I need a way for it to be repeatable without writing a "system" to handle it more quickly. For production systems, I typically launch pork chops at the folks providing the data until they can provide it in a correct and easily consumable format.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Dave PharmD Grasshopper Points: 11 More actions · Answer 10

Jeff Moden (4/23/2009)
Actually.... let's just say the data from the original post was stored on the server in C:\Temp and the filename was called SomeFile.txt
Then, you can do a little SQL prestidigitation using Linked Servers...
--===== Create a text base linked server.
EXEC dbo.sp_AddLinkedServer TempTextServer,
'Jet 4.0',
'Microsoft.Jet.OLEDB.4.0',
'C:\Temp', --LOOK!!! THIS IS THE DIRECTORY YOU WANT TO POINT AT!!!!
NULL,
'Text'
GO
--===== Set up the login (change the first null to a valid login name if you don't want SA)
EXEC dbo.sp_AddLinkedSrvLogin TempTextServer, FALSE, NULL, NULL
GO
--===== List the file names available in the new text based linked server.
-- Notice that the "dot" in file names has been replace by a # sign
EXEC dbo.sp_Tables_Ex TempTextServer
GO
--===== Query the file as if it were a table. Notice that the quotes and commas
-- are handled automatcially.
SELECT *
FROM TempTextServer...[SomeFile#txt]
Here's what the output of the original file looks like from the last SELECT above....
[font="Courier New"]id          company                      rep            employees
----------- ---------------------------- -------------- -----------
729216      INGRAM MICRO INC.            Stuart, Becky  523
729235      GREAT PLAINS ENERGY, INC.    Nelson, Beena  114
721177      GEORGE WESTON BAKERIES INC   Hogan, Meg     253
(3 row(s) affected)
[/font]
Of course, since it works like a table, you can do SELECT/INTO, INSERT/SELECT, or whatever you need to do.
There's no doubt about it... it WILL be slower than Bulk Insert... but it will work without having to write special parsing code.

Sorry to necrobump this thread, but Jeff, this is a lifesaver. It has been working for me pretty well, but I just ran into a file wherein this system loads one particular value ("1,237") as NULL. I cannot figure out why. There's another comma-containing number that loads just fine ("1,559"). Any thoughts on troubleshooting I could do to see what is happening?

Also, since Jet is depreciated, I was using the following command to create the server, but I don't know if that makes a difference:

EXEC dbo.sp_AddLinkedServer TempTextServer,

'MSDASQL',

'Microsoft.ACE.OLEDB.12.0',

'C:\inetpub\uploads',

NULL,

'Text'

Thanks!

-- Dave