April 23, 2009 at 11:12 am
I have data that looks like this (the first row contain headers):
id, company, rep, employees
729216,INGRAM MICRO INC.,"Stuart, Becky",523
729235,"GREAT PLAINS ENERGY, INC.","Nelson, Beena",114
721177,GEORGE WESTON BAKERIES INC,"Hogan, Meg",253
In other words, only values that contain a comma are wrapped by quotes.
Because the quotes aren't consistent, I can't use '","' as a delimiter
I tried using ',' as a delimter and loading the the data into a temporary table where every column is a varchar, then using some post processing to strip out the quotes, but that doesn't work either, because the fields that contain ',' are split into multiple columns.
Unfortunately, I don't have the ability to manipulate the CSV file beforehand. And I can't use DST, because this needs to be part of an automated process.
In short, it's part of an application where users can upload CSV files to the server, and have those files loaded into a database.
Is this hopeless?
Many thanks in advance for any advice.
April 23, 2009 at 11:17 am
It looks like the quotes are around a consistent set of columns, in your example. If that's so, then OpenRowSet with an XML-typle configuration file should be able to deal with that correctly. You can define the column-terminator for each column individually in that case. Might do exactly what you need.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 23, 2009 at 12:45 pm
GSquared,
Many thanks for your reply.
Actually, the quotes aren't as consistent.
Some columns will ALWAYS be wrapped in quotes, since the values will always contain a comma.
However, a few columns won't be consistent - the values in it will only occasionally contain commas, so will only occasionally have quotes.
Does that mean I can't use OpenRowSet?
Cheers,
Matt
April 23, 2009 at 1:33 pm
In that case, you're almost going to have to handle it row-by-row. You might be able to build an algorithm for splitting it that would work on the whole set at once, but it's probably going to be more efficient to use .NET regex functions or something of that sort.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 23, 2009 at 1:52 pm
Try this (you'll need to change the hard-coded path to your text file to import):
-- the final results end up in this table variable
declare @Import TABLE (
ID int,
Company varchar(50),
Rep varchar(50),
Employees int)
-- create a temporary table to do all the dirty work
if OBJECT_ID('tempdb..#BulkImport') is not null DROP TABLE #BulkImport
CREATE TABLE #BulkImport(
LineText varchar(100)
)
-- perform a bulk insert of the text file.
--******* CHANGE THE HARD-CODED PATH\FILENAME FOR YOUR SYSTEM
BULK INSERT #BulkImport FROM 'C:\SQL\Test.txt'
-- get rid of the header line
delete from #BulkImport where LineText like 'id%'
-- add necessary columns to the temporary table
-- if these columns were part of the table definition, then the bulk insert would fail
ALTER TABLE #BulkImport ADD FirstOne int
ALTER TABLE #BulkImport ADD SecondOne int
ALTER TABLE #BulkImport ADD ThirdOne int
ALTER TABLE #BulkImport ADD FourthOne int
-- declare some working variables
declare @first int, @second int, @third int, @fourth int
-- perform Phil Factor's "quirky update"
-- see http://www.simple-talk.com/content/print.aspx?article=446
update #BulkImport
set @first = FirstOne = CharIndex(',', LineText),
@second = SecondOne = case when SUBSTRING(LineText, @first+1,1) = '"' then CHARINDEX('"', LineText, @first+2)
else CHARINDEX(',', LineText, @first+2) end,
@third = ThirdOne = CharIndex('",', LineText, @second + 1),
@fourth = FourthOne = CharIndex(',', LineText, @third+1)
-- put this all into the table variable
insert into @Import
select
LEFT(LineText, FirstOne-1),
CASE when SUBSTRING(LineText, FirstOne+1,1) = '"' then SUBSTRING(LineText, FirstOne+2, SecondOne-FirstOne-2)
else SUBSTRING(LineText, FirstOne+1, SecondOne-FirstOne-1) end,
SUBSTRING(LineText, SecondOne + case when SUBSTRING(LineText, SecondOne, 1) = '"' then 3 else 2 end, ThirdOne - SecondOne - case when SUBSTRING(LineText, SecondOne, 1) = '"' then 4 else 3 end),
SUBSTRING(LineText, FourthOne+1, LEN(LineText))
from #BulkImport
-- show the results
select * from @Import
-- drop the temp table
DROP TABLE #BulkImport
Thank you Phil and Jeff for teaching this to me! 🙂
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
April 23, 2009 at 7:01 pm
Actually.... let's just say the data from the original post was stored on the server in C:\Temp and the filename was called SomeFile.txt
Then, you can do a little SQL prestidigitation using Linked Servers...
--===== Create a text base linked server.
EXEC dbo.sp_AddLinkedServer TempTextServer,
'Jet 4.0',
'Microsoft.Jet.OLEDB.4.0',
'C:\Temp', --LOOK!!! THIS IS THE DIRECTORY YOU WANT TO POINT AT!!!!
NULL,
'Text'
GO
--===== Set up the login (change the first null to a valid login name if you don't want SA)
EXEC dbo.sp_AddLinkedSrvLogin TempTextServer, FALSE, NULL, NULL
GO
--===== List the file names available in the new text based linked server.
-- Notice that the "dot" in file names has been replace by a # sign
EXEC dbo.sp_Tables_Ex TempTextServer
GO
--===== Query the file as if it were a table. Notice that the quotes and commas
-- are handled automatcially.
SELECT *
FROM TempTextServer...[SomeFile#txt]
Here's what the output of the original file looks like from the last SELECT above....
[font="Courier New"]id company rep employees
----------- ---------------------------- -------------- -----------
729216 INGRAM MICRO INC. Stuart, Becky 523
729235 GREAT PLAINS ENERGY, INC. Nelson, Beena 114
721177 GEORGE WESTON BAKERIES INC Hogan, Meg 253
(3 row(s) affected)
[/font]
Of course, since it works like a table, you can do SELECT/INTO, INSERT/SELECT, or whatever you need to do.
There's no doubt about it... it WILL be slower than Bulk Insert... but it will work without having to write special parsing code.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 23, 2009 at 7:45 pm
Leave it to Jeff to find another way with less code.:rolleyes:
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
April 23, 2009 at 9:33 pm
Jeff and WayneS,
Just wanted to thank you both for these fantastic solutions.
They both work great (but you knew that already).
The other option I've been messing with is using RegExp to pre-process the file - loading the entire file into memory, replacing the comma delimiters with tabs, removing unnecessary quotes, and rewriting it. This allows me to use the standard "Bulk Insert".
So, I guess the question is - which is the most efficient, and places the least stress on the server?
FWIW - i expect the file will typically be between 5,000 and 20,000 rows, about 20 fields, and typically between 500kb and 5mb.
Cheers,
Matt
April 24, 2009 at 9:06 am
stuehler (4/23/2009)
Jeff and WayneS,Just wanted to thank you both for these fantastic solutions.
You're welcome
So, I guess the question is - which is the most efficient, and places the least stress on the server?
FWIW - i expect the file will typically be between 5,000 and 20,000 rows, about 20 fields, and typically between 500kb and 5mb.
Cheers,
Matt
Matt,
20 fields... my way would require significant changes (though it should be straightforward). Jeff's way should handle the file without changes.
5k-20k rows... Jeff said that my way would be faster and more efficient (to answer the first question). But, I'd set up a file with 20k entries and try his way out first. If that speed is acceptable, it's a lot easier. Plus my way uses a feature of the update statement not frequently seen, so it would be confusing to many people. (which is why I included the link to explain it.)
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
April 24, 2009 at 10:36 am
WayneS (4/23/2009)
Leave it to Jeff to find another way with less code.:rolleyes:
Heh... I'm just lazy... I hate typing lots of code.
I learned how to do this just for the occasional add hoc import of data for test purposes. I need a way for it to be repeatable without writing a "system" to handle it more quickly. For production systems, I typically launch pork chops at the folks providing the data until they can provide it in a correct and easily consumable format.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 10, 2016 at 2:46 pm
Jeff Moden (4/23/2009)
Actually.... let's just say the data from the original post was stored on the server in C:\Temp and the filename was called SomeFile.txtThen, you can do a little SQL prestidigitation using Linked Servers...
--===== Create a text base linked server.
EXEC dbo.sp_AddLinkedServer TempTextServer,
'Jet 4.0',
'Microsoft.Jet.OLEDB.4.0',
'C:\Temp', --LOOK!!! THIS IS THE DIRECTORY YOU WANT TO POINT AT!!!!
NULL,
'Text'
GO
--===== Set up the login (change the first null to a valid login name if you don't want SA)
EXEC dbo.sp_AddLinkedSrvLogin TempTextServer, FALSE, NULL, NULL
GO
--===== List the file names available in the new text based linked server.
-- Notice that the "dot" in file names has been replace by a # sign
EXEC dbo.sp_Tables_Ex TempTextServer
GO
--===== Query the file as if it were a table. Notice that the quotes and commas
-- are handled automatcially.
SELECT *
FROM TempTextServer...[SomeFile#txt]
Here's what the output of the original file looks like from the last SELECT above....
[font="Courier New"]id company rep employees
----------- ---------------------------- -------------- -----------
729216 INGRAM MICRO INC. Stuart, Becky 523
729235 GREAT PLAINS ENERGY, INC. Nelson, Beena 114
721177 GEORGE WESTON BAKERIES INC Hogan, Meg 253
(3 row(s) affected)
[/font]
Of course, since it works like a table, you can do SELECT/INTO, INSERT/SELECT, or whatever you need to do.
There's no doubt about it... it WILL be slower than Bulk Insert... but it will work without having to write special parsing code.
Sorry to necrobump this thread, but Jeff, this is a lifesaver. It has been working for me pretty well, but I just ran into a file wherein this system loads one particular value ("1,237") as NULL. I cannot figure out why. There's another comma-containing number that loads just fine ("1,559"). Any thoughts on troubleshooting I could do to see what is happening?
Also, since Jet is depreciated, I was using the following command to create the server, but I don't know if that makes a difference:
EXEC dbo.sp_AddLinkedServer TempTextServer,
'MSDASQL',
'Microsoft.ACE.OLEDB.12.0',
'C:\inetpub\uploads',
NULL,
'Text'
Thanks!
-- Dave
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply