February 22, 2011 at 12:57 pm
I thought I'd share a unique problem that I'm dealing with and see if we can get a discussion going as to how to best handle it.
I'm dealing with legacy clients, and we're in a position where we can't ask them change their file format.
Here's the issue. We need to load into a sql table a comma delimited, text quoted file into a sql table.
As an example when clientA came on board they only needed to send 2 columns of data, when clientB came on board they sent 4 Columns of data, and ClientC has 6 Columns of data.
We can put in nulls for for the column data that is not sent.
i.e.
clientA's rec.txt
"Alfred","Ames"
"Aaron","Abbot"
clientB's rec.txt
"Bob","Baines",123,"Blue"
"Bill","Bathgate",456,"Yellow"
clientC's rec.txt
"charlie","Cass",789,"green","123 anywhere","USA"
"cathy","cow",444,"red",234 nowhere","Germany"
I want to be able to load records from the three clients above into a single sql table:
Col1 Col2 Col3 Col4 Col5 Col6
Alfred Ames NULL NULL NULL NULL
Aaron Abbot NULL NULL NULL NULL
Bob Baines 123 Blue NULL NULL
Bill Bathgate 456 Yellow NULL NULL
charlie Cass 789 green 123 anywhere USA
cathy cow 444 red 234 nowhere Germany
SSIS requires that I map input column to output column and this leads to difficulties.
1) I can have multiple data flow tasks and only execute a single one based on the number of columns in that particular txt file.
---This doesn't seem like a very clean solution to me especially with the number of different times I'd have to do this.
2) Using a script transform I parse the file and programatically create the NULL rows for output.
---I've actually done this for our smaller txt files, but this method kills performance on larger text files.
3) pre-flood the text files with column delimiters at the end of each line.
---This solution is actually similar to #2 above in that you're manufacturing columns. The transform loads quickly, but I haven't found a way to quickly add commas at the end of each line of a text file.
Have any of you faced this challenge? How did you handle it? What suggestions do you have that I might have missed?
February 22, 2011 at 1:01 pm
I would use SSIS to store the CSV list into a staging table with a single column.
Then I would call a sproc with a fast split string function (e.g. DelimitedSplit8K) and a CrossTab query to get the data back in one row, filling the missing volumns with zero at the same time. Finally, the sproc would be used to insert the data into the final table.
February 22, 2011 at 1:13 pm
there are a number of posts by Jeff Moden to do this via BCP; he refer's to the issue as "ragged right";
here's one example he posted to address the issue...it's not SSIS, but it's worth looking at:
--Originally posted by Jeff Moden somewhere in the sands of time:
BCP file import headless ragged right
Heh... that's known as a "Headerless Ragged Right" file and they're a bugger to import correctly. BCP and BULK insert require that all the rows have the same number of delimiters so you can't do an import of such a file with either.
However... with a little help from a couple of DOS commands and a little prestidigitation with a text based linked server, it can be done with very little pain.
First, I assume you have a file, like in your original post, in C:\Temp called Test01.txt and it looks like this...
col1,col2,col3
col1,col2
col1,col2,col3,col4
col1,col2,col,3,col4,col5
Next, let's setup a linked server and give it the necessary login privs...
--===== Create a linked server to the drive and path you desire.
EXEC dbo.sp_AddLinkedServer TxtSvr,
'Jet 4.0',
'Microsoft.Jet.OLEDB.4.0',
'C:\Temp',
NULL,
'Text'
--===== Set up login mappings.
EXEC dbo.sp_AddLinkedSrvLogin TxtSvr, FALSE, NULL, Admin, NULL
Here comes the fun part... if we just read the file directly...
--===== Query one of the files by using a four-part name.
SELECT *
FROM TxtSvr...[Test01^#txt]
... we get an awful mess that looks like this...
F4 F5 col1 col2 col3
col1 col2
col4 col1 col2 col3
col4 col5 col1 col2 col3
(3 row(s) affected)
notice that the first row was used as a header and that the data has been rearranged AND that two of the headers have been made up! We can fix that with a couple of simple DOS commands...
--===== Create a header that identifies the columns we want... (any col names will do)
EXEC Master.dbo.xp_CmdShell 'ECHO HdrCol1,HdrCol2,HdrCol3,HdrCol4,HdrCol5 > C:\Temp\Header.txt'
--===== Create a new working file that has the header and the original file as one
EXEC Master.dbo.xp_CmdShell 'COPY C:\Temp\Header.txt+C:\Temp\Test01.txt C:\Temp\MyWork.txt'
Then, accessing the data in the file is a breeze... do with it what you want because it can now be referenced as a table with the correct column names...
--===== Read the csv text file as if it were a table
SELECT *
FROM TxtSvr...[MyWork^#txt]
HdrCol1 HdrCol2 HdrCol3 HdrCol4 HdrCol5
col1 col2 col3
col1 col2
col1 col2 col3 col4
col1 col2 col3 col4 col5
(4 row(s) affected)
If you need to drop the linked server after than, the following command will do nicely...
EXEC dbo.sp_DropServer 'TxtSvr', 'DropLogins'
Lowell
February 22, 2011 at 1:16 pm
I probably would use a flat file source to read the data as one column. I would then feed that one column to a script component where I would use VB to parse the data, inserting the default NULL value for the missing columns. The output from the script component would include all the required columns for a "complete" input data row.
For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]
February 22, 2011 at 1:32 pm
Alvin, that is what I did in my choice number two above. I just didn't go into the details of it. I imported the data as a single column then parsed through the file. I found this solution to take on average 4x as long than directly loading the data.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply