September 13, 2008 at 11:28 pm
I count 8 columns in the first line and only 7 columns in the second line. Is that correct? And, the single quotes on each end are also actually in the file as well, correct?
--Jeff Moden
Change is inevitable... Change for the better is not.
September 14, 2008 at 3:18 am
Single quotes are for test strings.
They are not from file, of course. ๐
And, I believe, there are 7 columns in both strings.
1st one should be split into:
1Pick-up
2
3
41234
52008-08-24
6 "2008-08-24"
717", unknown brand
_____________
Code for TallyGenerator
September 14, 2008 at 6:48 am
From what I can tell so far, you are trying to parse a whole row of information into columns to form a single record where some of the columns themselves need sub-parsing.
As I mentioned earlier, Tally or numbers table approach would help but I can see it getting complex. The UDFs that I have created are designed for an RBAR approach.
With my approach, you pass in the whole line, determine the number of delimiters, use the # of delimiters as domain for a loop, do a string parse, return back a table fully hydrated with your desired output. In some cases, where you are going to sub-parse a column, recursivity would be needed.
I use a combination of three UDFs to do this:
fn_Delim_To_Table
fn_PatCount
fn_Repetitive_Str_Parse
Some of the code I found right here in Sql Server Central. Efficiency is OBVIOUSLY lacking in this approach where brute force reigns supreme.
I will continue to monitor this thread because if there is better solution, and I believe there is, I am wide open for suggestion.
Good luck,
Mike
September 14, 2008 at 10:04 am
Sergiy (9/14/2008)
Single quotes are for test strings.They are not from file, of course. ๐
And, I believe, there are 7 columns in both strings.
1st one should be split into:
1Pick-up
2
3
41234
52008-08-24
6 "2008-08-24"
717", unknown brand
Heh... just making sure... I had to ask because I just got done splitting a similar "properly formed" CSV that actually had the single quotes in them like this.
I've got a couple of "obligations" in the real world that I have to take care of today, Sergiy... I'll try to get back to this soon because these types of things are so much fun ๐
--Jeff Moden
Change is inevitable... Change for the better is not.
September 15, 2008 at 8:23 pm
Sorry for the delay...
Absolutely the easiest way to read a "properly formed" CSV file is using OpenRowSet. If the file is called "ImportTest.csv" and it's stored in "C:\Temp", and the file has no headers, then the following does a dandy job...
SELECT *
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Text;Database=C:\Temp;HDR=NO;FMT=Delimited',
'SELECT * FROM ImportTest.csv')
This is just one way to import the file. Of course, it can easily be turned into an INSERT/SELECT or a SELECT/INTO.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 15, 2008 at 9:10 pm
Jeff Moden (9/15/2008)
Sorry for the delay...Absolutely the easiest way to read a "properly formed" CSV file is using OpenRowSet. If the file is called "ImportTest.csv" and it's stored in "C:\Temp", and the file has no headers, then the following does a dandy job...
SELECT *
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Text;Database=C:\Temp;HDR=NO;FMT=Delimited',
'SELECT * FROM ImportTest.csv')
This is just one way to import the file. Of course, it can easily be turned into an INSERT/SELECT or a SELECT/INTO.
Well, I heard about this approach ๐
But there is a little problem.
We need to process about 40 different file formats, all with different number of columns, some have different number of columns in "Header" and "Body" rows.
And we expect the number of formats to grow.
Files happened to have all sorts of cra..(shut up, Sergiy!) inside, including broken lines (e.g. CR/LF in "Description" fields), badly formatted dates, missing quotes around values with delimiters inside, missing columns in some lines, etc.
That's why I do BULK INSERT the file into temp table with 2 columns: INT Row Identity and TEXT Row ?ontents, and then parse it into "vertical" table having columns RowID, ColID, Value.
Then I can count columns, make sure that there are no missing or not expected ones, validate data in each column without dynamic SQL mess, check file structure according expected schema, and, if everything is right, load values from specified set of ColID to destination tables (pivoting it back, of course)
And, if anything is wrong, I can automatically report what exactly and where is happened to be wrong and which file cannot be processed.
I'm not sure I can achieve it with Microsoft.Jet.OLEDB.4.0 driver. It will not provide me with specific enough error mesages.
And because some of our customers on another side of the Globe I'm trying to avoid answerring urgent calls about missing crucial files at 2am. :w00t:
I have this thing working, and working OK.
But because I cannot figure out the way to do it with Tally I loop through every row.
Not bad, when the file is up to 100 rows, but some of them have more then 50k - and it hurts.
:sick:
_____________
Code for TallyGenerator
September 15, 2008 at 9:22 pm
Heh... understood... and I frequently do the same thing if the file is crap...
... but that's not what you asked.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 15, 2008 at 9:29 pm
Jeff Moden (9/15/2008)
Heh... understood... and I frequently do the same thing if the file is crap...... but that's not what you asked.
I asked about splitting of bunch of quoted delimited strings without loop.
So far I saw only option without considering quotes.
Tried to make it myself - and abandoned when it has become overcomplicated.
_____________
Code for TallyGenerator
September 15, 2008 at 10:31 pm
You want the rows you posted to be split instead of imported?
--Jeff Moden
Change is inevitable... Change for the better is not.
September 15, 2008 at 10:37 pm
BTW... as you already know, the rows you originally posted import just fine because they are, despite appearances, properly formatted CSV's. The delimiters also line up and could be imported using a BCP Format File with \" to mark the quotes in the delimeters. These files typical for you or are you having to put up with cannon fodder from people that aren't qualified to provide data? ๐
--Jeff Moden
Change is inevitable... Change for the better is not.
September 15, 2008 at 10:47 pm
If I try to import file with wrong formatting (broken rows, unquoted delimiters in strings, missing header or header row where it not supposes to be, etc.) I'll end up with some error mesage about "ODBC driver error" or "Binary string truncated" or something like that.
If my import process will report such error to a customer 11 hours away from me I'll get a call at 2am: "What the heck is that?"
It's not gonna make me happy, even if I'm not asleep and sitting with friends drinking beer or whatever has left by the time.:hehe:
That's why I import file rows as a whole thing (bullet proof, especially when it's imported into TEXT column) and then I split them in order to process futher.
Of course, doing that split I need to consider quotation rules used fot delimited files.
Can you suggest any another way?
_____________
Code for TallyGenerator
September 15, 2008 at 10:56 pm
The delimiters also line up and could be imported using a BCP Format File with \" to mark the quotes in the delimeters.
BCP is too vulnerable to format errors.
And it does not let you know what exactly was wrong with the file.
These files typical for you or are you having to put up with cannon fodder from people that aren't qualified to provide data? ๐
I put up with manual extracts from Excel, sometimes manually "corrected" by users before sending to us.
Is it enough to get the picture?
_____________
Code for TallyGenerator
September 15, 2008 at 11:00 pm
Heh... yeah, I wouldn't want to take a call in the middle of a cold one, either.
Split functions always work, especially on a single delimiter. Obviously, your stuff has several delimiters that look like some of the text. So, going back to rule one of a single delimiter... you could split on the results of this...
DECLARE @String1 VARCHAR(8000)
--SET @String1 = '"Pick-up",,"",1234,"2008-08-24", "2008-08-24","17"", unknown brand"'
SET @String1 = '"""Next Day"" delivery",5.00,"9,5"" box",10,"2008-09-01", "2008-09-11","9"" nails"'
SELECT
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
SUBSTRING(@String1,2,LEN(@String1)-2)
,'" ,"','รพ')
,'", "','รพ')
,'","' ,'รพ')
,'",' ,'รพ')
,',"' ,'รพ')
,'"รพ' ,'"')
,'รพ"' ,'"')
,'""' ,'"')
Heh, yeah, I know... that code is a real thorn in the side... ๐
--Jeff Moden
Change is inevitable... Change for the better is not.
September 15, 2008 at 11:21 pm
Ummm... all those bloody REPLACE's are gonna be slow. I just thought of another way using our friends the Tally table and the "quirky update". It'll take me a bit to put together if you think it'll be worth it. Not sure I can make it "read minds" and do necessary in-flight corrections, but it might be a lot faster.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 15, 2008 at 11:49 pm
Even worse - those REPLACEs don't work.
For another string :
'"Pick-up",,"",1234,"2008-08-24", "2008-08-24","17"", unknown brand"'
it fails.
_____________
Code for TallyGenerator
Viewing 15 posts - 16 through 30 (of 48 total)
You must be logged in to reply to this topic. Login to reply