July 25, 2009 at 9:11 pm
Ghanta...
I believe I have your solution. I'll post it both here and on SQL Team (your other post, I believe) but I wonder if you'd do me a favor first... if the information is not proprietary, contains no sensitive information, and violates no privacy laws, would you 1) post the first 20 rows and 2) tell me how many rows there will likely be on each load?
If the information is proprietary, contains sensitive information, or it would violate privacy laws to post it, would you make up the first 3 rows of the file and post it? I need a model to test against before I post my findings.
Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2009 at 9:44 pm
Hey Jeff,
This will help so many folks... My data is comma delimited with quotes around only on those field that has embedded comma (each file can have different fields and fields will have quotes only if there is comma in it). First row has the column name which I will use to create the staging table. Let me know if you need more info... Thank you so muchhhh Jeff...
ID,ADD_1,ADD_CD,MEM,FullName
6000014623,"1310 FEARRINGTON POST",00,this is a sample,"tiger, scott"
6000014646,"532 BURNETT CIR",00,this is a sample,"tiger, scott"
6000014649,"2106 NC HIGHWAY 87 N",03,this is a sample,"tiger, scott"
6000014663,"1864 MANNS CHAPEL RD",00,this is a sample,"tiger, scott"
6000014668,"20 SILVERSTONE DR",00,this is a sample,"tiger, scott"
6000014669,"159 LAURA JOHNSON RD",00,this is a sample,"tiger, scott"
6000014670,"RR 1",02,this is a sample,"tiger, scott"
6000014686,"PO BOX 132",00,this is a sample,"tiger, scott"
6000014695,"137 FEARRINGTON POST",00,this is a sample,"tiger, scott"
6000014704,"339 TAYLOR RD",03,this is a sample,"tiger, scott"
6000014707,"PO BOX 880",00,this is a sample,"tiger, scott"
6000014714,"904 FEARRINGTON POST",00,this is a sample,"tiger, scott"
6000014747,"|2622 US HIGHWAY 64 W",00,this is a sample,"tiger, scott"
6000014748,"PO BOX 518",00,this is a sample,"tiger, scott"
6000014756,"866 FEARRINGTON POST",00,this is a sample,"tiger, scott"
6000015365,"1825 EAST ST",00,this is a sample,"tiger, scott"
6000015493,"407 BRAMPTON CLOSE W",02,this is a sample,"tiger, scott"
6000015499,"134 HUDSON HILLS RD",00,this is a sample,"tiger, scott"
6000015500,"321 RECTORY ST",00,this is a sample,"tiger, scott"
6000015501,"624 HAMLET CHAPEL RD",00,this is a sample,"tiger, scott"
6050860067,"P,O, BOX 92, WEST END",05,this is a sample,"tiger, scott"
6050864122,"CALLE RAMIRO 1, 20, 3B",02,this is a sample,"tiger, scott"
6050864480,"EUROVILLA 1, 803A",02,this is a sample,"tiger, scott"
6050864782,"WAH CHI COURT, FLAT D 30TH FLOOR",05,this is a sample,"tiger, scott"
6053339447,"KINGOSGADE 13, 3TV",05,this is a sample,"tiger, scott"
6053339541,"11 LONG CIRCULAR ROAD ,ST,JAMES,",05,this is a sample,"tiger, scott"
6053339542,"11 LONG CIRCULAR ROAD ,ST,JAMES,",05,this is a sample,"tiger, scott"
6053448211,"L'ARGENTIERE BAT A, 637 BD TAVERNIERE",02,this is a sample,"tiger, scott"
6053450414,"FLAT 43, FREDERICK COURT",02,this is a sample,"tiger, scott"
6053451508,"AVENIDA FRANCISCO BHERING, 17/201",02,this is a sample,"tiger, scott"
6053451663,"FLAT 5A, CHENYU COURT",02,this is a sample,"tiger, scott"
6053451664,"FLAT 5A, CHENYU COURT",02,this is a sample,"tiger, scott"
6053522326,"CASA VERDE, MANO IZQUIERDA, 175 M NORTE",02,this is a sample,"tiger, scott"
6053563849,"1 CASTLE PEEK RD BLOCK 6 10-F, A",05,this is a sample,"tiger, scott"
6055150555,"BUJI TWON,SHENZHEN, GUANGDONG",02,this is a sample,"tiger, scott"
6057020037,"LINCEY HOUSE, WHITTINGTON ROAD",02,this is a sample,"tiger, scott"
6057020038,"LINCEY HOUSE, WHITTINGTON ROAD",02,this is a sample,"tiger, scott"
6057020076,"3186 COUNTY RD |26, RR 2",02,this is a sample,"tiger, scott"
6057020077,"3186 COUNTY RD |26, RR 2",02,this is a sample,"tiger, scott"
6057020127,"AV, SANTO TORIBIO 298, APT, 401",02,this is a sample,"tiger, scott"
July 25, 2009 at 9:46 pm
As far as number of rows are concerned, most of the files will be small around 100,000 to 200,000 rows with around 10 fields, but small number (1 a day) might be around few million records with 30 fields.
July 25, 2009 at 10:46 pm
Ok.... a couple of questions...
First, some data from your post...
[font="Courier New"]6053339541,"11 LONG CIRCULAR ROAD ,ST,JAMES,",05,this is a sample,"tiger, scott"
__________ ________________________________ __ ________________ ____________
6053339542,"11 LONG CIRCULAR ROAD ,ST,JAMES,",05,this is a sample,"tiger, scott"
__________ _____________________________________ __ ________________ ____________
6053448211,"L'ARGENTIERE BAT A, 637 BD TAVERNIERE",02,this is a sample,"tiger, scott"[/font]
1. Except for the first column (ID), the first 2 rows above appear to be duplicates. Should one of them be eliminated? If so, which one do you prefer to be eliminated?
2. How do you want these split? I put underscores over the second 2 rows where a true CSV splitter would consider fields to be...
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2009 at 10:49 pm
Also, I don't see a single example of conditional quoting. It would appear to be a stable format as follows...
1st field - never quoted
2nd field - always quoted
3rd field - never quoted
4th field - never quoted
5th field - always quoted
And, none of the quoted fields have embedded quotes.
Do you have any rows that deviate from that other than the header itself?
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2009 at 10:58 pm
First, some data from your post...
6053339541,"11 LONG CIRCULAR ROAD ,ST,JAMES,",05,this is a sample,"tiger, scott"
__________ ________________________________ __ ________________ ____________
6053339542,"11 LONG CIRCULAR ROAD ,ST,JAMES,",05,this is a sample,"tiger, scott"
__________ _____________________________________ __ ________________ ____________
6053448211,"L'ARGENTIERE BAT A, 637 BD TAVERNIERE",02,this is a sample,"tiger, scott"
1. Except for the first column (ID), the first 2 rows above appear to be duplicates. Should one of them be eliminated? If so, which one do you prefer to be eliminated? --- Duplicate is fine, I need to stage entire data from file
2. How do you want these split? I put underscores over the second 2 rows where a true CSV splitter would consider fields to be... --- Split looks good to me, each field is separated by comma and embedded ones have quotes... we don't need quotes in the table or we need to clean the data
Jeff can't thank you enough for this help.
July 25, 2009 at 11:00 pm
BobMcC (2/18/2008)
I need to import (using DTS) some comma delimited text files into sql server 2000.I have discovered that the quotation mark is getting dual use. Not only is some of column data surrounded by quotes, but sometimes the quote marks are part of the data itself, and that's causing problems.
i.e.
1234, "some data here", "John "jack" Smith", 1234, 1234
What's the conventional wisdom on how to handle that? It looks like something I could key on using regular expressions.
Any suggestions?
Thanks,
E3
Sorry, Bob... I almost forgot about you.
The row you presented creates no problems for a BCP format file (except if the header row is different). It's not correctly formatted CSV where embedded quotes should actually be doubled up, but, like I said, a BCP format file will easily handle this to a tune of well over 5 million rows per minute (I did 5.1 million 20 column rows in 60 seconds flat on a "old" machine).
Can you provide the first 20 rows of data from the file. Same rules apply that I told Ghanta... nothing private, nothing proprietary. Please be sure to include the actual header.
Otherwise, the BCP format file I previously posted should work for you.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2009 at 11:04 pm
Jeff Moden (7/25/2009)
Also, I don't see a single example of conditional quoting. It would appear to be a stable format as follows...1st field - never quoted
2nd field - always quoted
3rd field - never quoted
4th field - never quoted
5th field - always quoted
And, none of the quoted fields have embedded quotes.
Do you have any rows that deviate from that other than the header itself?
If any of the text in a column has embedded comma then the entire field will have Quotes
2nd field was address and I put that example as most likely they will have embedded comma and I don't expect comma in ID so had that as example for field without quotes
3rd can be anything and quotes will depend on data, same with 4th, 5th was just sample and I used full name with comma... but it can be lastname and we see many records have two lastnames with comma...
July 25, 2009 at 11:07 pm
Jeff Moden (7/25/2009)
Also, I don't see a single example of conditional quoting. It would appear to be a stable format as follows...1st field - never quoted
2nd field - always quoted
3rd field - never quoted
4th field - never quoted
5th field - always quoted
And, none of the quoted fields have embedded quotes.
Do you have any rows that deviate from that other than the header itself?
Then, the only problem is that the header isn't formatted the same way as the rest of the rows and BCP counts delimiters when trying to skip rows (BIG complaint about BCP for me). We can fix that pretty easily... I'll be back.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2009 at 11:11 pm
Ghanta (7/25/2009)
Jeff Moden (7/25/2009)
Also, I don't see a single example of conditional quoting. It would appear to be a stable format as follows...1st field - never quoted
2nd field - always quoted
3rd field - never quoted
4th field - never quoted
5th field - always quoted
And, none of the quoted fields have embedded quotes.
Do you have any rows that deviate from that other than the header itself?
If any of the text in a column has embedded comma then the entire field will have Quotes
2nd field was address and I put that example as most likely they will have embedded comma and I don't expect comma in ID so had that as example for field without quotes
3rd can be anything and quotes will depend on data, same with 4th, 5th was just sample and I used full name with comma... but it can be lastname and we see many records have two lastnames with comma...
Ah... got it... basically a mixed bag and totally data dependent for everything except the ID field. Provided they don't embed a single quote anywhere (not talking about the apostrophe), I've got that solved... Lemme put together the answer.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2009 at 11:17 pm
Ah... sorry, Ghanta... I've got just one more question...
Do you know how to BULK INSERT a file into a single column table or not?
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2009 at 11:18 pm
I know that Jeff. Thanks for all ur help...hope I am not ruining your weekend.
July 25, 2009 at 11:21 pm
Ummmm.... "Weekend".... new word for me... guess I'll have to look it up. 😛
Since you know how to do the BULK INSERT, we'll start with a staging table... I'll be back soon.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 26, 2009 at 12:37 am
Ok... here's a full up working example with the data you provided. As always, the detail is in the comments in the code...
--=========================================================================================================== -- We need a function that will convert the pain-in-the-hiney conditionally text qualified CSV -- into something with consistent delimiters so we can split the column out of each row easily -- even when the presence of quotes may change from row to row. --=========================================================================================================== CREATE FUNCTION dbo.TrueCsvSplit(@pString VARCHAR(7998),@pDelimiter VARCHAR(1)) /********************************************************************************************************************** Purpose: This function prepares a "TRUE CSV" string for splitting by correctly finding and replacing all possible valid delimiters with the desired character. It also adds the delimiter to the very beginning and end of the converted string. Further, any spaces wrapped up between delimiters are eliminated. Revision History: Rev 00 - 17 Sep 2008 - Jeff Moden - Proof of concept for Sergiy on SQLServerCentral.com Rev 01 - 26 Jul 2009 - Jeff Moden - Converted delimiter to use for final split to an input and formalized code. **********************************************************************************************************************/ RETURNS VARCHAR(8000) AS BEGIN --===== Define and preset the local variables. Most are self explanatory. The "QuoteFlop" variables keep track if a -- given quote is "labeled" as a 1 or a 0. (1st quote is a 1, 2nd is a 0, 3rd is a 1, etc, etc) DECLARE @ReturnString VARCHAR(8000), @PrevChar VARCHAR(1), --MUST be VarChar or unwanted spaces appear @CurChar VARCHAR(1), --MUST be VarChar or unwanted spaces appear @PrevQuoteFlop TINYINT, @CurQuoteFlop TINYINT, @Keep TINYINT, @KillSpace TINYINT SELECT @PrevChar = ' ', @PrevQuoteFlop = 0, @ReturnString = '', @KillSpace = 1 --===== Find and conditionally replace all valid "true" CSV delimiters with the selected character using a set based -- loop formed by the Tally table. This is to make the actual splitting of a "true" CSV easier and will beat -- the tar out of the equivelent Cursor or While Loop. SELECT @CurChar = SUBSTRING(@pString,t.N,1), @CurQuoteFlop = CASE WHEN @CurChar = '"' THEN ABS(@PrevQuoteFlop-1) ELSE @PrevQuoteFlop END, @KillSpace = CASE --== If current character is a trailing quote or delimiter outside of quotes, start ignoring spaces. WHEN @CurQuoteFlop = 0 AND @CurChar IN( ',','"') THEN 1 --== If current character is inside quotes, stop ignoring spaces. WHEN @CurQuoteFlop = 1 THEN 0 --== If current character is outside quotes, spaces are being ignored, and it isn't a space, -- stop ignoring spaces because it's the start of data outside quotes. WHEN @CurQuoteFlop = 0 AND @KillSpace = 1 AND @CurChar ' ' THEN 0 --== Otherwise, make no change as to whether to ignore spaces or not. ELSE @KillSpace END, @Keep = CASE --Keep the current character if ... --== It's the second quote of two quotes that are together it's inside quotes. WHEN @CurQuoteFlop = 1 AND @PrevQuoteFlop = 0 AND @PrevChar = '"' THEN 1 --== It's not the first quote of two quotes that are together it's inside quotes. WHEN @CurQuoteFlop = 1 AND @CurChar '"' THEN 1 --== It's outside of quotes and either isn't a space or is an embedded/trailing space WHEN @CurQuoteFlop = 0 AND @KillSpace = 0 THEN 1 --== It's outside of quotes and not a leading space keep it. -- In other words, keep everything but leading space outside of quotes. WHEN @CurQuoteFlop = 0 AND @CurChar ' ' THEN 1 ELSE 0 END, --== Ignore first quote of quote pairs and last quote of quote runs @Keep = CASE WHEN @CurQuoteFlop = 0 AND @CurChar = '"' THEN 0 ELSE @Keep END, --== If it's a delimiter comma, change it to desire delimiter. @CurChar = CASE WHEN @PrevQuoteFlop = 0 AND @CurChar = ',' THEN @pDelimiter ELSE @CurChar END, @PrevChar = @CurChar, @PrevQuoteFlop = @CurQuoteFlop, --== A bit slow to assemble the string, but it's faster than anything except XML @ReturnString = @ReturnString + CASE WHEN @Keep = 1 THEN @CurChar ELSE '' END FROM dbo.Tally t WHERE t.N <= LEN(@pString) --===== Encapsulate the return string in delimiters and exit RETURN @pDelimiter + @ReturnString + @pDelimiter END GO --=========================================================================================================== -- This would be the staging table after you loaded it from BULK INSERT as whole rows only. -- Notice that it contains a header of a differt format than the rest of the rows. Also, -- you said the occurance of quotes around the "fields" would vary depending on whether or not the -- string data contained any commas. I changed field 4 and 5 of the first data row so that it's -- different just to prove this oddball method of mine works. -- As a side bar, this is probably one of the few places where a properly written CLR would do better. --=========================================================================================================== CREATE TABLE #Staging (WholeRow VARCHAR(8000)) INSERT INTO #Staging (WholeRow) SELECT 'ID,ADD_1,ADD_CD,MEM,FullName' UNION ALL SELECT '6000014623,"1310 FEARRINGTON POST",00,"this is a sample, first row changed",tiger' UNION ALL SELECT '6000014646,"532 BURNETT CIR",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014649,"2106 NC HIGHWAY 87 N",03,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014663,"1864 MANNS CHAPEL RD",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014668,"20 SILVERSTONE DR",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014669,"159 LAURA JOHNSON RD",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014670,"RR 1",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014686,"PO BOX 132",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014695,"137 FEARRINGTON POST",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014704,"339 TAYLOR RD",03,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014707,"PO BOX 880",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014714,"904 FEARRINGTON POST",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014747,"|2622 US HIGHWAY 64 W",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014748,"PO BOX 518",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000014756,"866 FEARRINGTON POST",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000015365,"1825 EAST ST",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000015493,"407 BRAMPTON CLOSE W",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6000015499,"134 HUDSON HILLS RD",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000015500,"321 RECTORY ST",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6000015501,"624 HAMLET CHAPEL RD",00,this is a sample,"tiger, scott"' UNION ALL SELECT '6050860067,"P,O, BOX 92, WEST END",05,this is a sample,"tiger, scott"' UNION ALL SELECT '6050864122,"CALLE RAMIRO 1, 20, 3B",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6050864480,"EUROVILLA 1, 803A",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6050864782,"WAH CHI COURT, FLAT D 30TH FLOOR",05,this is a sample,"tiger, scott"' UNION ALL SELECT '6053339447,"KINGOSGADE 13, 3TV",05,this is a sample,"tiger, scott"' UNION ALL SELECT '6053339541,"11 LONG CIRCULAR ROAD ,ST,JAMES,",05,this is a sample,"tiger, scott"' UNION ALL SELECT '6053339542,"11 LONG CIRCULAR ROAD ,ST,JAMES,",05,this is a sample,"tiger, scott"' UNION ALL SELECT '6053448211,"L''ARGENTIERE BAT A, 637 BD TAVERNIERE",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6053450414,"FLAT 43, FREDERICK COURT",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6053451508,"AVENIDA FRANCISCO BHERING, 17/201",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6053451663,"FLAT 5A, CHENYU COURT",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6053451664,"FLAT 5A, CHENYU COURT",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6053522326,"CASA VERDE, MANO IZQUIERDA, 175 M NORTE",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6053563849,"1 CASTLE PEEK RD BLOCK 6 10-F, A",05,this is a sample,"tiger, scott"' UNION ALL SELECT '6055150555,"BUJI TWON,SHENZHEN, GUANGDONG",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6057020037,"LINCEY HOUSE, WHITTINGTON ROAD",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6057020038,"LINCEY HOUSE, WHITTINGTON ROAD",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6057020076,"3186 COUNTY RD |26, RR 2",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6057020077,"3186 COUNTY RD |26, RR 2",02,this is a sample,"tiger, scott"' UNION ALL SELECT '6057020127,"AV, SANTO TORIBIO 298, APT, 401",02,this is a sample,"tiger, scott"' GO --=========================================================================================================== -- We need to reformat all that stuff so we can actually find all the delimiters before we do a split. -- You can either update the #Staging table directly or create another table on the fly. I prefer to -- create another table on the fly. Of course, we'll use our new function... and we'll use TABs for -- the delimiters --=========================================================================================================== SELECT RowNum = IDENTITY(INT,1,1), WholeRow = dbo.TrueCsvSplit(s.WholeRow,CHAR(9)) INTO #Staging2 FROM #Staging s --=========================================================================================================== -- Then, it's just a matter of splitting the data and reassembling it as columns in a row. It takes -- a Tally table to do that the fastest. I'll post the URL for the Tally table and what it does in the -- next post down. CHAR(9) is the TAB character. I'm assuming that you have SQL Server 2005 here. -- Let me know if you only have 2k so we can do the work around. --=========================================================================================================== ;WITH cteSplit AS ( SELECT s.RowNum, Element = ROW_NUMBER() OVER (PARTITION BY s.RowNum ORDER BY s.RowNum, t.N), RowText = SUBSTRING(s.WholeRow,t.N+1,CHARINDEX(CHAR(9),s.WholeRow,t.N+1)-t.N-1) FROM dbo.Tally t CROSS JOIN #Staging2 s WHERE t.N 1 ) SELECT MAX(CASE WHEN Element = 1 THEN RowText END) AS ID, MAX(CASE WHEN Element = 2 THEN RowText END) AS Add_1, MAX(CASE WHEN Element = 3 THEN RowText END) AS Add_CD, MAX(CASE WHEN Element = 4 THEN RowText END) AS Mem, MAX(CASE WHEN Element = 5 THEN RowText END) AS FullName FROM cteSplit GROUP BY RowNum
--Jeff Moden
Change is inevitable... Change for the better is not.
July 31, 2009 at 8:56 pm
Jeff,
First of all thank you so much again for assisting me on this. Since we cannot use the UDF in dynamic sql what do you suggest.. CLR might be too much for me rite now.. Do you think creating temp staging table and then doing a select into (after parsing temp Staging2 and using CTE) into the table name from variable might work... thanks a lot.
Viewing 15 posts - 16 through 30 (of 45 total)
You must be logged in to reply to this topic. Login to reply