February 27, 2011 at 10:32 pm
{edit} I found a bug in my code after I posted it, so I took it down. I'm trying to fix the bug right now.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 27, 2011 at 10:53 pm
JJH82 (2/27/2011)
yes, I can create or use stored procedures on the server.:crazy:but i have not .Net programming experience
OK. I was suggesting a .NET stored procedure or function, so that may not fly for you.
Until Jeff comes up with something to significantly improve the TSQL processing, one other option would be to get a .NET programmer to write a little bit of code to pre-process the files into an easier form for SQL Server to bulk load.
February 27, 2011 at 11:12 pm
Ok... I've not tested every eventuality, but this works with the test data provided by other posters.
First, here's that test data.
CREATE TABLE dbo.CLOB
(
Row_ID INT IDENTITY(1,1),
Parse_Clob NVARCHAR(4000)
)
INSERT INTO dbo.CLOB
SELECT ',2 Column_Name "Col1" "Col2" , 2 New_Column_Value "A" "B", 2 Old_Column_Value "a" "b"'
UNION ALL
SELECT ',2 Field_Name "F1" "F2", 2 New_Field_Name "F1" "F2", 2 Old_Field_Name "f1" "f2" ,1 Column_Name "Col3" ,1 New_Column_Value "C", 1 Old_Column_Value "c"'
UNION ALL
SELECT ',1 Field_Name "F3", 1 New_Field_Name "F3", 1 Old_Field_Name "f3" ,1 Column_Name "Col6" ,1 New_Column_Value "1<", 1 Old_Column_Value "2&"'
;
Next, the solution. It uses a thing called a "Tally Table". If you don't know what that is, how to build one, or how it replaces certain types of While Loops, please read the following article...
http://www.sqlservercentral.com/articles/T-SQL/62867/
WITH
cteFirstSplit AS
(
SELECT Row_ID,
Section = ROW_NUMBER() OVER (PARTITION BY Row_ID ORDER BY t.N)-1,
WorkingString = RTRIM(SUBSTRING(Parse_Clob,t.N+1,ISNULL(NULLIF(CHARINDEX(',',Parse_Clob,t.N+1),0),DATALENGTH(Parse_Clob)+1)-t.N-1))
FROM dbo.Tally t
CROSS JOIN dbo.Clob c
WHERE t.N BETWEEN 1 AND LEN(Parse_Clob)
AND SUBSTRING(Parse_Clob, t.N, 1) = ','
),
cteFirstClean AS
(
SELECT Row_ID,
Section,
Field_Type = CASE WHEN CHARINDEX('Field',WorkingString)>0 THEN 'Field' ELSE 'Column' END,
WorkingString = REPLACE(SUBSTRING(WorkingString, CHARINDEX('"',WorkingString), 4000),'" "','"')
FROM cteFirstSplit
),
cteFinalSplit AS
(
SELECT Row_ID,
Section = Section,
Element = ROW_NUMBER() OVER (PARTITION BY Row_ID, Field_Type, Section ORDER BY Field_Type, Section, t.N)-1,
Field_Type,
SplitString = RTRIM(SUBSTRING(WorkingString,t.N+1,ISNULL(NULLIF(CHARINDEX('"',WorkingString,t.N+1),0),DATALENGTH(WorkingString)+1)-t.N-1))
FROM dbo.Tally t
CROSS JOIN cteFirstClean
WHERE t.N BETWEEN 1 AND LEN(WorkingString)-1
AND SUBSTRING(WorkingString, t.N, 1) = '"'
)
--==== Now, reassemble
SELECT Row_ID,
Field_Type = MAX(CASE WHEN Section % 3 = 0 THEN Field_Type ELSE '' END),
Field_Name = MAX(CASE WHEN Section % 3 = 0 THEN SplitString ELSE '' END),
New_Value = MAX(CASE WHEN Section % 3 = 1 THEN SplitString ELSE '' END),
Old_Value = MAX(CASE WHEN Section % 3 = 2 THEN SplitString ELSE '' END)
FROM cteFinalSplit
GROUP BY Row_ID, Field_Type, Element
ORDER BY Row_ID, Field_Type, Element
;
--Jeff Moden
Change is inevitable... Change for the better is not.
February 28, 2011 at 12:21 am
A couple of tweaks to Jeff's code, mostly for correctness, not speed. It will be interesting to see how the Tally code performs.
;
WITH cteFirstSplit
AS (
SELECT Row_ID,
Section =
ROW_NUMBER() OVER
(
PARTITION BY C.Row_ID
ORDER BY T.N
) - 1,
WorkingString =
LTRIM(RTRIM(SUBSTRING(C.Parse_Clob, T.N+1,
ISNULL(NULLIF(CHARINDEX(N',', C.Parse_Clob, T.N + 1), 0),
DATALENGTH(C.Parse_Clob) + 1) - T.N-1)))
FROM dbo.Tally AS T
JOIN dbo.Clob AS C
ON T.N >= 1
AND T.N <= DATALENGTH(C.Parse_Clob) / DATALENGTH(LEFT(C.Parse_Clob, 1))
WHERE SUBSTRING(C.Parse_Clob, T.N, 1) = N','
),
cteFirstClean
AS (
SELECT S1.Row_ID,
S1.Section,
Field_Type =
CASE SUBSTRING(S1.WorkingString, 1 + CHARINDEX(SPACE(1), S1.WorkingString), 10)
WHEN N'New_Field_' THEN N'Field'
WHEN N'Old_Field_' THEN N'Field'
WHEN N'Field_Name' THEN N'Field'
WHEN N'Old_Column' THEN N'Column'
WHEN N'New_Column' THEN N'Column'
WHEN N'Column_Nam' THEN N'Column'
END,
WorkingString =
LTRIM(REPLACE(SUBSTRING(S1.WorkingString,
CHARINDEX(NCHAR(34), S1.WorkingString), 2147483647),
NCHAR(34) + SPACE(1) + NCHAR(34), NCHAR(34)))
FROM cteFirstSplit AS S1
), cteFinalSplit
AS (
SELECT Row_ID,
Section = Section,
Element = ROW_NUMBER() OVER
(
PARTITION BY FC.Row_ID, FC.Field_Type, FC.Section
ORDER BY FC.Field_Type, FC.Section, T.N
) - 1,
Field_Type,
SplitString =
RTRIM(SUBSTRING(FC.WorkingString, T.N + 1,
ISNULL(NULLIF(CHARINDEX(NCHAR(34), FC.WorkingString, T.N + 1), 0),
DATALENGTH(FC.WorkingString) + 1) - T.N - 1))
FROM dbo.Tally AS T
JOIN cteFirstClean AS FC
ON T.N >= 1
AND T.N <= (DATALENGTH(FC.WorkingString) / DATALENGTH(LEFT(FC.WorkingString, 1))) - 1
WHERE SUBSTRING(FC.WorkingString, T.N, 1) = NCHAR(34)
)
SELECT FS.Row_ID,
Field_Type = MAX(CASE WHEN FS.Section % 3 = 0 THEN FS.Field_Type ELSE SPACE(0) END),
Field_Name = MAX(CASE WHEN Section % 3 = 0 THEN FS.SplitString ELSE SPACE(0) END),
New_Value = MAX(CASE WHEN Section % 3 = 1 THEN FS.SplitString ELSE SPACE(0) END),
Old_Value = MAX(CASE WHEN Section % 3 = 2 THEN FS.SplitString ELSE SPACE(0) END)
FROM cteFinalSplit AS FS
GROUP BY
FS.Row_ID,
FS.Field_Type,
FS.Element
ORDER BY
FS.Row_ID,
FS.Field_Type,
FS.Element
;
Paul
February 28, 2011 at 5:51 am
SQLkiwi (2/28/2011)
A couple of tweaks to Jeff's code, mostly for correctness
I was in a bit of a hurry... what tweaks did you make?
--Jeff Moden
Change is inevitable... Change for the better is not.
February 28, 2011 at 5:54 am
Jeff Moden (2/28/2011)
SQLkiwi (2/28/2011)
A couple of tweaks to Jeff's code, mostly for correctnessI was in a bit of a hurry... what tweaks did you make?
Never mind. I see. My code worked but wouldn't suffer many data changes.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 28, 2011 at 5:59 am
Jeff Moden (2/28/2011)
Never mind. I see. My code worked but wouldn't suffer many data changes.
Yes, only minor changes to some very nice code. My main worry is that the sorts and string splitting operations won't scale well. We'll see if the real data does have 100,000 groups in a single line as was mentioned earlier...
February 28, 2011 at 6:55 am
Thanks, Paul.
I agree... in particular, I saw that 70% sort on the second use of the Tally Table and believe it would go away if we simply wrote the results of the cteFirstClean out to an interim table to do the final split on.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 28, 2011 at 7:08 am
Jeff Moden (2/28/2011)
I agree... in particular, I saw that 70% sort on the second use of the Tally Table and believe it would go away if we simply wrote the results of the cteFirstClean out to an interim table to do the final split on.
I doubt it would go away - the sorting for the window aggregates has to be performed at some point (creating or inserting into an index on the intermediate table would involve a sort too).
The intermediate result set might also be quite large, so the memory grant required by the sort could cause problems too. To over-simplify a little, as the sort grows, so does the memory grant. As soon as the server is unable to meet the memory grant request, the sort will either physically spill to tempdb disk, or the query will wait on RESOURCE_SEMAPHORE. Either way, not good.
March 1, 2011 at 2:20 pm
thanks guys, sorry for replying so late, i am trying them out today
🙂
March 8, 2011 at 3:01 pm
Great jobs guys, and the execution time is excellent, it reduced processing time from 3 hr45min to 2 mins. Thank you for all your input!:-):-D
Viewing 11 posts - 31 through 40 (of 40 total)
You must be logged in to reply to this topic. Login to reply