Split input string into multicolumn - multirows

Question

Post reply

Split input string into multicolumn - multirows

Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Dwain Camps SSC Guru Points: 86978 More actions · Answer 1

Let's try a few of these solutions in a small test harness to see what we can see, shall we?

CREATE TABLE #Strings (ID INT IDENTITY, MyString VARCHAR(8000))

;WITH Tally (n) AS (

SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a CROSS JOIN sys.all_columns b)

INSERT INTO #Strings

SELECT CAST((

SELECT CAST(n AS VARCHAR(5)) + ';' + CAST(40000+n AS VARCHAR(6)) + '|'

FROM Tally

FOR XML PATH('')) AS VARCHAR(8000))

FROM Tally

DECLARE @Holder1 VARCHAR(8000), @Holder2 VARCHAR(8000)

PRINT 'DelimitedSplit8K (Lynn Pettis)'

SET STATISTICS TIME ON

select

@Holder1=max(case ds2.ItemNumber when 1 then ds2.Item else '' end), --as field1,

@Holder2=max(case ds2.ItemNumber when 2 then ds2.Item else '' end) --as field2

from

#Strings

cross apply dbo.DelimitedSplit8K(MyString,'|') ds1

cross apply dbo.DelimitedSplit8K(ds1.Item,';') ds2

where

ds1.Item <> ''

group by

ID, ds1.ItemNumber;

SET STATISTICS TIME OFF

PRINT 'PatternSplitCM (Dwain.C)'

SET STATISTICS TIME ON

SELECT @Holder1=MAX(CASE ItemNumber%4 WHEN 1 THEN Item END)

,@Holder2=MAX(CASE ItemNumber%4 WHEN 3 THEN Item END)

FROM #Strings

CROSS APPLY PatternSplitCM(MyString, '[0-9]')

WHERE [Matched] = 1

GROUP BY ID, ItemNumber/4

--OPTION (MAXDOP 1)

SET STATISTICS TIME OFF

PRINT 'DelimitedSplit8K (Steven Willis)'

SET STATISTICS TIME ON

SELECT

@Holder1=(SELECT Item FROM dbo.DelimitedSplit8K(r1.Item,';') a WHERE ItemNumber = 1) --AS Col1

,@Holder2=(SELECT Item FROM dbo.DelimitedSplit8K(r1.Item,';') a WHERE ItemNumber = 2) --AS Col2

FROM

(

SELECT

ID, dsk1.Item

FROM #Strings a

CROSS APPLY

dbo.DelimitedSplit8K(MyString,'|') AS dsk1

) r1

SET STATISTICS TIME OFF

DROP TABLE #Strings

Results:

DelimitedSplit8K (Lynn Pettis)

SQL Server Execution Times:

CPU time = 280 ms, elapsed time = 164 ms.

PatternSplitCM (Dwain.C)

SQL Server Execution Times:

CPU time = 141 ms, elapsed time = 13946 ms.

Warning: Null value is eliminated by an aggregate or other SET operation.

DelimitedSplit8K (Steven Willis)

SQL Server Execution Times:

CPU time = 531 ms, elapsed time = 703 ms.

Lynn clearly wins on elapsed time, but PatternSplitCM wins on CPU time. Not sure why its elapsed time is trailing the pack by such a wide margin though.

I didn't include the XML solution because it looked a bit complicated to fit into the harness but please feel free to give it a try.

Also I'm not 100% sure I adapted Steven's solution to the table correctly as it was returning some rows with NULL values.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

murthyvs SSCommitted Points: 1799 More actions · Answer 2

mtassin (12/17/2012)
murthyvs (12/16/2012)
Hi dwain.c - I tried changing VARCHAR(8000) to VARCHAR(MAX). Ran the query with new pattern [0-9A-Za-z]. My input string is a very long string.
sample:
3038469
3138470
The results were as expected BUT the query returns fewer than expected rows. I believe somewhere in the function its hitting the 8000 limit. Any ideas? Thanks!
The Delimittedsplit8k only counts up to E4's max which is 10,000. If you have more than 10,000 characters in the string you'll need to increase the size of the tally based CTE by creating an E8 which would cross 10,000 by 10,000 and handle
100,000,000 columns of data for searching for delimitters.

Yes, I noticed that. I created a E5 and changed 10,000 to 100,000.

murthyvs SSCommitted Points: 1799 More actions · Answer 3

Results:
DelimitedSplit8K (Lynn Pettis)
SQL Server Execution Times:
CPU time = 280 ms, elapsed time = 133 ms.
PatternSplitCM (Dwain.C)
SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 11653 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
DelimitedSplit8K (Steven Willis)
SQL Server Execution Times:
CPU time = 468 ms, elapsed time = 717 ms.
Lynn clearly wins on elapsed time, but PatternSplitCM smokes all contenders on CPU time. Not sure why its elapsed time is trailing the pack by such a wide margin though.
I didn't include the XML solution because it looked a bit complicated to fit into the harness but please feel free to give it a try.

Thanks for all your help and suggestions. I finally chose DelimitedSplit8K - Lynn Pettis method. I tried all methods but for some magical reason my stored proc really runs fast/goes well with DelimitedSplit8K- Lynn Pettis method. For now, I am constraining the user to 250 records (each of the fields can have up to 10 characters) until I figure a way to import the excel file into sql server from my web application (right now the user just copy-pastes the two columns from excel file into a multiline text box).

Dwain Camps SSC Guru Points: 86978 More actions · Answer 4

dwain.c (12/18/2012)
Also I'm not 100% sure I adapted Steven's solution to the table correctly as it was returning some rows with NULL values.

The answer to this is the delimiter appearing at the end of the test strings. Remove it and I think Steven's code will work exactly like the others. I wouldn't expect the timings to be changed much though.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

mister.magoo SSC-Forever Points: 47174 More actions · Answer 5

Hi, I had a feeling that I had forgotten something important about the XML method...then I remembered...

My revised code (suitable for the test harness):

PRINT 'XML (Mister Magoo)'

SET STATISTICS TIME ON

SELECT -- XML (Mister Magoo)

@Holder1= nd.value('./@a','varchar(200)') ,

@Holder2= nd.value('./@b','varchar(200)')

FROM (

SELECT CAST(

'<c a="'+

REPLACE(

MyString,

'|',

'"/><c a="'

),

';',

'" b="'

)

+'"/>' AS XML

).query('.') FROM #Strings

FOR XML PATH(''),TYPE

) AS src(nds)

CROSS APPLY nds.nodes('c') AS x(nd)

WHERE nd.exist('./@b')=1

SET STATISTICS TIME OFF

edit:removed unnecessary comment :doze:

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Dwain Camps SSC Guru Points: 86978 More actions · Answer 6

"Oh, Magoo! You've done it again!"

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

mister.magoo SSC-Forever Points: 47174 More actions · Answer 7

dwain.c (12/19/2012)
"Oh, Magoo! You've done it again!"

Maybe 😉 I am just testing other data patterns...

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Dwain Camps SSC Guru Points: 86978 More actions · Answer 8

mister.magoo (12/19/2012)
dwain.c (12/19/2012)
"Oh, Magoo! You've done it again!"
Maybe 😉 I am just testing other data patterns...

You know the really funny thing. I made the discovery of that approach independently about a year ago while starting out learning about and playing with XML. Found that it wasn't unique after a Google search. But the thing was, I never figured it was going to be very good in the performance department for some reason. So I never actually used it for anything.

Guess I'll need to rethink that.

How many people do you think will actually remember Mr. Magoo's catch phrase?

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

mister.magoo SSC-Forever Points: 47174 More actions · Answer 9

Dwain,

Performance wise, the main "speed ups" in this version are the use of our old friend FOR XML PATH('') to bolt everything into one XML fragment (means only handling one XML fragment rather than one per row) and the seemingly innocuous ".query('.')"

See This Brad Schulz Blog Post for more on this technique (not a lot more, but extra validation of it's efficacy)...

I also chose to use a single node per pair of values instead of the row/col approach I had previously - although I haven't tested whether this gave a speed boost or not - however this would not work so well if the data had more than two values per row.

edit:spelling

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

mister.magoo SSC-Forever Points: 47174 More actions · Answer 10

And the nice thing is this method works nicely on varchar(MAX)...

I used 2000 rows of 2000 pairs of values to test :

Duration_AvgCpu_AvgReads_AvgWrites_AvgRowCounts_Avg

106994.000000985614196918 36806 4000000

Of course, I couldn't compare this to the other methods because they don't handle MAX in their current forms, but I did test them at just 200 rows/200 pairs per row and they were getting very slow.

I think 107 seconds to split out 4 million rows is quite good though 😀

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Dwain Camps SSC Guru Points: 86978 More actions · Answer 11

mister.magoo (12/19/2012)
And the nice thing is this method works nicely on varchar(MAX)...
I used 2000 rows of 2000 pairs of values to test :
Duration_AvgCpu_AvgReads_AvgWrites_AvgRowCounts_Avg
106994.000000985614196918 36806 4000000
Of course, I couldn't compare this to the other methods because they don't handle MAX in their current forms, but I did test them at just 200 rows/200 pairs per row and they were getting very slow.
I think 107 seconds to split out 4 million rows is quite good though 😀

Very interesting and thanks for the Brad Schultz link.

I have a vague recollection of some recent discussion in Jeff's DelimitedSplit8K article that may have also touched on this approach. That discussion thread is pretty huge but it would be in the last 2-3 pages of it I think. Can't recall if it was suggesting to use this for VARCHAR(MAX) or not though. Maybe something else to investigate or maybe even something you'd like to add to the discussion.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jeff Moden SSC Guru Points: 1004448 More actions · Answer 12

mister.magoo (12/19/2012)
Hi, I had a feeling that I had forgotten something important about the XML method...then I remembered...

N-i-i-i-i-i-c-c-c-c-e-e-e ! ! ! ! !

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004448 More actions · Answer 13

I'm getting some conflicting results on the performance testing. A 10 year old single cpu desktop computer and a laptop probably aren't the best places for me to test on. I'll try to reserve a little time to test these on the server at work today (it's just after midnight here).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Dwain Camps SSC Guru Points: 86978 More actions · Answer 14

Jeff Moden (12/19/2012)
I'm getting some conflicting results on the performance testing. A 10 year old single cpu desktop computer and a laptop probably aren't the best places for me to test on. I'll try to reserve a little time to test these on the server at work today (it's just after midnight here).

I would be really interested to hear if you've got any ideas why PatternSplitCM is doing reasonably well on the CPU side but sucking so badly on elapsed time.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

mister.magoo SSC-Forever Points: 47174 More actions · Answer 15

Jeff Moden (12/19/2012)
I'm getting some conflicting results on the performance testing. A 10 year old single cpu desktop computer and a laptop probably aren't the best places for me to test on. I'll try to reserve a little time to test these on the server at work today (it's just after midnight here).

Having read recently about the problems with using set statistics time on, I used a server side trace to measure when I was testing.

I also saw some strange results, mostly from Lynn's solution which, for some reason would sometimes have terrible elapsed time, but I took that to be environmental as it was seemingly random.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);