Convert CSV values in three columns to rows

Question

Convert CSV values in three columns to rows

Viewing 15 posts - 16 through 30 (of 47 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 1

Thank you for the test harness and your good code, Oleg. I'll check it out.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 2

Jeff Moden (7/6/2010)
Thank you for the test harness and your good code, Oleg. I'll check it out.

Sorry, Oleg... I've not nbeen able to put in the time to do a deep dive on the code and this certainly deserves int. I'm not going to be able to get to it for at least another 10 days because I'm going on vacation where I won't have access to a computer with SQL Server on it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 3

The initial testing I just got done doing looks pretty darned good on a parameter of 100,000 elements, Oleg. I'm still playing with it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Quatrei.X SSCrazy Points: 2697 More actions · Answer 4

Jeff Moden (6/29/2010)
WayneS (6/28/2010)
lmu92 (6/28/2010)
This thread include a post by WayneS showing the DelimitedSplit8K function I'm using at the moment.
I believe I saw a follow-up post from Jeff where he found that the UNPIVOT (as demonstrated in the above link) turns out to be slower when he ran it on his work prod server several times. Jeff, please let us know if I remember this correctly.
You did, indeed. I didn't take the time to analyze "why" but on certain machines with multiple processors, the UNPIVOT method sometimes runs substantially slower. I also didn't understand that the function you good folks were talking about was the function that I posted. Here's the latest and greatest with all the documentation and optimizations that I currently use for production code... the documentation in the header is quite substantial. {EDIT} Updated the code below to include the lastest performance (From yesterday) thanks to Paul White.
CREATE FUNCTION dbo.DelimitedSplit8K
/***************************************************************************************************
Purpose:
Split a given string at a given delimiter and return a list of the split elements (items).
Returns:
iTVF containing the following:
ItemNumber = Element position of Item as a BIGINT (not converted to INT to eliminate a CAST)
Item = Element value as a VARCHAR(8000)
CROSS APPLY Usage Example:
---------------------------------------------------------------------------------------------------
--===== Conditionally drop the test tables to make reruns easier for testing.
-- (this is NOT a part of the solution)
IF OBJECT_ID('tempdb..#JBMTest') IS NOT NULL
DROP TABLE #JBMTest
;
--===== Create and populate a test table on the fly (this is NOT a part of the solution).
SELECT *
INTO #JBMTest
FROM (
SELECT 1,'1,10,100,1000,10000,100000,1000000' UNION ALL
SELECT 2,'2000000,200000,20000,2000,200,20,2' UNION ALL
SELECT 3, 'This,is,a,test' UNION ALL
SELECT 4, 'and so is this' UNION ALL
SELECT 5, 'This, too (no pun intended)'
) d (SomeID,SomeValue)
;
GO
--===== Split the CSV column for the whole table using CROSS APPLY (this is the solution)
SELECT test.SomeID, split.ItemNumber, split.Item
FROM #JBMTest test
CROSS APPLY
(
SELECT ItemNumber, Item
FROM dbo.DelimitedSplit8k(test.SomeValue,',')
) split
;
---------------------------------------------------------------------------------------------------
Notes:
1. Optimized for VARCHAR(7999) or less. No testing or error reporting for truncation at 7999
characters is done.
2. Optimized for single character delimiter. Multi-character delimiters should be resolved
externally from this function.
3. Optimized for use with CROSS APPLY.
4. Does not "trim" elements just in case leading or trailing blanks are intended.
5. If you don't know how a Tally table can be used to replace loops, please see the following...
http://www.sqlservercentral.com/articles/T-SQL/62867/
6. Changing this function to use VARCHAR(MAX) will cause it to run twice as slow. It's just the
nature of VARCHAR(MAX) whether it fits in-row or not.
7. Multi-machine testing for the method of using UNPIVOT instead of 10 SELECT/UNION ALLs shows
that the UNPIVOT method is quite machine dependent and can slow things down quite a bit.
8. Performance testing shows using "TOP" for the limiting criteria of "N" is actually
slower and slightly more CPU intensive than the traditional WHERE N < LEN(@pString) + 2.
9. Performance testing shows using ORDER BY (SELECT x) where "x" is anything is actually
slower and slightly more CPU intensive than the traditional ORDER BY (SELECT N).
Credits:
This code is the product of many people's efforts including but not limited to the following:
cteTally concept originally by Iztek Ben Gan and "decimalized" by Lynn Pettis (and others) for a
bit of extra speed and finally redacted by Jeff Moden for a different slant on readability and
compactness. Hat's off to Paul White for his simple explanations of CROSS APPLY. Finally,
special thanks to Erland Sommarskog for his tireless efforts to help people understand
what you can actually do with T-SQL. I also thank whoever wrote the first article I ever saw
on "numbers tables" which is located at the following URL ...
http://sqlserver2000.databases.aspfaq.com/why-should-i-consider-using-an-auxiliary-numbers-table.html
Revision History:
Rev 00 - 20 Jan 2010 - Concept: Lynn Pettis and others.
Redaction/Implementation: Jeff Moden
- Base 10 redaction and reduction for CTE. (Total rewrite)
Rev 01 - 13 Mar 2010 - Jeff Moden
- Removed one additional concatenation and one subtraction from the SUBSTRING in the
SELECT List for that tiny bit of extra speed.
Rev 02 - 14 Apr 2010 - Jeff Moden
- No code changes. Added CROSS APPLY usage example to the header, some additional credits,
and extra documentation.
Rev 03 - 18 Apr 2010 - Jeff Moden
- No code changes. Added notes 7, 8, and 9 about certain "optimizations" that dont'
actually work for this type of function.
Rev 04 - 29 Jun 2010 - Jeff Moden
- Added WITH SCHEMABINDING thanks to a note by Paul White. This prevents an unnecessary
"Table Spool" when the function is used in an UPDATE statement even though the function
makes no external references.
***************************************************************************************************/
--===== Define I/O parameters
(
@pString VARCHAR(7999),
@pDelimiter CHAR(1)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Table” produces values up to
-- 10,000... enough to cover VARCHAR(8000)
WITH
E1(N) AS ( --=== Create Ten 1's
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 --10
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT N)) FROM E4)
--===== Do the split
SELECT ROW_NUMBER() OVER (ORDER BY N) AS ItemNumber,
SUBSTRING(@pString, N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, N) - N) AS Item
FROM cteTally
WHERE N < LEN(@pString) + 2
AND SUBSTRING(@pDelimiter + @pString, N, 1) = @pDelimiter
;
GO

Cool! thanks, I needed this... uhmmm, I have a question about this but in a different forum/section (don't wanna double post)...

http://www.sqlservercentral.com/Forums/Topic975437-392-1.aspx?Update=1

hope you can look at it.

Thanks in advance! 😀

_____________________________________________
[font="Comic Sans MS"]Quatrei Quorizawa[/font]
:):D:P;):w00t::cool::hehe:
MABUHAY PHILIPPINES!

"Press any key...
Where the heck is the any key?
hmmm... Let's see... there's ESC, CTRL, Page Up...
but no any key"
- Homer Simpson

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 5

Quatrei.X (8/27/2010)[hrCool! thanks, I needed this... uhmmm, I have a question about this but in a different forum/section (don't wanna double post)...
http://www.sqlservercentral.com/Forums/Topic975437-392-1.aspx?Update=1
hope you can look at it.
Thanks in advance! 😀

So how'd I do over there? 🙂 Did I satisfy your questions?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 6

Ok... here we go...

First, here are 4 functions. I believe Oleg will recognize the one called "XML-1 (Split8KXML1 mlTVF)"....

--=====================================================================================================================

-- Create the various functions to test

--=====================================================================================================================

--===== Do this in a nice safe place that everyone has

USE TempDB;

--===== Tally Table (Split8KTally iTVF) ===============================================================================

DROP FUNCTION dbo.Split8KTally

GO

CREATE FUNCTION dbo.Split8KTally

(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))

RETURNS TABLE AS

RETURN

SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,

SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue

FROM dbo.Tally

WHERE N BETWEEN 1 AND LEN(@Parameter)

AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma

GO

--===== XML-1 (Split8KXML1 mlTVF) =====================================================================================

DROP FUNCTION dbo.Split8KXML1

GO

CREATE FUNCTION dbo.Split8KXML1

(@Parameter VARCHAR(MAX), @Delimiter VARCHAR(1))

RETURNS @Result TABLE

(ItemNumber INT, ItemValue VARCHAR(8000))AS

BEGIN

DECLARE @XML XML;

SELECT @XML = '<r>'+REPLACE(@Parameter, ',', '</r><r>')+'</r>';

INSERT INTO @Result (ItemNumber, ItemValue)

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ItemNumber,

Item.value('text()[1]', 'varchar(max)') AS ItemValue

FROM @XML.nodes('//r') R(Item);

RETURN;

END;

GO

--===== XML-3 (Split8KXML3 iTVF) ======================================================================================

DROP FUNCTION dbo.Split8KXML3

GO

CREATE FUNCTION dbo.Split8KXML3

(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

SELECT CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS INT) AS ItemNumber,

R.Item.value('text()[1]', 'varchar(max)') AS ItemValue

FROM (SELECT CAST('<r>'+REPLACE(@Parameter, @Delimiter, '</r><r>')+'</r>' AS XML)) X(N)

CROSS APPLY N.nodes('//r') R(Item)

;

GO

--===== Loop Method 1 (Split8KL1 mlTVF) ===============================================================================

DROP FUNCTION dbo.Split8KL1

GO

CREATE FUNCTION dbo.Split8KL1

(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))

RETURNS @Result TABLE (ItemNumber INT IDENTITY(1,1), ItemValue VARCHAR(8000))

AS

BEGIN

--===== Declare a variable to remember the position of the current comma

DECLARE @N INT;

--===== Add start and end commas to the Parameter so we can handle

-- single elements

SELECT @Parameter = @Delimiter + @Parameter + @Delimiter,

--===== Preassign the current comma as the first character

@N = 1;

--===== Loop through and find each comma, then insert the string value

-- found between the current comma and the next comma. @N is

-- the position of the current comma.

WHILE @N < LEN(@Parameter) --Don't include the last comma

BEGIN

--==== Do the insert using the value between the commas

INSERT INTO @Result (ItemValue)

SELECT SUBSTRING(@Parameter, @N+1, CHARINDEX(@Delimiter, @Parameter, @N+1)-@N-1);

--==== Find the next comma

SELECT @N = CHARINDEX(@Delimiter, @Parameter, @N+1);

END; --END While

RETURN;

END; --END Function

GO

Next, some test data. Read the comments for where to make changes to get the data to vary, please...

--=====================================================================================================================

-- Build the test data

--=====================================================================================================================

--===== Do this in a nice safe place that everyone has

USE TempDB;

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('TempDB..CsvTest','U') IS NOT NULL

DROP TABLE CsvTest;

--===== This creates and populates a test table on the fly containing a

-- sequential column and a randomly generated CSV Parameter column.

SELECT TOP (10000) --Controls the number of rows in the test table

ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,

(

SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma

( --=== This builds CSV row with a leading comma

SELECT TOP (16) --Controls the number of CSV elements in each row

','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))

FROM dbo.Tally t3 --Classic cross join pseudo-cursor

CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.

WHERE t1.N <> t3.N --Without this line, all rows would be the same

FOR XML PATH('')

)

,1,1,'') AS NVARCHAR(MAX))

) AS CsvParameter

INTO CsvTest

FROM dbo.Tally t1 --Classic cross join pseudo-cursor

CROSS JOIN dbo.Tally t2; --can produce row sets up 121 million.

--===== Let's add a PK just for grins. Since it's a temp table, we won't name it.

ALTER TABLE CsvTest

ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100;

GO

Last but not least, let's test the functions. I ran the following for 4 different sets of test data with SQL Profiler running. You can tell what's what just by reading the following SQL Profiler output.

--=====================================================================================================================

-- Run the functions (Profiler turned on for this given SPID)

--=====================================================================================================================

--===== Do this in a nice safe place that everyone has

USE TempDB;

GO

--===== Tally Table (Split8KTally iTVF) ===============================================================================

DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT @RowNum = csv.RowNum, @ItemNumber = split.ItemNumber, @ItemValue = CAST(split.ItemValue AS INT)

FROM CsvTest csv

CROSS APPLY dbo.Split8KTally(csv.CsvParameter,',') AS split;

GO

--===== XML-1 (Split8KXML1 mlTVF) =====================================================================================

DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT @RowNum = csv.RowNum, @ItemNumber = split.ItemNumber, @ItemValue = CAST(split.ItemValue AS INT)

FROM CsvTest csv

CROSS APPLY dbo.Split8KXML1(csv.CsvParameter,',') AS split;

GO

--===== XML-3 (Split8KXML3 iTVF) ======================================================================================

DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT @RowNum = csv.RowNum, @ItemNumber = split.ItemNumber, @ItemValue = CAST(split.ItemValue AS INT)

FROM CsvTest csv

CROSS APPLY dbo.Split8KXML3(csv.CsvParameter,',') AS split;

GO

--===== Loop Method 1 (Split8KL1 mlTVF) ===============================================================================

DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT @RowNum = csv.RowNum, @ItemNumber = split.ItemNumber, @ItemValue = CAST(split.ItemValue AS INT)

FROM CsvTest csv

CROSS APPLY dbo.Split8KL1(csv.CsvParameter,',') AS split;

GO

Here's the profiler run. Even the WHILE Loop beat both versions of the XML. You don't really want to see the 25 minutes it takes the XML to do 100 elements for 10K Rows.

The reason why I initially said the XML was looking good is because I'd made a mistake in the creation of my Tally table... I used BIGINT instead of INT and the implicit conversions made a mess of the Tally table solution. Obviously, I fixed that for this experiment.

My machine is an 8 year old single P4 1.8Ghz with 1GB Ram running Windows XP Pro SP3 and SQL Server Developer's Edition 2005 SP3 with no CU's.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 7

Long story short... I wouldn't use XML for splitting if I were you. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Oleg Netchaev SSCertifiable Points: 5276 More actions · Answer 8

Jeff,

I am terribly sorry, but there is no way that your test reflects the picture. I ran Split8K and xml only tests and results I get are totally different. First, I would like to clarify that you did mean 25 seconds, not 25 minutes to process 10,000 records with 100 items in one record, right? The biggest problem with Split8KTally was pointed out a while ago by Lynn Pettis who correctly said that it becomes pretty slow when the datalength of the items in the record becomes close to 8K. This assessment is absolutely correct, yet you tested a rather small number of items to split in each record (4, 16 etc). When I ran your tests I saw that the smaller number of items to split, the better Split8KTally becomes. So, for 16 items per record I saw Split8kTally takes 1 second while xml takes 4. However, when I increased the records to hold 100 items each, I saw that Split8KTally takes 15 seconds while xml takes 20, still faster but not as much. When I increased the number to 1000 per record, which basically pushed the Split8K to about as much as it can handle then I saw that

Split8KTally took 4 minutes 39 seconds

xml still took 20 seconds.

Then I could no longer test Split8KTally (does not handle more than 8K worth of csv data per record), but continued with xml. I increased the number of items to split in each record to 10,000, which made the longest record to have 117,840 characters, and the number of records as a result of cross apply - 100 million (10,000 records with 10,000 csv values in each). This took 45 seconds.

This clearly shows that xml handily beats other methods (outside of CLR procedure, which can be created to complete the whole thing in a heartbeat) once the data to split becomes big enough in number of csv values to split.

Honestly, I would not even go for all this re-testing, but after I saw the 25 minutes instead of expected 25 seconds coupled with the fact that the WHILE loop (the which should not even exist in T-SQL) can beat my xml solution, I had to do it.

Please let me know if my assumptions are incorrect is some way. I used your harness without making any changes.

Thanks,

Oleg

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 9

My tests DO reflect the total picture... just not on your machine. 😉 They show the test results on my machine. It's obvious that my machine is different than yours (mine cannot do parallelism... it only has 1 CPU, for example) so I'm setting up to do the tests on a more modern machine than my 8 year old desktop. We'll see soon.

Please post the CREATE TABLE statement for your Tally Table. Be sure to include any indexes it may have. It would also be helpful to know what your default COLLATION is, please. We need to make sure we're comparing apples to apples. Thanks.

And to be sure, the only thing I care about right now is splitting things that are 8k or less. We already know that JOINing to a MAX column is terrible for performance and that a While Loop is one of the ways to go there.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 10

Ok, Oleg... not sure what the differences are between your machine and the two that I'm running but the Tally table beats the two XML methods pretty badly on both.

First, here's my Tally table setup just so we're talking apples and apples. Also, make sure you're using my test generator so you're testing for "real life" very random data with commas in different positions in each row.

--===== Do this in a nice safe place that everyone has

USE TempDB;

--===================================================================

-- Create a Tally table from 1 to 11000

--===================================================================

--===== Create and populate the Tally table on the fly.

-- This ISNULL function makes the column NOT NULL

-- so we can put a Primary Key on it

SELECT TOP 11000

IDENTITY(INT,1,1) AS N

INTO dbo.Tally

FROM Master.sys.ALL_Columns ac1

CROSS JOIN Master.sys.ALL_Columns ac2

;

--===== Add a CLUSTERED Primary Key to maximize performance

ALTER TABLE dbo.Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

;

--===== Allow the general public to use it

GRANT SELECT ON dbo.Tally TO PUBLIC

;

GO

Here are the run results from my desktop machine (8 year old single P4 1.8GHz 1GB Ram). Do notice the 100 element test, please. And, hell no... with speeds like that, there's no way I'm running a 1,000 element test.

Here are the exact same runs from a real server running 4 32-bit Xeons at 3GHz on Windows 2003 and SQL Server 2005 sp3 with 2GB memory allocated to SQL Server. Again... notice the 100 element runs.

I just might try the 1000 element runs on the server (well... probably NOT the XML3 version).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 11

Oh yeah... I almost forgot. Here's the default collation on my desktop box...

SQL_Latin1_General_CP1_CI_AS

... and on my server box...

SQL_Latin1_General_CP1_CI_AS

What we need to find out now is why XML seems to run so nice and fast on your box. Please let me know what the configuration is so I can try to find one similar to test on. Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Oleg Netchaev SSCertifiable Points: 5276 More actions · Answer 12

Jeff,

So far, I can see that I use the same setup as far as the tally creation is concerned. The collation is of course the same as well. You are correct in not using xml3, it is designed to fail due to the cross apply inside of it. This is what prompted me to modify my answer on ask after I saw Kevan's. My original answer included the same dreaded cross apply, but once I removed it, it became faster than Kevan's tally for sizeable strings. I still deleted the answer from ask altogether cause I respect your opinion (though cannot agree with it yet). I will try to play with settings and also check whether it is possible to tweak the xml a bit. One thing I saw for sure, I and hope that you will agree that with huge strings, xml execution times does not increase much, but tally split sure does.

I think that the most relevant test we can have will be with the string size somewhere in the middle, say 4k per record meaning about 400 items to split. More than 800 makes the string greater than 8k in size.

I would like to point out that while you do mention apples to apples comparison, the xml function you wrote does use the varchar(max). I will leave it as is though, because xml-ifying the input bloats it somewhat, so let varchar(max) stay in xml function.

Oleg

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 13

Oleg Netchaev (9/18/2010)
Jeff,
So far, I can see that I use the same setup as far as the tally creation is concerned. The collation is of course the same as well. You are correct in not using xml3, it is designed to fail due to the cross apply inside of it. This is what prompted me to modify my answer on ask after I saw Kevan's. My original answer included the same dreaded cross apply, but once I removed it, it became faster than Kevan's tally for sizeable strings. I still deleted the answer from ask altogether cause I respect your opinion (though cannot agree with it yet). I will try to play with settings and also check whether it is possible to tweak the xml a bit. One thing I saw for sure, I and hope that you will agree that with huge strings, xml execution times does not increase much, but tally split sure does.
I think that the most relevant test we can have will be with the string size somewhere in the middle, say 4k per record meaning about 400 items to split. More than 800 makes the string greater than 8k in size.
I would like to point out that while you do mention apples to apples comparison, the xml function you wrote does use the varchar(max). I will leave it as is though, because xml-ifying the input bloats it somewhat, so let varchar(max) stay in xml function.
Oleg

Hi Oleg,

I appreciate the respect and I do wish you'd put your good post back up on Ask.

I definitely agree that the Tally table stumbles pretty badly on things above 8k and I'm setting up to test that.

On the apples-to-apples thing, I left VARCHAR(MAX) in the XML code because in order to get anything close to 8k of data, you have to use VARCHAR(MAX) to account for the addition of the 7 character tag information at each delimiter. Just to be complete, though, I'll try it with just a VARCHAR(8000) at the same 100 elements I've been testing with, so far.

I've got some folks participating in the 100 element test right now. If it works out ok, I'll try to figure out a way to do a bit more automated and extensive testing across a wider range. I didn't think the Statistics reporting was going to work out as well as it is currently working out and didn't want to bog peoples machines down so much.

You WILL be please to know that it's YOUR method in XML-1 in that testing and even though it's a mlTVF, it blows the other XML methods away for performance. Under 8k, the Tally table still beats up on it pretty good. Like I said, I'm setting up for >8k testing... it just takes some time to do it all so people don't have to mess with things when they test it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Pavel Pawlowski Ten Centuries Points: 1244 More actions · Answer 14

Jeff,

I've done some testing about this also on my machine, but instead of the XML version I've tested the CLR RegEx solution.

Here is the source for the RegEx CLR Function

public class SQLRegEx

{

private class RegExRow

{

/// <summary>

/// Private class for passing matches of the RegExMatches to the FillRow method

/// </summary>

/// <param name="rowId">ID of the Row</param>

/// <param name="matchId">ID of the Match</param>

/// <param name="groupID">ID of the Group within the Match</param>

/// <param name="value">Value of the particular group</param>

public RegExRow(int rowId, int matchId, int groupID, string value)

{

RowId = rowId;

MatchId = matchId;

GroupID = groupID;

Value = value;

}

public int RowId;

public int MatchId;

public int GroupID;

public string Value;

}

/// <summary>

/// Applies Regular Expression to the Source strings and return all matches and groups

/// </summary>

/// <param name="sourceString">Source string on which the regular expression should be applied</param>

/// <param name="pattern">Regular Expression pattern</param>

/// <returns>Returns list of RegExRows representing the group value</returns>

[SqlFunction(FillRowMethodName = "FillRegExRow")]

public static IEnumerable RegExMatches(string sourceString, string pattern)

{

Regex r = new Regex(pattern, RegexOptions.Compiled);

int rowId = 0;

int matchId = 0;

foreach (Match m in r.Matches(sourceString))

{

matchId++;

for (int i = 0; i < m.Groups.Count; i++)

{

yield return new RegExRow(++rowId, matchId, i, m.Groups.Value);

}

/// <summary>

/// FillRow method to populate the output table

/// </summary>

/// <param name="obj">RegExRow passed as object</param>

/// <param name="rowId">ID or the returned row</param>

/// <param name="matchId">ID of returned Match</param>

/// <param name="groupID">ID of group in the Match</param>

/// <param name="value">Value of the Group</param>

public static void FillRegExRow(Object obj, out int rowId, out int matchId, out int groupID, out SqlChars value)

{

RegExRow r = (RegExRow)obj;

rowId = r.RowId;

matchId = r.MatchId;

groupID = r.GroupID;

value = new SqlChars(r.Value);

}

CREATE ASSEMBLY [SQLRegEx]

AUTHORIZATION [dbo]

FROM 'C:\CLR\SQLRegEx.dll'

WITH PERMISSION_SET = SAFE

GO

CREATE FUNCTION [dbo].[fn_RegExMatches](

@sourceString nvarchar(max), --Source string to be processed by regular expression

@pattern nvarchar(4000)) --Regular expression (pattern) to be applied on the source string

RETURNS TABLE (

[rowId] int, --RowId each row as it's ID

[matchId] int, --ID of particular match (starts from 1)

[groupId] int, --ID of particular group in RegEx match (GroupID = 0) represents a complete match

[value] [nvarchar](4000) --value of the group)

WITH EXECUTE AS CALLER

AS EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatches]

GO

I've tested it on the test tables you have provided and it's true, that for small number of elements the Tally table solution is unbeatable. But as Oleg said for larger element count the Tally is not usable.

For the 16 element on my machine the Tally runs about 1 se. and the CLR RegEx about 20 sec.

For 100 elements the Tally runs 20550 ms and CLR RegEx 26773 ms.

But the situation differs with increased number of element. Tested it on 1333 elements (nearly 8k). The Tally table version I stopped after 19 minutes of execution time.

Insetad the CLR RegEx solution tooks 46 seconds to finish.

Test query for the CRL RegEx:

DBCC DROPCLEANBUFFERS

GO

DBCC FREEPROCCACHE

GO

SET STATISTICS IO ON

SET STATISTICS TIME ON

GO

DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT

@RowNum = D.RowNum,

@ItemNumber = V.matchId,

@ItemValue = Cast(V.value AS int)

FROM dbo.CsvTest2 D

CROSS APPLY dbo.fn_RegExMatches(D.CsvParameter, '(\d+?)(?:,|$)') V

WHERE V.groupId = 1

GO

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

GO

Here are results for the CLR RegEx:

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 33 ms.

Table 'CsvTest2'. Scan count 5, logical reads 217, physical reads 5, read-ahead reads 68, lob logical reads 320872, lob physical reads 2500, lob read-ahead reads 170000.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 140775 ms, elapsed time = 46082 ms.

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 0 ms.

Profiler output:

Run on my working machine Core 2 Quand 2.5 GHz * GB RAM

So from my tests the Tally solution is unusable for large arrays splitting. Instead the CLR works perfectly even with very lage arrays.. When tested it on single array of 100 000 elements, the execution time is 745 ms and for array with 1 000 000 elements it's 7.8 sec.

Also using the RegEx, there are no AppDomains reloads/restarts as Phill Factor somewhere mentioned when using the String.Split method for splitting arrays. The RegEx doesn't have this kind of problems.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 15

But as Oleg said for larger element count the Tally is not usable.
For the 16 element on my machine the Tally runs about 1 se. and the CLR RegEx about 20 sec.
For 100 elements the Tally runs 20550 ms and CLR RegEx 26773 ms.
But the situation differs with increased number of element. Tested it on 1333 elements (nearly 8k). The Tally table version I stopped after 19 minutes of execution time.

Thanks for the feedback... I'll run your code and see if I come up with the same results.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)