September 12, 2010 at 1:43 pm
I working on a modernization of the Tally Table article which, of course, will include some "CSV Splitter" code. I wanted to do a comparison against XML splitters and not being a Ninja at XML, have searched the web and have found quite a few that looked pretty hopeful for speed. Unfortunately, I'm really disappointed in the XML splitters because {drum roll please}, even a UDF with a While Loop in it beats the XML methods for splitting a table's worth of CSV's according to the testing I'm doing.
So, at the risk of starting another "Mega Thread" on "How to Split" (heh... let's PLEASE not go there :-P), [font="Arial Black"]can someone point me to an XML CSV Splitter that runs faster than the following[/font] so that XML splitters have a least a fighting chance in the rewrite of the article?
CREATE FUNCTION dbo.Split8KX
(@Parameter VARCHAR(MAX), @Delimiter VARCHAR(1))
RETURNS @Result TABLE
(ItemNumber INT, ItemValue VARCHAR(8000))AS
BEGIN
DECLARE @XML XML;
SELECT @XML = '<r>'+REPLACE(@Parameter, ',', '</r><r>')+'</r>';
INSERT INTO @Result (ItemNumber, ItemValue)
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ItemNumber,
Item.value('text()[1]', 'varchar(max)') AS ItemValue
FROM @XML.nodes('//r') R(Item);
RETURN;
END;
Like I said... I'm no Ninja at XML in SQL, so I could really use the help here. Thanks for your time, folks.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 12, 2010 at 4:13 pm
The XML splitter that I used until you blasted it away with code did not use the ROW_NUMBER() function to return the position.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
September 12, 2010 at 5:41 pm
I'm amazed... to split 10,000 rows with 10 random elements each, this XML method takes more than 14 seconds on my humble desktop. A While Loop driven function only takes 6. I knew XML was pretty bad but I didn't figure it for THAT bad... which is why I thought I was doing something wrong with the function.
I still could be... does anyone else have and nice, fast XML CSV-Splitter function they'd like to share or is the one I posted (cobbled together from pieces of multiple sources) faster? And, no... this isn't for bragging rights on my part because it's all "leveraged" code. I'm just trying to find a "fast" XML CSV-Splitter and there just doesn't seem to be one including the one I posted.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 12, 2010 at 10:29 pm
If you will post your sample data creation, I'll give it a test...
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
September 12, 2010 at 10:52 pm
WayneS (9/12/2010)
If you will post your sample data creation, I'll give it a test...
Coming right up...
--Jeff Moden
Change is inevitable... Change for the better is not.
September 12, 2010 at 11:01 pm
There are 3 places that control what the table looks like. Of course, two of those are SELECT TOP (X) and the comments next to each of those explain the impact on the result.
The other place follows the % operator. The number that follows that operator is 1 larger than the largest number that will be generated. I believe you'll recognize the technique. 🙂
The code below is current set to create a paltry 10k rows of 100 elements each where each element will be from 1 to 5 digits wide (care of the %100000).
--===== Conditionally drop the test table to make reruns easier
IF OBJECT_ID('TempDB..#CsvTest','U') IS NOT NULL
DROP TABLE #CsvTest;
--===== This creates and populates a test table on the fly containing a
-- sequential column and a randomly generated CSV Parameter column.
SELECT TOP (10000) --Controls the number of rows in the test table
ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
(
SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
( --=== This builds CSV row with a leading comma
SELECT TOP (100) --Controls the number of CSV elements in each row
','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
FROM dbo.Tally t3 --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
WHERE t1.N <> t3.N --Without this line, all rows would be the same
FOR XML PATH('')
)
,1,1,'') AS VARCHAR(8000))
) AS CsvParameter
INTO #CsvTest
FROM dbo.Tally t1 --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2; --can produce row sets up 121 million.
--===== Let's add a PK just for grins. Since it's a temp table, we won't name it.
ALTER TABLE #CsvTest
ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100;
--Jeff Moden
Change is inevitable... Change for the better is not.
September 12, 2010 at 11:06 pm
For your convenience, here's a copy of the Tally table I used. You can use it in Temp DB without it interfering with your own...
--===== Do this in a nice safe place that everyone has
USE TempDB;
--===================================================================
-- Create a Tally table from 1 to 11000
--===================================================================
--===== Create and populate the Tally table on the fly.
-- This ISNULL function makes the column NOT NULL
-- so we can put a Primary Key on it
SELECT TOP 11000
IDENTITY(INT,1,1) AS N
--ISNULL(CAST(ROW_NUMBER() OVER (ORDER BY ac1.Object_ID) AS SMALLINT),0) AS N
INTO dbo.Tally
FROM Master.sys.ALL_Columns ac1
CROSS JOIN Master.sys.ALL_Columns ac2
;
--===== Add a CLUSTERED Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
;
--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC
;
GO
--Jeff Moden
Change is inevitable... Change for the better is not.
September 12, 2010 at 11:08 pm
p.s. Sorry that I don't document my code or anything and that I don't follow any formatting rules... I'm new at this... :-P:-D:hehe:
--Jeff Moden
Change is inevitable... Change for the better is not.
September 13, 2010 at 12:29 am
I am not an XML Ninja either, but you can try a inline table valued function
CREATE FUNCTION dbo.Split8KX_Inline(@Parameter VARCHAR(MAX), @Delimiter VARCHAR(1))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ItemNumber,
XC.R.query('.').value('/r[1]', 'varchar(8000)') ItemValue
FROM
(
SELECT CAST('<r>'+REPLACE(@Parameter, @Delimiter, '</r><r>')+'</r>' AS XML)
) X(P)
CROSS APPLY
P.nodes('/r') XC(R)
;
I have run the following code against your test data
set statistics time on
-- Print Jeff's
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX(CsvParameter, ',')
set statistics io off
set statistics time off
and
set statistics io on
set statistics time on
print 'Inline table valued function'
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX_Inline(CsvParameter, ',')
set statistics io off
set statistics time off
Here are the test results
(1 row(s) affected)
Table '#060DEAE8'. Scan count 10000, logical reads 10000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CsvTest'. Scan count 1, logical reads 774, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 40966 ms, elapsed time = 42803 ms.
and
(1 row(s) affected)
Table 'CsvTest'. Scan count 5, logical reads 849, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 8846 ms, elapsed time = 2268 ms.
Edit: as you can see from the execution times, the inline table valued function also heavily benefits from the 4 core CPU in my PC.
Peter
September 13, 2010 at 1:17 am
Brad Schulz has done a nice round up
http://bradsruminations.blogspot.com/2010/08/integer-list-splitting-sql-fable.html
September 13, 2010 at 4:28 am
Peter Brinkhaus (9/13/2010)
I am not an XML Ninja either, but you can try a inline table valued function
CREATE FUNCTION dbo.Split8KX_Inline(@Parameter VARCHAR(MAX), @Delimiter VARCHAR(1))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ItemNumber,
XC.R.query('.').value('/r[1]', 'varchar(8000)') ItemValue
FROM
(
SELECT CAST('<r>'+REPLACE(@Parameter, @Delimiter, '</r><r>')+'</r>' AS XML)
) X(P)
CROSS APPLY
P.nodes('/r') XC(R)
;
I have run the following code against your test data
set statistics time on
-- Print Jeff's
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX(CsvParameter, ',')
set statistics io off
set statistics time off
and
set statistics io on
set statistics time on
print 'Inline table valued function'
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX_Inline(CsvParameter, ',')
set statistics io off
set statistics time off
Here are the test results
(1 row(s) affected)
Table '#060DEAE8'. Scan count 10000, logical reads 10000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CsvTest'. Scan count 1, logical reads 774, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 40966 ms, elapsed time = 42803 ms.
and
(1 row(s) affected)
Table 'CsvTest'. Scan count 5, logical reads 849, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 8846 ms, elapsed time = 2268 ms.
Edit: as you can see from the execution times, the inline table valued function also heavily benefits from the 4 core CPU in my PC.
Peter
Thanks, Peter. I actually had tried an INLINE version and it was horrible. The one you posted is quite a bit different, though. It has a CROSS APPLY built into IT. I'll give it a whirl tonight. Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 13, 2010 at 4:28 am
Peter Brinkhaus (9/13/2010)
I am not an XML Ninja either, but you can try a inline table valued function
set statistics io on
set statistics time on
print 'Inline table valued function'
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX_Inline(CsvParameter, ',')
set statistics io off
set statistics time off
Sure the COUNT(*) may be quite quick, but what happens if you use the XML function in earnest to shift some data?
Takes rather a long time I think
SELECT C.RowNum, Z.ItemNumber, Z.ItemValue
INTO #split
FROM #CsvTest AS C
CROSS APPLY dbo.Split8KX_Inline(C.CsvParameter, ',') AS Z
September 13, 2010 at 4:36 am
Dave Ballantyne (9/13/2010)
Brad Schulz has done a nice round uphttp://bradsruminations.blogspot.com/2010/08/integer-list-splitting-sql-fable.html
I'll have to revisit that one sometime in the near future because a proper Tally table solution is between 5 and 16 times faster than the tightest of RBAR solutions including his. He's used way too many CROSS APPLY's (none are actually required in the Tally Table solution to return the columns he wants) for my liking although I don't know (yet) if those are the cause of its demise.
I do know that an on-the-fly CTE can raise hell with performance IF it's not used correctly. The modernized Tally Table article will show that.
There's also a huge truth about VARCHAR(MAX)... it takes at least twice as long to join to it. Also, if the datatype of the Tally Table isn't BIGINT, then CHARINDEX and SUBSTRING both cause an implicit conversion that forces SQL Server to do a full cartesian join between the string and itself at the character level. It's really not a pretty site... I think that's a part of what BRAD is running into.
But, this thread is about XML splitter solutions. I'm still open to running some tests using the test data I posted and comparing them all. My machine is a bit slow, though... 8 years old, single CPU running at 1.8GhZ (I'd have killed for a computer like this 30 years ago).
Thanks for the link, though, Dave.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 13, 2010 at 5:30 am
It is worth pointing out that Brads blog does not test the raw performance comparison for each splitting method.
But it is more what sqlserver will then do , to make a query plan...
Brad does state that "The optimizer made reasonable estimates of rows when it was incorporated into a query", and that really is the key to the differences seen , and perhaps should of been signposted more.
Consider these two similar functions and their use in the given statements.
create function dbo.ufn_SplitIntArray
(
@List varchar(max)
,@Delimiter char(1)
)
returns @Items table (Item int)
as
begin
declare @Item varchar(12)
,@Pos int
while len(@List)>0
begin
set @Pos=charindex(@Delimiter,@List)
if @Pos=0 set @Pos=len(@List)+1
set @Item=left(@List,@Pos-1)
insert @Items select convert(int,ltrim(rtrim(@Item)))
set @List=substring(@List,@Pos+len(@Delimiter),len(@List))
end
return
end
go
drop function ufn_SplitIntArrayInline
go
create function dbo.ufn_SplitIntArrayInline
(
@List varchar(max)
,@Delimiter char(1)
)
returns table
as
return
select Item=convert(int,(substring(@Delimiter+@List+@Delimiter
,w.n+1
,charindex(@Delimiter
,@Delimiter+@List+@Delimiter
,w.n+1) - w.n - 1
)))
from (select number as n from master..spt_values where type='P' and number <len(@Delimiter+@List)
) w
where w.n=charindex(@Delimiter,@Delimiter+@List+@Delimiter,w.n)
go
declare @ProductList varchar(max) = '897,911,942'
select d.ProductID
,ProductName=p.Name
,h.SalesOrderID
,h.OrderDate
,TerritoryName=t.Name
from dbo.ufn_SplitIntArrayInline(@ProductList,',') a
join Sales.SalesOrderDetail d on a.Item=d.ProductID
join Production.Product p on d.ProductID=p.ProductID
join Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID
join Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID
order by d.ProductID,h.SalesOrderID
go
declare @ProductList varchar(max) = '897,911,942'
select d.ProductID
,ProductName=p.Name
,h.SalesOrderID
,h.OrderDate
,TerritoryName=t.Name
from dbo.ufn_SplitIntArray(@ProductList,',') a
join Sales.SalesOrderDetail d on a.Item=d.ProductID
join Production.Product p on d.ProductID=p.ProductID
join Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID
join Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID
order by d.ProductID,h.SalesOrderID
Now looking at the execution plans of both , the difference in the execution speed of the splitter is irrelevant, its all to do with the plan built based on the different estimates.
Indeed if we now...
select number as n ,a.*
from master..spt_values
cross apply dbo.ufn_SplitIntArray('897,911,942',',') a
where type='P' and number <512
go
select number as n ,a.*
from master..spt_values
cross apply dbo.ufn_SplitIntArrayInline('897,911,942',',') a
where type='P' and number <512
Although ive only run it through profiler and not looked in as much depth as Brad has then the Inline has a very slight edge (YMMV) . But now the raw speed of the splitter is still irrelevant as both results are just effectively executed once and the the constant results are used.
So to test the function themselves we need to
create table #VarSplit
(
toSplit varchar(30)
)
go
insert into #varsplit
select convert(char(3),abs(checksum(NEWID()))%256)+','+
convert(char(3),abs(checksum(NEWID()))%256)+','+
convert(char(3),abs(checksum(NEWID()))%256)
from master..spt_values
where type='P' and number <512
go
select a.* from #VarSplit
cross apply dbo.ufn_SplitIntArray(#VarSplit.toSplit,',') a
go
select a.* from #VarSplit
cross apply dbo.ufn_SplitIntArrayInline(#VarSplit.toSplit,',') a
In this case the Inline is now using half the resources (Cpu/IO) of the non.
Im not drawing any conclusion from this , more "just saying" 🙂
September 13, 2010 at 6:33 am
I ran Jeff's CSV builder, and Peter's test code on my system. For good measure, I threw in the DelimitedSplit8K routine also (running against both a "real" and a virtual (with _v suffix) tally table).
Oh: system: Dual-core processor, 64-bit
Test code:
SET NOCOUNT ON;
print replicate('*',30);
Print 'Jeff''s function';
print replicate('*',30);
SET STATISTICS IO,TIME ON;
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX(CsvParameter, ',');
SET STATISTICS IO,TIME OFF;
print replicate('*',30);
print 'Inline table valued function';
print replicate('*',30);
SET STATISTICS IO,TIME ON;
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX_Inline(CsvParameter, ',');
SET STATISTICS IO,TIME OFF;
print replicate('*',30);
print 'DelimitedSplit8k table valued function';
print replicate('*',30);
SET STATISTICS IO,TIME ON;
select
count(*)
from
CsvTest
cross apply
dbo.DelimitedSplit8K(CsvParameter, ',');
SET STATISTICS IO,TIME OFF;
print replicate('*',30);
print 'DelimitedSplit8k table valued function';
print replicate('*',30);
SET STATISTICS IO,TIME ON;
select
count(*)
from
CsvTest
cross apply
dbo.DelimitedSplit8K(CsvParameter, ',');
SET STATISTICS IO,TIME OFF;
Results:
******************************
Jeff's function
******************************
Table '#481BA567'. Scan count 10000, logical reads 10000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CsvTest'. Scan count 1, logical reads 774, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 26832 ms, elapsed time = 28560 ms.
******************************
Inline table valued function
******************************
Table 'CsvTest'. Scan count 3, logical reads 849, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 4603 ms, elapsed time = 2390 ms.
******************************
DelimitedSplit8k table valued function
******************************
Table 'Tally'. Scan count 10000, logical reads 50000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CsvTest'. Scan count 3, logical reads 849, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 1278 ms, elapsed time = 675 ms.
******************************
DelimitedSplit8k_v table valued function
******************************
Table 'CsvTest'. Scan count 3, logical reads 849, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 4446 ms, elapsed time = 2440 ms.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
Viewing 15 posts - 1 through 15 (of 44 total)
You must be logged in to reply to this topic. Login to reply