June 25, 2009 at 11:20 am
I'd seen XML string splitting and concatenation before, but ran into issues with characters that aren't valid in XML. I found that just escaping all the invalid characters prior to splitting frequently took more time than the entire split operation using a tally table.
That said, if you can qualify your data enough to ensure you won't ever have those sorts of issues, then it is certainly easier to code than some other solutions.
June 25, 2009 at 12:37 pm
Senthilnathan.Karunakaran (6/25/2009)
We did performance and scaling test to split the comma separated string value using XML query and SQL function.The plan was better with XML split than SQL function; however XML query performance degraded when number of concurrent users increase. SQL function did better in scalability test.
Cool. Let's see the test code and the data generator.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 25, 2009 at 12:53 pm
We are converting SS2000 DTS packages to SSIS 2005.
I need help using a User Defined Function within a query on a table.
The function performs correctly.
select * from fn_Split('13.0,13.1,14.0,14.1' ,',')
idxvalue
013.0
113.1
214.0
314.1
This is the table to run the function against, with a few sample record results.
Select versions from versiontable
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
11.0,11.1,12.0,12.1,13.0,13.1,14.0
I tried this. Now it no longer recognises the function. Is the syntax wrong?
select fn_Split(versions, ',') as SingleVersion from versiontable where versions is not null
Server: Msg 195, Level 15, State 10, Line 1
'fn_Split' is not a recognized function name.
Tried this also.
select dbo.fn_Split(versions, ',') as Version from versiontable where versions is not null
Server: Msg 208, Level 16, State 1, Line 1
Invalid object name 'dbo.fn_Split'.
Any suggestions to get the query to work or even better, a nifty method to accomplish this within my data flow task using one of the Data Flow Transformation. Seems like the pivot might be useful for this?
June 25, 2009 at 1:17 pm
jo stovall (6/25/2009)
We are converting SS2000 DTS packages to SSIS 2005.I need help using a User Defined Function within a query on a table.
The function performs correctly.
select * from fn_Split('13.0,13.1,14.0,14.1' ,',')
idxvalue
013.0
113.1
214.0
314.1
This is the table to run the function against, with a few sample record results.
Select versions from versiontable
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
11.0,11.1,12.0,12.1,13.0,13.1,14.0
I tried this. Now it no longer recognises the function. Is the syntax wrong?
select fn_Split(versions, ',') as SingleVersion from versiontable where versions is not null
Server: Msg 195, Level 15, State 10, Line 1
'fn_Split' is not a recognized function name.
Tried this also.
select dbo.fn_Split(versions, ',') as Version from versiontable where versions is not null
Server: Msg 208, Level 16, State 1, Line 1
Invalid object name 'dbo.fn_Split'.
Any suggestions to get the query to work or even better, a nifty method to accomplish this within my data flow task using one of the Data Flow Transformation. Seems like the pivot might be useful for this?
What do you get when you run the following?
SELECT *
FROM sys.Objects
WHERE Name = 'fn_Split'
If the answer is nothing, then you're either in the wrong database or the CREATE FUNCTION code didn't actually work.
As a side bar, you'd probably get a lot more "hits" on your question if you asked it in the proper forum instead of a thread dedicated to the discusssion of an article.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 25, 2009 at 2:09 pm
Thanks for responding.
I reposted on the SSIS forum.
June 26, 2009 at 8:48 am
I have error:
Msg 170, Level 15, State 1, Line 12
Line 12: Incorrect syntax near '('.
Msg 170, Level 15, State 1, Line 13
Line 13: Incorrect syntax near '('.
DECLARE @t TABLE( ID INT IDENTITY, data VARCHAR(50))
INSERT INTO @t(data) SELECT 'AA,AB,AC,AD'
INSERT INTO @t(data) SELECT 'BA,BB,BC'
SELECT F1.id,
F1.data,
O.splitdata
FROM
(
SELECT *, cast(''+replace(F.data,',','')+'' as XML) as xmlfilter from @t F
)F1
CROSS
(
SELECT fdata.D.value('.','varchar(50)') as splitdata FROM f1.xmlfilter.nodes('X') as fdata(D)) as O
GT
June 26, 2009 at 9:11 am
I should probably add to my previous submitted method.
On our server a split of 131,072 is taking 160 milli seconds.
The server is fairly standard db server spec.
Sam
select getdate()
declare @STR varchar(max)
set @STR = 'AAAA,BBBB,CCCC,DDDD'
select count(*) from dbSystem..FnSplitQuick(@str,0,',')
select getdate()
June 27, 2009 at 4:30 pm
I just ran across one of J. Sebastians article on how to do a string split using XML:http://www.sqlservercentral.com/articles/XML/61618/. Looks comparable...
June 28, 2009 at 7:47 pm
The article is well written and my hat is off to the author for that. I have to agree with the others, though... using XML to split delimited data is a wee bit slow compared to many other methods.
Also, I see lot's of claims in the area of performance, but I don't see much in the line of test data generation code or test harnesses so other people can substantiate those claims. Like Sergiy says, "A Developer must not guess... a Developer must KNOW!" With that little pearl of wisdom still ringing in all our ears, let's do some testing...
First, if you don't have a Tally table yet, you need to build one. For this exercise, we'll write and execute all of the code, including the code for the Tally table, in a nice save place where we likely won't mess with anyone's real tables... TempDB...
USE TempDB--===== Create and populate the Tally table on the fly
SELECT TOP 1000000
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC[/font]
-- Setup
--=============================================================================
--===== Do this in a nice safe place
USE TempDB--===== Declare some obviously named variablesDECLARE @NumberOfElements INT,
@MaxElementSize INT,
@NumberOfRows INT--===== Presets
SELECT @NumberOfElements = 6,
@MaxElementSize = 6,
@NumberOfRows = 10000--=============================================================================
-- Create the test table on the fly
--=============================================================================
--===== Conditionally drop the test table
IF OBJECT_ID('TempDB..TestTable','U') IS NOT NULL
DROP TABLE TestTable--===== Create and populate the test table on the fly.
-- Number of elements, max size of elements, and number of rows are all
-- programable based on the variables at the beginning of this scirpt.
SELECT t.N AS ID,
STUFF((SELECT ',' + LEFT(NEWID(),ABS(CHECKSUM(NEWID()))%@MaxElementSize+1)
FROM dbo.Tally t1
WHERE t1.N <= @NumberOfElements
AND t.N > 0 --External reference forces NEWID() to be recalculated
FOR XML PATH(''))
,
1,1,'') AS DataINTO TestTable
FROM dbo.Tally t
WHERE t.N <= @NumberOfRows[/font]
Next, we need to build some functions that folks included in some of their responses either directly or indirectly. Not all the functions that were mentioned are included because some of them are the same...
USE TEMPDB--===== Create Eric Sommarskog's inline function (identical to the code I use outside a function)
CREATE FUNCTION InLineSplit(@String VARCHAR(8000))RETURNS TABLE AS
RETURN
(SELECT SUBSTRING(',' + @String + ',', N+1, CHARINDEX(',', @String + ',', N)-N) AS SplitDataFROM dbo.Tally t
WHERE N <= LEN(',' + @String)
AND
SUBSTRING(',' + @String, N, 1) = ',')GO--===== Create Rafidheen's split functionCREATE function Split_fn(@split_string varchar(max),
@deli_char varchar(3)
)
returns @list table(SeqNo int,
SplitString varchar(max)
)
asbegin
declare
@from_loc intdeclare
@to_loc intif charindex
(@deli_char,@split_string,0) <= 0begin
insert into
@list(seqno, SplitString) values (1, @split_string)return
end
if charindex
(@deli_char,@split_string,0) > 0begin
select
@from_loc = 0select @to_loc = charindex(@deli_char,@split_string,0)
end
if charindex
(@deli_char,@split_string,0) <= 0begin
select
@to_loc = nullend
while
@to_loc is not nullbegin
if
substring(@split_string,@from_loc, @to_loc - @from_loc) <> ''begin
insert into
@list(seqno, SplitString)select isnull(max(seqno),0) + 1, substring(@split_string,@from_loc, @to_loc - @from_loc)
from @list
end
select
@from_loc = charindex(@deli_char,@split_string,@from_loc+len(@deli_char)) + len(@deli_char)select @to_loc = charindex(@deli_char,@split_string,@from_loc)
if @to_loc = 0
begin
if
substring(@split_string,@from_loc, (len(@split_string) - @from_loc) + len(@deli_char)) <> ''begin
insert into
@list(seqno, SplitString)select isnull(max(seqno),0) + 1, substring(@split_string,@from_loc, (len(@split_string) - @from_loc) + len(@deli_char))
from @list
end
select
@to_loc = nullend
end
return
end
go[/font]-- Setup
--=============================================================================
--===== Do this in a nice safe place
USE TempDB--===== Declare some obviously named variablesDECLARE @NumberOfElements INT,
@MaxElementSize INT,
@NumberOfRows INT--===== Presets
SELECT @NumberOfElements = 6,
@MaxElementSize = 6,
@NumberOfRows = 10000--=============================================================================
-- Create the test table on the fly
--=============================================================================
--===== Conditionally drop the test table
IF OBJECT_ID('TempDB..TestTable','U') IS NOT NULL
DROP TABLE TestTable--===== Create and populate the test table on the fly.
-- Number of elements, max size of elements, and number of rows are all
-- programable based on the variables at the beginning of this scirpt.
SELECT t.N AS ID,
STUFF((SELECT ',' + LEFT(NEWID(),ABS(CHECKSUM(NEWID()))%@MaxElementSize+1)
FROM dbo.Tally t1
WHERE t1.N <= @NumberOfElements
AND t.N > 0 --External reference forces NEWID() to be recalculated
FOR XML PATH(''))
,
1,1,'') AS DataINTO TestTable
FROM dbo.Tally t
WHERE t.N <= @NumberOfRowsGO--=============================================================================
-- Test the various methods from this thread including the method from
-- the article.
--=============================================================================
--========== XML from article =================================================
PRINT '========== XML from article =========================================='
SET STATISTICS TIME ON
select
F1.id,f1.data,
O.splitdata
from
(
select *,
cast('<X>'+replace(F.data,',','</X><X>')+'</X>' as XML) as xmlfilter
from TestTable F
)F1
cross apply
(
select fdata.D.value('.','varchar(50)') as splitdata
from f1.xmlfilter.nodes('X') as fdata(D)) O--========== rafidheenm''s function ===========================================
SET STATISTICS TIME OFF
SET STATISTICS TIME ON
SELECT
tt.ID,tt.Data,
ca.SplitString
FROM TestTable tt
CROSS APPLY
(SELECT SplitString
FROM dbo.split_fn(tt.Data,',')
)
ca--========== Tally Table Split ================================================SET STATISTICS TIME OFF
SET STATISTICS TIME ON
SELECT
tt.ID,tt.Data,
SUBSTRING(',' + tt.Data + ',', N+1, CHARINDEX(',', tt.Data + ',', N)-N) AS SplitData
FROM dbo.Tally t
CROSS JOIN TestTable tt
WHERE N <= LEN(',' + tt.Data)
AND
SUBSTRING(',' + tt.Data, N, 1) = ','--========== Tally Table Cross Apply ==========================================SET STATISTICS TIME OFF
SET STATISTICS TIME ON
SELECT
tt.ID,tt.Data,
ca.SplitData
FROM TestTable tt
CROSS APPLY
(
SELECT SUBSTRING(',' + tt1.Data + ',', N+1, CHARINDEX(',', tt1.Data + ',', N)-N) AS SplitData
FROM dbo.Tally t
CROSS JOIN TestTable tt1
WHERE N <= LEN(',' + tt1.Data)
AND
SUBSTRING(',' + tt1.Data, N, 1) = ','AND tt.ID = tt1.ID
) ca--========== Inline Table Function ============================================
SET STATISTICS TIME OFF
SET STATISTICS TIME ON
SELECT
tt.ID,tt.Data,
ca.SplitDatta
FROM TestTable tt
CROSS APPLY
(SELECT SplitData
FROM InLineSplit(tt.Data)
)
caSET STATISTICS TIME OFF[/font]
Here's the results. I should point out that my computer is a 7 year old p4 1.8GHz single cpu with old IDE drives and only a Gig of RAM running SQL Server 2005 Dev Edition at sp2 (haven't had the chance to install sp3, yet).
(1000 row(s) affected)
========== XML from article ==========================================
(6000 row(s) affected)
SQL Server Execution Times:
CPU time = 3000 ms, elapsed time = 3573 ms.
========== rafidheenm's function ====================================
(6000 row(s) affected)
SQL Server Execution Times:
CPU time = 3656 ms, elapsed time = 5958 ms.
========== Tally Table Split =========================================
(6000 row(s) affected)
SQL Server Execution Times:
CPU time = 422 ms, elapsed time = 749 ms.
========== Tally Table Cross Apply ===================================
(6000 row(s) affected)
SQL Server Execution Times:
CPU time = 859 ms, elapsed time = 1208 ms.
========== Inline Table Function =====================================
(6000 row(s) affected)
SQL Server Execution Times:
CPU time = 156 ms, elapsed time = 406 ms.
[/font]
CLR's can be written to beat the Tally table but only if you're careful. Florian Reischl has a lead on a great CLR that I can just about guarantee will beat everything else for performance and simplicity including other CLR's for the same task.
Just a couple of end notes... the Cross-Apply Tally Table method works very quickly if all of the elements are exactly the same size. It seems that the optimizer can figure that out and will produce speeds nearly as fast as the Inline function method. I haven't tried the Inline function method in 2k yet mostly because I'd have to convert the Cross-Apply to a correlated subquery and simply haven't taken the time to do it.
I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code. 😉
Also, the Tally Table (and several other methods) has a bit of a difficult time with "out of row" variables such as VARCHAR(MAX) especially if the distance between delimiters is large. Ths is because of the JOIN the Tally Table does with the data... and it doesn't like joining with "out of row" stuff. Depending on the distance between delimiters and the actual size of each row, you may have to shift to a very tight WHILE LOOP to get superior performance in the presence of VARCHAR(MAX). I do know that the XML method will absolutely croak not only on VARCHAR(MAX), but anytime there are a large number of elements in the same row to be split. Of course, you don't have to believe me because you now have code to test with. Remember that "One experiment is worth a thousand expert opinions".
If you'd like to read more about what a Tally Table is and how it replaces certain kinds of While Loops, please see the following article...
http://www.sqlservercentral.com/articles/T-SQL/62867/
--Jeff Moden
Change is inevitable... Change for the better is not.
June 29, 2009 at 2:17 am
Jeff, is there any chance you could try the partial CLR method as well?
I submitted the code in previous article.
use CLR to convert the items to being a fixed width string
There's some additional code in there to determine the required fixed width length, if the size limit is known.
ie
aa,gggggggg,dd,ccc
becomes
aa______ggggggggdd______ccc______
and then split the result with a tally table, and substring function.
June 29, 2009 at 2:47 am
Ignore that last post.
I've just simply compared on my server.
Rows 4096 based on huge repitition of AAAA,BB,CCCCCCCCC,D
Inline function
126 ms
123 ms
130 ms
Partial CLR plus tally table
3 ms
3 ms
3 ms
I used count(*), what did you use for your measurements?
I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code.
I personally don't mind a little CLR, where it is non specific string ops, and the performance is that much better.
Sam
June 29, 2009 at 4:59 am
We are having only five characters ( , ', ", & ) that need escape character in XML.
We can use replace function for handling them.
DECLARE @test-2 xml
DECLARE @chars nvarchar(500)
set @chars = '1&4,4,5,6'
set @chars = '' + REPLACE(REPLACE(REPLACE(REPLACE(@chars,'&','&'),'>','>'),'<','<') ,',','') + ''
select @test-2 = @chars
select c.value('.','nvarchar(10)')
from @test.nodes('X') as t(c)
June 29, 2009 at 5:39 am
sam.walker (6/29/2009)
Ignore that last post.I've just simply compared on my server.
Rows 4096 based on huge repitition of AAAA,BB,CCCCCCCCC,D
Inline function
126 ms
123 ms
130 ms
Partial CLR plus tally table
3 ms
3 ms
3 ms
I used count(*), what did you use for your measurements?
I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code.
I personally don't mind a little CLR, where it is non specific string ops, and the performance is that much better.
Sam
Very cool, Sam. To answer your question, I just let the natural rowcount put out the 6000 count.
The data you said you built as a "huge repetition" may fall into the category where the "fixed field" anomoly does trick SQL Server into evaluating the split just once. That's why I built the test table where the length of each element is random. Then again, considering what the CLR does, maybe not.
So far as running a CLR, no can do... I don't have C# or VB on my box and, since I'm a bit over-cautious, won't run someone else's DLL.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 29, 2009 at 5:42 am
Chandra Sekhar Manchukonda (6/29/2009)
We are having only five characters ( , ', ", & ) that need escape character in XML.We can use replace function for handling them.
DECLARE @test-2 xml
DECLARE @chars nvarchar(500)
set @chars = '1&4,4,5,6'
set @chars = '' + REPLACE(REPLACE(REPLACE(REPLACE(@chars,'&','&'),'>','>'),'<','<') ,',','') + ''
select @test-2 = @chars
select c.value('.','nvarchar(10)')
from @test.nodes('X') as t(c)
Just curious... are you suggesting that you're still going to use XML as a splitter even though it's just been proven how slow it is and that it requires such special character replacement?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 29, 2009 at 5:16 pm
I have a set of comma delimited strings which have between 1 and 100's of invoice records per string (with each invoice record having the same 10 data fields). Your code turns the data into rows with admirable efficiency. However, I can't work out how to transform the results into columns with one row per invoice record. Can anyone help?
Viewing 15 posts - 16 through 30 (of 48 total)
You must be logged in to reply to this topic. Login to reply