Split string using XML

Question

Split string using XML

Viewing 15 posts - 16 through 30 (of 48 total)

You must be logged in to reply to this topic. Login to reply

Sean Terry Old Hand Points: 363 More actions · Answer 1

I'd seen XML string splitting and concatenation before, but ran into issues with characters that aren't valid in XML. I found that just escaping all the invalid characters prior to splitting frequently took more time than the entire split operation using a tally table.

That said, if you can qualify your data enough to ensure you won't ever have those sorts of issues, then it is certainly easier to code than some other solutions.

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 2

Senthilnathan.Karunakaran (6/25/2009)
We did performance and scaling test to split the comma separated string value using XML query and SQL function.
The plan was better with XML split than SQL function; however XML query performance degraded when number of concurrent users increase. SQL function did better in scalability test.

Cool. Let's see the test code and the data generator.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jo stovall Mr or Mrs. 500 Points: 509 More actions · Answer 3

We are converting SS2000 DTS packages to SSIS 2005.

I need help using a User Defined Function within a query on a table.

The function performs correctly.

select * from fn_Split('13.0,13.1,14.0,14.1' ,',')

idxvalue

013.0

113.1

214.0

314.1

This is the table to run the function against, with a few sample record results.

Select versions from versiontable

13.0,13.1,14.0,14.1

11.0,11.1,12.0,12.1,13.0,13.1,14.0

I tried this. Now it no longer recognises the function. Is the syntax wrong?

select fn_Split(versions, ',') as SingleVersion from versiontable where versions is not null

Server: Msg 195, Level 15, State 10, Line 1

'fn_Split' is not a recognized function name.

Tried this also.

select dbo.fn_Split(versions, ',') as Version from versiontable where versions is not null

Server: Msg 208, Level 16, State 1, Line 1

Invalid object name 'dbo.fn_Split'.

Any suggestions to get the query to work or even better, a nifty method to accomplish this within my data flow task using one of the Data Flow Transformation. Seems like the pivot might be useful for this?

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 4

jo stovall (6/25/2009)
We are converting SS2000 DTS packages to SSIS 2005.
I need help using a User Defined Function within a query on a table.
The function performs correctly.
select * from fn_Split('13.0,13.1,14.0,14.1' ,',')
idxvalue
013.0
113.1
214.0
314.1
This is the table to run the function against, with a few sample record results.
Select versions from versiontable
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
13.0,13.1,14.0,14.1
11.0,11.1,12.0,12.1,13.0,13.1,14.0
I tried this. Now it no longer recognises the function. Is the syntax wrong?
select fn_Split(versions, ',') as SingleVersion from versiontable where versions is not null
Server: Msg 195, Level 15, State 10, Line 1
'fn_Split' is not a recognized function name.
Tried this also.
select dbo.fn_Split(versions, ',') as Version from versiontable where versions is not null
Server: Msg 208, Level 16, State 1, Line 1
Invalid object name 'dbo.fn_Split'.
Any suggestions to get the query to work or even better, a nifty method to accomplish this within my data flow task using one of the Data Flow Transformation. Seems like the pivot might be useful for this?

What do you get when you run the following?

SELECT *

FROM sys.Objects

WHERE Name = 'fn_Split'

If the answer is nothing, then you're either in the wrong database or the CREATE FUNCTION code didn't actually work.

As a side bar, you'd probably get a lot more "hits" on your question if you asked it in the proper forum instead of a thread dedicated to the discusssion of an article.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jo stovall Mr or Mrs. 500 Points: 509 More actions · Answer 5

Thanks for responding.

I reposted on the SSIS forum.

gilbertojavier2 SSC Veteran Points: 272 More actions · Answer 6

I have error:

Msg 170, Level 15, State 1, Line 12

Line 12: Incorrect syntax near '('.

Msg 170, Level 15, State 1, Line 13

Line 13: Incorrect syntax near '('.

DECLARE @t TABLE( ID INT IDENTITY, data VARCHAR(50))

INSERT INTO @t(data) SELECT 'AA,AB,AC,AD'

INSERT INTO @t(data) SELECT 'BA,BB,BC'

SELECT F1.id,

F1.data,

O.splitdata

FROM

(

SELECT *, cast(''+replace(F.data,',','')+'' as XML) as xmlfilter from @t F

)F1

CROSS

(

SELECT fdata.D.value('.','varchar(50)') as splitdata FROM f1.xmlfilter.nodes('X') as fdata(D)) as O

GT

sam.walker Mr or Mrs. 500 Points: 597 More actions · Answer 7

I should probably add to my previous submitted method.

On our server a split of 131,072 is taking 160 milli seconds.

The server is fairly standard db server spec.

Sam

select getdate()

declare @STR varchar(max)

set @STR = 'AAAA,BBBB,CCCC,DDDD'

set @STR = @STR + ',' + @STR

select count(*) from dbSystem..FnSplitQuick(@str,0,',')

select getdate()

LutzM SSC Guru Points: 107049 More actions · Answer 8

I just ran across one of J. Sebastians article on how to do a string split using XML:http://www.sqlservercentral.com/articles/XML/61618/. Looks comparable...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 9

The article is well written and my hat is off to the author for that. I have to agree with the others, though... using XML to split delimited data is a wee bit slow compared to many other methods.

Also, I see lot's of claims in the area of performance, but I don't see much in the line of test data generation code or test harnesses so other people can substantiate those claims. Like Sergiy says, "A Developer must not guess... a Developer must KNOW!" With that little pearl of wisdom still ringing in all our ears, let's do some testing...

First, if you don't have a Tally table yet, you need to build one. For this exercise, we'll write and execute all of the code, including the code for the Tally table, in a nice save place where we likely won't mess with anyone's real tables... TempDB...

[font="Courier New"]--===== Do this in a nice safe place...

USE TempDB

--===== Create and populate the Tally table on the fly

SELECT TOP 1000000

IDENTITY(INT,1,1) AS N

INTO dbo.Tally

FROM Master.dbo.SysColumns sc1,

Master.dbo.SysColumns sc2

--===== Add a Primary Key to maximize performance

ALTER TABLE dbo.Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

--===== Allow the general public to use it

GRANT SELECT ON dbo.Tally TO PUBLIC[/font]

Next, we need some test data. Instead of the normal fixed width stuff, which probably doesn't reflect real life very well, we'll make some random length elements. Read the comments in the code folks...

[font="Courier New"]--=============================================================================

-- Setup

--=============================================================================

--===== Do this in a nice safe place

USE TempDB

--===== Declare some obviously named variables

DECLARE @NumberOfElements INT,

@MaxElementSize INT,

@NumberOfRows INT

--===== Presets

SELECT @NumberOfElements = 6,

@MaxElementSize = 6,

@NumberOfRows = 10000

--=============================================================================

-- Create the test table on the fly

--=============================================================================

--===== Conditionally drop the test table

IF OBJECT_ID('TempDB..TestTable','U') IS NOT NULL

DROP TABLE TestTable

--===== Create and populate the test table on the fly.

-- Number of elements, max size of elements, and number of rows are all

-- programable based on the variables at the beginning of this scirpt.

SELECT t.N AS ID,

STUFF((SELECT ',' + LEFT(NEWID(),ABS(CHECKSUM(NEWID()))%@MaxElementSize+1)

FROM dbo.Tally t1

WHERE t1.N <= @NumberOfElements

AND t.N > 0 --External reference forces NEWID() to be recalculated

FOR XML PATH(''))

,1,1,'') AS Data

INTO TestTable

FROM dbo.Tally t

WHERE t.N <= @NumberOfRows[/font]

Next, we need to build some functions that folks included in some of their responses either directly or indirectly. Not all the functions that were mentioned are included because some of them are the same...

[font="Courier New"]--===== Do this in a nice safe place

USE TEMPDB

--===== Create Eric Sommarskog's inline function (identical to the code I use outside a function)

CREATE FUNCTION InLineSplit(@String VARCHAR(8000))

RETURNS TABLE AS

RETURN (SELECT SUBSTRING(',' + @String + ',', N+1, CHARINDEX(',', @String + ',', N)-N) AS SplitData

FROM dbo.Tally t

WHERE N <= LEN(',' + @String)

AND SUBSTRING(',' + @String, N, 1) = ',')

GO

--===== Create Rafidheen's split function

CREATE function Split_fn

(

@split_string varchar(max),

@deli_char varchar(3)

)

returns @list table

(

SeqNo int,

SplitString varchar(max)

)

as

begin

declare @from_loc int

declare @to_loc int

if charindex(@deli_char,@split_string,0) <= 0

begin

insert into @list(seqno, SplitString) values (1, @split_string)

return

end

if charindex(@deli_char,@split_string,0) > 0

begin

select @from_loc = 0

select @to_loc = charindex(@deli_char,@split_string,0)

end

if charindex(@deli_char,@split_string,0) <= 0

begin

select @to_loc = null

end

while @to_loc is not null

begin

if substring(@split_string,@from_loc, @to_loc - @from_loc) <> ''

begin

insert into @list(seqno, SplitString)

select isnull(max(seqno),0) + 1, substring(@split_string,@from_loc, @to_loc - @from_loc)

from @list

end

select @from_loc = charindex(@deli_char,@split_string,@from_loc+len(@deli_char)) + len(@deli_char)

select @to_loc = charindex(@deli_char,@split_string,@from_loc)

if @to_loc = 0

begin

if substring(@split_string,@from_loc, (len(@split_string) - @from_loc) + len(@deli_char)) <> ''

begin

insert into @list(seqno, SplitString)

select isnull(max(seqno),0) + 1, substring(@split_string,@from_loc, (len(@split_string) - @from_loc) + len(@deli_char))

from @list

end

select @to_loc = null

end

return

end

go[/font]

Finally, we get to some test code. This prints some results in the messages window of SSMS. I recommend you be in the "Grid Mode" when you run these to keep from skewing the results and to make the performance messages easier to read....

[font="Courier New"]--=============================================================================

-- Setup

--=============================================================================

--===== Do this in a nice safe place

USE TempDB

--===== Declare some obviously named variables

DECLARE @NumberOfElements INT,

@MaxElementSize INT,

@NumberOfRows INT

--===== Presets

SELECT @NumberOfElements = 6,

@MaxElementSize = 6,

@NumberOfRows = 10000

--=============================================================================

-- Create the test table on the fly

--=============================================================================

--===== Conditionally drop the test table

IF OBJECT_ID('TempDB..TestTable','U') IS NOT NULL

DROP TABLE TestTable

--===== Create and populate the test table on the fly.

-- Number of elements, max size of elements, and number of rows are all

-- programable based on the variables at the beginning of this scirpt.

SELECT t.N AS ID,

STUFF((SELECT ',' + LEFT(NEWID(),ABS(CHECKSUM(NEWID()))%@MaxElementSize+1)

FROM dbo.Tally t1

WHERE t1.N <= @NumberOfElements

AND t.N > 0 --External reference forces NEWID() to be recalculated

FOR XML PATH(''))

,1,1,'') AS Data

INTO TestTable

FROM dbo.Tally t

WHERE t.N <= @NumberOfRows

GO

--=============================================================================

-- Test the various methods from this thread including the method from

-- the article.

--=============================================================================

--========== XML from article =================================================

PRINT '========== XML from article =========================================='

SET STATISTICS TIME ON

select F1.id,

f1.data,

O.splitdata

from

(

select *,

cast('<X>'+replace(F.data,',','</X><X>')+'</X>' as XML) as xmlfilter

from TestTable F

)F1

cross apply

(

select fdata.D.value('.','varchar(50)') as splitdata

from f1.xmlfilter.nodes('X') as fdata(D)) O

--========== rafidheenm''s function ===========================================

SET STATISTICS TIME OFF

PRINT '========== Rafidheen''s function ===================================='

SET STATISTICS TIME ON

SELECT tt.ID,

tt.Data,

ca.SplitString

FROM TestTable tt

CROSS APPLY

(SELECT SplitString

FROM dbo.split_fn(tt.Data,',')

) ca

--========== Tally Table Split ================================================

SET STATISTICS TIME OFF

PRINT '========== Tally Table Split ========================================='

SET STATISTICS TIME ON

SELECT tt.ID,

tt.Data,

SUBSTRING(',' + tt.Data + ',', N+1, CHARINDEX(',', tt.Data + ',', N)-N) AS SplitData

FROM dbo.Tally t

CROSS JOIN TestTable tt

WHERE N <= LEN(',' + tt.Data)

AND SUBSTRING(',' + tt.Data, N, 1) = ','

--========== Tally Table Cross Apply ==========================================

SET STATISTICS TIME OFF

PRINT '========== Tally Table Cross Apply ==================================='

SET STATISTICS TIME ON

SELECT tt.ID,

tt.Data,

ca.SplitData

FROM TestTable tt

CROSS APPLY

(

SELECT SUBSTRING(',' + tt1.Data + ',', N+1, CHARINDEX(',', tt1.Data + ',', N)-N) AS SplitData

FROM dbo.Tally t

CROSS JOIN TestTable tt1

WHERE N <= LEN(',' + tt1.Data)

AND SUBSTRING(',' + tt1.Data, N, 1) = ','

AND tt.ID = tt1.ID

) ca

--========== Inline Table Function ============================================

SET STATISTICS TIME OFF

PRINT '========== Inline Table Function ====================================='

SET STATISTICS TIME ON

SELECT tt.ID,

tt.Data,

ca.SplitDatta

FROM TestTable tt

CROSS APPLY

(SELECT SplitData

FROM InLineSplit(tt.Data)

) ca

SET STATISTICS TIME OFF

[/font]

Here's the results. I should point out that my computer is a 7 year old p4 1.8GHz single cpu with old IDE drives and only a Gig of RAM running SQL Server 2005 Dev Edition at sp2 (haven't had the chance to install sp3, yet).

[font="Courier New"]

(1000 row(s) affected)

========== XML from article ==========================================

(6000 row(s) affected)

SQL Server Execution Times:

CPU time = 3000 ms, elapsed time = 3573 ms.

========== rafidheenm's function ====================================

(6000 row(s) affected)

SQL Server Execution Times:

CPU time = 3656 ms, elapsed time = 5958 ms.

========== Tally Table Split =========================================

(6000 row(s) affected)

SQL Server Execution Times:

CPU time = 422 ms, elapsed time = 749 ms.

========== Tally Table Cross Apply ===================================

(6000 row(s) affected)

SQL Server Execution Times:

CPU time = 859 ms, elapsed time = 1208 ms.

========== Inline Table Function =====================================

(6000 row(s) affected)

SQL Server Execution Times:

CPU time = 156 ms, elapsed time = 406 ms.

[/font]

As you can see, the straight Tally Table method smokes most of the other methods. Only the Inline Table Function method is faster and that smokes the straight method. Of course, the straight Tally Table method works in all versions of SQL Server that I know of.

CLR's can be written to beat the Tally table but only if you're careful. Florian Reischl has a lead on a great CLR that I can just about guarantee will beat everything else for performance and simplicity including other CLR's for the same task.

Just a couple of end notes... the Cross-Apply Tally Table method works very quickly if all of the elements are exactly the same size. It seems that the optimizer can figure that out and will produce speeds nearly as fast as the Inline function method. I haven't tried the Inline function method in 2k yet mostly because I'd have to convert the Cross-Apply to a correlated subquery and simply haven't taken the time to do it.

I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code. 😉

Also, the Tally Table (and several other methods) has a bit of a difficult time with "out of row" variables such as VARCHAR(MAX) especially if the distance between delimiters is large. Ths is because of the JOIN the Tally Table does with the data... and it doesn't like joining with "out of row" stuff. Depending on the distance between delimiters and the actual size of each row, you may have to shift to a very tight WHILE LOOP to get superior performance in the presence of VARCHAR(MAX). I do know that the XML method will absolutely croak not only on VARCHAR(MAX), but anytime there are a large number of elements in the same row to be split. Of course, you don't have to believe me because you now have code to test with. Remember that "One experiment is worth a thousand expert opinions".

If you'd like to read more about what a Tally Table is and how it replaces certain kinds of While Loops, please see the following article...

http://www.sqlservercentral.com/articles/T-SQL/62867/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sam.walker Mr or Mrs. 500 Points: 597 More actions · Answer 10

Jeff, is there any chance you could try the partial CLR method as well?

I submitted the code in previous article.

use CLR to convert the items to being a fixed width string

There's some additional code in there to determine the required fixed width length, if the size limit is known.

ie

aa,gggggggg,dd,ccc

becomes

aa______ggggggggdd______ccc______

and then split the result with a tally table, and substring function.

sam.walker Mr or Mrs. 500 Points: 597 More actions · Answer 11

Ignore that last post.

I've just simply compared on my server.

Rows 4096 based on huge repitition of AAAA,BB,CCCCCCCCC,D

Inline function

126 ms

123 ms

130 ms

Partial CLR plus tally table

3 ms

I used count(*), what did you use for your measurements?

I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code.

I personally don't mind a little CLR, where it is non specific string ops, and the performance is that much better.

Sam

Chandra Sekhar Manchukonda SSC Veteran Points: 260 More actions · Answer 12

We are having only five characters ( , ', ", & ) that need escape character in XML.

We can use replace function for handling them.

DECLARE @test-2 xml

DECLARE @chars nvarchar(500)

set @chars = '1&4,4,5,6'

set @chars = '' + REPLACE(REPLACE(REPLACE(REPLACE(@chars,'&','&'),'>','>'),'<','<') ,',','') + ''

select @test-2 = @chars

select c.value('.','nvarchar(10)')

from @test.nodes('X') as t(c)

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 13

sam.walker (6/29/2009)
Ignore that last post.
I've just simply compared on my server.
Rows 4096 based on huge repitition of AAAA,BB,CCCCCCCCC,D
Inline function
126 ms
123 ms
130 ms
Partial CLR plus tally table
3 ms
3 ms
3 ms
I used count(*), what did you use for your measurements?
I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code.
I personally don't mind a little CLR, where it is non specific string ops, and the performance is that much better.
Sam

Very cool, Sam. To answer your question, I just let the natural rowcount put out the 6000 count.

The data you said you built as a "huge repetition" may fall into the category where the "fixed field" anomoly does trick SQL Server into evaluating the split just once. That's why I built the test table where the length of each element is random. Then again, considering what the CLR does, maybe not.

So far as running a CLR, no can do... I don't have C# or VB on my box and, since I'm a bit over-cautious, won't run someone else's DLL.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004484 More actions · Answer 14

Chandra Sekhar Manchukonda (6/29/2009)
We are having only five characters ( , ', ", & ) that need escape character in XML.
We can use replace function for handling them.
DECLARE @test-2 xml
DECLARE @chars nvarchar(500)
set @chars = '1&4,4,5,6'
set @chars = '' + REPLACE(REPLACE(REPLACE(REPLACE(@chars,'&','&'),'>','>'),'<','<') ,',','') + ''
select @test-2 = @chars
select c.value('.','nvarchar(10)')
from @test.nodes('X') as t(c)

Just curious... are you suggesting that you're still going to use XML as a splitter even though it's just been proven how slow it is and that it requires such special character replacement?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

peter-805445 Valued Member Points: 71 More actions · Answer 15

I have a set of comma delimited strings which have between 1 and 100's of invoice records per string (with each invoice record having the same 10 data fields). Your code turns the data into rows with admirable efficiency. However, I can't work out how to transform the results into columns with one row per invoice record. Can anyone help?