Recursive CTE comma list parser

Question

Post reply

Recursive CTE comma list parser

Jay_Noob

SSCommitted

Points: 1565
More actions
December 7, 2009 at 3:24 pm

#143434

I recently had to import data from a third party system that had (ugh) comma separated fields in a table. I came up with a recursive cte to do the parsing. I thought I'd post the query here and ask for some critiques and criticisms.
CREATE TABLE #tmp (id int, string varchar(1000))
INSERT INTO #tmp (id, string)
SELECT 1, 'abcd, efgh, ijkl, mnop, qrst, uvwx, yz' UNION ALL
SELECT 2, 'zyxw, vuts, rqpo, nmlk, jihg, fedc, ba' UNION ALL
SELECT 3, 'the, quick, brown, fox, jumped, over, the, lazy, dog'
;WITH test (id, lft, rght, idx)
AS
(
SELECT t.id
,LEFT(t.string, CHARINDEX(', ', t.string) - 1)
,SUBSTRING(t.string, CHARINDEX(', ', t.string) + 2, DATALENGTH(t.string))
,0
FROM #tmp t
UNION ALL
SELECT c.id
,CASE WHEN CHARINDEX(', ', c.rght) = 0 THEN c.rght ELSE LEFT(c.rght, CHARINDEX(', ', c.rght) - 1) END
,CASE WHEN CHARINDEX(', ', c.rght) > 0 THEN SUBSTRING(c.rght, CHARINDEX(', ', c.rght) + 2, DATALENGTH(c.rght))
ELSE '' END
,idx + 1
FROM test c
WHERE DATALENGTH(c.rght) > 0
)
select * from test order by id, idx
DROP TABLE #tmp

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

SQLRNNR SSC Guru Points: 281334 More actions · Answer 1

Nice, I like it better than some of the comma split functions I have seen.

What have you seen this scale out to?

For curiosity sake, I decided to see what kind of performance I would see if I changed this from a 3 row temp table to a 10,000 and then a 100,000 row temp table. I also changed your ID to and identity for insert ease with the bigger row counts.

I saw this procedure jump from instantaneous to 6 seconds to 1m01s to process for each of the row count levels.

If I specify an ID to query after the CTE is done, then the processing time drops dramatically. And yet again, significant performance gains by changing the CTE to query a specific ID.

The following is only for the CTE and Select statement immediately following it.

100,000 Records IO

Scan count 2, logical reads 7796635, physical reads 0

10,000 Records IO

Scan count 2, logical reads 522008, physical reads 0

1000 Records IO

Scan count 2, logical reads 52262, physical reads 0

10 Records IO

Scan count 2, logical reads 603, physical reads 0

10 Records Processing Time

125 ms

1000 Records Processing Time

437 ms

10000 Records Processing Time

4390 ms

100,000 Records Processing Time

43406 ms

Now if I query the 100,000 Record temp table, the processing time drops to:

15 ms

Scan count 2, logical reads 43, physical reads 0

I think this query can be very useful - depending on implementation and scale.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Jay_Noob SSCommitted Points: 1565 More actions · Answer 2

What have you seen this scale out to?

It's funny you should ask. I'm using a modified version of it in testing right now. I've run it against about 180000 records so far. The performance has been (mostly) acceptable so far, but the list I'm splitting uses a more complicated delimiter, so I have to use PATINDEX('%[A-Z]_[A-Z]_****%') as my delimiter instead of charindex. I'm about bump the data set up to around a million (in dev).

I'll update later with some more info.

Jay_Noob SSCommitted Points: 1565 More actions · Answer 3

First run on a set with 1.3 M rows. The result set expanded to 3.8 M rows.

I had 167323868 logical reads against my temp table, 0 physical.

Unfortunately that took 772563ms. On the plus side, the other options I've tried for splitting that dataset used a while loop and took much longer.

UPDATE

I broke the dataset up into 100000 row chunks, and I'm seeing comparable performance to CirquedeSQLeil. Running it against the whole dataset takes about 10x the logical reads as running it against 1/10th of the dataset.

SQLRNNR SSC Guru Points: 281334 More actions · Answer 4

That seems to be scaling about the same as my tests.

I do like the performance gains when a specific ID can be passed though. This script could prove very handy.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 5

Give this one a try. You may need to modify it a bit, but if you give me some sample data I can do some of the work.

USE [SandBox]

GO

/****** Object: UserDefinedFunction [dbo].[DelimitedSplit] Script Date: 12/08/2009 11:17:39 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE function [dbo].[DelimitedSplit] (

@pString varchar(max),

@pDelimiter char(1)

)

returns table

as

return

with

a1 as (select 1 as N union all

select 1 union all

select 1),

a2 as (select

1 as N

from

a1 as a

cross join a1 as b),

a3 as (select

1 as N

from

a2 as a

cross join a2 as b),

--a4 as (select

-- 1 as N

-- from

-- a3 as a

-- cross join a2 as b),

Tally as (select top (len(@pString))

row_number() over (order by a.N) as N

from

a3 as a

cross join a2 as b),

ItemSplit(

ItemOrder,

Item

) as (

SELECT

N,

SUBSTRING(@pDelimiter + @pString + @pDelimiter,N + 1,CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter,N + 1) - N - 1)

FROM

Tally

WHERE

N < LEN(@pDelimiter + @pString + @pDelimiter)

AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter --Notice how we find the delimiter

)

select

row_number() over (order by ItemOrder) as ItemID,

Item

from

ItemSplit

;

GO

Also, the reason yours doesn't scale well is the recursion. It is basically RBAR, just like those that use a WHILE loop.