December 7, 2009 at 3:24 pm
I recently had to import data from a third party system that had (ugh) comma separated fields in a table. I came up with a recursive cte to do the parsing. I thought I'd post the query here and ask for some critiques and criticisms.
CREATE TABLE #tmp (id int, string varchar(1000))
INSERT INTO #tmp (id, string)
SELECT 1, 'abcd, efgh, ijkl, mnop, qrst, uvwx, yz' UNION ALL
SELECT 2, 'zyxw, vuts, rqpo, nmlk, jihg, fedc, ba' UNION ALL
SELECT 3, 'the, quick, brown, fox, jumped, over, the, lazy, dog'
;WITH test (id, lft, rght, idx)
AS
(
SELECT t.id
,LEFT(t.string, CHARINDEX(', ', t.string) - 1)
,SUBSTRING(t.string, CHARINDEX(', ', t.string) + 2, DATALENGTH(t.string))
,0
FROM #tmp t
UNION ALL
SELECT c.id
,CASE WHEN CHARINDEX(', ', c.rght) = 0 THEN c.rght ELSE LEFT(c.rght, CHARINDEX(', ', c.rght) - 1) END
,CASE WHEN CHARINDEX(', ', c.rght) > 0 THEN SUBSTRING(c.rght, CHARINDEX(', ', c.rght) + 2, DATALENGTH(c.rght))
ELSE '' END
,idx + 1
FROM test c
WHERE DATALENGTH(c.rght) > 0
)
select * from test order by id, idx
DROP TABLE #tmp
December 7, 2009 at 10:36 pm
Nice, I like it better than some of the comma split functions I have seen.
What have you seen this scale out to?
For curiosity sake, I decided to see what kind of performance I would see if I changed this from a 3 row temp table to a 10,000 and then a 100,000 row temp table. I also changed your ID to and identity for insert ease with the bigger row counts.
I saw this procedure jump from instantaneous to 6 seconds to 1m01s to process for each of the row count levels.
If I specify an ID to query after the CTE is done, then the processing time drops dramatically. And yet again, significant performance gains by changing the CTE to query a specific ID.
The following is only for the CTE and Select statement immediately following it.
100,000 Records IO
Scan count 2, logical reads 7796635, physical reads 0
10,000 Records IO
Scan count 2, logical reads 522008, physical reads 0
1000 Records IO
Scan count 2, logical reads 52262, physical reads 0
10 Records IO
Scan count 2, logical reads 603, physical reads 0
10 Records Processing Time
125 ms
1000 Records Processing Time
437 ms
10000 Records Processing Time
4390 ms
100,000 Records Processing Time
43406 ms
Now if I query the 100,000 Record temp table, the processing time drops to:
15 ms
Scan count 2, logical reads 43, physical reads 0
I think this query can be very useful - depending on implementation and scale.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
December 8, 2009 at 7:16 am
What have you seen this scale out to?
It's funny you should ask. I'm using a modified version of it in testing right now. I've run it against about 180000 records so far. The performance has been (mostly) acceptable so far, but the list I'm splitting uses a more complicated delimiter, so I have to use PATINDEX('%[A-Z]_[A-Z]_****%') as my delimiter instead of charindex. I'm about bump the data set up to around a million (in dev).
I'll update later with some more info.
December 8, 2009 at 7:41 am
First run on a set with 1.3 M rows. The result set expanded to 3.8 M rows.
I had 167323868 logical reads against my temp table, 0 physical.
Unfortunately that took 772563ms. On the plus side, the other options I've tried for splitting that dataset used a while loop and took much longer.
UPDATE
I broke the dataset up into 100000 row chunks, and I'm seeing comparable performance to CirquedeSQLeil. Running it against the whole dataset takes about 10x the logical reads as running it against 1/10th of the dataset.
December 8, 2009 at 9:39 am
That seems to be scaling about the same as my tests.
I do like the performance gains when a specific ID can be passed though. This script could prove very handy.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
December 8, 2009 at 11:19 am
Give this one a try. You may need to modify it a bit, but if you give me some sample data I can do some of the work.
USE [SandBox]
GO
/****** Object: UserDefinedFunction [dbo].[DelimitedSplit] Script Date: 12/08/2009 11:17:39 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE function [dbo].[DelimitedSplit] (
@pString varchar(max),
@pDelimiter char(1)
)
returns table
as
return
with
a1 as (select 1 as N union all
select 1 union all
select 1 union all
select 1 union all
select 1 union all
select 1 union all
select 1 union all
select 1 union all
select 1 union all
select 1),
a2 as (select
1 as N
from
a1 as a
cross join a1 as b),
a3 as (select
1 as N
from
a2 as a
cross join a2 as b),
--a4 as (select
-- 1 as N
-- from
-- a3 as a
-- cross join a2 as b),
Tally as (select top (len(@pString))
row_number() over (order by a.N) as N
from
a3 as a
cross join a2 as b),
ItemSplit(
ItemOrder,
Item
) as (
SELECT
N,
SUBSTRING(@pDelimiter + @pString + @pDelimiter,N + 1,CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter,N + 1) - N - 1)
FROM
Tally
WHERE
N < LEN(@pDelimiter + @pString + @pDelimiter)
AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter --Notice how we find the delimiter
)
select
row_number() over (order by ItemOrder) as ItemID,
Item
from
ItemSplit
;
GO
Also, the reason yours doesn't scale well is the recursion. It is basically RBAR, just like those that use a WHILE loop.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply