Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 256 through 270 (of 990 total)

You must be logged in to reply to this topic. Login to reply

fahey.jonathan Hall of Fame Points: 3617 More actions · Answer 1

Thanks for the great article and the code that goes with it.

Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)? I modified the provided function to take a (MAX) as a parameter and it seems to work. To account for the increase in size, it is using an existing Tally table rather than the CTE, even though the tested CTE version outperforms the table-based version. That seemed simpler than adding more layers to the compounding CTEs, and the benchmarking was based on generating 10K rows rather than 100K or 1,000K rows.

Do you see any reason why a VARCHAR(MAX) would not work? I'm still testing it, but your thoughts would be appreciated.

WayneS SSC Guru Points: 95460 More actions · Answer 2

fahey.jonathan (5/25/2011)
Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)?

Performance. It goes wacko as soon as you start sending in > 8000 characters to the function. (However, in my limited testing, the varchar(max) seems to work fine with <= 8000 characters being sent to it.)

A DelimitedSplitMax version is in the works, but it's not ready for prime time yet.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 3

fahey.jonathan (5/25/2011)
Thanks for the great article and the code that goes with it.
Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)? I modified the provided function to take a (MAX) as a parameter and it seems to work. To account for the increase in size, it is using an existing Tally table rather than the CTE, even though the tested CTE version outperforms the table-based version. That seemed simpler than adding more layers to the compounding CTEs, and the benchmarking was based on generating 10K rows rather than 100K or 1,000K rows.
Do you see any reason why a VARCHAR(MAX) would not work? I'm still testing it, but your thoughts would be appreciated.

Yes. VARCHAR(MAX) doesn't like to be joined to. In the testing that I did, just changing the input to VARCHAR(MAX) and still using it for something less than VARCHAR(8000) immediately caused a 2:1 slowdown. Like Wayne said, a VARCHAR(MAX) splitter is in the works using a similar method but it's just not ready yet. Of course, now that you've said what you've said, some retesting for VARCHAR(MAX) appears to be in order across different machines.

If you can use a CLR like the one in the test attachment to the article, that would be the best thing to do for splitting. Understood if you cannot.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

fahey.jonathan Hall of Fame Points: 3617 More actions · Answer 4

I've tested it using 36014 characters (list of email addresses) and the response time is under 1 second:

-----------------------

2011-05-25 12:46:54.640

2011-05-25 12:46:55.253

For the few times that I will be using this, that speed is acceptable. I wonder if using a table-based Tally table makes any difference to the speed given the larger number of records that need to be generated using the CTE version. I did not try using the CTE version scaled up to (MAX) size.

I'm eager to see your revised version when it is complete.

mark hutchinson SSC Eights! Points: 878 More actions · Answer 5

@jeff

Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.

I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.

Mark

WayneS SSC Guru Points: 95460 More actions · Answer 6

mark hutchinson (5/25/2011)
@Jeff
Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.
I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.
Mark

Good idea, but you just can't split it every 8000 characters. You have to split that varchar(max) into chunks <= 8000 chars, AND split at a delimiter. Any ideas?

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

mark hutchinson SSC Eights! Points: 878 More actions · Answer 7

@Wayne

it depends on the ability to split the max string at a delimiter into <8000 sized chunks reliably. Otherwise, the ends of the parsed strings sets would need to be concatenated.

If I had the delimiter positions, I might be able to split first on the greatest position less than 8000 and then the greatest delimiter position whose difference form the prior delimiter position was <8000.

I have no idea how to write the T-SQL nor even if it can be written. It might perform like a dog. I just don't know.

Mark

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 8

mark hutchinson (5/25/2011)
@Jeff
Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.
I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.
Mark

That's the basis of the code I wrote. It runs very, very fast. Wayne and I have done some back and forth "what ifs"... I just haven't had the time to bring it up to releasable code, yet. By "Releasable Code", I mean code that's been fully documented and tested for functionality, performance, and scalability on at least a half dozen different types of machines across at least 2 revs of SQL. I've just gotta get some time to make the right kind of test harness, test data, and package to send to a very willing and helpful group of external testers. Considering the number of fires I currently have sticks in, it's not going to happen in the next week or two. 😉

Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

WayneS SSC Guru Points: 95460 More actions · Answer 9

Jeff Moden (5/25/2011)
Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:

Yes, I know. But, when you're selling software solutions, you can't mandate their use to your customers.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

WayneS SSC Guru Points: 95460 More actions · Answer 10

mark hutchinson (5/25/2011)
@Wayne
it depends on the ability to split the max string at a delimiter into <8000 sized chunks reliably. Otherwise, the ends of the parsed strings sets would need to be concatenated.
If I had the delimiter positions, I might be able to split first on the greatest position less than 8000 and then the greatest delimiter position whose difference form the prior delimiter position was <8000.
I have no idea how to write the T-SQL nor even if it can be written. It might perform like a dog. I just don't know.
Mark

And what happens if the delimiter is at position 8000

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 11

WayneS (5/25/2011)
Jeff Moden (5/25/2011)
Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:
Yes, I know. But, when you're selling software solutions, you can't mandate their use to your customers.

Heh... absolutely true. That's why I had all the smiley faces on my last. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

John Hardin Grasshopper Points: 20 More actions · Answer 12

How about this to get rid of the CHARINDEX() string operation:

CREATE FUNCTION dbo.DelimitedSplit8K

--===== Define I/O parameters

(@pString VARCHAR(8000), @pDelimiter CHAR(1))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover VARCHAR(8000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT 0 UNION ALL

SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT t.N+1

FROM cteTally t

WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

),

cteStartEnd(Nb, Ne) AS ( -- Associate starting and ending positions

SELECT N1, ISNULL((SELECT MIN(N1) FROM cteStart ce WHERE ce.N1 > cs.N1), 8001)

FROM cteStart cs

)

--===== Do the actual split.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY se.Nb),

Item = SUBSTRING(@pString,se.Nb,se.Ne - se.Nb - 1)

FROM cteStartEnd se

;

...how does that affect performance?

edited to add: Actually performance goes down a bit because the subquery in making cteStartEnd is not indexed. This version does exhibit performance gains in my very modest testing:

CREATE FUNCTION dbo.DelimitedSplit8K

--===== Define I/O parameters

(@pString VARCHAR(8000), @pDelimiter CHAR(1))

RETURNS TABLE WITH SCHEMABINDING AS

DECLARE @t_delim TABLE(offset smallint PRIMARY KEY); -- for the index

-- Get the offsets into the in-memory indexed table

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover VARCHAR(8000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT 0 UNION ALL

SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

INSERT INTO @t_delim(offset) --==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT t.N+1

FROM cteTally t

WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

;

RETURN

WITH

cteStartEnd(Nb, Ne) AS ( -- Associate starting and ending positions

SELECT offset, ISNULL((SELECT MIN(offset) FROM @t_delim ce WHERE ce.offset > cs.offset), 8001)

FROM @t_delim cs

)

--===== Do the actual split.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY se.Nb),

Item = SUBSTRING(@pString,se.Nb,se.Ne - se.Nb - 1)

FROM cteStartEnd se

;

(Note: I'm doing my version as a SP so the above exact code hasn't actually been tested by me...)

Gatekeeper SSCommitted Points: 1633 More actions · Answer 13

mark hutchinson (5/25/2011)
@Jeff
Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.

Mark, we used Jeff's function and expanded it out to varchar(max) with a fair amount of linearity, though I'm looking forward to seeing Jeff's with max handled.

String Length : Elements : CPU Time in ms

74242

148482

296962

5921923

118438432

23687687

4736153613

9472307256

18944614449

378881228896

7577624576193

15155249152385

303104983045006

60620819660810085

121241639321619989

UDF

/*

See sometime for improvements to this: http://www.sqlservercentral.com/articles/Tally+Table/72993/

Jeff Moden 2011-05-02

*/

CREATE FUNCTION udf_StrList2Table (

@List NVARCHAR(MAX) = N'',

@Delimiter NCHAR(1) = N','

)

RETURNS TABLE WITH SCHEMABINDING

AS

RETURN

/*

Following inline delimited spliter written by Jeff Moden.

http://www.sqlservercentral.com/articles/Tally+Table/72993/

*/

WITH E1(N) AS

( --=== Create Ten 1's

SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 --10

),

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000

E8(N) AS

(

--==== Limits the number of rows right up front for both a performance gain and prevention of accidental "overruns"

SELECT TOP (DATALENGTH(ISNULL(@List,1))) 1 FROM E4 a, E4 b -- 100,000,000

),

cteTally(N) AS

( --==== This provides the "zero base"

SELECT 0 UNION ALL

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E8

),

cteStart(N1) AS

( --==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT t.N+1 FROM cteTally t WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0)

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT

ROW_NUMBER() OVER(ORDER BY s.N1) as ListPos,

SUBSTRING(@List,s.N1,ISNULL(NULLIF(CHARINDEX(@Delimiter,@List,s.N1),0)-s.N1,DATALENGTH(ISNULL(@List,1)))) as Value

FROM cteStart s

GO

Testing

DECLARE

@List NVARCHAR(MAX),

@Delimiter NCHAR(1) = N',',

@ListCnt int,

@StartTime datetime2,

@EndTime datetime2,

@ReplicationSeed int = 2,

@ReplicationMultiplier int = 2

DECLARE @Results TABLE (ListLength int, ElementCount int, MSTime int)

WHILE @ReplicationSeed <= 50000

BEGIN

SELECT @List = REPLICATE(CAST(N'ab,a,aeae,3,3,a3,23,4,asa,,434,q4,345' as nvarchar(max)), @ReplicationSeed)

SELECT @StartTime = SYSDATETIME()

SELECT @ListCnt = COUNT(*) FROM udf_StrList2Table(@List, @Delimiter)

SELECT @EndTime = SYSDATETIME()

INSERT INTO @Results (ListLength, ElementCount, MSTime)

SELECT LEN(@List), LEN(@List) - LEN(REPLACE(@List, ',', '')), DATEDIFF(MS, @StartTime, @EndTime)

SELECT

@ReplicationSeed = @ReplicationSeed * @ReplicationMultiplier

END

SELECT * FROM @Results

/* Anything is possible but is it worth it? */

peter-757102 SSCertifiable Points: 6877 More actions · Answer 14

Gatekeeper,

Be careful testing sample data this way, where you have the constants in the same batch as the code that you are benchmarking. The replicate function does not "hide" the final string to be benchmark with. There is a reason Jeffs benchmarking involves a table of random stings, one of these reason is to avoid tainted results by giving the optimizer chances to skip logic it normally would have to perform at runtime.

I am not saying your test code neccecarily falls victim to this, but make sure it does not!

And thank you for your contribution. I will have to test how that E8 addition affects performance on input that never needs those high end numbers, and if it doesn't affect it, I am going to incorporate it in the thus far fastest version and use it in my own projects.

Cheers!

mark hutchinson SSC Eights! Points: 878 More actions · Answer 15

@peter-2

Jeff's new article uses describes this method and states that the compiler knows that when the number of rows is less than what would be produced at/above one of the joining expressions, that it doesn't evaluate that expression.

Mark