Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Paul White SSC Guru Points: 150468 More actions · Answer 1

Jeff Moden (4/14/2014)
I agree. A CLR would be the best way and you made it easy enough for everyone to use. Thanks. The "Split" function in the original article was much faster than any of the T-SQL methods.

I don't know for sure that the "max" version with multi-character delimiters I posted will beat Miller's method, but it's there if (s)he wants to try it.

The only thing I don't like about the one that Adam posted is that it ignores leading/trailing and consecutive delimiters. For the kind of work I do, those are important especially for "ragged right" splits.

I think we had this conversation a long time ago, which lead me to revise it. The version I presented works as you would wish, I think.

The link to Adam's article is broken. Here it is again...
http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx

Works for me, and I haven't edited it? No worries either way.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Eirikur Eiriksson SSC Guru Points: 182920 More actions · Answer 2

FYI: Ran the function through the test script with the same result aggregation as I used for the article.

😎

SplitterName TOTAL_D AVG_D

---------------------------- ------------ -----------

DelimitedSplit8K_T1_LEAD_V2 18302.00000 373.510204

DelimitedSplit8K_T1_LEAD 18587.00000 379.326530

BetterSplit 20307.00000 414.428571

peter-757102 SSCertifiable Points: 6877 More actions · Answer 3

Paul White (4/14/2014)
At the risk of making a predictable contribution at this point, based on Miller's stated requirements for long Unicode strings and multi-character delimiters:
.
.
.
Based on http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx

This version uses .net to supply the sub-strings, an approach so far I seen all splitters take.

A while back I wrote one that just delivers locations of the sub-strings back to SQL. The idea being that the .NET to SQL interface is leaner and would allow SQL Server to be smarter in its memory usage by for example utilizing partial strings internally, instead of fresh copies. That would hardly consume additional memory. For sure the resulting datatype and collation would be in-line with the original input, something which no other version i seen does. It also means no conversions from the a static Unicode result that .NET delivers are needed when parsing varchar input (saving both memory and time).

I never got to run benchmark tests for it as it is part of a large more complete lib which I haven't even fully designed and finished. Too bad I ran out of time and had to focus on other things, but it will be done at one point. I think the standard way of splitting has room to be improved upon, efficiency wise.

Miller SSC Enthusiast Points: 147 More actions · Answer 4

It is unfortunate that this that there are some seemingly inexplicable limitations in what you can do with T-SQL.

For example, I wondered what would happen if we were able to removed all usage of SUBSTRING and simply convert the delimited string into scripted data with a single execution of the REPLACE function, then execute the dynamic sql to generate the resultant table. This works great and is incredibly fast, but for several reasons, this will not work as a function, only as a stored procedure.

Below is a procedure that does the split with no tally and no substring, no charindex.

CREATE procedure [dbo].[TallyNo]

(@L NVARCHAR(MAX),@D NVARCHAR(100))AS

DECLARE @E VARCHAR(MAX)

SET @E='SELECT ItemNumber=ROW_NUMBER()OVER(ORDER BY(SELECT 1)),Item FROM(values('''+REPLACE(@L,@D,'''),(''')+'''))V(Item)'

EXEC(@E);

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 5

Paul White (4/14/2014)
Jeff Moden (4/14/2014)
I agree. A CLR would be the best way and you made it easy enough for everyone to use. Thanks. The "Split" function in the original article was much faster than any of the T-SQL methods.
I don't know for sure that the "max" version with multi-character delimiters I posted will beat Miller's method, but it's there if (s)he wants to try it.
The only thing I don't like about the one that Adam posted is that it ignores leading/trailing and consecutive delimiters. For the kind of work I do, those are important especially for "ragged right" splits.
I think we had this conversation a long time ago, which lead me to revise it. The version I presented works as you would wish, I think.
The link to Adam's article is broken. Here it is again...
http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx
Works for me, and I haven't edited it? No worries either way.

Apologies for the late reply.

We did indeed and thank you very much for not-only tweaking the CLR, Paul, but for offering it up to begin with and for making it easy to instantiate. I don't even know how to spell C# and couldn't have done it on my own. You really came through on that and made an essential comparison possible. Thank you very much, again!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mister.magoo SSC-Forever Points: 47174 More actions · Answer 6

SQL 2014 is kinda cool...

I just tried out using a memory optimized table for the Tally / Numbers table in DelimitedSplit8K and two things stand out.

1. It seems to be much quicker than SQL 2012 generally.

2. The memory optimised Tally Splitter wipes the floor with the stacked cte / IBG style computed tally apparently.

For those interested, attached results from my desktop PC.

DelimitedSplit8K_M_O is this:

CREATE FUNCTION [dbo].[DelimitedSplit8K_M_O]

--===== Define I/O parameters

(@pString VARCHAR(8000) , @pDelimiter CHAR(1))

--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== SQL 2014 Memory Optimised "Tally Table" produces values from 1 up to 1,000,000...

-- enough to cover VARCHAR(8000)

WITH

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT 1 UNION ALL

SELECT t.N+1 FROM dbo.moTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter AND N<=(ISNULL(DATALENGTH(@pString),0))

),

cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

SELECT s.N1,

ISNULL(NULLIF(CHARINDEX(@pDelimiter ,@pString ,s.N1),0)-s.N1,8000)

FROM cteStart s

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

Item = SUBSTRING(@pString, l.N1, l.L1)

FROM cteLen l

;

The MO Tally is this:

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[moTally]

(

[N] [int] NOT NULL,

CONSTRAINT [pk_moTally] PRIMARY KEY NONCLUSTERED

(

[N] ASC

)

)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Paul White SSC Guru Points: 150468 More actions · Answer 7

You could also play around with a native-compiled procedure, something like:

-- Memory optimized table type

CREATE TYPE dbo.SplitReturn AS TABLE

(

ItemNumber integer IDENTITY PRIMARY KEY NONCLUSTERED,

Item varchar(8000) NOT NULL

)

WITH (MEMORY_OPTIMIZED = ON);

GO

CREATE PROCEDURE dbo.Split

@pString varchar(8000),

@pDelimiter char(1)

WITH

NATIVE_COMPILATION,

SCHEMABINDING,

EXECUTE AS OWNER

AS

BEGIN ATOMIC

WITH

(

TRANSACTION ISOLATION LEVEL = SNAPSHOT,

LANGUAGE = 'us_english'

)

DECLARE @return AS dbo.SplitReturn;

DECLARE @pos AS integer = 1;

DECLARE @Last AS integer = 1;

WHILE @pos <= LEN(@pString)

BEGIN

IF SUBSTRING(@pString, @pos, 1) = @pDelimiter

BEGIN

IF @pos = 1

INSERT @return (Item)

VALUES (SUBSTRING(@pString, @Last, @pos - @Last + 1));

ELSE

INSERT @return (Item)

VALUES (SUBSTRING(@pString, @Last, @pos - @Last));

SET @Last = @pos + 1;

END;

SET @pos += 1;

END;

INSERT @return (Item)

VALUES (SUBSTRING(@pString, @Last, @pos - @Last + 1));

SELECT

R.ItemNumber,

R.Item

FROM @return AS R

ORDER BY

R.ItemNumber;

END;

GO

-- Example usage

EXECUTE dbo.Split

@pString = 'A**B*C*',

@pDelimiter = '*';

Not very useful given that it's limited to splitting one string per procedure call, I admit 🙂

It's a shame functions don't support native compilation, and crazy that basics like CHARINDEX aren't supported for native, but maybe we'll see these things in a future version.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 8

I was really looking forward to that type of fuctionality. It IS a step in the right direction, though. Thanks, Paul.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182920 More actions · Answer 9

FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.

😎

SplitterName TOTAL_D AVG_D

-------------------------- -------- ---------

DelimitedSplit8K_M_O_LEAD 2.96400 0.060489

DelimitedSplit8K_M_O 2.99700 0.061163

Split 4.06600 0.082979

Paul White SSC Guru Points: 150468 More actions · Answer 10

Eirikur Eiriksson (4/30/2014)
FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.

The memory optimized version of the tally splitter is certainly an improvement, and certainly very competitive with the CLR splitter especially on small strings, where it can be faster. My tests show it is still 50% slower for long strings with many elements.

Bear in mind that the memory optimized improvement requires SQL Server 2014 Enterprise Edition. Also, as currently written, it does not handle Unicode, and is limited to 8000 non-Unicode characters.

None of these limitations apply to the CLR splitter; it works on all versions and editions of SQL Server from 2005 onward, and handles Unicode and max-length strings all in one.

😎

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Eirikur Eiriksson SSC Guru Points: 182920 More actions · Answer 11

Paul White (4/30/2014)
Eirikur Eiriksson (4/30/2014)
FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.
The memory optimized version of the tally splitter is certainly an improvement, and certainly very competitive with the CLR splitter especially on small strings, where it can be faster. My tests show it is still 50% slower for long strings with many elements.
Bear in mind that the memory optimized improvement requires SQL Server 2014 Enterprise Edition. Also, as currently written, it does not handle Unicode, and is limited to 8000 non-Unicode characters.
None of these limitations apply to the CLR splitter; it works on all versions and editions of SQL Server from 2005 onward, and handles Unicode and max-length strings all in one.
😎

No doubt about the CLR splitter being generally better, my point is that when using the window function on the normal Delimiter8K splitter, it is up to 50% faster than the Charindex one. With the memory optimized tally table, the difference only starts to show on longer strings, in fact the window function version was faster than the CLR all the way through (see the attachment).

😎

Paul White SSC Guru Points: 150468 More actions · Answer 12

Eirikur Eiriksson (4/30/2014)
No doubt about the CLR splitter being generally better, my point is that when using the window function on the normal Delimiter8K splitter, it is up to 50% faster than the Charindex one.

Forgive me, but I didn't get any sense of that point from the post I replied to. Never mind.

With the memory optimized tally table, the difference only starts to show on longer strings, in fact the window function version was faster than the CLR all the way through (see the attachment).

The LEAD version is indeed impressive (albeit with limitations).

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

mister.magoo SSC-Forever Points: 47174 More actions · Answer 13

Eirikur Eiriksson (4/30/2014)
FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.
😎
SplitterName TOTAL_D AVG_D
-------------------------- -------- ---------
DelimitedSplit8K_M_O_LEAD 2.96400 0.060489
DelimitedSplit8K_M_O 2.99700 0.061163
Split 4.06600 0.082979

Well, that saved me the effort - that was going to be my next test 😀

Paul - absolutely agree that the CLR implementation is the nuts, what we are seeking is incremental improvements to the non-CLR version (despite it's limitations in this form) for those systems that will/can not use CLR.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

mburbea Right there with Babe Points: 781 More actions · Answer 14

Have you tried switching the memory optimized tally table durability to just schema? I read that a table will perform best in this mode and its pretty easy to add a start up script for a db to populate the tally table with rows.

mister.magoo SSC-Forever Points: 47174 More actions · Answer 15

mburbea (5/9/2014)
Have you tried switching the memory optimized tally table durability to just schema? I read that a table will perform best in this mode and its pretty easy to add a start up script for a db to populate the tally table with rows.

Yes, the performance was almost identical to the millisecond, which seems unlikely but that's what happened.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);