Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • mister.magoo (5/9/2014)


    mburbea (5/9/2014)


    Have you tried switching the memory optimized tally table durability to just schema? I read that a table will perform best in this mode and its pretty easy to add a start up script for a db to populate the tally table with rows.

    Yes, the performance was almost identical to the millisecond, which seems unlikely but that's what happened.

    Creating a Hekaton table with schema-only durability makes data modification faster (no logging/checkpointing), but it has no effect on reads. Native compilation (rather than T-SQL interop) would speed up reads, but as mentioned before this is not yet available for functions.

  • That makes sense.

    After playing with the new Lead variant, my conclusion is that it seems to be slower than the mister.magoo collate latin1_general_bin version. That seems to be the best performing in ITF.

    However, I can offer one very minor optimization that can slightly improve the performance depending on the size of the string. Basically reducing the size of the intermediate tables. I was studying the plan and it seemed that it was creating much larger rows then the optimizer needed for the intermediate derived tables. Since we don't care at all about the rows in E1,E2,E4 we can convert those to bit. For CteTally we can use a smallint as we need up to 8000. The other intermediate results conversion wasn't so successful. For very short strings it performs about the same, but for the longer strings the smaller rows for the intermediate results seems to be very beneficial.

    Here is the modified function:

    if object_id('delimitedSplit8kbsr') is not null drop function delimitedSplit8kbsr

    GO

    CREATE FUNCTION [dbo].delimitedSplit8kbsr

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT convert(bit,1) UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT convert(bit,1) FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT convert(bit,1) FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- we use conversion to bit and smallint to reduce the row size of these results.

    -- sql seems smart enough to avoid doing unnecessary work.

    SELECT top (ISNULL(DATALENGTH(@pString),0)) convert(smallint,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1

    UNION ALL

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    And attached is the comparison of 4 functions,clrsplit,delimitedSplit8kB,delimitedSplit8k_lead, and this function BSR. The results suggest this might be the best for users without access to SQL server 2014 enterprise and an inexplicable restriction on CLR.

    Split wins, as always but the BSR function is no slouch, beating the other 2 TSQL functions with a sizable lead, performing at about halfway between the clr and the other fast t-sql functions.

    This was performed on a sql server 2014 Standard (sigh 🙁 ) Edition instance.

  • Brilliant:-)

    Ran a short disco version of the test on 2014 Developers edition, also added a Lead version which came in second!

    SplitterName TOTAL_D AVG_D

    -------------------------- -------- ---------

    Split 3.97600 0.081142

    delimitedSplit8kbsr 4.12700 0.084224

    delimitedSplit8kbsrL 4.21200 0.085959

    DelimitedSplit8K_T1_Lead 4.34300 0.088632

    DelimitedSplit8K_Lead 5.47300 0.111693

    DelimitedSplit8K_T0_Lead 5.52800 0.112816

    Good stuff!

    😎

  • mburbea (5/13/2014)


    However, I can offer one very minor optimization that can slightly improve the performance depending on the size of the string. Basically reducing the size of the intermediate tables. I was studying the plan and it seemed that it was creating much larger rows then the optimizer needed for the intermediate derived tables. Since we don't care at all about the rows in E1,E2,E4 we can convert those to bit. For CteTally we can use a smallint as we need up to 8000.

    Well done, this seems like a very useful improvement. I hope other people will be able to reproduce the speed up on lower versions.

  • The collate binary, short row optimization seems to really only help on 2014. My guess is that the new estimator is much more clever (and faster). It seems to be worse on SQL server 2008R2. (I don't have access to a sql server 2012 box to try it on)

    SplitterName total_davg_d

    -----------------------------------------------

    DelimitedSplit8KB 35.1645160.717643183673469

    delimitedSplit8kbsr 37.3257270.761749530612245

    On older versions of sql server, it seems best to just use the binary variant.

    Eirikur Eiriksson (5/13/2014)


    Brilliant:-)

    Ran a short disco version of the test on 2014 Developers edition, also added a Lead version which came in second!

    SplitterName TOTAL_D AVG_D

    -------------------------- -------- ---------

    Split 3.97600 0.081142

    delimitedSplit8kbsr 4.12700 0.084224

    delimitedSplit8kbsrL 4.21200 0.085959

    DelimitedSplit8K_T1_Lead 4.34300 0.088632

    DelimitedSplit8K_Lead 5.47300 0.111693

    DelimitedSplit8K_T0_Lead 5.52800 0.112816

    Good stuff!

    😎

    Good to hear, do you mind posting your lead version? T1 seems quite impressive if its not using the "SR" (small row) variant. I also used this slight optimization of lead's math which I think is better but might be worse.

    isnull(lead(n1,1,null) over (order by n1) - 1 -n1,8000)

  • My findings differ, in that the "bit trick" makes no significant difference in Jeff's test harness.

    I am not saying it doesn't work for some, but not for me on 2014 Dev Edition on a somewhat old PC.

    The binary collation mod obviously helps a lot but only if using binary collation works for your use case and in my case there is no discernible improvement from using bit.

    On those systems where you do see an improvement, I wonder if you also would see one from explicitly casting the 1 in cteStart as a smallint to make it match datatype with cteTally to avoid implicit conversions?

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT cast(1 as smallint)

    UNION ALL

    SELECT t.N+cast(1 as smallint) FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    )

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • That's rather odd. I found it best not to mess with the CteStart derived table. Changing that seems to slow things down. For my use case which is comma delimited list of codes or integers it seems to be pretty nice.

    I tried casting both and it only hurts.

  • Looks like the performance improvements have more to do with the platform (2014). I have been looking at results from the same (or identical) hardware (E6220, i5, 256GB SSD, 8Gb RAM, Power Savings disabled!!!) for 2008 and 2012. The findings are that those are similar, but significantly slower than 2014. Mind you, this is not a scientific measurement, more eyeballing 😉

    Then, looking at the bit type tally, another thing emerged; unless all seeds in the initial CTE are explicitly cast to a bit, it still returns int (more in the code). This makes me believe that this bit (excuse the pun 😛 ) is not a contributing factor in terms of performance.

    😎

    SET NOCOUNT ON;

    DECLARE @NUMROWS INT = 1;

    /* All seeds explicitly cast to bit, returns bit */

    ;WITH TN(N) AS (SELECT N FROM (VALUES (CONVERT(BIT,1)),(CONVERT(BIT,1))

    ,(CONVERT(BIT,1)),(CONVERT(BIT,1)),(CONVERT(BIT,1))

    ,(CONVERT(BIT,1)),(CONVERT(BIT,1)),(CONVERT(BIT,1))

    ,(CONVERT(BIT,1)),(CONVERT(BIT,1))) AS X(N))

    SELECT TOP(@NUMROWS)

    DATALENGTH(NM.N) AS DLEN

    ,'All seeds explicitly cast to bit, returns bit' AS COMMENT

    FROM TN NM;

    /* First seed explicitly cast to bit, returns int */

    ;WITH E1(N) AS (

    SELECT convert(bit,1) UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    )

    SELECT TOP(@NUMROWS)

    DATALENGTH(NM.N) AS DLEN

    ,'First seed explicitly cast to bit, returns int' AS COMMENT

    FROM E1 NM;

    /* No conversion, returns int */

    ;WITH TN(N) AS (SELECT N FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1)) AS X(N))

    SELECT TOP(@NUMROWS)

    DATALENGTH(NM.N) AS DLEN

    ,'No conversion, returns int' AS COMMENT

    FROM TN NM;

    /* All but last seed cast to bit, returns int */

    ;WITH TN(N) AS (SELECT N FROM (VALUES (CONVERT(BIT,1)),(CONVERT(BIT,1))

    ,(CONVERT(BIT,1)),(CONVERT(BIT,1)),(CONVERT(BIT,1))

    ,(CONVERT(BIT,1)),(CONVERT(BIT,1)),(CONVERT(BIT,1))

    ,(CONVERT(BIT,1)),(1)) AS X(N))

    SELECT TOP(@NUMROWS)

    DATALENGTH(NM.N) AS DLEN

    ,'All but last seed cast to bit, returns int' AS COMMENT

    FROM TN NM;

    Contracted results

    DLEN COMMENT

    ----------- ---------------------------------------------

    1 All seeds explicitly cast to bit, returns bit

    4 First seed explicitly cast to bit, returns int

    4 No conversion, returns int

    4 All but last seed cast to bit, returns int

  • OK, I am dragged in again by so many notifications received for this excellent never ending thread :-). After looking at the recent comments it is really fascinating to know the use of new technology methods. I will look into more details when I will get more time.

    However, as far as the speed comparison tests are concerned, I am not sure that LEAD function was faster (Although, I may prefer it for some other reasons depdending upon the requirement). I already tried it way earlier (Absolutely no intention to take any credit away from Eirikur Eiriksson. It is an excellent excellent work from him) but I was able to beat that with CHARINDEX + BINARY COLLATION + FIRST ITEM HANDLED SEPERATELY combination on SQL 2012 (Not sure of SQL 2014). However, the most important reason for neglecting it was that I didn’t have SQL 2012 in production 😉

    The fastest method (after CLR) for me was AFOREMENTIONED COMBINATION + PHYSICAL TALLY TABLE + (N+1) PHYSICAL COLUMN + NO IMPLICIT CONVERSIONS (TALLY TABLE WITH INT DATATYPE). And as Paul White has already pointed out for LEAD and 8K functions, this function is not much susceptible to larger strings.

    Since I do not have the time to run the speed tests again, Following are my three functions (copied from old web source). It would be better if someone who already did the comparisons could run the tests again after including the following three functions.

    1.IF OBJECT_ID(N'dbo.DelimitedSplit8K_T1_First_Item_Handling_Seperately ', N'IF') IS NOT NULL

    2.DROP FUNCTION dbo. DelimitedSplit8K_T1_First_Item_Handling_Seperately;

    3. /*===== USES BINARY COLLATION TO SPEED UP THE STRING COMPARISON=======*/

    4./*===== FIRST DELIMITED ITEM HANDLED SEPERATELY=======================*/

    5.CREATE FUNCTION [dbo].[DelimitedSplit8K_T1_First_Item_Handling_Seperately]

    6.--===== Define I/O parameters

    7.(@pString VARCHAR(8000), @pDelimiter VARCHAR(1)/*===MADE IT VARCHAR TO AVOID IMPLICIT CONVERSION WHILE COMPARING====*/

    8.)

    9.RETURNS TABLE WITH SCHEMABINDING AS

    10.RETURN

    11.--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    12. -- enough to cover VARCHAR(8000)

    13. WITH E1(N) AS (

    14. SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    15. SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    16. SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    17. ), --10E+1 or 10 rows

    18. E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    19. E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    20. cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    21. -- for both a performance gain and prevention of accidental "overruns"

    22. SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    23. ),

    24.cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    25. /*====FIRST ITEM HANDLED AT THE END======*/

    26. --SELECT 1 UNION ALL -- does away with 0 base CTE, and the OR condition in one go!

    27. /*=======================================*/

    28. SELECT N+1

    29. FROM cteTally

    30. WHERE SUBSTRING(@pString ,N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    31. ),

    32.cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

    33. SELECT N1,

    34. ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,N1),0) - N1,8000)

    35. FROM cteStart

    36. )

    37.--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    38./*====FIRST ITEM HANDLING===============*/

    39.SELECT ItemNumber = 0, Item = SUBSTRING(@pString , 1, ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN),0)-1, 8000))

    40.UNION ALL

    41.SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY N1),

    42. Item = SUBSTRING(@pString , N1, L1)

    43. FROM cteLen

    44.;

    IF OBJECT_ID(N'dbo.Split8KTally_With_Column_N1', N'IF') IS NOT NULL

    DROP FUNCTION dbo.Split8KTally_With_Column_N1;

    GO

    --===== Tally Table (Split8KTally_With_Column_N1 ITVF) ===============================================================================

    /*===== THIS FUNCTION USES COLUMN "N1" OF TALLY TABLE HAVING PRE-CALCULATED VALUES FOR N+1 ===========*/

    CREATE FUNCTION dbo.Split8KTally_With_Column_N1 (

    @pString VARCHAR(8000), @pDelimiter VARCHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    /*======= FIND THE FIRST DELIMITED ITEM EXPLICITLY, AS WE SKIPPED THE FIRST ITEM TO AVOID "OR" CONDITION IN THE WHERE CLAUSE=====*/

    SELECT ItemNumber = CAST(0 AS BIGINT) --STARTING WITH ZERO FOR SORT AND TO AVOID ROW_NUMBER + 1 LATER

    , ItemValue = SUBSTRING(@pString, 1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter,

    @pString COLLATE Latin1_General_BIN), 0)

    - 1, 8000 ))

    UNION ALL -- CONCATENATE THE OUTPUTS

    /*======= FIND REST OF THE DELIMITED ITEMS USING TALLY TABLE ======================================*/

    SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY [N] )

    , ItemValue = SUBSTRING(@pString, [N1],

    ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

    @pString COLLATE Latin1_General_BIN, [N1]), 0)

    - [N1], 8000) )

    FROM [dbo].[Tally] WITH ( NOLOCK ) --NOLOCK HINT IS NOT NECESSARY

    WHERE [N] BETWEEN 1 AND LEN(@pString)

    AND SUBSTRING(@pString, [N], 1) = @pDelimiter

    GO

    --=====================================================================================================================

    -- Conditionally drop and recreate the Split8KTally_With_N_PLUS_1 splitter function.

    --=====================================================================================================================

    IF OBJECT_ID('dbo.Split8KTally_With_N_PLUS_1','IF') IS NOT NULL

    DROP FUNCTION [dbo].Split8KTally_With_N_PLUS_1

    GO

    --===== Tally Table (Split8KTally_With_N_PLUS_1 ITVF) ===============================================================================

    /*===== THIS FUNCTION USES RUNTIME CALCULATION "N + 1" INSTEAD OF USING THE PRE-CALCULATED COLUMN "N1" OF THE TALLY TABLE===========*/

    CREATE FUNCTION dbo.Split8KTally_With_N_PLUS_1 (

    @pString VARCHAR(8000), @pDelimiter VARCHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    /*======= FIND THE FIRST DELIMITED ITEM EXPLICITLY, AS WE SKIPPED THE FIRST ITEM TO AVOID "OR" CONDITION IN THE WHERE CLAUSE=====*/

    SELECT ItemNumber = CAST(0 AS BIGINT) --STARTING WITH ZERO FOR SORT AND TO AVOID ROW_NUMBER + 1 LATER

    , ItemValue = SUBSTRING(@pString, 1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter,

    @pString COLLATE Latin1_General_BIN), 0)

    - 1, 8000 ))

    UNION ALL -- CONCATENATE THE OUTPUTS

    /*======= FIND REST OF THE DELIMITED ITEMS USING TALLY TABLE ======================================*/

    SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY [N] )

    , ItemValue = SUBSTRING(@pString, (N+1),

    ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

    @pString COLLATE Latin1_General_BIN, (N+1)), 0)

    - (N+1), 8000) )

    FROM [dbo].[Tally] WITH ( NOLOCK )--NOLOCK HINT IS NOT NECESSARY

    WHERE [N] BETWEEN 1 AND LEN(@pString)

    AND SUBSTRING(@pString, [N], 1) = @pDelimiter

    GO

    One more thing about LEAD function, IIRC, I used the LEAD somewhat as

    SUBSTRING(@pString, N,

    LEAD(N,1,16000)—DOUBLE THE MAX STRING LENGTH

    OVER (ORDER BY N) - N-1 )

    No need for ISNULL(NULLIF(

    Also, with physical tally table LEAD performed much better as generally Window functions perform better on indexed columns especially CLUSTERED INDEX.

    Last but not least, even though, I saw the almost the same behavior in article’s test harness, my preferred test harness was different as I wanted to test the larger strings i.e. near to 8000 characters.

  • mburbea (5/13/2014)


    That makes sense.

    After playing with the new Lead variant, my conclusion is that it seems to be slower than the mister.magoo collate latin1_general_bin version. That seems to be the best performing in ITF.

    However, I can offer one very minor optimization that can slightly improve the performance depending on the size of the string. Basically reducing the size of the intermediate tables. I was studying the plan and it seemed that it was creating much larger rows then the optimizer needed for the intermediate derived tables. Since we don't care at all about the rows in E1,E2,E4 we can convert those to bit. For CteTally we can use a smallint as we need up to 8000. The other intermediate results conversion wasn't so successful. For very short strings it performs about the same, but for the longer strings the smaller rows for the intermediate results seems to be very beneficial.

    Here is the modified function:

    if object_id('delimitedSplit8kbsr') is not null drop function delimitedSplit8kbsr

    GO

    CREATE FUNCTION [dbo].delimitedSplit8kbsr

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT convert(bit,1) UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT convert(bit,1) FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT convert(bit,1) FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- we use conversion to bit and smallint to reduce the row size of these results.

    -- sql seems smart enough to avoid doing unnecessary work.

    SELECT top (ISNULL(DATALENGTH(@pString),0)) convert(smallint,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1

    UNION ALL

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    And attached is the comparison of 4 functions,clrsplit,delimitedSplit8kB,delimitedSplit8k_lead, and this function BSR. The results suggest this might be the best for users without access to SQL server 2014 enterprise and an inexplicable restriction on CLR.

    Split wins, as always but the BSR function is no slouch, beating the other 2 TSQL functions with a sizable lead, performing at about halfway between the clr and the other fast t-sql functions.

    This was performed on a sql server 2014 Standard (sigh 🙁 ) Edition instance.

    Very Nice thing. I guess converting it to bit could mean a smaller footprint in memory and hence may perform better (but I have been proven wrong with 1+1 = 2 logic as lots of things depends upon many other variables). I would try some alteration in your function (different combinations as well)

    1. Converting each value in starting CTE to bit

    2. Use INT conversion instead of SMALLINT

    Somewhat like

    if object_id('delimitedSplit8kbsr') is not null drop function delimitedSplit8kbsr

    GO

    ALTER FUNCTION [dbo].delimitedSplit8kbsr

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT convert(bit,1) UNION ALL SELECT convert(bit,1) UNION ALL SELECT convert(bit,1) UNION ALL

    SELECT convert(bit,1) UNION ALL SELECT convert(bit,1) UNION ALL SELECT convert(bit,1) UNION ALL

    SELECT convert(bit,1) UNION ALL SELECT convert(bit,1) UNION ALL SELECT convert(bit,1) UNION ALL SELECT convert(bit,1)

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT convert(bit,1) FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT convert(bit,1) FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- we use conversion to bit and smallint to reduce the row size of these results.

    -- sql seems smart enough to avoid doing unnecessary work.

    SELECT top (ISNULL(DATALENGTH(@pString),0)) convert(INT,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1

    UNION ALL

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    Also, instead of UNION ALL, VALUES could be used since you are testing it on SQL 2012+.

    Last but not least. IIRC, Chris Morris shared different variations of cteTALLY (some were better) , I guess they are never tried here yet ?

  • Usman actually seems to be right. The avg row size of the intermediate CTE tables seems to have no real bearing on the plan. THe conversion to smallint/int of row_number seems to be where all the gains are coming from. Smallint being upgraded is 'free', where as the conversion from int->bigint is non trivial. When you use "set statistics profile on" on the "8kB" version of the function you will see the following in the argument field (I've removed all the null fields):

    DEFINE:([Expr1055]=substring('a,b,c,d',CONVERT_IMPLICIT(int,[Union1052],0),CONVERT_IMPLICIT(int,[Expr1053],0)))

    DEFINE:([Expr1054]=row_number)

    DEFINE:([Expr1053]=isnull(CONVERT_IMPLICIT(bigint,CASE WHEN charindex(',','a,b,c,d',CONVERT_IMPLICIT(int,[Union1052],0))=(0) THEN NULL ELSE charindex(',','a,b,c,d',CONVERT_IMPLICIT(int,[Union1052],0)) END,0)-[Union1052],(8000)))

    VALUES:(((1)))

    DEFINE:([Expr1051]=[Expr1050]+(1))

    WHERE:(CONVERT(varchar(1),substring('a,b,c,d',CONVERT_IMPLICIT(int,[Expr1050],0),(1)),0)=',')

    TOP EXPRESSION:((7))

    DEFINE:([Expr1050]=row_number)

    The worst offender is the line where it has to convert the expression from big int to int to do work on it then convert the result back to bigint. Now let's look at the same plan for "8kbsr"

    DEFINE:([Expr1056]=substring('a,b,c,d',[Union1053],[Expr1054]))

    DEFINE:([Expr1055]=row_number)

    DEFINE:([Expr1054]=isnull(CASE WHEN charindex(',','a,b,c,d',[Union1053])=(0) THEN NULL ELSE charindex(',','a,b,c,d',[Union1053]) END-[Union1053],(8000)))

    ORDER BY:([Union1053] ASC)

    VALUES:(((1)))

    DEFINE:([Expr1052]=CONVERT(int,[Expr1050],0)+(1))

    WHERE:(CONVERT(varchar(1),substring('a,b,c,d',CONVERT(int,[Expr1050],0),(1)),0)=',')

    TOP EXPRESSION:((7))

    DEFINE:([Expr1050]=row_number)

    The plan look similar but the cost of converting back and forth is removed. The rowsize optimization had nothing or very little to do with it. I think that might have been part of the benefit that the physical table offered is that it wasn't having to convert numbers. I've tried reforming with different tally table shapes and in practice it made no decernible difference in performance. In Sql Server 2014 though, the convert to int on the row number is a huge gain.

    Edit:

    I can't figure out how to turn off the stupid emoticons which kind of make my post look ridiculous.

    Also Usman's reformation to have CTEStart not have teh union but to have the union on line 31 seems to provide a minor boost in performance, enough so that it's overall time is now beating Split 🙂

  • mburbea (5/14/2014)


    Usman actually seems to be right. The avg row size of the intermediate CTE tables seems to have no real bearing on the plan. THe conversion to smallint/int of row_number seems to be where all the gains are coming from. Smallint being upgraded is 'free', where as the conversion from int->bigint is non trivial. When you use "set statistics profile on" on the "8kB" version of the function you will see the following in the argument field (I've removed all the null fields):


    DEFINE:([Expr1055]=substring('a,b,c,d',CONVERT_IMPLICIT(int,[Union1052],0),CONVERT_IMPLICIT(int,[Expr1053],0)))
    DEFINE:([Expr1054]=row_number)
    DEFINE:([Expr1053]=isnull(CONVERT_IMPLICIT(bigint,CASE WHEN charindex(',','a,b,c,d',CONVERT_IMPLICIT(int,[Union1052],0))=(0) THEN NULL ELSE charindex(',','a,b,c,d',CONVERT_IMPLICIT(int,[Union1052],0)) END,0)-[Union1052],(8000)))
    VALUES:(((1)))
    DEFINE:([Expr1051]=[Expr1050]+(1))
    WHERE:(CONVERT(varchar(1),substring('a,b,c,d',CONVERT_IMPLICIT(int,[Expr1050],0),(1)),0)=',')
    TOP EXPRESSION:((7))
    DEFINE:([Expr1050]=row_number)

    The worst offender is the line where it has to convert the expression from big int to int to do work on it then convert the result back to bigint. Now let's look at the same plan for "8kbsr"


    DEFINE:([Expr1056]=substring('a,b,c,d',[Union1053],[Expr1054]))
    DEFINE:([Expr1055]=row_number)
    DEFINE:([Expr1054]=isnull(CASE WHEN charindex(',','a,b,c,d',[Union1053])=(0) THEN NULL ELSE charindex(',','a,b,c,d',[Union1053]) END-[Union1053],(8000)))
    ORDER BY:([Union1053] ASC)
    VALUES:(((1)))
    DEFINE:([Expr1052]=CONVERT(int,[Expr1050],0)+(1))
    WHERE:(CONVERT(varchar(1),substring('a,b,c,d',CONVERT(int,[Expr1050],0),(1)),0)=',')
    TOP EXPRESSION:((7))
    DEFINE:([Expr1050]=row_number)

    The plan look similar but the cost of converting back and forth is removed. The rowsize optimization had nothing or very little to do with it. I think that might have been part of the benefit that the physical table offered is that it wasn't having to convert numbers. I've tried reforming with different tally table shapes and in practice it made no decernible difference in performance. In Sql Server 2014 though, the convert to int on the row number is a huge gain.

    This was one of the main reasons I chose INT datatype for physical Tally table so that I can get rid of implicit conversions as for larger strings this could have been more costly 🙂

  • Usman's reformation of the query to remove the union from "cteStart" and add it to the final result as has a nice appreciable gain in performance. It now seems to beat the CLR splitter. This is ONLY true on SQL Server 2014 unfortunately.

    Here are the optimizations applied from the base "8K splitter" in the updated script file.

    1. Collate binary before doing any string comparison.

    2. Convert row_number to int.

    3. Try a base 20 tally table and doing 3 cross joins instead of 4 (doesn't really seem to change anything).

    4. Perform Usman's reformation. (for Lead that still means charindex no choice here).

    SplitterNametotal_davg_d

    DelimitedSplit8k_LEAD2.2430.0457755102040816

    DelimitedSplit8KB2.5460.0519591836734694

    delimitedSplit8kbsr2.2240.0453877551020408

    split2.2750.0464285714285714

    Both versions seem to be faster than the CLR function 😀

    Since the slowdown seems to be the virtual tally table I'd wonder if a CLR function that just spits out a streaming number table might be able to improve the performance. Unfortunately, I haven't a clue how to write such a thing.

    Here are the functions as they stand:

    BSR

    CREATE FUNCTION [dbo].delimitedSplit8kbsr

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1

    ),

    cteTally(N) AS (

    SELECT top (ISNULL(DATALENGTH(@pString),0)) CONVERT(int,ROW_NUMBER() OVER (ORDER BY (select null))) FROM E1,E1 b,E1 c

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    select itemNumber = 1,

    item = substring(@pstring,1,ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN),0)-1,8000))

    union all

    SELECT ROW_NUMBER() OVER(ORDER BY N1)+1,

    SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    LEAD

    CREATE FUNCTION dbo.DelimitedSplit8K_LEAD

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ),

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) CONVERT(int,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM E1,E1 b,E1 c

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    lead(n1,1,16000) over (order by n1) - 1 -n1

    FROM cteStart s

    )

    --===== Do the actual split.

    SELECT ItemNumber = 1,

    Item =substring(@pstring,1,ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN),0)-1,8000))

    UNION ALL

    SELECT ROW_NUMBER() OVER(ORDER BY l.N1)+1,

    SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

  • mburbea (5/14/2014)


    Usman's reformation of the query to remove the union from "cteStart" and add it to the final result as has a nice appreciable gain in performance. It now seems to beat the CLR splitter. This is ONLY true on SQL Server 2014 unfortunately.

    Here are the optimizations applied from the base "8K splitter" in the updated script file.

    1. Collate binary before doing any string comparison.

    2. Convert row_number to int.

    3. Try a base 20 tally table and doing 3 cross joins instead of 4 (doesn't really seem to change anything).

    4. Perform Usman's reformation. (for Lead that still means charindex no choice here).

    SplitterNametotal_davg_d

    DelimitedSplit8k_LEAD2.2430.0457755102040816

    DelimitedSplit8KB2.5460.0519591836734694

    delimitedSplit8kbsr2.2240.0453877551020408

    split2.2750.0464285714285714

    Both versions seem to be faster than the CLR function 😀

    Since the slowdown seems to be the virtual tally table I'd wonder if a CLR function that just spits out a streaming number table might be able to improve the performance. Unfortunately, I haven't a clue how to write such a thing.

    Here are the functions as they stand:

    BSR

    CREATE FUNCTION [dbo].delimitedSplit8kbsr

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1

    ),

    cteTally(N) AS (

    SELECT top (ISNULL(DATALENGTH(@pString),0)) CONVERT(int,ROW_NUMBER() OVER (ORDER BY (select null))) FROM E1,E1 b,E1 c

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    select itemNumber = 1,

    item = substring(@pstring,1,ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN),0)-1,8000))

    union all

    SELECT ROW_NUMBER() OVER(ORDER BY N1)+1,

    SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    LEAD

    CREATE FUNCTION dbo.DelimitedSplit8K_LEAD

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ),

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) CONVERT(int,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM E1,E1 b,E1 c

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    lead(n1,1,16000) over (order by n1) - 1 -n1

    FROM cteStart s

    )

    --===== Do the actual split.

    SELECT ItemNumber = 1,

    Item =substring(@pstring,1,ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN),0)-1,8000))

    UNION ALL

    SELECT ROW_NUMBER() OVER(ORDER BY l.N1)+1,

    SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    What machine are you on :hehe: I dont have SQL 2014 yet, so cannot test it myself. But I guess one extra bit of performance may still be possible.

    CREATE FUNCTION [dbo].delimitedSplit8kbsr

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter VARCHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1

    ),

    cteTally(N) AS (

    SELECT top (ISNULL(DATALENGTH(@pString),0)) CONVERT(int,ROW_NUMBER() OVER (ORDER BY (select null))) FROM E1,E1 b,E1 c

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    select itemNumber = convert(bigint,0),

    item = substring(@pstring,1,ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN),0)-1,8000))

    union all

    SELECT ROW_NUMBER() OVER(ORDER BY N1),--+1 not needed if we start from zero

    SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    LEAD

    CREATE FUNCTION dbo.DelimitedSplit8K_LEAD

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ),

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) CONVERT(int,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM E1,E1 b,E1 c

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1 FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(

    SELECT s.N1,

    lead(n1,1,16000) over (order by n1) - 1 -n1

    FROM cteStart s

    )

    --===== Do the actual split.

    SELECT ItemNumber = convert(bigint,0),

    Item =substring(@pstring,1,ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN),0)-1,8000))

    UNION ALL

    SELECT ROW_NUMBER() OVER(ORDER BY l.N1),--+1 not needed if we start from zero

    SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    Although I am sure on larger strings CLR can easily beat these solutions. So it all depends upon the requirements.

  • Is everyone running Jeff's original/updated test rig from the article to get these results?

    If so, you might not be testing what you think you are. There is a weakness in the test script that can result in the Csv8K table containing duplicates (i.e. every row is the same). This is related to hiding the NEWID call in the view. In (at least) SQL Server 2012 and 2014, this NEWID value may be cached, resulting in every row being the same.

    It is obviously not a very good test to simply split the string once, cache the results in a table spool, and replay that for each duplicate row from the Csv8K table. This particularly (and unfairly) benefits the non-CLR splitters by the way.

    As a quick hack fix, you can use undocumented trace flag 8690 to prevent caching the NEWID value in this part of the script:

    SELECT @sql = ISNULL(@SQL,'')+

    '

    IF OBJECT_ID(''dbo.Csv8K'',''U'') IS NOT NULL DROP TABLE dbo.Csv8K;

    SELECT *

    INTO dbo.Csv8K

    FROM dbo.CreateCsv8K

    ('+CAST(NumberOfRows AS VARCHAR(10))+', '

    +CAST(NumberOfElements AS VARCHAR(10))+', '

    +CAST(MinElementLength AS VARCHAR(10))+', '

    +CAST(MaxElementLength AS VARCHAR(10))+') OPTION (QUERYTRACEON 8690); --# of Rows, # of Elements, MIN element length, MAX element length

    EXEC dbo.TestEachFunction '+CAST(NumberOfRows AS VARCHAR(10)) +', '

    +CAST(NumberOfElements AS VARCHAR(10))+', '

    +CAST(MinElementLength AS VARCHAR(10))+', '

    +CAST(MaxElementLength AS VARCHAR(10))+';

    '

    FROM cteControl

    Nevertheless, you need to be careful to check the Csv8K table to check the rows are actually different, as intended.

    One more thing. The CLR code isn't fully compiled to native code on creation. The first call to the function may take a full second or more as just-in-time compilation takes place. This also seems unfair, so I add an initial do-nothing split just after the CLR function creation to the test script. This ensures at least most of the CLR is compiled before the main tests start:

    SELECT * FROM dbo.Split(N'A,B,C,D', N',');

    FWIW a test I just ran (on SQL Server 2014 RTM CU1) with the fixes in place produced:

    +-----------------------------------------------------+

    ¦ SplitterName ¦ TotalDuration ¦ AvgDuration ¦

    ¦-----------------------+---------------+-------------¦

    ¦ DelimitedSplit8K_LEAD ¦ 19.07600 ¦ 0.389306 ¦

    ¦ Split ¦ 12.35800 ¦ 0.252204 ¦

    +-----------------------------------------------------+

    BTW, that's CLR Split vs Usman's latest version of DelimitedSplit8K_LEAD.

  • Viewing 15 posts - 601 through 615 (of 990 total)

    You must be logged in to reply to this topic. Login to reply