Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Jeff Moden (4/14/2014)


    I agree. A CLR would be the best way and you made it easy enough for everyone to use. Thanks. The "Split" function in the original article was much faster than any of the T-SQL methods.

    I don't know for sure that the "max" version with multi-character delimiters I posted will beat Miller's method, but it's there if (s)he wants to try it.

    The only thing I don't like about the one that Adam posted is that it ignores leading/trailing and consecutive delimiters. For the kind of work I do, those are important especially for "ragged right" splits.

    I think we had this conversation a long time ago, which lead me to revise it. The version I presented works as you would wish, I think.

    Works for me, and I haven't edited it? No worries either way.

  • FYI: Ran the function through the test script with the same result aggregation as I used for the article.

    😎

    SplitterName TOTAL_D AVG_D

    ---------------------------- ------------ -----------

    DelimitedSplit8K_T1_LEAD_V2 18302.00000 373.510204

    DelimitedSplit8K_T1_LEAD 18587.00000 379.326530

    BetterSplit 20307.00000 414.428571

  • Paul White (4/14/2014)


    At the risk of making a predictable contribution at this point, based on Miller's stated requirements for long Unicode strings and multi-character delimiters:

    .

    .

    .

    Based on http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx

    This version uses .net to supply the sub-strings, an approach so far I seen all splitters take.

    A while back I wrote one that just delivers locations of the sub-strings back to SQL. The idea being that the .NET to SQL interface is leaner and would allow SQL Server to be smarter in its memory usage by for example utilizing partial strings internally, instead of fresh copies. That would hardly consume additional memory. For sure the resulting datatype and collation would be in-line with the original input, something which no other version i seen does. It also means no conversions from the a static Unicode result that .NET delivers are needed when parsing varchar input (saving both memory and time).

    I never got to run benchmark tests for it as it is part of a large more complete lib which I haven't even fully designed and finished. Too bad I ran out of time and had to focus on other things, but it will be done at one point. I think the standard way of splitting has room to be improved upon, efficiency wise.

  • It is unfortunate that this that there are some seemingly inexplicable limitations in what you can do with T-SQL.

    For example, I wondered what would happen if we were able to removed all usage of SUBSTRING and simply convert the delimited string into scripted data with a single execution of the REPLACE function, then execute the dynamic sql to generate the resultant table. This works great and is incredibly fast, but for several reasons, this will not work as a function, only as a stored procedure.

    Below is a procedure that does the split with no tally and no substring, no charindex.

    CREATE procedure [dbo].[TallyNo]

    (@L NVARCHAR(MAX),@D NVARCHAR(100))AS

    DECLARE @E VARCHAR(MAX)

    SET @E='SELECT ItemNumber=ROW_NUMBER()OVER(ORDER BY(SELECT 1)),Item FROM(values('''+REPLACE(@L,@D,'''),(''')+'''))V(Item)'

    EXEC(@E);

  • Paul White (4/14/2014)


    Jeff Moden (4/14/2014)


    I agree. A CLR would be the best way and you made it easy enough for everyone to use. Thanks. The "Split" function in the original article was much faster than any of the T-SQL methods.

    I don't know for sure that the "max" version with multi-character delimiters I posted will beat Miller's method, but it's there if (s)he wants to try it.

    The only thing I don't like about the one that Adam posted is that it ignores leading/trailing and consecutive delimiters. For the kind of work I do, those are important especially for "ragged right" splits.

    I think we had this conversation a long time ago, which lead me to revise it. The version I presented works as you would wish, I think.

    Works for me, and I haven't edited it? No worries either way.

    Apologies for the late reply.

    We did indeed and thank you very much for not-only tweaking the CLR, Paul, but for offering it up to begin with and for making it easy to instantiate. I don't even know how to spell C# and couldn't have done it on my own. You really came through on that and made an essential comparison possible. Thank you very much, again!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQL 2014 is kinda cool...

    I just tried out using a memory optimized table for the Tally / Numbers table in DelimitedSplit8K and two things stand out.

    1. It seems to be much quicker than SQL 2012 generally.

    2. The memory optimised Tally Splitter wipes the floor with the stacked cte / IBG style computed tally apparently.

    For those interested, attached results from my desktop PC.

    DelimitedSplit8K_M_O is this:

    CREATE FUNCTION [dbo].[DelimitedSplit8K_M_O]

    --===== Define I/O parameters

    (@pString VARCHAR(8000) , @pDelimiter CHAR(1))

    --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== SQL 2014 Memory Optimised "Tally Table" produces values from 1 up to 1,000,000...

    -- enough to cover VARCHAR(8000)

    WITH

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1 UNION ALL

    SELECT t.N+1 FROM dbo.moTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter AND N<=(ISNULL(DATALENGTH(@pString),0))

    ),

    cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter ,@pString ,s.N1),0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    The MO Tally is this:

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [dbo].[moTally]

    (

    [N] [int] NOT NULL,

    CONSTRAINT [pk_moTally] PRIMARY KEY NONCLUSTERED

    (

    [N] ASC

    )

    )WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • You could also play around with a native-compiled procedure, something like:

    -- Memory optimized table type

    CREATE TYPE dbo.SplitReturn AS TABLE

    (

    ItemNumber integer IDENTITY PRIMARY KEY NONCLUSTERED,

    Item varchar(8000) NOT NULL

    )

    WITH (MEMORY_OPTIMIZED = ON);

    GO

    CREATE PROCEDURE dbo.Split

    @pString varchar(8000),

    @pDelimiter char(1)

    WITH

    NATIVE_COMPILATION,

    SCHEMABINDING,

    EXECUTE AS OWNER

    AS

    BEGIN ATOMIC

    WITH

    (

    TRANSACTION ISOLATION LEVEL = SNAPSHOT,

    LANGUAGE = 'us_english'

    )

    DECLARE @return AS dbo.SplitReturn;

    DECLARE @pos AS integer = 1;

    DECLARE @Last AS integer = 1;

    WHILE @pos <= LEN(@pString)

    BEGIN

    IF SUBSTRING(@pString, @pos, 1) = @pDelimiter

    BEGIN

    IF @pos = 1

    INSERT @return (Item)

    VALUES (SUBSTRING(@pString, @Last, @pos - @Last + 1));

    ELSE

    INSERT @return (Item)

    VALUES (SUBSTRING(@pString, @Last, @pos - @Last));

    SET @Last = @pos + 1;

    END;

    SET @pos += 1;

    END;

    INSERT @return (Item)

    VALUES (SUBSTRING(@pString, @Last, @pos - @Last + 1));

    SELECT

    R.ItemNumber,

    R.Item

    FROM @return AS R

    ORDER BY

    R.ItemNumber;

    END;

    GO

    -- Example usage

    EXECUTE dbo.Split

    @pString = 'A**B*C*',

    @pDelimiter = '*';

    Not very useful given that it's limited to splitting one string per procedure call, I admit 🙂

    It's a shame functions don't support native compilation, and crazy that basics like CHARINDEX aren't supported for native, but maybe we'll see these things in a future version.

  • I was really looking forward to that type of fuctionality. It IS a step in the right direction, though. Thanks, Paul.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.

    😎

    SplitterName TOTAL_D AVG_D

    -------------------------- -------- ---------

    DelimitedSplit8K_M_O_LEAD 2.96400 0.060489

    DelimitedSplit8K_M_O 2.99700 0.061163

    Split 4.06600 0.082979

  • Eirikur Eiriksson (4/30/2014)


    FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.

    The memory optimized version of the tally splitter is certainly an improvement, and certainly very competitive with the CLR splitter especially on small strings, where it can be faster. My tests show it is still 50% slower for long strings with many elements.

    Bear in mind that the memory optimized improvement requires SQL Server 2014 Enterprise Edition. Also, as currently written, it does not handle Unicode, and is limited to 8000 non-Unicode characters.

    None of these limitations apply to the CLR splitter; it works on all versions and editions of SQL Server from 2005 onward, and handles Unicode and max-length strings all in one.

    😎

  • Paul White (4/30/2014)


    Eirikur Eiriksson (4/30/2014)


    FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.

    The memory optimized version of the tally splitter is certainly an improvement, and certainly very competitive with the CLR splitter especially on small strings, where it can be faster. My tests show it is still 50% slower for long strings with many elements.

    Bear in mind that the memory optimized improvement requires SQL Server 2014 Enterprise Edition. Also, as currently written, it does not handle Unicode, and is limited to 8000 non-Unicode characters.

    None of these limitations apply to the CLR splitter; it works on all versions and editions of SQL Server from 2005 onward, and handles Unicode and max-length strings all in one.

    😎

    No doubt about the CLR splitter being generally better, my point is that when using the window function on the normal Delimiter8K splitter, it is up to 50% faster than the Charindex one. With the memory optimized tally table, the difference only starts to show on longer strings, in fact the window function version was faster than the CLR all the way through (see the attachment).

    😎

  • Eirikur Eiriksson (4/30/2014)


    No doubt about the CLR splitter being generally better, my point is that when using the window function on the normal Delimiter8K splitter, it is up to 50% faster than the Charindex one.

    Forgive me, but I didn't get any sense of that point from the post I replied to. Never mind.

    With the memory optimized tally table, the difference only starts to show on longer strings, in fact the window function version was faster than the CLR all the way through (see the attachment).

    The LEAD version is indeed impressive (albeit with limitations).

  • Eirikur Eiriksson (4/30/2014)


    FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.

    😎

    SplitterName TOTAL_D AVG_D

    -------------------------- -------- ---------

    DelimitedSplit8K_M_O_LEAD 2.96400 0.060489

    DelimitedSplit8K_M_O 2.99700 0.061163

    Split 4.06600 0.082979

    Well, that saved me the effort - that was going to be my next test 😀

    Paul - absolutely agree that the CLR implementation is the nuts, what we are seeking is incremental improvements to the non-CLR version (despite it's limitations in this form) for those systems that will/can not use CLR.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Have you tried switching the memory optimized tally table durability to just schema? I read that a table will perform best in this mode and its pretty easy to add a start up script for a db to populate the tally table with rows.

  • mburbea (5/9/2014)


    Have you tried switching the memory optimized tally table durability to just schema? I read that a table will perform best in this mode and its pretty easy to add a start up script for a db to populate the tally table with rows.

    Yes, the performance was almost identical to the millisecond, which seems unlikely but that's what happened.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

    Viewing 15 posts - 586 through 600 (of 990 total)

    You must be logged in to reply to this topic. Login to reply