Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Thanks for the great article and the code that goes with it.

    Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)? I modified the provided function to take a (MAX) as a parameter and it seems to work. To account for the increase in size, it is using an existing Tally table rather than the CTE, even though the tested CTE version outperforms the table-based version. That seemed simpler than adding more layers to the compounding CTEs, and the benchmarking was based on generating 10K rows rather than 100K or 1,000K rows.

    Do you see any reason why a VARCHAR(MAX) would not work? I'm still testing it, but your thoughts would be appreciated.

  • fahey.jonathan (5/25/2011)


    Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)?

    Performance. It goes wacko as soon as you start sending in > 8000 characters to the function. (However, in my limited testing, the varchar(max) seems to work fine with <= 8000 characters being sent to it.)

    A DelimitedSplitMax version is in the works, but it's not ready for prime time yet.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • fahey.jonathan (5/25/2011)


    Thanks for the great article and the code that goes with it.

    Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)? I modified the provided function to take a (MAX) as a parameter and it seems to work. To account for the increase in size, it is using an existing Tally table rather than the CTE, even though the tested CTE version outperforms the table-based version. That seemed simpler than adding more layers to the compounding CTEs, and the benchmarking was based on generating 10K rows rather than 100K or 1,000K rows.

    Do you see any reason why a VARCHAR(MAX) would not work? I'm still testing it, but your thoughts would be appreciated.

    Yes. VARCHAR(MAX) doesn't like to be joined to. In the testing that I did, just changing the input to VARCHAR(MAX) and still using it for something less than VARCHAR(8000) immediately caused a 2:1 slowdown. Like Wayne said, a VARCHAR(MAX) splitter is in the works using a similar method but it's just not ready yet. Of course, now that you've said what you've said, some retesting for VARCHAR(MAX) appears to be in order across different machines.

    If you can use a CLR like the one in the test attachment to the article, that would be the best thing to do for splitting. Understood if you cannot.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've tested it using 36014 characters (list of email addresses) and the response time is under 1 second:

    -----------------------

    2011-05-25 12:46:54.640

    2011-05-25 12:46:55.253

    For the few times that I will be using this, that speed is acceptable. I wonder if using a table-based Tally table makes any difference to the speed given the larger number of records that need to be generated using the CTE version. I did not try using the CTE version scaled up to (MAX) size.

    I'm eager to see your revised version when it is complete.

  • @jeff

    Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.

    I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.

    Mark

  • mark hutchinson (5/25/2011)


    @Jeff

    Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.

    I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.

    Mark

    Good idea, but you just can't split it every 8000 characters. You have to split that varchar(max) into chunks <= 8000 chars, AND split at a delimiter. Any ideas?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • @Wayne

    it depends on the ability to split the max string at a delimiter into <8000 sized chunks reliably. Otherwise, the ends of the parsed strings sets would need to be concatenated.

    If I had the delimiter positions, I might be able to split first on the greatest position less than 8000 and then the greatest delimiter position whose difference form the prior delimiter position was <8000.

    I have no idea how to write the T-SQL nor even if it can be written. It might perform like a dog. I just don't know.

    Mark

  • mark hutchinson (5/25/2011)


    @Jeff

    Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.

    I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.

    Mark

    That's the basis of the code I wrote. It runs very, very fast. Wayne and I have done some back and forth "what ifs"... I just haven't had the time to bring it up to releasable code, yet. By "Releasable Code", I mean code that's been fully documented and tested for functionality, performance, and scalability on at least a half dozen different types of machines across at least 2 revs of SQL. I've just gotta get some time to make the right kind of test harness, test data, and package to send to a very willing and helpful group of external testers. Considering the number of fires I currently have sticks in, it's not going to happen in the next week or two. 😉

    Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/25/2011)


    Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:

    Yes, I know. But, when you're selling software solutions, you can't mandate their use to your customers.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • mark hutchinson (5/25/2011)


    @Wayne

    it depends on the ability to split the max string at a delimiter into <8000 sized chunks reliably. Otherwise, the ends of the parsed strings sets would need to be concatenated.

    If I had the delimiter positions, I might be able to split first on the greatest position less than 8000 and then the greatest delimiter position whose difference form the prior delimiter position was <8000.

    I have no idea how to write the T-SQL nor even if it can be written. It might perform like a dog. I just don't know.

    Mark

    And what happens if the delimiter is at position 8000

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (5/25/2011)


    Jeff Moden (5/25/2011)


    Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:

    Yes, I know. But, when you're selling software solutions, you can't mandate their use to your customers.

    Heh... absolutely true. That's why I had all the smiley faces on my last. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • How about this to get rid of the CHARINDEX() string operation:

    CREATE FUNCTION dbo.DelimitedSplit8K

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT 0 UNION ALL

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    ),

    cteStartEnd(Nb, Ne) AS ( -- Associate starting and ending positions

    SELECT N1, ISNULL((SELECT MIN(N1) FROM cteStart ce WHERE ce.N1 > cs.N1), 8001)

    FROM cteStart cs

    )

    --===== Do the actual split.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY se.Nb),

    Item = SUBSTRING(@pString,se.Nb,se.Ne - se.Nb - 1)

    FROM cteStartEnd se

    ;

    ...how does that affect performance?

    edited to add: Actually performance goes down a bit because the subquery in making cteStartEnd is not indexed. This version does exhibit performance gains in my very modest testing:

    CREATE FUNCTION dbo.DelimitedSplit8K

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    DECLARE @t_delim TABLE(offset smallint PRIMARY KEY); -- for the index

    -- Get the offsets into the in-memory indexed table

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT 0 UNION ALL

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    INSERT INTO @t_delim(offset) --==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    ;

    RETURN

    WITH

    cteStartEnd(Nb, Ne) AS ( -- Associate starting and ending positions

    SELECT offset, ISNULL((SELECT MIN(offset) FROM @t_delim ce WHERE ce.offset > cs.offset), 8001)

    FROM @t_delim cs

    )

    --===== Do the actual split.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY se.Nb),

    Item = SUBSTRING(@pString,se.Nb,se.Ne - se.Nb - 1)

    FROM cteStartEnd se

    ;

    (Note: I'm doing my version as a SP so the above exact code hasn't actually been tested by me...)

  • mark hutchinson (5/25/2011)


    @Jeff

    Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.

    Mark, we used Jeff's function and expanded it out to varchar(max) with a fair amount of linearity, though I'm looking forward to seeing Jeff's with max handled.

    String Length : Elements : CPU Time in ms

    74242

    148482

    296962

    5921923

    118438432

    23687687

    4736153613

    9472307256

    18944614449

    378881228896

    7577624576193

    15155249152385

    303104983045006

    60620819660810085

    121241639321619989

    UDF

    /*

    See sometime for improvements to this: http://www.sqlservercentral.com/articles/Tally+Table/72993/

    Jeff Moden 2011-05-02

    */

    CREATE FUNCTION udf_StrList2Table (

    @List NVARCHAR(MAX) = N'',

    @Delimiter NCHAR(1) = N','

    )

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    /*

    Following inline delimited spliter written by Jeff Moden.

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    */

    WITH E1(N) AS

    ( --=== Create Ten 1's

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 --10

    ),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000

    E8(N) AS

    (

    --==== Limits the number of rows right up front for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (DATALENGTH(ISNULL(@List,1))) 1 FROM E4 a, E4 b -- 100,000,000

    ),

    cteTally(N) AS

    ( --==== This provides the "zero base"

    SELECT 0 UNION ALL

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E8

    ),

    cteStart(N1) AS

    ( --==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1 FROM cteTally t WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0)

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT

    ROW_NUMBER() OVER(ORDER BY s.N1) as ListPos,

    SUBSTRING(@List,s.N1,ISNULL(NULLIF(CHARINDEX(@Delimiter,@List,s.N1),0)-s.N1,DATALENGTH(ISNULL(@List,1)))) as Value

    FROM cteStart s

    GO

    Testing

    DECLARE

    @List NVARCHAR(MAX),

    @Delimiter NCHAR(1) = N',',

    @ListCnt int,

    @StartTime datetime2,

    @EndTime datetime2,

    @ReplicationSeed int = 2,

    @ReplicationMultiplier int = 2

    DECLARE @Results TABLE (ListLength int, ElementCount int, MSTime int)

    WHILE @ReplicationSeed <= 50000

    BEGIN

    SELECT @List = REPLICATE(CAST(N'ab,a,aeae,3,3,a3,23,4,asa,,434,q4,345' as nvarchar(max)), @ReplicationSeed)

    SELECT @StartTime = SYSDATETIME()

    SELECT @ListCnt = COUNT(*) FROM udf_StrList2Table(@List, @Delimiter)

    SELECT @EndTime = SYSDATETIME()

    INSERT INTO @Results (ListLength, ElementCount, MSTime)

    SELECT LEN(@List), LEN(@List) - LEN(REPLACE(@List, ',', '')), DATEDIFF(MS, @StartTime, @EndTime)

    SELECT

    @ReplicationSeed = @ReplicationSeed * @ReplicationMultiplier

    END

    SELECT * FROM @Results

    /* Anything is possible but is it worth it? */

  • Gatekeeper,

    Be careful testing sample data this way, where you have the constants in the same batch as the code that you are benchmarking. The replicate function does not "hide" the final string to be benchmark with. There is a reason Jeffs benchmarking involves a table of random stings, one of these reason is to avoid tainted results by giving the optimizer chances to skip logic it normally would have to perform at runtime.

    I am not saying your test code neccecarily falls victim to this, but make sure it does not!

    And thank you for your contribution. I will have to test how that E8 addition affects performance on input that never needs those high end numbers, and if it doesn't affect it, I am going to incorporate it in the thus far fastest version and use it in my own projects.

    Cheers!

  • @peter-2

    Jeff's new article uses describes this method and states that the compiler knows that when the number of rows is less than what would be produced at/above one of the joining expressions, that it doesn't evaluate that expression.

    Mark

Viewing 15 posts - 256 through 270 (of 990 total)

You must be logged in to reply to this topic. Login to reply