Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • ChrisM@Work (2/27/2013)


    ...

    About once a week, mate. Now I'm an old git I can blame it on declining memory. When I was a sprightly youngster it was more embarrassing.

    I know what you mean, at least you are in a "safe place" here 🙂

    It's worse when you make the silly mistakes in front of a client, insisting something is wrong or right when it patently isn't :ermm:

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (2/27/2013)


    ChrisM@Work (2/27/2013)


    ...

    About once a week, mate. Now I'm an old git I can blame it on declining memory. When I was a sprightly youngster it was more embarrassing.

    I know what you mean, at least you are in a "safe place" here 🙂

    It's worse when you make the silly mistakes in front of a client, insisting something is wrong or right when it patently isn't :ermm:

    I know of at least one other old git who will be cackling his face off reading these posts :laugh:

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (2/27/2013)


    I know of at least one other old git who will be cackling his face off reading these posts :laugh:

    Oi!

  • Paul White (2/27/2013)


    ChrisM@Work (2/27/2013)


    I know of at least one other old git who will be cackling his face off reading these posts :laugh:

    Oi!

    Heh I didn't mean you specifically Paul, though you do have some history of spotting my embarrassing coding faux pas 😛

    Anyway, isn't it past your bedtime?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (2/27/2013)


    Anyway, isn't it past your bedtime?

    We can no longer be friends :laugh:

  • Paul White (2/27/2013)


    ChrisM@Work (2/27/2013)


    Anyway, isn't it past your bedtime?

    We can no longer be friends :laugh:

    Slippers are in the post 😀

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (2/27/2013)


    mister.magoo (2/27/2013)


    ChrisM@Work (2/27/2013)


    ...

    About once a week, mate. Now I'm an old git I can blame it on declining memory. When I was a sprightly youngster it was more embarrassing.

    I know what you mean, at least you are in a "safe place" here 🙂

    It's worse when you make the silly mistakes in front of a client, insisting something is wrong or right when it patently isn't :ermm:

    I know of at least one other old git who will be cackling his face off reading these posts :laugh:

    Heh... nah... I've BIN there myself. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Is there a reason that you mention subtracting the length of the final string from 8000 like so

    ISNULL(NULLIF(CHARINDEX(@pDelimiter, @pString, s.N1), 0), 8000) - s.N1

    and then in the final code you have ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000) I just wanted to make sure I'm understanding the code correctly. It looks like it runs the same either way...

  • Did I miss something? I didn't have the patience to go over all 54 pages of replies. Initially when the code for the cteTally was changed to be faster it used a

    SELECT 0 UNION ALL

    ...

    However in the final function it unions with 1 instead of 0. I tried both versions in the final function and they bot work actually.

    Don't just give the hungry man a fish, teach him how to catch it as well.

    the sqlist

  • Late entry to the discussion. I ran into an issue with a similar string splitter, and tried to break down my findings on this.

    The following code causes an error in SQL Server 2008R2:

    DECLARE @VALUES VARCHAR(8000)

    SET @VALUES = 'E62|E48|E47'

    SELECT

    Item

    FROM dbo.[DelimitedSplit8K](@VALUES, '|')

    WHERE ISNUMERIC(RIGHT(Item, LEN(Item) - 1)) > 0

    The following works

    DECLARE @VALUES VARCHAR(MAX)

    SET @VALUES = 'E62|E48|E47'

    SELECT

    Item, ItemNumber

    FROM dbo.[DelimitedSplit8K](@VALUES, '|')

    WHERE ISNUMERIC(RIGHT(Item, LEN(Item) - 1)) > 0

    Select * also works, as does only ItemNumber.

    The odd thing is that none of the possible result sets give an Item with a length of 0 (that would result in the error). I think something gets optimized so that an error is prematurely raised.

    If I add one more CTE level to the end of the splitter function, specifying that ItemNumber > 0, this error does not occur.

  • michael 77302 (6/26/2013)


    The odd thing is that none of the possible result sets give an Item with a length of 0 (that would result in the error). I think something gets optimized so that an error is prematurely raised.

    If I add one more CTE level to the end of the splitter function, specifying that ItemNumber > 0, this error does not occur.

    It is a little known fact that SQL Server is extremely liberal as to the order in which expressions within the query, including those within the function, are executed. An expression in the select clause that you logically expect to be filtered away CAN BE executed BEFORE the expression(s) that would filter it away and thus generate an exception.

    Your case is a nice example of just how easy it is to overlook such cases, especially when wrapped inside a function. In this case I suspect the separated items, including an empty tail item is fed to your expression in the where clause, before that tail item is removed by an appropiate condition in the function.

    The reason the varchar(max) stops this is most likely a type conversion that changes the order of expression execution into something we humans instinctively expect.

    This version of your code will work without error as the incorrect value -1 for the right function is turned into a null, which is harmless to the right function and causes no side effects for the rest of your query.

    DECLARE @VALUES VARCHAR(8000)

    SET @VALUES = 'E62|E48|E47'

    SELECT

    Item

    FROM dbo.[DelimitedSplit8K](@VALUES, '|')

    WHERE ISNUMERIC(RIGHT(Item, nullif(LEN(Item) - 1, -1))) > 0

    Another tip i can hand you is not to use Len in this instance, as it also right-trims the input.

    This makes it generally speaking a more expensive function then datalength, which does what you expect Len to do here.

  • mister.magoo (2/22/2013)


    What bothered me was that feeling that "someone should have thought of this ages ago, shouldn't they"?

    Always makes me nervous when that is the case...

    Damn. I've been bitten by and fixed the collation problem using a binary collation in other code often enough that I can't think of a good reason why I forgot it for this over the years. Even Paul has reminded me on occasion.

    I'll post a correction to the code in the article and in the attachments when I get a chance. Thank you very miuch for the help. That's why I call this the "community function". Lots of good folks have had a hand in it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • the sqlist (3/18/2013)


    Did I miss something? I didn't have the patience to go over all 54 pages of replies. Initially when the code for the cteTally was changed to be faster it used a

    SELECT 0 UNION ALL

    ...

    However in the final function it unions with 1 instead of 0. I tried both versions in the final function and they bot work actually.

    The change occurred during a couple of those 54 pages of discussion. A couple of folks found another enhancement to make it run even faster than I did (compared to the old function). Then I added a note to the beginning of the article to let folks know where the newer code could be found. Unfortunately, people in a hurry weren't reading the update so I updated the code in the article. I didn't go back to correct the rest of the article, though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • bboyd019 (3/13/2013)


    Is there a reason that you mention subtracting the length of the final string from 8000 like so

    ISNULL(NULLIF(CHARINDEX(@pDelimiter, @pString, s.N1), 0), 8000) - s.N1

    and then in the final code you have ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000) I just wanted to make sure I'm understanding the code correctly. It looks like it runs the same either way...

    I believe that's also due to the same reason I mentioned in the post just above this one.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Forgive me if the code to do this has already been posted but I had a requirement for a greater-than-8k splitter. I made the following changes:

    1) Changed @pString to varchar(max)

    2) Added more rows to the tally table (E8(N))

    3) Changed

    ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1, 8000)

    To

    ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1, DATALENGTH(@pString))

    This is the finished product with my comments/changes quoted like this: /** ... **/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    ALTER FUNCTION [dbo].[DelimitedSplitVcMax]

    --===== Define I/O parameters

    (@pString VARCHAR(/**8000**/max), @pDelimiter CHAR(1))

    --Sorry

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 1,000,000...

    -- enough to cover VARCHAR(<a lot>)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows

    /** NEEDED MORE ROWS, ADDED E8; **/

    E8(N) AS (SELECT 1 FROM E4 a, E4 b), --10E+8 or 100,000,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM /**E4**/ E8

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1 UNION ALL

    SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

    ),

    cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1, /**8000**/ /**Addaded: **/ DATALENGTH(@pString))

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l;

    This seems correct and produces the right result. It runs twice as slow (which is expected and acceptable [since it's still much, much faster than what I am replacing]). Other than the BIN COLLATION changes that MM mentioned a few pages back - does the code above look good?

    Second Question:

    I have a requirement where I need to split a value on a table that may or may not have columns with 8k characters. If there aren't rows with >8K characters I want it to use the 8K version, otherwise it would require the varchar(max) version (dbo.DelimitedSplitVcMax). Is this a good approach?

    --DDL

    CREATE TABLE dbo.tbl (id int primary key, val varchar(max) not null);

    GO

    IF NOT EXISTS(SELECT * FROM tbl WHERE DATALENGTH(val)>=8000)

    select s.Item

    FROM tbl v

    CROSS APPLY dbo.DelimitedSplit8K(v.val,',') s

    ELSE

    select s.Item

    FROM tbl v

    CROSS APPLY dbo.DelimitedSplitVcMax(v.val,',') s

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Viewing 15 posts - 541 through 555 (of 990 total)

    You must be logged in to reply to this topic. Login to reply