Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Jeff Moden (5/2/2011)


    ...

    I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore).

    ...

    Thank you for the excellent reminder; I looked at my ASCII chart for a refresher:

    CHAR(28) --File Separator

    CHAR(29) --Group Separator

    CHAR(30) --Record Separator

    CHAR(31) --Unit Separator

    I move we start trying to use these separator characters again! 30 and 31 are particularly appropriate for the usual two-dimensional split into table techniques.

    On another note, which tricks did I miss for a permanent, 0-based tally table adaptation?

    CREATE FUNCTION [dbo].[YourFunction](@pString [varchar](8000), @pDelimiter [char](1))

    RETURNS TABLE

    RETURN

    WITH cteStart(N1) AS (--==== This limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP ((DATALENGTH(ISNULL(@pString,1)))+1) N+1

    FROM YourDatabase.dbo.YourZeroBasedTallyTable t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    ORDER BY N

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

    FROM cteStart s

    ;

    And Record Separator test:

    select * from YourFunction('first'+CHAR(30)+'second'+CHAR(30)+REPLICATE('last',60),CHAR(30));

    And thank you for giving me a solid reason to finally have a good old-fashioned 0-based set as well.

    ETA: The above single CTE version is very slow for large numbers; see my later posts for double CTE versions that have real speed.

  • Thanks for the excellent article that was both entertaining and instructional. I will be making use of your new splitter soon!

  • I'd be interested in seeing the data you used for benchmarking.

    I tried a method that used a cte to generate a table of offsets for the delimiters. Then I joined that on itself to find the starts and ends and used that to get the substrings from the original string.

    I was curious how that would perform compared to the other methods.

    declare @text varchar(max)

    declare @delimiter char(1)

    declare @len int

    set @text = 'now is the time for all good men to come to the aid of the party'

    set @delimiter = ' '

    set @len = datalength(@text) + 1

    ;with cte1 as (

    select 0 as number, 0 as row

    union

    select number, row_number() over (order by number) as row from Numbers

    where number < @len and substring(@text, number, 1) = @delimiter

    )

    select substring(@text, c1.number + 1, coalesce(c2.number - 1, @len) - c1.number) from cte1 c1

    left join cte1 c2 on c1.row = c2.row - 1

    order by c1.row

    14090 SW TENNESSEE LN

  • gary.rumble (5/2/2011)


    I'd be interested in seeing the data you used for benchmarking.

    I tried a method that used a cte to generate a table of offsets for the delimiters. Then I joined that on itself to find the starts and ends and used that to get the substrings from the original string.

    I was curious how that would perform compared to the other methods.

    Did you miss that Jeff mentioned that his entire test suite was attached to the article? You can just download it and add your version to the test to compare it.

  • mtassin (5/2/2011)


    On another note though, I notice something about the final graph of performance around the 270 mark on the X axis. The pure SQL set based method seems to continue on a straight line path, while the CLR methods seems to be starting to curve upwards similarly to what Jeff was complaining about the original Tally splitter doing, just later in the X series than the original Tally did.

    I too noticed that, and would be interested in seeing the graph compared a few little further out to see if the CLR solution continued it surge to the top.

  • WayneS (5/2/2011)


    mtassin (5/2/2011)


    WayneS (5/2/2011)


    BTW, since the new code was not compared to a physical, permanent tally table, I thought that I would mention that while testing this code out for you, I decided to see how the new splitter with a permanent tally table (as compared to the virtual cte tally table in the article) would compare with all of the others - it's worse than the ctetally splitter, but better than all of the others.

    See, here's something I still don't get... I'm sure somebody can point me to another article that explains it, but how does a CTE tally outperform a table based tally? Wouldn't all the calculations and the lack of a tight clustered index mean that the table based tally should outperform? The memory footprint of the cte tally vs the table tally is about the same, and once the table tally is in cache, that's not an issue, so I don't get it.

    Mark, the only possible explanation that I can give is the elimination of the disk IO. When I ran this test, I fully expected the physical tally table to eek out the ctetally. I was pretty surprised by the results, and ran it multiple times on several VMs to test it.

    I wan't going to post the results of a real Tally Table against the cteTally, but it looks like I'll have to just for the sake of being thorough.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Robert Livermore (5/2/2011)


    Irony.codeplex.com LRLA(1) parser allows the definition for any BNF based grammar. Provided the grammar specifies the literals for double quote and CRLF characters then parser should handle the double quotes including the tricky cases like CRLF inside the double quotes. I pointed you to Comma Separated Value Spec http://www.ietf.org/rfc/rfc4180.txt because the IETF defines CSV in BNF in there specification. It will provide a good tempate to start from to configure the grammar.

    Outstanding information, Robert. I'll spend some time in the links you've provided. I might even get someone to compile the code for me because, although I can read it, I can't spell "C". 🙂

    Thank you again for your thoughtful posts on the subject.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mtassin (5/2/2011)


    What outcry?

    It was the dust bunnies... 😛 They insisted on an iTVF which required no external considerations. They threatened to replace my beard if I didn't comply. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • sdorris-1150234 (5/2/2011)


    Like I said before, when Jeff speaks - I listen.

    Excellent article, Jeff! If this article was a "tsql how-to" in Barnes & Noble, I would buy it.

    And like I said the first time you said such a thing, I am humbled by your more than generous words. :-):blush: I'm just trying to help where I can and have a little fun doing so. Thank you for your very kind thoughts.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Geoff A (5/2/2011)


    Awesome article. Thanks for the effort, Jeff.

    Thanks for stopping by at work, today, Geoff. It means a lot to me when someone like yourself takes the time to stop by and say that in person. I really appreciate your time.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lynn Pettis (5/2/2011)


    Just wanted to add my 2 cents as well. Very well done. I will be adding these routines to my toolbox as well.

    I'm curious, I'm wondering if I could make them work in Oracle as well. Maybe I'll give it a shot when I have some time.

    Usage is one of the highest forms of compliments. Thanks, Lynn.

    Shifting gears... you probably already know this so this is for the other folks that may be reading this... Oracle has the same thing as CTE's except they call it "Subquery Refactoring".

    But, for you, you might want to skip even than. Just before I left the world of Oracle, a couple of us got together to find different ways of quickly building Tally-Table-like objects in Oracle. I seem to remember a method using the word CUBE which turned out to be the fastest in our tests back then. I might still have it somewhere and I'll see if I can find it for you. It was only about 2 or 3 lines long, too.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lynn Pettis (5/2/2011)


    Just wanted to add my 2 cents as well. Very well done. I will be adding these routines to my toolbox as well.

    I'm curious, I'm wondering if I could make them work in Oracle as well. Maybe I'll give it a shot when I have some time.

    "Ah, says I. Oracle, for you good code I found."

    WITH numbers AS (

    SELECT 1 just_a_column

    FROM DUAL

    GROUP BY CUBE(1,2,3,4,5,6,7,8,9,10)

    )

    SELECT ROWNUM

    FROM NUMBERS N1, NUMBERS N2

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • virtualjosh (5/2/2011)


    Good article. I come from an application environment where I had more flexibility to use outside coding. I'm happy to see we can do interesting things while still in the T-SQL domain.

    Thanks, Josh. And that being your first post on this fine site, let me be the first to say "Welcome Aboard!"

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • JJ B (5/2/2011)


    Thanks for the article.

    While not the focus of the article, I wanted to thank you for the Nibbler and Inch Worm sections. I don't know if you came up with those terms, but they are perfect imagery. I didn't even have to read the sections to know exactly what you were talking about. Very nice.

    Congrats on solving the problem.

    Hi JJ,

    I couldn't think of what to call them. So I asked the dust bunnies. 😛

    Thanks for the kind words. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mark hutchinson (5/2/2011)


    Jeff,

    Thanks for writing both articles. I still don't understand why you use the CharIndex() when you could use a cursor to iterate through the delimiter positions. I think your repeated use of CharIndex() was the unnecessary perf-killer. Please enlighten me.

    I was tempted to suggest joining the delimiter-position table to itself to produce pairs of delimiter positions, but it seems like a less efficient method.

    ====

    Addendum:

    Upon some reflection (after posting the above), I would think that the minimization of all function calls would optimize the process. To that end, I think the fastest process would be to use CharIndex() in a loop to determine the delimited positions, inserting those into a temp table. (maybe adding the zero value along the way). Then use the cursor to iterate the delimiter positions, invoking the SubString() functions.

    ====

    Mark Hutchinson

    😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 46 through 60 (of 990 total)

You must be logged in to reply to this topic. Login to reply