Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • ChrisM@Work (8/23/2012)


    mtassin (8/23/2012)


    I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.

    ....

    Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").

    It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.

    Any one done any testing yet on it with the varchar/nvarchar(max) yet?

  • Lynn Pettis (8/23/2012)


    ChrisM@Work (8/23/2012)


    mtassin (8/23/2012)


    I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.

    ....

    Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").

    It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.

    Any one done any testing yet on it with the varchar/nvarchar(max) yet?

    Not yet Lynn, but it has to be done.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Lynn Pettis (8/23/2012)


    ChrisM@Work (8/23/2012)


    mtassin (8/23/2012)


    I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.

    ....

    Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").

    It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.

    Any one done any testing yet on it with the varchar/nvarchar(max) yet?

    I did some testing and shared it a while ago on this same thread but with some different code :hehe:

    I also tested this new inline code then but with the check needed not to handle special characters like "<" ">" etc. by using XML PATH. Otherwise, it may result in error OR false results.

    At that time it seemed slower in my testing so I stuck to Oleg's XML solution.

  • ChrisM@Work (8/23/2012)


    mtassin (8/23/2012)


    I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.

    ....

    Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").

    It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.

    Till VARCHAR(8000), the original splitter (Not to forget modified versions perform better), beats all other solutions. I did come up with a physical tally table solution here on this same thread. But the whole game changes when we are talking of VARCHAR(MAX) strings.

    But I must state that this article was clearly meant for max 8000 character strings, so comparing the same for larger strings does not make any point at all.

  • Wow! You folks have been busy!

    I've recently come across the very iTVF XML example you golks are working with and was going to setup to test it but all of you have beaten me to the punch.

    Cursory testing on my part shows that the XML will beat the DelimitedSplit8K function but ONLY if you remove the ROW_NUMBER() function to get rid of the ItemNumber column.

    The DelimitedSplit8K function is so finely tuned for what it does that if you remove the ItemNumber column from it or simply don't use it in the output, the DelimitedSplit8K function will lose to the XML (but not by much).

    And, yes... the DelimitedSplit8K function was written to be as fast as possible up to 8K. It was never intended to be used to split MAX datatypes because just joining a Tally CTE to a blob will cause the code to run at almost half speed even if the blob is less than 8K. It's just one of those facts of life.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mburbea (8/22/2012)


    While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

    The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

    Discovered? It bears an amazing similarity to Brad Shultz's method (where he also talks of XML streaming being the problem - which he is wrong about, incidentally) at http://bradsruminations.blogspot.co.nz/2010/01/delimited-string-tennis-again-final.html

    Only one of the functions is inline, the xmlify function is not an iTVF (i = inline) it is a scalar function.

    The biggest problem with XML methods (aside from being a little intellectually-offensive) is that they don't always work. Try any of these:

    SET @STR = '<a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    SET @STR = 'a&b;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    SET @STR = 'a;;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

  • ChrisM@Work (8/23/2012)


    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    The NULL is not eliminated by the CROSS APPLY. You can increase efficiency a little by excluding the NULL with a final:

    WHERE cte.xmlstring IS NOT NULL

  • SQL Kiwi (8/23/2012)


    mburbea (8/22/2012)


    While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

    The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

    Discovered? It bears an amazing similarity to Brad Shultz's method (where he also talks of XML streaming being the problem - which he is wrong about, incidentally) at http://bradsruminations.blogspot.co.nz/2010/01/delimited-string-tennis-again-final.html

    Only one of the functions is inline, the xmlify function is not an iTVF (i = inline) it is a scalar function.

    The biggest problem with XML methods (aside from being a little intellectually-offensive) is that they don't always work. Try any of these:

    SET @STR = '<a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    SET @STR = 'a&b;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    SET @STR = 'a;;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    Nice. Even entitization would not make it work as "<" would be replaced by lt; and then if we replace semicolon it will give error of missing semicolon. I guess the same case would be if the delimiter is & (which is more dangerous as no error would be prompted)? Is it only these two characters which can cause the problem?

    It also make me think the same would fail if the delimited string contains &lt (EDIT: its "&"+"l"+"t") OR & g t 🙁 So would you recommend staying away from xml splitter if it has to be a generic function? Of-course, people with prior knowledge of their data patterns may always live with that?

    Agghhh...Now more I think of the hazards, more I tilt towards the use of RCTE solution instead of the xml solution (in case of N/VARCHAR(MAX) strings only. Your CLR is already recommended by me. But I do not see the Shop allowing CLR :crazy:)

  • SQL Kiwi (8/23/2012)


    ChrisM@Work (8/23/2012)


    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    The NULL is not eliminated by the CROSS APPLY.

    I will second that CROSS APPLY will eliminate the NULL xml row as we are shredding the xml specifically for <r> node OR am I missing something?

  • Usman Butt (8/24/2012)


    SQL Kiwi (8/23/2012)


    ChrisM@Work (8/23/2012)


    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    The NULL is not eliminated by the CROSS APPLY.

    I will second that CROSS APPLY will eliminate the NULL xml row as we are shredding the xml specifically for <r> node OR am I missing something?

    This is not a democracy. The NULL is on the driving side of the APPLY. Check the execution plan, two rows are produced and the joins are executed twice:

    With the WHERE clause I suggested, only one row drives the joins:

  • Usman Butt (8/24/2012)


    Nice. Even entitization would not make it work as "<" would be replaced by lt; and then if we replace semicolon it will give error of missing semicolon. I guess the same case would be if the delimiter is & (which is more dangerous as no error would be prompted)? Is it only these two characters which can cause the problem?

    Any of the entitized characters could cause problems; there are a couple of others too (I think CHAR(31), but not sure I'm remembering that correctly). Anyway, failing with < or > is enough for me to avoid this technique.

    So would you recommend staying away from xml splitter if it has to be a generic function? Of-course, people with prior knowledge of their data patterns may always live with that?

    I would not use it myself. Why choose a method that might fail in future over another that will always work?

    Agghhh...Now more I think of the hazards, more I tilt towards the use of RCTE solution instead of the xml solution (in case of N/VARCHAR(MAX) strings only. Your CLR is already recommended by me. But I do not see the Shop allowing CLR :crazy:)

    I rarely have the need to split strings. My first choice is always the CLR solution. If I were forced to use a T-SQL solution, it would depend on the circumstances.

  • SQL Kiwi (8/23/2012)


    mburbea (8/22/2012)


    While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

    The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

    Discovered? It bears an amazing similarity to Brad Shultz's method (where he also talks of XML streaming being the problem - which he is wrong about, incidentally) at http://bradsruminations.blogspot.co.nz/2010/01/delimited-string-tennis-again-final.html

    "Discovered" is fine in this context Paul - the OP isn't claiming to have discovered it first, only that he figured it out. Many folks here will have found themselves in this situation.

    SQL Kiwi (8/23/2012)


    ChrisM@Work (8/23/2012)


    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    The NULL is not eliminated by the CROSS APPLY. You can increase efficiency a little by excluding the NULL with a final:

    WHERE cte.xmlstring IS NOT NULL

    Heh – that’s what happens when I make an assumption. Thanks for pointing it out.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (8/24/2012)


    "Discovered" is fine in this context Paul - the OP isn't claiming to have discovered it first, only that he figured it out. Many folks here will have found themselves in this situation.

    I won't argue with your opinion. I would liked to have seen a link to Brad's original work in mburbea's post though, if that was the source. It is a good habit to get into to credit people.

  • SQL Kiwi (8/24/2012)


    ...This is not a democracy....

    Absolutely. The weight of opinion won't magically change the way SQL Server works, or anything else for that matter - and yet it's depressingly common. A recent thread gives some indication of how democratised functionality can go horribly wrong.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (8/24/2012)


    SQL Kiwi (8/24/2012)


    ...This is not a democracy....

    Absolutely. The weight of opinion won't magically change the way SQL Server works, or anything else for that matter - and yet it's depressingly common. A recent thread gives some indication of how democratised functionality can go horribly wrong.

    That thread looks more like an inverse meritocracy.

    What's that called? "Idiocracy" perhaps?


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 15 posts - 361 through 375 (of 990 total)

You must be logged in to reply to this topic. Login to reply