Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 361 through 375 (of 991 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 1

ChrisM@Work (8/23/2012)
mtassin (8/23/2012)
I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.
....
Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").
It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.

Any one done any testing yet on it with the varchar/nvarchar(max) yet?

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 2

Lynn Pettis (8/23/2012)
ChrisM@Work (8/23/2012)
mtassin (8/23/2012)
I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.
....
Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").
It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.
Any one done any testing yet on it with the varchar/nvarchar(max) yet?

Not yet Lynn, but it has to be done.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Usman Butt Hall of Fame Points: 3186 More actions · Answer 3

Lynn Pettis (8/23/2012)
ChrisM@Work (8/23/2012)
mtassin (8/23/2012)
I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.
....
Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").
It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.
Any one done any testing yet on it with the varchar/nvarchar(max) yet?

I did some testing and shared it a while ago on this same thread but with some different code :hehe:

I also tested this new inline code then but with the check needed not to handle special characters like "<" ">" etc. by using XML PATH. Otherwise, it may result in error OR false results.

At that time it seemed slower in my testing so I stuck to Oleg's XML solution.

Usman Butt Hall of Fame Points: 3186 More actions · Answer 4

ChrisM@Work (8/23/2012)
mtassin (8/23/2012)
I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.
....
Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").
It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.

Till VARCHAR(8000), the original splitter (Not to forget modified versions perform better), beats all other solutions. I did come up with a physical tally table solution here on this same thread. But the whole game changes when we are talking of VARCHAR(MAX) strings.

But I must state that this article was clearly meant for max 8000 character strings, so comparing the same for larger strings does not make any point at all.

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 5

Wow! You folks have been busy!

I've recently come across the very iTVF XML example you golks are working with and was going to setup to test it but all of you have beaten me to the punch.

Cursory testing on my part shows that the XML will beat the DelimitedSplit8K function but ONLY if you remove the ROW_NUMBER() function to get rid of the ItemNumber column.

The DelimitedSplit8K function is so finely tuned for what it does that if you remove the ItemNumber column from it or simply don't use it in the output, the DelimitedSplit8K function will lose to the XML (but not by much).

And, yes... the DelimitedSplit8K function was written to be as fast as possible up to 8K. It was never intended to be used to split MAX datatypes because just joining a Tally CTE to a blob will cause the code to run at almost half speed even if the blob is less than 8K. It's just one of those facts of life.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 6

mburbea (8/22/2012)
While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.
The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

Discovered? It bears an amazing similarity to Brad Shultz's method (where he also talks of XML streaming being the problem - which he is wrong about, incidentally) at http://bradsruminations.blogspot.co.nz/2010/01/delimited-string-tennis-again-final.html

Only one of the functions is inline, the xmlify function is not an iTVF (i = inline) it is a scalar function.

The biggest problem with XML methods (aside from being a little intellectually-offensive) is that they don't always work. Try any of these:

SET @str = '<a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

SET @str = 'a&b;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

SET @str = 'a;;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 7

ChrisM@Work (8/23/2012)
UNION ALL SELECT NULL -- eliminated by CROSS APPLY

The NULL is not eliminated by the CROSS APPLY. You can increase efficiency a little by excluding the NULL with a final:

WHERE cte.xmlstring IS NOT NULL

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Usman Butt Hall of Fame Points: 3186 More actions · Answer 8

SQL Kiwi (8/23/2012)
mburbea (8/22/2012)
While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.
The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).
Discovered? It bears an amazing similarity to Brad Shultz's method (where he also talks of XML streaming being the problem - which he is wrong about, incidentally) at http://bradsruminations.blogspot.co.nz/2010/01/delimited-string-tennis-again-final.html
Only one of the functions is inline, the xmlify function is not an iTVF (i = inline) it is a scalar function.
The biggest problem with XML methods (aside from being a little intellectually-offensive) is that they don't always work. Try any of these:
SET @str = '<a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'
SET @str = 'a&b;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'
SET @str = 'a;;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

Nice. Even entitization would not make it work as "<" would be replaced by lt; and then if we replace semicolon it will give error of missing semicolon. I guess the same case would be if the delimiter is & (which is more dangerous as no error would be prompted)? Is it only these two characters which can cause the problem?

It also make me think the same would fail if the delimited string contains &lt (EDIT: its "&"+"l"+"t") OR & g t 🙁 So would you recommend staying away from xml splitter if it has to be a generic function? Of-course, people with prior knowledge of their data patterns may always live with that?

Agghhh...Now more I think of the hazards, more I tilt towards the use of RCTE solution instead of the xml solution (in case of N/VARCHAR(MAX) strings only. Your CLR is already recommended by me. But I do not see the Shop allowing CLR :crazy:)

Usman Butt Hall of Fame Points: 3186 More actions · Answer 9

SQL Kiwi (8/23/2012)
ChrisM@Work (8/23/2012)
UNION ALL SELECT NULL -- eliminated by CROSS APPLY
The NULL is not eliminated by the CROSS APPLY.

I will second that CROSS APPLY will eliminate the NULL xml row as we are shredding the xml specifically for <r> node OR am I missing something?

Paul White SSC Guru Points: 150468 More actions · Answer 10

Usman Butt (8/24/2012)
SQL Kiwi (8/23/2012)
ChrisM@Work (8/23/2012)
UNION ALL SELECT NULL -- eliminated by CROSS APPLY
The NULL is not eliminated by the CROSS APPLY.
I will second that CROSS APPLY will eliminate the NULL xml row as we are shredding the xml specifically for <r> node OR am I missing something?

This is not a democracy. The NULL is on the driving side of the APPLY. Check the execution plan, two rows are produced and the joins are executed twice:

With the WHERE clause I suggested, only one row drives the joins:

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 11

Usman Butt (8/24/2012)
Nice. Even entitization would not make it work as "<" would be replaced by lt; and then if we replace semicolon it will give error of missing semicolon. I guess the same case would be if the delimiter is & (which is more dangerous as no error would be prompted)? Is it only these two characters which can cause the problem?

Any of the entitized characters could cause problems; there are a couple of others too (I think CHAR(31), but not sure I'm remembering that correctly). Anyway, failing with < or > is enough for me to avoid this technique.

So would you recommend staying away from xml splitter if it has to be a generic function? Of-course, people with prior knowledge of their data patterns may always live with that?

I would not use it myself. Why choose a method that might fail in future over another that will always work?

Agghhh...Now more I think of the hazards, more I tilt towards the use of RCTE solution instead of the xml solution (in case of N/VARCHAR(MAX) strings only. Your CLR is already recommended by me. But I do not see the Shop allowing CLR :crazy:)

I rarely have the need to split strings. My first choice is always the CLR solution. If I were forced to use a T-SQL solution, it would depend on the circumstances.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 12

SQL Kiwi (8/23/2012)
mburbea (8/22/2012)
While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.
The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).
Discovered? It bears an amazing similarity to Brad Shultz's method (where he also talks of XML streaming being the problem - which he is wrong about, incidentally) at http://bradsruminations.blogspot.co.nz/2010/01/delimited-string-tennis-again-final.html

"Discovered" is fine in this context Paul - the OP isn't claiming to have discovered it first, only that he figured it out. Many folks here will have found themselves in this situation.

SQL Kiwi (8/23/2012)
ChrisM@Work (8/23/2012)
UNION ALL SELECT NULL -- eliminated by CROSS APPLY
The NULL is not eliminated by the CROSS APPLY. You can increase efficiency a little by excluding the NULL with a final:
WHERE cte.xmlstring IS NOT NULL

Heh – that’s what happens when I make an assumption. Thanks for pointing it out.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Paul White SSC Guru Points: 150468 More actions · Answer 13

ChrisM@Work (8/24/2012)
"Discovered" is fine in this context Paul - the OP isn't claiming to have discovered it first, only that he figured it out. Many folks here will have found themselves in this situation.

I won't argue with your opinion. I would liked to have seen a link to Brad's original work in mburbea's post though, if that was the source. It is a good habit to get into to credit people.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 14

SQL Kiwi (8/24/2012)
...This is not a democracy....

Absolutely. The weight of opinion won't magically change the way SQL Server works, or anything else for that matter - and yet it's depressingly common. A recent thread gives some indication of how democratised functionality can go horribly wrong.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Dwain Camps SSC Guru Points: 86978 More actions · Answer 15

ChrisM@Work (8/24/2012)
SQL Kiwi (8/24/2012)
...This is not a democracy....
Absolutely. The weight of opinion won't magically change the way SQL Server works, or anything else for that matter - and yet it's depressingly common. A recent thread gives some indication of how democratised functionality can go horribly wrong.

That thread looks more like an inverse meritocracy.

What's that called? "Idiocracy" perhaps?

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St