December 28, 2012 at 1:52 pm
Sean Lange (12/28/2012)
telcogod (12/28/2012)
what about xml?Did you read the whole article and look at the performance comparisons? This style of xml splitter was on the list and it was not as fast.
As Jeff said above, their is a full test harness posted. Try out your xml splitter next to the DelimitedSplit8K. You might be surprised,
I ran Jeff's test script and DelimitedSplit8K is the clear winner for splitting one-dimensional delimited string arrays. (Unless your system can be configured to allow the use of CLRs--in which case the CLR splitter is the fastest BY FAR.) I even changed his test script slightly to use random alphanumerics in the delimited string and the results were the same.
However, if a second dimension is added to the array ('abc,xyz|def,uvw') things change. I've been doing some testing (using a modified version of Jeff's test methodology) for that form of data using methods proposed for splitting such strings in another thread. Split input string into multicolumn - multirows
The CLR Split is still far superior to any other method. But if the use of a CLR is not possible, the winner of the 2-dimensional array split contest on my machine was an XML splitter offered by "Mr Magoo". DelimitedSplit8K runs a distant second as the size, number, and randomness of the element pairs increases. Most methods attack the 2-dim strings by using joins or a cross apply with regular split functions to do the second level split.
Here's Mr. Magoo's method for splitting element pairs:
-- set up some sample input data
declare @input varchar(max);
set @input = '30;38469|31;38470|32;38471|33;38472|34;38473|35;38474|36;38475|37;38476|38;38477|'
-- convert to xml
-- of course, this could be done in your web app and may be quicker.
declare @xml xml;
SET @xml =
CONVERT(xml,
'<row><col>'+
REPLACE(
REPLACE(
REPLACE(
REPLACE(
@input,
CHAR(13),
''
),
CHAR(10),
''
),
'|',
'</col></row><row><col>'
),
';',
'</col><col>'
)
+'</col></row>'
);
-- and select the values out as rows/columns
select
nd.value('(col/text())[1]','varchar(200)') as field1,
nd.value('(col/text())[2]','varchar(200)') as field2
from @xml.nodes('row') as x(nd)
where not nd.value('(col/text())[1]','varchar(200)') is null
December 28, 2012 at 3:06 pm
Hi Jeff, I managed (I think) to read my way through not only the update to an excellent article but the subsidiary commentary and the code fallout. Now I have some serious reading to do from Paul (always fun, although brains leaking out my ears is not a good look.)
I had to go back through the code though because there were a couple of things I wondered about.
The first is that you do the isnull(nullif()) thing. Yes, I get that it's faster than a case statement (I've used the same construct myself in the past). What I wondered was why you bother doing that at all; why not just another boundary to the end in much the same way as you union on the zero at the beginning?
The second is why do you do the second charindex()? The list you construct has the positions of all of the delimiters in the string, why not just reuse the list to give you the next position as well? The big if here is whether a join to reuse the list is quicker than doing the second charindex() scan; so it may well be you tried this approach and rejected it (although I couldn't see any examples).
I put together a little code just to illustrate the changes. Please note, I am pretty sure that this code didn't originate with me (and obviously it's also based on your own code)! I remember a discussion a few years back on this very site from which I got an earlier version of a string splitter, so all I've done is try to re-cast that code back into a similar shape as your own code.
Anyway, here's the example (just enough code to illustrate the changes, not a full splitter function). I haven't tested the code since I figure you probably already looked at similar changes and rejected them, but if you think it's worthwhile I can go ahead and create a full splitter function from it and run it through your test frame,
declare@pString varchar(8000) = '1,22,333',
@pDelimiter char(1) = ',';
with
T1(N) as (
select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all select 1
),
T2(N) as (
select 1 from T1 as TA cross join T1 as TB
),
T4(N) as (
select 1 from T2 as TA cross join T2 as TB
),
Tally(N) as (
select top (datalength(isnull(@pString, '1')))
row_number() over (order by (select null))
fromT4
),
--
-- generate the list of "N+1" positions that mark the start of each
-- item in the string being split (N being the position of the delimiters)
--
BaseItem as (
select1 as ItemPosition
union all
selectN + 1 as ItemPosition
fromTally
wheresubstring(@pString, N, 1) = @pDelimiter
--
-- add terminating boundary to the list of positions
--
union all
selectdatalength(isnull(@pString, '1')) + 2 -- need N+1 past imaginary final delimiter
),
--
-- order the list of item start positions so we can match each
-- start to the next starting position
--
ItemStart as (
selectrow_number() over (order by ItemPosition) as ItemNumber,
ItemPosition as ItemStart
fromBaseItem as ItemStart
)
selectItemStart.ItemNumber,
ItemStart.ItemStart,
ItemEnd.ItemStart - ItemStart.ItemStart - 1 as ItemLength
fromItemStart
--
-- join back to the same list to get the next item starting position (and
-- avoiding the second charindex()
--
inner join
ItemStart as ItemEnd
onItemEnd.ItemNumber = ItemStart.ItemNumber + 1;
Edited to add: from the little testing I did do, this appears to be (potentially) much more expensive as the join includes a sort, and the code I posted has an estimated cost of 98% versus 2% for equivalent code from Jeff. So what I'm trying to understand is why a sort which I would think is fairly optimal in SQL Server is so much more expensive than repeated string scans using charindex().
December 28, 2012 at 7:10 pm
Bruce W Cassidy (12/28/2012)
Hi Jeff, I managed (I think) to read my way through not only the update to an excellent article but the subsidiary commentary and the code fallout. Now I have some serious reading to do from Paul (always fun, although brains leaking out my ears is not a good look.)I had to go back through the code though because there were a couple of things I wondered about.
The first is that you do the isnull(nullif()) thing. Yes, I get that it's faster than a case statement (I've used the same construct myself in the past). What I wondered was why you bother doing that at all; why not just another boundary to the end in much the same way as you union on the zero at the beginning?
Hi Bruce,
Thank you for the excellent feedback and suggestions/questions. I hope I can answer them.
I was very interested in your suggestion above. I've not completed even full functional testing on it but it looks very promising. I guess the final proof of the pudding will be to find out if a UNION ALL and an extra DATALENGTH is faster than a number of ISNULL/NULLIFs.
The second is why do you do the second charindex()? The list you construct has the positions of all of the delimiters in the string, why not just reuse the list to give you the next position as well? The big if here is whether a join to reuse the list is quicker than doing the second charindex() scan; so it may well be you tried this approach and rejected it (although I couldn't see any examples).
I put together a little code just to illustrate the changes. Please note, I am pretty sure that this code didn't originate with me (and obviously it's also based on your own code)! I remember a discussion a few years back on this very site from which I got an earlier version of a string splitter, so all I've done is try to re-cast that code back into a similar shape as your own code.
Anyway, here's the example (just enough code to illustrate the changes, not a full splitter function). I haven't tested the code since I figure you probably already looked at similar changes and rejected them, but if you think it's worthwhile I can go ahead and create a full splitter function from it and run it through your test frame,
declare@pString varchar(8000) = '1,22,333',
@pDelimiter char(1) = ',';
with
T1(N) as (
select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all select 1
),
T2(N) as (
select 1 from T1 as TA cross join T1 as TB
),
T4(N) as (
select 1 from T2 as TA cross join T2 as TB
),
Tally(N) as (
select top (datalength(isnull(@pString, '1')))
row_number() over (order by (select null))
fromT4
),
--
-- generate the list of "N+1" positions that mark the start of each
-- item in the string being split (N being the position of the delimiters)
--
BaseItem as (
select1 as ItemPosition
union all
selectN + 1 as ItemPosition
fromTally
wheresubstring(@pString, N, 1) = @pDelimiter
--
-- add terminating boundary to the list of positions
--
union all
selectdatalength(isnull(@pString, '1')) + 2 -- need N+1 past imaginary final delimiter
),
--
-- order the list of item start positions so we can match each
-- start to the next starting position
--
ItemStart as (
selectrow_number() over (order by ItemPosition) as ItemNumber,
ItemPosition as ItemStart
fromBaseItem as ItemStart
)
selectItemStart.ItemNumber,
ItemStart.ItemStart,
ItemEnd.ItemStart - ItemStart.ItemStart - 1 as ItemLength
fromItemStart
--
-- join back to the same list to get the next item starting position (and
-- avoiding the second charindex()
--
inner join
ItemStart as ItemEnd
onItemEnd.ItemNumber = ItemStart.ItemNumber + 1;
Edited to add: from the little testing I did do, this appears to be (potentially) much more expensive as the join includes a sort, and the code I posted has an estimated cost of 98% versus 2% for equivalent code from Jeff. So what I'm trying to understand is why a sort which I would think is fairly optimal in SQL Server is so much more expensive than repeated string scans using charindex().
I guess the answer to that question can actually be found in a comment you made in your good code...
-- join back to the same list to get the next item starting position (and
-- avoiding the second charindex()
If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed. That immediately makes generation of the cteTally twice as slow. Then comes the join. On my 2005 box, that manifests itself as a Hash Join that adds 60% more overhead to something that already takes twice as long to execute. Forcing a Merge Join makes the Execution Plan look better but it actually takes more CPU and duration. Forcing a Loop Join just absolutely kills the performance.
On the other hand, using CHARINDEX() to find the next delimiter after the current one is a machine language level function that operates as quickly as a CLR you could write to do the same thing.
I probably should have mentioned that I had already tried all that in the article but the article was already almost too long as it was.
Again, thank your for the thoughtfull feedback, Bruce. I really appreciate it. I will try that first suggestion and let folks know. It may take me a bit because I'm up to my eyes in it at my job. Gotta love year end. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
December 28, 2012 at 7:41 pm
Steven Willis (12/28/2012)
Sean Lange (12/28/2012)
telcogod (12/28/2012)
what about xml?Did you read the whole article and look at the performance comparisons? This style of xml splitter was on the list and it was not as fast.
As Jeff said above, their is a full test harness posted. Try out your xml splitter next to the DelimitedSplit8K. You might be surprised,
I ran Jeff's test script and DelimitedSplit8K is the clear winner for splitting one-dimensional delimited string arrays. (Unless your system can be configured to allow the use of CLRs--in which case the CLR splitter is the fastest BY FAR.) I even changed his test script slightly to use random alphanumerics in the delimited string and the results were the same.
However, if a second dimension is added to the array ('abc,xyz|def,uvw') things change. I've been doing some testing (using a modified version of Jeff's test methodology) for that form of data using methods proposed for splitting such strings in another thread. Split input string into multicolumn - multirows
The CLR Split is still far superior to any other method. But if the use of a CLR is not possible, the winner of the 2-dimensional array split contest on my machine was an XML splitter offered by "Mr Magoo". DelimitedSplit8K runs a distant second as the size, number, and randomness of the element pairs increases. Most methods attack the 2-dim strings by using joins or a cross apply with regular split functions to do the second level split.
Here's Mr. Magoo's method for splitting element pairs:
-- set up some sample input data
declare @input varchar(max);
set @input = '30;38469|31;38470|32;38471|33;38472|34;38473|35;38474|36;38475|37;38476|38;38477|'
-- convert to xml
-- of course, this could be done in your web app and may be quicker.
declare @xml xml;
SET @xml =
CONVERT(xml,
'<row><col>'+
REPLACE(
REPLACE(
REPLACE(
REPLACE(
@input,
CHAR(13),
''
),
CHAR(10),
''
),
'|',
'</col></row><row><col>'
),
';',
'</col><col>'
)
+'</col></row>'
);
-- and select the values out as rows/columns
select
nd.value('(col/text())[1]','varchar(200)') as field1,
nd.value('(col/text())[2]','varchar(200)') as field2
from @xml.nodes('row') as x(nd)
where not nd.value('(col/text())[1]','varchar(200)') is null
You've actually answered your own question. As you pointed out, DelimitedSplit8K is nasty fast in doing what it was designed for which is Splitting non-blob varchars with an unknown number of delimiters/elements on a single delimiter. Only a well written CLR (like Paul White's) is faster at solvig the same problem directly. If you "cascade" the function using more than one Cross Apply, you have to remember that ALL of the code becomes a single gigantic query which doesn't run as effeciently as single usage of the function.
Shifting gears, Maggo's code, as fine as it is for the specific task given, can also only do one thing well. If you do something to the data even as simple as adding one extra element to the semi-colon delimited data, you will need to modify the code to accomodate the change. That's NOT a bad thing if the data is expected to not change on a regular basis. It just proves that well written code dedicated to solving a dedicated problem will be faster than generic code applied to the same dedicated data.
I've not had a lot of "deep dive" time available recently but the problem of multiple dimension (usually 2 dimensions as in your example but with unknown numbers of elements in each dimension) splitting has come up often enough where I should probably spend some time on the problem. Still, and as you've pointed out, something like Paul White's CLR works just fine for such a thing because it's faster to begin with and treats the "work load" as two separate queries instead of one massive one.
Heh... now that I've said that, I wonder if dumping the first split into a Temp Table and them resplitting that would be faster than the XML method?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 28, 2012 at 9:49 pm
Jeff Moden (12/28/2012)
I guess the final proof of the pudding will be to find out if a UNION ALL and an extra DATALENGTH is faster than a number of ISNULL/NULLIFs.
Yeah, I wondered at that myself. I wondered whether having a CTE that evaluates the datalength once and then cross joined where it's needed would help, but then you mentioned that the CTE can get evaluated more than once, so in effect that wouldn't resolve it. Does SQL Server do common factor elimination? In which case it might only evaluate the datalength once anyway.
By the way, an alternative to isnull(nullif()) is to use the sign of an expression (which in your case would evaluate as either 1 or 0) and multiply sub-expressions by either the sign or one minus the sign and then add the two sub-expressions together. Your isnull(nullif()) combination is way clearer (and probably a lot quicker as well) I think.
Jeff Moden (12/28/2012)
If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.
That's something I didn't know, and it explains a lot! Thanks Jeff!
I guess in some cases, if SQL Server does apply common factor elimination, then that may drop out, but there will be a limit on what it will factor out.
Jeff Moden (12/28/2012)
I probably should have mentioned that I had already tried all that in the article but the article was already almost too long as it was.
As I mentioned earlier, I was pretty sure the code I had been using (which did the self join thing) came from an earlier article on the same topic, so I suspected you had looked at it and ruled it out. That's why I was hoping to find out why that solution wasn't what I would call "optimal".
Cheers, and thanks again Jeff. Job hunting over here (in between contracts) so I've got time to look at stuff like this. 😀
December 29, 2012 at 10:02 am
Bruce W Cassidy (12/28/2012)
Jeff Moden (12/28/2012)
If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.That's something I didn't know, and it explains a lot! Thanks Jeff!
I guess in some cases, if SQL Server does apply common factor elimination, then that may drop out, but there will be a limit on what it will factor out.
I've not done a deep dive on it but I believe that any common factor eliminations would only be within each "instance" of the CTE. Each "call" to the CTE causes everything inside the CTE to be recalculated. I do which that MS would have been a bit smarter about that. They shot a really great feature right in the mouth. Same goes for multiple calls to a view within the same query... each "call" is evaluated as a separate call and the view code is executed once for each alias of the view.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 29, 2012 at 10:05 am
buyme43 (12/28/2012)
input this URL:***Url removed ***
you can find many cheap and high stuff
Believe you will love it.
WE ACCEPT CREDIT CARD /WESTERN UNION PAYMENT
YOU MUST NOT MISS IT!!!
Speaking of "cheap", SPAM reported. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
December 29, 2012 at 11:48 am
Jeff Moden (12/28/2012)Heh... now that I've said that, I wonder if dumping the first split into a Temp Table and them resplitting that would be faster than the XML method?
I modified DelimitedSplit8K by simply splitting on the "outer" delimiter then using a simple replace to get the second half of each pair.
I wasn't expecting much but my test results show that it beat the XML Split method and even the CLR split until the size and number of the value pairs got very large. The CLR Split must still use a cross apply when splitting pairs, so if the CLR function was re-written to parse both pairs within the function itself it would probably still be the fastest method.
All of the testing code (modified from Jeff Moden's code in this article) is posted at Split input string into multicolumn - multirows
The code for using DelimitedSplit8K to split value pairs (i.e., 2-dimensional arrays):
CREATE FUNCTION [dbo].[DelimitedSplit8K_2DIM]
(
@pString VARCHAR(8000)
,@pDelimiter1 CHAR(1)
,@pDelimiter2 CHAR(1) = NULL
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (
SELECT 0 UNION ALL
SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (
SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter1 OR t.N = 0)
)
SELECT
ItemNumber
,Item1
,Item2 = REPLACE(Item2,Item1+@pDelimiter2,'')
FROM
(
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1)
,Item1 = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter2,@pString,s.N1),0)-s.N1,8000))
,Item2 = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter1,@pString,s.N1),0)-s.N1,8000))
FROM cteStart s
) i1
/*
Example usage:
DECLARE @STR VARCHAR(8000)
SET @STR = '123,456|789,000'
SELECT * FROM [dbo].[DelimitedSplit8K_2DIM](@str,'|',',')
DECLARE @STR VARCHAR(8000)
SET @STR = '0WQDNw|aXxbzu,durPP|7yaFpK,UMERA5|FLN2G,HUdZv|QUQy5,3MbdqS|JWUgPp,F23jqp|kWbSBn,nSWunU|uh1zR,pqBJ4U|eNnZzE,jbu7R|cwd4E,1hNMC|Ru7ar'
SELECT * FROM [dbo].[DelimitedSplit8K_2DIM](@str,',','|')
*/
December 29, 2012 at 2:33 pm
Steven Willis (12/28/2012)
Here's Mr. Magoo's method for splitting element pairs:
Gosh, I hope that was a mistake?
That code was not the best version of the XML split for that problem by a long way
I won't clutter up this thread with the better version - it is in the other thread you linked to, and as Jeff has pointed out, it was for a specific problem, not a generic splitter.
Otherwise, thanks for the nod 😀
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
December 29, 2012 at 6:57 pm
Hi
unbelievable - so good. I have been doing these puter things for 30 years now and i am still amazed how little i know or probably at just how clever other people are. Really well explained and totally useful.
keep up the good work.
APL. My initials not the language.
December 30, 2012 at 10:02 am
Jeff Moden (12/28/2012)
If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.
Now, if only one could force the optimiser to spool the CTE and reuse it, CTEs would be far more useful. Would they be useful enough to deliver a performance gain here? Maybe, and maybe not.
Because there are cases where a CTE might be so big that spooling it would damage performance rather than improve it, MS decided never to spool them - at least in the current implementation - perhaps they spool when spooling makes a semantic difference, not just a performance one, but I'm not sure that they spool even then, as far as I can tell it isn't documeneted anywhere. Why can't the optimiser look and see if spooling would enhance performance (only in the cases where it make no semantic difference, of course) and use it when it does? Or if that's too difficult a task for the optimiser (it shouldn't be, but they haven't done it so I can imagine them claiming it is) why can't we have a query hint that tells them to do it?
I hope Paul or someone will jump in and explain all this - maybe tell me I've got it all wrong, but that's OK too, I like learning.
Tom
December 30, 2012 at 12:48 pm
mister.magoo (12/29/2012)
Steven Willis (12/28/2012)
Here's Mr. Magoo's method for splitting element pairs:Gosh, I hope that was a mistake?
That code was not the best version of the XML split for that problem by a long way
I won't clutter up this thread with the better version - it is in the other thread you linked to, and as Jeff has pointed out, it was for a specific problem, not a generic splitter.
Otherwise, thanks for the nod 😀
Probably a mistake on my part then...the code for the XML Split version I tested is in the code I attached in the other thread referenced above. But as a splitter of delimited value pairs your XML version performed very well when compared to other non-CLR methods.
December 30, 2012 at 1:07 pm
L' Eomot Inversé (12/30/2012)
Jeff Moden (12/28/2012)
If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.Now, if only one could force the optimiser to spool the CTE and reuse it, CTEs would be far more useful.
Don't become so blinded by CTEs that you forget about temporary tables or table-type variables; another problem with CTEs is they are not indexed.
On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁
December 30, 2012 at 1:13 pm
John Hardin (12/30/2012)
L' Eomot Inversé (12/30/2012)
Jeff Moden (12/28/2012)
If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.Now, if only one could force the optimiser to spool the CTE and reuse it, CTEs would be far more useful.
Don't become so blinded by CTEs that you forget about temporary tables or table-type variables; another problem with CTEs is they are not indexed.
On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁
Page 27 passed on how many posts per page? I ask as I display 50 posts per page. It would help if you posted the url of your post so we could go directly to that post.
Viewing 15 posts - 451 through 465 (of 990 total)
You must be logged in to reply to this topic. Login to reply