May 9, 2009 at 10:34 am
Florian Reischl (5/9/2009)
@Paul: Starting with a string.Join or a StringBuilder? 😉
A StringBuilder. A CLR aggregate based on a StringBuilder.
Such as implementation gets plugged directly into the plan (no reflection!) as a single Stream Aggregate operator. In 2008 it can return a MAX datatype, 2005 is limited to 8000 bytes. No T-SQL method will get close to the single-logical-operator in terms of performance!
Paul
May 9, 2009 at 12:11 pm
Paul White (5/9/2009)
Florian Reischl (5/9/2009)
@Paul: Starting with a string.Join or a StringBuilder? 😉A StringBuilder. A CLR aggregate based on a StringBuilder.
Such as implementation gets plugged directly into the plan (no reflection!) as a single Stream Aggregate operator. In 2008 it can return a MAX datatype, 2005 is limited to 8000 bytes. No T-SQL method will get close to the single-logical-operator in terms of performance!
Actually, as far as I can tell, the FOR XML PATH('') trick in fact uses this (or a very similar) method internally and consequently it scales the same: linearly. However, FOR XML has all kinds of overhead of it own and therefore should be easily outperformed by any good direct CLR method. My own guesstimate is that FOR XML (using the extra techniques to correct the entitization) has about 4x overhead.
I haven't seen a really good CLR string aggregate method yet, probably because of the 8k limit in 2005. Adam Machanic had some trick to get around it, but everytime I have tried to implement it, it just crashes in any parallel execution plan (I have dual cores). Of course SQL 2008 is out now, so maybe its a good time for you to do this Florian (or Paul?)?
At the small end, the pseudocursor tricks are pretty darn fast, but they're very clumsy and difficult to apply in many situations.
(edit: too many "actually"s)
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 9, 2009 at 12:38 pm
RBarryYoung (5/9/2009)
Actually, as far as I can tell, the FOR XML PATH('') trick in fact uses this (or a very similar) method internally and consequently it scales the same: linearly.
I personally dislike the FOR XML PATH('') trick intensely. Can converting to XML and back really be the best way to concatenate in T-SQL?! The empty string hack is just awful.
Once one has added TYPE and .value(/.[1], 'varchar(x)') the plan has gained an XML Reader, two UDX operations, a stream aggregate, and two loop joins - just to get the result into a local varchar variable.
Also, if I embed certain characters in the string (e.g. CHAR([0,11,29])) the XML serialization barfs. Any robust concatenation method must be able to handle all input gracefully.
RBarryYoung (5/9/2009)
However, FOR XML has all kinds of overhead of it own and therefore should be easily outperformed by any good direct CLR method. My own guesstimate is that FOR XML (using the extra techniques to correct the entitization) has about 4x overhead.
We should test this at some stage.
RBarryYoung (5/9/2009)
I haven't seen a really good CLR string aggregate method yet, probably because of the 8k limit in 2005.
The basic example in BOL is neat and swift for most practical uses. The change required to allow up to 2GB of concatenated data to be retrieved is trivial, though for extremely large operations one would have to think again because the memory allocation would be prohibitive. I'm not sure SQL Server is the right place to do massive string concatenation anyway.
RBarryYoung (5/9/2009)
At the small end, the pseudocursor tricks are pretty darn fast, but they're very clumsy and difficult to apply in many situations.
RBAR?! :shocked:
All of these designs will require more than a single stream aggregate operator to implement. With enough iterations, the difference should be appreciable.
Paul
May 9, 2009 at 5:29 pm
RBarryYoung (5/9/2009)
... but everytime I have tried to implement it, it just crashes in any parallel execution plan (I have dual cores).
Hi Barry
That have been also my first test results. Parallelism appeared to be a huge problem with CLR but finally this was not correct (in my tests). The magic is to not execute just a script which calls the CLR function(s) but a procedure. As long as I called them from usual scripts in SSMS the performance went to the toilet with more than ten connections. After putting the calls into procedures the performance was linear and they perform much better than all other approaches. Sure these are only my test results but I tried on SSE2k8 and on SSE2k5 on a smaller AMD dual core and on my current Intel quad core; same behavior.
Greets
Flo
May 9, 2009 at 6:45 pm
Paul White (5/9/2009)
RBarryYoung (5/9/2009)
Actually, as far as I can tell, the FOR XML PATH('') trick in fact uses this (or a very similar) method internally and consequently it scales the same: linearly.I personally dislike the FOR XML PATH('') trick intensely. Can converting to XML and back really be the best way to concatenate in T-SQL?! The empty string hack is just awful.
I used to dislike it as well until I I looked into it a little and realized some things (more below). And yes, my sad conclusion is that it really is the best way to do string concatenation in T-SQL 2005 (there are two new features of 2008 that open the door to something better). As for the "empty string hack", I have come to the realization that it is not that at all (at least not in my lexicon).
Why? Although there are about as many different definitions of "hack" as there are programmers in the world, my meaning of it has always been sourced in the obscurity, fragility and/or unsupported-ness of the code in question. However FOR XML does not really qualify on any of these grounds, even the empty string bit because it is clearly documented, and absolutely supported by the vendor. Also, FOR XML actually IS a string concatenator and formatter, that's what it actually does. The XML datatype part wasn't added until SQL 2005.
Moreover, I am now completely convinced that the empty string feature is no accidental side-effect, cleverly exploited by Microsoft's customers, but rather intentional functionality explicitly added to the PATH feature for SQL 2005 by Microsoft. All of these things take it out of the realm of a true "hack" (at least for our use of it, Microsoft's creation of it could still be called a hack).
Sadly, the converse implication of this is that, having given us this back-door method to do string concatenation, it seems unlikely that they will accede to our requests for a more direct and satisfactory facility anytime soon.
One other note on this: you don't actually have to "convert to XML and back" to use it. That is how it is frequently done (with "PATH.., TYPE") in order to handle the entitization problem. But if you don't need that (or have another way of handling it) then you can skip it and revert to the SQL 2000 behavior of just returning the text (there was no XML datatype in SQL 2000)
more in a bit...
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 9, 2009 at 7:56 pm
Paul White (5/9/2009)
Once one has added TYPE and .value(/.[1], 'varchar(x)') the plan has gained an XML Reader, two UDX operations, a stream aggregate, and two loop joins - just to get the result into a local varchar variable.
Yes, clearly lots of overhead that we shouldn't need.
Also, if I embed certain characters in the string (e.g. CHAR([0,11,29])) the XML serialization barfs. Any robust concatenation method must be able to handle all input gracefully.
Yes, I agree with you here. In fact I would say the same about ALL string handling routines. As far as I am concerned any general string-handling routines built on the assumption that certain characters won't be in a string are broken at the outset (I excuse application and case-specific instances). Unfortunately the standards committees are full of ancient Unix wogs who think that "text string" means that they don't have to handle most non-printable characters.
That said, it is possible to handle this, but the (additional) overhead is substantial.
RBarryYoung (5/9/2009)
However, FOR XML has all kinds of overhead of it own and therefore should be easily outperformed by any good direct CLR method. My own guesstimate is that FOR XML (using the extra techniques to correct the entitization) has about 4x overhead.We should test this at some stage.
Agreed.
RBarryYoung (5/9/2009)
I haven't seen a really good CLR string aggregate method yet, probably because of the 8k limit in 2005.The basic example in BOL is neat and swift for most practical uses. The change required to allow up to 2GB of concatenated data to be retrieved is trivial, though for extremely large operations one would have to think again because the memory allocation would be prohibitive. I'm not sure SQL Server is the right place to do massive string concatenation anyway.
It may not be, but IMHO a databse server should be the appropiate place to handling things like bulk string aggregation, reformatting, etc.
Now, having bashed FOR XML a lot, in all fairness we should examine SQL CLR's shortcomings in these areas as well. First and foremost, of course is the string-limit of 8K in 2005. And as noted previously, although Adam Machanic's book does claim to have a workaround for this, I have never seen an actual working (non-crashing) example that used it. Of course 2008 does address this, but it does also highlight that any true test of such a functions performance must also include the serialization overhead.
The second and even more import the thing that FOR XML has all over CLR is the ORDER BY clause. Ordering is usually a very important functional requirement of string aggregation which FOR XML gets in the most natural way possible in SQL from the ORDER BY clause, which is explicitly supported for this purpose, probably because XML itself is order-sensitive and requires it.
A SQLCLR aggreagate function on the other hand, has no such built-in mechanism because although there is the IsInvariantToOrder property, it is still "reserved for future use". In 2005, there is virtually no reliable supported way to address this. And of course, relying on the calling query's ORDER BY to insure calling order to the aggregate function is neither reliable nor supported.
SQL Server 2008 does provide one or two ways around this, however, the most obvious of which is that it allows more than one parameter for an aggregate function. This would allow ordering principle information to be passed in the additional parameters. Then, it could use that ordering information to sort the strings just before concatenating and returning them in the Termination routine. Of course, passing the extra parameter(s), storing them along with the string and then sorting based on them all add more overhead.
RBarryYoung (5/9/2009)
At the small end, the pseudocursor tricks are pretty darn fast, but they're very clumsy and difficult to apply in many situations.RBAR?! :shocked:
All of these designs will require more than a single stream aggregate operator to implement. With enough iterations, the difference should be appreciable.
I'm afraid that I don't follow this last comment. Could you expand on it?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 9, 2009 at 8:03 pm
Florian Reischl (5/9/2009)
RBarryYoung (5/9/2009)
... but everytime I have tried to implement it, it just crashes in any parallel execution plan (I have dual cores).Hi Barry
That have been also my first test results. Parallelism appeared to be a huge problem with CLR but finally this was not correct (in my tests). The magic is to not execute just a script which calls the CLR function(s) but a procedure. As long as I called them from usual scripts in SSMS the performance went to the toilet with more than ten connections. After putting the calls into procedures the performance was linear and they perform much better than all other approaches. Sure these are only my test results but I tried on SSE2k8 and on SSE2k5 on a smaller AMD dual core and on my current Intel quad core; same behavior.
Greets
Flo
Not sure that we're talking about the same thing, Flo. I haven't had any trouble with SQL parallelism in most of my SQL CLR (granted I haven't pushed it real hard either). I am talking only about Adam Machanics trick to get around the 8K local context serialization limit for SQL CLR aggregate functions in SQL Server 2005. That would crash ANY time I allowed it to be used in an execution plan with any parallelism. Adam's "trick" always looked very suspect to me and I am convinced that it just isn't safe.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 10, 2009 at 1:37 am
Barry,
[Brief comments on the first of your two replies]
Let me avoid the various meanings of 'hack' by calling the use of FOR XML for concatenation an unsatisfactory and ad-hoc work-around.
The purpose of FOR XML is to produce XML from a result set. Using it to concatenate strings by using a blank row element name is an abuse - the operation does not even produce valid XML! A blank row element name ought to be illegal (or produce empty row tags?). MSFT's decision to allow this and omit the tags entirely is completely ad-hoc. It makes no difference whether MSFT added it deliberately - a deliberate crock is still a crock.
It is true that you don't have to convert back from XML if you don't want to, but then it might not work properly! :pinch:
Using something which will break if you ever include rare and unusual characters like &, in your strings is brittle in the extreme. To modify the FOR XML approach so that it handles these properly kills performance. Modifying it still further to handle all possible ANSI or UNICODE characters results in a complete mess.
Given the limitations of FOR XML, I even prefer Paul Nielsen's approach! At least it doesn't die horribly if the wrong character is found in the input - whatever its other faults.
Cheers,
Paul
May 10, 2009 at 1:48 am
RBarryYoung (5/9/2009)
I am talking only about Adam Machanics trick to get around the 8K local context serialization limit for SQL CLR aggregate functions in SQL Server 2005. That would crash ANY time I allowed it to be used in an execution plan with any parallelism. Adam's "trick" always looked very suspect to me and I am convinced that it just isn't safe.
Hey Barry,
Any chance of a link to that code? I would love to take a look!
Paul
May 10, 2009 at 3:10 am
Barry,
All string handling routines should handle any valid input - agreed.
The local-variable and CLR aggregate approaches both do this.
It is only the FOR XML method which fails on odd characters, as far as I am aware...?
You are right that, even in 2008, the CLR string aggregate has a problem with ordering. While it may be possible to force the order of the rows entering the stream aggregate, the problem would still occur in a parallel plan, where the merge method must preserve the ordering that went into the partial aggregates.
The solution you outline using multiple parameters is the way I was thinking of going. The CLR includes efficient methods for sorting a collection of items in place, so I am confident that a good solution can be found.
I don't think that the serialization adds much of an overhead - but again it would be interesting to test.
To clarify my point about the pseudo-cursors: the query plan for those solutions would be considerably more complex than the single stream aggregate required for the CLR aggregate, and so would be noticeably slower for larger inputs. That was all - aside from a wry smile that you should even consider a RBAR solution :laugh:
Paul
May 10, 2009 at 3:18 am
RBarryYoung (5/9/2009)
Florian Reischl (5/9/2009)
RBarryYoung (5/9/2009)
... but everytime I have tried to implement it, it just crashes in any parallel execution plan (I have dual cores).Hi Barry
That have been also my first test results. Parallelism appeared to be a huge problem with CLR but finally this was not correct (in my tests). The magic is to not execute just a script which calls the CLR function(s) but a procedure. As long as I called them from usual scripts in SSMS the performance went to the toilet with more than ten connections. After putting the calls into procedures the performance was linear and they perform much better than all other approaches. Sure these are only my test results but I tried on SSE2k8 and on SSE2k5 on a smaller AMD dual core and on my current Intel quad core; same behavior.
Greets
Flo
Not sure that we're talking about the same thing, Flo. I haven't had any trouble with SQL parallelism in most of my SQL CLR (granted I haven't pushed it real hard either). I am talking only about Adam Machanics trick to get around the 8K local context serialization limit for SQL CLR aggregate functions in SQL Server 2005. That would crash ANY time I allowed it to be used in an execution plan with any parallelism. Adam's "trick" always looked very suspect to me and I am convinced that it just isn't safe.
Good morning Barry
I just thought you want to point to parallelism problems because you alluded Adam. He wrote in one of his posts of the "Tally Performance" thread that CLR and parallelism might be "interesting".
Maybe I shouldn't try to read/write English at 1:30 AM...
Greets
Flo
May 10, 2009 at 3:30 am
Paul White (5/10/2009)Using it to concatenate strings by using a blank row element name is an abuse - the operation does not even produce valid XML! A blank row element name ought to be illegal (or produce empty row tags?). MSFT's decision to allow this and omit the tags entirely is completely ad-hoc. It makes no difference whether MSFT added it deliberately - a deliberate crock is still a crock.
Good evening Paul
Confirmed, FOR XML PATH('') is a hack which doe not create a valid XML. Just notice this is a build-in feature. XML PATH is by definition a simpler alternative for XML EXPLICIT and the description for XML EXPLICIT says (BOL):
"...you must ensure that the generated XML is well formed and valid."
It is a hack with many limitations (as you wrote) but sometimes it is a nice ad-hoc feature. I use it in some of my scripts to resolve 1:n relations into one result row. (Scripts, no procedures)
Greets
Flo
May 10, 2009 at 3:33 am
RBarryYoung (5/9/2009)
Moreover, I am now completely convinced that the empty string feature is no accidental side-effect, cleverly exploited by Microsoft's customers, but rather intentional functionality explicitly added to the PATH feature for SQL 2005 by Microsoft.
You are correct in that conclusion.
Sadly, the converse implication of this is that, having given us this back-door method to do string concatenation, it seems unlikely that they will accede to our requests for a more direct and satisfactory facility anytime soon.
Most likely the answer would be either "Build your own", as that's part of the reason behind CLR aggregates, or, if MS did provide one in a future version, it would simply be based on a CLR assembly, much the same way as most of the spatial stuff it in 2008.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
May 10, 2009 at 4:23 am
GilaMonster (5/10/2009)
RBarryYoung (5/9/2009)
Moreover, I am now completely convinced that the empty string feature is no accidental side-effect, cleverly exploited by Microsoft's customers, but rather intentional functionality explicitly added to the PATH feature for SQL 2005 by Microsoft.You are correct in that conclusion.
Can you provide anything authoritative to support that?
I am intrigued.
May 10, 2009 at 4:36 am
Florian Reischl (5/10/2009)
Confirmed, FOR XML PATH('') is a hack which doe not create a valid XML. Just notice this is a build-in feature. XML PATH is by definition a simpler alternative for XML EXPLICIT and the description for XML EXPLICIT says (BOL):"...you must ensure that the generated XML is well formed and valid."
It is a hack with many limitations (as you wrote) but sometimes it is a nice ad-hoc feature. I use it in some of my scripts to resolve 1:n relations into one result row. (Scripts, no procedures)
Evening Flo!
Yes I have had to use it too - and I hated myself for it 🙂
I have a feeling the whole FOR XML debate is going to become like Marmite* - you either love it or hate it. :laugh:
Paul
Viewing 15 posts - 4,201 through 4,215 (of 66,749 total)
You must be logged in to reply to this topic. Login to reply