April 17, 2009 at 6:38 am
Hi Phil
I'm with Paul. RegEx is a powerful tool for complex searches, replacements and validations in text. For a simple text a standard string.Split or a char loop (outstanding test 😉 ) seems to be the fastest solution.
Greets
Flo
April 17, 2009 at 7:17 am
Derek, not me, Florian. All I'm doing is showing you guys how to test the idea! Let's get some detail on the relative performance of all these functions.
I decided I'd test it myself. The CLR RegEx split routine is forty times slower than the Peso routine to split the entire Moby Dick novel into lines. My routine, and your While loop took about the same time, but you missed a line. Peso's is a third faster. I'm still experimenting.
Best wishes,
Phil Factor
April 17, 2009 at 7:24 am
Hi Phil
Phil Factor (4/17/2009)
Derek, not me, Florian.
Oops... Sorry!
All I'm doing is showing you guys how to test the idea! Let's get some detail on the relative performance of all these functions.
I decided I'd test it myself. The CLR RegEx split routine is forty times slower than the Peso routine to split the entire Moby Dick novel into lines. My routine, and your While loop took about the same time, but you missed a line. Peso's is a third faster. I'm still experimenting.
I'm also still experimenting 😉 - in addition some other CLR approaches.
Greets
Flo
April 17, 2009 at 7:28 am
Hey Flo!
I'm surprised no-one has suggested sending the entire text to split to a CLR function in an XML parameter, generated from a SELECT with some kind of FOR XML stuff on the end. One round trip...interesting?!
A completely mad idea, of course...?
Paul
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 17, 2009 at 7:41 am
Phil Factor (4/17/2009)
Of course, splitting a string is a variant of RegEx, which seems to be the main reason for using a CLR solution for something which could, in theory, be done purely in T-SQL.
See the RegExSplit function in 'CLR Assembly RegEx Functions for SQL Server by Example [/url]' over on Simple-Talk. The article took me blooming ages.
Exactly my point! CLR is very good at RegEx (and beats T-SQL) but is usually beaten by a well-written T-SQL solution in other areas. Hence it's not much of a surprise when CLR wins at the simpler pattern matching problem. Just slightly disappointing...
Obviously, if the CLR is being used to do something which T-SQL can't actually do...
Derek
April 17, 2009 at 7:43 am
Paul White (4/17/2009)
Derek Dongray (4/17/2009)
Of course, splitting a string is a variant of RegEx, which seems to be the main reason for using a CLR solution for something which could, in theory, be done purely in T-SQL.Are you sure?
Splitting a string at a delimiter certainly involes a search, but is that "a variant" of a regular expression? Or is a regular expression "a variant" of a search?
I could equally say that splitting a string is "a variant" of a linear array search. If I was in uber-nitpick mode, I might even say that splitting a string into fixed-length chunks doesn't even go that far.
I usually write regular expressions in Perl where the syntax to split a string at a CRLF (which was the original problem) would need an expression something like "(.*)\n(.*)". Hence I tend to group pattern matching and RegEx together.
Splitting into fixed length chunks is obviously a different problem.
The second part of your statement seems to miss the point somewhat: the usual solution (done purely in T-SQL) is less efficient that calling string.Split() in a CLR routine.
I obviously didn't express myself well.
The point I was trying to make was that, from what I've read and my own experiments with CLR, the only area where CLR functions can beat an equivalent process written entirely in T-SQL is when using RegEx and hence the usual advice is to use the T-SQl solution except when regular expressions are involved.
Since I view the search for CRLF to be a special case of a regular expression search, it's not really surprising a CLR solution would win out.
Note that the CLR solutions posted did not use RegEx. There was a hint of XQuery in the XML version...? Perhaps that was it?
Paul
Since the problem was to locate CRLF within a string, this is pattern matching (whether you use String.split, charindex, patindex or full regular expression processing) and, in my mind as I said above, these all fall into the same area, which I referred to a RegEx rather than pattern matching.
Although this problem doesn't need a full regular expression since the pattern is fixed, it's easy to see where the problem might be extended into, for example, breaking up a block of text at numbered paragraphs where the pattern/expression would need to be something like "(.*)\n(\d+.*)".
Clearly, if CLR wins on the simpler problem, then it's going to also win on more complex pattern matches and/or regular expression searches. It's just slightly disappointing that it wins out on even the simpler case.
Derek
April 17, 2009 at 7:49 am
Phil Factor (4/17/2009)
Derek, not me, Florian. All I'm doing is showing you guys how to test the idea! Let's get some detail on the relative performance of all these functions.I decided I'd test it myself. The CLR RegEx split routine is forty times slower than the Peso routine to split the entire Moby Dick novel into lines. My routine, and your While loop took about the same time, but you missed a line. Peso's is a third faster. I'm still experimenting.
Hey, Phil! Are you saying that CLR RegEx can be beaten by T-SQL?!?:w00t:
I recall reading several posts and running a whole slew of tests a few years ago which concluded that CLR RegEx would beat T-SQL every time. Is this another assumption that's wrong?
Or should I go and dig out my test scripts (if I still have them)?
Derek
April 17, 2009 at 8:04 am
Of course, when I thought about it, the Perl to split a string would simply be.
@lines = split('\ n',$string)
Since I usually need to do more complex matches I immediately thought of coding it as a full RegEx, which is, of course, unnecessary. :blush:
Now all I need is the Perl.Net CLR interface... 😀
P.S.: I gave up trying to get a Backslash-N sequence into the code and stuck a space in!
Derek
April 17, 2009 at 8:13 am
Paul White (4/17/2009)
Hey Flo!I'm surprised no-one has suggested sending the entire text to split to a CLR function in an XML parameter, generated from a SELECT with some kind of FOR XML stuff on the end. One round trip...interesting?!
A completely mad idea, of course...?
Paul
Hey Paul!
Convert everything to XML, send to CLR, split and back to TSQL? And you called me spooky in another thread... :w00t:
This may become a resource problem if the text table is to huge. Anyway, I have to try this! :Whistling:
Greets
Flo
PS: It's about 2:00 AM where you are... GO TO BED!
April 17, 2009 at 8:14 am
Derek Dongray (4/17/2009)
Phil Factor (4/17/2009)
Derek, not me, Florian. All I'm doing is showing you guys how to test the idea! Let's get some detail on the relative performance of all these functions.I decided I'd test it myself. The CLR RegEx split routine is forty times slower than the Peso routine to split the entire Moby Dick novel into lines. My routine, and your While loop took about the same time, but you missed a line. Peso's is a third faster. I'm still experimenting.
Hey, Phil! Are you saying that CLR RegEx can be beaten by T-SQL?!?:w00t:
I recall reading several posts and running a whole slew of tests a few years ago which concluded that CLR RegEx would beat T-SQL every time. Is this another assumption that's wrong?
Or should I go and dig out my test scripts (if I still have them)?
Hi!
Thanks for the proposal!
I'll add a new test function for for CLR and RegEx this evening. I keep you informed!
Greets
Flo
April 17, 2009 at 8:20 am
Derek Dongray (4/17/2009)
Of course, when I thought about it, the Perl to split a string would simply be.
@lines = split('\ n',$string)
Since I usually need to do more complex matches I immediately thought of coding it as a full RegEx, which is, of course, unnecessary. :blush:
Now all I need is the Perl.Net CLR interface... 😀
P.S.: I gave up trying to get a Backslash-N sequence into the code and stuck a space in!
Perl.Net... erm... let me think about... nope! 😀
I'll try the .Net RegEx.
Greets
Flo
April 17, 2009 at 8:32 am
Flo!
Yes it's getting late - and I have an appointment with the dentist early in the morning...eeek!
Resource problems? Well maybe, but we could always process the XML in appropriate 'chunks'...as I'm sure you are already spookily aware. 😉
Cool yeah - no problem. Mostly I wanted you to clarify what you meant, and you've done that. It's not that important that we ultimately agree, after all 😉
I don't think it is 'disappointing' if CLR 'beats' T-SQL, they are both part of the same product after all. The right tool for the job, that's the ticket.
Um, and no - CLR is more appropriate than T-SQL for a whole range of operations, not just regex-like operations as you seem to be saying. CLR adds a *huge* range of mostly complimentary abilities to the SQL Server that went before. I'm not going to go on about it, as I think it's self-evident after a minute's thought.
It is nice to see that RBAR operations (in the CLR) are now faster than all set based operations.*
Paul
* - Do not feed the obvious troll 😛
edit for spelling
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 17, 2009 at 8:37 am
April 17, 2009 at 10:29 am
Hi Paul
Paul White (4/17/2009)
Resource problems? Well maybe, but we could always process the XML in appropriate 'chunks'...as I'm sure you are already spookily aware. 😉
Chunks? Your suggestion was to get all in one XML.
I know how to handle data chunks from huge stateful cells (e.g. an own stream implementation) and I know how to handle row chunks (paging/server-side cursors). I'm not sure (eq. don't know :hehe:) what you mean with "FOR XML" in chunks. FOR XML creates one row with one column as a temporary value, please correct me if I'm wrong ;-). How would handle this in chunks? The SqlXml .Net type has the CreateReader() method which creates a XmlReader but the data are already on client side (in CLR) at this moment.
Greets
Flo
April 17, 2009 at 1:39 pm
Paul White (4/17/2009)
It is nice to see that RBAR operations (in the CLR) are now faster than all set based operations.** - Do not feel the obvious troll 😛
Why not? You're the one speaking in absolutes. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 76 through 90 (of 522 total)
You must be logged in to reply to this topic. Login to reply