April 27, 2009 at 6:42 pm
Lynn Pettis (4/27/2009)
Flo...Since there was no production problem I had, after this input I could show my chief some really great new things and it causes that some parts of our applications will run about 100 times faster in future. So I also had a huge benefit for my business.
I'm a younger guy who tries to become better. I know that I'm one of the best developers in my company. But I also know there are many people who are much better than me - like here! So I'm very thankful for every answer to my questions, corrections of other posts or just some tips!
I really hope that you are not angry because there was initially no production problem to be fixed...
SSC isn't just about helping people fix production problems, it is also about helping people learn more about SQL Server. What is happening on this thread is a testament to this communities commitment to helping people learn and improve their skills and abilities. There is absolutely no reason to apologize for starting this thread. It is an awesome thread and we can't wait (but will have to) for you to write this whole thing up as an article (or two or three).
Man, I've gotta agree with that! I don't believe I've ever seen so many different solutions for the same thing. Even Adam has joined in the fray.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 28, 2009 at 3:02 am
I've been doing a 'skunkworks' project in TSQL that takes a database and scripts it out every day using SMO/DMO to file, and then imports the file contents into a table to check on what has changed since the last time it was done. It keeps a record of all changes and allows you to revert the database to a particular build. Essentially, you get a complete history of your database development and can inspect the revisions in a WIKI. Because of the file-mechanism for doing this, it allows you to archive CLR, data insert scripts and all the other stuff needed for a build.
I needed line and token splitting for all this: Hence my particular interest in this thread. (I use a WHILE loop). I know all this is easy, but I was tickled by the idea of doing it all in TSQL (with a thin smear of ASP.NET). DIFFing was an interesting challenge. I wanted to do it since, as a 'hairy-arsed' database developer, rather than a computer scientist, I felt I knew what joe DBA would want from such a system. It has been a great learning experience!
Best wishes,
Phil Factor
April 28, 2009 at 3:33 am
Jeff Moden (4/27/2009)
Florian Reischl (4/27/2009)
Hi AdamThe String Enumerator (I just called it "String Enumerator" for better understanding what it does) performs very good! Especially for more rows. I just tried with some of my test cases:
ItemLength ItemCount ItemRows ItemType ItemDescription Ranking Module Duration(ms) CpuTime
----------- ----------- ----------- -------------- ---------------- -------- -------------------------- ------------ -----------
10 10 10 VARCHAR(111) Fixed Length 1 CLR Simple 00:00:00.010 0
10 10 10 VARCHAR(111) Fixed Length 2 CLR RegEx 00:00:00.010 0
10 10 10 VARCHAR(111) Fixed Length 3 CLR String Enumerator 00:00:00.010 16
10 10 10 VARCHAR(111) Fixed Length 4 Traditional Tally TOP(0) 00:00:00.010 16
10 10 10 VARCHAR(111) Fixed Length 5 Cursor 00:00:00.033 15
10 10 10 VARCHAR(111) Fixed Length 6 CLR char loop 00:00:00.037 0
10 10 10 VARCHAR(111) Fixed Length 7 CLR XML 00:00:00.053 31
10 10 10 VARCHAR(111) Fixed Length 8 Traditional Tally 00:00:00.113 0
10 10 1000 VARCHAR(111) Fixed Length 1 Traditional Tally TOP(0) 00:00:00.203 156
10 10 1000 VARCHAR(111) Fixed Length 2 CLR String Enumerator 00:00:00.443 422
10 10 1000 VARCHAR(111) Fixed Length 3 CLR Simple 00:00:00.467 468
10 10 1000 VARCHAR(111) Fixed Length 4 CLR RegEx 00:00:00.473 468
10 10 1000 VARCHAR(111) Fixed Length 5 CLR char loop 00:00:00.533 516
10 10 1000 VARCHAR(111) Fixed Length 6 Traditional Tally 00:00:00.747 188
10 10 1000 VARCHAR(111) Fixed Length 7 CLR XML 00:00:00.987 954
10 10 1000 VARCHAR(111) Fixed Length 8 Cursor 00:00:01.027 969
100 10 10 VARCHAR(1011) Fixed Length 1 CLR Simple 00:00:00.010 16
100 10 10 VARCHAR(1011) Fixed Length 2 CLR RegEx 00:00:00.010 15
100 10 10 VARCHAR(1011) Fixed Length 3 CLR String Enumerator 00:00:00.013 16
100 10 10 VARCHAR(1011) Fixed Length 4 Traditional Tally TOP(0) 00:00:00.017 16
100 10 10 VARCHAR(1011) Fixed Length 5 CLR char loop 00:00:00.027 0
100 10 10 VARCHAR(1011) Fixed Length 6 Cursor 00:00:00.040 16
100 10 10 VARCHAR(1011) Fixed Length 7 CLR XML 00:00:00.070 15
100 10 10 VARCHAR(1011) Fixed Length 8 Traditional Tally 00:00:00.370 31
100 10 100 VARCHAR(1011) Fixed Length 1 CLR String Enumerator 00:00:00.073 47
100 10 100 VARCHAR(1011) Fixed Length 2 CLR char loop 00:00:00.087 63
100 10 100 VARCHAR(1011) Fixed Length 3 CLR RegEx 00:00:00.087 62
100 10 100 VARCHAR(1011) Fixed Length 4 CLR Simple 00:00:00.090 78
100 10 100 VARCHAR(1011) Fixed Length 5 Traditional Tally TOP(0) 00:00:00.110 94
100 10 100 VARCHAR(1011) Fixed Length 6 CLR XML 00:00:00.150 109
100 10 100 VARCHAR(1011) Fixed Length 7 Cursor 00:00:00.167 140
100 10 100 VARCHAR(1011) Fixed Length 8 Traditional Tally 00:00:00.417 141
100 10 100 NVARCHAR(1011) Fixed Length 1 CLR String Enumerator 00:00:00.057 47
100 10 100 NVARCHAR(1011) Fixed Length 2 CLR Simple 00:00:00.070 63
100 10 100 NVARCHAR(1011) Fixed Length 3 CLR RegEx 00:00:00.070 62
100 10 100 NVARCHAR(1011) Fixed Length 4 Traditional Tally TOP(0) 00:00:00.143 110
100 10 100 NVARCHAR(1011) Fixed Length 5 CLR char loop 00:00:00.147 79
100 10 100 NVARCHAR(1011) Fixed Length 6 CLR XML 00:00:00.150 125
100 10 100 NVARCHAR(1011) Fixed Length 7 Cursor 00:00:00.180 141
100 10 100 NVARCHAR(1011) Fixed Length 8 Traditional Tally 00:00:00.720 109
100 10 1000 NVARCHAR(1011) Fixed Length 1 CLR String Enumerator 00:00:00.567 532
100 10 1000 NVARCHAR(1011) Fixed Length 2 CLR char loop 00:00:00.670 547
100 10 1000 NVARCHAR(1011) Fixed Length 3 CLR Simple 00:00:00.677 609
100 10 1000 NVARCHAR(1011) Fixed Length 4 CLR RegEx 00:00:00.717 641
100 10 1000 NVARCHAR(1011) Fixed Length 5 Traditional Tally TOP(0) 00:00:01.313 1172
100 10 1000 NVARCHAR(1011) Fixed Length 6 Cursor 00:00:01.350 1313
100 10 1000 NVARCHAR(1011) Fixed Length 7 CLR XML 00:00:01.517 1094
100 10 1000 NVARCHAR(1011) Fixed Length 8 Traditional Tally 00:00:01.837 1141
10 10 10 VARCHAR(111) Variable Length 1 Traditional Tally TOP(0) 00:00:00.007 0
10 10 10 VARCHAR(111) Variable Length 2 CLR RegEx 00:00:00.007 16
10 10 10 VARCHAR(111) Variable Length 3 CLR String Enumerator 00:00:00.007 16
10 10 10 VARCHAR(111) Variable Length 4 CLR Simple 00:00:00.010 0
10 10 10 VARCHAR(111) Variable Length 5 CLR char loop 00:00:00.020 0
10 10 10 VARCHAR(111) Variable Length 6 Cursor 00:00:00.030 15
10 10 10 VARCHAR(111) Variable Length 7 CLR XML 00:00:00.047 15
10 10 10 VARCHAR(111) Variable Length 8 Traditional Tally 00:00:00.097 0
10 10 1000 VARCHAR(111) Variable Length 1 Traditional Tally TOP(0) 00:00:00.110 109
10 10 1000 VARCHAR(111) Variable Length 2 CLR String Enumerator 00:00:00.453 453
10 10 1000 VARCHAR(111) Variable Length 3 CLR Simple 00:00:00.457 469
10 10 1000 VARCHAR(111) Variable Length 4 CLR RegEx 00:00:00.483 484
10 10 1000 VARCHAR(111) Variable Length 5 CLR char loop 00:00:00.537 484
10 10 1000 VARCHAR(111) Variable Length 6 Traditional Tally 00:00:00.537 141
10 10 1000 VARCHAR(111) Variable Length 7 Cursor 00:00:00.933 906
10 10 1000 VARCHAR(111) Variable Length 8 CLR XML 00:00:01.120 891
500 10 10 VARCHAR(5011) Variable Length 1 CLR String Enumerator 00:00:00.010 16
500 10 10 VARCHAR(5011) Variable Length 2 CLR Simple 00:00:00.013 15
500 10 10 VARCHAR(5011) Variable Length 3 CLR RegEx 00:00:00.013 0
500 10 10 VARCHAR(5011) Variable Length 4 CLR char loop 00:00:00.020 0
500 10 10 VARCHAR(5011) Variable Length 5 Traditional Tally TOP(0) 00:00:00.027 15
500 10 10 VARCHAR(5011) Variable Length 6 Cursor 00:00:00.043 15
500 10 10 VARCHAR(5011) Variable Length 7 CLR XML 00:00:00.087 16
500 10 10 VARCHAR(5011) Variable Length 8 Traditional Tally 00:00:00.560 32
500 10 100 VARCHAR(5011) Variable Length 1 CLR RegEx 00:00:00.073 62
500 10 100 VARCHAR(5011) Variable Length 2 CLR String Enumerator 00:00:00.073 47
500 10 100 VARCHAR(5011) Variable Length 3 CLR Simple 00:00:00.090 78
500 10 100 VARCHAR(5011) Variable Length 4 CLR char loop 00:00:00.110 78
500 10 100 VARCHAR(5011) Variable Length 5 CLR XML 00:00:00.193 125
500 10 100 VARCHAR(5011) Variable Length 6 Traditional Tally TOP(0) 00:00:00.210 203
500 10 100 VARCHAR(5011) Variable Length 7 Cursor 00:00:00.217 188
500 10 100 VARCHAR(5011) Variable Length 8 Traditional Tally 00:00:00.440 219
Greets
Flo
PS: I just formatted the code more to a C# look in my solution. DBAs writing C#, pew... 😀
Flo, which test generator are you using for this. Just gotta make sure.
Hi Jeff
I use a little extended version of the generic approach I posted. Complete environment (CLR-Project in C#, Database-DDL, Test script) is attached here.
Greets
Flo
April 28, 2009 at 3:34 am
Bob Hovious (4/23/2009)
With respect to testing, and I apologize for not having done more, it seems there are at least three or four variables about the test data to consider:1. The number of rows in the table
2. The average size of the data in the column to be parsed
3. The average number of delimited strings to be parsed per row
4. The average size of the strings being parsed (really the inverse of #3 above).
1. should scale almost linerarily for all of these methods, except one semi-numerical issue(below).
2. Only relevant as part of the same semi-numrical issue below.
3. see #4, below.
4. is really #2 &3 refactored as average space between delimiters (i.e.: #2 / #3).
I never thought about this before, but would the optimizer choose a different execution plan based on the characteristics (RAM, number of processors) of the server (or instance) it was running on?
Yes, definitely. paralellism thresholds alone gets figured in here.
The semi-numerical issues I refferred to above is whter the test can be done all in memory or not. Although we test on large datasets (million rows) modern memorys are truly huge (2-4gb for personal, 8-16gb for servers) so our tests actually typically fit entirely within memory (remember to count caches for this too). however going up by just one factor of 10 would put about half of our test systems' optimizer in the position of noe know it if will fit for sure. this can have several effects:
1) memory degredation is an extremely sharp-kneed threshold, so much that it can barely be considered numerical.
2) The methods that are more memory eficient will be effected last and generally least.
3) I must confess, that I do not know for surif the optimizer will change the query plan as you potentially approach the threshold or not.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 28, 2009 at 4:23 am
Hey Flo,
Great stuff - very handy having that all in one post.
You know, Adam's CLR function may not always offer a performance advantage; however it's memory efficiency and potential scalability (assuming that the mulit-page allocator memory area is at a premium and garbage collection interruptions can be tolerated) make it the one to beat overall...do you agree?
I bet you didn't know what you were starting when you asked your first question to start this thread! :hehe:
Cheers,
Paul
April 28, 2009 at 4:42 am
Hey Paul
Completely correct. The huge advantage of the enumerator approaches should be a better scalability. Since now I only tested the single request handling to get some indicators. Workloads are outstanding but will be covered until end of the story ;-).
I bet you didn't know what you were starting when you asked your first question to start this thread! :hehe:
Nope. I couldn't suspect this thread would start this debate on principals. It stays very interesting.
I think my current test environment covers almost every possible type of data to be split. So I hope I can start with load tests this evening (your early morning). Can't promise, but that's the current plan.
Keep you (all) informed!
Flo
April 28, 2009 at 5:33 am
EDIT: forget it - when I figure out how to post results I will try again!
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
April 28, 2009 at 6:02 am
mister.magoo (4/28/2009)
EDIT: forget it - when I figure out how to post results I will try again!
Ooo! You removed that just as I was quoting you!
Actually the formatting was ok. 😉
The 'simpler function' is pretty near identical to previously posted code on this thread.
The MAX version of your 'simpler' function lacks a SqlFacet to mark it as a LOB.
It may perform about the same as Adam's, but you missed the crucial point that Adam's version splits the string on demand, as SQL Server asks for rows from the TVF. The advantage is that it avoids the potentially large initial memory allocation required to split the string all in one go, and allows more effective garbage collection.
I would encourage you to visit Adam's Blog entry for a detailed explanation.
Paul
April 28, 2009 at 6:43 am
We're not done yet -- if we're going to create the "best" method it should do as much as possible in terms of functionality we would normally want from string splitting routines. So I'm currently in the process of modifying Flo's modification to my character-based approach, and trying to figure out how to make it ignore consecutive delimiters ... Any other ideas? I often add a DISTINCT to my T-SQL efforts but I don't think that's a possibility in the SQLCLR routine.
--
Adam Machanic
whoisactive
April 28, 2009 at 7:04 am
Ah. A post from Adam. Must be later than I realized. :w00t:
This is probably feeping creaturism, but it would be great if it could handle more complex splitting tasks.
For example, strings like:
~Parent~child, child, child, child~Another Parent~its child~...
or
~Fish~colour=gold, size=M, memory=0~...
and yes, this is a real example (fish used to protect the guilty).
Also, I would still like to see a CLR TVF stream on the input as well as the output. I don't think this is going to fly though - not without an external class library, the unsafe permission set, and maybe (gasp) storing to a static field - a horrible idea, which I am trying hard to forget.
Time for bed.
Paul
edit: P.S. Most reasonable requirements could be handled I guess by allowing a regualr expression to be passed in, but that will break the nice split-on-the-fly thing. Wouldn't it?
April 28, 2009 at 7:28 am
Adam Machanic (4/28/2009)
We're not done yet -- if we're going to create the "best" method it should do as much as possible in terms of functionality we would normally want from string splitting routines. So I'm currently in the process of modifying Flo's modification to my character-based approach, and trying to figure out how to make it ignore consecutive delimiters ... Any other ideas? I often add a DISTINCT to my T-SQL efforts but I don't think that's a possibility in the SQLCLR routine.
Parsing quoted strings, which would allow embedded delimiters
e.g.
"Field1 with commas ,," , Field2 , "Field 3 with embedded quote"" here"
which would give three rows
Field1 with commas ,,
Field2
Field 3 with embedded quote " here
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537April 28, 2009 at 7:39 am
Paul White (4/28/2009)
mister.magoo (4/28/2009)
EDIT: forget it - when I figure out how to post results I will try again!Ooo! You removed that just as I was quoting you!
Sorry!
The 'simpler function' is pretty near identical to previously posted code on this thread.
Yes it is, and almost identical to Flo's but I could not understand the point of the complexity built into a simple split...
The MAX version of your 'simpler' function lacks a SqlFacet to mark it as a LOB.
Fair enough - I didn't originally use it with MAX in mind, so just kind of added that in for Flo's test harness.
It may perform about the same as Adam's, but you missed the crucial point that Adam's version splits the string on demand, as SQL Server asks for rows from the TVF. The advantage is that it avoids the potentially large initial memory allocation required to split the string all in one go, and allows more effective garbage collection.
I would encourage you to visit Adam's Blog entry for a detailed explanation.
I accept that premise and offer these thoughts:
Realistically though - how often would you encounter a single chunk of text large enough for that to be an actual problem? (See end of post for more thoughts on this...)
If truly large amounts of data are to be processed in one go, is passing that data to a CLR function really the most memory efficient method of working with it anyway?
(re-reading - that sounds a bit narky and it is not meant to be, so please treat these comments as happy discussion, not defensive or attacking 😛 )
As for performance, I would offer up preliminary results:
This is the result of two different sets of input data - one was 4000 rows of up to 4000 characters, the other 40,000 rows of up to 100 characters (to speed things up a little).
Both sets of data had each "text" chunk split 5 times at varying (random) positions, including some consecutive delimiters and some text chunks starting with or ending with a delimiter.
source_rows source_min_length source_max_length source_avg_length func_name quickest_time longest_time average_time
4000 5 3902 1970 CLR Simpler Split 853 1360 1130
4000 5 3902 1970 ufn_clr_SplitString_Enumerator_Flo2 900 2523 1271
4000 5 3902 1970 even_faster 1073 1506 1279
40000 6 93 49 CLR Simpler Split 3320 3560 3387
40000 6 93 49 ufn_clr_SplitString_Enumerator_Flo2 4736 4736 4736
You will notice the "even_faster" function posted by Adam is missing from the 40000 row sample - I had to kill it after 9 minutes and still no results.
I appreciate this is still a limited test and larger datasets may see a reversal of that - also I have no doubt that Adam would be able to tweak whatever is causing the problem. Of course it could also be that "some other factor" was causing Adam's function to take a long time - it was a quick test and it was late...
Aside..sort of...
Thinking....
Whilst out to lunch just now, I got to thinking about how irritating it was trying to get the results formatted correctly - Presentation Layer problem - and this led me to think about the problem from a wider perspective.
"Why is this horrible data being put into SQL Server tables like this in the first place?" - there is no business case to solve - and I for one cannot think of a single situation where I would willingly put data like this into a database - knowing I would then have to "untangle" it using T-SQL.
My own feeling is that this is a twist on the Presentation Layer - it is just that we should now be thinking about "presenting" the data to SQL server in a reasonable format.
The topic has tended towards CLR because it is better at dealing with convoluted tasks such as this - my take is now :
"Use .Net (or some other suitable system - perl, php, java , whatever, multi-threaded!) to manipulate the data before it is presented to SQL server, then let SQL server deal with the data in sets as it is designed to do."
If that means using a BULK INSERT to split the text (as demonstrated earlier in this thread) BEFORE the data gets into SQL server then fine, use that SQL server tool to do it, but why is anyone allowing "dirty" data into their database only to have to attempt to clean it using a set based language?
(Wow - not finding it easy to post results REALLY annoyed me!)
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
April 28, 2009 at 8:07 am
You will notice the "even_faster" function posted by Adam is missing from the 40000 row sample - I had to kill it after 9 minutes and still no results.
Probably a bug in the code causing it to run into an endless loop. I noticed that there was a similar issue with leading delimiters, which I've fixed on my end but not posted back here yet.
--
Adam Machanic
whoisactive
April 28, 2009 at 8:10 am
Parsing quoted strings, which would allow embedded delimiters
e.g.
"Field1 with commas ,," , Field2 , "Field 3 with embedded quote"" here"
That's a fun challenge -- I did it in T-SQL a while back (yes, I've been obsessed with string splitting for a long, long time ;-)):
http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/tokenize-udf.aspx
--
Adam Machanic
whoisactive
April 28, 2009 at 8:20 am
@mister.magoo,
I think the function you refer to was one I posted originally, and improved by Flo. Not that I'm territorial!
The point about the scalability is not so much about how much memory a single call would allocate, rather: consider many concurrent users executing the function...!
Cheers,
Paul
Viewing 15 posts - 286 through 300 (of 522 total)
You must be logged in to reply to this topic. Login to reply