April 20, 2009 at 10:20 am
Hi Phil
Thanks for your tests! Heh... so it seems that the best way stays depending on the current case. 🙂
I just tried to reproduce but in my environment the Chars solution seems to be faster. Here the statistics about the "Monster" table:
SourceType AvgLineLength CountLines MaxLineLength AvgTextLength MaxTextLength SumTextLength
---------- ------------- ----------- ------------- -------------------- -------------------- --------------------
Monster 1024000 160 1024001 8192014 8192014 163840280
... So the lines are 1mb and each row contains 8mb data.
I tried with 20 rows and get the following results:
Sorting Name Lines Duration CpuTime LogicalReads PhysicalReads
------- ------------------ ----- -------- ------- ------------ -------------
Monster CLR Chars solution 160 15750 969 553428 7980
Monster Cursor solution 160 18706 3062 686344 5907
Could you please post one of the cursor solutions you used? There have been so much scripts... I think I lost track.
Greets
Flo
April 20, 2009 at 11:22 am
I'd still like to see what the inputs and outputs of the binary to hex problem look like just to play...
--Jeff Moden
Change is inevitable... Change for the better is not.
April 20, 2009 at 11:26 am
I modified the WHILE loop function to make it quicker, and to give the same table of results as the Quirky Update function, which I haven't changed. The WHILE loop function is getting closer to Peso's solution now, so it is a sort of hybrid. I'm working on getting a really fast Tally-based function but that's a bit of a struggle on a large block of text! The tally isn't linear either on my test machine....
[font="Courier New"]
ALTER FUNCTION [dbo].[WhileSplit](@String NVARCHAR(MAX), @Delimiter VARCHAR(255))
RETURNS @strings TABLE
(
line_no INT IDENTITY(1,1),
theIndex INT DEFAULT 0,
previousIndex INT DEFAULT 0,
TheLine VARCHAR(MAX) DEFAULT '')
AS
-- body of the function
BEGIN
DECLARE @pos INT, @next INT, @LenDelimiter INT
SELECT @pos = 1, @lenDelimiter=LEN(@Delimiter),
@next = CHARINDEX(@Delimiter, @String, @pos)
WHILE (@next > 0)
BEGIN
INSERT INTO @Strings (PreviousIndex,TheIndex,TheLine)
SELECT @pos, @next, SUBSTRING(@String, @pos, @next - @pos)
SELECT @pos = @next + @lenDelimiter,
@next = CHARINDEX(@Delimiter, @String, @pos)
END
INSERT INTO @Strings (PreviousIndex,TheIndex,TheLine)
SELECT @pos, @next, SUBSTRING(@String, @pos, LEN(@STRING)-@POS+1)
RETURN
END
[/font]
The text of Moby Dick is all over the place on the internet. It is very useful, though Melville used pretty long lines/paragraphs.
Best wishes,
Phil Factor
April 20, 2009 at 12:12 pm
Hi Phil
As first, thanks again for all your input!
I included your function into my test environment and took the Moby Dick text into a new table #SourceMobyDick. Here the stats:
SourceType AvgLineLength CountLines MaxLineLength AvgTextLength MaxTextLength SumTextLength SourceLines
---------- ------------- ----------- ------------- ------------- ------------- ------------- -----------
MobyDick 609116 1 609116 609116 609116 609116 1
And here the results on my system.
Sorting Name Duration(ms) CpuTime LogicalReads PhysicalReads
-------- ---------------------- ------------ ----------- ------------ -------------
Large CLR Chars solution 350 343 7321 0
Large CLR DanielJ solution 420 437 7321 0
Large Cursor solution 2166 2106 155276 0
Large Phil's Cursor solution 2240 2090 152109 0
Many CLR Chars solution 1183 1123 6669 0
Many CLR DanielJ solution 1430 1498 6669 0
Many Cursor solution 6960 7035 478439 0
Many Phil's Cursor solution 7443 7160 476209 0
MobyDick CLR Chars solution 120 109 7838 0
MobyDick CLR DanielJ solution 143 125 7841 0
MobyDick Cursor solution 796 795 80884 0
MobyDick Phil's Cursor solution 853 733 79525 0
Monster CLR DanielJ solution 30 31 2492 0
Monster CLR Chars solution 63 16 2492 0
Monster Phil's Cursor solution 186 234 7137 0
Monster Cursor solution 196 219 9820 0
Short CLR Chars solution 133 156 149 0
Short CLR DanielJ solution 160 156 149 0
Short Cursor solution 696 702 20347 0
Short Phil's Cursor solution 763 796 18341 0
Looks a bit strange if I compare to your results. On my system the CLR functions are still much faster and even my cursor performs better... :ermm:
April 20, 2009 at 12:31 pm
Am I right in seeing that your copy of Moby Dick only has one line? My copy is twice the length too.
Best wishes,
Phil Factor
April 20, 2009 at 12:40 pm
Phil Factor (4/20/2009)
Am I right in seeing that your copy of Moby Dick only has one line? My copy is twice the length too.
Hi Phil
Sorry, I had an error in my stats. The performance counters are correct. Here the correct statistics:
SourceType AvgLineLength CountLines MaxLineLength AvgTextLength MaxTextLength SumTextLength SourceLines
---------- ------------- ----------- ------------- ------------- ------------- ------------- -----------
MobyDick 93 12801 120 1218231 1218231 1218231 1
I took the Moby Dick text from here:
http://www.enotes.com/mobydick-text/
I downloaded the PDF and copied all into a text file, than loaded by:
CREATE TABLE #SourceMobyDick (Name NVARCHAR(128), Definition NVARCHAR(MAX))
INSERT INTO #SourceMobyDick
SELECT 'Moby Dick', CONVERT(VARCHAR(MAX), BulkColumn)
FROM OPENROWSET ( BULK 'C:\AnyWhere\MobyDick.txt', SINGLE_BLOB ) t
If you have another source it would be great if you could give me a link!
Greets
Flo
April 20, 2009 at 3:22 pm
Phil Factor (4/20/2009)
I'm sorry, folks, but these CLR string-splitting techniques don't scale up
It will be interesting to see how this pans out. I would be surprised if your results are correct, but I've been surprised before. As for the wrong line-count, make sure you are using the CLR TVF with the right String.Split option. The initial version threw away blanks (on purpose).
Cheers,
Paul
April 20, 2009 at 5:58 pm
I use the recursive function below
Create Function dbo.GetStringTokens(
@String Varchar(Max),
@Delimiter varchar(1)
)
Returns Table
AS
Return
(
with StartEnd(ID,i,j) as
(
select
ID=1,
i=1,
j=charindex(@Delimiter,@String+@Delimiter)
union all
select
ID=ID+1,
i=j+1,
j=charindex(@Delimiter,@String+@Delimiter,j+1)
from StartEnd
where charindex(@Delimiter,@String+@Delimiter,j+1) 0
)
select
ID=ID,
Token=substring(@String,i,j-i)
from StartEnd
)
April 20, 2009 at 6:20 pm
Hey Edward,
The recursive CTE is cool from a technical point of view, but performance is kinda sucky, and any recursion below 100 levels will require an OPTION (MAXRECURSION = x) hint on the client query - it can't be embedded in the function.
It is an appealing solution though, I just wish MS would improve the performance and make recursion parallelizable.
Paul
April 20, 2009 at 6:24 pm
My strings usually have less that 20 "tokens"
April 20, 2009 at 6:30 pm
Edward Boyle (4/20/2009)
My strings usually have less that 20 "tokens"
Well that's ok then! Though it is worth bearing in mind that future users of your function may not be aware of it's limitations - unless it is well documented I guess.
The thrust of this thread is performance though, which is why I commented as I did.
If you ever need to split larger strings, or lots of them, check out the various methods posted so far. I would be fairly confident that the recursive method would be the slowest.
Cheers,
Paul
April 21, 2009 at 4:11 am
As far as I can tell, the difference in the results that we are getting is due to the way different servers behave with different resources and configurations. I do my testing on an old dog of a server so that performance problems are highlighted.
The problem with looking at these performance issues is that there are so many variables that affect performance. Obviously, the length of the string is an important factor. I'm coming to the conclusion that CLR routines only shine on high-spec servers which aren't suffering heavy usage. I've tried my test-harness on two servers so far.
For the particular string-splitting function, then the WHILE loop, and the Quirky Update solutions, behave predictably and linearly. Whereas they are slower under light load on fast servers, they are scaleable, and the time taken increases linearly. The tally solution is also linear, but very expensive, and it is never a candidate for this sort of task.
The performace of CLR solutions deteriorate as the string-lenth increases. The time taken does not increase linearly as a good database solution must: it is a second-order polynomial curve. The more powerful the server, the flatter the polynomial. I assume that this is due to the CLR routine taking place on a single thread, whereas a TSQL routine is inherently multi-process.
Best wishes,
Phil Factor
April 21, 2009 at 4:55 am
Phil Factor (4/21/2009)
...the difference in the results that we are getting is due to the way different servers behave with different resources and configurations. I do my testing on an old dog of a server so that performance problems are highlighted.
It seems logical that "different machines with different resources and configurations" should produce different results, doesn't it? FYI I use a 2GHz (single core obviously) Pentium IV-m laptop of a fairly ordinary sort. I wouldn't be surprised if your 'old dog' compares quite well with it!
Phil Factor (4/21/2009)
I'm coming to the conclusion that CLR routines only shine on high-spec servers which aren't suffering heavy usage. I've tried my test-harness on two servers so far.
This is not true in my experience. Hopefully other people will be able to add to the weight of evidence either way. A sample of two servers on an unspecified test-harness probably isn't sufficient to satisfy full scientific rigour yet. The definition of "high-end" and "heavy usage" (of what?) probably needs nailing down too.
Phil Factor (4/21/2009)
For the particular string-splitting function, then the WHILE loop, and the Quirky Update solutions, behave predictably and linearly. Whereas they are slower under light load on fast servers, they are scaleable, and the time taken increases linearly. The tally solution is also linear, but very expensive, and it is never a candidate for this sort of task.The performace of CLR solutions deteriorate as the string-lenth increases. The time taken does not increase linearly as a good database solution must: it is a second-order polynomial curve. The more powerful the server, the flatter the polynomial. I assume that this is due to the CLR routine taking place on a single thread, whereas a TSQL routine is inherently multi-process.
It would be great to see the test-harness and results (as Flo provided before announcing conclusions).
Be careful dismissing the tally solution - I understand that Jeff and Flo have some new and surprising results to share with us shortly.
Cheers,
Paul
April 21, 2009 at 5:35 am
Phil Factor (4/21/2009)
For the particular string-splitting function, then the WHILE loop, and the Quirky Update solutions, behave predictably and linearly.
Sorry Phil... Haven't been able to keep up on this thread and I've missed it. Could you repost the code for the Quirky Update splitter solution or post the URL for it? Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 21, 2009 at 6:49 am
My Set-based 'quirky update' solution is half way down the page on page 6 of this thread. http://www.sqlservercentral.com/Forums/Topic695508-338-6.aspx
Which is the current front-runner for the CLR solution? I'm using DanielJ's
The reason I haven't posted my data, harness, or results is that the forum software won't let me. (no damn tables) I'll have to do it in the blog, i reckon, but I'd like to include a good Tally function if there is such a beast.
Paul, the whole point of the exercise, surely, is to try and better-understand the performance factors that determine the best solution. I maybe didn't explain what I meant about the server, but it seems to change the Relative performance of the various solutions, not just the overall times. I've only run the full test suite on two servers (it means creating a database) and would be delighted if we can run it on others. I'm not too bothered which solution eventually comes out as the best. I guess we'll emerge with a better understanding of how to do large-string processing in TSQL. It's all too easy with little strings: even a CTE wll do it. (I just do a regex split on any small string nowadays as the time differences aren't significant)
Best wishes,
Phil Factor
Viewing 15 posts - 136 through 150 (of 522 total)
You must be logged in to reply to this topic. Login to reply