April 13, 2009 at 3:23 pm
[font="Verdana"]I remember doing a bunch of performance testing a couple of years ago when we were converting code from SQL Server 6.5 to SQL Server 7. We found then that in some cases, cursors (or while loops) were faster. So it's interesting to see that can still be the case.
Like everything, there's never one simple answer.
[/font]
April 13, 2009 at 3:26 pm
Here is the tally-function with the same bug as the other routines:
create function [dbo].[fnSplit4](
@parameter varchar(Max) -- the string to split
, @Separator Varchar(64) -- the string to use as a separator
)
RETURNS @Items TABLE(
ID INT -- the element number
, item VARCHAR(8000) -- the split-out string element
, OffSet int -- the original offest
--( not entirley accurate if LEN(@Seperator) > 1 because of the Replace() )
)
AS
BEGIN
/*
"Monster" Split in SQL Server 2005
From Jeff Moden, 2008/05/22
BYoung, 2008/06/18: Modified to be a Table-Valued Function
And to handle CL/LF or LF-only line breaks
Test: (scripts all triggers in your database)
Select Lines.Item
From sys.sql_modules M
Join sys.objects O on O.object_id = M.object_id
cross apply dbo.fnSplit1(M.definition, char(13)+char(10)) Lines
Where O.Type = 'TR'
Order by O.create_date, Lines.ID
*/
Declare @Sep char(1)
--our seperator character (convenient, doesn't affect performance)
Set @Sep = char(10)
--NOTE: we make the @Sep character LF so that we will automatically
-- parse out rogue LF-only line breaks.
;WITH cteTally AS
(--==== Create a Tally CTE from 1 to whatever the length
-- of the parameter is
SELECT TOP (LEN(@Parameter))
ROW_NUMBER() OVER (ORDER BY t1.ID) AS N
FROM Master.sys.sysColumns t1
CROSS JOIN Master.sys.sysColumns t2
)
INSERT into @Items
SELECT ROW_NUMBER() OVER (ORDER BY N) AS Number,
SUBSTRING(@Parameter, N+1, CHARINDEX(@Sep, @Parameter, N+1)-N-1) AS Value
, N+1
FROM cteTally
WHERE N < LEN(@Parameter)
AND SUBSTRING(@Parameter, N, 1) = @Sep
--Notice how we find the separator
Return
END
And here is the correct version:
create function [dbo].[fnSplit3](
@parameter varchar(Max) -- the string to split
, @Separator Varchar(64) -- the string to use as a separator
)
RETURNS @Items TABLE(
ID INT -- the element number
, item VARCHAR(8000) -- the split-out string element
, OffSet int -- the original offest
--( not entirley accurate if LEN(@Seperator) > 1 because of the Replace() )
)
AS
BEGIN
/*
"Monster" Split in SQL Server 2005
From Jeff Moden, 2008/05/22
BYoung, 2008/06/18: Modified to be a Table-Valued Function
And to handle CL/LF or LF-only line breaks
Test: (scripts all triggers in your database)
Select Lines.Item
From sys.sql_modules M
Join sys.objects O on O.object_id = M.object_id
cross apply dbo.fnSplit1(M.definition, char(13)+char(10)) Lines
Where O.Type = 'TR'
Order by O.create_date, Lines.ID
*/
Declare @Sep char(1)
--our seperator character (convenient, doesn't affect performance)
Set @Sep = char(10)
--NOTE: we make the @Sep character LF so that we will automatically
-- parse out rogue LF-only line breaks.
--===== Add start and end seprators to the Parameter so we can handle
-- all the elements the same way
-- Also change the seperator expressions to our seperator
-- character to keep all offsets = 1
SET @Parameter = @Sep+ Replace(@Parameter,@Seperator,@Sep) +@Sep
-- This reduces run-time about 10%
;WITH cteTally AS
(--==== Create a Tally CTE from 1 to whatever the length
-- of the parameter is
SELECT TOP (LEN(@Parameter))
ROW_NUMBER() OVER (ORDER BY t1.ID) AS N
FROM Master.sys.sysColumns t1
CROSS JOIN Master.sys.sysColumns t2
--Note: using the sysColumns trick is faster than a permanent Tally table
-- for this case anyway
)
INSERT into @Items
SELECT ROW_NUMBER() OVER (ORDER BY N) AS Number,
SUBSTRING(@Parameter, N+1, CHARINDEX(@Sep, @Parameter, N+1)-N-1) AS Value
, N+1
FROM cteTally
WHERE N < LEN(@Parameter)
AND SUBSTRING(@Parameter, N, 1) = @Sep
--Notice how we find the separator
Return
END
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 13, 2009 at 3:28 pm
And finally, here is the inclusion code to inlcude it with the other tests:
--- ==========================================================
--- -> Tally/fUNCTION solution
PRINT 'Start Tally-function 3 solution (correct line-splitting)'
SELECT @now = GETDATE()
--- Split text into lines
INSERT INTO @result
SELECT l.item
FROM @source s
CROSS APPLY dbo.fnSplit3(s.definition, char(13)+char(10)) l
--- Results
SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())
SELECT @count = COUNT(*) FROM @result
PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)
--- <- Tally solution
--- ==========================================================
DELETE FROM @result --- Clean up
--- ==========================================================
--- -> Tally/fUNCTION solution
PRINT 'Start Tally-function 4 solution (has same bug as the rest)'
SELECT @now = GETDATE()
--- Split text into lines
INSERT INTO @result
SELECT l.item
FROM @source s
CROSS APPLY dbo.fnSplit4(s.definition, char(13)+char(10)) l
--- Results
SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())
SELECT @count = COUNT(*) FROM @result
PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)
--- <- Tally solution
--- ==========================================================
DELETE FROM @result --- Clean up
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 13, 2009 at 3:32 pm
Good morning Paul! (every time I see your first posts I know it's time for bed 🙂 )
Paul White (4/13/2009)
I'm still wondering why the CLR TVF is so much slower than the XML.
Florian Reischl
I don't really understand why the way over XML is faster than a direct TVF. The only idea I have is that SQL Server calls the FillRowMethod over reflection, what would be really ugly implementation! I usually handle those kinds of dynamic method calls over dynamic classes and delegates. It's a little more source code but the performance improvement is up to factor 1000.
I'm quiet sure that they use reflection because you have to define the name of the row method as string instead of a delegate. This is often a staring point for reflection solutions (MS also is not perfect). Some month ago I tested some reflection solutions against dynamic property descriptors and the performance difference was dramatic.
Note that the line counts are different simply because the TVF omits empty results:
string[] items = toSplit.Split(new string[] { delimeter }, StringSplitOptions.RemoveEmptyEntries);)
I noticed this ;-). I used the StringSplitOptions.None because I want all rows, just splited.
Greets
Flo
April 13, 2009 at 3:43 pm
RBarryYoung (4/13/2009)
And finally, here is the inclusion code to inlcude it with the other tests:
--- ==========================================================
--- -> Tally/fUNCTION solution
PRINT 'Start Tally-function 3 solution (correct line-splitting)'
SELECT @now = GETDATE()
--- Split text into lines
INSERT INTO @result
SELECT l.item
FROM @source s
CROSS APPLY dbo.fnSplit3(s.definition, char(13)+char(10)) l
--- Results
SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())
SELECT @count = COUNT(*) FROM @result
PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)
--- <- Tally solution
--- ==========================================================
DELETE FROM @result --- Clean up
--- ==========================================================
--- -> Tally/fUNCTION solution
PRINT 'Start Tally-function 4 solution (has same bug as the rest)'
SELECT @now = GETDATE()
--- Split text into lines
INSERT INTO @result
SELECT l.item
FROM @source s
CROSS APPLY dbo.fnSplit4(s.definition, char(13)+char(10)) l
--- Results
SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())
SELECT @count = COUNT(*) FROM @result
PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)
--- <- Tally solution
--- ==========================================================
DELETE FROM @result --- Clean up
Hi Barry!
Thanks also for your multi-line tally functions! They are much faster than my tally solution.
I included all to my test setup and here the results:
Start tally solution
Milliseconds: 5453 | Lines: 28145
Start Tally-function 3 solution (correct line-splitting)
Milliseconds: 3376 | Lines: 28545
Start Tally-function 4 solution (has same bug as the rest)
Milliseconds: 3316 | Lines: 28145
Start cursor solution
Milliseconds: 1950 | Lines: 28145
Start clr xml solution
Milliseconds: 773 | Lines: 28545
Start clr tvf (Paul White) solution
Milliseconds: 1186 | Lines: 28545
Greets
Flo
April 13, 2009 at 3:47 pm
I think the reason the Tally version is slower has to do with the blob data type, nvarchar(max).
I posted some information in The Thread.
April 13, 2009 at 3:51 pm
Lynn Pettis (4/13/2009)
I think the reason the Tally version is slower has to do with the blob data type, nvarchar(max).I posted some information in The Thread.
Hi Lynn
Thanks for your feedback! I currently have to handle the NVARCHAR(MAX)...
By the way: Which "The Thread" do you mean? I thought there have been two.
Greets
Flo
April 13, 2009 at 4:00 pm
Flo,
I understand needing to work with NVARCHAR(MAX). I was trying to find out why the Tally solution may have been slower, and I think the reason is due to the blob data type. I don't think there is anything inherently wrong with the Tally method when used with the "fixed length" data type, including the VARCHAR(8000) and NVARCHAR(4000).
This is where testing a variety of solutions is always a good thing to do.
April 13, 2009 at 4:02 pm
Lynn Pettis (4/13/2009)
I think the reason the Tally version is slower has to do with the blob data type, nvarchar(max).I posted some information in The Thread.
Hi again
I just tested with VARCHAR(8000) now I get a exception in my first tally solution but this doesn't matter for the moment. The other tally solutions provided by Barry (which have been much faster than mine) are still slower than the cursor.
The changed DDL and the changed test data:
DECLARE @tally TABLE (N INT NOT NULL, PRIMARY KEY CLUSTERED (N))
DECLARE @source TABLE (name NVARCHAR(128), definition VARCHAR(8000))
DECLARE @result TABLE (line nvarchar(max))
SELECT @crlf = CHAR(13) + CHAR(10)
-- //////////////////////////////////////////////////////////
-- -> Test data and tally table
-- Get some system procedures to split into lines
INSERT INTO @source
-- SELECT TOP(200) o.name, @crlf + m.definition + @crlf
SELECT TOP(200) o.name, @crlf + LEFT(m.definition, 7996) + @crlf
FROM master.sys.all_objects o
JOIN master.sys.all_sql_modules m ON o.object_id = m.object_id
WHERE type = 'P'
The results (sure everything is faster because of the less data):
Start tally solution
Msg 537, Level 16, State 2, Line 37
Invalid length parameter passed to the LEFT or SUBSTRING function.
The statement has been terminated.
Milliseconds: 13 | Lines: 0
Start Tally-function 3 solution (correct line-splitting)
Milliseconds: 2453 | Lines: 21115
Start Tally-function 4 solution (has same bug as the rest)
Milliseconds: 2380 | Lines: 20715
Start cursor solution
Milliseconds: 1466 | Lines: 20715
Start clr xml solution
Milliseconds: 590 | Lines: 21115
Start clr tvf (Paul White) solution
Milliseconds: 880 | Lines: 21115
Greets
Flo
April 13, 2009 at 4:02 pm
Paul White (4/13/2009)
RBarryYoung (4/13/2009)
1) For splitting the lines in stored procedure definitions, there is a subtle bug that exists in almost every line-splitting routine, including the ones so far in this thread: there is actually two different kinds of line-breaks incorporated in the system stored procedures: CR-LF, and LF alone. To correctly split them, you must take both into account (which is not trivial).It's pretty trivial in the CLR case 🙂
Code-wise, yes. But I suspect that it will add 10-20% to the run-time. If you could test it for me I'd appreciate it as I probably will not get to loading the CLR code for a while yet.
RBarryYoung (4/13/2009)
3) The final observation that I had at that time was that the loop-based routines start to catch up with the tally-based methods if the average distance between each separator gets large enough, and for line-splitting system procedures, they tend to be quite large of course (as opposed to comma-separated strings, for instance). This is because the loop-based methods can skip the ahead of "dead" characters in-between with CHARINDEX(). Because Florian figured out how to call CHARINDEX only once per loop, that lowered where that threshold is.*cough*
Paul White
The tally or numbers table solution is, to an extent, a brute-force approach. In the case of many long strings where the frequency of the searched-for string is low, a solution based on a loop/CHARINDEX approach should out-perform the tally approach.
Yes, I was pointing out that our previous analysis had indicated the same thing.
RBarryYoung (4/13/2009)
So anyway, I wanted to test my line splitting functions against these also, so here are the results:Flo
Start clr xml solutionMilliseconds: 790 | Lines: 28545
Start clr tvf (PW) solution
Milliseconds: 1083 | Lines: 25661
Still slower than both CLR implementations 😉
I wasn't disagreeing with that either, I just don't have the CLR stuff up yet.
And these CLR results are not a big surprise either because about 4-8 weeks ago I wrote a CLR function for a Search & Replace problem that was blowing the loop-based and set-based code out by about 4-5x. Since then I've been speculating on a hypothesis that anytime that loop-based SQL would come close to set-based performance, that CLR will blow them both out anyway. (That was on the Forums here, so I will see if I can find the thread...)
Now, I personally have no problem with CLR except that up until about two months ago it has been really hard to find things that it didn't stink at. My assertion has always been that anytime the CPU/record was above a certain point, it should be really good. but I'm kicking myself for not realizing that this was one of those cases.
Note that the line counts are different simply because the TVF omits empty results:
string[] items = toSplit.Split(new string[] { delimeter }, StringSplitOptions.RemoveEmptyEntries);)
It would be really nice if we could get a version with this fixed so that we can be sure of comparing apples to apples.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 13, 2009 at 4:07 pm
Lynn Pettis (4/13/2009)
Flo,I understand needing to work with NVARCHAR(MAX). I was trying to find out why the Tally solution may have been slower, and I think the reason is due to the blob data type. I don't think there is anything inherently wrong with the Tally method when used with the "fixed length" data type, including the VARCHAR(8000) and NVARCHAR(4000).
This is where testing a variety of solutions is always a good thing to do.
Done. See the post below yours.
April 13, 2009 at 4:17 pm
Flo,
If you check my code, I created the temporary table (#TestData) with definition defined as nvarchar(4000) and when loading it, I selected records where the length of definition was <= 4000.
This may be why you have issue with the change you made.
April 13, 2009 at 4:19 pm
RBarryYoung (4/13/2009)
Paul White (4/13/2009)
RBarryYoung (4/13/2009)
1) For splitting the lines in stored procedure definitions, there is a subtle bug that exists in almost every line-splitting routine, including the ones so far in this thread: there is actually two different kinds of line-breaks incorporated in the system stored procedures: CR-LF, and LF alone. To correctly split them, you must take both into account (which is not trivial).It's pretty trivial in the CLR case 🙂
Code-wise, yes. But I suspect that it will add 10-20% to the run-time. If you could test it for me I'd appreciate it as I probably will not get to loading the CLR code for a while yet.
I just added the LF to the CLR functions of Paul and me, you are correct it affects the performance. I will investigate the next days if a character or binary loop or RegEx may be a faster solution. Keep you informed. Here the test results:
Start tally solution
Milliseconds: 5426 | Lines: 28145
Start Tally-function 3 solution (correct line-splitting)
Milliseconds: 3296 | Lines: 28545
Start Tally-function 4 solution (has same bug as the rest)
Milliseconds: 3236 | Lines: 28145
Start cursor solution
Milliseconds: 1940 | Lines: 28145
Start clr xml solution
Milliseconds: 816 | Lines: 28545
Start clr tvf (Paul White) solution
Milliseconds: 1223 | Lines: 28545
Anyway if you have a look to my previous posted results there was no difference between your function and the CLR functions.
I already have some ideas to get the performance back to the previous including the LF handling. I will try tomorrow.
Greets
Flo
April 13, 2009 at 4:29 pm
RBarryYoung (4/13/2009)
Note that the line counts are different simply because the TVF omits empty results:
string[] items = toSplit.Split(new string[] { delimeter }, StringSplitOptions.RemoveEmptyEntries);)
It would be really nice if we could get a version with this fixed so that we can be sure of comparing apples to apples.
[font="Verdana"]Flo mentioned the correction earlier: just replace "RemoveEmptyEntries" with "None".[/font]
April 13, 2009 at 6:26 pm
Out of curiosity, I had a play with replace()....
using this stored proc:
CCREATE PROCEDURE #usp_print_lines
@text NVARCHAR(MAX)
AS
DECLARE @eol nvarchar(22)
SET @eol=N''' UNION ALL SELECT '''
SELECT @text= N'SELECT '''+REPLACE(REPLACE(REPLACE(@text,N'''',N''''''),CHAR(13) + CHAR(10),@eol),CHAR(10),@eol)+''''
EXEC(@text)
--- Return lines
I get these results:
[p][font="Courier New"]Start tally solution
Milliseconds: 4520 | Lines: 24811
Start cursor solution
Milliseconds: 1160 | Lines: 26161[/font][/p]
Compared to my timings using the original stored proc:
[font="Courier New"]Start tally solution
Milliseconds: 4600 | Lines: 24811
Start cursor solution
Milliseconds: 1916 | Lines: 25761[/font]
You can see a significant increase in speed using replace in the stored proc.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
Viewing 15 posts - 31 through 45 (of 522 total)
You must be logged in to reply to this topic. Login to reply