April 17, 2009 at 2:09 pm
Hi Jeff!
Some intermediate results of my current tests. Some new CLR functions joined the competition ;-).
SourceType Description Lines Duration AverageTextLength MaxTextLength SumTextLength
---------- ----------------------------- ------ -------- ----------------- ------------- -------------
Large CLR DanielJ solution 28544 60 5076 73908 1015384
Large CLR tvf (Paul White) solution 28544 106 5076 73908 1015384
Large CLR RegEx solution 28544 120 5076 73908 1015384
Large CLR Chars solution 25967 266 5076 73908 1015384
Large CLR XML Solution 28544 346 5076 73908 1015384
Large Tally solution 28144 3246 5076 73908 1015384
Many CLR DanielJ solution 119609 203 3857 12672 771456
Many CLR tvf (Paul White) solution 119624 270 3857 12672 771456
Many CLR Chars solution 113024 310 3857 12672 771456
Many CLR RegEx solution 119624 403 3857 12672 771456
Many CLR XML Solution 119624 766 3857 12672 771456
Many Tally solution 119239 2026 3857 12672 771456
Short CLR DanielJ solution 15560 46 2526 3994 505257
Short CLR tvf (Paul White) solution 15560 60 2526 3994 505257
Short CLR RegEx solution 15561 80 2526 3994 505257
Short Tally solution 15160 123 2526 3994 505257
Short CLR Chars solution 14083 146 2526 3994 505257
Short CLR XML Solution 15560 250 2526 3994 505257
I will add the other functions back to the results when finished. Complete results maybe tomorrow.
Greets
Flo
April 17, 2009 at 3:15 pm
Florian Reischl (4/14/2009)
Here my final test results
SourceType Description Lines Duration AverageTextLength MaxTextLength
---------- ----------------------------------------------------- ------- ----------- -------------------- --------------------
Large CLR tvf (Paul White) solution 28545 580 5076 73908
Large CLR XML Solution 28545 823 5076 73908
Large Cursor solution 28145 2000 5076 73908
Large Peso solution 28545 2143 5076 73908
Large Phil Factor solution 28577 2703 5076 73908
Large Tally-function 4 solution (has same bug as the rest) 28145 3200 5076 73908
Large Tally-function 3 solution (correct line-splitting) 28545 3290 5076 73908
Large Tally solution 28145 3870 5076 73908
Many CLR tvf (Paul White) solution 157586 1626 3903 12672
Many CLR XML Solution 157586 2590 3903 12672
Many Tally solution 156410 4356 3903 12672
Many Tally-function 4 solution (has same bug as the rest) 156410 5823 3903 12672
Many Tally-function 3 solution (correct line-splitting) 157586 6030 3903 12672
Many Cursor solution 155288 6716 3903 12672
Many Peso solution 157562 7193 3903 12672
Many Phil Factor solution 157677 11163 3903 12672
Short CLR tvf (Paul White) solution 44106 226 2526 3994
Short Tally solution 43306 310 2526 3994
Short CLR XML Solution 44106 473 2526 3994
Short Cursor solution 43306 1080 2526 3994
Short Peso solution 44106 1083 2526 3994
Short Phil Factor solution 44197 1123 2526 3994
Short Tally-function 4 solution (has same bug as the rest) 43306 1750 2526 3994
Short Tally-function 3 solution (correct line-splitting) 44106 1813 2526 3994
Florian:
Could you get us the Average Line lengths for these tests? Or possibly the total source rows (from which I could calculate the line lengths)?
Thanks,
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 17, 2009 at 3:22 pm
I am real eager to see this DanielJ solution too.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 17, 2009 at 4:16 pm
Hi Barry
RBarryYoung (4/17/2009)
Florian:Could you get us the Average Line lengths for these tests? Or possibly the total source rows (from which I could calculate the line lengths)?
The source totals have been included in the other results. I just did some changes in source data generation to ensure deterministic tests. Here the statistics about the data:
SourceType AvgLineLength CountLines MaxLineLength AvgTextLength MaxTextLength SumTextLength
---------- ------------- ----------- ------------- ------------- ------------- -------------
Large 32 26441 329 4659 73908 931902
Many 9 141000 10 8448 15360 1689600
Short 30 14902 296 2420 3994 484158
If you need further information let me know.
Here the snippets to generate the test data:
---- ======================================================
---- -> #SourceLarge
CREATE TABLE #SourceLarge (Name NVARCHAR(128), Definition NVARCHAR(MAX))
INSERT INTO #SourceLarge
SELECT TOP(200) o.name, CHAR(13) + CHAR(10) + m.definition + CHAR(13) + CHAR(10)
FROM master.sys.all_objects o
JOIN master.sys.all_sql_modules m ON o.object_id = m.object_id
WHERE type = 'P'
ORDER BY o.object_id
---- ======================================================
---- -> #SourceShort
CREATE TABLE #SourceShort (Name NVARCHAR(128), Definition NVARCHAR(4000))
INSERT INTO #SourceShort
SELECT TOP(200) o.name, @crlf + LEFT(m.definition, 3990) + @crlf
FROM master.sys.all_objects o
JOIN master.sys.all_sql_modules m ON o.object_id = m.object_id
WHERE type = 'P' --AND LEN(m.definition) #SourceMany
CREATE TABLE #SourceMany (Name NVARCHAR(128), Definition NVARCHAR(MAX))
INSERT INTO #SourceMany
--SELECT TOP(200) o.name, REPLICATE(LEFT(NEWID(), ABS(CHECKSUM(NEWID())) % 10) + NCHAR(13) + NCHAR(10), ABS(CHECKSUM(NEWID())) % 10)
SELECT TOP(200) o.name, REPLICATE(LEFT(NEWID(), 10) + NCHAR(13) + NCHAR(10), ROW_NUMBER() OVER (ORDER BY object_id) % 10 + 1)
FROM master.sys.all_objects o
ORDER BY o.object_id
WHILE (10000 > (SELECT MAX(LEN(Definition)) FROM #SourceMany))
UPDATE #SourceMany SET Definition = Definition + Definition
--SELECT MAX(LEN(Definition)), AVG(LEN(Definition)), MIN(LEN(Definition)) FROM #SourceMany
---- =======================================================
Greets
Flo
April 17, 2009 at 4:20 pm
RBarryYoung (4/17/2009)
I am real eager to see this DanielJ solution too.
Stay patient, please. 😉
I just want to post it together with the other functions. I also did some performance improvements in the other functions (especially Paul's TVF). I will post them together in a few hours.
Greets
Flo
April 17, 2009 at 4:40 pm
Thanks, Flo.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
April 17, 2009 at 4:41 pm
Flo,
To echo someone else in this thread, this would be an awesome article. You should take the time to write it up.
April 17, 2009 at 7:51 pm
Florian Reischl (4/17/2009)
Hi Jeff!Some intermediate results of my current tests. Some new CLR functions joined the competition ;-).
Very cool, Flo. I do have a couple of concerns. Since some of the tests show a different number of result lines for a given test, I'm thinking that some of the methods actually don't work correctly. Also, the DanielJ CLR's appear to operate with extraordinary speed. I looked back a bit in this thread and can't find the DanielJ postings. Could you provide a listing of the source code for those? Thanks.
Slowly but surely, I'm working up a very determinate and qualified set of tests. I won't be able to test the CLR methods on my machine to compare apples with apples. When I'm ready, would you do the honors?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 18, 2009 at 5:29 am
Jeff Moden (4/17/2009)
Paul White (4/17/2009)
Why not? You're the one speaking in absolutes. 😉
Yes. That was the humourous part (YMMV on that).
April 18, 2009 at 5:40 am
Florian Reischl (4/17/2009)
Chunks? Your suggestion was to get all in one XML.Flo
You gave my suggestion a much deeper analysis than it warranted 😉
By 'chunks' I meant something like a TOP (n) on the SELECT which would feed n rows at a time, via FOR XML, to a CLR function which would split the strings contained in the XML input. I imagine tabular results (rather than more XML) would be optimal for the CLR function's output.
The postulate is that sending, say 10K rows at a time inside XML, is more efficient than 10K TVF calls using CROSS APPLY.
Sorry I wasn't clearer before.
Paul
April 18, 2009 at 6:18 am
Hi!
So finally I’m done with all my tests. This post will contain the main information about the tests. I will separate the results into more than one post to hopefully get more than one webpage and save the IE ;-).
Here are the final deterministic statistics for the test data:
SourceType AvgLineLength CountLines MaxLineLength AvgTextLength MaxTextLength SumTextLength
---------- ------------- ----------- ------------- ------------- ------------- -------------
Large 34 41113 369 7661 95941 1532211
Many 9 141000 10 8448 15360 1689600
Short 30 14902 296 2420 3994 484158
Here are the final test results on a SQL Server 2005:
Sorting Name Lines Duration CpuTime LogicalReads PhysicalReads
------- ---------------------------------------------------- ------ -------- ------- ------------ -------------
Large CLR Chars solution 40475 63 62 7140 0
Large CLR DanielJ solution 40475 80 62 7140 0
Large CLR RegEx solution 40475 110 78 7140 0
Large CLR tvf (Paul White) solution 40475 126 110 7140 0
Large CLR XML Solution 40475 373 218 10905 0
Large Peso solution 40475 1936 1812 195814 0
Large Cursor solution 40475 2063 1953 152361 0
Large Tally-function 4 solution (has same bug as the rest) 40075 2936 2890 61400 0
Large Tally-function 3 solution (correct line-splitting) 40475 3013 2860 61842 0
Large Phil Factor solution 40487 3843 3422 339304 156
Large Tally solution 40075 4140 4125 1700176 0
Many CLR Chars solution 141000 140 141 6658 0
Many CLR DanielJ solution 141000 140 140 6658 0
Many CLR tvf (Paul White) solution 141000 233 219 6658 0
Many CLR RegEx solution 141000 266 266 6658 0
Many CLR XML Solution 141000 766 578 11177 0
Many Tally-function 4 solution (has same bug as the rest) 140600 4233 4015 176616 0
Many Tally-function 3 solution (correct line-splitting) 141000 4470 4407 177016 0
Many Tally solution 140600 4626 4625 1605852 0
Many Peso solution 141000 6190 5938 618419 0
Many Cursor solution 141000 6280 6157 477816 0
Many Phil Factor solution 141000 10733 9812 1180874 0
Short CLR RegEx solution 15065 30 31 154 0
Short CLR DanielJ solution 15064 30 32 154 0
Short CLR Chars solution 15064 30 31 154 0
Short CLR tvf (Paul White) solution 15064 46 31 154 0
Short Tally solution 14664 93 93 1677 0
Short CLR XML Solution 15064 140 16 158 0
Short Lynn's tally solution 15064 266 266 1439 0
Short Phil Factor solution 15115 546 469 34658 0
Short Peso solution 15064 640 625 34296 0
Short Cursor solution 15064 673 656 20190 0
Short Tally-function 4 solution (has same bug as the rest) 14664 1046 985 23500 0
Short Tally-function 3 solution (correct line-splitting) 15064 1080 1015 23910 0
Here are the final test results on a SQL Server 2008 on same machine:
Sorting Name Lines Duration CpuTime LogicalReads PhysicalReads
------- ---------------------------------------------------- ------ -------- ------- ------------ -------------
Large CLR Chars solution 41113 50 47 7337 0
Large CLR DanielJ solution 41113 53 32 7337 0
Large CLR RegEx solution 41113 96 110 7337 0
Large CLR tvf (Paul White) solution 41113 103 109 7337 0
Large CLR XML Solution 41113 293 327 4277 0
Large Peso solution 41113 1990 2121 198363 0
Large Cursor solution 41113 2076 2293 155583 0
Large Phil Factor solution 41125 2730 2933 345116 0
Large Tally solution 40713 2740 6241 1746845 0
Large Tally-function 4 solution (has same bug as the rest) 40713 3460 3666 62837 0
Large Tally-function 3 solution (correct line-splitting) 41113 3570 3837 63176 0
Many CLR Chars solution 141000 113 124 6696 0
Many CLR DanielJ solution 141000 116 109 6696 0
Many CLR tvf (Paul White) solution 141000 220 280 6696 0
Many CLR RegEx solution 141000 253 265 6696 0
Many CLR XML Solution 141000 576 593 5133 0
Many Tally solution 140600 2040 7300 1607421 0
Many Tally-function 4 solution (has same bug as the rest) 140600 5190 5554 177992 0
Many Tally-function 3 solution (correct line-splitting) 141000 5363 5757 178072 0
Many Peso solution 141000 6330 6942 616292 0
Many Cursor solution 141000 6676 7223 478034 0
Many Phil Factor solution 141000 10936 12153 1183408 0
Short CLR Chars solution 14902 23 16 157 0
Short CLR DanielJ solution 14902 26 31 157 0
Short CLR RegEx solution 14902 40 62 157 0
Short CLR tvf (Paul White) solution 14902 40 47 157 0
Short Tally solution 14502 46 141 1652 0
Short Lynn's tally solution 14902 123 342 1447 0
Short CLR XML Solution 14902 146 141 157 0
Short Phil Factor solution 14946 603 733 34119 0
Short Peso solution 14902 650 702 33637 0
Short Cursor solution 14902 686 780 19977 0
Short Tally-function 4 solution (has same bug as the rest) 14502 1273 1404 23461 0
Short Tally-function 3 solution (correct line-splitting) 14902 1310 1482 23872 0
It seems that SQL Server 2008 is faster in most cases. Especially the CLR integration is optimized!
I moved the cursor solution from procedure to a TVF to give the same chance as the other solutions:
---- ======================================================
---- -> Function to split lines by cursor
CREATE FUNCTION dbo.ufn_SplitLines_Cursor (@text NVARCHAR(MAX))
RETURNS @ret TABLE (Id INT IDENTITY, line NVARCHAR(2000))
AS
BEGIN
DECLARE @pos INT, @next INT, @crlf NCHAR(2)
SELECT @pos = 0, @crlf = NCHAR(13) + NCHAR(10)
WHILE (1 = 1)
BEGIN
SELECT @next = CHARINDEX(@crlf, @text, @pos)
IF (@next = 0) BREAK -- Nothing more to do
IF (@pos != @next)
INSERT INTO @ret SELECT SUBSTRING(@text, @pos, @next - @pos - 1)
SELECT @pos = @next + 1
END
INSERT INTO @ret
SELECT SUBSTRING(@text, @pos, LEN(@text) - 1)
RETURN
END
Different results between Tally and others
Because the tally tables needs the separator at the start and end position of the text I added this CRLF into the test data. This might be the reason why is shows a little less rows than all other solutions.
April 18, 2009 at 6:21 am
Some explanation for the CLR functions.[/b]
It seems that SQL Server doe many optimizations for SUBSTRING functions like “SUBSTRING(s.definition, N, 2) = @crlf”. I don’t think that it really copies every character out of the source string to compare with the constant expression. Maybe it does a memory compare from constant expression to the source text without copying. Otherwise the performance could not reach this performance in my opinion.
CLR doesn’t do this automatically so one of the most important things for thus string operations is avoid needless memory copies.
CLR DanielJ solution
A short information about “DanielJ”. Daniel is, believe it, my apprentice at work. He is young but a already a great developer!
I spoke with him about the discussion and he just had a great idea; “Reduce to the max”. Skip all cool .Net gifts for this simple solution and use a C/C++ style solution by just looping over the characters of the incoming string.
using System;
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(
FillRowMethodName = "FillRowChars",
TableDefinition = "line nvarchar(2000)"
)
]
public static IEnumerable ufn_clr_SplitLines_DJ(
[SqlFacet(
MaxSize = -1
)
]
SqlString txtIn
)
{
ArrayList _al = new ArrayList(1000);
int currentIndex = 0;
int lastIndex = 0;
int skipCount = 0;
foreach (char c in txtIn.Value)
{
if (c == 0x0D)
{
skipCount++;
}
else if (c == 0x0A)
{
_al.Add(txtIn.Value.Substring(lastIndex, currentIndex - lastIndex - skipCount));
lastIndex = currentIndex + 1;
skipCount = 0;
}
currentIndex++;
}
_al.Add(txtIn.Value.Substring(lastIndex, currentIndex - lastIndex));
return _al;
}
public static void FillRowDJ(object obj, out string line)
{
line = (string)obj;
}
};
Special thanks to Daniel for this great solution! 🙂
CLR Chars solution
This is my last solution for the business case. It is quiet equal to Daniel’s solution but works with chars and a for-int loop instead of a for-each loop. As known in .Net the for-each loops are nice features but create a little overhead. I just fixed a performance issue in my function. It seems that the “Value” property of the SqlChars object is not fixed but recalculated at each access. So I take the char-array only once and work with this:
using System;
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(
FillRowMethodName = "FillRowChars",
TableDefinition = "line nvarchar(2000)"
)
]
public static IEnumerable ufn_clr_SplitLines_Chars(
[SqlFacet(
MaxSize = -1
)
]
SqlChars charsIn
)
{
char[] chars = charsIn.Value;
ArrayList al = new ArrayList(1000);
int lastIndex = 0;
int len = 0;
int i;
for (i = 0; i < chars.Length; i++)
{
char c = chars;
if (c == 0x0A)
{
al.Add(new string(chars, lastIndex, len));
lastIndex = i;
len = 0;
}
else if (c != 0x0D)
{
len++;
}
}
al.Add(new string(chars, lastIndex, i - lastIndex));
return al;
}
public static void FillRowChars(object obj, out string line)
{
line = (string)obj;
}
};
CLR RegEx solution
I just have to apologize to Phil and Derek. I undervalued the RegEx! The performance is great and it is very flexible. I used a once compiled RegEx to optimize:
using System;
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using System.Text.RegularExpressions;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
private static readonly Regex _regEx = new Regex(@"\r|\r", RegexOptions.Compiled);
[Microsoft.SqlServer.Server.SqlFunction(
FillRowMethodName = "FillRowRegEx",
TableDefinition = "line nvarchar(2000)"
)
]
public static System.Collections.IEnumerable ufn_clr_SplitLines_RegEx(
[SqlFacet(
MaxSize = -1
)
]
SqlString txtIn
)
{
string[] lines = _regEx.Split(txtIn.Value);
return lines;
}
public static void FillRowRegEx(object obj, out SqlString line)
{
line = (string)obj;
}
};
CLR Paul’s TVF
Since TVF’s until Sql2k8 don’t use MARS and SQL Server works with an IEnumerable the data are always returned in same direction as the CLR function returns them we return the result of string.Split without any row-ids.
using System;
using System.Collections;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(
FillRowMethodName = "FillRow",
TableDefinition = "line nvarchar(4000)"
)
]
public static IEnumerable ufn_clr_SplitLines_PW(
[SqlFacet(
MaxSize = -1
)
]
SqlString StringToSplit
)
{
return StringToSplit.Value.Split(new string[] { "\r", "" }, StringSplitOptions.None);
}
public static void FillRow(Object obj, out SqlString line)
{
line = (string)obj;
}
};
I hope I didn’t forget any of the CLRs...
April 18, 2009 at 6:24 am
The Test Framework
To simplify the tests I create one table “PerfMon” (delete the data if it already exists) and two procedures “usp_PerfMon_Init” and “usp_PerfMon_Write”. If you want to do the same tests please ensure that the table and procedures are not available in your database!
---- ==========================================================
---- -> PerfMon table and depending procedures
--DROP TABLE PerfMon
IF (OBJECT_ID('PerfMon') IS NULL)
BEGIN
CREATE TABLE PerfMon
(
Id INT NOT NULL IDENTITY,
PRIMARY KEY CLUSTERED (Id),
Name VARCHAR(100),
Lines INT,
CreationDate DATETIME,
ElapsedTime INT,
CpuTime INT,
PhysicalReads INT,
PhysicalWrites INT,
LogicalReads INT,
Sorting VARCHAR(30),
IsInit BIT
)
END
GO
IF (OBJECT_ID('usp_PerfMon_Init') IS NOT NULL)
DROP PROCEDURE usp_PerfMon_Init
GO
CREATE PROCEDURE usp_PerfMon_Init
AS
INSERT INTO PerfMon (CreationDate, ElapsedTime, CpuTime, PhysicalReads, PhysicalWrites, LogicalReads, IsInit)
SELECT GETDATE(), r.total_elapsed_time, r.cpu_time, r.reads, r.writes, r.logical_reads, 1
FROM sys.dm_exec_requests r --sys.dm_exec_sessions r
WHERE r.session_id = @@SPID
GO
IF (OBJECT_ID('usp_PerfMon_Write') IS NOT NULL)
DROP PROCEDURE usp_PerfMon_Write
GO
CREATE PROCEDURE usp_PerfMon_Write
@Name VARCHAR(100),
@Lines INT,
@Sorting VARCHAR(30)
AS
INSERT INTO PerfMon (Name, Lines, CreationDate, ElapsedTime, CpuTime, PhysicalReads, PhysicalWrites, LogicalReads, Sorting, IsInit)
SELECT @Name, @Lines, GETDATE(), r.total_elapsed_time, r.cpu_time, r.reads, r.writes, r.logical_reads, ISNULL(@Sorting, 0), 0
FROM sys.dm_exec_requests r -- sys.dm_exec_sessions r
WHERE r.session_id = @@SPID
GO
-- Delete previous test results
DELETE FROM PerfMon
---- <- PerfMon table and depending procedures
---- ==========================================================
The complete test SQL script is attached to this post.
Again thanks to everybody who attended this great discussion!
Greets
Flo
April 18, 2009 at 6:34 am
Paul White (4/18/2009)
Florian Reischl (4/17/2009)
Chunks? Your suggestion was to get all in one XML.FloYou gave my suggestion a much deeper analysis than it warranted 😉
By 'chunks' I meant something like a TOP (n) on the SELECT which would feed n rows at a time, via FOR XML, to a CLR function which would split the strings contained in the XML input. I imagine tabular results (rather than more XML) would be optimal for the CLR function's output.
The postulate is that sending, say 10K rows at a time inside XML, is more efficient than 10K TVF calls using CROSS APPLY.
Sorry I wasn't clearer before.
Paul
Hi Paul
This opens two different possible approaches:
First a "do everything I want" solution in CLR. Within the function use the context connection to get the chunks as XML, parse them and put the results into one result list which returns all source table ids and their split lines.
As second a TSQL loop which takes chunk pages from source table and supplies them with FOR XML to the CLR function. within this function the XML becomes parsed and returns the lines with the source ids.
If you mean another approach could you please populate a little sample? 🙂
Greets
Flo
April 18, 2009 at 6:41 am
Hey Flo,
Great stuff - the optimized CLR routines you and Daniel came up with are very impressive!
Note to self: don't pass strings - pass SqlChars!
Paul
Viewing 15 posts - 91 through 105 (of 522 total)
You must be logged in to reply to this topic. Login to reply