October 2, 2010 at 11:37 am
Pavel Pawlowski (9/26/2010)
But the situation differs with increased number of element. Tested it on 1333 elements (nearly 8k). The Tally table version I stopped after 19 minutes of execution time.
Do you see the two errors we've both made that caused the Tally Table to look bad? 😉
The Tally Table works by joining to the data at the character level. So the first performance error occurs because of what happens if you join two columns that are different datatypes... The data is NVARCHAR and the Tally Table splitter is based on VARCHAR.
The second "error" is caused by the myth than an iTVF always works as well at high row-counts as it does at low row-counts. Because of the join at the character level, you actually end up with a much larger cross join between the Tally Table and the data than required for the larger column counts because the optimizer can "see" the splitter code instead of not being able to see it as if it were a scalar function. In other words, the optimizer makes a bad mistake here. There are some fixes you can add but let’s not do that for the moment. I will, however, include mlTVF code to do a Tally Table split below.
So... if we do like we would do when tuning any query for performance, we need to make the datatypes match for the join and we need to change to a NON iTVF to get the performance the Tally Table is famous for even at larger numbers of CSV elements.
In my tests, the only change I made to the data generator was to change between creating an NVARCHAR column and a VARCHAR column and then ran the tests for 10, 100, and 1333 for each datatype. I also threw in the NON iTVF function to show that the problem really isn't with the Tally Table... the problem is with what the optimizer chose to do with it all and we had to do a little tuning just like we would with any query...
Here are the results from the tests. As you can see, "It Depends" and "Tuning Prevails". It clearly demonstrates that the iTVF is very good (in this case) for very low column counts but loses it's mind even to XML splitters at higher column counts especially when the datatype doesn’t match. It also clearly demonstrates that the mlTVF for the Tally table is slow at very low row counts and blows the doors off both the iTVF and the XML method for the larger row count that you did your testing with.
So, it’s not the Tally Table split that’s the problem… it’s how it’s used that’s the problem and, just like any other code in SQL, there are some best practices to follow including the matching of data-types in joins.
Oh yeah... almost forgot... here's the mlTVF I used...
--===== Tally Table (Split8KTallyM mlTVF) ===============================================================================
DROP FUNCTION dbo.Split8KTallyM;
GO
CREATE FUNCTION dbo.Split8KTallyM
(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))
RETURNS @Result TABLE (ItemNumber INT, ItemValue INT) AS
BEGIN
INSERT INTO @Result
(ItemNumber, ItemValue)
SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,
SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue
FROM dbo.Tally
WHERE N BETWEEN 1 AND LEN(@Parameter)+1
AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma
RETURN
END;
GO
And, no... I didn't take the time to test an NVARCHAR version of the Tally Table code here. I'm saving that for the article. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
October 2, 2010 at 1:59 pm
Jeff,
you are right about the data types, so I've modified the nvarchar to varchar in the test data generator and make a comparison of the Split8KTallyM vs. CLR RegEx. Aso I've made small enhancement to the CLR RegEx by changing input and output datatypes to match closer the input and output data and simplified the Regular expression for the concrete test.
As you said, it's alwasy "It Depends".
But even the Split8KTallyM is 8 times slower comparing the CLR RegEx on the hi element counts (1333).
Maybe I'm doing somethig wrong or forget something, but I'm not aware of this.
Here are results from test on my workstation. (both runs against the test data with varchar column instead of nvarchar)
Here is definition CLR RegEx function (rest is the same as in previous post)
[SqlFunction(FillRowMethodName = "FillRegExRow")]
public static IEnumerable RegExMatches2(SqlString sourceString, SqlString pattern)
{
Regex r = new Regex(pattern.Value, RegexOptions.Compiled);
int rowId = 0;
int matchId = 0;
foreach (Match m in r.Matches(sourceString.Value))
{
matchId++;
for (int i = 0; i < m.Groups.Count; i++)
{
yield return new RegExRow(++rowId, matchId, i, m.Groups.Value);
}
}
}
CREATE FUNCTION [dbo].[fn_RegExMatches2](@sourceString [nvarchar](max), @pattern [nvarchar](10))
RETURNS TABLE (
[rowId] [int] NULL,
[matchId] [int] NULL,
[groupId] [int] NULL,
[value] [nvarchar](10) NULL
) WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatches2]
GO
And TestQuery
--=== CLR-RegEx Version ===
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
@RowNum = D.RowNum,
@ItemNumber = V.matchId,
@ItemValue = CAST(V.value as int)
FROM dbo.CsvTest4 D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, '\d+') V
GO
--=== Split8KTallyM ===
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
@RowNum = D.RowNum,
@ItemNumber = V.ItemNumber,
@ItemValue = V.ItemValue
FROM dbo.CsvTest4 D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO
October 2, 2010 at 2:29 pm
Aha the 1,333 test:-) Curiously enough I had already carried out the analysis according to your CsvTest but on 10,000 Row X 1,333 Element table. Results available on request
October 3, 2010 at 9:13 am
mishaluba (6/27/2010)
Hello,I have a table with three columns containing CSV values. I need to normalize them. The order of the CSV values is meaningful (explained more below) Here is the sample DDL:
CREATE TABLE #t (id int, kpi1 varchar(16), kpi2 varchar(16), kpi3 varchar(16))
insert into #t (id, kpi1, kpi2, kpi3)
values(1, '0.1,0.11,0.111', '0.2,0.22,0.222', '0.3,0.33,0.333')
insert into #t (id, kpi1, kpi2, kpi3)
values(2, '0.4,0.44,0.444', '0.5,0.55,0.555', '0.6,0.66,0.666')
select *
from #t
My desired result would look something like this:
id kpi1 kpi2 kpi3 items1 items2 items3
----------- ---------------- ---------------- ---------------- ---------- -----------
1 0.1,0.11,0.111 0.2,0.22,0.222 0.3,0.33,0.333 0.1 0.20.3
1 0.1,0.11,0.111 0.2,0.22,0.222 0.3,0.33,0.333 0.11 0.220.33
1 0.1,0.11,0.111 0.2,0.22,0.222 0.3,0.33,0.333 0.111 0.2220.333
2 0.4,0.44,0.444 0.5,0.55,0.555 0.6,0.66,0.666 0.4 0.50.6
2 0.4,0.44,0.444 0.5,0.55,0.555 0.6,0.66,0.666 0.44 0.550.66
2 0.4,0.44,0.444 0.5,0.55,0.555 0.6,0.66,0.666 0.444 0.5550.666
I hope this is clear. Basically I want to split CSV values in each of the columns without multiplying the number of records every time I do the split, but just to append them to the same id. Once again, the order of CSV values is meaningful for the subsequent manipulations. For example: 0.1, 0.2, 0.3 (first values from kpi1, kpi2, kpi3) form a meaningful set of values, that's why I want them to be in the first row. The next row contains second values from each of the kpi's: 0.11, 0.22, 0.33. Since I do have an id in my source table and the function returns this id, I thought I could do a JOIN instead of CROSS APPLY, but this gives me a syntax error:
OK then folks, returning to the original post:-) I have another solution.
This approach uses CROSS APPLY and some string manipulation to accomplish the splitting
SELECT id, kpi1, kpi2, kpi3, items1, items2, items3
FROM #t
CROSS APPLY
(
SELECT LEFT(kpi1, CHARINDEX(',', kpi1) - 1), LEFT(kpi2, CHARINDEX(',', kpi2) - 1), LEFT(kpi3, CHARINDEX(',', kpi3) - 1)
UNION ALL SELECT SUBSTRING(kpi1, CHARINDEX(',', kpi1) + 1, CHARINDEX(',', kpi1, CHARINDEX(',', kpi1) + 1) - CHARINDEX(',', kpi1) - 1), SUBSTRING(kpi2, CHARINDEX(',', kpi2) + 1, CHARINDEX(',', kpi2, CHARINDEX(',', kpi2) + 1) - CHARINDEX(',', kpi2) - 1), SUBSTRING(kpi3, CHARINDEX(',', kpi3) + 1, CHARINDEX(',', kpi3, CHARINDEX(',', kpi3) + 1) - CHARINDEX(',', kpi3) - 1)
UNION ALL SELECT RIGHT(kpi1, CHARINDEX(',', REVERSE(kpi1)) - 1), RIGHT(kpi2, CHARINDEX(',', REVERSE(kpi2)) - 1), RIGHT(kpi3, CHARINDEX(',', REVERSE(kpi3)) - 1)
) AS Z (items1, items2, items3)
October 3, 2010 at 12:22 pm
steve-893342 (10/2/2010)
Aha the 1,333 test:-) Curiously enough I had already carried out the analysis according to your CsvTest but on 10,000 Row X 1,333 Element table. Results available on request
Heh... I'm not quite sure why you make people request the results. They'd be useful to the forum even if I had absolutely no interest. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
October 3, 2010 at 12:27 pm
Pavel Pawlowski (10/2/2010)
But even the Split8KTallyM is 8 times slower comparing the CLR RegEx on the hi element counts (1333).Maybe I'm doing somethig wrong or forget something, but I'm not aware of this.
You're correct. I'm not sure what or if you're doing anything wrong because the tally table versions I used obviously took a lot less than yours did. I'll take a look at your code and see if I can figure it out.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 3, 2010 at 12:32 pm
@Pavel,
What is the default collation on your machine, please? And thanks ahead of time for helping me figure this out.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 4, 2010 at 2:04 pm
Jeff Moden (10/3/2010)
@Pavel,What is the default collation on your machine, please? And thanks ahead of time for helping me figure this out.
Jeff,
my default collation is Czech_CI_AS, but I've tested this also on DB with Latin1_General_CI_AS collation and the results are the same. Tested it also on Clustered table and also on heap.
Maybe the difference is in the query Plan. Don't know your query plan, but on my machine SQL Server chooses paralel plan for the RegEx solution and maybe it will be benefit for it.
Here are the Plans. The first paralel one is for the RegEx CLR.
Also in term of optimizations and speed of CSV Splitting I've created a little small CLR function which is nearly 5 times quicker than the RegEx for CSV splitting and more than 30 times faster than the tally on my box. It can be used for simple CSV with one char delimiter. (The maxLen argument is length of buffer (maximum length of item i the CSV)
Here is the source code and below the results from Profilers.
public class StringSplit
{
private struct StrRow
{
public StrRow(int rowId, SqlChars value)
{
RowId = rowId;
Value = value;
}
public int RowId;
public SqlChars Value;
}
[SqlFunction(FillRowMethodName = "FillSplitString3")]
public static IEnumerable SplitString3(SqlString sourceString, string delimiter, int maxLen)
{
char[] buffer = new char[maxLen];
char delim = delimiter[0];
int rowNumber = 0;
int chars = 0;
char[] finalString;
foreach (char chr in sourceString.Value)
{
if (chr == delim)
{
finalString = new char[chars];
Array.Copy(buffer, finalString, chars);
yield return new StrRow(++rowNumber, new SqlChars(finalString));
chars = 0;
}
else
{
buffer[chars++] = chr;
}
}
if (chars > 0)
{
finalString = new char[chars];
Array.Copy(buffer, finalString, chars);
yield return new StrRow(++rowNumber, new SqlChars(finalString));
}
}
public static void FillSplitString3(object obj, out int rowId, out SqlChars value)
{
StrRow r = (StrRow)obj;
rowId = r.RowId;
value = r.Value;
}
}
CREATE FUNCTION dbo.fn_SplitString3(
@sourceString nvarchar(max),
@delimiter nchar(1),
@maxLen int
)
returns TABLE (
RowID int,
Value nvarchar(10)
)
AS
EXTERNAL NAME [CQLCLR].StringSplit.SplitString3
GO
Profiler Output:
October 5, 2010 at 4:55 am
I'm not sure why but the Tally Table method appears (from this angle... can't tell really) to be Cross Joining with the data which, of course, is going to make it terribly slow. The question is... why? Again... can't tell from here and it's not happening that way with any of the (apparently) same tests I've run on 2005 or 2008. At this point, I'm not sure what the two of us are doing differently.
I've also lost track a bit of what we've done. The difference may be in the Tally Table itself (post the CREATE TABLE statement and all indexes you have on it just so I can have a look, please?). This type of stuff is difficult to TS remotely.
If you want (don't worry about the CLR side... we know that's good), post all of the code YOU are using to setup the test, the function, and the Tally Table and I'll try that on my machine to see if I can duplicate the problem you're having. I have both a 2k5 and 2k8 machine at home to test on now.
And, yes... I realize that much of the code may be what's already been posted and much of it may be mine. Since I'm trying to duplicate your problem, I wanted you to post the actual code you used so I can try and find a difference.
Thanks, Pavel.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 5, 2010 at 1:21 pm
Jeff Moden (10/5/2010)
If you want (don't worry about the CLR side... we know that's good), post all of the code YOU are using to setup the test, the function, and the Tally Table and I'll try that on my machine to see if I can duplicate the problem you're having. I have both a 2k5 and 2k8 machine at home to test on now.And, yes... I realize that much of the code may be what's already been posted and much of it may be mine. Since I'm trying to duplicate your problem, I wanted you to post the actual code you used so I can try and find a difference.
Thanks, Pavel.
Jeff,
here is a complete script and also profiler output:
--Create TestDB
CREATE DATABASE [TestDB]
COLLATE Latin1_General_CI_AS
GO
--Use TestDB
USE [TestDB]
GO
--Create and populate tally table
SELECT TOP 11000
IDENTITY(INT, 1, 1) AS N
INTO dbo.Tally
FROM sys.all_objects o1, sys.all_objects
GO
--Add Clustered Index on Tally table
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
GO
--Create and populate CsvTest table (doesn't matter whether the table has Clustered index or it is simply heap)
SELECT TOP (10000) --Controls the number of rows in the test table
ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
(
SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
( --=== This builds CSV row with a leading comma
SELECT TOP (1333) --Controls the number of CSV elements in each row
','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
FROM dbo.Tally t3 --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
WHERE t1.N <> t3.N --Without this line, all rows would be the same
FOR XML PATH('')
)
,1,1,'') AS VARCHAR(8000))
) AS CsvParameter
INTO CsvTest
FROM dbo.Tally t1 --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2; --can produce row sets up 121 million.
GO
--Create Split Tally Function
CREATE FUNCTION dbo.Split8KTallyM (
@Parameter VARCHAR(8000),
@Delimiter VARCHAR(1)
)
RETURNS @Result TABLE (ItemNumber INT, ItemValue INT) AS
BEGIN
INSERT INTO @Result
(ItemNumber, ItemValue)
SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,
SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue
FROM dbo.Tally
WHERE N BETWEEN 1 AND LEN(@Parameter)+1
AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma
RETURN
END;
GO
--Tally Test
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
@RowNum = D.RowNum,
@ItemNumber = V.ItemNumber,
@ItemValue = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO
October 6, 2010 at 4:01 pm
Thanks, Pavel. I'll give it a whirl on my 2k5 box. Are you using 2k5 or 2k8?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 7, 2010 at 12:41 am
Jeff Moden (10/6/2010)
Thanks, Pavel. I'll give it a whirl on my 2k5 box. Are you using 2k5 or 2k8?
I'm running 2008
October 7, 2010 at 5:41 am
Zounds... not sure what just changed but I used your script exactly and I'm coming up on 10 minutes on my 2k Box. I've got the Actual Execution Plan enabled so we'll see what happens, shortly.
I know it's a stupid question but I have to make sure... are the two CLR methods you ran (I can't test them) producing the correct output?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 7, 2010 at 6:36 am
Ok... here's what I got (using Pavel's code) on my 2005 box which only has 1 CPU (P4 @1.8GHz), 1GB Ram, and an 8 year old hard drive. Notice that collation does make a huge change. The place where I changed the collation was within the function itself.
My default collation is the SQL Server Default of SQL_Latin1_General_CP1_CI_AS which explains why stuff runs faster on my machine than on a lot of other folks machines.
SPID Dur CPU Reads Writes RowCount
SQL:BatchCompleted55--Tally Test 862545 737422 14813936 29871 26660000 Latin1_General_CI_AS
SQL:BatchCompleted55--Tally Test 491414 457594 14819820 29884 26660001 SQL_Latin1_General_CP1_CI_AS
SQL:BatchCompleted55--Tally Test 470092 455282 14864872 30166 26660000 Latin1_General_BIN
Of course, it's still not going to touch the SQLCLR splitter. We knew that going into this.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 7, 2010 at 12:28 pm
Jeff Moden (10/7/2010)
Zounds... not sure what just changed but I used your script exactly and I'm coming up on 10 minutes on my 2k Box. I've got the Actual Execution Plan enabled so we'll see what happens, shortly.I know it's a stupid question but I have to make sure... are the two CLR methods you ran (I can't test them) producing the correct output?
Jeff,
for sure the two CLR method produce correct results. I run simple test query and compare it against the Tally.
WITH TestData(RowNum, CsvParameter) AS(
SELECT 1 as RowNum, '10,8,156,1389,45726,125456' as CsvParameter UNION ALL
SELECT 2 as RowNum, '4578,13,78784,888,45726' as CsvParameter UNION ALL
SELECT 3 as RowNum, '973,576,12547,88877,2111333' as CsvParameter
),
Split8KTallyM AS (
SELECT
D.RowNum,
V.ItemNumber,
V.ItemValue AS Split8KTallyM
FROM TestData D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
),
CLRSplitString3 AS (
SELECT
D.RowNum,
V.RowID as ItemNumber,
V.Value AS CLRSplitString3
FROM TestData D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
),
CLRRegEx AS (
SELECT
D.RowNum,
V.RowID AS ItemNumber,
V.Value AS CLRRegEx
FROM TestData D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, '\d+') V
)
SELECT
T.RowNum,
T.ItemNumber,
T.Split8KTallyM,
T1.CLRSplitString3,
T2.CLRRegEx
FROM Split8KTallyM T
FULL OUTER JOIN CLRSplitString3 T1 ON T.RowNum = T1.RowNum AND T.ItemNumber = T1.ItemNumber
FULL OUTER JOIN CLRRegEx T2 ON T.RowNum = T2.RowNum AND T.ItemNumber = T2.ItemNumber
and of course here are results:
RowNum ItemNumber Split8KTallyM CLRSplitString3 CLRRegEx
----------- ----------- ------------- --------------- ----------
1 1 10 10 10
1 2 8 8 8
1 3 156 156 156
1 4 1389 1389 1389
1 5 45726 45726 45726
1 6 125456 125456 125456
2 1 4578 4578 4578
2 2 13 13 13
2 3 78784 78784 78784
2 4 888 888 888
2 5 45726 45726 45726
3 1 973 973 973
3 2 576 576 576
3 3 12547 12547 12547
3 4 88877 88877 88877
3 5 2111333 2111333 2111333
And also modified version of test run on the big CsvTest table with 1333 or other count of items. This version only returns differences agains the Split8KTallyM. If there is no difference and the functions works correctly then we will receive empty result set.
WITH Split8KTallyM AS (
SELECT
D.RowNum,
V.ItemNumber,
V.ItemValue AS Split8KTallyM
FROM CsvTest D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
),
CLRSplitString3 AS (
SELECT
D.RowNum,
V.RowID AS ItemNumber,
V.Value AS CLRSplitString3
FROM CsvTest D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
),
CLRRegEx AS (
SELECT
D.RowNum,
V.RowID AS ItemNumber,
V.Value AS CLRRegEx
FROM CsvTest D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, '\d+') V
)
SELECT
T.RowNum,
T.ItemNumber,
T.Split8KTallyM,
T1.CLRSplitString3,
T2.CLRRegEx
FROM Split8KTallyM T
FULL OUTER JOIN CLRSplitString3 T1 ON T.RowNum = T1.RowNum AND T.ItemNumber = T1.ItemNumber
FULL OUTER JOIN CLRRegEx T2 ON T.RowNum = T2.RowNum AND T.ItemNumber = T2.ItemNumber
WHERE
T.RowNum IS NULL
OR
T1.RowNum IS NULL
OR
T2.RowNum IS NULL
OR
T.Split8KTallyM <> T1.CLRSplitString3
OR
T.Split8KTallyM <> T2.CLRRegEx
And result:
RowNum ItemNumber Split8KTallyM CLRSplitString3 CLRRegEx
-------------------- ----------- ------------- --------------- ----------
(0 row(s) affected)
So I can confirm, that both the RegEx and also the fastest function processing the string on character level works correctly and returns correct results.
Viewing 15 posts - 31 through 45 (of 47 total)
You must be logged in to reply to this topic. Login to reply