October 2, 2015 at 4:32 pm
Hi,
After reading the articles around the performance of CLR we decide to roll out our own version of a CLR string splitter to replace a TSQL version that we floating around and from a performance / scalability point of view things appear to be great (we were very much inspired by the “Tally OH! An Improved SQL 8K “CSV Splitter” Function by Jeff Moden and the “SQLCLR String Splitting Part 2: Even Faster, Even More Scalable” by Adam Machanic so thanks to all involved for those pieces of great work).
I have tried to recreate this using a our own string splitter / a few others (including the Tally Oh splitter without success) and I don't think this can be recreated if you only have 1 column being referenced from your CLR function either.
This isn't really causing any problems but just something interesting I thought I would share since i have not found any information whilst searching around.
So let’s say we have a very basic string splitter. such as this one (don’t worry this isn’t being used, it’s something I knocked up when I thought I had make a mistake in the actual code we are using).
using System.Collections;
using System.Collections.Generic;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
namespace CLR_SQLCentralServer
{ public class StringSplitter
{
[SqlFunction(Name = "CLR_Split",
FillRowMethodName = "FillRow",
TableDefinition = "idx int, id nvarchar(max)")]
public static IEnumerable SplitStringBySingleDelimiter(SqlString str, SqlChars delimiter)
{
List<ReturnValues> allVals = new List<ReturnValues>();
var vals = new ReturnValues();
if (delimiter.Length == 0)
{
for (var index = 0; index < str.Value.Length; index++)
{
vals.Idx = index;
vals.Value = str.Value[index].ToString();
allVals.Add(new ReturnValues {Idx = vals.Idx, Value = vals.Value});
}
}
else
{
var lineElements = str.Value.Split(delimiter[0]);
for (var index = 0; index < lineElements.Length; index++)
{
vals.Idx = index;
vals.Value = lineElements[index];
allVals.Add(new ReturnValues {Idx = vals.Idx, Value = vals.Value});
}
}
return allVals;
}
private struct ReturnValues
{
public int Idx;
public string Value;
}
public static void FillRow(object row, out SqlInt32 idx, out SqlString str)
{
var returnVals = (ReturnValues)row;
str = returnVals.Value;
idx = returnVals.Idx;
}
}
}
If the input string contains a delimiter it will split, if no delimiter is supplied it will spit out the whole string.
so statements such as:
declare
@SwappedString varchar(max) = 'Test?';
select idx, value
from dbo.fn_Split(@SwappedString, '')
order by idx desc
returns:
idxvalue
4?
3t
2s
1e
0T
and
declare
@SwappedString varchar(max) = 'Te,st?';
select idx, value
from dbo.fn_Split(@SwappedString, ',')
order by idx desc
returns:
idxvalue
1st?
0Te
Now the interesting part is when you do something like this:
declare
@WorkingString varchar(max) = '',
@SwappedString varchar(max) = 'Lets test this string?';
select @WorkingString = @WorkingString + value
from dbo.fn_Split(@SwappedString, '')
order by idx desc
select @WorkingString
returns:
L
Yet by changing the order by to the value we get the full string
declare
@WorkingString varchar(max) = '',
@SwappedString varchar(max) = 'Lets test this string?';
select @WorkingString = @WorkingString + value
from dbo.fn_Split(@SwappedString, '')
order by value desc
Returns:
tttttssssrnLiihgee?
or add a top X and we get the full string again
declare
@WorkingString varchar(max) = '',
@SwappedString varchar(max) = 'Lets test this string?';
select top 100 @WorkingString = @WorkingString + value
from dbo.fn_Split(@SwappedString, '')
order by idx desc
select @WorkingString
Returns:
?gnirts siht tset steL
I have tried to attach some execution plans to this email but I don't think they will show, it's getting late here now so I will add them in the morning.
Thanks for looking.
October 3, 2015 at 12:16 am
Hey there. So I found the cause, kinda. It has to do with the NVARCHAR(MAX) datatype, though I am not sure why or how it is only showing up in a rather specific scenario: it doesn't matter if it is ASC or DESC order, but it needs to be ordered, and without a TOP clause. It also does not matter if the string is split on a character or no character.
This can be reproduced by anyone without doing any coding. Either use whatever splitter you currently have, or download the free version of SQL#[/url] (which I am the author of, but the String_Split function is in the Free version).
DECLARE @WorkingString VARCHAR(MAX) = '',
@SwappedString VARCHAR(MAX) = 'Lets test this string?';
SELECT @WorkingString = @WorkingString + spl.SplitVal
FROM SQL#.String_Split(@SwappedString, ' ', 1) spl
ORDER BY spl.SplitNum ASC;
SELECT @WorkingString;
Returns:
string?
Changing the ORDER BY to DESC returns:
Lets
If you change the SELECT to be:
SELECT TOP 100000 @WorkingString = @WorkingString + spl.SplitVal
then it works correctly. Look at the execution plans for both and you should see a difference. The version that does not work looks like:
SELECT <-- Sort <-- Compute Scalar <-- Table Valued Function
The version that does work looks like:
SELECT <-- Compute Scalar <-- Sort <-- Table Valued Function
Notice that "Compute Scalar" and "Sort" swapped positions.
The first attempt to fix this is to use the ORDER option for SQLCLR TVFs (not available in SQL Server 2005). Run the following:
CREATE FUNCTION [SQL#].[String_SplitOrdered](@StringValue [nvarchar](max),
@Separator [nvarchar](4000), @SplitOption [int])
RETURNS TABLE (
[SplitNum] [int] NULL,
[SplitVal] [nvarchar](max) NULL
) WITH EXECUTE AS CALLER
ORDER ([SplitNum] ASC)
AS EXTERNAL NAME [SQL#].[STRING].[Split];
Notice how we added ORDER ([SplitNum] ASC). Now we see a difference when running:
DECLARE @WorkingString VARCHAR(MAX) = '',
@SwappedString VARCHAR(MAX) = 'Lets test this string?';
SELECT @WorkingString = @WorkingString + spl.SplitVal
FROM SQL#.String_SplitOrdered(@SwappedString, ' ', 1) spl
ORDER BY spl.SplitNum ASC;
SELECT @WorkingString;
Returns:
Letstestthisstring?
Nice. But change the ORDER BY to DESC and we are back to:
Lets
Now let's try a different approach. We will take out the ORDER option and then change the datatype of the SplitVal field in the returned table to be NVARCHAR(4000) instead of NVARCHAR(MAX) (please note that we are not changing the type of the input parameters):
CREATE FUNCTION [SQL#].[String_SplitReturnNV4k](@StringValue [NVARCHAR](MAX),
@Separator [NVARCHAR](4000), @SplitOption [INT])
RETURNS TABLE (
[SplitNum] [INT] NULL,
[SplitVal] [NVARCHAR](4000) NULL
) WITH EXECUTE AS CALLER
AS EXTERNAL NAME [SQL#].[STRING].[Split];
And now we run:
DECLARE @WorkingString VARCHAR(MAX) = '',
@SwappedString VARCHAR(MAX) = 'Lets test this string?';
SELECT @WorkingString = @WorkingString + spl.SplitVal
FROM SQL#.String_SplitReturnNV4k(@SwappedString, ' ', 1) spl
ORDER BY spl.SplitNum ASC;
SELECT @WorkingString;
Returns:
Letstestthisstring?
And, changing the ORDER BY to DESC we finally get:
string?thistestLets
So, it seems like there is a specific combination of things that trips up the Query Optimizer, and it appears to be mostly centered around the datatype(s) of the result set of the SQLCLR TVF.
I do not know the exact mechanism causing the problem, but at least slightly less of a mystery now :-).
Hope this helps.
P.S. the "namespace" should probably be: "CLR_SQLServerCentral" 😉
SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR
October 3, 2015 at 3:26 am
Thanks,
Setting the column to a specified value in the function does seem to make sense, I suppose it's giving the compiler more hints that data should be output, in the same way as using top.
Look at the execution plans for both and you should see a difference. The version that does not work looks like:
SELECT <-- Sort <-- Compute Scalar <-- Table Valued Function
The version that does work looks like:
SELECT <-- Compute Scalar <-- Sort <-- Table Valued Function
Notice that "Compute Scalar" and "Sort" swapped positions.
In the version that works its also worth noting that the 'sort' contains both column's in the output list which it passes along the chain.
The version that does not work does list the 'value' output column when at the 'Table Valued Function' but the Compute Scalar seems to drop it.
Here's an extract of the Compute Scalar's to further showcase this:
Not working:
<ComputeScalar>
<DefinedValues>
<DefinedValue>
<ColumnReference Column="Expr1001" />
<ScalarOperator ScalarString="CONVERT_IMPLICIT(varchar(max),CONVERT_IMPLICIT(nvarchar(max),[@WorkingString],0)+fn_Split.[value],0)">
<Convert DataType="varchar(max)" Length="2147483647" Style="0" Implicit="true">
<ScalarOperator>
<Arithmetic Operation="ADD">
<ScalarOperator>
<Convert DataType="nvarchar(max)" Length="2147483647" Style="0" Implicit="true">
<ScalarOperator>
<Identifier>
<ColumnReference Column="@WorkingString" />
</Identifier>
</ScalarOperator>
</Convert>
</ScalarOperator>
<ScalarOperator>
<Identifier>
<ColumnReference Table="[fn_Split]" Column="value" />
</Identifier>
</ScalarOperator>
</Arithmetic>
</ScalarOperator>
</Convert>
</ScalarOperator>
</DefinedValue>
</DefinedValues>
<RelOp AvgRowSize="4039" EstimateCPU="0.00100016" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="1000" LogicalOp="Table-valued function" NodeId="2" Parallel="false" PhysicalOp="Table-valued function" EstimatedTotalSubtreeCost="0.00100016">
<OutputList>
<ColumnReference Table="[fn_Split]" Column="idx" />
<ColumnReference Table="[fn_Split]" Column="value" />
</OutputList>
<RunTimeInformation>
<RunTimeCountersPerThread Thread="0" ActualRebinds="1" ActualRewinds="0" ActualRows="22" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" />
</RunTimeInformation>
<TableValuedFunction>
<DefinedValues>
<DefinedValue>
<ColumnReference Table="[fn_Split]" Column="idx" />
</DefinedValue>
<DefinedValue>
<ColumnReference Table="[fn_Split]" Column="value" />
</DefinedValue>
</DefinedValues>
<Object Database="[CLRTestDB]" Schema="[dbo]" Table="[fn_Split]" />
<ParameterList>
<ScalarOperator ScalarString="CONVERT_IMPLICIT(nvarchar(max),[@SwappedString],0)">
<Identifier>
<ColumnReference Column="ConstExpr1002">
<ScalarOperator>
<Convert DataType="nvarchar(max)" Length="2147483647" Style="0" Implicit="true">
<ScalarOperator>
<Identifier>
<ColumnReference Column="@SwappedString" />
</Identifier>
</ScalarOperator>
</Convert>
</ScalarOperator>
</ColumnReference>
</Identifier>
</ScalarOperator>
<ScalarOperator ScalarString="N''">
<Const ConstValue="N''" />
</ScalarOperator>
</ParameterList>
</TableValuedFunction>
</RelOp>
</ComputeScalar>
Working:
<ComputeScalar>
<DefinedValues>
<DefinedValue>
<ColumnReference Column="Expr1001" />
<ScalarOperator ScalarString="CONVERT_IMPLICIT(varchar(max),CONVERT_IMPLICIT(nvarchar(max),[@WorkingString],0)+fn_Split.[value],0)">
<Convert DataType="varchar(max)" Length="2147483647" Style="0" Implicit="true">
<ScalarOperator>
<Arithmetic Operation="ADD">
<ScalarOperator>
<Convert DataType="nvarchar(max)" Length="2147483647" Style="0" Implicit="true">
<ScalarOperator>
<Identifier>
<ColumnReference Column="@WorkingString" />
</Identifier>
</ScalarOperator>
</Convert>
</ScalarOperator>
<ScalarOperator>
<Identifier>
<ColumnReference Table="[fn_Split]" Column="value" />
</Identifier>
</ScalarOperator>
</Arithmetic>
</ScalarOperator>
</Convert>
</ScalarOperator>
</DefinedValue>
</DefinedValues>
<RelOp AvgRowSize="4039" EstimateCPU="0.0156507" EstimateIO="0.0112613" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="100" LogicalOp="TopN Sort" NodeId="1" Parallel="false" PhysicalOp="Sort" EstimatedTotalSubtreeCost="0.0279121">
<OutputList>
<ColumnReference Table="[fn_Split]" Column="idx" />
<ColumnReference Table="[fn_Split]" Column="value" />
</OutputList>
<MemoryFractions Input="1" Output="1" />
<RunTimeInformation>
<RunTimeCountersPerThread Thread="0" ActualRebinds="1" ActualRewinds="0" ActualRows="22" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" InputMemoryGrant="1024" OutputMemoryGrant="1024" UsedMemoryGrant="16" />
</RunTimeInformation>
<TopSort Distinct="false" Rows="100">
<OrderBy>
<OrderByColumn Ascending="false">
<ColumnReference Table="[fn_Split]" Column="idx" />
</OrderByColumn>
</OrderBy>
<RelOp AvgRowSize="4039" EstimateCPU="0.00100016" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="1000" LogicalOp="Table-valued function" NodeId="2" Parallel="false" PhysicalOp="Table-valued function" EstimatedTotalSubtreeCost="0.00100016">
<OutputList>
<ColumnReference Table="[fn_Split]" Column="idx" />
<ColumnReference Table="[fn_Split]" Column="value" />
</OutputList>
<RunTimeInformation>
<RunTimeCountersPerThread Thread="0" ActualRebinds="1" ActualRewinds="0" ActualRows="22" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" />
</RunTimeInformation>
<TableValuedFunction>
<DefinedValues>
<DefinedValue>
<ColumnReference Table="[fn_Split]" Column="idx" />
</DefinedValue>
<DefinedValue>
<ColumnReference Table="[fn_Split]" Column="value" />
</DefinedValue>
</DefinedValues>
<Object Database="[CLRTestDB]" Schema="[dbo]" Table="[fn_Split]" />
<ParameterList>
<ScalarOperator ScalarString="CONVERT_IMPLICIT(nvarchar(max),[@SwappedString],0)">
<Identifier>
<ColumnReference Column="ConstExpr1002">
<ScalarOperator>
<Convert DataType="nvarchar(max)" Length="2147483647" Style="0" Implicit="true">
<ScalarOperator>
<Identifier>
<ColumnReference Column="@SwappedString" />
</Identifier>
</ScalarOperator>
</Convert>
</ScalarOperator>
</ColumnReference>
</Identifier>
</ScalarOperator>
<ScalarOperator ScalarString="N''">
<Const ConstValue="N''" />
</ScalarOperator>
</ParameterList>
</TableValuedFunction>
</RelOp>
</TopSort>
</RelOp>
</ComputeScalar>
Whilst we have been messing around with combinations another member of the team also discovered we could output the full result using XML.
declare
@WorkingString varchar(max) = '',
@SwappedString varchar(max) = 'Lets test this string?';
set @WorkingString = (
select value as [text()]
from dbo.fn_Split(@SwappedString, '')
order by idx desc
for xml path('')
)
select @WorkingString
October 3, 2015 at 3:33 am
P.S. the "namespace" should probably be: "CLR_SQLServerCentral" Wink
oops... I'm sure I can be forgiven, to be honest at the time I posted this I'm just happy to find its on the correct forum at all :-P.
June 16, 2018 at 6:27 am
I believe I see some pieces of this puzzle that nobody has pointed out. The first is a bug in your CLR, although it's one you are unlikely to ever trip.
public static void FillRow(object row, out SqlInt32 idx, [SqlFacet(MaxSize = -1)] out SqlString str);
Unless you specify this SqlFacet your output will be truncated to 4000 bytes. Something that probably has not bitten you. More importantly it also prevents the need for an implicit cast in your execution plan which should speed its execution a bit.
The real key to your puzzle has to do with the fact that your SqlFunction doesn't specify IsDeterministic = true. The default is false. When a function is non-deterministic the optimizer is quite a bit more free to re-order operations. There are really poignant examples of this in all kinds of forums where functions like newid are used in CTEs and wholly unexpected results are produced.
For example
with x as ( select newid id,val from t1 )
select x.id, count(*) cnt from t2 join x on x.val = t2.val group by x.id;
Queries of this form can create plans where cnt is always 1 regardless of the number of rows added by the join to t2 because the optimizer re-orders newid creation to a point after the join occurs. The group by then sees unique values associated with every row and does no grouping.
June 17, 2018 at 12:29 am
Richard.Bielawski - Saturday, June 16, 2018 6:27 AMI believe I see some pieces of this puzzle that nobody has pointed out. The first is a bug in your CLR, although it's one you are unlikely to ever trip.
public static void FillRow(object row, out SqlInt32 idx, [SqlFacet(MaxSize = -1)] out SqlString str);
Unless you specify this SqlFacet your output will be truncated to 4000 bytes. Something that probably has not bitten you. More importantly it also prevents the need for an implicit cast in your execution plan which should speed its execution a bit.The real key to your puzzle has to do with the fact that your SqlFunction doesn't specify IsDeterministic = true. The default is false. When a function is non-deterministic the optimizer is quite a bit more free to re-order operations. There are really poignant examples of this in all kinds of forums where functions like newid are used in CTEs and wholly unexpected results are produced.
Hi Richard. Thanks for lending a hand in trying to figure this one out. With regards to those two points, however, they are both incorrect:
For all other situations, that particular SqlFacet property (i.e. "MaxSize") is only used by SSDT when generating the T-SQL to publish. In earlier versions of SSDT (and its predecessor: DataDude), the default was to use NVARCHAR(4000) for SqlString and NVARCHAR(MAX) for SqlChars. Starting in Visual Studio 2013 (or 2015, perhaps), the default was changed to be NVARCHAR(MAX) for both SqlString and SqlChars. Meaning, in the earlier versions, if using Visual Studio / SSDT to generate the T-SQL, then yes, you did need that SqlFacet to get NVARCHAR(MAX), but at this point you don't. And even back then, if you wrote the CREATE FUNCTION statement yourself, then it didn't matter.
SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply