Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Usman Butt (5/14/2014)


    BTW, can someone test the physical tally table splitters as well?

    Just had the chance to to a quick test, the physical table splitters seems to be at a 2005/2008 level while the inline ones are pretty close to the CLR. Will advice at the earliest convenience.

    😎

  • mburbea (5/20/2014)


    How am I sure sql server actually does the work and doesn't just cheat realize that the first 999 rows don't matter, and only processes the last row? (since the sql-clr function is a blackbox). I know its not quite as bad as say a C compiler when it comes to removing code, but I know the query optimizer can be quite clever in its evaluation.

    I don't see how SQL Server could not generate all of the values as it is not really in control of that code. I suppose it is theoretically possible (though still highly doubtful) that it simply ignores all but the last row, but I would just consider that a well-done optimization ;-). Regardless, you could force the test to require all rows by using them via accumulation:

    '

    DECLARE @DummyRowNum INT,

    @DummyItemNumber INT,

    @DummyItemLEN INT

    SELECT @DummyRowNum = 0,

    @DummyItemNumber = 0,

    @DummyItemLEN = 0

    --===== Run the test

    SELECT @DummyRowNum = (@DummyRowNum + csv.RowNum),

    @DummyItemNumber = (@DummyItemNumber + split.ItemNumber),

    @DummyItemLEN = (@DummyItemLEN + LEN(split.Item))

    FROM dbo.CSV8K csv

    CROSS APPLY dbo.'+c.name+'(csv.CSV,char(44)) split

    '

    (note: using SQL Server 2005 compatible syntax for testing flexibility, hence no "= 0" on DECLARE or @var += val)

    Just a thought.

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Paul's CLR split method is quite excellent and so simple. However, in this trial part of its pain points is that it takes an nvarchar(max) and so every call has an implicit conversion to nvarchar(max). The fact that it wins soundly regularly shows how it performs.

    As from the C# code perspective, there is a small bit of optimization you can do. First it makes more sense to make the record object a class. Since the method is being called as an interface there is little hope for in-lining the Current property. It's best to avoid boxing to reduce memory pressure. The for loops check condition used a comparision a constant a property, and though the x64 JIT in .net is much more clever than the x86 counterpart, it still inserts a bound check. Comparing it against Input.Length eliminates the bound check.

    Another thing you can do is instead of returning a SqlString you can return a SqlChars object. SqlChars has less of an overhead when being serialized and deserialized. However, to do that we need to send a char array. Buffer.BlockCopy is probably the least overhead way to do this, as creating a new string and grabbing its byte would incur more memory pressure.

    This version performs minimally better than Paul's original version.

    However, if you are trying to win this trial, you're best bet is to avoid LOB types. With that said I present an T-SQL* function that will beat Paul's version for VARCHAR's with less than 8k.

    The idea of SplitVarchar is to basically take your varchar convert it to a varbinary(8000) and call a CLR function that works on bytes instead of chars, then convert back 🙂 Even though this method is horribly gross it still can perform quite nicely as it avoids lob types.

    Full source code here:

    https://gist.github.com/mburbea/e72151af503873d82d6f

    SplitterNametotal_davg_d

    Split11.0781020.21721768627451

    SplitNVarchar10.6060590.207961941176471

    SplitVarchar8.0528080.157898196078431

    Notes:

    Jeff's test harness is somewhat biased. It seems that the earlier the function is the higher it might rate. It's unfortunate.

    The .net framework has an optimized path for IndexOf on bytes. It will perform n/4 iterations through the loop and do 4 comparison at once. Unfortunately, the optimization really only helps on extremely long element lengths. The cost of its set up usually dwarfs the short lengths we normally use.

    Buffering isn't worth investigating on 8k (varchar)/16k (nvarchar). Microsoft's own StreamReader has an internal buffer of 8096 bytes on .net 2 or 2048 bytes on .net 4, and every operation would occur a read from the stream to the internal buffer to your buffer that you would scan to make slices. The ideal, would be to use a method that only uses the internal buffer like "ReadLine", which would require you to make your delimiter the newline character.

  • For those looking for the fastest way to split 8k CSVs in SQL Server, I present hybrid.

    Hybrid performs about TWICE as fast as Paul White's Split CLR function. And about FIVE times faster than the state of the art T-SQL splitters. It pretty much exceeds all threshholds I could imagine.

    SplitterNametotal_davg_d

    ============================================

    CLRBaseline8.5648540.142747566666667

    CLRNVBaseline9.9039860.165066433333333

    DelimitedSplit8kb26.0796070.434660116666667

    hybrid5.1955190.0865919833333333

    Split10.6270560.1771176

    SplitNVarchar10.2460280.170767133333333

    SplitVarchar8.8828860.1480481

    SplitVarcharA8.9978960.149964933333333

    SqlBaseline5.7505730.0958428833333333

    All sourcecode and an updated test script below.

    https://gist.github.com/mburbea/e72151af503873d82d6f

    To try to create the best SQL splitter, I figured I wanted to figure out what was I trying to optimize and how much was loss to IO I decided to create some baselines.

    These functions will receive a parameter that tells them how many chunks to spit out and they chunk the input into even amounts and return the parts.

    • 1. SQLBaseline --This is basically a CTE splitter.
    • 2. CLRBaseLine -- A wrapper t-sql function that calls a CLR function to split the string as a binary string and then convert the binary string back to varchar.
    • 3. CLRNVBaseLine -Samething as 2 except it converts to nvarchar(max) and returns nvarchar(4000).

    What I found was a bit depressing. My improved CLR splitters were usually extremely close to their respective baseline. I tried fancy vectorization techniques,but it was mostly pointless. I was fighting for an average gain of a few hundred milliseconds across all trials. Not worth it. Most sane people would have given up here.

    I was ready to publish my results showing how close they were to their theoretical limiting baselines and urge everyone to try my revised functions, when I noticed that the SQLBaseline was fast. Nearly twice as fast as Paul's Split function which is quite a feat and about 4 times faster than the best T-sql splitter.

    What was going on? Well, the big slowdown in the original splitters is that they need to find the split intervals. For a given string of length N with P parts, that requires creating a tally table of N elements,does N string compares, performs N+P substrings, and P charindexs.

    That's A lot of work.

    The SQLBaseline on the other hand creates a tally table of P elements and performs P substrings.

    This is a relatively light load.

    The CLR code on the other hand, does N char/byte comparison (basically free), and performs P bulk copies. But for every chunk it has to marshall the returned object to sql server which then has to turn it into unmanaged code and send it to the return set. But this basically means we are doing two bulk copies on each step.

    So the solution hit me, exploit the strength of what each system is good at. HybridInternal is great at finding the next chunk in a blob of text. Sql's intrinsic function substring is fast at copying strings. The result is pretty incredible.

    Hybrid beats the SQL baseline in most cases. Don't believe me? Try it yourself 🙂

    I included some longer string lengths than the original benchmark. DelimitedSplit8kb actually works really really well if your string lengths aren't going to be monstrously long. I figured a few of these would be worth looking at.

  • mburbea (5/25/2014)


    For those looking for the fastest way to split 8k CSVs in SQL Server, I present hybrid.

    Hybrid performs about TWICE as fast as Paul White's Split CLR function. And about FIVE times faster than the state of the art T-SQL splitters. It pretty much exceeds all threshholds I could imagine.

    SplitterNametotal_davg_d

    ============================================

    CLRBaseline8.5648540.142747566666667

    CLRNVBaseline9.9039860.165066433333333

    DelimitedSplit8kb26.0796070.434660116666667

    hybrid5.1955190.0865919833333333

    Split10.6270560.1771176

    SplitNVarchar10.2460280.170767133333333

    SplitVarchar8.8828860.1480481

    SplitVarcharA8.9978960.149964933333333

    SqlBaseline5.7505730.0958428833333333

    All sourcecode and an updated test script below.

    https://gist.github.com/mburbea/e72151af503873d82d6f

    To try to create the best SQL splitter, I figured I wanted to figure out what was I trying to optimize and how much was loss to IO I decided to create some baselines.

    These functions will receive a parameter that tells them how many chunks to spit out and they chunk the input into even amounts and return the parts.

    • 1. SQLBaseline --This is basically a CTE splitter.
    • 2. CLRBaseLine -- A wrapper t-sql function that calls a CLR function to split the string as a binary string and then convert the binary string back to varchar.
    • 3. CLRNVBaseLine -Samething as 2 except it converts to nvarchar(max) and returns nvarchar(4000).

    What I found was a bit depressing. My improved CLR splitters were usually extremely close to their respective baseline. I tried fancy vectorization techniques,but it was mostly pointless. I was fighting for an average gain of a few hundred milliseconds across all trials. Not worth it. Most sane people would have given up here.

    I was ready to publish my results showing how close they were to their theoretical limiting baselines and urge everyone to try my revised functions, when I noticed that the SQLBaseline was fast. Nearly twice as fast as Paul's Split function which is quite a feat and about 4 times faster than the best T-sql splitter.

    What was going on? Well, the big slowdown in the original splitters is that they need to find the split intervals. For a given string of length N with P parts, that requires creating a tally table of N elements,does N string compares, performs N+P substrings, and P charindexs.

    That's A lot of work.

    The SQLBaseline on the other hand creates a tally table of P elements and performs P substrings.

    This is a relatively light load.

    The CLR code on the other hand, does N char/byte comparison (basically free), and performs P bulk copies. But for every chunk it has to marshall the returned object to sql server which then has to turn it into unmanaged code and send it to the return set. But this basically means we are doing two bulk copies on each step.

    So the solution hit me, exploit the strength of what each system is good at. HybridInternal is great at finding the next chunk in a blob of text. Sql's intrinsic function substring is fast at copying strings. The result is pretty incredible.

    Hybrid beats the SQL baseline in most cases. Don't believe me? Try it yourself 🙂

    I included some longer string lengths than the original benchmark. DelimitedSplit8kb actually works really really well if your string lengths aren't going to be monstrously long. I figured a few of these would be worth looking at.

    I have not gone into the details but 10 out of 10 for thinking out of the box. The whole work is really admirable :cool:.

    You have tried to address the problem which is pointed out various times by Paul White that his CLR Splitter

    1. Is a generic splitter; a general solution (most important point to me)

    2. is made for Unicode Strings

    3. is made for larger strings i.e. nvarchar(max)

    Hence, there was always a cost to convert from varchar to nvarchar and vice versa. Also, when I asked him about any room for improvement in his CLR (obviously way earlier than your posts), his response was that there should be as the CLR he made was a very straightforward .NET 2.0 implementation (now there is .NET 4.0 available :-)) and he never spent any time writing anything except the obvious simple code because simply he never seen the need.

    Having said that, it is really nice to know how you implement your ideas effectively and took it as a challenge to improve CLR splitter as much as possible. Commendable :Wow:. I can foresee an interesting feedback coming up from our expert.

  • Usman Butt (5/26/2014)


    mburbea (5/25/2014)


    For those looking for the fastest way to split 8k CSVs in SQL Server, I present hybrid.

    Hybrid performs about TWICE as fast as Paul White's Split CLR function. And about FIVE times faster than the state of the art T-SQL splitters. It pretty much exceeds all threshholds I could imagine.

    SplitterNametotal_davg_d

    ============================================

    CLRBaseline8.5648540.142747566666667

    CLRNVBaseline9.9039860.165066433333333

    DelimitedSplit8kb26.0796070.434660116666667

    hybrid5.1955190.0865919833333333

    Split10.6270560.1771176

    SplitNVarchar10.2460280.170767133333333

    SplitVarchar8.8828860.1480481

    SplitVarcharA8.9978960.149964933333333

    SqlBaseline5.7505730.0958428833333333

    All sourcecode and an updated test script below.

    https://gist.github.com/mburbea/e72151af503873d82d6f

    To try to create the best SQL splitter, I figured I wanted to figure out what was I trying to optimize and how much was loss to IO I decided to create some baselines.

    These functions will receive a parameter that tells them how many chunks to spit out and they chunk the input into even amounts and return the parts.

    • 1. SQLBaseline --This is basically a CTE splitter.
    • 2. CLRBaseLine -- A wrapper t-sql function that calls a CLR function to split the string as a binary string and then convert the binary string back to varchar.
    • 3. CLRNVBaseLine -Samething as 2 except it converts to nvarchar(max) and returns nvarchar(4000).

    What I found was a bit depressing. My improved CLR splitters were usually extremely close to their respective baseline. I tried fancy vectorization techniques,but it was mostly pointless. I was fighting for an average gain of a few hundred milliseconds across all trials. Not worth it. Most sane people would have given up here.

    I was ready to publish my results showing how close they were to their theoretical limiting baselines and urge everyone to try my revised functions, when I noticed that the SQLBaseline was fast. Nearly twice as fast as Paul's Split function which is quite a feat and about 4 times faster than the best T-sql splitter.

    What was going on? Well, the big slowdown in the original splitters is that they need to find the split intervals. For a given string of length N with P parts, that requires creating a tally table of N elements,does N string compares, performs N+P substrings, and P charindexs.

    That's A lot of work.

    The SQLBaseline on the other hand creates a tally table of P elements and performs P substrings.

    This is a relatively light load.

    The CLR code on the other hand, does N char/byte comparison (basically free), and performs P bulk copies. But for every chunk it has to marshall the returned object to sql server which then has to turn it into unmanaged code and send it to the return set. But this basically means we are doing two bulk copies on each step.

    So the solution hit me, exploit the strength of what each system is good at. HybridInternal is great at finding the next chunk in a blob of text. Sql's intrinsic function substring is fast at copying strings. The result is pretty incredible.

    Hybrid beats the SQL baseline in most cases. Don't believe me? Try it yourself 🙂

    I included some longer string lengths than the original benchmark. DelimitedSplit8kb actually works really really well if your string lengths aren't going to be monstrously long. I figured a few of these would be worth looking at.

    I have not gone into the details but 10 out of 10 for thinking out of the box. The whole work is really admirable :cool:.

    You have tried to address the problem which is pointed out various times by Paul White that his CLR Splitter

    1. Is a generic splitter; a general solution (most important point to me)

    2. is made for Unicode Strings

    3. is made for larger strings i.e. nvarchar(max)

    Hence, there was always a cost to convert from varchar to nvarchar and vice versa. Also, when I asked him about any room for improvement in his CLR (obviously way earlier than your posts), his response was that there should be as the CLR he made was a very straightforward .NET 2.0 implementation (now there is .NET 4.0 available :-)) and he never spent any time writing anything except the obvious simple code because simply he never seen the need.

    Having said that, it is really nice to know how you implement your ideas effectively and took it as a challenge to improve CLR splitter as much as possible. Commendable :Wow:. I can foresee an interesting feedback coming up from our expert.

    From the description it sounds like he did the same as my version of which i posted here a while back:

    http://www.sqlservercentral.com/Forums/FindPost1561800.aspx

    At the time I got zero feedback on this approach. It doesn't take anything away and it might be a better implementation overall as I haven't seen exact code yet. For one, my own version uses simple stings instead of char arrays as most of the time i expect the input is small enough as not to worry about streaming.

    As for type conversion, the way I handled this is to make one case sensitive and one case insensitive C# version. Then create several SQL functions per C# function where each version handles a different input type (varchar 8k, varchar max, nvarchar 4k, nvarchar max). The C# code returning only sub-string index, start-index and length of sub-strings, allows the SQL code to do sub-strings directly in into the original search text and with survival of the collation without any conversions).

    It kind of bites me now I never took time to benchmark the code, just wanted to have the whole lib done first. Which can still take a while as I haven't put much time in it since I got the splinters and some other functions going. I was particularly occupied by a sound naming scheme and more specialized functions to help with parameter passing. Like delivering N sub-strings at a time, so that its easy to cast the strings and use as arguments in other operations.

  • Because I have little time to perfomance test now, I will just hand over the code so someone can check it out and/or improve on it.

    This is how the SQL functions look:

    /* String: Split */

    create function [dbo].[$Split]( @instr nvarchar(max), @delimiter nvarchar(64) )

    returns table ( SequenceNr int, StartIndex int, Length int)

    as external name [cs.db].[Cs.Db.String].[Split];

    go

    create function [dbo].[Split]( @instr nvarchar(max), @delimiter nvarchar(64) )

    returns table

    as return ( select SequenceNr, StartIndex, Length, Value = substring( @instr, StartIndex, Length ) from [dbo].[$Split](@instr, @delimiter ) )

    go

    create function [dbo].[Split4K]( @instr nvarchar(4000), @delimiter nvarchar(64) )

    returns table

    as return ( select SequenceNr, StartIndex, Length, Value = substring( @instr, StartIndex, Length ) from [dbo].[$Split](@instr, @delimiter ) )

    go

    create function [dbo].[SplitC]( @instr varchar(max), @delimiter varchar(64) )

    returns table

    as return ( select SequenceNr, StartIndex, Length, Value = substring( @instr, StartIndex, Length ) from [dbo].[$Split](@instr, @delimiter ) )

    go

    create function [dbo].[SplitC8K]( @instr varchar(8000), @delimiter varchar(64) )

    returns table

    as return ( select SequenceNr, StartIndex, Length, Value = substring( @instr, StartIndex, Length ) from [dbo].[$Split](@instr, @delimiter ) )

    go

    /* String: SplitNoCase */

    create function [dbo].[$SplitNoCase]( @instr nvarchar(max), @delimiter nvarchar(64) )

    returns table ( SequenceNr int, StartIndex int, Length int)

    as external name [cs.db].[Cs.Db.String].[SplitNoCase];

    go

    create function [dbo].[SplitNoCase]( @instr nvarchar(max), @delimiter nvarchar(64) )

    returns table

    as return ( select SequenceNr, StartIndex, Length, Value = substring( @instr, StartIndex, Length ) from [dbo].[$SplitNoCase](@instr, @delimiter ) )

    go

    create function [dbo].[SplitNoCase4K]( @instr nvarchar(4000), @delimiter nvarchar(64) )

    returns table

    as return ( select SequenceNr, StartIndex, Length, Value = substring( @instr, StartIndex, Length ) from [dbo].[$SplitNoCase](@instr, @delimiter ) )

    go

    create function [dbo].[SplitNoCaseC]( @instr varchar(max), @delimiter varchar(64) )

    returns table

    as return ( select SequenceNr, StartIndex, Length, Value = substring( @instr, StartIndex, Length ) from [dbo].[$SplitNoCase](@instr, @delimiter ) )

    go

    create function [dbo].[SplitNoCaseC8K]( @instr varchar(8000), @delimiter varchar(64) )

    returns table

    as return ( select SequenceNr, StartIndex, Length, Value = substring( @instr, StartIndex, Length ) from [dbo].[$SplitNoCase](@instr, @delimiter ) )

    go

    Queries will use these:

    dbo.Split

    dbo.Split4K

    dbo.SplitC

    dbo.SplitC8K

    dbo.SplitNoCase

    dbo.SplitNoCase4K

    dbo.SplitNoCaseC

    dbo.SplitNoCaseC8K

    The functions starting with a $ are for internal use only.

    The C# code has 3 parts (an enumerator and two interface functions):

    using System;

    using System.Collections;

    using System.Data.SqlTypes;

    using System.Text;

    using Microsoft.SqlServer.Server;

    namespace Cs.Db.Solution.String

    {

    internal class StringSplitEnumerator : IEnumerator

    {

    #region result class

    internal class Result

    {

    private readonly int _sequenceNr;

    private readonly int _startIndex;

    private readonly int _length;

    public Result( int sequenceNr, int startIndex, int length )

    {

    _sequenceNr = sequenceNr;

    _startIndex = startIndex;

    _length = length;

    }

    internal void SQLFillRow( out SqlInt32 sequenceNr, out SqlInt32 startIndex, out SqlInt32 length )

    {

    sequenceNr = new SqlInt32( _sequenceNr ); // In T-SQL everythign works base 1, so adjust to that

    startIndex = new SqlInt32( _startIndex + 1 ); // In T-SQL the first character is at index 1

    length = new SqlInt32( _length );

    }

    }

    #endregion

    private int _sequenceNr;

    private int _lastPos;

    private int _nextPos;

    private readonly string _value;

    private readonly string _delimiter;

    private readonly int _delimiterLength;

    private readonly StringComparison _comparisonType;

    public StringSplitEnumerator( string value, string delimiter, StringComparison comparisonType )

    {

    _value = value;

    _delimiter = delimiter;

    _delimiterLength = (delimiter ?? string.Empty).Length;

    _comparisonType = comparisonType;

    _sequenceNr = -1;

    _nextPos = 0 - _delimiterLength;

    }

    #region IEnumerator Members

    public void Reset()

    {

    _sequenceNr = -1;

    _nextPos = 0 - _delimiterLength;

    _lastPos = 0;

    }

    public object Current

    {

    get

    {

    int startIndex = _lastPos + _delimiterLength;

    int length = _nextPos - startIndex;

    return new Result( _sequenceNr, startIndex, length );

    }

    }

    public bool MoveNext()

    {

    // There is never anything to return for null input!

    if ( _value == null )

    return false;

    // If the last sequence of the string has already been delivered, there is no next!

    if ( _nextPos >= _value.Length )

    return false;

    // Prepare internal state for this iteration

    _lastPos = _nextPos;

    _sequenceNr++;

    if ( _delimiter == null )

    {

    _nextPos = _value.Length;

    return true;

    }

    // Handle delimiters with length = 0 (this splits on imaginary delimiters between each existing character in the source string)

    if ( _delimiterLength == 0 )

    {

    _nextPos++;

    return true;

    }

    // Handle delimiters with length >= 1

    _nextPos = _value.IndexOf( _delimiter, _lastPos + _delimiterLength, _comparisonType );

    if ( _nextPos == -1 )

    _nextPos = _value.Length;

    return true;

    }

    #endregion

    }

    }

    using Cs.Db.Solution.String;

    using System;

    using System.Collections;

    using System.Data.SqlTypes;

    using System.Text;

    using Microsoft.SqlServer.Server;

    namespace Cs.Db

    {

    public partial class String

    {

    // Case sensitive ordinal string splitter

    [SqlFunction

    (

    Name = "Split"

    , FillRowMethodName = "Splitter_SQLFillRow"

    , TableDefinition = "SequenceNr int, StartIndex int, Length int"

    , IsDeterministic = true

    , IsPrecise = true

    )

    ]

    public static IEnumerator Split

    (

    [SqlFacet( MaxSize=-1 )] string instr

    , [SqlFacet( MaxSize=64 )] string delimiter

    )

    {

    return new StringSplitEnumerator( instr, delimiter, StringComparison.Ordinal );

    }

    // This method is used to output table-row results.

    public static void Splitter_SQLFillRow

    (

    object obj

    , out SqlInt32 sequenceNr

    , out SqlInt32 startIndex

    , out SqlInt32 length

    )

    {

    ((StringSplitEnumerator.Result) obj).SQLFillRow( out sequenceNr, out startIndex, out length );

    }

    }

    }

    using Cs.Db.Solution.String;

    using System;

    using System.Collections;

    using System.Data.SqlTypes;

    using System.Text;

    using Microsoft.SqlServer.Server;

    namespace Cs.Db

    {

    public partial class String

    {

    // Case insensitive ordinal string splitter

    [SqlFunction

    (

    Name = "SplitNoCase"

    , FillRowMethodName = "SplitterNoCase_SQLFillRow"

    , TableDefinition = "SequenceNr int, StartIndex int, Length int"

    , IsDeterministic = true

    , IsPrecise = true

    )

    ]

    public static IEnumerator SplitNoCase

    (

    [SqlFacet( MaxSize=-1 )] string instr

    , [SqlFacet( MaxSize=64 )] string delimiter

    )

    {

    return new StringSplitEnumerator( instr, delimiter, StringComparison.OrdinalIgnoreCase );

    }

    // This method is used to output table-row results.

    public static void SplitterNoCase_SQLFillRow

    (

    object obj

    , out SqlInt32 sequenceNr

    , out SqlInt32 startIndex

    , out SqlInt32 length

    )

    {

    ((StringSplitEnumerator.Result) obj).SQLFillRow( out sequenceNr, out startIndex, out length );

    }

    }

    }

    The C# code is also attached and should be put into a project first). As is visible, the enumerator is very plain written and is simply configured to work case sensitive or in-sensitive. Unique might also be that it accepts zero length delimiters, in which case it simple imagines a separator between each character in the input.

  • If you look at my github I included an update test script which should make it easier to test your functions. Simply add them to the function array on top and add a conditional drop for your funcs/assembly.

    My average run took less than 2 minutes, I think you can find time for it 😛

    It's unfortunate that SQL doesn't pass collation with the strings to the CLR that way you can avoid having multiple functions

    e.g.

    • CS-AS - Ordinal
    • CS-AI - InvariantCulture
    • CI-AS - OrdinalIgnoreCase
    • CI-AI - InvariantCultureIgnoreCase

    This is a pretty good approach, but from testing:

    • Making anything lob is cancer. Once you Lob the input, you are getting a lob on the substrings from SQL server. While substring on non-lobs is fast, substring on Lob types (especially manufactured ones in the case of this trial) is super slow. You are better off having the CLR do the substrings when dealing with nvarchar(max) inputs. There is a reason I didn't include a version of hybrid to work on varchar(max)/NVarchar(max) types. It performs about ELEVEN times worse then hybrid does on these trials. You're best off using SplitNVarchar.
    • SQLString is slower than SQLChars. There is a larger overhead with converting the buffer into a string. This gets worse on Lob strings.
    • string.IndexOf is faster when you are looking for a character. If you must support variable length delimiters, special case the case when it is one, you will have faster code.
    • Error checking, and pinning adds up. Be careful with the method calls you do as they might be more costly then just implementing the logic into a loop body.
    • Streaming TVFs, stream. They act like other source elements in t-sql and only get called until there explicit row goal is met. If you want to return just the first 10 delimiters, do a top 10 on the call to the TVF. If you want to return a row with those 10, consider doing a transformation either with a pivot or partial aggregation like sum.

    Edit:

    Perplexingly enough, when I created a version of hybridInternal that works on chars (and thus takes an nvarchar(max) instead of bytes and created an varchar(8000) wrapper to call it, it was much slower than the varbinary version. About ~60% of the speed of the hybrid function. If you try to call it on an nvarchar(max) input it was about *Eleven* times slower than its varbin cousin on a non-varchar input. My guess is LOB manufacturing is just wildly expensive. Even doing it once is just that slow.

    The way it goes is if you're input is a varchar with less than 8k, use hybrid. If it's nvarchar or a long string use SplitNvarchar.

    Repo updated with the new function and new test bed code.

    https://gist.github.com/mburbea/e72151af503873d82d6f

    It would be really nice if SSDT supported mapping varchar to varbinary. That way we could avoid having to do any casts :). Anyway what else is a latin1 encoded string, but an array of bytes?

  • mburbea (5/27/2014)


    If you look at my github I included an update test script which should make it easier to test your functions. Simply add them to the function array on top and add a conditional drop for your funcs/assembly.

    My average run took less than 2 minutes, I think you can find time for it 😛

    It's unfortunate that SQL doesn't pass collation with the strings to the CLR that way you can avoid having multiple functions

    e.g.

    • CS-AS - Ordinal
    • CS-AI - InvariantCulture
    • CI-AS - OrdinalIgnoreCase
    • CI-AI - InvariantCultureIgnoreCase

    This is a pretty good approach, but from testing:

    • Making anything lob is cancer. Once you Lob the input, you are getting a lob on the substrings from SQL server. While substring on non-lobs is fast, substring on Lob types (especially manufactured ones in the case of this trial) is super slow. You are better off having the CLR do the substrings when dealing with nvarchar(max) inputs. There is a reason I didn't include a version of hybrid to work on varchar(max)/NVarchar(max) types. It performs about ELEVEN times worse then hybrid does on these trials. You're best off using SplitNVarchar.
    • SQLString is slower than SQLChars. There is a larger overhead with converting the buffer into a string. This gets worse on Lob strings.
    • string.IndexOf is faster when you are looking for a character. If you must support variable length delimiters, special case the case when it is one, you will have faster code.
    • Error checking, and pinning adds up. Be careful with the method calls you do as they might be more costly then just implementing the logic into a loop body.
    • Streaming TVFs, stream. They act like other source elements in t-sql and only get called until there explicit row goal is met. If you want to return just the first 10 delimiters, do a top 10 on the call to the TVF. If you want to return a row with those 10, consider doing a transformation either with a pivot or partial aggregation like sum.

    Thanks, some of the tips I knew, others I did not, yet others I could argue against or have a easy to test workaround for.

    As for collations, they are passed with the argument, but it is the wrong one (not kidding you, the database default is used regardless of input).

    When i find a free moment I will do some testing, just not right now (else i would have done it in the 1st place).

  • CREATE FUNCTION permission denied in database 'tempdb'.

    not sure why i do not have permission over the tempdb but i could not get started with these much needed tools because of this error above.

    i have permission over all other tables.

    any suggestions?

  • timgapinski-841846 (6/10/2014)


    CREATE FUNCTION permission denied in database 'tempdb'.

    not sure why i do not have permission over the tempdb but i could not get started with these much needed tools because of this error above.

    i have permission over all other tables.

    any suggestions?

    Create a new database and work in there.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Paul White (5/15/2014)


    Jeff Moden (5/15/2014)


    I'd not seen that happen in my testing but know that such a thing could happen. Thanks for the heads up and the patch. When I get my head above water, I'll see if I can incorporate it. Thanks! 🙂

    Yes it was always a calculated risk. It seems the new cardinality estimator in 2014 results in a plan favouring a table spool, which effectively caches a single NEWID call. The semantics of NEWID are a bit of a mess, but that's a whole other discussion.

    Please don't incorporate the hack fix into your fine script - it is undocumented and unsupported after all. No doubt some revision to the script will be necessary, but please try to avoid 8690. Cheers.

    I've finally had a bit of time to spend on this. I can't get the symptoms you spoke of to repeat themselves. Not sure why you may be having the problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/25/2014)


    I've finally had a bit of time to spend on this. I can't get the symptoms you spoke of to repeat themselves. Not sure why you may be having the problem.

    Make sure you're on SQL Server 2014 and using the 120 model cardinality estimator.

  • It seems that Paul is right, which explains why it appears that the memory optimized tally tables splitter beat CLR functions at first blush. (This isn't true sadly). Adding the traceflag removes the caching advantage.

  • I have a CSV file that I bulk loaded into a single column staging table.

    I need a way to parse proper csv files that are comma delimited with optional quote text qualifiers

    as the data may have embedded commas.

    i.e. Jason,"900 N. May ST., #5"

    using the Jeff's splitter it would get split into 3 columns i.e. Jason as column 1 900 N. May ST. as column 2 and #5 as column 3

    I need it to keep the address together when there are quotes text qualifiers .

    I was able to get the function from this article to work on my dev machine "Reaping the benefits of the Window functions in T-SQL"

    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903

    But I need to deploy this to a server running 2008 and that function uses syntax that is only valid in 2012.

    I would appreciate any help.

Viewing 15 posts - 631 through 645 (of 990 total)

You must be logged in to reply to this topic. Login to reply