Performance issue with tally solution

  • Paul White (4/28/2009)


    @mister.magoo,

    I think the function you refer to was one I posted originally, and improved by Flo. Not that I'm territorial!

    I do apologise - it's such a long thread and function names keep changing!

    The point about the scalability is not so much about how much memory a single call would allocate, rather: consider many concurrent users executing the function...!

    Now you really are scaring me - people are going to put this kind of unprocessed data into a database and then allow multiple users access to it at the same time?

    Please excuse me while I book myself on a woodwork course :hehe:

    P.S. What about the question of whether passing large chunks of data to a CLR function is really good practice to start with? 😉

    MM

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Now you really are scaring me - people are going to put this kind of unprocessed data into a database and then allow multiple users access to it at the same time?

    The primary use case I see is apps that need to send collections of identifiers to the database for one reason or another. This can be done using a chatty interface (i.e. multiple calls to a stored procedure that only accepts a single identifier) and that tends not to scale at all. So more informed developers (or consultants who come in to clean up the mess :-D) will instead figure out some way to batch the calls, usually via lists. I've seen some extremely active apps using this pattern -- tens of thousands of batch requests per second -- so the method must be able to scale up. The T-SQL methods actually work pretty well in this case, but I've come to believe that a finely-tuned CLR method could do even better.

    --
    Adam Machanic
    whoisactive

  • how to make it ignore consecutive delimiters

    Ignore in which way?

    I split data with consecutive delimiters, ie a column/field with no data

    Far away is close at hand in the images of elsewhere.
    Anon.

  • David Burrows (4/28/2009)


    how to make it ignore consecutive delimiters

    Ignore in which way?

    I split data with consecutive delimiters, ie a column/field with no data

    In most of the use cases I'm interested in, "a,,b,,c" with a delimiter of "," should return 3 rows:

    a

    b

    c

    likewise ",a,b,c" or "a,b,c," or ",,a,b,c,," or any number of other slightly malformed combinations.

    --
    Adam Machanic
    whoisactive

  • In most of the use cases I'm interested in, "a,,b,,c" with a delimiter of "," should return 3 rows:

    Fair enough, another one of those 'It Depends' features 😀

    I suppose the 'trade of' in time is between processing during split vs post split

    Far away is close at hand in the images of elsewhere.
    Anon.

  • OK, here's my latest. A few bug fixes, and it seems to properly ignore consecutive delimiters (you don't want to know how many times I recompiled this fighting off-by-one errors...)

    Enjoy! By the way, if you do performance tests, make sure to compile in Release mode, not Debug mode. I just did a quick comparison and it does seem to make a significant difference.

    using System;

    using System.Collections;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    public partial class UserDefinedFunctions

    {

    [Microsoft.SqlServer.Server.SqlFunction(

    FillRowMethodName = "FillRow_Multi",

    TableDefinition = "item nvarchar(4000)"

    )

    ]

    public static IEnumerator SplitString_Multi(

    [SqlFacet(MaxSize = -1)]

    SqlChars Input,

    [SqlFacet(MaxSize = 255)]

    SqlChars Delimiter

    )

    {

    return (

    (Input.IsNull || Delimiter.IsNull) ?

    new SplitStringMulti(new char[0], new char[0]) :

    new SplitStringMulti(Input.Value, Delimiter.Value));

    }

    public static void FillRow_Multi(object obj, out SqlString item)

    {

    item = new SqlString((string)obj);

    }

    public class SplitStringMulti : IEnumerator

    {

    public SplitStringMulti(char[] TheString, char[] Delimiter)

    {

    theString = TheString;

    stringLen = TheString.Length;

    delimiter = Delimiter;

    delimiterLen = (byte)(Delimiter.Length);

    isSingleCharDelim = (delimiterLen == 1);

    lastPos = 0;

    nextPos = delimiterLen * -1;

    }

    #region IEnumerator Members

    public object Current

    {

    get

    {

    return new string(theString, lastPos, nextPos - lastPos);

    }

    }

    public bool MoveNext()

    {

    if (nextPos >= stringLen)

    return false;

    else

    {

    lastPos = nextPos + delimiterLen;

    for (int i = lastPos; i < stringLen; i++)

    {

    bool matches = true;

    //Optimize for single-character delimiters

    if (isSingleCharDelim)

    {

    if (theString != delimiter[0])

    matches = false;

    }

    else

    {

    for (byte j = 0; j = stringLen) || (theString != delimiter[j]))

    {

    matches = false;

    break;

    }

    }

    }

    if (matches)

    {

    nextPos = i;

    //Deal with consecutive delimiters

    if ((nextPos - lastPos) > 0)

    return true;

    else

    {

    i += (delimiterLen-1);

    lastPos += delimiterLen;

    }

    }

    }

    lastPos = nextPos + delimiterLen;

    nextPos = stringLen;

    if ((nextPos - lastPos) > 0)

    return true;

    else

    return false;

    }

    }

    public void Reset()

    {

    lastPos = 0;

    nextPos = delimiterLen * -1;

    }

    #endregion

    private int lastPos;

    private int nextPos;

    private readonly char[] theString;

    private readonly char[] delimiter;

    private readonly int stringLen;

    private readonly byte delimiterLen;

    private readonly bool isSingleCharDelim;

    }

    };

    --
    Adam Machanic
    whoisactive

  • Adam Machanic (4/28/2009)


    We're not done yet -- if we're going to create the "best" method it should do as much as possible in terms of functionality we would normally want from string splitting routines. So I'm currently in the process of modifying Flo's modification to my character-based approach, and trying to figure out how to make it ignore consecutive delimiters ... Any other ideas? I often add a DISTINCT to my T-SQL efforts but I don't think that's a possibility in the SQLCLR routine.

    Hi Adam

    Correct. A complete well done split function should be able to handle delimiters within the text to be split. I usually avoid this "practice" by splitting with CHAR(0) which should never appear in any text. Anyway, the requirement stays absolutely valid. The question is only "how should it be handled?". As usual in CSV this will be done by quotation marks. If even the quotation mark appears within the text it will be masked by a second quotation mark. Would this be okay?

    Greets

    Flo

  • Paul White (4/28/2009)


    Ah. A post from Adam. Must be later than I realized. :w00t:

    This is probably feeping creaturism, but it would be great if it could handle more complex splitting tasks.

    For example, strings like:

    ~Parent~child, child, child, child~Another Parent~its child~...

    or

    ~Fish~colour=gold, size=M, memory=0~...

    and yes, this is a real example (fish used to protect the guilty).

    Also, I would still like to see a CLR TVF stream on the input as well as the output. I don't think this is going to fly though - not without an external class library, the unsafe permission set, and maybe (gasp) storing to a static field - a horrible idea, which I am trying hard to forget.

    Time for bed.

    Paul

    edit: P.S. Most reasonable requirements could be handled I guess by allowing a regualr expression to be passed in, but that will break the nice split-on-the-fly thing. Wouldn't it?

    Hi Paul!

    Hierarchical data to be split? Well. This becomes really freaky... It would require either recursive methods or a much more complex RegEx which also becomes quiet slow.

    Give me some hours to think about 😉

    Greets

    Flo

  • Hi mister.magoo

    As first thanks a lot for your invested time and your test results!!

    mister.magoo (4/28/2009)


    Realistically though - how often would you encounter a single chunk of text large enough for that to be an actual problem? (See end of post for more thoughts on this...)

    This splitting requirement is very current at my company. Many of your projects are based on more than one database and those databases are on different servers. For criteria which excess one database the data have to be taken from one database to another. It is much faster to transfer the data as text to the server and just let it be split instead of creating temporary tables.

    Another business case here is my own project. It is a distributed system. The communication works over XML the data will be imported by C# processes. The data have a local Id which is different on every server and a global Id (a VARCHAR) which is unique all over the world. The importer parses the XML file and now has to select 20,000 objects from database to merge them with the new data.

    Greets

    Flo

  • Paul White (4/28/2009)The point about the scalability is not so much about how much memory a single call would allocate, rather: consider many concurrent users executing the function...!

    I'm currently doin' my first load tests, but since now I don't have enough information that it would make sense to share 😉

  • Adam Machanic (4/28/2009)


    OK, here's my latest. A few bug fixes, and it seems to properly ignore consecutive delimiters (you don't want to know how many times I recompiled this fighting off-by-one errors...)

    Enjoy! By the way, if you do performance tests, make sure to compile in Release mode, not Debug mode. I just did a quick comparison and it does seem to make a significant difference.

    using System;

    using System.Collections;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    public partial class UserDefinedFunctions

    {

    [Microsoft.SqlServer.Server.SqlFunction(

    FillRowMethodName = "FillRow_Multi",

    TableDefinition = "item nvarchar(4000)"

    )

    ]

    public static IEnumerator SplitString_Multi(

    [SqlFacet(MaxSize = -1)]

    SqlChars Input,

    [SqlFacet(MaxSize = 255)]

    SqlChars Delimiter

    )

    {

    return (

    (Input.IsNull || Delimiter.IsNull) ?

    new SplitStringMulti(new char[0], new char[0]) :

    new SplitStringMulti(Input.Value, Delimiter.Value));

    }

    public static void FillRow_Multi(object obj, out SqlString item)

    {

    item = new SqlString((string)obj);

    }

    public class SplitStringMulti : IEnumerator

    {

    public SplitStringMulti(char[] TheString, char[] Delimiter)

    {

    theString = TheString;

    stringLen = TheString.Length;

    delimiter = Delimiter;

    delimiterLen = (byte)(Delimiter.Length);

    isSingleCharDelim = (delimiterLen == 1);

    lastPos = 0;

    nextPos = delimiterLen * -1;

    }

    #region IEnumerator Members

    public object Current

    {

    get

    {

    return new string(theString, lastPos, nextPos - lastPos);

    }

    }

    public bool MoveNext()

    {

    if (nextPos >= stringLen)

    return false;

    else

    {

    lastPos = nextPos + delimiterLen;

    for (int i = lastPos; i < stringLen; i++)

    {

    bool matches = true;

    //Optimize for single-character delimiters

    if (isSingleCharDelim)

    {

    if (theString != delimiter[0])

    matches = false;

    }

    else

    {

    for (byte j = 0; j = stringLen) || (theString != delimiter[j]))

    {

    matches = false;

    break;

    }

    }

    }

    if (matches)

    {

    nextPos = i;

    //Deal with consecutive delimiters

    if ((nextPos - lastPos) > 0)

    return true;

    else

    {

    i += (delimiterLen-1);

    lastPos += delimiterLen;

    }

    }

    }

    lastPos = nextPos + delimiterLen;

    nextPos = stringLen;

    if ((nextPos - lastPos) > 0)

    return true;

    else

    return false;

    }

    }

    public void Reset()

    {

    lastPos = 0;

    nextPos = delimiterLen * -1;

    }

    #endregion

    private int lastPos;

    private int nextPos;

    private readonly char[] theString;

    private readonly char[] delimiter;

    private readonly int stringLen;

    private readonly byte delimiterLen;

    private readonly bool isSingleCharDelim;

    }

    };

    Hi Adam!

    Thanks for your new version :-)! I will include it into my other tests and give you feedback as soon as possible!! I just have to finish the setup of a load-test environment.

    Sure, all test results are done in release build 😉

    I keep you informed!

    Flo

  • [font="Verdana"]You know? I am really looking forward to Flo's final article. :-D[/font]

  • Bruce W Cassidy (4/28/2009)


    [font="Verdana"]You know? I am really looking forward to Flo's final article. :-D[/font]

    Howdy Bruce!

    Hold your horses 😛

    I just don't think I can put all these information into one article...

    Currently I have over 400 test cases for single threading and just setting up the load testing environment.

    Best wishes!

    Flo

  • Florian Reischl (4/28/2009)


    I just don't think I can put all these information into one article...

    [font="Verdana"]Hey Flo,

    A series of articles won't hurt. Then you can do a separate articles around performance of RegEx, best of breed CLR, best of breed T-SQL, etc.

    And I think having one article on dealing with hierarchical string splitting would be a good one to add. 😀

    [/font]

  • A series of articles won't hurt. Then you can do a separate articles around performance of RegEx, best of breed CLR, best of breed T-SQL, etc.

    Yes, I think it will be a series, especially for different requirements. One huge split, different data, multi users...

    And I think having one article on dealing with hierarchical string splitting would be a good one to add. 😀

    :sick:!!

    It's late now...

    CU tomorrow!

    Flo

  • Viewing 15 posts - 301 through 315 (of 522 total)

    You must be logged in to reply to this topic. Login to reply