String tokenizing / splitting

  • Interesting about the varying times with smaller sets of data. I'll do some testing and see what I can find.

    /* Anything is possible but is it worth it? */

  • Hey guys,

    look at here --> udf_ListToTable[/url]

    maybe you could run it within your performance tests,

    Norbert

    PS: For the way upsidedown take this --> udf_TableToList[/url]

  • That function looks pretty similar to the one from the original article, which I called "fn_Split". It does fine, too, and it will work on older versions of SQL Server.

    For joining, there are a couple of basic ways to do it. The tough part can be when you want to pass the table name involved as a variable, as you either you have to do some SQL-Server-2008-only stuff using a newly declared type, or you have to use string concatenation with EXEC, so be careful with that one regarding SQL Injection... AND some of the XML functionality might let you find a back door, too.

    The basic premise for joining can use either CTE (as I wrote in my very first blog post) or a COALESCE() or perhaps a CURSOR (which will be a little slow). Here is a quick and dirty example using a table variable and a 2008-only insert statement to fill that test table variable, and then a COALESCE() to join all the rows in that table:

    DECLARE @table TABLE (token VARCHAR(max));

    INSERT INTO @table VALUES ('xyz'),('pdq'),('abc'),('123')

    DECLARE @joined VARCHAR(max)

    SELECT @joined=COALESCE(@joined+',', '')+token FROM @table

    SELECT @joined

  • I ran the three types again, with "SET EXECUTION TIME ON" and...

    ...8788 tokens:

    XML:CPU time = 203722 ms

    Tally:CPU time = 47 ms

    Original:CPU time = 686 ms

  • Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.

    Cheap toilet paper is a false economy, beware!

  • Craig Sunderland (3/7/2011)


    Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.

    Yes, it would skew the tests from one machine to the other but he's been testing on the same machine and instance. If the results widely vary, then it shows it doesn't scale well. However, the three ideas tested has their own disadvantages. One requires CLR procs to be enabled, another needs a tally table, and the other works well with smaller delimited lists. If you plan to never have a list of hundreds or thousands, XML will work without any other object. But if you need to have that flexibility, it's not a big performance hit to use the tally function for smaller sets too. Testing and re-visiting those tests from time to time will let you know if you're still using the best function for your current hardware/data distribution.

    /* Anything is possible but is it worth it? */

  • My theory on the XML problem was that it was actually a CLR call-out. Tally seems to work well, it shows an up front execution cost that is not comparable to the others, but it seems to perform better.

    I'm going to try to tweak each idea a bit and put them under more competition scenarios to see which ones fare better at scale with parallel use.

  • I had a little more fun... again using 17,000+ tokens.

    I put the CTE version inside the original function by creating 200 character tokens using the original method (sort of), assuming a single character delimiter, then using CTE to further split them into arrays. It still has a few bugs but it did improve the overall results (hopefully, not due to the bugs):

    Original + CTE Hybrid:CPU time = 468 ms, elapsed time = 473 ms.

    Original:CPU time = 686 ms, elapsed time = 734 ms.

    Tally:CPU time = 172 ms, elapsed time = 157 ms.

    CREATE FUNCTION [dbo].[fn_Split_Hybrid]

    (

    @InputString VARCHAR(max),

    @Delimiter char(1)

    )

    RETURNS @Values TABLE (

    VALUE VARCHAR(max)

    -- ,IPS VARCHAR(MAX)

    --,DCI BIGINT

    )

    AS

    BEGIN

    DECLARE @DelimitierLen tinyint = 200

    DECLARE @DelimiterCharIndex bigint = @DelimitierLen --BIGINT = CHARINDEX(@Delimiter,@InputString)

    WHILE (@DelimiterCharIndex >= @DelimitierLen )

    BEGIN

    declare @newval varchar(200) = left(@InputString, @DelimitierLen)

    SET @InputString = SUBSTRING(@InputString,@DelimitierLen+1, LEN(@InputString)-@DelimitierLen)

    SET @DelimiterCharIndex = LEN(@inputString)

    --INSERT INTO @Values VALUES (@newval)--, @InputString,@DelimiterCharIndex)

    BEGIN

    with split(i, token, remainder) as

    (select 1

    , left(@newval,charindex(@Delimiter,@newval)-1)

    , LTRIM(right(@newval,len(@newval)-CHARINDEX(@Delimiter,@newval)))

    union all

    select i + 1

    ,case when charindex(@Delimiter,remainder) > 0 then

    left(remainder,charindex(@Delimiter,remainder)-1)

    else remainder end as token

    ,LTRIM(right(remainder,len(remainder)-CHARINDEX(@Delimiter,remainder))) as remainder

    from split

    where charindex(@Delimiter,remainder) >= 0 and token != remainder

    )

    insert into @Values

    Select token

    from split

    END

    END

    set @newval = @InputString

    BEGIN

    with split(i, token, remainder) as

    (select 1

    , left(@newval,charindex(@Delimiter,@newval)-1)

    , LTRIM(right(@newval,len(@newval)-CHARINDEX(@Delimiter,@newval)))

    union all

    select i + 1

    ,case when charindex(@Delimiter,remainder) > 0 then

    left(remainder,charindex(@Delimiter,remainder)-1)

    else remainder end as token

    ,LTRIM(right(remainder,len(remainder)-CHARINDEX(@Delimiter,remainder))) as remainder

    from split

    where charindex(@Delimiter,remainder) >= 0 and token != remainder

    )

    insert into @Values

    Select token

    from split

    END

    RETURN

    END

    GO

  • As promised, I did some testing (sorry it took this long). This was run from my laptop in a Win2K3 Server VM. The results showed a pretty consistent run time of ~40 ms. Running the script in a loop of 20 times, the max ms were 50 and min was 30. The average was 40.45 and a median of 41. I will note that yes, there is a penalty for the first load of the tally table, which I saw the first time I ran it in tempdb. However, if it's used often, it will probably stay in memory, which it does in our system.

    But to be fair, I did a DBCC freeproccache and DBCC dropcleanbuffers between the tests below and did experience the occasional doubling of time you experienced but it wasn't consistent.

    Other changes made to your test script:

    1) The udf takes nvarchar so I changed chars to nchar to remove the implicit conversion

    2) Used datetime2

    3) DATEDIFF for time difference calculation

    4) Added GO 20 to repeat the script 20 times

    declare @i int = 26, @x nvarchar(max) = N'', @d nchar(1) = N' ', @j-2 int;

    declare @t table (id tinyint primary key, tokens int, which nvarchar(32), start datetime2, finish datetime2);

    --Note: this is going to take a while, so if you want to run this more than once, store this data somewhere...

    set @j-2 = @i*@i;

    while @j-2 > 0 BEGIN

    while @i > 0 BEGIN

    set @x = @x + @d +NCHAR(91 - @i);

    set @i = @i - 1;

    END

    set @j-2 = @j-2 - 1;

    set @i = 26

    END

    declare @C int;

    update @t set tokens = @C, finish = sysdatetime() where id = 1;

    insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());

    select @C = COUNT(*) from ..[udf_StrList2Table] (@x,@d)

    update @t set tokens = @C, finish = sysdatetime() where id = 2;

    select *, DATEDIFF(millisecond, start, finish) as runtime from @t;

    GO 20

    /* Anything is possible but is it worth it? */

  • Jeff Moden posted a very nice article about CSV parsing. His code was written for a max input string of 8K so if you need more, stay tuned for an upcoming article from him. I tested his code vs mine and it completely blows it away. Not to mention his CTE tally table did much better than relying on a physical Tally table.

    Tally Oh![/url]

    /* Anything is possible but is it worth it? */

  • Gatekeeper (5/2/2011)


    Not to mention his CTE tally table...

    Thank you for the wonderful kudo but I absolutely cannot take the credit for the cteTally. The original concept was first published by Itzik Ben-Gan and he used a "base 2" generator. The "base 10" generator that I included in my article was the result of a lot of good people's (Lynn Petis started the ball rolling in this area) efforts here on SSC where we tested many different "bases". I chose to stick with the "base 10" generator because, as with many different based generators, it appeared to have a minor speed advantage over the "base 2" generator, took fewer Cross Joins to ramp up to the larger numbers and, as a result, took a bit less code.

    Still, I have to give the credit to Itzik for the wonderful and very useful idea especially since it generates virtually no reads. The use of a physical Tally Table will still beat it by a bit in some applications but you simply have to love the "no reads" aspect of Itzik's fine idea.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/25/2012)


    Still, I have to give the credit to Itzik for the wonderful and very useful idea especially since it generates virtually no reads. The use of a physical Tally Table will still beat it by a bit in some applications but you simply have to love the "no reads" aspect of Itzik's fine idea.

    Yes, the no reads is very hard to not pass up when thinking about tally tables, thanks to Itzik.

    /* Anything is possible but is it worth it? */

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply