Performance issue with tally solution

  • Phil Factor (4/25/2009)


    Most data feeds are a bit more regular, I'll admit, but then it is great for testing text parsing routines. I've had to parse larger strings than this before now. (e.g. daily feed from a telecommunications switch) though I've usually used BCP to do the line-splitting for me. I've been more interested in doing this sort of thing for website files for fast indexing, recently.

    Phil,

    The graphs are interesting, but run somewhat contrary to other test results in this thread. I would once again invite you to clarify the exact code for the Quirky and While methods, together with the core code for the test rig. It is frustrating not to be able to validate your findings for myself.

    As it stands, I am forming the opinion that Flo's CLR-with-chars method or the basic CLR TVF is the best general solution, as it performs best, or second best in all of the tests he ran, including Moby Dick.

    A quick post from you containing the exact code for Quirky and While would be great.

    Paul

    edit: P.S. I had to import 3TB of web log files last week. I have always found Microsoft LogParser to be a good solution - no point in re-inventing the wheel, eh? 😉

  • Paul White (4/25/2009)


    By the way, reading the whole data column (all rows at once) inside the CLR TVF works very well. Splitting that data also works well. However, writing the results to the database down the SqlPipe is horrible.

    *affirmative*

    I really don't what they do when they try to get the data back to server.

    It's a shame, because the plan was to read the CSV data into a streaming SqlDataReader, writing the results back down the context connection pipe through the streaming CLR TVF interface.

    The problem is that one is only allowed to do data operations in the InitMethod part of the CLR function. If you try to create a SqlDataReader in Init and then read it on the streaming side (while being called through IEnumerator) the SqlDataReader is closed and all attempts to access data fail. Annoying.

    It would have been great because one could just write:

    SELECT * FROM dbo.My_Streaming_TVF ('dbo.DataTable', 'KeyColumnName', 'DataColumnName')and you would get all the results at once, with just one call through reflection (or whatever it uses to bind to the IEnumerator), and both reading and writing would have been streaming, and therefore very scalable. Oh well, never mind.

    Also completely confirm this! Since it is not possible to use input streaming while output streaming you always have to ensure that the count of data is not to much to be handled in one step. It would be great if this becomes possible in future version. SQL Server 2008 also doesn't support...

    Greets

    Flo

  • [p]I've already posted the Quirky and While solutions, but I put the harness in just to make the post more interesting to others.[/p]

    [font="Courier New"]ALTER FUNCTION [dbo].[splitStringToTable](@String NVARCHAR(MAX), @Delimiter VARCHAR(255))

    /*

    This splits a string based on a delimiting substring. It returns a table of the resulting strings, and the locations within the original string.

    This technique uses the 'Quirky Update' on a table-update. A table is created with the correct number of delimiters within the string, and is then updated to contain the locations of all the delimiters, and the actual text between the delimiters.

    I suspect that there is an improvement to be made somewhere as it should be as fast as the WHILE loop.

    */

    RETURNS @strings TABLE

    (

    line_no INT IDENTITY(1,1),

    theIndex INT DEFAULT 0,

    previousIndex INT DEFAULT 0,

    TheLine VARCHAR(MAX) DEFAULT '')

    AS

    -- body of the function

    BEGIN

    DECLARE @TheIndex INT, @Previousindex INT,@LenDelimiter INT

    SELECT @TheIndex=-1, @LenDelimiter=LEN(@Delimiter)

    --create the string table unfilled but the right length

    INSERT INTO  @strings(TheIndex)

       SELECT 0 FROM numbers

           WHERE number <=(LEN(@String)

               -LEN(REPLACE(@String,@Delimiter,'')))/LEN(@Delimiter)+1

    --and populate the table

    UPDATE  @strings

       SET @PreviousIndex=PreviousIndex=@TheIndex,

       @TheIndex=theIndex=CASE WHEN @PreviousIndex<>0

           THEN CHARINDEX(@Delimiter,@String,@PreviousIndex+1)ELSE 0 END,

       TheLine=CASE WHEN @TheIndex>0 THEN

           SUBSTRING(@String,@PreviousIndex+@LenDelimiter,

                   @TheIndex-@previousindex-@LenDelimiter)

       WHEN @PreviousIndex>0 THEN

           SUBSTRING(@String,@PreviousIndex+@LenDelimiter,

                   LEN(@String)-@previousindex)    

       ELSE '' END    

       RETURN

    END

    [/font]

    [p] ...and the while loop...[/p]

    [font="Courier New"]GO

    /*

    This splits a string based on a delimiting substring. It returns a table of the resulting strings, and the locations within the original string.

    This technique uses the 'While Loop'.

    */

    ALTER FUNCTION [dbo].[WhileSplit](@String NVARCHAR(MAX), @Delimiter VARCHAR(255))

    RETURNS @strings TABLE

    (

    line_no INT IDENTITY(1,1),

    theIndex INT DEFAULT 0,

    previousIndex INT DEFAULT 0,

    TheLine VARCHAR(MAX) DEFAULT '')

    AS

    -- body of the function

    BEGIN

       DECLARE @pos INT, @next INT, @LenDelimiter INT

       SELECT @pos = 1, @lenDelimiter=LEN(@Delimiter),

             @next = CHARINDEX(@Delimiter, @String, @pos)

       WHILE (@next > 0)

       BEGIN

       INSERT INTO @Strings (PreviousIndex,TheIndex,TheLine)

               SELECT @pos, @next, SUBSTRING(@String, @pos, @next - @pos)

        SELECT @pos = @next + @lenDelimiter,

           @next = CHARINDEX(@Delimiter, @String, @pos)

       END

       INSERT INTO @Strings (PreviousIndex,TheIndex,TheLine)

           SELECT @pos, @next, SUBSTRING(@String, @pos, LEN(@STRING)-@POS+1)

       RETURN

    END

    [/font]

    [p] and the current test harness (this is something I keep fiddling with[/p]

    [font="Courier New"]--create a log table to record timings and results

    DECLARE  @log TABLE

       (

        Log_ID INT IDENTITY(1, 1),

        Event VARCHAR(40),

        param1 INT, param2 INT,

        InsertionDate DATETIME DEFAULT GETDATE()

       )

    --various variables

    DECLARE @LotsOfText VARCHAR(MAX),

           @TestText VARCHAR(MAX),

           @Test VARCHAR(80),

           @ii INT,

           @lentext INT,

           @lenString INT,

           @linesFound INT

    --flag the start of the test run in the log

    INSERT INTO @log (event,param1,param2) SELECT 'Start Test run', NULL, NULL

    --have we a data table? if not then create it

    IF NOT EXISTS (SELECT 1

                   FROM information_schema.tables

                   WHERE table_name='SampleText')

       CREATE TABLE SampleText(SampleText_ID INT IDENTITY(1,1),

                               TheText VARCHAR(MAX))

    --have we data in the table. If not then read it in.

    IF NOT EXISTS (SELECT 1 FROM sampleText)

       BEGIN

       SELECT  @LotsOfText = BulkColumn

       -- 2005/8 only for this code)

       FROM    OPENROWSET(BULK 'c:\workbench\moby-dick.txt',

                   SINGLE_BLOB) AS x

       INSERT INTO SampleText(TheText) SELECT @lotsOfText

       END

    SELECT @LotsOfText=TheText FROM SampleText WHERE SampleText_ID=1

    SELECT @ii=10 --100%, 1=10%. Start with all, then decrease.

    SELECT @LenString=LEN(@LotsOfText)--for calulating chop of data.

    WHILE @ii>0

       BEGIN

       --start by creating the size of string for the test

       SELECT @TestText =

           CASE @ii WHEN 10 THEN @LotsOfText

                   ELSE SUBSTRING( @LotsOfText,1, CAST((@lenString*(@ii/10.00)) AS INT))

           END

       SELECT @LenText=LEN(@TestText)

      

       DECLARE  @bucket TABLE

          (

           line VARCHAR(8000)

          )

       SELECT @LinesFound=COUNT(*), @Test='Tally'

               FROM dbo.SplitTally(@TestText)

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test, @LenText, @LinesFound

       INSERT INTO @bucket (line) SELECT line

               FROM dbo.SplitTally(@TestText)

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test+ ' insert', @LenText, @LinesFound

       DECLARE  @bucket1 TABLE

          (

           line VARCHAR(8000)

          )

       SELECT @LinesFound=COUNT(*), @Test='Regex Split'

      FROM dbo.RegExSplit (dbo.regexescape('

    '),@TestText,1)

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test, @LenText, @LinesFound

       INSERT INTO @bucket1 (line) SELECT MATCH

               FROM dbo.RegExSplit (dbo.regexescape('

    '),@TestText,1)

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test+ ' insert', @LenText, @LinesFound

       DECLARE  @bucket2 TABLE

          (

           line VARCHAR(8000)

          )

       SELECT @LinesFound=COUNT(*) , @Test='Quirky Update'

               FROM dbo.splitStringToTable(@TestText,'

    ')

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test, @LenText, @LinesFound

       INSERT INTO @bucket2 (line) SELECT Theline

               FROM dbo.splitStringToTable(@TestText,'

    ')

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test+ ' insert', @LenText, @LinesFound

       DECLARE  @bucket3 TABLE

          (

           line VARCHAR(8000)

          )

       SELECT @LinesFound=COUNT(*) , @Test='CLR Split'

               FROM dbo.DanielJ(@TestText)

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test, @LenText, @LinesFound

       INSERT INTO @bucket3 (line) SELECT match

               FROM dbo.DanielJ(@TestText)

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test+ ' insert', @LenText, @LinesFound

       DECLARE  @bucket4 TABLE

          (

           line VARCHAR(8000)

          )

       SELECT @LinesFound=COUNT(*) , @Test='WHILE Split'

               FROM WhileSplit(@TestText,'

    ')

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test, @LenText, @LinesFound

       INSERT INTO @bucket4 (line) SELECT theLine

               FROM WhileSplit(@TestText,'

    ')  INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test+ ' insert', @LenText, @LinesFound

       SELECT @LinesFound=COUNT(*) , @Test='Double-barrel tally'

               FROM DoubleBarreledTally(@TestText,'

    ')

       INSERT INTO @log (event,param1,Param2) --report time

           SELECT @Test, @LenText, @LinesFound

       INSERT INTO @bucket4 (line) SELECT theLine

               FROM DoubleBarreledTally(@TestText,'

    ')

       innsert INTO @log (event,param1,Param2) --report time

           SELECT @Test+ ' insert', @LenText, @LinesFound

       SELECT @ii=@ii-1

    END

    --Now see how long it took...

    SELECT   event,

           [length of string] = param1,

          [No. Lines] =param2,

           [Time (Ms)] = DATEDIFF(ms,  

                               (SELECT [InsertionDate]

                                   FROM @log f

                                   WHERE f . Log_ID = g . Log_ID - 1

                               ),

                               [InsertionDate]

                           )

    FROM     @log g

    WHERE    Log_ID >1

    GO

    [/font]

    Best wishes,
    Phil Factor

  • Hi Phil

    This results would be great for cursors! As Paul already wrote could you point me to the "Quirky Update" and "While Loop"? Then I can include it into the other tests.

    Thanks a lot!

    Flo

  • Paul White (4/25/2009)


    - no point in re-inventing the wheel, eh? 😉

    One of the hobbies of my chief. We have an own string, DateTime, List, DB-Connection, even an own File... :hehe:

  • Phil Factor (4/26/2009)


    I've already posted the Quirky and While solutions, but I put the harness in just to make the post more interesting to others.

    Thanks! I will include it into my other test when I'm back!

    ... now it's time for barbecue 🙂

    Greets

    Flo

  • A quick post before bed concerning the TVFs posted by Phil.

    Unless I am missing something obvious, this solution is heavily dependent on the TVF only being called once.

    This is achieved in the script by reading the whole of Moby Dick into a NVARCHAR(MAX) variable.

    Changing it's datatypes to VARCHAR(8000) on the way in and VARCHAR(100) out, and running it against the first 200 lines of Jeff's test table:

    ALTER FUNCTION [dbo].[WhileSplit](@String VARCHAR(8000), @Delimiter VARCHAR(255))

    *snip*

    TheLine VARCHAR(100) DEFAULT '')

    *snip*

    GO

    SELECTWS.line_no, WS.theLine

    INTOdbo.Result

    FROM(SELECT TOP 200 * FROM dbo.JBMTest) AS J

    CROSS

    APPLYdbo.WhileSplit (J.CSV, ',') AS WS

    Table 'Worktable'. Scan count 1, logical reads 363973, physical reads 566, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#0BC6C43E'. Scan count 200, logical reads 800, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'JBMTest'. Scan count 1, logical reads 203, physical reads 152, read-ahead reads 479, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (160200 row(s) affected)

    SQL Server Execution Times:

    CPU time = 141484 ms, elapsed time = 189769 ms.

    I ran it on the first 200 rows only because of time constraints. As it is, it took over three minutes...

    The actual and estimated execution plans are attached. The actual plan doesn't show the inner execution of the TVF so that's why the estimated one is there.

    Cheers,

    Paul

  • Florian Reischl (4/26/2009)


    Paul White (4/25/2009)


    - no point in re-inventing the wheel, eh? 😉

    One of the hobbies of my chief. We have an own string, DateTime, List, DB-Connection, even an own File... :hehe:

    I have no words to adequately convey my feelings on that :crazy:

    :sigh:

    You have my sympathy!

    Paul

  • Hey Paul, I followed your link about the new flag. My son has lived in Dunedin for years now, but never mentioned this to me. As an American, I can cast no vote, but personally I favor the Silver Fern. 🙂

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Hi Phil

    I just ran exactly your script on my system. I just removed the Tally version because this is definitely the wrong decision for this requirement.

    The chart is attached. Hope it will also be shown here...

    I don't know why, but there seems to be a huge difference in the execution between our systems.

    Edit: Please forgive the lousy formatting. I'm no Excel guru...

    Greets

    Flo

  • Wow... Just looked back at some of the graphs.... I've gotta find out which While Loop and Tally table methods Phil used to come up with his graphs because those numbers just seem really odd to me. Also, if I recall correctly, when Phil and I did some testing on Moby Dick a while back, the while loop won on his machine while the Tally table one on mine.

    Heh... The "split" function is certainly turning out to be a huge "it depends". Depends on the machine, depends on delimiter density, depends on delimiter randomness between rows, etc, etc, etc. Flo, I don't envy you in the task of writing and article about all of this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bob Hovious (4/26/2009)


    Hey Paul, I followed your link about the new flag. My son has lived in Dunedin for years now, but never mentioned this to me. As an American, I can cast no vote, but personally I favor the Silver Fern. 🙂

    Yeah, me too - I had it as my avatar for a while, but figured I should spread the love a little.

    The Silver Fern would make one cool flag - the only slight downside would be the licensing costs to the All Blacks...

    I think the 'new flag' guys welcome support from whatever quarter, so thanks for that.

    @Flo

    Regex is pretty awesome for long strings eh?

  • [font="Verdana"]I like the southern cross, and I like the silver fern and I like the kiwi. They're all our national symbols. I don't see why the flag need change (unless we become a republic, which at the moment doesn't seem to be a serious thing.)

    Now, back to the excellent code!

    I did have one thought, although I don't see it as practical in this case. When I was at University we were doing performance analysis on sorting algorithms (which I believe is a classic). We noticed that for small input, even a bubble sort would beat a quick sort, but a bubble sort soon fell out of fashion as the input size increased. However, a quick sort works by chunking the input, and as the chunk sizes got smaller, you could swap sort algorithms to get the best performance.

    Would it be possible to use a hybrid solution? I don't think so, because it appears that the result can't be "chunked" (or partitioned) in any way. It also seems that the two CLR solutions are still performing well.

    [/font]

  • Hmmm.. I'm really scratching my head now wondering about the differences between the two test runs, Flo. At least we can agree that the Tally doesn't cut it when the delimiters are widely spaced. With my system, both the regex solutions (Yours and mine) produced identical dire performance, so bad I took them out of the test because they made the test runs take far too long.

    If we can find out what makes the CLR solutions run so fast on your tests, we ought to bottle it and sell it to everyone. There is a lot of work to do.

    In answer to paul's comments about the cross apply, It doesn't surprise me that it the CROSS APPLY runs like a fly in syrup. I have Quirky and WHILE loop solutions to Flo's original problem, but I got a bit distracted when I got interested in the relative performance of various TVFs on large strings. I'll dust them out.

    Best wishes,
    Phil Factor

  • @Phil, Flo:

    Probably the easiest way would be to compare your respective .sqlplan actual execution plans for the queries in question, together with the usual CPU and IO information...?

    @bruce-2:

    Don't get me wrong - I like the existing flag, it's just a bit similar to another I could mention, and doesn't have anything very NZ about it apart from the Southern Cross. A design that incorporates a little more of NZ without throwing away the good things on the current flag would be perfect. I just think the present balance is a little off.

    Paul

    P.S. One obvious disadvantage of the CLR solutions which I forgot before is that T-SQL is more easily re-used. Copying a procedure or function definition in T-SQL is easier and faster than creating an assembly and a T-SQL stub. This is the main reason I still use a ROW_NUMBER() generated numbers table for quick one-off string splits.

Viewing 15 posts - 226 through 240 (of 522 total)

You must be logged in to reply to this topic. Login to reply