Performance issue with tally solution

  • Bruce W Cassidy (4/13/2009)


    P.S. Phil Factor's blog post makes interesting reading, particularly for this little gem:

    Jeff Moden


    P.S. Heh... Always say "Never" to a RBAR While loop... if the While loop beats the set based method, there's something else wrong. 😉

    :w00t:

    [/font]

    Heh... and the stone throwing from the band wagon begins. 😉 Don't forget, I was the one that first said that Flo's beat the Tally table, so be nice. :hehe:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/13/2009)


    Heh... and the stone throwing from the band wagon begins. 😉 Don't forget, I was the one that first said that Flo's beat the Tally table, so be nice. :hehe:

    [font="Verdana"]Whoa! No stones being thrown here! In fact, quite the opposite. I think it's brilliant that you're chipping in and helping disprove your own theory (I should add "in this case"), and that's a clear indication of your professionalism. 😀

    Besides, the real answer in this case is CLR, as it's still far faster than the while loop (or cursor). So in a way, your earlier statement is still true.

    [/font]

  • Bruce W Cassidy (4/13/2009)


    Jeff Moden (4/13/2009)


    Heh... and the stone throwing from the band wagon begins. 😉 Don't forget, I was the one that first said that Flo's beat the Tally table, so be nice. :hehe:

    [font="Verdana"]Whoa! No stones being thrown here! In fact, quite the opposite. I think it's brilliant that you're chipping in and helping disprove your own theory (I should add "in this case"), and that's a clear indication of your professionalism. 😀

    Besides, the real answer in this case is CLR, as it's still far faster than the while loop (or cursor). So in a way, your earlier statement is still true.

    [/font]

    Thanks, Bruce... and sorry I took it the wrong way. Guess I'm a little sensitive to see my favorite baby get the tar beat out of it by what used to be considered RBAR, not once, but twice. If you include the fact that a While Loop worked better on Phil's old machine on the Moby Dick split, that makes 3 times. BWAA-HAA! Looks like I've gotta change my sig and my avatar, again. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/13/2009)


    Thanks, Bruce... and sorry I took it the wrong way.

    [font="Verdana"]No problem.

    I think it's yet another example of using the right tool for the job. It's just that the right tool in this case is not a tally table. Considering lots of us have been recommending the use of CLR for things like text handling, I'm just thrilled to see the numbers to back that advice up! (Because frankly, I haven't had time to learn C# in depth yet, so I haven't felt my own code was indicative of how well CLR can perform.)[/font]

  • Bruce W Cassidy (4/13/2009)


    Jeff Moden (4/13/2009)


    Thanks, Bruce... and sorry I took it the wrong way.

    [font="Verdana"]No problem.

    I think it's yet another example of using the right tool for the job. It's just that the right tool in this case is not a tally table. Considering lots of us have been recommending the use of CLR for things like text handling, I'm just thrilled to see the numbers to back that advice up! (Because frankly, I haven't had time to learn C# in depth yet, so I haven't felt my own code was indicative of how well CLR can perform.)[/font]

    Yep... I agree... it's the first time I've actually seen someone post numbers on an apples-to-apples test with a CLR on such a thing. Since I don't write C# at all, had to depend on others and, up to now, there were no takers even though I've been asking.

    Heh... like I said... looks like I'm gonna have to learn C#. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Or a faster split UDF?

    --- -> Peso solution

    PRINT 'Start Peso solution'

    SELECT @now = GETDATE()

    --- Split text into lines

    INSERT INTO @result

    SELECT l.data

    FROM @source s

    CROSS APPLY dbo.fnParseList(CHAR(13), s.definition) l

    --- Results

    SELECT @duration = DATEDIFF(MILLISECOND, @now, GETDATE())

    SELECT @count = COUNT(*) FROM @result

    PRINT 'Milliseconds: ' + CONVERT(VARCHAR(10), @duration) + ' | Lines: ' + CONVERT(VARCHAR(10), @count)

    --- <- Peso solution

    http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=76033


    N 56°04'39.16"
    E 12°55'05.25"

  • looks like I'm gonna have to learn C#.

    Watch out that your halo does not slip too far Jeff 😉

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Peso (4/14/2009)


    Or a faster split UDF?

    PRINT 'Start Peso solution'

    ...blah blah blah...

    Awesome. You have saved the day! Well done sir! :ermm:

    I can't wait for the version of your solution that actually does something similar to the requirements.

    The input string is NVARCHAR(MAX), not VARCHAR(8000).

    The delimiter is CR + LF : NCHAR(2), not a CHAR.

    Aside from that, awesome, as I say.

    Thanks also for not bothering to post performance figures for us to compare.

    I would edit your TVF to match the requirements and get the performance figures for myself, but it's late and I can't be bothered, frankly. Maybe tomorrow.

    :laugh:

    Paul

  • No problem Mr White.

    CREATE FUNCTION dbo.fnParseList

    (

    @Text NVARCHAR(MAX)

    )

    RETURNS @Result TABLE (RowID SMALLINT IDENTITY(1, 1) PRIMARY KEY, Data NVARCHAR(MAX))

    AS

    BEGIN

    DECLARE@NextPos INT,

    @LastPos INT,

    @Delimiter NCHAR(2)

    SELECT@Delimiter = NCHAR(13) + NCHAR(10),

    @NextPos = CHARINDEX(@Delimiter, @Text, 1),

    @LastPos = 0

    WHILE @NextPos > 0

    BEGIN

    INSERT@Result

    (

    Data

    )

    SELECTSUBSTRING(@Text, @LastPos + 1, @NextPos - @LastPos - 1)

    SELECT@LastPos = @NextPos,

    @NextPos = CHARINDEX(@Delimiter, @Text, @NextPos + 1)

    END

    IF SCOPE_IDENTITY() > 0

    INSERT@Result

    (

    Data

    )

    SELECTSUBSTRING(@Text, @LastPos + 1, DATALENGTH(@Text) - @LastPos)

    RETURN

    END


    N 56°04'39.16"
    E 12°55'05.25"

  • We really haven't got to the bottom of what was causing the original problem.

    As I'm a bit stupid, and read it too quickly, I hadn't realized that the original code was using the stored procedure data in master for its data. (Doh!) When I finally realized that we had the test data, I ran some tests with my own hot-shot generic split-table-function routine. It took almost exactly the same length of time as Florian's usp_print_lines procedure. Peso's was faster by a third. When I then looked at the test harness, I realised that it just isn't comparing like-with- like and we're jumping to conclusions about what is causing the problem with the tally solution. Why not re-write the test harness so that both strategies use the CROSS APPLY with a table-function, for example, or if both strategies use the outer WHILE loop and the INSERT....EXECUTE.

    Until we can control for all other things that effect the performance we are merely guessing.

    Until you can prove me wrong, I shall claim that, with this test-data, a good sensible set-based split routine out-performs the RBAR WHILE-loop by almost ten-fold, but then I may have made a mistake somewhere!

    Best wishes,
    Phil Factor

  • Check out the code in this post.

    There I have one method being used. If you read the following post, you will also notice I did a second run making sure that both processed the same data (although I didn't post the modified code).

  • David Burrows (4/14/2009)


    looks like I'm gonna have to learn C#.

    Watch out that your halo does not slip too far Jeff 😉

    Heh... Not to worry. I'll wear the tin hat over it so it doesn't slip.

    I see lots and lots of good code... I've seen some of it before like Barry's and Peter's. I'm going to have to revisit it all mostly because I've gotta test things for myself. You know the old saying... "Trust... but verify". 😛

    I do wish they'd fix these bloody code windows, though... it's a huge pain to reformat what you copy and paste. It's definitely NOT a WYSIWYG thing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Phil Factor (4/14/2009)


    I ran some tests with my own hot-shot generic split-table-function routine.

    Phil, could you post that code here so we have all these fine split functions on one wonderful thread? Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • [font="Courier New"]--================================================

    -- Multi-statement Table-valued Function that splits an NVARCHAR(MAX)

    -- into rows depending on a delimiter

    --================================================

    IF OBJECT_ID (N'splitStringToTable') IS NOT NULL

       DROP FUNCTION splitStringToTable

    GO

    CREATE FUNCTION splitStringToTable(@String NVARCHAR(MAX), @Delimiter VARCHAR(255))

    RETURNS @strings TABLE

    (

    line_no INT IDENTITY(1,1),

    theIndex INT DEFAULT 0,

    previousIndex INT DEFAULT 0,

    TheLine VARCHAR(255) DEFAULT '')

    AS

    -- body of the function

    BEGIN

    DECLARE @TheIndex INT, @Previousindex INT,@LenDelimiter INT

    SELECT @TheIndex=-1, @LenDelimiter=LEN(@Delimiter)

    --create the string table unfilled but the right length

    INSERT INTO  @strings(TheIndex) SELECT 0 FROM numbers

    WHERE number <=(LEN(@String)

           -LEN(REPLACE(@String,@Delimiter,'')))/LEN(@Delimiter)+1

    --and populate the table

    UPDATE  @strings

       SET @PreviousIndex=PreviousIndex=@TheIndex,

       @TheIndex=theIndex=CASE WHEN @PreviousIndex<>0

           THEN CHARINDEX(@Delimiter,@String,@PreviousIndex+1)ELSE 0 END,

       TheLine=CASE WHEN @TheIndex>0 THEN

           LEFT(SUBSTRING(@String,@PreviousIndex+@LenDelimiter,

                   @TheIndex-@previousindex-@LenDelimiter),255)

       WHEN @PreviousIndex>0 THEN

           LEFT(SUBSTRING(@String,@PreviousIndex+@LenDelimiter,

                   LEN(@String)-@previousindex),255)  

       ELSE '' END    

       RETURN

    END

    GO

    DECLARE @mehitabelsExtensivePast VARCHAR(MAX)

    SELECT @mehitabelsExtensivePast=

    '

    i have been

    used something fierce in my time but

    i am no bum sport archy

    i am a free spirit archy i

    look on myself as being

    quite a romantic character oh the

    queens i have been and the

    swell feeds i have ate

    a cockroach which you are

    and a poet which you used to be

    archy couldn t understand

    my feelings at having come

    down to this i have

    had bids to elegant feeds where poets

    and cockroaches would

    neither one be mentioned without a

    laugh archy i have had

    adventures but i

    have never been an adventuress'

    SELECT * FROM dbo.splitStringToTable(@mehitabelsExtensivePast,'

    ')

    [/font]

    Best wishes,
    Phil Factor

  • Hi everybody

    Sorry for my late answer but I had a busy day at work...

    Thank you for all your investigations and your time!!!

    I've not been lazy. I did many tests for different cases and added each single of your posted split functions to my test environment. I'm not sure if I covered all business cases, but I did my best...

    I tested three different cases for each function:

    * First the initial tested master.sys.sql_modules. Property, very large test especially some of them.

    * Second have been also larger data but with much more lines and shorter text in a single line.

    * Third test was Lynn's idea. A table with shorter text (NVARCHAR(4000))

    Here my final test results

    SourceType Description Lines Duration AverageTextLength MaxTextLength

    ---------- ----------------------------------------------------- ------- ----------- -------------------- --------------------

    Large CLR tvf (Paul White) solution 28545 580 5076 73908

    Large CLR XML Solution 28545 823 5076 73908

    Large Cursor solution 28145 2000 5076 73908

    Large Peso solution 28545 2143 5076 73908

    Large Phil Factor solution 28577 2703 5076 73908

    Large Tally-function 4 solution (has same bug as the rest) 28145 3200 5076 73908

    Large Tally-function 3 solution (correct line-splitting) 28545 3290 5076 73908

    Large Tally solution 28145 3870 5076 73908

    Many CLR tvf (Paul White) solution 157586 1626 3903 12672

    Many CLR XML Solution 157586 2590 3903 12672

    Many Tally solution 156410 4356 3903 12672

    Many Tally-function 4 solution (has same bug as the rest) 156410 5823 3903 12672

    Many Tally-function 3 solution (correct line-splitting) 157586 6030 3903 12672

    Many Cursor solution 155288 6716 3903 12672

    Many Peso solution 157562 7193 3903 12672

    Many Phil Factor solution 157677 11163 3903 12672

    Short CLR tvf (Paul White) solution 44106 226 2526 3994

    Short Tally solution 43306 310 2526 3994

    Short CLR XML Solution 44106 473 2526 3994

    Short Cursor solution 43306 1080 2526 3994

    Short Peso solution 44106 1083 2526 3994

    Short Phil Factor solution 44197 1123 2526 3994

    Short Tally-function 4 solution (has same bug as the rest) 43306 1750 2526 3994

    Short Tally-function 3 solution (correct line-splitting) 44106 1813 2526 3994

    I'm still wondering why my Tally solution does not return the correct counts of lines but I will investigate.

    Resume

    * The winner of the split contest is Pauls CLR table valued function!

    * The tally table does a very good work on smaler strings! It even beats my XML based scalar CLR function.

    * The XML based CLR function seems also a good approach but creates too much overhead

    * The other solutions are also all very good if we keep in mind that we speak about MBs to be splited!

    I'm afraid that I can't post the complete script because of the formatting of this page. Firefox takes over 350mb RAM and IE7 just blows with some script failures. I attached the complete test script as text file.

    Thanks again to all of you!

    Flo

    Edit: Reformatted the test results

Viewing 15 posts - 46 through 60 (of 522 total)

You must be logged in to reply to this topic. Login to reply