Performance issue with tally solution

  • Florian Reischl (5/11/2009)


    Have a look at the attached statistic results of my tests.

    Flo,

    Which cursor based split solution were you using in those statistics?

  • UMG Developer (5/11/2009)


    1 Line1|Line2|Line3

    Heh... "Tab or whatever to separate fields"...

    Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/11/2009)


    UMG Developer (5/11/2009)


    1 Line1|Line2|Line3

    Heh... "Tab or whatever to separate fields"...

    Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?

    Like I said it doesn't currently exist as a text file, but I can export it to one fairly easily. (Why I included the "whatever", as whatever would be workable.)

    Currently it is in a table as previously described with three fields: Comment_ID (int), Comment_Part_Num (int), Comment_Part (varchar(4000))

    There is a | at the end of each line in the comment, but a given line could be broken across many parts, and a single part can contain many lines.

    I assume I wouldn't need to export the Comment_Part_Num field, just use it to order the export correctly.

  • Florian Reischl (5/11/2009)


    Hi Paul

    I know this sound's strange but all my tests showed, that the set-based solutions using a Tally table run into problems with huge text and large result items.

    Remember Phil's test to split the book Moby Dick. I included this into my tests to split the text of the book by commas with following results:

    CLR: 00:00.867

    Cursor: 00:01.310

    Set-Based: 00:06.260

    The book has about 1.2 million characters and the text specified by UMG has more than 4.3 million characters.

    Greets

    Flo

    Heh... Phil didn't tell you about the original "Whale Tests" we did where the Tally table won on my machine but lost on his. I'm going to have to make the time to do my own testing of all these different methods.... Heh... except for the CLR's. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • UMG Developer (5/11/2009)


    Jeff Moden (5/11/2009)


    UMG Developer (5/11/2009)


    1 Line1|Line2|Line3

    Heh... "Tab or whatever to separate fields"...

    Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?

    Like I said it doesn't currently exist as a text file, but I can export it to one fairly easily. (Why I included the "whatever", as whatever would be workable.)

    Currently it is in a table as previously described with three fields: Comment_ID (int), Comment_Part_Num (int), Comment_Part (varchar(4000))

    There is a | at the end of each line in the comment, but a given line could be broken across many parts, and a single part can contain many lines.

    I assume I wouldn't need to export the Comment_Part_Num field, just use it to order the export correctly.

    Ah... got it. I was confused. Thanks.

    Take a look at the link in my signature. Any chance of you cranking out the first, say, 10 rows of data and attaching it as a file? It would provide a whole lot of answers for folks. Of course, don't do it if the file contains any sensitive or private information.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/11/2009)


    Take a look at the link in my signature. Any chance of you cranking out the first, say, 10 rows of data and attaching it as a file? It would provide a whole lot of answers for folks. Of course, don't do it if the file contains any sensitive or private information.

    I was trying to avoid this, but here is a script to create a table with some sample data, which is essentially what I am starting with:

    [font="Courier New"]--Drop table comments

        --Create a sample comments table

    CREATE TABLE Comments (

                    CommentID INT,

                    CommentPartNum INT,

                    CommentCont CHAR(1),

                    CommentDate DATETIME,

                    CommentPart VARCHAR(4000));

        --Add some sample data

    INSERT INTO Comments

        SELECT

                1, 1, 'N', '03/23/2009', 'Comment Line 1'

        UNION ALL

    SELECT

                1, 2, 'N', '03/26/2009', 'Comment Line 2|Comment Line 3|Comment Line 4'

        UNION ALL

    SELECT

                1, 3, 'Y', '03/26/2009', 'Comment Line 4 continued'

        UNION ALL

    SELECT

                2, 1, 'N', '03/23/2009', 'Comment '

        UNION ALL

    SELECT

                2, 2, 'Y', '03/24/2009', 'Line '

        UNION ALL

    SELECT

                2, 3, 'Y', '03/24/2009', '1|Comment Line 2'

        UNION ALL

    SELECT

                2, 4, 'N', '03/25/2009', 'Comment Line 3|Comment Line 4';

        --Display the sample data

    SELECT *

        FROM Comments;

                [/font]

    (Creating fake data is easier than trying to sanitize it.)

    It wasn't even as painful as I thought it would be...

    Edited because I left out one small piece, and that is a field to signify if this part is a continuation of the previous part...

  • Florian Reischl (5/11/2009)


    Well, this brings up some more "fun"!

    I did some changes in my previous version. Give it a try:

    Flo,

    Thanks for that, I re-did it a little bit to handle multiple comments:

    [font="Courier New"]

    ---==================================

    -- Some source data

    DECLARE @Source TABLE (Id INT NOT NULL IDENTITY, CommentID INT, CommentPart INT, Txt VARCHAR(100))

    INSERT INTO @Source

                 SELECT 1, 1, 'aaa,bb'

       UNION ALL SELECT 1, 2, 'b,ccc,ddd'

       UNION ALL SELECT 1, 3, 'ddd'

       UNION ALL SELECT 1, 4, 'ddd,e'

       UNION ALL SELECT 1, 5, 'eee'

       UNION ALL SELECT 1, 6, ',fff'

       UNION ALL SELECT 2, 1, '1,22,33'

       UNION ALL SELECT 2, 2, '3,4444,55'

       UNION ALL SELECT 2, 3, '5'

       UNION ALL SELECT 2, 4, '55,66'

       UNION ALL SELECT 2, 5, '6666'

       UNION ALL SELECT 2, 6, ',7777777'

    ---==================================

    -- Destination table

    DECLARE @Result TABLE (Id INT NOT NULL IDENTITY, SourceId INT, CommentID INT, Txt VARCHAR(100), Sequence INT)

    ---==================================

    -- Delimiter

    DECLARE @Delimiter VARCHAR(20)

    DECLARE @DelimiterLen INT

    SELECT @Delimiter = ','

    SELECT @DelimiterLen = LEN(@Delimiter)

    ---==================================

    -- Split items without a required leading/trailing delimiter

    --

    -- IMPORATNT:

    -- If the source row was determined correct (with the delimiter) this will return an empty row

    INSERT INTO @Result (

             SourceId,

             CommentId,        

             Txt,

             Sequence

             )

       SELECT

             s.Id,

             s.CommentID,

             l.Item,

             1

          FROM @Source s

             CROSS APPLY

             (

                SELECT

                      SUBSTRING(s.Txt, 1, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, 1) - 1, -1), LEN(s.Txt))) Item,

                      1 Sorting

                UNION ALL

                SELECT TOP 100 PERCENT

                      SUBSTRING(s.Txt, t.N + @DelimiterLen, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, t.N + 1) - t.N - 1, -t.N - 1), LEN(s.Txt) - t.N)) Item,

                      2 Sorting

                   FROM Counter t

                   WHERE t.N <= LEN(s.Txt)

                      AND SUBSTRING(s.Txt, t.N, @DelimiterLen) = @Delimiter

                ORDER BY Sorting, t.N

             ) l

    ---==================================

    -- Move single fragments up

    WHILE (1 = 1)

    BEGIN

       ; WITH

       single_fragment (CommentID, Id, Txt) AS

       (

          SELECT CommentID, MIN(Id), MIN(Txt) FROM @Result GROUP BY CommentID, SourceId HAVING COUNT(*) = 1

       )

       UPDATE r SET

             r.Txt = r.Txt + sf.Txt,

             r.Sequence = r.Sequence + 1

          FROM @Result r

             JOIN single_fragment sf ON r.CommentID = sf.CommentID AND r.Id + r.Sequence = sf.Id

       IF (@@ROWCOUNT = 0)

          BREAK

    END

    ---==================================

    -- Move last fragments up

    ; WITH

    last_item (CommentID, Id) AS

    (

       SELECT CommentID, MAX(Id) FROM @Result GROUP BY CommentID, SourceId

    )

    UPDATE r1 SET

          Txt = r1.Txt + r2.Txt,

          Sequence = r1.Sequence + 1

       FROM last_item li

          JOIN @Result r1 ON li.CommentID = r1.CommentID AND li.Id = r1.Id

          JOIN @Result r2 ON r1.CommentID = r2.CommentID AND r1.Id + r1.Sequence = r2.Id

       WHERE

          r1.Txt != ''

    ---==================================

    -- Delete the rows which have been joined up to the previous

    DELETE r1

       FROM @Result r1

          JOIN @Result r2 ON r1.Id > r2.Id AND r1.Id < r2.Id + r2.Sequence

    ---==================================

    -- Delete the empty last rows which resulted by the initial

    -- tally split

    ; WITH

    last_item AS ( SELECT MAX(Id) Id FROM @Result GROUP BY SourceId )

    DELETE r

       FROM last_item li

          JOIN @Result r ON li.Id = r.Id

       WHERE r.Txt = ''

    ---==================================

    -- Result

    SELECT * FROM @Source

    SELECT * FROM @Result

    [/font]

    I probably missed what was necessary on the last two parts of the process to deal with the multiple comments.

    It took about a minute to copy the data into a table with the structure of @Source, but that wouldn't really count as I doubt it would add anything to the time of bringing the data over the WAN from Oracle during a real refresh. The actual splitting didn't use parallelism, but that may have been on purpose. It's been running over an hour so far, but I think it is close to done. 😉 (Edit: After 2 and half hours I killed it, it was on the "Move last fragments up" step.)

    I'm still wrapping my head around everything it is doing, but it is an interesting approach that I wouldn't have come up with.

  • Once you recognize the "pattern", you don't need all the ancillary conditional data moves and deletes. The "pattern" is that if it's NOT a continuation line, add a leading delimiter to the concatenation by CommentID in order by the CommentPartNum.

    [font="Courier New"]--Drop table comments

    --Create a sample comments table

     CREATE TABLE Comments

            (

            CommentID INT NOT NULL,

            CommentPartNum INT NOT NULL,

            CommentCont CHAR(1) NOT NULL,

            CommentDate DATETIME,

            CommentPart VARCHAR(4000)

            )

    --Add some sample data

     INSERT INTO Comments

     SELECT 1, 1, 'N', '03/23/2009', 'Comment Line 1' UNION ALL

     SELECT 1, 2, 'N', '03/26/2009', 'Comment Line 2|Comment Line 3|Comment Line 4' UNION ALL

     SELECT 1, 3, 'Y', '03/26/2009', 'Comment Line 4 continued' UNION ALL

     SELECT 2, 1, 'N', '03/23/2009', 'Comment ' UNION ALL

     SELECT 2, 2, 'Y', '03/24/2009', 'Line ' UNION ALL

     SELECT 2, 3, 'Y', '03/24/2009', '1|Comment Line 2' UNION ALL

     SELECT 2, 4, 'N', '03/25/2009', 'Comment Line 3|Comment Line 4'

     CREATE CLUSTERED INDEX IXC_Comments

         ON Comments (CommentID,CommentPartNum,CommentCont)

    --Display the sample data

     SELECT *

       FROM Comments

    DECLARE @Delimiter CHAR(1)

     SELECT @Delimiter = '|'

    ;WITH cteComments AS

    (

     SELECT t1.CommentID,(SELECT CASE CommentCont

                                 WHEN 'N' THEN @Delimiter

                                 ELSE '' END + CAST(t2.CommentPart AS VARCHAR(MAX))

                            FROM Comments t2

                           WHERE t2.CommentID = t1.CommentID   --- must match GROUP BY below

                           ORDER BY t2.CommentPartNum

                             FOR XML PATH('')) + @Delimiter AS Comment

       FROM Comments t1

      GROUP BY t1.CommentID

    )

     SELECT c.CommentID,

            ca.CommentNum,

            ca.Item

       FROM cteComments c

      CROSS APPLY (SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS CommentNum,

                          SUBSTRING(c.Comment, t.N +1, CHARINDEX(@Delimiter, c.Comment, t.N +1) -t.N -1) AS Item

                     FROM dbo.Tally t

                    WHERE t.N < LEN(c.Comment)

                      AND SUBSTRING(c.Comment, t.N, 1) = @Delimiter) ca

    [/font]

    ==================================================================================================================================

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Nice one Jeff 😎

    Far away is close at hand in the images of elsewhere.
    Anon.

  • UMG Developer (5/11/2009)


    Florian Reischl (5/11/2009)


    Have a look at the attached statistic results of my tests.

    Flo,

    Which cursor based split solution were you using in those statistics?

    Hi UMG

    Here is the cursor I actually used in my tests. I compared it with all others but it seems to be the fastest.

    [font="Courier New"]ALTER FUNCTION dbo.ufn_SplitString_Cursor_VM8

    (

       @Text VARCHAR(MAX),

       @Delimiter VARCHAR(255)

    )

       RETURNS @ReturnData TABLE (Item VARCHAR(8000))

    AS

    BEGIN

       DECLARE @Pos INT, @Next INT, @DelimiterLen INT

       -- Get only once the len of the delimiter to avoid any recalculation

       SELECT @DelimiterLen = LEN(@Delimiter)

      

       -- Get the first occurence to get the item before the first delimiter

       SELECT @Pos = CHARINDEX(@Delimiter, @Text, 1)

      

       IF (@Pos = 0)

       BEGIN

          -- We didn't find any delimiter so return the complete text as one item

          INSERT INTO @ReturnData

             SELECT @Text

          RETURN

       END

       ELSE

       BEGIN

          -- Insert the first item

          INSERT INTO @ReturnData

             SELECT SUBSTRING(@Text, 1, @Pos - 1)

       END

      

       -- Step over the first delimiter

       SELECT @Pos = @Pos + @DelimiterLen

       -- Start an infinite loop to avoid a second CHARINDEX call for each item

       WHILE (1 = 1)

       BEGIN

          -- Get the next delimiter position from our previous position

          SELECT @Next = CHARINDEX(@delimiter, @Text, @Pos)

          

          -- CHARINDEX returned "0" so we have to break the loop

          IF (@Next = 0) BREAK

          -- Insert the next found item

          INSERT INTO @ReturnData

             SELECT SUBSTRING(@Text, @Pos, @Next - @Pos)

          -- Step of the delimiter

          SELECT @Pos = @Next + @DelimiterLen

       END

       -- Due to the fact that our cursor only jumps from delimter to delimiter

       -- the last item would be lost

       INSERT INTO @ReturnData

          SELECT SUBSTRING(@Text, @Pos, LEN(@Text) - 1)

       RETURN

    END

    [/font]

    Thanks to Steve for the bug fix if the text doesn't contain any delimiter

    Greets

    Flo

  • UMG Developer (5/11/2009)


    ... (Edit: After 2 and half hours I killed it, it was on the "Move last fragments up" step.)

    Well, errmm... You need bigger hardware! 😀

    Seems to depend on the count of rows. I will inspect later today. Do you know how long the previous steps took?

    Greets

    Flo

  • Jeff Moden (5/11/2009)


    Once you recognize the "pattern", you don't need all the ancillary conditional data moves and deletes. The "pattern" is that if it's NOT a continuation line, add a leading delimiter to the concatenation by CommentID in order by the CommentPartNum.

    Jeff,

    Now you've essentially got my roll it up and then split it method that takes ~2 minutes for the roll up, and ~16 minutes for the split. I do have to use the .VALUE trick with the FOR XML to avoid the entitization. (I know I should have posted my code.)

  • Florian Reischl (5/12/2009)


    Well, errmm... You need bigger hardware! 😀

    Seems to depend on the count of rows. I will inspect later today. Do you know how long the previous steps took?

    Bigger hardware would be nice, but it isn't exactly small as it is. I didn't time the pieces individually but I think it took about 1 hour and 15 minutes to get to the "Move last fragments up" step. When I built the source table I left one little thing out, but I doubt that caused the issue, it just would have messed up the results. I'll try again tonight.

    I'll, also, try your splitter UDF and see how it works for our data. Thanks!

  • Do you get parallelism with Jeff's solution? I would think that would help a lot.

    T-SQL UDFs always generate a serial plan.

    (CLR UDFs don't - unless defined with a MAX datatype parameter)

  • Paul White (5/12/2009)


    Do you get parallelism with Jeff's solution? I would think that would help a lot.

    T-SQL UDFs always generate a serial plan.

    Hmm, I did not know that. Is that for any scalar-udf in any query?

    Does it also apply to TV-UDFs, including inline-TVFs?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 436 through 450 (of 522 total)

You must be logged in to reply to this topic. Login to reply