Performance issue with tally solution

  • Jeff Moden (5/9/2009)


    Peter,

    Sorry man. I tested against 1000 lines of 800 comma separated random numbers each in a VARCHAR(8000) environment... your code did the split into a table at a minute and 50 seconds. Many of the other solutions come in at less than 10 seconds for the same thing.

    Heh, I was still running against the solutions listed in the initial few pages, you know all those that build an intermediate table due to the non-inline table valued functions. I bet it was significantly faster on those on your machine too...if not that is would worry me. As I did some further tweaking, I kept hitting a wall where SQL Sever refuses to act smart.

    Now I have to look at your latest finding and understand how that impacts things. If all it takes is some implicit sorting to speed up code then I am all for it :). I did take a quick look at the later SQL solutions, but converting those to be exactly how I would want a generic splitter will take some doing and could possibly kill the optimizer when used in larger queries.

    Thanks for your testing tho 🙂

    Note:

    I found that my otherwise fast tally function does not work well with your cross join queries.

    It slows it down from 2 seconds to at least 3 minutes at which point I aborted it!

    So much more testing to do 😉

    Suffice to say, this topic is going to live forever until I find the best solution for all cases 🙂

  • I know I am coming in late, but boy has it been a read so far and I can't wait to see the final outcome.

    In the particular problem I am trying to solve I have 4,558,077 comment pieces, that have to be combined so that I have the full comment in one field. It takes about 2 minutes and 12 seconds using FOR XML (with .VALUE) to create the table with 1,173,035 comments all assembled.

    The comments are as short as 43 characters, as long as 19,629 characters, with an average of 502 characters. (They have as little as 1 line, up to 434 lines, with an average of 11 lines.)

    Then I have to take all of that and burst it into individual lines for each comment. (The lines may have been broken apart in different pieces in the source.)

    Using the last Tally solution from Jeff Moden it took 15 minutes and 1 second.

    The resulting 13,519,756 lines are as short as 0 characters, as long as 733 characters, with an average of 42 characters.

    One interesting thing I found is that if I have a clustered index on the comment table (dbo.JBMTest in the Jeff's example) that SQL Server decides to uni-task, and took 41 minutes 19 seconds to finish. By dropping the PK it used parallelism, and was almost 3 times faster.

    I'm interested in trying other T-SQL solutions, but I'm not sure which I should try or where the best/latest versions are. (I'm not ready to go down the path of getting CLR turned on.)

  • I'm confused... why would you put over 4 million comments together only to break them apart again?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/10/2009)


    I'm confused... why would you put over 4 million comments together only to break them apart again?

    Because in the source data an individual line may be broken across multiple parts. I figured rather than split the parts, and then try to figure out what line fragments have to be put together it would be easier to make one huge string out of all the parts for a given comment and then break that into the individual lines. (i.e. Part 1 of Comment 1 may contain lines 1 and 2, and the first portion of line 3, while part 2 may contain the rest of line 3 as well as line 4.)

  • Hi UMG

    Maybe I have an idea. Just for ensure. Your data are text data? No binary.

    Greets

    Flo

  • UMG Developer (5/10/2009)


    Jeff Moden (5/10/2009)


    I'm confused... why would you put over 4 million comments together only to break them apart again?

    Because in the source data an individual line may be broken across multiple parts. I figured rather than split the parts, and then try to figure out what line fragments have to be put together it would be easier to make one huge string out of all the parts for a given comment and then break that into the individual lines. (i.e. Part 1 of Comment 1 may contain lines 1 and 2, and the first portion of line 3, while part 2 may contain the rest of line 3 as well as line 4.)

    Heh... step 1 for me would be to pork chop the data provider. What a terrible thing to do to the data.

    That notwithstanding, what is the unique character being used to signify the parts of a comment? And, how long are the individual lines? Last but not least, are you importing this data from a file?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi UMG

    For now just a simple approach.

    Since your source data are not enclosed with a leading/trailing delimiter I took my UNION ALL Tally version to do the initial split. This includes the first and the last item. It is a little bit slower than the traditional tally version but it doesn't require the delimiters at start and end of the text.

    Rows which end with the specified delimiter (I took a comma for now) will cause a empty row as last item. After that move the fragments up to the previous result row. Delete the moved rows. Delete the empty rows at the end of each row-split.

    [font="Courier New"]---==================================

    -- Some source data

    DECLARE @Source TABLE (Id INT NOT NULL IDENTITY, Txt VARCHAR(100))

    INSERT INTO @Source

                 SELECT 'aaa,bb'

       UNION ALL SELECT 'b,ccc,'

       UNION ALL SELECT 'ddd'

    ---==================================

    -- Destination table

    DECLARE @Result TABLE (Id INT NOT NULL IDENTITY, SourceId INT, Txt VARCHAR(100), IsJoin BIT)

    ---==================================

    -- Delimiter

    DECLARE @Delimiter VARCHAR(20)

    DECLARE @DelimiterLen INT

    SELECT @Delimiter = ','

    SELECT @DelimiterLen = LEN(@Delimiter)

    ---==================================

    -- Split items without a required leading/trailing delimiter

    --

    -- IMPORATNT:

    -- If the source row was determied correct (with the delimiter) this will return an empty row

    INSERT INTO @Result (

             SourceId,

             Txt

             )

       SELECT

             s.Id,

             l.Item

          FROM @Source s

             CROSS APPLY

             (

                SELECT

                      SUBSTRING(s.Txt, 1, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, 1) - 1, -1), LEN(s.Txt))) Item,

                      1 Sorting

                UNION ALL

                SELECT TOP 100 PERCENT

                      SUBSTRING(s.Txt, t.N + @DelimiterLen, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, t.N + 1) - t.N - 1, -t.N - 1), LEN(s.Txt) - t.N)) Item,

                      2 Sorting

                   FROM Tally t

                   WHERE t.N <= LEN(s.Txt)

                      AND SUBSTRING(s.Txt, t.N, @DelimiterLen) = @Delimiter

                ORDER BY Sorting, t.N

             ) l

    ---==================================

    -- Move fragments from next row up to the previous

    ; WITH

    last_item (Id) AS

    (

       SELECT MAX(Id) FROM @Result GROUP BY SourceId

    )

    UPDATE r1 SET

          Txt = r1.Txt + r2.Txt,

          IsJoin = 1

       FROM last_item li

          JOIN @Result r1 ON li.Id = r1.Id

          JOIN @Result r2 ON r1.Id + 1 = r2.Id

       WHERE

          r1.Txt != ''

    ---==================================

    -- Delete the rows which have been joined up to the previous

    DELETE r1

       FROM @Result r1

          JOIN @Result r2 ON r1.Id = r2.Id + 1

       WHERE r2.IsJoin = 1

    ---==================================

    -- Delete the empty last rows which resulted by the initial

    -- tally split

    ; WITH

    last_item AS ( SELECT MAX(Id) Id FROM @Result GROUP BY SourceId )

    DELETE r

       FROM last_item li

          JOIN @Result r ON li.Id = r.Id

       WHERE r.Txt = ''

    ---==================================

    -- Result

    SELECT * FROM @Source

    SELECT * FROM @Result

    [/font]

    Greets

    Flo

  • I was kind of hoping the data was being imported from a file and that each comment, despite the number of whole and partial lines it may occupy, would have a unique delimiter. Then you could just bulk insert the whole bloody thing using the comment delimiter as the end of line delimiter and simply replace any CrLf's with empty strings.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff

    I saw your question and you are correct. If the data are imported by a file this would be the best approach.

    But I already started with my solution and didn't want to hit "Ctrl+A" and "Del" 😛

    Greets

    Flo

  • Flo,

    Yes the data is all text. I'll have to take a look at your proposed solution, but it would need some work to handle all the possible crap data cases. If you use this for the source, you loose a hole set of the Ds:

    INSERT INTO @Source

    SELECT 'aaa,bb'

    UNION ALL SELECT 'b,ccc,ddd'

    UNION ALL SELECT 'ddd'

    UNION ALL SELECT 'ddd,e'

    UNION ALL SELECT 'eee'

    UNION ALL SELECT ',fff'

    This is why I thought it would just be easier to roll the parts all up and then split it. (But maybe the resulting really long strings make the splitting go slower?)

    Jeff,

    I've never heard the phrase to "pork chop" someone what does it mean? There isn't much I can do about where it comes from. It isn't even complete data because of a "bug" in their software that stores chunks longer than 4000 characters sometimes, but they ETL it out of the source system into Oracle into a VARCHAR2 field which has a limit of 4000 characters. This is only a problem for about 150 of the 4+ million rows so it isn't a huge deal. (And it isn't like we can do anything about it, they are thinking about how they might solve the problem...)

    Each row has a comment ID and a part number in addition to the comment part so that they can be put back together in the correct order.

    The delimiter between lines is a: |

    The source parts are between 18 and 4000 characters, with an average of 220 characters. (This could change at any time as new data is added.)

    The lines (once split) are between 0 and 733 characters, with an average of 42 characters. (Again this could change as new data is added.)

    I get the data from an Oracle 10g DB. (I could do some processing on Oracle, but I use SQL Server more and that is where the final analysis needs to be anyhow.)

    There are no CR/LFs in the data. (At least at this point.)

  • Florian Reischl (5/10/2009)


    Hi Jeff

    I saw your question and you are correct. If the data are imported by a file this would be the best approach.

    But I already started with my solution and didn't want to hit "Ctrl+A" and "Del" 😛

    Greets

    Flo

    LOL! Man, do I ever understand that! 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • UMG Developer (5/10/2009)


    I've never heard the phrase to "pork chop" someone what does it mean?

    There's a pretty long story behind it, but it basically means taking the person responsible for your dilema out for a nice pork chop dinner... where you will tie that person to a chair and feed that person pork chops... at point blank range with a hefty sling shot. Most folks on this forum have come to know the phrase and I constantly forget not to use it around someone new to the forum. My apologies.

    There isn't much I can do about where it comes from. It isn't even complete data because of a "bug" in their software that stores chunks longer than 4000 characters sometimes, but they ETL it out of the source system into Oracle into a VARCHAR2 field which has a limit of 4000 characters. This is only a problem for about 150 of the 4+ million rows so it isn't a huge deal. (And it isn't like we can do anything about it, they are thinking about how they might solve the problem...)

    Understood.

    Each row has a comment ID and a part number in addition to the comment part so that they can be put back together in the correct order.

    The delimiter between lines is a: |

    The source parts are between 18 and 4000 characters, with an average of 220 characters. (This could change at any time as new data is added.)

    The lines (once split) are between 0 and 733 characters, with an average of 42 characters. (Again this could change as new data is added.)

    I get the data from an Oracle 10g DB. (I could do some processing on Oracle, but I use SQL Server more and that is where the final analysis needs to be anyhow.)

    There are no CR/LFs in the data. (At least at this point.)

    You say you get the data from an Oracle 10 DB... do you mean directly or is it in a text file? Even if it's not in a text file, see if you can get them to give it to you that way because, from everything you described, what I said about using bulk insert will work very quickly and almost by magic.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden

    Any word on this, Phil? Copying code from the code windows makes some really ugly stuff with extra line breaks, no indentation, and all those nasty things I posted on the test thread for the code windows.

    I spoke to Mel at RedGate last week. He hasn't given me a date yet, but I've given him the full force of the urgency with which you want it done. Sorry I've been a bit quiet on this thread but I've been v busy- but I am trying to follow things. It is getting very interesting.

    Best wishes,
    Phil Factor

  • Phil Factor (5/10/2009)


    Jeff Moden

    Any word on this, Phil? Copying code from the code windows makes some really ugly stuff with extra line breaks, no indentation, and all those nasty things I posted on the test thread for the code windows.

    I spoke to Mel at RedGate last week. He hasn't given me a date yet, but I've given him the full force of the urgency with which you want it done. Sorry I've been a bit quiet on this thread but I've been v busy- but I am trying to follow things. It is getting very interesting.

    Heh... Like Mel will listen to me. It's not for me... it's for all the others that may want to copy code from this good forum. To wit...

    http://www.thecounter.com/stats/2009/March/browser.php

    I do appreciate the feedback, though. Thanks, Phil.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/10/2009)You say you get the data from an Oracle 10 DB... do you mean directly or is it in a text file? Even if it's not in a text file, see if you can get them to give it to you that way because, from everything you described, what I said about using bulk insert will work very quickly and almost by magic.

    I get it directly from the Oracle sever via linked server setup in SQL 2k5. I can certainly dump it to a text file if that makes it easier or faster. But frankly I'm pretty happy with the ~2 minutes to roll the parts together and the ~15 minutes to split it out into lines. On the other hand if I can make it even faster then all the better.

    I'm curios about the bulk insert method, so you're saying it can take something like this:

    ID Part Comment

    -- ---- ---------

    1 1 Line 1|Line 2 part 1

    1 2 Line 2 part 2

    1 3 Line 2 part 3|Line 3|

    1 4 Line 4|

    2 1 Line 1|

    2 2 Line 2|

    and give me something like this:

    ID Line Comment

    -- ---- ---------

    1 1 Line 1

    1 2 Line 2 part 1Line 2 part 2Line 2 part 3

    1 3 Line 3

    1 4 Line 4

    2 1 Line 1

    2 2 Line 2

    with very little work?

Viewing 15 posts - 406 through 420 (of 522 total)

You must be logged in to reply to this topic. Login to reply