Which is Quicker ? - updating 90 million records

  • Hi , Ive got a table of 60 million (ish) records I need to update in chunks

    Which is the quicker approach (or doesnt it even mater ?)

    A higher number of iterations or a higher number of records per update ?

    Many thanks for any words of wise wisdom

    DECLARE @COUNTER INT

    SET @COUNTER = 1

    WHILE @COUNTER < 8

    BEGIN

    BEGIN TRANSACTION

    UPDATE TOP (800000) a

    SET A.FIELD1 = B.FIELD2++B.FIELD3 , A.UPDATED = 1

    FROMTABLEA A

    INNER JOIN TABLEB B

    ON A.ID = B.ID

    AND A.RECNO = B.RECNO

    WHERE A.UPDATED IS NULL

    SET @COUNTER = @COUNTER + 1

    COMMIT TRANSACTION

    END

  • As you might expect, "it depends" is the most appropriate answer.

    Keep in mind that bigger batches require more T-LOG space.

    It also depends on how expensive is the query to find the rows to update compared to the update itself. If the INNER JOIN I see in your query costs more than updating the data, bigger batches will likely be faster.

    -- Gianluca Sartori

  • thanks Gianluca much appreciated 🙂

  • I'll say that this would be slower because you're still using a single transaction and you're using several queries. This is a guess and I might be wrong.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (1/23/2015)


    I'll say that this would be slower because you're still using a single transaction and you're using several queries. This is a guess and I might be wrong.

    I also believe that single transaction will also play hell with the log file. The transaction trick is great for nearly doubling the performance of a RBAR While loop but that's not what this one is. It's a "set size control" rather than RBAR.

    I'd say that it would be better to just lock the whole table rather than doing this in an explicit transaction if you can afford the luxury of excluding all other users.

    You also have to remember that every system has a "tipping point" where, for example, updating 3 million rows might only take 3 times the duration as updating a million but 4 million rows (again, for example) might take an hour or two.

    Doing a million rows at a time isn't going to be horribly slower overall than doing 2 million rows if things are done correctly. I also wouldn't use an "Updated" column as the primary search for the update. I'll have to explain what I mean by that later tonight.

    In the mean time, what is the definition of the clustered index on this table and what are the datatypes of the columns in that clustered index? I ask because I believe this can be made quite a bit faster but I need to know that information to be sure.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hey Jeff

    thanks for the input

    as far as I was concerned the updated flag was a poor choice but the only one I had, the updated data isnt visually distinguishable from the data its being replaced with

    the PK is a clustered composite on the fields used for the join

    regards simon !

  • Jeff Moden (1/23/2015)


    ....I'd say that it would be better to just lock the whole table rather than doing this in an explicit transaction if you can afford the luxury of excluding all other users.

    I know this is a bit of a tangent from the original question, but this reminded me of something I do at my company, similar to (or literally) read committed transaction. This way the table before the updates begin remains online while you perform all the locking transactions your heart could desire.

    You can also do this manually with a set of views and clones of the same table so you can do work on one and then alter the view to swap out which table you're pointing at when the work (in this case, updates) are finished.

    Executive Junior Cowboy Developer, Esq.[/url]

  • simon_s (1/23/2015)


    Hey Jeff

    thanks for the input

    as far as I was concerned the updated flag was a poor choice but the only one I had, the updated data isnt visually distinguishable from the data its being replaced with

    the PK is a clustered composite on the fields used for the join

    regards simon !

    In that case (and assuming that you're trying to update all 60 million rows), I have an idea. Before we do a deep dive on the idea, tell me how long the following query takes to run. I'm thinking about 12-15 seconds.

    WITH

    cteEnumerate AS

    (

    SELECT RowNum = ROW_NUMBER() OVER (ORDER BY ID, RECNO)

    ,ID

    FROM dbo.TableA

    )

    SELECT *

    FROM cteEnumerate

    WHERE RowNum%1000000 = 0

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hey Jeff

    takes 4 minutes amd returns 555 rows

    thanks simon

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply