HUGE Database Update (> 550 million rows)

  • We have a database that is currently 900GB in size and we have the need to update over 550 million rows within this database. Essentially what happened is we have a custom extract package that takes a value out of a source database, concatenates 3 fields and that is the key that is then stored within 9 different tables in the target database. This key we have found changes periodically, invalidating some of our data, so we have modified this extract to pull just a single, unique key from the source database (oversight at design time).

    So what I have now is 9 tables that need to have updates based on 43,000 records in a new reference table.

    I have been able to write efficient update statements for 7 of the 9 tables using a simple cursor that pulls the old Key value and the new key value. After updating certain # of rows in each database, I have extrapolated the UPDATE times out and it will likely take about 4 days to complete (goal is entire database update under 10 days)

    On the other 2 tables - here are the specs:

    Table 1 - 224 million rows

    Table 2 - 259 million rows

    Now I have tried numerous ways of doing this update, but have not found any to be efficient. I've tried adding new indexes, etc to the table based on DTA recommendations to improve performance, but the update itself is just going to take way too long. I've looked at all different ways and worked with various members of our team and have not found any real efficient ways to do the updates. If anyone has any suggestions, I'd greatly appreciate it.

    Basically, we have the 2 values in the reference table

    Ref_Table (NewKey, OldKey)

    In the 2 tables mentioned above, there is a Key field that needs to be updated, so it's a basic UPDATE statement:

    UPDATE Table1

    SET KEY = NewKey

    WHERE KEY = OldKey

    The trick is finding the most optimal way to loop through all 43,000 records in the reference table and efficiently update the roughly 500 million rows in these 2 tables.

    If you have any suggestions or can point me to a command or utility that might work, I would greatly appreciate it.

  • Number one suggestion: Do NOT use a cursor or WHILE loop to do this. You can let SQL match up the reference table with the rows to be updated using the "Update.... From..." syntax.

    Your two-column reference table should, of course, be indexed on the old key.

    Here's a quick sample of Update... From. I didn't bother with indexes, but you surely understand how to do that.

    create Table #RefTable(

    oldKey int

    ,newKey int)

    go

    create Table #FixMePlease(

    MyKey int)

    go

    insert #FixMePlease -- Put some data in the "data" table

    select 1 union all

    select 2 union all

    select 3 union all

    select 4 union all

    select 5 union all

    select 6 union all

    select 7 union all

    select 8

    go

    Insert #RefTable -- create some old/new pairs in the reference table

    Select 2, 35 union all

    Select 5, 48 union all

    Select 7, 34 union all

    Select 8, 99

    go

    Select * -- Look at what's in the tables

    from #FixMePlease

    Select *

    from #RefTable

    go

    --= = = = = = = = = = = Here's the update Statement...

    Update #FixMePlease

    set MyKey = r.newKey

    from #RefTable r

    where MyKey = r.oldKey

    go

    Select * -- look at the updated data table

    from #FixMePlease

    go

    drop table #FixMePlease

    drop table #RefTable

  • Actually, a Cursor or While Loop is just what the doctor ordered here...

    Every system has a "tipping" point where it just loses it's mind during an update. A million rows may only take a minute. Two million rows may only take twice as long or two minutes. But three million rows may take an hour or two depending on when the tipping point is reached. A Cursor or While loop to control the "batch" size of such updates so they stay below the "tipping" point is absolutely necessary on such large updates.

    There are other things to consider during such large updates. The physical and logical order of the rows being updated can be used to substantially reduce both individual row locking and physical read/write head movement on the disk drive itself. My recommendation would be to do the batched updates in the same order as the clustered index UNLESS that index is in the wrong order to maintain non-page-split data.

    Something else that affects massive updates is whether or not there are indexes on the columns being updated. The worst scenario would be if you're updating a column that's part of the clustered index... you could be page splitting the whole "world". IIRC, nonclustered indexes that are larger than a single extent split of at the extent level which can be just as bad as page splits (or worse if the index is tightly packed by FILL FACTOR).

    If the clustered index is in good shape (more temporally based than not), doing things in clustered index order with page locks may also reduce contention on the table especially if the "new stuff" and the "stuff people are accessing" is near the logical end of the table and the stuff being updated is outside that range.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ah, thank you for the correction, Jeff.

    Now I do wonder if the procedural loop need be done at the row level or if it could be done with batches. That is, would it be advisable for a case like Brian's to loop not through the individual rows, but rather through ranges of rows. I can imagine a cursor running through a thousand iterations, each of which updates about 500K rows with an "update...from". There'd be a preliminary step that perhaps uses the NTile() function to define the 1000 batches, then the loop.

    Make sense? or do we just say go get 'em RBAR?

  • One other thing that I wanted to see if anyone has some knowledge on is doing these updates as part of a bulk update instead of having each update logged in the transaction log.

    The plan will be to take a backup (using a SAN based snapshot tool), update a table and ensure it is completed, then proceed with another backup and another table update. I was looking to see if I grouped these updates in batches of say 500,000 rows, if I could just have it record one transaction as opposed to 500,000 individual. Reason being, if the update fails, I'm just going to roll back to the last backup and start over.

    Thoughts?

  • Jeff Moden (1/21/2010)


    Actually, a Cursor or While Loop is just what the doctor ordered here...

    Who are you, and what have you done with Jeff Moden? 😀

    (Sorry, couldn't resist!)

    +--------------------------------------------------------------------------------------+
    Check out my blog at https://pianorayk.wordpress.com/

  • john.arnott (1/22/2010)


    Ah, thank you for the correction, Jeff.

    Now I do wonder if the procedural loop need be done at the row level or if it could be done with batches. That is, would it be advisable for a case like Brian's to loop not through the individual rows, but rather through ranges of rows. I can imagine a cursor running through a thousand iterations, each of which updates about 500K rows with an "update...from". There'd be a preliminary step that perhaps uses the NTile() function to define the 1000 batches, then the loop.

    Make sense? or do we just say go get 'em RBAR?

    In the rare instances where I've had to do such a thing, I actually predefine the ranges of PK's to work with and store them in a table. Depending on the system, concurrency requirements, and a whole bunch of other stuff, the ranges may be as small as 500 rows or as large as a million rows. Then, I just loop through that control table one row at a time to control the batches of updates which affect 500 to 1 million rows at a time.

    Heh... in this case, I don't consider it to be "true" RBAR because for each control row read, up to a million rows are affected in a set based manner. The cursor or while loop isn't a performance problem. So far as the control table goes, I suppose you could classify it as RBR but I wouldn't classify it as RBAR. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Brian.Dunat (1/22/2010)


    One other thing that I wanted to see if anyone has some knowledge on is doing these updates as part of a bulk update instead of having each update logged in the transaction log.

    The plan will be to take a backup (using a SAN based snapshot tool), update a table and ensure it is completed, then proceed with another backup and another table update. I was looking to see if I grouped these updates in batches of say 500,000 rows, if I could just have it record one transaction as opposed to 500,000 individual. Reason being, if the update fails, I'm just going to roll back to the last backup and start over.

    Thoughts?

    That's mostly what I suggested. I also provided some additional thoughts in the post just above this one.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ray K (1/22/2010)


    Jeff Moden (1/21/2010)


    Actually, a Cursor or While Loop is just what the doctor ordered here...

    Who are you, and what have you done with Jeff Moden? 😀

    (Sorry, couldn't resist!)

    Heh... not a problem and I definitely deserve the friendly poking on that. I can see the headlines now... "EXTRA! EXTRA! Leader of the anti-RBAR alliance recommends a Cursor!" 😛

    Not too disappoint though... I don't actually use a Cursor or While Loop to do this in 2k5. I gen all of the necessary commands using setbased code and concatenate them into a single VARCHAR(MAX) and execute the VARCHAR(MAX). I just can't bring myself to write an actual cursor. The reason why I recommend a cursor for this is three fold...

    1) The cursor will not be a performance problem for such a thing.

    2) Everyone understands what cursors are and how to write one whereas they may not understand the setbased command generation method I use nor the reasons why I do it that way.

    3) It's actually one of the few appropriate uses for a cursor.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    Thanks for the hint on the clustered index. I was checking out the table and the execution plan and was scrolling through everything that would be affected and found that the field I needed to update was part of a 7 field clustered index. When it was running through the update, it was causing all sorts of page splits and that I didn't catch when reviewing the execution plan the first time.

    For the purpose of this update, we are going to drop the clustered index on that table, create a nonclustered index on the key field I need to update and the parse through that in batches of 250K rows at a time. Once the update is done, then we'll recreate the clustered index.

    I've been doing the testing on this starting last night and I think now with those changes, I should be able to get this table updated in about 7 days now as opposed to the original 40 I had estimated. THANK YOU FOR THAT TIP!

    Still learning, but I think this is where we all learn the best is getting these types of issues thrown at us and figuring out the solution. Thanks for the help!

  • Brian.Dunat (1/22/2010)


    One other thing that I wanted to see if anyone has some knowledge on is doing these updates as part of a bulk update instead of having each update logged in the transaction log.

    There is no 'bulk update' operation. All updates are fully logged.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Brian.Dunat (1/23/2010)


    Jeff,

    Thanks for the hint on the clustered index. I was checking out the table and the execution plan and was scrolling through everything that would be affected and found that the field I needed to update was part of a 7 field clustered index. When it was running through the update, it was causing all sorts of page splits and that I didn't catch when reviewing the execution plan the first time.

    For the purpose of this update, we are going to drop the clustered index on that table, create a nonclustered index on the key field I need to update and the parse through that in batches of 250K rows at a time. Once the update is done, then we'll recreate the clustered index.

    I've been doing the testing on this starting last night and I think now with those changes, I should be able to get this table updated in about 7 days now as opposed to the original 40 I had estimated. THANK YOU FOR THAT TIP!

    Still learning, but I think this is where we all learn the best is getting these types of issues thrown at us and figuring out the solution. Thanks for the help!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Brian.Dunat (1/23/2010)


    Jeff,

    Thanks for the hint on the clustered index. I was checking out the table and the execution plan and was scrolling through everything that would be affected and found that the field I needed to update was part of a 7 field clustered index. When it was running through the update, it was causing all sorts of page splits and that I didn't catch when reviewing the execution plan the first time.

    For the purpose of this update, we are going to drop the clustered index on that table, create a nonclustered index on the key field I need to update and the parse through that in batches of 250K rows at a time. Once the update is done, then we'll recreate the clustered index.

    I've been doing the testing on this starting last night and I think now with those changes, I should be able to get this table updated in about 7 days now as opposed to the original 40 I had estimated. THANK YOU FOR THAT TIP!

    Still learning, but I think this is where we all learn the best is getting these types of issues thrown at us and figuring out the solution. Thanks for the help!

    You bet. But before you take off on that, check a couple of other things and make a plan for the long term...

    Dropping the 7 column clustered index is definitely a good idea because it's part of the update provided that you understand that it will have a drastic effect on any SELECTs that you may use against the table. When you go to reinstate the index, ask yourself, how many times do you insert into this table and will the inserts be done near the logical end of whatever the sort order of the clustered index is? If the answer is that you're going to be doing a lot of inserts and they won't be done near the logical end, then DON'T make it a clustered index because it will require constant rebuilding. Instead, make it a non-clustered index and make something else the clustered index even if it's just on an IDENTITY column or a date column that represents the logical order of inserts.

    It sounds like the tables you're doing the update to can actually be offline for quite a while. If that's true and you have the Enterprise Edition of SQL Server, you might want to consider partitioning the table. If nothing else, it makes index rebuilding a whole lot less impacting.

    So far as the update goes... what is the PK of this table? The clustered index? If not, the method in the "side bar" below may work swimmingly.

    As a side bar... lot's of folks yell at me for wanting to put an IDENTITY column on such tables. They talk of things such as natural keys and how useful they are and while that's all true, a 7 column natural key makes life pretty difficult for such updates and other "special ops". If the table had an IDENTITY column as an "alternate key" or maybe even a "primary key" (where the 7 columns would become the alternate key and the IDENTITY column would also be the clustered index), then you could pull off a very high speed method for doing the updates... read the PK and the column (using the same batch size as you already planned) to be updated into an ancillary table. Update the ancillary table for all possible updates against your 43,000 row control table as well as marking an "IsUpdate" column. Then use the ancillary table as the driver (paying particular attention to the IsUpdate column) to update the original table. There are a couple of reasons why that's so fast... first, it requires only a single column join and the updated column is guaranteed not to be a part if the index you need to do the join. Second, you're guaranteed that the update will happen in the same order as the clustered index so no hopping around a large table with the read write heads.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/23/2010)


    When you go to reinstate the index, ask yourself, how many times do you insert into this table and will the inserts be done near the logical end of whatever the sort order of the clustered index is?

    And while you're thinking about that, read this article http://www.sqlservercentral.com/articles/Indexing/68563/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Jeff,

    This is the proper clustered index and it makes up the primary key on the table. This is a wealth management database, so the data that will be inserted again post this update will all be logically ordered in accordance with the clustered index.

    We are near rollout of this system and we are anticipating that the database will be offline while these updates take place (no SELECTs), so that is why I think the best approach will be to drop that clustered index, create the nonclustered index on the key column, perform the update, drop that temporary index and recreate the clustered index. Once that is done and we verify the updates, then we will be able to bring the system back online.

    Thanks again!

    Brian

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply