Updating Large datasets

  • I'm running into a problem updating 25+ million rows of data, tempdb keeps blowing up (exceeds 166gigs). I've set the database to bulk insert, of course that doesn't help - I've tried a loop that updates 1 million, commits and then grabs the next million, it either didn't work (nothing committed) or didn't commit because I still blew tempdb in 4 hours. It's a simple update based on -

    update <mytable>

    set PKid = b.PKid

    from mytable a, my_ther_table b

    where a.col1 = b.col1 and

    a.col2= b.col2

    both col1 and col2 are indexed (non clustered) and are int's.

    Optimizer says that if I limit the update to pkid < 1million I still have to hash join 125 million rows (it's 2billion on total table) - any ideas?

    Thanks, and don't forget to Chuckle


    Thanks, and don't forget to Chuckle

  • Updating a million rows at a time is still an awful lot. It looks like you have the luxury of not needing access to the database during this update but I would cut down the update to about 10,000 in each update.

    However, the other question is why is so much being updated? Are there business processes or other approaches that could be taken to work with the data closer to its creation so massive updates are not needed?

    Guarddata-

  • It's actually a business process that is mandating this, I'm aggregating the 25 million down to 1.8 or so million, and the update is to update the FK back to the un-aggregated part.

    It's all offline, not on the production servers (hardware being cheap now).

    Thanks, and don't forget to Chuckle


    Thanks, and don't forget to Chuckle

  • quote:


    It's actually a business process that is mandating this, I'm aggregating the 25 million down to 1.8 or so million, and the update is to update the FK back to the un-aggregated part.

    It's all offline, not on the production servers (hardware being cheap now).

    Thanks, and don't forget to Chuckle


    Sounds like perhaps you're updating all the rows, in which case you might want to leave the recovery model as Bulk-Logged, use SELCT...INTO to make the changed table, then drop the original table, rename the changed table, and create the indexes (use SORT_IN_TEMPDB if tempdb is on its own disk array).

    --Jonathan



    --Jonathan

  • I perfer to break the huge update into small batches as Guarddata recommedned.

    Edited by - allen_cui on 11/07/2003 2:40:56 PM

  • quote:


    I perfer to break the huge update into small batches as Guarddata recommedned.


    Last time I tested this sort of thing was years ago with SQL 7, so I decided to see if there is less of a performance impact in this specific case to support your "preference."

    This is simplified as we of course do not know the exact schema. I used only one million rows as I didn't feel like taking the time and space to create 25 million rows, but the results are obviously extensible.

    
    
    CREATE TABLE MyTable(
    PKid int,
    Col1 int,
    Col2 int,
    Col3 varchar(40))
    CREATE CLUSTERED INDEX ix_PKid ON MyTable(PKid) WITH SORT_IN_TEMPDB
    CREATE INDEX ix_Col1 ON MyTable(Col1) WITH SORT_IN_TEMPDB
    CREATE INDEX ix_Col2 ON MyTable(Col2) WITH SORT_IN_TEMPDB

    CREATE TABLE My_Other_Table(
    PKid int PRIMARY KEY,
    Col1 int,
    Col2 int)
    CREATE INDEX ix_Col1 ON My_Other_Table(Col1) WITH SORT_IN_TEMPDB
    CREATE INDEX ix_Col2 ON My_Other_Table(Col2) WITH SORT_IN_TEMPDB

    SET NOCOUNT ON
    DECLARE @i int
    SET @i = 0
    WHILE @i < 1000000 BEGIN
    INSERT MyTable
    SELECT @i, @i/100, @i%10, 'auehlkdsajlkfeaj'
    SET @i = @i + 1 END
    SET @i = 0
    WHILE @i < 100000 BEGIN
    INSERT My_Other_Table
    SELECT 1000000 + @i, @i/10, @i%10
    SET @i = @i + 1 END

    UPDATE MyTable
    SET PKid = b.PKid
    FROM MyTable a, My_Other_Table b
    WHERE a.Col1 = b.Col1 AND a.Col2= b.Col2

    With only one million rows and no other users on the server, I did not bother to cut this into separate batches (which would be slower). The above update took 5:14 on my (slow) test system, and added about 500MB to the tran log (which was sized so that it did not autogrow).

    
    
    SELECT b.PKid, a.Col1, a.Col2, a.Col3
    INTO MyTable2
    FROM MyTable a JOIN My_Other_Table b ON a.Col1 = b.Col1 AND a.Col2 = b.Col2

    CREATE CLUSTERED INDEX ix_PKid ON MyTable2(PKid) WITH SORT_IN_TEMPDB
    CREATE INDEX ix_Col1 ON MyTable2(Col1) WITH SORT_IN_TEMPDB
    CREATE INDEX ix_Col2 ON MyTable2(Col2) WITH SORT_IN_TEMPDB

    The SELECT/INTO took 8 seconds, and the index creation took 24 seconds. So my method is almost ten time faster and doesn't have the tran log repercussions when the Bulk-Logged model is used.

    --Jonathan



    --Jonathan

  • I've been trying this all weekend - I need to go back to the drawing board as I obviously have a bad join.

    Using Insert into I blew the database (in bulk recovery, and still growing) to over 100 gig's, log was @ 80 gig. Optimizer says I was joining 23 million to 23 million, and getting 299 million - that's not what I intended :), even knowing that the optimizer is often wrong, I must have had something wrong somewhere (it's a simple pk to pk join).

    Thanks, and don't forget to Chuckle


    Thanks, and don't forget to Chuckle

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply