Updating Large datasets

Question

Updating Large datasets

Chuck Ritenour

SSC-Addicted

Points: 420
More actions
November 7, 2003 at 5:57 am

#176937

I'm running into a problem updating 25+ million rows of data, tempdb keeps blowing up (exceeds 166gigs). I've set the database to bulk insert, of course that doesn't help - I've tried a loop that updates 1 million, commits and then grabs the next million, it either didn't work (nothing committed) or didn't commit because I still blew tempdb in 4 hours. It's a simple update based on -
update <mytable>
set PKid = b.PKid
from mytable a, my_ther_table b
where a.col1 = b.col1 and
a.col2= b.col2
both col1 and col2 are indexed (non clustered) and are int's.
Optimizer says that if I limit the update to pkid < 1million I still have to hash join 125 million rows (it's 2billion on total table) - any ideas?
Thanks, and don't forget to Chuckle

Thanks, and don't forget to Chuckle

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

guarddata SSCertifiable Points: 6583 More actions · Answer 1

Updating a million rows at a time is still an awful lot. It looks like you have the luxury of not needing access to the database during this update but I would cut down the update to about 10,000 in each update.

However, the other question is why is so much being updated? Are there business processes or other approaches that could be taken to work with the data closer to its creation so massive updates are not needed?

Guarddata-

Chuck Ritenour SSC-Addicted Points: 420 More actions · Answer 2

It's actually a business process that is mandating this, I'm aggregating the 25 million down to 1.8 or so million, and the update is to update the FK back to the un-aggregated part.

It's all offline, not on the production servers (hardware being cheap now).

Thanks, and don't forget to Chuckle

Jonathan SSC-Insane Points: 20427 More actions · Answer 3

quote:
It's actually a business process that is mandating this, I'm aggregating the 25 million down to 1.8 or so million, and the update is to update the FK back to the un-aggregated part.
It's all offline, not on the production servers (hardware being cheap now).
Thanks, and don't forget to Chuckle

Sounds like perhaps you're updating all the rows, in which case you might want to leave the recovery model as Bulk-Logged, use SELCT...INTO to make the changed table, then drop the original table, rename the changed table, and create the indexes (use SORT_IN_TEMPDB if tempdb is on its own disk array).

--Jonathan

Allen Cui-55137 SSC Guru Points: 51650 More actions · Answer 4

I perfer to break the huge update into small batches as Guarddata recommedned.

Edited by - allen_cui on 11/07/2003 2:40:56 PM

Jonathan SSC-Insane Points: 20427 More actions · Answer 5

quote:
I perfer to break the huge update into small batches as Guarddata recommedned.

Last time I tested this sort of thing was years ago with SQL 7, so I decided to see if there is less of a performance impact in this specific case to support your "preference."

This is simplified as we of course do not know the exact schema. I used only one million rows as I didn't feel like taking the time and space to create 25 million rows, but the results are obviously extensible.



CREATE TABLE MyTable(

PKid int,

Col1 int,

Col2 int,

Col3 varchar(40))

CREATE CLUSTERED INDEX ix_PKid ON MyTable(PKid) WITH SORT_IN_TEMPDB

CREATE INDEX ix_Col1 ON MyTable(Col1) WITH SORT_IN_TEMPDB

CREATE INDEX ix_Col2 ON MyTable(Col2) WITH SORT_IN_TEMPDB

 

CREATE TABLE My_Other_Table(

PKid int PRIMARY KEY,

Col1 int,

Col2 int)

CREATE INDEX ix_Col1 ON My_Other_Table(Col1) WITH SORT_IN_TEMPDB

CREATE INDEX ix_Col2 ON My_Other_Table(Col2) WITH SORT_IN_TEMPDB

 

SET NOCOUNT ON

DECLARE @i int

SET @i = 0

WHILE @i < 1000000 BEGIN

INSERT MyTable

SELECT @i, @i/100, @i%10, 'auehlkdsajlkfeaj'

SET @i = @i + 1 END

SET @i = 0

WHILE @i < 100000 BEGIN

INSERT My_Other_Table

SELECT 1000000 + @i, @i/10, @i%10

SET @i = @i + 1 END

 

UPDATE MyTable

SET PKid = b.PKid

FROM MyTable a, My_Other_Table b

WHERE a.Col1 = b.Col1 AND a.Col2= b.Col2

With only one million rows and no other users on the server, I did not bother to cut this into separate batches (which would be slower). The above update took 5:14 on my (slow) test system, and added about 500MB to the tran log (which was sized so that it did not autogrow).



SELECT b.PKid, a.Col1, a.Col2, a.Col3

INTO MyTable2

FROM MyTable a JOIN My_Other_Table b ON a.Col1 = b.Col1 AND a.Col2 = b.Col2

 

CREATE CLUSTERED INDEX ix_PKid ON MyTable2(PKid) WITH SORT_IN_TEMPDB

CREATE INDEX ix_Col1 ON MyTable2(Col1) WITH SORT_IN_TEMPDB

CREATE INDEX ix_Col2 ON MyTable2(Col2) WITH SORT_IN_TEMPDB

The SELECT/INTO took 8 seconds, and the index creation took 24 seconds. So my method is almost ten time faster and doesn't have the tran log repercussions when the Bulk-Logged model is used.

--Jonathan

Chuck Ritenour SSC-Addicted Points: 420 More actions · Answer 6

I've been trying this all weekend - I need to go back to the drawing board as I obviously have a bad join.

Using Insert into I blew the database (in bulk recovery, and still growing) to over 100 gig's, log was @ 80 gig. Optimizer says I was joining 23 million to 23 million, and getting 299 million - that's not what I intended :), even knowing that the optimizer is often wrong, I must have had something wrong somewhere (it's a simple pk to pk join).

Thanks, and don't forget to Chuckle