November 7, 2003 at 5:57 am
I'm running into a problem updating 25+ million rows of data, tempdb keeps blowing up (exceeds 166gigs). I've set the database to bulk insert, of course that doesn't help - I've tried a loop that updates 1 million, commits and then grabs the next million, it either didn't work (nothing committed) or didn't commit because I still blew tempdb in 4 hours. It's a simple update based on -
update <mytable>
set PKid = b.PKid
from mytable a, my_ther_table b
where a.col1 = b.col1 and
a.col2= b.col2
both col1 and col2 are indexed (non clustered) and are int's.
Optimizer says that if I limit the update to pkid < 1million I still have to hash join 125 million rows (it's 2billion on total table) - any ideas?
Thanks, and don't forget to Chuckle
Thanks, and don't forget to Chuckle
November 7, 2003 at 9:11 am
Updating a million rows at a time is still an awful lot. It looks like you have the luxury of not needing access to the database during this update but I would cut down the update to about 10,000 in each update.
However, the other question is why is so much being updated? Are there business processes or other approaches that could be taken to work with the data closer to its creation so massive updates are not needed?
Guarddata-
November 7, 2003 at 9:21 am
It's actually a business process that is mandating this, I'm aggregating the 25 million down to 1.8 or so million, and the update is to update the FK back to the un-aggregated part.
It's all offline, not on the production servers (hardware being cheap now).
Thanks, and don't forget to Chuckle
Thanks, and don't forget to Chuckle
November 7, 2003 at 1:23 pm
quote:
It's actually a business process that is mandating this, I'm aggregating the 25 million down to 1.8 or so million, and the update is to update the FK back to the un-aggregated part.It's all offline, not on the production servers (hardware being cheap now).
Thanks, and don't forget to Chuckle
Sounds like perhaps you're updating all the rows, in which case you might want to leave the recovery model as Bulk-Logged, use SELCT...INTO to make the changed table, then drop the original table, rename the changed table, and create the indexes (use SORT_IN_TEMPDB if tempdb is on its own disk array).
--Jonathan
--Jonathan
November 7, 2003 at 2:41 pm
I perfer to break the huge update into small batches as Guarddata recommedned.
Edited by - allen_cui on 11/07/2003 2:40:56 PM
November 10, 2003 at 7:44 am
quote:
I perfer to break the huge update into small batches as Guarddata recommedned.
Last time I tested this sort of thing was years ago with SQL 7, so I decided to see if there is less of a performance impact in this specific case to support your "preference."
This is simplified as we of course do not know the exact schema. I used only one million rows as I didn't feel like taking the time and space to create 25 million rows, but the results are obviously extensible.
CREATE TABLE MyTable(
PKid int,
Col1 int,
Col2 int,
Col3 varchar(40))
CREATE CLUSTERED INDEX ix_PKid ON MyTable(PKid) WITH SORT_IN_TEMPDB
CREATE INDEX ix_Col1 ON MyTable(Col1) WITH SORT_IN_TEMPDB
CREATE INDEX ix_Col2 ON MyTable(Col2) WITH SORT_IN_TEMPDB
CREATE TABLE My_Other_Table(
PKid int PRIMARY KEY,
Col1 int,
Col2 int)
CREATE INDEX ix_Col1 ON My_Other_Table(Col1) WITH SORT_IN_TEMPDB
CREATE INDEX ix_Col2 ON My_Other_Table(Col2) WITH SORT_IN_TEMPDB
SET NOCOUNT ON
DECLARE @i int
SET @i = 0
WHILE @i < 1000000 BEGIN
INSERT MyTable
SELECT @i, @i/100, @i%10, 'auehlkdsajlkfeaj'
SET @i = @i + 1 END
SET @i = 0
WHILE @i < 100000 BEGIN
INSERT My_Other_Table
SELECT 1000000 + @i, @i/10, @i%10
SET @i = @i + 1 END
UPDATE MyTable
SET PKid = b.PKid
FROM MyTable a, My_Other_Table b
WHERE a.Col1 = b.Col1 AND a.Col2= b.Col2
With only one million rows and no other users on the server, I did not bother to cut this into separate batches (which would be slower). The above update took 5:14 on my (slow) test system, and added about 500MB to the tran log (which was sized so that it did not autogrow).
SELECT b.PKid, a.Col1, a.Col2, a.Col3
INTO MyTable2
FROM MyTable a JOIN My_Other_Table b ON a.Col1 = b.Col1 AND a.Col2 = b.Col2
CREATE CLUSTERED INDEX ix_PKid ON MyTable2(PKid) WITH SORT_IN_TEMPDB
CREATE INDEX ix_Col1 ON MyTable2(Col1) WITH SORT_IN_TEMPDB
CREATE INDEX ix_Col2 ON MyTable2(Col2) WITH SORT_IN_TEMPDB
The SELECT/INTO took 8 seconds, and the index creation took 24 seconds. So my method is almost ten time faster and doesn't have the tran log repercussions when the Bulk-Logged model is used.
--Jonathan
--Jonathan
November 10, 2003 at 11:48 am
I've been trying this all weekend - I need to go back to the drawing board as I obviously have a bad join.
Using Insert into I blew the database (in bulk recovery, and still growing) to over 100 gig's, log was @ 80 gig. Optimizer says I was joining 23 million to 23 million, and getting 299 million - that's not what I intended :), even knowing that the optimizer is often wrong, I must have had something wrong somewhere (it's a simple pk to pk join).
Thanks, and don't forget to Chuckle
Thanks, and don't forget to Chuckle
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply