update duplicate rows in table with over 160 million rows

Question

update duplicate rows in table with over 160 million rows

Colin MacKenzie

SSC-Addicted

Points: 480
More actions
August 24, 2005 at 12:55 pm

#88780

hey all...

ok, table structure...
Id int, ContactId bigint, FieldValue varchar(4000), TypeId int, CurrentValue char(1)

there are currently over 170 million rows in this table... when I load new users in, their field values are separated to rows and inserted here (it adds up pretty quickly ). I need to be able to go through and update CurrentValue to 'N' for older values (non-current values are then removed from the table and inserted into a history table).

My current statement is:
Update cd
set CurrentValue = 'N'
from ContactDetails cd
where Id not in (select max(id) from ContactDetails cd1 group by contactid, typeid)

this works, but I just had to stop it as it was running for over 36 hours... I need a faster way to do this.

I've tried
Update cd
set CurrentValue = 'N'
from ContactDetails cd
inner join ContactDetails cd1 on cd.ContactId = cd1.ContactId and cd.TypeId = cd1.TypeId
where cd.Id < cd1.Id
but this actually seems slower (for a test section of 450k rows, 12 seconds for the subselect and over 30 minutes for the inner join).

Anyone have any other ideas? it's driving me insane.

Thanks a ton!
Cheers

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Merrill Aldrich SSCrazy Points: 2358 More actions · Answer 1

Shooting in the dark here, but is it necessary to process the entire table, or when you "load new users" can you just run this update for those records that are relevant?

Colin MacKenzie SSC-Addicted Points: 480 More actions · Answer 2

I have actually tried this using a couple different methods, all slow down the insert rather significantly... inserts of 7500 users goes from 45 - 60 seconds to 15 - 20 minutes, not good when you're loading another million users (I wish it was!)

Cheers

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

Ninja's_RGR'us

SSC Guru

Points: 294069

August 24, 2005 at 1:23 pm

#584757

Is there a trigger on that table?

Colin MacKenzie SSC-Addicted Points: 480 More actions · Answer 4

nope, no triggers... tried that, was REALLY slow

there are a ton (7 - 8) indexes on it... but these should only speed it up.

Cheers

Farrell Keough SSCoach Points: 17414 More actions · Answer 5

Have you tried forcing the use of any of the Indexes?

I wasn't born stupid - I had to study.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 6

Have you tried doing the update in small batches (10 000 rows, wait 1 sec, next batch).

Colin MacKenzie SSC-Addicted Points: 480 More actions · Answer 7

do you have any syntax that I could see regarding this? I've never had to force the use of an index...

Cheers

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 8

I wouldn't go with that route first. I'd first make sure that the plan is bad first and I doubt it's the case here.

Colin MacKenzie SSC-Addicted Points: 480 More actions · Answer 9

I'm guessing the syntax for this would be something like

declare @Id int

select @Id = min(Id) from ContactDetails

while @Id < (select max(Id) from ContactDetails)

begin

Update cd

set CurrentValue = 'N'

from ContactDetails cd

where Id not in (select max(id) from ContactDetails cd1 group by contactid, typeid)

and Id between @Id and @Id + 10000

set @Id = @Id + 10000

end

no, I haven't done this... seems like the loop would cause more overhead (although, it wouldn't take NEARLY as long to rollback stuff). Am I wrong about the loop adding overhead?

Cheers

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 10

It adds a little overhead, but you get very little more locking on the tables, so your current tasks wouldn't be affected. So in essence that thing could run for days without anyone noticing it.

I have a few ideas for a better looping way, can you add a column to the tables (they all need that).

Farrell Keough SSCoach Points: 17414 More actions · Answer 11

Remi is right. Plus, it has been a long time since I had to force an index (I think it was 6.5). I will look, but it will take me a while to find it... (it is simple, but if I remember, hard to look in BOL).

I wasn't born stupid - I had to study.

Merrill Aldrich SSCrazy Points: 2358 More actions · Answer 12

Colin - also note that indexes don't speed everything up -- they actually can slow inserts and updates, as the content in each index needs to be brought in line with changes. Removing indexes that are not used can actually speed up changes to tables.

Colin MacKenzie SSC-Addicted Points: 480 More actions · Answer 13

Remi: I CAN add columns, but that'll definately slow stuff down, no?

Farrell: if I remember, in 6.5 you had to explicitly call indexes into use, but 7.0 and 2000 do it automatically.

Merrill: I'm waiting for the 36 hour rollback to happen, then I've actually modded the proc to remove all indexes as step 1, then add them all back as a final step (I've read indexes can slow inserts/updates over the past couple days as well)

Cheers

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 14

It would take time to do the add column, but it would speed up the loop process greatly >>

Add a new identity(1,1) column, INDEX IT

then something like this :

Declare @Loops as int

Declare @BatchSize as int

set @Loops = -1

set @BatchSize = 10000

While @@Rowcount > 0

begin

set @Loops = @Loops + 1

waitfor delay 00:00:01

update table set ... where NewColumn >= @Loops * @BatchSize and NewColumn < (@Loops + 1) * @BatchSize

end

I haven't tested this... try that on a small table to be sure it works. use the batch size that locks the ressource for the less time... and stills updates a good chuck of data.