Removing Duplicates from very large table

Question

Removing Duplicates from very large table

outlander_333

SSC Veteran

Points: 204
More actions
April 21, 2009 at 3:18 pm

#203742

Good afternoon all,
I am working on a script that will remove duplicate records from a very large table (40 mil + records). It is taking forever to accomplish.
Deduping table: tDuplicatesExist
--Back-up data
Select * into tDuplicatesExist_BackUp from tDuplicatesExist.
--GetData to dedup
Select * into tDuplicatesExist_DeDup from tDuplicatesExist.
--Truncate originating table to insert distinct records back (WITH orig pk recordnumber [unique identifier])
Truncate table tDuplicatesExist
SET IDENTITY_INSERT Database.dbo.tDuplicatesExist ON
Insert tDuplicatesExist
Select tDuplicatesExist
tDuplicatesExist_DeDup.RecordNumber, --Pk/Unique Identifier
tDuplicatesExist_DeDup.EmployeeID,
tDuplicatesExist_DeDup.AccountID,
tDuplicatesExist_DeDup.TransactionType,
tDuplicatesExist_DeDup.TransactionData,
tDuplicatesExist_DeDup.TransactionDate
from tDuplicatesExist
JOIN
(Select
distinct EmployeeID,
AccountID,
TransactionType,
convert(datetime,convert(varchar(10),TransactionDate,101))TDate,
Max(RecordNumber) RecordNumber
From tDuplicatesExist_DeDup d
Group by
EmployeeID,
AccountID,
TransactionType,
convert(datetime,convert(varchar(10),TransactionDate,101)
) as A
on tDuplicatesExist_DeDup.RecordNumber = A.RecordNumber
SET IDENTITY_INSERT Database.dbo.tDuplicatesExist OFF
1. Does anyone see a better way to do this to speed it up?
2. Could I add indexes to my _deDup table to help
Any guidance would be appreciated!
thanks

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Ron McCullough SSC Guru Points: 63877 More actions · Answer 1

Whoops did not realize this was SQL 2000 - sorry Thanks Sergiy

Have you looked at ROW_NUMBER (Transact-SQL) in BOL? It is very useful and quite fast for listing duplicates. A quick and dirty example for checking by two duplicate columns is:

;with numbered as(SELECT rowno=row_number()

over (partition by PRODUCTID, Customerid order by PRODUCTID),

PRODUCTID,CustomerId from PartDemo)

select * from numbered

The data returned is:

rowno PRODUCTID CustomerId

-------------------- ----------- -----------

1 5 5

2 5 5

1 10 1

1 10 5

1 20 101

2 20 101

3 20 101

4 20 101

1 20 102

Same idea but only checkiing for duplicate PRODUCTID (one column)

;with numbered as(SELECT rowno=row_number()

over (partition by PRODUCTID order by PRODUCTID),

PRODUCTID,CustomerId from PartDemo)

select * from numbered

The data returned is:

rowno PRODUCTID CustomerId

-------------------- ----------- -----------

1 5 5

2 5 5

1 10 5

2 10 1

1 20 101

2 20 102

3 20 101

4 20 101

5 20 101

Note that rowno values higher than 1 are duplicates hence all one has to do is to delete where rowno > 1 to eliminate duplicates.

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

Glenn Dorling SSCertifiable Points: 7878 More actions · Answer 2

One observation. You're making a backup copy of the original table and then another copy of the table which you use as the dedup source. Seeing as you're not changing the data in the dedup source why not just use the backup copy as the source? You'll save a lot of time and disk space for any reasonably sized table.

When I've needed to make any changes to any large table I've always:

- Renamed the existing table (and any indices and stats on the table),

- Created an empty table (and clustered index if any) of the same structure and name as the original,

- Moved the data to the new table (sorted as per the clustered index where necessary, and in multiple streams if I'm trying to cut down the time), then

- Recreated any non-clustered indices (and any stats I want to have available from the "restart" with the new table), again with the same names as the originals.

At least that saves moving the data twice (or three times as per your supplied script). It also means you've got your original indices on the renamed table to help speed things up too, so there's no need to worry about building new indices (unless you're doing something weird or maybe if there's no clustered index).

Sergiy SSC Guru Points: 110208 More actions · Answer 3

bitbucket (4/21/2009)
Have you looked at ROW_NUMBER (Transact-SQL) in BOL? It is very useful and quite fast for listing duplicates.

Did you ever try it on 40+ mil tables?

Not to mention it does not work on SQL2000.

_____________
Code for TallyGenerator

outlander_333 SSC Veteran Points: 204 More actions · Answer 4

Thanks!

This is an old old database, and I am not sure if indexes exist, and I am not sure how to check if they do.

outlander_333 SSC Veteran Points: 204 More actions · Answer 5

Glenn Dorling (4/21/2009)
One observation. You're making a backup copy of the original table and then another copy of the table which you use as the dedup source. Seeing as you're not changing the data in the dedup source why not just use the backup copy as the source? You'll save a lot of time and disk space for any reasonably sized table.
When I've needed to make any changes to any large table I've always:
- Renamed the existing table (and any indices and stats on the table),
- Created an empty table (and clustered index if any) of the same structure and name as the original,
- Moved the data to the new table (sorted as per the clustered index where necessary, and in multiple streams if I'm trying to cut down the time), then
- Recreated any non-clustered indices (and any stats I want to have available from the "restart" with the new table), again with the same names as the originals.
At least that saves moving the data twice (or three times as per your supplied script). It also means you've got your original indices on the renamed table to help speed things up too, so there's no need to worry about building new indices (unless you're doing something weird or maybe if there's no clustered index).

I appreciate your input. One thing, though this is a transactional table, and if i rename it the app will break; by not allowing data to be inserted. So that is not an option. I wonder if I should delete, instead of re-inserting into the new table

Glenn Dorling SSCertifiable Points: 7878 More actions · Answer 6

If you can't get an outage you're going to have real problems with managing locking/blocking.

One way of doing it would be to search for the first duplicate then delete all but one of them, repeating until there's no more duplicates. That could literally take days and it is going to impact any concurrent activity, perhaps significantly. You could minimise that by forcing row-level locking with no escalation to table locks (so inserts could still happen) but then you may well run into memory issues.

Any way involving inserts into a new table is going to cause a lot of I/O, so even using a nolock hint may impact concurrent users. If you do use a nolock hint you'd have to have a key to determine what's already been written and do a second pass to get what's been written (inserted or updated) since the first one started prior to doing any table renames. If rows can be deleted (not likely in most systems) then a two-pass process won't be possible.

And most importantly of all, before doing any sort of deduping you must fix the issue that caused the duplicates in the first place, otherwise you'll have to go through the whole tedious process again (and again and ...).

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 7

Also seems asked and answered here

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=124271

N 56°04'39.16"
E 12°55'05.25"