Removing Duplicate Records Efficiently

Question

Removing Duplicate Records Efficiently

DanG

SSC Eights!

Points: 951
More actions
October 23, 2002 at 6:56 pm

#159050

I periodically get 12 million records with names, addresses and phone numbers. In the data there are some duplicate phone numbers. Where there are duplicates, I only want one of the records - which one does not matter. Is there an efficient way to get rid of the dups without running a cursor?
Thanks,
Dan

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Antares686 SSC Guru Points: 125444 More actions · Answer 1

Depending on how you get the records and what fields are duplicated you have multiple options. How many columns must have the same data to be considered a duplicate. If all then SELECT DISTINCT will handle. If less then is there any existing field such as a date which determines which to keep or is it just keep one. If keep one then setup a processing table with an IDENTITY column to key off of.

Then query something like this.

SELECT * FROM tblProc oq WHERE IDX = (SELECT TOP 1 IDX FROM tblProc iq WHERE iq.Col1 = oq.Col1 AND iq.Col2 = oq.Col2 AND iq.Col3 = oq.Col3 ORDER BY IDX)

-- Basically where the duplicating fields are equal.

Now if you have a date or other unique column and want to keep the last record query like this on the data (again if coming from another server may not need processing table)

SELECT * FROM tblProc oq WHERE ColDate = (SELECT Max(ColDate) FROM tblProc iq WHERE iq.Col1 = oq.Col1 AND iq.Col2 = oq.Col2 AND iq.Col3 = oq.Col3)

It all boils down to how you decide what to keep and what to throw away.

"Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

Andy Warren SSC Guru Points: 119902 More actions · Answer 2

A quick and dirty alternative is to make a copy of the table structure and create a unique index with ignore dupes on your key field(s). Then append the data, applying an order by that makes sense (keep newest, keep oldest, etc).

Andy

http://www.sqlservercentral.com/columnists/awarren/

Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter

wmyip Grasshopper Points: 24 More actions · Answer 3

Hi,

Maybe you can try this if you want to avoid creating any temp table or using any cursor -->

SET ROWCOUNT 1

SELECT 'Start Removing Duplicate Record...' -- Don't remove, seed record

WHILE (@@ROWCOUNT>=1)

BEGIN

DELETE TABLE_1 WHERE COL_1 in ( SELECT COL_1 FROM TABLE_1 GROUP BY COL_1 HAVING COUNT(*) > 1)

END

SET ROWCOUNT 0

SELECT * FROM TABLE_1

Basically it just a WHILE..LOOP process to keep scanning TABLE_1, until there is not more duplicate record, @@ROWCOUNT will return 0 and the process will stop.

Another alternative/dirty way which I used to do that before I came out with this script is :

SELECT DISTINCT * INTO TEMP_TABLE FROM TABLE_1

TRUNCATE TABLE TABLE_1

INSERT INTO TABLE_1 SELECT * FROM TEMP_TABLE

I find this still the simplest way, but it required high IO resources.

Hope this help!!!

or you can reach me at :

ywmeng@rocketmail.com