Removing Duplicate Records

Question

Post reply

Removing Duplicate Records

endo64

Ten Centuries

Points: 1012
More actions
February 28, 2008 at 5:10 am

#179435

Comments posted to this topic are about the item Removing Duplicate Records

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

I cant let you do that Dave Right there with Babe Points: 783 More actions · Answer 1

Are you sure you don't mean

INSERT INTO phonebook SELECT

[phonenumber],

[firstname] ,

[lastname] ,

[company]

FROM #tmp

DROP TABLE #Tmp

at the end

Ted Manasa Default port Points: 1461 More actions · Answer 2

If the data set is large, and there are no FKs defined on the table, a faster method might be to SELECT INTO a physical table (which will create the table for you), drop the table with duplicate rows, then rename the table with the clean data set.

Ted

www.intermz.com | www.intermz.com/blog

---------------------------
|Ted Pin >>

endo64 Ten Centuries Points: 1012 More actions · Answer 3

DAvid (4/18/2008)
Are you sure you don't mean
INSERT INTO phonebook SELECT
[phonenumber],
[firstname] ,
[lastname] ,
[company]
FROM #tmp
DROP TABLE #Tmp
at the end

Sure. I fixed the artice now.

Thank you.

endo64 Ten Centuries Points: 1012 More actions · Answer 4

ted.pin (4/18/2008)
If the data set is large, and there are no FKs defined on the table, a faster method might be to SELECT INTO a physical table (which will create the table for you), drop the table with duplicate rows, then rename the table with the clean data set.
Ted
www.intermz.com | www.intermz.com/blog

SELECT INTO creates the table but with no indexes. Then how to remove duplicate records?

Ted Manasa Default port Points: 1461 More actions · Answer 5

Ah, you are right. You can add an IDENTITY column to the table after the SELECT INTO, which will act as the index. But by this time you might be better off adding the IDENTITY column to the source table (with dups), then INSERT INTO a new table like:

INSERT INTO Table_NoDups

(keyCol1, col1, col2)

SELECT keyCol1, col1, col2

FROM

Table_WithDups

WHERE identityCol IN

(SELECT MAX(identityCol) FROM Table_WithDups

GROUP BY keyCol1)

But by this point, I'm not sure what the performance gains would be... 😉

---------------------------
|Ted Pin >>

endo64 Ten Centuries Points: 1012 More actions · Answer 6

I'm using that method (creating a unique index) for already existing tables. I mean without changing the original table.

If you able to change / redesign your table you can add a unique index on multiple fields. So inserting duplicate records gives a warning like 'Duplicate key was ignored', but no error.

For example:

CREATE TABLE phonebook (

[phonenumber] [varchar] (30),

[firstname] [varchar] (30),

[lastname] [varchar] (30),

[company] [varchar] (100)

)

CREATE UNIQUE INDEX IX_Unique ON phonebook (firstname, lastname) WITH IGNORE_DUP_KEY

But do not forget that, ADO will take 'Duplicate key was ignored' warning as an error!

So you should take care of this in your application.

A VB6 example;

Const cnstError_DuplicateKeyWasIgnored As Long = 3604

'...

On Error GoTo ADO_ERROR

Call oCON.Execute(strAnInsertSql)

On Error GoTo E 'Normal error handler

'...

ADO_ERROR:

If oCON.Errors(0).NativeError = cnstError_DuplicateKeyWasIgnored Then

'Duplicate Key Was Ignored errors should be ignored

Resume Next

End If

toniupstny Hall of Fame Points: 3986 More actions · Answer 7

If there is another field in the table that would determine which of the duplicate records you want to keep (such as the oldest record) you could key off that to remove the duplicates.

/* Set up the test table */

Declare @phonenum table (phone varchar(10), pdate int)

insert

into @phonenum (phone,pdate)

select 1,1 union all

select 2,2 union all

select 3,3 union all

select 3,2 union all

select 4,2 union all

select 5,1 union all

select 5,7 union all

select 5,3 union all

select 5,1 union all

select 6,1

select 'original table', phone, pdate

from @phonenum

order by phone, pdate

/* copy the non-dups from the original table based on oldest date */

select phone,pdate

into #temptab

from

(select p.phone,p.pdate

from @phonenum p

join @phonenum p2

on p.phone = p2.phone

GROUP BY p.phone,p.pdate having p.pdate = min(p2.pdate)) t

select 'temporary table - non-dups with oldest date', phone, pdate from #temptab

order by phone

drop table #temptab

If you need another tiebreaker when the dates are also the same, add that to the selection criteria. If it doesn't matter if the dates are the same, this will just put one of the records out there.

It is always good to review the data before running any updates or deletes so you can see if there are any special cases to take into account.

Toni

toniupstny Hall of Fame Points: 3986 More actions · Answer 8

Oh... one more thing. If there are Null values in any of your comparison columns, then you would have to treat those separately (doing comparisons with Null is a bit tricky). For instance should a Null date be considered lower than another date?

This can lead to some intricate modifications such as this one...

select phone , pdate

into #temptab

from

(select p.phone,p.pdate

from @phonenum p

join @phonenum p2

on p.phone = p2.phone or p.phone is null

GROUP BY p.phone,p.pdate having p.pdate = min(p2.pdate) or p.pdate is null or p.phone is null) t

So you should think these kinds of situations through before making the final changes.

Toni

Peter E. Kierstead SSCarpal Tunnel Points: 4114 More actions · Answer 9

If the ratio of duplicate to unique phone numbers is "small" AND you have a unique row identifier on the table you might be better off doing something like this...

If Object_Id('tempdb.dbo.#Tmp') is not Null Drop table dbo.#Tmp

Create table dbo.#Tmp(RId Int Identity(1,1),Phone VarChar(7))

Insert dbo.#Tmp Values('1234567')

Insert dbo.#Tmp Values('9876543')

Insert dbo.#Tmp Values('1234567') -- dup

Insert dbo.#Tmp Values('9876543') -- dup

Select * from dbo.#Tmp

Delete #Tmp

from dbo.#Tmp

join dbo.#Tmp t2 on #Tmp.RId>t2.RId and #Tmp.Phone=t2.Phone

Select * from dbo.#Tmp

PeteK
I have CDO. It's like OCD but all the letters are in alphabetical order... as they should be.

Syed-201559 SSC-Addicted Points: 477 More actions · Answer 10

There is no need to create temporary tables, very helpful when deleting from large tables, you can delete the duplicate records using inner join and setting the rowcout to 1, the modified version of script.

CREATE TABLE #phonebook (

[phonenumber] [varchar] (30),

[firstname] [varchar] (30),

[lastname] [varchar] (30),

[company] [varchar] (100)

)

INSERT #phonebook SELECT '902','syed','iqbal','sm soft'

INSERT #phonebook SELECT '905','john','ksjl89','company llc'

INSERT #phonebook SELECT '909','joe','average','united'

-- Duplicate insert 1

INSERT #phonebook SELECT '902','syed','iqbal','sm soft'

INSERT #phonebook SELECT '905','john','ksjl89','company llc'

INSERT #phonebook SELECT '909','joe','average','united'

-- Duplicate insert 2

INSERT #phonebook SELECT '902','syed','iqbal','sm soft'

INSERT #phonebook SELECT '905','john','ksjl89','company llc'

INSERT #phonebook SELECT '909','joe','average','united'

--Show Duplicate Phonenumbers in Phonebook

SELECT phonenumber, COUNT(*) FROM #phonebook

GROUP BY phonenumber HAVING COUNT(*) > 1

ORDER BY COUNT(*) DESC

SET ROWCOUNT 1

SELECT @@rowcount

WHILE @@rowcount > 0

DELETE pb FROM #phonebook as pb

INNER JOIN

(SELECT phonenumber

FROM #phonebook

GROUP BY phonenumber HAVING count(*) > 1)

AS c ON c.phonenumber = pb.phonenumber

SET ROWCOUNT 0

SELECT * FROM #phonebook

DROP TABLE #phonebook

Kindest Regards,

Syed
Sr. SQL Server DBA

toniupstny Hall of Fame Points: 3986 More actions · Answer 11

I agree you do not really need the temporary table unless you want it to be able to review results before making final changes to the "live" database.

If you want to forego the temp table, below is a variation (without loops or temp tables) for removing duplicates if you have some other criteria to determine which duplicate to keep. If you have to pick off a completely identical record, then I guess you would have to do some sort of @@Rowcount function as shown in a previous posting.

If you do have criteria to pick out which record(s) to remove then the removal is a simple Delete statement. Again if the records are exact duplicates then it is a coin-toss to find which one to take out so you would need a rowcount, cursor kind of method, index of columns with dup key (as in the original article and some posts), or a select distinct into a temporary table following that method.

The Delete Statement (for duplicate phone values with differing pdate values)

DELETE p1

FROM @phonenum p1

INNER JOIN @phonenum p2

ON p1.phone = p2.phone and p1.pdate > p2.pdate

The Delete statement in test code

Declare @phonenum table (phone varchar(10), pdate int)

insert -- set up for test

into @phonenum (phone,pdate)

select '1',1 union all

select '2',2 union all

select '3',3 union all

select '3',2 union all

select '4',2 union all

select '5',1 union all

select '5',7 union all

select '5',3 union all

select '6',1

select 'original table', phone, pdate -- show before picture

from @phonenum

/* Remove the dups from the original table based on oldest date */

DELETE p1

FROM @phonenum p1

INNER JOIN @phonenum p2

ON p1.phone = p2.phone and p1.pdate > p2.pdate

select 'after delete', phone, pdate -- show after picture

from @phonenum

Toni

I cant let you do that Dave Right there with Babe Points: 783 More actions · Answer 12

I would be hesitant about deleting duplicates in the same statement that finds them. If the table was non trivial size (ie. you didnt want to hand list the record keys to be deleted)

Any long running scan of the table looking for duplicates could cascade its lock to table level.

Generated record deletes in small transactions would have less impact if the system was live.

endo64 Ten Centuries Points: 1012 More actions · Answer 13

And ofcourse you can create a CHECKSUM computed column which is calculated from all other fields. So you can determine all the duplicate records easily, they are all give the same checksum value.

But unfortunately CHECKSUM is 32 bit integers, so if you have too much records in your table it may give the same checksum value.

You can also use the

--create a checksum field to see duplicates

ALTER TABLE Phonebook

ADD chkValue AS checksum(first_name, last_name, phonenumber)

--select and/or delete duplicates

...

--create unique index so no more dup. can be inserted

CREATE INDEX chkUnique ON Phonebook (chkValue) WITH IGNORE_DUP_KEY

Don't forget, if you have thousands of records then there can be different rows that give the same checksum value.