Help needed with removing duplicates

Question

Post reply

Help needed with removing duplicates

tom-231504

Valued Member

Points: 74
More actions
May 9, 2005 at 10:53 am

#66418

Hi All
I have a table of approx 500,000 members which I need to dedupe only by email address. and i dont mind which records get removed. as long as there is a unique email address.
e.g.
tom, harrow, tom@tom.com, 30/09/1978, blue, £456
thomas, harrow, tom@tom.com, <NULL>, blue, £456
tom, harrow, tom@tom.com, 30/09/1978, <NULL>, <NULL>
I dont mind which record remains... as long as there are no duplicate email addresses.
I have tried a number of ways but none seem to work properly and the only effective method opnly keeps one field(email) i.e. creates a table with only distinct email addresses.
Any help hugely appreciated.
Thanks
Tom

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

ChrisMoix-87856 SSCertifiable Points: 7288 More actions · Answer 1

There's a script that someone anonymous posted in the scripts section. Do a search on this site for remove duplicates:

Remove Duplicates from Table - No Cursors

This script deletes all duplicates from a table. It keeps the first instance of the record, and discards all others. If you want to keep the last instance of the record, use MAX instead of MIN.

The script can be modified very easily to include more/less fields for comparison. I've successfully used this SP to find dupes in over 100 fields. The only requirement is that the table must have a unique ID field.

tom-231504 Valued Member Points: 74 More actions · Answer 2

hey chris thanks for your help ... i found the script ... just having a little trouble unbdrestanding the pseudo code - is t1 a tablew which i need to ctreate first? my table is called customer do i replace MyTable with customer?

DELETE FROM

t1

FROM

MyTable t1

INNER JOIN

(

SELECT

MIN(FieldID) AS FieldID,

FieldWithDupes1,

FieldWithDupes2,

FieldWithDupes3

FROM

MyTable

GROUP BY

FieldWithDupes1,

FieldWithDupes2,

FieldWithDupes3

HAVING

COUNT(*) > 1

&nbsp t2

ON(

t1.FieldWithDupes1 = t2.FieldWithDupes1

AND t1.FieldWithDupes2 = t2.FieldWithDupes2

AND t1.FieldWithDupes3 = t2.FieldWithDupes3

AND t1.FieldID <> t2.FieldID

&nbsp

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

t1 and t2 are the same aliased table. The table you want to remove dups from.

tom-231504 Valued Member Points: 74 More actions · Answer 4

so do i need to creat t1 and t2 before i start?

DELETE FROM t1

FROM customer t1 INNER JOIN

(SELECT MIN(autoID) AS autoID, Email

FROM customer

GROUP BY Email

HAVING COUNT(*) > 1 t2 ON (t1.email = t2.email AND t1.email = t2.email AND t1.email = t2.email AND t1.AutoID <> t2.AutoID

does this look correct?

ChrisMoix-87856 SSCertifiable Points: 7288 More actions · Answer 5

I would try something like the following (sorry, I haven't had a chance to test it):

DELETE FROM t1

FROM customer t1

INNER JOIN

(SELECT MIN(autoID) AS autoID, Email

FROM customer

GROUP BY Email

HAVING COUNT(*) > 1) t2

ON(

t1.email = t2.email AND t1.AutoID <> t2.AutoID

)

Since you are just filtering on 1 field (email), I don't think you need those additional email fields in the last section (after ON). I couldn't make the formatting look any better, so SQL Server QA would probably complain 🙁

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 6

You don't need to create anything.

t1 and t2 are aliases for the table name.

It's the same as doing this :

Select * from dbo.SysObjects O1 inner join dbo.SysObjects O2 on O1.id = O2.id

tom-231504 Valued Member Points: 74 More actions · Answer 7

thanks chris that was very clear ...i executed this successfully but affected 0 records??

there are definately duplicates

any ideas?

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 8

How many records does this returns??

SELECT MIN(autoID) AS autoID, Email

FROM dbo.customer

GROUP BY Email

HAVING COUNT(*) > 1

noeld SSC Guru Points: 96590 More actions · Answer 9

In my opinion

>>SELECT MIN(autoID) AS autoID, Email

FROM dbo.customer

GROUP BY Email

HAVING COUNT(*) > 1 <<<

will return ALWAYS 0 records affected !

* Noel

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 10

Still don't see it...

what did I miss Noeld???

This returns a lot of records for me >>>

Select name, min(id) as AutoID, count(*) from dbo.SysColumns group by name having count(*) > 1

noeld SSC Guru Points: 96590 More actions · Answer 11

Well,

I GOOFED! my count(*) > 1 was count(*) > 11

then I copied an pasted the SAME query twice to change only the columns

Sorry for the confusion

* Noel

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 12

Hehe... I thaught I was going crazy missing something so simple in that query...

But that still doesn't tell us what's wrong with the delete statement. Any ideas?

Jeremy Wilhite SSC Rookie Points: 37 More actions · Answer 13

I noticed this thread because of the news letter!

delete

FROM Customer

WHERE (autoid >

(SELECT MIN(autoid)

FROM Customer t1

WHERE Customer.email = t1.email))

Mike Prascak SSC Eights! Points: 930 More actions · Answer 14

Tom, If you are having trouble understanding the SQL syntax in the scripts above, you definitely need some more training in this area. Get to it! You will benefit by the effort!

Another way (isn't there always) is to create a matching table structure and create a unique index on the email field with the "ignore duplicates" box checked. Then use DTS to append the data from your existing table into the new structure. This will prevent the duplicates from entering the table. If you incorporate this index in your table, it will stop this problem from happening again in the future. ( a very good thing! )

Mike