May 9, 2005 at 10:53 am
Hi All
I have a table of approx 500,000 members which I need to dedupe only by email address. and i dont mind which records get removed. as long as there is a unique email address.
e.g.
tom, harrow, tom@tom.com, 30/09/1978, blue, £456
thomas, harrow, tom@tom.com, <NULL>, blue, £456
tom, harrow, tom@tom.com, 30/09/1978, <NULL>, <NULL>
I dont mind which record remains... as long as there are no duplicate email addresses.
I have tried a number of ways but none seem to work properly and the only effective method opnly keeps one field(email) i.e. creates a table with only distinct email addresses.
Any help hugely appreciated.
Thanks
Tom
May 9, 2005 at 11:20 am
There's a script that someone anonymous posted in the scripts section. Do a search on this site for remove duplicates:
This script deletes all duplicates from a table. It keeps the first instance of the record, and discards all others. If you want to keep the last instance of the record, use MAX instead of MIN.
The script can be modified very easily to include more/less fields for comparison. I've successfully used this SP to find dupes in over 100 fields. The only requirement is that the table must have a unique ID field.
May 9, 2005 at 11:34 am
hey chris thanks for your help ... i found the script ... just having a little trouble unbdrestanding the pseudo code - is t1 a tablew which i need to ctreate first? my table is called customer do i replace MyTable with customer?
DELETE FROM
t1
FROM
MyTable t1
INNER JOIN
(
SELECT
MIN(FieldID) AS FieldID,
FieldWithDupes1,
FieldWithDupes2,
FieldWithDupes3
FROM
MyTable
GROUP BY
FieldWithDupes1,
FieldWithDupes2,
FieldWithDupes3
HAVING
COUNT(*) > 1
  t2
ON(
t1.FieldWithDupes1 = t2.FieldWithDupes1
AND t1.FieldWithDupes2 = t2.FieldWithDupes2
AND t1.FieldWithDupes3 = t2.FieldWithDupes3
AND t1.FieldID <> t2.FieldID
 
May 9, 2005 at 11:36 am
t1 and t2 are the same aliased table. The table you want to remove dups from.
May 9, 2005 at 11:53 am
so do i need to creat t1 and t2 before i start?
DELETE FROM t1
FROM customer t1 INNER JOIN
(SELECT MIN(autoID) AS autoID, Email
FROM customer
GROUP BY Email
HAVING COUNT(*) > 1 t2 ON (t1.email = t2.email AND t1.email = t2.email AND t1.email = t2.email AND t1.AutoID <> t2.AutoID
does this look correct?
May 9, 2005 at 12:09 pm
I would try something like the following (sorry, I haven't had a chance to test it):
DELETE FROM t1
FROM customer t1
INNER JOIN
(SELECT MIN(autoID) AS autoID, Email
FROM customer
GROUP BY Email
HAVING COUNT(*) > 1) t2
ON(
t1.email = t2.email AND t1.AutoID <> t2.AutoID
)
Since you are just filtering on 1 field (email), I don't think you need those additional email fields in the last section (after ON). I couldn't make the formatting look any better, so SQL Server QA would probably complain 🙁
May 9, 2005 at 12:11 pm
You don't need to create anything.
t1 and t2 are aliases for the table name.
It's the same as doing this :
Select * from dbo.SysObjects O1 inner join dbo.SysObjects O2 on O1.id = O2.id
May 9, 2005 at 12:26 pm
thanks chris that was very clear ...i executed this successfully but affected 0 records??
there are definately duplicates
any ideas?
May 9, 2005 at 12:32 pm
How many records does this returns??
SELECT MIN(autoID) AS autoID, Email
FROM dbo.customer
GROUP BY Email
HAVING COUNT(*) > 1
May 9, 2005 at 1:12 pm
In my opinion
>>SELECT MIN(autoID) AS autoID, Email
FROM dbo.customer
GROUP BY Email
HAVING COUNT(*) > 1 <<<
will return ALWAYS 0 records affected !
* Noel
May 9, 2005 at 1:26 pm
Still don't see it...
what did I miss Noeld???
This returns a lot of records for me >>>
Select name, min(id) as AutoID, count(*) from dbo.SysColumns group by name having count(*) > 1
May 9, 2005 at 2:19 pm
Well,
I GOOFED! my count(*) > 1 was count(*) > 11
then I copied an pasted the SAME query twice to change only the columns
Sorry for the confusion
* Noel
May 9, 2005 at 2:36 pm
Hehe... I thaught I was going crazy missing something so simple in that query...
But that still doesn't tell us what's wrong with the delete statement. Any ideas?
May 10, 2005 at 6:56 am
I noticed this thread because of the news letter!
delete
FROM Customer
WHERE (autoid >
(SELECT MIN(autoid)
FROM Customer t1
WHERE Customer.email = t1.email))
May 10, 2005 at 7:48 am
Tom, If you are having trouble understanding the SQL syntax in the scripts above, you definitely need some more training in this area. Get to it! You will benefit by the effort!
Another way (isn't there always) is to create a matching table structure and create a unique index on the email field with the "ignore duplicates" box checked. Then use DTS to append the data from your existing table into the new structure. This will prevent the duplicates from entering the table. If you incorporate this index in your table, it will stop this problem from happening again in the future. ( a very good thing! )
Mike
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply