Is This The Best Way to Remove Duplicate Records ?

Question

Is This The Best Way to Remove Duplicate Records ?

isuckatsql

SSChampion

Points: 10381
More actions
July 14, 2009 at 2:58 pm

#136518

http://support.microsoft.com/kb/139444
Thanks

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

DCPeterson SSCoach Points: 15266 More actions · Answer 1

Well, that article describes several ways to approach the problem, but without knowing the details it's impossible to say.

That article addresses the situation where the duplicates are completely duplicates, more often in my experience, there will be an IDENTITY primary key and the other data will be duplicated. That situation can be easier to resolve, but depending on the business requirements may actually be much more difficult. It all depends on the specifics of your situation and the requirements you have to meet.

/*****************

If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

*****************/

LutzM SSC Guru Points: 107049 More actions · Answer 2

As always, the answer is "it depends".

If you refer to the first option in the article for deleting just one row I would agree that this is the best way.

What also need to be considered: Are you talking about "real duplicates" (every single column is identical) or about duplicates with different primary key.

You might want to have a look at this post to get an idea of an alternative (using rank function).

Note: please also read the posts following after the one I referred to.

You might want to search for "delete duplicates rank" on this site for a few more examples.

If you have trouble getting it to work please post DDL and sample data as described in the link in my signature.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

isuckatsql SSChampion Points: 10381 More actions · Answer 3

DECLARE @Table TABLE (id3 int, resume varchar(256), email1 varchar(256))

INSERT INTO @Table

SELECT 454382, 'testresume1', 'abc@hotmail.com' UNION ALL

SELECT 612832, 'testresume2', 'john@global.net' UNION ALL

SELECT 612833, 'testresume2', 'john@global.net' UNION ALL

SELECT 612834, 'testresume2', 'john@global.net'

SELECT * FROM @Table;

--SELECT MAX(id3) as id3, resume, email1

--FROM @Table

--GROUP BY resume, email1;

with Dups as

( SELECT *,row_number()

OVER

(partition by resume order by resume) as RowNum

FROM @Table)

Delete from Dups where rownum > 1;

SELECT * FROM @Table;

When i run the following query, I get:

id3resume email1

454382testresume1abc@hotmail.com

612832testresume2john@global.net

But i would like:

id3resume email1

454382testresume1abc@hotmail.com

612834testresume2john@global.net

I tried using the max(id3) with the dups code but it failed!

Thanks

Jack Corbett SSC Guru Points: 184394 More actions · Answer 4

Here's what you need:

with Dups

as (SELECT

*,

row_number() OVER (partition by resume order by resume, id3 desc) as RowNum

FROM

@Table)

Delete from

Dups

where

rownum > 1 ;

SELECT

*

FROM

@Table ;

I just added "id3 desc" to the order by.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

isuckatsql SSChampion Points: 10381 More actions · Answer 5

isuckatsql

SSChampion

Points: 10381

July 15, 2009 at 9:14 am

#1025151

Many Thanks !

isuckatsql SSChampion Points: 10381 More actions · Answer 6

with Dups as

( SELECT *,row_number()

OVER

(partition by email1 order by email1, id3 desc) as RowNum

FROM outputresume3)

Delete from Dups where rownum > 1;

I ran the above code and am missing some records, when i do some before and after testing:(

Am i missing something obvious ?

Thanks

LutzM SSC Guru Points: 107049 More actions · Answer 7

ifila (7/15/2009)
...
I ran the above code and am missing some records, when i do some before and after testing:(
Thanks

Isn't that an expected effect when running a DELETE statement? :ermm:

If this statement deleted rows that shouldn't have been deleted from your point of view you should run the cte statement with an

"select top 100 * from dups" instead of "delete from dups where rownum>1"

This would display the (first 100) rows with the rownum column added so you could verify what rows would have been deleted (rownum>1) and the ones that wouldn't (rownum=1).

You also could post some sample data the way you already did (BTW: excellent job!!) supporting your post from above so we have something to test against.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 8

I always wonder at the words "Remove Duplicate Records"... from what? The original data as in a "Delete" or just from the output of a Select?

People really need to be specific about these types of things.

So... which way do you mean???

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

isuckatsql SSChampion Points: 10381 More actions · Answer 9

Delete original records from a SQL Server database.

Thanks

isuckatsql SSChampion Points: 10381 More actions · Answer 10

To be more specific, the records with the same email, since the whole record is not an exact duplicate. Each record key is still unique.

I am trying to generate some test data to duplicate the issue.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 11

ifila (7/16/2009)
To be more specific, the records with the same email, since the whole record is not an exact duplicate. Each record key is still unique.
I am trying to generate some test data to duplicate the issue.

ifila (7/15/2009)
with Dups as
( SELECT *,row_number()
OVER
(partition by email1 order by email1, id3 desc) as RowNum
FROM outputresume3)
Delete from Dups where rownum > 1;
I ran the above code and am missing some records, when i do some before and after testing:(
Am i missing something obvious ?
Thanks

The code above looks correct. I'm just not sure what you mean by "missing some records" only because I cannot see what you see.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 12

I tried your code with the test data you previously posted. All seems well with the code. DECLARE @Table TABLE (id3 int, resume varchar(256), email1 varchar(256))

INSERT INTO @Table

SELECT 454382, 'testresume1', 'abc@hotmail.com' UNION ALL

SELECT 612832, 'testresume2', 'john@global.net' UNION ALL

SELECT 612833, 'testresume2', 'john@global.net' UNION ALL

SELECT 612834, 'testresume2', 'john@global.net'

;with Dups as

( SELECT *,row_number()

OVER

(partition by email1 order by email1, id3 desc) as RowNum

FROM @Table)

Delete from Dups where rownum > 1;

SELECT * FROM @Table

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

isuckatsql SSChampion Points: 10381 More actions · Answer 13

select distinct email1

from outputresume3

where (contains (originalresume, '"livelink"

and "java"

and "cobol"

and "j2ee"

and "asp"

and "sql"

and "unix"

and "oracle"

and "sharepoint"

and "weblogic"'))

When i run the above query on the database, the before results equal 24. The after delete results equal 21. I am trying to work out where my 3 extra results went.

Thanks

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 14

Heh... good luck... that doesn't help much on our end.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)