July 14, 2009 at 2:58 pm
July 14, 2009 at 3:27 pm
Well, that article describes several ways to approach the problem, but without knowing the details it's impossible to say.
That article addresses the situation where the duplicates are completely duplicates, more often in my experience, there will be an IDENTITY primary key and the other data will be duplicated. That situation can be easier to resolve, but depending on the business requirements may actually be much more difficult. It all depends on the specifics of your situation and the requirements you have to meet.
/*****************
If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek
*****************/
July 14, 2009 at 3:46 pm
As always, the answer is "it depends".
If you refer to the first option in the article for deleting just one row I would agree that this is the best way.
What also need to be considered: Are you talking about "real duplicates" (every single column is identical) or about duplicates with different primary key.
You might want to have a look at this post to get an idea of an alternative (using rank function).
Note: please also read the posts following after the one I referred to.
You might want to search for "delete duplicates rank" on this site for a few more examples.
If you have trouble getting it to work please post DDL and sample data as described in the link in my signature.
July 15, 2009 at 7:46 am
DECLARE @Table TABLE (id3 int, resume varchar(256), email1 varchar(256))
INSERT INTO @Table
SELECT 454382, 'testresume1', 'abc@hotmail.com' UNION ALL
SELECT 612832, 'testresume2', 'john@global.net' UNION ALL
SELECT 612833, 'testresume2', 'john@global.net' UNION ALL
SELECT 612834, 'testresume2', 'john@global.net'
SELECT * FROM @Table;
--SELECT MAX(id3) as id3, resume, email1
--FROM @Table
--GROUP BY resume, email1;
with Dups as
( SELECT *,row_number()
OVER
(partition by resume order by resume) as RowNum
FROM @Table)
Delete from Dups where rownum > 1;
SELECT * FROM @Table;
When i run the following query, I get:
id3resume email1
454382testresume1abc@hotmail.com
612832testresume2john@global.net
But i would like:
id3resume email1
454382testresume1abc@hotmail.com
612834testresume2john@global.net
I tried using the max(id3) with the dups code but it failed!
Thanks
July 15, 2009 at 9:00 am
Here's what you need:
with Dups
as (SELECT
*,
row_number() OVER (partition by resume order by resume, id3 desc) as RowNum
FROM
@Table)
Delete from
Dups
where
rownum > 1 ;
SELECT
*
FROM
@Table ;
I just added "id3 desc" to the order by.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
July 15, 2009 at 9:14 am
Many Thanks !
July 15, 2009 at 2:11 pm
with Dups as
( SELECT *,row_number()
OVER
(partition by email1 order by email1, id3 desc) as RowNum
FROM outputresume3)
Delete from Dups where rownum > 1;
I ran the above code and am missing some records, when i do some before and after testing:(
Am i missing something obvious ?
Thanks
July 15, 2009 at 3:24 pm
ifila (7/15/2009)
...I ran the above code and am missing some records, when i do some before and after testing:(
Thanks
Isn't that an expected effect when running a DELETE statement? :ermm:
If this statement deleted rows that shouldn't have been deleted from your point of view you should run the cte statement with an
"select top 100 * from dups" instead of "delete from dups where rownum>1"
This would display the (first 100) rows with the rownum column added so you could verify what rows would have been deleted (rownum>1) and the ones that wouldn't (rownum=1).
You also could post some sample data the way you already did (BTW: excellent job!!) supporting your post from above so we have something to test against.
July 15, 2009 at 9:58 pm
I always wonder at the words "Remove Duplicate Records"... from what? The original data as in a "Delete" or just from the output of a Select?
People really need to be specific about these types of things.
So... which way do you mean???
--Jeff Moden
Change is inevitable... Change for the better is not.
July 16, 2009 at 6:41 am
Delete original records from a SQL Server database.
Thanks
July 16, 2009 at 7:09 am
To be more specific, the records with the same email, since the whole record is not an exact duplicate. Each record key is still unique.
I am trying to generate some test data to duplicate the issue.
July 16, 2009 at 7:12 am
ifila (7/16/2009)
To be more specific, the records with the same email, since the whole record is not an exact duplicate. Each record key is still unique.I am trying to generate some test data to duplicate the issue.
ifila (7/15/2009)
with Dups as( SELECT *,row_number()
OVER
(partition by email1 order by email1, id3 desc) as RowNum
FROM outputresume3)
Delete from Dups where rownum > 1;
I ran the above code and am missing some records, when i do some before and after testing:(
Am i missing something obvious ?
Thanks
The code above looks correct. I'm just not sure what you mean by "missing some records" only because I cannot see what you see.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 16, 2009 at 7:18 am
I tried your code with the test data you previously posted. All seems well with the code. DECLARE @Table TABLE (id3 int, resume varchar(256), email1 varchar(256))
INSERT INTO @Table
SELECT 454382, 'testresume1', 'abc@hotmail.com' UNION ALL
SELECT 612832, 'testresume2', 'john@global.net' UNION ALL
SELECT 612833, 'testresume2', 'john@global.net' UNION ALL
SELECT 612834, 'testresume2', 'john@global.net'
;with Dups as
( SELECT *,row_number()
OVER
(partition by email1 order by email1, id3 desc) as RowNum
FROM @Table)
Delete from Dups where rownum > 1;
SELECT * FROM @Table
--Jeff Moden
Change is inevitable... Change for the better is not.
July 16, 2009 at 7:21 am
select distinct email1
from outputresume3
where (contains (originalresume, '"livelink"
and "java"
and "cobol"
and "j2ee"
and "asp"
and "sql"
and "unix"
and "oracle"
and "sharepoint"
and "weblogic"'))
When i run the above query on the database, the before results equal 24. The after delete results equal 21. I am trying to work out where my 3 extra results went.
Thanks
July 16, 2009 at 7:40 am
Heh... good luck... that doesn't help much on our end.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 20 total)
You must be logged in to reply to this topic. Login to reply