Find and Remove Duplicate Records SQL Server

Question

Find and Remove Duplicate Records SQL Server

Viewing 15 posts - 46 through 60 (of 76 total)

You must be logged in to reply to this topic. Login to reply

John Donaldson SSC Journeyman Points: 89 More actions · Answer 1

This will delete all of the records that have more than 1 occurrence including the original.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 2

John Donaldson (2/2/2016)
This will delete all of the records that have more than 1 occurrence including the original.

I guess you didn't try the code. Here, let me give you the opportunity to run it by giving you sample data and the code all in one.

CREATE TABLE Duplicates(

CustID INT,

Col1 INT,

Col2 VARCHAR(100));

INSERT INTO Duplicates

VALUES (1, 100, 'ABC'),

(2, 200, 'ABC'),

(3, 300, 'BBB'),

(4, 200, 'CCC'),

(5, 400, 'EEE'),

(6, 500, 'DDA');

SELECT * FROM Duplicates ORDER BY Col1, col2;

SET NOCOUNT ON

SET ROWCOUNT 1

WHILE 1 = 1

BEGIN

DELETE

FROM Duplicates

WHERE Col1 IN

(SELECT Col1

FROM Duplicates

GROUP BY Col1

HAVING COUNT(*) > 1)

IF @@Rowcount = 0

BREAK ;

END

SET ROWCOUNT 0

SELECT * FROM Duplicates ORDER BY Col1, col2;

GO

--Clean my DB

DROP TABLE Duplicates;

This technique is bad because it uses a deprecated feature and deletes one row at a time, but it actually does what it promises to do.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Ed Wagner SSC Guru Points: 287026 More actions · Answer 3

Luis Cazares (2/1/2016)
The problem is that you're missing an important option that SQL Server offers: The possibility to modify data through CTEs and views as long as they follow certain rules (Check Updatable views).
With that option, the code results in something easier to read and generate. Here's an example of your first case.
--CASE 1:
CREATE TABLE Duplicates(
RowID INT IDENTITY (1,1),
Col1 INT,
Col2 VARCHAR(100));
INSERT INTO Duplicates
VALUES (100, 'ABC'),
(200, 'ABC'),
(300, 'BBB'),
(300, 'BBB'),
(200, 'CCC'),
(400, 'EEE'),
(400, 'EEE'),
(400, 'EEE'),
(500, 'DDA');
--Validate the data before the delete
SELECT * FROM Duplicates ORDER BY Col1, col2;
/* --Original code (rearranged)
DELETE FROM Duplicates
FROM (SELECT *,
ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1
FROM Duplicates) As T
WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)
*/
WITH CTE AS(
SELECT *, ROW_NUMBER() OVER (Partition by col1,col2 --Use the columns that define what is a duplicate row
order by RowID --Ability to define which row will be left
) as RowNum
FROM Duplicates
)
DELETE FROM CTE
WHERE RowNum > 1;
--Validate the data after the delete
SELECT * FROM Duplicates ORDER BY Col1, col2;
GO
--Clean my DB
DROP TABLE Duplicates;
And here's the second case you stated. I just needed to include the CustID column in the PARTITION BY to establish that it's also part of the duplicates definition.
CREATE TABLE Duplicates(
CustID INT,
Col1 INT,
Col2 VARCHAR(100));
INSERT INTO Duplicates
VALUES (1, 100, 'ABC'),
(2, 200, 'ABC'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(4, 200, 'CCC'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(6, 500, 'DDA');
SELECT * FROM Duplicates ORDER BY Col1, col2;
WITH CTE AS(
SELECT *, ROW_NUMBER() OVER (Partition by CustID, col1,col2 order by col1) as RowNum
FROM Duplicates
)
DELETE FROM CTE
WHERE RowNum > 1;
SELECT * FROM Duplicates ORDER BY Col1, col2;
GO
--Clean my DB
DROP TABLE Duplicates;
No Loops, no RBAR, no deprecated features, no multiple FROM in the delete statement, just simple SQL code that can even be ported to a different RDBMS.
Is this the best solution? It depends.
With large data sets and few duplicates, a solution as simple as this might not perform well enough. Depending on different factors, changes should be done to the code. For most scenarios, this method should work just fine.

Thank you, Luis, for posting an approach that makes sense, is flexible to include multiple columns of multiple types and is performant.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Ed Wagner SSC Guru Points: 287026 More actions · Answer 4

Luis Cazares (2/2/2016)
o.kaesmann (2/2/2016)
My faforite method to clean-up duplicated rows is this (I used the example code from above):
That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.

It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Luis Cazares SSC Guru Points: 183706 More actions · Answer 5

Ed Wagner (2/2/2016)
Luis Cazares (2/2/2016)
o.kaesmann (2/2/2016)
My faforite method to clean-up duplicated rows is this (I used the example code from above):
That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.
It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.

Yes, I tested it after finding Paul's post. The question is: would you use it in production code?

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Ed Wagner SSC Guru Points: 287026 More actions · Answer 6

Luis Cazares (2/2/2016)
Ed Wagner (2/2/2016)
Luis Cazares (2/2/2016)
o.kaesmann (2/2/2016)
My faforite method to clean-up duplicated rows is this (I used the example code from above):
That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.
It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.
Yes, I tested it after finding Paul's post. The question is: would you use it in production code?

I don't see a use for production code, but maybe for troubleshooting internals.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

o.kaesmann SSC Enthusiast Points: 111 More actions · Answer 7

I do use it in production environmentbecause of getting date from other Systems that conteins duplicatetd data (don't ask ....) and it works fine.

In the past i've worked on Oracle and learned to eliminate duplicates by using ROWID. I've searched something equivalent on MS-SQL and found this physloc-thing.

SQLian SSC-Addicted Points: 413 More actions · Answer 8

Not sure if I am missing something here, but why can't the following be used?

delete Customers

where CustID NOT in (

select min(CustID)

from Customers

group by CustName

)

EDIT: I have subsequently realised (following responses) that CustID is not a PK or other unique constraint so am aware this will not work in this instance.

Brandie Tarvin SSC Guru Points: 173080 More actions · Answer 9

SQLian (2/3/2016)
Not sure if I am missing something here, but why can't the following be used?
delete Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)

Because that will delete all the non duplicate customers as well.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

JustASQLGuy SSC-Addicted Points: 447 More actions · Answer 10

o.kaesmann (2/2/2016)
I do use it in production environmentbecause of getting date from other Systems that conteins duplicatetd data (don't ask ....) and it works fine.
In the past i've worked on Oracle and learned to eliminate duplicates by using ROWID. I've searched something equivalent on MS-SQL and found this physloc-thing.

I belive that Oracle's ROWID is the fastest mechanism to access a specific row in a table. Do you know if the same is true of MS-SQL?

Luis Cazares SSC Guru Points: 183706 More actions · Answer 11

Brandie Tarvin (2/3/2016)
SQLian (2/3/2016)
Not sure if I am missing something here, but why can't the following be used?
delete Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)
Because that will delete all the non duplicate customers as well.

No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

SQLian SSC-Addicted Points: 413 More actions · Answer 12

Yes, my bad as I failed to read the original question properly. I assumed wrongly that CustId was a PK.

I'll return to my cave.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 13

SQLian (2/3/2016)
Yes, my bad as I failed to read the original question properly. I assumed wrongly that CustId was a PK.
I'll return to my cave.

Your approach is certainly better than the one proposed in the article. It just won't work in certain cases, but it's certainly not bad per se.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Brandie Tarvin SSC Guru Points: 173080 More actions · Answer 14

Luis Cazares (2/3/2016)
Brandie Tarvin (2/3/2016)
SQLian (2/3/2016)
Not sure if I am missing something here, but why can't the following be used?
delete [FROM] Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)
--Brandie added FROM clause
Because that will delete all the non duplicate customers as well.
No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.

Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?

If not, what am I missing?

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 15

Brandie Tarvin (2/3/2016)
Luis Cazares (2/3/2016)
Brandie Tarvin (2/3/2016)
SQLian (2/3/2016)
Not sure if I am missing something here, but why can't the following be used?
delete [FROM] Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)
--Brandie added FROM clause
Because that will delete all the non duplicate customers as well.
No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.
Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?
If not, what am I missing?

You're missing the GROUP BY CustName. It will leave one CustID for each CustName. For unique names it won't do anything, and for duplicate names it will delete the higher CustIDs.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Find and Remove Duplicate Records SQL Server

Cookies on SQLServerCentral