Find and Remove Duplicate Records SQL Server

Question

Post reply

Find and Remove Duplicate Records SQL Server

Viewing 15 posts - 31 through 45 (of 76 total)

You must be logged in to reply to this topic. Login to reply

bob_chang Grasshopper Points: 12 More actions · Answer 1

Thanks!

When I CAST the text field as a VARCHAR(MAX) I was able to locate the duplicate text field values.

WITH cte AS (

SELECT a.*,

ROW_NUMBER() OVER(PARTITION BY [varcharfield],CAST([textfield] as VARCHAR(MAX)) ORDER BY CAST([textfield] as VARCHAR(MAX))) rrn

FROM dbo.[table_with_dups] a

)

SELECT * FROM cte WHERE rrn > 1

venkataprasanth SSC Veteran Points: 207 More actions · Answer 2

The author's posted solution works and it DOESN'T DELETE ALL Duplicates. (If there are 3 dup rows it deletes 2 and leaves 1). So it works. Also I guess people posting the row_number() solution works only when we have a unique identifier in the table. But having a unique identifier obviously doesn't make two rows appear duplicate (no matter even if all other columns are same). The post is related to deleting duplicates when there is no unique identifier (i.e all columns values are same for more than one row)

Luis Cazares SSC Guru Points: 183706 More actions · Answer 3

venkataprasanth (2/1/2016)
The author's posted solution works and it DOESN'T DELETE ALL Duplicates. (If there are 3 dup rows it deletes 2 and leaves 1). So it works. Also I guess people posting the row_number() solution works only when we have a unique identifier in the table. But having a unique identifier obviously doesn't make two rows appear duplicate (no matter even if all other columns are same). The post is related to deleting duplicates when there is no unique identifier (i.e all columns values are same for more than one row)

What do you mean with having a unique identifier? The ROW_NUMBER option allows you to define what constitutes a duplicate (either one column, some columns or all the columns). The method shown in the article is bad in terms that it needs the table to have a single column that defines duplication and the worst part is that it deletes one row at a time using a deprecated option.

I'm sorry that Ginger Keys got such a harsh reaction, but the alternatives are certainly much better.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

INCREDIBLEmouse SSC Eights! Points: 835 More actions · Answer 4

Luis Cazares (2/1/2016)

venkataprasanth (2/1/2016)
..but the alternatives are certainly much better.

I unnecessarily second the motion. :Whistling:

venkataprasanth SSC Veteran Points: 207 More actions · Answer 5

This is what I mean by the unique identifier using a ROW_NUMBER() methodology

CASE 1:

CREATE TABLE Duplicates(RowID INT IDENTITY (1,1), Col1 INT, Col2 VARCHAR(100))

INSERT INTO Duplicates VALUES (100, 'ABC'),(200, 'ABC'),(300, 'BBB'),(300, 'BBB'),(200, 'CCC'),(400, 'EEE'),(400, 'EEE'),(400, 'EEE'),(500, 'DDA')

DELETE FROM Duplicates FROM (SELECT *,ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1 FROM Duplicates) As T

WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)

The RowID being the anchor column in deciding which rows can be deleted from the table that are formed by the duplicates of rest of the columns.

CASE2: NOTE the RowID is replaced by CustID column which is not a unique identifier.

CREATE TABLE Duplicates(CustID INT, Col1 INT, Col2 VARCHAR(100))

INSERT INTO Duplicates VALUES (1, 100, 'ABC'),(2, 200, 'ABC'),(3, 300, 'BBB'),(3, 300, 'BBB'),(3, 300, 'BBB'),(4, 200, 'CCC'),(5, 400, 'EEE'),(5, 400, 'EEE'),(5, 400, 'EEE'),(6, 500, 'DDA')

Can you achieve the delete functionality using the ROW_NUMBER() from the above example. I am guessing you cannot. You can post if you have an answer.

So my point is CASE1 is not a table with duplicate rows as there is a unique identifier (even though col1 and col2 have duplicate values). But CASE2 is.

You can delete the duplicate valued rows of col1 and col2 from CASE1 using ROW_NUMBER(). But cannot do this for CASE2.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 6

The problem is that you're missing an important option that SQL Server offers: The possibility to modify data through CTEs and views as long as they follow certain rules (Check Updatable views).

With that option, the code results in something easier to read and generate. Here's an example of your first case.

--CASE 1:

CREATE TABLE Duplicates(

RowID INT IDENTITY (1,1),

Col1 INT,

Col2 VARCHAR(100));

INSERT INTO Duplicates

VALUES (100, 'ABC'),

(200, 'ABC'),

(300, 'BBB'),

(200, 'CCC'),

(400, 'EEE'),

(500, 'DDA');

--Validate the data before the delete

SELECT * FROM Duplicates ORDER BY Col1, col2;

/* --Original code (rearranged)

DELETE FROM Duplicates

FROM (SELECT *,

ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1

FROM Duplicates) As T

WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)

*/

WITH CTE AS(

SELECT *, ROW_NUMBER() OVER (Partition by col1,col2 --Use the columns that define what is a duplicate row

order by RowID --Ability to define which row will be left

) as RowNum

FROM Duplicates

)

DELETE FROM CTE

WHERE RowNum > 1;

--Validate the data after the delete

SELECT * FROM Duplicates ORDER BY Col1, col2;

GO

--Clean my DB

DROP TABLE Duplicates;

And here's the second case you stated. I just needed to include the CustID column in the PARTITION BY to establish that it's also part of the duplicates definition.

CREATE TABLE Duplicates(

CustID INT,

Col1 INT,

Col2 VARCHAR(100));

INSERT INTO Duplicates

VALUES (1, 100, 'ABC'),

(2, 200, 'ABC'),

(3, 300, 'BBB'),

(4, 200, 'CCC'),

(5, 400, 'EEE'),

(6, 500, 'DDA');

SELECT * FROM Duplicates ORDER BY Col1, col2;

WITH CTE AS(

SELECT *, ROW_NUMBER() OVER (Partition by CustID, col1,col2 order by col1) as RowNum

FROM Duplicates

)

DELETE FROM CTE

WHERE RowNum > 1;

SELECT * FROM Duplicates ORDER BY Col1, col2;

GO

--Clean my DB

DROP TABLE Duplicates;

No Loops, no RBAR, no deprecated features, no multiple FROM in the delete statement, just simple SQL code that can even be ported to a different RDBMS.

Is this the best solution? It depends.

With large data sets and few duplicates, a solution as simple as this might not perform well enough. Depending on different factors, changes should be done to the code. For most scenarios, this method should work just fine.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

venkataprasanth SSC Veteran Points: 207 More actions · Answer 7

I am not aware that you can do DML operations on CTEs. I learned this today. This works well too and certainly better than looping. Thank you for posting your answer.

o.kaesmann SSC Enthusiast Points: 111 More actions · Answer 8

My faforite method to clean-up duplicated rows is this (I used the example code from above):

CREATE TABLE Duplicates(

CustID INT,

Col1 INT,

Col2 VARCHAR(100));

INSERT INTO Duplicates

VALUES (1, 100, 'ABC'),

(2, 200, 'ABC'),

(3, 300, 'BBB'),

(4, 200, 'CCC'),

(5, 400, 'EEE'),

(6, 500, 'DDA');

SELECT * FROM Duplicates ORDER BY Col1, col2;

delete from Duplicates where %%physloc%% not in (select max (%%physloc%%) from Duplicates group by CustID, Col1, Col2)

SELECT * FROM Duplicates ORDER BY Col1, col2;

GO

--Clean my DB

DROP TABLE Duplicates;

acemary4918 Newbie Points: 3 More actions · Answer 9

The article on deleting duplicates worked well for me. But, now every time I try to create the PK a new duplicate record is created. I'm not sure what's going on maybe it's a technical issue in SQL 2005?

PHP Training in Chennai 🙂 | Pega Training in Chennai 🙂

PHP Training in Chennai 🙂

dbuckley-maidt SSC Rookie Points: 42 More actions · Answer 10

I also thought it would delete all the duplicated records. However, when I tested it the code did work. The key is setting ROWCOUNT = 1 which means that all subsequent queries affect only 1 row.

When I looked up setting ROWCOUNT there was a note stating "Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax. For more information, see TOP (Transact-SQL)."

An advantage of using the "SELECT ROW_NUMBER() Over" option is that it would be set-based vs. working on only 1 row at a time and is less coding.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 11

o.kaesmann (2/2/2016)
My faforite method to clean-up duplicated rows is this (I used the example code from above):

That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

marquito_61 Valued Member Points: 50 More actions · Answer 12

But this method eliminates all duplicate records

that is wrong

Luis Cazares SSC Guru Points: 183706 More actions · Answer 13

marquito_61 (2/2/2016)
But this method eliminates all duplicate records
that is wrong

As far as I know, none of the code posted in the article of the forum eliminates all rows with duplicates, they all remove duplicates and leave the desired rows.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

kenreg SSC Rookie Points: 31 More actions · Answer 14

It seems that the author is using SET ROWCOUNT 1 to limit the DELETE to just affecting one record at a time. Checking the MSDN documentation for SET ROWCOUNT it contains this warning:

Important

Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax.

In light of this fact, you may want to come up with a better method for removing duplicates.

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 15

andy_111 (1/31/2016)
Not really useful article.

Not really a useful comment, either. Please explain why you think it's not useful.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)