Find and Remove Duplicate Records SQL Server

  • Thanks!

    When I CAST the text field as a VARCHAR(MAX) I was able to locate the duplicate text field values.

    WITH cte AS (

    SELECT a.*,

    ROW_NUMBER() OVER(PARTITION BY [varcharfield],CAST([textfield] as VARCHAR(MAX)) ORDER BY CAST([textfield] as VARCHAR(MAX))) rrn

    FROM dbo.[table_with_dups] a

    )

    SELECT * FROM cte WHERE rrn > 1

  • The author's posted solution works and it DOESN'T DELETE ALL Duplicates. (If there are 3 dup rows it deletes 2 and leaves 1). So it works. Also I guess people posting the row_number() solution works only when we have a unique identifier in the table. But having a unique identifier obviously doesn't make two rows appear duplicate (no matter even if all other columns are same). The post is related to deleting duplicates when there is no unique identifier (i.e all columns values are same for more than one row)

  • venkataprasanth (2/1/2016)


    The author's posted solution works and it DOESN'T DELETE ALL Duplicates. (If there are 3 dup rows it deletes 2 and leaves 1). So it works. Also I guess people posting the row_number() solution works only when we have a unique identifier in the table. But having a unique identifier obviously doesn't make two rows appear duplicate (no matter even if all other columns are same). The post is related to deleting duplicates when there is no unique identifier (i.e all columns values are same for more than one row)

    What do you mean with having a unique identifier? The ROW_NUMBER option allows you to define what constitutes a duplicate (either one column, some columns or all the columns). The method shown in the article is bad in terms that it needs the table to have a single column that defines duplication and the worst part is that it deletes one row at a time using a deprecated option.

    I'm sorry that Ginger Keys got such a harsh reaction, but the alternatives are certainly much better.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (2/1/2016)


    venkataprasanth (2/1/2016)


    ..but the alternatives are certainly much better.

    I unnecessarily second the motion. :Whistling:

  • This is what I mean by the unique identifier using a ROW_NUMBER() methodology

    CASE 1:

    CREATE TABLE Duplicates(RowID INT IDENTITY (1,1), Col1 INT, Col2 VARCHAR(100))

    INSERT INTO Duplicates VALUES (100, 'ABC'),(200, 'ABC'),(300, 'BBB'),(300, 'BBB'),(200, 'CCC'),(400, 'EEE'),(400, 'EEE'),(400, 'EEE'),(500, 'DDA')

    DELETE FROM Duplicates FROM (SELECT *,ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1 FROM Duplicates) As T

    WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)

    The RowID being the anchor column in deciding which rows can be deleted from the table that are formed by the duplicates of rest of the columns.

    CASE2: NOTE the RowID is replaced by CustID column which is not a unique identifier.

    CREATE TABLE Duplicates(CustID INT, Col1 INT, Col2 VARCHAR(100))

    INSERT INTO Duplicates VALUES (1, 100, 'ABC'),(2, 200, 'ABC'),(3, 300, 'BBB'),(3, 300, 'BBB'),(3, 300, 'BBB'),(4, 200, 'CCC'),(5, 400, 'EEE'),(5, 400, 'EEE'),(5, 400, 'EEE'),(6, 500, 'DDA')

    Can you achieve the delete functionality using the ROW_NUMBER() from the above example. I am guessing you cannot. You can post if you have an answer.

    So my point is CASE1 is not a table with duplicate rows as there is a unique identifier (even though col1 and col2 have duplicate values). But CASE2 is.

    You can delete the duplicate valued rows of col1 and col2 from CASE1 using ROW_NUMBER(). But cannot do this for CASE2.

  • The problem is that you're missing an important option that SQL Server offers: The possibility to modify data through CTEs and views as long as they follow certain rules (Check Updatable views).

    With that option, the code results in something easier to read and generate. Here's an example of your first case.

    --CASE 1:

    CREATE TABLE Duplicates(

    RowID INT IDENTITY (1,1),

    Col1 INT,

    Col2 VARCHAR(100));

    INSERT INTO Duplicates

    VALUES (100, 'ABC'),

    (200, 'ABC'),

    (300, 'BBB'),

    (300, 'BBB'),

    (200, 'CCC'),

    (400, 'EEE'),

    (400, 'EEE'),

    (400, 'EEE'),

    (500, 'DDA');

    --Validate the data before the delete

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    /* --Original code (rearranged)

    DELETE FROM Duplicates

    FROM (SELECT *,

    ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1

    FROM Duplicates) As T

    WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)

    */

    WITH CTE AS(

    SELECT *, ROW_NUMBER() OVER (Partition by col1,col2 --Use the columns that define what is a duplicate row

    order by RowID --Ability to define which row will be left

    ) as RowNum

    FROM Duplicates

    )

    DELETE FROM CTE

    WHERE RowNum > 1;

    --Validate the data after the delete

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    GO

    --Clean my DB

    DROP TABLE Duplicates;

    And here's the second case you stated. I just needed to include the CustID column in the PARTITION BY to establish that it's also part of the duplicates definition.

    CREATE TABLE Duplicates(

    CustID INT,

    Col1 INT,

    Col2 VARCHAR(100));

    INSERT INTO Duplicates

    VALUES (1, 100, 'ABC'),

    (2, 200, 'ABC'),

    (3, 300, 'BBB'),

    (3, 300, 'BBB'),

    (3, 300, 'BBB'),

    (4, 200, 'CCC'),

    (5, 400, 'EEE'),

    (5, 400, 'EEE'),

    (5, 400, 'EEE'),

    (6, 500, 'DDA');

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    WITH CTE AS(

    SELECT *, ROW_NUMBER() OVER (Partition by CustID, col1,col2 order by col1) as RowNum

    FROM Duplicates

    )

    DELETE FROM CTE

    WHERE RowNum > 1;

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    GO

    --Clean my DB

    DROP TABLE Duplicates;

    No Loops, no RBAR, no deprecated features, no multiple FROM in the delete statement, just simple SQL code that can even be ported to a different RDBMS.

    Is this the best solution? It depends.

    With large data sets and few duplicates, a solution as simple as this might not perform well enough. Depending on different factors, changes should be done to the code. For most scenarios, this method should work just fine.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I am not aware that you can do DML operations on CTEs. I learned this today. This works well too and certainly better than looping. Thank you for posting your answer.

  • My faforite method to clean-up duplicated rows is this (I used the example code from above):

    CREATE TABLE Duplicates(

    CustID INT,

    Col1 INT,

    Col2 VARCHAR(100));

    INSERT INTO Duplicates

    VALUES (1, 100, 'ABC'),

    (2, 200, 'ABC'),

    (3, 300, 'BBB'),

    (3, 300, 'BBB'),

    (3, 300, 'BBB'),

    (4, 200, 'CCC'),

    (5, 400, 'EEE'),

    (5, 400, 'EEE'),

    (5, 400, 'EEE'),

    (6, 500, 'DDA');

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    delete from Duplicates where %%physloc%% not in (select max (%%physloc%%) from Duplicates group by CustID, Col1, Col2)

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    GO

    --Clean my DB

    DROP TABLE Duplicates;

  • The article on deleting duplicates worked well for me. But, now every time I try to create the PK a new duplicate record is created. I'm not sure what's going on maybe it's a technical issue in SQL 2005?

    PHP Training in Chennai 🙂 | Pega Training in Chennai 🙂

  • I also thought it would delete all the duplicated records. However, when I tested it the code did work. The key is setting ROWCOUNT = 1 which means that all subsequent queries affect only 1 row.

    When I looked up setting ROWCOUNT there was a note stating "Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax. For more information, see TOP (Transact-SQL)."

    An advantage of using the "SELECT ROW_NUMBER() Over" option is that it would be set-based vs. working on only 1 row at a time and is less coding.

  • o.kaesmann (2/2/2016)


    My faforite method to clean-up duplicated rows is this (I used the example code from above):

    That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • But this method eliminates all duplicate records

    that is wrong

  • marquito_61 (2/2/2016)


    But this method eliminates all duplicate records

    that is wrong

    As far as I know, none of the code posted in the article of the forum eliminates all rows with duplicates, they all remove duplicates and leave the desired rows.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • It seems that the author is using SET ROWCOUNT 1 to limit the DELETE to just affecting one record at a time. Checking the MSDN documentation for SET ROWCOUNT it contains this warning:

    Important

    Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax.

    In light of this fact, you may want to come up with a better method for removing duplicates.

  • andy_111 (1/31/2016)


    Not really useful article.

    Not really a useful comment, either. Please explain why you think it's not useful.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 31 through 45 (of 76 total)

You must be logged in to reply to this topic. Login to reply