Is This The Best Way to Remove Duplicate Records ?

  • http://support.microsoft.com/kb/139444

    Thanks

  • Well, that article describes several ways to approach the problem, but without knowing the details it's impossible to say.

    That article addresses the situation where the duplicates are completely duplicates, more often in my experience, there will be an IDENTITY primary key and the other data will be duplicated. That situation can be easier to resolve, but depending on the business requirements may actually be much more difficult. It all depends on the specifics of your situation and the requirements you have to meet.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • As always, the answer is "it depends".

    If you refer to the first option in the article for deleting just one row I would agree that this is the best way.

    What also need to be considered: Are you talking about "real duplicates" (every single column is identical) or about duplicates with different primary key.

    You might want to have a look at this post to get an idea of an alternative (using rank function).

    Note: please also read the posts following after the one I referred to.

    You might want to search for "delete duplicates rank" on this site for a few more examples.

    If you have trouble getting it to work please post DDL and sample data as described in the link in my signature.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • DECLARE @Table TABLE (id3 int, resume varchar(256), email1 varchar(256))

    INSERT INTO @Table

    SELECT 454382, 'testresume1', 'abc@hotmail.com' UNION ALL

    SELECT 612832, 'testresume2', 'john@global.net' UNION ALL

    SELECT 612833, 'testresume2', 'john@global.net' UNION ALL

    SELECT 612834, 'testresume2', 'john@global.net'

    SELECT * FROM @Table;

    --SELECT MAX(id3) as id3, resume, email1

    --FROM @Table

    --GROUP BY resume, email1;

    with Dups as

    ( SELECT *,row_number()

    OVER

    (partition by resume order by resume) as RowNum

    FROM @Table)

    Delete from Dups where rownum > 1;

    SELECT * FROM @Table;

    When i run the following query, I get:

    id3resume email1

    454382testresume1abc@hotmail.com

    612832testresume2john@global.net

    But i would like:

    id3resume email1

    454382testresume1abc@hotmail.com

    612834testresume2john@global.net

    I tried using the max(id3) with the dups code but it failed!

    Thanks

  • Here's what you need:

    with Dups

    as (SELECT

    *,

    row_number() OVER (partition by resume order by resume, id3 desc) as RowNum

    FROM

    @Table)

    Delete from

    Dups

    where

    rownum > 1 ;

    SELECT

    *

    FROM

    @Table ;

    I just added "id3 desc" to the order by.

  • Many Thanks !

  • with Dups as

    ( SELECT *,row_number()

    OVER

    (partition by email1 order by email1, id3 desc) as RowNum

    FROM outputresume3)

    Delete from Dups where rownum > 1;

    I ran the above code and am missing some records, when i do some before and after testing:(

    Am i missing something obvious ?

    Thanks

  • ifila (7/15/2009)


    ...

    I ran the above code and am missing some records, when i do some before and after testing:(

    Thanks

    Isn't that an expected effect when running a DELETE statement? :ermm:

    If this statement deleted rows that shouldn't have been deleted from your point of view you should run the cte statement with an

    "select top 100 * from dups" instead of "delete from dups where rownum>1"

    This would display the (first 100) rows with the rownum column added so you could verify what rows would have been deleted (rownum>1) and the ones that wouldn't (rownum=1).

    You also could post some sample data the way you already did (BTW: excellent job!!) supporting your post from above so we have something to test against.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • I always wonder at the words "Remove Duplicate Records"... from what? The original data as in a "Delete" or just from the output of a Select?

    People really need to be specific about these types of things.

    So... which way do you mean???

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Delete original records from a SQL Server database.

    Thanks

  • To be more specific, the records with the same email, since the whole record is not an exact duplicate. Each record key is still unique.

    I am trying to generate some test data to duplicate the issue.

  • ifila (7/16/2009)


    To be more specific, the records with the same email, since the whole record is not an exact duplicate. Each record key is still unique.

    I am trying to generate some test data to duplicate the issue.

    ifila (7/15/2009)


    with Dups as

    ( SELECT *,row_number()

    OVER

    (partition by email1 order by email1, id3 desc) as RowNum

    FROM outputresume3)

    Delete from Dups where rownum > 1;

    I ran the above code and am missing some records, when i do some before and after testing:(

    Am i missing something obvious ?

    Thanks

    The code above looks correct. I'm just not sure what you mean by "missing some records" only because I cannot see what you see.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I tried your code with the test data you previously posted. All seems well with the code. DECLARE @Table TABLE (id3 int, resume varchar(256), email1 varchar(256))

    INSERT INTO @Table

    SELECT 454382, 'testresume1', 'abc@hotmail.com' UNION ALL

    SELECT 612832, 'testresume2', 'john@global.net' UNION ALL

    SELECT 612833, 'testresume2', 'john@global.net' UNION ALL

    SELECT 612834, 'testresume2', 'john@global.net'

    ;with Dups as

    ( SELECT *,row_number()

    OVER

    (partition by email1 order by email1, id3 desc) as RowNum

    FROM @Table)

    Delete from Dups where rownum > 1;

    SELECT * FROM @Table

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • select distinct email1

    from outputresume3

    where (contains (originalresume, '"livelink"

    and "java"

    and "cobol"

    and "j2ee"

    and "asp"

    and "sql"

    and "unix"

    and "oracle"

    and "sharepoint"

    and "weblogic"'))

    When i run the above query on the database, the before results equal 24. The after delete results equal 21. I am trying to work out where my 3 extra results went.

    Thanks

  • Heh... good luck... that doesn't help much on our end.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply