Get Rid of Duplicates!

  • If you found two duplicated item_no's why did four rows get deleted? Wouldn't you want to delete just one of the duplicates so that one unique row would remain?

    I must be missing something. Thanks for your explanation in advance.

    Richard

  • I have to admit that when I read the following, I thought that Seth had simply lost his mind...

    My quick resolution in this situation is to:

    1. Remove the unique index temporarily;

    2. Run the application, allowing it to insert duplicate item(s);

    3. then find the duplicate(s) and remove them.

    Of course, these steps are preceded by performing a good backup of the database and possibly putting the database in single user mode to prevent unexpected query results during my work. As simple as the task of removing a record with a duplicate value sounds, it can get confusing, and I need to proceed with care. To be safe, I follow this rule of thumb: first I perform a SELECT of the record(s) that will be removed, then I convert it to a DELETE statement after I'm sure it will affect only record(s) that I want it to.

    .. because it just wasn't clear that it was a legacy app that shouldn't be changed because of the impending rewrite. I thought that was an awful lot of work to do a simple conditional insert.

    Now that Seth has clarified the problem a bit, I can mostly agree with the pain he goes through including that of duplicate elimination. On that subject and for all of those that made the very good suggestion of using ROW_NUMBER() to isolate duplicates, keep in mind that this is a legacy app on a legacy DB and it might be pre-2k5 where ROW_NUMBER() simply doesn't exist. Still, the title of the article is "Get Rid of Duplicates" and not "Get Rid of Duplicates for a Special Case" and I can certainly understand why people may have jumped to the wrong conclusion on this article especially when the wrap-up line in the Conclusion is "Now you can confidently remove duplicate records from your tables!" and there was no mention of version nor ROW_NUMBER(). 😉

    That notwithstanding, for what the article was actually about, it was a good, well written article. Thanks, Seth.

    As a side bar... I don't know what the app would do with the [font="Courier New"]"Duplicate key was ignored."[/font] message that would show up you tried to insert any dupes (some apps interpret such messages as an error... same goes with returned row counts), but have you tried changing the unique index to a unique index with the "IGNORE_DUP_KEY = ON" setting? If the app forgives the warning message(s) about dupes being ignored (1 for each INSERT statment that has dupes no matter how many dupes exist in that INSERT), it could save you a wad of trouble that you're currently going through.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I prefer to use GROUP BY when I want to see the number of duplicates for each group of attributes used to check uniqueness, while ROW_NUMBER is a nicer solution and I use it especially when I want to use the latest or earliest entered version of the same record.

    The check for duplicates might be required when merging data from two different sources or when breaking a not normalized table into multiple tables, for example a headers/lines set of tables. Another situation in which I had to check for duplicates is when importing data from non-relational sources (e.g. text files, Excel sheets, etc.) in which the chances of having duplicates are quite high.

    As already stressed, it's preferable to reduce upfront the possibility of entering duplicates, unfortunately that's not always possible.

    It's not always required to add unique indexes/constraints, though that was a good tip.

  • Seth,

    this is an age old dilemma and was put to rest with many variations like JP de Jong-202059 pointed out. In fact this site has articles with script to perform the same task. Stop wasting our time with your eureka moments ....

  • If you found two duplicated item_no's why did four rows get deleted? Wouldn't you want to delete just one of the duplicates so that one unique row would remain?

    I must be missing something. Thanks for your explanation in advance.

    Richard

    Richard,

    In this scenario, I have 2 duplicated records, where every field is identical. I copied one instance of each record into a temp table, then deleted ALL the records from the original table that had item_nos that were duplicated (each of the 2 records had 1 duplicate record, so the total was 4 records). I chose to group by item_no, but could just as easily have used id. Then I copied everything from the temp table back to the original table (2 non-duplicate records). This method just seemed to make sense to me, I'm sure there are other, possibly more efficient ways to do this.

    _________________________________
    seth delconte
    http://sqlkeys.com

  • Yes but there is nothing fantastic about your method; its been long done by many in SQL2K world. Just google and you will find a ton articles ...

    cheers.

  • Seth,

    Thanks. Oh... so.. simple. Gotcha. Makes sense. Just never did a clean dups by deleting them all from the original table.

    Yeah, there's tons of ways to deal with dups. Still appreciate you're taking the time to write the article.

    Write on!

    Richard

  • Hey all - can I firstly just join the "I think this is a good article" camp - obviously there are always more than one way of doing things - but assuming you're not always using 2005+ I think this solution is very nice.

    One thing I stumbled on in the following discussion this comment "It's not always required to add unique indexes/constraints, though that was a good tip."

    Just out of curiosity - if I have a table where I use surrogate PK of some sort (Ints or GUIDs or whatever) I always make sure that I also have a natural key in the form of a unique index - so say it's a Person table I might place this on the email, if it's an Order table I might place it on the Customer and TimePlaces etc.

    I normally go to quite some length to do this, mainly just because I was once taught this was good practise - and also because I find it traps a lot of application logic errors that would otherwise go unnoticed.

    Now I should say that I am an application developer and not a DBA - so I am actually quite interested in hearing your opinion on this.

  • G33kKahuna (12/1/2009)


    Seth,

    this is an age old dilemma and was put to rest with many variations like JP de Jong-202059 pointed out. In fact this site has articles with script to perform the same task. Stop wasting our time with your eureka moments ....

    Heh... man... you don't need be to be so rude. Most everything that folks write an article having to do with SQL Server are all "age old dilemma's" and yet nothing is truly at rest. A newbie might stumble into the discussion that follows such an article and actually learn something new. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh... man... you don't need be to be so rude. Most everything that folks write an article having to do with SQL Server are all "age old dilemma's" and yet nothing is truly at rest. A newbie might stumble into the discussion that follows such an article and actually learn something new.

    Jeff,

    with all due respects, there is nothing new to learn in the article. It's done, closed and available across the internet ... here is a simple google search ..

    Those were the days when SSC had interesting articles everyday; these days anyone with access to internet and oxygen is dumping garbage on the site. I wish SSC moderated articles post more ...

  • Well don't read it then - the only thing that's worse than low quality forum posts is people poncing about complaining about how offended their intellect is by this terrible posts that they are having to read.

    If you know how to find duplicates then yes - you probably should stop reading articles about how to find duplicates.

    And before you get all huffed up and spend all night drafting your reply - that's my final word on the matter!

    Cheers mate! :0)

  • tpoulsen (12/1/2009)


    Well don't read it then - the only thing that's worse than low quality forum posts is people poncing about complaining about how offended their intellect is by this terrible posts that they are having to read.

    If you know how to find duplicates then yes - you probably should stop reading articles about how to find duplicates.

    And before you get all huffed up and spend all night drafting your reply - that's my final word on the matter!

    Cheers mate! :0)

    Ditto to you .... don't like my comments ... move on ...

  • G33kKahuna (12/1/2009)


    Heh... man... you don't need be to be so rude. Most everything that folks write an article having to do with SQL Server are all "age old dilemma's" and yet nothing is truly at rest. A newbie might stumble into the discussion that follows such an article and actually learn something new.

    Jeff,

    with all due respects, there is nothing new to learn in the article. It's done, closed and available across the internet ... here is a simple google search ..

    Those were the days when SSC had interesting articles everyday; these days anyone with access to internet and oxygen is dumping garbage on the site. I wish SSC moderated articles post more ...

    With the same respect, you don't stike me as an SSC 'old timer' ("Those were the days...") with a whopping big 139 posts. 😉 Heh... and there are other things to learn about from such an article like the rare alternate method found in the discussions that follow or the human element that causes people to waste their time flaming about such articles. At least the guy tried... how many articles have you written? :hehe:

    Instead of offering a Google link on this subject, take a look at item 6 in the following article...

    http://www.ehow.com/how_2106033_use-proper-forum-etiquette.html

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sad isn't it when my forum posts offer more challenge and value .....

    cheer,

  • Heh... nah... it's just the human element on a (thankfully) open forum. Sometimes ya just gotta play the ol' Jedi mind trick on your self... "These are not the droids I want... I'll move along." 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply