How can I find duplicate Entries in a table?

  • I have a table that has CustomerNumber as one of its fields (CustomerNumber is not a key field). I was hoping that there was a way that I could find duplicate entries. Sort of like:

    select Not(Distinct(CustomerNumber)) From CustomerContactsXref

    Thanks

  • Select * from (SELECT *,

    Row_Number() OVER(PARTITION BY CustomerNumber ORDER BY

    CustomerNUmber) as row_no

    FRom CustomerContactsXref) t

    Where row_no > 1

    ---------------------------------------------------------------------------------

  • Probably the simplest way to do this.

    SELECT mt.CustomerNumber

    FROM MyTable AS mt

    GROUP BY mt.CustomerNumber

    HAVING COUNT(mt.CustomerNumber) > 1

    Hope This Helps!

  • Thanks so much to the both of you. Both statements worked perfectly. You saved me hours of time.

  • Nabha (11/24/2009)


    Select * from (SELECT *,

    Row_Number() OVER(PARTITION BY CustomerNumber ORDER BY

    CustomerNUmber) as row_no

    FRom CustomerContactsXref) t

    Where row_no > 1

    Hi Nabha, I was curious about any differences in perfomance using the Group By and Having clauses vs. the windowing functions. While doing this I get some unexpected results. Please help me understand the behavior of the windowing functions better. If I Select * I get a particular result back, if I Select CustomerNumber, Date (for example) , I get a completely different result returned. Thanks.

    Scott.

    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    I'm Sorry... I had my date filter misplaced.

  • SSSolice (11/24/2009)


    Nabha (11/24/2009)


    Select * from (SELECT *,

    Row_Number() OVER(PARTITION BY CustomerNumber ORDER BY

    CustomerNUmber) as row_no

    FRom CustomerContactsXref) t

    Where row_no > 1

    Hi Nabha, I was curious about any differences in perfomance using the Group By and Having clauses vs. the windowing functions. While doing this I get some unexpected results. Please help me understand the behavior of the windowing functions better. If I Select * I get a particular result back, if I Select CustomerNumber, Date (for example) , I get a completely different result returned. Thanks.

    Scott.

    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    I'm Sorry... I had my date filter misplaced.

    Performance difference; I believe there is not much but its again debatable. This is the closest topic that discussed about the difference that I could find out. ( I guess there is plenty over here)

    http://www.sqlservercentral.com/Forums/Topic711987-360-1.aspx

    Personally I prefer row_number() over group by in this case because, i can see all the other records in the table along with their rank and you would have to do a self join back in group by case to see that.

    ---------------------------------------------------------------------------------

  • Heh... just a "forum, survival tip"... I wouldn't make any such performance claims unless I had code to back it up. I've been burned without the code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/25/2009)


    Heh... just a "forum, survival tip"... I wouldn't make any such performance claims unless I had code to back it up. I've been burned without the code.

    🙂 I knew this was coming! Yes sure Jeff. Thanks.

    I should not have commented about the performance because,

    1. I think its not even fair to compare these two queries here as they

    are doing two different things. Row_number() would get me all those

    rows except the first one. group by would get me just those customer

    numbers which are duplicates. So two different result sets.

    Gears start shifting when you actually want to delete it.

    Then we may start the comparision, and then comes performance difference.

    2. I was lazy enough(and not completely sure on how)to setup the test

    data for this case and use these two approaches to create script to delete duplicates and test performance.

    Thanks Jeff, will be careful in making performance claims in future :hehe:

    EDIT: Deleted the last line in my previous post about the performance of windowing function against the subquery in select clause 🙂

    ---------------------------------------------------------------------------------

  • Heh... no problem, Nabha... just didn't want to see a food fight break out. I'm not perfect either. I miss stuff just like on that previous row problem I just stuck my foot in my mouth with. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Nabha and Jeff,

    How would you change the query to find duplicates if the entries were numbers? I have been tasked with creating a query that will pull out duplicate entries but, the entries are numbers. We have a process that automates the entry of meters into a database. Unfortunately there are times when the same meter is entered. IF I knew how to write a query to find the duplicates I would be able to alert users when a duplicate is entered. Another issue is that the amount of meter entries will vary. Is there a way to write a query that will look at each piece of equipment individually and compare it's meters and then, if there is a meter that is duplicated, return the equipment ID? The last issue would be that I only want to compare the two latest (most recent) entries. Any suggestions would be greatly appreciated.

  • For something like that, why not just put a unique index on the meter column and use the thrown error as an alert?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I may not understand what you mean but, I do not think that will work in the database I work on. We have one table that has thousands of equipment records and another table that has tens or hundreds of meters for each piece of equipment. The table that has the meters in it has meters for each piece of equipment. I need a way to group the meters for each piece of equipment and then take the last two meters and compare them to see if they are duplicates.

    So for instance say the database has 5,000 pieces of equipment. Each piece of equipment will have 1 to 150 meters. I have to figure out a way to compare the last two meters for each piece of equipment to see if they are the same.

  • atwite (12/18/2009)


    Hi Nabha and Jeff,

    How would you change the query to find duplicates if the entries were numbers? I have been tasked with creating a query that will pull out duplicate entries but, the entries are numbers. We have a process that automates the entry of meters into a database. Unfortunately there are times when the same meter is entered. IF I knew how to write a query to find the duplicates I would be able to alert users when a duplicate is entered. Another issue is that the amount of meter entries will vary. Is there a way to write a query that will look at each piece of equipment individually and compare it's meters and then, if there is a meter that is duplicated, return the equipment ID? The last issue would be that I only want to compare the two latest (most recent) entries. Any suggestions would be greatly appreciated.

    Change the * to CustomerNumber. Same principle, just specify the columns. Otherwise, the queries are already grouped by CustomerNumber.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • atwite (12/18/2009)


    I may not understand what you mean but, I do not think that will work in the database I work on. We have one table that has thousands of equipment records and another table that has tens or hundreds of meters for each piece of equipment. The table that has the meters in it has meters for each piece of equipment. I need a way to group the meters for each piece of equipment and then take the last two meters and compare them to see if they are duplicates.

    So for instance say the database has 5,000 pieces of equipment. Each piece of equipment will have 1 to 150 meters. I have to figure out a way to compare the last two meters for each piece of equipment to see if they are the same.

    I guess I don't understand why a piece of equipment would have more than one meter associated with it during the lifetime of the equipment. Whatever...

    See the article in the first link in my signature line below. If you were to provide table creation and readily consumable data statements in the format indicated in that article, someone would be able to help you lickety-split.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/18/2009)


    For something like that, why not just put a unique index on the meter column and use the thrown error as an alert?

    Yes, prevent the duplicates in the first place. If for some reason you can't use a unique index because duplicates are allowed in special situations, then the insert statement could be coded to prevent adding a duplicate by doing something like WHERE x not in (select y from ...)

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply