Find Duplicates

  • Hello,

    I am working with sql 2000, and I am faced with a project that requires myself to find duplicate entries. To begin I will give an example of what I need to finish with and hopfully someone can tell me where to begin. I have been asked to find all the duplicate entries within a table of 18099597 rows of people and give the percentage of duplicates within that table. So duplicate people is what needs to be found along with the percentage. And with these people are of course identifying fields such as passport numbers, gender, data of birth etc. Where should I begin?

  • I'd begin by deciding what columns have to match in order to consider two rows a duplicate of each other. Also ensure you have a candidate key in the table which doesn't include those columns that are in your must-match list. Then you can perform a self-join on the table specifying your matching criteria as the join criteria and a not-equal against the candidate key.

  • I've come up with this query so far and it tells me some differences, but not the percentage. I read that I just have to devide the number to rows I get back from the query by the number of original rows in the table then multiply by 100. Can anyone confirm this?

    select Last_Name, First_name, Date_of_Birth, Id_Num, count(*) AS NumOfTimes

    from people_info

    group by Last_name, First_Name, Date_of_Birth, Id_Num

    having count(*) > 1

    order by count(*) desc, Last_Name, First_Name

  • Close... I'd check just for dupe ID_Num's first... the rest of the columns are ancillary in nature...

    Or, just take out the ID_Num to see if the same person has been registered twice...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I agree, the fewer number of fields you check, the less likely you are to exclude someone due to a spelling (or name change ...) error. If you feel the ID_num is a reliable field than just do your grouping on it, get your count and yes divide by the total number of records for a percentage.

    Now for the real kick. This will tell you the number of exact dupes. What about inexact dupes (id_num misentered...)?  I call thes fuzzy or inexact dupes and they are much harder to find and much harder to quantify. I should have an article out on how to use the fuzzy grouping task in SSIS to discover fuzzy dupes soon.

    But you could also join the table back to itself on a soundex of the first name, lastname ... This will likely overexaggerate your dups, so its not my favorite.

     

  • Try this:

    SELECT NULL

    FROM

    b where

    b. =

    .

    and b. =

    .

    GROU BY b., b.

    HAVING

    . > min(b.))

    ORDER BY

    .

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply