Self join to check duplicates

  • Hi,

    I would like to see how many duplicates we have in our users database. This is the simplified users table - id, fname, lname, address, city, state, username, password. I want to retrieve all the records with same fname, same lname, but different username. And all the records with same address, but different username.

    Thanks.

  • SELECT u1.fname, u1.lname, u1.username, u2.fname, u2.lname, u2.username

    FROM Users As u1

    INNER JOIN Users As u2

      On (u1.fname = u2.fname AND

            u1.lname = u2.lname )

    WHERE u1.username <> u2.username

  • It is not returning correct results. There is 40,000+ records in the table, the query returns 180,000+ records.

  • Depends on what you want to see - you get large resultsets if a person has 3, 4 or more records.

    This is an alternative that shows only users with 2 or more different usernames.

    SELECT u1.fname, u1.lname, Count( Distinct u2.username ) As NumerOfUsernames

    FROM Users As u1

    INNER JOIN Users As u2

      On (u1.fname = u2.fname AND

            u1.lname = u2.lname )

    GROUP BY u1.fname, u1.lname

    HAVING Count( Distinct u2.username ) > 1

  • Well, the result cannot be more than 40,000, no matter how many users has how many records. I am not sure I explained myself clearly. The total records in the table is 40,000. Say 2000 of them registered themselves again, exactly once, with a different username, but same fname and lname. The query shold return 4000 records, 2000 original records, plus the 2000 duplicates. Thanks.

  • >>Well, the result cannot be more than 40,000, no matter how many users has how many records.

    How do you know ?

    You need to understand how joins work if there are more than 2 records with the same key. Consider this example where Mr. forgetful, Joe Blow has registered himself 4 times. Do you get 2 recrods in a self join. Nope, you get 12. And that's just for 1 user and a table of only 4 rows to begin with.

    Declare @Users Table (

      fname varchar(100),

      lname varchar(100),

      username varchar(100)

    )

    Insert Into @Users

    Select 'Joe', 'Blow', 'jb1' Union all

    Select 'Joe', 'Blow', 'jb2' Union all

    Select 'Joe', 'Blow', 'jb3' Union all

    Select 'Joe', 'Blow', 'jb4'

    SELECT u1.fname, u1.lname, u1.username, u2.fname, u2.lname, u2.username

    FROM @Users As u1

    INNER JOIN @Users As u2

      On (u1.fname = u2.fname AND

            u1.lname = u2.lname )

    WHERE u1.username <> u2.username

     

  • I am sorry I had 'self join' in the title. It doesn't matter whether it is a self join, I just want the correct results.

  • I would concatenate firstname and last name in the query to get just one field for grouping, group on it and do having count user name >1 like PW suggested. Something like that:

    Select lastname+firstname, count(username) from my table

    Group by lastname+firstname Having Count(username)>1

    Regards,Yelena Varsha

  • Same with address, just tested it on my table:

    select address1+address2+state+city+zip

    from mytable

    group by address1+address2+state+city+zip having count(username)>1)

    Or to select all fields for these people:

    select * from mytable where address1+address2+state+city+zip IN

    (select address1+address2+state+city+zip

    from mytable

    group by address1+address2+state+city+zip having count(username)>1)

    order by address1+address2+state+city+zip

    The performance was not optimized, it took betwwen 1 and 2 seconds to run this query on about 100,000 rows, but if you have 40,000 rows I think it is OK

    The real side of the story when I ran the query on my database  was that there was about a dosen pairs of records with the same address but different names. Some names were spelled differently, like full first name and short first name and some people were relatives with the same last name but different first names.

    Regards,Yelena Varsha

  • Thanks.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply