January 20, 2006 at 10:19 am
Hi,
I would like to see how many duplicates we have in our users database. This is the simplified users table - id, fname, lname, address, city, state, username, password. I want to retrieve all the records with same fname, same lname, but different username. And all the records with same address, but different username.
Thanks.
January 20, 2006 at 10:22 am
SELECT u1.fname, u1.lname, u1.username, u2.fname, u2.lname, u2.username
FROM Users As u1
INNER JOIN Users As u2
On (u1.fname = u2.fname AND
u1.lname = u2.lname )
WHERE u1.username <> u2.username
January 20, 2006 at 10:31 am
It is not returning correct results. There is 40,000+ records in the table, the query returns 180,000+ records.
January 20, 2006 at 10:44 am
Depends on what you want to see - you get large resultsets if a person has 3, 4 or more records.
This is an alternative that shows only users with 2 or more different usernames.
SELECT u1.fname, u1.lname, Count( Distinct u2.username ) As NumerOfUsernames
FROM Users As u1
INNER JOIN Users As u2
On (u1.fname = u2.fname AND
u1.lname = u2.lname )
GROUP BY u1.fname, u1.lname
HAVING Count( Distinct u2.username ) > 1
January 20, 2006 at 10:58 am
Well, the result cannot be more than 40,000, no matter how many users has how many records. I am not sure I explained myself clearly. The total records in the table is 40,000. Say 2000 of them registered themselves again, exactly once, with a different username, but same fname and lname. The query shold return 4000 records, 2000 original records, plus the 2000 duplicates. Thanks.
January 20, 2006 at 11:13 am
>>Well, the result cannot be more than 40,000, no matter how many users has how many records.
How do you know ?
You need to understand how joins work if there are more than 2 records with the same key. Consider this example where Mr. forgetful, Joe Blow has registered himself 4 times. Do you get 2 recrods in a self join. Nope, you get 12. And that's just for 1 user and a table of only 4 rows to begin with.
Declare @Users Table (
fname varchar(100),
lname varchar(100),
username varchar(100)
)
Insert Into @Users
Select 'Joe', 'Blow', 'jb1' Union all
Select 'Joe', 'Blow', 'jb2' Union all
Select 'Joe', 'Blow', 'jb3' Union all
Select 'Joe', 'Blow', 'jb4'
SELECT u1.fname, u1.lname, u1.username, u2.fname, u2.lname, u2.username
FROM @Users As u1
INNER JOIN @Users As u2
On (u1.fname = u2.fname AND
u1.lname = u2.lname )
WHERE u1.username <> u2.username
January 20, 2006 at 12:38 pm
I am sorry I had 'self join' in the title. It doesn't matter whether it is a self join, I just want the correct results.
January 20, 2006 at 1:53 pm
I would concatenate firstname and last name in the query to get just one field for grouping, group on it and do having count user name >1 like PW suggested. Something like that:
Select lastname+firstname, count(username) from my table
Group by lastname+firstname Having Count(username)>1
Regards,Yelena Varsha
January 20, 2006 at 2:05 pm
Same with address, just tested it on my table:
select address1+address2+state+city+zip
from mytable
group by address1+address2+state+city+zip having count(username)>1)
Or to select all fields for these people:
select * from mytable where address1+address2+state+city+zip IN
(select address1+address2+state+city+zip
from mytable
group by address1+address2+state+city+zip having count(username)>1)
order by address1+address2+state+city+zip
The performance was not optimized, it took betwwen 1 and 2 seconds to run this query on about 100,000 rows, but if you have 40,000 rows I think it is OK
The real side of the story when I ran the query on my database was that there was about a dosen pairs of records with the same address but different names. Some names were spelled differently, like full first name and short first name and some people were relatives with the same last name but different first names.
Regards,Yelena Varsha
January 20, 2006 at 3:21 pm
Thanks.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply