November 29, 2005 at 1:54 pm
I have a complicated and interesting problem here for you. I am currently working on a data cleansing project in preparation of an upcoming system migration. The issue here is duplicate email addresses. Here's a sample data set:
username, email, product_id, product_status
mike123, mike@yahoo.com, 123456, 1
mike123, mike@yahoo.com, 123457, 1
nick456, mike@yahoo.com, 124567, 0
nick456, mike@yahoo.com, 124578, 0
jack777, jack@gmail.com, 777890, 0
jack777, jack@gmail.com, 777891, 0
james88, jack@gmail.com, 777892, 0
brad555, brad@yahoo.com, 555123, 0
brad555, brad@yahoo.com, 555234, 0
kate666, brad@yahoo.com, 666123, 1
mill777, brad@yahoo.com, 777234, 1
hill888, brad@yahoo.com, 888123, 1
john111, john@gmail.com, 111222, 1
mary222, john@gmail.com, 111333, 1
kate333, john@gmail.com, 111444, 1
A total of 103749 rows are populated into a table called "dup_emails" with data like the above sample set. Now, the goal of this project is to remove duplicate email across multiple users (e.g. we're going to use email address as the user login in the future.) Due to the sheer volume, I have come up with a few categories that I hope will help my clean up crew attack this project with better efficiency. They are:
a) Only one user is linked to any active product(s)
b) No user is linked to active product
c) > 1 user is linked to active product(s), but not all users are linked to active products
d) All users are linked to active products
(These 4 categories are respectively reflected in the sample data above.)
Now, the question. At first, I tried to just use GROUP BY, but aside from b) No user is linked to active product, the others all failed, I had no choice but to use a loop. I'll post my queries next, but can anyone think of ways to use a SET operation for this instead a procedure one? (The loop takes ~3 hrs to go through the entire set!!!)
Thanks!
Nick
November 29, 2005 at 2:09 pm
DECLARE @currentEmail varchar(100), @currentUser varchar(50),
@totalUser int, @totalActive int, @count int, @error int
SET @currentEmail = ''
SET @currentUser = ''
SET @totalUser = 0
SET @totalActive = 0
SET @count = 0
SET @error = 0
BEGIN TRAN
WHILE EXISTS (SELECT * FROM dup_emails
WHERE email > @currentEmail)
BEGIN
SET @currentEmail = (SELECT MIN(email)
FROM dup_emails
WHERE email > @currentEmail)
/*** For each email, get the count of distinct users ***/
SET @totalUser = (SELECT count(distinct username)
FROM dup_emails
WHERE email = @currentEmail)
WHILE EXISTS (SELECT * FROM dup_emails
WHERE email = @currentEmail
AND username > @currentUser)
BEGIN
SET @currentUser = (SELECT MIN(username)
FROM dup_emails
WHERE email = @currentEmail
AND username > @currentUser)
/*** For each distinct user, if it has active products, then @totalActive ++1 ***/
IF (SELECT SUM(product_status)
FROM dup_emails
WHERE username = @currentUser) >= 1
SET @totalActive = @totalActive + 1
END
--IF @totalUser = @totalActive THEN all users within said email is linked to active products
IF @totalUser = @totalActive
BEGIN
INSERT INTO #temp_dup_email_AllActive
SELECT @currentEmail
END
/*** Reset values for next email ***/
SET @currentUser = ''
SET @totalUser = 0
SET @totalActive = 0
END
PRINT 'Number of COUNT ' + CAST (@count AS varchar(9))
PRINT 'TranCount ' + CAST(@@TRANCOUNT AS varchar(5))
COMMIT TRAN
GO
November 29, 2005 at 2:39 pm
I am pretty sure that there is a set based approach to solve your problem but I don't understad clearly what is the final outcome that you need. Based on the sample data that you posted, can you post the output that you expect?
* Noel
November 29, 2005 at 3:56 pm
In the future, please post create table and insert statements.
How about this SQL (an index on Users.email is recommended)
select email, product_id, product_status
, MIN (username)
Into GoodUsers
FROM ( SELECT email, product_id
, CASE WHEN SUM(product_status ) > 1 then 1 else 0 END
from Users
group by email, product_id
) as X (email, product_id, product_status )
JOIN Users
On Users.email = X.email
Then:
truncate table users
insert into users (..) select .. from GoodUsers
SQL = Scarcely Qualifies as a Language
November 29, 2005 at 4:54 pm
I'm looking for just the email in the result. Sorry, here's the create + insert statements.
CREATE TABLE dbo.dup_emails (
username varchar(100),
email varchar(100),
product_id int,
product_status bit)
-- a) only one active
INSERT dbo.dup_emails
VALUES ('mike123', 'mike@yahoo.com', 123456, 1)
INSERT dbo.dup_emails
VALUES ('mike123', 'mike@yahoo.com', 123457, 1)
INSERT dbo.dup_emails
VALUES ('nick456', 'mike@yahoo.com', 124567, 0)
INSERT dbo.dup_emails
VALUES ('nick456', 'mike@yahoo.com', 124578, 0)
INSERT dbo.dup_emails
VALUES ('nick567', 'mike@yahoo.com', 124579, 0)
-- b) none active
INSERT dbo.dup_emails
VALUES ('jack777', 'jack@gmail.com', 777890, 0)
INSERT dbo.dup_emails
VALUES ('jack777', 'jack@gmail.com', 777891, 0)
INSERT dbo.dup_emails
VALUES ('james88', 'jack@gmail.com', 777892, 0)
-- c) > 1 active but not all
INSERT dbo.dup_emails
VALUES ('brad555', 'brad@yahoo.com', 555123, 0)
INSERT dbo.dup_emails
VALUES ('brad555', 'brad@yahoo.com', 555234, 0)
INSERT dbo.dup_emails
VALUES ('kate666', 'brad@yahoo.com', 666123, 1)
INSERT dbo.dup_emails
VALUES ('mill777', 'brad@yahoo.com', 777234, 1)
INSERT dbo.dup_emails
VALUES ('hill888', 'brad@yahoo.com', 888123, 1)
-- d) All active
INSERT dbo.dup_emails
VALUES ('john111', 'john@gmail.com', 111222, 1)
INSERT dbo.dup_emails
VALUES ('mary222', 'john@gmail.com', 111333, 1)
INSERT dbo.dup_emails
VALUES ('kate333', 'john@gmail.com', 111444, 1)
November 30, 2005 at 3:03 am
I am not sure what you expect, you state that you want to delete duplicate email addresses, so here you go:
IF OBJECT_ID(N'dup_emails') > 0
DROP TABLE dbo.dup_emails
CREATE TABLE dbo.dup_emails (
username varchar(100),
email varchar(100),
product_id int,
product_status bit,
id int identity)
...
DELETE FROM dup_emails
FROM dup_emails
INNER JOIN dup_emails AS dupe ON dup_emails.email = dupe.email
AND dup_emails.id <> dupe.id
WHERE dup_emails.id > dupe.id
Andy
November 30, 2005 at 9:14 am
DELETE
A
FROM
tableA AS A
INNER JOIN
tableA AS B
ON
A.ID > B.ID
AND
A.column_of_interest = B.column_of_interest
December 1, 2005 at 4:50 pm
It's much harder than it looks. I'm not trying to delete anything, I just need to get the email address the met my criteria. Look at my loop, that's for d) All users are linked to active products. I basically have to go in, for each email address, get a count of unique usernames, then for each username, check to see that it has at least one active product. In the end, if the number of unique users = the number of active users, then I'll return the email address. Now, do you can that can be done without a loop?
December 3, 2005 at 6:40 am
Based on your post:
1. For each email address, get a count of unique usernames.
SQL:
select email, count(distinct username)
from dup_emails
group by email
Result:
2. For each username, check to see that it has at least one active product
SQL:
select distinct username
from dup_emails
where EXISTS
(select 1
from dup_emails as Actives
where Actives.username = dup_emails.username
and product_status = 1)
Result:
hill888
john111
kate333
kate666
mary222
mike123
mill777
3. When the number of unique users = the number of active users, get the email address.
This requirement does not match any of the outputs of the prior steps. Please re-phrase.
SQL = Scarcely Qualifies as a Language
December 3, 2005 at 10:41 pm
Not sure... I think this is what you want... make sure you have a primary key on your table and this should only take a couple of seconds to run. I tested using the data you provided but that's not exactly a "load" test...
PRINT 'A. Only One Active...'
SELECT EMail
FROM dbo.Dup_Emails
GROUP BY EMail
HAVING SUM(CAST(Product_Status AS TINYINT))=1
PRINT 'B. None Active...'
SELECT EMail
FROM dbo.Dup_Emails
GROUP BY EMail
HAVING SUM(CAST(Product_Status AS TINYINT))=0
PRINT 'C. >1 Active but not all...'
SELECT EMail
FROM dbo.Dup_Emails
GROUP BY EMail
HAVING COUNT(*) > SUM(CAST(Product_Status AS TINYINT))
AND SUM(CAST(Product_Status AS TINYINT))>1
PRINT 'D. All Active...'
SELECT EMail
FROM dbo.Dup_Emails
GROUP BY EMail
HAVING COUNT(*) = SUM(CAST(Product_Status AS TINYINT))
AND SUM(CAST(Product_Status AS TINYINT)) > 0
By the way... you had an error in your sample data for mike@yahoo.com... he's actually got more than one active product but I set one of his statuses to 0 to proof the code.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply