Anyway to make this de-duping process run faster?

Question

Post reply

Anyway to make this de-duping process run faster?

Nicholas Wang

Default port

Points: 1463
More actions
November 29, 2005 at 1:54 pm

#89979

I have a complicated and interesting problem here for you. I am currently working on a data cleansing project in preparation of an upcoming system migration. The issue here is duplicate email addresses. Here's a sample data set:
username, email, product_id, product_status
mike123, mike@yahoo.com, 123456, 1
mike123, mike@yahoo.com, 123457, 1
nick456, mike@yahoo.com, 124567, 0
nick456, mike@yahoo.com, 124578, 0
jack777, jack@gmail.com, 777890, 0
jack777, jack@gmail.com, 777891, 0
james88, jack@gmail.com, 777892, 0
brad555, brad@yahoo.com, 555123, 0
brad555, brad@yahoo.com, 555234, 0
kate666, brad@yahoo.com, 666123, 1
mill777, brad@yahoo.com, 777234, 1
hill888, brad@yahoo.com, 888123, 1
john111, john@gmail.com, 111222, 1
mary222, john@gmail.com, 111333, 1
kate333, john@gmail.com, 111444, 1
A total of 103749 rows are populated into a table called "dup_emails" with data like the above sample set. Now, the goal of this project is to remove duplicate email across multiple users (e.g. we're going to use email address as the user login in the future.) Due to the sheer volume, I have come up with a few categories that I hope will help my clean up crew attack this project with better efficiency. They are:
a) Only one user is linked to any active product(s)
b) No user is linked to active product
c) > 1 user is linked to active product(s), but not all users are linked to active products
d) All users are linked to active products
(These 4 categories are respectively reflected in the sample data above.)
Now, the question. At first, I tried to just use GROUP BY, but aside from b) No user is linked to active product, the others all failed, I had no choice but to use a loop. I'll post my queries next, but can anyone think of ways to use a SET operation for this instead a procedure one? (The loop takes ~3 hrs to go through the entire set!!!)
Thanks!
Nick

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Nicholas Wang Default port Points: 1463 More actions · Answer 1

DECLARE @currentEmail varchar(100), @currentUser varchar(50),

@totalUser int, @totalActive int, @count int, @error int

SET @currentEmail = ''

SET @currentUser = ''

SET @totalUser = 0

SET @totalActive = 0

SET @count = 0

SET @error = 0

BEGIN TRAN

WHILE EXISTS (SELECT * FROM dup_emails

WHERE email > @currentEmail)

BEGIN

SET @currentEmail = (SELECT MIN(email)

FROM dup_emails

WHERE email > @currentEmail)

/*** For each email, get the count of distinct users ***/

SET @totalUser = (SELECT count(distinct username)

FROM dup_emails

WHERE email = @currentEmail)

WHILE EXISTS (SELECT * FROM dup_emails

WHERE email = @currentEmail

AND username > @currentUser)

BEGIN

SET @currentUser = (SELECT MIN(username)

FROM dup_emails

WHERE email = @currentEmail

AND username > @currentUser)

/*** For each distinct user, if it has active products, then @totalActive ++1 ***/

IF (SELECT SUM(product_status)

FROM dup_emails

WHERE username = @currentUser) >= 1

SET @totalActive = @totalActive + 1

END

--IF @totalUser = @totalActive THEN all users within said email is linked to active products

IF @totalUser = @totalActive

BEGIN

INSERT INTO #temp_dup_email_AllActive

SELECT @currentEmail

END

/*** Reset values for next email ***/

SET @currentUser = ''

SET @totalUser = 0

SET @totalActive = 0

END

PRINT 'Number of COUNT ' + CAST (@count AS varchar(9))

PRINT 'TranCount ' + CAST(@@TRANCOUNT AS varchar(5))

COMMIT TRAN

GO

noeld SSC Guru Points: 96590 More actions · Answer 2

I am pretty sure that there is a set based approach to solve your problem but I don't understad clearly what is the final outcome that you need. Based on the sample data that you posted, can you post the output that you expect?

* Noel

Carl Federl One Orange Chip Points: 25384 More actions · Answer 3

In the future, please post create table and insert statements.

How about this SQL (an index on Users.email is recommended)

select email, product_id, product_status

, MIN (username)

Into GoodUsers

FROM ( SELECT email, product_id

, CASE WHEN SUM(product_status ) > 1 then 1 else 0 END

from Users

group by email, product_id

) as X (email, product_id, product_status )

JOIN Users

On Users.email = X.email

Then:

truncate table users

insert into users (..) select .. from GoodUsers

SQL = Scarcely Qualifies as a Language

Nicholas Wang Default port Points: 1463 More actions · Answer 4

I'm looking for just the email in the result. Sorry, here's the create + insert statements.

CREATE TABLE dbo.dup_emails (

username varchar(100),

email varchar(100),

product_id int,

product_status bit)

-- a) only one active

INSERT dbo.dup_emails

VALUES ('mike123', 'mike@yahoo.com', 123456, 1)

INSERT dbo.dup_emails

VALUES ('mike123', 'mike@yahoo.com', 123457, 1)

INSERT dbo.dup_emails

VALUES ('nick456', 'mike@yahoo.com', 124567, 0)

INSERT dbo.dup_emails

VALUES ('nick456', 'mike@yahoo.com', 124578, 0)

INSERT dbo.dup_emails

VALUES ('nick567', 'mike@yahoo.com', 124579, 0)

-- b) none active

INSERT dbo.dup_emails

VALUES ('jack777', 'jack@gmail.com', 777890, 0)

INSERT dbo.dup_emails

VALUES ('jack777', 'jack@gmail.com', 777891, 0)

INSERT dbo.dup_emails

VALUES ('james88', 'jack@gmail.com', 777892, 0)

-- c) > 1 active but not all

INSERT dbo.dup_emails

VALUES ('brad555', 'brad@yahoo.com', 555123, 0)

INSERT dbo.dup_emails

VALUES ('brad555', 'brad@yahoo.com', 555234, 0)

INSERT dbo.dup_emails

VALUES ('kate666', 'brad@yahoo.com', 666123, 1)

INSERT dbo.dup_emails

VALUES ('mill777', 'brad@yahoo.com', 777234, 1)

INSERT dbo.dup_emails

VALUES ('hill888', 'brad@yahoo.com', 888123, 1)

-- d) All active

INSERT dbo.dup_emails

VALUES ('john111', 'john@gmail.com', 111222, 1)

INSERT dbo.dup_emails

VALUES ('mary222', 'john@gmail.com', 111333, 1)

INSERT dbo.dup_emails

VALUES ('kate333', 'john@gmail.com', 111444, 1)

David A. Long SSCrazy Eights Points: 8421 More actions · Answer 5

I am not sure what you expect, you state that you want to delete duplicate email addresses, so here you go:

IF OBJECT_ID(N'dup_emails') > 0

DROP TABLE dbo.dup_emails

CREATE TABLE dbo.dup_emails (

username varchar(100),

email varchar(100),

product_id int,

product_status bit,

id int identity)

...

DELETE FROM dup_emails

FROM dup_emails

INNER JOIN dup_emails AS dupe ON dup_emails.email = dupe.email

AND dup_emails.id <> dupe.id

WHERE dup_emails.id > dupe.id

Andy

David Korb SSCrazy Points: 2505 More actions · Answer 6

DELETE

A

FROM

tableA AS A

INNER JOIN

tableA AS B

ON

A.ID > B.ID

AND

A.column_of_interest = B.column_of_interest

Nicholas Wang Default port Points: 1463 More actions · Answer 7

It's much harder than it looks. I'm not trying to delete anything, I just need to get the email address the met my criteria. Look at my loop, that's for d) All users are linked to active products. I basically have to go in, for each email address, get a count of unique usernames, then for each username, check to see that it has at least one active product. In the end, if the number of unique users = the number of active users, then I'll return the email address. Now, do you can that can be done without a loop?

Carl Federl One Orange Chip Points: 25384 More actions · Answer 8

Based on your post:

1. For each email address, get a count of unique usernames.

SQL:

select email, count(distinct username)

from dup_emails

group by email

Result:

2. For each username, check to see that it has at least one active product

SQL:

select distinct username

from dup_emails

where EXISTS

(select 1

from dup_emails as Actives

where Actives.username = dup_emails.username

and product_status = 1)

Result:

hill888

john111

kate333

kate666

mary222

mike123

mill777

3. When the number of unique users = the number of active users, get the email address.

This requirement does not match any of the outputs of the prior steps. Please re-phrase.

SQL = Scarcely Qualifies as a Language

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 9

Not sure... I think this is what you want... make sure you have a primary key on your table and this should only take a couple of seconds to run. I tested using the data you provided but that's not exactly a "load" test...

PRINT 'A. Only One Active...'

SELECT EMail

FROM dbo.Dup_Emails

GROUP BY EMail

HAVING SUM(CAST(Product_Status AS TINYINT))=1

PRINT 'B. None Active...'

SELECT EMail

FROM dbo.Dup_Emails

GROUP BY EMail

HAVING SUM(CAST(Product_Status AS TINYINT))=0

PRINT 'C. >1 Active but not all...'

SELECT EMail

FROM dbo.Dup_Emails

GROUP BY EMail

HAVING COUNT(*) > SUM(CAST(Product_Status AS TINYINT))

AND SUM(CAST(Product_Status AS TINYINT))>1

PRINT 'D. All Active...'

SELECT EMail

FROM dbo.Dup_Emails

GROUP BY EMail

HAVING COUNT(*) = SUM(CAST(Product_Status AS TINYINT))

AND SUM(CAST(Product_Status AS TINYINT)) > 0

By the way... you had an error in your sample data for mike@yahoo.com... he's actually got more than one active product but I set one of his statuses to 0 to proof the code.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)