Anyway to make this de-duping process run faster?

  • I have a complicated and interesting problem here for you. I am currently working on a data cleansing project in preparation of an upcoming system migration. The issue here is duplicate email addresses. Here's a sample data set:

    username, email, product_id, product_status

    mike123, mike@yahoo.com, 123456, 1

    mike123, mike@yahoo.com, 123457, 1

    nick456, mike@yahoo.com, 124567, 0

    nick456, mike@yahoo.com, 124578, 0

    jack777, jack@gmail.com, 777890, 0

    jack777, jack@gmail.com, 777891, 0

    james88, jack@gmail.com, 777892, 0

    brad555, brad@yahoo.com, 555123, 0

    brad555, brad@yahoo.com, 555234, 0

    kate666, brad@yahoo.com, 666123, 1

    mill777, brad@yahoo.com, 777234, 1

    hill888, brad@yahoo.com, 888123, 1

    john111, john@gmail.com, 111222, 1

    mary222, john@gmail.com, 111333, 1

    kate333, john@gmail.com, 111444, 1

    A total of 103749 rows are populated into a table called "dup_emails" with data like the above sample set. Now, the goal of this project is to remove duplicate email across multiple users (e.g. we're going to use email address as the user login in the future.) Due to the sheer volume, I have come up with a few categories that I hope will help my clean up crew attack this project with better efficiency. They are:

    a) Only one user is linked to any active product(s)

    b) No user is linked to active product

    c) > 1 user is linked to active product(s), but not all users are linked to active products

    d) All users are linked to active products

    (These 4 categories are respectively reflected in the sample data above.)

    Now, the question. At first, I tried to just use GROUP BY, but aside from b) No user is linked to active product, the others all failed, I had no choice but to use a loop. I'll post my queries next, but can anyone think of ways to use a SET operation for this instead a procedure one? (The loop takes ~3 hrs to go through the entire set!!!)

    Thanks!

    Nick

  • DECLARE @currentEmail varchar(100), @currentUser varchar(50),

    @totalUser int, @totalActive int, @count int, @error int

    SET @currentEmail = ''

    SET @currentUser = ''

    SET @totalUser = 0

    SET @totalActive = 0

    SET @count = 0

    SET @error = 0

    BEGIN TRAN

    WHILE EXISTS (SELECT * FROM dup_emails

    WHERE email > @currentEmail)

    BEGIN

    SET @currentEmail = (SELECT MIN(email)

    FROM dup_emails

    WHERE email > @currentEmail)

    /*** For each email, get the count of distinct users ***/

    SET @totalUser = (SELECT count(distinct username)

    FROM dup_emails

    WHERE email = @currentEmail)

    WHILE EXISTS (SELECT * FROM dup_emails

    WHERE email = @currentEmail

    AND username > @currentUser)

    BEGIN

    SET @currentUser = (SELECT MIN(username)

    FROM dup_emails

    WHERE email = @currentEmail

    AND username > @currentUser)

    /*** For each distinct user, if it has active products, then @totalActive ++1 ***/

    IF (SELECT SUM(product_status)

    FROM dup_emails

    WHERE username = @currentUser) >= 1

    SET @totalActive = @totalActive + 1

    END

    --IF @totalUser = @totalActive THEN all users within said email is linked to active products

    IF @totalUser = @totalActive

    BEGIN

    INSERT INTO #temp_dup_email_AllActive

    SELECT @currentEmail

    END

    /*** Reset values for next email ***/

    SET @currentUser = ''

    SET @totalUser = 0

    SET @totalActive = 0

    END

    PRINT 'Number of COUNT ' + CAST (@count AS varchar(9))

    PRINT 'TranCount ' + CAST(@@TRANCOUNT AS varchar(5))

    COMMIT TRAN

    GO

  • I am pretty sure that there is a set based approach to solve your problem but I don't understad clearly what is the final outcome that you need. Based on the sample data that you posted, can you post the output that you expect?

     


    * Noel

  • In the future, please post create table and insert statements.

    How about this SQL (an index on Users.email is recommended)

    select email, product_id, product_status

    , MIN (username)

    Into GoodUsers

    FROM ( SELECT email, product_id

    , CASE WHEN SUM(product_status ) > 1 then 1 else 0 END

    from Users

    group by email, product_id

    ) as X (email, product_id, product_status )

    JOIN Users

    On Users.email = X.email

    Then:

    truncate table users

    insert into users (..) select .. from GoodUsers

    SQL = Scarcely Qualifies as a Language

  • I'm looking for just the email in the result. Sorry, here's the create + insert statements.

    CREATE TABLE dbo.dup_emails (

    username varchar(100),

    email varchar(100),

    product_id int,

    product_status bit)

    -- a) only one active

    INSERT dbo.dup_emails

    VALUES ('mike123', 'mike@yahoo.com', 123456, 1)

    INSERT dbo.dup_emails

    VALUES ('mike123', 'mike@yahoo.com', 123457, 1)

    INSERT dbo.dup_emails

    VALUES ('nick456', 'mike@yahoo.com', 124567, 0)

    INSERT dbo.dup_emails

    VALUES ('nick456', 'mike@yahoo.com', 124578, 0)

    INSERT dbo.dup_emails

    VALUES ('nick567', 'mike@yahoo.com', 124579, 0)

    -- b) none active

    INSERT dbo.dup_emails

    VALUES ('jack777', 'jack@gmail.com', 777890, 0)

    INSERT dbo.dup_emails

    VALUES ('jack777', 'jack@gmail.com', 777891, 0)

    INSERT dbo.dup_emails

    VALUES ('james88', 'jack@gmail.com', 777892, 0)

    -- c) > 1 active but not all

    INSERT dbo.dup_emails

    VALUES ('brad555', 'brad@yahoo.com', 555123, 0)

    INSERT dbo.dup_emails

    VALUES ('brad555', 'brad@yahoo.com', 555234, 0)

    INSERT dbo.dup_emails

    VALUES ('kate666', 'brad@yahoo.com', 666123, 1)

    INSERT dbo.dup_emails

    VALUES ('mill777', 'brad@yahoo.com', 777234, 1)

    INSERT dbo.dup_emails

    VALUES ('hill888', 'brad@yahoo.com', 888123, 1)

    -- d) All active

    INSERT dbo.dup_emails

    VALUES ('john111', 'john@gmail.com', 111222, 1)

    INSERT dbo.dup_emails

    VALUES ('mary222', 'john@gmail.com', 111333, 1)

    INSERT dbo.dup_emails

    VALUES ('kate333', 'john@gmail.com', 111444, 1)

  • I am not sure what you expect, you state that you want to delete duplicate email addresses, so here you go:

    IF OBJECT_ID(N'dup_emails') > 0

            DROP TABLE dbo.dup_emails

    CREATE TABLE dbo.dup_emails (

    username varchar(100),

    email varchar(100),

    product_id int,

    product_status bit,

    id int identity)

    ...

    DELETE FROM dup_emails

    FROM dup_emails

     INNER JOIN dup_emails AS dupe ON dup_emails.email = dupe.email

                    AND dup_emails.id <> dupe.id

    WHERE dup_emails.id > dupe.id

    Andy

  • DELETE

     A

    FROM

     tableA AS A

    INNER JOIN

     tableA AS B

    ON

     A.ID > B.ID

     AND

     A.column_of_interest = B.column_of_interest

     

  • It's much harder than it looks. I'm not trying to delete anything, I just need to get the email address the met my criteria. Look at my loop, that's for d) All users are linked to active products. I basically have to go in, for each email address, get a count of unique usernames, then for each username, check to see that it has at least one active product. In the end, if the number of unique users = the number of active users, then I'll return the email address. Now, do you can that can be done without a loop?

  • Based on your post:

    1. For each email address, get a count of unique usernames.

    SQL:

    select email, count(distinct username)

    from dup_emails

    group by email

    Result:

    brad@yahoo.com4

    jack@gmail.com2

    john@gmail.com3

    mike@yahoo.com3

    2. For each username, check to see that it has at least one active product

    SQL:

    select distinct username

    from dup_emails

    where EXISTS

    (select 1

    from dup_emails as Actives

    where Actives.username = dup_emails.username

    and product_status = 1)

    Result:

    hill888

    john111

    kate333

    kate666

    mary222

    mike123

    mill777

    3. When the number of unique users = the number of active users, get the email address.

    This requirement does not match any of the outputs of the prior steps. Please re-phrase.

    SQL = Scarcely Qualifies as a Language

  • Not sure... I think this is what you want... make sure you have a primary key on your table and this should only take a couple of seconds to run.  I tested using the data you provided but that's not exactly a "load" test...

     PRINT 'A. Only One Active...'

     SELECT EMail

       FROM dbo.Dup_Emails

      GROUP BY EMail

     HAVING SUM(CAST(Product_Status AS TINYINT))=1

      PRINT 'B. None Active...'

     SELECT EMail

       FROM dbo.Dup_Emails

      GROUP BY EMail

     HAVING SUM(CAST(Product_Status AS TINYINT))=0

      PRINT 'C. >1 Active but not all...'

     SELECT EMail

       FROM dbo.Dup_Emails

      GROUP BY EMail

     HAVING COUNT(*) > SUM(CAST(Product_Status AS TINYINT))

        AND SUM(CAST(Product_Status AS TINYINT))>1

      PRINT 'D. All Active...'

     SELECT EMail

       FROM dbo.Dup_Emails

      GROUP BY EMail

     HAVING COUNT(*) = SUM(CAST(Product_Status AS TINYINT))

        AND SUM(CAST(Product_Status AS TINYINT)) > 0

    By the way... you had an error in your sample data for mike@yahoo.com... he's actually got more than one active product but I set one of his statuses to 0 to proof the code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply