Duplicates returned even using DISTINCT and WHERE NOT IN

  • I have a table with nearly 6,000 rows that needs to be updated with new rows periodically. The update is based on whether the email address already exists. Email is the PK.

    The query has been changed and tweaked, but no matter what I do (including using WHERE NOT EXISTS in place of WHERE NOT IN) I get 3 duplicate emails blocking the insert.

    This is the code I am using:

    INSERT INTO InternalML.dbo.TempEmail

    (Email)

    SELECT DISTINCT c.Email

    FROM AAService.dbo.Customers c WITH (NOLOCK)

    WHERE c.Email NOT IN (SELECT t.email

    FROM InternalML.dbo.TempML t)

    Unfortunately, I cannot post sample data that duplicates the situation. Mocking up data doesn't make this happen and the actual data consists of real email addresses.

    I am hoping someone will look at this and see something I don't.

    Thanks!

  • jerry-621596 (10/6/2010)


    I have a table with nearly 6,000 rows that needs to be updated with new rows periodically. The update is based on whether the email address already exists. Email is the PK.

    The query has been changed and tweaked, but no matter what I do (including using WHERE NOT EXISTS in place of WHERE NOT IN) I get 3 duplicate emails blocking the insert.

    This is the code I am using:

    INSERT INTO InternalML.dbo.TempEmail

    (Email)

    SELECT DISTINCT c.Email

    FROM AAService.dbo.Customers c WITH (NOLOCK)

    WHERE c.Email NOT IN (SELECT t.email

    FROM InternalML.dbo.TempML t)

    Unfortunately, I cannot post sample data that duplicates the situation. Mocking up data doesn't make this happen and the actual data consists of real email addresses.

    Subquery should check in InternalML.dbo.TempEmail rather than in InternalML.dbo.TempML 🙂

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Do you have leading/trailing spaces on the fields showing as duplicates? You could try doing ltrim(rtrim(t.email)) on the where not in clause.

  • do the trimming in the SELECT DISTINCT also



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Sorry, I should have been more clear. The query is inserting into the TempEmail table so I can join back to the table with the conflict to see which emails are duplicates.

    I am checking back to TempML with this query:

    SELECT e.email, t.email

    FROM dbo.TempEmail e

    INNER JOIN dbo.TempML t ON e.email = t.email

  • I did try trimming on all of the email columns. It made the query run much longer, but came up with the same results.

  • Can you try this?

    INSERT INTO InternalML.dbo.TempEmail

    (Email)

    SELECT DISTINCT

    Ltrim(Rtrim(c.Email))

    FROM AAService.dbo.Customers c WITH (NOLOCK)

    EXCEPT

    SELECT Ltrim(Rtrim(t.email))

    FROM InternalML.dbo.TempML t WITH (NOLOCK)

    edit: ah just saw your post about no difference with the trims... I really don't know where to go from here without data.

  • Not sure what the difference was with EXCEPT but that worked perfectly!

    Thanks!

  • jerry-621596 (10/6/2010)


    Not sure what the difference was with EXCEPT but that worked perfectly!

    Thanks!

    Well...logically there would be no difference, but I've been using EXCEPT on all 2005+ queries and have had more consistent results and better performance compared to where not in or where not exists..I do not know why. Good to hear it worked though 😀

  • SELECT c.Email

    FROM AAService.dbo.Customers c WITH (NOLOCK)

    group by c.Email having c.Email except (SELECT t.email

    FROM InternalML.dbo.TempML t)

    Try this use group by instead of distinct.

    Distinct utilizes more resources.

    [font="Tahoma"]
    --SQLFRNDZ[/url]
    [/font]

  • pavan_srirangam (10/6/2010)


    Try this use group by instead of distinct.

    Distinct utilizes more resources.

    What is your reasoning for this?

    You get the same execution plan either way (index scan + hash match)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply