Interesting query - remove duplicates from my table

  • Hi everyone,

    I need help understanding something.

    The situation is - I have a table that stores comments about shipments, and sometimes the same comment is in a table several times. So I had to write some code to delete duplicate comments.

    I could not find a way without using some programming language and loop through each row, but I asked a colleague from India and he gave me this query:

    DELETE NestleComments FROM NestleComments d

    LEFT OUTER JOIN

    (SELECT ManagerComment, pk = MIN(PK)

    FROM NestleComments

    GROUP BY ManagerComment) x

    ON d.pk = x.pk

    WHERE x.pk IS NULL

    GO

    It works perfect and deletes the duplicate comments! But I don't understand why! I looked at it like 100 times but don't get it. But I don't want to give up, I want to learn why it works! Can someone please explain what happens in this query? Thank you!

  • Are you sure that works?

    For me it has errors. There's no such thing as "DELETE field FROM Table" as it will delete the entire row. And AFAIK you can't use JOINs in a DELETE statement.You should post DDL and sample data and explain why are you using 2 tables ( NestleComments & Dup).

    Here's an example of what you can do if you have an ID field.

    DECLARE @Tabletable(

    tableidint identity,

    stringvarchar(15))

    INSERT @Table(string)

    SELECT 'a' UNION ALL

    SELECT 'a' UNION ALL

    SELECT 'b' UNION ALL

    SELECT 'b' UNION ALL

    SELECT 'b' UNION ALL

    SELECT 'c'

    SELECT * FROM @Table

    DELETE FROM @Table

    WHERE tableid NOT IN (SELECT MIN(tableid)

    FROM @Table

    GROUP BY string)

    SELECT * FROM @Table

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Are you sure that works?

    For me it has errors. There's no such thing as "DELETE field FROM Table" as it will delete the entire row. And AFAIK you can't use JOINs in a DELETE statement.

    You should post DDL and sample data and explain why are you using 2 tables ( NestleComments & Dup).

    The code is not deleting a field it is deleting the table. This is perfectly valid syntax. It is sql server specific but it is much the same as UPDATE FROM.

    Delete TableName

    from TableName

    join OtherTable ...

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Oksana March (9/26/2012)


    Hi everyone,

    I need help understanding something.

    The situation is - I have a table that stores comments about shipments, and sometimes the same comment is in a table several times. So I had to write some code to delete duplicate comments.

    I could not find a way without using some programming language and loop through each row, but I asked a colleague from India and he gave me this query:

    DELETE NestleComments FROM NestleComments d

    LEFT OUTER JOIN

    (SELECT ManagerComment, pk = MIN(PK)

    FROM Dup

    GROUP BY ManagerComment) x

    ON d.pk = x.pk

    WHERE x.pk IS NULL

    GO

    It works perfect and deletes the duplicate comments! But I don't understand why! I looked at it like 100 times but don't get it. But I don't want to give up, I want to learn why it works! Can someone please explain what happens in this query? Thank you!

    As Luis said, without ddl and some sample data it is hard to say exactly but the gist of it is pretty straight forward.

    You have two tables NestleComments and another called Dup. I assume that this table Dup contains a copy of the comment when it is duplicated? Maybe it will help your understanding if you change this into a select.

    Below is the query that will return all the rows to delete.

    Select *

    FROM NestleComments d

    LEFT OUTER JOIN

    (SELECT ManagerComment, pk = MIN(PK)

    FROM Dup

    GROUP BY ManagerComment) x

    ON d.pk = x.pk

    WHERE x.pk IS NULL

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (9/26/2012)


    Are you sure that works?

    For me it has errors. There's no such thing as "DELETE field FROM Table" as it will delete the entire row. And AFAIK you can't use JOINs in a DELETE statement.

    You should post DDL and sample data and explain why are you using 2 tables ( NestleComments & Dup).

    The code is not deleting a field it is deleting the table. This is perfectly valid syntax. It is sql server specific but it is much the same as UPDATE FROM.

    Delete TableName

    from TableName

    join OtherTable ...

    Oh, you're right. I got confused and thought it was a field (not sure why). Now it makes sense.

    What does not makes sense is the "Dup" table.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • It looks as though dup is a copy of NestleComments.

    Assuming this is so:

    The subquery gets the minimum PK for the comment. This is joined to NestleComments on PK. This means all comments in NestleComments with PK <> minimum PK show a NULL in the join, on the x-side.

    Any records where x.PK is null are deleted - i.e. any where PK <> minimum PK!

    I think thats right... 🙂

  • Sorry, my bad. Of course Dup is an exact copy of NestleComments which is a table I made for testing. I have corrected the original query.

    Thanks everyone for responding. I think I started to understand after turning it into SELECT as Sean suggested. Thanks, Sean! The SELECT returns only the records that need to be deleted.

    I still have questions...

    I have never seen this syntax "delete table from table". When is it used?

    What is the value of x in my query? Is x like a table with records? Does it loop through all the records in x and compare the PK? Didn't think this is possible with simple query, only with cursor 😉

  • Oksana March (9/26/2012)


    Sorry, my bad. Of course Dup is an exact copy of NestleComments which is a table I made for testing. I have corrected the original query.

    Thanks everyone for responding. I think I started to understand after turning it into SELECT as Sean suggested. Thanks, Sean! The SELECT returns only the records that need to be deleted.

    I still have questions...

    I have never seen this syntax "delete table from table". When is it used?

    What is the value of x in my query? Is x like a table with records? Does it loop through all the records in x and compare the PK? Didn't think this is possible with simple query, only with cursor 😉

    If Dup is a copy of NestleComments then it is not needed for this at all. You can just do a self join (in fact, I see you have edited your original post doing this).

    x is the alias given to the derived table. It is exactly like a table but it is derived for the purpose of this query. This DOES NOT LOOP, it is a join. In this case it is just joining to another query instead of a base table. You can use this technique when you want to delete rows that meet a certain condition like in this example.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • DELETE table FROM table... is used when you want to use a something like a select statement with joins (usually) to control which records you delete. You leave off 'SELECT *' & use the FROM... onwards. If the condition is within the single table you can just say 'DELETE table WHERE ...'.

    x is just an alias for the query in brackets. The results of the query are treated exactly as if it was a real table, & x is the alias of that table. It's not a loop - it's a set-based operation like this:

    SELECT ManagerComment, MIN(PK) FROM table GROUP BY ManagerComment.

  • Oksana March (9/26/2012)


    DELETE NestleComments FROM NestleComments d

    LEFT OUTER JOIN

    (SELECT ManagerComment, pk = MIN(PK)

    FROM NestleComments

    GROUP BY ManagerComment) x

    ON d.pk = x.pk

    WHERE x.pk IS NULL

    GO

    It works perfect and deletes the duplicate comments! But I don't understand why! I looked at it like 100 times but don't get it. But I don't want to give up, I want to learn why it works! Can someone please explain what happens in this query? Thank you!

    Let me have a try at explaining (though I think those who have responded before me have probably given you enough info already):

    Let's say you start with data like this:

    PKManagerComment

    1Some Comment

    2Another Comment

    3Stupid Comment

    4Some Comment

    5No Comment

    6Stupid Comment

    The subquery (in quotes) will group by the ManagerComment field and return a sort of virtual table with the smallest PK value for each duplicate comment. In this example, like this:

    PKManagerComment

    1Some Comment

    2Another Comment

    3Stupid Comment

    5No Comment

    The main query refers to this virtual table with the alias 'x', and the original table as 'd'. So, a left outer join would give you results like this:

    d.PKd.ManagerCommentx.PKx.ManagerComment

    1Some Comment1Some Comment

    2Another Comment2Another Comment

    3Stupid Comment3Stupid Comment

    4Some CommentNULLNULL

    5No Comment5No Comment

    6Stupid CommentNULLNULL

    So, the rows where x.PK is NULL have corresponding d.PK values of 4 and 6. These are the duplicates. SQL then deletes the corresponding rows in d. That's the DELETE [FROM] <table> FROM <table> JOIN <othertable>... It's a confusing syntax, and not an ANSI SQL standard, but it works.

    Hope that helps.

    Rob Schripsema
    Propack, Inc.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply