Removing Duplicate Rows

  • I have a table called company_contacts

    CREATE TABLE [dbo].[company_contact] (

     [contact_id] [int] IDENTITY (1, 1) NOT NULL ,

     [company_id] [int] NOT NULL ,

     [contact_type_id] [smallint] NOT NULL ,

     [contact_name] [varchar] (55)

    ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

    GO

    ALTER TABLE [dbo].[company_contact] WITH NOCHECK ADD

     CONSTRAINT [PK_company_contact] PRIMARY KEY  CLUSTERED

     (

      [contact_id]

    &nbsp  ON [PRIMARY]

    GO

    I have removed superfluous columns from the script.

    Sample Data

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (329,6,'John Deere')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (329,6,'John Deere')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (329,6,'John Deere')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (108,6,'Mary Washington')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (108,6,'Mary Washington')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (685,6,'Hillary clinton')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1002,6,'Shania Twain')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1002,6,'Shania Twain')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1002,6,'Shania Twain')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1002,6,'Shania Twain')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1002,6,'Shania Twain')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (3,6,'Dave Navaro')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1,6,'William Tale')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1,6,'William Tale')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1,6,'William Tale')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1,6,'William Tale')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1,6,'William Tale')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1,6,'William Tale')

    insert into company_contacts (company_id ,contact_type_id ,contact_name ) values (1,6,'William Tale')

    As one can see from the sample data, multiple entries have been made by mistake. This is a result of a bug in the application that did the insert.

    My problem is this. I need to remove only duplicate entries while leaving on one entry for each person and contact type. For this exercise I am dealing only with contact type, 6.

    After you insert the data you can see that in order to delete the duplicate rows you need the [contact_id] of the dupliccate row. I need help in finding the contact_id of just the duplicates while leaving any non duplicated row.

    As always your help is appreciated.

    Gary

  • Gary, I replied earlier to your problem and have since deleted that post.  I was not reading clearly and have since noticed that you have an indentity column as your primary key.

    Your solution could end up being very easy, or a bit of a hassle.

    Tell me - are you replicating that PK as a foreign key in other tables?

  • Gary, this should get you started.  Assuming that the duplicate record you want to keep is arbitrary, you can left join to a derived table to get the information you want:

    SELECT MIN(contact_ID)as PK, company_id, contact_type_id, contact_name

    FROM company_contacts

    GROUP BY company_id, contact_type_id, contact_name

    Based on this query, you can delete your duplicate records using:

    DELETE FROM company_contacts

    FROM company_contacts c

    LEFT JOIN

    (SELECT MIN(contact_ID)as PK, company_id, contact_type_id, contact_name

    FROM company_contacts

    GROUP BY company_id, contact_type_id, contact_name) vt

    ON c.contact_ID=vt.PK

    WHERE vt.PK IS NULL

    I have never had to delete replicated FK's being used this way but I would think you could delete those the same way - by left joining to your cleaned up company_contacts table.

    Also, your original post indicates that you ommitted superfluous columns from your DDL for the post.  You will want to consider your requirements for these columns and adjust the SQL above.

    Again, depending on your PK to FK relationsships, definitely logic check the above to make sure it works for you.

  • rhunt, the contact_id is not a foregin key in any table. The company_id is a foregin key to another table.

    I was thinking along the same line as you but I just could not put it together.

     

    Thanks for your replies.

     

    Gary

  • Hi,

    I tested rhunt's SQL and it does precisely what you need - as far as I can tell. I thought there was a typing error in it (superfluous FROM on the first line), but it seems that even with that it works correctly - I didn't know you can repeat it. However, I'd prefer to write it this way, because the 2 "FROM" look confusing to me:

    DELETE company_contacts

    FROM company_contacts c

    LEFT JOIN

    (SELECT MIN(contact_ID)as PK, company_id, contact_type_id, contact_name

    FROM company_contacts

    GROUP BY company_id, contact_type_id, contact_name) vt

    ON c.contact_ID=vt.PK

    WHERE vt.PK IS NULL

    If it doesn't do the job, please let us know what's the problem.

  • Hi,

    One more way of removing the duplicate records would be to dump the filtered data into a new table.

    SELECT company_id,contact_type_id, contact_name into company_contacts

    FROM company_contact

    GROUP BY company_id, contact_type_id, contact_name

    Then alter the new <company_contacts> table, to add the contact_id column

    alter TABLE [dbo].[company_contacts]

    add [contact_id] [int] IDENTITY (1, 1) NOT NULL

    Then drop the existing table and rename the new table to the name of original table.

    I know this is a much lenghty method of doing it. but just wanted to share it with you guys.

    --Kishore.

     

     

  • Or you can just run this statement:

    delete c1

    From  company_contact c1

    JOIN  company_contact c2

     on  c1.company_ID = c2.Company_ID

     and c1.contact_type_id = c2.contact_type_id

     and c1.contact_name = c2.contact_name

    where c1.contact_id > c2.contact_id

     

    Signature is NULL

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply