Old problem - delete duplicates in table

  • I would appreciate if someone can help me with SQL or something which would help me delete duplicates from a table which has almost 10 million rows. I think there may be 500000 duplicate rows in this table. As you can understand it is difficult to create an environment to test it (from problems with log etc) and then applying it in production 

    Thanks

  • The easy way to do it is to add a new field to the table, call MyID as an IDENTITY (1,1) INT

    And now comes the tricky part...

    ex: select col1, col2, col3, max(MyID) from mytable  group by...

    SELECT col1, col2, col3 (and so on... end with), MAX(MyID) AS MaxMyID

    INTO #Temp

    FROM MyTable

    group by col1, col2, col3 (and so on)

    And for deleting the obselete rows:

    delete from MyTable where MyID NOT IN (select MaxMyID From #Temp)

    drop table #temp

  • I am worried about adding a new column as well as the delete done in bulk (500000 rows) as it is a huge table.  What do you think?

  • You could create a cursor for the #temp table and delete rows one by one with the syntax WHERE CURRENT OF.

  • Hi,

    See this article.

    http://www.sqlservercentral.com/columnists/ccubley/findinganddeletingduplicatedata.asp

    good Luck,

    K.Senthil Kumar

  • i suggest u, to

    fire a query like

    select count(*) as cnt,col1,col2 from table

    group by

    col1,col2

    having count(*)>1

    col1,and col2 are the column names which are supposed to be uniques

    fire it in the query analyzer and then

    make a view of that query

    and then go to import/export wizard export that view to excel,

    and show it to the conncerned person for verification and then go and delete those records,

    that will be much more safer, because u will be showing it to the concerned authority

    Best Luck

    <a href="http://www.websolsoftware.com"> For IT jobs click here</a>

    *Sukhoi*[font="Arial Narrow"][/font]

  • I've done this before with a correlated subquery.

    Assuming you have an identity column named ID and columns FirstName, LastName,Zip for which each combination should be unique. You could run:

     

    DELETE FROM mytable  WHERE ID >

    (SELECT MIN(ID) FROM mytable b where

    mytable.FirstName=b.FirstName AND

    mytable.Lastname=b.LastName AND

    mytable.Zip=b.Zip)

    This would delete all but the lowest ID record for each duplicate.

    Alternatively, you could change it to an update query to mark the records to be kept. Whichever way you mark them could be used a criteria to delete them after review

  • To address your concern of too many records to delete (bulk delete) I suggest doing it in a loop by set rowcount 10000 so that a batch delete is done. it will definitely help with the trans. log not getting filled up too.

    while <duplicates are there>

    begin

    set rowcount 10000

    delete stmt for duplicates

    set rowcount 0

    if you do not care about logging the delete operation I suggest

    dump tran <DB> with no_log

    end

    You can also include statements to display how many rec are deleted in each loop so that you can see the progress.

    I definitely suggest batch delete..

    Good luck!

  • Thanks everyone for all the good ideas. The only question that remains is the table does not have an identity column. So I guess I have to add a column and then assign an unique number programatically and update all the 10 million rows. If anyone gets a great idea, please update the forum as I expect a very long working week end doing this.

  • You don't have to assign the values programmatically. In the table designer, add a column with data type int then you can set the identity property to yes. By default SQL server will number the column starting with 1 and incrementing by 1.

  • I always thought that if there is an existing table and you add an identity column only new rows will get number assigned. I just tried and was wrong and thanks so much. This should make life pretty easy.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply