June 17, 2004 at 12:26 pm
I would appreciate if someone can help me with SQL or something which would help me delete duplicates from a table which has almost 10 million rows. I think there may be 500000 duplicate rows in this table. As you can understand it is difficult to create an environment to test it (from problems with log etc) and then applying it in production
Thanks
June 17, 2004 at 12:53 pm
The easy way to do it is to add a new field to the table, call MyID as an IDENTITY (1,1) INT
And now comes the tricky part...
ex: select col1, col2, col3, max(MyID) from mytable group by...
SELECT col1, col2, col3 (and so on... end with), MAX(MyID) AS MaxMyID
INTO #Temp
FROM MyTable
group by col1, col2, col3 (and so on)
And for deleting the obselete rows:
delete from MyTable where MyID NOT IN (select MaxMyID From #Temp)
drop table #temp
June 17, 2004 at 1:01 pm
I am worried about adding a new column as well as the delete done in bulk (500000 rows) as it is a huge table. What do you think?
June 17, 2004 at 3:26 pm
You could create a cursor for the #temp table and delete rows one by one with the syntax WHERE CURRENT OF.
June 17, 2004 at 8:58 pm
Hi,
See this article.
http://www.sqlservercentral.com/columnists/ccubley/findinganddeletingduplicatedata.asp
good Luck,
K.Senthil Kumar
June 18, 2004 at 6:35 am
i suggest u, to
fire a query like
select count(*) as cnt,col1,col2 from table
group by
col1,col2
having count(*)>1
col1,and col2 are the column names which are supposed to be uniques
fire it in the query analyzer and then
make a view of that query
and then go to import/export wizard export that view to excel,
and show it to the conncerned person for verification and then go and delete those records,
that will be much more safer, because u will be showing it to the concerned authority
Best Luck
<a href="http://www.websolsoftware.com"> For IT jobs click here</a>
*Sukhoi*[font="Arial Narrow"][/font]
June 18, 2004 at 9:38 am
I've done this before with a correlated subquery.
Assuming you have an identity column named ID and columns FirstName, LastName,Zip for which each combination should be unique. You could run:
DELETE FROM mytable WHERE ID >
(SELECT MIN(ID) FROM mytable b where
mytable.FirstName=b.FirstName AND
mytable.Lastname=b.LastName AND
mytable.Zip=b.Zip)
This would delete all but the lowest ID record for each duplicate.
Alternatively, you could change it to an update query to mark the records to be kept. Whichever way you mark them could be used a criteria to delete them after review
June 18, 2004 at 9:48 am
To address your concern of too many records to delete (bulk delete) I suggest doing it in a loop by set rowcount 10000 so that a batch delete is done. it will definitely help with the trans. log not getting filled up too.
while <duplicates are there>
begin
set rowcount 10000
delete stmt for duplicates
set rowcount 0
if you do not care about logging the delete operation I suggest
dump tran <DB> with no_log
end
You can also include statements to display how many rec are deleted in each loop so that you can see the progress.
I definitely suggest batch delete..
Good luck!
June 18, 2004 at 10:38 am
Thanks everyone for all the good ideas. The only question that remains is the table does not have an identity column. So I guess I have to add a column and then assign an unique number programatically and update all the 10 million rows. If anyone gets a great idea, please update the forum as I expect a very long working week end doing this.
June 18, 2004 at 10:53 am
You don't have to assign the values programmatically. In the table designer, add a column with data type int then you can set the identity property to yes. By default SQL server will number the column starting with 1 and incrementing by 1.
June 18, 2004 at 11:02 am
I always thought that if there is an existing table and you add an identity column only new rows will get number assigned. I just tried and was wrong and thanks so much. This should make life pretty easy.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply