I know, I know your data could never possibly have duplicates because obviously, you have all the constraints in place to stop this bad data before it gets into the system… But we all make mistakes right? Let’s imagine someone did let the little critters in, how can we then write a query that will remove the duplicates?
We actually have a few options, some good, some not so good. Before we run through them if you want to follow along then run the following SQL to create two tables and populate them with some duplicated data…
De-Duping Without Window Functions
Let’s imagine for whatever reason you’re still rocking SQL Server 2000 and don’t have access to the hotness that is window functions. What are our options for de-duping data?
We currently have 2 tables, one with a unique key and one without. Let’s start with the one with no unique key…
In the DupesWithNoUniqeKey table with have absolutely nothing unique to use in our filters to say delete this record but leave at least one copy, we also don’t have access to window functions so we can’t stick a row number on each record and delete all the ones with a row number greater than one. This really leaves us with only one option and it’s a pretty painful option, especially if it’s a large table…
We’re having to do a SELECT DISTINCT into a new table to remove the duplicates, then we’re having to clear out our original table and re-insert all the data.
Let’s now look at our other table that does have a unique key which allows us to differentiate between the duplicate records…
We can run the following query to see all the duplicate records and filter out the first occurrence of them…
This is basically getting all duplicates except the one with the lowest ID, this will allow us to keep the first occurrence of every duplicate.
Once we’ve run that and looked at the output to confirm it’s returning the records we want to delete we can use the same CTE to do our delete…
This time we’re only touching the deleted rows and not having to move the whole table which in most cases will be far less overhead and cause less blocking.
De-Duping With Windows Functions
Things get a lot better with the introduction of Window Functions in SQL Server 2005, we no longer need to worry about having a unique key or some way to differentiate duplicates from each other, We can now use PARTITION BY to define what needs to be unique in order for it to be classed as a duplicate and ROW_NUMBER() to give each duplicate an ascending ID. At which point we can then delete anything with an ID > 1…
Notice we’re ordering by SELECT NULL this is because we have no real ID and so don’t care which version of a duplicate gets deleted.
Let’s imagine on our other table that does have a duplicate ID we only want records after the first one to be deleted, we can do that by ordering by ID in our partition…
It’s worth noting that everything above can be achieved without the use of CTE’s, I use them so I can first run a select against the CTE to check what I’m going to be deleting, for example…
In the above, I’ve changed the CTE to “SELECT *” so we can see all the data and also commented out the delete statement to replace it with a select. With this method, you can always run the select first and when you’re happy just comment it out and uncomment the delete line.