August 24, 2013 at 12:41 pm
We are still using SQL 2000, so my question is about SQL 2000.
Problem statement
We receive a comma separated file about once every month.
We import that into a SQL table. Once it is imported we run various scripts to compare the imported data to a database we have. When we have matches based on certain criteria the results are inserted into a results table.
Despite trying to prevent duplicates from getting inserted into the “results” table we still get duplicates because of one or another of the required criteria.
In an effort to remove certain duplicates, we assign a “Score” to each record and we flag the duplicates with a 1.
What I would like is some help with the following.
1.If there is a duplicate record then remove all but the one with the highest score.
2.If there are duplicate records and they all have the same score, then remove all duplicates except one, preferably the one with the most data in the row.
In the “results” table there is a primary key called ID.
There is a unique id in the source file called ScrubID that gets inserted into the “results” table as well.
I used this script, which I modified from one on this site, to delete records with the same score.
DELETE FROM results WHERE ID >
(SELECT MIN(ID) FROM results b where
results.ScrubID=b.ScrubID AND
results.score=b.Score )
I have tried to modify the above script to keep the higest score and it seems no matter what I try it does not work.
Please see modification below.
DELETE FROM results WHERE ID >
(SELECT MIN(ID) FROM results b where
results.ScrubID=b.ScrubID AND
results.score > b.Score )
I will try to provide some samples in a little while.
Thanks for any help and or guidance.
Gary
August 24, 2013 at 6:02 pm
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".
There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.
Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.
DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;
WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;
--Jeff Moden
Change is inevitable... Change for the better is not.
August 25, 2013 at 3:15 am
Jeff Moden (8/24/2013)
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.
Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.
DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;
WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;
Just one problem with what you provided, Jeff. The OP needs a solution that will work with SQL Server 2000.
August 25, 2013 at 10:57 am
If you like using aggregates (all those MINs suggest that) and are not too concerned by the idea of using two queries instead of one, and that "having the most data" can be changed into "having the lowest ID" as suggested by your code, something that will work in SQL 2000 is
delete results R where not exists (
select 1 from (
select A.ScrubID, max(A.Score) as Score from results A group by ScrubID
) MX
where MX.Score = R.Score and MX.ScrubId = R.ScrubID); -- gets rid of low scores
delete results R where R.ID not in (select min(A.ID) from results A group by scrubID) ; -- gets rid of duplicates with high IDs
Four years ago, I might have worked out how to do it in a single query in SQL 2000, but years of being spoilt by the neat new features in SQL 2008 and its successors have made me less willing (and probably less able, too) to write really convoluted queries.
Actually, Jeff's first suggestion does the same as my first query, and will probably do it a lot more efficiently unless there's some very unlikely (and rather silly) indexing.
edit: spelling :blush:
Tom
August 25, 2013 at 2:08 pm
Lynn Pettis (8/25/2013)
Jeff Moden (8/24/2013)
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.
Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.
DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;
WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;
Just one problem with what you provided, Jeff. The OP needs a solution that will work with SQL Server 2000.
In that case, the first solution I provided will work.
Good to "see" you around, Lynn.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 26, 2013 at 9:15 am
2. If there are duplicate records and they all have the same score, then remove all duplicates except one, preferably the one with the most data in the row.
If you post the schema for this table I can quickly give you a script to do this in SQL 2000. I used to write jobs that had to clean tables with no unique ID's. Should be easy to use a join and write a clean up script for the one you have.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply