How to calculate a duplication rate?

Question

How to calculate a duplication rate?

peterzeke

SSCrazy Eights

Points: 9003
More actions
October 15, 2010 at 7:34 am

#228070

I'm a reporting analyst/database developer in healthcare. One of the projects I work on focuses on our master person index (MPI) -- a MPI assigns a single ID to a person/patient that can be tracked across various sources of data: hospitals, labs, doctor offices...
It's important to minimize "duplicate" records in the MPI -- i.e., sometimes two or more different IDs exist for the same person due to failed logic in the MPI that didn't unify these IDs into one record.
I'm struggling to come up with a formula for calculating the percent of records (patients) that are duplicated in the MPI. Two approaches I've used are: 1) Take the count of all records that are considered duplicates divided by the total number of records overall; or 2) Take the count of all "cases" with duplicates divided by the total number of records. (A "case" reflects a situation where two or more patient records could be combined into one record.)
So, to add to the discussion, suppose I only have a total of 4 records wherein I discover that all 4 rows could be collapsed to 1 record --> what is the duplication rate? If I use the first calculation, I'll end up with a rate of 1 (100%); if I use the second calculation where 1 case exists for all 4 records, the rate will be .25 (25%).
Is one of the calculations correct? Does someone have a different method for determining the rate of duplicates in their sql tables?
Thanks in advance for suggestions/advice,
Pete

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286590 More actions · Answer 1

It sounds like in your example the duplication is either 100% or 75%. Obviously it is not 25%. That would indicate that of the 4 records only one of them is duplicated. 100% can be correct since it could be argued that all 4 of them were duplicated. However the most likely correct result is 75%. Meaning that the 1st entry was the original and then it was duplicated 3 times. Meaning that 3 out 4 are duplicates. To calculate it would be something like NumDuplicateRecords/TotalRecordCount.

In your example 3/4.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

peterzeke SSCrazy Eights Points: 9003 More actions · Answer 2

Thanks for the quick reply -- I've actually come up with a new formula that I think gets it right by merely tweaking the numerator (and agrees with your duplication rate):

Numerator = total number of records duplicated - total number of cases

Denom = total number of records overall

So, in the situation of having a total of 4 records, all of which appear to be duplicates as a single case:

Num = 4 records - 1 case = 3

Denom = 4 records

Result = (4 records - 1 case)/4 records = .75 (75%)

This new equation reminds me of how to calculate growth rate: (New value - Old value)/ Old value.

drew.allen SSC Guru Points: 76995 More actions · Answer 3

I think the best measure is probably

(totalRecords - uniqueRecords)/totalRecords

In your example that would give 75%. The hard part is, of course, identifying the unique records.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

peterzeke SSCrazy Eights Points: 9003 More actions · Answer 4

Thanks for the reply, Drew.

Yup, the trick is identifying the unique records (which I call "cases"). Fortunately, I developed quite a few identification rules which do a really good job of identify patient records that should be merged into a single record. Our vendor's EPI system lacks a lot of functionality; consequently, their system lets a lot of duplicates slip through that shouldn't, so we end up post-processing the output.