SQL Server 2005 Integration Services (SSIS) introduces two new
tools designed for Data Warehousing, but their uses are far more than just
warehousing. Fuzzy Lookups and Fuzzy Grouping can both improve quality of data.
Fuzzy lookups is designed to correct errors in lookup tables such as
misspelling in cities or states. Fuzzy grouping finds duplicates in datasets. The
hospital where I work strives to identify and merge duplicates to improve and
maintain one complete electronic medical record for each patient. We have been
using the fuzzy grouping tool for the past 6 months.
The concept of fuzzy grouping is not new; many have used
probabilistic linkage of datasets to find duplicates and related records in
other datasets (for more info visit http://www.utcodes.org/Linkage/description.htm).
Few commercial probabilistic linkage packages are available and they are
expensive. With SSIS we now have a free tool to do probabilistic linkages.
Fix the SSIS executables first
First install service pack 2 or the memory leak fix http://support.microsoft.com/kb/912423/. This fix resolves many memory issues when using record sets with over 1 million records, but there are still issues with more than 4 million records. Fuzzy grouping is memory and processor intensive so running SSIS on a server separate from the databases is recommended. On a 1.8 GHZ Pentium IV with 512 MB. My tests running an 800,000 record dataset took 5 hours. Adding another 512MB reduced the run time to 2 hours.
Define A Dataset
Open Visual Studio and start a BI project then define an OLE DB Source dataset. Any definable dataset may be grouped, but as many temporary tables are created, you must have a connection to a SQL Server and be a user with permission to create tables. Also limiting your dataset to the fields to be used for grouping and a primary key will increase performance. Be wary of data types as many are not supported and datetime fields must be transformed to varchar.
Add A Fuzzy Grouping Step
Next add a Fuzzy Grouping package. Define the fields that identify
the duplicates. Fields such as First Name, Last Name, Gender, and Date of Birth
will prove sufficient to find duplicates. Fields with high discerning power
such as Social Security number will improve accuracy and should be given a
higher weight. A primary key should also be used as a pass-through in the
dataset to aid in future joins of the result set. Define the type of grouping, either exact (data must be equal) or fuzzy (data is compared to determine similarity). The fuzzy grouping allows you to choose further options to ignore cases, punctuation, kana, non-spacing
characters, character width or symbols.
Minimum Similarities
Minimum Similarities should be defined for each field. These
thresholds define how closely you want each of the values to correspond. Minimum
similarities are between 0 and 1. Where 1 means the values are exact and 0 means
review everything. Setting a higher minimum similarity will speed the matching
process. Minimum similarities are very dependent on data and the number of
matches, and match quality will depend on these values. Setting minimum similarities
too high will result in only closer to exact records being found. So you may
need some experimentation to determine the best thresholds to use. For example,
I like my dates of birth (dob) to be close, but I realize that data entry
errors skew dates, so I set a threshold of .20 for dob. In the example below May 12 1972 is
compared against May 12 1971 and the similarity score is .84 so it would make the cut. If unsure run some tests with no minimum similarities and review your resulting _similarity fields for each field. Take some averages and review the matches to determine what a good match is, then set your threshold at this level.
Minimum Thresholds
An overall minimum threshold must also be set. I have found that matches with my data get weak after .80. So I set my minimum overall threshold at .80. Again this is dependent on your data, accuracy of entry staff and how closely you want/need your records to match.
Define the Destination
The last step sets an OLE DB destination and either creates
a new table or uses existing and matches the output fields in the fuzzy
grouping step.
The outcome of the grouping will give a _key_in which is the
unique key of the first table, a key_out field that identifies the unique
groups, the _similarity_fieldname and the _score field that is the average of
the _similarity_fieldname. The _score field is between 0 and 1 where a 0 is no
similarity and 1 is exact.
Running the following dataset through the grouping:
Here is the sample input data:
And here are the outputs:
Here it grouped the first 3 records together and the last 3 together. Note it did not choose the correct name in grouping one, it simply noted that they are all the same. So if you want the "correct" person, as we did for the hospital, you will need a human to review the results. In our hospital database of over 1.4 million records we found over 5000 duplicates. Unfortunately the nature of the business requires us to review each of the dataset and pull medical records, so an automated merge process is out of the question. But Fuzzy Grouping has proved to be a very valuable tool and has given us many useful reports.