Finding Similar Data Using SQL Server Integration Services

SQL Server 2005 Integration Services (SSIS) introduces two new

tools designed for Data Warehousing, but their uses are far more than just

warehousing. Fuzzy Lookups and Fuzzy Grouping can both improve quality of data.

Fuzzy lookups is designed to correct errors in lookup tables such as

misspelling in cities or states. Fuzzy grouping finds duplicates in datasets. The

hospital where I work strives to identify and merge duplicates to improve and

maintain one complete electronic medical record for each patient. We have been

using the fuzzy grouping tool for the past 6 months.

The concept of fuzzy grouping is not new; many have used

probabilistic linkage of datasets to find duplicates and related records in

other datasets (for more info visit http://www.utcodes.org/Linkage/description.htm).

Few commercial probabilistic linkage packages are available and they are

expensive. With SSIS we now have a free tool to do probabilistic linkages.

Fix the SSIS executables first

First install service pack 2 or the memory leak fix http://support.microsoft.com/kb/912423/. This fix resolves many memory issues when using record sets with over 1 million records, but there are still issues with more than 4 million records. Fuzzy grouping is memory and processor intensive so running SSIS on a server separate from the databases is recommended. On a 1.8 GHZ Pentium IV with 512 MB. My tests running an 800,000 record dataset took 5 hours. Adding another 512MB reduced the run time to 2 hours.

Define A Dataset

Open Visual Studio and start a BI project then define an OLE DB Source dataset. Any definable dataset may be grouped, but as many temporary tables are created, you must have a connection to a SQL Server and be a user with permission to create tables. Also limiting your dataset to the fields to be used for grouping and a primary key will increase performance. Be wary of data types as many are not supported and datetime fields must be transformed to varchar.

Add A Fuzzy Grouping Step

Next add a Fuzzy Grouping package. Define the fields that identify

the duplicates. Fields such as First Name, Last Name, Gender, and Date of Birth

will prove sufficient to find duplicates. Fields with high discerning power

such as Social Security number will improve accuracy and should be given a

higher weight. A primary key should also be used as a pass-through in the

dataset to aid in future joins of the result set. Define the type of grouping, either exact (data must be equal) or fuzzy (data is compared to determine similarity). The fuzzy grouping allows you to choose further options to ignore cases, punctuation, kana, non-spacing

characters, character width or symbols.

Minimum Similarities

Minimum Similarities should be defined for each field. These

thresholds define how closely you want each of the values to correspond. Minimum

similarities are between 0 and 1. Where 1 means the values are exact and 0 means

review everything. Setting a higher minimum similarity will speed the matching

process. Minimum similarities are very dependent on data and the number of

matches, and match quality will depend on these values. Setting minimum similarities

too high will result in only closer to exact records being found. So you may

need some experimentation to determine the best thresholds to use. For example,

I like my dates of birth (dob) to be close, but I realize that data entry

errors skew dates, so I set a threshold of .20 for dob. In the example below May 12 1972 is

compared against May 12 1971 and the similarity score is .84 so it would make the cut. If unsure run some tests with no minimum similarities and review your resulting _similarity fields for each field. Take some averages and review the matches to determine what a good match is, then set your threshold at this level.

Minimum Thresholds

An overall minimum threshold must also be set. I have found that matches with my data get weak after .80. So I set my minimum overall threshold at .80. Again this is dependent on your data, accuracy of entry staff and how closely you want/need your records to match.

Define the Destination

The last step sets an OLE DB destination and either creates

a new table or uses existing and matches the output fields in the fuzzy

grouping step.

The outcome of the grouping will give a _key_in which is the

unique key of the first table, a key_out field that identifies the unique

groups, the _similarity_fieldname and the _score field that is the average of

the _similarity_fieldname. The _score field is between 0 and 1 where a 0 is no

similarity and 1 is exact.

Running the following dataset through the grouping:

Here is the sample input data:

And here are the outputs:

Here it grouped the first 3 records together and the last 3 together. Note it did not choose the correct name in grouping one, it simply noted that they are all the same. So if you want the "correct" person, as we did for the hospital, you will need a human to review the results. In our hospital database of over 1.4 million records we found over 5000 duplicates. Unfortunately the nature of the business requires us to review each of the dataset and pull medical records, so an automated merge process is out of the question. But Fuzzy Grouping has proved to be a very valuable tool and has given us many useful reports.