In a previous article I discussed using Fuzzy Grouping to
find inexact duplicates in the same dataset. But what if you have 2 different
datasets that you need to review to find common people? You could look for
exact records, but what about misspellings, typos There are a few
probabilistic linkage software packages on the market, but the fuzzy lookup
transformation in SSIS can do the job. For example, in our hospital I was
presented with 1244 patients from an outside study database and was asked if
they existed in our 1.7 million patient registration database.
Fix the SSIS executables first
First install SQL Server service pack 2 or the memory leak
fix http://support.microsoft.com/kb/912423/
. This fix resolves many memory issues when using record sets with over 1
million records, but there are still issues with more than 4 million records.
Define a Dataset
First add the input dataset that contains the records you
want to match. In this example it is my table of 1244 patients. To do this,
open Business Intelligence Development Studio.and start a BI project, then
define an OLE DB Source dataset. Any definable dataset may be used, but as many
temporary tables are created, you must have a connection to a SQL Server and be
a user with permission to create tables. Also limit your dataset to the fields
to be used for lookups and pass through to increase performance. Be wary of
data types, the fuzzy transformations rely on string comparisons to build the
Error-Tolerant Index (ETI), so datatypes like datetime must be transformed to
varchar.
Add a Fuzzy Lookup Transformation
Drop the fuzzy lookup transformation on the project. Then
point the datasource output to the fuzzy lookup. Double click on the
transformation to get to the Fuzzy Lookup Transformation Editor.
In the editor, define your reference table. My reference
table will be my patient registration table. The transformation is dependent on
indexes that are created during the lookup process. These indexes are different
than normal table indexes and are not stored, unless the Store new index box is
checked. Since this will not be a normal process of finding small lookup
values, there is no reason to store the newly created index. Add your OLE DB
connection, choose Generate new index, and then select the reference table
(mine is actually a view that converts the date of birth to a varchar).
Choose the Columns to Lookup and Return
Clicking on the Columns tab will allow selection of columns
to use in the fuzzy lookup and to return as a pass through.
The two tables here are the Input and the reference tables.
Simply drag and drop the fields from the input to the appropriate Lookup
columns. In this example I have first name, last name and date of birth in my
input, so I will match on those in the reference table. I need to get the medical
record number (MRN) out of the lookup table to return to our study coordinator,
so I choose MRN as a pass through.
Right clicking on the joined field line and choosing Edit
Mappings will allow further customization of the lookup in the Create
Relationships editor.
The Create Relationships Editor
Here are the guts of the lookup:
The Mapping Type indicates if the join is Fuzzy or Exact. If
your data columns are anything aside from strings (datetime, int, etc.) the
Mapping Type will only allow exact. The Comparison Flags allow further
processing choices, such as Ignore case, Ignore kana type, and others.
The Minimum Similarity sets the threshold similarity for
each field. If you want some fields to have a stronger weight in the linkage,
it should be given a higher value (closer to 1). In my example, I have very few
fields, and both first and last name can vary due to marriage, nicknames,
abbreviations and so forth. However, a patients date of birth should remain
constant, so I want it weighted higher than other columns. Setting a higher
Minimum Similarity will reduce the number of potential matches and thus speed
processing. Lastly, set the Similarity Output Alias if you want column names
other than the default in the resulting output table.
The Advanced Tab
Clicking on the Advanced tab shows the Maximum number of
matches to output per lookup. The default is 1, but if multiple matches are
desired, this number can be adjusted. The Similarity threshold sets the
threshold for the entire lookup. The closer to 1, the more alike the records
must be to satisfy the lookup. Bumping up this threshold will reduce the number
of matches, but could also miss records if they are less exact.
The Advanced Editor for Fuzzy Lookup
Most of the Custom Properties have been set in the
Relationship Editor; however there are other important elements in the advanced
editor. The Exhaustive indicator defaults to False. Setting it to True will
cause the lookup to reference each row in the input to each row in the
reference table. To do this, all records in the reference table are loaded into
memory and this can be very slow for large datasets. But it may result in
better matches. False will limit the results to matches where at least one of
the substrings is common to both the input and the reference.
WarmCaches are very important for speed with large datasets,
but do eat up more memory. If set to True, the ETI and the reference table will
be partially loaded into RAM. For small datasets setting WarmCaches to False
may actually be faster. The MaxMemoryUseage setting can be useful if you dont
want to allocate all available RAM to the transformation. If zero, it will
dynamically use RAM based on need and availability.
Add a Destination
Once the lookup component is configured, add a destination
and link the Fuzzy Lookup Output to this destination. Create the destination
first in the Connection Manager. Next click on the mappings to see what the
output will look like.
Running the package.
I recommend having SSIS on a separate machine (yes you need
another license) from your databases as this is a very memory- and CPU-
intensive operation. I ran this on a Dual Core 1.86 GHZ with 2 GB of RAM with
my reference and input tables on a Dual Core 2.21 GHZ server with 4 GB RAM. My
reference table had 1,711,968 records and the input had 1244. A non-exhaustive
transform took just over 15 minutes.
The Output
The output is of course dependent on your input and
selection of pass through fields. But the important fields are the Similarity
(one for each field and an overall), and the Confidence. Each field can have
its similarity to the reference output and it will tell you how like the values
are, where 1 is an exact match and .00 has nothing in common. There is also an overall
Similarity for all fields. The confidence field is the measure of likelihood
that this particular record is the best match with the reference table. So both
these fields are important in determining if the records are a match. A tie in
the similarities could be broken with a higher confidence score. Of course,
with SSIS the output can be directed to do many things. For my purposes, I will
have to generate a work list for someone to review.
To demonstrate, I created 2 dummy tables with fake data, and
I ran the project with a MaxOutputMatchesPerInput value
of 2.
The input:
The reference:
The output:
Note that with the MaxOutputMatchesPerInput of 2, I get two possible
matches for Mary Smith, Mary and Mark (both mathematically very similar). The
Similarity and confidence scores should tell me which to use. Also, if no
records are found that meet the minimum similarity, Nulls will be returned.
Conclusion
The fuzzy lookup transformation was designed to find and
standardize lookup values; however, with a little tweaking, it can be used for
probabilistic linkage of different datasets. The quality of the matching is
dependent on the fields you have and the quality of data therein.