Update: Based on several excellent forum contribution's, I asked Michael Capes(friend with big brain) to review them and provide feedback. The result is a revision to the original script. See Forum posts for this article for details.We have also provided a test script and results.
The objective here is to demonstrate how to effectively use T-SQL to accomplish Fuzzy Matching and Fuzzy Grouping that supports record linkage. . I have worked in this area for many years and this is by no means a complete solution suitable for very high volumes, however this article will provide a workable T-SQL solution for matching and grouping required for de duplication. In addition we will demonstrate how to cleans a string and remove special characters.
This solution will create a capability similar to the UTL_MATCH Functions in Oracle.
The Problem
Let’s assume you have a list of prospective customers and you want to identify which ones are the same. However the list of prospective customers has some duplicate due to misspelling and or typos. Notice below cust_id 11 and 111 are probably the same person.
As you can see from the list above we have a list of Customer Ids and First and Last names. For our exercise the last names are assumed to be correct. Our objective is to group or match the unique Cust_Id records. We want to create an output list that links the similar customers and also normalizes or standardizes the first names.
I like to start with the end in mind and below is our final output.
We have cleansed (First_Name_Normalized) and grouped (Namegroup+_Id) similar customers. In this example for the customer Jones with three different first names Tomas, Tom and Thhomas we have normalized the first name as Thomas. Our next step would be to do a Merge/Purge and create a single customer records, we will examine this in a future article.
The Approach
Create a process which will compare the customer Name_Input list to a Lookup table with existing Standard first_Names , identify the similar first names and group them by assigning a Name_Id. separate lists into a new list containing and identifying the matches. In addition we will standardize the name. For instance we will make Tom and Tomas into Thomas or Pete into Peter.
This article will describe how to create a T-SQL Stored Procedure that will provide Fuzzy Matching via a Jaro Winkelr algorithm.
Specifically we will demonstrate how to implement the Jaro-Winkler Matching Algorithm .
“The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant ofthe Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicatedetection). The higher the Jaro-Winkler distance for two strings is, the more similar the strings are. The Jaro-Winkler distance metric is designed and best suited for short strings such as person names. The score isnormalized such that 0 equates to no similarity and 1 is an exact match.”
The references used for this effort were:
- Thread: finding duplicates in large table with the use of utl_match
- Dev Shed Forums - Jaro-Winkler Matching Algorithm
The way it works
Basically we will read all the rows(NameInput) from an input dataet and compare each row (FirstName) in a lookup or reference table(NameLookup).
Obviously this process will create a large volumn of rows, we will filter the rows into one category.
Matches with Scores over 95%
First we wil create the Jaro-Winkler Stored Procedure.
-- ****************** Functions for Jaro **********************************************************************
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[fn_GetCommonCharacters]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[fn_GetCommonCharacters]
GO CREATE FUNCTION [dbo].fn_GetCommonCharacters(@firstWord VARCHAR(MAX), @secondWord VARCHAR(MAX), @matchWindow INT)
RETURNS VARCHAR(MAX) AS
BEGIN
DECLARE @CommonChars VARCHAR(MAX)
DECLARE @copy VARCHAR(MAX)
DECLARE @char CHAR(1)
DECLARE @foundIT BIT
DECLARE @f1_len INT
DECLARE @f2_len INT
DECLARE @i INT
DECLARE @j INT
DECLARE @j_Max INT
SET@CommonChars = ''
IF @firstWord IS NOT NULL AND @secondWord IS NOT NULL
BEGIN
SET @f1_len = LEN(@firstWord)
SET @f2_len = LEN(@secondWord)
SET @copy = @secondWord
SET @i = 1
WHILE @i < (@f1_len + 1)
BEGIN
SET@char = SUBSTRING(@firstWord, @i, 1)
SET @foundIT = 0
-- Set J starting value
IF @i - @matchWindow > 1
BEGIN
SET @j = @i - @matchWindow
END
ELSE
BEGIN
SET @j = 1
END -- Set J stopping value
IF @i + @matchWindow <= @f2_len
BEGIN
SET @j_Max = @i + @matchWindow
END
ELSE
IF @f2_len < @i + @matchWindow
BEGIN
SET @j_Max = @f2_len
END
WHILE @j < (@j_Max + 1) AND @foundIT = 0
BEGIN
IF SUBSTRING(@copy, @j, 1) = @char
BEGIN
SET@foundIT = 1
SET@CommonChars = @CommonChars + @char
SET @copy = STUFF(@copy, @j, 1, '#')
END
SET @j = @j + 1
END
SET @i = @i + 1
END
END
RETURN @CommonChars
END
GO IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[fn_calculateMatchWindow]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[fn_calculateMatchWindow]
GO CREATE FUNCTION [dbo].[fn_calculateMatchWindow](@s1_len INT, @s2_len INT)
RETURNS INT AS
BEGIN
DECLARE @matchWindow INT SET@matchWindow =CASEWHEN @s1_len >= @s2_len
THEN (@s1_len / 2) - 1
ELSE (@s2_len / 2) - 1
END
RETURN @matchWindow
END
GO IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[fn_calculateTranspositions]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[fn_calculateTranspositions]
GO CREATE FUNCTION [dbo].[fn_calculateTranspositions](@s1_len INT, @str1 VARCHAR(MAX), @str2 VARCHAR(MAX))
RETURNS INT AS
BEGIN
DECLARE @transpositions INT
DECLARE @i INT
SET@transpositions = 0
SET@i = 0
WHILE @i < @s1_len
BEGIN
IF SUBSTRING(@str1, @i+1, 1) <> SUBSTRING(@str2, @i+1, 1)
BEGIN
SET@transpositions = @transpositions + 1
END
SET @i = @i + 1
END
SET@transpositions = @transpositions / 2
RETURN @transpositions
END
GO IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[fn_calculateJaro]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[fn_calculateJaro]
GO CREATE FUNCTION [dbo].[fn_calculateJaro](@str1 VARCHAR(MAX), @str2 VARCHAR(MAX))
RETURNS FLOAT AS
BEGIN
DECLARE@Common1VARCHAR(MAX)
DECLARE@Common2VARCHAR(MAX)
DECLARE @Common1_LenINT
DECLARE@Common2_LenINT
DECLARE @s1_lenINT
DECLARE @s2_lenINT
DECLARE@transpose_cntINT
DECLARE @match_windowINT
DECLARE @jaro_distanceFLOAT SET@transpose_cnt= 0
SET@match_window= 0
SET@jaro_distance= 0 Set @s1_len = LEN(@str1)
Set @s2_len = LEN(@str2) SET@match_window = dbo.fn_calculateMatchWindow(@s1_len, @s2_len)
SET@Common1 = dbo.fn_GetCommonCharacters(@str1, @str2, @match_window)
SET @Common1_Len = LEN(@Common1)
IF @Common1_Len = 0 OR @Common1 IS NULL
BEGIN
RETURN 0
END SET @Common2 = dbo.fn_GetCommonCharacters(@str2, @str1, @match_window)
SET @Common2_Len = LEN(@Common2)
IF @Common1_Len <> @Common2_Len OR @Common2 IS NULL
BEGIN
RETURN 0
END
SET@transpose_cnt = dbo.[fn_calculateTranspositions](@Common1_Len, @Common1, @Common2) SET@jaro_distance =@Common1_Len / (3.0 * @s1_len) +
@Common1_Len / (3.0 * @s2_len) +
(@Common1_Len - @transpose_cnt) / (3.0 * @Common1_Len);
RETURN @jaro_distance
END
GO
-- ****************** Functions for Jaro ********************************************************************** -- ****************** Functions for Jaro-Winkler ***************************************************************
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[fn_calculatePrefixLength]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[fn_calculatePrefixLength]
GO CREATE FUNCTION [dbo].[fn_calculatePrefixLength](@firstWord VARCHAR(MAX), @secondWord VARCHAR(MAX))
RETURNS INT As
BEGIN
DECLARE @f1_len INT
DECLARE @f2_len INT
DECLARE@minPrefixTestLength INT
DECLARE @i INT
DECLARE @n INT
DECLARE @foundIT BIT
SET@minPrefixTestLength = 4
IF @firstWord IS NOT NULL AND @secondWord IS NOT NULL
BEGIN
SET @f1_len = LEN(@firstWord)
SET @f2_len = LEN(@secondWord)
SET @i = 0
SET@foundIT = 0
SET @n =CASEWHEN@minPrefixTestLength < @f1_len
AND @minPrefixTestLength < @f2_len
THEN@minPrefixTestLength WHEN@f1_len < @f2_len
AND @f1_len < @minPrefixTestLength
THEN@f1_len ELSE@f2_len
END WHILE @i < @n AND @foundIT = 0
BEGIN
IF SUBSTRING(@firstWord, @i+1, 1) <> SUBSTRING(@secondWord, @i+1, 1)
BEGIN
SET @minPrefixTestLength = @i
SET @foundIT = 1
END
SET @i = @i + 1
END
END
RETURN @minPrefixTestLength
END
GO IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[fn_calculateJaroWinkler]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[fn_calculateJaroWinkler]
GO CREATE FUNCTION [dbo].[fn_calculateJaroWinkler](@str1 VARCHAR(MAX), @str2 VARCHAR(MAX))
RETURNS float As
BEGIN
DECLARE @jaro_distanceFLOAT
DECLARE @jaro_winkler_distanceFLOAT
DECLARE @prefixLengthINT
DECLARE @prefixScaleFactorFLOAT
SET@prefixScaleFactor= 0.1 --Constant = .1
SET@jaro_distance= dbo.fn_calculateJaro(@str1, @str2)
SET@prefixLength= dbo.fn_calculatePrefixLength(@str1, @str2)
SET@jaro_winkler_distance = @jaro_distance + ((@prefixLength * @prefixScaleFactor) * (1.0 - @jaro_distance))
RETURN@jaro_winkler_distance
END
GO
-- ****************** Functions for Jaro-Winkler *************************************************************** /*
To provide for more effecient matching we will creat a string cleaning function to remove special charactera and normalize some values. This could also be accomplished with regular expressions. See TSQL Regular Expression Workbench
-- ****************** Miscellaneous Functions *****************************************************************
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[fn_StrClean]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[fn_StrClean]
GO
CREATE Function [dbo].[fn_StrClean](@p_str1 VARCHAR(MAX))
RETURNS VARCHAR(MAX) as
BEGIN
DECLARE @ret_value VARCHAR(MAX)
SET @ret_value = @p_str1
SET @ret_value = REPLACE(@ret_value, '.', '')
SET @ret_value = REPLACE(@ret_value, ',', '')
SET @ret_value = REPLACE(@ret_value, '-', '')
SET @ret_value = REPLACE(@ret_value, ';', '')
SET @ret_value = REPLACE(@ret_value, ':', '')
RETURN @ret_value
ENDl
GO
-- ****************** Miscellaneous Functions *****************************************************************
Now lets test it.
--For testing these functions run declare @i float
select @i=[dbo].[JaroWinkler]('Peter','Pete')
print @i
As you can see above the result of the score for comparing "Peter" to "Pete" is 0.933332.
We have also added additional test scripts and expected results.
I havepProvided the sctipts to load the test data:
DDL for NameInput
USE [AdventureWorksDW]
GO
/****** Object: Table [dbo].[NameInput] Script Date: 01/23/2009 00:05:41 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[NameInput](
[Cust_Id] [nvarchar](10) NULL,
[Name_Input] [nvarchar](10) NULL,
[Last_Name] [nchar](10) NULL
) ON [PRIMARY]
DDL for NameLookup
USE [AdventureWorksDW]
GO
/****** Object: Table [dbo].[NameLookup] Script Date: 01/24/2009 15:11:06 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[NameLookup](
[name_group_id] [nvarchar](10) NULL,
[first_name] [nvarchar](10) NULL,
[first_name_normalized] [nchar](10) NULL
) ON [PRIMARY]
Inserts for NameInput
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('11', 'Tomas', 'Jones ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('22', 'Peter', 'Jackson ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('33', 'Theresa', 'Smith ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('333', 'Therese', 'Smith ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('111', 'Tom', 'Jones ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('222', 'Pete', 'Jackson ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('55', 'Johnathan', 'Oconner ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('555', 'Johnatiag', 'Oconner ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('1111', 'Thhomas', 'Jones ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('2222', 'Petet', 'Jackson ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('3333', 'Theres', 'Smith ')
insert into NameInput (Cust_Id, Name_Input, Last_Name) values ('5555', 'Jonathan', 'Oconner ')
Inserts for NameLookup
insert into NameLookup (name_group_id, first_name, first_name_normalized) values ('3', 'Theresa', 'Theresa ')
insert into NameLookup (name_group_id, first_name, first_name_normalized) values ('4', 'Johnathan', 'Johnathan ')
insert into NameLookup (name_group_id, first_name, first_name_normalized) values ('1', 'Tom', 'Thomas ')
insert into NameLookup (name_group_id, first_name, first_name_normalized) values ('2', 'Pete', 'Peter ')
insert into NameLookup (name_group_id, first_name, first_name_normalized) values ('1', 'Thomas', 'Thomas ')
insert into NameLookup (name_group_id, first_name, first_name_normalized) values ('2', 'Peter', 'Peter ')
Now we will develope a query that will solve our business problem.
SELECT NameLookup.name_group_id, NameInput.Cust_Id, NameInput.Name_Input, NameLookup.first_name ,NameLookup.first_name_normalized, NameInput.Last_Name, dbo.JaroWinkler(NameInput.Name_Input,
NameLookup.first_name) AS Jaro3,
RANK() OVER
(Partition BY NameLookup.name_group_id ORDER BY dbo.JaroWinkler(NameInput.Name_Input,
NameLookup.first_name) Desc) Jar02 FROM NameInput CROSS JOIN
NameLookup
Where dbo.JaroWinkler(NameInput.Name_Input,
NameLookup.first_name) > .95
order by NameLookup.name_group_id
For this example we have used used a cross join to find close matches by creating a cartesian product of all rows . For demonstration purposes we have not filter the highest scores for close matches, instead we have output the entire cartesian product with all scores. We used Rank(Jaro2) to prioritize the Scores
As you review the results above you will see that for the most part scores over .95 are matches. Now lets look at the reults if we filter on Score for only match score over .95.
Also notice that the name_group_id is in essence the "Match_Key" or identifier that enables record linkage or "Grouping"
I also have a companion artiticle that will demonstrate the same Jaro Winkler implementation , in SSIS.,
There are several areas where this technique can be improved. One would be to remove the exact matches with, perhaprs with EXCEPT or LEFT OUTER JOIN, however for brevities I left all the reults.
Ira Warren Whiteside
Actuality Business Intelligence
"karo yaa na karo, koshish jaisa kuch nahi hai"
"Do, or do not. There is no try."