December 28, 2010 at 12:59 pm
This is a great article, thank you so much for posting. I am trying to implement a similar solution in my organization, and was doing some testing on your function. I found some odd results when comparing strings that use the same letters. I am wondering if these results should be expected in these cases?
DECLARE @string1 VARCHAR(100), @string2 VARCHAR(100)
SET @string1 = 'sottosanti'
SET @string2 = 'ottositnas'
PRINT dbo.JaroWinkler(@string1, @string2)
GO
DECLARE @string1 VARCHAR(100), @string2 VARCHAR(100)
SET @string1 = 'margin'
SET @string2 = 'arming'
PRINT dbo.JaroWinkler(@string1, @string2)
Thanks,
Adam Sottosanti
Adam Sottosanti
December 28, 2010 at 3:32 pm
Hi Adam,
I suggest you to try string matching function I developed and was using for years. This function compares two strings based on sequence of letters in the strings.
if exists (select 1 from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareThePair]'))
drop function [dbo].[fn_CompareThePair]
GO
--
create function dbo.fn_CompareThePair
(@String1 varchar (100), @String2 varchar (100), @FirstIn bit = 1, @LastIn bit = 1)
returns integer
/*************************************************************************************************
Function CompareThePair
This function accepts two string values and returns an integer value between
zero and one hundred indicating the similarity between the two string. This
algorithm was developed specifically to search for the spelling
variations of the names and addresses. For this purpose,
it is superior to SOUNDEX, which searches only for similar sounding words.
SOUNDEX fails when dealing with not Latin names(asian names as an example)
or with their interpretations.
Usage: select dbo.fn_CompareThePair('margin', 'arming', 1, 1)
Input Parameters
@String1 first string
@String2 second string
@FirstIn option to include comparison on the first character
@LastIn option to include comparison on the last character
--History
* Developed by marat 21/04/2008 based on the string matching algorithm developed in 2003 and used for address and name matching.
*
****************************************************************************************************/
begin
declare @Target integer
declare @Hits integer
declare @ExHits integer
declare @Pos integer
declare @index integer
declare @Cursor integer
declare @Keep integer
--
set @Hits = 0
set @ExHits = 0
--
if @FirstIn > 0
begin
if left(@string1,1) = left(@string2,1) set @ExHits = @ExHits + 1
end
if @LastIn > 0
begin
if right(@string1,1) = right(@string2,1) set @ExHits = @ExHits + 1
end
set @Target =
case when len(@String1) >= len(@String2) then len(@String1) -1 + @FirstIn + @LastIn
else len(@String2) -1 + @FirstIn + @LastIn
end
--Run I
set @Pos = 1
set @Cursor = 1
while @Pos < len(@String1)
begin
set @index = charindex(substring(@String1, @Pos, 2), @String2, @Cursor)
if @index > 0
begin
set @Cursor = @index + 1
set @Hits = @Hits + 1
end
set @Pos = @Pos + 1
end
set @Keep = @Hits --keep first result
--Run II
set @Pos = 1
set @Cursor = 1
set @Hits = 0
while @Pos < len(@String2)
begin
set @index = charindex(substring(@String2, @Pos, 2), @String1, @Cursor)
if @index > 0
begin
set @Cursor = @index + 1
set @Hits = @Hits + 1
end
set @Pos = @Pos + 1
end
finish:
if @Keep > @Hits set @Hits = @Keep
return 100*(@Hits + @ExHits)/@Target
end
GO
December 28, 2010 at 4:13 pm
Thank you for the code Anatoliy. I'll play around with this. It seems like a good approach might be blend this with the Jaro Winkler function.
Adam
Adam Sottosanti
December 29, 2010 at 10:24 am
Edit: I guess it also helps to read the entire article and discussion before asking questions, since doing so makes it abundantly clear I was using the wrong function to begin with! Whoops!
Adam Sottosanti
February 2, 2011 at 10:16 am
Just wanted to say thanks very much!
I wanted to use Jaro-Winkler matching of names and email addresses across two different db tables to filter out the bogus entries. This works perfectly. Several hundred rows matched on two JW's and an average in less than 2 minutes.
We might try this on addresses next, but I'm less confident of the results. I've tried similar before (not in sql) and the data is just too messy to get reliable results.
Thanks again!!!
February 2, 2011 at 10:48 am
Here's my C# SQL CLR implementation- just deploy to an SQL server.
In my opinion, for anything involving significant string manipulation, a CLR language is vastly superior to SQL server almost every time (as well as being a lot more elegant).
The JaroWinkler algorithm was ported from the java lingpipe version, so I'd advise checking the output before committing to production.
Also, it's very fast, I was getting 36k rows processing and sorting on JaroWinkler scores in under a second.
JaroWinkler.cs
using System;
using System.Collections.Generic;
using System.Text;
namespace JaroWinkler
{
class JaroWinklerDistance
{
private double mWeightThreshold;
private double mNumChars;
/**
* Construct a basic Jaro string distance without the Winkler
* modifications. See the class documentation above for more information
* on the exact algorithm and its parameters.
*/
public JaroWinklerDistance() {
mNumChars=Double.PositiveInfinity;
mWeightThreshold=0;
}
/**
* Construct a Winkler-modified Jaro string distance with the
* specified weight threshold for refinement and an initial number
* of characters over which to reweight. See the class
* documentation above for more information on the exact algorithm
* and its parameters.
*/
public JaroWinklerDistance(double weightThreshold, int numChars) {
mNumChars = numChars;
mWeightThreshold = weightThreshold;
}
/**
* Returns the Jaro-Winkler distance between the specified character
* sequences. Teh distance is symmetric and will fall in the
* range <code>0</code> (perfect match) to <code>1</code> (no overlap).
* See the class definition above for formal definitions.
*
*
This method is defined to be:
*
* <pre>
* distance(cSeq1,cSeq2) = 1 - proximity(cSeq1,cSeq2)</code></pre>
*
* @param cSeq1 First character sequence to compare.
* @param cSeq2 Second character sequence to compare.
* @return The Jaro-Winkler comparison value for the two character
* sequences.
*/
public double distance(String cSeq1, String cSeq2)
{
return 1.0 - proximity(cSeq1,cSeq2);
}
/**
* Return the Jaro-Winkler comparison value between the specified
* character sequences. The comparison is symmetric and will fall
* in the range <code>0</code> (no match) to <code>1</code>
* (perfect match)inclusive. See the class definition above for
* an exact definition of Jaro-Winkler string comparison.
*
*
The method {@link #distance(CharSequence,CharSequence)} returns
* a distance measure that is one minus the comparison value.
*
* @param cSeq1 First character sequence to compare.
* @param cSeq2 Second character sequence to compare.
* @return The Jaro-Winkler comparison value for the two character
* sequences.
*/
public double proximity(String cSeq1, String cSeq2) {
int len1 = cSeq1.Length;
int len2 = cSeq2.Length;
if (len1 == 0)
return len2 == 0 ? 1.0 : 0.0;
int searchRange = Math.Max(0,Math.Max(len1,len2)/2 - 1);
bool[] matched1 = new bool[len1];
matched1.Initialize();
//Arrays.Fill(matched1,false);
bool[] matched2 = new bool[len2];
matched2.Initialize();
//Arrays.fill(matched2,false);
int numCommon = 0;
for (int i = 0; i < len1; ++i) {
int start = Math.Max(0,i-searchRange);
int end = Math.Min(i+searchRange+1,len2);
for (int j = start; j < end; ++j) {
if (matched2[j]) continue;
if (cSeq1 != cSeq2[j])
continue;
matched1 = true;
matched2[j] = true;
++numCommon;
break;
}
}
if (numCommon == 0) return 0.0;
int numHalfTransposed = 0;
int k = 0;
for (int i = 0; i < len1; ++i) {
if (!matched1) continue;
while (!matched2[k]) { ++k; }
if (cSeq1 != cSeq2[k])
{
++numHalfTransposed;
}
++k;
}
// System.out.println("numHalfTransposed=" + numHalfTransposed);
int numTransposed = numHalfTransposed/2;
// System.out.println("numCommon=" + numCommon
// + " numTransposed=" + numTransposed);
double numCommonD = numCommon;
double weight = (numCommonD/len1
+ numCommonD/len2
+ (numCommon - numTransposed)/numCommonD)/3.0;
if (weight <= mWeightThreshold) return weight;
double max = Math.Min(mNumChars,Math.Min(cSeq1.Length,cSeq2.Length));
int pos = 0;
while (pos < max && cSeq1[pos] == cSeq2[pos])
{
++pos;
}
if (pos == 0) return weight;
return weight + 0.1 * pos * (1.0 - weight);
}
/**
* A constant for the Jaro distance. The value is the same as
* would be returned by the nullary constructor
* <code>JaroWinklerDistance()</code>.
*
*
Instances are thread safe, so this single distance instance
* may be used for all comparisons within an application.
*/
//public static JaroWinklerDistance JARO_DISTANCE = new JaroWinklerDistance();
/**
* A constant for the Jaro-Winkler distance with defaults set as
* in Winkler's papers. The value is the same as would be
* returned by the nullary constructor
* <code>JaroWinklerDistance(0.7,4)</code>.
*
*
Instances are thread safe, so this single distance instance
* may be used for all comparisons within an application.
*/
//public static JaroWinklerDistance JARO_WINKLER_DISTANCE = new JaroWinklerDistance(0.70,4);
}
}
JaroWinklerWrapper.cs
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using JaroWinkler;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlDouble JaroWinklerDistance(string s, string t)
{
JaroWinklerDistance a = new JaroWinklerDistance();
return a.distance(s, t);
}
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlDouble JaroWinklerProximity(string s, string t)
{
JaroWinklerDistance a = new JaroWinklerDistance();
return a.proximity(s, t);
}
};
Then call it from your code something like this
select
Name,
'TestString',
dbo.JaroWinklerDistance('TestString',Name)
from
Names
WHERE
Name is not null
February 3, 2011 at 11:45 am
For addresses, I'd recommend first using nested REPLACE()'s (SQL Server 2000, at least, will easily go 100 or 120 deep) to standardize abbreviations. You'll have to look at your own data to include common misspellings, as well.
Personally, I also standardize on the singular abbreviations, on the assumption that plural vs. singular when most other factors are the same is more often a typo than two different streets with the same <entity> having the same numeric address on both.
Splitting addresses into street #, street name, and suite can be very... interesting.
February 3, 2011 at 12:05 pm
mike.renwick (2/2/2011)
Here's my C# SQL CLR implementation- just deploy to an SQL server.In my opinion, for anything involving significant string manipulation, a CLR language is vastly superior to SQL server almost every time (as well as being a lot more elegant).
The JaroWinkler algorithm was ported from the java lingpipe version, so I'd advise checking the output before committing to production.
...
Sweet, thanks Mike! If I ever get some free time, maybe I'll run some side by side compares. I run new records through multiple "rounds" of matching against a lookup table which is at 250K records and growing, using T-SQL, and it keeps getting slower... and slower... and slower...
Adam Sottosanti
February 3, 2011 at 12:22 pm
Adam Sottosanti (2/3/2011)
Sweet, thanks Mike! If I ever get some free time, maybe I'll run some side by side compares. I run new records through multiple "rounds" of matching against a lookup table which is at 250K records and growing, using T-SQL, and it keeps getting slower... and slower... and slower...
I've found (heavily) indexed #temp tables to help significantly with speed on larger datasets. Calculate your double metaphone and cleaned up variants once, and then do Jaro-Winkler based on a WHERE clause (first two characters match, first two characters of double metaphone match, etc.). It's naturally a cartesian [O(n^2)] operation, which does get very slow, very fast, so to speak.
February 3, 2011 at 5:01 pm
I've found (heavily) indexed #temp tables to help significantly with speed on larger datasets. Calculate your double metaphone and cleaned up variants once, and then do Jaro-Winkler based on a WHERE clause (first two characters match, first two characters of double metaphone match, etc.). It's naturally a cartesian [O(n^2)] operation, which does get very slow, very fast, so to speak.
Agreed. I try to avoid the cartesian joins during matching as much as I can, and generally require certain demographics to match (inner join on DOB + Gender instead of cross join) to limit the result set, then also limit the amount I run through at a time. I may miss a handful of records with this approach, but they are easy to identify, and since the matching is fuzzy to begin with, you are going to need a clean up process anyway.
Adam Sottosanti
February 3, 2011 at 5:10 pm
Adam Sottosanti (2/3/2011)
Agreed. I try to avoid the cartesian joins during matching as much as I can, and generally require certain demographics to match (inner join on DOB + Gender instead of cross join) to limit the result set, then also limit the amount I run through at a time. I may miss a handful of records with this approach, but they are easy to identify, and since the matching is fuzzy to begin with, you are going to need a clean up process anyway.
Have you considered building a (large) precomputed Jaro-Winkler lookup table, and indexing it? Perhaps run your precise matching first, and once you're at the Jaro-Winkler stage, run against the lookup table first, and then work on those values that weren't in the table (and add them to the table)?
February 3, 2011 at 5:19 pm
Yep, for the most part that is exactly what I'm dong for the fuzzy matching.
Adam Sottosanti
December 29, 2013 at 7:45 am
Hi I AM TRYING TO USE THIS JARO-WINKLER ALGORITHM,
I CANT CREATE A STORED PROCEDURE WITH THE SCALAR FUNCTIONS HERE,
PLEASE PROVIDE ME THE CODE TO CREATE A STORED PROCEDURE
THANKS IN ADVANCE
December 29, 2013 at 11:38 am
kishorreddy.yuva (12/29/2013)
Hi I AM TRYING TO USE THIS JARO-WINKLER ALGORITHM,I CANT CREATE A STORED PROCEDURE WITH THE SCALAR FUNCTIONS HERE,
PLEASE PROVIDE ME THE CODE TO CREATE A STORED PROCEDURE
THANKS IN ADVANCE
Code to create a CLR function has been included as well as a TSQL function. Just call either of the functions from your stored proc.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
December 29, 2013 at 11:06 pm
Hi Jason,
I just copy the code from here and try to execute in my machine, But when i check these all are function not stored procedure.
Please provide me the code either for CLR function or to convert all this functions into stored procedure
Viewing 15 posts - 31 through 45 (of 57 total)
You must be logged in to reply to this topic. Login to reply