Fuzzy Matching

Question

Fuzzy Matching

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply

RonMexico Hall of Fame Points: 3474 More actions · Answer 1

We haven't created a client table of correct, distinct values yet. I'm thinking initially we will have to go through those 300k records to find any variations and then continue the build the table based on if future months have names that don't match with what have already been identified.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 2

Unless you're using a 3rd party product like Melissa, then attempting to auto-magically identify and recode mistyped company names is probably beyond the level of effort you would be willing to put into this ETL process. At the very least, any straightforward matching process would need to leverage something like a primary reference Company table containing valid company names. If you're using SSIS to load your table, then you can leverage a Lookup DataFlow task to join the source file/table with the reference table and redirect non-matching records into an exception table for which you may then subsequently perform manual editing and re-loading.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

RonMexico Hall of Fame Points: 3474 More actions · Answer 3

Eric - Thanks for the SSIS tip. That will make it much easier to capture the handful of records without a match as opposed to joining across tables manually.

William Rayer SSC Eights! Points: 865 More actions · Answer 4

If you are using SSIS, there is a "Fuzzy Grouping" transformation that can be used for de-duplication. You provide it with the 'canonical' forms of the strings, and with the real data. It then returns the closest match and the degree of confidence in the match. Would this do what you need?

Disclaimer: I know of this by studying for the 70-463 exam (SQL Server 2012 Data warehouse) but I have not actually tried it! So you would need to experiment to get it working.

Chris Wooding SSCarpal Tunnel Points: 4558 More actions · Answer 5

I have worked on a similar task before. The approach I took was to copy the names to another table (along with the primary key so I could link back to the original). In the copy table, I standardise the names as much as possible; remove punctuation, change double spaces to single spaces, remove words that appear too often to be useful for matching (eg. the, of, and, &, plc, corp, ltd) and replace common misspellings. I then match the words individually and score the result based on the number of matching words as a proportion of the total number of words in the name. Anything above a certain score is deemed a match with the remainder being reported for manual checking (I use the original versions of the names in the report, hence the need to copy the data sideways). Obviously the score required for automatic matching and the list of words to drop will be unique to your situation and the type of data you need to process.

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 6

An oldie but a goody... Years ago, Keith Henry translated Lawrence Philips C++ function to SQL function.

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574

Yes, it is a Scalar function and no I haven't attempted to rewrite it as an iTVF (I don't use it enough to make it worth the effort)...

It actually works really well... I've used it in the past with really good results.

SELECT[dbo].[DoubleMetaPhone]('A.B.C. Liquers');

SELECT[dbo].[DoubleMetaPhone]('a b c liquors');

SELECT[dbo].[DoubleMetaPhone]('ABC Lickers');

All return a value of: APKLKAPKLK

Rather than making you dig through that old thread, here is the function code...

/****** Object: UserDefinedFunction [dbo].[DoubleMetaPhone] Script Date: 9/6/2016 12:33:55 PM ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

ALTER FUNCTION [dbo].[DoubleMetaPhone] (@str varchar(70))

RETURNS char(10)

AS

BEGIN

/*#########################################################################

Double Metaphone Phonetic Matching Function

This reduces word to approximate phonetic string. This is deliberately

not a direct phonetic

Based off original C++ code and algorithm by

Lawrence Philips (lphilips_AT_verity.com)

Published in the C/C++ Users Journal:

http://www.cuj.com/articles/2000/0006/0006d/0006d.htm?topic=articles

Original Metaphone presented in article in "Computer Language" in 1990.

Reduces alphabet to

The 14 constonant sounds:

"sh" "p"or"b" "th"

| | |

X S K J T F H L M N P R 0 W

Drop vowels except at the beginning

Produces a char(10) string. The left(@result,5) gives the most common

pronouciation, right(@result,5) gives the commonest alternate.

Translated into t-SQL by Keith Henry (keithh_AT_lbm-solutions.com)

#########################################################################*/

Declare

@original varchar(70),

@primary varchar(70),

@secondary varchar(70),

@lengthint,

@Last int,

@current int,

@strcur1 char(1) ,

@strnext1 char(1) ,

@strprev1char(1),

@SlavoGermanic bit

set @SlavoGermanic= 0

set @primary = ''

set @secondary = ''

set @current = 1

set @length = len(@str)

set @Last = @length

set @original = isnull(@str,'') + ''

set @original = upper(@original)

if patindex('%[WK]%',@str) + charindex('CZ',@str) + charindex('WITZ',@str) <> 0

set @SlavoGermanic = 1

-- skip this at beginning of word

if substring(@original, 1, 2) in ('GN', 'KN', 'PN', 'WR', 'PS')

set @current = @current + 1

-- Initial 'X' is pronounced 'Z' e.g. 'Xavier'

if substring(@original, 1, 1) = 'X'

begin

set @primary = @primary + 'S' -- 'Z' maps to 'S'

set @secondary = @secondary + 'S'

set @current = @current + 1

end

if substring(@original, 1, 1) in ('A', 'E', 'I', 'O', 'U', 'Y')

begin

set @primary = @primary + 'A' -- all init vowels now map to 'A'

set @secondary = @secondary + 'A'

set @current = @current + 1

end

while @current <= @length

begin

if len(@primary) >= 5 break

set @strcur1 = substring(@original, @current, 1)

set @strnext1 = substring(@original, (@current + 1), 1)

set @strprev1 = substring(@original, (@current - 1), 1)

if @strcur1 in ('A', 'E', 'I', 'O', 'U', 'Y')

set @current = @current + 1

else

if @strcur1 = 'B' -- '-mb', e.g. 'dumb', already skipped over ...

begin

set @primary = @primary + 'P'

set @secondary = @secondary + 'P'

if @strnext1 = 'B'

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'Ç'

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'S'

set @current = @current + 1

end

else

if @strcur1 = 'C'

begin

if @strnext1 = 'H'

begin

if substring(@original, @current, 4) = 'CHIA'-- italian 'chianti'

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

end

else

begin

if @current > 1-- find 'michael'

and substring(@original, @current, 4) = 'CHAE'

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'X'

end

else

begin

if @current = 1-- greek roots e.g. 'chemistry', 'chorus'

and (substring(@original, @current + 1, 5) in ('HARAC', 'HARIS')

or substring(@original, @current + 1, 3) in ('HOR', 'HYM', 'HIA', 'HEM')

)

and substring(@original, 1, 5) <> 'CHORE'

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

end

else

begin

if (substring(@original, 0, 4) in ('VAN ', 'VON ')-- germanic, greek, or otherwise 'ch' for 'kh' sound

or substring(@original, 0, 3) = 'SCH'

)

or substring(@original, @current - 2, 6) in ('ORCHES', 'ARCHIT', 'ORCHID')-- 'architect' but not 'arch', orchestra', 'orchid'

or substring(@original, @current + 2, 1) in ('T', 'S')

or ((@strprev1 in ('A','O','U','E')

or @current = 0

)

and substring(@original, @current + 2, 1) in ('L','R','N','M','B','H','F','V','W',' ')-- e.g. 'wachtler', 'weschsler', but not 'tichner'

)

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

end

else

begin

if (@current > 1)

begin

if substring(@original, 1, 2) = 'MC' -- e.g. 'McHugh'

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

end

else

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'K'

end

else

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'X'

end

set @current = @current + 2

end --ch logic

else

begin

if @strnext1 = 'C'-- double 'C', but not McClellan'

and not(@current = 1

and substring(@original, 1, 1) = 'M'

)

begin

if substring(@original, @current + 2, 1) in ('I','E','H')-- 'bellocchio' but not 'bacchus'

and substring(@original, @current + 2, 2) <> 'HU'

begin

if (@current = 2-- 'accident', 'accede', 'succeed'

and @strprev1 = 'A'

)

or substring(@original, @current - 1, 5) in ('UCCEE', 'UCCES')

begin

set @primary = @primary + 'KS'

set @secondary = @secondary + 'KS'

end

else

begin-- 'bacci', 'bertucci', other italian

set @primary = @primary + 'X'

set @secondary = @secondary + 'X'

-- e.g. 'focaccia'if substring(@original, @current, 4) = 'CCIA'

end

set @current = @current + 3

end

else

begin

set @primary = @primary + 'K'-- Pierce's rule

set @secondary = @secondary + 'K'

set @current = @current + 2

end

else

begin

if @strnext1 in ('K','G','Q')

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

set @current = @current + 2

end

else

begin

if @strnext1 in ('I','E','Y')

begin

if substring(@original, @current, 3) in ('CIO','CIE','CIA')-- italian vs. english

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'X'

end

else

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'S'

end

set @current = @current + 2

end

else

begin

if @strnext1 = 'Z'-- e.g. 'czerny'

and substring(@original, @current -2, 4) <> 'WICZ'

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'X'

set @current = @current + 2

end

else

begin

if @current > 2 -- various gremanic

and substring(@original, @current - 2,1) not in ('A', 'E', 'I', 'O', 'U', 'Y')

and substring(@original, @current - 1, 3) = 'ACH'

and ((substring(@original, @current + 2, 1) <> 'I')

and ((substring(@original, @current + 2, 1) <> 'E')

or substring(@original, @current - 2, 6) in ('BACHER', 'MACHER')

)

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

set @current = @current + 2

end

else

begin

if @current = 1 -- special case 'caesar'

and substring(@original, @current, 6) = 'CAESAR'

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'S'

set @current = @current + 2

end

else

begin-- final else

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

if substring(@original, @current + 1, 2) in (' C',' Q',' G')-- name sent in 'mac caffrey', 'mac gregor'

set @current = @current + 3

else

set @current = @current + 1

end

else

if @strcur1 = 'D'

begin

if substring(@original, @current, 2) = 'DG'

begin

if substring(@original, @current + 2, 1) in ('I','E','Y')

begin

set @primary = @primary + 'J'-- e.g. 'edge'

set @secondary = @secondary + 'J'

set @current = @current + 3

end

else

begin

set @primary = @primary + 'TK'-- e.g. 'edgar'

set @secondary = @secondary + 'TK'

set @current = @current + 2

end

else

begin

if substring(@original, @current, 2) in ('DT','DD')

begin

set @primary = @primary + 'T'

set @secondary = @secondary + 'T'

set @current = @current + 2

end

else

begin

set @primary = @primary + 'T'

set @secondary = @secondary + 'T'

set @current = @current + 1

end

else

if @strcur1 = 'F'

begin

set @primary = @primary + 'F'

set @secondary = @secondary + 'F'

if (@strnext1 = 'F')

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'G'

begin

if (@strnext1 = 'H')

begin

if @current > 1

and @strprev1 not in ('A', 'E', 'I', 'O', 'U', 'Y')

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

end

else

begin

if not((@current > 2-- Parker's rule (with some further refinements) - e.g. 'hugh'

and substring(@original, @current - 2, 1) in ('B','H','D')

)-- e.g. 'bough'

or (@current > 3

and substring(@original, @current - 3, 1) in ('B','H','D')

)-- e.g. 'broughton'

or (@current > 4

and substring(@original, @current - 4, 1) in ('B','H')

))

begin

if @current > 3-- e.g. 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough'

and @strprev1 = 'U'

and substring(@original, @current - 3, 1) in ('C','G','L','R','T')

begin

set @primary = @primary + 'F'

set @secondary = @secondary + 'F'

end

else

begin

if @current > 1

and @strprev1 <> 'I'

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

end

else

begin

if (@current < 4)

begin

if (@current = 1)-- 'ghislane', 'ghiradelli'

begin

if (substring(@original, @current + 2, 1) = 'I')

begin

set @primary = @primary + 'J'

set @secondary = @secondary + 'J'

end

else

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

end

set @current = @current + 2

end

else

begin

if (@strnext1 = 'N')

begin

if @current = 1

and substring(@original, 0,1) in ('A', 'E', 'I', 'O', 'U', 'Y')

and @SlavoGermanic = 0

begin

set @primary = @primary + 'KN'

set @secondary = @secondary + 'N'

end

else

begin

-- not e.g. 'cagney'

if substring(@original, @current + 2, 2) = 'EY'

and (@strnext1 <> 'Y')

and @SlavoGermanic = 0

begin

set @primary = @primary + 'N'

set @secondary = @secondary + 'KN'

end

else

begin

set @primary = @primary + 'KN'

set @secondary = @secondary + 'KN'

end

set @current = @current + 2

end

else

begin

if substring(@original, @current + 1, 2) = 'LI'-- 'tagliaro'

and @SlavoGermanic = 0

begin

set @primary = @primary + 'KL'

set @secondary = @secondary + 'L'

set @current = @current + 2

end

else

begin

if @current = 1-- -ges-, -gep-, -gel- at beginning

and (@strnext1 = 'Y'

or substring(@original, @current + 1, 2) in ('ES','EP','EB','EL','EY','IB','IL','IN','IE', 'EI','ER')

)

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'J'

set @current = @current + 2

end

else

begin

if (substring(@original, @current + 1, 2) = 'ER'-- -ger-, -gy-

or @strnext1 = 'Y'

)

and substring(@original, 1, 6) not in ('DANGER','RANGER','MANGER')

and @strprev1 not in ('E', 'I')

and substring(@original, @current - 1, 3) not in ('RGY','OGY')

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'J'

set @current = @current + 2

end

else

begin

if @strnext1 in ('E','I','Y')-- italian e.g. 'biaggi'

or substring(@original, @current -1, 4) in ('AGGI','OGGI')

begin

if (substring(@original, 1, 4) in ('VAN ', 'VON ')-- obvious germanic

or substring(@original, 1, 3) = 'SCH'

)

or substring(@original, @current + 1, 2) = 'ET'

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

end

else

begin

-- always soft if french ending

if substring(@original, @current + 1, 4) = 'IER '

begin

set @primary = @primary + 'J'

set @secondary = @secondary + 'J'

end

else

begin

set @primary = @primary + 'J'

set @secondary = @secondary + 'K'

end

set @current = @current + 2

end

else

begin-- other options exausted call it k sound

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

if (@strnext1 = 'G')

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'H'

begin

if (@current = 0 -- only keep if first & before vowel or btw. 2 vowels

or @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y')

)

and @strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')

begin

set @primary = @primary + 'H'

set @secondary = @secondary + 'H'

set @current = @current + 2

end

else

set @current = @current + 1

end

else

if @strcur1 = 'J'

begin

if substring(@original, @current, 4) = 'JOSE'-- obvious spanish, 'jose', 'san jacinto'

or substring(@original, 1, 4) = 'SAN '

begin

if (@current = 1

and substring(@original, @current + 4, 1) = ' '

)

or substring(@original, 1, 4) = 'SAN '

begin

set @primary = @primary + 'H'

set @secondary = @secondary + 'H'

end

else

begin

set @primary = @primary + 'J'

set @secondary = @secondary + 'H'

end

set @current = @current + 1

end

else

begin

if @current = 1

begin

set @primary = @primary + 'J' -- Yankelovich/Jankelowicz

set @secondary = @secondary + 'A'

set @current = @current + 1

end

else

begin

if @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y') -- spanish pron. of .e.g. 'bajador'

and @SlavoGermanic = 0

and @strnext1 in ('A','O')

begin

set @primary = @primary + 'J'

set @secondary = @secondary + 'H'

set @current = @current + 1

end

else

begin

if (@current = @Last)

begin

set @primary = @primary + 'J'

set @secondary = @secondary + ''

set @current = @current + 1

end

else

begin

if @strnext1 in ('L','T','K','S','N','M','B','Z')

and @strprev1 not in ('S','K','L')

begin

set @primary = @primary + 'J'

set @secondary = @secondary + 'J'

set @current = @current + 1

end

else

begin

if (@strnext1 = 'J') -- it could happen

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'K'

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

if (@strnext1 = 'K')

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'L'

begin

if (@strnext1 = 'L')

begin

if (@current = (@length - 3)-- spanish e.g. 'cabrillo', 'gallegos'

and substring(@original, @current - 1, 4) in ('ILLO','ILLA','ALLE')

)

or ((substring(@original, @Last - 1, 2) in ('AS','OS')

or substring(@original, @Last, 1) in ('A','O')

)

and substring(@original, @current - 1, 4) = 'ALLE'

)

set @primary = @primary + 'L'--set @secondary = @secondary + ''

set @current = @current + 2

end

else

begin

set @current = @current + 1

set @primary = @primary + 'L'

set @secondary = @secondary + 'L'

end

else

if @strcur1 = 'M'

begin

set @primary = @primary + 'M'

set @secondary = @secondary + 'M'

if substring(@original, @current - 1, 3) = 'UMB'

and (@current + 1 = @Last

or substring(@original, @current + 2, 2) = 'ER'

)-- 'dumb', 'thumb'

or @strnext1 = 'M'

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 in ('N','Ñ')

begin

set @primary = @primary + 'N'

set @secondary = @secondary + 'N'

if @strnext1 in ('N','Ñ')

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'P'

begin

if (@strnext1 = 'H')

begin

set @current = @current + 2

set @primary = @primary + 'F'

set @secondary = @secondary + 'F'

end

else

begin

-- also account for 'campbell' and 'raspberry'

if @strnext1 in ('P','B')

set @current = @current + 2

else

begin

set @current = @current + 1

set @primary = @primary + 'P'

set @secondary = @secondary + 'P'

end

else

if @strcur1 = 'Q'

begin

set @primary = @primary + 'K'

set @secondary = @secondary + 'K'

if (@strnext1 = 'Q')

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'R'

begin

if @current = @Last-- french e.g. 'rogier', but exclude 'hochmeier'

and @SlavoGermanic = 0

and substring(@original, @current - 2, 2) = 'IE'

and substring(@original, @current - 4, 2) not in ('ME','MA')

set @secondary = @secondary + 'R' --set @primary = @primary + ''

else

begin

set @primary = @primary + 'R'

set @secondary = @secondary + 'R'

end

if (@strnext1 = 'R')

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'S'

begin

if substring(@original, @current - 1, 3) in ('ISL','YSL') -- special cases 'island', 'isle', 'carlisle', 'carlysle'

set @current = @current + 1--silent s

else

begin

if substring(@original, @current, 2) = 'SH'

begin

-- germanic

if substring(@original, @current + 1, 4) in ('HEIM','HOEK','HOLM','HOLZ')

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'S'

end

else

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'X'

end

set @current = @current + 2

end

else

begin

-- italian & armenian

if substring(@original, @current, 3) in ('SIO','SIA')

or substring(@original, @current, 4) in ('SIAN')

begin

if @SlavoGermanic = 0

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'X'

end

else

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'S'

end

set @current = @current + 3

end

else

begin

if (@current = 1-- german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider'

and @strnext1 in ('M','N','L','W')-- also, -sz- in slavic language altho in hungarian it is pronounced 's'

)

or @strnext1 = 'Z'

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'X'

if @strnext1 = 'Z'

set @current = @current + 2

else

set @current = @current + 1

end

else

begin

if substring(@original, @current, 2) = 'SC'

begin

if substring(@original, @current + 2, 1) = 'H'-- Schlesinger's rule

begin

if substring(@original, @current + 3, 2) in ('OO','ER','EN','UY','ED','EM')-- dutch origin, e.g. 'school', 'schooner'

begin

if substring(@original, @current + 3, 2) in ('ER','EN')-- 'schermerhorn', 'schenker'

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'SK'

end

else

begin

set @primary = @primary + 'SK'

set @secondary = @secondary + 'SK'

end

set @current = @current + 3

end

else

begin

if @current = 1

and substring(@original, 3,1) not in ('A', 'E', 'I', 'O', 'U', 'Y')

and substring(@original, @current + 3, 1) <> 'W'

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'S'

end

else

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'X'

end

set @current = @current + 3

end

else

begin

if substring(@original, @current + 2, 1) in ('I','E','Y')

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'S'

end

else

begin

set @primary = @primary + 'SK'

set @secondary = @secondary + 'SK'

end

set @current = @current + 3

end

else

begin

if @current = 1-- special case 'sugar-'

and substring(@original, @current, 5) = 'SUGAR'

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'S'

set @current = @current + 1

end

else

begin

if @current = @Last-- french e.g. 'resnais', 'artois'

and substring(@original, @current - 2, 2) in ('AI','OI')

set @secondary = @secondary + 'S' --set @primary = @primary + ''

else

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'S'

end

if @strnext1 in ('S','Z')

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'T'

begin

if substring(@original, @current, 4) = 'TION'

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'X'

set @current = @current + 3

end

else

if substring(@original, @current, 3) in ('TIA','TCH')

begin

set @primary = @primary + 'X'

set @secondary = @secondary + 'X'

set @current = @current + 3

end

else

if substring(@original, @current, 2) = 'TH'

or substring(@original, @current, 3) = 'TTH'

begin

if substring(@original, @current + 2, 2) in ('OM','AM')-- special case 'thomas', 'thames' or germanic

or substring(@original, 0, 4) in ('VAN ','VON ')

or substring(@original, 0, 3) = 'SCH'

begin

set @primary = @primary + 'T'

set @secondary = @secondary + 'T'

end

else

begin

set @primary = @primary + '0'

set @secondary = @secondary + 'T'

end

set @current = @current + 2

end

else

begin

if @strnext1 in ('T','D')

begin

set @current = @current + 2

set @primary = @primary + 'T'

set @secondary = @secondary + 'T'

end

else

begin

set @current = @current + 1

set @primary = @primary + 'T'

set @secondary = @secondary + 'T'

end

else

if @strcur1 = 'V'

if (@strnext1 = 'V')

set @current = @current + 2

else

begin

set @current = @current + 1

set @primary = @primary + 'F'

set @secondary = @secondary + 'F'

end

else

if @strcur1 = 'W'

begin

-- can also be in middle of word

if substring(@original, @current, 2) = 'WR'

begin

set @primary = @primary + 'R'

set @secondary = @secondary + 'R'

set @current = @current + 2

end

else

if @current = 1

and (@strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')

or substring(@original, @current, 2) = 'WH'

)

begin

if @strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')-- Wasserman should match Vasserman

begin

set @primary = @primary + 'A'

set @secondary = @secondary + 'F'

set @current = @current + 1

end

else

begin

set @primary = @primary + 'A'-- need Uomo to match Womo

set @secondary = @secondary + 'A'

set @current = @current + 1

end

else

if (@current = @Last -- Arnow should match Arnoff

and @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y')

)

or substring(@original, @current - 1, 5) in ('EWSKI','EWSKY','OWSKI','OWSKY')

or substring(@original, 0, 3) = 'SCH'

begin

set @secondary = @secondary + 'F'--set @primary = @primary + ''

set @current = @current + 1

end

else

if substring(@original, @current, 4) in ('WICZ','WITZ') -- polish e.g. 'filipowicz'

begin

set @primary = @primary + 'TS'

set @secondary = @secondary + 'FX'

set @current = @current + 4

end

else

set @current = @current + 1-- else skip it

end

else

if @strcur1 = 'X'

begin

if not (@current = @Last-- french e.g. breaux

and (substring(@original, @current - 3, 3) in ('IAU', 'EAU')

or substring(@original, @current - 2, 2) in ('AU', 'OU')

)

begin

set @primary = @primary + 'KS'

set @secondary = @secondary + 'KS'

end--else skip it

if @strnext1 in ('C','X')

set @current = @current + 2

else

set @current = @current + 1

end

else

if @strcur1 = 'Z'

begin

if (@strnext1 = 'Z')

set @current = @current + 2

else

begin

if (@strnext1 = 'H') -- chinese pinyin e.g. 'zhao'

begin

set @primary = @primary + 'J'

set @secondary = @secondary + 'J'

set @current = @current + 2

end

else

begin

if (substring(@original, @current + 1, 2) in ('ZO', 'ZI', 'ZA'))

or (@SlavoGermanic = 1

and (@current > 0

and @strprev1 <> 'T'

)

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'TS'

end

else

begin

set @primary = @primary + 'S'

set @secondary = @secondary + 'S'

end

set @current = @current + 1

end

else

set @current = @current + 1

end

return cast(@primary as char(5)) + cast(@secondary as char(5))

end

Eric M Russell SSC Guru Points: 125562 More actions · Answer 7

'ABC Lickers' is probably an entirely different business.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 8

-- just for fun

SELECT[dbo].[DoubleMetaPhone]('Pheasant');

SELECT[dbo].[DoubleMetaPhone]('Phesant');

SELECT[dbo].[DoubleMetaPhone]('Fessant');

SELECT[dbo].[DoubleMetaPhone]('Fezant');

SELECT[dbo].[DoubleMetaPhone]('Phes ant');

SELECT[dbo].[DoubleMetaPhone]('Fez aunt');

quite impressive

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 9

J Livingston SQL (9/6/2016)
-- just for fun SELECT[dbo].[DoubleMetaPhone]('Pheasant'); SELECT[dbo].[DoubleMetaPhone]('Phesant'); SELECT[dbo].[DoubleMetaPhone]('Fessant'); SELECT[dbo].[DoubleMetaPhone]('Fezant'); SELECT[dbo].[DoubleMetaPhone]('Phes ant'); SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive

Yeah... Wish I could take credit for it (Google skills perhaps?), but my contribution was just a straight copy & paste from the interwebs.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 10

J Livingston SQL (9/6/2016)
-- just for fun SELECT[dbo].[DoubleMetaPhone]('Pheasant'); SELECT[dbo].[DoubleMetaPhone]('Phesant'); SELECT[dbo].[DoubleMetaPhone]('Fessant'); SELECT[dbo].[DoubleMetaPhone]('Fezant'); SELECT[dbo].[DoubleMetaPhone]('Phes ant'); SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive

Still there is the issue that two words sounding alike are completely different.

SELECT[dbo].[DoubleMetaPhone]('Pheasant');

FSNT FSNT

SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');

FSNT FSNT

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 11

Eric M Russell (9/6/2016)
J Livingston SQL (9/6/2016)
-- just for fun SELECT[dbo].[DoubleMetaPhone]('Pheasant'); SELECT[dbo].[DoubleMetaPhone]('Phesant'); SELECT[dbo].[DoubleMetaPhone]('Fessant'); SELECT[dbo].[DoubleMetaPhone]('Fezant'); SELECT[dbo].[DoubleMetaPhone]('Phes ant'); SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
Still there is the issue that two words sounding alike are completely different.
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
FSNT FSNT
SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');
FSNT FSNT

It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.

The last big project I did like this, was to attempt to match up "common patients" coming from several different medical systems... I used a scoring method, where I compared every patient (name, ssn, address & phone) to every other patient (yes it was a brutal as it sounds) and assigned a "match quality score" for all of the possible combinations, discarded matches that scored below a given threshold and then used nested sets to identify & group the best quality matches. Management was far more impressed with it than I was at the end.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 12

Jason A. Long (9/6/2016)
Eric M Russell (9/6/2016)
J Livingston SQL (9/6/2016)
-- just for fun SELECT[dbo].[DoubleMetaPhone]('Pheasant'); SELECT[dbo].[DoubleMetaPhone]('Phesant'); SELECT[dbo].[DoubleMetaPhone]('Fessant'); SELECT[dbo].[DoubleMetaPhone]('Fezant'); SELECT[dbo].[DoubleMetaPhone]('Phes ant'); SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
Still there is the issue that two words sounding alike are completely different.
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
FSNT FSNT
SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');
FSNT FSNT
It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.
The last big project I did like this, was to attempt to match up "common patients" coming from several different medical systems... I used a scoring method, where I compared every patient (name, ssn, address & phone) to every other patient (yes it was a brutal as it sounds) and assigned a "match quality score" for all of the possible combinations, discarded matches that scored below a given threshold and then used nested sets to identify & group the best quality matches. Management was far more impressed with it than I was at the end.

That's almost an exact description of my previous job. I maintained an ETL process for what we called the Master Patient Index; matching up distinct patients across medical claims, enrollments, and healthcare utilization records. The total distinct population was about 40 million across 100+ providers and payors.

Are you familiar with issue where newborns are assigned the mother or father's first name by default and then re-named on a subsequent record?

Or how about a client decides one day to start sending files with truncated names, dates of birth, and SSN for "privacy reasons" but still expect the MPI to continue matching?

Yep, I've been there. The original implementation had one rule that attempted to match on DoubleMetaPhone but it resulted in too many false matches on similar names. It's common for members of the same household to have similar or identical first names (ie: Alexander and Alexandra, or John Smith Sr. and Jr.), which can be an issue if there is no identifier like SSN or birth date to combine with.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 13

Jason A. Long (9/6/2016)
Eric M Russell (9/6/2016)
J Livingston SQL (9/6/2016)
-- just for fun SELECT[dbo].[DoubleMetaPhone]('Pheasant'); SELECT[dbo].[DoubleMetaPhone]('Phesant'); SELECT[dbo].[DoubleMetaPhone]('Fessant'); SELECT[dbo].[DoubleMetaPhone]('Fezant'); SELECT[dbo].[DoubleMetaPhone]('Phes ant'); SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
Still there is the issue that two words sounding alike are completely different.
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
FSNT FSNT
SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');
FSNT FSNT
It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.

Google search for "Fez aunt"

Did you mean:

Fez font

Faz unt

Fez hint

Fez anti

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 14

That's almost an exact description of my previous job. I maintained an ETL process for what we called the Master Patient Index; matching up distinct patients across medical claims, enrollments, and healthcare utilization records. The total distinct population was about 40 million across 100+ providers and payors.
Are you familiar with issue where newborns are assigned the mother or father's first name by default and then re-named on a subsequent record?
Or how about a client decides one day to start sending files with truncated names, dates of birth, and SSN for "privacy reasons" but still expect the MPI to continue matching?
Yep, I've been there. The original implementation had one rule that attempted to match on DoubleMetaPhone but it resulted in too many false matches on similar names. It's common for members of the same household to have similar or identical first names (ie: Alexander and Alexandra, or John Smith Sr. and Jr.), which can be an issue if there is no identifier like SSN or birth date to combine with.

Yea... Nothing as painful as that... Everything we do is related to workers comp. So no infants to deal with.

The biggest problem we face is the fact that the database was created by C# developers who had no clue what normalization is or how to implement it. I won't go into gruesome detail but I'm sure you can imagine...

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 15

Google search for "Fez aunt"
Did you mean:
Fez font
Faz unt
Fez hint
Fez anti

Same results I got... At least it's consistent. 😉