Fuzzy Matching

  • We haven't created a client table of correct, distinct values yet. I'm thinking initially we will have to go through those 300k records to find any variations and then continue the build the table based on if future months have names that don't match with what have already been identified.

  • Unless you're using a 3rd party product like Melissa, then attempting to auto-magically identify and recode mistyped company names is probably beyond the level of effort you would be willing to put into this ETL process. At the very least, any straightforward matching process would need to leverage something like a primary reference Company table containing valid company names. If you're using SSIS to load your table, then you can leverage a Lookup DataFlow task to join the source file/table with the reference table and redirect non-matching records into an exception table for which you may then subsequently perform manual editing and re-loading.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric - Thanks for the SSIS tip. That will make it much easier to capture the handful of records without a match as opposed to joining across tables manually.

  • If you are using SSIS, there is a "Fuzzy Grouping" transformation that can be used for de-duplication. You provide it with the 'canonical' forms of the strings, and with the real data. It then returns the closest match and the degree of confidence in the match. Would this do what you need?

    Disclaimer: I know of this by studying for the 70-463 exam (SQL Server 2012 Data warehouse) but I have not actually tried it! So you would need to experiment to get it working.

  • I have worked on a similar task before. The approach I took was to copy the names to another table (along with the primary key so I could link back to the original). In the copy table, I standardise the names as much as possible; remove punctuation, change double spaces to single spaces, remove words that appear too often to be useful for matching (eg. the, of, and, &, plc, corp, ltd) and replace common misspellings. I then match the words individually and score the result based on the number of matching words as a proportion of the total number of words in the name. Anything above a certain score is deemed a match with the remainder being reported for manual checking (I use the original versions of the names in the report, hence the need to copy the data sideways). Obviously the score required for automatic matching and the list of words to drop will be unique to your situation and the type of data you need to process.

  • An oldie but a goody... Years ago, Keith Henry translated Lawrence Philips C++ function to SQL function.

    http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574

    Yes, it is a Scalar function and no I haven't attempted to rewrite it as an iTVF (I don't use it enough to make it worth the effort)...

    It actually works really well... I've used it in the past with really good results.

    SELECT[dbo].[DoubleMetaPhone]('A.B.C. Liquers');

    SELECT[dbo].[DoubleMetaPhone]('a b c liquors');

    SELECT[dbo].[DoubleMetaPhone]('ABC Lickers');

    All return a value of: APKLKAPKLK

    Rather than making you dig through that old thread, here is the function code...

    /****** Object: UserDefinedFunction [dbo].[DoubleMetaPhone] Script Date: 9/6/2016 12:33:55 PM ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    ALTER FUNCTION [dbo].[DoubleMetaPhone] (@str varchar(70))

    RETURNS char(10)

    AS

    BEGIN

    /*#########################################################################

    Double Metaphone Phonetic Matching Function

    This reduces word to approximate phonetic string. This is deliberately

    not a direct phonetic

    Based off original C++ code and algorithm by

    Lawrence Philips (lphilips_AT_verity.com)

    Published in the C/C++ Users Journal:

    http://www.cuj.com/articles/2000/0006/0006d/0006d.htm?topic=articles

    Original Metaphone presented in article in "Computer Language" in 1990.

    Reduces alphabet to

    The 14 constonant sounds:

    "sh" "p"or"b" "th"

    | | |

    X S K J T F H L M N P R 0 W

    Drop vowels except at the beginning

    Produces a char(10) string. The left(@result,5) gives the most common

    pronouciation, right(@result,5) gives the commonest alternate.

    Translated into t-SQL by Keith Henry (keithh_AT_lbm-solutions.com)

    #########################################################################*/

    Declare

    @original varchar(70),

    @primary varchar(70),

    @secondary varchar(70),

    @lengthint,

    @Last int,

    @current int,

    @strcur1 char(1) ,

    @strnext1 char(1) ,

    @strprev1char(1),

    @SlavoGermanic bit

    set @SlavoGermanic= 0

    set @primary = ''

    set @secondary = ''

    set @current = 1

    set @length = len(@str)

    set @Last = @length

    set @original = isnull(@str,'') + ''

    set @original = upper(@original)

    if patindex('%[WK]%',@str) + charindex('CZ',@str) + charindex('WITZ',@str) <> 0

    set @SlavoGermanic = 1

    -- skip this at beginning of word

    if substring(@original, 1, 2) in ('GN', 'KN', 'PN', 'WR', 'PS')

    set @current = @current + 1

    -- Initial 'X' is pronounced 'Z' e.g. 'Xavier'

    if substring(@original, 1, 1) = 'X'

    begin

    set @primary = @primary + 'S' -- 'Z' maps to 'S'

    set @secondary = @secondary + 'S'

    set @current = @current + 1

    end

    if substring(@original, 1, 1) in ('A', 'E', 'I', 'O', 'U', 'Y')

    begin

    set @primary = @primary + 'A' -- all init vowels now map to 'A'

    set @secondary = @secondary + 'A'

    set @current = @current + 1

    end

    while @current <= @length

    begin

    if len(@primary) >= 5 break

    set @strcur1 = substring(@original, @current, 1)

    set @strnext1 = substring(@original, (@current + 1), 1)

    set @strprev1 = substring(@original, (@current - 1), 1)

    if @strcur1 in ('A', 'E', 'I', 'O', 'U', 'Y')

    set @current = @current + 1

    else

    if @strcur1 = 'B' -- '-mb', e.g. 'dumb', already skipped over ...

    begin

    set @primary = @primary + 'P'

    set @secondary = @secondary + 'P'

    if @strnext1 = 'B'

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    if @strcur1 = 'Ç'

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'S'

    set @current = @current + 1

    end

    else

    if @strcur1 = 'C'

    begin

    if @strnext1 = 'H'

    begin

    if substring(@original, @current, 4) = 'CHIA'-- italian 'chianti'

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    end

    else

    begin

    if @current > 1-- find 'michael'

    and substring(@original, @current, 4) = 'CHAE'

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'X'

    end

    else

    begin

    if @current = 1-- greek roots e.g. 'chemistry', 'chorus'

    and (substring(@original, @current + 1, 5) in ('HARAC', 'HARIS')

    or substring(@original, @current + 1, 3) in ('HOR', 'HYM', 'HIA', 'HEM')

    )

    and substring(@original, 1, 5) <> 'CHORE'

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    end

    else

    begin

    if (substring(@original, 0, 4) in ('VAN ', 'VON ')-- germanic, greek, or otherwise 'ch' for 'kh' sound

    or substring(@original, 0, 3) = 'SCH'

    )

    or substring(@original, @current - 2, 6) in ('ORCHES', 'ARCHIT', 'ORCHID')-- 'architect' but not 'arch', orchestra', 'orchid'

    or substring(@original, @current + 2, 1) in ('T', 'S')

    or ((@strprev1 in ('A','O','U','E')

    or @current = 0

    )

    and substring(@original, @current + 2, 1) in ('L','R','N','M','B','H','F','V','W',' ')-- e.g. 'wachtler', 'weschsler', but not 'tichner'

    )

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    end

    else

    begin

    if (@current > 1)

    begin

    if substring(@original, 1, 2) = 'MC' -- e.g. 'McHugh'

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    end

    else

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'K'

    end

    end

    else

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'X'

    end

    end

    end

    end

    end

    set @current = @current + 2

    end --ch logic

    else

    begin

    if @strnext1 = 'C'-- double 'C', but not McClellan'

    and not(@current = 1

    and substring(@original, 1, 1) = 'M'

    )

    begin

    if substring(@original, @current + 2, 1) in ('I','E','H')-- 'bellocchio' but not 'bacchus'

    and substring(@original, @current + 2, 2) <> 'HU'

    begin

    if (@current = 2-- 'accident', 'accede', 'succeed'

    and @strprev1 = 'A'

    )

    or substring(@original, @current - 1, 5) in ('UCCEE', 'UCCES')

    begin

    set @primary = @primary + 'KS'

    set @secondary = @secondary + 'KS'

    end

    else

    begin-- 'bacci', 'bertucci', other italian

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'X'

    -- e.g. 'focaccia'if substring(@original, @current, 4) = 'CCIA'

    end

    set @current = @current + 3

    end

    else

    begin

    set @primary = @primary + 'K'-- Pierce's rule

    set @secondary = @secondary + 'K'

    set @current = @current + 2

    end

    end

    else

    begin

    if @strnext1 in ('K','G','Q')

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    set @current = @current + 2

    end

    else

    begin

    if @strnext1 in ('I','E','Y')

    begin

    if substring(@original, @current, 3) in ('CIO','CIE','CIA')-- italian vs. english

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'X'

    end

    else

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'S'

    end

    set @current = @current + 2

    end

    else

    begin

    if @strnext1 = 'Z'-- e.g. 'czerny'

    and substring(@original, @current -2, 4) <> 'WICZ'

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'X'

    set @current = @current + 2

    end

    else

    begin

    if @current > 2 -- various gremanic

    and substring(@original, @current - 2,1) not in ('A', 'E', 'I', 'O', 'U', 'Y')

    and substring(@original, @current - 1, 3) = 'ACH'

    and ((substring(@original, @current + 2, 1) <> 'I')

    and ((substring(@original, @current + 2, 1) <> 'E')

    or substring(@original, @current - 2, 6) in ('BACHER', 'MACHER')

    )

    )

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    set @current = @current + 2

    end

    else

    begin

    if @current = 1 -- special case 'caesar'

    and substring(@original, @current, 6) = 'CAESAR'

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'S'

    set @current = @current + 2

    end

    else

    begin-- final else

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    if substring(@original, @current + 1, 2) in (' C',' Q',' G')-- name sent in 'mac caffrey', 'mac gregor'

    set @current = @current + 3

    else

    set @current = @current + 1

    end

    end

    end

    end

    end

    end

    end

    end

    else

    if @strcur1 = 'D'

    begin

    if substring(@original, @current, 2) = 'DG'

    begin

    if substring(@original, @current + 2, 1) in ('I','E','Y')

    begin

    set @primary = @primary + 'J'-- e.g. 'edge'

    set @secondary = @secondary + 'J'

    set @current = @current + 3

    end

    else

    begin

    set @primary = @primary + 'TK'-- e.g. 'edgar'

    set @secondary = @secondary + 'TK'

    set @current = @current + 2

    end

    end

    else

    begin

    if substring(@original, @current, 2) in ('DT','DD')

    begin

    set @primary = @primary + 'T'

    set @secondary = @secondary + 'T'

    set @current = @current + 2

    end

    else

    begin

    set @primary = @primary + 'T'

    set @secondary = @secondary + 'T'

    set @current = @current + 1

    end

    end

    end

    else

    if @strcur1 = 'F'

    begin

    set @primary = @primary + 'F'

    set @secondary = @secondary + 'F'

    if (@strnext1 = 'F')

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    if @strcur1 = 'G'

    begin

    if (@strnext1 = 'H')

    begin

    if @current > 1

    and @strprev1 not in ('A', 'E', 'I', 'O', 'U', 'Y')

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    end

    else

    begin

    if not((@current > 2-- Parker's rule (with some further refinements) - e.g. 'hugh'

    and substring(@original, @current - 2, 1) in ('B','H','D')

    )-- e.g. 'bough'

    or (@current > 3

    and substring(@original, @current - 3, 1) in ('B','H','D')

    )-- e.g. 'broughton'

    or (@current > 4

    and substring(@original, @current - 4, 1) in ('B','H')

    ))

    begin

    if @current > 3-- e.g. 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough'

    and @strprev1 = 'U'

    and substring(@original, @current - 3, 1) in ('C','G','L','R','T')

    begin

    set @primary = @primary + 'F'

    set @secondary = @secondary + 'F'

    end

    else

    begin

    if @current > 1

    and @strprev1 <> 'I'

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    end

    else

    begin

    if (@current < 4)

    begin

    if (@current = 1)-- 'ghislane', 'ghiradelli'

    begin

    if (substring(@original, @current + 2, 1) = 'I')

    begin

    set @primary = @primary + 'J'

    set @secondary = @secondary + 'J'

    end

    else

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    end

    end

    end

    end

    end

    end

    end

    set @current = @current + 2

    end

    else

    begin

    if (@strnext1 = 'N')

    begin

    if @current = 1

    and substring(@original, 0,1) in ('A', 'E', 'I', 'O', 'U', 'Y')

    and @SlavoGermanic = 0

    begin

    set @primary = @primary + 'KN'

    set @secondary = @secondary + 'N'

    end

    else

    begin

    -- not e.g. 'cagney'

    if substring(@original, @current + 2, 2) = 'EY'

    and (@strnext1 <> 'Y')

    and @SlavoGermanic = 0

    begin

    set @primary = @primary + 'N'

    set @secondary = @secondary + 'KN'

    end

    else

    begin

    set @primary = @primary + 'KN'

    set @secondary = @secondary + 'KN'

    end

    end

    set @current = @current + 2

    end

    else

    begin

    if substring(@original, @current + 1, 2) = 'LI'-- 'tagliaro'

    and @SlavoGermanic = 0

    begin

    set @primary = @primary + 'KL'

    set @secondary = @secondary + 'L'

    set @current = @current + 2

    end

    else

    begin

    if @current = 1-- -ges-, -gep-, -gel- at beginning

    and (@strnext1 = 'Y'

    or substring(@original, @current + 1, 2) in ('ES','EP','EB','EL','EY','IB','IL','IN','IE', 'EI','ER')

    )

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'J'

    set @current = @current + 2

    end

    else

    begin

    if (substring(@original, @current + 1, 2) = 'ER'-- -ger-, -gy-

    or @strnext1 = 'Y'

    )

    and substring(@original, 1, 6) not in ('DANGER','RANGER','MANGER')

    and @strprev1 not in ('E', 'I')

    and substring(@original, @current - 1, 3) not in ('RGY','OGY')

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'J'

    set @current = @current + 2

    end

    else

    begin

    if @strnext1 in ('E','I','Y')-- italian e.g. 'biaggi'

    or substring(@original, @current -1, 4) in ('AGGI','OGGI')

    begin

    if (substring(@original, 1, 4) in ('VAN ', 'VON ')-- obvious germanic

    or substring(@original, 1, 3) = 'SCH'

    )

    or substring(@original, @current + 1, 2) = 'ET'

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    end

    else

    begin

    -- always soft if french ending

    if substring(@original, @current + 1, 4) = 'IER '

    begin

    set @primary = @primary + 'J'

    set @secondary = @secondary + 'J'

    end

    else

    begin

    set @primary = @primary + 'J'

    set @secondary = @secondary + 'K'

    end

    end

    set @current = @current + 2

    end

    else

    begin-- other options exausted call it k sound

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    if (@strnext1 = 'G')

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    end

    end

    end

    end

    end

    end

    else

    if @strcur1 = 'H'

    begin

    if (@current = 0 -- only keep if first & before vowel or btw. 2 vowels

    or @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y')

    )

    and @strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')

    begin

    set @primary = @primary + 'H'

    set @secondary = @secondary + 'H'

    set @current = @current + 2

    end

    else

    set @current = @current + 1

    end

    else

    if @strcur1 = 'J'

    begin

    if substring(@original, @current, 4) = 'JOSE'-- obvious spanish, 'jose', 'san jacinto'

    or substring(@original, 1, 4) = 'SAN '

    begin

    if (@current = 1

    and substring(@original, @current + 4, 1) = ' '

    )

    or substring(@original, 1, 4) = 'SAN '

    begin

    set @primary = @primary + 'H'

    set @secondary = @secondary + 'H'

    end

    else

    begin

    set @primary = @primary + 'J'

    set @secondary = @secondary + 'H'

    end

    set @current = @current + 1

    end

    else

    begin

    if @current = 1

    begin

    set @primary = @primary + 'J' -- Yankelovich/Jankelowicz

    set @secondary = @secondary + 'A'

    set @current = @current + 1

    end

    else

    begin

    if @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y') -- spanish pron. of .e.g. 'bajador'

    and @SlavoGermanic = 0

    and @strnext1 in ('A','O')

    begin

    set @primary = @primary + 'J'

    set @secondary = @secondary + 'H'

    set @current = @current + 1

    end

    else

    begin

    if (@current = @Last)

    begin

    set @primary = @primary + 'J'

    set @secondary = @secondary + ''

    set @current = @current + 1

    end

    else

    begin

    if @strnext1 in ('L','T','K','S','N','M','B','Z')

    and @strprev1 not in ('S','K','L')

    begin

    set @primary = @primary + 'J'

    set @secondary = @secondary + 'J'

    set @current = @current + 1

    end

    else

    begin

    if (@strnext1 = 'J') -- it could happen

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    end

    end

    end

    end

    end

    else

    if @strcur1 = 'K'

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    if (@strnext1 = 'K')

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    if @strcur1 = 'L'

    begin

    if (@strnext1 = 'L')

    begin

    if (@current = (@length - 3)-- spanish e.g. 'cabrillo', 'gallegos'

    and substring(@original, @current - 1, 4) in ('ILLO','ILLA','ALLE')

    )

    or ((substring(@original, @Last - 1, 2) in ('AS','OS')

    or substring(@original, @Last, 1) in ('A','O')

    )

    and substring(@original, @current - 1, 4) = 'ALLE'

    )

    set @primary = @primary + 'L'--set @secondary = @secondary + ''

    set @current = @current + 2

    end

    else

    begin

    set @current = @current + 1

    set @primary = @primary + 'L'

    set @secondary = @secondary + 'L'

    end

    end

    else

    if @strcur1 = 'M'

    begin

    set @primary = @primary + 'M'

    set @secondary = @secondary + 'M'

    if substring(@original, @current - 1, 3) = 'UMB'

    and (@current + 1 = @Last

    or substring(@original, @current + 2, 2) = 'ER'

    )-- 'dumb', 'thumb'

    or @strnext1 = 'M'

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    if @strcur1 in ('N','Ñ')

    begin

    set @primary = @primary + 'N'

    set @secondary = @secondary + 'N'

    if @strnext1 in ('N','Ñ')

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    if @strcur1 = 'P'

    begin

    if (@strnext1 = 'H')

    begin

    set @current = @current + 2

    set @primary = @primary + 'F'

    set @secondary = @secondary + 'F'

    end

    else

    begin

    -- also account for 'campbell' and 'raspberry'

    if @strnext1 in ('P','B')

    set @current = @current + 2

    else

    begin

    set @current = @current + 1

    set @primary = @primary + 'P'

    set @secondary = @secondary + 'P'

    end

    end

    end

    else

    if @strcur1 = 'Q'

    begin

    set @primary = @primary + 'K'

    set @secondary = @secondary + 'K'

    if (@strnext1 = 'Q')

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    if @strcur1 = 'R'

    begin

    if @current = @Last-- french e.g. 'rogier', but exclude 'hochmeier'

    and @SlavoGermanic = 0

    and substring(@original, @current - 2, 2) = 'IE'

    and substring(@original, @current - 4, 2) not in ('ME','MA')

    set @secondary = @secondary + 'R' --set @primary = @primary + ''

    else

    begin

    set @primary = @primary + 'R'

    set @secondary = @secondary + 'R'

    end

    if (@strnext1 = 'R')

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    if @strcur1 = 'S'

    begin

    if substring(@original, @current - 1, 3) in ('ISL','YSL') -- special cases 'island', 'isle', 'carlisle', 'carlysle'

    set @current = @current + 1--silent s

    else

    begin

    if substring(@original, @current, 2) = 'SH'

    begin

    -- germanic

    if substring(@original, @current + 1, 4) in ('HEIM','HOEK','HOLM','HOLZ')

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'S'

    end

    else

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'X'

    end

    set @current = @current + 2

    end

    else

    begin

    -- italian & armenian

    if substring(@original, @current, 3) in ('SIO','SIA')

    or substring(@original, @current, 4) in ('SIAN')

    begin

    if @SlavoGermanic = 0

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'X'

    end

    else

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'S'

    end

    set @current = @current + 3

    end

    else

    begin

    if (@current = 1-- german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider'

    and @strnext1 in ('M','N','L','W')-- also, -sz- in slavic language altho in hungarian it is pronounced 's'

    )

    or @strnext1 = 'Z'

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'X'

    if @strnext1 = 'Z'

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    begin

    if substring(@original, @current, 2) = 'SC'

    begin

    if substring(@original, @current + 2, 1) = 'H'-- Schlesinger's rule

    begin

    if substring(@original, @current + 3, 2) in ('OO','ER','EN','UY','ED','EM')-- dutch origin, e.g. 'school', 'schooner'

    begin

    if substring(@original, @current + 3, 2) in ('ER','EN')-- 'schermerhorn', 'schenker'

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'SK'

    end

    else

    begin

    set @primary = @primary + 'SK'

    set @secondary = @secondary + 'SK'

    end

    set @current = @current + 3

    end

    else

    begin

    if @current = 1

    and substring(@original, 3,1) not in ('A', 'E', 'I', 'O', 'U', 'Y')

    and substring(@original, @current + 3, 1) <> 'W'

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'S'

    end

    else

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'X'

    end

    set @current = @current + 3

    end

    end

    else

    begin

    if substring(@original, @current + 2, 1) in ('I','E','Y')

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'S'

    end

    else

    begin

    set @primary = @primary + 'SK'

    set @secondary = @secondary + 'SK'

    end

    set @current = @current + 3

    end

    end

    else

    begin

    if @current = 1-- special case 'sugar-'

    and substring(@original, @current, 5) = 'SUGAR'

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'S'

    set @current = @current + 1

    end

    else

    begin

    if @current = @Last-- french e.g. 'resnais', 'artois'

    and substring(@original, @current - 2, 2) in ('AI','OI')

    set @secondary = @secondary + 'S' --set @primary = @primary + ''

    else

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'S'

    end

    if @strnext1 in ('S','Z')

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    end

    end

    end

    end

    end

    end

    else

    if @strcur1 = 'T'

    begin

    if substring(@original, @current, 4) = 'TION'

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'X'

    set @current = @current + 3

    end

    else

    if substring(@original, @current, 3) in ('TIA','TCH')

    begin

    set @primary = @primary + 'X'

    set @secondary = @secondary + 'X'

    set @current = @current + 3

    end

    else

    if substring(@original, @current, 2) = 'TH'

    or substring(@original, @current, 3) = 'TTH'

    begin

    if substring(@original, @current + 2, 2) in ('OM','AM')-- special case 'thomas', 'thames' or germanic

    or substring(@original, 0, 4) in ('VAN ','VON ')

    or substring(@original, 0, 3) = 'SCH'

    begin

    set @primary = @primary + 'T'

    set @secondary = @secondary + 'T'

    end

    else

    begin

    set @primary = @primary + '0'

    set @secondary = @secondary + 'T'

    end

    set @current = @current + 2

    end

    else

    begin

    if @strnext1 in ('T','D')

    begin

    set @current = @current + 2

    set @primary = @primary + 'T'

    set @secondary = @secondary + 'T'

    end

    else

    begin

    set @current = @current + 1

    set @primary = @primary + 'T'

    set @secondary = @secondary + 'T'

    end

    end

    end

    else

    if @strcur1 = 'V'

    if (@strnext1 = 'V')

    set @current = @current + 2

    else

    begin

    set @current = @current + 1

    set @primary = @primary + 'F'

    set @secondary = @secondary + 'F'

    end

    else

    if @strcur1 = 'W'

    begin

    -- can also be in middle of word

    if substring(@original, @current, 2) = 'WR'

    begin

    set @primary = @primary + 'R'

    set @secondary = @secondary + 'R'

    set @current = @current + 2

    end

    else

    if @current = 1

    and (@strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')

    or substring(@original, @current, 2) = 'WH'

    )

    begin

    if @strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')-- Wasserman should match Vasserman

    begin

    set @primary = @primary + 'A'

    set @secondary = @secondary + 'F'

    set @current = @current + 1

    end

    else

    begin

    set @primary = @primary + 'A'-- need Uomo to match Womo

    set @secondary = @secondary + 'A'

    set @current = @current + 1

    end

    end

    else

    if (@current = @Last -- Arnow should match Arnoff

    and @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y')

    )

    or substring(@original, @current - 1, 5) in ('EWSKI','EWSKY','OWSKI','OWSKY')

    or substring(@original, 0, 3) = 'SCH'

    begin

    set @secondary = @secondary + 'F'--set @primary = @primary + ''

    set @current = @current + 1

    end

    else

    if substring(@original, @current, 4) in ('WICZ','WITZ') -- polish e.g. 'filipowicz'

    begin

    set @primary = @primary + 'TS'

    set @secondary = @secondary + 'FX'

    set @current = @current + 4

    end

    else

    set @current = @current + 1-- else skip it

    end

    else

    if @strcur1 = 'X'

    begin

    if not (@current = @Last-- french e.g. breaux

    and (substring(@original, @current - 3, 3) in ('IAU', 'EAU')

    or substring(@original, @current - 2, 2) in ('AU', 'OU')

    )

    )

    begin

    set @primary = @primary + 'KS'

    set @secondary = @secondary + 'KS'

    end--else skip it

    if @strnext1 in ('C','X')

    set @current = @current + 2

    else

    set @current = @current + 1

    end

    else

    if @strcur1 = 'Z'

    begin

    if (@strnext1 = 'Z')

    set @current = @current + 2

    else

    begin

    if (@strnext1 = 'H') -- chinese pinyin e.g. 'zhao'

    begin

    set @primary = @primary + 'J'

    set @secondary = @secondary + 'J'

    set @current = @current + 2

    end

    else

    begin

    if (substring(@original, @current + 1, 2) in ('ZO', 'ZI', 'ZA'))

    or (@SlavoGermanic = 1

    and (@current > 0

    and @strprev1 <> 'T'

    )

    )

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'TS'

    end

    else

    begin

    set @primary = @primary + 'S'

    set @secondary = @secondary + 'S'

    end

    end

    set @current = @current + 1

    end

    end

    else

    set @current = @current + 1

    end

    return cast(@primary as char(5)) + cast(@secondary as char(5))

    end

  • 'ABC Lickers' is probably an entirely different business.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • -- just for fun

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    SELECT[dbo].[DoubleMetaPhone]('Phesant');

    SELECT[dbo].[DoubleMetaPhone]('Fessant');

    SELECT[dbo].[DoubleMetaPhone]('Fezant');

    SELECT[dbo].[DoubleMetaPhone]('Phes ant');

    SELECT[dbo].[DoubleMetaPhone]('Fez aunt');

    quite impressive

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (9/6/2016)


    -- just for fun

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    SELECT[dbo].[DoubleMetaPhone]('Phesant');

    SELECT[dbo].[DoubleMetaPhone]('Fessant');

    SELECT[dbo].[DoubleMetaPhone]('Fezant');

    SELECT[dbo].[DoubleMetaPhone]('Phes ant');

    SELECT[dbo].[DoubleMetaPhone]('Fez aunt');

    quite impressive

    Yeah... Wish I could take credit for it (Google skills perhaps?), but my contribution was just a straight copy & paste from the interwebs.

  • J Livingston SQL (9/6/2016)


    -- just for fun

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    SELECT[dbo].[DoubleMetaPhone]('Phesant');

    SELECT[dbo].[DoubleMetaPhone]('Fessant');

    SELECT[dbo].[DoubleMetaPhone]('Fezant');

    SELECT[dbo].[DoubleMetaPhone]('Phes ant');

    SELECT[dbo].[DoubleMetaPhone]('Fez aunt');

    quite impressive

    Still there is the issue that two words sounding alike are completely different.

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    FSNT FSNT

    SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');

    FSNT FSNT

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (9/6/2016)


    J Livingston SQL (9/6/2016)


    -- just for fun

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    SELECT[dbo].[DoubleMetaPhone]('Phesant');

    SELECT[dbo].[DoubleMetaPhone]('Fessant');

    SELECT[dbo].[DoubleMetaPhone]('Fezant');

    SELECT[dbo].[DoubleMetaPhone]('Phes ant');

    SELECT[dbo].[DoubleMetaPhone]('Fez aunt');

    quite impressive

    Still there is the issue that two words sounding alike are completely different.

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    FSNT FSNT

    SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');

    FSNT FSNT

    It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.

    The last big project I did like this, was to attempt to match up "common patients" coming from several different medical systems... I used a scoring method, where I compared every patient (name, ssn, address & phone) to every other patient (yes it was a brutal as it sounds) and assigned a "match quality score" for all of the possible combinations, discarded matches that scored below a given threshold and then used nested sets to identify & group the best quality matches. Management was far more impressed with it than I was at the end.

  • Jason A. Long (9/6/2016)


    Eric M Russell (9/6/2016)


    J Livingston SQL (9/6/2016)


    -- just for fun

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    SELECT[dbo].[DoubleMetaPhone]('Phesant');

    SELECT[dbo].[DoubleMetaPhone]('Fessant');

    SELECT[dbo].[DoubleMetaPhone]('Fezant');

    SELECT[dbo].[DoubleMetaPhone]('Phes ant');

    SELECT[dbo].[DoubleMetaPhone]('Fez aunt');

    quite impressive

    Still there is the issue that two words sounding alike are completely different.

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    FSNT FSNT

    SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');

    FSNT FSNT

    It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.

    The last big project I did like this, was to attempt to match up "common patients" coming from several different medical systems... I used a scoring method, where I compared every patient (name, ssn, address & phone) to every other patient (yes it was a brutal as it sounds) and assigned a "match quality score" for all of the possible combinations, discarded matches that scored below a given threshold and then used nested sets to identify & group the best quality matches. Management was far more impressed with it than I was at the end.

    That's almost an exact description of my previous job. I maintained an ETL process for what we called the Master Patient Index; matching up distinct patients across medical claims, enrollments, and healthcare utilization records. The total distinct population was about 40 million across 100+ providers and payors.

    Are you familiar with issue where newborns are assigned the mother or father's first name by default and then re-named on a subsequent record?

    Or how about a client decides one day to start sending files with truncated names, dates of birth, and SSN for "privacy reasons" but still expect the MPI to continue matching?

    Yep, I've been there. The original implementation had one rule that attempted to match on DoubleMetaPhone but it resulted in too many false matches on similar names. It's common for members of the same household to have similar or identical first names (ie: Alexander and Alexandra, or John Smith Sr. and Jr.), which can be an issue if there is no identifier like SSN or birth date to combine with.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Jason A. Long (9/6/2016)


    Eric M Russell (9/6/2016)


    J Livingston SQL (9/6/2016)


    -- just for fun

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    SELECT[dbo].[DoubleMetaPhone]('Phesant');

    SELECT[dbo].[DoubleMetaPhone]('Fessant');

    SELECT[dbo].[DoubleMetaPhone]('Fezant');

    SELECT[dbo].[DoubleMetaPhone]('Phes ant');

    SELECT[dbo].[DoubleMetaPhone]('Fez aunt');

    quite impressive

    Still there is the issue that two words sounding alike are completely different.

    SELECT[dbo].[DoubleMetaPhone]('Pheasant');

    FSNT FSNT

    SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');

    FSNT FSNT

    It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.

    Google search for "Fez aunt"

    Did you mean:

    Fez font

    Faz unt

    Fez hint

    Fez anti

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • That's almost an exact description of my previous job. I maintained an ETL process for what we called the Master Patient Index; matching up distinct patients across medical claims, enrollments, and healthcare utilization records. The total distinct population was about 40 million across 100+ providers and payors.

    Are you familiar with issue where newborns are assigned the mother or father's first name by default and then re-named on a subsequent record?

    Or how about a client decides one day to start sending files with truncated names, dates of birth, and SSN for "privacy reasons" but still expect the MPI to continue matching?

    Yep, I've been there. The original implementation had one rule that attempted to match on DoubleMetaPhone but it resulted in too many false matches on similar names. It's common for members of the same household to have similar or identical first names (ie: Alexander and Alexandra, or John Smith Sr. and Jr.), which can be an issue if there is no identifier like SSN or birth date to combine with.

    Yea... Nothing as painful as that... Everything we do is related to workers comp. So no infants to deal with.

    The biggest problem we face is the fact that the database was created by C# developers who had no clue what normalization is or how to implement it. I won't go into gruesome detail but I'm sure you can imagine...

  • Google search for "Fez aunt"

    Did you mean:

    Fez font

    Faz unt

    Fez hint

    Fez anti

    Same results I got... At least it's consistent. 😉

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply