August 31, 2016 at 7:31 am
We haven't created a client table of correct, distinct values yet. I'm thinking initially we will have to go through those 300k records to find any variations and then continue the build the table based on if future months have names that don't match with what have already been identified.
August 31, 2016 at 7:45 am
Unless you're using a 3rd party product like Melissa, then attempting to auto-magically identify and recode mistyped company names is probably beyond the level of effort you would be willing to put into this ETL process. At the very least, any straightforward matching process would need to leverage something like a primary reference Company table containing valid company names. If you're using SSIS to load your table, then you can leverage a Lookup DataFlow task to join the source file/table with the reference table and redirect non-matching records into an exception table for which you may then subsequently perform manual editing and re-loading.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
August 31, 2016 at 9:42 am
Eric - Thanks for the SSIS tip. That will make it much easier to capture the handful of records without a match as opposed to joining across tables manually.
September 1, 2016 at 2:38 am
If you are using SSIS, there is a "Fuzzy Grouping" transformation that can be used for de-duplication. You provide it with the 'canonical' forms of the strings, and with the real data. It then returns the closest match and the degree of confidence in the match. Would this do what you need?
Disclaimer: I know of this by studying for the 70-463 exam (SQL Server 2012 Data warehouse) but I have not actually tried it! So you would need to experiment to get it working.
September 6, 2016 at 2:53 am
I have worked on a similar task before. The approach I took was to copy the names to another table (along with the primary key so I could link back to the original). In the copy table, I standardise the names as much as possible; remove punctuation, change double spaces to single spaces, remove words that appear too often to be useful for matching (eg. the, of, and, &, plc, corp, ltd) and replace common misspellings. I then match the words individually and score the result based on the number of matching words as a proportion of the total number of words in the name. Anything above a certain score is deemed a match with the remainder being reported for manual checking (I use the original versions of the names in the report, hence the need to copy the data sideways). Obviously the score required for automatic matching and the list of words to drop will be unique to your situation and the type of data you need to process.
September 6, 2016 at 10:45 am
An oldie but a goody... Years ago, Keith Henry translated Lawrence Philips C++ function to SQL function.
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574
Yes, it is a Scalar function and no I haven't attempted to rewrite it as an iTVF (I don't use it enough to make it worth the effort)...
It actually works really well... I've used it in the past with really good results.
SELECT[dbo].[DoubleMetaPhone]('A.B.C. Liquers');
SELECT[dbo].[DoubleMetaPhone]('a b c liquors');
SELECT[dbo].[DoubleMetaPhone]('ABC Lickers');
All return a value of: APKLKAPKLK
Rather than making you dig through that old thread, here is the function code...
/****** Object: UserDefinedFunction [dbo].[DoubleMetaPhone] Script Date: 9/6/2016 12:33:55 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[DoubleMetaPhone] (@str varchar(70))
RETURNS char(10)
AS
BEGIN
/*#########################################################################
Double Metaphone Phonetic Matching Function
This reduces word to approximate phonetic string. This is deliberately
not a direct phonetic
Based off original C++ code and algorithm by
Lawrence Philips (lphilips_AT_verity.com)
Published in the C/C++ Users Journal:
http://www.cuj.com/articles/2000/0006/0006d/0006d.htm?topic=articles
Original Metaphone presented in article in "Computer Language" in 1990.
Reduces alphabet to
The 14 constonant sounds:
"sh" "p"or"b" "th"
| | |
X S K J T F H L M N P R 0 W
Drop vowels except at the beginning
Produces a char(10) string. The left(@result,5) gives the most common
pronouciation, right(@result,5) gives the commonest alternate.
Translated into t-SQL by Keith Henry (keithh_AT_lbm-solutions.com)
#########################################################################*/
Declare
@original varchar(70),
@primary varchar(70),
@secondary varchar(70),
@lengthint,
@Last int,
@current int,
@strcur1 char(1) ,
@strnext1 char(1) ,
@strprev1char(1),
@SlavoGermanic bit
set @SlavoGermanic= 0
set @primary = ''
set @secondary = ''
set @current = 1
set @length = len(@str)
set @Last = @length
set @original = isnull(@str,'') + ''
set @original = upper(@original)
if patindex('%[WK]%',@str) + charindex('CZ',@str) + charindex('WITZ',@str) <> 0
set @SlavoGermanic = 1
-- skip this at beginning of word
if substring(@original, 1, 2) in ('GN', 'KN', 'PN', 'WR', 'PS')
set @current = @current + 1
-- Initial 'X' is pronounced 'Z' e.g. 'Xavier'
if substring(@original, 1, 1) = 'X'
begin
set @primary = @primary + 'S' -- 'Z' maps to 'S'
set @secondary = @secondary + 'S'
set @current = @current + 1
end
if substring(@original, 1, 1) in ('A', 'E', 'I', 'O', 'U', 'Y')
begin
set @primary = @primary + 'A' -- all init vowels now map to 'A'
set @secondary = @secondary + 'A'
set @current = @current + 1
end
while @current <= @length
begin
if len(@primary) >= 5 break
set @strcur1 = substring(@original, @current, 1)
set @strnext1 = substring(@original, (@current + 1), 1)
set @strprev1 = substring(@original, (@current - 1), 1)
if @strcur1 in ('A', 'E', 'I', 'O', 'U', 'Y')
set @current = @current + 1
else
if @strcur1 = 'B' -- '-mb', e.g. 'dumb', already skipped over ...
begin
set @primary = @primary + 'P'
set @secondary = @secondary + 'P'
if @strnext1 = 'B'
set @current = @current + 2
else
set @current = @current + 1
end
else
if @strcur1 = 'Ç'
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'S'
set @current = @current + 1
end
else
if @strcur1 = 'C'
begin
if @strnext1 = 'H'
begin
if substring(@original, @current, 4) = 'CHIA'-- italian 'chianti'
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
end
else
begin
if @current > 1-- find 'michael'
and substring(@original, @current, 4) = 'CHAE'
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'X'
end
else
begin
if @current = 1-- greek roots e.g. 'chemistry', 'chorus'
and (substring(@original, @current + 1, 5) in ('HARAC', 'HARIS')
or substring(@original, @current + 1, 3) in ('HOR', 'HYM', 'HIA', 'HEM')
)
and substring(@original, 1, 5) <> 'CHORE'
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
end
else
begin
if (substring(@original, 0, 4) in ('VAN ', 'VON ')-- germanic, greek, or otherwise 'ch' for 'kh' sound
or substring(@original, 0, 3) = 'SCH'
)
or substring(@original, @current - 2, 6) in ('ORCHES', 'ARCHIT', 'ORCHID')-- 'architect' but not 'arch', orchestra', 'orchid'
or substring(@original, @current + 2, 1) in ('T', 'S')
or ((@strprev1 in ('A','O','U','E')
or @current = 0
)
and substring(@original, @current + 2, 1) in ('L','R','N','M','B','H','F','V','W',' ')-- e.g. 'wachtler', 'weschsler', but not 'tichner'
)
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
end
else
begin
if (@current > 1)
begin
if substring(@original, 1, 2) = 'MC' -- e.g. 'McHugh'
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
end
else
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'K'
end
end
else
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'X'
end
end
end
end
end
set @current = @current + 2
end --ch logic
else
begin
if @strnext1 = 'C'-- double 'C', but not McClellan'
and not(@current = 1
and substring(@original, 1, 1) = 'M'
)
begin
if substring(@original, @current + 2, 1) in ('I','E','H')-- 'bellocchio' but not 'bacchus'
and substring(@original, @current + 2, 2) <> 'HU'
begin
if (@current = 2-- 'accident', 'accede', 'succeed'
and @strprev1 = 'A'
)
or substring(@original, @current - 1, 5) in ('UCCEE', 'UCCES')
begin
set @primary = @primary + 'KS'
set @secondary = @secondary + 'KS'
end
else
begin-- 'bacci', 'bertucci', other italian
set @primary = @primary + 'X'
set @secondary = @secondary + 'X'
-- e.g. 'focaccia'if substring(@original, @current, 4) = 'CCIA'
end
set @current = @current + 3
end
else
begin
set @primary = @primary + 'K'-- Pierce's rule
set @secondary = @secondary + 'K'
set @current = @current + 2
end
end
else
begin
if @strnext1 in ('K','G','Q')
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
set @current = @current + 2
end
else
begin
if @strnext1 in ('I','E','Y')
begin
if substring(@original, @current, 3) in ('CIO','CIE','CIA')-- italian vs. english
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'X'
end
else
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'S'
end
set @current = @current + 2
end
else
begin
if @strnext1 = 'Z'-- e.g. 'czerny'
and substring(@original, @current -2, 4) <> 'WICZ'
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'X'
set @current = @current + 2
end
else
begin
if @current > 2 -- various gremanic
and substring(@original, @current - 2,1) not in ('A', 'E', 'I', 'O', 'U', 'Y')
and substring(@original, @current - 1, 3) = 'ACH'
and ((substring(@original, @current + 2, 1) <> 'I')
and ((substring(@original, @current + 2, 1) <> 'E')
or substring(@original, @current - 2, 6) in ('BACHER', 'MACHER')
)
)
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
set @current = @current + 2
end
else
begin
if @current = 1 -- special case 'caesar'
and substring(@original, @current, 6) = 'CAESAR'
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'S'
set @current = @current + 2
end
else
begin-- final else
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
if substring(@original, @current + 1, 2) in (' C',' Q',' G')-- name sent in 'mac caffrey', 'mac gregor'
set @current = @current + 3
else
set @current = @current + 1
end
end
end
end
end
end
end
end
else
if @strcur1 = 'D'
begin
if substring(@original, @current, 2) = 'DG'
begin
if substring(@original, @current + 2, 1) in ('I','E','Y')
begin
set @primary = @primary + 'J'-- e.g. 'edge'
set @secondary = @secondary + 'J'
set @current = @current + 3
end
else
begin
set @primary = @primary + 'TK'-- e.g. 'edgar'
set @secondary = @secondary + 'TK'
set @current = @current + 2
end
end
else
begin
if substring(@original, @current, 2) in ('DT','DD')
begin
set @primary = @primary + 'T'
set @secondary = @secondary + 'T'
set @current = @current + 2
end
else
begin
set @primary = @primary + 'T'
set @secondary = @secondary + 'T'
set @current = @current + 1
end
end
end
else
if @strcur1 = 'F'
begin
set @primary = @primary + 'F'
set @secondary = @secondary + 'F'
if (@strnext1 = 'F')
set @current = @current + 2
else
set @current = @current + 1
end
else
if @strcur1 = 'G'
begin
if (@strnext1 = 'H')
begin
if @current > 1
and @strprev1 not in ('A', 'E', 'I', 'O', 'U', 'Y')
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
end
else
begin
if not((@current > 2-- Parker's rule (with some further refinements) - e.g. 'hugh'
and substring(@original, @current - 2, 1) in ('B','H','D')
)-- e.g. 'bough'
or (@current > 3
and substring(@original, @current - 3, 1) in ('B','H','D')
)-- e.g. 'broughton'
or (@current > 4
and substring(@original, @current - 4, 1) in ('B','H')
))
begin
if @current > 3-- e.g. 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough'
and @strprev1 = 'U'
and substring(@original, @current - 3, 1) in ('C','G','L','R','T')
begin
set @primary = @primary + 'F'
set @secondary = @secondary + 'F'
end
else
begin
if @current > 1
and @strprev1 <> 'I'
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
end
else
begin
if (@current < 4)
begin
if (@current = 1)-- 'ghislane', 'ghiradelli'
begin
if (substring(@original, @current + 2, 1) = 'I')
begin
set @primary = @primary + 'J'
set @secondary = @secondary + 'J'
end
else
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
end
end
end
end
end
end
end
set @current = @current + 2
end
else
begin
if (@strnext1 = 'N')
begin
if @current = 1
and substring(@original, 0,1) in ('A', 'E', 'I', 'O', 'U', 'Y')
and @SlavoGermanic = 0
begin
set @primary = @primary + 'KN'
set @secondary = @secondary + 'N'
end
else
begin
-- not e.g. 'cagney'
if substring(@original, @current + 2, 2) = 'EY'
and (@strnext1 <> 'Y')
and @SlavoGermanic = 0
begin
set @primary = @primary + 'N'
set @secondary = @secondary + 'KN'
end
else
begin
set @primary = @primary + 'KN'
set @secondary = @secondary + 'KN'
end
end
set @current = @current + 2
end
else
begin
if substring(@original, @current + 1, 2) = 'LI'-- 'tagliaro'
and @SlavoGermanic = 0
begin
set @primary = @primary + 'KL'
set @secondary = @secondary + 'L'
set @current = @current + 2
end
else
begin
if @current = 1-- -ges-, -gep-, -gel- at beginning
and (@strnext1 = 'Y'
or substring(@original, @current + 1, 2) in ('ES','EP','EB','EL','EY','IB','IL','IN','IE', 'EI','ER')
)
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'J'
set @current = @current + 2
end
else
begin
if (substring(@original, @current + 1, 2) = 'ER'-- -ger-, -gy-
or @strnext1 = 'Y'
)
and substring(@original, 1, 6) not in ('DANGER','RANGER','MANGER')
and @strprev1 not in ('E', 'I')
and substring(@original, @current - 1, 3) not in ('RGY','OGY')
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'J'
set @current = @current + 2
end
else
begin
if @strnext1 in ('E','I','Y')-- italian e.g. 'biaggi'
or substring(@original, @current -1, 4) in ('AGGI','OGGI')
begin
if (substring(@original, 1, 4) in ('VAN ', 'VON ')-- obvious germanic
or substring(@original, 1, 3) = 'SCH'
)
or substring(@original, @current + 1, 2) = 'ET'
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
end
else
begin
-- always soft if french ending
if substring(@original, @current + 1, 4) = 'IER '
begin
set @primary = @primary + 'J'
set @secondary = @secondary + 'J'
end
else
begin
set @primary = @primary + 'J'
set @secondary = @secondary + 'K'
end
end
set @current = @current + 2
end
else
begin-- other options exausted call it k sound
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
if (@strnext1 = 'G')
set @current = @current + 2
else
set @current = @current + 1
end
end
end
end
end
end
end
else
if @strcur1 = 'H'
begin
if (@current = 0 -- only keep if first & before vowel or btw. 2 vowels
or @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y')
)
and @strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')
begin
set @primary = @primary + 'H'
set @secondary = @secondary + 'H'
set @current = @current + 2
end
else
set @current = @current + 1
end
else
if @strcur1 = 'J'
begin
if substring(@original, @current, 4) = 'JOSE'-- obvious spanish, 'jose', 'san jacinto'
or substring(@original, 1, 4) = 'SAN '
begin
if (@current = 1
and substring(@original, @current + 4, 1) = ' '
)
or substring(@original, 1, 4) = 'SAN '
begin
set @primary = @primary + 'H'
set @secondary = @secondary + 'H'
end
else
begin
set @primary = @primary + 'J'
set @secondary = @secondary + 'H'
end
set @current = @current + 1
end
else
begin
if @current = 1
begin
set @primary = @primary + 'J' -- Yankelovich/Jankelowicz
set @secondary = @secondary + 'A'
set @current = @current + 1
end
else
begin
if @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y') -- spanish pron. of .e.g. 'bajador'
and @SlavoGermanic = 0
and @strnext1 in ('A','O')
begin
set @primary = @primary + 'J'
set @secondary = @secondary + 'H'
set @current = @current + 1
end
else
begin
if (@current = @Last)
begin
set @primary = @primary + 'J'
set @secondary = @secondary + ''
set @current = @current + 1
end
else
begin
if @strnext1 in ('L','T','K','S','N','M','B','Z')
and @strprev1 not in ('S','K','L')
begin
set @primary = @primary + 'J'
set @secondary = @secondary + 'J'
set @current = @current + 1
end
else
begin
if (@strnext1 = 'J') -- it could happen
set @current = @current + 2
else
set @current = @current + 1
end
end
end
end
end
end
else
if @strcur1 = 'K'
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
if (@strnext1 = 'K')
set @current = @current + 2
else
set @current = @current + 1
end
else
if @strcur1 = 'L'
begin
if (@strnext1 = 'L')
begin
if (@current = (@length - 3)-- spanish e.g. 'cabrillo', 'gallegos'
and substring(@original, @current - 1, 4) in ('ILLO','ILLA','ALLE')
)
or ((substring(@original, @Last - 1, 2) in ('AS','OS')
or substring(@original, @Last, 1) in ('A','O')
)
and substring(@original, @current - 1, 4) = 'ALLE'
)
set @primary = @primary + 'L'--set @secondary = @secondary + ''
set @current = @current + 2
end
else
begin
set @current = @current + 1
set @primary = @primary + 'L'
set @secondary = @secondary + 'L'
end
end
else
if @strcur1 = 'M'
begin
set @primary = @primary + 'M'
set @secondary = @secondary + 'M'
if substring(@original, @current - 1, 3) = 'UMB'
and (@current + 1 = @Last
or substring(@original, @current + 2, 2) = 'ER'
)-- 'dumb', 'thumb'
or @strnext1 = 'M'
set @current = @current + 2
else
set @current = @current + 1
end
else
if @strcur1 in ('N','Ñ')
begin
set @primary = @primary + 'N'
set @secondary = @secondary + 'N'
if @strnext1 in ('N','Ñ')
set @current = @current + 2
else
set @current = @current + 1
end
else
if @strcur1 = 'P'
begin
if (@strnext1 = 'H')
begin
set @current = @current + 2
set @primary = @primary + 'F'
set @secondary = @secondary + 'F'
end
else
begin
-- also account for 'campbell' and 'raspberry'
if @strnext1 in ('P','B')
set @current = @current + 2
else
begin
set @current = @current + 1
set @primary = @primary + 'P'
set @secondary = @secondary + 'P'
end
end
end
else
if @strcur1 = 'Q'
begin
set @primary = @primary + 'K'
set @secondary = @secondary + 'K'
if (@strnext1 = 'Q')
set @current = @current + 2
else
set @current = @current + 1
end
else
if @strcur1 = 'R'
begin
if @current = @Last-- french e.g. 'rogier', but exclude 'hochmeier'
and @SlavoGermanic = 0
and substring(@original, @current - 2, 2) = 'IE'
and substring(@original, @current - 4, 2) not in ('ME','MA')
set @secondary = @secondary + 'R' --set @primary = @primary + ''
else
begin
set @primary = @primary + 'R'
set @secondary = @secondary + 'R'
end
if (@strnext1 = 'R')
set @current = @current + 2
else
set @current = @current + 1
end
else
if @strcur1 = 'S'
begin
if substring(@original, @current - 1, 3) in ('ISL','YSL') -- special cases 'island', 'isle', 'carlisle', 'carlysle'
set @current = @current + 1--silent s
else
begin
if substring(@original, @current, 2) = 'SH'
begin
-- germanic
if substring(@original, @current + 1, 4) in ('HEIM','HOEK','HOLM','HOLZ')
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'S'
end
else
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'X'
end
set @current = @current + 2
end
else
begin
-- italian & armenian
if substring(@original, @current, 3) in ('SIO','SIA')
or substring(@original, @current, 4) in ('SIAN')
begin
if @SlavoGermanic = 0
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'X'
end
else
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'S'
end
set @current = @current + 3
end
else
begin
if (@current = 1-- german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider'
and @strnext1 in ('M','N','L','W')-- also, -sz- in slavic language altho in hungarian it is pronounced 's'
)
or @strnext1 = 'Z'
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'X'
if @strnext1 = 'Z'
set @current = @current + 2
else
set @current = @current + 1
end
else
begin
if substring(@original, @current, 2) = 'SC'
begin
if substring(@original, @current + 2, 1) = 'H'-- Schlesinger's rule
begin
if substring(@original, @current + 3, 2) in ('OO','ER','EN','UY','ED','EM')-- dutch origin, e.g. 'school', 'schooner'
begin
if substring(@original, @current + 3, 2) in ('ER','EN')-- 'schermerhorn', 'schenker'
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'SK'
end
else
begin
set @primary = @primary + 'SK'
set @secondary = @secondary + 'SK'
end
set @current = @current + 3
end
else
begin
if @current = 1
and substring(@original, 3,1) not in ('A', 'E', 'I', 'O', 'U', 'Y')
and substring(@original, @current + 3, 1) <> 'W'
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'S'
end
else
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'X'
end
set @current = @current + 3
end
end
else
begin
if substring(@original, @current + 2, 1) in ('I','E','Y')
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'S'
end
else
begin
set @primary = @primary + 'SK'
set @secondary = @secondary + 'SK'
end
set @current = @current + 3
end
end
else
begin
if @current = 1-- special case 'sugar-'
and substring(@original, @current, 5) = 'SUGAR'
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'S'
set @current = @current + 1
end
else
begin
if @current = @Last-- french e.g. 'resnais', 'artois'
and substring(@original, @current - 2, 2) in ('AI','OI')
set @secondary = @secondary + 'S' --set @primary = @primary + ''
else
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'S'
end
if @strnext1 in ('S','Z')
set @current = @current + 2
else
set @current = @current + 1
end
end
end
end
end
end
end
else
if @strcur1 = 'T'
begin
if substring(@original, @current, 4) = 'TION'
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'X'
set @current = @current + 3
end
else
if substring(@original, @current, 3) in ('TIA','TCH')
begin
set @primary = @primary + 'X'
set @secondary = @secondary + 'X'
set @current = @current + 3
end
else
if substring(@original, @current, 2) = 'TH'
or substring(@original, @current, 3) = 'TTH'
begin
if substring(@original, @current + 2, 2) in ('OM','AM')-- special case 'thomas', 'thames' or germanic
or substring(@original, 0, 4) in ('VAN ','VON ')
or substring(@original, 0, 3) = 'SCH'
begin
set @primary = @primary + 'T'
set @secondary = @secondary + 'T'
end
else
begin
set @primary = @primary + '0'
set @secondary = @secondary + 'T'
end
set @current = @current + 2
end
else
begin
if @strnext1 in ('T','D')
begin
set @current = @current + 2
set @primary = @primary + 'T'
set @secondary = @secondary + 'T'
end
else
begin
set @current = @current + 1
set @primary = @primary + 'T'
set @secondary = @secondary + 'T'
end
end
end
else
if @strcur1 = 'V'
if (@strnext1 = 'V')
set @current = @current + 2
else
begin
set @current = @current + 1
set @primary = @primary + 'F'
set @secondary = @secondary + 'F'
end
else
if @strcur1 = 'W'
begin
-- can also be in middle of word
if substring(@original, @current, 2) = 'WR'
begin
set @primary = @primary + 'R'
set @secondary = @secondary + 'R'
set @current = @current + 2
end
else
if @current = 1
and (@strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')
or substring(@original, @current, 2) = 'WH'
)
begin
if @strnext1 in ('A', 'E', 'I', 'O', 'U', 'Y')-- Wasserman should match Vasserman
begin
set @primary = @primary + 'A'
set @secondary = @secondary + 'F'
set @current = @current + 1
end
else
begin
set @primary = @primary + 'A'-- need Uomo to match Womo
set @secondary = @secondary + 'A'
set @current = @current + 1
end
end
else
if (@current = @Last -- Arnow should match Arnoff
and @strprev1 in ('A', 'E', 'I', 'O', 'U', 'Y')
)
or substring(@original, @current - 1, 5) in ('EWSKI','EWSKY','OWSKI','OWSKY')
or substring(@original, 0, 3) = 'SCH'
begin
set @secondary = @secondary + 'F'--set @primary = @primary + ''
set @current = @current + 1
end
else
if substring(@original, @current, 4) in ('WICZ','WITZ') -- polish e.g. 'filipowicz'
begin
set @primary = @primary + 'TS'
set @secondary = @secondary + 'FX'
set @current = @current + 4
end
else
set @current = @current + 1-- else skip it
end
else
if @strcur1 = 'X'
begin
if not (@current = @Last-- french e.g. breaux
and (substring(@original, @current - 3, 3) in ('IAU', 'EAU')
or substring(@original, @current - 2, 2) in ('AU', 'OU')
)
)
begin
set @primary = @primary + 'KS'
set @secondary = @secondary + 'KS'
end--else skip it
if @strnext1 in ('C','X')
set @current = @current + 2
else
set @current = @current + 1
end
else
if @strcur1 = 'Z'
begin
if (@strnext1 = 'Z')
set @current = @current + 2
else
begin
if (@strnext1 = 'H') -- chinese pinyin e.g. 'zhao'
begin
set @primary = @primary + 'J'
set @secondary = @secondary + 'J'
set @current = @current + 2
end
else
begin
if (substring(@original, @current + 1, 2) in ('ZO', 'ZI', 'ZA'))
or (@SlavoGermanic = 1
and (@current > 0
and @strprev1 <> 'T'
)
)
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'TS'
end
else
begin
set @primary = @primary + 'S'
set @secondary = @secondary + 'S'
end
end
set @current = @current + 1
end
end
else
set @current = @current + 1
end
return cast(@primary as char(5)) + cast(@secondary as char(5))
end
September 6, 2016 at 11:11 am
'ABC Lickers' is probably an entirely different business.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
September 6, 2016 at 11:23 am
-- just for fun
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
SELECT[dbo].[DoubleMetaPhone]('Phesant');
SELECT[dbo].[DoubleMetaPhone]('Fessant');
SELECT[dbo].[DoubleMetaPhone]('Fezant');
SELECT[dbo].[DoubleMetaPhone]('Phes ant');
SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
September 6, 2016 at 11:32 am
J Livingston SQL (9/6/2016)
-- just for fun
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
SELECT[dbo].[DoubleMetaPhone]('Phesant');
SELECT[dbo].[DoubleMetaPhone]('Fessant');
SELECT[dbo].[DoubleMetaPhone]('Fezant');
SELECT[dbo].[DoubleMetaPhone]('Phes ant');
SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
Yeah... Wish I could take credit for it (Google skills perhaps?), but my contribution was just a straight copy & paste from the interwebs.
September 6, 2016 at 11:33 am
J Livingston SQL (9/6/2016)
-- just for fun
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
SELECT[dbo].[DoubleMetaPhone]('Phesant');
SELECT[dbo].[DoubleMetaPhone]('Fessant');
SELECT[dbo].[DoubleMetaPhone]('Fezant');
SELECT[dbo].[DoubleMetaPhone]('Phes ant');
SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
Still there is the issue that two words sounding alike are completely different.
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
FSNT FSNT
SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');
FSNT FSNT
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
September 6, 2016 at 11:49 am
Eric M Russell (9/6/2016)
J Livingston SQL (9/6/2016)
-- just for fun
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
SELECT[dbo].[DoubleMetaPhone]('Phesant');
SELECT[dbo].[DoubleMetaPhone]('Fessant');
SELECT[dbo].[DoubleMetaPhone]('Fezant');
SELECT[dbo].[DoubleMetaPhone]('Phes ant');
SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
Still there is the issue that two words sounding alike are completely different.
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
FSNT FSNT
SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');
FSNT FSNT
It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.
The last big project I did like this, was to attempt to match up "common patients" coming from several different medical systems... I used a scoring method, where I compared every patient (name, ssn, address & phone) to every other patient (yes it was a brutal as it sounds) and assigned a "match quality score" for all of the possible combinations, discarded matches that scored below a given threshold and then used nested sets to identify & group the best quality matches. Management was far more impressed with it than I was at the end.
September 6, 2016 at 2:40 pm
Jason A. Long (9/6/2016)
Eric M Russell (9/6/2016)
J Livingston SQL (9/6/2016)
-- just for fun
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
SELECT[dbo].[DoubleMetaPhone]('Phesant');
SELECT[dbo].[DoubleMetaPhone]('Fessant');
SELECT[dbo].[DoubleMetaPhone]('Fezant');
SELECT[dbo].[DoubleMetaPhone]('Phes ant');
SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
Still there is the issue that two words sounding alike are completely different.
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
FSNT FSNT
SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');
FSNT FSNT
It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.
The last big project I did like this, was to attempt to match up "common patients" coming from several different medical systems... I used a scoring method, where I compared every patient (name, ssn, address & phone) to every other patient (yes it was a brutal as it sounds) and assigned a "match quality score" for all of the possible combinations, discarded matches that scored below a given threshold and then used nested sets to identify & group the best quality matches. Management was far more impressed with it than I was at the end.
That's almost an exact description of my previous job. I maintained an ETL process for what we called the Master Patient Index; matching up distinct patients across medical claims, enrollments, and healthcare utilization records. The total distinct population was about 40 million across 100+ providers and payors.
Are you familiar with issue where newborns are assigned the mother or father's first name by default and then re-named on a subsequent record?
Or how about a client decides one day to start sending files with truncated names, dates of birth, and SSN for "privacy reasons" but still expect the MPI to continue matching?
Yep, I've been there. The original implementation had one rule that attempted to match on DoubleMetaPhone but it resulted in too many false matches on similar names. It's common for members of the same household to have similar or identical first names (ie: Alexander and Alexandra, or John Smith Sr. and Jr.), which can be an issue if there is no identifier like SSN or birth date to combine with.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
September 6, 2016 at 2:46 pm
Jason A. Long (9/6/2016)
Eric M Russell (9/6/2016)
J Livingston SQL (9/6/2016)
-- just for fun
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
SELECT[dbo].[DoubleMetaPhone]('Phesant');
SELECT[dbo].[DoubleMetaPhone]('Fessant');
SELECT[dbo].[DoubleMetaPhone]('Fezant');
SELECT[dbo].[DoubleMetaPhone]('Phes ant');
SELECT[dbo].[DoubleMetaPhone]('Fez aunt');
quite impressive
Still there is the issue that two words sounding alike are completely different.
SELECT[dbo].[DoubleMetaPhone]('Pheasant');
FSNT FSNT
SELECT[dbo].[DoubleMetaPhone]('Fussy Aunt');
FSNT FSNT
It's been my experience that there is no perfect solution to this kind of problem... At least nothing SQL based that I'm aware of. Google's "Did you mean" is probably the best but I don't have a clue what they're using or how they're using it.
Google search for "Fez aunt"
Did you mean:
Fez font
Faz unt
Fez hint
Fez anti
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
September 6, 2016 at 6:38 pm
That's almost an exact description of my previous job. I maintained an ETL process for what we called the Master Patient Index; matching up distinct patients across medical claims, enrollments, and healthcare utilization records. The total distinct population was about 40 million across 100+ providers and payors.
Are you familiar with issue where newborns are assigned the mother or father's first name by default and then re-named on a subsequent record?
Or how about a client decides one day to start sending files with truncated names, dates of birth, and SSN for "privacy reasons" but still expect the MPI to continue matching?
Yep, I've been there. The original implementation had one rule that attempted to match on DoubleMetaPhone but it resulted in too many false matches on similar names. It's common for members of the same household to have similar or identical first names (ie: Alexander and Alexandra, or John Smith Sr. and Jr.), which can be an issue if there is no identifier like SSN or birth date to combine with.
Yea... Nothing as painful as that... Everything we do is related to workers comp. So no infants to deal with.
The biggest problem we face is the fact that the database was created by C# developers who had no clue what normalization is or how to implement it. I won't go into gruesome detail but I'm sure you can imagine...
September 6, 2016 at 6:47 pm
Google search for "Fez aunt"
Did you mean:
Fez font
Faz unt
Fez hint
Fez anti
Same results I got... At least it's consistent. 😉
Viewing 15 posts - 16 through 30 (of 30 total)
You must be logged in to reply to this topic. Login to reply