September 5, 2016 at 7:11 am
I have two tables joining based on Member ID and Member Name. My question is do we have any function like lookup in ssis
Ex:
Table1
Name
Dhana
Table2
Name
Dmana
If you can see above case like operator is failed to match because of spelling mistake even though 80% match in string. I'm aware that if I use ssis lookup we can achieve this.
Can we do same in SQL, Any help greatly appreciated
September 5, 2016 at 7:35 am
How would you treat the string if someone had missed a character?
For example, would "dhana" and "dana" be a 80% match, or a 20% match (or even 75%/25% depending on the side you're looking at)?
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
September 5, 2016 at 7:44 am
Hmm. I missed the point character length wont change.."Dhana" and "Dmana".
I don't know how to match. Its typo mistakes by user so have to handle in sql.
Why it will vary from 80% to 70% or 25%..I believe that string match is more than 80% for above scenario
or is there any alterative solution?
let me know if any questions.Thanks
September 5, 2016 at 7:52 am
koti.raavi (9/5/2016)
I have two tables joining based on Member ID and Member Name. My question is do we have any function like lookup in ssisEx:
Table1
Name
Dhana
Table2
Name
Dmana
If you can see above case like operator is failed to match because of spelling mistake even though 80% match in string. I'm aware that if I use ssis lookup we can achieve this.
Can we do same in SQL, Any help greatly appreciated
Not really. SOUNDEX() is the closest function, but I would not use it in a join, for the following reasons:
1) You will get false matches
2) You may get multiple matches
3) Performance on a large dataset will be awful
I'm sure that there's a good reason, but why can you not match purely on MemberId?
Also, why should the string lengths always be the same? If this problem came about as a result of human typing errors, there could be all sorts of mistakes in there, not only mistyped characters.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
September 5, 2016 at 8:06 am
Thanks Phil.
I will check how Soundex() will work in my case
Yes true, I'm not going to use join operator instead I will use Case statement
Member is not unique in my case and one member can have multiple names ..but all names are different and need to match on name
Yes, true agree with you. So far I didn't see string length vary, but there is change of vary of string length in future, that's the reason I'm looking for closest match.
I also got few scenarios like below
Table1:
Name
Kevin O' Brain
Elizabeth Kesler JR
Table2:
Name
Kevin O" Brain
Elizibeth Kesler
I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.
Much Appreciated Phil
Thank You,
Dhana
September 5, 2016 at 8:17 am
Do you use the name match together with matching on any other columns? Or is match on name only?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
September 5, 2016 at 8:21 am
Yeah i do match on both member Id and name as well.
My case member id is not unique
Thanks,
Dhana
September 5, 2016 at 8:24 am
I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.
Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental π
DECLARE @Name VARCHAR(100);
SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';
SELECT Name = @Name
, CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),
'><', ''), '<>', ' ');
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
September 5, 2016 at 8:28 am
Try this fuzzy-matching function:
/****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- this will score around 10,000 word pairs per second on 2010 laptop technology
create FUNCTION [dbo].[IF_Levenshtein02]
(
@Reference VARCHAR(20), @Target VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
( -- output query
SELECT [Score %] = CASE
WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))
WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts
ELSE
(SELECT
[Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))
FROM ( -- do
SELECT
seq = t1.n,
ref.Letter,
v.WordLength,
LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)
FROM ( -- v
SELECT
Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),
Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),
WordLength = WordLength
FROM ( -- di
SELECT WordLength = MAX(WordLength)
FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)
) di
) v
CROSS APPLY ( -- t1
SELECT TOP(WordLength) n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
) t1
CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref
OUTER APPLY ( -- tgt
SELECT TOP(WordLength) n = ABS(t1.n - t2.n)
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)
WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter
) tgt
GROUP BY t1.n, ref.Letter, v.WordLength
) do
)
END
) -- output query
GO
-- Usage
SELECT *
FROM MyTable t1
INNER JOIN MyTable t2
on whatever, e.g. t2.ID < t1.ID
CROSS APPLY [dbo].[IF_Levenshtein02] (t1.String, t2.String) l
WHERE l.[Score %] >= 80
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 5, 2016 at 8:39 am
Phil Parkin (9/5/2016)
I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.
Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental π
DECLARE @Name VARCHAR(100);
SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';
SELECT Name = @Name
, CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),
'><', ''), '<>', ' ');
This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
September 5, 2016 at 8:54 am
Thom A (9/5/2016)
Phil Parkin (9/5/2016)
I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.
Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental π
DECLARE @Name VARCHAR(100);
SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';
SELECT Name = @Name
, CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),
'><', ''), '<>', ' ');
This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?
I thought about that, and considered replacing ' JR' with ' ' (ie, adding a preceding space to the search). Then I realised that I could not think of a single name containing 'JR' (other than as a suffix) and decided it was an unnecessary refinement. But no doubt such names are possible in certain languages.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
September 5, 2016 at 9:01 am
Phil Parkin (9/5/2016)
Thom A (9/5/2016)
Phil Parkin (9/5/2016)
I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.
Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental π
DECLARE @Name VARCHAR(100);
SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';
SELECT Name = @Name
, CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),
'><', ''), '<>', ' ');
This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?
I thought about that, and considered replacing ' JR' with ' ' (ie, adding a preceding space to the search). Then I realised that I could not think of a single name containing 'JR' (other than as a suffix) and decided it was an unnecessary refinement. But no doubt such names are possible in certain languages.
I'll admit, I did do a search on our client database (as I was stumped to think of one as well), and did find some hits. But from an English speaking perspective, they did seem to be of international origin, so it depends on the origin of your clients more than anything.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
September 5, 2016 at 11:45 am
koti.raavi (9/5/2016)
Yeah i do match on both member Id and name as well.My case member id is not unique
Thanks,
Dhana
We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.
For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.
The function looks long, but it runs fast. Let me know if you have any questions
C REATE FUNCTION [dbo].[itvf_StringsClose]
(
@String1 varchar(50)
,@String2 varchar(50)
)
RETURNS TABLE
AS
-- =================================================================================================
-- Author:Bob Hovious
-- Create date: 3/19/2008
-- Description:This function compares two Strings and deems them to be "close" if:
--1. One has one more character than the other, in any position
--2. One character has been mistyped
--3. Two characters are transposed
-- =================================================================================================
RETURN
(
WITH
-- remove all blanks from Strings
trimmedStrings AS (SELECT replace(@String1,' ','') AS String1
,replace(@String2,' ','') AS String2)
-- make sure that the longest String is always put in String1
-- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte
,basevalues AS
(SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1
,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2
FROM trimmedStrings
WHERE abs(len(String1)-len(String2)) < 2
)
,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)
)
,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)
)
----------------------------------------------------------------------------------------
-- build sets of character Strings for "Like" comparisons and transposition comparisons
----------------------------------------------------------------------------------------
,comparisons AS (-- if they're equal, they're equal
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
WHERE String1 = String2
UNION ALL
-- replace each character in String1 with a wildcard to handle single keystroke typos
-- whether wrong character typed or single character omitted
-- this works whether or not the length of String1 is greater than String2
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
CROSS JOIN tally1 t1
WHERE 1 = 1
AND String2 LIKE stuff(String1,N1,1,'%')
AND N1 BETWEEN 2 and len(String1)
UNION ALL
-- where Strings are the same length, check for transposition
SELECT TOP (1) 'Y' AS Stringsclose
FROM basevalues
CROSS JOIN tally2 t2
WHERE 1 = 1
AND len(String1) = len(String2)
AND N2 < len(String1)
AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))
)
SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose
)
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
September 6, 2016 at 1:50 am
The Dixie Flatline (9/5/2016)
koti.raavi (9/5/2016)
Yeah i do match on both member Id and name as well.My case member id is not unique
Thanks,
Dhana
We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.
For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.
The function looks long, but it runs fast. Let me know if you have any questions
C REATE FUNCTION [dbo].[itvf_StringsClose]
(
@String1 varchar(50)
,@String2 varchar(50)
)
RETURNS TABLE
AS
-- =================================================================================================
-- Author:Bob Hovious
-- Create date: 3/19/2008
-- Description:This function compares two Strings and deems them to be "close" if:
--1. One has one more character than the other, in any position
--2. One character has been mistyped
--3. Two characters are transposed
-- =================================================================================================
RETURN
(
WITH
-- remove all blanks from Strings
trimmedStrings AS (SELECT replace(@String1,' ','') AS String1
,replace(@String2,' ','') AS String2)
-- make sure that the longest String is always put in String1
-- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte
,basevalues AS
(SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1
,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2
FROM trimmedStrings
WHERE abs(len(String1)-len(String2)) < 2
)
,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)
)
,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)
)
----------------------------------------------------------------------------------------
-- build sets of character Strings for "Like" comparisons and transposition comparisons
----------------------------------------------------------------------------------------
,comparisons AS (-- if they're equal, they're equal
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
WHERE String1 = String2
UNION ALL
-- replace each character in String1 with a wildcard to handle single keystroke typos
-- whether wrong character typed or single character omitted
-- this works whether or not the length of String1 is greater than String2
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
CROSS JOIN tally1 t1
WHERE 1 = 1
AND String2 LIKE stuff(String1,N1,1,'%')
AND N1 BETWEEN 2 and len(String1)
UNION ALL
-- where Strings are the same length, check for transposition
SELECT TOP (1) 'Y' AS Stringsclose
FROM basevalues
CROSS JOIN tally2 t2
WHERE 1 = 1
AND len(String1) = len(String2)
AND N2 < len(String1)
AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))
)
SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose
)
Very sneaky π
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 6, 2016 at 3:18 am
The Dixie Flatline (9/5/2016)
koti.raavi (9/5/2016)
Yeah i do match on both member Id and name as well.My case member id is not unique
Thanks,
Dhana
We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.
For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.
The function looks long, but it runs fast. Let me know if you have any questions
C REATE FUNCTION [dbo].[itvf_StringsClose]
(
@String1 varchar(50)
,@String2 varchar(50)
)
RETURNS TABLE
AS
-- =================================================================================================
-- Author:Bob Hovious
-- Create date: 3/19/2008
-- Description:This function compares two Strings and deems them to be "close" if:
--1. One has one more character than the other, in any position
--2. One character has been mistyped
--3. Two characters are transposed
-- =================================================================================================
RETURN
(
WITH
-- remove all blanks from Strings
trimmedStrings AS (SELECT replace(@String1,' ','') AS String1
,replace(@String2,' ','') AS String2)
-- make sure that the longest String is always put in String1
-- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte
,basevalues AS
(SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1
,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2
FROM trimmedStrings
WHERE abs(len(String1)-len(String2)) < 2
)
,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)
)
,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)
)
----------------------------------------------------------------------------------------
-- build sets of character Strings for "Like" comparisons and transposition comparisons
----------------------------------------------------------------------------------------
,comparisons AS (-- if they're equal, they're equal
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
WHERE String1 = String2
UNION ALL
-- replace each character in String1 with a wildcard to handle single keystroke typos
-- whether wrong character typed or single character omitted
-- this works whether or not the length of String1 is greater than String2
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
CROSS JOIN tally1 t1
WHERE 1 = 1
AND String2 LIKE stuff(String1,N1,1,'%')
AND N1 BETWEEN 2 and len(String1)
UNION ALL
-- where Strings are the same length, check for transposition
SELECT TOP (1) 'Y' AS Stringsclose
FROM basevalues
CROSS JOIN tally2 t2
WHERE 1 = 1
AND len(String1) = len(String2)
AND N2 < len(String1)
AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))
)
SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose
)
I am going to save this. Thanks!
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply