String Match Help

Question

String Match Help

koti.raavi

SSCrazy

Points: 2756
More actions
September 5, 2016 at 7:11 am

#312849

I have two tables joining based on Member ID and Member Name. My question is do we have any function like lookup in ssis
Ex:
Table1
Name
Dhana
Table2
Name
Dmana
If you can see above case like operator is failed to match because of spelling mistake even though 80% match in string. I'm aware that if I use ssis lookup we can achieve this.
Can we do same in SQL, Any help greatly appreciated

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Thom A SSC Guru Points: 99134 More actions · Answer 1

How would you treat the string if someone had missed a character?

For example, would "dhana" and "dana" be a 80% match, or a 20% match (or even 75%/25% depending on the side you're looking at)?

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

koti.raavi SSCrazy Points: 2756 More actions · Answer 2

Hmm. I missed the point character length wont change.."Dhana" and "Dmana".

I don't know how to match. Its typo mistakes by user so have to handle in sql.

Why it will vary from 80% to 70% or 25%..I believe that string match is more than 80% for above scenario

or is there any alterative solution?

let me know if any questions.Thanks

Phil Parkin SSC Guru Points: 246993 More actions · Answer 3

koti.raavi (9/5/2016)
I have two tables joining based on Member ID and Member Name. My question is do we have any function like lookup in ssis
Ex:
Table1
Name
Dhana
Table2
Name
Dmana
If you can see above case like operator is failed to match because of spelling mistake even though 80% match in string. I'm aware that if I use ssis lookup we can achieve this.
Can we do same in SQL, Any help greatly appreciated

Not really. SOUNDEX() is the closest function, but I would not use it in a join, for the following reasons:

1) You will get false matches

2) You may get multiple matches

3) Performance on a large dataset will be awful

I'm sure that there's a good reason, but why can you not match purely on MemberId?

Also, why should the string lengths always be the same? If this problem came about as a result of human typing errors, there could be all sorts of mistakes in there, not only mistyped characters.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

koti.raavi SSCrazy Points: 2756 More actions · Answer 4

Thanks Phil.

I will check how Soundex() will work in my case

Yes true, I'm not going to use join operator instead I will use Case statement

Member is not unique in my case and one member can have multiple names ..but all names are different and need to match on name

Yes, true agree with you. So far I didn't see string length vary, but there is change of vary of string length in future, that's the reason I'm looking for closest match.

I also got few scenarios like below

Table1:

Name

Kevin O' Brain

Elizabeth Kesler JR

Table2:

Name

Kevin O" Brain

Elizibeth Kesler

I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.

Much Appreciated Phil

Thank You,

Dhana

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 5

Do you use the name match together with matching on any other columns? Or is match on name only?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

koti.raavi SSCrazy Points: 2756 More actions · Answer 6

Yeah i do match on both member Id and name as well.

My case member id is not unique

Thanks,

Dhana

Phil Parkin SSC Guru Points: 246993 More actions · Answer 7

I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.

Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental 🙂

DECLARE @Name VARCHAR(100);

SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';

SELECT Name = @Name

, CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),

'><', ''), '<>', ' ');

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

Try this fuzzy-matching function:

/****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

-- this will score around 10,000 word pairs per second on 2010 laptop technology

create FUNCTION [dbo].[IF_Levenshtein02]

(

@Reference VARCHAR(20), @Target VARCHAR(20)

)

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

( -- output query

SELECT [Score %] = CASE

WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))

WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts

ELSE

(SELECT

[Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))

FROM ( -- do

SELECT

seq = t1.n,

ref.Letter,

v.WordLength,

LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)

FROM ( -- v

SELECT

Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),

Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),

WordLength = WordLength

FROM ( -- di

SELECT WordLength = MAX(WordLength)

FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

) di

) v

CROSS APPLY ( -- t1

SELECT TOP(WordLength) n

FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

) t1

CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref

OUTER APPLY ( -- tgt

SELECT TOP(WordLength) n = ABS(t1.n - t2.n)

FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter

) tgt

GROUP BY t1.n, ref.Letter, v.WordLength

) do

)

END

) -- output query

GO

-- Usage

SELECT *

FROM MyTable t1

INNER JOIN MyTable t2

on whatever, e.g. t2.ID < t1.ID

CROSS APPLY [dbo].[IF_Levenshtein02] (t1.String, t2.String) l

WHERE l.[Score %] >= 80

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Thom A SSC Guru Points: 99134 More actions · Answer 9

Phil Parkin (9/5/2016)
I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.
Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental 🙂
DECLARE @Name VARCHAR(100);
SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';
SELECT Name = @Name
, CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),
'><', ''), '<>', ' ');

This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Phil Parkin SSC Guru Points: 246993 More actions · Answer 10

Thom A (9/5/2016)
Phil Parkin (9/5/2016)
I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.
Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental 🙂
DECLARE @Name VARCHAR(100);
SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';
SELECT Name = @Name
, CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),
'><', ''), '<>', ' ');
This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?

I thought about that, and considered replacing ' JR' with ' ' (ie, adding a preceding space to the search). Then I realised that I could not think of a single name containing 'JR' (other than as a suffix) and decided it was an unnecessary refinement. But no doubt such names are possible in certain languages.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Thom A SSC Guru Points: 99134 More actions · Answer 11

Phil Parkin (9/5/2016)
Thom A (9/5/2016)
Phil Parkin (9/5/2016)
I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.
Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental 🙂
DECLARE @Name VARCHAR(100);
SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';
SELECT Name = @Name
, CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),
'><', ''), '<>', ' ');
This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?
I thought about that, and considered replacing ' JR' with ' ' (ie, adding a preceding space to the search). Then I realised that I could not think of a single name containing 'JR' (other than as a suffix) and decided it was an unnecessary refinement. But no doubt such names are possible in certain languages.

I'll admit, I did do a search on our client database (as I was stumped to think of one as well), and did find some hits. But from an English speaking perspective, they did seem to be of international origin, so it depends on the origin of your clients more than anything.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 12

koti.raavi (9/5/2016)
Yeah i do match on both member Id and name as well.
My case member id is not unique
Thanks,
Dhana

We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.

For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.

The function looks long, but it runs fast. Let me know if you have any questions

C REATE FUNCTION [dbo].[itvf_StringsClose]

(

@String1 varchar(50)

,@String2 varchar(50)

)

RETURNS TABLE

AS

-- =================================================================================================

-- Author:Bob Hovious

-- Create date: 3/19/2008

-- Description:This function compares two Strings and deems them to be "close" if:

--1. One has one more character than the other, in any position

--2. One character has been mistyped

--3. Two characters are transposed

-- =================================================================================================

RETURN

(

WITH

-- remove all blanks from Strings

trimmedStrings AS (SELECT replace(@String1,' ','') AS String1

,replace(@String2,' ','') AS String2)

-- make sure that the longest String is always put in String1

-- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte

,basevalues AS

(SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1

,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2

FROM trimmedStrings

WHERE abs(len(String1)-len(String2)) < 2

)

,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)

)

,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)

)

----------------------------------------------------------------------------------------

-- build sets of character Strings for "Like" comparisons and transposition comparisons

----------------------------------------------------------------------------------------

,comparisons AS (-- if they're equal, they're equal

SELECT TOP (1) 'Y' AS StringsClose

FROM basevalues

WHERE String1 = String2

UNION ALL

-- replace each character in String1 with a wildcard to handle single keystroke typos

-- whether wrong character typed or single character omitted

-- this works whether or not the length of String1 is greater than String2

SELECT TOP (1) 'Y' AS StringsClose

FROM basevalues

CROSS JOIN tally1 t1

WHERE 1 = 1

AND String2 LIKE stuff(String1,N1,1,'%')

AND N1 BETWEEN 2 and len(String1)

UNION ALL

-- where Strings are the same length, check for transposition

SELECT TOP (1) 'Y' AS Stringsclose

FROM basevalues

CROSS JOIN tally2 t2

WHERE 1 = 1

AND len(String1) = len(String2)

AND N2 < len(String1)

AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))

)

SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose

)

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 13

The Dixie Flatline (9/5/2016)
koti.raavi (9/5/2016)
Yeah i do match on both member Id and name as well.
My case member id is not unique
Thanks,
Dhana
We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.
For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.
The function looks long, but it runs fast. Let me know if you have any questions
C REATE FUNCTION [dbo].[itvf_StringsClose]
(
@String1 varchar(50)
,@String2 varchar(50)
)
RETURNS TABLE
AS
-- =================================================================================================
-- Author:Bob Hovious
-- Create date: 3/19/2008
-- Description:This function compares two Strings and deems them to be "close" if:
--1. One has one more character than the other, in any position
--2. One character has been mistyped
--3. Two characters are transposed
-- =================================================================================================
RETURN
(
WITH
-- remove all blanks from Strings
trimmedStrings AS (SELECT replace(@String1,' ','') AS String1
,replace(@String2,' ','') AS String2)
-- make sure that the longest String is always put in String1
-- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte
,basevalues AS
(SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1
,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2
FROM trimmedStrings
WHERE abs(len(String1)-len(String2)) < 2
)
,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)
)
,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)
)
----------------------------------------------------------------------------------------
-- build sets of character Strings for "Like" comparisons and transposition comparisons
----------------------------------------------------------------------------------------
,comparisons AS (-- if they're equal, they're equal
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
WHERE String1 = String2
UNION ALL
-- replace each character in String1 with a wildcard to handle single keystroke typos
-- whether wrong character typed or single character omitted
-- this works whether or not the length of String1 is greater than String2
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
CROSS JOIN tally1 t1
WHERE 1 = 1
AND String2 LIKE stuff(String1,N1,1,'%')
AND N1 BETWEEN 2 and len(String1)
UNION ALL
-- where Strings are the same length, check for transposition
SELECT TOP (1) 'Y' AS Stringsclose
FROM basevalues
CROSS JOIN tally2 t2
WHERE 1 = 1
AND len(String1) = len(String2)
AND N2 < len(String1)
AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))
)
SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose
)

Very sneaky 😎

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

joeroshan SSChampion Points: 10377 More actions · Answer 14

The Dixie Flatline (9/5/2016)
koti.raavi (9/5/2016)
Yeah i do match on both member Id and name as well.
My case member id is not unique
Thanks,
Dhana
We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.
For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.
The function looks long, but it runs fast. Let me know if you have any questions
C REATE FUNCTION [dbo].[itvf_StringsClose]
(
@String1 varchar(50)
,@String2 varchar(50)
)
RETURNS TABLE
AS
-- =================================================================================================
-- Author:Bob Hovious
-- Create date: 3/19/2008
-- Description:This function compares two Strings and deems them to be "close" if:
--1. One has one more character than the other, in any position
--2. One character has been mistyped
--3. Two characters are transposed
-- =================================================================================================
RETURN
(
WITH
-- remove all blanks from Strings
trimmedStrings AS (SELECT replace(@String1,' ','') AS String1
,replace(@String2,' ','') AS String2)
-- make sure that the longest String is always put in String1
-- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte
,basevalues AS
(SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1
,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2
FROM trimmedStrings
WHERE abs(len(String1)-len(String2)) < 2
)
,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)
)
,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),
(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)
)
----------------------------------------------------------------------------------------
-- build sets of character Strings for "Like" comparisons and transposition comparisons
----------------------------------------------------------------------------------------
,comparisons AS (-- if they're equal, they're equal
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
WHERE String1 = String2
UNION ALL
-- replace each character in String1 with a wildcard to handle single keystroke typos
-- whether wrong character typed or single character omitted
-- this works whether or not the length of String1 is greater than String2
SELECT TOP (1) 'Y' AS StringsClose
FROM basevalues
CROSS JOIN tally1 t1
WHERE 1 = 1
AND String2 LIKE stuff(String1,N1,1,'%')
AND N1 BETWEEN 2 and len(String1)
UNION ALL
-- where Strings are the same length, check for transposition
SELECT TOP (1) 'Y' AS Stringsclose
FROM basevalues
CROSS JOIN tally2 t2
WHERE 1 = 1
AND len(String1) = len(String2)
AND N2 < len(String1)
AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))
)
SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose
)

I am going to save this. Thanks!

----------------------------------------------------------------------------------------------------------------------------------------------------
Roshan Joe

Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]