String Match Help

  • I have two tables joining based on Member ID and Member Name. My question is do we have any function like lookup in ssis

    Ex:

    Table1

    Name

    Dhana

    Table2

    Name

    Dmana

    If you can see above case like operator is failed to match because of spelling mistake even though 80% match in string. I'm aware that if I use ssis lookup we can achieve this.

    Can we do same in SQL, Any help greatly appreciated

  • How would you treat the string if someone had missed a character?

    For example, would "dhana" and "dana" be a 80% match, or a 20% match (or even 75%/25% depending on the side you're looking at)?

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Hmm. I missed the point character length wont change.."Dhana" and "Dmana".

    I don't know how to match. Its typo mistakes by user so have to handle in sql.

    Why it will vary from 80% to 70% or 25%..I believe that string match is more than 80% for above scenario

    or is there any alterative solution?

    let me know if any questions.Thanks

  • koti.raavi (9/5/2016)


    I have two tables joining based on Member ID and Member Name. My question is do we have any function like lookup in ssis

    Ex:

    Table1

    Name

    Dhana

    Table2

    Name

    Dmana

    If you can see above case like operator is failed to match because of spelling mistake even though 80% match in string. I'm aware that if I use ssis lookup we can achieve this.

    Can we do same in SQL, Any help greatly appreciated

    Not really. SOUNDEX() is the closest function, but I would not use it in a join, for the following reasons:

    1) You will get false matches

    2) You may get multiple matches

    3) Performance on a large dataset will be awful

    I'm sure that there's a good reason, but why can you not match purely on MemberId?

    Also, why should the string lengths always be the same? If this problem came about as a result of human typing errors, there could be all sorts of mistakes in there, not only mistyped characters.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Thanks Phil.

    I will check how Soundex() will work in my case

    Yes true, I'm not going to use join operator instead I will use Case statement

    Member is not unique in my case and one member can have multiple names ..but all names are different and need to match on name

    Yes, true agree with you. So far I didn't see string length vary, but there is change of vary of string length in future, that's the reason I'm looking for closest match.

    I also got few scenarios like below

    Table1:

    Name

    Kevin O' Brain

    Elizabeth Kesler JR

    Table2:

    Name

    Kevin O" Brain

    Elizibeth Kesler

    I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.

    Much Appreciated Phil

    Thank You,

    Dhana

  • Do you use the name match together with matching on any other columns? Or is match on name only?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Yeah i do match on both member Id and name as well.

    My case member id is not unique

    Thanks,

    Dhana

  • I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.

    Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental πŸ™‚

    DECLARE @Name VARCHAR(100);

    SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';

    SELECT Name = @Name

    , CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),

    '><', ''), '<>', ' ');

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Try this fuzzy-matching function:

    /****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    -- this will score around 10,000 word pairs per second on 2010 laptop technology

    create FUNCTION [dbo].[IF_Levenshtein02]

    (

    @Reference VARCHAR(20), @Target VARCHAR(20)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    ( -- output query

    SELECT [Score %] = CASE

    WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))

    WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts

    ELSE

    (SELECT

    [Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))

    FROM ( -- do

    SELECT

    seq = t1.n,

    ref.Letter,

    v.WordLength,

    LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)

    FROM ( -- v

    SELECT

    Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),

    Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),

    WordLength = WordLength

    FROM ( -- di

    SELECT WordLength = MAX(WordLength)

    FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

    ) di

    ) v

    CROSS APPLY ( -- t1

    SELECT TOP(WordLength) n

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    ) t1

    CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref

    OUTER APPLY ( -- tgt

    SELECT TOP(WordLength) n = ABS(t1.n - t2.n)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter

    ) tgt

    GROUP BY t1.n, ref.Letter, v.WordLength

    ) do

    )

    END

    ) -- output query

    GO

    -- Usage

    SELECT *

    FROM MyTable t1

    INNER JOIN MyTable t2

    on whatever, e.g. t2.ID < t1.ID

    CROSS APPLY [dbo].[IF_Levenshtein02] (t1.String, t2.String) l

    WHERE l.[Score %] >= 80

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Phil Parkin (9/5/2016)


    I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.

    Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental πŸ™‚

    DECLARE @Name VARCHAR(100);

    SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';

    SELECT Name = @Name

    , CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),

    '><', ''), '<>', ' ');

    This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A (9/5/2016)


    Phil Parkin (9/5/2016)


    I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.

    Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental πŸ™‚

    DECLARE @Name VARCHAR(100);

    SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';

    SELECT Name = @Name

    , CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),

    '><', ''), '<>', ' ');

    This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?

    I thought about that, and considered replacing ' JR' with ' ' (ie, adding a preceding space to the search). Then I realised that I could not think of a single name containing 'JR' (other than as a suffix) and decided it was an unnecessary refinement. But no doubt such names are possible in certain languages.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin (9/5/2016)


    Thom A (9/5/2016)


    Phil Parkin (9/5/2016)


    I'm planning to use function to remove spaces, quotes and "JR" , is there any function already exists?.so I can reuse same else I will start fresh one.

    Not specifically. I suggest that you write an in-line nested REPLACE to do this. Please do not consider using a scalar UDF; they perform horribly. Also, rather than removing all spaces, I suggest that you just remove multiple concurrent spaces. Here's an example. Any resemblance of this name to an actual person is unintentional and purely accidental πŸ™‚

    DECLARE @Name VARCHAR(100);

    SET @Name = 'Elizabeth Kesler JR and Walter "Jackass" O''Brien';

    SELECT Name = @Name

    , CleanedName = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Name, '''', ''), '"', ''), 'jr', ''), ' ', '<>'),

    '><', ''), '<>', ' ');

    This would replace any real names with "jr" in though. Is this intended (sorry to play devils advocate)?

    I thought about that, and considered replacing ' JR' with ' ' (ie, adding a preceding space to the search). Then I realised that I could not think of a single name containing 'JR' (other than as a suffix) and decided it was an unnecessary refinement. But no doubt such names are possible in certain languages.

    I'll admit, I did do a search on our client database (as I was stumped to think of one as well), and did find some hits. But from an English speaking perspective, they did seem to be of international origin, so it depends on the origin of your clients more than anything.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • koti.raavi (9/5/2016)


    Yeah i do match on both member Id and name as well.

    My case member id is not unique

    Thanks,

    Dhana

    We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.

    For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.

    The function looks long, but it runs fast. Let me know if you have any questions

    C REATE FUNCTION [dbo].[itvf_StringsClose]

    (

    @String1 varchar(50)

    ,@String2 varchar(50)

    )

    RETURNS TABLE

    AS

    -- =================================================================================================

    -- Author:Bob Hovious

    -- Create date: 3/19/2008

    -- Description:This function compares two Strings and deems them to be "close" if:

    --1. One has one more character than the other, in any position

    --2. One character has been mistyped

    --3. Two characters are transposed

    -- =================================================================================================

    RETURN

    (

    WITH

    -- remove all blanks from Strings

    trimmedStrings AS (SELECT replace(@String1,' ','') AS String1

    ,replace(@String2,' ','') AS String2)

    -- make sure that the longest String is always put in String1

    -- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte

    ,basevalues AS

    (SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1

    ,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2

    FROM trimmedStrings

    WHERE abs(len(String1)-len(String2)) < 2

    )

    ,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

    (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

    (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

    (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

    (41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)

    )

    ,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

    (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

    (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

    (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

    (41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)

    )

    ----------------------------------------------------------------------------------------

    -- build sets of character Strings for "Like" comparisons and transposition comparisons

    ----------------------------------------------------------------------------------------

    ,comparisons AS (-- if they're equal, they're equal

    SELECT TOP (1) 'Y' AS StringsClose

    FROM basevalues

    WHERE String1 = String2

    UNION ALL

    -- replace each character in String1 with a wildcard to handle single keystroke typos

    -- whether wrong character typed or single character omitted

    -- this works whether or not the length of String1 is greater than String2

    SELECT TOP (1) 'Y' AS StringsClose

    FROM basevalues

    CROSS JOIN tally1 t1

    WHERE 1 = 1

    AND String2 LIKE stuff(String1,N1,1,'%')

    AND N1 BETWEEN 2 and len(String1)

    UNION ALL

    -- where Strings are the same length, check for transposition

    SELECT TOP (1) 'Y' AS Stringsclose

    FROM basevalues

    CROSS JOIN tally2 t2

    WHERE 1 = 1

    AND len(String1) = len(String2)

    AND N2 < len(String1)

    AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))

    )

    SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose

    )

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (9/5/2016)


    koti.raavi (9/5/2016)


    Yeah i do match on both member Id and name as well.

    My case member id is not unique

    Thanks,

    Dhana

    We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.

    For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.

    The function looks long, but it runs fast. Let me know if you have any questions

    C REATE FUNCTION [dbo].[itvf_StringsClose]

    (

    @String1 varchar(50)

    ,@String2 varchar(50)

    )

    RETURNS TABLE

    AS

    -- =================================================================================================

    -- Author:Bob Hovious

    -- Create date: 3/19/2008

    -- Description:This function compares two Strings and deems them to be "close" if:

    --1. One has one more character than the other, in any position

    --2. One character has been mistyped

    --3. Two characters are transposed

    -- =================================================================================================

    RETURN

    (

    WITH

    -- remove all blanks from Strings

    trimmedStrings AS (SELECT replace(@String1,' ','') AS String1

    ,replace(@String2,' ','') AS String2)

    -- make sure that the longest String is always put in String1

    -- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte

    ,basevalues AS

    (SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1

    ,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2

    FROM trimmedStrings

    WHERE abs(len(String1)-len(String2)) < 2

    )

    ,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

    (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

    (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

    (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

    (41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)

    )

    ,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

    (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

    (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

    (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

    (41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)

    )

    ----------------------------------------------------------------------------------------

    -- build sets of character Strings for "Like" comparisons and transposition comparisons

    ----------------------------------------------------------------------------------------

    ,comparisons AS (-- if they're equal, they're equal

    SELECT TOP (1) 'Y' AS StringsClose

    FROM basevalues

    WHERE String1 = String2

    UNION ALL

    -- replace each character in String1 with a wildcard to handle single keystroke typos

    -- whether wrong character typed or single character omitted

    -- this works whether or not the length of String1 is greater than String2

    SELECT TOP (1) 'Y' AS StringsClose

    FROM basevalues

    CROSS JOIN tally1 t1

    WHERE 1 = 1

    AND String2 LIKE stuff(String1,N1,1,'%')

    AND N1 BETWEEN 2 and len(String1)

    UNION ALL

    -- where Strings are the same length, check for transposition

    SELECT TOP (1) 'Y' AS Stringsclose

    FROM basevalues

    CROSS JOIN tally2 t2

    WHERE 1 = 1

    AND len(String1) = len(String2)

    AND N2 < len(String1)

    AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))

    )

    SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose

    )

    Very sneaky 😎

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The Dixie Flatline (9/5/2016)


    koti.raavi (9/5/2016)


    Yeah i do match on both member Id and name as well.

    My case member id is not unique

    Thanks,

    Dhana

    We do something similar, matching on license numbers and names. If you have one or more matches on member ID, the following function will tell you if a string is within a typo of another string. A typo is defined as an extra character, an omitted character, an incorrect character, or two characters being transposed.

    For this to perform well, your table should have an index with member ID as the primary column, otherwise queries using it will do a table scan. To use it, cross apply the function after your joins and add StringsClose = 'Y' to your WHERE clause.

    The function looks long, but it runs fast. Let me know if you have any questions

    C REATE FUNCTION [dbo].[itvf_StringsClose]

    (

    @String1 varchar(50)

    ,@String2 varchar(50)

    )

    RETURNS TABLE

    AS

    -- =================================================================================================

    -- Author:Bob Hovious

    -- Create date: 3/19/2008

    -- Description:This function compares two Strings and deems them to be "close" if:

    --1. One has one more character than the other, in any position

    --2. One character has been mistyped

    --3. Two characters are transposed

    -- =================================================================================================

    RETURN

    (

    WITH

    -- remove all blanks from Strings

    trimmedStrings AS (SELECT replace(@String1,' ','') AS String1

    ,replace(@String2,' ','') AS String2)

    -- make sure that the longest String is always put in String1

    -- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte

    ,basevalues AS

    (SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1

    ,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2

    FROM trimmedStrings

    WHERE abs(len(String1)-len(String2)) < 2

    )

    ,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

    (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

    (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

    (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

    (41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)

    )

    ,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

    (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

    (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

    (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

    (41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)

    )

    ----------------------------------------------------------------------------------------

    -- build sets of character Strings for "Like" comparisons and transposition comparisons

    ----------------------------------------------------------------------------------------

    ,comparisons AS (-- if they're equal, they're equal

    SELECT TOP (1) 'Y' AS StringsClose

    FROM basevalues

    WHERE String1 = String2

    UNION ALL

    -- replace each character in String1 with a wildcard to handle single keystroke typos

    -- whether wrong character typed or single character omitted

    -- this works whether or not the length of String1 is greater than String2

    SELECT TOP (1) 'Y' AS StringsClose

    FROM basevalues

    CROSS JOIN tally1 t1

    WHERE 1 = 1

    AND String2 LIKE stuff(String1,N1,1,'%')

    AND N1 BETWEEN 2 and len(String1)

    UNION ALL

    -- where Strings are the same length, check for transposition

    SELECT TOP (1) 'Y' AS Stringsclose

    FROM basevalues

    CROSS JOIN tally2 t2

    WHERE 1 = 1

    AND len(String1) = len(String2)

    AND N2 < len(String1)

    AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))

    )

    SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose

    )

    I am going to save this. Thanks!

    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Roshan Joe

    Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply