Need a function to compare strings.

  • I am having to deal with a lot of external data for which NO front end editing can be accomplished. Trust me, that is a fervent wish. I am trying to match this external data to our existing data, and will accept close matches. Here's an example:

    declare @sample1-2 table (name varchar(30), phone varchar(12))

    insert into @sample

    select 'Hovious','123-456-7890'

    My name gets typoed a LOT and some people have numeric dyslexia that might make them miskey the phone number. I need to be able to find this record when the name is an exact match and the phone number is close, OR when the phone number is an exact match and the name is close.

    Here is my working definition of "close"

    String A has only one character different than String B: Hovious/Havious

    or

    String A has one character more or less than String B: Hovious/Hovios

    or

    String A has two characters transposed in String B: Hovious/Hoviuos

    This may be used for large batch runs, so speed is a must. The existing function seems slow (32MS CPU to compare Hovious/Hoviouous), so I'm prepared to rewrite it from scratch. But before I do, I thought I'd ask if anyone out there already has a fast way to something similar to this?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Hi Bob!

    Before diving into a SQL parser... What about a CLR function? Would this be a possibility?

    Greets

    Flo

  • Google for: SSIS - Fuzzy Lookup


    * Noel

  • Unfortunately CLR isn't an option for me for the same reason that front end editing isn't. Long story.

    As for fuzzy lookup, I haven't looked at it for a while, but I thought it was basically a comparison of how many exact matches were found in columns, not characters. Did I overlook something?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Are you able to compare on a table or do you have to match each single text?

    How many data will come within one import action?

    Greets

    Flo

  • The normal application will be a comparison between two tables, with a table of several thousand being compared to a table of 75,000 and growing.

    I am coding as we chat about this, but any good ideas would still be welcomed. Thanks, Flo.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (3/18/2009)


    I am having to deal with a lot of external data for which NO front end editing can be accomplished. Trust me, that is a fervent wish. I am trying to match this external data to our existing data, and will accept close matches. Here's an example:

    declare @sample1-2 table (name varchar(30), phone varchar(12))

    insert into @sample

    select 'Hovious','123-456-7890'

    My name gets typoed a LOT and some people have numeric dyslexia that might make them miskey the phone number. I need to be able to find this record when the name is an exact match and the phone number is close, OR when the phone number is an exact match and the name is close.

    Here is my working definition of "close"

    String A has only one character different than String B: Hovious/Havious

    or

    String A has one character more or less than String B: Hovious/Hovios

    or

    String A has two characters transposed in String B: Hovious/Hoviuos

    This may be used for large batch runs, so speed is a must. The existing function seems slow (32MS CPU to compare Hovious/Hoviouous), so I'm prepared to rewrite it from scratch. But before I do, I thought I'd ask if anyone out there already has a fast way to something similar to this?

    SQL Server has two SOP answers for this problem. One is old and limited, by easy to use. The other is newer,very good, but involves administrative considerations to setup and manager.

    The old one: SOUNDEX and DIFFERENCE functions. Both use the old Soundex algorithm, invented by Ma Bell in the 60's or 70's for the exact problem that you mention: matching names that might be misspelled. Look it up in BOL. It's easy, fast , and not too hard to make it really fast (you add a soundex column to the table and then index it). However, for anything more than matching spelling variations of one name against a column of single names, it may not work too well.

    The new one: Full-Text Indexing. This is the Cadillac solution, it can find words and phrases inside large string columns of text, multiple tables or even entire databases. Tons of flexibility. The cost? Well, the price is free with SQL Server, but you will have to set it up, determine it's scope, and manage the indexer, which is going to want a lot of space for the full-text indexes that it will be making and maintaining. Your choice.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • [font="Verdana"]Yes, you have had a look at SOUNDEX() (and DIFFERENCE()) :-D[/font]

  • Thanks guys. 🙂

    I'm aware of SOUNDEX. The problem is that some of these strings aren't names, or even proper words, so I eliminated that early. Full text search comes closer to general application but still has problems with transposed characters, and maybe speed. Remember, I don't know where in the string an error may appear.

    All good thoughts though. Please keep them coming if others occur to you.

    I am onto something in my coding although it may turn out to be an ITVF from hell. If I can get it to work, I will post it for critique back here when I have a working prototype.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Hi Bob

    I think in this case a usual inline function may be one of the fastest solutions.

    Here my function

    USE tempdb

    IF (OBJECT_ID('dbo.ufn_CompareBH') IS NOT NULL)

    DROP FUNCTION dbo.ufn_CompareBH

    GO

    CREATE FUNCTION dbo.ufn_CompareBH (@s1 VARCHAR(100), @s2 VARCHAR(100))

    RETURNS BIT

    -- RETURNS VARCHAR(100)

    AS

    BEGIN

    DECLARE @len1 SMALLINT

    DECLARE @len2 SMALLINT

    DECLARE @pos1 SMALLINT

    DECLARE @pos2 SMALLINT

    DECLARE @c1 CHAR(1)

    DECLARE @c2 CHAR(1)

    DECLARE @joker BIT

    SELECT @len1 = LEN(@s1), @len2 = LEN(@s2), @joker = 1, @pos1 = 1, @pos2 = 1

    IF (@len1 = @len2 - 1)

    BEGIN

    -- s1 is shorter than s2

    WHILE (@pos1 < @len1)

    BEGIN

    SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos2, 1)

    -- Maybe the missing character

    IF (@c1 != @c2)

    BEGIN

    -- Joker is gambled away and reuse the current character in s1

    IF (@joker = 1)

    SELECT @joker = 0, @pos1 = @pos1 - 1

    ELSE

    RETURN 0

    END

    SELECT @pos1 = @pos1 + 1, @pos2 = @pos2 + 1

    END

    END

    ELSE IF (@len2 = @len1 - 1)

    BEGIN

    -- s2 is shorter than s1

    WHILE (@pos1 < @len1)

    BEGIN

    SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos2, 1)

    IF (@c1 != @c2)

    BEGIN

    -- Joker is gambled away and reuse current character in s2

    IF (@joker = 1)

    SELECT @joker = 0, @pos2 = @pos2 - 1

    ELSE

    RETURN 0

    END

    SELECT @pos1 = @pos1 + 1, @pos2 = @pos2 + 1

    END

    RETURN 1

    END

    ELSE IF (@len1 = @len2)

    BEGIN

    -- len is equal; check the characters

    WHILE (@pos1 < @len1)

    BEGIN

    SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos1, 1)

    IF (@c1 != @c2)

    BEGIN

    -- Check if two characters have been swapped

    IF (@pos1 < @len1 AND @c1 = SUBSTRING(@s2, @pos1 + 1, 1) AND @c2 = SUBSTRING(@s1, @pos1 + 1, 1))

    BEGIN

    IF (@joker = 1)

    SELECT @joker = 0, @pos1 = @pos1 + 1

    ELSE

    RETURN 0

    END

    -- Any other wrong character

    ELSE IF (@joker = 1)

    SET @joker = 0

    ELSE

    RETURN 0

    END

    SET @pos1 = @pos1 + 1

    END

    RETURN 1

    END

    RETURN 0

    END

    GO

    My test script

    SET NOCOUNT ON

    GO

    DROP TABLE tbh

    GO

    CREATE TABLE tbh (id INT IDENTITY NOT NULL, s1 VARCHAR(100), s2 VARCHAR(100), result BIT, PRIMARY KEY (id))

    GO

    INSERT INTO tbh VALUES ('Hovious', 'Havious', NULL)

    INSERT INTO tbh VALUES ('Hovious', 'Hovios', NULL)

    INSERT INTO tbh VALUES ('Hovious', 'Hoviuos', NULL)

    WHILE ((SELECT COUNT(*) FROM tbh) < 75000)

    INSERT INTO tbh SELECT s1, s2, result FROM tbh

    --SELECT COUNT(*) FROM tbh

    GO

    DECLARE @now DATETIME

    SET @now = GETDATE()

    UPDATE TOP(75000) tbh SET result = dbo.ufn_CompareBH(s1, s2)

    SELECT DATEDIFF(MILLISECOND, @now, GETDATE())

    My test result

    For 75,000 records was 1,833 milliseconds. Hope this might be fast enough... 🙂

    ... time for finish now for me. It's quiet late here. See you tomorrow!

    Greets

    Flo

  • Thanks, Flo. I am going to hold time trials tonight.

    Hope you have a good evening. 🙂

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Flo, using your function, when I test:

    PRINT dbo.ufn_CompareBH('Jesse', 'Jessie')

    I get zero returned, but I think it should be 1, right?

    Bob, as far as performance is concerned, I think you're going to have to limit the comparisons you will test. If you have 1 million rows in your target, and 10,000 coming in daily, you really don't want to run 10 billion comparisons.

    What about this - add an integer column to the tables holding values for comparison, and populate it with the sum of ASCII values of the upper-cased letters to be compared. If the difference between the two sums is less than or equal to (65 + 26), then the two words should be compared.

    For example, comparing 'JESSE' and 'JESSIE':

    SELECT ABS((ASCII('J') + ASCII('E') + ASCII('S') + ASCII('S') + ASCII('E'))

    - (ASCII('J') + ASCII('E') + ASCII('S') + ASCII('S') + ASCII('I') + ASCII('E'))) - 65

    results in 8, indicating that the two words should be compared.

  • Interesting suggestion, and I will certainly look into it. Right now I minimize the number of rows by running separate queries, first joining for example on name (exact match) and number being close in the where clause, and then joining on number(exact match) with name being close in the where clause.

    I haven't done time trials yet, but I came up with the following inline table-valued function last night, if anyone wants to try to break it (or improve it).

    Thanks to everyone again for your input and suggestions.

    ALTER FUNCTION [dbo].[itvf_StringsClose]

    (

    @string1 varchar(50)

    ,@string2 varchar(50)

    )

    RETURNS TABLE

    AS

    -- ============================================================================

    -- Author: Bob Hovious

    -- Create date: 3/19/2008

    -- Description: This function compares two strings and deems them to be "close" if:

    -- 1. One has one more character than the other

    -- 2. One character has been mistyped

    -- 3. Two characters are transposed

    -- ============================================================================

    RETURN

    (

    WITH

    L0 AS (SELECT 1 AS C UNION ALL

    SELECT 1 union all

    SELECT 1 union all

    SELECT 1 union all

    SELECT 1 union all

    SELECT 1 union all

    SELECT 1 union all

    SELECT 1 )-- 8 rows

    ,L1 AS (SELECT 1 as C from L0 A cross join L0 b) --64 rows

    ,Tally AS (SELECT ROW_NUMBER() OVER(ORDER BY C) AS N FROM L1) -- maximum test string size is 50, so 64 is plenty

    ,trimmedStrings as (select ltrim(rtrim(@string1)) as string1, ltrim(rtrim(@string2)) as string2)

    ,basevalues as

    (select case when string1 >= string2 then string1 else string2 end as string1 -- long string 1

    ,case when string1 >= string2 then string2 else string1 end as string2 -- short string 2

    from trimmedStrings

    where abs(len(string1)-len(string2)) < 2

    )

    ,stuffing as

    (select string1,ltrim(rtrim(stuff(' '+string2+' ',t2.N,0,'_'))) as S2

    ,string2,stuff(string1,t1.N,1,'_') as S1

    ,case when t1.N < len(string1) then stuff(string1,t1.N,2,reverse(substring(string1,t1.N,2)))

    else 'askljfaqwieyhui' -- highly unlikely to be compared

    end as S1Transposition

    from basevalues b

    cross join tally t1

    cross join tally t2

    where t1.N <= len(string1)

    and t2.N <= len(string2)+2

    )

    select isnull((select 'Y' as [StringsClose]

    where exists(select 1 from stuffing

    where string2 like S1

    or string1 like S2

    or string2 = S1Transposition

    )

    ),'N') as stringsClose

    )

    /* tests

    set statistics time on;

    select * from dbo.StringsClose('Hovious','Hoviouscc')

    set statistics time off;

    ------------------------------------------------------

    set statistics time on;

    with names (name1,name2) as

    (select 'Hovious','Hovious' union all

    select 'xHovious','Hovious' union all

    select 'Hovious','xHovious' union all

    select 'Hoviousx','Hovious' union all

    select 'Hovious','Hoviousx' union all

    select 'Xovious','Hovious' union all

    select 'Hovious','Xovious' union all

    select 'HoviouX','HoviouX' union all

    select 'HoXious','Hovious' union all

    select 'Hovious','HoXious' union all

    select 'HoviousSS','Hovious' union all

    select 'Hovious','HoviousSS' union all

    select 'HHHovious','Hovious' union all

    select 'Hovious','HHHovious' union all

    select 'OHvious','Hovious' union all

    select 'Hovious','OHvious' union all

    select 'HoIVous','Hovious' union all

    select 'Hovious','HoIVous' union all

    select 'HovioSU','Hovious' union all

    select 'Hovious','HovioSU' union all

    select 'Hovis','Hovious' union all

    select 'Hovious','Hovis' union all

    select 'HovIAs','Hovious' union all

    select 'Hovious','HovIAs' union all

    select 'HXviXus','Hovious' union all

    select 'Hovious','HXviXus'

    )

    select name1,name2,name.stringsClose

    from names

    cross apply dbo.itvf_StringsClose(name1,name2) name

    where stringsClose = 'Y'

    set statistics time off;

    */

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob,

    If you are processing incoming data, you may really want to look at the fuzzy lookup in SSIS. I used it quite a while ago in a now defunct process and it worked well. It does require some testing and tuning during development, but it may just fit your requirements.

  • Thanks, Lynn. Apparently I missed that when I first skimmed through fuzzy lookup. I was under the misimpression that it involved how many columns were exact matches, rather than the character strings themselves. Frankly, I'm going to have to go back to school on SSIS. For the moment, doing it in pure SQL, I can get it through QA faster.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply