Need a function to compare strings.

Question

Post reply

Need a function to compare strings.

The Dixie Flatline

SSC Guru

Points: 53354
More actions
March 18, 2009 at 4:22 pm

#135676

I am having to deal with a lot of external data for which NO front end editing can be accomplished. Trust me, that is a fervent wish. I am trying to match this external data to our existing data, and will accept close matches. Here's an example:
declare @sample1 table (name varchar(30), phone varchar(12))
insert into @sample
select 'Hovious','123-456-7890'
My name gets typoed a LOT and some people have numeric dyslexia that might make them miskey the phone number. I need to be able to find this record when the name is an exact match and the phone number is close, OR when the phone number is an exact match and the name is close.
Here is my working definition of "close"
String A has only one character different than String B: Hovious/Havious
or
String A has one character more or less than String B: Hovious/Hovios
or
String A has two characters transposed in String B: Hovious/Hoviuos
This may be used for large batch runs, so speed is a must. The existing function seems slow (32MS CPU to compare Hovious/Hoviouous), so I'm prepared to rewrite it from scratch. But before I do, I thought I'd ask if anyone out there already has a fast way to something similar to this?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 1

Hi Bob!

Before diving into a SQL parser... What about a CLR function? Would this be a possibility?

Greets

Flo

noeld SSC Guru Points: 96590 More actions · Answer 2

noeld

SSC Guru

Points: 96590

March 18, 2009 at 4:32 pm

#962333

Google for: SSIS - Fuzzy Lookup

* Noel

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 3

Unfortunately CLR isn't an option for me for the same reason that front end editing isn't. Long story.

As for fuzzy lookup, I haven't looked at it for a while, but I thought it was basically a comparison of how many exact matches were found in columns, not characters. Did I overlook something?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 4

Are you able to compare on a table or do you have to match each single text?

How many data will come within one import action?

Greets

Flo

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 5

The normal application will be a comparison between two tables, with a table of several thousand being compared to a table of 75,000 and growing.

I am coding as we chat about this, but any good ideas would still be welcomed. Thanks, Flo.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

RBarryYoung SSC Guru Points: 143329 More actions · Answer 6

Bob Hovious (3/18/2009)
I am having to deal with a lot of external data for which NO front end editing can be accomplished. Trust me, that is a fervent wish. I am trying to match this external data to our existing data, and will accept close matches. Here's an example:
declare @sample1 table (name varchar(30), phone varchar(12))
insert into @sample
select 'Hovious','123-456-7890'
My name gets typoed a LOT and some people have numeric dyslexia that might make them miskey the phone number. I need to be able to find this record when the name is an exact match and the phone number is close, OR when the phone number is an exact match and the name is close.
Here is my working definition of "close"
String A has only one character different than String B: Hovious/Havious
or
String A has one character more or less than String B: Hovious/Hovios
or
String A has two characters transposed in String B: Hovious/Hoviuos
This may be used for large batch runs, so speed is a must. The existing function seems slow (32MS CPU to compare Hovious/Hoviouous), so I'm prepared to rewrite it from scratch. But before I do, I thought I'd ask if anyone out there already has a fast way to something similar to this?

SQL Server has two SOP answers for this problem. One is old and limited, by easy to use. The other is newer,very good, but involves administrative considerations to setup and manager.

The old one: SOUNDEX and DIFFERENCE functions. Both use the old Soundex algorithm, invented by Ma Bell in the 60's or 70's for the exact problem that you mention: matching names that might be misspelled. Look it up in BOL. It's easy, fast , and not too hard to make it really fast (you add a soundex column to the table and then index it). However, for anything more than matching spelling variations of one name against a column of single names, it may not work too well.

The new one: Full-Text Indexing. This is the Cadillac solution, it can find words and phrases inside large string columns of text, multiple tables or even entire databases. Tons of flexibility. The cost? Well, the price is free with SQL Server, but you will have to set it up, determine it's scope, and manage the indexer, which is going to want a lot of space for the full-text indexes that it will be making and maintaining. Your choice.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 7

[font="Verdana"]Yes, you have had a look at SOUNDEX() (and DIFFERENCE()) :-D[/font]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 8

Thanks guys. 🙂

I'm aware of SOUNDEX. The problem is that some of these strings aren't names, or even proper words, so I eliminated that early. Full text search comes closer to general application but still has problems with transposed characters, and maybe speed. Remember, I don't know where in the string an error may appear.

All good thoughts though. Please keep them coming if others occur to you.

I am onto something in my coding although it may turn out to be an ITVF from hell. If I can get it to work, I will post it for critique back here when I have a working prototype.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 9

Hi Bob

I think in this case a usual inline function may be one of the fastest solutions.

Here my function

USE tempdb

IF (OBJECT_ID('dbo.ufn_CompareBH') IS NOT NULL)

DROP FUNCTION dbo.ufn_CompareBH

GO

CREATE FUNCTION dbo.ufn_CompareBH (@s1 VARCHAR(100), @s2 VARCHAR(100))

RETURNS BIT

-- RETURNS VARCHAR(100)

AS

BEGIN

DECLARE @len1 SMALLINT

DECLARE @len2 SMALLINT

DECLARE @pos1 SMALLINT

DECLARE @pos2 SMALLINT

DECLARE @c1 CHAR(1)

DECLARE @c2 CHAR(1)

DECLARE @joker BIT

SELECT @len1 = LEN(@s1), @len2 = LEN(@s2), @joker = 1, @pos1 = 1, @pos2 = 1

IF (@len1 = @len2 - 1)

BEGIN

-- s1 is shorter than s2

WHILE (@pos1 < @len1)

BEGIN

SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos2, 1)

-- Maybe the missing character

IF (@c1 != @c2)

BEGIN

-- Joker is gambled away and reuse the current character in s1

IF (@joker = 1)

SELECT @joker = 0, @pos1 = @pos1 - 1

ELSE

RETURN 0

END

SELECT @pos1 = @pos1 + 1, @pos2 = @pos2 + 1

END

ELSE IF (@len2 = @len1 - 1)

BEGIN

-- s2 is shorter than s1

WHILE (@pos1 < @len1)

BEGIN

SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos2, 1)

IF (@c1 != @c2)

BEGIN

-- Joker is gambled away and reuse current character in s2

IF (@joker = 1)

SELECT @joker = 0, @pos2 = @pos2 - 1

ELSE

RETURN 0

END

SELECT @pos1 = @pos1 + 1, @pos2 = @pos2 + 1

END

RETURN 1

END

ELSE IF (@len1 = @len2)

BEGIN

-- len is equal; check the characters

WHILE (@pos1 < @len1)

BEGIN

SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos1, 1)

IF (@c1 != @c2)

BEGIN

-- Check if two characters have been swapped

IF (@pos1 < @len1 AND @c1 = SUBSTRING(@s2, @pos1 + 1, 1) AND @c2 = SUBSTRING(@s1, @pos1 + 1, 1))

BEGIN

IF (@joker = 1)

SELECT @joker = 0, @pos1 = @pos1 + 1

ELSE

RETURN 0

END

-- Any other wrong character

ELSE IF (@joker = 1)

SET @joker = 0

ELSE

RETURN 0

END

SET @pos1 = @pos1 + 1

END

RETURN 1

END

RETURN 0

END

GO

My test script

SET NOCOUNT ON

GO

DROP TABLE tbh

GO

CREATE TABLE tbh (id INT IDENTITY NOT NULL, s1 VARCHAR(100), s2 VARCHAR(100), result BIT, PRIMARY KEY (id))

GO

INSERT INTO tbh VALUES ('Hovious', 'Havious', NULL)

INSERT INTO tbh VALUES ('Hovious', 'Hovios', NULL)

INSERT INTO tbh VALUES ('Hovious', 'Hoviuos', NULL)

WHILE ((SELECT COUNT(*) FROM tbh) < 75000)

INSERT INTO tbh SELECT s1, s2, result FROM tbh

--SELECT COUNT(*) FROM tbh

GO

DECLARE @now DATETIME

SET @now = GETDATE()

UPDATE TOP(75000) tbh SET result = dbo.ufn_CompareBH(s1, s2)

SELECT DATEDIFF(MILLISECOND, @now, GETDATE())

My test result

For 75,000 records was 1,833 milliseconds. Hope this might be fast enough... 🙂

... time for finish now for me. It's quiet late here. See you tomorrow!

Greets

Flo

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 10

Thanks, Flo. I am going to hold time trials tonight.

Hope you have a good evening. 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jesse McLain SSCommitted Points: 1782 More actions · Answer 11

Flo, using your function, when I test:

PRINT dbo.ufn_CompareBH('Jesse', 'Jessie')

I get zero returned, but I think it should be 1, right?

Bob, as far as performance is concerned, I think you're going to have to limit the comparisons you will test. If you have 1 million rows in your target, and 10,000 coming in daily, you really don't want to run 10 billion comparisons.

What about this - add an integer column to the tables holding values for comparison, and populate it with the sum of ASCII values of the upper-cased letters to be compared. If the difference between the two sums is less than or equal to (65 + 26), then the two words should be compared.

For example, comparing 'JESSE' and 'JESSIE':

SELECT ABS((ASCII('J') + ASCII('E') + ASCII('S') + ASCII('S') + ASCII('E'))

- (ASCII('J') + ASCII('E') + ASCII('S') + ASCII('S') + ASCII('I') + ASCII('E'))) - 65

results in 8, indicating that the two words should be compared.

Jesse McLain
jesse@jessemclain.com
www.jessemclain.com
www.jessesql.blogspot.com

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 12

Interesting suggestion, and I will certainly look into it. Right now I minimize the number of rows by running separate queries, first joining for example on name (exact match) and number being close in the where clause, and then joining on number(exact match) with name being close in the where clause.

I haven't done time trials yet, but I came up with the following inline table-valued function last night, if anyone wants to try to break it (or improve it).

Thanks to everyone again for your input and suggestions.

ALTER FUNCTION [dbo].[itvf_StringsClose]

(

@string1 varchar(50)

,@string2 varchar(50)

)

RETURNS TABLE

AS

-- ============================================================================

-- Author: Bob Hovious

-- Create date: 3/19/2008

-- Description: This function compares two strings and deems them to be "close" if:

-- 1. One has one more character than the other

-- 2. One character has been mistyped

-- 3. Two characters are transposed

-- ============================================================================

RETURN

(

WITH

L0 AS (SELECT 1 AS C UNION ALL

SELECT 1 union all

SELECT 1 )-- 8 rows

,L1 AS (SELECT 1 as C from L0 A cross join L0 b) --64 rows

,Tally AS (SELECT ROW_NUMBER() OVER(ORDER BY C) AS N FROM L1) -- maximum test string size is 50, so 64 is plenty

,trimmedStrings as (select ltrim(rtrim(@string1)) as string1, ltrim(rtrim(@string2)) as string2)

,basevalues as

(select case when string1 >= string2 then string1 else string2 end as string1 -- long string 1

,case when string1 >= string2 then string2 else string1 end as string2 -- short string 2

from trimmedStrings

where abs(len(string1)-len(string2)) < 2

)

,stuffing as

(select string1,ltrim(rtrim(stuff(' '+string2+' ',t2.N,0,'_'))) as S2

,string2,stuff(string1,t1.N,1,'_') as S1

,case when t1.N < len(string1) then stuff(string1,t1.N,2,reverse(substring(string1,t1.N,2)))

else 'askljfaqwieyhui' -- highly unlikely to be compared

end as S1Transposition

from basevalues b

cross join tally t1

cross join tally t2

where t1.N <= len(string1)

and t2.N <= len(string2)+2

)

select isnull((select 'Y' as [StringsClose]

where exists(select 1 from stuffing

where string2 like S1

or string1 like S2

or string2 = S1Transposition

)

),'N') as stringsClose

)

/* tests

set statistics time on;

select * from dbo.StringsClose('Hovious','Hoviouscc')

set statistics time off;

------------------------------------------------------

set statistics time on;

with names (name1,name2) as

(select 'Hovious','Hovious' union all

select 'xHovious','Hovious' union all

select 'Hovious','xHovious' union all

select 'Hoviousx','Hovious' union all

select 'Hovious','Hoviousx' union all

select 'Xovious','Hovious' union all

select 'Hovious','Xovious' union all

select 'HoviouX','HoviouX' union all

select 'HoXious','Hovious' union all

select 'Hovious','HoXious' union all

select 'HoviousSS','Hovious' union all

select 'Hovious','HoviousSS' union all

select 'HHHovious','Hovious' union all

select 'Hovious','HHHovious' union all

select 'OHvious','Hovious' union all

select 'Hovious','OHvious' union all

select 'HoIVous','Hovious' union all

select 'Hovious','HoIVous' union all

select 'HovioSU','Hovious' union all

select 'Hovious','HovioSU' union all

select 'Hovis','Hovious' union all

select 'Hovious','Hovis' union all

select 'HovIAs','Hovious' union all

select 'Hovious','HovIAs' union all

select 'HXviXus','Hovious' union all

select 'Hovious','HXviXus'

)

select name1,name2,name.stringsClose

from names

cross apply dbo.itvf_StringsClose(name1,name2) name

where stringsClose = 'Y'

set statistics time off;

*/

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 13

Bob,

If you are processing incoming data, you may really want to look at the fuzzy lookup in SSIS. I used it quite a while ago in a now defunct process and it worked well. It does require some testing and tuning during development, but it may just fit your requirements.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 14

Thanks, Lynn. Apparently I missed that when I first skimmed through fuzzy lookup. I was under the misimpression that it involved how many columns were exact matches, rather than the character strings themselves. Frankly, I'm going to have to go back to school on SSIS. For the moment, doing it in pure SQL, I can get it through QA faster.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills