March 18, 2009 at 4:22 pm
I am having to deal with a lot of external data for which NO front end editing can be accomplished. Trust me, that is a fervent wish. I am trying to match this external data to our existing data, and will accept close matches. Here's an example:
declare @sample1-2 table (name varchar(30), phone varchar(12))
insert into @sample
select 'Hovious','123-456-7890'
My name gets typoed a LOT and some people have numeric dyslexia that might make them miskey the phone number. I need to be able to find this record when the name is an exact match and the phone number is close, OR when the phone number is an exact match and the name is close.
Here is my working definition of "close"
String A has only one character different than String B: Hovious/Havious
or
String A has one character more or less than String B: Hovious/Hovios
or
String A has two characters transposed in String B: Hovious/Hoviuos
This may be used for large batch runs, so speed is a must. The existing function seems slow (32MS CPU to compare Hovious/Hoviouous), so I'm prepared to rewrite it from scratch. But before I do, I thought I'd ask if anyone out there already has a fast way to something similar to this?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 18, 2009 at 4:30 pm
Hi Bob!
Before diving into a SQL parser... What about a CLR function? Would this be a possibility?
Greets
Flo
March 18, 2009 at 4:32 pm
Google for: SSIS - Fuzzy Lookup
* Noel
March 18, 2009 at 4:48 pm
Unfortunately CLR isn't an option for me for the same reason that front end editing isn't. Long story.
As for fuzzy lookup, I haven't looked at it for a while, but I thought it was basically a comparison of how many exact matches were found in columns, not characters. Did I overlook something?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 18, 2009 at 5:02 pm
Are you able to compare on a table or do you have to match each single text?
How many data will come within one import action?
Greets
Flo
March 18, 2009 at 5:07 pm
The normal application will be a comparison between two tables, with a table of several thousand being compared to a table of 75,000 and growing.
I am coding as we chat about this, but any good ideas would still be welcomed. Thanks, Flo.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 18, 2009 at 5:19 pm
Bob Hovious (3/18/2009)
I am having to deal with a lot of external data for which NO front end editing can be accomplished. Trust me, that is a fervent wish. I am trying to match this external data to our existing data, and will accept close matches. Here's an example:
declare @sample1-2 table (name varchar(30), phone varchar(12))
insert into @sample
select 'Hovious','123-456-7890'
My name gets typoed a LOT and some people have numeric dyslexia that might make them miskey the phone number. I need to be able to find this record when the name is an exact match and the phone number is close, OR when the phone number is an exact match and the name is close.
Here is my working definition of "close"
String A has only one character different than String B: Hovious/Havious
or
String A has one character more or less than String B: Hovious/Hovios
or
String A has two characters transposed in String B: Hovious/Hoviuos
This may be used for large batch runs, so speed is a must. The existing function seems slow (32MS CPU to compare Hovious/Hoviouous), so I'm prepared to rewrite it from scratch. But before I do, I thought I'd ask if anyone out there already has a fast way to something similar to this?
SQL Server has two SOP answers for this problem. One is old and limited, by easy to use. The other is newer,very good, but involves administrative considerations to setup and manager.
The old one: SOUNDEX and DIFFERENCE functions. Both use the old Soundex algorithm, invented by Ma Bell in the 60's or 70's for the exact problem that you mention: matching names that might be misspelled. Look it up in BOL. It's easy, fast , and not too hard to make it really fast (you add a soundex column to the table and then index it). However, for anything more than matching spelling variations of one name against a column of single names, it may not work too well.
The new one: Full-Text Indexing. This is the Cadillac solution, it can find words and phrases inside large string columns of text, multiple tables or even entire databases. Tons of flexibility. The cost? Well, the price is free with SQL Server, but you will have to set it up, determine it's scope, and manage the indexer, which is going to want a lot of space for the full-text indexes that it will be making and maintaining. Your choice.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 18, 2009 at 5:21 pm
[font="Verdana"]Yes, you have had a look at SOUNDEX() (and DIFFERENCE()) :-D[/font]
March 18, 2009 at 5:59 pm
Thanks guys. 🙂
I'm aware of SOUNDEX. The problem is that some of these strings aren't names, or even proper words, so I eliminated that early. Full text search comes closer to general application but still has problems with transposed characters, and maybe speed. Remember, I don't know where in the string an error may appear.
All good thoughts though. Please keep them coming if others occur to you.
I am onto something in my coding although it may turn out to be an ITVF from hell. If I can get it to work, I will post it for critique back here when I have a working prototype.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 18, 2009 at 6:06 pm
Hi Bob
I think in this case a usual inline function may be one of the fastest solutions.
Here my function
USE tempdb
IF (OBJECT_ID('dbo.ufn_CompareBH') IS NOT NULL)
DROP FUNCTION dbo.ufn_CompareBH
GO
CREATE FUNCTION dbo.ufn_CompareBH (@s1 VARCHAR(100), @s2 VARCHAR(100))
RETURNS BIT
-- RETURNS VARCHAR(100)
AS
BEGIN
DECLARE @len1 SMALLINT
DECLARE @len2 SMALLINT
DECLARE @pos1 SMALLINT
DECLARE @pos2 SMALLINT
DECLARE @c1 CHAR(1)
DECLARE @c2 CHAR(1)
DECLARE @joker BIT
SELECT @len1 = LEN(@s1), @len2 = LEN(@s2), @joker = 1, @pos1 = 1, @pos2 = 1
IF (@len1 = @len2 - 1)
BEGIN
-- s1 is shorter than s2
WHILE (@pos1 < @len1)
BEGIN
SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos2, 1)
-- Maybe the missing character
IF (@c1 != @c2)
BEGIN
-- Joker is gambled away and reuse the current character in s1
IF (@joker = 1)
SELECT @joker = 0, @pos1 = @pos1 - 1
ELSE
RETURN 0
END
SELECT @pos1 = @pos1 + 1, @pos2 = @pos2 + 1
END
END
ELSE IF (@len2 = @len1 - 1)
BEGIN
-- s2 is shorter than s1
WHILE (@pos1 < @len1)
BEGIN
SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos2, 1)
IF (@c1 != @c2)
BEGIN
-- Joker is gambled away and reuse current character in s2
IF (@joker = 1)
SELECT @joker = 0, @pos2 = @pos2 - 1
ELSE
RETURN 0
END
SELECT @pos1 = @pos1 + 1, @pos2 = @pos2 + 1
END
RETURN 1
END
ELSE IF (@len1 = @len2)
BEGIN
-- len is equal; check the characters
WHILE (@pos1 < @len1)
BEGIN
SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos1, 1)
IF (@c1 != @c2)
BEGIN
-- Check if two characters have been swapped
IF (@pos1 < @len1 AND @c1 = SUBSTRING(@s2, @pos1 + 1, 1) AND @c2 = SUBSTRING(@s1, @pos1 + 1, 1))
BEGIN
IF (@joker = 1)
SELECT @joker = 0, @pos1 = @pos1 + 1
ELSE
RETURN 0
END
-- Any other wrong character
ELSE IF (@joker = 1)
SET @joker = 0
ELSE
RETURN 0
END
SET @pos1 = @pos1 + 1
END
RETURN 1
END
RETURN 0
END
GO
My test script
SET NOCOUNT ON
GO
DROP TABLE tbh
GO
CREATE TABLE tbh (id INT IDENTITY NOT NULL, s1 VARCHAR(100), s2 VARCHAR(100), result BIT, PRIMARY KEY (id))
GO
INSERT INTO tbh VALUES ('Hovious', 'Havious', NULL)
INSERT INTO tbh VALUES ('Hovious', 'Hovios', NULL)
INSERT INTO tbh VALUES ('Hovious', 'Hoviuos', NULL)
WHILE ((SELECT COUNT(*) FROM tbh) < 75000)
INSERT INTO tbh SELECT s1, s2, result FROM tbh
--SELECT COUNT(*) FROM tbh
GO
DECLARE @now DATETIME
SET @now = GETDATE()
UPDATE TOP(75000) tbh SET result = dbo.ufn_CompareBH(s1, s2)
SELECT DATEDIFF(MILLISECOND, @now, GETDATE())
My test result
For 75,000 records was 1,833 milliseconds. Hope this might be fast enough... 🙂
... time for finish now for me. It's quiet late here. See you tomorrow!
Greets
Flo
March 18, 2009 at 6:12 pm
Thanks, Flo. I am going to hold time trials tonight.
Hope you have a good evening. 🙂
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 19, 2009 at 9:22 am
Flo, using your function, when I test:
PRINT dbo.ufn_CompareBH('Jesse', 'Jessie')
I get zero returned, but I think it should be 1, right?
Bob, as far as performance is concerned, I think you're going to have to limit the comparisons you will test. If you have 1 million rows in your target, and 10,000 coming in daily, you really don't want to run 10 billion comparisons.
What about this - add an integer column to the tables holding values for comparison, and populate it with the sum of ASCII values of the upper-cased letters to be compared. If the difference between the two sums is less than or equal to (65 + 26), then the two words should be compared.
For example, comparing 'JESSE' and 'JESSIE':
SELECT ABS((ASCII('J') + ASCII('E') + ASCII('S') + ASCII('S') + ASCII('E'))
- (ASCII('J') + ASCII('E') + ASCII('S') + ASCII('S') + ASCII('I') + ASCII('E'))) - 65
results in 8, indicating that the two words should be compared.
March 19, 2009 at 9:36 am
Interesting suggestion, and I will certainly look into it. Right now I minimize the number of rows by running separate queries, first joining for example on name (exact match) and number being close in the where clause, and then joining on number(exact match) with name being close in the where clause.
I haven't done time trials yet, but I came up with the following inline table-valued function last night, if anyone wants to try to break it (or improve it).
Thanks to everyone again for your input and suggestions.
ALTER FUNCTION [dbo].[itvf_StringsClose]
(
@string1 varchar(50)
,@string2 varchar(50)
)
RETURNS TABLE
AS
-- ============================================================================
-- Author: Bob Hovious
-- Create date: 3/19/2008
-- Description: This function compares two strings and deems them to be "close" if:
-- 1. One has one more character than the other
-- 2. One character has been mistyped
-- 3. Two characters are transposed
-- ============================================================================
RETURN
(
WITH
L0 AS (SELECT 1 AS C UNION ALL
SELECT 1 union all
SELECT 1 union all
SELECT 1 union all
SELECT 1 union all
SELECT 1 union all
SELECT 1 union all
SELECT 1 )-- 8 rows
,L1 AS (SELECT 1 as C from L0 A cross join L0 b) --64 rows
,Tally AS (SELECT ROW_NUMBER() OVER(ORDER BY C) AS N FROM L1) -- maximum test string size is 50, so 64 is plenty
,trimmedStrings as (select ltrim(rtrim(@string1)) as string1, ltrim(rtrim(@string2)) as string2)
,basevalues as
(select case when string1 >= string2 then string1 else string2 end as string1 -- long string 1
,case when string1 >= string2 then string2 else string1 end as string2 -- short string 2
from trimmedStrings
where abs(len(string1)-len(string2)) < 2
)
,stuffing as
(select string1,ltrim(rtrim(stuff(' '+string2+' ',t2.N,0,'_'))) as S2
,string2,stuff(string1,t1.N,1,'_') as S1
,case when t1.N < len(string1) then stuff(string1,t1.N,2,reverse(substring(string1,t1.N,2)))
else 'askljfaqwieyhui' -- highly unlikely to be compared
end as S1Transposition
from basevalues b
cross join tally t1
cross join tally t2
where t1.N <= len(string1)
and t2.N <= len(string2)+2
)
select isnull((select 'Y' as [StringsClose]
where exists(select 1 from stuffing
where string2 like S1
or string1 like S2
or string2 = S1Transposition
)
),'N') as stringsClose
)
/* tests
set statistics time on;
select * from dbo.StringsClose('Hovious','Hoviouscc')
set statistics time off;
------------------------------------------------------
set statistics time on;
with names (name1,name2) as
(select 'Hovious','Hovious' union all
select 'xHovious','Hovious' union all
select 'Hovious','xHovious' union all
select 'Hoviousx','Hovious' union all
select 'Hovious','Hoviousx' union all
select 'Xovious','Hovious' union all
select 'Hovious','Xovious' union all
select 'HoviouX','HoviouX' union all
select 'HoXious','Hovious' union all
select 'Hovious','HoXious' union all
select 'HoviousSS','Hovious' union all
select 'Hovious','HoviousSS' union all
select 'HHHovious','Hovious' union all
select 'Hovious','HHHovious' union all
select 'OHvious','Hovious' union all
select 'Hovious','OHvious' union all
select 'HoIVous','Hovious' union all
select 'Hovious','HoIVous' union all
select 'HovioSU','Hovious' union all
select 'Hovious','HovioSU' union all
select 'Hovis','Hovious' union all
select 'Hovious','Hovis' union all
select 'HovIAs','Hovious' union all
select 'Hovious','HovIAs' union all
select 'HXviXus','Hovious' union all
select 'Hovious','HXviXus'
)
select name1,name2,name.stringsClose
from names
cross apply dbo.itvf_StringsClose(name1,name2) name
where stringsClose = 'Y'
set statistics time off;
*/
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 19, 2009 at 9:40 am
Bob,
If you are processing incoming data, you may really want to look at the fuzzy lookup in SSIS. I used it quite a while ago in a now defunct process and it worked well. It does require some testing and tuning during development, but it may just fit your requirements.
March 19, 2009 at 9:49 am
Thanks, Lynn. Apparently I missed that when I first skimmed through fuzzy lookup. I was under the misimpression that it involved how many columns were exact matches, rather than the character strings themselves. Frankly, I'm going to have to go back to school on SSIS. For the moment, doing it in pure SQL, I can get it through QA faster.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply