November 3, 2010 at 2:30 pm
Say I've got a varchar variable that contains 'Jackson Greene inc'
In a table in my database I have a company that has been listed as 'Jackson-Green incorporated'
What I want to do is find any companies in the database that, to some extent, match what is in the varchar variable - so I can put a list of them on the screen so a user can decide whether they are, or are not, the same company.
What I'd like to do is loop through the varchar variable, moving forwards one character at a time, testing if 4 characters are in the company name field.
So the test, in the example above, would be ...
WHERE tblCompany.CompanyName LIKE '%Jack%'
WHERE tblCompany.CompanyName LIKE '%acks%'
WHERE tblCompany.CompanyName LIKE '%ckso%'
WHERE tblCompany.CompanyName LIKE '%kson%'
etc.
For the sake of simplicity - I will strip spaces and punctation out of the varchar variable and I have a stripped version of the company name that I will compare against.
I've got no idea how to do it and would be grateful for any pointers.
November 3, 2010 at 3:13 pm
Would this work for you? (FYI - this is the first time I've ever used this function!)
DECLARE @test-2 TABLE (TestChar varchar(50));
INSERT INTO @test-2 VALUES ('Jackson-Green incorporated');
declare @FindStr varchar(50);
set @FindStr = 'Jackson Greene inc';
SELECT *
FROM @test-2
WHERE SOUNDEX(TestChar) = SOUNDEX(@FindStr);
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 3, 2010 at 4:35 pm
Are you working with one table looking for potential dupes?
Are you working with two or more tables looking for potential matches? Please provide as much information as you can.
This is token-matching. You can match, form the source string abcdef
abc
bcd
cde etc
or alternatively
abc
def
ghi etc
The solution to this requires a tally table to set start and end points (or lengths) of the tokens from the source string to match into the target. If you can wait 12 hours (cos it's bedtime here in the UK) and don't know about tally tables, then Google or search here for a good start to the technique. I'll post some in the morning.
Soundex doesn't work too well for this (sorry Wayne). Matching on whole words of a sentence: 'Jackson' and 'Greene' and 'inc' from 'Jackson Greene inc' works rather better.
In the UK, it helps to first replace all possible permutations of company abbreviations with a single fixed token e.g. CLT for Company Limited - which reduces the weighting of the fixed part of the name: e.g. ACC Ltd / ACC Company Ltd / ACC Co Ltd. A lookup table is useful for this. In practice, using more than one method (preferably at the same time to enhance performance) with a percentage matching accuracy for each gives good results.
If you're itching to see how this is done, I've posted it on the forum before now but under a different user. Look for something like Dedupe / Token matching. Have fun ๐
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
November 3, 2010 at 5:15 pm
Hi, and thanks for your reply.
It's two tables. One contains a new imported list of organisations - the other is the existing list of organisations (about 10,000 rows in the existing table, 3,000 rows in the new table).
Unfortunately many of the organisations are not commercial companies - and have very long names often abbreviated or spelt incorrectly. I've tried loads of routines of comparing different bits of the strings but, in the end, none are accurate enough and a lot of it is going to have to be done manually.
Which is why I want to put on the screen the imported organisations on the left and, say, 20 near matches on the right so the user can select the right match etc.
Any examples you could provide would be very much appreciated. (It is, in fact, a list of UK organisations but hardly any of them have Ltd. or Limited or Company in their name. Most of them are education, research or charity related and they have arcane names and lots of acronyms are used (some at the beginning, some at the end, some in brackets, some with the acronym - a hyphen (space or comma) and then the full name, some with an acronym and an abbreviated name - it's endless). Which is why I want to loop through (thinking in terms of one imported organisation at a time saying 'do any 4 consecutive characters anywhere in this organisation's name match any 4 consecutive characters in any of the existing organisation names' - bearing in mind I've already stripped all punctuation and spaces.)
November 3, 2010 at 5:25 pm
Thanks for your reply. I'll give it a go although Chris (below) seems to think it might not be the solution for this situation.
November 4, 2010 at 4:41 am
-------------------------------------------------------------
-- Create a tally table - CAREFUL, this is only 120 rows / numbers
-- and should be used only for this demo
-------------------------------------------------------------
DROP TABLE #Numbers
SELECT TOP 120 n = ROW_NUMBER() OVER(ORDER BY Name)
INTO #Numbers
FROM Master.dbo.sysColumns
CREATE UNIQUE CLUSTERED INDEX CIn ON #Numbers ([n] ASC)
-------------------------------------------------------------
-- See how the method works with a code snippet
-------------------------------------------------------------
DECLARE @String1 VARCHAR(120), @String2 VARCHAR(120)
SELECT @String1 = 'Jackson Greene inc', @String2 = 'Jackson-Green incorporated'
-- Fit tokens from @String1 into @String2
SELECT TokenRefCount = CAST(LEN(@String2)-2 AS NUMERIC),
TokenTestCount = CAST(LEN(@String1)-2 AS NUMERIC),
TokenMatched = (CASE
WHEN CHARINDEX(SUBSTRING(@String1, (n.n), 3), @String2, 1) = 0 THEN 0
ELSE 1 END),
Token = SUBSTRING(@String1, (n.n), 3)
FROM #Numbers n
WHERE (n.n) < LEN(@String1) - 1
GO
-------------------------------------------------------------
-- Put it together as a CROSS APPLY for SQL Server 2008
-- Returning rows from a table where a matching threshold is exceeded
-------------------------------------------------------------
-- Make a small sample table
DROP TABLE #Company
CREATE TABLE #Company (CompanyID INT IDENTITY, CompanyName VARCHAR(120))
INSERT INTO #Company (CompanyName)
SELECT 'Jackson Greene inc' UNION ALL
SELECT 'Jackson-Green incorporated' UNION ALL
SELECT 'Jack. Greene inc' UNION ALL
SELECT 'Jackson Grn inc' UNION ALL
SELECT 'Jackson-Greene' UNION ALL
SELECT 'Jackson Browne' UNION ALL
SELECT 'Jackson Greene'
-- Solution starts here
DECLARE @String1 VARCHAR(120), @TokenTestCount INT
SET @String1 = 'Jackson Greene inc'
SET @TokenTestCount = CAST(LEN(@String1)-2 AS NUMERIC(6,2))
SELECT TOP 20 d.CompanyName, d.MatchScore
FROM (
SELECT c.CompanyName,
iTVF.TokenMatches,
MatchScore = CASE
WHEN @String1 = c.CompanyName THEN 100
WHEN @String1 LIKE c.CompanyName + '%' OR c.CompanyName LIKE @String1 + '%' THEN 90
WHEN iTVF.tokenrefcount = 0 OR @tokentestcount = 0 THEN 0
ELSE CAST(100 * (iTVF.TokenMatches / @TokenTestCount) AS NUMERIC(6,2))
--Fine-tune: you can apply a weighting according to the relative lengths of the two strings
--* CASE
--WHEN @tokentestcount > iTVF.tokenrefcount THEN (iTVF.tokenrefcount / @tokentestcount)
--WHEN @tokentestcount < iTVF.tokenrefcount THEN (@tokentestcount / iTVF.tokenrefcount)
--ELSE 1
--END
END
FROM #Company c
CROSS APPLY (
SELECT TokenRefCount, TokenMatches = CAST(SUM(TokenMatches) AS NUMERIC(6,2))
FROM (
SELECT c.CompanyName,
TokenRefCount = CAST(LEN(c.CompanyName)-2 AS NUMERIC),
TokenMatches = CASE
WHEN CHARINDEX(SUBSTRING(@String1, (n.n), 3), c.CompanyName, 1) = 0 THEN 0
ELSE 1 END
FROM #Numbers n
WHERE (n.n) < LEN(@String1) - 1
) xi GROUP BY xi.CompanyName, xi.TokenRefCount
) iTVF
) d
WHERE d.MatchScore > 40
ORDER BY d.MatchScore DESC
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 4, 2010 at 5:30 am
Chris Morris-439714 (11/4/2010)
-------------------------------------------------------------
-- Create a tally table - CAREFUL, this is only 120 rows / numbers
-- and should be used only for this demo
-------------------------------------------------------------
DROP TABLE #Numbers
SELECT TOP 120 n = ROW_NUMBER() OVER(ORDER BY Name)
INTO #Numbers
FROM Master.dbo.sysColumns
CREATE UNIQUE CLUSTERED INDEX CIn ON #Numbers ([n] ASC)
-------------------------------------------------------------
-- See how the method works with a code snippet
-------------------------------------------------------------
DECLARE @String1 VARCHAR(120), @String2 VARCHAR(120)
SELECT @String1 = 'Jackson Greene inc', @String2 = 'Jackson-Green incorporated'
-- Fit tokens from @String1 into @String2
SELECT TokenRefCount = CAST(LEN(@String2)-2 AS NUMERIC),
TokenTestCount = CAST(LEN(@String1)-2 AS NUMERIC),
TokenMatched = (CASE
WHEN CHARINDEX(SUBSTRING(@String1, (n.n), 3), @String2, 1) = 0 THEN 0
ELSE 1 END),
Token = SUBSTRING(@String1, (n.n), 3)
FROM #Numbers n
WHERE (n.n) < LEN(@String1) - 1
GO
-------------------------------------------------------------
-- Put it together as a CROSS APPLY for SQL Server 2008
-- Returning rows from a table where a matching threshold is exceeded
-------------------------------------------------------------
-- Make a small sample table
DROP TABLE #Company
CREATE TABLE #Company (CompanyID INT IDENTITY, CompanyName VARCHAR(120))
INSERT INTO #Company (CompanyName)
SELECT 'Jackson Greene inc' UNION ALL
SELECT 'Jackson-Green incorporated' UNION ALL
SELECT 'Jack. Greene inc' UNION ALL
SELECT 'Jackson Grn inc' UNION ALL
SELECT 'Jackson-Greene' UNION ALL
SELECT 'Jackson Browne' UNION ALL
SELECT 'Jackson Greene'
-- Solution starts here
DECLARE @String1 VARCHAR(120), @TokenTestCount INT
SET @String1 = 'Jackson Greene inc'
SET @TokenTestCount = CAST(LEN(@String1)-2 AS NUMERIC(6,2))
SELECT TOP 20 d.CompanyName, d.MatchScore
FROM (
SELECT c.CompanyName,
iTVF.TokenMatches,
MatchScore = CASE
WHEN @String1 = c.CompanyName THEN 100
WHEN @String1 LIKE c.CompanyName + '%' OR c.CompanyName LIKE @String1 + '%' THEN 90
WHEN iTVF.tokenrefcount = 0 OR @tokentestcount = 0 THEN 0
ELSE CAST(100 * (iTVF.TokenMatches / @TokenTestCount) AS NUMERIC(6,2))
--Fine-tune: you can apply a weighting according to the relative lengths of the two strings
--* CASE
--WHEN @tokentestcount > iTVF.tokenrefcount THEN (iTVF.tokenrefcount / @tokentestcount)
--WHEN @tokentestcount < iTVF.tokenrefcount THEN (@tokentestcount / iTVF.tokenrefcount)
--ELSE 1
--END
END
FROM #Company c
CROSS APPLY (
SELECT TokenRefCount, TokenMatches = CAST(SUM(TokenMatches) AS NUMERIC(6,2))
FROM (
SELECT c.CompanyName,
TokenRefCount = CAST(LEN(c.CompanyName)-2 AS NUMERIC),
TokenMatches = CASE
WHEN CHARINDEX(SUBSTRING(@String1, (n.n), 3), c.CompanyName, 1) = 0 THEN 0
ELSE 1 END
FROM #Numbers n
WHERE (n.n) < LEN(@String1) - 1
) xi GROUP BY xi.CompanyName, xi.TokenRefCount
) iTVF
) d
WHERE d.MatchScore > 40
ORDER BY d.MatchScore DESC
Wow! Can't thank you enough. It's exactly what I need. Thanks very, very much.
November 4, 2010 at 5:32 am
You're welcome. Thank you for the generous feedback ๐
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 4, 2010 at 6:45 am
This could be useful too.... Tried to do in a small code....
--Table creation and records insertion...........
--create table aaa (name varchar(200))
--insert into aaa values ('abcdefghijklmnop')
--insert into aaa values ('bcdefghijklmnop')
--insert into aaa values ('cdefghijklmnop')
--insert into aaa values ('defghijklmnop')
--insert into aaa values ('efghijklmnop')
--insert into aaa values ('abcd efghijklmnop')
--insert into aaa values ('ab defghijklmnop')
--insert into aaa values ('ab f hiabcd')
--insert into aaa values ('cdef jklmnop')
--insert into aaa values ('a bcde fgh ijklmnop')
--insert into aaa values ('abcdefghijklmnop')
--insert into aaa values ('abcde f g hijk lmnop')
--insert into aaa values ('a bcd efghi jk lmn o p')
declare @counter int, @len int, @name varchar(200)
create table #tmp123 (SearchedName varchar(4), Name varchar(200))
declare c1 cursor
for
select name from aaa
open c1
fetch next from c1 into @name
while(@@fetch_status=0)
begin
select @counter = 1, @len = len(@name)
while (@counter<=@len)
begin
if (@len-@counter)>2
insert into #tmp123 select substring(@name, @counter, 4), @name
set @counter = @counter + 1
end
fetch next from c1 into @name
end
close c1
deallocate c1
select * from #tmp123 where SearchedName = 'fghi'
drop table #tmp123
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply