August 3, 2017 at 11:56 am
J Livingston SQL - Thursday, August 3, 2017 11:01 AMany links to code/data sources to generate this 1 million row "Names" table please?
The site has changed since I downloaded census data several years ago but it looks familiar enough I think you can find the same data.
https://www.census.gov/2010census/data/
I have several pages of scripts I wrote to randomly generate a million names in the frequency they normally occur. Census data is defined by frequency per 100,000. The 87,000 last names in the census would have different span sizes but end-to-end would go from 1 to 1 million when I create the ranges. Using the RAND function, I pick a number between 1 and 1 million to get a last name. The RAND is pretty good at having an even distribution. The drawback from my routines is you may get Chinese first names and Spanish last names when building a full name.
Here are the top 10 last names in the census with the range numbers I generated. A rand value of 30000 would return the name Jones. You can see that the first 10 names take up 62144 positions in the 1 million values. The least common last names may only take one or two positions each. Looking at the chart, I would expect to see 11,118 Smiths generated give or take 10%.
August 3, 2017 at 1:42 pm
I've had some success using the combination of Levenshtein and Difference functions. I had a problem where I needed to match names like 'Term Loan' to 'Term Loan A', 'Term Loan B' etc:
SELECT
x.ID
,RowNumber = ROW_NUMBER() OVER(PARTITION BY x.ID ORDER BY x.Lev ASC, x.Dif DESC)
,x.Lev
,x.Dif
INTO #FuzzyMatch
FROM (
SELECT
obj1.ID
,Lev = dbo.Levenshtein(obj1.Name,obj2.Name)
,Dif = DIFFERENCE(obj1.Name,obj2.Name)
,SnapshotID = snap.ID
FROM dbo.Object1 obj1
JOIN dbo.Object2 obj2 ....
) x
Later on, I would select from that table ordering by rank (from row_number). It was a reasonably quick way to accomplish the matches I was looking for.
August 4, 2017 at 1:48 am
danechambrs - Thursday, August 3, 2017 1:42 PMI've had some success using the combination of Levenshtein and Difference functions. I had a problem where I needed to match names like 'Term Loan' to 'Term Loan A', 'Term Loan B' etc:
SELECT
x.ID
,RowNumber = ROW_NUMBER() OVER(PARTITION BY x.ID ORDER BY x.Lev ASC, x.Dif DESC)
,x.Lev
,x.Dif
INTO #FuzzyMatch
FROM (
SELECT
obj1.ID
,Lev = dbo.Levenshtein(obj1.Name,obj2.Name)
,Dif = DIFFERENCE(obj1.Name,obj2.Name)
,SnapshotID = snap.ID
FROM dbo.Object1 obj1
JOIN dbo.Object2 obj2 ....
) x
Later on, I would select from that table ordering by rank (from row_number). It was a reasonably quick way to accomplish the matches I was looking for.
Levenshtein isn't an internal function, it's a user-defined function. I'd be very interested to see the code for it.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 4, 2017 at 7:01 am
I was hoping to find something that would return matches on similar names but this one did not accommodate that. I would like to search text like 'david' and return 'dave' or even search for 'james' and return 'james' or 'jim'.
August 4, 2017 at 7:23 am
dlchase - Friday, August 4, 2017 7:01 AMI was hoping to find something that would return matches on similar names but this one did not accommodate that. I would like to search text like 'david' and return 'dave' or even search for 'james' and return 'james' or 'jim'.
I have production implementations of nickname variation searches (Margaret, Agatha, Maggie). They're functional and rational and procedural but can't handle misspelled names and variation explosions like this function. I'd say build yourself a nickname or abbreviation or synonym table where your search needs it but then call the fuzzy search to do the actual matching so you can find misspellings and variations. If someone wants a nickname table article, I can write that, but I'm telling you, this fuzzy function is a magnitude more powerful. As Microsoft employees say "don't flip the bozo bit". That phrase means don't consider something as stupid unless you first understand it.
This general fuzzy search function can be used for any strings such as movie names, song names, addresses, and descriptions. If you know what you want to find, then type it into the search function. I have examples in a previous post of how to find variations. Search on all names and nicknames and variations to find all targets for a person. For example, search on "Jim Jimmy James TheDude Smith Smitty Smythe" or whatever else the person goes by. Do it all in one function call so you get the benefit of exponential explosion of variation matches. That is where the power is.
August 4, 2017 at 7:39 am
Thanks. BTW I love the function for many of my searches! I will try some way of incorporating nicknames. Do you know of any sources where this might be available so I don't have to manually enter them into a table?
August 4, 2017 at 7:52 am
dlchase - Friday, August 4, 2017 7:39 AMThanks. BTW I love the function for many of my searches! I will try some way of incorporating nicknames. Do you know of any sources where this might be available so I don't have to manually enter them into a table?
These are screen-scraped nicknames from dozens of web pages. The attached 49k text file has about 1000 names and their 3389 variations.
August 4, 2017 at 7:58 am
Thanks.
August 4, 2017 at 7:39 pm
William, thank you for your very nice work with the fuzzy search function. I'm a 74-year-old retired DBA who now 'plays' nearly every day with SQL code working mainly on analyzing and cleaning my digital music collection of 70k tracks and standardizing my 31 years of Quicken financial data which is all imported into SQL Server because it is my tool of choice. Unfortunately my learning curve has flattened lots but I still love to work with data and keep what's left of my brain active. I appreciate finding work such as yours that serves to further my projects and at the same time give me a little exposure to the newer techniques. I peruse nearly all the information that arrives from the SSC website.
Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )
August 6, 2017 at 8:24 am
Every now and then you come across a solution that is both simple and elegant. The only flaw is that I didn't come up with it myself! My first association was to anagrams, if the characters appears in pairs it’s often easy to figure out, but if there is no character combination from the original word it’s much harder.
With the help of your function I built a simple tool, FindObject, that we’ll useinternally to look for SQL objects. We'll bring it into our production as soon as possible.
Something along those lines:
DECLARE @s1 varchar(100) = 'Countries';
SELECT DISTINCT top (50)
ObjectName = concat(s.name, '.', o.name), o.type_desc, fms.score
from sys.objects o
inner join sys.schemas s on s.schema_id = o.schema_id
CROSS APPLY ( selectdbo.FuzzyMatchString(@s1, concat(s.name, o.name)) AS score) AS fms
where o.parent_object_id = 0
ORDER by fms.score desc
Thanks for sharing this simple and elegant solution!
August 7, 2017 at 9:59 am
I love this idea. I can immediately see a potential use for this that warrants some further research. I work in the legal sector and can see that this might be very useful in conflict of interest searches.
"Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. When we enquire into any subject, the first thing we have to do is to know what books have treated of it. This leads us to look at catalogues, and at the backs of books in libraries."
— Samuel Johnson
I wonder, would the great Samuel Johnson have replaced that with "GIYF" now?
August 7, 2017 at 10:00 am
I always enjoy seeing different approaches to handling fuzzy matching in SQL Server. That said, there are many issues with this approach that I don't want to dive into that deeply.
If you were to use this approach (not suggested) the #1 thing to address is ensure that the shorter of the two strings is always evaluated first. Otherwise a consistant or accurate result is not guaranteed. Observe:
DECLARE
@s1 varchar(100) = 'Billy William Lee Talada Jr. alalalalalalalalalal',
@s2 varchar(100) = 'Mr. William Lee Talada Jr.';
SELECT dbo.FuzzyMatchString(@s1, @s1);
SELECT dbo.FuzzyMatchString(@s2, @s1);
#2: Performance wise - scalar always fails. This can be done easily by using NGrams8k like so:
DECLARE
@s1 varchar(100) = 'Billy William Lee Talada Jr.',
@s2 varchar(100) = 'Mr. William Lee Talada Jr.';
SELECT fms = sum(sign(charindex(token, @s2)))
FROM dbo.NGrams8k(@s1, 2)
WHERE charindex(token, @s2) > 1;
Performance test:select TOP (10000) someid = identity(int, 1, 1), x.[name]
into #names
from (VALUES ('Mr. William Lee Talada Jr.'), ('William Lee Talada Jr.'),
('William Lee Talada'), ('Lee Talada'), ('William Lee Talada Jr.')
,(cast(newid() as varchar(100))), (cast(newid() as varchar(100)))
) x([name])
CROSS JOIN sys.all_columns a, sys.all_columns b;
GO
PRINT 'scalar udf'+char(10)+replicate('-',50);
GO
DECLARE @st datetime = getdate(), @x int, @s1 varchar(100) = 'Billy William Lee Talada Jr.';
SELECT @x = fms.score
from #Names
CROSS APPLY (select dbo.FuzzyMatchString(@s1, Name) AS score) AS fms;
PRINT datediff(ms, @st, getdate());
GO 5
PRINT char(10)+'ngrams serial'+char(10)+replicate('-',50);
GO
DECLARE @st datetime = getdate(), @x int, @s1 varchar(100) = 'Billy William Lee Talada Jr.';
SELECT @x = sum(sign(charindex(token, [name])))
FROM #names
CROSS APPLY dbo.NGrams8k(@s1, 2)
GROUP BY someid
OPTION (MAXDOP 1);
PRINT datediff(ms, @st, getdate());
GO 5
PRINT char(10)+'ngrams parallel'+char(10)+replicate('-',50);
GO
DECLARE @st datetime = getdate(), @x int, @s1 varchar(100) = 'Billy William Lee Talada Jr.';
SELECT @x = sum(sign(charindex(token, [name])))
FROM #names
CROSS APPLY dbo.NGrams8k(@s1, 2)
CROSS APPLY dbo.make_parallel()
GROUP BY someid;
PRINT datediff(ms, @st, getdate());
GO 5
Results
scalar udf
--------------------------------------------------
Beginning execution loop
243
237
240
236
240
Batch execution completed 5 times.ngrams serial
--------------------------------------------------
Beginning execution loop
237
230
227
240
233
Batch execution completed 5 times.ngrams parallel
--------------------------------------------------
Beginning execution loop
100
103
100
100
100
Batch execution completed 5 times.
-- Itzik Ben-Gan 2001
August 8, 2017 at 10:58 am
ChrisM@Work - Friday, August 4, 2017 1:48 AMdanechambrs - Thursday, August 3, 2017 1:42 PMI've had some success using the combination of Levenshtein and Difference functions. I had a problem where I needed to match names like 'Term Loan' to 'Term Loan A', 'Term Loan B' etc:
SELECT
x.ID
,RowNumber = ROW_NUMBER() OVER(PARTITION BY x.ID ORDER BY x.Lev ASC, x.Dif DESC)
,x.Lev
,x.Dif
INTO #FuzzyMatch
FROM (
SELECT
obj1.ID
,Lev = dbo.Levenshtein(obj1.Name,obj2.Name)
,Dif = DIFFERENCE(obj1.Name,obj2.Name)
,SnapshotID = snap.ID
FROM dbo.Object1 obj1
JOIN dbo.Object2 obj2 ....
) x
Later on, I would select from that table ordering by rank (from row_number). It was a reasonably quick way to accomplish the matches I was looking for.Levenshtein isn't an internal function, it's a user-defined function. I'd be very interested to see the code for it.
I basically took this https://www.dotnetperls.com/levenshtein and turned it into a CLR:
https://pastebin.com/grC4enaR
August 10, 2017 at 1:22 pm
Here is an sql function implementation of the Levenshtein Difference function. I am posting it with the statement that it is not my original code, but I am certain that it came from this forum or a similar one but I have long since lost the original reference. Use it as you wish, I desire neither credit or blame
CREATE FUNCTION [dbo].[edit_distance]
(
@s1 NVARCHAR(3999)
, @s2 NVARCHAR(3999)
)
RETURNS INT
AS
BEGIN
/* example SELECT dbo.edit_distance ( N'kitten', N'kitchen' ) */
DECLARE @s1_len INT
, @s2_len INT
, @i INT
, @j-2 INT
, @s1_char NCHAR
, @C INT
, @c_temp INT
, @cv0 VARBINARY(8000)
, @cv1 VARBINARY(8000);
SELECT @s1_len = LEN(@s1)
, @s2_len = LEN(@s2)
, @cv1 = 0x0000
, @j-2 = 1
, @i = 1
, @C = 0
IF @s1= ' '
BEGIN
SELECT @s1= REPLICATE('Z',25)
SELECT @s1_len = LEN(@s1)
END
IF @s2 = ' '
BEGIN
SELECT @s2= REPLICATE('X',25)
SELECT @s2_len = LEN(@s2)
END
WHILE @j-2 <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS BINARY(2))
, @j-2 = @j-2 + 1;
WHILE @i <= @s1_len
BEGIN
SELECT @s1_char = SUBSTRING(@s1, @i, 1)
, @C = @i
, @cv0 = CAST(@i AS BINARY(2))
, @j-2 = 1;
WHILE @j-2 <= @s2_len
BEGIN
SET @C = @C + 1;
SET @c_temp = CAST(SUBSTRING(@cv1, @j-2 + @j-2 - 1, 2) AS INT)
+ CASE WHEN @s1_char = SUBSTRING(@s2, @j-2, 1)
THEN 0
ELSE 1
END;
IF @C > @c_temp
SET @C = @c_temp;
SET @c_temp = CAST(SUBSTRING(@cv1, @j-2 + @j-2 + 1, 2) AS INT)
+ 1;
IF @C > @c_temp
SET @C = @c_temp;
SELECT @cv0 = @cv0 + CAST(@c AS BINARY(2))
, @j-2 = @j-2 + 1;
END;
SELECT @cv1 = @cv0
, @i = @i + 1;
END;
RETURN @C;
END;
GO
May 16, 2019 at 11:30 am
I have a question regarding this nice implementation. What happens when you have a string of less then four characters in length or two? I am noticing that it will obviously not rank it, so i tried to add a logical check for if the string input is less then four chars, do it per char not every two, but the desired outcome is not happening. Do you have any advice on tweaking this to handle strings less then 4 chars? Should it just be a LIKE clause at that point, and if so how to return score cards for it.
-Thanks-
Viewing 15 posts - 16 through 30 (of 33 total)
You must be logged in to reply to this topic. Login to reply