March 26, 2008 at 3:30 am
Hello! everyone 🙂
Please help me to solve this .... how I should compare Convergys and Congys or similar type of strings...
Regards,
Ahmad
Regards,
[font="Verdana"]Sqlfrenzy[/font]
March 26, 2008 at 8:09 am
March 26, 2008 at 8:31 am
Ahmad Osama (3/26/2008)
Hello! everyone 🙂Please help me to solve this .... how I should compare Convergys and Congys or similar type of strings...
Regards,
Ahmad
Well, one way to compare them is to use SOUNDEX (see http://msdn2.microsoft.com/en-us/library/ms189282.aspx)
You can use it like:
SELECT DIFFERENCE('Convergys', 'Congys')
Regards,
Andras
March 26, 2008 at 11:34 am
mrpolecat (3/26/2008)
What do you want to know about them? How alike they are or how different? Are you looking for words that start the same, end the same, have x number of same characters?What are your expected results?
sry i forget to tell that.. :)a function should return an interger which either represents how alike they are or how different are they
Regards,
[font="Verdana"]Sqlfrenzy[/font]
March 26, 2008 at 11:40 am
March 26, 2008 at 10:58 pm
Hi,
I want to check a user inputed company name with that of company that exist in our database in order to determine whether the user given company already exist in our database or it doesn't.As one can easily understand that users may misspell a company which already exist....so how should I check for these strings.....
Regards,
[font="Verdana"]Sqlfrenzy[/font]
March 27, 2008 at 10:23 am
If you have a table called CompanyNames you can query like this:
-- Remember to use procs to eliminate SQL injection
SELECT cid from CompanyNames where cname =
One project that I was on had an issue with this and a second table was created which stored the most used misspellings of a company name. Users entered the name and if it was wrong the app gave back a googlesqe, "did you mean...". Of course a better solution may be to update the application to let the user select an existing company.
ST
March 27, 2008 at 1:24 pm
souLTower (3/27/2008)
If you have a table called CompanyNames you can query like this:-- Remember to use procs to eliminate SQL injection
SELECT cid from CompanyNames where cname =
One project that I was on had an issue with this and a second table was created which stored the most used misspellings of a company name. Users entered the name and if it was wrong the app gave back a googlesqe, "did you mean...". Of course a better solution may be to update the application to let the user select an existing company.
ST
My issue is to restrict users to enter an existing company....I think that creating a new table to store most misspelled name of around 4 million companies is a tedious task..:)
Thanks
Ahmas
Regards,
[font="Verdana"]Sqlfrenzy[/font]
March 28, 2008 at 8:43 am
Hi Ahmad, depending on how many rows you're working with, fuzzy matching may also be an option. You could employ SOUNDEX or DIFFERENCE to quickly obtain a list of likely matches then run this through more sophisticated fuzzy matching algorithms to return the single best match to the app for eyeball checking.
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 28, 2008 at 1:56 pm
Chris Morris (3/28/2008)
Hi Ahmad, depending on how many rows you're working with, fuzzy matching may also be an option. You could employ SOUNDEX or DIFFERENCE to quickly obtain a list of likely matches then run this through more sophisticated fuzzy matching algorithms to return the single best match to the app for eyeball checking.Cheers
ChrisM
well i tried this function 1st to get a feel of difference function.....
Create FUNCTION strcmp
(@strone varchar(100),@strtwo varchar(100))
RETURNS char(5)
AS
BEGIN
Declare @result char(5)
if difference(@strtwo,@strone)=4
Begin
set @result='True'
End
Else
Begin
set @result='False'
End
return @result
END
GO
It is not working in this case select dbo.strcmp('Ta ta','Tat a')
Can we have fuzzy logic algo in SQL SERVER 2000?
Thanks:)
Regards,
[font="Verdana"]Sqlfrenzy[/font]
March 31, 2008 at 5:34 am
It's not built-in like 2k5 so you have to write your own. I've used it a lot for migration deduping on the last project with this client, looking for dupes in a table and matches between two different tables. It's fiddly and slow but very effective.
The method I use is token matching, and it works like this. A reference string is compared against each target string e.g. a column. Starting at the left hand side of the reference (first character), take the first three characters - the token -and see if they match anywhere in the target, 1 if they do, 0 if they don't. The second token is characters 2, 3 and 4, then 3, 4 and 5. Add up the matches and divide by the length of the reference to generate a score for each target string.
There's loads of room to tune this: you can use a 4-char token, you can use non-overlapping tokens i.e first token is characters 1, 2 and 3, second is 4, 5 and 6. You can change the way that scoring is calculated.
This snippet is the core code I use for an internal dedupe:
-- Self fuzzy match
SELECT t.VENDOR, r.VENDOR AS Dupe, COUNT(*) AS TokenCount
FROM FuzzyInputFile r -- reference (6,943)
INNER JOIN [Numbers] n ON n.number < LEN(r.VendorName)
INNER JOIN FuzzyInputFile t ON t.VENDOR <> r.VENDOR
AND CHARINDEX(SUBSTRING(r.VendorName, n.number, 3), t.VendorName) > 0 -- 3 is token size, tune to data
WHERE n.number < 30 -- column length
GROUP BY t.VENDOR, r.VENDOR, t.VendorName
HAVING COUNT(*) / CAST(LEN(t.VendorName) AS NUMERIC (5,2)) > 0.4 -- Cutoff: tune to data
Where VENDOR is the vendor code (PK) and the numbers table is a standard tally table.
This snippet is the core code I use for finding the closest match in a column to a string:
CREATE FUNCTION GetFuzzyMatchCount (@TargetString NVARCHAR(60), @ReferenceString NVARCHAR(60))
RETURNS Numeric (6,3)
AS
BEGIN
DECLARE @iCounter INT, @nOutput INT, @cString CHAR(3), @TokenCountReference INT, @SecondWordStart INT
SET @TokenCountReference = LEN(RTRIM(@ReferenceString)) - 2
SET @TargetString = RTRIM(@TargetString)
SET @nOutput = 0
SET @iCounter = 0
WHILE @TokenCountReference > @iCounter+1
BEGIN
SET @cString = RTRIM(SUBSTRING(@ReferenceString, @iCounter + 1, 3))
IF LEN(@cString) > 0 AND CHARINDEX (@cString, @TargetString) > 0
BEGIN
SET @nOutput = @nOutput + 1
-- See if the token exists anywhere else in the reference string
SET @SecondWordStart = CHARINDEX (@cString, @TargetString, @iCounter + 5)
IF @SecondWordStart > 0
SET @TargetString = STUFF(@TargetString, @SecondWordStart, 3, '')
END
SET @iCounter = @iCounter + 1
END
RETURN @nOutput
END
Usage is like this:
SELECT TOP 1 @MatchLevel = dbo.GetFuzzyMatchCount(@TargetFuzzyMatchName, FuzzyMatchName) / @TargetTokenCountName,
@ReferenceVENDOR = LTRIM(VENDOR)
FROM VendorMatchReference (NOLOCK)
ORDER BY 1 DESC
Beware that both of these methods have been used for migration where performance is far less important than with an app. I think the function could be tuned for far better performance should you need it.
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 31, 2008 at 7:06 am
Ahmad Osama (3/27/2008)
souLTower (3/27/2008)
If you have a table called CompanyNames you can query like this:-- Remember to use procs to eliminate SQL injection
SELECT cid from CompanyNames where cname =
One project that I was on had an issue with this and a second table was created which stored the most used misspellings of a company name. Users entered the name and if it was wrong the app gave back a googlesqe, "did you mean...". Of course a better solution may be to update the application to let the user select an existing company.
ST
My issue is to restrict users to enter an existing company....I think that creating a new table to store most misspelled name of around 4 million companies is a tedious task..:)
Thanks
Ahmas
You certainly wouldn't build a table of possible mis-spellings but I agree with the logic... create a table of correct names and if the user types something that doesn't have a perfect match, use the SOUNDEX functions to return a list of companies that have a "4" match.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 31, 2008 at 8:52 am
If you're trying to make sure that they don't re-enter existing clients - why not force them to go through the existing list of customers FIRST? Like - a DDL with compeny names, or a search screen? Or a few nag screens when they are trying to add something new (are you SURE this is a new company? No, as in REALLY REALLY SURE? How about now?). The people doing the data entry are likely to be the best people to make that determination (since they would presumably have all of the info in front of them), so making it a clean-up problem instead of preventing the problem is ultimately a bit wasteful.
Free-text typing key info that's already being tracked in a reference table is always dodgy.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 31, 2008 at 9:15 am
Matt Miller (3/31/2008)
If you're trying to make sure that they don't re-enter existing clients - why not force them to go through the existing list of customers FIRST? Like - a DDL with compeny names, or a search screen? Or a few nag screens when they are trying to add something new (are you SURE this is a new company? No, as in REALLY REALLY SURE? How about now?). The people doing the data entry are likely to be the best people to make that determination (since they would presumably have all of the info in front of them), so making it a clean-up problem instead of preventing the problem is ultimately a bit wasteful.Free-text typing key info that's already being tracked in a reference table is always dodgy.
Spot on, Matt. I've seen something like this working really well for company input and that's what Ahmad expects to implement. I'm sorry if this isn't clear from my posts and stand totally corrected.
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 31, 2008 at 1:02 pm
Jeff Moden (3/31/2008)
Ahmad Osama (3/27/2008)
souLTower (3/27/2008)
If you have a table called CompanyNames you can query like this:-- Remember to use procs to eliminate SQL injection
SELECT cid from CompanyNames where cname =
One project that I was on had an issue with this and a second table was created which stored the most used misspellings of a company name. Users entered the name and if it was wrong the app gave back a googlesqe, "did you mean...". Of course a better solution may be to update the application to let the user select an existing company.
ST
My issue is to restrict users to enter an existing company....I think that creating a new table to store most misspelled name of around 4 million companies is a tedious task..:)
Thanks
Ahmas
You certainly wouldn't build a table of possible mis-spellings but I agree with the logic... create a table of correct names and if the user types something that doesn't have a perfect match, use the SOUNDEX functions to return a list of companies that have a "4" match.
Thanx jeff...I appreciate your advice...
But to implement this I do need to create a list of correct spelled companies out of total 4 million list...which will be a manual(not totally) task...well I will hv to see how much time do i have:)
this will be a good logic to implement on a smaal list or one shud maintain a table right from the start.
Harlow!
Regards,
[font="Verdana"]Sqlfrenzy[/font]
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply