Comparing Strings......

  • Hello! everyone 🙂

    Please help me to solve this .... how I should compare Convergys and Congys or similar type of strings...

    Regards,

    Ahmad

    Regards,
    [font="Verdana"]Sqlfrenzy[/font]

  • What do you want to know about them? How alike they are or how different? Are you looking for words that start the same, end the same, have x number of same characters?What are your expected results?


  • Ahmad Osama (3/26/2008)


    Hello! everyone 🙂

    Please help me to solve this .... how I should compare Convergys and Congys or similar type of strings...

    Regards,

    Ahmad

    Well, one way to compare them is to use SOUNDEX (see http://msdn2.microsoft.com/en-us/library/ms189282.aspx)

    You can use it like:

    SELECT DIFFERENCE('Convergys', 'Congys')

    Regards,

    Andras


    Andras Belokosztolszki, MCPD, PhD
    GoldenGate Software

  • mrpolecat (3/26/2008)


    What do you want to know about them? How alike they are or how different? Are you looking for words that start the same, end the same, have x number of same characters?What are your expected results?

    sry i forget to tell that.. :)a function should return an interger which either represents how alike they are or how different are they

    Regards,
    [font="Verdana"]Sqlfrenzy[/font]

  • As Andras stated SOUNDEX and DIFFERNCE do that based the relation of consonents in the 2 words. Is this what you are looking for or do you have another standard of measurement you want to apply?


  • Hi,

    I want to check a user inputed company name with that of company that exist in our database in order to determine whether the user given company already exist in our database or it doesn't.As one can easily understand that users may misspell a company which already exist....so how should I check for these strings.....

    Regards,
    [font="Verdana"]Sqlfrenzy[/font]

  • If you have a table called CompanyNames you can query like this:

    -- Remember to use procs to eliminate SQL injection

    SELECT cid from CompanyNames where cname =

    One project that I was on had an issue with this and a second table was created which stored the most used misspellings of a company name. Users entered the name and if it was wrong the app gave back a googlesqe, "did you mean...". Of course a better solution may be to update the application to let the user select an existing company.

    ST

  • souLTower (3/27/2008)


    If you have a table called CompanyNames you can query like this:

    -- Remember to use procs to eliminate SQL injection

    SELECT cid from CompanyNames where cname =

    One project that I was on had an issue with this and a second table was created which stored the most used misspellings of a company name. Users entered the name and if it was wrong the app gave back a googlesqe, "did you mean...". Of course a better solution may be to update the application to let the user select an existing company.

    ST

    My issue is to restrict users to enter an existing company....I think that creating a new table to store most misspelled name of around 4 million companies is a tedious task..:)

    Thanks

    Ahmas

    Regards,
    [font="Verdana"]Sqlfrenzy[/font]

  • Hi Ahmad, depending on how many rows you're working with, fuzzy matching may also be an option. You could employ SOUNDEX or DIFFERENCE to quickly obtain a list of likely matches then run this through more sophisticated fuzzy matching algorithms to return the single best match to the app for eyeball checking.

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Chris Morris (3/28/2008)


    Hi Ahmad, depending on how many rows you're working with, fuzzy matching may also be an option. You could employ SOUNDEX or DIFFERENCE to quickly obtain a list of likely matches then run this through more sophisticated fuzzy matching algorithms to return the single best match to the app for eyeball checking.

    Cheers

    ChrisM

    well i tried this function 1st to get a feel of difference function.....

    Create FUNCTION strcmp

    (@strone varchar(100),@strtwo varchar(100))

    RETURNS char(5)

    AS

    BEGIN

    Declare @result char(5)

    if difference(@strtwo,@strone)=4

    Begin

    set @result='True'

    End

    Else

    Begin

    set @result='False'

    End

    return @result

    END

    GO

    It is not working in this case select dbo.strcmp('Ta ta','Tat a')

    Can we have fuzzy logic algo in SQL SERVER 2000?

    Thanks:)

    Regards,
    [font="Verdana"]Sqlfrenzy[/font]

  • It's not built-in like 2k5 so you have to write your own. I've used it a lot for migration deduping on the last project with this client, looking for dupes in a table and matches between two different tables. It's fiddly and slow but very effective.

    The method I use is token matching, and it works like this. A reference string is compared against each target string e.g. a column. Starting at the left hand side of the reference (first character), take the first three characters - the token -and see if they match anywhere in the target, 1 if they do, 0 if they don't. The second token is characters 2, 3 and 4, then 3, 4 and 5. Add up the matches and divide by the length of the reference to generate a score for each target string.

    There's loads of room to tune this: you can use a 4-char token, you can use non-overlapping tokens i.e first token is characters 1, 2 and 3, second is 4, 5 and 6. You can change the way that scoring is calculated.

    This snippet is the core code I use for an internal dedupe:

    -- Self fuzzy match

    SELECT t.VENDOR, r.VENDOR AS Dupe, COUNT(*) AS TokenCount

    FROM FuzzyInputFile r -- reference (6,943)

    INNER JOIN [Numbers] n ON n.number < LEN(r.VendorName)

    INNER JOIN FuzzyInputFile t ON t.VENDOR <> r.VENDOR

    AND CHARINDEX(SUBSTRING(r.VendorName, n.number, 3), t.VendorName) > 0 -- 3 is token size, tune to data

    WHERE n.number < 30 -- column length

    GROUP BY t.VENDOR, r.VENDOR, t.VendorName

    HAVING COUNT(*) / CAST(LEN(t.VendorName) AS NUMERIC (5,2)) > 0.4 -- Cutoff: tune to data

    Where VENDOR is the vendor code (PK) and the numbers table is a standard tally table.

    This snippet is the core code I use for finding the closest match in a column to a string:

    CREATE FUNCTION GetFuzzyMatchCount (@TargetString NVARCHAR(60), @ReferenceString NVARCHAR(60))

    RETURNS Numeric (6,3)

    AS

    BEGIN

    DECLARE @iCounter INT, @nOutput INT, @cString CHAR(3), @TokenCountReference INT, @SecondWordStart INT

    SET @TokenCountReference = LEN(RTRIM(@ReferenceString)) - 2

    SET @TargetString = RTRIM(@TargetString)

    SET @nOutput = 0

    SET @iCounter = 0

    WHILE @TokenCountReference > @iCounter+1

    BEGIN

    SET @cString = RTRIM(SUBSTRING(@ReferenceString, @iCounter + 1, 3))

    IF LEN(@cString) > 0 AND CHARINDEX (@cString, @TargetString) > 0

    BEGIN

    SET @nOutput = @nOutput + 1

    -- See if the token exists anywhere else in the reference string

    SET @SecondWordStart = CHARINDEX (@cString, @TargetString, @iCounter + 5)

    IF @SecondWordStart > 0

    SET @TargetString = STUFF(@TargetString, @SecondWordStart, 3, '')

    END

    SET @iCounter = @iCounter + 1

    END

    RETURN @nOutput

    END

    Usage is like this:

    SELECT TOP 1 @MatchLevel = dbo.GetFuzzyMatchCount(@TargetFuzzyMatchName, FuzzyMatchName) / @TargetTokenCountName,

    @ReferenceVENDOR = LTRIM(VENDOR)

    FROM VendorMatchReference (NOLOCK)

    ORDER BY 1 DESC

    Beware that both of these methods have been used for migration where performance is far less important than with an app. I think the function could be tuned for far better performance should you need it.

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Ahmad Osama (3/27/2008)


    souLTower (3/27/2008)


    If you have a table called CompanyNames you can query like this:

    -- Remember to use procs to eliminate SQL injection

    SELECT cid from CompanyNames where cname =

    One project that I was on had an issue with this and a second table was created which stored the most used misspellings of a company name. Users entered the name and if it was wrong the app gave back a googlesqe, "did you mean...". Of course a better solution may be to update the application to let the user select an existing company.

    ST

    My issue is to restrict users to enter an existing company....I think that creating a new table to store most misspelled name of around 4 million companies is a tedious task..:)

    Thanks

    Ahmas

    You certainly wouldn't build a table of possible mis-spellings but I agree with the logic... create a table of correct names and if the user types something that doesn't have a perfect match, use the SOUNDEX functions to return a list of companies that have a "4" match.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If you're trying to make sure that they don't re-enter existing clients - why not force them to go through the existing list of customers FIRST? Like - a DDL with compeny names, or a search screen? Or a few nag screens when they are trying to add something new (are you SURE this is a new company? No, as in REALLY REALLY SURE? How about now?). The people doing the data entry are likely to be the best people to make that determination (since they would presumably have all of the info in front of them), so making it a clean-up problem instead of preventing the problem is ultimately a bit wasteful.

    Free-text typing key info that's already being tracked in a reference table is always dodgy.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (3/31/2008)


    If you're trying to make sure that they don't re-enter existing clients - why not force them to go through the existing list of customers FIRST? Like - a DDL with compeny names, or a search screen? Or a few nag screens when they are trying to add something new (are you SURE this is a new company? No, as in REALLY REALLY SURE? How about now?). The people doing the data entry are likely to be the best people to make that determination (since they would presumably have all of the info in front of them), so making it a clean-up problem instead of preventing the problem is ultimately a bit wasteful.

    Free-text typing key info that's already being tracked in a reference table is always dodgy.

    Spot on, Matt. I've seen something like this working really well for company input and that's what Ahmad expects to implement. I'm sorry if this isn't clear from my posts and stand totally corrected.

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Jeff Moden (3/31/2008)


    Ahmad Osama (3/27/2008)


    souLTower (3/27/2008)


    If you have a table called CompanyNames you can query like this:

    -- Remember to use procs to eliminate SQL injection

    SELECT cid from CompanyNames where cname =

    One project that I was on had an issue with this and a second table was created which stored the most used misspellings of a company name. Users entered the name and if it was wrong the app gave back a googlesqe, "did you mean...". Of course a better solution may be to update the application to let the user select an existing company.

    ST

    My issue is to restrict users to enter an existing company....I think that creating a new table to store most misspelled name of around 4 million companies is a tedious task..:)

    Thanks

    Ahmas

    You certainly wouldn't build a table of possible mis-spellings but I agree with the logic... create a table of correct names and if the user types something that doesn't have a perfect match, use the SOUNDEX functions to return a list of companies that have a "4" match.

    Thanx jeff...I appreciate your advice...

    But to implement this I do need to create a list of correct spelled companies out of total 4 million list...which will be a manual(not totally) task...well I will hv to see how much time do i have:)

    this will be a good logic to implement on a smaal list or one shud maintain a table right from the start.

    Harlow!

    Regards,
    [font="Verdana"]Sqlfrenzy[/font]

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply