Dictonary Creation - Any ideas?????????

  • Hi All,

     

    I am trying to creat a dcitionary for my address table. I have a large list of incorrect spellings and a list of the correct spelling. Any ideas on how to match the incorrect to the correct without doing it manually?

    I have tried using SOUNDEX and DIFFERENCE but the result that is returned is not what I am looking for. Any help here would be appreciated.

     

    Thanks!

    L

  • I would create a temporary table that has two coulmns (correct_spelling, incorrect_spelling).

     

    Then run an update on the address table linking together the incorrect spelling with the address.

    I.E

     

    update address_table

    set address_table.address = correct_spelling

    from address_table, temp_table

    where address_table.address = temp_table.incorrect_spelling

     

    This way, the only rows that will be updated will be the incorrectly spelt rows in the address table.


    ------------------------------
    The Users are always right - when I'm not wrong!

  • thanks for the quick reply Jonathan,

     

    Although, I don't think I made myself too clear. I am trying to create the temp table you are talking about (with correct_spelling, incorrect_spelling as fields - which I imagine relate to one another) so that I can link the correct address with an incorrect spelling of that address.

    Problem is, I have over 150 000 incorrect Spellings and about 20 000 correct spelling with which to match them to. If it helps, I am trying to this for Suburb Names (correct_spelling, incorrect_spelling). So I want to run a query that will take the incorrect spelling of the suburb name and match it to the correct suburb name and then put this all into a table (equivalent to your temp table). Am I making sense?

    Thanks again

    L

  • I don't know if this will help in your situation or not but i had a similar problem with ssns. I used a combination of truncation and soundex to find matches. In this case typos usually occur at the end of the string rather than at the beginning so I truncated the last 2 or 4 characters then compared the name fields to determine if there was a likely match. I then assigned a confidence value based on the likely hood of a correct match.

    In your case typos are more likely in the street than the numbers so if you can separate out the numbers, then compare some of the other fields you may be able to match. Also we have found suffixs like rd, ave etc are not so often misspelled as the are labled dff. You can come up with a list of likely interpretations for those too (i.e rd and road).

    Hope that is helpful

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply