October 16, 2013 at 8:16 am
Apologies if this isn't the right place to post this, I struggled to find a relevant forum.
I'm working on a data matching engine at the minute. This will match UK name and address data from source files to a master database. I want to standardise forenames to aid in matching, e.g.
Michael -> Mike
Mike -> Mike
Michel -> Mike
Susan -> Sue
Suzanne -> Sue
Susanne -> Sue
David -> Dave
Dave -> Dave
Davide -> Dave
Etc.
I'd like to avoid building my own conversion table if at all possible, given the manual effort involved. Can anyone point me in the direction of a suitable resource? I have seen this list: http://www.galbithink.org/names/ginap.htm, which is a useful start, but not quite comprehensive enough for my purposes.
Thanks, Iain
October 16, 2013 at 8:47 pm
Could you use Soundex for this? And any comparison that has a score within an "acceptable" range (open to interpretation) would automatically update. You could probably do most/all of this in SSIS and just run it in an overnight job or something.
October 16, 2013 at 9:11 pm
Have you looked at the Fuzzy Lookup transformation in SSIS - it does the sort of thing you are after without needing to pre-determine all of the possible combinations of names. It is available with the Enterprise Edition of SQL Server 2008.
October 21, 2013 at 4:06 am
This type of standardisation may pervent you from getting false negatives based on name, but you are likely to have a significant number of false positives.
Name matching is the easy bit of name & adderss matching. Dealing with the variety of address formats and combinations you get will be much harder. Many organisations outsource this type of matching to third parties who make their living by doing this job well - do not expect that your home-grown system will match the accuracy of the professionals.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
October 21, 2013 at 6:46 am
Hi all,
Sorry for the delayed response. Comments as follows:
pietlinden
Soundex doesn't really do the job here. As an example:
select soundex ('Dave') = D100
select soundex ('David') = D130
happycat59
Fuzzy lookup doesn't provide me with definitive synonyms, which is what I'm looking for. Depending on the settings it might match Dave to David, but also to Davina and maybe even Daphne.
EdVassie
"Dealing with the variety of address formats and combinations you get will be much harder." You're not kidding 🙂
Overall, this is intended only to form a single component of a wider matching effort, using a range of different techniques (exact matches, synonym matches, edit distance matches, generated keys, household composition, name commonality, etc), each with an associated score component. Match scores will then be calculated and compared to provide the best match.
Thanks all for the responses, much appreciated.
Regards, Iain
October 21, 2013 at 5:34 pm
Trying to do something for the UK (as opposed to just for England) is going to be quite hard. It is very common in Scotland for first names to be translated between two (or 3) languages, so for example someone whose name on his birth certificate is Iain may commonly be called Johnnie, and because he's called Johnnie people may think his name is Sean, or even John, and that's just one of dozens of names that get screwed up. Sometimes the changes are just mis-spelling rather than translation, as when Seumas becomes Hamish (because the vocative case of Seumas is Sheumais, for which the Anglicised spelling is Hamish), so you have to decide whether to reduce Seumas and Hamish both to Jim (since Seumas and James are interchangeable).
It's also pretty common for surnames to be translated, but that's still not as bad as addresses which are as has already been said far worse than names. Names of towns and villages are very comonly translated, just as names of people are.
Tom
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply