Introduction
If, like me, you have ever had to administer the organisations marketing database then at some time or other you would probably have had to deal the maintenance of the organisations contact/mailing list.
For me this involved cleaning it up, removing duplicate names, providing updates and selection facilities for that list.
Maintaining the company mailing list was a good way of introducing database discipline to the marketing department and gaining a better understanding of the companies business.
From a database and programming perspective it represents a good meaty challenge.
The basics
In Great Britain addresses can comprise of many lines of information. Not all of these are compulsory or necessary but may be used nonetheless.
- House/Building Name
- Precinct/Residence
- Street number
- Locality
- Double-dependant locality
- Post Town
- County
- Postal Code
- Country
The secret of deduplication is to identify correctly the different lines of your address so that you are comparing post town with post town, street with street etc.
If you can’t do this then you will end up matching an address with a street element of Chester Street with an address with a post town of Chester le Street.
Database Structure
The first thing I decided was that I was going to have a specific dedupe database that would hold a table for each address file that I was going to dedupe.
These would start off structurally identical to their original source files and I would append my own fields to the end of each structure.
In any address there can be only one postal code, county and postal town.
There can be between one and three lines that could be called the street element, and between one and three that I could use as catch alls for superfluous address information.
Next, I decided that I needed fields to hold both a SOUNDEX code and a METAPHONE code for my street lines.
Soundex
Soundex has been around for over a century, and was originally used to help match addresses written on file cards.
It is fairly good at catering for basic miss-spellings such as wrong vowels and double consonants. Where it falls down is that a miss-spelt word has to start at the very least with the same letter as the correct word.
It breaks down text into 6 possible codes plus a preceding letter. The rules are as follows:-
- Always retain the first character or number.
- Discard all vowels including the Welsh vowels H, W and Y.
- Discard any other non-alphabetic characters.
- Replace B,F,P, V with 1.
- Replace C, G, J, K, Q, S, X, Z with 2
- Replace D & T with 3
- Replace L with 4.
- Replace N & M with 5
- Replace R with 6.
- If any numbers occur in pairs or more only retain the first number.
In the SQL Server version only the first 4 characters of the code are retained. Why, oh why didn’t Microsoft put this limit on it.
There are some variations on this function.
- Replace the vowels with 0 and then strip, carry out all the steps above, then at the end, remove all the zeros.
- Pre-scan the text for duplicate letters and eliminate. This caters for Welsh place names such as Llangollen.
Metaphone
The metaphone function was defined by Lawrence Phillips. It is too detailed to document here, but click here for one of many good references on the web.
Metaphone breaks down text into 16 phonetic sounds and caters for the sounds that a combination of letters can make and also combinations that are silent i.e. the gh at the end of through.
You may see references to a function called double-metaphone. This is a variation that can handle transposition of letters within the text.
I wish that the metaphone function was included in SQL Server, but alas, there aren’t that many deduplication packages that use it either.
Identifying the address elements
There is a database in Great Britain called the Post Office Address File or PAF for short. It holds every deliverable address in GB.
There are quite a few PC based systems such as Quick Address that will match your internal databases against the PAF database to make sure your addresses are postally correct and up-to-date, and also to identify the address elements for you.
The systems themselves aren’t expensive, but subscribing to the PAF file is. I didn’t have the budget for this so I had to design my own system.
I decided that the best way to proceed was to work backward through the elements within an address.
Country
Firstly, 99.9% of my addresses were in GB so I knew I didn’t have to worry about the country.
Postal Code
This is the single most important address element in a British marketing database. There are 1.6 million of these covering 22 million households, averaging about 15 households per postal code. Therefore identifying this puts you are within 15 households of your target.
Apart from its use in deduplication there are many datasets of marketing information that can be bought that rely on the postal code as the primary key for their data. These include grid reference systems, geodemographics, census data etc.
A British postal code consists of two halves.
The first half identifies the sorting office and general area of the address. The 2nd have goes right down to the specific side of the street. If you are a business and receive a lot of mail then you may have a postal code all to yourself.
From a database perspective you can identify postal codes because they can be pattern matched.
The 2nd half is always [0-9][A-Z][A-Z] so this is easy to look for.
The 1st half will generally match to either of the following.
- [A-Z][0-9]
- [A-Z][0-9][0-9]
- [A-Z][A-Z][0-9]
- [A-Z][A-Z][0-9][0-9]
The letters identify the main sorting office i.e. SK = Stockport. The numbers identify the district covered by the Stockport sorting office.
I say generally because London (capital of England) breaks these rules. Legend has it that when WWII came along and all the postmen went off to fight, women took over (officially) and decided to employ logic to London post codes so the letters define an direction as well as an area within London. W= West 1, EC = East Central.
In addition to the patterns listed so far we have
[A-Z][0-9][A-Z]
[A-Z][A-Z][0-9][A-Z]
Overall though extracting postal codes is a pattern matching exercise.
Counties
Counties are not needed in UK addresses as far as the post office is concerned. It simply isn’t necessary for their systems. Give them a nice postal code and they’re happy.
Systems are one thing people are another. The English Civil War was between the House of Lancaster and the House of York. As far as the inhabitants of Lancashire and Yorkshire are concerned we are merely in a state cease-fire brought about when Edward VII (a Lancastrian) married Elizabeth of York. That was over 500 years ago but if you miss off the county for their addresses or, heaven forbid, mix them up then hostilities will resume.
From a database perspective there are a limited number of counties and the sort of people who use them are not the sort to miss-spell them. You can do an exact match against a look-up table in order to extract these.
The only thing to bare in mind is that there are short names for some counties.
Lancashire is often written as Lancs so the look-up table needs to have these short entries as well.
When the counties are extracted from an address I always recorded the long version of the name in my dedupe tables. This is so I had a standardised list of counties.
Post Towns.
Post Towns again are limited in number but more numerous than counties. Again, you can do a match against a lookup table.
Street Lines
For my purposes I defined street lines as
- Beginning with a number or
- Ending with a word/abbreviation such as St or Street.
Now it is not enough to simply extract the street lines.
Consider St John Street.This could be written as
- St John St
- Saint John St
- St John Street
- Saint John Street.
What I did was build a substitution table with the following fields
Field | Type | Description |
---|---|---|
SearchTerm | VARCHAR(50) | The abbreviation to search for. |
ReplaceTerm | VARCHAR(50) | The full version of the abbreviation. |
AddressLineType | CHAR(1) | S = Street. B = Business. N = Name. A = Any. |
TermType | CHAR(1) | P = Prefix. S = Suffix. A = Any. |
This table has two functions.
- Identify a street line from its contents.
- Replace the abbreviated version of a word with the full
Other Address Lines
Having gone through all the stages listed so far any remaining address lines I classified as “General” and didn’t bother to extract them.
Business Names
Names of businesses were always in their own field but on extraction I used the same substitution table I used for my street addresses to turn things Mitsubishi Ltd to Mitsubishi Limited. This is to make the comparison between the two simpler.
The actual dedupe
Encoding
Firstly, note that I didn't use a limited character version of either SOUNDEX or METAPHONE. As far as I was concerned I wanted the full phonetic encoding of the words as a starting point so that I could tune the dedupe process later.
I had identified and extracted my address elements and standardised terms within those elements. The next stage was to use my unlimited length SOUNDEX and METAPHONE to encode the extracted street and business name lines.
I didn’t need to do this for postal code, county or town because these conform to a standard list of limited values.
I now had encoded values that would enable me to compare two business addresses and be fairly sure that I had a close match.
Points to bear in mind
No single query is going to identify all the dupes. That is why I used both SOUNDEX and METAPHONE.
Data omissions also affect the results of a query so different combinations of elements will help you whittle away at the list of dupes.
As stated earlier I could truncate my codes for comparison purposes to act as a tuning mechanism on the dedupe.
As I had three address lines I had an iterative procedure to match any of the three lines to any of the other three lines in conjunction with other elements.
The basic rule with deduplication is set the rules so they are tightly defined i.e. you are 99% sure that a match is a match, then slacken off the rules to make the matching more generic.
Glaring Pitfalls
After I did the dedupe my company found that response rates to direct mail actually dropped. Not because the dedupe hadn’t worked but because of the way that the addresses were classified for marketing purposes.
We used to mail to plumbers with plumber specific offerings and to builders with building specific offerings. The problem came because the same name could appear once in plumbing and one under building. The deduplication process was effectively saying you are either one or the other, not both and in the process of doing so limiting the range of products offered to that person.
If you are aware of this pitfall then you can turn it to your advantage in situations where you have a customer who has been buying product ‘X’ off you for years, then all of a sudden their name and address crops up on a list under a totally different classification.
Possible future amendments?
If SOUNDEX gets turned into a letter then one of six values and METAPHONE gets turned into one of 16 letters then one thing I am keen to try is to compress these values so that one byte holds two characters, thus a single Int field could hold up to 4 characters of a METAPHONE word.
Why bother?
Because when you are deduping a large number of addresses the encoded values could represent a large volume of data. It is going to be indexed but it is going to be faster to compare two int values than two varchar(4)s.
I believe that Lawrence Phillips himself now works for Verity (the search engine people) and does something similar, in addition he uses a version of metaphone that reduces the phonetic characters from 16 to 12 encoded values.
Beyond that I should like to try foreign language versions of either SOUNDEX or METAPHONE on international data.