November 27, 2007 at 10:10 am
I just started working on a project for my company in which I need to build a process for matching the contacts in one database table with the contacts in a master database table.
The master table has approximately 200 million records, and includes the fields: first name, last name, address 1, address 2, zip code (zip5), and state (US only) and has other fields or is related to other tables that contain information about the person such as demographics and income.
What I need to be able to do is to develop a process in which I can take a smaller table, lets say a few million records, that has first name, last name, address 1, address 2, zip code, and state and find the best match in the master table. If I can't find an exact match, I need to look for records in the master table that are 'most like' the record in the smaller table. For example, I need to account for soundex() similarities. Once I find a match, I can then pull in the information on demographics from the master database to get data I need for further processing.
I am using SQL 2005, and would like to be able to do this through T-SQL queries and scripts because, for now at least, I don't have the option to buy a third party tool. As the data sets are large, efficiency and total run-time are my main concerns.
If anyone has done something similar in the past, and has any advice or can point me to useful articles, I would greatly appreciate it.
November 28, 2007 at 4:47 am
Hi,
I'm in a roughly similar situation, I have a table with company name (Nvarchar(255)), and company address (All address pieces mangled together in an NVARCHAR(1000)). I need to match duplicates, and have been trying different strategies over the last few days.
I'd recommend you look at the following SSIS articles, that deal with exact matching followed by fuzzy matching. Trying to use something like Soundex in raw SQL worked terribly for me, but the fuzzy lookup component has been good so far. This will also allow you to perform work on some automatically, while sending others for manual checking if required.
I also found that with 40,000 rows, comparing each to each (Cross join) worked badly in TSQL (For some reason 😉 ) but pipelining it in SSIS was better.
I appreciate you want to do this within SQL 2005, and without 3rd party tools. Do you have VS2005 with SSIS?
Good luck - Leo.
Articles:
http://www.sql-server-performance.com/articles/dba/data_cleaning_ssis_p1.aspx
http://www.jumpstarttv.com/Media.aspx?vid=27 (Video tutorial on fuzzy lookup SSIS component)
http://msdn.microsoft.com/msdnmag/issues/05/09/SQLServer2005/ (Fuzzy lookup/fuzzy grouping)
November 29, 2007 at 6:37 am
Hi David
Working with person names and addresses is far easier than working with company names and addresses because company names are likely to contain common elements which relate to the type of company, and which may or may not be abbreviated, e.g. "Smith & Co", "Smith and Company". These have to be carefully considered and dealt with before fuzzy matching, so you have a bit of a head start.
The first question is this: how much confidence do you have in the accuracy of the addresses, and in particular the zip codes? If your degree of confidence isn't high, consider running the master database through an address checker such as Quickaddress Batch, which will correct your master addresses. If you can do the same with your working db then so much the better.
A person in the working db is matched to a person in the master db on first name, surname and address. We can abbreviate the address to lines 1 and 2, plus the zip code - certainly in the UK, this is sufficient to uniquely identify an address. The zip code is exact but line 1 and line 2 are inconsistent. The first name may be the full name e.g. "John" but it's often only the initial "J". The surname is complete but may be mis-spelled. So we have zip code, which is exact; surname, and address lines 1 and 2, are fuzzy; and forename, which is "extra fuzzy".
Here's the logic: we create a new column and populate it with surname & address lines 1 & 2. We strip out everything which isn't a number, letter or space and then convert the content to upper case. This is our fuzzy comparison column. Then we join our two tables on zip code and an expression to roughly match up the fuzzy columns (which also generates a % match). We apply a cutoff to the output, capture say only 85% match or better: the cutoff should be adjusted low enough so that all rows in the working db have one or more matches from the master db, but not so low that the number of matched rows in the output is unmanageably high. Note that this is a "triangular join" and performance is poor.
Next, the output file is ranked according to the closeness of match by elements. You may find that you have an exact match on name elements but a slight mis-match on address elements. Set these to MatchType = 1. You have an exact match on surname and address elements, but you've matched "John" to "J" on forename. Set these to matchtype = 2. You need to eyeball your output matched data and figure out useful matchtypes yourself, but once you've done it, it will apply to subsequent tagging jobs. You may be lucky and find that the % match alone is sufficient for your needs, in which case mark the row which has the highest % match.
Finally, using a combination of MatchType and % match, select out the optimum match for each row of your working db.
You will almost certainly find that this procedure works best with extracts from your original tables containing only those columns required for it to work and to match back to the original tables to enable tagging. You may find useful shortcuts such as selecting only the highest matching row for the output in an "onion select". Experiment and find out what works best for you. I recommend that you test with a SMALL subset from the master db of say 10,000 rows.
THE CODE
Start by creating a numbers or tally table where the highest number is greater than the maximum number of characters you may have in your fuzzy comparison column. There are a number of ways of making the table, here's an example from Jeff Moden which is good for tables of any size:
IF EXISTS (select * from dbo.sysobjects where id = object_id(N'[dbo].[Numbers]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
DROP TABLE dbo.Numbers
--===== Create and populate the Tally table on the fly
SELECT TOP 120 -- exceeds max number of characters in fuzzy comparison column
IDENTITY(int,1,1) AS number
INTO dbo.Numbers
FROM master.dbo.syscolumns sc1,
master.dbo.syscolumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_numbers_number PRIMARY KEY CLUSTERED (number)
--===== Allow the general public to use it
GRANT SELECT ON dbo.Numbers TO PUBLIC
Here's the query which performs the fuzzy match ("NameString" is the fuzzy comparison column):
SELECT a.PKID AS TargetID, b.PKID AS MasterID, COUNT(*) / CAST(LEN(b.NameString) AS NUMERIC (5,2)) AS Match
FROM TargetList a
INNER JOIN [Numbers] n ON n.number < LEN(a.NameString)
INNER JOIN MasterList b ON b.Zip = a.Zip AND a.Zip <> ''
AND CHARINDEX(SUBSTRING(a.NameString, n.number, 4), b.NameString) > 0
WHERE n.number < 30 -- max length of fuzzy comparison column
GROUP BY b.PKID, a.PKID, b.NameString
HAVING COUNT(*) / CAST(LEN(b.NameString) AS NUMERIC (5,2)) > 0.8 -- cutoff
To see how it works, disassemble it and take out the group by clause.
There are quite a few variables to play with here. Start with the token size, play with 3 and 4. You may also find that you can get away with a non-overlapping token which will speed things up 3 or 4-fold. Make sure you have Zip indexed on both tables.
Good luck!
Leo:
When working with company names, I always tokenize common elements first. Make a word list from the company name column and do a count of each word. Then, for each word which appears in the list more than n times, generate a 3-char unique abbreviation. Finally, replace each occurrence in the company name column with the abbreviation. This sp makes the word list and generates a unique 3-char abbreviation:
CREATE PROCEDURE dbo.MakeFuzzyWordList
AS
DECLARE @sql varchar(2000)
CREATE TABLE #FuzzyWordList (Word varchar(60))
SET @sql =
'SELECT
LTRIM(SUBSTRING('' '' + dbo.GetCleanWord(t.VENDOR_VNAME) + '' '', n.number + 1,
CHARINDEX('' '', '' '' + dbo.GetCleanWord(t.VENDOR_VNAME) + '' '', n.number + 1) - n.number - 1)) AS Word
FROM [Numbers] n
INNER JOIN Staging_APVENMAST t
ON SUBSTRING('' '' + dbo.GetCleanWord(t.VENDOR_VNAME) + '' '', n.number, 1) = '' ''
AND n.number < LEN('' '' + dbo.GetCleanWord(t.VENDOR_VNAME) + '' '')
WHERE n.number < 61'
INSERT INTO #FuzzyWordList EXEC (@SQL)
-- remove trailing 'S' if appropriate
UPDATE #FuzzyWordList SET Word = REVERSE(SUBSTRING(REVERSE(Word), 2, 59)) WHERE LEFT(REVERSE(Word), 1) = 'S' AND NOT LEFT(REVERSE(Word), 2) = 'SS'
IF OBJECT_ID('ConnectMig..FuzzyWordList') IS NOT NULL
DROP TABLE dbo.FuzzyWordList
SELECT CAST(REPLACE(REPLACE(Word, ')', ''), '(', '') AS varchar(30)) as LongWord, SPACE(3) AS ShortWord, COUNT(*) AS Counts
INTO FuzzyWordList
FROM #FuzzyWordList t
WHERE LEN(RTRIM(Word)) > 4
GROUP BY CAST(REPLACE(REPLACE(Word, ')', ''), '(', '') AS varchar(30))
HAVING COUNT(*) > 7
ORDER BY 1
DROP TABLE #FuzzyWordList
DECLARE @ShortWord VARCHAR(3), @LongWord varchar(30)
SET @ShortWord = 'AAA'
UPDATE FuzzyWordList SET
@ShortWord = ShortWord = CASE
WHEN LEFT(@LongWord, 6) = LEFT(LongWord, 6) then @ShortWord
WHEN LEFT(LongWord, 3) = LEFT(@LongWord, 3) THEN LEFT(@ShortWord, 2) + CHAR(ASCII(substring(@ShortWord, 3,1))+1)
WHEN LEFT(LongWord, 3) = @ShortWord THEN LEFT(@ShortWord, 2) + CHAR(ASCII(substring(@ShortWord, 3,1))+1)
ELSE LEFT(LongWord, 3)
END,
@LongWord = LongWord
GO
I'm sure you can figure out how to do the swap!
Good luck
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 29, 2007 at 7:22 am
I have done a similar thing with matching flight dates/times and it worked out really well using SSIS. It takes a bit of time playing with the threshold you want and how to weigh it but I got pretty good data after a bit.
December 5, 2007 at 8:11 am
Have you considered try to parse the address1 field into its components? Then the parts of the address that are more relevant (house number, street name, etc.) can be given more credence than a direction.
Here's an example of our address layout
========================================================================================
------
------ Table: address - Production Property Address Information
------
Column Name Type Length Scale Null Description
-------------------- ---------- ------ ----- ---- --------------------------------------
...
-------------------- ---------- ------ ----- ---- --------------------------------------
house_no Int 10 0 YES House Number.
-------------------- ---------- ------ ----- ---- --------------------------------------
house_fra Varchar 3 YES House number fraction, ie '1/2'.
-------------------- ---------- ------ ----- ---- --------------------------------------
st_dir_pre Char 2 YES Prefix Direction.
-------------------- ---------- ------ ----- ---- --------------------------------------
st_name Varchar 40 YES Street Name.
-------------------- ---------- ------ ----- ---- --------------------------------------
st_type Char 4 YES Street Type or Suffix (in the
str_type_lu table).
-------------------- ---------- ------ ----- ---- --------------------------------------
st_dir_suf Char 2 YES Suffix Direction.
-------------------- ---------- ------ ----- ---- --------------------------------------
unit_type Varchar 4 YES Type of unit.
APT = Apartment.
STE = Suite.
UNIT = Unit.
-------------------- ---------- ------ ----- ---- --------------------------------------
unit_num Varchar 9 YES Unit (Apartment, Suite) Number.
-------------------- ---------- ------ ----- ---- --------------------------------------
bldg Varchar 6 YES Building Number.
-------------------- ---------- ------ ----- ---- --------------------------------------
city Varchar 30 YES City Name (in the cityzip_lu view).
-------------------- ---------- ------ ----- ---- --------------------------------------
state Char 2 YES State code (IA).
-------------------- ---------- ------ ----- ---- --------------------------------------
zip Char 5 YES Five digit zip code (in the cityzip_lu
view).
-------------------- ---------- ------ ----- ---- --------------------------------------
zip4 Char 4 YES Four digit Zip Code Extension.
========================================================================================
Steve
February 14, 2008 at 11:50 am
Parsing is of limited use for matching contacts pending the size of the datasets. Ignoring the easy stuff: city, state, zip; parsing the address lines requires context sensitive grammar parsers; a challenge (weeks) to write. Despite post office standards are more like guidelines, real addresses tend to be a lot more creative. Especially if non US or Europe contacts are involved. Fuzzy / similarity matching against what ever you consider to be an authority dataset is your best approach especially with larger datasets and limited validation staff.
Pending your business model, the goal for contacts is consistency. Correctness is often in the eyes of the beholder: the contact, the sales rep, the post office, etc. Deliverable addresses is best handled via a mailing house service (unless economies of scale or business strategy get in the way).
February 14, 2008 at 12:31 pm
of course, and the best approach I've seen yet in terms of both accuracy and speed is QA batch. But not everybody is prepared to pay for it.
Writing your own string-matching code is a bit like watching Groundhog Day. Every night for a couple of months!
Like everything else in SQL, there's more than one way of skinning the cat. Which way you choose to do it depends upon a lot of factors - disk space, data characteristics, data set size, skillset / capabilities of your programmers, whims of your DBA, whims of your IT manager. There's always a best way but you won't always figure it out in time to implement it.
The method I described above was the best method for us to use. QA batch would have been better.
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
February 14, 2008 at 1:29 pm
With that many addresses, you might want to look into the USPS address matching system. With it, you can cleanup and standardize mailing addresses, city, state, zips. Once standardized, the matching should be fairly straight-forward since there won't be multiple permutations of the same address to deal with. Your company would also have the side benefit of 'clean' mailing addresses.
Google USPS AMS API and look this guy's C# wrapper (that you utilize in an extended stored procedure):
February 14, 2008 at 1:59 pm
antonio.collins (2/14/2008)
Google USPS AMS API and look this guy's C# wrapper (that you utilize in an extended stored procedure):
http://www.elserinteractive.com/chris/wp/?p=5%5B/quote%5D
Spot on, Antonio. Same (or very similar) functionality as QA batch.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
February 14, 2008 at 2:00 pm
Been down that route. It's good with a grain of salt. Here are some of the issues I ran into: addresses for individuals in large company campuses, addresses for individuals in universities, mangling long addresses (especially rural), insistence on upper case and abbreviations, city/section differing from what the contact told us.
The stats for us per USPS was: 2% perfect, 24% claimed undeliverable. In reality only 6% were undeliverable (mostly - no longer at the address).
In the end we decided consistency to an authority dataset, be it USPS, or our own, was most important because the various mail delivery services do a fantastic job at deciphering and delivering fuzzy addresses.
Unfortunately, for most of the world, despite the hype, obtaining validated contact / addresses isn't a reality. Similarity is better in such scenarios.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply