August 26, 2015 at 4:55 pm
Hi all,
I've inherited a table of members that has the following structure:
CREATE TABLE [dbo].[dimMember](
[dimMemberId] [int] IDENTITY(1,1) NOT NULL,
[dimSourceSystemId] [int] NOT NULL CONSTRAINT [DF_dimMember_dimSourceSystemId] DEFAULT ((-1)),
[MemberCode] [nvarchar](50) NOT NULL,
[FirstName] [nvarchar](250) NOT NULL,
[LastName] [nvarchar](250) NOT NULL,
[MiddleInitial] [nvarchar](5) NOT NULL,
[dimDayidDOB] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayidDOB] DEFAULT ((-1)),
[dimDayIdDeath] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdDeath] DEFAULT ((99991231)),
[Gender] [nvarchar](10) NOT NULL,
[SubcriberRelationship] [nvarchar](50) NOT NULL,
[SSN] [nvarchar](25) NOT NULL CONSTRAINT [DF_dimMember_SSN] DEFAULT (N'n/a'),
[ExecutionGUID] [uniqueidentifier] NOT NULL,
[RowRecordedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowRecordedDate] DEFAULT (getdate()),
[isCurrent] [bit] NOT NULL CONSTRAINT [DF_dimMember_isCurrent] DEFAULT ((1)),
[SubscriberCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_SubscriberId] DEFAULT (N'n/a'),
[RowUpdatedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowUpdatedDate] DEFAULT (getdate()),
[dimDayIdSubscriberStart] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberStart] DEFAULT ((1)),
[dimDayIdSubscriberEnd] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberEnd] DEFAULT ((99991231)),
[Race] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Race] DEFAULT (N'n/a'),
[Ethnicity] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Ethnicity] DEFAULT (N'n/a'),
[PrimaryLanguage] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_PrimaryLanguage] DEFAULT (N'n/a'),
[CurrentAddress1] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress] DEFAULT (N'n/a'),
[CurrentAddress2] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress2] DEFAULT (N'n/a'),
[CurrentZipCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_CurrentZipCode] DEFAULT (N'n/a'),
CONSTRAINT [PK_dimMember] PRIMARY KEY CLUSTERED
Based on the way the data loads into the table there's a possibility of some records being near duplicates of each other. For example, we can have a member that has records that have the same first name, last name, SSN, but different addresses, membercodes, subscribercode etc... This can happen in pretty much any variation thereof.
What I want to do, is add a new column and use that to group the similar records under based on comparing on several columns. By this I mean, if a member matches 4 of the 7 values below with another member, we would group these:
First Name (1st 3 characters)
Last Name
DOB
CurrentAddress1
MemberCode
SSN
SubscriberCode
I'm at a loss of how to structure the SQL to update the new column in the table. Any help is appreciated.
August 26, 2015 at 5:40 pm
Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 26, 2015 at 5:51 pm
Jeff Moden (8/26/2015)
Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.
You're preaching to the choir on that. The team I work on is all pretty new to this company (all joined in the last 6 months) and we all see this as an issue. We didn't have a say in the table creation and are trying our best to fix the existing problems as quickly as we can. That is high on the list of things to fix.
The main issue we have, is that while we can encrypt the SSN on our reporting database, the OLTP system we get the data (written and supported by a 3rd party vendor) from does not encrypt the SSN on their end.
August 26, 2015 at 7:37 pm
craig.bobchin (8/26/2015)
Jeff Moden (8/26/2015)
Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.You're preaching to the choir on that. The team I work on is all pretty new to this company (all joined in the last 6 months) and we all see this as an issue. We didn't have a say in the table creation and are trying our best to fix the existing problems as quickly as we can. That is high on the list of things to fix.
The main issue we have, is that while we can encrypt the SSN on our reporting database, the OLTP system we get the data (written and supported by a 3rd party vendor) from does not encrypt the SSN on their end.
Been there and done that. No fun at all yet lots of fun trying to fix what a whole 'nuther team managed to mess up. And, yeah... 3rd party vendors are a joy, aren't they?
Getting back to your problem... I see that the SSN column is NOT NULL but does have a default. For those that do have an SSN, I'd base the near duplicates on SSN only to determine the near duplicates.
As for the rest of them and the 4 of 7 grouping, I did such a similar thing a couple of years ago and it worked very quickly. I have to see if I can find the code in my archives because it was the only time I had to do such a thing so far. IIRC, it used cascading CTEs but I don't remember the particulars.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 26, 2015 at 8:27 pm
I found the code I was looking for and, unfortunately, it solves for a problem quite different that I thought I remembered and doesn't come close to solving this problem.
However, I do have an idea on how to solve this one using a CTE that contains a CROSS JOIN and 6 additive case statements with an outer query that looks for a count of 4 or more and does a group by one of the IDs. I need to test it. How many rows does your original table have in it?
--Jeff Moden
Change is inevitable... Change for the better is not.
August 27, 2015 at 8:40 am
I've done something very similar but it far from fast, especially on larger sets.
The following is a simplified version of what I did in the past but it does illustrate the premise.
(I apologize in advance for the lack of better comments...)
-- Create the necessary tables
IF OBJECT_ID('tempdb..#People') IS NOT NULL
DROP TABLE #People;
CREATE TABLE #People (
PersonID INT IDENTITY(1,1),
FirstName VARCHAR(20),
LastName VARCHAR(20),
SSN VARCHAR(9),
Address_1 VARCHAR(30),
Address_2 VARCHAR(30),
City VARCHAR(30),
[State] CHAR(2),
Zip CHAR(5),
GroupID INT,
MatchQuality INT
);
IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL
DROP TABLE #MatchScores;
CREATE TABLE #MatchScores (
ParentID INT,
ChildID INT,
FirstNameScore INT,
LastNameScore INT,
SSN_Score INT,
Add1_Score INT,
Add2_Score INT,
CityScore INT,
StateScore INT,
ZipScore INT,
TotalScore INT
);
IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL
DROP TABLE #BestMatches;
CREATE TABLE #BestMatches (
ParentID INT,
ChildID INT,
BestScore INT,
Lft INT,
Rgt INT
);
INSERT #People (FirstName,LastName,SSN,Address_1,Address_2,City,State,Zip) VALUES
('Bob','Jones','333224444','258 Grants Trail','#3','Bowling Green','KY','45459'),
('Robert','Jones','333224444','258 Grants Trail','Ste 3','Bowling Green','KY','45459'),
('Rob','Jones','333224444','258 Grants Tr.',NULL,'Bowling Green','KY','45459'),
('R.','Jones','333224444','258 Grants Tr.','Ste 3','Bowling Green','KY','45459'),
('Louis','Armstrong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),
('L.','Armstrng','258321478','123 Humming Bird Lane',NULL,'Jacksonville','FL','32207'),
('Louie','Armstrong','258321478','123 Humming Bird Lain',NULL,'Jacksonville','FL','32209'),
('Louis','Armstong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),
('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33304'),
('M.','Adams','321456987','555 Bay Meadows Dr.','# 42','Ashvlle','NC','33304'),
('Mark','Adams','321456987','555 Baymeadows Drive','Unit 42','Ashville','NC','33305'),
('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33306'),
('Bob','Jones','555229999','4227 Some Place Ln.','#3','Bowling Green','KY','45459'),
('Bob','Jones','555229999','4227 Some Place Ln.','# 3','Bowling Green','KY','45459'),
('Bob','Jones','555292999','4227 SomePlace Ln','Unit 3','Bowling Green','KY','45459'),
('Bob','Jones','555292999','4227 Some Place Ln.','Ste 3','Bowling Green','KY','45459'),
('Louis','Armstrong','147852369','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),
('Louis','Armstrong','147852369','123 Baldwin Ave',NULL,'Jacksonville','FL','32209'),
('Louis','Armstrong','147852369','123 Baldwin Anenue',NULL,'Jachsonville','FL','32209'),
('Louis','Armstrong','147852639','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),
('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),
('Mark','Adam','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),
('Mark','Adams','654789231','524 Main St.','Unit 42','Barkersville','NC','33304'),
('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304');
-- Take a look at the base data.
SELECT * FROM #People p;
WITH ColumnScores AS (-- Assign weights to the various attributes
SELECT
p1.PersonID AS ParentID,
p2.PersonID AS ChildID,
CASE WHEN p1.FirstName = p2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,
CASE WHEN p1.LastName = p2.LastName THEN 2 ELSE 0 END AS LastNameScore,
CASE WHEN p1.SSN = p2.SSN THEN 6 ELSE 0 END AS SSN_Score,
CASE WHEN p1.Address_1 = p2.Address_1 THEN 2 ELSE 0 END AS Add1_Score,
CASE WHEN COALESCE(p1.Address_2, '') = COALESCE(p2.Address_2, '') THEN 1 ELSE 0 END AS Add2_Score,
CASE WHEN p1.City = p2.City THEN 1 ELSE 0 END AS CityScore,
CASE WHEN p1.State = p2.State THEN 1 ELSE 0 END AS StateScore,
CASE WHEN p1.Zip = p2.Zip THEN 1 ELSE 0 END AS ZipScore
FROM
#People p1
JOIN #People p2
ON p1.PersonID < p2.PersonID
)
INSERT #MatchScores (ParentID,ChildID,FirstNameScore,LastNameScore,SSN_Score,
Add1_Score,Add2_Score,CityScore,StateScore,ZipScore,TotalScore)
SELECT
cs.ParentID,
cs.ChildID,
cs.FirstNameScore,
cs.LastNameScore,
cs.SSN_Score,
cs.Add1_Score,
cs.Add2_Score,
cs.CityScore,
cs.StateScore,
cs.ZipScore,
cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +
cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore AS TotalScore
FROM
ColumnScores cs
WHERE
cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +
cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore > 7 -- Set a threashold for total match quality
ORDER BY
cs.ParentID;
--===============================================================
INSERT #BestMatches (ParentID,ChildID,BestScore)
SELECT
bs.ParentID,
bs.ChildID,
bs.TotalScore AS BestScore
FROM
(SELECT DISTINCT ms1.ChildID
FROM #MatchScores ms1) c
CROSS APPLY (
SELECT TOP 1
*
FROM #MatchScores ms2
WHERE c.ChildID = ms2.ChildID
ORDER BY ms2.TotalScore DESC
) bs;
INSERT #BestMatches (ParentID,ChildID,BestScore)
SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore)
UNION ALL
SELECT DISTINCT
0 AS ParentID,
ms.ParentID AS ChildID,
100 AS BestScore
FROM #MatchScores ms
WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE ms.ParentID = bm.ChildID);
--==========================================================================
-- Use nested sets to create the groups
EXEC dbo.CreateNestedSets-- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.
@TableName = '#BestMatches',
@ChildColName = 'ChildID',
@ParentColName = 'ParentID',
@LeftColName = 'Lft',
@RightColName = 'Rgt',
@DeBug = 0;
DELETE #BestMatches WHERE ParentID IS NULL;
--==========================================================================
UPDATE p SET
p.GroupID = bm.GroupID,
p.MatchQuality = bm.BestScore
FROM
#People p
JOIN (
SELECT
Groups.ChildID AS GroupID,
bm2.ChildID,
bm2.BestScore
FROM (
SELECT
bm1.ChildID,
bm1.Lft,
bm1.Rgt
FROM
#BestMatches bm1
WHERE
bm1.ParentID = 0
) Groups
JOIN #BestMatches bm2
ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt
) bm
ON p.PersonID = bm.ChildID;
SELECT * FROM #People p;
The results...
PersonID FirstName LastName SSN Address_1 Address_2 City State Zip GroupID MatchQuality
----------- -------------------- -------------------- --------- ------------------------------ ------------------------------ ------------------------------ ----- ----- ----------- ------------
1 Bob Jones 333224444 258 Grants Trail #3 Bowling Green KY 45459 1 100
2 Robert Jones 333224444 258 Grants Trail Ste 3 Bowling Green KY 45459 1 13
3 Rob Jones 333224444 258 Grants Tr. NULL Bowling Green KY 45459 1 11
4 R. Jones 333224444 258 Grants Tr. Ste 3 Bowling Green KY 45459 1 13
5 Louis Armstrong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 100
6 L. Armstrng 258321478 123 Humming Bird Lane NULL Jacksonville FL 32207 5 9
7 Louie Armstrong 258321478 123 Humming Bird Lain NULL Jacksonville FL 32209 5 12
8 Louis Armstong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 13
9 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33304 9 100
10 M. Adams 321456987 555 Bay Meadows Dr. # 42 Ashvlle NC 33304 9 10
11 Mark Adams 321456987 555 Baymeadows Drive Unit 42 Ashville NC 33305 9 12
12 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33306 9 14
13 Bob Jones 555229999 4227 Some Place Ln. #3 Bowling Green KY 45459 13 100
14 Bob Jones 555229999 4227 Some Place Ln. # 3 Bowling Green KY 45459 13 14
15 Bob Jones 555292999 4227 SomePlace Ln Unit 3 Bowling Green KY 45459 15 100
16 Bob Jones 555292999 4227 Some Place Ln. Ste 3 Bowling Green KY 45459 15 12
17 Louis Armstrong 147852369 123 Baldwin Ave. NULL Jacksonville FL 32209 17 100
18 Louis Armstrong 147852369 123 Baldwin Ave NULL Jacksonville FL 32209 17 13
19 Louis Armstrong 147852369 123 Baldwin Anenue NULL Jachsonville FL 32209 17 12
20 Louis Armstrong 147852639 123 Baldwin Ave. NULL Jacksonville FL 32209 17 9
21 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 100
22 Mark Adam 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 13
23 Mark Adams 654789231 524 Main St. Unit 42 Barkersville NC 33304 21 8
24 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 15
Note... Setting the weights is the important part. We spent a lot of time tweaking the various values and the total threshold to get the best possible matches based on our own data...
The proc that I use to set the left and right bowers is attached a .txt file.
HTH,
Jason
August 27, 2015 at 11:51 am
Thanks. That looks very similar to what I'm looking for. I'll give it a try.
Another though I had was doing a fuzzy lookup in SSIS when I load the table. I'll try both ways and see which works best.
August 27, 2015 at 11:52 am
Jeff Moden (8/26/2015)
I found the code I was looking for and, unfortunately, it solves for a problem quite different that I thought I remembered and doesn't come close to solving this problem.However, I do have an idea on how to solve this one using a CTE that contains a CROSS JOIN and 6 additive case statements with an outer query that looks for a count of 4 or more and does a group by one of the IDs. I need to test it. How many rows does your original table have in it?
We only have about 30k records, so not an overwhelming about of records.
August 27, 2015 at 1:41 pm
To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.
SELECT m1.dimMemberId, m2.dimMemberId
FROM dimMember m1
INNER JOIN dimMember m2 ON
m1.dimMemberId < m2.dimMemberId AND
CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +
CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +
CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +
CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +
CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +
CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +
CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END
>= 4
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 27, 2015 at 3:05 pm
ScottPletcher (8/27/2015)
To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.
SELECT m1.dimMemberId, m2.dimMemberId
FROM dimMember m1
INNER JOIN dimMember m2 ON
m1.dimMemberId < m2.dimMemberId AND
CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +
CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +
CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +
CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +
CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +
CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +
CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END
>= 4
This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.
August 27, 2015 at 7:39 pm
craig.bobchin (8/27/2015)
ScottPletcher (8/27/2015)
To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.
SELECT m1.dimMemberId, m2.dimMemberId
FROM dimMember m1
INNER JOIN dimMember m2 ON
m1.dimMemberId < m2.dimMemberId AND
CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +
CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +
CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +
CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +
CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +
CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +
CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END
>= 4
This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.
Scott did it pretty much as I described including being able to eliminate roughly half the work by using a Triangular Join instead of a Full (Square) Cartesian Join.
The question now is, what do you mean by a "common value"?
--Jeff Moden
Change is inevitable... Change for the better is not.
August 27, 2015 at 9:36 pm
Jeff Moden (8/27/2015)
craig.bobchin (8/27/2015)
ScottPletcher (8/27/2015)
To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.
SELECT m1.dimMemberId, m2.dimMemberId
FROM dimMember m1
INNER JOIN dimMember m2 ON
m1.dimMemberId < m2.dimMemberId AND
CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +
CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +
CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +
CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +
CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +
CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +
CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END
>= 4
This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.
Scott did it pretty much as I described including being able to eliminate roughly half the work by using a Triangular Join instead of a Full (Square) Cartesian Join.
The question now is, what do you mean by a "common value"?
By Common Value I mean the following:
This is a real example of the data I'm looking at the data is changed but the N/As are correct:
MemberIDSubscriberCodeMembercodeSSNFirstnamelastnameAddress
123U12345N/A999-999-9999JohnDoe123 main St Los Angles CA
124U12346NAN/AJohnDoe123 main St Los Angles CA
5764U12346x0987N/AJohnDoeN/A
1126U12348x0987999-999-9999n/an/an/a
12n/aNA999-999-9999JohnDoe427 main St Los Angles CA
The Common Value would be a new Unique ID for that group of records (normal Single records would also get a unique one)
MemberIDSubscriberCodeMembercodeSSNFirstnamelastnameAddressNewID (Common Value)
123U12345N/A999-999-9999JohnDoe123 main St Los Angles CACH1234
124U12346NAN/AJohnDoe123 main St Los Angles CACH1234
5764U12346x0987N/AJohnDoeN/ACH1234
1126U12348x0987999-999-9999n/an/an/aCH1234
12n/aNA999-999-9999JohnDoe427 main St Los Angles CACH1234
3254H12342345111-111-1112JaneSmithn/aCH2398
The last record in this is a regular record and has it's own unique NewID(Common Value) while in the group above, all the associated records have the same NewID (Common Value)
I had trouble formatting the data into a table, but that is the idea.
August 27, 2015 at 10:21 pm
The code I posted earlier shows how to handle the "how to assign a common value to a new field based on the matches" problem.
The idea is to find the "best" single parent/child match that exists for any given "ChildID". (Note: every row is a "child" of every preceding row and every row is a "parent" of every row that follows.) It's just a matter of finding the best parent for each child.
The rows that don't have parents (there are no rows with, the PersonID in the ChildID position, with a score high enough to meet the match threshold) end up being the "GroupID" rows. Simply put, they are the 1st occurrence of a unique individual.
This causes a natural adjacency list type hierarchy, which can be converted to a nested set hierarchy. (Easily accomplished thanks to Jeff Moden[/url])
Once the Left & Right values are updated, they are used to define a "Group".
For example...
PersonID = 1 isn't the child of any other rows so it becomes the start of a group.
PersonID = 2 matches to PersonID = 1 with a score of 13 (above the threshold) so it will be a part of GroupID = 1
PersonID = 3 is matched equally to both 1 & 2 with a score of 11 (again above the threshold) making it part of Group 1. (note... It doesn't matter which of the "parents" is selected. They are equally good.)
PersonID = 4 is matched to 1,2,&3 but the best match is to 3 with a score of 13 (above the threshold) again making it part of Group 1.
PersonID = 5 is matched to 1,2,3&4 but none of those scores are high enough to meet the threshold, so it becomes the 1st member of Group 5.
... and so on...
Applying the Lft & Rht values looks something like this...
ParentIDChildIDBestScoreLftRgt
0110029
121334
131158
341367
051001017
Everything between 2 & 9 is part of Group 1
Everything between 10 & 17 is part of group 5...
August 27, 2015 at 11:31 pm
I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.
All you might need is to schedule the package to run frequently.
August 27, 2015 at 11:50 pm
Jayanth_Kurup (8/27/2015)
I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.All you might need is to schedule the package to run frequently.
Yes I'm looking at Fuzzy Grouping in SSIS as well. In fact that was my first idea. One of my co-workers was thinking an SQL Stored Proc might be a better alternative so I'm investigating both. I do have a package built, but I'm stuck on how to assign the new column to the groupings. I've not had a need to use either fuzzy grouping and aggregate tasks in SSIS before this.
I'll take a look at the video You suggested and see if it answers my questions.
Viewing 15 posts - 1 through 15 (of 33 total)
You must be logged in to reply to this topic. Login to reply