December 22, 2008 at 7:39 pm
Hi,
I'm struggling with the best approach to idenifying records from a single that should match (are equal) according to our business rules. Any help would be appreciated.
I have single a table that contains data describing individuals:
SourceID FirstName SecondName DateOfBirth
1 Jack Jones 10/08/2000
2 Jack Jones 10/08/2000
3 Jack J 10/08/2000
4 Jack Jones 01/08/2000
5 Bob Jones 05/11/1980
Business Rules to identify whether records reflect the same person:
Firstname - must be equal
SecondName - must be equal or intial only
DateOfBirth - minimum 6 of 8 Eg. Year+Month or Year+Day is ok
Applying the business rules to the above sample, records with a SourceID of 1-4 reflect the same person and I would treat as a match (& insert an identifier into a matched table).
What is the most efficient (there will be large volumes of data) way to identify these records?
Thanks
December 22, 2008 at 8:32 pm
Well it would depend on a lot of stuff that you haven't told us yet, but probably something like this:
Create Table #temp(
BirthYear smallint
, BirthMonth smallint
, BirthDay smallint
, FirstName varchar(24)
, LI char(1)
, RestOfName varchar(24)
, ID int
, Primary Key (BirthYear, LI, FirstName, ID)
)
Insert into #temp
Select Year(DateOfBirth)
, Month(DateOfBirth)
, Day(DateOfBirth)
, FirstName
, Left(SecondName, 1)
, Substring(SecondName,2)
, ID
From YourTable
INSERT into DuplNameIDs
Select ID
From #temp t
Where Exists( Select * From #temp t2
Where t2.BirthYear = t.BirthYear
And t2.LI = t.LI
And t2.FirstName = t.FirstName
And t2.ID <> t.ID
And (t2.BirthMonth = t.BirthMonth
Or t2.BirthDay = t.BirthDay)
And (t2.RestOfName = ''
Or t.RestOfName = ''
Or t2.RestOfName = t.RestOfName)
)
go
Drop table #temp
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
December 22, 2008 at 9:05 pm
You have a few things to check here and I might break it down into a couple queries and union them.
Each would join rows with first name matching.
1. would join last name matching and month+year
2. would join last name matching and month+day
1. would join last name matching last initial and month+year
1. would join last name matching last initial and month+day
Barry has given you a start there. Try that and see if it helps.
December 23, 2008 at 1:17 pm
That has been a big help. Thanks guys.
December 23, 2008 at 1:37 pm
Glad we could help. Be sure to let us know how it works out.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
January 28, 2009 at 1:18 pm
Well, it certainly works...but I'm finding the performance is poor under large volumes. Indexing & optimization is a bit of dark art for me at this stage, could you recommend an indexing strategy for the following query to get me moving in the right direction?
NB. There are 500,000 records in VwMatch_Base
SELECT TOP (100) PERCENT t .SubscriberID, t .CustomerID, t .Surname, t .Firstname, t .SecondName, t .Initial, t .SecondName_RestOf, t .DATEOFBIRTH,
t .GenderID, t .PostCode, t .StreetName6, dbo.tblMatch.MatchID, ROW_NUMBER() OVER (PARTITION BY t .Surname, t .Firstname, t .DATEOFBIRTH,
t .PostCode
ORDER BY t .Surname ASC) AS GroupID
FROM dbo.VwMatch_Base AS t LEFT OUTER JOIN
dbo.tblMatch ON t .CustomerID = dbo.tblMatch.CustomerID AND t .SubscriberID = dbo.tblMatch.SubscriberID
WHERE EXISTS
(SELECT SubscriberID, CustomerID, Surname, Firstname, SecondName, DATEOFBIRTH, GenderID, StreetName, Initial, SecondName_RestOf,
YearMonth, YearDay, StreetName6, PostCode
FROM dbo.VwMatch_Base AS t2
WHERE (Surname = t .Surname) AND (Initial = t .Initial) AND (Firstname = t .Firstname) AND (CustomerID <> t .CustomerID) AND
(SubscriberID <> t .SubscriberID) AND (YearMonth = t .YearMonth OR
YearDay = t .YearDay) AND (SecondName_RestOf = '' OR
t .SecondName_RestOf = '' OR
SecondName_RestOf = t .SecondName_RestOf) AND (GenderID = t .GenderID) AND (t .PostCode = PostCode))
CREATE TABLE [dbo].[vwMatch_Base](
[SubscriberID] [nvarchar](4) NOT NULL,
[CustomerID] [nvarchar](25) NOT NULL,
[Surname] [nvarchar](50) NOT NULL,
[Firstname] [nvarchar](50) NOT NULL,
[SecondName] [nvarchar](50) NULL,
[DateOfBirth] [datetime] NOT NULL,
[GenderID] [nchar](1) NOT NULL,
[StreetName] [nchar](10) NOT NULL,
[PostCode] [smallint] NOT NULL,
[Initial] [nvarchar](1) NULL,
[SecondName_RestOf] [nvarchar](35) NULL,
[YearMonth] [int] NULL,
[YearDay] [int] NULL,
[StreetName6] [nvarchar](6) NULL
) ON [PRIMARY]
CREATE TABLE [dbo].[tblMatch](
[MatchID] [int] NOT NULL,
[SubscriberID] [nvarchar](4) NOT NULL,
[CustomerID] [nvarchar](25) NOT NULL,
[Date] [datetime] NOT NULL,
[UserID] [nvarchar](10) NOT NULL,
CONSTRAINT [PK_tblMatch] PRIMARY KEY CLUSTERED
(
[MatchID] ASC,
[SubscriberID] ASC,
[CustomerID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
January 28, 2009 at 2:28 pm
I wonder if anyone realizes just how many MANNY COSTA's there are in Bristol, RI. 2 out of 3 on the date of birth just isn't going to get it. Someone needs to convince the designers of the sword they are about to fall on.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 30, 2009 at 9:12 am
simon (1/28/2009)
Well, it certainly works...but I'm finding the performance is poor under large volumes. Indexing & optimization is a bit of dark art for me at this stage, could you recommend an indexing strategy for the following query to get me moving in the right direction?
Can you post the XML query plan as a .TXT attachement? The actual plan is preferred if you can get it.
Also, how long does it take to run now?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
January 30, 2009 at 9:49 am
Hmm, I notice some strange things here: you have a table named vwMatch_Base, which looks more like the name of a view. Also, it has no keys or indexes listed. So I have to ask: is this really a table or a view?
If it is a view, you will not be able to performance tune your query effectively while ignoring that fact.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
January 30, 2009 at 10:03 am
OK, if vwMatch_Base really is a table, then you might try applying a Clustered Index with the following columns:
Surname, PostCode, Firstname, YearMonth, YearDay
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply