The best way to identify matching records in a single table

  • Hi,

    I'm struggling with the best approach to idenifying records from a single that should match (are equal) according to our business rules. Any help would be appreciated.

    I have single a table that contains data describing individuals:

    SourceID FirstName SecondName DateOfBirth

    1 Jack Jones 10/08/2000

    2 Jack Jones 10/08/2000

    3 Jack J 10/08/2000

    4 Jack Jones 01/08/2000

    5 Bob Jones 05/11/1980

    Business Rules to identify whether records reflect the same person:

    Firstname - must be equal

    SecondName - must be equal or intial only

    DateOfBirth - minimum 6 of 8 Eg. Year+Month or Year+Day is ok

    Applying the business rules to the above sample, records with a SourceID of 1-4 reflect the same person and I would treat as a match (& insert an identifier into a matched table).

    What is the most efficient (there will be large volumes of data) way to identify these records?

    Thanks

  • Well it would depend on a lot of stuff that you haven't told us yet, but probably something like this:

    Create Table #temp(

    BirthYear smallint

    , BirthMonth smallint

    , BirthDay smallint

    , FirstName varchar(24)

    , LI char(1)

    , RestOfName varchar(24)

    , ID int

    , Primary Key (BirthYear, LI, FirstName, ID)

    )

    Insert into #temp

    Select Year(DateOfBirth)

    , Month(DateOfBirth)

    , Day(DateOfBirth)

    , FirstName

    , Left(SecondName, 1)

    , Substring(SecondName,2)

    , ID

    From YourTable

    INSERT into DuplNameIDs

    Select ID

    From #temp t

    Where Exists( Select * From #temp t2

    Where t2.BirthYear = t.BirthYear

    And t2.LI = t.LI

    And t2.FirstName = t.FirstName

    And t2.ID <> t.ID

    And (t2.BirthMonth = t.BirthMonth

    Or t2.BirthDay = t.BirthDay)

    And (t2.RestOfName = ''

    Or t.RestOfName = ''

    Or t2.RestOfName = t.RestOfName)

    )

    go

    Drop table #temp

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • You have a few things to check here and I might break it down into a couple queries and union them.

    Each would join rows with first name matching.

    1. would join last name matching and month+year

    2. would join last name matching and month+day

    1. would join last name matching last initial and month+year

    1. would join last name matching last initial and month+day

    Barry has given you a start there. Try that and see if it helps.

  • That has been a big help. Thanks guys.

  • Glad we could help. Be sure to let us know how it works out.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Well, it certainly works...but I'm finding the performance is poor under large volumes. Indexing & optimization is a bit of dark art for me at this stage, could you recommend an indexing strategy for the following query to get me moving in the right direction?

    NB. There are 500,000 records in VwMatch_Base

    SELECT TOP (100) PERCENT t .SubscriberID, t .CustomerID, t .Surname, t .Firstname, t .SecondName, t .Initial, t .SecondName_RestOf, t .DATEOFBIRTH,

    t .GenderID, t .PostCode, t .StreetName6, dbo.tblMatch.MatchID, ROW_NUMBER() OVER (PARTITION BY t .Surname, t .Firstname, t .DATEOFBIRTH,

    t .PostCode

    ORDER BY t .Surname ASC) AS GroupID

    FROM dbo.VwMatch_Base AS t LEFT OUTER JOIN

    dbo.tblMatch ON t .CustomerID = dbo.tblMatch.CustomerID AND t .SubscriberID = dbo.tblMatch.SubscriberID

    WHERE EXISTS

    (SELECT SubscriberID, CustomerID, Surname, Firstname, SecondName, DATEOFBIRTH, GenderID, StreetName, Initial, SecondName_RestOf,

    YearMonth, YearDay, StreetName6, PostCode

    FROM dbo.VwMatch_Base AS t2

    WHERE (Surname = t .Surname) AND (Initial = t .Initial) AND (Firstname = t .Firstname) AND (CustomerID <> t .CustomerID) AND

    (SubscriberID <> t .SubscriberID) AND (YearMonth = t .YearMonth OR

    YearDay = t .YearDay) AND (SecondName_RestOf = '' OR

    t .SecondName_RestOf = '' OR

    SecondName_RestOf = t .SecondName_RestOf) AND (GenderID = t .GenderID) AND (t .PostCode = PostCode))

    CREATE TABLE [dbo].[vwMatch_Base](

    [SubscriberID] [nvarchar](4) NOT NULL,

    [CustomerID] [nvarchar](25) NOT NULL,

    [Surname] [nvarchar](50) NOT NULL,

    [Firstname] [nvarchar](50) NOT NULL,

    [SecondName] [nvarchar](50) NULL,

    [DateOfBirth] [datetime] NOT NULL,

    [GenderID] [nchar](1) NOT NULL,

    [StreetName] [nchar](10) NOT NULL,

    [PostCode] [smallint] NOT NULL,

    [Initial] [nvarchar](1) NULL,

    [SecondName_RestOf] [nvarchar](35) NULL,

    [YearMonth] [int] NULL,

    [YearDay] [int] NULL,

    [StreetName6] [nvarchar](6) NULL

    ) ON [PRIMARY]

    CREATE TABLE [dbo].[tblMatch](

    [MatchID] [int] NOT NULL,

    [SubscriberID] [nvarchar](4) NOT NULL,

    [CustomerID] [nvarchar](25) NOT NULL,

    [Date] [datetime] NOT NULL,

    [UserID] [nvarchar](10) NOT NULL,

    CONSTRAINT [PK_tblMatch] PRIMARY KEY CLUSTERED

    (

    [MatchID] ASC,

    [SubscriberID] ASC,

    [CustomerID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

  • I wonder if anyone realizes just how many MANNY COSTA's there are in Bristol, RI. 2 out of 3 on the date of birth just isn't going to get it. Someone needs to convince the designers of the sword they are about to fall on.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • simon (1/28/2009)


    Well, it certainly works...but I'm finding the performance is poor under large volumes. Indexing & optimization is a bit of dark art for me at this stage, could you recommend an indexing strategy for the following query to get me moving in the right direction?

    Can you post the XML query plan as a .TXT attachement? The actual plan is preferred if you can get it.

    Also, how long does it take to run now?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hmm, I notice some strange things here: you have a table named vwMatch_Base, which looks more like the name of a view. Also, it has no keys or indexes listed. So I have to ask: is this really a table or a view?

    If it is a view, you will not be able to performance tune your query effectively while ignoring that fact.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • OK, if vwMatch_Base really is a table, then you might try applying a Clustered Index with the following columns:

    Surname, PostCode, Firstname, YearMonth, YearDay

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply