Name variation matching

Question

Name variation matching

sperry-750868

SSCommitted

Points: 1505
More actions
May 15, 2009 at 5:48 am

#95489

We have a number of jobs that perform matches across different tables. The initial match is based on the forename and surname of the person in question being equal. If the forename and surname is the same, a record is created in another table and further matching is performed to give a final match rating.
We have just purchased a database containing million of name variations. These are in two tables (one for forenames and one for surnames) with the following columns:
•Source_Name – This contains a list of Names that you can search on (E.g. ‘Steve’)
•Name_Variation – This list the variation of the sources name. Each variation is on a new row. (I.e. there may be 2 rows with the source name ‘Steve’. One with the variation of ‘Stephen’ and the other with the variation ‘Steven’)
•Score – This is a rating system with 100 being a close match and 75 being a loose match.
I am having trouble getting the matching script to return results in an efficient manner. I believe that this is caused by the cross join that I have used in the below script. Is there a more efficient way of doing it?
SELECT Distinct
AVP.ID,
PER.PersonID,
1,
1
FROM PERSON P
Cross JOIN C6..PERSON_SUBSET PER
WHERE P.Forename <> ''
AND
P.Surname <> ''
And (P.Forename = PER.EnglishForename Or PER.PERSONID in (
Select PersonID
From C6..Person_Subset
Where EnglishForename In
(Select Name_Variation
from Names_Database..Forename
Where [Source_Name] = P.Forename and Score >= 95)))
AND (P.Surname = PER.EnglishSurname OR PER.PERSONID in (
Select PersonID
From C6..Person_Subset
Where EnglishSurname in
(Select Name_Variation
from Names_Database..Surname
Where [Source_Name] = P.Surname and Score >= 95)))

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply

Samuel Vella SSCrazy Eights Points: 8055 More actions · Answer 1

SSIS has fuzzy lookup and fuzzy grouping which will find potential matches, create a link between the two, score the matching and allow you to give give certain cut off levels for the matching score. e.g. only match on greater than 0.95

It will probably be a lot easier to switch to the SSIS method than use T-SQL

sperry-750868 SSCommitted Points: 1505 More actions · Answer 2

Thanks for your answer Samuel.

I know its a bit cheeky, but would you be able to provide some more information as to how i would use the fuzzy transformation to create the same results set as the t sql query.

I am not an SSIS novice but not exactly an expert.

Thanks

Samuel Vella SSCrazy Eights Points: 8055 More actions · Answer 3

All my Fuzzy lookup packages are on laptop which is at home...

Try these instead

have a look here:

http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services

and here:

http://blogs.conchango.com/jamiethomson/archive/2005/03/30/SSIS_3A00_-Adventures-with-Fuzzy-Matching.aspx

sperry-750868 SSCommitted Points: 1505 More actions · Answer 4

Thanks for the links. I have had a look and i do not think that the fuzzy matching is the correct route for me to go down.

The names table already contains the explicit links between names and their variations. Also while the fuzzy searching would be able to catch things like 'Stephen' and 'Steven' it would have a harder time with names like 'Richard' and 'Dick'.

FelixG SSCrazy Points: 2550 More actions · Answer 5

I actually don't understand your query well, but without taking a deep look I advice to use INNER JOINS to substitute the IN clauses in your query .... the use of the IN causes real performance issues when used with big result sets....

sperry-750868 SSCommitted Points: 1505 More actions · Answer 6

Thanks, i will try that. It wont effect the query at the mo becuase its the initial cross join that is causing the problems. Also each of the sub selects after the IN's will only re returning around 10 names.

RBarryYoung SSC Guru Points: 143329 More actions · Answer 7

The CROSS JOIN isn't your problem, it's just an INNER JOIN that has had it's ON conditions moved down to the WHERE clause. It'll act the same.

Your real problem is that you have subqueries five deep.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 8

Hmmm, the multiple correlated subqueries of PERSON_SUBSET to itself are definitely a problem, and it does not appear that they are necessary.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 9

Hmm, also, this query is not valid:

sperry (5/15/2009)
SELECT Distinct AVP.ID, --<<< No AVP alias? PER.PersonID, 1, 1 FROM PERSON P Cross JOIN C6..PERSON_SUBSET PER WHERE P.Forename '' AND P.Surname '' And (P.Forename = PER.EnglishForename Or PER.PERSONID in ( ...

Notice that there is no AVP table? We cannot reorganize this if we do not know how this table is getting into the query.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

sperry-750868 SSCommitted Points: 1505 More actions · Answer 10

Sorry, my mistake. the AVP alias is the same as the P alias. I have changed the name of the table (and i thought all of the aliases) to remove potentially sensitive information.

Is the cross join not responsible for creating a larger the necessary data set?

RBarryYoung SSC Guru Points: 143329 More actions · Answer 11

sperry (5/15/2009)[hrIs the cross join not responsible for creating a larger the necessary data set?

No. Although syntactically it is a CROSS JOIN, logically it is not because the WHERE clause is full of P.* to PER.* conditions.

You performance issues are almost certainly related to those PERSON_SUBSET to PERSON_SUBSET correlated subqueries, which you probably do not need (I am working on it now).

It is also possible that you need to look at indexes, but lets get the query straight first...

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 12

Could you post the DDL (including keys & indexes) for the Surname and Forename tables?

thnx,

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

sperry-750868 SSCommitted Points: 1505 More actions · Answer 13

CREATE TABLE [dbo].[Surname](

[Source_Name] [varchar](20) NOT NULL,

[Name_Variation] [varchar](20) NOT NULL,

[Score] [int] NOT NULL

) ON [PRIMARY]

Both surname and forename tables are the same. There is a clustered index on the source_name column. There are no other keys or indexes

RBarryYoung SSC Guru Points: 143329 More actions · Answer 14

OK, try this and see if it is any better:

;WITH cteSurname95 as (

Select Name_Variation

From Names_Database..Surname

Where Score >= 95

), cteForeName95 as (

Select Name_Variation

From Names_Database..Forename

Where Score >= 95

)

SELECT Distinct

P.ID,

PER.PersonID,

1,

1

FROM PERSON P

INNER JOIN C6..PERSON_SUBSET PER

ON PER.EnglishForename IN (

Select P.Forename

UNION ALL

Select fn.Name_Variation

From cteForeName95 fn

Where fn.Source_Name = P.Forename

)

AND PER.EnglishSurname IN (

Select P.Surname

UNION ALL

Select fn.Name_Variation

From cteSurName95 fn

Where fn.Source_Name = P.Surname

)

WHERE P.Forename ''

AND P.Surname ''

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]