May 15, 2009 at 5:48 am
We have a number of jobs that perform matches across different tables. The initial match is based on the forename and surname of the person in question being equal. If the forename and surname is the same, a record is created in another table and further matching is performed to give a final match rating.
We have just purchased a database containing million of name variations. These are in two tables (one for forenames and one for surnames) with the following columns:
•Source_Name – This contains a list of Names that you can search on (E.g. ‘Steve’)
•Name_Variation – This list the variation of the sources name. Each variation is on a new row. (I.e. there may be 2 rows with the source name ‘Steve’. One with the variation of ‘Stephen’ and the other with the variation ‘Steven’)
•Score – This is a rating system with 100 being a close match and 75 being a loose match.
I am having trouble getting the matching script to return results in an efficient manner. I believe that this is caused by the cross join that I have used in the below script. Is there a more efficient way of doing it?
SELECT Distinct
AVP.ID,
PER.PersonID,
1,
1
FROM PERSON P
Cross JOIN C6..PERSON_SUBSET PER
WHERE P.Forename <> ''
AND
P.Surname <> ''
And (P.Forename = PER.EnglishForename Or PER.PERSONID in (
Select PersonID
From C6..Person_Subset
Where EnglishForename In
(Select Name_Variation
from Names_Database..Forename
Where [Source_Name] = P.Forename and Score >= 95)))
AND (P.Surname = PER.EnglishSurname OR PER.PERSONID in (
Select PersonID
From C6..Person_Subset
Where EnglishSurname in
(Select Name_Variation
from Names_Database..Surname
Where [Source_Name] = P.Surname and Score >= 95)))
May 15, 2009 at 6:30 am
SSIS has fuzzy lookup and fuzzy grouping which will find potential matches, create a link between the two, score the matching and allow you to give give certain cut off levels for the matching score. e.g. only match on greater than 0.95
It will probably be a lot easier to switch to the SSIS method than use T-SQL
May 15, 2009 at 7:17 am
Thanks for your answer Samuel.
I know its a bit cheeky, but would you be able to provide some more information as to how i would use the fuzzy transformation to create the same results set as the t sql query.
I am not an SSIS novice but not exactly an expert.
Thanks
May 15, 2009 at 7:24 am
All my Fuzzy lookup packages are on laptop which is at home...
Try these instead
have a look here:
http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services
and here:
May 15, 2009 at 8:00 am
Thanks for the links. I have had a look and i do not think that the fuzzy matching is the correct route for me to go down.
The names table already contains the explicit links between names and their variations. Also while the fuzzy searching would be able to catch things like 'Stephen' and 'Steven' it would have a harder time with names like 'Richard' and 'Dick'.
May 15, 2009 at 8:39 am
I actually don't understand your query well, but without taking a deep look I advice to use INNER JOINS to substitute the IN clauses in your query .... the use of the IN causes real performance issues when used with big result sets....
May 15, 2009 at 8:54 am
Thanks, i will try that. It wont effect the query at the mo becuase its the initial cross join that is causing the problems. Also each of the sub selects after the IN's will only re returning around 10 names.
May 15, 2009 at 11:00 am
The CROSS JOIN isn't your problem, it's just an INNER JOIN that has had it's ON conditions moved down to the WHERE clause. It'll act the same.
Your real problem is that you have subqueries five deep.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 15, 2009 at 11:26 am
Hmmm, the multiple correlated subqueries of PERSON_SUBSET to itself are definitely a problem, and it does not appear that they are necessary.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 15, 2009 at 11:41 am
Hmm, also, this query is not valid:
sperry (5/15/2009)
SELECT Distinct
AVP.ID, --<<< No AVP alias?
PER.PersonID,
1,
1
FROM PERSON P
Cross JOIN C6..PERSON_SUBSET PER
WHERE P.Forename ''
AND
P.Surname ''
And (P.Forename = PER.EnglishForename Or PER.PERSONID in (
...
Notice that there is no AVP table? We cannot reorganize this if we do not know how this table is getting into the query.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 15, 2009 at 11:53 am
Sorry, my mistake. the AVP alias is the same as the P alias. I have changed the name of the table (and i thought all of the aliases) to remove potentially sensitive information.
Is the cross join not responsible for creating a larger the necessary data set?
May 15, 2009 at 12:06 pm
sperry (5/15/2009)[hrIs the cross join not responsible for creating a larger the necessary data set?
No. Although syntactically it is a CROSS JOIN, logically it is not because the WHERE clause is full of P.* to PER.* conditions.
You performance issues are almost certainly related to those PERSON_SUBSET to PERSON_SUBSET correlated subqueries, which you probably do not need (I am working on it now).
It is also possible that you need to look at indexes, but lets get the query straight first...
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 15, 2009 at 12:21 pm
Could you post the DDL (including keys & indexes) for the Surname and Forename tables?
thnx,
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
May 15, 2009 at 12:32 pm
CREATE TABLE [dbo].[Surname](
[Source_Name] [varchar](20) NOT NULL,
[Name_Variation] [varchar](20) NOT NULL,
[Score] [int] NOT NULL
) ON [PRIMARY]
Both surname and forename tables are the same. There is a clustered index on the source_name column. There are no other keys or indexes
May 15, 2009 at 12:56 pm
OK, try this and see if it is any better:
;WITH cteSurname95 as (
Select Name_Variation
From Names_Database..Surname
Where Score >= 95
), cteForeName95 as (
Select Name_Variation
From Names_Database..Forename
Where Score >= 95
)
SELECT Distinct
P.ID,
PER.PersonID,
1,
1
FROM PERSON P
INNER JOIN C6..PERSON_SUBSET PER
ON PER.EnglishForename IN (
Select P.Forename
UNION ALL
Select fn.Name_Variation
From cteForeName95 fn
Where fn.Source_Name = P.Forename
)
AND PER.EnglishSurname IN (
Select P.Surname
UNION ALL
Select fn.Name_Variation
From cteSurName95 fn
Where fn.Source_Name = P.Surname
)
WHERE P.Forename ''
AND P.Surname ''
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
Viewing 15 posts - 1 through 15 (of 28 total)
You must be logged in to reply to this topic. Login to reply