April 22, 2013 at 10:38 am
The following SQL query isn't scaling very well! Works great on record sets of 10-20 million. However, now that I'm trying it on a table of 90 million records I'nm getting really slow performance 🙁
INSERT INTO ##BaseField_PotentialMatches_(Master_Id, Duplicate_Id, MatchKeyType, OrgName_Score, Address_Score, Postcode_Score)
SELECT TOP 100 a.id, b.id, 1, ONS.*, ADS.*, PS.*
FROM dbo.BaseField_keys_ AS a
INNER JOIN dbo.BaseField_keys_ AS b
ON a.mkMatchKeyType1 = b.mkMatchKeyType1
AND a.ID > b.ID
AND NOT EXISTS (SELECT 1 from ##BaseField_PotentialMatches_ as c WHERE c.duplicate_ID IN (a.id,b.id))
CROSS APPLY dbo.OrgNameScoring (a.mkNormalizedOrganization, a.mkOrgName1, a.mkOrgName2, a.mkOrgName3,
b.mkNormalizedOrganization, b.mkOrgName1, b.mkOrgName2, b.mkOrgName3) as ONS
CROSS APPLY dbo.AddressScoring (a.mkPhoneticStreet, a.mkPremise, a.mkPhoneticTown, a.mkAddressKey,
b.mkPhoneticStreet, b.mkPremise, b.mkPhoneticTown, b.mkAddressKey) as ADS
CROSS APPLY dbo.PostcodeScoring(a.mkPOstOut, a.mkPostIn, b.mkPostOut, b.mkPOstIn) as PS
the table:
Index definition:
CREATE NONCLUSTERED INDEX [idx_mkMatchKeyType1] ON [dbo].[BaseField_keys_]
(
[mkMatchKeyType1] ASC,
[ID] DESC
)
INCLUDE ( [GUID],
[mkNormalizedName],
[mkName1],
[mkName2],
[mkName3],
[mkNormalizedOrganization],
[mkOrgName1],
[mkOrgName2],
[mkOrgName3],
[mkPhoneticStreet],
[mkPremise],
[mkPhoneticTown],
[mkAddressKey],
[mkPostOut],
[mkPostIn],
[mkTelephoneNumber]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Destination table:
##BaseField_PotentialMatches_PotentialMatches_(
Master_Id INT
,Duplicate_Id INT
,MatchKeyType TINYINT
,OrgName_Score VARCHAR(10)
,Address_Score TINYINT
,Postcode_Score VARCHAR(10))
CREATE CLUSTERED INDEX idx_MasterDuplicateIDs ON ##BaseField_PotentialMatches_PotentialMatches_(Duplicate_Id)
Exection plan attached below
Any suggestions are welcome!
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
April 22, 2013 at 10:43 am
your key types, being huge varbinarys, are preventing you from indexing; that's going to require a table scan.
is that a default varbinary size that someone/some application imposed? can you get the size down under 900 bytes so it could be indexed?
if not, maybe you can create a surrogate key like the HASHBYTES of the value, so you can join them together and use an index?
Lowell
April 22, 2013 at 10:52 am
Lowell (4/22/2013)
your key types, being huge varbinarys, are preventing you from indexing; that's going to require a table scan.is that a default varbinary size that someone/some application imposed? can you get the size down under 900 bytes so it could be indexed?
if not, maybe you can create a surrogate key like the HASHBYTES of the value, so you can join them together and use an index?
The mkMatchKeyType1 is actually a hashbyte of several columns that I wasn't able to index due to teh limit of 900 bytes for the index. As far as I know, the hashbyte needs to be stored in a VARBINARY(8000) data type which is what I've done.
I managed to create the index on the match keys (see above) these took around 25 minutes per index on my 90 million record table. But as I said, when I run the first query (with or witrhout the CLR TVFs) it takes a long time to perform the join. No idea what to do now.. I'm losing the will to live with this!
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
April 22, 2013 at 11:40 am
Why are you joining on a hash of several columns in the first place?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
April 22, 2013 at 11:57 am
GilaMonster (4/22/2013)
Why are you joining on a hash of several columns in the first place?
Record linkage.
How else would you do this if you want looks for duplicates where
forename, surname, address 1, postcode is the same etc.... Or
Surname 5 character , address 1, town, county
Or sound of surname and postcode etc etc....
I'm open to suggestions.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
April 23, 2013 at 4:26 am
Abu Dina (4/22/2013)
GilaMonster (4/22/2013)
Why are you joining on a hash of several columns in the first place?Record linkage.
How else would you do this if you want looks for duplicates where
forename, surname, address 1, postcode is the same etc.... Or
Surname 5 character , address 1, town, county
Or sound of surname and postcode etc etc....
I'm open to suggestions.
Not sure if I read the thread correctly, but it seems like you are trying to eliminate dupes from a set of people/addresses. If so, maintaining a "master person list" is an EXCEPTIONALLY difficult task no matter what country you live in. There are entire lines of business built just around that, especially as it relates to healthcare (where I have done my work at this mess), insurance, government, etc. Most that I know of rely on some form of parsing to get the data into proper elements (firstname, middlename, lastname, address1, address2, city, etc., etc.) and THEN set about looking for duplicates using a variety of mechanisms (that might include soundex, which you mentioned).
It seems that you have simply strung all the characters into your fields in some order. To check that for duplicates I wouldn't do a JOIN I would simply do a COUNT(*) with a GROUP BY and a HAVING COUNT(*) > 1.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
April 23, 2013 at 5:18 am
TheSQLGuru (4/23/2013)
Abu Dina (4/22/2013)
GilaMonster (4/22/2013)
Why are you joining on a hash of several columns in the first place?Record linkage.
How else would you do this if you want looks for duplicates where
forename, surname, address 1, postcode is the same etc.... Or
Surname 5 character , address 1, town, county
Or sound of surname and postcode etc etc....
I'm open to suggestions.
Not sure if I read the thread correctly, but it seems like you are trying to eliminate dupes from a set of people/addresses. If so, maintaining a "master person list" is an EXCEPTIONALLY difficult task no matter what country you live in. There are entire lines of business built just around that, especially as it relates to healthcare (where I have done my work at this mess), insurance, government, etc. Most that I know of rely on some form of parsing to get the data into proper elements (firstname, middlename, lastname, address1, address2, city, etc., etc.) and THEN set about looking for duplicates using a variety of mechanisms (that might include soundex, which you mentioned).
It seems that you have simply strung all the characters into your fields in some order. To check that for duplicates I wouldn't do a JOIN I would simply do a COUNT(*) with a GROUP BY and a HAVING COUNT(*) > 1.
Yes you are right and yes it's a challenging task alright!
The table that I mentioned above actually contains the parsed elements you mention.
The Key's merely group records together and then I use several CLR TVFs to then compare data elements (names, company names, addresses, phone numbers) to work out a confidence score for each matching pair.
It works great for smaller record sets. I can dedupe a file of 2 million records in a few minutes. Something that use to take hours with our old process but now that I'm applying this to a record set of close to 90 million records I seem to have hit a brick wall with performance!
The configuration of the server doesn't help either with one drive for data and log files. I had no input when the server was purchased for this task.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
April 23, 2013 at 9:01 am
Abu Dina (4/23/2013)
TheSQLGuru (4/23/2013)
Abu Dina (4/22/2013)
GilaMonster (4/22/2013)
Why are you joining on a hash of several columns in the first place?Record linkage.
How else would you do this if you want looks for duplicates where
forename, surname, address 1, postcode is the same etc.... Or
Surname 5 character , address 1, town, county
Or sound of surname and postcode etc etc....
I'm open to suggestions.
Not sure if I read the thread correctly, but it seems like you are trying to eliminate dupes from a set of people/addresses. If so, maintaining a "master person list" is an EXCEPTIONALLY difficult task no matter what country you live in. There are entire lines of business built just around that, especially as it relates to healthcare (where I have done my work at this mess), insurance, government, etc. Most that I know of rely on some form of parsing to get the data into proper elements (firstname, middlename, lastname, address1, address2, city, etc., etc.) and THEN set about looking for duplicates using a variety of mechanisms (that might include soundex, which you mentioned).
It seems that you have simply strung all the characters into your fields in some order. To check that for duplicates I wouldn't do a JOIN I would simply do a COUNT(*) with a GROUP BY and a HAVING COUNT(*) > 1.
Yes you are right and yes it's a challenging task alright!
The table that I mentioned above actually contains the parsed elements you mention.
The Key's merely group records together and then I use several CLR TVFs to then compare data elements (names, company names, addresses, phone numbers) to work out a confidence score for each matching pair.
It works great for smaller record sets. I can dedupe a file of 2 million records in a few minutes. Something that use to take hours with our old process but now that I'm applying this to a record set of close to 90 million records I seem to have hit a brick wall with performance!
The configuration of the server doesn't help either with one drive for data and log files. I had no input when the server was purchased for this task.
the Estimated rows is 4 in the self-join, leading to a nested loop join. The actual rows is 39.8MILLION, so the logical reads are killing you. Try one or the other of these for that self-join:
FROM dbo.BaseField_keys_ AS a
INNER HASH JOIN dbo.BaseField_keys_ AS b
FROM dbo.BaseField_keys_ AS a
INNER MERGE JOIN dbo.BaseField_keys_ AS b
MERGE will likely require a SORT operator and thus will be bad too. Check the estimated plan and if you see a big sort you may want to not try that.
I wonder about the CLR TVFs - do they have any data access?
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
April 23, 2013 at 9:18 am
Thanks Kevin. I wil try the join hints and report back
Here is one of the CLR TVFs I use to score Email addresses:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Net;
using System.Data.SqlClient;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
namespace Merge
{
public class EmailScore
{
[SqlFunction(Name = "EmailScoring", FillRowMethodName = "FillRow", TableDefinition = "Email_Score NVARCHAR(100)")]
public static IEnumerable TelephoneScoring(
SqlString IncomingmasterEmail,
SqlString IncomingduplicateEmail
)
{
ArrayList rowsArray = new ArrayList();
// ***************************************************************************************************************************************
// INITIALISE VARIABLES
// ***************************************************************************************************************************************
string theResult = "";
string masterEmail = (string)IncomingmasterEmail;
string duplicateEmail = (string)IncomingduplicateEmail;
if (string.IsNullOrEmpty(masterEmail.Trim()) && string.IsNullOrEmpty(duplicateEmail.Trim())) { theResult = "bothEmpty"; } // both empty
else if (string.IsNullOrEmpty(masterEmail.Trim()) || string.IsNullOrEmpty(duplicateEmail.Trim())) { theResult = "oneEmpty"; } // one empty
else if (masterEmail.Length > 0 && duplicateEmail.Length > 0 && masterEmail == duplicateEmail) { theResult = "sure"; } // exact match!
else if (masterEmail.Length > 0 && duplicateEmail.Length > 0
&& (WordFunctions.LevenshteinDistance(masterEmail, duplicateEmail) == 1
|| WordFunctions.DamerauLevenshtein(masterEmail, duplicateEmail) > (SqlDecimal)0.9))
{
theResult = "likely";
} // fuzzy (1-char insertion/deletion/replacement, 2-char transposition)
else
{
theResult = "zero";
}
object[] column = new object[1];
column[0] = (SqlString)theResult;
rowsArray.Add(column);
return rowsArray;
}
public static void FillRow(Object obj, out SqlString theResult)
{
object[] row = (object[])obj;
theResult = (SqlString)row[0];
}
}
}
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
April 23, 2013 at 9:46 am
Abu Dina (4/23/2013)
Thanks Kevin. I wil try the join hints and report backHere is one of the CLR TVFs I use to score Email addresses:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Net;
using System.Data.SqlClient;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
namespace Merge
{
public class EmailScore
{
[SqlFunction(Name = "EmailScoring", FillRowMethodName = "FillRow", TableDefinition = "Email_Score NVARCHAR(100)")]
public static IEnumerable TelephoneScoring(
SqlString IncomingmasterEmail,
SqlString IncomingduplicateEmail
)
{
ArrayList rowsArray = new ArrayList();
// ***************************************************************************************************************************************
// INITIALISE VARIABLES
// ***************************************************************************************************************************************
string theResult = "";
string masterEmail = (string)IncomingmasterEmail;
string duplicateEmail = (string)IncomingduplicateEmail;
if (string.IsNullOrEmpty(masterEmail.Trim()) && string.IsNullOrEmpty(duplicateEmail.Trim())) { theResult = "bothEmpty"; } // both empty
else if (string.IsNullOrEmpty(masterEmail.Trim()) || string.IsNullOrEmpty(duplicateEmail.Trim())) { theResult = "oneEmpty"; } // one empty
else if (masterEmail.Length > 0 && duplicateEmail.Length > 0 && masterEmail == duplicateEmail) { theResult = "sure"; } // exact match!
else if (masterEmail.Length > 0 && duplicateEmail.Length > 0
&& (WordFunctions.LevenshteinDistance(masterEmail, duplicateEmail) == 1
|| WordFunctions.DamerauLevenshtein(masterEmail, duplicateEmail) > (SqlDecimal)0.9))
{
theResult = "likely";
} // fuzzy (1-char insertion/deletion/replacement, 2-char transposition)
else
{
theResult = "zero";
}
object[] column = new object[1];
column[0] = (SqlString)theResult;
rowsArray.Add(column);
return rowsArray;
}
public static void FillRow(Object obj, out SqlString theResult)
{
object[] row = (object[])obj;
theResult = (SqlString)row[0];
}
}
}
I am not a CLR guru and don't have time to give the code a good review, but my gut tells me there are some efficiency improvements there. If nothing else, make theResult a tinyint and use 1, 2, 3, etc as states.
Check documentation to see if you need to do a TRIM when testing isnullorempty. Also, make sure your matching tests are coded in the most likely to least likely to hit order (if that is known). That will minimize code branching down into the ELSE's.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
April 23, 2013 at 9:56 am
Fair points Kevin. I'm not a C# coder really so I know it's not the best written code in the world.
I will speak to one of our developers to see if they can help me re-write to make it more efficient.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply