March 12, 2012 at 6:12 am
Hi all,
I have a query that is trying to identify unique sales records by comparing against all previous sales. It does this by using an 'edit distance' function against customer details. This works fine when looking for similar records in the same table, but now that I need things to be unique, it takes 20+ hours to run against 5 million records (I have had to stop it at that time, it has actually never finished). I need this to complete in under 12 hours. Thereafter, it will be run per-day so shouldn't be as much of a problem.
Production system is SQL 2008 R2 SP1 on a Windows 2008 VM, 4GB RAM.
The function is as follows:
CREATE FUNCTION [dbo].[edit_distance_weighted](@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS float
AS
BEGIN
DECLARE @s1_len int, @s2_len int
DECLARE @i int, @j-2 int, @s1_char nchar, @C int, @c_temp int
DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT
@s1_len = LEN(ISNULL(@s1,'')),
@s2_len = LEN(ISNULL(@s2,'')),
@cv1 = 0x0000,
IF (@s1_len + @s2_len = 0)
BEGIN
RETURN 0;
END;
WHILE @j-2 <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j-2 = @j-2 + 1
WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@C = @i,
@cv0 = CAST(@i AS binary(2)),
@j-2 = 1
WHILE @j-2 <= @s2_len
BEGIN
SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j-2, 1) THEN 0 ELSE 1 END
IF @C > @c_temp SET @C = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j+1, 2) AS int)+1
IF @C > @c_temp SET @C = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j-2 = @j-2 + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @C * CAST(2 AS FLOAT) / (@s1_len + @s2_len)
END
Table DDL:
CREATE TABLE [dbo].[Sales](
[SaleID] [int] IDENTITY(1,1) NOT NULL,
[PhoneNumber] [varchar](20) NULL,
[Surname] [varchar](50) NULL,
[FlatNumber] [varchar](5) NULL,
[HouseNumber] [varchar](5) NULL,
[Address1] [varchar](100) NULL,
[Postcode] [varchar](8) NULL,
[SaleDate] [datetime] NULL,
[ManualSale] [bit] NULL,
CONSTRAINT [PK_Sales] PRIMARY KEY CLUSTERED
(
[SaleID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
And my query is as follows:
SELECT DISTINCT
ManualSales.Surname,
ManualSales.FlatNumber,
ManualSales.HouseNumber,
ManualSales.Address1,
ManualSales.Postcode,
ManualSales.PhoneNumber,
ManualSales.SaleDate
FROM
Sales ManualSales
WHERE ManualSales.SaleDate IS NOT NULL
AND ISNULL(ManualSales.ManualSale,0) = 1
AND NOT EXISTS
(
SELECT 1
FROM Sales S
WHERE dbo.edit_distance_weighted(ISNULL(S.PhoneNumber,ManualSales.PhoneNumber),ISNULL(ManualSales.PhoneNumber,S.PhoneNumber)) < 0.4
AND dbo.edit_distance_weighted(S.Postcode,ManualSales.Postcode) < 0.3
AND dbo.edit_distance_weighted(S.Address1,ManualSales.Address1) < 0.4
AND dbo.edit_distance_weighted(S.Surname,ManualSales.Surname) < 0.4
AND S.SaleID != ManualSales.SaleID
AND S.SaleDate IS NOT NULL
AND S.SaleDate < ManualSales.SaleDate
AND ISNULL(S.ManualSale,0) = 0
)
ORDER BY ManualSales.SaleDate DESC
Can anyone think of a way to re-work this or other changes to make it more efficient? There is one suggested index (SaleDate with ManualSale), but it will not save enough time (~25%).
Thanks,
Craig
March 12, 2012 at 7:36 am
To start with this is not a SELF JOIN but an anti-semi-self-join and seems like you can turn it into a semi-self-join(OR JOIN). (I am not too sure if it is the right logic but this is just TO let you know that your query can be redesigned. "NOT EXISTS" could be more expensive than "EXISTS"). Besides I have changed the ISNULL(ManualSales.ManualSale,0) = 1
to a more optimal STATEMENT.
SELECT DISTINCT
ManualSales.Surname,
ManualSales.FlatNumber,
ManualSales.HouseNumber,
ManualSales.Address1,
ManualSales.Postcode,
ManualSales.PhoneNumber,
ManualSales.SaleDate
FROM
Sales ManualSales
WHERE ManualSales.SaleDate IS NOT NULL
AND ManualSales.ManualSale = 1
AND ManualSales.ManualSale IS NOT NULL
AND EXISTS
(
SELECT 1
FROM Sales S
WHERE
dbo.edit_distance_weighted(ISNULL(S.PhoneNumber,ManualSales.PhoneNumber),ISNULL(ManualSales.PhoneNumber,S.PhoneNumber)) > 0.4
AND dbo.edit_distance_weighted(S.Postcode,ManualSales.Postcode) > 0.3
AND dbo.edit_distance_weighted(S.Address1,ManualSales.Address1) > 0.4
AND dbo.edit_distance_weighted(S.Surname,ManualSales.Surname) > 0.4
AND S.CustomerID = ManualSales.CustomerID
--AND S.SaleDate IS NOT NULL --WOULD NOT BE NEEDED IN CASE OF A JOIN/SEMI JOIN
AND S.SaleDate > ManualSales.SaleDate
AND S.ManualSale = 1 --MAY NOT BE NEEDED
AND S.ManualSale IS NOT NULL --MAY NOT BE NEEDED
)
Since, you did not mention what actually the function is doing, I cannot do much about that. But we all know that Scalar functions in the WHERE clause would hinder the performance a great deal.
Since, changing the function would need additional testing and that may not be a possibility at the moment. Another way of dealing is to store the partial output in a temp table by moving the scalar function values from WHERE clause to SELECT clause. With proper indexing on the temp table, you would be to able to even work on more than 5 million records. And I am pretty sure that you wont be needing even 12 hours for that.
Hope it helps.
March 12, 2012 at 8:49 am
Hi Usman,
Thanks for your reply.
I'm trying to wrap my head around the differences between my original query and what you suggest. The goal of the query is to look for manual sales which have unique customer information, compared to all past sales. The UDF gives a measure of similarity between two strings (edit distance, weighted for average length of the two strings).
You suggest swapping
ISNULL(ManualSales.ManualSale,0) = 1
for
AND ManualSales.ManualSale = 1
AND ManualSales.ManualSale IS NOT NULL
but I don't think these are logically equivalent? The original would include NULL but the replacement would exclude them? Unless I am missing something about the rest of the query.
Also, I had a typo in the original post, it should be
S.SaleID != ManualSales.SaleID
which is to exclude the manual from matching itself.
Another way of dealing is to store the partial output in a temp table by moving the scalar function values from WHERE clause to SELECT clause. With proper indexing on the temp table, you would be to able to even work on more than 5 million records. And I am pretty sure that you wont be needing even 12 hours for that.
Can you elaborate here? If, say, 10% of the sales were Manual, would this not be inserting 250,000,000,000 rows? I'll have a look at it though as it does give an alternative route.
Thanks,
Craig
March 12, 2012 at 9:03 am
CraigIW (3/12/2012)
Hi Usman,You suggest swapping
ISNULL(ManualSales.ManualSale,0) = 1
for
AND ManualSales.ManualSale = 1
AND ManualSales.ManualSale IS NOT NULL
but I don't think these are logically equivalent? The original would include NULL but the replacement would exclude them? Unless I am missing something about the rest of the query.
Actually they are both going to produce the same thing and they are both redundant. Why bother with an ISNULL check at all? If it is not 1 does it matter if it is null or not? The original would NOT include null. All you managed to do with this isnull in the where clause is to eliminate index usage. You only need to check "AND ManualSales.ManualSale = 1".
That UDF is the reason your performance is so awful. Looping in sql is like injecting heroin into the system. Calling the same scalar UDF 3 times in a where clause is never going to perform well. You have effectively eliminated all indexing and you have 5 million rows.
There are lots of people around here, including myself, willing to help with this if want. I would suggest the first place to start is to eliminate that UDF. If you want some help take a look at the first link in my signature.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
March 12, 2012 at 10:25 am
Hi Sean,
Sorry, I'm a bit tired. You are right about the ISNULL. It is there as the column is nullable, but the expressions are equivalent.
The UDF is what does all the work. If you can tell me a way to quantify the difference between strings without a UDF, I am all ears. I did explore using the SQL full text functionality but didn't see a way to get it to compare two columns.
Craig
March 12, 2012 at 10:31 am
CraigIW (3/12/2012)
Hi Sean,Sorry, I'm a bit tired. You are right about the ISNULL. It is there as the column is nullable, but the expressions are equivalent.
The UDF is what does all the work. If you can tell me a way to quantify the difference between strings without a UDF, I am all ears. I did explore using the SQL full text functionality but didn't see a way to get it to compare two columns.
Craig
What is it doing? I don't really have the energy to parse that obfuscated code to decipher it.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
March 12, 2012 at 11:16 am
Hi,
Well, I dont worked too much (just read it one time) in it so it's only a guess...
The problem lies in:
1) The sub querie is performing a cross join in sales with itself.
If there are x=5*10^6 rows in it you can expect a join of x^2.
2) The function contains nested whiles. Not good for performance.
3) The function ill hind indexes and a full table scan ill be performed.
A think you can review it from a design point, not development point.
Maybe create some indexed computed columns (if possbile) to avoid use the edit_distance_weighted thing (whatever is the awful requirement used to create it).
March 12, 2012 at 11:16 am
Sean Lange (3/12/2012)
CraigIW (3/12/2012)
Hi Sean,Sorry, I'm a bit tired. You are right about the ISNULL. It is there as the column is nullable, but the expressions are equivalent.
The UDF is what does all the work. If you can tell me a way to quantify the difference between strings without a UDF, I am all ears. I did explore using the SQL full text functionality but didn't see a way to get it to compare two columns.
Craig
What is it doing? I don't really have the energy to parse that obfuscated code to decipher it.
I picked that function up when first looking for a way to assess the relative difference between strings. I agree it could do with some tidyup, but it does seem to do the job.
It's computing a weighted 'edit distance' to gauge how similar the strings are. E.g, 'G11 1AA' is quite similar to 'G11 1AB', but not very similar to 'NW2 4AB' The function computes a difference as a float that I can use to sit a limit on how much of a difference I want to include.
Craig
March 12, 2012 at 11:22 am
As I said, maybe you can change the desing, not only the implementation.
Since you are "cleaning" the table do find duplicate rows, you must run a join over itself (all rows X all rows).
Maybe you can create indexed computed columns and just validate the data before each insert/update.
Updt:
It dont surprises me to make a nested while string manipulation (5 million X 5 million) times takes a long time.
March 12, 2012 at 11:30 am
Updt:
It dont surprises me to make a nested while string manipulation (5 million X 5 million) times takes a long time.
Me either. Though it does seem to be a lot faster on my 32bit SQL 2008 development desktop than it is on the production server :/ I'm still comparing, but 3 million rows on my dev machine takes 45 minutes. I'm going to try and get the quantities closer to production and see if the difference is still apparent.
March 12, 2012 at 11:50 am
To compare between machines use a dump from the prod server (if possible)
Comparation between 5 millions of nulls or small strings is faster than comparing 3 milllions large strings...
Anyway the production server can do slower duo to concurrence.
If that table is updated a few times / minute you have a problem in running a long check in it.
March 12, 2012 at 1:36 pm
Looking into your UDF, it would definitely benefit from implementing it as CLR function.
Also, you can try to split your query in two parts.
First filtering out everything without checking string similarities, then apply your "fine" checks using your udf.
March 13, 2012 at 9:17 am
Eugene Elutin (3/12/2012)
Looking into your UDF, it would definitely benefit from implementing it as CLR function.Also, you can try to split your query in two parts.
First filtering out everything without checking string similarities, then apply your "fine" checks using your udf.
I am just testing a CLR version and so far it runs about 50% faster. 🙂
Good call on the filter-then-check. I'll do some re-structuring around that.
March 13, 2012 at 9:21 am
jcb (3/12/2012)
To compare between machines use a dump from the prod server (if possible)Comparation between 5 millions of nulls or small strings is faster than comparing 3 milllions large strings...
Anyway the production server can do slower duo to concurrence.
If that table is updated a few times / minute you have a problem in running a long check in it.
I'd like to use the same data but production is R2 so I'd have to move all the data across with exports :/
Indeed, the shape of the data may be quite different, though I'm still surprised at the size of the discrepancy. The process runs overnight though so at least concurrency shouldn't be an issue.
March 14, 2012 at 9:35 am
CraigIW (3/12/2012)
Updt:
Though it does seem to be a lot faster on my 32bit SQL 2008 development desktop than it is on the production server :/
Did you say the production server was a VM?
If so does that mean it cannot use parallelism?
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply