large self-join taking ages, anything I can do?

  • Hi all,

    I have a query that is trying to identify unique sales records by comparing against all previous sales. It does this by using an 'edit distance' function against customer details. This works fine when looking for similar records in the same table, but now that I need things to be unique, it takes 20+ hours to run against 5 million records (I have had to stop it at that time, it has actually never finished). I need this to complete in under 12 hours. Thereafter, it will be run per-day so shouldn't be as much of a problem.

    Production system is SQL 2008 R2 SP1 on a Windows 2008 VM, 4GB RAM.

    The function is as follows:

    CREATE FUNCTION [dbo].[edit_distance_weighted](@s1 nvarchar(3999), @s2 nvarchar(3999))

    RETURNS float

    AS

    BEGIN

    DECLARE @s1_len int, @s2_len int

    DECLARE @i int, @j-2 int, @s1_char nchar, @C int, @c_temp int

    DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

    SELECT

    @s1_len = LEN(ISNULL(@s1,'')),

    @s2_len = LEN(ISNULL(@s2,'')),

    @cv1 = 0x0000,

    @j-2 = 1, @i = 1, @C = 0

    IF (@s1_len + @s2_len = 0)

    BEGIN

    RETURN 0;

    END;

    WHILE @j-2 <= @s2_len

    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j-2 = @j-2 + 1

    WHILE @i <= @s1_len

    BEGIN

    SELECT

    @s1_char = SUBSTRING(@s1, @i, 1),

    @C = @i,

    @cv0 = CAST(@i AS binary(2)),

    @j-2 = 1

    WHILE @j-2 <= @s2_len

    BEGIN

    SET @C = @C + 1

    SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j-1, 2) AS int) +

    CASE WHEN @s1_char = SUBSTRING(@s2, @j-2, 1) THEN 0 ELSE 1 END

    IF @C > @c_temp SET @C = @c_temp

    SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j+1, 2) AS int)+1

    IF @C > @c_temp SET @C = @c_temp

    SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j-2 = @j-2 + 1

    END

    SELECT @cv1 = @cv0, @i = @i + 1

    END

    RETURN @C * CAST(2 AS FLOAT) / (@s1_len + @s2_len)

    END

    Table DDL:

    CREATE TABLE [dbo].[Sales](

    [SaleID] [int] IDENTITY(1,1) NOT NULL,

    [PhoneNumber] [varchar](20) NULL,

    [Surname] [varchar](50) NULL,

    [FlatNumber] [varchar](5) NULL,

    [HouseNumber] [varchar](5) NULL,

    [Address1] [varchar](100) NULL,

    [Postcode] [varchar](8) NULL,

    [SaleDate] [datetime] NULL,

    [ManualSale] [bit] NULL,

    CONSTRAINT [PK_Sales] PRIMARY KEY CLUSTERED

    (

    [SaleID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    And my query is as follows:

    SELECT DISTINCT

    ManualSales.Surname,

    ManualSales.FlatNumber,

    ManualSales.HouseNumber,

    ManualSales.Address1,

    ManualSales.Postcode,

    ManualSales.PhoneNumber,

    ManualSales.SaleDate

    FROM

    Sales ManualSales

    WHERE ManualSales.SaleDate IS NOT NULL

    AND ISNULL(ManualSales.ManualSale,0) = 1

    AND NOT EXISTS

    (

    SELECT 1

    FROM Sales S

    WHERE dbo.edit_distance_weighted(ISNULL(S.PhoneNumber,ManualSales.PhoneNumber),ISNULL(ManualSales.PhoneNumber,S.PhoneNumber)) < 0.4

    AND dbo.edit_distance_weighted(S.Postcode,ManualSales.Postcode) < 0.3

    AND dbo.edit_distance_weighted(S.Address1,ManualSales.Address1) < 0.4

    AND dbo.edit_distance_weighted(S.Surname,ManualSales.Surname) < 0.4

    AND S.SaleID != ManualSales.SaleID

    AND S.SaleDate IS NOT NULL

    AND S.SaleDate < ManualSales.SaleDate

    AND ISNULL(S.ManualSale,0) = 0

    )

    ORDER BY ManualSales.SaleDate DESC

    Can anyone think of a way to re-work this or other changes to make it more efficient? There is one suggested index (SaleDate with ManualSale), but it will not save enough time (~25%).

    Thanks,

    Craig

  • To start with this is not a SELF JOIN but an anti-semi-self-join and seems like you can turn it into a semi-self-join(OR JOIN). (I am not too sure if it is the right logic but this is just TO let you know that your query can be redesigned. "NOT EXISTS" could be more expensive than "EXISTS"). Besides I have changed the ISNULL(ManualSales.ManualSale,0) = 1 to a more optimal STATEMENT.

    SELECT DISTINCT

    ManualSales.Surname,

    ManualSales.FlatNumber,

    ManualSales.HouseNumber,

    ManualSales.Address1,

    ManualSales.Postcode,

    ManualSales.PhoneNumber,

    ManualSales.SaleDate

    FROM

    Sales ManualSales

    WHERE ManualSales.SaleDate IS NOT NULL

    AND ManualSales.ManualSale = 1

    AND ManualSales.ManualSale IS NOT NULL

    AND EXISTS

    (

    SELECT 1

    FROM Sales S

    WHERE

    dbo.edit_distance_weighted(ISNULL(S.PhoneNumber,ManualSales.PhoneNumber),ISNULL(ManualSales.PhoneNumber,S.PhoneNumber)) > 0.4

    AND dbo.edit_distance_weighted(S.Postcode,ManualSales.Postcode) > 0.3

    AND dbo.edit_distance_weighted(S.Address1,ManualSales.Address1) > 0.4

    AND dbo.edit_distance_weighted(S.Surname,ManualSales.Surname) > 0.4

    AND S.CustomerID = ManualSales.CustomerID

    --AND S.SaleDate IS NOT NULL --WOULD NOT BE NEEDED IN CASE OF A JOIN/SEMI JOIN

    AND S.SaleDate > ManualSales.SaleDate

    AND S.ManualSale = 1 --MAY NOT BE NEEDED

    AND S.ManualSale IS NOT NULL --MAY NOT BE NEEDED

    )

    Since, you did not mention what actually the function is doing, I cannot do much about that. But we all know that Scalar functions in the WHERE clause would hinder the performance a great deal.

    Since, changing the function would need additional testing and that may not be a possibility at the moment. Another way of dealing is to store the partial output in a temp table by moving the scalar function values from WHERE clause to SELECT clause. With proper indexing on the temp table, you would be to able to even work on more than 5 million records. And I am pretty sure that you wont be needing even 12 hours for that.

    Hope it helps.

  • Hi Usman,

    Thanks for your reply.

    I'm trying to wrap my head around the differences between my original query and what you suggest. The goal of the query is to look for manual sales which have unique customer information, compared to all past sales. The UDF gives a measure of similarity between two strings (edit distance, weighted for average length of the two strings).

    You suggest swapping

    ISNULL(ManualSales.ManualSale,0) = 1

    for

    AND ManualSales.ManualSale = 1

    AND ManualSales.ManualSale IS NOT NULL

    but I don't think these are logically equivalent? The original would include NULL but the replacement would exclude them? Unless I am missing something about the rest of the query.

    Also, I had a typo in the original post, it should be

    S.SaleID != ManualSales.SaleID

    which is to exclude the manual from matching itself.

    Another way of dealing is to store the partial output in a temp table by moving the scalar function values from WHERE clause to SELECT clause. With proper indexing on the temp table, you would be to able to even work on more than 5 million records. And I am pretty sure that you wont be needing even 12 hours for that.

    Can you elaborate here? If, say, 10% of the sales were Manual, would this not be inserting 250,000,000,000 rows? I'll have a look at it though as it does give an alternative route.

    Thanks,

    Craig

  • CraigIW (3/12/2012)


    Hi Usman,

    You suggest swapping

    ISNULL(ManualSales.ManualSale,0) = 1

    for

    AND ManualSales.ManualSale = 1

    AND ManualSales.ManualSale IS NOT NULL

    but I don't think these are logically equivalent? The original would include NULL but the replacement would exclude them? Unless I am missing something about the rest of the query.

    Actually they are both going to produce the same thing and they are both redundant. Why bother with an ISNULL check at all? If it is not 1 does it matter if it is null or not? The original would NOT include null. All you managed to do with this isnull in the where clause is to eliminate index usage. You only need to check "AND ManualSales.ManualSale = 1".

    That UDF is the reason your performance is so awful. Looping in sql is like injecting heroin into the system. Calling the same scalar UDF 3 times in a where clause is never going to perform well. You have effectively eliminated all indexing and you have 5 million rows.

    There are lots of people around here, including myself, willing to help with this if want. I would suggest the first place to start is to eliminate that UDF. If you want some help take a look at the first link in my signature.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Hi Sean,

    Sorry, I'm a bit tired. You are right about the ISNULL. It is there as the column is nullable, but the expressions are equivalent.

    The UDF is what does all the work. If you can tell me a way to quantify the difference between strings without a UDF, I am all ears. I did explore using the SQL full text functionality but didn't see a way to get it to compare two columns.

    Craig

  • CraigIW (3/12/2012)


    Hi Sean,

    Sorry, I'm a bit tired. You are right about the ISNULL. It is there as the column is nullable, but the expressions are equivalent.

    The UDF is what does all the work. If you can tell me a way to quantify the difference between strings without a UDF, I am all ears. I did explore using the SQL full text functionality but didn't see a way to get it to compare two columns.

    Craig

    What is it doing? I don't really have the energy to parse that obfuscated code to decipher it.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Hi,

    Well, I dont worked too much (just read it one time) in it so it's only a guess...

    The problem lies in:

    1) The sub querie is performing a cross join in sales with itself.

    If there are x=5*10^6 rows in it you can expect a join of x^2.

    2) The function contains nested whiles. Not good for performance.

    3) The function ill hind indexes and a full table scan ill be performed.

    A think you can review it from a design point, not development point.

    Maybe create some indexed computed columns (if possbile) to avoid use the edit_distance_weighted thing (whatever is the awful requirement used to create it).

    http://msdn.microsoft.com/en-us/library/ms189292.aspx

  • Sean Lange (3/12/2012)


    CraigIW (3/12/2012)


    Hi Sean,

    Sorry, I'm a bit tired. You are right about the ISNULL. It is there as the column is nullable, but the expressions are equivalent.

    The UDF is what does all the work. If you can tell me a way to quantify the difference between strings without a UDF, I am all ears. I did explore using the SQL full text functionality but didn't see a way to get it to compare two columns.

    Craig

    What is it doing? I don't really have the energy to parse that obfuscated code to decipher it.

    I picked that function up when first looking for a way to assess the relative difference between strings. I agree it could do with some tidyup, but it does seem to do the job.

    It's computing a weighted 'edit distance' to gauge how similar the strings are. E.g, 'G11 1AA' is quite similar to 'G11 1AB', but not very similar to 'NW2 4AB' The function computes a difference as a float that I can use to sit a limit on how much of a difference I want to include.

    Craig

  • As I said, maybe you can change the desing, not only the implementation.

    Since you are "cleaning" the table do find duplicate rows, you must run a join over itself (all rows X all rows).

    Maybe you can create indexed computed columns and just validate the data before each insert/update.

    Updt:

    It dont surprises me to make a nested while string manipulation (5 million X 5 million) times takes a long time.

  • Updt:

    It dont surprises me to make a nested while string manipulation (5 million X 5 million) times takes a long time.

    Me either. Though it does seem to be a lot faster on my 32bit SQL 2008 development desktop than it is on the production server :/ I'm still comparing, but 3 million rows on my dev machine takes 45 minutes. I'm going to try and get the quantities closer to production and see if the difference is still apparent.

  • To compare between machines use a dump from the prod server (if possible)

    Comparation between 5 millions of nulls or small strings is faster than comparing 3 milllions large strings...

    Anyway the production server can do slower duo to concurrence.

    If that table is updated a few times / minute you have a problem in running a long check in it.

  • Looking into your UDF, it would definitely benefit from implementing it as CLR function.

    Also, you can try to split your query in two parts.

    First filtering out everything without checking string similarities, then apply your "fine" checks using your udf.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (3/12/2012)


    Looking into your UDF, it would definitely benefit from implementing it as CLR function.

    Also, you can try to split your query in two parts.

    First filtering out everything without checking string similarities, then apply your "fine" checks using your udf.

    I am just testing a CLR version and so far it runs about 50% faster. 🙂

    Good call on the filter-then-check. I'll do some re-structuring around that.

  • jcb (3/12/2012)


    To compare between machines use a dump from the prod server (if possible)

    Comparation between 5 millions of nulls or small strings is faster than comparing 3 milllions large strings...

    Anyway the production server can do slower duo to concurrence.

    If that table is updated a few times / minute you have a problem in running a long check in it.

    I'd like to use the same data but production is R2 so I'd have to move all the data across with exports :/

    Indeed, the shape of the data may be quite different, though I'm still surprised at the size of the discrepancy. The process runs overnight though so at least concurrency shouldn't be an issue.

  • CraigIW (3/12/2012)


    Updt:

    Though it does seem to be a lot faster on my 32bit SQL 2008 development desktop than it is on the production server :/

    Did you say the production server was a VM?

    If so does that mean it cannot use parallelism?

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply