Performance issue with EXCEPT

  • I am rewriting some user-defined functions and I want to test if they give the exact same result as the functions they are replacing. I am testing the functions on all corporations of my database and the result contains about 15,000 rows. Running the following script:

    WITH
        OLD AS (
            SELECT C.UniqueID, F.Value
            FROM Corporations C
            OUTER APPLY (SELECT Value FROM OldFunction(C.UniqueID)) F
        ),
        NEW AS (
            SELECT C.UniqueID, F.Value
            FROM Corporations C
            OUTER APPLY (SELECT Value FROM NewFunction(C.UniqueID)) F
        )
    SELECT * FROM OLD
    EXCEPT
    SELECT * FROM NEW;

    Takes over 21 minutes while executing the EXCEPT outside the WITH like in:

    DECLARE @TOLD TABLE (UniqueID INT, Value INT);
    DECLARE @TNEW TABLE (UniqueID INT, Value INT);

    INSERT INTO @TOLD
        SELECT C.UniqueID, F.Value
        FROM Corporations C
        OUTER APPLY (SELECT Value FROM OldFunction(C.UniqueID)) F;
    INSERT INTO @TNEW
        SELECT C.UniqueID, F.Value
        FROM Corporations C
        OUTER APPLY (SELECT Value FROM NewFunction(C.UniqueID)) F;

    SELECT * FROM @TOLD
    EXCEPT
    SELECT * FROM @TNEW;

    Takes less than one minute. I would prefer using the former script as it is by far more elegant but its performance is not acceptable. Any idea of what is going on?
  • To tell you exactly what is going on would require examining the execution plan.

    In general, though, what is happening is that in the second case, you are forcing a particular execution order for the query by forcing it to materialize the result sets of the two queries. Once that is done, you then run a query using EXCEPT on those two result sets.

    In the CTE case, the optimizer is free to produce results however it sees fit; it won't necessarily process the queries defining the CTEs and then run the final query against those results.

    In this way, CTEs are very much like views and are in-lined with the rest of the query.

    In your case, the method chosen by the optimizer for the first case is just much worse than materializing the intermediate result sets and then running the final query. 

    Sometimes it's quite advantageous to let the optimizer decide this, since it will often choose physical execution order better than we do, but sometimes it doesn't. In cases like this, it might be because of poor statistics, or it might be that this data retrieval is a bit too complicated for the optimizer to figure out as one query.

    Especially if those UDFs are not iTVFs, they might be the source of much of the optimizer's confusion in the larger query.

    Hopefully that helps!

    EDIT: Fixed some typos.

  • Thanks Jacob,

    Both old functions and new functions are iTVFs (new functions use CTE). I could send you the plans which are actually complex since the functions are themselves complex. I was expecting that someone encountered this issue and share the magic trick that would speed up the first script (I don't mind if it is slower than the second one but not that slow!)

  • cmartel 20772 - Wednesday, May 17, 2017 2:33 PM

    Thanks Jacob,

    Both old functions and new functions are iTVFs (new functions use CTE). I could send you the plans which are actually complex since the functions are themselves complex. I was expecting that someone encountered this issue and share the magic trick that would speed up the first script (I don't mind if it is slower than the second one but not that slow!)

    There's no magic solution, though I've found many times in the past that breaking large queries into multiple chunks helps with execution speed.
    I suggest you try using temp tables with version 2 of your query, it may run even faster.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • As Phil indicated, there really isn't a one-size-fits-all trick that applies here.

    Given what you've stated about your code, I'd suspect that the query is simply too complicated for SQL Server to reliably (or at all) come up with a decent execution plan.

    I'd wager that if we saw the execution plan, the reason for early termination would be a time out.

    In that case, there's not a lot to do but either tune all the involved code as well as possible, or give the optimizer smaller, more manageable queries to work with, as you've done by materializing the intermediate result sets in your second example.

    Cheers!

  • Without Execution plans (you can just post them here, BTW, using the "Add File" link below) it's impossible to determine why each query is performing differently. In SQL server 2014 there's some improvements to how temp variables work - perhaps that is helping you. Note that your first solution could be reduced to:
    SELECT C.UniqueID, F.Value
    FROM Corporations C
    OUTER APPLY (SELECT Value FROM OldFunction(C.UniqueID)) F
    EXCEPT
    SELECT C.UniqueID, F.Value
    FROM Corporations C
    OUTER APPLY (SELECT Value FROM NewFunction(C.UniqueID)) F

    As has been mentioned: run both of these queries separately (try using WITH (RECOMPILE) and see if one is very slow. If one is very slow then you've isolated the problem; if they're both blazing fast and the EXCEPT is what's slowing you down then you could consider using temp tables instead of table variables and adding an index on each temp table on UniqueID, Value.

    It's most important to understand how EXCEPT works: EXCEPT returns a distinct result set which means that the data must be sorted. sorting is expensive and each row becomes more expensive to sort the more rows you add. Indexes can prevent sorting but not in your query because you are generating a derived column. I would be curious to see if this helps:
    (
    SELECT DISTINCT C.UniqueID, F.Value
    FROM Corporations C
    OUTER APPLY (SELECT Value FROM OldFunction(C.UniqueID)) F
    )
    EXCEPT
    (
    SELECT DISTINCT C.UniqueID, F.Value
    FROM Corporations C
    OUTER APPLY (SELECT Value FROM NewFunction(C.UniqueID)) F
    )

    Here we're performing 3 sorts instead of one but, depending on the number of duplicates in each of the two queries, you may be performing 3 wildly less expensive sorts instead of a really big one. 

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Like others have said, the code and execution plans are really what we need to be able offer any serious help.

  • Alan.B - Wednesday, May 17, 2017 3:50 PM

    Note that your first solution could be reduced to:
    SELECT C.UniqueID, F.Value
    FROM Corporations C
    OUTER APPLY (SELECT Value FROM OldFunction(C.UniqueID)) F
    EXCEPT
    SELECT C.UniqueID, F.Value
    FROM Corporations C
    OUTER APPLY (SELECT Value FROM NewFunction(C.UniqueID)) F

     

    I believe it can be reduced even further:

    SELECT C.UniqueID, F.Value
    FROM Corporations C
    CROSS APPLY ( -- changed this to a CROSS APPLY
        SELECT Value FROM OldFunction(C.UniqueID)
        EXCEPT 
        SELECT Value FROM NewFunction(C.UniqueID)
    ) F

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • I just found out a magic key. If I change the end of the first script from:

    SELECT * FROM OLD
    EXCEPT
    SELECT * FROM NEW;

    To:

    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM OLD
    EXCEPT
    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM NEW;

    Then both scripts run fast. The technique works here since I only want to know if there are differences. I do not care about presentation. I nevertheless found a way to trick the optimizer and that makes my day!

  • cmartel 20772 - Thursday, May 18, 2017 2:33 PM

    I just found out a magic key. If I change the end of the first script from:

    SELECT * FROM OLD
    EXCEPT
    SELECT * FROM NEW;

    To:

    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM OLD
    EXCEPT
    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM NEW;

    Then both scripts run fast. The technique works here since I only want to know if there are differences. I do not care about presentation. I nevertheless found a way to trick the optimizer and that makes my day!

    By not sizing the CONVERT(VARCHAR, you may be truncating your data and invalidating the comparison.
    Try this
    CREATE TABLE #Tmp
    (
        Var_val varchar(100)
    )

    INSERT INTO #Tmp(Var_val)
    VALUES('1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890')

    SELECT CONVERT(varchar, var_val)
    FROM #Tmp

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John - Thursday, May 18, 2017 2:46 PM

    cmartel 20772 - Thursday, May 18, 2017 2:33 PM

    I just found out a magic key. If I change the end of the first script from:

    SELECT * FROM OLD
    EXCEPT
    SELECT * FROM NEW;

    To:

    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM OLD
    EXCEPT
    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM NEW;

    Then both scripts run fast. The technique works here since I only want to know if there are differences. I do not care about presentation. I nevertheless found a way to trick the optimizer and that makes my day!

    By not sizing the CONVERT(VARCHAR, you may be truncating your data and invalidating the comparison.
    Try this
    CREATE TABLE #Tmp
    (
        Var_val varchar(100)
    )

    INSERT INTO #Tmp(Var_val)
    VALUES('1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890')

    SELECT CONVERT(varchar, var_val)
    FROM #Tmp

    Mike's correct with his point.  If you don't size a varchar, a variable will default to a length of 1 and a column will default to 30.

  • Ed Wagner - Friday, May 19, 2017 8:31 AM

    Michael L John - Thursday, May 18, 2017 2:46 PM

    cmartel 20772 - Thursday, May 18, 2017 2:33 PM

    I just found out a magic key. If I change the end of the first script from:

    SELECT * FROM OLD
    EXCEPT
    SELECT * FROM NEW;

    To:

    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM OLD
    EXCEPT
    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM NEW;

    Then both scripts run fast. The technique works here since I only want to know if there are differences. I do not care about presentation. I nevertheless found a way to trick the optimizer and that makes my day!

    By not sizing the CONVERT(VARCHAR, you may be truncating your data and invalidating the comparison.
    Try this
    CREATE TABLE #Tmp
    (
        Var_val varchar(100)
    )

    INSERT INTO #Tmp(Var_val)
    VALUES('1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890')

    SELECT CONVERT(varchar, var_val)
    FROM #Tmp

    Mike's correct with his point.  If you don't size a varchar, a variable will default to a length of 1 and a column will default to 30.

    Minor tweak to that:

    Both variable declaration and column DDL will use a length of 1 for varchar if length is not specified.

    It's when used in CAST/CONVERT without a length that it uses 30.

    Cheers!

  • Jacob Wilkins - Friday, May 19, 2017 8:58 AM

    Ed Wagner - Friday, May 19, 2017 8:31 AM

    Michael L John - Thursday, May 18, 2017 2:46 PM

    cmartel 20772 - Thursday, May 18, 2017 2:33 PM

    I just found out a magic key. If I change the end of the first script from:

    SELECT * FROM OLD
    EXCEPT
    SELECT * FROM NEW;

    To:

    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM OLD
    EXCEPT
    SELECT C.UniqueID + '|' + CONVERT(VARCHAR, F.Value) FROM NEW;

    Then both scripts run fast. The technique works here since I only want to know if there are differences. I do not care about presentation. I nevertheless found a way to trick the optimizer and that makes my day!

    By not sizing the CONVERT(VARCHAR, you may be truncating your data and invalidating the comparison.
    Try this
    CREATE TABLE #Tmp
    (
        Var_val varchar(100)
    )

    INSERT INTO #Tmp(Var_val)
    VALUES('1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890')

    SELECT CONVERT(varchar, var_val)
    FROM #Tmp

    Mike's correct with his point.  If you don't size a varchar, a variable will default to a length of 1 and a column will default to 30.

    Minor tweak to that:

    Both variable declaration and column DDL will use a length of 1 for varchar if length is not specified.

    It's when used in CAST/CONVERT without a length that it uses 30.

    Cheers!

    Aaagh!  You're absolutely correct - thanks for correcting it.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply