Query Performance Question

  • Stefan_G (8/3/2010)


    What is the actual execution times of the two query plans you posted ?

    The estimated cost for Query1 is 109.4 while the estimated cost for Query2 is 46.3

    With these numbers it seems very strange that Query1 would really take more than 30 times more time than Query2

    That was the reason for my original posting... I could not believe that a simple change in which table was being used for the search conditions/constraints would have such a dramatic affect on the queries. In fact, I re-ran these tests on two different servers (and two different databases) just to be sure, as well as running the queries themselves three or 4 times. I also included my

    DBCC DROPCLEANBUFFERS -- frees buffers

    DBCC FREEPROCCACHE -- Frees procedure cache, should not matter here

    statements just to be sure nothing was being cached to give me incorrect results.

    Just as an FYI, for this particular example/database there are 18,670,308 records in the VIEW_InvoiceData view and 40,015 records in the InvSummary table.

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • Query 1 took 00:24:33 (+/- 1 minute in other trials) and Query 2 took 00:00:50 (+/- 2 seconds in other trials). Both were run on the server to minimize any network delays.

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • DB_Newbie2007 (8/3/2010)


    Stefan,

    1) We do have constraints on the tables... here is an example:

    ALTER TABLE [dbo].[Jan10] WITH NOCHECK ADD CONSTRAINT [CK_Jan10] CHECK ((datepart(month,[InvoiceDate])=(1) AND datepart(year,[InvoiceDate])=(2010)))

    GO

    ALTER TABLE [dbo].[Jan10] CHECK CONSTRAINT [CK_Jan10]

    GO

    I am not sure, but I think it would work better with a constraint like this:

    ALTER TABLE [dbo].[Jan10] ADD CONSTRAINT [CK_Jan10] CHECK InvoiceDate >= '20100101' AND InvoiceDate < '20100201'

    2) The VIEW_InvoiceData table listed in the original post was incorrect. It should have been

    CREATE VIEW dbo.VIEW_InvoiceData

    AS

    SELECT * FROM dbo.Jan10 UNION ALL SELECT * FROM dbo.Feb10

    Good

    3) The actual execution plans are correct. "The names (of the fields and tables) have been changed to protect the innocent" (namely, ME 😉 ). A confidentiality agreement prevents me from posting the actual table design, indexes, etc.

    That is too bad. It makes it much harder to really understand your problem.

    4) There is a field (which I renamed to ChargeType), but again, you have to understand the table design to figure out how that field (plus other fields) combine together to generate a "charge" record (again, this is where things get proprietary). This particular report used that (plus one other) field to generate it's data (which you can see in the query plans). 😉 I just had to "scrub" the plans before being posted.

    Hmm exactly how did you produce the "scrubbed" plans ?

    Did you first execute the real query, capture the execution plan, and then perform a search & replace in the plan xml ?

    or did you create a set of modified tables and indexes, and queries, run the modified queries and then post the actual plans ?

    the queries you posted in your first post contained a GROUP BY. The queries shown in the actual plans did not - at least no aggregation is visible in the query plans.

    Do your real problem queries contain GROUP BY or not ?

  • Stefan_G (8/3/2010)


    Hmm exactly how did you produce the "scrubbed" plans ?

    Did you first execute the real query, capture the execution plan, and then perform a search & replace in the plan xml ?

    or did you create a set of modified tables and indexes, and queries, run the modified queries and then post the actual plans ?

    the queries you posted in your first post contained a GROUP BY. The queries shown in the actual plans did not - at least no aggregation is visible in the query plans.

    Do your real problem queries contain GROUP BY or not ?

    Executed the real query, captured the execution plan(s), then did a search and replace on specific names (Database name, table name, field name).

    There is no grouping on the original query.

    *SIGH* Ok, if you really must have the original query (again, which is available in the posted execution plans, here it is after scrubbing the field names). This is the query IN ITS ENTIRETY (i.e., run directly on SSMS as is), no grouping, sub-queries, etc.

    USE ScratchDB

    GO

    DBCC DROPCLEANBUFFERS -- frees buffers

    DBCC FREEPROCCACHE -- Frees procedure cache

    SELECT

    IS.InvoiceNumber,

    InvoiceItem,

    EnteredWeight,

    IndItemWeight,

    CONVERT(varchar(3),CONVERT(int, SUBSTRING(ID.PkgDim,1,5)))+' x '+

    CONVERT(varchar(3),CONVERT(int, SUBSTRING(ID.PkgDim,7,5)))+' x '+

    CONVERT(varchar(3),CONVERT(int, SUBSTRING(ID.PkgDim,13,5))) AS PkgDim,

    IsNull(CustNote1,'')+' '+IsNull(CustNote2,'')+' '+IsNull(CustNote3,'') AS Notes,

    TransactionDate AS PickUp_Date,

    ItemCharge AS Charges

    FROM VIEW_InvoiceData ID

    LEFT OUTER JOIN [dbo].[BilledWtType] WT ON ID.[WtType] = WT.[Code]

    INNER JOIN dbo.InvSummary IS ON ID.InvoiceDate = IS.InvoiceDate AND RIGHT(ID.InvoiceNumber,9) = IS.InvoiceNumber

    WHERE IS.InvoiceDate Between '7/15/2010' And '8/3/2010'

    --AND ID.InvoiceDate Between '7/15/2010' And '8/3/2010'

    AND InvoiceChargeItem = 'abc'

    AND ChargeType = 'def'

    AND CustNote1 IS NOT NULL

    ---- Parameter is being passed to this report to determine how to run it

    ---- makes the query dynamic

    AND ((PkgDim IS NOT NULL AND 'Yes' = 'Yes')

    OR (PkgDim IS NULL AND 'No' = 'Yes'))

    I understand that having the DB objects, query plans, etc. help in diagnosis, but I was interested AT A HIGH LEVEL in knowing why there was such a dramatic difference in the execution times. The reason for posting the execution plans in this case as to compare the same query using the two different conditions to help in diagnosing the differences. My original post was an example of the table structure and indexes, plus the focus was "not about the design of the database objects but on the query performance".

    Part of the question was answered,

    Grant Fritchey (8/3/2010)


    No, a different column in the WHERE clause means a different set of statitistics on a different table, which, can, and did, lead to differences in the execution plan.

    plus

    Grant Fritchey (8/3/2010)


    Stefan, good thought on the optimizer dropping out, but both queries went through a FULL optimization process. That means the optimizer thought it had them both right. I'm still wondering about the parallelism, because that's the one thing that's making the most difference here, from what I can see.

    Now I have some areas to research to learn more! 😀

    Thanks! 🙂

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • DB_Newbie2007 (8/3/2010)


    I understand that having the DB objects, query plans, etc. help in diagnosis, but I was interested AT A HIGH LEVEL in knowing why there was such a dramatic difference in the execution times. The reason for posting the execution plans in this case as to compare the same query using the two different conditions to help in diagnosing the differences. My original post was an example of the table structure and indexes, plus the focus was "not about the design of the database objects but on the query performance".

    Sorry for asking so many questions. The reason I did not trust the query in the actual plans was that you said you had scrubbed the plans - I was not sure what that meant.

    Now let me rephrase your actual question slightly:

    1) Why is there such a dramatic difference in execution time between two different (but logically equivalent) formulations of the same query ?

    The simple answer is that there is no guarantee that two different queries will get the exact same execution plan. Most of the time you will get very similar plans, but in this case one plan was fairly fast but the other was extremely slow.

    This gives us another question.

    2) Why is the plan generated for Query1 so slow ?

    The reason for this is probably bad statistics.

    If you look at the plan for Query1, you see that the clustered index seek in InvSummary has an estimated number of rows=1 while actual number of rows = 915. This means that the actual cost of this plan is 915 times higher than the estimated cost.

    if that estimation was correct this would be a very good plan.

    Since that estimate is so bad the plan becomes very bad.

    Please make sure that the statistics for the primary key on InvSummary is up to date.

    While you are at it also make sure that all other statistics for all other relevant tables are up to date. (Especially the index on ChargeType in all the partitions)

    If you do this I believe you will get good performance from both your queries. At least it would surprise me very much if the same plan was selected for Query1 if the optimizer new the real cost was not 109 but 109*915=99735

  • I made some tests and they confirm my suspicions:

    * It is very important to formulate the CHECK condition on the individual tables appropriately.

    * If you have correct statistics it does not matter how you formulate the query

    Try the script below.

    Change the constraints and see the difference in execution plan.

    Change the predicate in the query and see that it makes no difference with correct statistics.

    Enable the STATISTICS_NORECOMPUTE option and see that now you get different plans depending on the predicate.

    -- Let us play somewhere safe

    USE tempdb

    GO

    if object_id('dbo.InvSummary') is not null DROP TABLE dbo.InvSummary

    GO

    CREATE TABLE dbo.InvSummary (

    InvoiceNumber varchar(12) NOT NULL,

    InvoiceDate smalldatetime NOT NULL,

    CONSTRAINT [PK_InvSum] PRIMARY KEY CLUSTERED (InvoiceDate, InvoiceNumber)

    WITH (IGNORE_DUP_KEY = ON) ON [PRIMARY],

    --WITH (IGNORE_DUP_KEY = ON, STATISTICS_NORECOMPUTE = ON) ON [PRIMARY],

    ) ON [PRIMARY];

    GO

    if object_id('dbo.Jan10') is not null DROP TABLE dbo.Jan10

    GO

    CREATE TABLE dbo.Jan10 (

    RowNumber int NOT NULL,

    InvoiceNumber varchar(15) NOT NULL,

    InvoiceDate smalldatetime NOT NULL,

    CONSTRAINT [PK_Jan10] PRIMARY KEY CLUSTERED (InvoiceDate, InvoiceNumber, RowNumber) WITH (IGNORE_DUP_KEY = ON) ON [PRIMARY]

    -- Try switching between these two different constraints. Note that only the first performs as expected

    ,CONSTRAINT CK_Jan10 CHECK (InvoiceDate >= '20100101' and InvoiceDate < '20100201')

    --,CONSTRAINT CK_Jan10 CHECK (datepart(year,InvoiceDate)=2010 and datepart(month,InvoiceDate)=1)

    ) ON [PRIMARY];

    GO

    if object_id('dbo.Feb10') is not null DROP TABLE dbo.Feb10

    GO

    CREATE TABLE dbo.Feb10 (

    RowNumber int NOT NULL,

    InvoiceNumber varchar(15) NOT NULL,

    InvoiceDate smalldatetime NOT NULL,

    CONSTRAINT [PK_Feb10] PRIMARY KEY CLUSTERED (InvoiceDate, InvoiceNumber, RowNumber) WITH (IGNORE_DUP_KEY = ON) ON [PRIMARY]

    ,CONSTRAINT CK_Feb10 CHECK (InvoiceDate >= '20100201' and InvoiceDate < '20100301')

    --,CONSTRAINT CK_Feb10 CHECK (datepart(year,InvoiceDate)=2010 and datepart(month,InvoiceDate)=2)

    ) ON [PRIMARY];

    GO

    if object_id('dbo.VIEW_InvoiceData') is not null DROP VIEW dbo.VIEW_InvoiceData

    GO

    CREATE VIEW dbo.VIEW_InvoiceData

    AS

    SELECT * FROM dbo.Jan10 UNION ALL

    SELECT * FROM dbo.Feb10

    GO

    -- generate a tally table

    if object_id('dbo.Tally') is not null drop table dbo.Tally

    go

    create table dbo.Tally (n int primary key clustered)

    insert into dbo.Tally

    select top (1000) (row_number() over(order by (select null))) as n

    from sys.columns c1, sys.columns c2, sys.columns c3

    -- generate sample data - use 450 rows per invoice

    insert into dbo.Jan10 (InvoiceNumber, RowNumber, InvoiceDate)

    select r1.n as InvoiceNumber, r2.n as RowNumber, dateadd(day, r1.n%28, '20100101') as InvoiceDate

    from dbo.Tally r1

    cross join dbo.Tally r2

    where r1.n <=1000 and r2.n <= 450

    insert into dbo.Feb10 (InvoiceNumber, RowNumber, InvoiceDate)

    select r1.n as InvoiceNumber, r2.n as RowNumber, dateadd(day, r1.n%28, '20100201') as InvoiceDate

    from dbo.Tally r1

    cross join dbo.Tally r2

    where r1.n <=1000 and r2.n <= 450

    insert into dbo.InvSummary (InvoiceNumber, InvoiceDate)

    select distinct InvoiceNumber, InvoiceDate

    from VIEW_InvoiceData

    go

    ----- Query 1

    SET STATISTICS IO ON

    SELECT ID.InvoiceNumber, ID.InvoiceDate

    FROM dbo.VIEW_InvoiceData ID

    INNER JOIN dbo.InvSummary ISU

    ON ID.InvoiceNumber = RIGHT(RTRIM(ISU.InvoiceNumber),12)

    AND ID.InvoiceDate = ISU.InvoiceDate

    WHERE

    -- Try switching between these two predicates. It should not matter which you use

    ISU.InvoiceDate BETWEEN '20100101' and '20100105'

    --ID.InvoiceDate BETWEEN '20100101' and '20100105'

  • Stefan_G (8/4/2010)


    2) Why is the plan generated for Query1 so slow ?

    The reason for this is probably bad statistics.

    Ok... I updated the statistics for the primary key on the dbo.InvSummary table, then re-ran the queries! WOW!

    Query 1 ran in 00:01:37 and ran with parallelism

    Query 2 ran in 00:00:15.

    I do not know much about statistics and I would have thought (I dislike the word "assumed", but that is what this is) that the AUTO_UPDATE_STATISTICS setting on the database would be taking care of this, but obviously I am wrong. :w00t:

    I think I need to spend some time research methods to review and update statistics on all of the databases.

    WOW again! I am stunned!

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • Glad I could help! 😎

    Just of curiosity, could you please post the actual query plans of Query1 and Query2 after the statistics update ?

  • Execution plans after updating the Statistics on InvSummary table only.

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • >>I do not know much about statistics and I would have thought (I dislike the word "assumed", but that is what this is) that the AUTO_UPDATE_STATISTICS setting on the database would be taking care of this, but obviously I am wrong.

    Most people aren't aware that for almost every table in existence you must modify TWENTY PERCENT OF THE ROWS in the table before you get an auto-fired statistics update. And even once you get that it will only normally do a sampling of the values. So for your 500M row table you must modify 100MILLION rows of data to get a stats change. This is a crippling issue for MANY of my clients who rely on it. You simply MUST do manual stats updates on some interval that makes sense for each table in your system. I have some client tables that get stats updates every HOUR.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Kevin,

    Thanks for the post. I found the same thing elsewhere on the web about 20% for 2005 and 2008.

    I found this website that needs more review: http://www.sqlstatistics.com/page/2/

    I also started on my new journey to get a statistics update process running for all of my DB(s). Hopefully this will help identify the statistics that need to be updated:

    SELECT o.name AS

    , i.name AS [Index Name], i.rowmodctr AS [Rows Modified]

    , stats_date(ST.object_id, ST.stats_id) AS [Last Updated]

    FROM SYS.SYSOBJECTS o

    INNER JOIN SYS.SYSINDEXES i

    ON o.id = i.id

    INNER JOIN SYS.STATS ST

    ON I.indid = ST.object_id

    WHERE i.rowmodctr > 0 ---- number of inserted, deleted, or updated rows since the last time statistics were updated for the table.

    and o.xtype = 'U' ---- U = User table

    AND stats_date(ST.object_id, ST.stats_id) < ''+GetDate()-5+'' ---- last time statistics were NOT updated within last 5 days

    ORDER BY O.Name, i.rowmodctr DESC

    I will need to modify the query slightly to exclude certain tables that I know do not need to be updated (staging tables, temp tables, or older static tables).

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

Viewing 11 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply