Performance Issue - understanding query plans

Question

Performance Issue - understanding query plans

Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

nick.mcdermaid (2/6/2013)
p.s. The plan that comes from sys.dm_exec_query_plan is the estimated plan, because it has no actual row counts.
Are you saying that actual plans are not in that table?

It's not a table, it's a view into the plan cache. The plan cache stores plans with only the estimated row counts, hence plans pulled from cache are estimated plans

or are you saying the one I uploaded is not an actual plan because it doesn't have 'actuals?'

A plan that has only estimated row count, not actual row count is an estimated plan

If the actual plan is not in that table, can you tell me where to find it?

Turn 'Include Actual execution plan' option on in SSMS and run the query

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

nick.mcdermaid Hall of Fame Points: 3809 More actions · Answer 2

I was afraid you were going to say that - so the only way to get the actual plan is to run this interactively in SSMS with that option ticked? I can't get the actual plan afterwards from the cache? Very well, I'll do that and see if the actuals differ from estimates. I'm pretty sure they will.

ChrisM, I did originally take the UPDATE out but I wanted to post the full story

Since this takes so long to run I probably won't know the concrete answers to these things until later in the week so I will let you know my results.

However my suspicions still lie in the non-overlapping range issue. SQL doesn't know that the join is 1:n rather than n:n (and no amount of constraints can tell it that). In other words I would be very suprised if the estimated row count (zillions of rows) is correct.

Can anyone recall if they have run into an issue like that before? SQL incorrectly estimates a very large result set (almost a cross join) when in fact the result is 1:n, and it generates an incorrect query plan from that?

I had the exact same issue in Oracle so it would be ironic if this is the same problem.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

nick.mcdermaid (2/6/2013)
Can anyone recall if they have run into an issue like that before? SQL incorrectly estimates a very large result set (almost a cross join) when in fact the result is 1:n, and it generates an incorrect query plan from that?

Yes, more than once, and the opposite (SQL incorrectly estimates a very small resultset by over-estimating what a filter will remove)

Sometimes you can break the query up using temp tables, that tends to reduce the impact of the mis-estimations, sometimes a query hint is required.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 4

nick.mcdermaid (2/6/2013)
...
ChrisM, I did originally take the UPDATE out but I wanted to post the full story
...

Of course...but when was the last time you tried to run just the SELECT component of the query? If you have time, the EXACT query I posted would be just grand.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

nick.mcdermaid Hall of Fame Points: 3809 More actions · Answer 5

I did try running just the SELECT query early on in the piece and also when I was trying to get the Actual plan. In both cases I stopped it after 10 minutes, figuring that it would not return within 28 hours! In addition the query was still running on the same box and I didn't want to mess up the metrics.

Anyway I have a few things to look at and have learnt a lot more about SQl Server query plans.

If SQL Server has come up with the wrong plan, I guess that about all I can do is force a hash join and ensure indexes are absolutely optimal? I was hoping that adding more constraints could give it some clues without hints but I don't think that's possible.

Anyway I will post back any findings.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

nick.mcdermaid (2/6/2013)
I did try running just the SELECT query early on in the piece and also when I was trying to get the Actual plan. In both cases I stopped it after 10 minutes, figuring that it would not return within 28 hours! In addition the query was still running on the same box and I didn't want to mess up the metrics.
Anyway I have a few things to look at and have learnt a lot more about SQl Server query plans.
If SQL Server has come up with the wrong plan, I guess that about all I can do is force a hash join and ensure indexes are absolutely optimal? I was hoping that adding more constraints could give it some clues without hints but I don't think that's possible.
Anyway I will post back any findings.

The reason I'm asking for the actual plan of just the SELECT is because, in cases like this where an UPDATE...FROM is taking an unreasonable time to run, the SELECT on it's own often runs substantially faster than the whole UPDATE. It provides us with information to understand why this is the case, and hence choices for accelerating the UPDATE. It's quite important.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

nick.mcdermaid Hall of Fame Points: 3809 More actions · Answer 7

Thanks. The estimated plan for just the SELECT is missing everything after the loop join (including stream aggregate), but everything before (indexes used etc.) is exactly the same.

Tomorrow I will run the SELECT as you have posted and get the actual plan. If it's still running after four hours I think I'll try forcing a hash join, because assuming it's got the estimated rows incorrect, that is the only thing I can think of to alter the query plan besides altering indexes.

It is interesting to note that the range summary table is actually originally generated from a cursor being first run over the detail table. So my other option is to put ALL of this logic in the cursor beforehand rather than having an UPDATE afterwards. I think I will spend some more time on the UPDATE portion first though.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

nick.mcdermaid (2/6/2013)
... I think I will spend some more time on the UPDATE portion first though.

There's no point in thrashing the horse when someone's left the cart brake on...

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

nick.mcdermaid Hall of Fame Points: 3809 More actions · Answer 9

I'm interested in getting to the bottom of it. In fact it's far more interesting than what I'm meant to be doing which is why I'm having difficulty spending the time I need to solve it.

Oh well.

mister.magoo SSC-Forever Points: 47174 More actions · Answer 10

nick.mcdermaid (2/6/2013)
I'm interested in getting to the bottom of it. In fact it's far more interesting than what I'm meant to be doing which is why I'm having difficulty spending the time I need to solve it.
Oh well.

Just as a matter of interest, can you see what estimated plan this gives...

UPDATE SFI

SET

SFI.Smooth_Time_Key = COALESCE(SE.SmoothTimeKey, SFI.Time_Key)

, SFI.Smooth_FailureEvent = CASE

WHEN SE.FirstFailure = 'Y'

AND SFI.RowNum = SE.RowNum_Start

THEN 'Y'

ELSE 'N'

END

FROM leica.States_for_Interval SFI

CROSS APPLY (

SELECT TOP 1

SE.SmoothTimeKey

, SE.FirstFailure

, SE.RowNum_Start

FROM tfm.stg_Time_SmoothEvents SE

WHERE SE.RowNum_Start <= SFI.RowNum

AND SE.RowNum_End >= SFI.RowNum

AND SE.SRC_System = SFI.SRC_System

AND SE.SRC_Application = 'LEICA'

ORDER BY SE.SmoothTimeKey

) AS SE

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

nick.mcdermaid Hall of Fame Points: 3809 More actions · Answer 11

Thanks! That hit the nail on the head - if I understand correctly it's forcing the optimiser to realise it's a 1:n join. The plan is quite different but most notably the output from the nested loops operator = the number of rows input from the transactional table - this is exactly what I was after. Let me try it and see! (I will also upload the plan)

nick.mcdermaid Hall of Fame Points: 3809 More actions · Answer 12

Get out of bed people, I'm excited!

I've uploaded the actual query plan (1.Original.sqlplan) and it indeed has got the estimated record counts wrong all along the staging table branch and the nested loops operator.

nested loops estimated = 4 million

nested loops actual = 255,580

Mr Magoo's query (2.Force1n.sqlplan) forces it to get the estimated counts accurate, but it's still decided to use the same join operator (nested loops) on the same actual rows so it will probably will perform worse, but now at least we can see what join plan it comes up with when it has the correct estimates..... and there is no difference except that it no longer summarises the join output to update the table. If the performance issue is in the join operator then it appears there is no solution!

The branch of the plan containing the incorrect estimates comes from a pre generated staging table, so I have a multitude of options for providing columns and indexes so that it gets the estimate correct. Perhaps these have even been suggested in prior posts.

What can I do to the stg_Time_SmoothEvents table so that knows this is 1:n join without needing to use a subtable as in Mr Magoos example? It has been suggested that I create a unique constraint on FROM/TO which I will try when I have the opportunity.

Finally, not trolling here: just observing that Oracle has a handy 'cardinality' hint to help in these situations.

nick.mcdermaid Hall of Fame Points: 3809 More actions · Answer 13

I took an uneducated guess and

-Added a unique index to RowNum_Start

-Removed the SRC_System join (I know I know, changing the requirements is cheating )

and query time went from 29 minutes to 1 second. I'm suprised as the very first thing I did was check the query plan to see if it would suggest any indexes and it didn't.

I tried many other combinations if indexes and PK's but unless I removed the join on SRC_System there was no difference.

I know that some prior suggestions were heading in this general direction anyway but doing this the hard way has been a valuable learning experience for me.

The new query plan (4, attached) doesn't have an Index spool and has a hash match aggregate instead of a stream aggregate. It also has an index seek instead of scan. I'm reading up and trying to make sense of this. I'm suspicious of the Index spool now.

What would be really helpful is a 'Actual IO' and 'Actual CPU' in the actual query plan also - that would tell me for sure afterwards which operator was unexpectedly high in resource usage.

I now need to remove the ability to process multiple system from the outer logic.... unless anyone can think of a way to re-add this join to the query?

Any comments on what exactly has happened here are more than welcome.

ScottPletcher SSC Guru Points: 101119 More actions · Answer 14

nick.mcdermaid (2/6/2013)
I took an uneducated guess and
-Added a unique index to RowNum_Start
-Removed the SRC_System join (I know I know, changing the requirements is cheating )
and query time went from 29 minutes to 1 second. I'm suprised as the very first thing I did was check the query plan to see if it would suggest any indexes and it didn't.
I tried many other combinations if indexes and PK's but unless I removed the join on SRC_System there was no difference.
I know that some prior suggestions were heading in this general direction anyway but doing this the hard way has been a valuable learning experience for me.
The new query plan (4, attached) doesn't have an Index spool and has a hash match aggregate instead of a stream aggregate. It also has an index seek instead of scan. I'm reading up and trying to make sense of this. I'm suspicious of the Index spool now.
What would be really helpful is a 'Actual IO' and 'Actual CPU' in the actual query plan also - that would tell me for sure afterwards which operator was unexpectedly high in resource usage.
I now need to remove the ability to process multiple system from the outer logic.... unless anyone can think of a way to re-add this join to the query?
Any comments on what exactly has happened here are more than welcome.

Did you try clustering SE properly? If you get away from the "never think about it, always cluster by ident" "rule" and properly cluster tables, a lot of the constant need to ueber-"tune" every query, and create more and more covering indexes, will go away.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

nick.mcdermaid Hall of Fame Points: 3809 More actions · Answer 15

Thanks for your advice Scott.

I did change the PK to

RowNum_Start,SRC_System, SRC_Application

and also

SRC_System, SRC_Application,RowNum_Start

(Yes I know you can create a clustered index that isn't based on the primary key)

In both cases it was still stuck on the old query plan unless I removed SRC_System from the join. I do hear what you're saying about don't get into the habit of just clustering on the primary key every time, that's good advice.

There are probably more benefits to be had there but unfortunately my time to optimise this has run out. I'll learn some more next time I have this kind of issue (there will definitely be a next time).