Investigating (and resolving) bad plans and dynamic SQL.

  • Problem statement:

    - A vendor's client application is intermitttently reporting timeouts when using the search function on one of the forms.

    - The search is being performed using a field that maps to the primary key attribute of the table.

    - Once the timeout occurs, repeated searches continue to result in timeouts.

    - Logging in as a different user can result in different behaviour.

    Potentially related problems:

    On the same screen, the client application regularly reports QBE (query by example) errors.

    Similarly, after a QBE error occurs, a "syntax error" message can occur (not reported as a QBE error). This syntax error is sometimes of the form "invalid column name <value>", where <value> is the value entered into the search field.

    All of these problems can be reproduced in multiple environments (production, UAT, training).

    Query information:

    Using Adam Machanic's sp_whoisactive, the query being executed is:

    (@0 int,@1 int)select top 50 * from Hire_HireMast where @0 = @1 order by Hire_no

    The plan as displayed by sp_whoisactive is (output lists and defined values excluded for brevity):

    <ShowPlanXML xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan" Version="1.1" Build="10.50.1600.1">

    <BatchSequence>

    <Batch>

    <Statements>

    <StmtSimple StatementText="(@0 int,@1 int)select top 50 * from Hire_HireMast where @0 = @1 order by Hire_no" StatementId="1" StatementCompId="4" StatementType="SELECT" StatementSubTreeCost="5.87112" StatementEstRows="1" StatementOptmLevel="FULL" QueryHash="0x3D50FF3717EE69B9" QueryPlanHash="0x82196C6EC0BA1CBB">

    <StatementSetOptions QUOTED_IDENTIFIER="true" ARITHABORT="false" CONCAT_NULL_YIELDS_NULL="true" ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" NUMERIC_ROUNDABORT="false" />

    <QueryPlan CachedPlanSize="424" CompileTime="10" CompileCPU="10" CompileMemory="912">

    <RelOp NodeId="0" PhysicalOp="Sort" LogicalOp="TopN Sort" EstimateRows="1" EstimateIO="0.0112613" EstimateCPU="0.000104215" AvgRowSize="4215" EstimatedTotalSubtreeCost="5.87112" Parallel="0" EstimateRebinds="0" EstimateRewinds="0">

    <OutputList>

    <!-- all columns -->

    </OutputList>

    <MemoryFractions Input="0.333333" Output="1" />

    <TopSort Distinct="0" Rows="50">

    <OrderBy>

    <OrderByColumn Ascending="1">

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="Hire_No" />

    </OrderByColumn>

    </OrderBy>

    <RelOp NodeId="1" PhysicalOp="Parallelism" LogicalOp="Gather Streams" EstimateRows="1" EstimateIO="0" EstimateCPU="0.0288832" AvgRowSize="4215" EstimatedTotalSubtreeCost="5.85976" Parallel="1" EstimateRebinds="0" EstimateRewinds="0">

    <OutputList>

    <!-- all columns -->

    </OutputList>

    <Parallelism>

    <RelOp NodeId="2" PhysicalOp="Nested Loops" LogicalOp="Inner Join" EstimateRows="1" EstimateIO="0" EstimateCPU="2.09e-006" AvgRowSize="4215" EstimatedTotalSubtreeCost="5.83087" Parallel="1" EstimateRebinds="0" EstimateRewinds="0">

    <OutputList>

    </OutputList>

    <NestedLoops Optimized="1">

    <OuterReferences>

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="Hire_No" />

    </OuterReferences>

    <RelOp NodeId="4" PhysicalOp="Parallelism" LogicalOp="Repartition Streams" EstimateRows="1" EstimateIO="0" EstimateCPU="0.0285033" AvgRowSize="25" EstimatedTotalSubtreeCost="5.85609" Parallel="1" EstimateRebinds="0" EstimateRewinds="0">

    <OutputList>

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="Hire_No" />

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="DLicenceNo" />

    </OutputList>

    <Parallelism PartitioningType="RoundRobin">

    <RelOp NodeId="5" PhysicalOp="Filter" LogicalOp="Filter" EstimateRows="1" EstimateIO="0" EstimateCPU="0.372119" AvgRowSize="25" EstimatedTotalSubtreeCost="5.82759" Parallel="1" EstimateRebinds="0" EstimateRewinds="0">

    <OutputList>

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="Hire_No" />

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="DLicenceNo" />

    </OutputList>

    <Filter StartupExpression="1">

    <RelOp NodeId="6" PhysicalOp="Index Scan" LogicalOp="Index Scan" EstimateRows="2.65799e+006" EstimateIO="3.9935" EstimateCPU="1.46197" AvgRowSize="25" EstimatedTotalSubtreeCost="5.45547" TableCardinality="2.65799e+006" Parallel="1" EstimateRebinds="0" EstimateRewinds="0">

    <OutputList>

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="Hire_No" />

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="DLicenceNo" />

    </OutputList>

    <IndexScan Ordered="0" ForcedIndex="0" ForceSeek="0" NoExpandHint="0">

    <DefinedValues>

    <DefinedValue>

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="Hire_No" />

    </DefinedValue>

    <DefinedValue>

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="DLicenceNo" />

    </DefinedValue>

    </DefinedValues>

    <Object Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Index="[DLicenceNo]" IndexKind="NonClustered" />

    </IndexScan>

    </RelOp>

    <Predicate>

    <ScalarOperator ScalarString="[@0]=[@1]">

    <Identifier>

    <ColumnReference Column="ConstExpr1003">

    <ScalarOperator>

    <Compare CompareOp="EQ">

    <ScalarOperator>

    <Identifier>

    <ColumnReference Column="@0" />

    </Identifier>

    </ScalarOperator>

    <ScalarOperator>

    <Identifier>

    <ColumnReference Column="@1" />

    </Identifier>

    </ScalarOperator>

    </Compare>

    </ScalarOperator>

    </ColumnReference>

    </Identifier>

    </ScalarOperator>

    </Predicate>

    </Filter>

    </RelOp>

    </Parallelism>

    </RelOp>

    <RelOp NodeId="9" PhysicalOp="Clustered Index Seek" LogicalOp="Clustered Index Seek" EstimateRows="1" EstimateIO="0.003125" EstimateCPU="0.0001581" AvgRowSize="4199" EstimatedTotalSubtreeCost="0.0032831" TableCardinality="2.65799e+006" Parallel="1" EstimateRebinds="0" EstimateRewinds="0">

    <OutputList>

    <!-- all columns -->

    </OutputList>

    <IndexScan Lookup="1" Ordered="1" ScanDirection="FORWARD" ForcedIndex="0" ForceSeek="0" NoExpandHint="0">

    <DefinedValues>

    <!-- all columns -->

    </DefinedValues>

    <Object Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Index="[PK__HIRE_HireMast__6DAC5172]" TableReferenceId="-1" IndexKind="Clustered" />

    <SeekPredicates>

    <SeekPredicateNew>

    <SeekKeys>

    <Prefix ScanType="EQ">

    <RangeColumns>

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="Hire_No" />

    </RangeColumns>

    <RangeExpressions>

    <ScalarOperator ScalarString="[the_DB].[dbo].[HIRE_HireMast].[Hire_No]">

    <Identifier>

    <ColumnReference Database="[the_DB]" Schema="[dbo]" Table="[HIRE_HireMast]" Column="Hire_No" />

    </Identifier>

    </ScalarOperator>

    </RangeExpressions>

    </Prefix>

    </SeekKeys>

    </SeekPredicateNew>

    </SeekPredicates>

    </IndexScan>

    </RelOp>

    </NestedLoops>

    </RelOp>

    </Parallelism>

    </RelOp>

    </TopSort>

    </RelOp>

    <ParameterList>

    <ColumnReference Column="@1" ParameterCompiledValue="(0)" />

    <ColumnReference Column="@0" ParameterCompiledValue="(1)" />

    </ParameterList>

    </QueryPlan>

    </StmtSimple>

    </Statements>

    </Batch>

    </BatchSequence>

    </ShowPlanXML>

    Schema information:

    The relevant DDL is as follows:

    CREATE TABLE [dbo].[HIRE_HireMast] (

    [Hire_No] [int] NOT NULL,

    --...

    CONSTRAINT [PK__HIRE_HireMast__6DAC5172] PRIMARY KEY CLUSTERED

    ([Hire_No] ASC)

    )

    CREATE NONCLUSTERED INDEX [DLicenceNo] ON [dbo].[HIRE_HireMast]

    ([DLicenceNo] ASC)

    Server and DB information:

    Microsoft SQL Server 2008 R2 (RTM) - 10.50.1600.1 (X64) Apr 2 2010 15:48:46 Copyright (c) Microsoft Corporation Enterprise Edition (64-bit) on Windows NT 5.2 <X64> (Build 3790: Service Pack 2)

    Cost threshhold for parallelism is 50.

    Max DOP is 2.

    Paramaterization is forced.

    Auto create stats is true.

    Auto update stats is true.

    Auto shrink is false.

    Optimize for ad hoc workloads is false.

    Maximum server memory is 30000MB.

    Investigation:

    Clearly the client application is building some dynamic SQL, and the logic behind the query generation is somewhat faulty.

    Also clearly, the plan being chosen is a severely suboptimal plan.

    I ran the following query against the cache:

    select*

    fromsys.dm_exec_cached_plans p

    cross applysys.dm_exec_sql_text(p.plan_handle) t

    cross apply sys.dm_exec_query_plan(p.plan_handle) x

    --cross apply sys.dm_exec_Plan_attributes(p.plan_handle) a

    wheret.text like '%top 50 *%hire_hiremast%'

    andt.text not like '%sys.dm_exec_cached_plans%'

    This query returns 1437 rows in production. If the cross apply to plan attribues is uncommented I can see that the user_id is NOT -2 in most cases. This may account for the observations of different behaviour for different users.

    Some of the rows show prepared statements with parameterized queries, such as:

    (@0 int,@1 int)select top 50 * from Hire_hireMast where @0 = @1 order by Hire_no

    This particular row appears dozens of times, with different plan handles, most of which are the "same" plan when viewed graphically (using the dlicenceNo index or some other similarly inapporpriate index)

    Other rows show ad hoc statments, such as:

    SET FMTONLY OFF;

    SET NO_BROWSETABLE ON;

    Selecttop 50 *

    fromHire_HireMast

    WHERE1=1

    and(1 = 1 and Hire_no = 3212783 )

    order byHire_no desc

    SET NO_BROWSETABLE OFF;

    In such cases, viewing the plan XML shows only a simple one component operation for each of four statements (3 sets and a select) in the graphical plan. I'm not sure how to drill into the actual query plan for the select in these cases.

    Resolution:

    Unsurprisingly, I can fix this problem by issuing a dbcc freeproccache. Subsequent client seraches by hire_no then use a much saner query plan (clustered index seek) and return instantly.

    I am, however, hesitant to do this in production for two reasons. 1) It seems drastic, and 2) I suspect the problem will eventually return, given that it currently appears in multiple environments.

    Therefore I am looking for a different solution. One that is either more targeted (eg, clearing just the "bad" plans from the cache), or even better, one that helps prevent the problem occurring. I do have access to the vendor's programming staff.

    Any advice would be appreciated. Specific questions I have include:

    1) How do I drill into the graphical plan for the ad hoc select statements as returned by my cache query? In the plan XML I can see that the StmtSimple element includes The attribute ParameterizedPlanHandle followed by a hex value. Can I follow the hex value to a catalog view?

    2) What's the deal with having the column name paramaterized as well as the value being searched for (ie, those prepared statements of the form @0 = @1)? While I understand that this is possible, it seems like something that would confuse the hell out of the optimizer and it's ability to pick from the plan cache. True?

    3) Does anyone have a query handy to detect "bad" plans? In my specific case perhaps I could add a filter on the plan XML searching for parallelism, since it seems to me that a clustered index seek based on a PK value has no use for parallelism, and any plans including parallelism are therefore bad?

  • allmhuran (11/4/2011)


    2) What's the deal with having the column name paramaterized as well as the value being searched for (ie, those prepared statements of the form @0 = @1)? While I understand that this is possible, it seems like something that would confuse the hell out of the optimizer and it's ability to pick from the plan cache. True?

    Column names don't get parameterised, they can't. Parameter means the values change, if the column names change it's a completely different query, completely different plan.

    If you look in the plan, it states what the values for those parameters were at compile time. That query was originally:

    select top 50 * from Hire_HireMast where 1 = 0 order by Hire_no

    <ParameterList>

    <ColumnReference Column="@1" ParameterCompiledValue="(0)" />

    <ColumnReference Column="@0" ParameterCompiledValue="(1)" />

    </ParameterList>

    As for 'bad plans'... http://www.sqlpass.org/24hours/fall2011/default.aspx

    If you're registered and logged in (free) look for the recording of the presentation 'Bad Plan! Sit!'

    Probably an attempt to get the metadata. If that wasn't parameterised, the optimiser would be able to do contradiction detection and know that it wouldn't ever return rows, but the fact that it was subject to forced parameterisation means SQL can't do that and essentially will have to scan the table. This is one of the downsides of forced parameterisation.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • p.s. Schema-qualify tables to ensure that plans can be better reused. Without schema qualification the same table name may map to different tables for different users.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I'm aware that in SQL there's no paramaterization of the column name from SQL's point of view, but it seems the client app is doing some kind of strange hybrid between pure dynamic SQL and paramaterized .net sqlcommand objects (based on the hundreds of plans that are not paramaterized in the cache... that ugly "1=1 and (1=1 and hire_no = 12345" plan, for example). Unfortunately I can't see the app code as I don't have the source. All I can do is make some guesses as to the client logic based on what I see in various tracing and monitoring tools.

    I'm aware of the need for schema qualification (this is what I was getting at with the user_id not being -2), however unfortunately this is not my code, this is all vendor code. I have previously suggested they do this for other reasons (like the one you've mentioned here).

    What about this specific situation, though? Any thoughts on how to prevent the client logic from messing up the optimizer so badly? Or what about drilling into the specific plan for those ad hoc statements? I can't see what those are doing in terms of the optimizer's choices, all I see in the graphical plan are 4 icons (representing set, set, select, set).

  • Oh and thanks for the link. Registering now.

    Edit: Can't seem to find that particular presentation. It wasn't #18 I hope...

    Tangent on PASS: While I was in Canada for a couple of years there was a lot more social activity around things like PASS (monthly pizza, soft drink and lecture presentations at the Microsoft office in downtown Ottawa was fantastic.) Back in Sydney Australia now and this kind of thing is either really hard to find or doesn't seem to exist. There was a PASS group in Sydney but last I checked the website was pretty dormant. All this makes me a sad panda.

  • allmhuran (11/5/2011)


    There was a PASS group in Sydney but last I checked the website was pretty dormant. All this makes me a sad panda.

    Dormant? Or just changed sites? http://www.sqlserver.org.au/

    Am mixing my presentations up. That one was Spring 2011, not Fall. http://www.sqlpass.org/LearningCenter/24Hours2011.aspx Session 10.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks muchly, I'll take a look.

    I guess what I'm hoping for in replies is some ideas for "next steps". I know SQL pretty well... but for the true experts out there (like Gail :D), what would you do next in this situation? I'm not necessarily looking for "the fix", recommendations on further avenues of investigation would be fine. I know SQL Server well enough to get to where I am as described in the OP, but I'm hesitant to go making changes right away (like turning off forced params, or turning on opitmization for ad hoc workloads, or scheduling a full or partial cache clear). Of course, whether anything like this is "the right thing to do" is going to come down to the magic SQL phrase ("It depends"). I get that. Can I look deeper than I have so far? Or is it fair to say that "the client code is obviously crappy, you're going to have a hard time either way, so you might as well just play with it on the back end and see what happens".

    You know what would be great (I've thought this many times over the past few years)? To kidnap one of you guys and/or gals who write the books I read. I'd only keep you for a week and I'd give you pizza. Plus Australia is nice. Promise!

  • allmhuran (11/5/2011)


    What about this specific situation, though? Any thoughts on how to prevent the client logic from messing up the optimizer so badly?

    This kind of thing will likely need fixing at the source, though maybe a plan guide to disable forced parameterisation for queries like that might help (if it's even valid), as then the optimiser can use the contradiction detection and not have to even run that query. But that's an individual patch for an individual problem

    Or what about drilling into the specific plan for those ad hoc statements? I can't see what those are doing in terms of the optimizer's choices, all I see in the graphical plan are 4 icons (representing set, set, select, set).

    Normally if I'm looking for plans I'll look straight in the plan cache (sys.dm_exec_cached_plans and related), not at the execution-related DMVs

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Normally if I'm looking for plans I'll look straight in the plan cache (sys.dm_exec_cached_plans and related), not at the execution-related DMVs

    Indeed... that's where I got the plan from: cached_plans to exec_query_plan via the plan_handle. The only XML plan available in my cache query is from the exec_query_plan DMV, which shows the generalized plan without the operator breakdown. That's why I'm struggling to figure out where to get the actual operators used for the ad hoc statements. Am I cross applying the wrong DMV perhaps?

    Thanks again! (Watching the presentation now BTW)

  • Something else that just hit me: Why would SQL server be using the @0=@1 paramaterized plan for the execution of the "where hire_no = <value>" query? That doesn't make sense. The client code could be sending a 1=0 query first to get the schema before sending the user's actual query... but again, that doesn't seem to make sense. I'm pretty confident .net will just issue a set fmtonly to do this, and in any case, it is the @0=@1 query which is timing out, which, again, makes no sense no matter what values are being passed.... unless it's passing where 1=1 and timing out while doing a full clustered scan.... but then... where did the client's filter go?

    Gah, now I'm more confused than I was when I started... 😀

    Although... heh... OK, I suppose the client dynamic SQL generation logic (which I already know is broken) could be screwing up here. Perhaps it is supposed to be issuing a 1=0 (for some odd reason), or supposed to be building the actual predicated query, but the build_dynamic_sql (or whatever) function is broken and is simply returning the string with a "where 1=1". I mean... clearly the client code loves its 1=1 predicates, some queries have two of them (heck, some even have three!). The more I think about it, the more this "feels" like the best hypothesis. Unfortunately Adam's sp_whoisactive output is only showing me the compiled value, not the runtime value. Might have to wait until monday and run a trace.

    I hope all this thinking out loud is interesting/entertaining/informative to someone out there! 🙂

  • allmhuran (11/5/2011)


    Why would SQL server be using the @0=@1 paramaterized plan for the execution of the "where hire_no = <value>" query? That doesn't make sense.

    It wouldn't be. It can't. Ad-hoc queries match on a hash of their text and those 2 have different texts hence match to different plans. The plan for the query with that silly predicate is only used for queries with that silly predicate

    The client code could be sending a 1=0 query first to get the schema before sending the user's actual query... but again, that doesn't seem to make sense.

    That's probably what it's there for. Seen that enough times.

    I'm pretty confident .net will just issue a set fmtonly to do this, and in any case, it is the @0=@1 query which is timing out, which, again, makes no sense no matter what values are being passed.... unless it's passing where 1=1 and timing out while doing a full clustered scan.... but then... where did the client's filter go?

    Since it's parameterised (the values can change) and does not have any other filter, the optimiser cannot do contradiction detection and turn that into a constant scan. It has to create a plan safe for reuse, and that plan will be a scan, filter and sort (only thing that's safe and will return correct results no matter the values of those 2 parameters).

    If those were hard coded values (and it could be that they are and it's the forced parameterisation that is to blame here) of 1 and 0 rather than parameters, the optimiser would know that the query can never return rows and would simplify that entire query into a constant scan.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks for helping me think this one through Gail. I have a hypothesis I need to test, but it seems to me like it makes a lot of sense.

    Given:

    1) The dynamic SQL generation in the client code is pretty faulty

    2) A plan exists with the @0=@1 predicate, and the operations for that plan are a scan down the DLicenceNo index followed by a filter, then bookmark lookup to the clustered, sort, then top (ie, expensive)

    3) The client code seems to like using the 1=1 predicate.

    Assumption on which the hypothesis is based: (3) exists because it's easier to generate dynamic SQL filter predicates if you simply start off with a "1=1" as part of the template. That way additional predicates are always "and-ed" in, there's no need to figure out whether you need to use "where" or "and" in the concatenation logic. (There will be no "or"s coming from this kind of form based search).

    Hypothesis: A bug in the client SQL generation is causing the concatenation of additional predicates to fail. Perhaps a type mismatch is generating an exception which is silently handled in an empty catch, or something similar. In any case, the hypothesis continues: the hire_no predicate from the user's search is being lost.

    This results in the query "select top 50 * from hire_hiremast where 1 = 1 order by hire_no asc". Normally this would be easily optimized and, presumably result in a clustered index scan on the range including just the first 50 rows of the table.

    However, a cached plan exists that matches this query signature (the one compiled with the values 1 and 0). The engine therefore reuses this existing plan, and what should be a range scan on the clustered ends up being a scan down an entirely different index, followed by filter (on 1=1 in this case), lookup, sort and top. As such there will be a lookup for every row in the table, followed by a sort of the entire table. It would not be surprising if this resulted in a client timeout.

    I can test this hypothesis: It predicts that the runtime values for the parameters of the query which is timing out are both 1. If I find this to be the case when tracing on Monday I will have a high degree of confidence in my analysis and will submit this to the vendor as my diagnosis of their issue.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply