unidentified SEVERE performance issue

  • Everyone,

    I've had an issue come up on a production system that hosts a large and active website.

    Essentially what happens is that query optimizer memory spikes, number of connections spikes and performance goes to the dogs until I run a dbcc freeproccache(). This is happening intermittently, but you can count on it about every 4-5 days.

    Here are the details:

    Microsoft SQL Server 2005 - 9.00.4207.00 (X64) Dec 17 2008 14:30:05 Copyright (c) 1988-2005 Microsoft Corporation Enterprise Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)

    Clustered.

    Fairly static data

    I'm updating statistics nightly anyway

    lots of stored procedures called that complete (normally) very very quickly. they are run hundreds of thousands of times per day. During the 'events' response times go south.

    One of them is selecting from a 7 row table and taking 10 seconds (duration, not CPU time).

    There is no increase in webserver activity to correspond with increased utilization

    Number of active connections and overall connections jumps up about 8X

    the MPA balloons to 10GB (by my calculation) during the event, possibly due to all the worker threads that can't do anything (2MB per worker thread on 64).

    Physical Info:

    Disk I/O doesn't increase at all.

    Processor jumps to 50% across 16 CPUs from 25% or so... so doubling CPU use.

    We have tons of available system memory and 16 GB in the SQL Server buffer.

    Any advice on where to go next?

    ~BOT

  • - start a trace to figure out what 's going on.

    (you can have the trace start a new file every xMB, and so if the issue didn't occure during the last day, you can delete the files manually)

    - have you set max server memory for your instance ?

    - Is there any a-typical processing going on at that time (so the critycal sprocs might be running with a very bad plan for that a-typical load)

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • ALZDBA (3/26/2009)


    - start a trace to figure out what 's going on.

    (you can have the trace start a new file every xMB, and so if the issue didn't occure during the last day, you can delete the files manually)

    - have you set max server memory for your instance ?

    - Is there any a-typical processing going on at that time (so the critycal sprocs might be running with a very bad plan for that a-typical load)

    we've got a baseline trace and a 'during event' trace and nothing is jumping up.

    Max server mem is 16 GB, server has 32 (we're planning on putting more instances).

    I really do think that the critical procs are grabbing a bad plan somehow... maybe not even that bad, but when you run hundreds of times per second... it doesn't need to be /that/ bad... just different.

    we've tried looking at before/after plans using the plan_handle in sys.dm_exec_cached_plans

    and maybe we're not looking at the right ones.

    Any idea on how to get a better picture of 'plans gone bad'?

    any way to tell for sure that it is a bad plan causing it... different wait types, for example?

  • parameter sniffing maybe? have you isolated which proc slowing things down? does it have a default value for any of it's parameters? in my experience, it's the default values that make a bad execution plan get chosen.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • SQLBOT (3/26/2009)


    Any advice on where to go next?

    What are the individual physical disk's %Idle, before & during?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (3/26/2009)


    SQLBOT (3/26/2009)


    Any advice on where to go next?

    What are the individual physical disk's %Idle, before & during?

    Thanks for your response, sir.

    Interestingly enough, physical and logical disk utilization spikes to 100% during the event and is 'normal' during times of normalcy.

    HOWEVER!

    We're not seeing disk waits in SQL at event time, and SQL File I/O rates are normal during the event (actually below average in some cases).

    Your question makes me wonder if you've seen this somewhere before.

    We're contacting our storage team in the meantime.

    ~BOT

  • Lowell (3/26/2009)


    parameter sniffing maybe? have you isolated which proc slowing things down? does it have a default value for any of it's parameters? in my experience, it's the default values that make a bad execution plan get chosen.

    Lowell,

    There are dozens of likely candidates. Our problem is that we don't know if how to tell if the proc response times are slow due to the event, or if they are causing the event.

    I'll pass your wisdom on to the dev team and have them check their default values. I hadn't thought of default values causing problems, but I could definitely see it if the default was something ridiculous.

    That fits my original theory, so I'm ready to buy in!!

    Thanks,

    ~BOT

  • here's an article i saved on parameter sniffing; basically, the default for an execution plan is to look at the statistics to determine how unique the parameter is; but when you enter a default value, like NULL, it ignores the statistics and builds a plan assuming the default(since default is supposed to be the most common value)

    but a plan for a null value is different than , say , an index seek that the statistics would have used.

    Looks like a kind of parameter sniffing. Here's a brilliant explanation of that "feature".

    quote:

    --------------------------------------------------------------------------------

    The reason for the performance difference stems from a feature called

    "parameter sniffing". Consider a stored proc defined as follows:

    CREATE PROC proc1 @p1 int AS

    SELECT * FROM table1 WHERE c1 = @p1

    GO

    Keep in mind that the server has to compile a complete execution plan for

    the proc before the proc begins to execute. In 6.5, at compile time SQL

    didn't know what the value of @p1 was, so it had to make a lot of guesses

    when compiling a plan. Suppose all of the actual parameter values for

    "@p1 int" that a user ever passed into this stored proc were unique

    integers that were greater than 0, but suppose 40% of the [c1] values in

    [table1] were, in fact, 0. SQL would use the average density of the

    column to estimate the number of rows that this predicate would return;

    this would be an overestimate, and SQL would might choose a table scan

    over an index seek based on the rowcount estimates. A table scan would

    be the best plan if the parameter value was 0, but unfortunately it

    happens that users will never or rarely pass @p1=0, so performance of the

    stored proc for more typical parameters suffers.

    In SQL 7.0 or 2000, suppose you executed this proc for the first time

    (when the sp plan is not in cache) with the command "EXEC proc1 @p1 =

    10". Parameter sniffing allows SQL to insert the known value of

    parameter @p1 into the query at compile time before a plan for the query

    is generated. Because SQL knows that the value of @p1 is not 0, it can

    compile a plan that is tailored to the class of parameters that is

    actually passed into the proc, so for example it might select an index

    seek instead of a table scan based on the smaller estimated rowcount --

    this is a good thing if most of the time 0 is not the value passed as

    @p1. Generally speaking, this feature allows more efficient stored proc

    execution plans, but a key requirement for everything to work as expected

    is that the parameter values used for compilation be "typical".

    In your case, the problem is that you have default NULL values for your

    parameters ("@Today DATETIME = NULL, ...") that are not typical because

    the parameter values are changed inside the stored proc before they are

    used -- as a result NULL will never actually be used to search the

    column. If the first execution of this stored proc doesn't pass in an

    explicit value for the @Today parameter, SQL believes that its value will

    be NULL. When SQL compiles the plan for this sp it substitutes NULL for

    each occurrence of @Today that is embedded within a query.

    Unfortunately, after execution begins the first thing the stored proc

    does is change @Today to a non-NULL value if it is found to be NULL, but

    unfortunately SQL doesn't know about this at compile time. Because NULL

    is a very atypical parameter value, the plan that SQL generates may not

    be a good one for the new value of the parameter that is assigned at

    execution time.

    So, the bottom line is that if you assign defaults to your sp parameters

    and later use those same parameters in a query, the defaults should be

    "typical" because they will be used during plan generation. If you must

    use defaults and business logic dictates that they be atypical (as may be

    the case here if app modifications are not an option), there are two

    possible solutions if you determine that the substitution of atypical

    parameter values is causing bad plans:

    1. "Disable" parameter sniffing by using local DECLARE'd variables that

    you SET equal to the parameters inside the stored proc, and use the local

    variables instead of the offending parameters in the queries. This is the

    solution that you found yourself. SQL can't use parameter sniffing in

    this case so it must make some guesses, but in this case the guess based

    on average column density is better than the plan based on a specific but

    "wrong" parameter value (NULL).

    2. Nest the affected queries somehow so that they run within a different

    context that will require a distinct execution plan. There are several

    possibilities here. for example:

    a. Put the affected queries in a different "child" stored proc. If

    you execute that stored proc within this one *after* the parameter @Today

    has been changed to its final value, parameter sniffing will suddenly

    become your friend because the value SQL uses to compile the queries

    inside the child stored proc is the actual value that will be used in the

    query.

    b. Use sp_executesql to execute the affected queries. The plan won't

    be generated until the sp_executesql stmt actually runs, which is of

    course after the parameter values have been changed.

    c. Use dynamic SQL ("EXEC (@sql)") to execute the affected queries.

    An equivalent approach would be to put the query in a child stored proc

    just like 2.a, but execute it within the parent proc with EXEC WITH

    RECOMPILE.

    Option #1 seems to have worked well for you in this case, although

    sometimes one of the options in #2 is a preferable choice. Here are some

    guidelines, although when you're dealing with something as complicated as

    the query optimizer experimentation is often the best approach :

    - If you have only one "class" (defined as values that have similar

    density in the table) of actual parameter value that is used within a

    query (even if there are other classes of data in the base table that are

    never or rarely searched on), 2.a. or 2.b is probably the best option.

    This is because these options permit the actual parameter values to be

    used during compilation which should result in the most efficient query

    plan for that class of parameter.

    - If you have multiple "classes" of parameter value (for example, for

    the column being searched, half the table data is NULL, the other half

    are unique integers, and you may do searches on either class), 2.c can be

    effective. The downside is that a new plan for the query must be

    compiled on each execution, but the upside is that the plan will always

    be tailored to the parameter value being used for that particular

    execution. This is best when there is no single execution plan that

    provides acceptable execution time for all classes of parameters.

    HTH -

    Bart

    ------------

    Bart Duncan

    Microsoft SQL Server Support

    --------------------------------------------------------------------------------

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks for that information, Lowell.

    I'm going to file it away.

    My question is how does SQL Server 2005 decide when it's had enough of a cached plan and that it needs to change it. Is it just clock hands like the buffer cache??

    Anyone??

    ~BOT

  • SQLBOT (3/27/2009)


    RBarryYoung (3/26/2009)


    SQLBOT (3/26/2009)


    Any advice on where to go next?

    What are the individual physical disk's %Idle, before & during?

    Thanks for your response, sir.

    Interestingly enough, physical and logical disk utilization spikes to 100% during the event and is 'normal' during times of normalcy.

    ...

    Is this based on "% Idle Time" or "% Disk Time"? They are not converse statistics, which is why I ask.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (3/27/2009)


    SQLBOT (3/27/2009)


    RBarryYoung (3/26/2009)


    SQLBOT (3/26/2009)


    Any advice on where to go next?

    What are the individual physical disk's %Idle, before & during?

    Thanks for your response, sir.

    Interestingly enough, physical and logical disk utilization spikes to 100% during the event and is 'normal' during times of normalcy.

    ...

    Is this based on "% Idle Time" or "% Disk Time"? They are not converse statistics, which is why I ask.

    I'm using a 3rd party tool that keeps historical data. It calls it 'utilization %' I would imagine it corresponds with the % disk time perfmon counter, but I found nothing concrete to prove that.

  • SQLBOT (3/27/2009)


    We're not seeing disk waits in SQL at event time, and SQL File I/O rates are normal during the event (actually below average in some cases).

    How are you determining this? What are the statistics that you are using?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (3/27/2009)


    SQLBOT (3/27/2009)


    We're not seeing disk waits in SQL at event time, and SQL File I/O rates are normal during the event (actually below average in some cases).

    How are you determining this? What are the statistics that you are using?

    We're using two different tools that have history gathering. They are not Microsoft tools.

    The file I/O rates are gathered with read and write Bytes P/S using one of the tools. I'm assuming those come from fn_virtualfilestats. These File I/O statistics follow the same trends they do every day and fall within the baselines we've established.

    The waits comes from sys.dm_os_wait_stats and watching anecdotally how fast log waits, i/o and completion rates are growing. Not too scientific there.

  • For posterity it turned out to be an issue with cached user permissions because the vendor code uses too much ad-hoc SQL

    This Article sums up the issues nicely.

    the fix is to set trace flag 4621.

  • Thank you for this valuable feedback !

    Keep in mind trace flag 4621 is only available with SP3

    and is only advised if the combination of trace flags 4618 and 4610 do not solve your issue !

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply