Performance Degradation

  • I do not know if this is a SQL Server issue but I am hoping that maybe someone has experienced something similar and might be able to give me some advice on how diagnose this.

    For starters I have a clustered SQL Server environment running on Windows Enterprise Server 2008 r2 (64bit). We recently rebuilt the server to 2008 r2 (JULY)

    8 Quad Core 2.4Ghz processors

    256 GB RAM

    Version SQL Server 2005 SP3CU7 (4273)... SQL Server has 100 Gig RAM dedicated to it

    A couple of weeks ago we had some reports of sluggishness on our application.

    We identified the dedicated database server as the source of the sluggishness but we were unable to identify a smoking gun. We realized that the problem had to be the SQL Server machine because our remote desktop connections were taking very long to connect. We could not identify any resources that were starved on the machine.

    All my DBA monitors were showing a little high in values but nothing that I can say that would create the kind of degradation that we were experiencing.

    CPU utilization across the board was running at 30% to 35%, maybe 5% higher than normal.

    Memory looked good but I can not say that I did a good job of analysis here...I need some better monitors and I am researching that now (Any advice for researching memory issues would be appreciated)

    CPU Queue Waits looked fine

    Normal Blocking ocurrances

    SQL Traces showed nothing completely out of the ordinary as far as response times

    Overall nothing that I could see inside SQL Server that I could say was causing the issue.

    We ended up failing over SQL Server to the B side of the cluster and the problem resolved itself immediately for both the appliciations accessing SQL Server and the Remote Desktop Connection accessing the primary server. We re-booted the A side of the cluster just to be safe and then failed SQL Server back to its primary location and all was fine.

    Now, two weeks later we are starting to see slight signs of the same problem.

    I have opened a case with Microsoft But I have not had any satisfactory results yet.

    My first Question...What resources beyond the obvious should I be monitoring and how?

    Second Part...Part of a performance analysis that I did last year identified one stored procedure that was using high CPU cycles and being called quite frequently. I traced the problem to a FOR XML RAW statement. One of the applications that has been seeing the highest amount of performance degradation calls this procedure often. I have called this out to the developers in hopes that we can redesign this so that we do not use the FOR XML RAW. Note...peak usage for our sytem starts around September and I am starting to see a pattern ever two weeks.

    My Second question....Can frequent use of FOR XML RAW cause a SYSTEM wide issue? Maybe memory related?

    I know this is a complicated issue.

    Any advice would be greatly appreciated.

  • Eric, I believe you're on the right track regarding memory.

    FOR XML by itself is not going to the problem, any more then any heavily used, large result set query would be. A quick question.

    What's the SQL Server's memory settings?


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Min server Memory (MB) - 30720

    Max Server Memory (MB) - 102400

    Perfmon shows SQL Server:memory Manager:Total Server Memory (KB) at 104,857,600 ( or 102400 MB)

  • First question:

    Are You set "lock pages in memory" in gpedit.msc for account sqlservices?

    Twice:

    what are you set RAID for Database, TempDb, Log Files... ? It is for separates LUN and spindles?

    Best regards,

    Michal Marek

    Best regards,
    Michał Marek

  • michal_marek (10/12/2010)


    First question:

    Are You set "lock pages in memory" in gpedit.msc for account sqlservices?

    Twice:

    what are you set RAID for Database, TempDb, Log Files... ? It is for separates LUN and spindles?

    Yes, we are set for "lock pages in memory".

    One question...we have this set via the Domain Admins group which our SQL Service account is part of (not recommended I know). Would this be a problem? Do I need to set the policy on the domain account directly?

    There is a lot of conflicting documentation surrounding this policy on 64bit machines. I have read that it is no longer needed on 64bit machines? We have it set to be safe.

    We are 0+1 across the board accept for the c: which is 1.

    Tempdb has its own LUN with its own set of spindles as does log, data and system DB's

    thanks for the reply, let me know if there is anything else you can think of.

    I am working with Microsft to caputre statistics durring the next degradation occurance.

    Please keep the questions and comments flowing, any help is appreciated.

    I will post any solutions that I come across.

  • Yes, we are set for "lock pages in memory".

    One question...we have this set via the Domain Admins group which our SQL Service account is part of (not recommended I know). Would this be a problem? Do I need to set the policy on the domain account directly?

    No, i think so 🙂

    There is a lot of conflicting documentation surrounding this policy on 64bit machines. I have read that it is no longer needed on 64bit machines? We have it set to be safe.

    This set is for "lock" pages in memory and to avoid Windows and other application "memory pressure",

    We are 0+1 across the board accept for the c: which is 1.

    Tempdb has its own LUN with its own set of spindles as does log, data and system DB's

    thanks for the reply, let me know if there is anything else you can think of.

    I am working with Microsoft to caputre statistics durring the next degradation occurance.

    Please keep the questions and comments flowing, any help is appreciated.

    I will post any solutions that I come across.

    ...

    My suggestion:

    MS DTC, Check in PerfMon Counters for it, statistic, Operations IOs ...

    Other Check:

    Average disk queue lenght (recommendation is <2 ms per physical disk in LUN/RAID for Database, for Log and TempDb...

    Please, read this paper: http://www.servicesorientedstorage.com/assets/pdf/tuning-microsoft-sql-server-2005-performance-wp.pdf

    Use PAL tool to chceck bottlenack disk subsystem http://pal.codeplex.com/

    Best regards,
    Michał Marek

  • Whot is set in "sp_configure" ?

    sp_configure 'show advanced options', 1;

    GO

    RECONFIGURE WITH OVERRIDE;

    GO

    sp_configure

    GO

    Best regards,
    Michał Marek

  • 1) disk queue length is a useless measure of IO performance on modern hardware. avg disk sec/read and /write are best.

    2) do a fileIO stall analysis and wait stat analysis during the event (and before to get a baseline). andy kelly has a presentation/scripts to help with this, although I use track_waitstats_2005 for the waits. There is a whitepaper from microsoft on this too. search for sql server 2005 waits and queues

    3) do NOT use lock pages in memory without reading this first: http://support.microsoft.com/kb/918483

    4) have you analyzed network utilization?

    5) sound like you are doing a lot of remote desktop sessions. why?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • NameMinMaxConfigValRunVal

    Ad Hoc Distributed Queries0100

    affinity I/O mask-2147483648214748364700

    affinity mask-2147483648214748364700

    affinity64 I/O mask-2147483648214748364700

    affinity64 mask-2147483648214748364700

    Agent XPs0111

    allow updates0100

    awe enabled0100

    blocked process threshold08640055

    c2 audit mode0100

    clr enabled0100

    common criteria compliance enabled0100

    cost threshold for parallelism03276755

    cross db ownership chaining0100

    cursor threshold-12147483647-1-1

    Database Mail XPs0111

    default full-text language0214748364710331033

    default language0999900

    default trace enabled0111

    disallow results from triggers0100

    fill factor (%)010000

    ft crawl bandwidth (max)032767100100

    ft crawl bandwidth (min)03276700

    ft notify bandwidth (max)032767100100

    ft notify bandwidth (min)03276700

    index create memory (KB)704214748364700

    in-doubt xact resolution0200

    lightweight pooling0100

    locks5000214748364700

    max degree of parallelism06444

    max full-text crawl range025644

    max server memory (MB)162147483647102400102400

    max text repl size (B)021474836476553665536

    max worker threads1283276700

    media retention036500

    min memory per query (KB)512214748364720482048

    min server memory (MB)021474836473072030720

    nested triggers0111

    network packet size (B)5123276740964096

    Ole Automation Procedures0111

    open objects0214748364700

    PH timeout (s)136006060

    precompute rank0100

    priority boost0100

    query governor cost limit0214748364700

    query wait (s)-12147483647-1-1

    recovery interval (min)03276700

    remote access0111

    remote admin connections0100

    remote login timeout (s)021474836472020

    remote proc trans0100

    remote query timeout (s)02147483647600600

    Replication XPs0100

    scan for startup procs0100

    server trigger recursion0111

    set working set size0100

    show advanced options0111

    SMO and DMO XPs0111

    SQL Mail XPs0111

    transform noise words0100

    two digit year cutoff1753999920492049

    user connections03276700

    user options03276700

    Web Assistant Procedures0111

    xp_cmdshell0100

  • 1) disk queue length is a useless measure of IO performance on modern hardware. avg disk sec/read and /write are best.

    Yes, am am now capturing and trending these numbers daily.

    2) do a fileIO stall analysis and wait stat analysis during the event (and before to get a baseline). andy kelly has a presentation/scripts to help with this, although I use track_waitstats_2005 for the waits. There is a whitepaper from microsoft on this too. search for sql server 2005 waits and queues

    Yes, I have set up perfmons to capure fileIO stats, wait stats etc. If this does re-occur, I am ready to compare healthy to degraded

    3) do NOT use lock pages in memory without reading this first: http://support.microsoft.com/kb/918483%5B/quote%5D

    I've looked into this as well as consulted MS on this, we have it set as a recomendation from MS.

    4) have you analyzed network utilization?

    That is monitored daily by NW engineer, but I have not monitored from the servers perspective. I have included Network stats in the perfmon traces.

    5) sound like you are doing a lot of remote desktop sessions. why?

    I wouldn't say we do a lot. This is a remote server and RDC is the primary means for logging into the server for debugging and administrator tasks. When we were debugging this issue it was obvious that the RDC was not responding as expected.

    I appreciate the input and questions.

    It sounds like I am on the right track for diagnosis. I found the MS White papper on tracking wait stats and I plan on giving that an in depth look.

    Tomorrow is the day where we might expect a re-occurance. I will post my findings if it does occur.

  • My Second question....Can frequent use of FOR XML RAW cause a SYSTEM wide issue? Maybe memory related?

    I know this is a complicated issue.

    Any advice would be greatly appreciated.

    I did a no-no and piggy backed two questions in the same topic...I am going to post this question in a more suitable forum.

  • hi, can you please leave a link to the new thread ?

    i am following this 🙂

  • Final followup to this thread in case anyone was interested....

    We ended up having the problem re-occur and neither Microsoft nor I could determine the source of the issue even with all of the monitors I had going.

    We ended up failing SQL Server to the secondary side of the cluster and we have not seen the problem since.

    We are also seeing some very strange results in the form of extremely High Disk Response times coming from the simple Task Manager Resource Monitor. Microsoft has indicated that the monitor results are incorrect as they differ from the perf mon results. The second side of the cluster is built with the exact same configuration and we do not see these anomolies.

    I agree with Microsoft that the numbers in the Resource Monitor are probably off, but I believe that this could be an indication of something else going on either with a corrupted OS install/patch or hardware issue.

    We plan on rebuilding this machine at a later time to see if the problems go away.

  • Eric1/2aB (12/20/2010)


    Final followup to this thread in case anyone was interested....

    We ended up having the problem re-occur and neither Microsoft nor I could determine the source of the issue even with all of the monitors I had going.

    We ended up failing SQL Server to the secondary side of the cluster and we have not seen the problem since.

    We are also seeing some very strange results in the form of extremely High Disk Response times coming from the simple Task Manager Resource Monitor. Microsoft has indicated that the monitor results are incorrect as they differ from the perf mon results. The second side of the cluster is built with the exact same configuration and we do not see these anomolies.

    I agree with Microsoft that the numbers in the Resource Monitor are probably off, but I believe that this could be an indication of something else going on either with a corrupted OS install/patch or hardware issue.

    We plan on rebuilding this machine at a later time to see if the problems go away.

    My advice for the "corrupted OS install/patch": If you're really concerned, do a binary (or at least a hash) comparison of all OS files between, in this case, your primary and secondary sides of the cluster. A corrupt file is by definition different than the original. Jacksum[/url] is a Java-based tool to do a variety of hashes, and there are many others (md5sum is an old, out of date Unix favorite).

    I hear a lot of "X must be corrupt" tossed around on issues that nobody's figured out; many times an actual investigation of X shows this to translate directly to "I have no idea", as X is identical on a working and a nonworking system (or a before/after snapshot, whichever).

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply