Performance degradation - Help troubleshooting

  • We have a server. App team has noticed performance degradation from 1-2 months. We don’t think there are any changes in windows or sql server. Seeing 100% CPU usage by SQL Server & seeing io_stall_read_ms. Queries that were running below 4 min are taking 20min now. Any help will be appreciated . Using sql 2014 on windows 2012. The errors from sql logs are below

    SQL Server has encountered 20 occurrences of I/O requests taking longer than 15 seconds to complete on file … in database id. The is file

    The instance of the sql server DBA engine Canot obtain a lock resource at this time. Rerun your statement when there are fewer active users.

    • This topic was modified 2 years, 2 months ago by  ramana3327.
    • This topic was modified 2 years, 2 months ago by  ramana3327.
  • Slow disk errors in the log suggest some sort of hardware problem.

    I would get your sysadmin to look for failing RAID disks, problems with a SAN, bad network connections etc. Also check that anti-virus software is not scanning the database files.

    You are then left with all the usual suspects:- MAX memory in SQL Server, large memory grants, missing indexes etc. I would be surprised if these caused IO erros in the log.

     

  • Could you please elaborate further with some tangible details?

    😎

    From what you have posted, my first guess would be the wrong butterfly flapping its wings in the wrong direction in central Amazon, but then again, I might be wrong.

  • This is going to sound strange but it's happened where I work at twice now...

    On high performance storage, there's usually a rechargeable cache battery.  Most people don't even know such a thing exists.  Like any rechargeable battery, it has a lifetime.  If it fails for one reason or another, even if there is no power loss to the system, writes to the system go haywire.  For our latest go 'round of fun, it turned backups that usually only take 90 minutes to sudden take 8-12 hours.

    Of course, there are a thousand other crazy reasons like someone kinked a cable or someone meant to make an "improvement" by changing a switch setting on a piece of network hardware, or a network card suddenly has a problem or a physical connection got dirty by air plating.  I remember one issue where someone plugged in a new coffee pot just outside the server room... that particular circuit happened to be on the same circuit as some network equipment and the "feedback" noise that it created on the 120 volt line was too much for the line filters and every time the hotplate on the pot would turn on to keep the coffee warm, nothing made it through the related network.

    Of course, it could be that "crap code" has overwhelmed you system by reaching a "tipping point" as scalability increased, as well.

    Problems like these are difficult to hammer down a solution for.

    And you say that you're not aware of any changes... I wouldn't bet on that.  Suspect EVERYTHING but, keep in mind that the data is always changing and very likely growing rapidly.

    Other questions to ask are things like "When was the last time you rebuilt statistics"?  And if the system is at 100% CPU usage, you have nothing to lose... try a DBCC FREEPROCCACHE and see what happens after 10 minutes.

    While we're on that subject, what kinds of connections are being made?  We went for years with no issues and then we suddenly starting having "spurts" of 10 and 20 minutes of 100% CPU usage and it just got worse from there.  Adding extra CPUs only made it MUCH worse.  It turned out that everyone thought that the application connections were defaulting to disabling M.A.R.S.  It turned out that the documentation was wrong (and I proved it across many companies that my fellow DBAs and Developer friend worked at) and that you had to explicitly define M.A.R.S. as being disabled.  It took us about 20 minutes to do that an all of the sudden, we were golden and actually running better than before the CPU issue started.

    We found that issue by looking for "rollbacks" on the SPIDs, which M.A.R.S. is prone to doing if it's not appropriate (and it usually isn't) for it to be enabled.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi,

    The databases has Auto statistics enabled. Some of the statistics are showing recent date. I updated on some tables.

    Whenever I open the performance dashboard, the user sessions are 2-3 times than user requests. Wait time for user requests is showing 0 but user sessions are varying up to 99.50%. CPU times for user requests are 22.50% & user sessions are .50%. Also most of the times, there are 1-2 queries (Noticed same query most of the time) showing CPU-Parallelism under current waiting requests. It has 8 cpus. Also under miscellaneous information, I am seeing missing indexes (About 400 indexes). The data for these databases will come from another SQL Server database. The SQL Scripts will be running in a bat file. So this server is more like data warehouse. Is creating missing missing indexes with high cost will be useful? Most of the tables has primary key. I don't have any baseline. Not sure how it was previously. Is restarting the SQL Server will helpful? Currently the performance dashboard has the message Overall performance may be degraded because the system shows signs of being cpu-bound. The SQL Server instance is consuming the majority of the CPU.

    From the device management the disks are showing VM virtual disk SCSI. Any suggestions will be helpful.

    Thanks,

    Ramana

     

  • As Ken said, stall's are usually an infrastructure thing. (network, san, ...)

    Could also be due a new query plan generating a heavier load

     

    The server is cpu-bound -> find the most intensive cpu queries

    Not sure if the queries in https://docs.microsoft.com/en-us/troubleshoot/sql/performance/troubleshoot-high-cpu-usage-issues

    work for SQL 2014

     

    sp_Blitz, sp_Blitzfirst might give you some general working points

    https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit/

    Rebooting the server just to clear the plan cache isn't really recommended as you may suffer downtime from rollbacks

  • The Performance Dashboard has an area for the most expensive queries.  Have a look at those.  Unfortunately, it doesn't tell you which part of some really long queries might be the part to concentrate on.

    Fortunately, there are other reports that will.  Using the same method you used to get to the performance dashboard. go look at the two reports that start with "Performance - Top Queries by Total..." .  The code found in that are what you need to concentrate on because it lists code by totals whether they ran once or 100,000 times since the proc cache was cleared.

    And, yeah... both find procs  and stuff coming from the front end and other sources of execution requests.

    Also, are the 8 CPU's you're talking about the actual number of physical core your machine has?  I suspect not... I suspect those are "logical processors.  If that's true, then you only have 4 physical core and if you're that CPU bound, consider spending the bucks for more core and SQL Server licenses to get you out of the woods until you can square away some of the code running on that box.

    And do check out the articles/products that Jo Pattyn recommended above.  Brent Ozar's collection of sp_Blitz* code can be seriously helpful.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Have you actually had the hardware checked?

    A hardware fault is the most likely cause of your problems and it could mean your server is about the fail completely.

  • Ken McKelvey wrote:

    Have you actually had the hardware checked?

    A hardware fault is the most likely cause of your problems and it could mean your server is about the fail completely.

    I totally agree... I also touched on some strange issues that had such a 1-2 month degradation as you describe.  Of course, that's also a lot of time for bad code to be added or a large amount of data to be added to the system, as well.

    ramana3327 wrote:

    Hi,

    The databases has Auto statistics enabled. Some of the statistics are showing recent date. I updated on some tables.

    To be honest, Auto statistics don't work well with indexes with ever-increasing keys.  And, even after the improvements they made, I still think that they fall a bit short.  Go take a look at how many rows have been modified on the stats that cover your indexes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To be honest, Auto statistics don't work well with indexes with ever-increasing keys. And, even after the improvements they made, I still think that they fall a bit short. Go take a look at how many rows have been modified on the stats that cover your indexes.

    As this is a SQL2014 thread the OP might want to check the trace flag 2371 has been set. (I do not think it was the default until SQL2016.)

    I think it would require a high through-put with some bad code, bad indexes and bad maintenance to produce a 15s IO alert. We know very little about the OPs system so it is difficult to judge. I still strongly suspect a hardware problem.

     

  • Agreed on both.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • From the performan

    The disk bytes/sec

    last: 377,652,605 Avg: 1,091,938,363, Min: 209,067,244 , Max: 1,493,374,735

    last: 204,419,027 Avg: 318,095,304  Min: 149,107,905 , Max: 638,625,164

    IO Data bytes/sec

    last: 393,516,961 Avg: 334,045,587 Min: 158,358,914 , Max: 656,608,128

    IO Data bytes/sec

    last: 1,455,148,834 Avg: 1,292,252,400 Min: 148,377,209, Max: 1,591,670,412

    Disk reads/sec:

    last: 12,521.431 Avg: 14,431.182 Min: 5,803.871 Max: 19,383.516

    Disk transfers/sec

    last: 16,092.022 Avg: 15,440.802 Min: 8,740.250, Max: 19,782.266

    Disk Write bytes/sec

    last: 2,146,310 Avg: 4,977,631 Min: 0 Max: 116,569,986

    Disk read bytes/sec

    last: 989,158,829 Avg: 1,296,936,202 Min: 656,876,770 Max: 1,579,871,435

    page reads/sec SQLserver : Buffer Manager

    last: 140,795.925 Avg: 156,220.621 Min: 35,837.037 Max:198,068.104

    page writes/sec sql server: Buffer Manager

    last: 26.001 Avg: 133.777, Min: 8.99 Max: 1,112.077

     

    • This reply was modified 2 years, 2 months ago by  ramana3327.
  • All that information provides is that something is beating the hell our of your disk system.  You need to find out what that is.  Have you tried looking at Task Manager or anything like that?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks everyone. Statistics update did the magic. The CPU inbound message is gone. Waiting for the confirmation from App tram

  • ramana3327 wrote:

    Thanks everyone. Statistics update did the magic. The CPU inbound message is gone. Waiting for the confirmation from App tram

    Awesome.  Thanks for the feedback on what the fix was.

    As a bit of a sidebar, I update stats every week on most things.  I do NOT base it on the date except to say that if they were rebuilt in the last 24 hours, not to rebuild them.  I base the number of rows modified and whether or not the index has an ever-increasing key.  If it has one of those and it's had mods and it hasn't been rebuilt in the last 24 hours, I rebuild it.  I also use FULL SCAN because it takes (old study though... haven't checked lately) nearly as long as a 25% scan (for example).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply