CPU Pressure.

  • Good morning!

    I'm seeing CPU pressure on my SQL VMs - 4 node cluster, 20 CPU Intel Xeon 2.2GHz, 128 GB RAM (100 max SQL server mem).  SQL 2019 & Win 2016.

    The cluster is for Sharepoint 2019.  I started with 95+ CPU & have stopped a large regular batch copy && some additional poorly formed CPU bound queries & will optimise these later.

    Now I'm at 60-70% give or take.  I'm running process explorer & can see that it is SQL & not some other errant service consuming the CPU.  Now I'm not sure how to tell if what I have left is my new "normal", if I should consider requesting more CPUs &/or what else, if anything I should do to optimise or look into it.  I'm unsure if the original designer took resource consumption into account into the spec or if it has just ticked along giving it more resources where necessary.  I don't know where to begin in saying we need X, Y, Z for this other than - it's been under pressure.

    For additional info, I have 200 busy databases, 3 AGs, currently async while we resolve but will set to sync soon.  Backups daily.

    I have run various queries I have found & cannot see any stand out poor queries.  I run update stats daily.  With sharepoint maxDop is set to 1 per BP & I don't defrag.  Sharepoint supposedly does this itself (though I see no evidence of this).  My waits are currently skewed by re-seeding various databases so I will re-run all of this after the tidy up process.

    Within sys.dm_os_schedulers nothing seems particularly over-subscribed (ideal_workers_limit v current_tasks).

    Ideas or pointers either technical or procedural gratefully received as ever.

    Thanks for reading!

  • The below is entirely my opinion - feel free to ignore what I say.

    If you are not running frequently at 100% CPU, then you are not having CPU pressure.

    Having your CPU running at 60%-70% to me sounds like it is being properly utilized with room for overhead. That doesn't sound like CPU pressure to me. If it was constantly running at 100% CPU, that is CPU pressure. You want your CPU to be utilized. If the CPU is constantly sitting at 10% or less, then you have wasted CPU that could be better allocated.

    Having the CPU spike 95+% for brief periods likely isn't anything to worry about as long as you are not hitting and sticking at 100.

    What I would do is look at what is running on the SQL instance when it is slow. If my CPU spikes to 90+% I will sometimes have a peek at the instance to see what is currently running as it MAY be that someone ran an odd query that could be better tuned, but most of the time when mine spikes to 90% or higher, it drops back down pretty quick too. The few times I've seen it spike and remain spiked it was due to a problem with a query or an SSIS package.

    Now if you NEED to fix this, I would look for queries (custom stored procedures likely; I personally wouldn't touch any out of the box stored procedures as that is just asking for trouble) that do unneeded work. For example, sorting of data can be a high CPU operation and it is one of those things I recommend doing application side not database side. The reason being if you want to re-sort the data it is MUCH faster to do that in the application layer than the database layer UNLESS your application layer doesn't support reordering data. Similar thing with DISTINCT - if you know your result set will be unique as it is, skip the DISTINCT. If you are uncertain, but also don't think anyone will care, skip the DISTINCT. If it is absolutely required, and you have subqueries, try to do the distinct on the smallest subquery that will give a distinct on the whole query. The reason behind this is that a DISTINCT needs to sort the data and sort is a high CPU operation.

    As a thought, is your power profile set to "high performance"? If not, I would adjust that to high performance.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

    1. How are you determining the %60-%70 usage? Where are you seeing these values?  Have these values exceeded the baselines for CPU?  Or are you looking at today and it just looks high?  You have captured these metrics over time, right?
    2. Do you have the VM's set to not balloon memory and CPU? The VM host is n0t over-allocating resources, correct?
    3. As Brian said, do you have the power plan set properly in both the Windows settings and in VMWare?
    4. Are you sure CPU is the issue, and this is not an effect of something else?  As one example, poor disk performance can cause an increase in CPU usage.
    5. This is SharePoint. Like Brian I am thinking that this is the "new normal"?  CPU may have averaged 10%-20% when this SharePoint farm was first brought online.  Now, the volume of content as well as users accessing the content is likely much higher.
    6. Depending upon the nature of the system, I'm typically not too concerned about constant 60-70% CPU usage.  It depends!  But I'm paying for these processors, I am going to use them!!!
    7. How are you updating statistics?  FULLSCAN?  After updating stats, does CPU usage go down?

      1. With sharepoint maxDop is set to 1 per BP & I don't defrag.  Sharepoint supposedly does this itself

        Try rebuilding, not reorg, a few of the heavily fragmented indexes on some of the larger tables and see if that has any effect on CPU.

      2. If neither of these have any measurable effect on CPU, then I would look at the disk subsystem.

    8. Is this VMWare?  Do you have the disks on different iscsi controllers?  Are they the paravirtual controllers?  You can have 4 iscsi adaptors on a VM.  I would create separate paravirtual adaptors for the data, log, and tempdb drives, and the default for everything else.
    9. Can you see CPU rise and fall with usage patterns? Does it rise with heavy usage? If it does not rise and fall with usage, then you may be CPU bound.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Brian and Michael said a lot already, so random thoughts...

    2.2 Ghz is incredibly slow for a database server in general, but for a max dop = 1 application like SharePoint its going to be particularly painful.

    Windows' CPU time does not mean a whole lot for SQL server as Brian and Michael alluded to. You can still have CPU pressure in SQL at 60% CPU time due to the time it takes SQL to switch between tasks. Generally, the more OLTP an application is, the lower total CPU time you can tolerate before hitting CPU contention in SQL. Usually 60-70% is pretty good, however.

    Did these databases get migrated from a previous version of SQL server? I don't have any data to support, nor have I cared to investigate, but have found that it 'seems' that after a version upgrade, a rebuild of all indexes appears to do something for performance that a full scan statistics update and updating compatibility does not.

     

  • CreateIndexNonclustered wrote:

    2.2 Ghz is incredibly slow for a database server in general, but for a max dop = 1 application going to be particularly painful.

    SharePoint requires maxdop to be set to 1

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John wrote:

    CreateIndexNonclustered wrote:

    2.2 Ghz is incredibly slow for a database server in general, but for a max dop = 1 application going to be particularly painful.

    SharePoint requires maxdop to be set to 1

    That was part of my point and original post

  •  

    Thank you all!  It's so helpful to get your ideas.  I work alone and sometimes you end up down in the rabbit holes of your own mind so it’s great to get 2nd – 5th opinions.

    Excellent point about my CPU now being effectively utilised.  It’s changed the way I am thinking about it now. CPU is fairly now steady to the eye. I was called in to troubleshoot so they did not have baselines previously but I will now establish new baselines over time & look at any spikes.

    The queries I have stopped were doing all of their processing database side with very poorly formed queries – clearly not written by anyone with any idea about the database engine.  I will seek to optimise or pass back to 3rd party app to do so before re-establishing.

    Re: VM ideas – I will check power profile & speak to the VM team about your ideas.  We are using VMM.

    I will experiment with your collective ideas of

    ·       Updating stats.

    ·       Rebuilding.

    I will ask for faster disks.

    I will also look at the disk subsystem.  As per the CPU’s we do not have the best of anything here so I would imagine it is also a factor.

    Lots to do.

    🙂

     

     

     

     

  • I recommend reading this.  It's VMWare's recommendation for SQL Server on a VM.

    https://core.vmware.com/resource/architecting-microsoft-sql-server-vmware-vsphere

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • One thing before asking for faster disks - I would check if the disks are the bottleneck. I wouldn't try to fix a problem that doesn't exist.

    Also, if the 200 databases are all busy, you may benefit from putting them on their own physical disks (if possible). In a VM it is a bit more of a challenge to ensure the disks presented to the VM are actually physically separated. But if, for example, database 1 is writing to disk and database 2 is reading from disk and they are on the same physical HDD (I am using a worst case scenario... I really hope you are on SSD's) for the data and log file, it will be slower than if all of them had their own disks, even if they are HDD's. When I say "all" I mean data and log for database 1 and 2 should ideally be on different disks. This isn't always possible obviously, but ideally that is how you'd want it set up. In a VM, you likely have a SAN back end which will HOPEFULLY be trying to balance the load out across the disks as activity starts spiking.

    But step 1 is to determine do you have a bottleneck OR is the system properly provisioned? Throwing more hardware at a problem that MAY not exist isn't going to solve anything and MAY make it harder to get more hardware when you do have a hardware bottleneck. I know my IT department is much more willing to listen to my suggestions (more RAM, more CPU, faster disk, etc.) if I can prove to them that it will help the performance issue I am having. One real-world example I have is I had some SSIS packages failing when they ran on live, but they ran fine on the test/dev environments. I did some digging and it turned out that the SSIS server on live is shared with a few SQL instances. SSIS was running a complex package and was pulling in a lot of data. My theory was that SSIS ran out of memory to store/process the data. To test this, I loaded up the test environment with a SQL instance, set the min memory so that the free memory on the test environment would match live and the package failed. Stopped the SQL instance and the package succeeded. My options were to reduce the memory footprint of the SQL instances on the live system (not really an option) or get more memory from IT.

    In your scenario, I would be looking "is anything slower than I expect?" if not, then maybe you are trying to fix a problem that doesn't exist. If things ARE slower than you expect, then I'd find evidence of what the problem is (CPU, memory, disk, network) and then work with the IT dept to test it.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Mr. Brian Gale wrote:

    Also, if the 200 databases are all busy,

    If this is anything like the SharePoint farms I have worked with, there may be a few very busy databases and the rest are almost static.

    Mr. Brian Gale wrote:

    ensure the disks presented to the VM are actually physically separated. But if, for example, database 1 is writing to disk and database 2 is reading from disk and they are on the same physical HDD (I am using a worst case scenario... I really hope you are on SSD's) for the data and log file

    With VMWare, it's critical to create the maximum of 2 iSCSI controllers, 3 of which should be the paravirtual controller.  Data, logs, and tempdb on the 3 paravirtual and everything else on the 4th one.  That made a HUGE difference in our busy systems.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply