April 9, 2021 at 2:44 pm
Trying to figure this out, unsuccessfully so far. Can anyone advise?
Server specs are in the attachment. 4 NUMA, 122 cores, all cores except 0-7 are given to SQL Server through CPU Affinity. Max Server Memory 913402 MB. I have 24 tempdb files, 1.3 Gb each, and a 300 Gb log for tempdb, on an SSD drive. The system is highly transactional and makes extensive use of temp tables. The user database getting the lions share of transactions has one primary data file and one secondary data file used for indexes exclusively, these files reside on separate physical diskspagelatch, connection timeout.
Every now and then, without any way to predict this that I could see, CPU activity on the system spikes, specifically the kernel-level activity, which results in slow queries and connection timeouts — see attached DataDog screenshots, relevant machine is labeled "03". The PHP client is pooling connections. The only somewhat predictable part is that this almost invariably happens once a day at 10:00 AM, after that it may or may not occur throughout the day.
SolarWinds Database Performance Analyzer shows a drastic increase in PAGELATCH_XX waits during this time (see DPA screenshot), but it shows no significant disk or file activity at that time. The queries affected by these waits are the same queries that execute throughout the day, there's nothing new or out of the ordinary in that regard, the load and type of load is consistent.
How can I trace the cause of these CPU spikes and related timeouts?
April 9, 2021 at 5:49 pm
Looking at the attached images, I would say something is "happening" on the server at that time. Possibly a scheduled thing, possibly a malicious thing, possibly an automated thing.
I would check for scheduled tasks that start at 10:00 AM and MAY be scheduled to re-occur.
Looking at the first datadog image, those drastic spikes in PHP errors right around when the CPU is jumping up would be the first thing I'd look into after scheduled tasks. Spikes in errors going from low (not showing up on the graph) to over 1000 in a minute to me looks like something went bonkers. You also should check the NGINX logs as you are getting some errors there too.
Kernel time spikes are generally caused by hardware related issues, so you may want to investigate if driver updates are available.
Last thing to check would be the server logs. Check the physical server logs to make sure nothing weird shows up there. What I mean by "weird" is ANY ERROR level messages that you are not expecting (some things will throw errors into the windows log that are safe to ignore) or WARNING/Information level messages that are out of place. I'd check for anything in the log that is happening before, during, and after the spike.
The final thing - I'd check with the hardware vendor for any known issues with the OS you are using, drivers you are using, and patch level you are at for OS and SQL.
My thoughts - this is NOT a SQL Server issue, but a physical server issue. The PAGELATCH waits are LIKELY all related to some underlying hardware or OS level issue.
One other thing that MAY not hurt to check is if the antivirus is set to ignore scanning the SQL related folders. Antivirus hitting those will cause database performance issues. I wouldn't expect this to cause a spike in the kernel times, but Windows does weird things at times.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
April 9, 2021 at 9:49 pm
Tried it all so far...
I understand that "something" is happening, I just can't figure out how to pin down "what" it is.
April 9, 2021 at 10:11 pm
What about the PHP errors? You are seeing a jump to 1500 errors in that window when the CPU is spiking.
The NGINX error of 502 is "bad gateway". So it sounds like you have some hardware or software related errors around the NIC.
To me, this is looking like a hardware issue (or virtual hardware issue), likely related to the network. I'd check the network related logs on the VM (event viewer has a LOT of logs you can go through, so I would look for even information ones related to network. I would START with the "Windows logs" folder, but I would move on to the "application and services logs" folder if nothing interesting was found in the Windows logs). I would also check your VMWare logs to see if anything weird happened at that time like loss of network or a NIC failing over (if you have multiple NICs). It MIGHT be problems with the physical hardware, with the VM, the SAN (assuming you have a SAN), etc.
Another thing to check would be if anything is scheduled that might touch that server. What I mean is if you have an update server that pushes updates out to the server on a schedule, it may be using all the network bandwidth to push updates to the machine. Or if you have an asset management tool, it may be querying the server in a strange way and causing that performance abnormality. If you can clone the box in an isolated network, it would be easier to test if it is an internal to the VM problem or external to the VM.
Since you don't see anything odd on the server (apart from the NGINX error which I would recommend you try to correct before digging deeper as I HIGHLY suspect that SQL performance issues is a symptom not the problem and I think that the NGINX 502 error at the SAME TIME as the CPU spiking is very likely pointing to the problem), my next step would be to clone it to an isolated network and let it run for a day and see if any odd performance happens. If you do or don't get odd performance, you will have ruled out a LOT of things. If you DO get odd performance, you have now ruled out the Dell and any network related problems as it is on an isolated network on different hardware, so it is probably a VMWare misconfiguration or something broken on the server. If you DON'T get odd performance, you have now ruled out the VM and VMWare and the problem is likely with the Dell.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
April 11, 2021 at 1:58 am
You mentioned it's a VM.
Did you check what's going on on other VM's sharing the same physical Server?
What's going on on the physical server itself?
_____________
Code for TallyGenerator
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply