August 29, 2017 at 10:15 pm
I have a backup regime that creates a backup on a remote server and then runs a verify. This solution has been working without issue for many months. Recently the backups are reporting as failed but are actually succeeding, it's the verify step that is failing.
A quick rundown on architecture of the system is it's a 3 node AOAG, a primary and two secondaries, one secondary in synch commit and the other in a-synch commit. The backup runs on the primary.
After some testing I determined running the verify on anything other than the a-synch secondary causes the CPU spike and the job to terminate. When run on the primary node I see the CPU spike and almost all transactions waiting on the hadr_sync_commit wait type. I assume what's happening here is thread exhaustion, the primary node is waiting on acknowledgement from the synch commit secondary node that the transactions have been committed?
Can anyone shed any light on why running a RESTORE VERIFY ONLY causes the CPU to spike now, when this has never occurred before, on the primary and synch commit secondary? But not the a-synch secondary.
PS: I have checked config settings, (sp_configure), process affinity settings etc. all three instance have the same settings.
PPS: All three nodes have the same hardware and the same SAN backend.
Thanks in advance to anyone that may be able to assist me with this matter 🙂
August 30, 2017 at 6:43 am
There are a good bit of articles out there which state that for Synchronous Always-On, the server is likely to take a performance hit due to the confirmation it requires for commits, where the asynchronous will not. For your issue, the CPU spikes can be from lack of memory which is usually not the underlying cause, meaning something has changed possibly as low as at the database level to cause the spike. So, is anything running, from the agent or windows scheduler, like a backup or any type of maintenance job on the instance the same time the verify is occurring? Are the Max Memory instance settings on the Primary and Secondary exactly the same? You could run a wait stats query during the verify to see what exactly Sql Server is waiting on at that exact time of the spike. You could also run a script, like whoisactive or sp_who2 to see what is running during the verify and what is and how much cpu is being used. As far as why this just started to occur, you can question what changed in your environment as a starting point. Also, is anything being written to the Sql Server error log during the time of the spike?
August 30, 2017 at 9:02 am
ReamerXXVI - Tuesday, August 29, 2017 10:15 PMI have a backup regime that creates a backup on a remote server and then runs a verify. This solution has been working without issue for many months. Recently the backups are reporting as failed but are actually succeeding, it's the verify step that is failing.
A quick rundown on architecture of the system is it's a 3 node AOAG, a primary and two secondaries, one secondary in synch commit and the other in a-synch commit. The backup runs on the primary.
After some testing I determined running the verify on anything other than the a-synch secondary causes the CPU spike and the job to terminate. When run on the primary node I see the CPU spike and almost all transactions waiting on the hadr_sync_commit wait type. I assume what's happening here is thread exhaustion, the primary node is waiting on acknowledgement from the synch commit secondary node that the transactions have been committed?
Can anyone shed any light on why running a RESTORE VERIFY ONLY causes the CPU to spike now, when this has never occurred before, on the primary and synch commit secondary? But not the a-synch secondary.
PS: I have checked config settings, (sp_configure), process affinity settings etc. all three instance have the same settings.
PPS: All three nodes have the same hardware and the same SAN backend.Thanks in advance to anyone that may be able to assist me with this matter 🙂
How about doing a consistency check ?
August 31, 2017 at 5:01 am
This was removed by the editor as SPAM
October 4, 2017 at 1:51 am
This was removed by the editor as SPAM
January 11, 2018 at 1:50 am
This was removed by the editor as SPAM
January 23, 2018 at 9:42 pm
FYI - We finally resolved this, the solution was an upgrade of VMWare tools. Go figure hey.
October 20, 2020 at 10:06 am
This was removed by the editor as SPAM
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply