Rebooting secondary replica of AG causes some databases to stop synchronizing

Question

Post reply

Rebooting secondary replica of AG causes some databases to stop synchronizing

steal

SSC Eights!

Points: 827
More actions
April 5, 2023 at 5:29 pm

#4169603
Rebooting secondary replicas should be non-impactful, but starting about 6 months ago we pretty consistently see a handful of databases that fail to synchronize post-reboot. Background:
- Each of our SQL cluster nodes hosts 6 AGs, and each AG has around 50 DBs
- Definition of 'handful' - most recently, we rebooted 8 secondary nodes and had 15 databases across 5 AGs that failed to recover
- The timing when this started loosely correlates to a migration from Nutanix (VM and storage) to PureStorage/VMWare (we did the migration over several months). No other significant infrastructure changes have happened.
Details
- On the secondary node there are logs for only affected DBs that report 'nonqualified transactions are being rolled back in database for an Always On Availability Groups state change...' (6x in 10 ms), followed immediately by 'Database (database ID 57) startup failed with error 3602, severity 25, state 75'
- On the primary node
  - the logs look like this for each database that DOES recover successfully
    Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED] 12:49:56
    Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP] 12:49:56
    Transitioning from [CATCHUP] to [COMPLETED] 12:49:57
  - For each database that does NOT recover successfully, logs show that the DB does not successfully reach the 'Transitioning from [CATCHUP] to [COMPLETED] phase of recovery. Instead it reports [CATCHUP] to [FAILED] after 30 minutes and then tries again, so the logs look like this
    Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED]  12:49:56
    Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP]  12:49:56
    Transitioning from [CATCHUP] to [FAILED]  13:19:56
    Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED]  13:19:56
    Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP]  13:19:56
    Transitioning from [CATCHUP] to [FAILED]  13:49:56
To fix, we manually remove the DB from the AG, drop it from the secondary, and add back to the AG. Our IT team reports no unusual network activity or underlying infrastructure anomalies when this happens, and CPU/memory on the SQL Servers is stable/normal. I haven't been able to find anything noteworthy in the Windows Event logs on the SQL servers.
Does anyone have ideas what might be causing a small subset of our databases to have these 'nonqualified transactions', or how to troubleshoot further?

Viewing 15 posts - 1 through 15 (of 39 total)

You must be logged in to reply to this topic. Login to reply

Site Owners SSC Guru Points: 80325 More actions · Answer 1

Thanks for posting your issue and hopefully someone will answer soon.

This is an automated bump to increase visibility of your question.

onetsql Valued Member Points: 57 More actions · Answer 2

We've had various issues all very similar in nature, never managed to get a root cause. Methods of fixing include -

Resume replication if it's paused
Remove database from secondary replica and re add. Dont delete database from secondary
Restart SQL
Sometimes the dbstartup command gets blocked, in this case you will see blocking, make sure to check the system processes. Only method to get out of this is to force SQL to shutdown. I.e. kill sqlserver.exe process
Remove from primary and re add to AG.

You can probably tell we've seen our fair share of issues.

You might want to review your max worker threads, if I'm reading it right you've got up to 300 databases per instance, that's could easily cause worker thread starvation as each replicated database requires 2 or 3 threads each. If there is no activity it doesn't use the worker thread and releases back to the pool. I expect when SQL starts up it has to evaluate the replication state of each database so will require more threads.

We used to have a server with a large number of databases and we had similar issues after failover. Once we reduced the number of databases on the server the problems went away.

steal SSC Eights! Points: 827 More actions · Answer 3

Yes, we have ~300 dbs/server and indeed we do see thread exhaustion errors. We are working on an initiative to reduce the number of databases but it's a slow process.

Thanks for the reply, misery loves company 🙂

rothj Hall of Fame Points: 3412 More actions · Answer 4

I also started to experience this about 6 months ago, I wonder if an SP introduced the issue. I had over 300 tiny databases in my AG. I've reduced it to about 170 and I still experience it. I tried artificially bumping up the threads but still have the issue. The server is not using much CPU otherwise. I'm on SPLA licensing so adding cores would be expensive. I also have to remove the 'not synchronizing' databases from the AG and re-add them. I agree, it needs a bunch of threads at startup or failover. I have thought about splitting my 8 core Cluster into two 4 core Clusters to get more threads, my SQL licensing would be the same in that scenario. Please let me know if you find any resolution. Thanks

Perry Whittle SSC Guru Points: 234013 More actions · Answer 5

300 dbs in an AG?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

steal SSC Eights! Points: 827 More actions · Answer 6

steal

SSC Eights!

Points: 827

April 14, 2023 at 9:45 pm

#4174219

No, 300 dbs on a server. About 50 dbs/AG.

Perry Whittle SSC Guru Points: 234013 More actions · Answer 7

First question is why 50 dbs per AG, generally you want all application related dbs in a per application AG, are these sharepoint databases

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

steal SSC Eights! Points: 827 More actions · Answer 8

No they are not Sharepoint DBs. 50 DBs/AG is relatively arbitrary, that's just where we landed with our capacity testing when we first implemented AGs about 5 years ago.

Rebooting the secondary replicas used to be non-impactful (did not have issues for several years). We can't figure out why we've started having problems around 6 months ago.

rothj Hall of Fame Points: 3412 More actions · Answer 9

Same here, mine were fine for a few years before this issue started about 6 months ago.

Perry Whittle SSC Guru Points: 234013 More actions · Answer 10

Have you checked the cluster log for more detailed event information

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

steal SSC Eights! Points: 827 More actions · Answer 11

Yes we have combed the logs pretty thoroughly and can't find anything definitive. We have a ticket open with MS so I'll post back here with whatever they report.

AlexSQLForums SSChampion Points: 14277 More actions · Answer 12

Hi Steal and Rothj

Do you have seeding mode set to Automatic in your AG group?

Does this happen every time you reboot secondary?

Alex S

steal SSC Eights! Points: 827 More actions · Answer 13

Yes, we are using Automatic seeding for our AGs.

Our reboots most often are due to patching, so we reboot many secondaries at the same time and yes, we always have a handful of databases that fail to re-synchronize following the reboots.

rothj Hall of Fame Points: 3412 More actions · Answer 14

Same, I have auto seed on and most of the time I have some db's that get out of sync. Like about 15-20 out of 170 db's need manually fixed after the failover/restart. Mine is also mostly around patching as the reason for the failover.