Rebooting secondary replica of AG causes some databases to stop synchronizing

  • Rebooting secondary replicas should be non-impactful, but starting about 6 months ago we pretty consistently see a handful of databases that fail to synchronize post-reboot.  Background:

    • Each of our SQL cluster nodes hosts 6 AGs, and each AG has around 50 DBs
    • Definition of 'handful' - most recently, we rebooted 8 secondary nodes and had 15 databases across 5 AGs that failed to recover
    • The timing when this started loosely correlates to a migration from Nutanix (VM and storage) to PureStorage/VMWare (we did the migration over several months).  No other significant infrastructure changes have happened.

    Details

    • On the secondary node there are logs for only affected DBs that report 'nonqualified transactions are being rolled back in database for an Always On Availability Groups state change...' (6x in 10 ms), followed immediately by 'Database (database ID 57) startup failed with error 3602, severity 25, state 75'
    • On the primary node

      • the logs look like this for each database that DOES recover successfully

        • Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED]  12:49:56
        • Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP] 12:49:56
        • Transitioning from [CATCHUP] to [COMPLETED]  12:49:57

      • For each database that does NOT recover successfully, logs show that the DB does not successfully reach the 'Transitioning from [CATCHUP] to [COMPLETED] phase of recovery.  Instead it reports [CATCHUP] to [FAILED] after 30 minutes and then tries again, so the logs look like this

        • Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED]  12:49:56
        • Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP]  12:49:56
        • Transitioning from [CATCHUP] to [FAILED]  13:19:56
        • Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED]  13:19:56
        • Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP]  13:19:56
        • Transitioning from [CATCHUP] to [FAILED]  13:49:56

    To fix, we manually remove the DB from the AG, drop it from the secondary, and add back to the AG.  Our IT team reports no unusual network activity or underlying infrastructure anomalies when this happens, and CPU/memory on the SQL Servers is stable/normal.  I haven't been able to find anything noteworthy in the Windows Event logs on the SQL servers.

    Does anyone have ideas what might be causing a small subset of our databases to have these 'nonqualified transactions', or how to troubleshoot further?

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • We've had various issues all very similar in nature, never managed to get a root cause. Methods of fixing include -

    1. Resume replication if it's paused
    2. Remove database from secondary replica and re add. Dont delete database from secondary
    3. Restart SQL
    4. Sometimes the dbstartup command gets blocked, in this case you will see blocking, make sure to check the system processes. Only method to get out of this is to force SQL to shutdown. I.e. kill sqlserver.exe process
    5. Remove from primary and re add to AG.

    You can probably tell we've seen our fair share of issues.

    You might want to review your max worker threads, if I'm reading it right you've got up to 300 databases per instance, that's could easily cause worker thread starvation as each replicated database requires 2 or 3 threads each. If there is no activity it doesn't use the worker thread and releases back to the pool. I expect when SQL starts up it has to evaluate the replication state of each database so will require more threads.

    We used to have a server with a large number of databases and we had similar issues after failover. Once we reduced the number of databases on the server the problems went away.

  • Yes, we have ~300 dbs/server and indeed we do see thread exhaustion errors.  We are working on an initiative to reduce the number of databases but it's a slow process.

    Thanks for the reply, misery loves company 🙂

  • I also started to experience this about 6 months ago, I wonder if an SP introduced the issue.  I had over 300 tiny databases in my AG.  I've reduced it to about 170 and I still experience it.  I tried artificially bumping up the threads but still have the issue.  The server is not using much CPU otherwise.  I'm on SPLA licensing so adding cores would be expensive.  I also have to remove the 'not synchronizing' databases from the AG and re-add them.  I agree, it needs a bunch of threads at startup or failover. I have thought about splitting my 8 core Cluster into two 4 core Clusters to get more threads, my SQL licensing would be the same in that scenario.  Please let me know if you find any resolution. Thanks

  • 300 dbs in an AG?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • No, 300 dbs on a server.  About 50 dbs/AG.

  • First question is why 50 dbs per AG, generally you want all application related dbs in a per application AG, are these sharepoint databases

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • No they are not Sharepoint DBs.   50 DBs/AG is relatively arbitrary, that's just where we landed with our capacity testing when we first implemented AGs about 5 years ago.

    Rebooting the secondary replicas used to be non-impactful (did not have issues for several years).  We can't figure out why we've started having problems around 6 months ago.

  • Same here, mine were fine for a few years before this issue started about 6 months ago.

  • Have you checked the cluster log for more detailed event information

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Yes we have combed the logs pretty thoroughly and can't find anything definitive.  We have a ticket open with MS so I'll post back here with whatever they report.

  • Hi Steal and Rothj

    Do you have seeding mode set to Automatic in your AG group?

    Does this happen every time you reboot secondary?

    Alex S
  • Yes, we are using Automatic seeding for our AGs.

    Our reboots most often are due to patching, so we reboot many secondaries at the same time and yes, we always have a handful of databases that fail to re-synchronize following the reboots.

    • Same, I have auto seed on and most of the time I have some db's that get out of sync.  Like about 15-20 out of 170 db's need manually fixed after the failover/restart.  Mine is also mostly around patching as the reason for the failover.

Viewing 15 posts - 1 through 15 (of 39 total)

You must be logged in to reply to this topic. Login to reply