Always On FailureConditionLevel and Session Timeout

  • Hi All,

    Over the past week, I've been monitoring my AOAG setup and I've noticed around the same time each night, my primary fails over to my secondary instance. This happens to be right after the VM backup tool runs that I get notifications about errors 1480 and 19406. Good to know those are working as expected.

    Anyway, I believe it's something with how the VM backup tool snapshots and merges changes that's causing there to be a few missed pings to the primary server--enough to trigger the failover. So, part of me wants to know if there's a way to get another backup tool (or configure this tool) to not cause this little blip in connectivity. However, that's a bigger conversation for another day.

    In the meantime, I'm researching ways on the SQL Server side to mitigate this problem, as it's causing me to have to fail back over to the primary each time (which is an annoyance).

    1) Initially my thought was to just change the failure mode on the primary to Manual to prevent the failover and keep activity on the primary while I work with the network team to explore options.

    2) I came across some suggestions related to the FailureConditionLevel and Session Timeout settings of the Cluster/AOAG, and how tweaking these from their defaults could help increase the threshold for a failure to occur

    I'm leaning toward #2 because I think it's a more realistic long-term solution as opposed to replacing our whole VM backup process (not that there isn't some tweaking on that side that could help--I just don't know yet). Assuming that I go this route, it'd be nice to know which of these Cluster/AOAG settings to test first. I'd rather not make multiple changes at the same time. So my questions are (for those of you that have some experience with this):

    Is there something in the SQL Server error logs that would help me understand which of the two settings' criteria were met that caused the replica to fail over? I know 1480 and 19406 were triggered, but that just tells me what happened, not necessarily why it happened. For example, could it not ping the primary for more than 10 seconds and therefore increasing the timeout would help? If 1480 and 19406 are considered system errors, then did it fail over because the FailureConditionLevel was set at 3 (System Errors) and therefore setting it to 2 or 1 might make it less sensitive and increase the threshold for failover?

    I'm not looking for all of the answers--just some guidance from those a little more experienced than me with this. And if I'm approaching this in the entirely wrong way, please let me know. I'd really appreciate any insight.

    Thanks in advance,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Just for a little more information, it looks like SameSubnetDelay and SameSubnetThreshold are two additional settings that affect the sensitivity of a failover (https://blogs.msdn.microsoft.com/clustering/2012/11/21/tuning-failover-cluster-network-thresholds/).

    With so many areas that seemingly affect AOAG failovers (AOAG Timeout, Cluster FailureConditionLevel, SameSubnetDelay, SameSubnetThreshold, etc.), I suppose it would be good to know what combination of settings you've needed to use and which ones actually can often be ignored. Ideally I'd like to be able to change only one to not have my instance fail over so immediately due to a network hiccup...

    Microsoft recommends leaving the SameSubnetDelay alone to keep the heartbeats frequent, so I'm thinking I might be able to simply adjust SameSubnetThreshold from 5 lost pings to something like 10 or 15 and resolve my issue. Not that anyone is particularly interested, but I'll update this once I do more testing (which may take a few days...).

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • I think you are on the right track. It is a number of years since I set my AOAG up but looking through my notes and at the current setting of one of the clusters I had issue with I changed the two parameters you mentioned:

    PS C:\Windows\system32> get-cluster | fl *subnet*
    CrossSubnetDelay    : 1000
    CrossSubnetThreshold  : 5
    PlumbAllCrossSubnetRoutes : 0
    SameSubnetDelay    : 2000
    SameSubnetThreshold   : 20

    Have you dumped the cluster log to see in detail what is happening?

    You can do that by running Powershell then:

    Import-ModuleFailoverClusters
    Get-ClusterLog –UseLocalTime -TimeSpan 30 <-number of minutes back to dump

    By default, the log file is created in %WINDIR%\cluster\reports

  • CC-597066 - Monday, August 6, 2018 2:40 PM

    I think you are on the right track. It is a number of years since I set my AOAG up but looking through my notes and at the current setting of one of the clusters I had issue with I changed the two parameters you mentioned:

    PS C:\Windows\system32> get-cluster | fl *subnet*
    CrossSubnetDelay    : 1000
    CrossSubnetThreshold  : 5
    PlumbAllCrossSubnetRoutes : 0
    SameSubnetDelay    : 2000
    SameSubnetThreshold   : 20

    Have you dumped the cluster log to see in detail what is happening?

    You can do that by running Powershell then:

    Import-ModuleFailoverClusters
    Get-ClusterLog –UseLocalTime -TimeSpan 30 <-number of minutes back to dump

    By default, the log file is created in %WINDIR%\cluster\reports

    Thank you, CC! I haven't yet dumped the cluster log specifically but I'll do that next to see if adjusting the subnet settings will help (which I think they will). Again, thank you.

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike,

    I think I'm having the same issue as you. It looks like VMware is kicking off a snapshot process and then it's causing a node to lose connectivity for a while which results in it thinking it's lost quorum so throws itself out of the cluster. This causes the AlwaysOn Availability group hosted on that node to fail-over to another. Is this what you were seeing and did tweaking SameSubnetThreshold have any positive effect?

    Cheers

    Craig

  • broonie27 wrote:

    Mike, I think I'm having the same issue as you. It looks like VMware is kicking off a snapshot process and then it's causing a node to lose connectivity for a while which results in it thinking it's lost quorum so throws itself out of the cluster. This causes the AlwaysOn Availability group hosted on that node to fail-over to another. Is this what you were seeing and did tweaking SameSubnetThreshold have any positive effect? Cheers Craig

    Hi Craig,

    Yes, this is exactly what I was seeing at the time. As I said in one of my previous posts in this thread, there seemed to be so many settings that affect an AOAG cluster, so it was difficult to figure out what the "right" combination was. We were able to do it, and the two most important settings for us were:

    1. SameSubnetThreshold. Default is 10 heartbeats (W2k12R2+). I changed ours to 20. Per Microsoft's recommendation, I did not change the frequency at which it checks for the heartbeats (i.e., the SameSubnetDelay property). I left it at 1000 (1 second).
    2. Session Timeout. This what what really ultimately helped us. By default, this is set to 10 seconds (and it's set per availability replica). We changed ours to 180 seconds. Yes, I know that's three minutes, but after slowly increasing it and continually seeing failovers when VM snapshots were occurring, I set it to the three minutes and both replicas in the cluster successfully tolerated the process--so it worked well for us.

    https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/change-the-session-timeout-period-for-an-availability-replica-sql-server?view=sql-server-2017

    I hope that helps, and if you have any other questions, let me know.

    Mike

     

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Hey Mike,

    Thanks for getting back to me so quickly. Yeah I also looked into the session timeout setting but this  document https://blog.dbi-services.com/sql-server-alwayson-and-availability-groups-session-timeout-parameter/ states that it has no effect on fail over.  Instead it simply marks the replica as disconnected after the timeout period has elapsed which results in failed attempts to connect to the DB.

    Anyway, I've increased my SameSubnetThreshold vlaue to 15 and i'll keep an eye it.

    Cheers

    Craig

    • This reply was modified 5 years, 5 months ago by  broonie27.
    Attachments:
    You must be logged in to view attached files.
  • broonie27 wrote:

    Hey Mike, Thanks for getting back to me so quickly. Yeah I also looked into the session timeout setting but this  document https://blog.dbi-services.com/sql-server-alwayson-and-availability-groups-session-timeout-parameter/ states that it has no effect on fail over.  Instead it simply marks the replica as disconnected after the timeout period has elapsed which results in failed attempts to connect to the DB. Anyway, I've increased my SameSubnetThreshold vlaue to 15 and i'll keep an eye it. Cheers Craig

    Craig,

    You're right. Now that you mention it, I believe we adjusted that setting for a different reason (but around the same time as SameSubnetThreshold). We use synchronous commit with our cluster, and I think that setting also determines how long the primary replica will wait before switching to asynchronous-commit mode and just commit the pending transactions. We decided that increasing the timeout and potentially taking longer to synchronously commit with no data loss was more important than better performance but the potential for data loss...

    In any case, I hope the SameSubnetThreshold setting helps alleviate your issue. Good luck.

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply