Cluster Timeout Settings to handle transient network outages

  • I have an Availability group with 2 replicas on the same subnet.  We've been experiencing short network outages that last a few seconds and cause connections to timeout.  This causes the AG to fail completely (not failover automatically) and we have to manually bring it back online since the DBs are marked "not synchronizing" on the primary.  I believe the cluster role is failing because there are timeouts (network loss) on both replicas so its stuck in a failed state.  Working with the infrastructure team to reduce network outages has not been helpful, so I think the next best solution is to add a delay in the cluster timeout settings to allow the network to recover and keep the DBs available once that happens.  We will still experience an outage, but at least the AG will be online and not need manual attention.  Here is what I have tried to configure the delay.

    ALTER AVAILABILITY GROUP AG_name MODIFY REPLICA ON 'All_replicas' WITH (SESSION_TIMEOUT = 40);

    On the FCI, change "LeaseTimeout" to 40000 (40 seconds) and "HealthCheckTimeout" to 120,000 (120 seconds)

    I tested the delay by turning off the primary sql service and wait for the secondary replica to take over. Unfortunately there is no delay, the secondary becomes the primary within 5 seconds.

    Is this the best solution?  If so, what else do I need to change to make the delay work as expected.

  • Just my 2 cents but it feels like you are trying to treat the symptoms and not the problem. What is causing the network hiccups? I would work on addressing that issue rather than treating the symptoms.

    I know you said you worked with the infrastructure team to figure out what's up and they were not helpful, but my next step would not be to work-around the problem. I would escalate. If the networking team is telling you "everything looks good on our end", AND you have shown them logs or error messages indicating the network is the problem, escalate them to their boss. If you haven't shown them logs, then that'd be my next step before escalating. Prove it is a networking issue and not something else. I've been on the receiving end of problems like that. Had a support ticket of "Database is down" and I thought "what the heck? Why didn't I get notified of this?" My boss came to my cubicle and told me "drop everything and get that database back online". The stupid thing though - it WAS online and accepting connections and profiler (SQL Server 2008 R2 at the time... this was years ago... I know XE existed then, but profiler was just so much faster to set up for monitoring traffic) showed that connections were going and the system was in use. Turns out the developers had released a new version of their software that had a typo so it was connecting to a non-existent SQL instance. Took me 30 minutes to convince my boss it was online, over an hour to convince the app team it was online and the problem was on their end, and about half a day to get everything fixed because the app team refused to admit they created the problem.

    Now, I'm not saying that's your situation, but playing devil's advocate here - if the networking team looked at the logs and everything came back clean, is there any chance it is a problem on your end? Like it isn't a problem with the network, but with the replication? What I'm meaning is if the replication is trying to write from primary to secondary and gets blocked for a long time or worse - deadlocked - that'll cause all sorts of issues.

    But assuming it is definitely the network, does it goes down predictably (ie every Wednesday at 10:00 PM for example)? If so, someone has likely scheduled a process that is mucking things up. If it is unpredictable, then it is much harder to troubleshoot. My opinion, network related issues MUST be corrected and not worked-around as they will impact other things (end user queries for example). PLUS if it is random, there MAY come a time where it impacts your backups (if they are on a SAN or NAS) and that's never fun to find out your backup failed due to a known network related issue as you and your team (DBA's) are responsible for the backups.

    The solution may be something simple like replacing the ethernet cable (if it is a physical box) or updating the network drivers or maybe a switch port is starting to go or the switch is doing something wrong with the traffic or the firewall or the network monitor tools or any number of things. Either way, I personally wouldn't work around the issue as it will show up again somewhere else and there will come a point where you can't work around the problem.

    Also, I think the reason your test ("I tested the delay by turning off the primary sql service and wait for the secondary replica to take over. Unfortunately there is no delay, the secondary becomes the primary within 5 seconds.") failed over quickly rather than waiting like you expected is because you had a clean shutdown. SQL instance was going offline, so it told the secondary "you are the primary now" before it went offline. If you want a real test of your scenario and configuration you'd need to initiate an unexpected outage. If it is a phyiscal box, pull the power cable on it. If it is a VM, do a hard shutdown from the VM manager. Or if you want a "safer" (and faster to recover from) option as pulling the power cable is hard on computers, pull the network cable. If it is a VM, disable the NIC. BUT where is the fun in the safe options? In a real world scenario, the main thing that'll trigger your instance going offline unexpectedly is power loss, so pulling the power may be a good test case or may be overkill. I have had computers die on me when I pull the power cable without doing a clean shutdown, but they were OLD boxes.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply