July 25, 2023 at 8:58 am
I'm trying to write a DR doc with no test system for someone with no experience of SQL / clustering. I know! Don't ask. It's a weird old place where I work but they pay, I do.
So, 2 nodes at site 1. 2 at site 2. 2016. AG. WSFC underlying on 2016. File share witness at site 1.
Scenario is site 1 is completely gone forever.
DR monkey logs onto site 2. What state is the WSFC in?
July 25, 2023 at 3:37 pm
The cluster is more than likely failed, you will have to force quorum. 2012 R2 introduced dynamic quorum, but it can't lose most of the votes at the same time. The design and scenario means that the DR plan will also have to plan for restoring the original site as it could cause split brain in the cluster if it is not brought back 'correctly.'
July 25, 2023 at 8:17 pm
On your DR Doc, I would document what the role is of Each server then in what order they will take in the event of a failure. ie how to promote to primary etc...
July 25, 2023 at 8:20 pm
It depends on the how the cluster quorum was configured in the first place. If all nodes in the cluster have a quorum vote - then you have 2 possibilities.
Since the file witness is located in DC1 - you are always running the risk of not having quorum. If the file witness is not available (site 1 is down), then you cannot restart either node in the cluster without taking the cluster down. You definitely need to consider moving that file witness to a location that is accessible from either location and not dependent on either location. A cloud witness is one option...
Another consideration - you can set the node weight on the DR nodes to 0. That removes those nodes from quorum - so in the event you lose access to site 2 (DR) you will not affect quorum and your cluster will remain up and healthy. If site 1 goes down, you can then manually set the node weights on the nodes in the DR site - and force the AG with data loss to recover and bring the system up. You really don't want to fail over to a DR site automatically - it should be driven by the business, once it has been determined that the primary site will not be up within a certain time.
You also have to consider all of the other required servers and services. If you failover automatically - you may be in a situation where the databases are hosted in your DR site - and the web servers and attempting to perform cross-site communications. That could increase the latency and potential have a significant impact on performance.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
August 4, 2023 at 10:01 am
It is also worth putting near the start of your documentation a warning sbout risk.
Explicitly say that anyone trying to deal with DR who does not already have practical experience of working with the concepts involved will increase the risk of failure to complete the DR process and increase the risk of permanently losing data. If a problem occurs that is outside the scope of the documented troubleshooting the best approach is to stop trying to get things working, until experienced help becomes available. A delayed recovery is always better than continually hitting the problem until something permanently breaks.
I have done many cluster failovers and seen many one-off problem scenarios. Mostly they got resolved fairly quickly, but some resulted in the loss of the cluster. Do not expect inexperienced staff to be able to deal with all problems.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply