January 10, 2024 at 8:47 am
we have 2 node non-FCI AG setup. so there is primary and secondary, its async with manual failover, as we don't want a failover to happen.
Off late we have been having AG failover attempts , which seem n/w related, looks like heartbeat related to me between the nodes. we also have a FSW.
Networking team has been looking but no result yet & we get constant failovers leading me to my question in the below lines.
I have relaxed the cluster timeouts, but still if there is a HB issue, primary AG will 'attempt' a failover, which means that it will goto PRIMARY_RESOLVING before coming back online after a couple of mins back on primary, this causes a outage.
What i am thinking of is - Can i take the vote off the secondary?
Currently primary, secondary & FSW - each has 1 vote.
Recent outage had this message on primary clusterlog - "Quorum witness has better epoch than local node, this node must have been on the losing side of arbitration!"
Seems to me that primary and secondary lost communication, reached heartbeat limit (tries HB evry 2 sec , threshold is 40 , so 80 secs timeout in this env) - If i understand correctly, secondary beat primary to the FSW , thus the msg above, and thus the outage.
So, My question is can i remove the vote from secondary , so only primary & FSW has the vote? What are the repercussions of it?
I will be left with even votes (2) , which is not recommended I read.... But as it is I don't want secondary server to become primary ever, so whats the harm?
All servers are on the same subnet. I had other apps before where we had DR site on a different subnet, there I know you should remove votes from DR nodes, i don't have any doubts on that, but can the same be done in this env where its all 1 subnet - 2 machines, one is meant to be primary and other not , with a FSW?
Please let me know what you think, thanks.
January 10, 2024 at 4:55 pm
You can remove a vote from a node, in your case I don't think it would help anything on its own, there are quorum problems happening with at least two of the nodes. If you remove the vote from the remote datacenter node AND configure dynamic quorum it may help a little bit but you may have problems getting the cluster to start if there is a non-graceful shutdown.
I would be cautious about relying on async replication for DR, you would need to carefully monitor how far behind that copy gets, with the problems you are having there is real risk of it getting very far behind. I think log shipping would be more reliable, easier to monitor and more fault tolerant than using an AAG in this way.
The cluster logs should contain all the information needed to identify what the problem is, if you need, clear them out and recreate the issue so you don't have needle in haystack issue of logs.
Are you using a snapshot-based backup by any chance? You mentioned adjusting time outs, one of the things snapshot backup vendors commonly recommend is to adjust timeouts. Veeam recommends something insane like 200 seconds if I remember correctly. the backups being configured wrong, not properly patch managed, the environment being configured wrong etc can all create major clustering problems.
January 10, 2024 at 5:07 pm
Possibly a dumb question, but if secondary should never become primary, do you even need AG to be set up? Could you set up something like replication instead?
As for having only 2 votes in an AG means that if the voters lose the ability to communicate with each other, each will have a vote of 1 and think that they should be the primary. What I mean is if the FSW and Primary lose network connection, Primary will say "I have 1 vote and that is the majority so I am primary" and FSW will say "I have 1 vote and and that is the majority so I am giving it to the secondary" so then secondary will be primary. So then you have both systems thinking they are primary.
IF you really never want to failover, I would recommend not setting up an AG to begin with and pick a different tool for the job.
Now the above is all my understanding of WFC. I currently don't have this set up and haven't had to work with WFC for a few years now, so my knowledge is a bit rusty on it.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
January 10, 2024 at 5:44 pm
As for having only 2 votes in an AG means that if the voters lose the ability to communicate with each other, each will have a vote of 1 and think that they should be the primary. What I mean is if the FSW and Primary lose network connection, Primary will say "I have 1 vote and that is the majority so I am primary" and FSW will say "I have 1 vote and and that is the majority so I am giving it to the secondary" so then secondary will be primary. So then you have both systems thinking they are primary.
now, so my knowledge is a bit rusty on it.
Dynamic quorum should resolve that, but in his/her case with node number 2 being a witness node, it would never try to become primary. Don't think this is a good design, I think it is unnecessarily complicated and fragile for a use case that can be addressed with something simpler and more reliable
January 10, 2024 at 8:10 pm
This configuration doesn't really make sense - it is either a DR scenario without an HA solution or it is just to allow for offloading read only workloads.
There are many different options - but no way to determine what would work without understanding the goal. For example, I would probably lean towards adding a new node with shared storage to the 'primary' and create an FCI cluster - then add the secondary AG for read-only access. This allows for failover between FCI nodes and asynchronous mode for the AG node - with only those databases needed for reporting included in the AG.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
January 11, 2024 at 8:43 pm
Yes, I think its messy, I will leave the votes as they are.
CreateIndexNonclustered - There is no remote DR.
Just some history - I had setup this env yrs ago , initially it was automatic failover but then every now and then AG will failover and cause a outage , mostly network/clustering related. Then i set it up as async manual which is how it is now for the past couple of yrs.
But off late again we are getting n/w HB issues etc, I increased the SameSubnetDelay & SameSubnetThreshold values recently, but had another outage last week. So the thought of removing the vote came to me, but it seems like it can open another can of worms.
Thanks for all for clearing my mind, will continue to look into it, n/w team is also looking.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply