May 17, 2024 at 7:07 am
It seems Question was incomplete.
If we consider Always ON ( Sync - Automatic failover) is on Windows cluster, Primary complete down is possible with Primary server switch off / network isolation conditions. In such case, if underlying windows cluster resources not available to secondary node will prevent Always ON secondary to switch over as Primary automatically and stucks in air.
Resolution ( without much delay and no loss ) is to restart secondary node multiple times , until its cluster resources ( inturn Always ON ) available / to come online.
May 17, 2024 at 7:44 am
It seems Question was incomplete.
If we consider Always ON ( Sync - Automatic failover) is on Windows cluster, Primary complete down is possible with Primary server switch off / network isolation conditions. In such case, if underlying windows cluster resources not available to secondary node will prevent Always ON secondary to switch over as Primary automatically and stucks in air.
Resolution ( without much delay and no loss ) is to restart secondary node multiple times , until its cluster resources ( inturn Always ON ) available / to come online.
I think this is probably the worst piece of advice i've seen for an AG outage\failover scenario.
Make a primary outage even worst by continually bouncing the secondary node?
What if you're running multiple instances and AGs (common scenario) and the secondary is actually a primary for another AG?
Quorum configuration should be understood but even then, the dynamic voting in WSFC since Windows 2012 easily adjusts the node voting to ensure cluster uptime, a synchronous AG with auto failover should not see the scenario you are describing and i've yet to reproduce anything like that
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
May 17, 2024 at 11:28 am
Let's start off with what I think the answer was: Stop & resume data movement.
With a large 40 TB database, yes, I could see this scenario RARELY. They don't want data loss at all? Probably not possible. But minimal data loss can happen if you also ensure that you are using filegroup backups, proper RAID on your backups, and lots of transaction log backups.
The thing is, did they specify if this was a data warehouse or an OLTP database? It makes a difference.
OLTP databases are heavy traffic usually, multiple updates per hour / minute / second. Data warehouses might only get their data every few hours or once a day. In the later case, if this scenario happens at the right time, then recovering a data warehouse without data loss is very easy.
I worked with a 4 node 2 data center clustered AG in SQL 2012. 2 servers in each DC. We used manual failover, synchronous on the primary DC between those two nodes and asynch to the secondary DC servers. We had a large vended database around 1 or 2 TB (can't remember) using one data file that was a PITA to restore / recover. There were a couple of other databases almost as big as part of the vended solution. The reporting databases were on Server B, the OLTP databases were on Server A. Servers C & D were on the secondary DC.
Our corporate office hadn't figured out yet that they couldn't patch Servers A & B at the same time (we had the first AG setup) and brought down our application on numerous occasions. Fortunately we didn't have a 24x7 requirement at the time and patching happened in off hours. But there was one time we spent almost 8 hours trying to bring everything up. We first tried stopping and resuming data movement, which started causing worse problems, then failing over manually, which really brought everything down and froze SSMS. Then we tried restarting the SQL server services but that didn't work. So we eventually had to restart the servers themselves and be patient.
AFAIK, we didn't lose any data because no one could connect. We didn't have to restore the databases, but if we did, we had TLogs going every 15 minutes in addition to a weekly full backup. We also had daily copy-only full backups because we restored these DBs to a reporting server, which let us know our backups were good as well. So we had multiple things going for us.
For a 40 TB server? I would do it this way: Multiple data files / filegroups. Weekly FULL filegroup backups (staggered), daily differentials, and TLogs every 15 or 30 minutes assuming this is OLTP. If this is data warehouse, I would probably change that up based on how often the data is updated and not have as many TLog backups. Maybe 1 every 4 hours? RAID 10 would be preferable for me, probably stripped. But that would depend on budget.
Then, if it gets caught in mid-transition, start with checking the synch on both primary and secondary. The AG tech is better now than it was with 2012. You can try stopping and resuming data movement on the primary first. If the secondary still looks wonky, do the same thing on that side. If that doesn't work, try running T-SQL to fail over. Don't do failover through the GUI as at that point it can freeze SSMS. And if the interviewer says that won't work, then you apologize to them and say "I don't see a way it can be recovered without data loss."
May 17, 2024 at 2:42 pm
with a busy 40TB db i would think the reason the AG does not failover immediately is due to the outstanding redo queue, it is multi threaded in later versions of SQL but if there's a lot waiting to be applied, it can take a short while for the AG to complete failover.
There are stats that can be read to ascertain how much is waiting and the synch status, etc
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
Viewing 4 posts - 16 through 18 (of 18 total)
You must be logged in to reply to this topic. Login to reply