Unplanned failover of AG

Question

Unplanned failover of AG

SQLAssAS

SSCertifiable

Points: 7246
More actions
February 9, 2017 at 10:02 am

#327093

Last night our AG failed over to our passive node (same subnet). This isn't a problem as our data loads are configured to run from the current primary.
What is a problem, is that our Listener stopped working for some reason. We have done lots of investigation today and couldn't find anything. The listener would work locally on the box, but not remotely. It would ping from the box and connect to the instance, but not ping remotely.
I have just failed the AG back over to the primary and the listener started working again. I failed it back to the passive, and the listener still works.
Any idea why this might have occured?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Perry Whittle SSC Guru Points: 234013 More actions · Answer 1

under what accounts do the sql server services run on each node?

does the listener name and ip come online at all on the second node, check the cluster log for more detail on why it failed

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Ben Miller SSCrazy Points: 2995 More actions · Answer 2

I would be interested in the configuration of the Listener too. Is it in a multi-subnet environment? Which port is it listening on, etc.

Ben Miller
Microsoft Certified Master: SQL Server, SQL MVP
@DBAduck - http://dbaduck.com

Perry Whittle SSC Guru Points: 234013 More actions · Answer 3

SQLAssAS - Thursday, February 9, 2017 10:02 AM
Last night our AG failed over to our passive node (same subnet). This isn't a problem as our data loads are configured to run from the current primary.
What is a problem, is that our Listener stopped working for some reason. We have done lots of investigation today and couldn't find anything. The listener would work locally on the box, but not remotely. It would ping from the box and connect to the instance, but not ping remotely.
I have just failed the AG back over to the primary and the listener started working again. I failed it back to the passive, and the listener still works.
Any idea why this might have occured?

also, please provide the AG configuration details, is this a synch or asynch group, is there any auto failover, this will dicatate which nodes are a possible owner of the AG cluster resources

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

SQLAssAS SSCertifiable Points: 7246 More actions · Answer 4

We have a 3 node AG
Node 1 - Subnet 1 -Sync - autofailover
Node 2 - Subnet 1 -Sync - autofailover
Node 3 - Subnet 2 - ASync - Manual

All services are set up using a service account specifically created for running this AG. The account can not be locked out

The listener is using a custom Port number (not used by anything else)

The unplanned failover happened between node 1 and node 2, which is when the listener stopped responding. This is due to the Quorum (fileshare) becoming unavailable for a period of time.

SQLAssAS SSCertifiable Points: 7246 More actions · Answer 5

Perry Whittle - Thursday, February 9, 2017 10:05 AM
under what accounts do the sql server services run on each node?
does the listener name and ip come online at all on the second node, check the cluster log for more detail on why it failed

Hi

Yes it was online, it was working absolutely fine locally, but not remotely.

Ben Miller SSCrazy Points: 2995 More actions · Answer 6

Sounds to me like you have a firewall issue on that node that does not let remote connections, especially if you can connect to the listener locally.

Ben Miller
Microsoft Certified Master: SQL Server, SQL MVP
@DBAduck - http://dbaduck.com

SQLAssAS SSCertifiable Points: 7246 More actions · Answer 7

dbaduck - Thursday, February 9, 2017 10:28 AM
Sounds to me like you have a firewall issue on that node that does not let remote connections, especially if you can connect to the listener locally.

Hi

the listener is working fine now with no firewall changes.

As mentioned above, I had to fail the AG over to node 2, then back again and it now works as expected. But it doesnt explain why it stopped working.

Joie Andrew One Orange Chip Points: 27360 More actions · Answer 8

Can you check that gratuitous ARP packets are allowed on the networking devices hosting your VLAN? I have an issue in my workplace that gratuitous ARP is blocked on the VLANs because of DISA STIG requirements. The problem is this is how a Windows cluster node announces the updated MAC/IP pair to the gateway for the subnet. So what happens is the VLAN hosting the subnet gateway refuses the packet to update its local ARP table of the new node hosting the listener address and external traffic (traffic coming from application servers trying to connect to the AG listener) timeout for about 15 minutes until the ARP table entry times out and it requests an update; then traffic to the AG listener starts working again.

The only fixes I know of for this are:

- Get your networking guys to allow gratuitous ARP packets on networking devices
- Add NICs on clients connecting to the AG that are on the same subnet as the AG listener. This will allow them to receive the ARP packets directly and should allow for failover times like you expect

I did a post on this to TechNet forums a little while back.

WSFC Virtual IPs and GARP
https://social.technet.microsoft.com/Forums/en-US/5f0831b7-fef6-4efd-a6b5-6ddacd1c3f89/wsfc-virtual-ips-and-garp?forum=winserverClustering

Joie Andrew
"Since 1982"