February 9, 2017 at 10:02 am
Last night our AG failed over to our passive node (same subnet). This isn't a problem as our data loads are configured to run from the current primary.
What is a problem, is that our Listener stopped working for some reason. We have done lots of investigation today and couldn't find anything. The listener would work locally on the box, but not remotely. It would ping from the box and connect to the instance, but not ping remotely.
I have just failed the AG back over to the primary and the listener started working again. I failed it back to the passive, and the listener still works.
Any idea why this might have occured?
February 9, 2017 at 10:05 am
under what accounts do the sql server services run on each node?
does the listener name and ip come online at all on the second node, check the cluster log for more detail on why it failed
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
February 9, 2017 at 10:06 am
I would be interested in the configuration of the Listener too. Is it in a multi-subnet environment? Which port is it listening on, etc.
February 9, 2017 at 10:17 am
SQLAssAS - Thursday, February 9, 2017 10:02 AMLast night our AG failed over to our passive node (same subnet). This isn't a problem as our data loads are configured to run from the current primary.What is a problem, is that our Listener stopped working for some reason. We have done lots of investigation today and couldn't find anything. The listener would work locally on the box, but not remotely. It would ping from the box and connect to the instance, but not ping remotely.
I have just failed the AG back over to the primary and the listener started working again. I failed it back to the passive, and the listener still works.
Any idea why this might have occured?
also, please provide the AG configuration details, is this a synch or asynch group, is there any auto failover, this will dicatate which nodes are a possible owner of the AG cluster resources
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
February 9, 2017 at 10:22 am
We have a 3 node AG
Node 1 - Subnet 1 -Sync - autofailover
Node 2 - Subnet 1 -Sync - autofailover
Node 3 - Subnet 2 - ASync - Manual
All services are set up using a service account specifically created for running this AG. The account can not be locked out
The listener is using a custom Port number (not used by anything else)
The unplanned failover happened between node 1 and node 2, which is when the listener stopped responding. This is due to the Quorum (fileshare) becoming unavailable for a period of time.
February 9, 2017 at 10:23 am
Perry Whittle - Thursday, February 9, 2017 10:05 AMunder what accounts do the sql server services run on each node?does the listener name and ip come online at all on the second node, check the cluster log for more detail on why it failed
Hi
Yes it was online, it was working absolutely fine locally, but not remotely.
February 9, 2017 at 10:28 am
Sounds to me like you have a firewall issue on that node that does not let remote connections, especially if you can connect to the listener locally.
February 10, 2017 at 1:54 am
dbaduck - Thursday, February 9, 2017 10:28 AMSounds to me like you have a firewall issue on that node that does not let remote connections, especially if you can connect to the listener locally.
Hi
the listener is working fine now with no firewall changes.
As mentioned above, I had to fail the AG over to node 2, then back again and it now works as expected. But it doesnt explain why it stopped working.
February 12, 2017 at 11:41 am
Can you check that gratuitous ARP packets are allowed on the networking devices hosting your VLAN? I have an issue in my workplace that gratuitous ARP is blocked on the VLANs because of DISA STIG requirements. The problem is this is how a Windows cluster node announces the updated MAC/IP pair to the gateway for the subnet. So what happens is the VLAN hosting the subnet gateway refuses the packet to update its local ARP table of the new node hosting the listener address and external traffic (traffic coming from application servers trying to connect to the AG listener) timeout for about 15 minutes until the ARP table entry times out and it requests an update; then traffic to the AG listener starts working again.
The only fixes I know of for this are:
- Get your networking guys to allow gratuitous ARP packets on networking devices
- Add NICs on clients connecting to the AG that are on the same subnet as the AG listener. This will allow them to receive the ARP packets directly and should allow for failover times like you expect
I did a post on this to TechNet forums a little while back.
WSFC Virtual IPs and GARP
https://social.technet.microsoft.com/Forums/en-US/5f0831b7-fef6-4efd-a6b5-6ddacd1c3f89/wsfc-virtual-ips-and-garp?forum=winserverClustering
Joie Andrew
"Since 1982"
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply