June 12, 2020 at 8:27 am
Hello, hopefully someone can provide some help for me.
I have a always avaliability group setup with 4 nodes, two in a HA pairing, syncronous commit. A day ago twice the AG automatically fell over to the secondary node in the HA pairing. The errors I have seen are -
From Failover CLuster Manager - 'Cluster node 'hh-###-###' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.'
And from the cluster logs for that node -
Local endpoint: 10.##.#.###:~3343~
Remote endpoint: 10.##.#.###:~3343~
[Operational] 0000#####.0000#####::2020/06/10-10:34:31.486 INFO Microsoft Failover Cluster Virtual Adapter (NetFT) has missed more than 40 percent of consecutive heartbeats.
Local endpoint: 10.##.#.###:~3343~
Remote endpoint: 10.##.#.###:~3343~
[Operational] 0000####.0000####::2020/06/10-10:34:36.908 INFO Cluster has lost the UDP connection from local endpoint 10.##.#.###:~3343~ connected to remote endpoint 10.##.#.###:~3343~.
This seems to me to be network related and as I am a DBA I passed it on to the Networks team to have a look and all they have said is 'The loss of connection to the remote endpoint is almost certainly a server side issue.' and they are not helping any more.
It happend a couple of days ago, a couple of times, but has been fine since. Being a DBA I am not sure what to look for or whether it is something to be worried about.
June 13, 2020 at 9:10 am
Thanks for posting your issue and hopefully someone will answer soon.
This is an automated bump to increase visibility of your question.
June 13, 2020 at 11:05 am
Do you know if the heartbeat is using the same network as other data?
I once had an issue where the database backup was using all of the bandwidth and stopping the heartbeat from working long enough for a failover to occur. Or, if you are running in a virtual environment, another VM guest could have used the bandwidth.
In terms of whether it's something to worry about. Ideally I think it's worth trying to find out what caused it if possible. It could be something that could repeat or get worse. I think the failover would have disconnected all of the client applications?
June 15, 2020 at 7:46 am
I do believe the heartbeat is using the same network as everything else.
I use a monitoring tool as well and at the same time this happened to the cluster, the monitoring tool picked up this -
Cannot connect to SQL Server instance '######' :
A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 -
The semaphore timeout period has expired.) : The semaphore timeout period has expired [121] (requires acknowledgement)
From what I read this error can relate to Network Adapter issues, so I will ask my Network guys again to have a look at this, I cant see any error message that don't point towards networks.
June 19, 2020 at 12:48 pm
I've seen issues like this related to VSS-Backups. During these backups the server is halted for a short-term including network cards. In one AG-Environments the primary went in pending state for a couple of seconds but afterwards came back online without a failover. On other days a failover occurred.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy