SQL AlwaysOn Failover Due to Missed Heartbeats

Question

SQL AlwaysOn Failover Due to Missed Heartbeats

rajpat13

Old Hand

Points: 357
More actions
December 10, 2014 at 7:26 pm

#301752

Hi Folks,
I am having some difficulties with AlwaysOn setup, I'd appreciate if someone can provide some guidance.
I have setup SQL AlwaysOn between primary and DR data centers. Here is the setup:
Primary data center: Server1 (Primary), Server2 (Sync Commit Secondary), Server3 (ASync Commit Secondary)
DR data center: Server4 (ASync Commit Secondary)
Data synchronization and manual failover works fine. But, sometimes, the AlwaysOn cluster automatically fails over to Sync Commit Secondary on Primary data center. Here is the error message from Failover Cluster Manager->Cluster Events:
"Cluster has missed two consecutive heartbeats for the local endpoint xx.xx.xx.yy:~3343~ connected to remote endpoint xx.xx.xx.zz:~3343~"
"Cluster has lost the UDP connection from local endpoint xx.xx.xx.yy:~3343~ connected to remote endpoint xx.xx.xx.zz:~3343~"
I had our network engineer check all connections multiple times and he confirmed everything is fine. But he was also able to confirm (using monitoring tools) that right at the time of a failover, there is almost 2GB worth of traffic going from Primary Server to DR server. That happens every time. I had checked the times of all failovers and there is no job or process occuring that will produce 2GB worth of data. Also, this happens regardless of which server is primary.
Even though the failover works fine, this unexpected automatic failover due to missed heartbeats are occurring often (2-3 times a month).
Here is the list of errors from the Cluster Validation Report:
Under Network Section, I see the following error messages in Red:
Validate Network Communication
Network interfaces Server4 (DR) - SAN_Team and Server1 (Primary) - SAN_Team - VLAN 20 are on the same cluster network, yet address xx.xx.xx.pp is not reachable from xx.xx.xx.yy using UDP on port 3343.
Network interfaces Server4 (DR) - SAN_Team and Server2 (Secondary) - SAN_Team - VLAN 20 are on the same cluster network, yet address xx.xx.xx.qq is not reachable from xx.xx.xx.yy using UDP on port 3343.
Network interfaces Server4 (DR) - SAN_Team and Server3 (Secondary) - SAN_Team - VLAN 20 are on the same cluster network, yet address xx.xx.xx.zz is not reachable from xx.xx.xx.yy using UDP on port 3343.
Network interfaces Server1 (Primary) - SAN_Team - VLAN 20 and Server4 (DR) - SAN_Team are on the same cluster network, yet address xx.xx.xx.yy is not reachable from xx.xx.xx.pp using UDP on port 3343.
Network interfaces Server2 (Secondary) - SAN_Team - VLAN 20 and Server4 (DR) - SAN_Team are on the same cluster network, yet address xx.xx.xx.yy is not reachable from xx.xx.xx.qq using UDP on port 3343.
Network interfaces Server3 (Secondary) - SAN_Team - VLAN 20 and Server4 (DR) - SAN_Team are on the same cluster network, yet address xx.xx.xx.yy is not reachable from xx.xx.xx.zz using UDP on port 3343.
--
Under System COnfiguration, I see the following error messages in Red:
Validate Active Directory Configuration
The servers are not all in the same organizational unit. Connectivity to a writable domain controller from node Server4 (DR) could not be determined because of this error: Could not get domain controller name from machine Server4 (DR).
Connectivity to a writable domain controller from node Server1 (Primary) could not be determined because of this error: Could not get domain controller name from machine Server1 (Primary).
Connectivity to a writable domain controller from node Server2 (Secondary) could not be determined because of this error: Could not get domain controller name from machine Server2 (Secondary).
Connectivity to a writable domain controller from node Server3 (Secondary) could not be determined because of this error: Could not get domain controller name from machine Server3 (Secondary).
Node(s) Server4 (DR) Server1 (Primary) Server2 (Secondary) Server3 (Secondary) cannot reach a writable domain controller. Please check connectivity of these nodes to the domain controllers.
Thanks in advance. Have a nice day.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 1

1) Makes me twitchy to see those types of network stack errors in the Cluster Validation Report

2) I would watch out for uncommitted transactions causing a backup in the tlog that isn't being sent over to secondaries. Also trend tlog space used (dbcc sqlperf(logspace) still works for that) and if you start seeing it grow very large then find out why and squash the root cause. There is also a monitor for AGs you can use. See BOL for that if you aren't familiar with it. You can use DMVs to also see internal information about AGs.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Perry Whittle SSC Guru Points: 234013 More actions · Answer 2

it's no good complaining about network issues when your validation report is full of network configuration errors. You'll need to get these sorted first and ensure the report is clean before looking at anything else..

Did you separate mirror traffic down a separate network, if not you may want to consider this

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

rajpat13 Old Hand Points: 357 More actions · Answer 3

Thanks Kevin and Perry for the reply.

I talked to our network engineer and according to him these network errors can be ignored. The errors are related to IP address communication between Primary datacenter and DR datacenter. Since they are in different datacenter, their IP addresses should not be able to communicate as DR node doesn't have vote in the quorum. They are part of same windows cluster network but since they are in different datacenters, there should not be any connection between these IP addresses directly.

I will do some more research on this and try to figure out if IP addresses should communicate even though there are in different datacenters.

Also, I checked the size of the transaction log files, and it's not that big. We have a transaction backup job running every 20 mins, which will shrink the log file as soon as the backup is created.

I will review DMV option to figure out internal information.

Thanks.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

Also, I checked the size of the transaction log files, and it's not that big. We have a transaction backup job running every 20 mins, which will shrink the log file as soon as the backup is created.

You need to check the tlog at the time you have the issue - i.e. when 2GB of stuff is being transferred. Also the committed-tran tlog space will NOT be flushed under the scenario I painted - namely if there is a long-running transaction or one of several other things going on that can prevent that operation.

Also, I do hope you MEANT "flush committed space" and not actually what you said - "shrink the log file"!!! Please confirm that you aren't doing that! :w00t:

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

rajpat13 Old Hand Points: 357 More actions · Answer 5

Hi Kevin,

I have asked our network engineer to do some more research about these errors. Correct, I am not shrinking the transaction log files. I meant the size the .ldf files goes back to its "Initial Size (MB)" settings when the transaction log back ups are completed.

Also, I will write a script to log output of dbcc sqlperf(logspace) into a table so I can review later.

Here is the output:

Database Name Log Size (MB) Log Space Used (%)Status

master12907.990.19773360

tempdb3089.86738.892620

model1.24218883.333340 <-- 83%

msdb26.1796927.081470

DB132.2421967.870120 <-- 68%

DB2739.492294.874540 <-- 95%

DB322500.990.14853510

For DB1 and DB2, I see high number for Log Space Used. How do I reduce that number?

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

DB1 32.24219 67.87012 0 <-- 68%
DB2 739.4922 94.87454 0 <-- 95%

DB1 just has a tiny tlog file. It may or may not need to be bigger. 68% full isn't an issue if the tlog backups are happening such that it doesn't grow. If it does, just make it bigger so it doesn't need to grow. Autogrowths should NOT be used to manage your file sizes (for tlog or data either one)!

DB2: is it having tlog backups being done? If so, they may need to be done more frequently. Or the file needs to be bigger. Or someone need to stop holding a transaction open for too long. Or any combination of above. 🙂

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

ramyours2003 SSChampion Points: 12268 More actions · Answer 7

iam having the same issue can you please provide why the issue will occur and how to resolve ?