SQL Cluster with 10 nodes losing network connection.

Question

Post reply

SQL Cluster with 10 nodes losing network connection.

AlexSQLForums

SSChampion

Points: 14277
More actions
September 17, 2019 at 3:27 pm

#3680736

Hi Everyone
We have a 10 node (win 2016) sql 2016 SP2 CU6 cluster with AOAG in automatic failover.
Every Saturday night at exactly 11:00 PM cluster loses connection between nodes and we get random failovers.
I extended session timeout to 2 minutes and still get failovers.
We checked all the jobs and processes and nothing is running at that time.
I even followed all the changes proposed in this article https://www.virtual-dba.com/always-on-changing-cluster-configuration/
random failovers still occur.
Is there anything i can monitor or do to prove to network team that it's not a AG or SQL issue ?
Alex S

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Sue_H SSC Guru Points: 90861 More actions · Answer 1

You may want to start by checking the cluster log as well as the Windows event logs at the times the issue hits.

Sue

AlexSQLForums SSChampion Points: 14277 More actions · Answer 2

Hi Sue

i did check cluster logs and event viewer there are tons of messages about cluster losing network connections:

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

I also get ISCSI connectivity messages:

Connection to the target was lost. The initiator will attempt to retry the connection.

Alex S

Sue_H SSC Guru Points: 90861 More actions · Answer 3

You may need to start checking for issues with the NICs. Make sure the firmware is current and doesn't have known issues with iSCSI (Broadcom for whatever reason seems to have a lot of issues with iSCSI). Or the network gets saturated at the time - like all the backups and copies are scheduled for 11:00 pm Saturday, that type of scenario. Hopefully you have a network group you can pawn this issue off to 🙂

Sue

AlexSQLForums SSChampion Points: 14277 More actions · Answer 4

Hi Sue

We checked all the NIC cards and even updated firmware none are broadcom.

We even re scheduled backup and DFS sync jobs.

Alex S

Sue_H SSC Guru Points: 90861 More actions · Answer 5

If you already checked everything including the network trace and found nothing I don't have anything to add other than I would wonder if it's random. Your post indicates it happens every Saturday at 11:00 pm. Either there are other times or it's not random.

Sue

Beatrix Kiddo SSC-Dedicated Points: 32407 More actions · Answer 6

Are these physical servers or VMs? Do you have any VM snapshots/system-state backups/fileset backups running on those servers at the time? What about antivirus scans?

Steve Jones - SSC Editor SSC Guru Points: 741407 More actions · Answer 7

I think Beatrix is on the right track. Something scheduled, likely outside of SQL, is affecting the cards. I would also be sure no "phone home" software is looking for NIC or other updates at that time and maybe trying to do some update.

AlexSQLForums SSChampion Points: 14277 More actions · Answer 8

They are physical servers. We eliminated/rescheduled backups, antivirus and windows updates.

thank you

Alex S

sterling3721 SSChasing Mays Points: 658 More actions · Answer 9

cluster log will probably provide the best clue to pinpoint the cause, especially those with errors.

I would also check the power plan for each node. All should be on high performance, not on balanced or power saver. this could miss some heartbeats and trigger fail over.

AlexSQLForums SSChampion Points: 14277 More actions · Answer 10

ok all the same errors but there was no failover.

i'm contacting Microsoft.

Alex S