December 17, 2013 at 7:49 am
Hi,
We have set up SQL server 2012 alwaysOn availability group on windows 2012. It runs and fails over successfully. Recently, a VMWare creates a snapshort of a primary and it breaks a cluster. We saw errors in a cluster log.
In VMware setting, The quiesce option is turned off for these VMs. Also, we configured cluster setting :\Windows\system32> (get-cluster).SamesubnetThreshold = 10 ( Relaxed )
Cluster node 'Test1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or
File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\Test\TEstQuorum'. Please ensure that file share '\\Test\TestQuorum' exists and is accessible by the cluster.
In alwaysOn error log,
A connection timeout has occurred on a previously established connection to availability replica 'Test1' with id [CAD40D99-E333-457E-9993-BBE977D2CDA2]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
I am pulling my hair out. Any thoughts?
Thanks
AyeMya
December 17, 2013 at 11:03 am
Are you using RDMs or normal VMDKs? If you are using VMDKs then you'll want to turn the quiesce option back on. If you are using RDMs then just stop taking snapshots of the VMs. Why the change to the cluster (SameSubnetThreshold)? What verion of vSphere? Does this happen when you snapshot the primary replica, or the secondary replica, or any replica?
Odds are your answers will lead me to ask more questions before giving you an answer.
December 17, 2013 at 11:38 am
It is normal VMDK backup. VMWare version is 5.1. It is taking a snapshot of a primary replica. We changed the SamesubnetThreshold = 10 (Relaxed) to be more tolerant of failure. We refer to the following link.
http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx
Everytime, VMWare takes a snapshot, we have a connection timeout error.
A connection timeout has occurred on a previously established connection to availability replica 'Test' with id [399ED765-5052-4448-86B1-02818E038E45]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
after 30 secs, connection restablished.
A connection for availability group 'AG' from availability replica 'Test' with id [DE46449A-072C-4274-9E48-ABD821D815B6] to 'Test' with id [399ED765-5052-4448-86B1-02818E038E45] has been successfully established. This is an informational message only. No user action is required.
December 17, 2013 at 12:37 pm
OK, well the first problem is that with the quiesce option turned off the snapshots are useless for backups as SQL hasn't flushed the buffer and the transaction log and the database file may not be in sync. Turn the quiesce option on and try taking the snapshot again and see if that resolves the problem.
December 17, 2013 at 12:56 pm
We already turned on quiesce back on and connection timeout still occured.
"connection timeout has occurred on a previously established connection to availability replica "
how high would you suggest setting the SameSubnetTimeout value?
Thanks.
AyeMya
December 17, 2013 at 1:07 pm
AlwaysOn Availability group Connection timeout occurred when a snapshot is removed or created..
December 18, 2013 at 2:18 pm
VM admin set a virtual disk mode to independent on the drives where SQL server data/log, tempdb and log nd system db/log. They took a snapshot and no errors occurred. In this case OS is only taken snapshot. If a server crashes, Can we able to start Sql server service after a OS snapshot is restored? The drives where Systemdb/log and tempdb/log are not taken snapshot. But we backup systemdatabases/usersdbs in two different locations.
December 18, 2013 at 2:32 pm
You may need to restore the master, model, and msdb databases then the user databases, but yes you should be able to recovery with just an OS snapshot.
March 20, 2014 at 10:26 am
Hi All,
We are also facing same issue when VMware backup triggeres, you have got any solution ?
are do we can change or increase timeout values to delay failover by then snapshot completes.
Thanks
sandeep
March 20, 2014 at 3:16 pm
We havent found solution yet. We have workaround.
First, we set to relax setting in Cluster Threshold.
Second, set data/log/temp disk to independent mode in VMWare.
We only back up C:\OS and Apps drive
http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx
Hope it works for you too.
Thanks
ayemya
March 21, 2014 at 5:35 am
Hi ,
Thank you for providing work around , i need to work with my VMware team to set this and try backups.
I have got another question , In VMware we can select to backup either only OS drive or Data drive, how can we configure VMware backup only OS and Apps drive.
I will test this and update you.
Thanks,
sandeep
April 16, 2014 at 7:27 pm
Hi,
We are experiencing the exact same issue with AG on SQL2012 and VMware 5.5. Snaps always cause the AG to failover. Increasing the timeout and only taking a snap of the c: alleviates the problem somewhat but can still cause a failover intermittently. Did you find a resolution to this? We really want to back these boxes up with vdp but can't until we resolve this issue.
Cheers,
Joe
April 17, 2014 at 7:54 am
The quiesce option is turned off in VMWare for the servers since we are not backing up SQL drives.
April 24, 2014 at 12:08 pm
Do not include the memory while taking the snapshot.
It stuns the server and it causes the cluster to failover.
October 23, 2017 at 12:52 pm
ayemya - Thursday, March 20, 2014 3:16 PMWe havent found solution yet. We have workaround.First, we set to relax setting in Cluster Threshold. Second, set data/log/temp disk to independent mode in VMWare.We only back up C:\OS and Apps drivehttp://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspxHope it works for you too.Thanksayemya
Please let me know where should i change the timeout settings on the primary or the secondary. I see errors 35206(connection timeout) on primary and as a result i see errors 976 (broken connection )on the seconday.
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply