2 node cluster Split-Brain Issue

  • I have a two node ESXi v5 VM Windows 2008 cluster in a secure environment that recently had a split brain issue due to loss of heartbeat. The quorum configuration is "Node and Disk Majority" I guess I am surprised that loss of heartbeat due to Retina Scans causing switches to crash (at same time) would cause a split brain senario. I thought the Quorum Disk would not allow it. I stand corrected, but I am very nervious now. It was a pain getting my SQL Cluster up and running after major disk corruption on all shared drives. I have 9 SQL Clusters in this environment to look after.

    Reviewing Cluster.log looks like the following occurred:

    When Heartbeat was lost, the inactive node2 brings the Quorum Disk on line.

    Then node2 brings the DTC disk on line.

    Then node2 brings all of the SQL SAN Disks on line and SQL Services on line.

    Active Node1 thinks it lost Node2 and stays active with SQL, DTC and Quorum Disks and SQL Services.

    The Network is restored

    SQL Crashes from corrupted disks.

    Ideas on prevention? Configuration changes?

    I did impliment a second heartbeat through a seperate switch but I am still nervious and I think the quorum disk is worthless in these two node clusters.

  • This Split-Brain issue is re-creatable.

    In a separate environment, I have another 2 Node SQL Cluster.

    By interrupting network communication on the active node for 1 minute, I caused the same Split-Brain scenario as described below.

    Both Nodes became active and when the network was restored, the Split-Brain (Active-Active scenario) remained.

    Since we are dealing with Quorum and DTC being ACTIVE/ACTIVE before SQL becomes ACTIVE/ACTIVE,

    This is not a SQL issue.

    One way to re-create the issue:

    Although this issue is probably not specific to ESXi v5, I am going to use it to re-create the issue rather than physically pull cables or turn off switches:

    In Failover Cluster Manager

    Click on each Cluster Network under Networks

    One network should have "allow cluster network communication..." and "allow clients..."

    All other networks should have "Do not allow cluster network communication..."

    In our ESXi v5 environment using vSphere,

    Click on the Physical Server where the VM is for the active SQL Node.

    Configuration Tab

    Networking (Hardware)

    Click on Properties of Network "Switch" that matches up with the network set to "allow clients..."

    Click on the Port that matches up with the network set to "allow clients..."

    Click on "Edit"

    Click on "NIC Teaming" tab

    Move all Adapters down to "Unused Adapters" and click OK

    Both Nodes come up as active (Split-Brain)

    Restore previous Network Adapter

    Confirm that network is restored on both nodes (I could browse out to other servers)

    Both Nodes still active

    Waited 20 minutes...Both Nodes still active

    Shut down 2nd Node (Reboot)

    Cluster Crashed

  • Are both VMs on the same host (Cluster in a box) or are they on separate hosts?

    Is this a host based vswitch or a distributed vswitch?

    How is the storage presented to both node VMs?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Hi Perry,

    The VM's are on separate physical hosts.

    I think the vswitch is host based (I would have to know how to verify)

    Storage is presented via SCSi controller (LSI Logic SAS) shared disks (SCSI bus sharing set to Physical)

    SAN shared Drives are using Thick Provisioned Lazy Zeroed.

    We are currently Reviewing Cluster Validation Error - SCSI-3 Persistent Reservation Failed.

    It is my understanding that lack of PR is the most likely cause of this split-brain issue.

  • sbaker-757360 (5/22/2012)


    shared disks (SCSI bus sharing set to Physical)

    This type of shared drive is not supported for Windows 2008 it will only work with Windows 2003 nodes. Hence why you have this

    sbaker-757360 (5/22/2012)


    Cluster Validation Error - SCSI-3 Persistent Reservation Failed.

    sbaker-757360 (5/22/2012)


    It is my understanding that lack of PR is the most likely cause of this split-brain issue.

    That is correct, i would hazard a guess that your shared quorum drive isn't shared and visible to both nodes and therefore you have no quorum.

    Have you tried using a file share witness instead of a disk?

    Is this a test environment or are you running it live?

    I use ESX for running virtual multi node Windows 2008 clusters with no problems, I get round the scsi3 PR by presenting storage to my Windows 2008 VMs via iSCSI LUNs from a virtual SAN device.

    See my guide at this[/url] link for more info

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry,

    I do not have a Test Environment just for SQL 2008 R2 Clustering.

    I am performing tests on an environment that has not had Sharepoint Setup on it yet (which would employ the SQL Cluster)

    I have not seen any current ESXi v5 or Microsoft 2008 R2 MSCS documentation saying that our configuration is not supported (as log as our IBM SAN supports PR).

    Our SAN Specialist is very busy and I am waiting for his reply on Persistent Reservations.

    Currently our 9 clusters work great and failovers work great....unless all heartbeats are lost (complete network failure).

    When working properly, Quorum disk is off line on the inactive Node2 and seen on the active Node1. If heartbeat is lost, the inactive Node2 goes after the Quorum (brings it on line). Once it has the Quorum Drive, it has majority, and goes after the DTC drive (brings it on-line), then the SQL Data, SQL Log and SQL Backup Drives are brought on line. Then SQL Services are started.

    The issue here is with lack of PR, and if the active Node1 did not die and still has the Quorum Disk, (just network failure occurred), we do not want the inactive Node2 bringing the Quorum disk on line...which it does, subjecting the Quorum disk to corruption. Hopefully with PR enabled, the SAN will tell Node2 - "You can't have the Quorum disk because it is reserved by Node1". Without the Quorum drive, Node2 does not have majority and cannot pursue becoming the active Node...preventing an active/active split-brain scenario.

    One of the most disappointing things I seen is that when MSCS split-brain occurs, it does not have any mechanism in place (after network communication is restored) to check and recover and/or shut down (shy of a reboot or restart of the cluster service), so split-brain can occur for hours and days (depending on how much disk corruption the systems can take)

  • sbaker-757360 (5/23/2012)


    I have not seen any current ESXi v5 or Microsoft 2008 R2 MSCS documentation saying that our configuration is not supported (as log as our IBM SAN supports PR).

    Our SAN Specialist is very busy and I am waiting for his reply on Persistent Reservations.

    looks like i may have misunderstood your shared disk set up. The shared disks are LUNs attached to the ESX servers from a SAN, is that correct?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • The shared disks are created from LUNs attached to the ESX servers from a SAN.

    Typically, more than 1 shared disk is created from a single LUN.

  • So your attaching the LUN and creating an ESX data store and then creating VMware virtual disks for sharing amongst the VMs?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • sbaker-757360 (5/23/2012)


    I have not seen any current ESXi v5 or Microsoft 2008 R2 MSCS documentation saying that our configuration is not supported

    yes, you have, the mere fact your cluster validation fails the SCSI-3 persistent reservation is all you need.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (5/23/2012)


    So your attaching the LUN and creating an ESX data store and then creating VMware virtual disks for sharing amongst the VMs?

    We are attaching a LUN and we create one Data Store per LUN, and then create VMware virtual disks.

    My concern is that since Persistent Reservations are at the LUN level, does that mean I need a Dedicated LUN for each shared disk in the cluster?

    I am meeting with our SAN Specialist this afternoon. He is concerned that PR only works at the physical nodes (so VMs won't be able to use PR), but I just googled some information that states that with ESXi 4 and above, "passing through of SCSI-3 persistent reservations from a virtual machine to the underlying physical storage" is enabled.

    http://www.google.com/url?sa=t&rct=j&q=esxi%205%20passing%20through%20scsi-3%20%22persistent%20reservations%22&source=web&cd=4&ved=0CGQQFjAD&url=http%3A%2F%2Fhaiderriz.files.wordpress.com%2F2009%2F08%2Fmodule-3-storage-rev-p.ppt&ei=_fy8T9qHJYbH6AGyupAh&usg=AFQjCNFNy46CrG5aJodPpThDUx4kPWqxMg&cad=rja

  • "Use of Microsoft Cluster Service requires that each cluster disk resource is in its own LUN"

    I believe this requirement has everything to do with Persistent Reservations

    ...Looks like I have to create a bunch of LUNs.:w00t:

    http://pubs.vmware.com/vsphere-50/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-50-storage-guide.pdf

  • sbaker-757360 (5/23/2012)


    We are attaching a LUN and we create one Data Store per LUN, and then create VMware virtual disks.

    Bingo, thought so 😉

    sbaker-757360 (5/23/2012)


    My concern is that since Persistent Reservations are at the LUN level, does that mean I need a Dedicated LUN for each shared disk in the cluster?

    Straight VMWare virtual disks DO NOT support SCSI-3 persistent reservations, you need to map each LUN as a raw device mapping to the ESX hosts. When you add the virtual disk to the actual VMs, you select the raw device mapping option instead of the option "Create new virtual disk". RDM's can be from a SAN\NAS over FC or iSCSI.

    sbaker-757360 (5/23/2012)


    He is concerned that PR only works at the physical nodes (so VMs won't be able to use PR)

    No, absolutely wrong. It works for VMs too, read my guide that i linked previously 😉

    sbaker-757360 (5/23/2012)


    but I just googled some information that states that with ESXi 4 and above, "passing through of SCSI-3 persistent reservations from a virtual machine to the underlying physical storage" is enabled.

    It's the Windows 2008 OS that's the issue not VMWare. Windows 2008 clusters require high end storage via either SAS, FC or iSCSI. You're not passing through (that's why you want RDMs in physical compatibility), you're attaching the LUN(s) to the ESX host and then creating a VMWare VMFS formatted datastore, then creating a VMWare virtual disk. As i have already said this is NOT supported and will cause you all sorts of issues.

    Do the following;

    ;-)Remove the disks from the VMs, remove the VMFS datastores. Create the required number of unformatted LUNs and ensure they are attached directly to the ESX hosts where the VMs will run.

    ;-)Once this is done, go to the first VM and start adding virtual disks, but select Raw Device Mappings option. The adapter must use an LSI Logic SAS adapter in physical compatibility and not virtual for a Windows 2008 cluster.

    ;-)Review the mappings you have just made on the first VM and now add the shared disks to the partner VM this time selecting existing virtual disk not RDM or new virtual disk, also note the virtual scsi adpater IDs for each existing disk on the first VM.

    For example if on the first node you attached the first RDM as a virtual disk named "Shared-disk1.vmdk" to scsi id 1:0 then attach it exactly the same on the partner node and so on.

    Its all detailed here 😎

    Shout back if you're still stuck.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry,

    Thank you for the info.

    The SAN Guy finished creating a new set of LUNs for two SQL cluster servers and our VM Guy is currently creating the RDMs. I'll let you know how it goes.

  • Cool!

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply