August 22, 2014 at 8:36 am
We've been having this exact same issue. Our environments are close enough that it sounds like I wrote your original post. Windows 2008 R2 cluster nodes, SQL 2008 R2, multiple ESXi 5.5 clusters, two different EMC SANs (5300, 5400) at two different sites. We have three clusters built using VMDKs. All three of these clusters have existed for more than a year. Two of the the three recently started getting these "LogWriter: Operating system error 170(The requested resource is in use.) encountered" errors in SQL around May/June, possibly around the time we upgraded to ESXi 5.5U1a (Heartbleed fix release). This issue would happen on random databases or txlogs several times a day, sometimes on msdb or tempdb, which of course pretty much crashes SQL altogether. No errors in the SAN event logs, nothing in the vmkernel.logs. No performance issues on either the SANs or the ESXi hosts. Nothing to indicate any sort of problem in the infrastructure underneath the Windows OS. We have dozens of standalone SQL instances, none of which were having this problem. We had one SQL cluster not having the problem but it was SQL 2012 so we were initially sidetracked thinking it was a SQL 2008 R2 problem. However, after comparing the three environments, we eventually tracked down the problem to the configuration of the shared cluster disks. When building a traditional FCI cluster, you have to configure the SCSI controller in the VM to use SCSI Bus Sharing. Your options are either Virtual, for Cluster-In-A-Box (CIB) configurations, or Physical, for Cluster-Across-Boxes (CAB). The two problem clusters were both configured as CABs with the Physical option. However, one of those two clusters was a single-node cluster using only one VM, basically we never got around to adding a second node to the cluster so it didn't really reside across multiple hosts. The working cluster was configured with Virtual SCSI Bus Sharing but the VMs resided on different hosts, apparently unsupported but has worked like this for more than a year. We ended up shutting down both problem clusters, changing the SCSI Bus Sharing to Virtual, vMotioning the VMs to a single host and our problem has now gone away. It has been more than 36 hours with no "Operating system error 170" errors on either cluster when we couldn't make it more than 4 hours previously for 2+ months straight.
I am confident this was the problem for us. What I haven't figured out yet is why Physical used to work and doesn't now. These two problem clusters have survived ESXi upgrades from 5.1U1 to 5.1U2 to 5.5U1 to 5.5U1a and didn't have this issue until recently. I know you had stated that your "solution" was to migrate your DBs off the clustered instance as standalone instances are apparently not affected, more likely because of the lack of shared disks. But any idea if you were using Physical SCSI Bus Sharing previously? I haven't involved VMware at this point because quite frankly their support has never really been helpful for me, they seem to be more interested in getting me off the phone than actually solving any issues. Now that I know the culprit, I am going to try the VMware forums to see if I can find others with the same issue and maybe if it gets loud enough, someone at VMware beyond frontline support will actually have a look into it.
August 22, 2014 at 3:08 pm
OK so the definitive answer is, you can't use Physical SCSI Bus Sharing with VMDKs at all. As of 5.5, Physical Bus Sharing is only supported with RDMs. Whether it becomes supported in the future is unknown. If you use Physical with VMDKs, you will run into the operating system 170 errors in SQL, even if both nodes are on the same host, or even if you only have one node in the cluster. Based on the "VMware vSphere support for Microsoft clustering solutions on VMware products" document, Physical with VMDKs is not a supported configuration. That means you cannot run a CAB if you are using shared VMDKs. You must use Virtual and keep all nodes on the same host in a CIB configuration. I swear this worked for us for over a year but unsupported is unsupported. Hope this helps someone else.
August 23, 2014 at 8:39 am
Hi,
Yes, we used physical + VMDK. I was thinking along the same lines as you and was going to try the RDM configuration... The problem smacked of the para-virtualised SCSI bug, although we're using logic SAS. I was sure it was something in that arena but it's seemingly undocumented...
Thanks for replying - I wasn't aware about physical + VMDK not being supported. We've got a few builds coming up and I was worried that we'd get the same issue all over again.
Viewing 3 posts - 16 through 17 (of 17 total)
You must be logged in to reply to this topic. Login to reply