Failover Cluster reliability in a shop with limited experience in clustering

Question

Failover Cluster reliability in a shop with limited experience in clustering

chuck.forbes

SSCarpal Tunnel

Points: 4294
More actions
September 6, 2017 at 9:42 am

#339049

We recently had a consultant assist in the setup of a 2-node (both VM's) Failover Cluster, with SQL 2014 installed. For the most part, it's been pretty stable, but our non-clustered servers with MSSQL (which are also VM's) have experienced more uptime. For us, downtime in the cluster is related to a loss of mapping for the shared storage between the two nodes, or the Quorum disk, ultimately requiring a bit of fiddling within the Failover Cluster Manager and an instance restart. But until that happens, the databases within the instance are unavailable.
We're not really that big of a shop, and so I don't see us being able to devote a large amount of time for a deep dive into Clustering. If you're in a similar situation, are you experiencing more stability with Clusters than we are? I'm concerned that after we resolve this loss of shared storage, that another issue may come up that we're unprepared for. We're much more intellectually invested in single VM setups, and so we're on the brink of making a decision of whether to stick with the cluster, or move the databases out of that environment.
Thanks for your help,
--=Chuck

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Henrico Bekker One Orange Chip Points: 27652 More actions · Answer 1

If you can afford to carve off more storage, my vote would go to Availability Groups instead on 2 x standalone VMs.
No shared storage required, just use a file share witness for the quorum

It does sound like your storage was not assigned 100% correct to the 2 nodes, the question is, do you have someone who can fix it.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This thing is addressing problems that dont exist. Its solution-ism at its worst. We are dumbing down machines that are inherently superior. - Gilfoyle

chuck.forbes SSCarpal Tunnel Points: 4294 More actions · Answer 2

We'll likely pull in a consultant to examine the current setup, with the caveat that in-house training accompany the evaluation & changes.

I'll check out Availability Groups beforehand, but we're using SQL Server Standard. I didn't think AG was available under that version, not until MSSQL 2016, and there it's available in a more restrictive setup. Would you still recommend it knowing we won't have any Enterprise licenses?

--=cf

Henrico Bekker One Orange Chip Points: 27652 More actions · Answer 3

That changes things, but you can do BAGs, Basic Availability Groups still, one DB per group, EDIT: in 2016 only again.
I know this might not be a solution for instances with plenty of databases.

If you need to stick to 2014, you might be limited to more traditional HA like Mirroring/Log Shipping if you want to get rid of clustering, although each of them also requires their own administration and some SQL knowledge to setup, maintain and actually use.

Read more here to see if it could be a solution for you
https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/basic-availability-groups-always-on-availability-groups

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This thing is addressing problems that dont exist. Its solution-ism at its worst. We are dumbing down machines that are inherently superior. - Gilfoyle

Perry Whittle SSC Guru Points: 234065 More actions · Answer 4

chuck.forbes - Wednesday, September 6, 2017 9:42 AM
For us, downtime in the cluster is related to a loss of mapping for the shared storage between the two nodes

The storage is the first place you'll need to start.
Does the cluster role fail on both nodes or one particularly.
I've seen issues similar to this in the past and it was because the storage admin didn't fully unmask the LUNs to both nodes by way of a mistyped WWN, only one node could fully access the shared storage, you can guess the issues this caused

chuck.forbes - Wednesday, September 6, 2017 9:42 AM
or the Quorum disk

Everytime i see a WSFC with a disk witness it chills me to the bone, get into the 21st century people.
Sounds like your consultant didn't have a very good handle on SQL Server HA.
It may pay you to read my stairway to alwayson starting at this link

http://www.sqlservercentral.com/stairway/112556/

chuck.forbes - Wednesday, September 6, 2017 9:42 AM
We're not really that big of a shop, and so I don't see us being able to devote a large amount of time for a deep dive into Clustering.

Well at the moment you're now stuck with it so some quick learning will be required

chuck.forbes - Wednesday, September 6, 2017 9:42 AM
If you're in a similar situation, are you experiencing more stability with Clusters than we are? I'm concerned that after we resolve this loss of shared storage, that another issue may come up that we're unprepared for. We're much more intellectually invested in single VM setups, and so we're on the brink of making a decision of whether to stick with the cluster, or move the databases out of that environment.
Thanks for your help,
--=Chuck

Check my stairway for more info

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

chuck.forbes SSCarpal Tunnel Points: 4294 More actions · Answer 5

I'll check out your stairway, and then get back to this post once we have more information. Thanks to both of you for some direction.
--=Chuck

Perry Whittle SSC Guru Points: 234065 More actions · Answer 6

chuck.forbes - Thursday, September 7, 2017 8:53 AM
I'll check out your stairway, and then get back to this post once we have more information. Thanks to both of you for some direction.
--=Chuck

WSFCs are generally reliable when setup correctly.
It may pay to get a good consultant come in to look at this for you, if you're in UK i am available for site work or remote work if outside UK

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉