Post-Installation: SQL Failover Cluster Instance preventing nodes from booting after restart

Question

Post-Installation: SQL Failover Cluster Instance preventing nodes from booting after restart

Dougieson

SSC Eights!

Points: 971
More actions
December 11, 2015 at 7:53 am

#310110

First, my problem. I am attempting to setup a Two Node SQL Server 2014 Cluster Instance, I have successfully installed the SQL Default Instance onto both nodes - and am able to pass the role successfully between the two nodes. The problem is when a node is restarted, when it attempts to boot up it gets stuck.
The C$ admin share will be accessible, we are able to ping the actual server with no difficulties - but within Failover Cluster Manager, the node will be stuck 'Joining...' or it will show as 'Up' but the server is not accessible via RDP.
If it does allow for an RDP session, it will not be a fully initialized session (missing start menu, task bar, etc[Like explorer.exe isn't running - but it is in Task Manager]) and I have to use Hotkey combinations to bring anything up.
This occurs on both nodes, if the node that currently possesses the SQL Server Role is rebooted (without draining roles [bear in mind I'm trying to break things to ensure this is going to be reliable during production use]) then the WSFC itself will not be accessible on the other node once it finishes coming up (if one server gets stuck booting, rebooting the opposing server will allow for the original server to boot).
I am kind of at a loss to explain this issue, I can provide additional details, but am unsure of precisely what would be causing this type of issue. I'm thinking it has to be a conflict on accessing the CSV resources, the reason being that while the C$ share is accessible on a node stuck booting and the ClusterStorage folder is accessible and has the full list of my mount points, if I attempt to open one of the CSV Drives it will hang up.
Additionally, I have been able to successfully bring them both up by passing the Disk that the SQL Instance resides upon, back to the Node that is stuck booting. (Ex: Node A is up, Node B is rebooting. Node A has the SQL Role running, and Disk 1 is Owned by Node A. Failover Cluster Manager lists Node B as 'Up' after reboot, but cannot RDP to server or open Disk 1 ClusterStorage Directory. Pass Disk 1 Owner to Node B, and now it finishes booting.) - Bear in mind, I've only gotten this example to work once or twice.
If I remove the SQL Role (uninstall from bother servers the SQL Database Engine, etc) - then I can freely restart the servers and they will boot just fine.
I cannot find any online resources that contradict my assumption, that rebooting a MSFC Node should not prevent it from being able to immediately come back up (and the behavior without the SQL Role, reinforces my belief that this should be acceptable). I also cannot locate any articles on the specific behavior that I'm seeing.
Can anyone possibly provide a reason as to why this is occurring?
Environment:
2 x Node WSFC (Both nodes are identical physical boxes): Node A & Node B - Both running Windows Server 2012 R2, both same patch level
2 x SQL Server 2014 SP1 Enterprise (installed on both nodes), using the Advanced > Prepare Cluster Installation/Complete Cluster Installation wizards
3 x CSV Disks, Current SQL instance is installed on Disk1
1 x Quorum Disk, set as Disk Witness in Quorum
I apologize if my explanation is not as detailed as necessary, please identify any additional areas of my description that I can flesh out and I'll provide further details.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Dougieson SSC Eights! Points: 971 More actions · Answer 1

The following test reinforces my belief that there is some conflict on the Disks for the CSV, when the server is coming back online, that I am not able to identify the cause of.

Environment:

Node A = No Roles currently running

Node B = SQL Role Running

Both Nodes are 'Up' and functioning as intended.

Test:

1 - Change Node A to running SQL Role

2 - Move CSV discs from Node B to Node A

3 - Pause Node B

4 - Restart server for Node B

5 - Once Node B is back on the network, test to see if you can RDP to server

Results: Failure to RDP, but can PING by Hostname or IP & Can connect to C$ Admin Share - though you are not able to open the Directories for the Mount Points for the disks being hosted on CSV

6 - From Failover Cluster Manager on Node A, Unpause Node B

Results: Node B shows status 'Up', but still unable to RDP to server

7 - Move SQL Server Role from Node A to Node B, through Failover Cluster Manager on Node A

Results: SQL Server Resource gets stuck in a 'Online Pending' Status, for the SQL Server Role, after a short window the SQL Server Resource will fail to start and give us a 'Failed' status.

8 - Pause and Drain Roles on Node B, to revert them back to Node A.

Results: Gets stuck on 'Draining...' status when moving the role, attempt to Pause Node B and it will get stuck on Draining as well. Ended up having to issue a remote restart command for Node B, and then restarting Node B within the cluster once the server was back online.

9 - Move the CSV Disk for SQL Server Instance, from Node A to Node B

Results: Succeeds, and I am now able to successfully RDP to Node B's server with no problems.

10 - Move SQL Server Failover Instance Role back to Node B

Results: Succeeds, and SQL instance is running as intended and remotely accessible.

Anyone able to identify what could be the potential cause of this issue? I'm finding nothing in the Failover Cluster Event Logs or Windows Logs that would explain this behavior. And I'm coming up dry when I search online for issues with similar behavior.

Perry Whittle SSC Guru Points: 234013 More actions · Answer 2

have you run a cluster validation report?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Dougieson SSC Eights! Points: 971 More actions · Answer 3

Repeatedly, lol. I think I may have thought of the possible cause, over the weekend - unfortunately my work is sticklers about hours, so I only get to test it now that I'm in the office.

I'll post if I am able to come up with a resolution, but yes - definitely ran the cluster validation wizard repeatedly.

Dougieson SSC Eights! Points: 971 More actions · Answer 4

Determined the cause to be a very peculiar permissions issue on the SAN Manager itself, that was handling the LUNs I for the CSV.

I can't give the exact details for resolution, as to be honest - I tried so many different things to get this resolved, I'm unsure of the exact steps to correct. That being said, I hope no one else runs into this (spent 4 days on this, off and on)

will.hall SSC Journeyman Points: 82 More actions · Answer 5

Can you give a hint as to the few things you tried to resolve? I'm stuck trying to install SQL 2014:

Updating permission setting for folder 'C:\ClusterStorage\myinstance\MSSQL12.MSSQLSERVER\MSSQL\DATA' failed.

The folder permission setting were supposed to be set to

'D:P(A;OICI;FA;;;BA)(A;OICI;FA;;;SY)(A;OICI;FA;;;CO)(A;OICI;FA;;;S-1-5-80-3880718306-3832830129-1677859214-2598158968-1052248003)'.

Even though I'm in the Administrators group on the server. SAN setup as well and trying to use CSV.