Removing a SQL node from a cluster

  • First and foremost - I AM NOT A SYSTEMS ADMIN

    I'm a DBA. My windows admin and I built a sql cluster to do availability groups. It was both of our 1st time building for AG. It was a total success. Then we pointed our 3rd party software at it and it couldn't deal with the fact that we'd clustered. It was a disaster. We had to very quickly remove the nodes from the cluster and point everything at just one of the nodes and move forward as if the cluster never existed. B node is completely powered down at this point and the A node is now treated like a singular server.

    The problem is that the logs in that A node are bombarded with cluster-related failure entries. Over and over and over and over again, the same set of errors in the Cluster Events log and in the Windows event log.

    Let's say the cluster is named TESTSERVER with 2 nodes: TESTSERVERA and TESTSERVERB. The repeated errors are as follows:

    • The cluster service encountered an unexpected problem and will be shut down. The error code was '5050'.
    • Failed to form cluster 'TESTSERVER' with error code '6'. Failover cluster will not be available.
    • The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

    Those are specifically in the Cluster Events on TESTSERVERA (remember B is powered down).

    The errors from the Windows Event Viewer on A happen every 1 minute:

    • Failed to form cluster 'TESTSERVER' with error code '6'. Failover cluster will not be available.
    • The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
    • The Cluster Service service terminated with the following service-specific error: The handle is invalid.

    While using SSMS on A, SELECT * FROM SYS.DM_OS_CLUSTER_NODES returns no records. I definitely ran the SQL setup and removed the node for A, but can't remember if I did that on B. And I have NO idea what to do within the Windows config/settings. I just know I need to make it all go away asap so I can update SQL and move on with my life.

  • My best guess is that you are getting these errors because the cluster cannot determine quorum.  On a 2 node cluster - you need 3 votes, which would be handled by node1, node2 and either a disk or file witness.

    Without knowing everything that was done - it is impossible to identify how to get the system up and running.  And without understanding what issues the application had with accessing the system - not sure what help can be provided.

    Did you create a listener when you setup/configured the AG?  If not, that would definitely cause problems with the application connecting to the system.  A listener would be required to be able to redirect the connection to the current primary node.  However - even without a listener you can connect directly to the primary node, but that would not satisfy the requirement for failover.

    At this point, I am guessing that what you really have is a single-node cluster but since you only have the single node and probably don't have a witness the cluster is not healthy and shuts down.  Since the instance is a stand-alone instance - SQL Server is up and available directly.

    As to next steps:

    1. Fix the cluster - add a witness and the secondary node.  Once the cluster is fixed you can then review and restart the AG part of the process.  Note: you don't have to set a secondary for automatic failover, it can be set to manual failover until you have a listener available and have tested access through the listener.
    2. Remove clustering from that node completely.  Again - SQL is installed as a stand-alone instance so removing clustering from the system will not have any effect on SQL.

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • robin.pryor wrote:

    First and foremost - I AM NOT A SYSTEMS ADMIN

    I'm a DBA. My windows admin and I built a sql cluster to do availability groups. It was both of our 1st time building for AG. It was a total success. Then we pointed our 3rd party software at it and it couldn't deal with the fact that we'd clustered. It was a disaster. We had to very quickly remove the nodes from the cluster and point everything at just one of the nodes and move forward as if the cluster never existed. B node is completely powered down at this point and the A node is now treated like a singular server.

    The problem is that the logs in that A node are bombarded with cluster-related failure entries. Over and over and over and over again, the same set of errors in the Cluster Events log and in the Windows event log.

    Let's say the cluster is named TESTSERVER with 2 nodes: TESTSERVERA and TESTSERVERB. The repeated errors are as follows:

    • The cluster service encountered an unexpected problem and will be shut down. The error code was '5050'.
    • Failed to form cluster 'TESTSERVER' with error code '6'. Failover cluster will not be available.
    • The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

    Those are specifically in the Cluster Events on TESTSERVERA (remember B is powered down).

    The errors from the Windows Event Viewer on A happen every 1 minute:

    • Failed to form cluster 'TESTSERVER' with error code '6'. Failover cluster will not be available.
    • The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
    • The Cluster Service service terminated with the following service-specific error: The handle is invalid.

    While using SSMS on A, SELECT * FROM SYS.DM_OS_CLUSTER_NODES returns no records. I definitely ran the SQL setup and removed the node for A, but can't remember if I did that on B. And I have NO idea what to do within the Windows config/settings. I just know I need to make it all go away asap so I can update SQL and move on with my life.

     

    If you set up an availability group, your apps should have been connecting to the listener, not the cluster name.  SQL is not a clustered resource in an AG.

     

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • This was removed by the editor as SPAM

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply