This weekend I was working, as usual, holding the on-call and listening to my favorite singer Kishor Kumar. And, my friend Rakesh call me in jeopardy. One of his clients has reported that one of their Cluster Service failed to bring clustered role online or offline. And, that the cluster is failing back to the primary, not moving to the secondary server.
Before proceeding further I have asked him a few questions that we normally asked like:
- What is the version of the SQL Server?
- It was SQL Server 2012
- What is the edition of the SQL Server?
- SQL Server Enterprise Edition
- What is the version of the Windows Server?
- Windows Server 2016 Data Center
- Do you have the logs available for the windows server?
- Yes
- Do you have the logs available for the SQL Server?
- Yes
I have asked him to email me these details; while we were discussing the issue, he suddenly excused me and called back. This time, he come with the consent of his customer and asked if I can look at the issue by connecting and help resolve the issue – Cluster Service failed to bring clustered role online or offline.
Rakesh was my classmate and a friend since my college days. And, this error is interesting enough that I could not resist me to dive in for help. I recall that I have seen this error earlier. I did try to find out the relevant notes based on the error details I saw. The error details I saw in cluster logs were event ID 1205 and 1069. If you are to look at the event ids they are pretty general. They look like below:
Luckily, the installation and configuration of these cluster nodes are pretty recent. I went ahead and checked the SQL Server installation summary. I found that the setup was not completely successful. I have also run the Cluster Validation Report and it has reported some issues. Most likely, the engineer has overlooked some of the message(s).
Anyways, so, I did find an interesting MSDN thread in my notes that was talking about this very issue. Basically, the cause is that the SQL Server setup program cannot find the SQL Server Agent resource in the cluster because it is not created. And, that the sqagtres.dll was missing.
FIXING CLUSTER SERVICE FAILED TO BRING CLUSTERED ROLE COMPLETELY ONLINE OR OFFLINE:
Step 1: Review and confirm that the SQL Server is functioning fine on the current primary node
Step 2: Run the repair on the secondary node and reboot
Step 3: Run the repair on the primary node and reboot
Step 4: Run the cluster validation report again, comes clean
Step 5: Attempt to do the manual failover, succeeded
Although, the issue is fixed and the cluster starts to function as expected. The solution or the workaround I have mentioned above is not the only reason. So, here are some of the additional tips that I can offer fixing the issues like this, they are:
- Please review the SQL Server installation summary once the setup is complete
- Please make sure that you run the cluster validation report at regular interval
- You can use the command cluster log /g to generate the cluster log at C:windowsclusterreports folder
- In most cases, the issue can also happen if we miss adding a resource to another node
- You may also need to check if the object(s) are created properly in the OU
- And, last but not the list. It can also happen if the Cluster / Resource policies are configured the way that it prevents the failover or failback
- Most importantly, these notes were for SQL Server 2008 R2 but it worked for me here as well. Your mileage may vary
I am pretty sure you would be interested in browsing through more of the troubleshooting and HA & DR tips. And, do not forget to let me know how do you like this post?