Intermittent issue with cluster failover when SQL Agent service is restarted

  • Hi everyone,

    Some background:

    Very recently, we upgraded both our SQL cluster and the servers that houses it. We're now on SQL Server 2016 and Windows Server 2016. I'm the 'database administrator'; not a real one because I don't do the upgrades or make many changes to our set up, etc etc. I'm basically here keeping the ship afloat and RackSpace and our app developers set up the servers and clusters and make other changes through patch applications. I don't have a background in permissions, networking, OS stuff, etc etc.

    So, I can't say exactly what happened for sure during the upgrade but as far as I know, Rackspace set up this new cluster and new servers 'exactly the same' (I'm sure there were some minor differences that naturally come with new SW/OS). Things seem to be working well so far, except we're having an odd issue I can't really figure out.

    The issue:

    This issue occurs when stopping/starting the SQLAgent service, but it only occurs intermittently. Traffic/load doesn't seem to have any impact. When we stop/start, upon 'starting' the service, we see the cluster fail over to the other server. The error immediately presented is 'Service Dependency Failure (ObjectExplorer)'. In Event Viewer, there's one error event for the service, and it is event ID 324, message 'OpenCluster (reason:5)'. I see no obvious reference to this event in the SQLAgent log on disk.

    All of my research suggests that this is a permissions issue. However, it's hard to pinpoint what that would be (and hard for me to believe outright) because the issue is so intermittent. We could go 5 restarts without the issue and then have that happen (not that 5 restarts would be normal, but that's what I found in testing after hours). Do both logins for the MSSQLSERVER service and SQLAGENT service need to be the same? They are currently two different users, but as far as I know, have administrative privileges.

    Anyone have any thoughts on this? I appreciate the help in advance.

  • How are you stopping/starting SQL Server Agent - and why is that something that needs to be done on a regular basis?

    If you are using the services applet to stop/start the service, that is why you see a cluster failure and fail over.  The service should be managed by the cluster - and stopping/starting the service in that manner is telling the cluster manager that the service failed - and cluster manager then fails over to the other node(s).

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • This occurred when I restarted via both the UI in SSMS, and in services.msc.

    Restarting the service isn't going to be a regular thing - this was discovered during a restart after the server move. That restart was made because I set up the DB mail and changed some SQL Agent properties related to that. Even if this isn't a regular occurrence, we feel this shouldn't be happening.

    Are you saying the stop/start should be done strictly in the SQL Server 2016 Configuration Manager app?

  • In a cluster setup (FCI), any services should be handled through Windows Failover Manager. Starting/stopping services with either SSMS or SSCM (configuration manager) will cause the cluster to believe the service has failed and spawn a failover. Same thing goes with starting/stopping the service in Services.msc.

    Alan H
    MCSE - Data Management and Analytics
    Senior SQL Server DBA

    Best way to ask a question: http://www.sqlservercentral.com/articles/Best+Practices/61537/

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply