cluster question

  • First, I aplogize for the double post. I originally asked this question in the "General" forum before I realized there was a "High Availability" forum and I didn't see any way to *move* my question.

    I have a two node SS 2008 cluster. I regularly test the failover by using Cluster Administrator to move the groups from one node to the other with no issues.

    Yesterday, I powered off the passive node to upgrade memory and my users complained that active connections were killed and they were unable to establish new connections for a few seconds. Based on the SQL logs, it does not appear the services restarted.

    My understanding and past experience is that powering down the passive node should have had no effect on the active node. Does anyone have any suggestions what may have happened or what I should look at to try to determine the cause? (The Windows logs didn't shed any light on this, at least not to me.)

    Thanks!

  • Check the cluster logs, usually in C:\Windows\Cluster.

  • Thanks for the suggestion Chrissy. I haven't had much reason to troubleshoot our cluster in the past so I was not aware of this log.

    The log on the primary node (which is where the problem occurred) only goes back 2 days and this happened Tuesday afternoon, so that's a bummer. Interestingly, though, the log on the secondary node goes back like 6 months. I couldn't find anything useful during the time frame I was interested in - until I discovered this log is in GMT instead of local time. I guess they do this for geographically dispersed clusters?

    Anyway, I can't tell if the log entries are out of the ordinary since I've never looked at this log. I'll need to play around with our dev cluster and see what the logs look like.

    Good to know about this log - thanks again!

  • Hey Henry,

    Your right normally this should not cause any outage, it is called the Passive node for a reason.

    Is this a Windows 2008 or 2003 cluster? Are you certain that all of your resources were owned by the "Active" node at the time? (In Windows 2003), If the OS/Cluster Quorum is on "Passive Node" then the dependency that the SQL resources have on it, this may have caused your issue.

  • my guess is the passive node was actually the active node. If the service didnt restart you need to find out why the group did not failover correctly

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Bradley, thanks for the reply. This is a Win 2003 Cluster. I am completely certain all three groups - the cluster / quorum group, the MSDTC group, and the SQL group - were owned by the active node.

    Perry, thanks also for the reply, but that was not the case. Additionally, the concern here is not that a failover did not take place - indeed, there was no reason for a failover to occur. All resources were owned by the active and it was the passive that was powered down. Users experienced a brief period of dropped connections then shortly thereafter were able to connect again. I could understand the loss of connectivity if the services restarted but that was not the case - the SQL services never stopped as evidenced by the logs. It makes sense that the SQL services never stopped as there was no failover, I just can't explain the brief period of dropped connections.

  • Hi Henry,

    Did the connections come back up as the other server started to come back online from a reboot? Is this a Prod Server? Do you have anyway of testing (if prod after hours), by shutting down the passive node to see if you can repeate the network drop against the active node?

  • Hi Brad,

    Unfortunately, I can't say for sure exactly when the connections dropped and were subsequently able to be re-established. I was in our server room and it wasn't until I returned a while later that I got the feedback from users.

    It is a production server, so I won't be able to test it until our maintenance window later this month. I did however attempt to repeat this on a dev cluster which is a similar build - Win 2003, SS 2008 - and was unable to reproduce the issue. I completely powered off the passive node with no interruption at all to the active node (which of course I am ultimately happy about!)

    I will test during our mainenance window later this month and post back the results.

    Thanks again.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply