May 26, 2010 at 11:16 am
Running 2 node (disk only quorum) multi-instance cluster (active-active)
failovers successful w/ server reboot testing
manual fail over successful
"cluster group" and SQL groups failing on power down of switch, this should auto fail over
disable all NICs to force fail over is successful
disable only public NIC is unsuccessful
System Log Events
1126,1127,1129,1130
Failover Clustering Log Events
1566,1538,1537,1281,1280,1204,1203,1201,1200,1153,1132,1131,1128,1125,1062
Testing by modifying the NIC negotiation (Auto, 10/Full) for the heartbeat has also been unsuccessful.
Heartbeat is a direct connection between nodes via crossover cable (ie, no switch between)
Anyone seen a similar issue or have any ideas?
May 26, 2010 at 12:57 pm
I believe this is the crux of the issue although I cannot figure out why. I've tried all duplex modes and transfer rates w/o success
Resource Group Node Status
-------------------- -------------- ------ -------
Cluster IP Address Cluster Group %1 Failed
May 26, 2010 at 1:15 pm
Windows 2003 or Windows 2008 cluster?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
May 26, 2010 at 1:16 pm
2008
November 2, 2011 at 5:37 am
I know this thread has been dormant for a long time now, but just in case it helps anyone...
I've been having the exact same problem as outlined here on a new system I'm building and turned up very little to help on many hours of searching. I then happened to try a test failover of my SQL cluster after a day or so of inactivity - and it worked.
That has since led me to discover that there's a set of failover parameters for the cluster hiding in the Failover Cluster Manager console.
If you go to <MSCS cluster name>/Services and Applications, right-click on <SQL cluster name> and choose Properties from the context menu, you should see two tabs - General and Failover.
If you click on the Failover tab, there are two configurable parameters:
Maximum failures in the specified period and Period (hours)
The failures parameter will be set to (n-1), where n is the number of nodes in your cluster. The period was set to 6 hours in my case, which I think is default, but your mileage may vary. More than n-1 failures will put the clustered service in a permanently failed state, to stop a problematic service endlessly bouncing between cluster nodes.
What this meant in my case, having a 2-node cluster, was that the cluster would only automatically fail over once in any 6-hour period. Any further failures in that time would put the cluster in a state of permanent failure. Manually moving the service between nodes doesn't count as a failure, so can be done at any time.
As this cluster is not live yet, I've since changed the options to allow failure 10 times every hour for testing, and I can now automatically perform a test fail from one node to another by simply disabling the Windows-facing network interface - as long as there are no more than 10 test failures an hour, so it looks like my cluster was actually behaving itself all along!
Hope that helps someone.
Cheers
Graham
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply