Cluster Failover Testing

  • I was shown these two articles related to testing cluster failure and failover:

     

    http://technet2.microsoft.com/WindowsServer/en/library/20e7a35f-2477-4f9d-acb9-5146c92152211033.mspx?mfr=true

    http://msdn2.microsoft.com/en-us/library/ms189117.aspx

     

    Was just wondering if anyone had any other tests that they have performed or it would just be overkill?   We're using SQL Server 2000 on MS 2003.

  • Aside from the testing with the Cluster Administrator as outlines in Technet we ALWAYS perform the following tests after clustering and SQL Server is configured and installed:

    • Both nodes up - On the 'active' node perform a shutdown no restart - things should failover to the 'passive' node.
    • Both nodes up - On the 'passive' node perform a shutdown no restart - things should remain 'active' on the active node.
    • Both nodes up - On the 'active' node pull the power core (YUP, that is what happens when a power supply fails) - things should failover to the 'passive' node.
    • Both nodes up - On the 'passive' node pull the power core (YUP, again) - things should remain 'active' on the active node.
    • Both nodes up - On the 'active' node pull the network cable out (YUP, that is what happens when your primary switch fails) - things should failover to the 'passive' node.
    • Both nodes up - On the 'passive' node pull the network cable (YUP, again) - things should remain 'active' on the active node.
    • Both nodes up, unplug the crossover cable - both nodes should remain up but the cluster administrator will complain.

    Now if you shared storage is on a SAN and you have a 'solid' environment:

    • Both nodes up - On the 'active' node fiber cable out (YUP, that is what happens when your primary SAN switch fails) - things should failover to the 'passive' node.
    • Both nodes up - On the 'passive' node pull the fiber cable (YUP, again) - things should remain 'active' on the active node.

    Some might call it 'overkill' we prefer to call it 'due diligence' !!!

    The 'stuff' above is what I prefer to call 'fun'. But you need to make it educational while you do it. To accomplish this examine the System, Application and Security Event logs before and after each test on both nodes. By doing this you will be one up on diagnosis of a real problem when it occurs. Also do not forget to examine the  actual cluster log located at: C:\Windows\Cluster\cluster.log. This file wraps like a transaction log. Also, the time used is in GMT.

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • This is great information!  Thanks Rudy! 

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply