Failover testing Clustered SQL2000 with storagewor

  • Hi.

    We are running Clustered SQL2000 Enterprise on 2 Win2K Advanced Servers. The data is held on a Compaq Storageworks SAN and each server has 2 fiber connections to the san for resilience. Our cluster failover tests all worked well until we simulated a loss of connection to the Storageworks disks by disconnecting the fiber connections to one of the servers. Instead of failing over to the other server we had a series of hard disk write failure messages. When we manually moved the clustering to the other server the SQL database would not start and had a corrupted master database.

    When questioned our reseller theorised that when we pulled the fiber connections from the server there was still traffic in the fiber cable in the form of photons of light that caused the corruption.

    I am not convinced by this answer as I would have thought that error checking algorithms in the fiber comms would prevent this happening.

    I would like to know:

    Has anyone successfully failed over a sql2000 cluster when the server’s connection to the storage is cut?

    Has anyone a better theory on why the masterdb corrupted?

    Thanks

  • This was removed by the editor as SPAM

  • Gotta love the photons in the cable. Might be true I suppose...

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • Well, I think it is a load of crap personally. I have disconnected fibre channel arrays from the server with no issues. That is a VERY poor answer. I don't know why it didn't fail over the way it was expected. I do know you need to have your reseller contact compaq directly or storageworks and get the scoop from them. I'm sure they would be shocked to hear that it failed in that mannor. Heck, compaq doesn't even support clustering on anything but fibre channel anymore.

    Wes

  • Wes

    The hard disks in the cluster did fail over to the other server. However the MSSQLSERVER service that was associated with the cluster group would not come up again as it was corrupted.

    Thanks

    Stephen Blanchard

  • Can you see the master database physical files from the second node with Windows Explorer?

  • Were you using multi-path software such as Compaq Securepath to shield the o/s from the dual paths to the storage array?

    I am building a similar setup using a Compaq EVA, so the question is a little too close for my liking. I will let you know How i go in the coming weeks, failover testing is part of our test roll-out as well.

    Regards, Brian

  • To answer Allen_Cui Yes we could see the master database file from the other server it was corrupted though. Database consistancy checks would not run on it and the rebuildm tool which tries to rebuild the master database failed also.

    Brines question about Securepath was a good one. Yes we do use securepath on all the san attached servers. On a seperate note I heard a rumour this week that the very latest version of securepath (ver 4?) can bluescreen WIN2K servers on Service Pack 3. We use version 3.1 without any problems.

  • We are running same config into a Compaq EVA.

    If I drop a fibre connection the Shared drives remain on line.

    No error noticed as yet in the SQL log , cluster log , application event

    log or system event logs.

    What version of securepath are you using?

    It must be 4.x for EVA .

    There is a known issue with Oracle 9i running with Securepath.

    Are you running a SQL Virtual cluster or a standalone SQL on a single Node.

    Regards, Brian

    contiguous1@Notmail.com

    quote:


    Hi.

    We are running Clustered SQL2000 Enterprise on 2 Win2K Advanced Servers. The data is held on a Compaq Storageworks SAN and each server has 2 fiber connections to the san for resilience. Our cluster failover tests all worked well until we simulated a loss of connection to the Storageworks disks by disconnecting the fiber connections to one of the servers. Instead of failing over to the other server we had a series of hard disk write failure messages. When we manually moved the clustering to the other server the SQL database would not start and had a corrupted master database.

    When questioned our reseller theorised that when we pulled the fiber connections from the server there was still traffic in the fiber cable in the form of photons of light that caused the corruption.

    I am not convinced by this answer as I would have thought that error checking algorithms in the fiber comms would prevent this happening.

    I would like to know:

    Has anyone successfully failed over a sql2000 cluster when the server’s connection to the storage is cut?

    Has anyone a better theory on why the masterdb corrupted?

    Thanks


Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply