SQL 2000 Failover Clustering Logic?

  • I am a relative newbie, so forgive me.

    I have been reading some material on how to set up a SQL Server 2000 failover cluster. The material describes two SQL 2000 servers and a shared storage device. The logic is that if one SQL 2000 server fails, the other will takeover. This sounds great for a SQL 2000 server machine/etc. failure. But what about the shared storage device? The descriptions I've read only illustrate one shared device. Wouldn't someone still be as much at risk of hardware failure? I understand that shared storage devices require system boards, controller cards, power supplies, etc. This sounds like just another server to me. How does adding two more servers reduce this risk if I am only using one shared device? What am I missing? Is my assumption about a shared storage device being simply another server with more disks incorrect? Please educate this newbie. I am trying to decide on a course of action for a database oriented website. Thanks 🙂

  • A cluster is two or more servers with one or more shared storage devices. We have a DELL cluster. Our two servers are 8450's and we have two shared storage devices; one is an array of 14 x 36G hard drives and the other is an array of 14 x 73G hard drives. You are kind of correct in saying the shared storage is another server, but where you are missing the point is that they are 'virtual' servers. My shared storage doesn't have any RAM, doesn't have any processors, etc. They are just a set of hard drives and RAID controllers. My shared storage is 'seen' by both nodes of the cluster.

    So, here is what happens:

    I have two nodes, SQL1 and SQL2 and these are 'true' servers. I have two shared 'virtual' servers which are really two arrays of hard drives. Both nodes see the shared storage. Let's say SQL1 is the primary node....it sees and controls the virtual servers, SQL2 can't do anything with them (this is active/passive mode). When SQL1 fails, SQL2 immediately sees and takes control of the virtual servers. Unlike the C and D drives of the 'true' servers, the shared storage doesn't belong to a specific server.

    Hope that explains it a bit better for you.

    -SQLBill

  • Your points are well taken. However, I still wonder about the shared storage device. Isn't there hardware failure risk (i.e. controller cards, etc.)? How do you protect yourself from this? Consider the following two scenarios:

    1. A single SQL Server machine with a RAID-5 disk array

    2. Two clustered SQL Server machines with a separate RAID-5 shared storage device.

    In both cases there is RAID-5 disk failure protection.

    In the case of scenario 1, If there is a controller card failure, you are down. There is no failover to protect you.

    In the case of scenario 2, If there is a controller card failure in the shared storage device, you are down. There is no failover to protect you. Granted, you have failover protection if one of the SQL Server machines goes down. But it seems that you don't have adequate protection for the shared storage device.

  • Yes you do technically have a single point of failure at the Shared storage. Some is limited by multiple HBAs (fiber cards) to the SAN. For our clusters we normally run 2 physical cards in each machine.

    Most SANs are heavily redundant on their own, so the risk of failure is pretty remote.

    KlK, MCSE


    KlK

  • My shared storage has two RAID controllers, each handles half of the drives. If one fails the other is supposed to take over. I can come up with lots of scenerios where you will have a failure and not be able to recover. You could create an off-site cluster, built the exact same way as your on-site one and mirror all your data to it. But what happens if both experience the exact same problem, or a tornado/hurricane/earthquake/fire/whatever hits both sites at the same time?

    We had a RAID controller failure and the other RAID controller couldn't pick up the slack. Initially it looked like we had 3 hard drives from our 14x36GB shared storage array fail all at the same time. We figured the odds of that were slim, and finally narrowed it down to the one RAID controller. Replaced that and were back up and running.

    About the only way to plan for total failure is to make backups, but what happens when your backups fail?

    -SQLBill

  • In most of the clusters or SAN's I have seen, the disk controllers are redundant as well. Thus eliminating a controller failure.

    Russ

    quote:


    Your points are well taken. However, I still wonder about the shared storage device. Isn't there hardware failure risk (i.e. controller cards, etc.)? How do you protect yourself from this? Consider the following two scenarios:

    1. A single SQL Server machine with a RAID-5 disk array

    2. Two clustered SQL Server machines with a separate RAID-5 shared storage device.

    In both cases there is RAID-5 disk failure protection.

    In the case of scenario 1, If there is a controller card failure, you are down. There is no failover to protect you.

    In the case of scenario 2, If there is a controller card failure in the shared storage device, you are down. There is no failover to protect you. Granted, you have failover protection if one of the SQL Server machines goes down. But it seems that you don't have adequate protection for the shared storage device.


  • Rustroot1,

    I believe you missed part of my post...we do have two Raid controllers, yes they are SUPPOSED to be redundant. But we lost one and the other could not accept the full load, so our system crashed. Redundancy isn't always redundant.

    -SQLBill

  • What do you guys suggest....

    I am debating the following two configurations:

    1. Two SQL Servers, each with two 36GB mirrored disks. One of these servers will be the primary database server which will use "log shipping" to the second "warm standby" server. This will give me redundancy but will provide only manual failover. While the failover is not automatic, I would still be in business in a matter of approximately 15 minutes.

    2. Two SQL Servers in an active/passive cluster connected to a raid NAS with two raid controller cards and a scsi disk array of 4 36GB disks. This will provide me with both redundancy and nearly immediate failover protection. However, if anything goes wrong with the raid NAS, I am out of business for potentially hours.

    I am on a somewhat limited budget and investing in a SAN is not feasible at this time. Consequently, NAS options are all I can consider for the time being.

  • Here's my thoughts on both of your options...

    1. "While the failover is not automatic, I would still be in business in a matter of approximately 15 minutes." This is true ONLY if you are immediately aware of the failure and available to do something about it. I will admit that I'm not familiar with log shipping, but the bottom line is you are replicating all the data to a second server. That's great if you have a hardware failure. What happens if you have a virus, worm, etc? How fast can you catch it? Before it replicates? If you set your log shipping so that you can possibly catch things like that (maybe twice a day), that's great but you could possibly lose half a day's data. Is that acceptable? Also, once you bring the stand-by server online, you might have to have all your clients re-do their ODBC connections. This might slow down your recovery. So, possible loss of data and possible delay in becoming operational.

    2. In this method, all of the data is immediately available unless a hardware/software issue takes down the WHOLE cluster. If that happens, possible data loss, definite delay in becoming operational.

    Method 1 solves the issue for a hardware failure. The only issue is getting your source and your clients to recognize the newly operational database.

    Method 2 helps keep a hardware failure from being critical...normally if one part fails the system fails over and your client doesn't even notice anything. Everything remains up and running while you repair the failure.

    I prefer method 2 and that is what we are using (but with SANS storage and more robust equipment).

    The best of all worlds would be an active/passive cluster replicating to a standby server. But who has the money for that?

    -SQLBill

  • SQLBill,

    I agree with your comments.

    With respect to option 1, I had planned to log ship to the standby server every 5 minutes and to create a backup to another location a couple of times each day to protect against viruses/worms/etc. Regarding the client connection problem, you have a point there. This is an issue.

    With respect to option 2, I agree that it is the preferred option. However, what do you do in the event of viruses/worms/etc. It would seem that you have the same problem that you do in option 1.

    Any thoughts? Is it simply a backup issue? What backup procedures do you follow?

  • As I said, I'm using an active/passive cluster. I am doing some kind of backup every hour. Once a day, I do a full backup, then every 4 hours I do a differential backup (except for the hour that the full backup is done), and I do transaction log backups every hour except when a differential or full backup is being done. This gives me (I believe) the best ability to recover the database to a specific point in time.

    I have had a MAJOR crash that took six days to recover from (actually a month to recover from-read on)....

    It took us three days to recover from the hardware failure, our vendor's help desk treated us as though we didn't know anything. You know, the dumb user type. We were showing three of fourteen hard drives in our array had failed all at once. We believed there was something else wrong. On the third day of trying to solve this, the vendor admitted it was a RAID controller. Turned out it was both a RAID controller and an LS module. It appears the two together kept the array from properly failing over (the bad LS module kept the good RAID controller from being able to take over all the drives). Then ckdsk damaged our data. Then we found out that Veritas BE 8.6 does not work with SQL Server 2000 under SPECIFIC circumstances, which we met. So our backup tapes no longer worked and applying the patch is not retroactive. So, six days after the crash we were operational again but without the original data. We had to send the tape to Veritas for recovery. As of yesterday, we have all of our data recovered 100%.

    Lesson learned...you can't prepare for everything. We had a truly catastrophic failure, but we managed to recover from it. Would it have helped if we had a standby server? Yes. But my client isn't going to spend the money for that. Plus to be really failsafe, a standby server needs to be in an off-site location (in case the building is destroyed-for example the World Trade Centers). Except for that one failure where everything that could go wrong just about did, we have had several other failures and they were not noticeable to the client. The cluster failed over and kept on going as it was designed to do.

    As for backups, we have the patch applied so all the backups should work. However, I have changed the way I do them. I suggest that anyone using SQL Server use the 'native' SQL Server backup method. Backup to the hard drive and then use the tape software to backup the .bak file to tape. It takes longer but is worth it. If you have the ability, the BEST backup is to put the database in single user mode and copy the .mdf and .ldf files to tape.

    To restate the above, build it however you feel best suits your need BUT make sure your backups can cover you no matter what happens. And consider storing your backups off-site in case of a fire, etc.

    -SQLBill

  • SQLBill,

    Wow! I think I would have a meltdown under the circumstances you described.

    Your advice is well noted. I believe I will opt for the active/passive cluster with a shared storage device (NAS instead of SAN due to budget). Additionally, I will implement a rigorous and staggered backup regimen.

    Thank you all for your input 🙂

  • Just be aware that MS recommends NOT using NAS for data files for performance reasons. SAN's might be prohibitively priced, but NAS has some pretty serious limitations.

    I am a big believer in Clusters, but I'm a big believer in performance too and since disk subsystems tend to be a bottleneck under normal circumstances, I would be very hesitant to use an inherently slow one for my databases.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • One option is to use Microsoft Network Load balancing on the front-end to route requests to multiple machines. If one goes down, then requests will be routed to the remaining server(s). As far a keeping them in sync, I would say that log shipping is not a bad idea for a "active/passive" scenario, or you could use transactional replication. The problem with both of these is that a DB restore from the backup server to the primary is required before you switch back. You may also take a look at merge replication between servers. This works well in an "active/active" scenario, but requires GUIDs on all replicated tables.

    I've never been a fan of clustering...you still have a single point of failure-the SAN/disk arrays. You can achieve a similar level of redundancy without using a cluster and just using a standard RAID(1/5) setup. 95%+ of all failures from my experience happen in the disk array...not the CPU, not the power supplies, not the NIC, not the controller cards. It's interesting that failure usually happens in the most critical component-the disks. You can buy a new NIC, but you can't buy new data.

    -Dan


    -Dan

  • How difficult would it be to cluster two SQL Servers in an active/passive mode without a separate storage device? For example, have a RAID-5 array in each SQL server and have the active server log ship data to the passive server or use transactional replication to the passive server? Assuming that there is an effective, staggered backup plan in place, what would be the implications of such a configuration?

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply