May 25, 2005 at 2:18 pm
I was just given the honorable duty of researching the ins and outs (plusses and minuses) of SQL clustering. As a quickly growing software development firm, this will be the "hot" topic on our radar for fy 2006.
Can anyone point me in the direction of an article or reference to the advantages/disadvantages of clustering in regards to :
1. maintainability
2. initial effort
3. cost
4. options
Any help would be appreciated. Thanks.
May 25, 2005 at 2:44 pm
Sorry, i can't direct you to an article, but I can give you a quick overview.
SQL Server Clustering (and MSCS in general) can be very complicated and you can actually decrease your uptime numbers quite easily (especially if you don't set up everything very well in the first place).
Where clustering really comes in handy is when you have a catastrophic failure of the underlying server hardware. In an emergency, you just can't beat running on Node B while Node A is being repaired by the field engineer.
Not to complicate matters, but I would look into VMWare (especially their ESX server with VMotion). It lets you create 'virtual' SQL Server in VMWare farms, so if one physical server fails, everything moves to another physical server. The nice thing about this is that you avoid the complexity (and expense) of running Cluster Server, SQL Server Enterprise Edition, and maybe you can run Windows 2000/2003 Server instead of Advanced server. It really is a great product and it is supported by most (if not all) of the major hardware vendors.
DISCLAIMER: I am in no way affiliated with VMWare, Microsoft, or any hardware vendor. My experience with both Clustering and VMWare comes as a customer.
May 25, 2005 at 2:57 pm
Thanks for the quick reply. I actually found a couple resources, this is the best so far:
http://www.sql-server-performance.com/clustering_resources.asp
May 26, 2005 at 3:03 pm
In a previous employer, I supported Active/Active clusters on NT4/SQL7, windows 2000/SQL2000, windows 2003/sql2000 some with GEOSPan and long distance clusters. They work great for business continuance but not so good for availability. Applying patches to OS and SQL from my experience meant more downtime than it normally meant on a regular server. Support (OS) people had to know what they were doing or could cause outages of the cluster accidently during maintenance work. Things like installing SQL Service packs often had cluster quirks. Network blips could cause issues. All in all, after several years of using the clusters, they were replaced with boot from san servers that had drives replicated to hot sites with identical hardware waiting to take over if needed.
May 26, 2005 at 3:07 pm
May 26, 2005 at 7:49 pm
The cluster I helped implemented has increased downtime significantly through unplanned failovers. The idea and marketing hype of Microsoft clustering sounds good, but in a production environment blips in the network, SAN connectivity issues and unexplained issues (even after opening numerous MS calls) are the reality. Each of these issues causes a failover which would not have occurred in a non-clustered environment (clusters are very sensitive). To successfully implement clustering, it requires a great deal of discipline and maturity in operational practices which most IT departments lack. The added complexity means added point of failure (SAN, network are even more critical to clustered servers than non-clustered). Usually there is alot of finger pointing by MS when issues occur stating things such as: The application is not cluster aware, its SAN issues or operational maturity issues. Part of the problem with MS clustering stems from the way in which it implemented. MS implementation is rather simple, a shared nothing architecture. There are resources disks, IP address, network name and SQL Server. Only one node can own a resource so when a failover occurs ownership is transferred to another node. This means SQL Server shuts down, ownership is transferred then SQL Server startups up. All application connections are broken, and if the application isn't coded to gracefully handle 5-10 minutes of lack of database connectivity it causes problems. Because only one node can own a resource at a time there is no load balancing. In a high activity server failover can take 5-10 minutes. Contrast this with Oracle clustering (RAC, GRID) which is a shared architecture where multiple nodes can access the database at the same time, providing load balancing and transparency in failover. Although I have not worked with Oracle, the technology behind it sounds much more advanced. Microsoft should seek to simplify it's clustering implementation, right now it's just too hard to get to work correctly. If you want to test the experience of supporting an MS production clustering without investing alot of time, money and misery, simply restart SQL Server at random times during business hours and see what kind of fallout you receive from dropping the application connections.
May 27, 2005 at 6:08 am
We have two production clusters here. I was not a fan of this but I was told we would do it to guard against hardware failure. The first one we implemented failed over every so often for no apparent reason, no one could figure out why. We ended up replacing the hardware and it has run for 17 months now OK. The second one we implemented 3 weeks ago on Win2003 SP1 has an issue where SQL Server stops and starts for no apparent reason 2-3 times a week. Microsoft is currently looking at this issue via PSSDIAG output... they are leaning to blocking/locking in the application code... I say the worst code and a high CPU usage in the world should NEVER cause a RDBMS to go down...
My experience is that you have to get specific hardware that is certified for clustering as it is extrememly sensitive and make sure the server engineer team has knowledge / experience with clustering... it is not just install operating system and give it to you. SQL Server installs/Service packs take twice as long to install since it has to put the binaries on both servers so your downtime for that is double. We are looking into a product called Double Take which will replicate Windows / SQL Server data byte by byte and will possibly use this for a disaster recovery senario and or hardware failure solution.
May 27, 2005 at 6:22 am
Cmille / Markus,
Thank you both for your comments, they are much appreciated. Unfortunately it seems that in all the areas I've asked this question, I get similar feedback. Has anyone out there successfully implemented a cluster without a rash of difficulty in the process?
Thank you again.
May 27, 2005 at 7:10 am
Hi Alex,
Fortunately it's not all doom and gloom, I've been sucessfully working with Clusters for the last few years and whilst I agree that it needs a fair amount of knowledge and discipline to set them up its not beyond the bounds of mortal man (which is just as well otherwise I wouldn't be able to do it).
The latest SQL Cluster I implemented (last november) was based on twin Itanium2 servers with EMC based SAN storage for their back end (we're talking multiple TB databases here) (Windows 2003 Enterprise and SQL 2000 Enterprise, both for 64-bit). Once I'd got all of the teething problems out of the way (a couple of weeks worth of work) everything settled down and started running properly, we haven't had an uncontrolled failover since then.
In my experience it's often a lack of knowledge/experience in some IT departments that causes unexplained failovers (sometimes just people playing to see what a utility does), if you can implement strict controls over who has access to your kit and everyone works in a professional manner then you should be able to manage no problem.
I'd be more than happy to give you the benefit of my experience if you like, just send me a PM with any questions (oops, forgot to mention that i'm on holiday for a week so it might take me a little while to respond)
May 27, 2005 at 7:25 am
I've had good experiences with clustering as well. My earlier post detailed some of the possible prolems. I managed several clusters directly (and I was an escalation point for many more for support). There is a steep learning curve and you have to be very discliplined. I was a hard case about this and I got a lot of complaints from other people who just wanted to 'make the change now' and worry about the change later.
Long story short, document EVERYTHING, follow your vendor's maintenance guidelines, don't do anything that would make you unsupportable from Microsoft's / your hard ware vendor's perspective, and test everything if you can (VMWare is great for this).
I was spoiled in that I had a SAN and three clusters attached to it in a test lab, so I could go and break it before trying the stuff out for real in production. I think that this is where many people go wrong. They are forced to try stuff out in production due to a lack of resources in the lab.
May 27, 2005 at 7:32 am
Mike, PM'd ya, thanks.
Chris, thanks for the update. I've copied out some of your info for part of a presentation I have to do. I'll be looking into VMWare and see what it can do for us. Are there any other viable solutions to clusters other than log shipping? I realiza of course that log shipping requires alot more manual intervention and therefor can't gaurantee the availability of a cluster.
May 27, 2005 at 7:39 am
Log shipping is good for a warm stand by, but it doesn't really come close to clustering (and it has some problems of its own). The two technologies that I have personal experience with are Microsoft clustering and VMWare virtual machines. I know that Veritas (and another company I can't recall) offer clustering solutions for Windows and SQL Server as well. I know Veritas' solution is pretty impressive, I've just never worked with it.
On the log shipping issue, the best use of Log Shipping I've been able to implement was a reporting database, so I got my ad-hoc query users out of production. I know there is a choice to have the secondary database take over if the primary one goes down, but I've never used it.
May 27, 2005 at 8:04 pm
Log Shipping is fairly useful for a disaster recovery, but requires manual intervention. I use hostname, port number for connection strings and in the event of an emergency change the DNS entries to point hostname to the disaster recovery IP address. This solution requires zero changes to application connection strings and saves DBA from rebooting many servers if you had to rename each server.
As for using dual using log shipped server for reporting, this is very difficult since you can't restore a log while users are connected. I keep my logged shipped server 30 minutes behind production server. You could batch up your logs and restore them off-hours, in many situations this will not work well since a day's worth of log backups will take longer to restore than a full database. Besides having a logged shipped server 12 hours behind isn't a great solution since you can't plan emergencies.
May 31, 2005 at 7:15 am
A little off topic but I've always used log shipping with 15 minute intervals between log backups. 12 hours just seems like a waste of time if you're using it for disaster recovery. You'll spend more time recovering than if you had just done a complete backup and restore. Just my opinion.
June 13, 2005 at 2:46 am
I've implemented many client sites with clustering software from Legato. Not sure if you are looking for clustering in a LAN or WAN environment. We've had a lot of sucecss using the co-standby server which provides an exact copy of all data on a SQL server. Co-Standby simpling mirror all the disk writes for a parition copying over, so you're covered for hardware failure, network failure which will cause an auto failover.
If you are looking at wan replication/ clustering the replistor is worth looking at. This allows data to be relicated over a failry slow wan link, and has the advangtage of being able to replicate at file level rather than disk. Both solutions will control the start and stopping of the SQL servers and work with standard sql and Windows 2000 OS, so is a cost effective solution. Good luck
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply