For Day 29 of this series, I am going to talk about some of the basic things that you should consider from a hardware perspective when you are trying to increase the basic resiliency and availability of an individual database server. These are some of the first steps you would take as part of designing a high availability solution for your data tier.
The basic principal here is to try to eliminate as many single points of failure as possible at the hardware and configuration level. I believe you should do these things regardless of what other high availability techniques you decide to use. When you are choosing components for a database server (as opposed to a web server, for example), here are some basic things to include:
Two internal drives in a RAID 1 configuration for the operating system and SQL Server binaries. These drives should be using the embedded hardware RAID controller that is available on most new rack mounted servers. I try to get at least 146GB, 15K 2.5” drives for this purpose. Using 15K drives will help Windows Server boot a little faster, and will help SQL Server load a little faster when the service first starts up. Using 146GB (or larger) drives will give you more room to accommodate things like SQL Server error log files, dump files, etc., without being worried about drive space.
Dual power supplies for the server, each plugged into separate circuits in your server room or data center. You should also be plugged into an Uninterruptable Power Supply (UPS) on each circuit, and ideally have a backup power source, such as a diesel generator for your data center. The idea here is to protect against an internal power supply failure , a cord being kicked out of a plug, a circuit breaker tripping, or loss of electrical power from the utility grid.
You should have multiple network ports in the server, with Ethernet connections into at least two different network switches. These network switches should be plugged into different electrical circuits in your data center. Most new rack mounted servers have at least four gigabit Ethernet ports embedded on the motherboard.
You should have multiple RAID controller cards (if you are using Direct Attached Storage), multiple Host Bus Adapters (HBAs) if you are using a fiber channel SAN, or multiple PCI-e Gigabit (or better) Ethernet cards with an iSCSI SAN. This will give you better redundancy and better throughput, depending on your configuration.
Wherever your SQL Server data files, log files, TempDB files, and SQL Server backup files are located, they should be protected by an appropriate RAID level, depending on your budget and performance needs. We want to keep our databases from going down due to the loss of a single drive. One thing to keep in mind, is that RAID is not a substitute for an appropriate SQL Server backup and restore strategy! Never, never, never let anyone, whether it is a SAN vendor, a server admin from your Operations team, or your boss, talk you into not doing SQL Server backups as appropriate for your Recovery Point Objective (RPO) and Recovery Time Objective (RTO) metrics. I cannot emphasize this point enough! There is absolutely no substitute for having SQL Server backup files.
Despite this fact, you will undoubtedly be pressured multiple times in your career, by different people, into not running SQL Server database backups. You really need to stand your ground and not give in to this pressure. There is an old saying: “If you don’t have backups, you don’t have a database”.
I also want to note one configuration setting I like to use for database servers, to reduce their boot and SQL Server startup time. For a standalone database server, reducing your total reboot time has a direct effect on your high availability numbers.
I always go into the BIOS setup for the server, and disable the memory testing that normally occurs during the POST sequence. This will shave a significant amount of time off of the POST sequence (often many minutes), so the server will boot faster. I think this is pretty low risk, since this testing only occurs during the POST sequence. It has nothing to do with detecting a memory problem while the server is running later (which is the job of your hardware monitoring software). I am sure some people may disagree with this setting, so I would love to hear your opinions.