You might be able to shave off thousands – or more – in your monthly cloud bills for your critical SQL Servers, all while maintaining or even improving performance.
Public cloud providers charge organizations for everything they deploy, and while a few items in the cloud are based on a pay-by-consumption model, most of the services that really stack up, namely SQL Server licensing, are charged on a pay-by-allocation model. If you provision a SQL Server VM with eight cores and promptly forget about it, you will be shocked at the end of the month when you receive your cloud subscription bill. If you provision a 32-core VM, and your workload only uses four cores, you are still paying for all 32 cores, regardless of the low utilization rates. The same goes for memory and storage. A large memory footprint and lots of high-speed attached managed disks all add up.
It’s time to be objective. It begs the question: “How are my servers using the resources currently allocated?” I find that most organizations do not have a solid resource consumption baseline for their servers, so understanding where cost optimizations can be made becomes a guessing game, which is the exact opposite of what we need, and usually backfires spectacularly.
A few tools exist to help with this process, but be careful and weigh the output carefully. Highly granular data helps immensely, as the lack of granularity can either skew the output from the law of averages, or can weight the results differently. An example could be a CPU spike for 30 seconds because of a time-sensitive job running at midnight might not even appear in the telemetry if the telemetry reports an average of each 30-minute interval. A utility that reports just the minimum, average, and maximum CPU usage in a day might show a max at 95% and base their recommendations from this spike. But, if this spike occurs at 2:00am because of routine database index maintenance, and finishes in an hour but we have three hours for it to safely run, and the business-day CPU consumption peak is just 15%, the lack of context in the results can cause you to spend too much.
I’ve even seen some folks provision 16-32TB storage just to get a high IOPs rate for VMs, just to find that the metrics they were basing the estimate on were from database backups over a weekend, and were in no way even close to consumption rates during the week. (They were also hitting the VM’s I/O limits well before maximizing the available speed on the virtual disks.)
Usually, I find that retrieving the data yourself with built-in tools such as Windows’ Perfmon and Linux’s sar can provide as granular data as you need to help you build a better resource consumption baseline. Take the raw data and view the in-depth metrics, such as CPU consumption by-core, the memory usage of your application/database engine (and not just the OSE, as many applications like databases hide their memory usage stats from the underlying OSE), disk read and write demand, etc. Every bit of data helps paint a more complete picture of a day in the life of this server.
Next, understand the workload itself from a business perspective. A database server underneath a Monday-to-Friday 8:00am to 5:00pm business has far different resource consumption footprint than an around-the-clock, three shift-a-day, manufacturing facility. What are the normal operating windows that have time sensitive tasks? It could be limited to just normal business hours, or could be constant or scattered across the week.
If you see spikes or dips in the telemetry, what task or process corresponds to the change? A sharp CPU spike at 2:00am from index maintenance that runs in 20 minutes could be perfectly normal and acceptable to run a few minutes longer. Review the big items to understand if they are time sensitive or not, and what sort of impact it has on everything running concurrently.
Now, map out consumption rates versus resource allocations. You’ll be shocked to find that in almost all cases, your servers are oversized. Oversized means you are paying too much. Use the baseline to estimate how much you could reduce your VM’s resource allocations without impacting performance. (In a few cases, you might find some of your servers are actually undersized and starved for resources, and you can adjust the resource allocations accordingly to regain your performance.)
In some cases, reducing the allocated resources could actually improve the performance of the VM. Every single compute resource allocated to a VM must pass through the hypervisor’s scheduling queues, and cloud is nothing more than someone else’s datacenter with automation on top of some hypervisor. The less resources that you must push through the queues, the less overhead the hypervisor could impose on your workload, and that means it could get faster.
An example could be a 16-core VM with SQL Server Standard edition installed. An Azure E16ads v5 VM, with 16 vCPUs and 128GB of RAM MSRP’s at $2,470 a month without managed storage. If your workload is only using three or four CPUs of actual consumption, why not test downsizing it? Chopping the VM scale in half to an E8ads v5 reduces the monthly cost to $1,235. Your utilization percentage goes up, but if your server is running in a ‘safe’ range and not constantly maxed out, you just saved your organization a significant amount of money with just a couple of clicks while maintaining the speed your users expect.
This example covered just one server. What if you had a hundred like this? Or a thousand? How much of your monthly cloud budget can you reclaim? DBAs today must be as much of a cost optimizer as they are a performance and availability engineer, so use this technique to review your resource allocations and save your organization a lot of money!