Recently, I was in work in the early hours of the morning, staring bleary-eyed at my screen as I fought to regain space on one of my SQL Servers that had run critically low. Silently, I cursed the monitoring system for failing to alert me of this problem before it became an imminent crisis. Dammit, as a DBA team, disk space was our currency! We had to know exactly how much was in the collective bank! Otherwise, we too often ended up like this, trading Megabytes for sleep.
My mental grumblings had me recall a recent experience with my real bank. While at work I had, unusually, received an email from the bank notifying me that I had an online message waiting. Too busy to check the message right away, it slipped my mind until later when, attempting to pay for lunch, I was embarrassed to hear that my card had been declined. I returned to the office to find I'd missed a call from the bank.
An email, a phone call, online messages and a declined card, all responses, as it turned out, to the potential theft of my money, discovered by a trail of data transactions somewhere in Las Vegas. The bank had detected the potential problem quickly, acted decisively, and so averted a potential crisis in my bank balance. I was impressed by the bank's multi-pronged alerts, and felt more secure that that they'd detect quickly, and respond similarly, to any future fraudulent activity on my card.
Suddenly, I was more alert myself. Surely, we needed a similar approach to this disk space problem, which we spent so much of our time battling? Early detection and multi-pronged alerts. It was one thing to have disk alerts through a general purpose monitoring system, but clearly it wasn't 100% reliable, and surely sometimes such alerts were missed in amongst all the others. What if I set up an additional, dedicated alert, just for low disk space? It felt a bit like "redundant redundancy" but I was gripped by the idea, and knew that I had to know what impact, if any, it would have on the team's responses to these recurrent problems.
I pulled out and ran a query I had developed several months prior to find the top 20 low disk space servers. Immediately, I saw that were 15 servers that needed attention. I scheduled a job in SQL Agent, with the results to be sent to the DBA team. I ran the job that afternoon and waited for the reaction. It took only 10 minutes before the first response came in.
The first DBA noted that we had only 2 GB on one of the data drives of a test server. This was not a crisis but nevertheless one more auto growth of the data file would push the server over the limit and the drive would be out of space. He did not question where the mail had originated or if the results were valid. He simply responded, investigated, and in short order had recouped 15 GB of space. Later, he sent an email to the business informing them they had to reduce their data footprint on the server or purchase more space.
At the next meeting, I explained to the team what I had done and why. As DBAs, we spend a lot of our time responding to full disks; log files that fill up due to insane data loads that do not commit transactions frequently; space rapidly depleted by an application no one anticipated using 500 Gigabytes. We cannot always trust a single monitoring system to catch every event early enough. Any event, good or bad, can be monitored, and trigger downstream events either form a live person or an automated response from a job, but if it is not caught it could go unnoticed for a long time, at which point it may have become an unrecoverable problem. Adding a simple email report to our environment to point out potential disk space problems, even if it seemed 'redundant', had a demonstrable impact on our response as a team, and was a move in the right direction in terms of us responding to recurrent problems before they escalated into a crisis.
Rodney Landrum (Guest Editor).