Just as a society’s civilization is mostly nurtured by its culture, I believe that DBA work quality is mostly determined by the requirements from management. I had been fortunate to work with a few excellent managers in the past and benefited a lot from their requirements. I’ll share some of their requirements and some stories here as well.
1st requirement: DBA team needs to be the first to know what happens in the database eco-system environment and be the first to respond to any issues.
This is easier said than done because there is a long path from the database engine to the end user. It may include the database hosting server’s own problems, like CPU, Memory, to network issues, to storage system issues, to end-user application itself.
To achieve this requirement, our DBA team comes up with lots of monitoring mechanisms to cover almost every issue we ever encountered. We even dynamically profiled each stored procedure (SP) and send an alert if one SP is consuming twice CPU usage than its average usage for the last 5 days. Usually it indicates some execution plans are changed from good to bad.
One story here is that there were times when a clustered node failed over to another node in just 5 seconds and DBA team is notified while nobody else knows. When we raised the alert to all stakeholders with our damage evaluation report (such as how long was the downtime window, # of users impacted, errors messages found in the error log etc) , I guess that DBA team demonstrated the value to the management.
2nd requirement: Never cause another problem when fixing one problem.
To achieve this requirement, we are prudent about each of our solutions to any problems. We not only look at the benefit from the solution but also check the potential “costs” related to the proposed solutions and taking the corresponding mitigation. For example, we do index maintenance (even online index maintenance) at night, however the index maintenance sometime block business user sessions and causes timeouts on the end user sides. This is rare but indeed exists, and we finally come up with a central monitoring framework that will kill any DBA-related maintenance work whenever it blocks user sessions. Here I really want the SQL Server can have the following feature
Set Lock_Priority Low /* when this sets, the session will auto-die if it blocks others */
(We know when Set Deadlock_Priority Low, the session will be a victim in a deadlock scenario if other sessions have higher priority.)
One story here is like this: we found a problem with duplicate data in a table, the duplicate data is about a few hundred due to an application bug. So no big deal, we DBA prepared a data fix script and on the other hand the developer team prepared the fix for the application bug (just a stored procedure issue). On Wednesday, we tested our DBA script in the QA environment which was refreshed using the production data of the Monday(two days earlier). Everything went well within 1 minute. Two days later, on the Friday night, we started to do the implementation, just a fix to the stored procedure, plus executing this data fix script. The stored procedure change fix was quick, no more than one second, however, the data fix kept on running and even blocks other jobs, we were astonished why this happened as we only scheduled 10 min downtime window. After 20 min, we had to cancel this fix and rolled back the implementation. What we finally found is that the underlying table for the data fix script had suddenly grown from dozens of thousands to dozens of millions due to some big business activities, and as such, the duplicate rows grow from a few hundreds to 1+ millions and thus caused the data fix script, which delete all duplicates in one transaction, unable to finish the work.
From this lesson, we learnt to be more defensive in our DBA script so we were not disgraced badly.
3rd requirement: Acknowledge and Response to critical alerts within 5 min
This does not seem very inspiring initially but it indeed opens a door for innovation and imagination. The current practice is that the on-duty DBA should be able to reach the computer/facility that can acknowledge and evaluate the alert (unfortunately we have seen fairly big number of false alarms). Usually it means the on-duty DBA cannot leave office / home so s/he can respond within 5 min. However, we have brainstormed lots of good ideas and will work on them when resource is available. One idea is to use twitter API to send critical alerts, so the on-duty DBA can then twitter back to the DBA monitoring server and the DBA monitoring server can then decode the DBA’s twitter and act correspondingly, such as sending out acknowledgement email to the necessary stakeholders, auto-check some data and twitter back to the DBA for evaluating whether it is a false alarm or something beyond DBA’s responsibility scope (such as network down). If such response system is available, we DBA can live a much better life (i.e. not confined to the house / office) yet still to meet this requirement.
In short, I believe actionable and reasonable requirements (or I’d say guidelines) are a key to quality DBA work and is also a nutritious ingredient to a healthy database administration culture.