This editorial was originally published on Mar 13, 2018. It is being republished as Steve is on vacation.
Many of us would consider ourselves to be reliable at work. Our employers count on us, co-workers may assume we're handling certain tasks, and it's human nature to think that we are meeting our obligations and responsibilities. Maybe not 100% of the time, but certainly I know my goal is to complete tasks I've committed to, on time, and to avoid dropping any balls that I'm juggling. I think I am mostly successful here, but certainly am late or forgetful about things at times.
However, as a group, department, or even a set of services, does your organization think highly of the databases? Are your systems meeting their SLAs for performance and availability? If not, does your group respond in a timely manner? We sometimes think that we are, but without feedback and communication can we be sure? This certainly could be taken the other way, with other groups constantly complaining about your performance, even while you are meeting your commitments.
I ran across a talk from Uber on reliability. It's more of a high level architecture talk about distributed systems and being able to detect and respond to issues. Certainly Uber works at a rate and scale that few of us will reach in our organizations. Add to the highly public nature and real time demands, and reliability is extremely important for their business. Mistakes can have dramatic hard dollar effects instantly, and there is a lot of pressure on their staff.
For most of us, our databases do continue to become more important, and even when we are meeting our SLA commitments, are we providing reliable service from the entire staff? Is information being shared, with root cause analysis or retrospectives that help knowledge transfer among all of the individuals that might respond to an issue? Are you dependent on a superstar that must be called in to solve issues with the database, network, or storage?
I've been the main person on call, the expert for a system that received calls on weekends, vacation, and other downtime. It's no fun to be in this position, and it certainly distorts work life balance, not to mention upsetting the rest of my family. My goal is to be there if necessary, but train others so that they can provide a consistent, similar level of service to customers and clients if need be.
Becoming too dependent on any one person isn't much different than becoming too dependent on one server or disk drive or network cable. At some point you'll have a failure and that item will no longer be available. If you don't have a spare, you'll have regrets.