This week's editorial is a guest post from Phil Factor.
I take a great interest in the post-mortems of system outage, but not from a wish to gloat about the misfortune of others; a misplaced schadenfreude or epicaricacy. It is to see how well the alerting, detection, instrumentation and recovery took place. I realise that a lot of these accounts are never made public, but when they are, they often show a lot of information that is useful to many of us. Because of the detail of the events, it is often possible to pick out many of the things done right, beyond the more obvious account of the things done wrong. You can be reading about a simple operator-error due to poor team-work, when suddenly being struck by the rapidity of the subsequent alert, and the coordinated response that rescued the downed service. I find great value in reading beyond the obvious mistakes to revel in the description of the way that the system was recovered.
When developing any technology, we learn so much when things go wrong: Transport technology and practice, for example, developed, and continues to evolve, through an intense study of what happened in accidents and incidents. Inquiries can change industry practice within months. In IT, we are curiously reticent about their disasters, whether they are security breaches, system downtime, or data-loss, perhaps because there are generally no flames and it is rare for people to die as a consequence. IT companies tend to publish their post-mortems on the death of IT systems only when compelled to. Many of the accounts of IT disasters have become required reading for any IT professional, but these accounts are the exception, and so often we are left in the dark. In the absence of obvious victims it is difficult to compel companies to publish accounts of catastrophic mistakes and the subsequent recovery process.
We’re getting better at reporting certain IT traumas, as when Cloud Services vanish, or when the personal information from a dating site is published on the Dark Web. I’d like to see more information about disasters in delivering IT initiatives too, particularly in public services, especially if there was a successful recovery. I suspect that poor communications and weak stakeholder-management underlie a lot of these failures, but it would be far better to be able to judge from the evidence. Until we can get clear evidence of the cause of failure and learn from effective ways of recovery, we’ll continue to ‘evolve’ development and delivery techniques in a haphazard and unscientific way.