Software Updates, Outages, Processes, and Protocols This past Thursday, February 22, AT&T had a major outage on their U.S. network. For upwards of 10 hours, hundreds of thousands of customers could not make phone calls, send or receive texts, or use mobile data for apps or browsing websites. Aside from not being able to communicate as normal, it also appeared to knock the ability of users to access emergency services like 911, raising the level of alarm. It was such a major communications event that the FCC and the US Cybersecurity and Infrastructure Security Agency are now involved in the investigation. Thankfully the initial investigation does not appear to show any cyber-attack as the cause. Instead, the response from AT&T on Friday was that the incident was caused “by the application and execution of an incorrect process used as we were expanding our network…” Inside sources have told some news outlets that it was specifically a software update that didn’t complete correctly and took essential equipment offline. In other words, somebody (or team) didn’t follow the playbook. All I could think about on Thursday in relation to this story was #HugOps. J Here's the thing. In more than 20 years of IT operations, software and database development, and building multiple SaaS applications, I have personally never had to worry about a mistake, even a failed upgrade, putting people’s lives in danger. Although I, and my teams, have always striven to provide an excellent experience for our customers at every level, nobody lost their job if utility bill data couldn’t be entered into the application for a few hours. When we did fall short of our goal and our customer SLA was breeched, it was always a learning opportunity to prepare better for next time. The longer you work in this field, the more you recognize that preparation is key. Checklists, playbooks, and planning for how to react when the unexpected happens are essential skills to learn and build within a team. I’m positive that the folks at AT&T had all of that, and still a major outage occurred. Thankfully it appears to have been remedied in less than 10 hours, which honestly feels like a speedy recovery considering the size and scope of the network they deal with. Jokes about DNS, BGP, or an expired certificate aside, it was a good reminder for us all that thinking about, and planning for, the unexpected is always a part of our jobs. I mean, none of us want to end up in the headlines when data loss occurs, accidentally or not, right? If you were affected by, or aware of, the outage this past week, what did it make you think about in relation to your job? Do you see opportunities to prepare in a new way the next time you release and update or modify the database schema? Give it some thought if you haven’t. And remember, in times like Thursday, it never hurts to send out some #HugOps. Ryan Booz Join the debate, and respond to the editorial on the forums |