Scripting Makes Mistakes Easier Than Ever A number of you likely use Atlassian products like Jira, Confluence, Opsgenie, or something else. You might have been affected by a large outage they had (post incident blog, Company Q&A, TechRepublic report) recently which lasted at least 9 days. I don't know if all customers have their data back and are working, but this was a surprisingly poorly handled incident according to a number of reports from customers. There's a great write-up from the outside that you might want to read. The bottom line in this issue is that Atlassian looked to deactivate a legacy product with a script, but they apparently didn't communicate well among their teams. The script ended up using the wrong customer IDs and also marked the sites for permanent removal, not temporary removal (soft delete). While they supposedly test their restore capabilities, they weren't prepared for partial restores of subsites. I'm guessing this is likely a partial database restore, which many of us know is way more complex than a full database restore. Leave aside the issue of a software-as-a-service (SaaS) company failing their customers, and the lack of communication with customers. The more interesting thing for me is the challenge of poor coding and communication internally. Clearly, the project to deactivate their legacy app wasn't well planned or tested and the code used was probably executed at too wide a scale initially. When we deploy code changes to a large number of items, we want to test them at a small number first. Whether we are deploying to multiple databases, against many customers, or different systems, a standard method of making changes at scale involves working in rings. Azure DevOps describes this in docs, and they actually use rings to change the platform. We used the same pattern 20 years ago for software and database updates to many systems. We would internally deploy to a few users to look for issues. Then a week later we would deploy to a small number of systems to check for unexpected issues. Then typically to most systems in the third ring with a fourth ring a week later to catch up stragglers that needed more time to prepare. I find many customers, especially those with sharded/federated databases or many systems unwilling to spread out deployments in this manner. Often they yield to pressure from business users to ensure everyone gets the same update at the same time. I would never recommend this approach as we need to ensure we are looking at scripts in a controlled environment, or even two, before we deploy things widely. I'd be even more cautious about one-off administrative scripts that might make a change similar to the one Atlassian attempted. Those are often not seriously tested enough. At the very least, any of us working with multiple customers in a single database or in multiple databases ought to ensure we can backup and restore a single customer, but more importantly, can you restore a group of customers. If you make a mistake like Atlassian, which scripting allows us to do extremely rapidly, can you recover a partial set of data? Many of us don't test this, but that's likely something we ought to consider when we work with scripts that are designed to only change some data. Most of us don't experience complete failures, but partial ones, usually because of human error. Steve Jones - SSC Editor Join the debate, and respond to today's editorial on the forums |