The Blast Radius of a database This week's editorial is a guest post from Phil Factor. Just occasionally, journalists whose eloquence greatly exceeds their understanding of technology write inflammatory articles about the impending obsolescence of the 'monolithic' relational database in favor of distributed, fault-tolerant database systems. The idea of the database being an essentially vulnerable component resonates with our culture. Nobody argues when we blame the database, so it has become the classic call-center excuse, 'sorry but the database has gone down'. This feeds into the myth of its fragility, perpetuated by the hotheads of distributed architectures. Certainly, a relational database is by necessity a single point of failure, if your business is unlikely to be able to function if it fails. In recent history, though, a complete failure of a system is very rarely due to a 'monolithic RDBMS'. It is much more likely, in the cold light of day, to be due to the network, system architecture or application design. Much has happened with the industry-strength RDBMSs in the past ten years to make them intrinsically resilient. Even if there is down-time, many software devices have been contrived to make partially-connected computing practicable. There is still a potential for failure, but judging from the publicly-reported incidents, the aging monolith of legend is no longer one of the usual suspects. It is a bad idea for a single system failure to have a large impact on the business. In Software Engineering, there is the idea of resiliency and availability: limiting the 'blast radius' of any incident such an error, or fault. The whole idea of failover clustering is based on these principles. The major commercial RDBMS are designed for resilience in the light of their potential blast radius. It would seem to be commonsense to deal with the risk of blast radius by increasing resilience. With the aircraft industry, where the pioneering work on resilient architectures was done, it is a simple decision. It is almost never a good idea to shut down a computer system entirely in-flight. It is always better for the computer systems to 'stagger on bleeding' to avoid a 'blast' happening. Surprisingly perhaps, the same isn't always true in other sectors of industry. In commerce, trading and financial services, it isn't quite so clear-cut. In the case of a database, for example, some incidents require you to take the database offline to prevent the damage from cascading. If a fault exposes a vector for fraud, for example, you leave the database working at your peril. These are all issues that must be determined when the system is designed, but in those cases where swift action must be taken, then a resilient, monolithic architecture may be easier to 'orchestrate', in that it is easier to turn off the service in a coordinated way. Like most good things in life, resilient systems require a great deal of conscious effort. For the application developer, it means a far more rigorous discipline in handling exceptions, and in the design of the systems that monitor and control the applications. There is no free lunch to be had by, say, adopting microservices or putting your databases in the cloud: you still have to consider the intricacies of service registries, software circuit breakers and bulkheads. Radical alternative designs need very smart architectural decisions. Phil Factor Join the debate, and respond to the editorial on the forums |