This article has a concept I've never heard about: invisible downtime. This is the idea that there are problems in your application that the customer sees. Your servers are running, but the application doesn't work correctly or is pausing with a delay that impacts customers. From an IT perspective, the SLA is being met and there aren't any problems. From a customer viewpoint, they're ready to start looking at a competitor's offering.
Lots of developers and operations people know there are issues in our systems. We know networks go down or connectivity to some service is delayed. We also know the database gets slow, or at least, slower than we'd like. We know there are poor-performing code and under-sized hardware, running with storage that doesn't produce as many IOPs as our workload demands. We would also like time to fix these issues, but often we aren't given any resources.
The current buzzword among executives and senior IT leaders is observability. It's the goal of looking at how our entire system, application, database, and network, are linked and performing with an eye on improving performance. Not because they want to spend time or money here, but because customers are becoming more fickle and quick to move to another offering. Leaders know that degraded application performance (another phrase for invisible downtime) can have real bottom-line impacts on revenue.
There are a lot of products in this space, application performance monitoring (APM), designed to look at lines of code and determine how well each is performing. They can help you spot issues in application code, but they lack insight into database and network details, at least at a level that the experts need. As a result, digging into performance issues and root cause analysis of problems usually means pulling data from multiple sources and correlating log entries.
This is likely an area where AI/ML technologies can help, especially across large estates, though I think in many cases, what we need is just a pointer to poor-performing code. C#, Java, SQL, whatever. We need to know where the bad code is and then we need to train developers to write more efficient code. That might be the best way to improve application and database performance.