While scanning the headlines this past week, I came acoss this sentence in the opening paragraph of one article: Is the pressure to be “always on” leaving data center power systems more vulnerable to failure?
That really caught my eye, and in terms of mechanical systems, and even electrical connections, it makes sense. There is a certain amount of maintenance that you need to do in order to ensure things keep running. Your car needs regular oil changes, your house might need regular painting, even your bike chain needs a little grease now and again.
There are plenty of systems that continue to function without maintenance, but they degrade. Your HVAC system will keep running even if you change filters once every 3 years, but it will not run as efficiently and it will cost you more money than it needs to. As a suggestion, I set a reminder in Outlook to change filters every month. Your computer is the same way in that it will keep running even if you never clean out the case, but at some point those dust bunnies might cause you a premature failure from heat.
In a data center, there are definitely recommendations for handling the various supporting systems, and the article mentions that electrical connections need to get checked, along with other items. There is concern that the constant demand for uptime might contribute to larger-scale failures from a lack of maintenance, and that should be a concern. It has me wondering if my co-location center is actually doing the maintenance that is needed to ensure they'll keep running.
However it also had me wondering about software. Our code doesn't need a checkup, per se, but it might need some maintenance. There might be things that are slightly broken, or they cause minor issues. If we delay fixing things, especially those items with a low probability of occurring, but a high possibility of impact to the system, are we asking for trouble?
It's hard with software to determine how much and how often maintenance is required, but I do think that there are definitely times we defer mainteance for various reaons and eventually cause downtime.
Steve Jones