Cost of downtime vs Risk of downtime vs Cost of fault tolerance

8:28 pm in infrastructure management by Matt Jenkins

What is the risk of an outage occuring? How many are acceptable? And what are the most effective ways to reduce them?

There are two types of outage, planned and unplanned. A planned outage can be achieved out of hours, communicated to users/customers and have work arounds designed where necessary. When the marketing dept are aware the website will be down for an hour next Thursday 12-1am may decide to hold off that email campaign, or reschedule the US based demo meeting they had booked.

Whilst planned outages are necessary, if they occur too often, it can be better to adopt a regular maintenance slot, so users get used to an hour (or several) of downtime per week/ month etc. at an agreed tme.

The cost of downtime

This can be broken down to effect on staff and effect on customers. If staff rely on the application to do their job, then a 4 hour outage means that those staff are idle while the app is down. If customers cannot get to the site, & would normally be placing orders, then there is an value for lost revenue – but there is also a lost spend cost that needs to be accounted for, this includes items such as advertising that drove customers to the site while it was down, datacenter costs, potential compensation for customers.

Risk of downtime

Unplanned outages can be caused by several parts of the infrastructure :

  • application code issues ( memory leaks / loops etc)
  • server failure (web / database / other)
  • network failure ( load balancer / switch / firewall)
  • data provider issue
  • power failure

A risk assessment should take into account Mean Time Between Failure (MTBF) and Mean Time To Recovery (MTTR), and failure impact (site outage, email outage, images fail etc). If you know how often a component fails (MTBF), how long it takes to fix (MTTR) and the impact of it failing, you know the risk of that component.

Cost of fault tolerance

Cost of fault tolerance is simply the capital outlay, plus the annual support, of the component/s that will increase the MTBF to an acceptable level. This may mean replacing the equipment, upgrading the current equipment, buying a hot/warm spare or just getting a support agreement for the device. The difference amongst these is mainly cost, but also involves installation & management time for your staff and the simplicity of your network.

What is the most effective fix?

Using the information above, you can put together a breakdown of risk vs cost and let the business owner decide on the downtime they find acceptable.

Example

By multiplying the number of failures by the downtime, we can see this example shows 44hours downtime per year on hardware related outages, which would mean we can only provide 99.5% uptime before the application is involved!

Looking at this example, I would suggest an investment in a 2nd data feed asap, with a plan to install a 2nd firewall and database very soon. Those upgrades would cost £26,500 but increased expected uptime to 99.84%

Obviously a real breakdown would include many more items and several would be educated guesswork, but this method can help decide on the priority of new equipment and show which single points of failure (SPoF) are really critical.

How much downtime is acceptable depends on how vital your is site to your business, but this approach may give you a better understanding of where the biggest risks lay and which are the quick wins for your web infrastructure.