DevOps Metrics [Mean Time To Recovery] (MTTR) by Solidify

DevOps Metrics: Mean Time To Recovery (MTTR)

Published February 01, 2017

You and your organisation has decided to start working towards DevOps. Your transormation was successful and the various departments are now more integrated than ever. The level of collaboration is high and what used to be a waterfall-based development has now more in common with a flowing tap of continuous updates and deliveries. So how do you know everything is actually working and that the transformation has given results? In many cases it might be easy to get started with DevOps, but harder to keep improving it. The easy answer is that you need a few key metrics to keep track of your success. One such metric is the Mean Time To Recovery (MTTR). Without metrics you’re shooting blind regarding improvements and process development. If you are using metrics, you can get an overview of what works well, what doesn’t and in general get a better picture of where you are and where you’re headed. We have therefore created a list of metrics we think are general and important for you and your organisation when you’re doing DevOps. Let’s start by talking about Mean Time To Recovery (MTTR). <h2 style="text-align: left;">Mean Time To Recovery</h2> With MTTR, you measure the time it takes to recover from a production failure. It’s often a good way to measure the skill and flexibility of your organisation and it’s a measurement that should decrease with time. Compared to the more traditional Mean Time To Failure (MTTF), measuring time between each failure, Mean Time To Recovery is more important as shorter disturbances are less noticable. If, for example, Google.com had 3 year MTTF but 24 hour downtime, many would notice and be annoyed. If they instead had 3 failures per day but less than one minute downtime each, barely anyone would care. When you focus on continuous improvement and take every chance you get to achieve this, you will sometimes have failures. What’s important here is cutting the recovery times as short as possible. By measuring MTTR you accept that sometimes things will go wrong – it’s just a part of development. When you’ve accepted this and that DevOps is about continuously improving, analyzing and collecting feedback, you will realize that MTTR will lead to things such as: <ul> <li style="text-align: left;">Faster feedback mechanisms and quicker responses</li> <li style="text-align: left;">Better logging and monitoring systems</li> <li style="text-align: left;">Process for making recovery as simple as deployment</li> </ul> MTTR is of course affected by things such as the complexity of the codebase, the number of new features, operational changes and other. <h2>Conclusion</h2> No matter what metrics you choose to use, measuring and acting on the results is a necessity for DevOps. Without metrics, you have no idea whether or not you DevOps initiative actually contribute to what you expect it to. In addition, you won’t know if there are any problem areas requiring your attention. With this said, be careful; when you start investigating what metrics you want to use, you will quickly notice that there are quite a few to choose from. It’s easy to end up in a labyrinth of irrelevant data. Before choosing your metrics, you should take a look and try to see which metrics are the most relevant for you and your organization.   Have you started measuring Mean Time To Recovery yet? What other metrics do you use? Let us know in the comments below!