ARN

MTTR “not a viable metric” for complex software system reliability and security

Verica Open Incident Database Report suggests mean time to resolve should be retired and replaced with other metrics more appropriate for software systems and networks.

Mean time to resolve (MTTR) isn’t a viable metric for measuring the reliability or security of complex software systems and should be replaced by other, more trustworthy options.

That’s according to a new report from Verica which argued that the use of MTTR to gauge software network failures and outages is not appropriate, partly due to the distribution of duration data and because failures in such systems don’t arrive uniformly over time.

Site reliability engineering (SRE) teams and others in similar roles should therefore retire MTTR as a key metric, instead looking to other strategies including service level objectives (SLOs) and post-incident data review, the report stated.

MTTR metric not descriptive of system reliability

MTTR originated in manufacturing organisations to measure the average time required to repair a failed physical component or device, the second annual Verica Open Incident Database (VOID) Report, read.

However, such devices had simpler, predictable operations with wear and tear that lent themselves to reasonably standard and consistent estimates of MTTR, it added. “Over time the use of MTTR has expanded to software systems, software companies view it as an indicator of system reliability and team agility/effectiveness.”

Verica researchers predicted that MTTR was not an appropriate metric for complex software systems. “Each failure is inherently different, unlike issues with physical manufacturing devices. Operators of modern software systems regularly invest in improving the reliability of their systems, only to be caught off guard by unexpected and unusual failures.”

“MTTR is appealing because it appears to make clear, concrete sense of what are really messy, surprising situations that don’t lend themselves to simple summaries, but MTTR has too much variance in the underlying data to be a measure of system reliability,” Courtney Nash, lead researcher, Verica, tells CSO.

“It also tells us little about what an incident is really like for the organisation, which can vary wildly in terms of the number of people and teams involved, the level of stress, what is needed technically and organisationally to fix it, and what the team learned as a result,” she adds.

The same set of technological circumstances could conceivably go a lot of different ways depending on the responders, what they know or don’t know, their risk appetite and internal pressures, Nash says.

With incident data collected in the report, Verica claimed it was able to show that MTTR is not descriptive of complex software system reliability, conducting two experiments to test MTTR reliability based on previous findings published by Štěpán Davidovič in Incident Metrics in SRE: Critically Evaluating MTTR and Friends.

The results showed that reducing incident duration by 10 per cent did not cause a reliable reduction in the calculated MTTR, regardless of sample size (e.g., total number of incidents), the report stated. “Our results [also] highlight how much the extreme variance in duration data can impact calculated changes in MTTR.”

Implementing alternatives to the MTTR metric

A single averaged number should have never been used to measure or represent the reliability of complex software systems, the report read.

“No matter what your (unreliable) MTTR might seem to indicate, you’d still need to investigate your incidents to understand what is truly happening with your systems.” However, moving away from MTTR isn’t just swapping one metric for another; it’s a mindset shift, Nash says.

“Much the way the early DevOps movement was as much about changing culture as technology, organisations that embrace data-driven decisions and empower people to enact change when and where necessary, will be able to reckon with a metric that isn’t useful and adapt.”

Vericas’ report listed a set of metrics (most of which are incident analyses-based) to consider instead of MTTR.

SLOs/customer feedback: “SLOs are commitments that a service provider makes to ensure they are serving users adequately (and investing in reliability when needed to meet those commitments). SLOs help align technical system metrics with business objectives, making them a more useful frame for reliability.

"However, SLOs can share weaknesses with MTTR, including being backward-looking only, not including information about known risks, and not capturing capture non-SLO-impacting near misses."

Sociotechnical incident data: Modern, complex systems are sociotechnical, comprising code, machines, and the humans who build and maintain them, the report read. However, teams tend to consistently collect only technical data to assess how they are doing.

“One rich source of socio-technical data comes from the concept of Costs of Coordination as studied by Dr. Laura Maguire.” These data types include the number of people involved in an incident, tools used, unique teams, and concurrent events.

“Until you start collecting this kind of information, you won’t know how your organisation actually responds to incidents (as opposed to how you may believe it does),” the report stated.

Post-incident review data: “Another way to assess the effectiveness of incident analysis within/across your organisation is to track the degree of participation, sharing, and dissemination of post-incident review information.” This can include the number of people reading write-ups and voluntarily attending post-incident review meetings, the report read.

Near misses: Prioritising learning from near misses and actual customer/user-impacting incidents is another fledgling practice within the software industry, Verica claimed.

“We know from the aviation industry that focusing on near misses can provide deeper understanding of gaps in knowledge, misaligned mental models, and other forms of organisational and technical blind spots.”

However, deciding what constitutes a near miss is by no means straightforward. Example scenarios provided by Verica include: “System X is down, but users don’t notice because system Y serves cached or generic content for the duration or the outage. Is this an incident? [Also] Your backups start failing but the team doesn’t notice for a month, customers don’t notice either. Is that an incident?”

“It’s not an overnight shift, but at the end of the day, it’s being honest about the contributing factors and the role that people play in coming up with solutions,” Nash states. “It sounds simple, but it takes time, and these are the concrete activities that will build better metrics.”