How SLOs and error budgets improve app reliability

These practical tools help devops teams balance operational issues and the pursuit of perfection.

Comments

Once upon a time, business sponsors pestered development teams about when a feature would be done or a release ready for deployment. Today, agile development teams use tools like Jira Software to track epics and releases in burn-down charts to answer these questions, review priorities, and consider adding or reducing scope.

Operations teams have had similar challenges defining, measuring, and managing service-level agreements (SLAs), which are often used to establish a business service’s or application’s target uptime or performance. For example, an SLA might specify that a business service must have four 9s of reliability (99.99 per cent), which means that the application can experience only 52.5 minutes of downtime a year.

The difference between a service-level agreement and a service-level objective

Defining SLAs might work well for reporting, and IT managers use them to report whether business services met their target SLAs. But SLAs are typically broad metrics and often don’t provide greater specificity on when, where, and for whom service levels should have a different business objective. For example, an e-commerce application may seek only three 9s of reliability during slow times but require much higher reliability during holiday shopping periods.

SLAs are also not very predictive or actionable. You are more likely to hear, “The devops team missed its SLA,” without any feedback about improvement. Compare that to “The SLA is not on track to meet its objectives, so what steps will the team prioritise to improve reliability?”

These issues are why site reliability engineers changed the terms, practices, mindset, and tools from SLA to SLO (service-level objective).

Site reliability engineering (SRE) was introduced at Google in 2003, and its practices were published in the Site Reliability Engineering book in 2016. Key SRE principles include embracing risk, eliminating toil, instituting release engineering, and simplifying architectures.

SRE principles and functions work well with devops and IT Infrastructure Library (ITIL). The SRE’s primary roles include incident defence and promoting dev practices that improve application performance and reliability.

Creating SLOs is a foundational SRE practice. They can be defined at the business level or more granularly at the application, API, or data levels. They measure successful versus failed events during a defined duration window.

For example, an API with a monthly service level of three 9s must successfully respond to 99.9% of API requests during 30 days. If this API handles 100,000 service requests in 30 days, it can have 100 failed events during that period and meet its SLO.

Business leaders can specify SLOs by customer segments, including peak times, customer type, user type, or business activity. For example, the SLO for peak periods, or customers trying to buy products, might be increased to four 9s or 99.99 per cent. Since SLOs are measured against specific events or personas, it’s more straightforward to map them to meaningful business dimensions.

Error budgets help devops teams improve reliability

SLOs also define an error budget or how many failed events the service can undergo and meet its SLO. In the previous example, if an SLO of three 9s (99.9 per cent) during a 30-day window can have 100 failed events, the error budget is 100 events.

I discussed SLOs and error budgets with Kit Merker, COO of Nobl9, at a webinar on ways SRE practices help deliver exceptional service. He explains the difference in mindset in managing with SLOs and error budgets instead of SLAs.

“We want to gain confidence and deliver excellence with our customers. But at the same time, we have a business to run, and we want to do that sustainably. I think of the difference between the edge of excellence (the SLO) and hard-to-achieve perfection. Once we define the SLO, we can make really important automated business decisions about engineering our solution.”

The edge of excellence refers to how much unreliability can be tolerated by end users and performance metrics. The important decision Merker alludes to is when and how much teams should prioritise development work to improve application reliability and performance, compared to work to enhance end-user experiences, add features, or address other business needs.

Error budgets provide a metric for making these decisions. Teams that consistently miss their SLOs should prioritise more application reliability improvements. Also, when investments in reliability improvements show reductions in errors afterward, then agile devops teams can point to the benefits of proactive reliability improvements.

Error budgets can also help devops teams optimise their time. Teams that are in danger of missing their SLOs may elect to prioritise responding to incidents, handling support escalations, or addressing defects. On the other hand, teams operating well under their error budgets might avoid chasing perfection and stay on the strategic course by completing their sprints, releases, and feature deployments.

Giving more decision-making authority to devops teams on whether they prioritise operational issues may sound counterintuitive, but chasing perfection has a high cost.

Instead, IT leaders should define a service-level objective policy to help teams know how to respond when they are below their error budget or when SLOs are at risk. Leaders can also set operating principles on how teams can “spend” their error budgets or can recommend actions when missing SLOs.

SLOs impact development, SRE, and ops. They also can elevate the role of quality assurance. When production defects correlate to errors, it’s a signal to increase test automation and align to missed SLOs.

Lastly, just as devops teams use epic and release burn-downs to track progress against business objectives, creating error rate burn-downs helps teams forecast whether they are on track to meet SLOs.

Can SLOs and error budgets change IT’s culture?

The use of error budgets is still in early adoption. In the recently published 2021 SRE report, 50 per cent of respondents continuously refine their SLOs, but only 20 per cent of SREs regularly use error budgets.

I spoke to Thad West, CEO of Isos Technology, about using SLOs and error budgets to change the operational mindset. “You find many ops groups operating in a hero mode, flying from incident to incident, and that can sometimes become their identity. It can really burn out the IT team, and it’s a detriment to transformation.”

With organisations developing more mission-critical applications and expecting higher service levels, IT teams must seek ways to balance operations and innovation. Establishing SLOs helps define more realistic operational goals, and defining error budget policies and principles enables teams with prudent decision-making tools.