Providing concise metrics for decision-making without losing reliability.Launch the site >
Measuring software reliability
Service Level Objectives (SLOs) set expectations for system behavior. Users want to know what to expect from a service to understand if it’s appropriate for their use case.
For instance, a team building a photo-sharing website might want to avoid using a service that promises very strong durability and low cost in exchange for slightly lower availability, though the same service might be a perfect fit for an archival records management system.
To measure an SLO, it’s important to first define some Service Level Indicators (SLIs). Using intuition, experience, and an understanding of what users want helps us define SLIs, and with that comes SLOs and Service Level Agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives the team confidence that a service is healthy.
It’s both unrealistic and undesirable to insist that SLOs will be met 100% of the time: doing so can reduce the rate of innovation and deployment, require expensive, overly conservative solutions, or both. Instead, it is better to allow an error budget—a rate at which the SLOs can be missed—and track that on a daily or weekly basis. Upper management would benefit with a monthly or quarterly assessment.
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.