Skip to main content
SRE6 min read·

How to Design SLOs That Engineering Teams Actually Use

Most SLO programs fail because the targets are detached from real user experience. A practical guide to SLO design that engineering teams will own, not ignore.

How to Design SLOs That Engineering Teams Actually Use

TL;DR. Most SLO programs fail not because the math is wrong but because the targets were chosen by the wrong people for the wrong reasons. SLOs that engineering teams use are grounded in what users actually experience, owned by the teams that can change the system, and connected to a decision that changes when the budget is gone. Here is the design process.

Service level objectives get introduced in most engineering organizations through a top-down initiative. Someone reads the SRE book, attends a conference, or sees a vendor demo. A reliability program gets stood up. Dashboards are built. Targets are set.

Six months later, the dashboards are ignored. The SLOs are considered aspirational at best, a compliance exercise at worst. The error budgets are never referenced in planning. Nobody has changed their behavior in response to a budget alert.

This is the most common SLO failure pattern, and it is not caused by poor tooling or incorrect percentages. It is caused by a design process that puts the numbers before the users.

Start with the user, not the system

The first question in SLO design is not "what can we achieve?" It is "what does a user experience when the service is degraded?"

This distinction determines whether the SLO will drive behavior or collect dust. An SLO that measures latency at the database layer is a system health metric dressed up as an SLO. An SLO that measures the percentage of checkout flows completing in under three seconds is measuring what the user actually experiences. When that SLO is breached, the team knows a user was impacted. When the database latency metric is breached, the team knows something is elevated — but not whether it matters.

The process that produces meaningful SLOs is called critical user journey mapping. For each service with an SLO, identify the two or three most important things users do with it. Measure those journeys, not the underlying system metrics. The SLO is a commitment about user experience backed by system measurement, not the other way around.

This matters doubly for AI-assisted workflows. When AI agents interact with your services at machine speed, traditional p99 latency SLOs can look healthy while user-facing journeys are degraded, because the AI traffic skews aggregate statistics differently than human traffic. Designing SLOs against explicit user journeys gives you a measurement that holds under both personas.

Make the target a decision, not a threshold

An SLO that nobody acts on is not an SLO. It is a dashboard.

The difference between an SLO and a dashboard is that an SLO is connected to a decision. The decision is: when the error budget is at 50 percent, reliability engineering takes priority over new features. When it is at 20 percent, we stop non-essential deployments. When it is fully consumed, we run a formal reliability review before resuming normal velocity.

Without the decision, the error budget is a number that counts down and then resets. With the decision, the error budget is the mechanism by which reliability demands are automatically converted into engineering priorities. The team does not need to wait for an incident review or a leadership directive. The budget state drives the work.

This is the connection between SLOs and team behavior that most SLO programs miss. The target is not the output. The decision tree triggered by the budget state is the output.

The target range problem

One of the most common SLO design mistakes is setting targets too high.

A 99.9 percent availability SLO on a service with actual availability of 99.7 percent produces an error budget that is consumed in the first ten days of the month. By day eleven, the team is already over budget. The budget becomes meaningless because it is always exhausted.

The opposite mistake is setting targets too low. A 95 percent availability SLO on a service that rarely falls below 99.5 percent produces an error budget that is never consumed. The team cannot tell whether their reliability work is having any effect because the budget is always full.

The target should be calibrated to where the system actually performs, with a small stretch that represents the next achievable improvement. A service with measured 99.6 percent availability should have an initial SLO of 99.5 percent — in range, achievable, meaningful — while the team works toward improving actual performance. Raise the target when the system reliably exceeds it.

Starting with a target the service already meets, rather than an aspirational target, is not pessimism. It is the practice that produces a usable error budget from day one.

SLO ownership: the team that can change the system

An SLO without an owner is not an SLO. It is a metric.

Ownership means the team listed as the SLO owner has the ability to change the system when the budget is consumed. Not just the ability to file tickets. The actual ability to modify the deployment, the architecture, the dependencies, and the operational practice that determines whether the SLO is met.

This has an organizational implication that many SLO programs do not resolve. Platform SLOs that are owned by a team that does not have deploy access to the infrastructure layer will never produce behavior change. The team can observe the budget state, write post-mortems, and escalate — but they cannot act on the signal themselves. The ownership model and the deployment model must be aligned, or the SLO ownership is nominal.

The test is simple: if the error budget is consumed tomorrow, what specific action does the SLO owner take, and do they have the access and authority to take it? If the answer is not concrete, the ownership is not real.

Connecting SLOs to on-call

The SLO program and the on-call program are not separate. They are the same system described from different angles.

An SLO describes the target reliability a service should maintain. The on-call rotation is the mechanism that responds when the target is breached. A team that has SLOs but whose on-call rotation is not aware of those SLOs has disconnected the measurement from the response.

Practically, this means every on-call runbook should reference the service's SLOs and error budget state in its triage procedure. An on-call engineer who sees a budget alert should have a documented path: what to check, what constitutes an incident, and when to escalate. SLO awareness in the runbook converts the error budget from a planning tool into an operational signal.

This integration also makes the on-call rotation more sustainable. When the on-call engineer's response is guided by error budget state rather than raw alert volume, the noise-to-signal ratio improves. Alerts tied to budget consumption are alerts that matter. Alerts that are not affecting the error budget are candidates for tuning or removal.

The SLO that fails the Clouditive test

The Foundations Framework Pillar 02 is Signal Integrity: the capacity to produce metrics that are reproducible, defensible, and accurately describe the current state of delivery. SLOs are a reliability measurement. They either have Signal Integrity or they do not.

An SLO fails the Signal Integrity test when:

The measurement method and the SLO definition are not aligned. You measure latency at the load balancer but the SLO is defined as end-to-end response time including third-party dependencies. The number on the dashboard and the commitment to users describe different things.

The SLO was set without a measurement baseline. You have a 99.9 percent SLO but no data on what actual availability was before the SLO existed. You cannot tell whether you improved or whether you started with a target below actual performance.

The SLO has not been reviewed since it was set. The service architecture changed six months ago. The user journey the SLO was designed to protect is now served by a different component. The SLO is measuring the old path.

Signal Integrity for SLOs requires an annual review minimum: verify that the measurement method and the SLO definition still match, that the target is still calibrated to actual performance, and that the user journey the SLO protects is still the right one.

Where to start if you have none of this yet

If your organization has no SLOs, the highest-leverage starting point is a single service and a single user journey.

Pick the service that is most visible to users or most critical to revenue. Map the one journey that matters most. Define the SLO based on that journey's performance over the last ninety days. Set the target at the measured p50 of that period. Define the decision that triggers when 50 percent of the monthly budget is consumed. Assign an owner with real deploy access.

Run that for ninety days. Observe whether the budget was consumed, whether the decision actually changed engineering priorities, and whether the on-call team referenced the SLO in their triage. Adjust based on what you learn. Then expand to the next service.

One SLO that produces behavior change is worth more than twenty SLOs that sit on a dashboard. The program grows from the one that works.


If you want to build an SLO program that connects to engineering decisions and on-call practice, the SRE Program engagement covers SLO design, error budget implementation, and on-call optimization as a structured delivery. The Foundations Assessment includes an operational accountability baseline that surfaces the current state of reliability measurement before any SLO design begins.

SLOSREError BudgetDORA MetricsPlatform EngineeringReliability

Found this useful? Share it with your network.

Matías Caniglia

Mat Caniglia

LinkedIn

Founder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.

79 articles published

Stay updated with Clouditive

Long-form analysis on platform engineering, DORA, and AI readiness from Mat Caniglia. Sent when there is something worth reading.

Start here

See where your delivery stands.

A fifteen minute self-diagnostic that scores your platform across DORA metrics, deployment frequency, change failure rate, and cognitive load. No sales call required.

Want to read first? See the Foundations Framework