SLO, SLI, and SLA Explained
Measure first. Set targets second. Contract third.
SLIs, SLOs, and SLAs are three distinct concepts that only work in the right order. Most reliability problems come from teams that skip the first two and jump straight to the third.
The three concepts
SLI, SLO, SLA: what each one is
SLI — Service Level Indicator
The raw measurement. An SLI is a quantitative signal that tells you how a service is performing right now. Common SLIs are availability (percentage of successful requests), latency (99th percentile request duration), and error rate (percentage of requests that return an error). An SLI is just a number — it has no target attached to it yet.
SLO — Service Level Objective
The internal target. An SLO is the threshold you set for an SLI: "99.9% of requests will succeed over a rolling 28-day window." The SLO is a commitment the engineering team makes to itself. It determines when you are burning your error budget and when behavior should change — typically, deploying new features pauses when the budget runs low.
SLA — Service Level Agreement
The external contract. An SLA is a formal agreement with customers or partners that specifies what reliability they can expect, and what remediation they receive when you fail to deliver it. SLAs typically include financial penalties (service credits) or operational consequences. Because SLAs carry real consequences, they should always be set looser than the corresponding SLO.
Why the order matters
You need SLIs before SLOs, and SLOs before SLAs
The dependency chain is strict. An SLO without an SLI is a guess — you are committing to a target you are not measuring. An SLA without an SLO behind it is a liability — you are making a customer-facing promise with no internal signal to tell you when you are at risk of breaching it.
The most common sequencing mistake is teams that negotiate SLAs with enterprise customers before they have reliable SLI measurement in place. The SLA looks reasonable when signed. Then an incident happens and nobody can tell whether the SLA was breached because the measurement does not exist. Customer relationship damage follows.
The second most common mistake is choosing SLIs that are easy to collect rather than meaningful to users. Disk I/O throughput is easy to instrument. What the user experiences when the service is slow is harder to capture but more relevant. Google's Site Reliability Engineering book introduced the concept of user journey SLIs — measuring availability and latency from the user's perspective, not the infrastructure's perspective. That reframe is still the right one.
A third mistake is setting 100% availability as an SLO. 100% availability means no error budget. No error budget means every planned deployment is a risk because you have no tolerance for any failure. Teams that set 100% SLOs either never deploy or lie about their incidents. Neither outcome serves users.
Platform team role
The platform team owns the tooling. Product teams own the values.
A common organizational question is: who sets the SLOs? The answer has two parts. The platform team owns the SLO infrastructure — the dashboards, the alerting, the error budget tracking tooling, the templates that make it straightforward to define a new SLO for a new service. The platform team also typically writes the error budget policy: the documented rules about what happens when budgets deplete (feature freeze, reliability sprint, postmortem required).
The actual SLO values — 99.9% vs 99.5%, over a 7-day window vs a 28-day window — are a product decision as much as an engineering decision. The product team understands what reliability means to users and what a service degradation costs in business terms. The engineering team understands what the system can realistically achieve at current maturity. Setting SLO values without both perspectives produces targets that are either too strict (burning error budget constantly, blocking all features) or too loose (never catching real user-impacting problems).
The error budget is the mechanism that makes SLOs actionable. The headroom between your current reliability and your SLO threshold is the budget you can spend on deployments. When it runs low, the policy kicks in. That policy is what gives the SLO teeth — without it, a breached SLO produces a dashboard alert that nobody changes behavior in response to.
Error budget: the related concept
The error budget is the operational expression of an SLO. It quantifies how much unreliability you have agreed to tolerate, and it determines when behavior changes.
Read the error budget definitionCommon questions
SLO, SLI, SLA: direct answers
What's the difference between an SLO and an SLA?
An SLO is an internal commitment: the reliability threshold your team holds itself to. An SLA is an external contract with customers or partners that includes financial or operational consequences when breached. The SLO should be stricter than the SLA — if you are meeting your SLO, you have headroom before the SLA triggers. If your SLO and SLA are the same number, you have no buffer and every real incident risks a customer-facing breach.
Who sets SLOs?
In practice, SLOs require input from both the product team (who understands what reliability means to users) and the engineering team (who understands what is achievable with the current system). Platform teams typically own the tooling and templates for defining and tracking SLOs. The actual values — what percentage, over what window, for what signal — come from product and engineering together. A platform team that sets SLOs unilaterally usually sets the wrong targets.
What happens when you breach an SLO?
An SLO breach means the error budget for that window is depleted. The appropriate response depends on your error budget policy: typically, new feature deployments pause and reliability work takes priority until the budget recovers. This is the mechanism that gives SLOs operational meaning — without a policy that changes team behavior when the budget runs out, an SLO is a dashboard number rather than a decision-making tool.
Is 99.9% a good SLO?
99.9% availability means roughly 44 minutes of downtime per month. Whether that is appropriate depends entirely on the service. For an internal developer tool, 99.9% is probably more than sufficient. For a payment processing API, it may not be. The question to ask is not "what percentage looks good" but "what does a user experience when availability is at 99.9% and what does that cost the business?" Start from user impact, then derive the target.
Build reliability into your platform
SLOs work when they have a policy behind them and a platform team enforcing it.
The Reliability Program engagement wires SLI instrumentation, SLO dashboards, and error budget policy into a system your engineering teams actually use.