Skip to main content
Platform Engineering6 min read·

Error Budgets in Practice: How Platform Teams Stop the Reliability Argument

Without error budgets, the reliability debate has no numbers. Here is how to set one up, who owns what, and the three mistakes that make it stop working.

Error Budgets in Practice: How Platform Teams Stop the Reliability Argument

TL;DR. The reliability debate — ship faster vs. slow down and fix things — never resolves without numbers. An error budget converts a vague SLO into a quantified allowance. Platform teams provide the tooling. Product teams define the SLOs. Both agree on the policy for what happens when the budget runs out. Without all three, the budget is a number that nobody acts on.

Every engineering organization that has both a product function and a reliability function eventually has the same argument. Product wants to ship faster. Reliability wants to slow down and fix the system. Neither side has numbers. Both sides have anecdotes. The argument continues indefinitely, usually until something breaks badly enough that one side wins temporarily.

Error budgets are how Google's SRE organization addressed this. The concept is straightforward: you agree in advance on the reliability level your users need, you express that as a monthly allowance of acceptable failures, and you let the current state of that allowance determine deployment policy. The argument stops being about opinions and starts being about math.

For a platform team at a Series A–C company, this means two things. First, building the tooling that makes error budget tracking automated and visible. Second, getting all the relevant stakeholders to agree on what the budget means before it depletes for the first time.

What an error budget actually is

An SLO (Service Level Objective) is a target for service reliability. A common example: 99.9% availability over a rolling 30-day window. That SLO implies an error budget. If 99.9% is the target, the acceptable failure rate is 0.1%. Over 30 days, that is approximately 43 minutes of downtime.

When the system is available, error budget accumulates. When the system is degraded, error budget depletes. If the system is running well and the budget is full, deploy freely — the system has demonstrated it can absorb the risk. If the budget is nearly gone, stop new deployments until the budget recovers or revise the SLO downward.

This framing does two things that arguing about reliability does not. First, it makes the cost of reliability explicit. An SLO of 99.99% gives you about 4 minutes of downtime per month. If your deployment process causes 2 minutes of degradation per deploy, you can run at most two deployments per month before you exhaust the budget. That is a concrete constraint that forces an honest conversation about what "high availability" actually means operationally.

Second, it makes budget depletion a signal, not a failure. When the error budget is gone, the system is telling you something: either the SLO is set too high for the current infrastructure, or the system has reliability problems that need fixing before new features are safe to add. Both are useful information. Neither is a reason to assign blame.

Who sets the SLO and who manages the budget

This is where most error budget implementations break down. The SLO is set by the wrong party, the tooling lives in the wrong place, or nobody has agreed on the policy before it matters.

The right ownership structure:

Product team sets the SLO. They are closest to users and best positioned to answer the question: what reliability level do our users actually require? A 99.9% SLO means users experience at most 43 minutes of degraded service per month. Is that acceptable? The product team knows. The platform team does not.

Platform team provides the tooling and the budget calculation. Pyrra, Sloth, or custom Prometheus recording rules can generate multi-window multi-burn-rate alert rules from a simple SLO definition. The platform team builds and maintains this. Product teams configure their SLOs through it. The calculation is automatic once the SLO is defined.

Both teams agree on the error budget policy. The policy is a document that exists before the first incident, not a conversation that happens during one. It defines what deployment behavior changes at different budget levels. Without the policy, having a budget number does not change behavior.

What an error budget policy looks like

A basic policy has three tiers:

When more than 50% of the monthly error budget remains — normal deployment cadence. The system is healthy. Ship.

When 25–50% remains — new deployments require a brief reliability review before proceeding. The review does not need to be formal. The question is: does this deployment introduce risk that, given our current budget level, we should be aware of?

When less than 25% remains — no new feature deployments until the budget recovers above 50% or the SLO is revised. Engineering effort redirects to reliability work: fixing known issues, improving test coverage, reducing error rates.

The specific thresholds are less important than having thresholds that both teams have agreed to. A policy with slightly wrong numbers that everyone follows is more valuable than a theoretically optimal policy that nobody acts on because it was never agreed to in the first place.

Three mistakes that make error budgets stop working

Setting the SLO at a number the infrastructure cannot support. A 99.99% SLO sounds good in a planning document. If your deployment pipeline causes 10 minutes of degradation per deploy, you have depleted your monthly budget after the first deploy of the month. The SLO should reflect what the current system can actually achieve, with room to improve. An honest baseline is more useful than an aspirational number that is always depleted.

Not reviewing SLOs quarterly. Systems change. The SLO that reflected reality when the system was handling 1,000 requests per day may not reflect reality at 100,000. A quarterly review — looking at actual reliability data against the SLO — keeps the SLO honest. It also creates a regular forcing function for the conversation about whether the current reliability level is meeting user expectations.

Treating budget depletion as a failure rather than a signal. This is the cultural mistake, and it undermines the technical implementation. If the engineering organization treats a depleted error budget as something to explain away or as evidence that the on-call team failed, the error budget stops producing honest data. Engineers find ways to exclude certain failures from the calculation. The SLO drifts away from reality. The budget number becomes theater.

Budget depletion is information. It says: the system is less reliable than the target, and something about that needs to change — either the target or the system. The value of the error budget framework is that it makes this conversation happen with data rather than opinions. Treating depletion as blame destroys that value.

The platform team's tooling role

The platform team does not define product teams' SLOs. It provides the infrastructure that makes SLO-based reliability management tractable.

In practice, this means: Prometheus recording rules that calculate error rate and availability from existing metrics. A Grafana dashboard that shows current budget status across all services. Alert rules generated by Sloth or Pyrra that fire at meaningful burn rates — the "fast burn" alert fires when you are consuming budget 14x faster than the target rate, giving time to respond before the budget is exhausted. The runbook template that on-call engineers follow when a budget alert fires.

Product teams configure SLOs as a simple YAML definition. The platform tooling generates the recording rules and alert rules from that definition. The engineer who needs to add an SLO does not need to understand multi-window multi-burn-rate alert math. The platform team has translated that math into a configuration interface.

This is the appropriate division. Platform teams own the standard and the infrastructure. Product teams own the configuration for their services. Neither team does the other's job.

For how error budgets connect to the broader observability picture, read Observability for Platform Teams. For what an SLO design looks like end-to-end, see SLO design guide. For how Clouditive structures a reliability program engagement, see the Reliability Program.

sreerror budgetssloplatform engineering

Found this useful? Share it with your network.

Matías Caniglia

Mat Caniglia

LinkedIn

Founder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.

79 articles published

Related Articles

Platform Engineering

The Cost of Not Investing in Platform Engineering

Every hour engineers spend fighting deploy friction, waiting on platform tickets, or repeating slow onboarding is a real cost. A framework for making the number concrete.

Read More →
Platform Engineering

Platform Engineering Consulting vs. Hiring: When Each Makes Sense

An honest analysis for a VP Eng facing the build-the-team-or-bring-in-a-consultancy decision. Cover the 3-6 month critical window, failure modes of each approach, and what a good engagement exit looks like.

Read More →
Platform Engineering

IDP Build vs Buy: A Decision Framework for Engineering Leaders

A structured decision framework covering total cost of ownership, team capacity requirements, vendor lock-in spectrum, what changes at 10 vs 50 vs 200 engineers, and the hybrid path.

Read More →

Stay updated with Clouditive

Long-form analysis on platform engineering, DORA, and AI readiness from Mat Caniglia. Sent when there is something worth reading.

Start here

See where your delivery stands.

A fifteen minute self-diagnostic that scores your platform across DORA metrics, deployment frequency, change failure rate, and cognitive load. No sales call required.

Want to read first? See the Foundations Framework