What Is an Error Budget?
Unreliability you can spend intentionally rather than suffer accidentally.
An error budget is the amount of unreliability your team has agreed to tolerate before breaching its SLO. It turns reliability from a vague aspiration into a quantified resource that engineering and product teams can manage deliberately.
Definition
What an error budget is and how it is calculated
An error budget is the complement of an SLO target. If your SLO is 99.9% availability over a 30-day window, your error budget is 0.1% — approximately 43 minutes of allowed downtime per month. That is not a problem to eliminate. It is a resource to spend deliberately.
The budget framing matters. Treating that 43 minutes as "the amount we can fail" produces reactive operations: incidents burn budget, the team scrambles, the budget recovers, repeat. Treating it as a resource produces deliberate operations: the team decides how to spend it — new deployments, experiments, planned maintenance — and the budget functions as a constraint on how much risk to take at any given time.
Error budgets apply to time-based SLOs (availability over a period) and to request-based SLOs (success rate over a number of requests). For a service handling one million requests per month with a 99.9% success rate SLO, the budget is 1,000 allowed failures per month. When the team has used 900 of those failures, the budget is nearly spent and behavior should change.
Error budget calculation
Example: 99.9% SLO over 30 days → 0.001 × 43,200 minutes = 43.2 minutes of allowed downtime per month
How teams use error budgets
The budget changes deployment and reliability decisions
The fundamental mechanism: when the budget is full (no incidents, SLO comfortably met), deploy as fast as you want. When the budget is depleted (the SLO window's tolerance has been used up), slow down and fix things. This aligns product and engineering incentives without a manager mediating every reliability-vs-feature conversation.
A product team that wants to ship a risky new feature has an incentive to maintain the error budget — if the budget is depleted by previous deployments, the policy blocks the new feature too. That creates a natural pressure toward reliability that does not require the SRE team to be the permanent bearer of bad news.
In practice, most teams track their budget consumption in a dashboard: current reliability vs. SLO target, percentage of budget consumed for the current window, burn rate (how fast the budget is being spent). When burn rate is high, the team investigates before the budget depletes, not after.
Error budget policies
The rules that give error budgets operational meaning
An error budget without a policy is a dashboard metric. The policy is the document that specifies what changes when the budget depletes: which deployments are blocked, which are allowed (security patches, bug fixes), who approves exceptions, and what triggers the policy to lift (budget recovery over a defined window, postmortem completed and action items closed).
A well-designed error budget policy is written before the budget depletes, reviewed by product and engineering leadership, and referenced during incidents rather than negotiated in the moment. The policy removes the need for ad-hoc conversations about whether a particular deployment is acceptable given the current reliability state.
Policies that are too strict become obstacles: every minor deployment requires a waiver request and the policy gets ignored. Policies that are too loose have no operational effect. The right calibration reflects the business cost of reliability breaches for the specific service and the team's current capacity to recover quickly when things go wrong.
Platform team ownership
What the platform team owns in the error budget system
The platform team owns two things in the error budget system. First, the tooling: SLO dashboards, error budget tracking, alerting that fires before the budget depletes rather than after, and the templates that make it straightforward for a new service to define its first SLO and start tracking its budget.
Second, the policy enforcement mechanism: the automated gate that blocks deployments when the error budget policy criteria are met. Without automation, the policy depends on someone checking the dashboard before every deployment — which means it gets skipped when teams are moving fast. An automated deployment gate that checks current budget consumption before allowing a production deploy makes the policy real.
The SLO values themselves — the targets that determine how large each service's budget is — are not the platform team's to set unilaterally. Those values require product context the platform team does not have: what reliability means to users, what a breach costs in business terms, and what the service is actually capable of achieving at current maturity.
SLO, SLI, and SLA — the related concept
Error budgets are derived from SLOs. The SLO entry covers the full SLI-SLO-SLA relationship and why the order matters.
Read the SLO, SLI, SLA definitionCommon questions
Error budget: direct answers
How do you calculate an error budget?
Error budget = (1 - SLO target) × measurement window length. For a 99.9% availability SLO over 30 days: error budget = 0.1% × 30 days × 24 hours × 60 minutes = approximately 43.2 minutes of allowed downtime per month. For a request-based SLO with 99.9% success rate and one million requests per month: error budget = 1,000 allowed failed requests per month.
What happens when you run out of error budget?
When the error budget depletes, the error budget policy determines what happens. A typical policy: new feature deployments pause, and engineering capacity shifts to reliability work until the budget recovers. The important thing is that the policy is documented before the budget depletes, not decided in the moment. A policy created reactively during an incident is not a policy — it is a negotiation.
Who sets the error budget?
The error budget is derived directly from the SLO — whoever sets the SLO implicitly sets the error budget. SLO values require input from both the product team (what reliability do users need?) and the engineering team (what is the system capable of?). Platform teams own the tooling and dashboards that track error budget consumption, and they typically author the error budget policy.
Implement error budgets that your teams actually use
The Reliability Program wires SLO infrastructure, error budget dashboards, and policy enforcement into your engineering org.
SLI instrumentation. SLO dashboard templates. Error budget policy. Automated deployment gates.