What Is Site Reliability Engineering?
Reliability is a feature, and it can be engineered.
Site Reliability Engineering applies software engineering discipline to operations. The core premise, introduced by Google, is that the same principles used to build reliable software can be applied to make the operation of that software reliable.
Origin and definition
Where SRE came from and what it means
Google created SRE in the early 2000s when Ben Treynor Sloss was asked to run production for Google Search. The problem he was trying to solve: operations teams using manual processes could not scale with Google's growth, and the cultural divide between development and operations was producing unreliable systems. His solution was to hire software engineers and give them the reliability of production as an engineering problem to solve.
The SRE book, published by Google in 2016, codified the practices that emerged: SLOs, error budgets, toil measurement and elimination, blameless postmortems, and the 50% cap on operational work. These are not soft cultural aspirations — they are specific, measurable commitments about how the team operates.
The core idea is precise: reliability is a feature that must be designed, measured, and continuously improved, not a property that emerges from hoping nothing breaks. SRE gives that feature an engineering home with specific tools and metrics.
What SRE teams do
Five activities that define SRE work
Define and track SLOs. SRE teams own the service level objective framework for the services they support. They define what "reliable" means for each service using SLIs (the measurements) and set SLO targets that translate those measurements into operational commitments. See the SLO, SLI, SLA glossary entry for the full definition.
Manage error budgets. The error budget is the gap between the SLO target and 100% — the amount of unreliability the team has agreed to tolerate. SRE teams track how the budget is being spent, escalate when it is depleting too fast, and enforce the error budget policy (typically: when budget runs low, feature work pauses and reliability work takes priority).
Write postmortems. After every significant incident, SRE teams facilitate a blameless postmortem: a structured analysis of what happened, why it happened, and what changes would prevent recurrence. The blameless framing is deliberate — blame produces cover-up; blameless analysis produces systemic fixes.
Eliminate toil. Toil is manual, repetitive operations work that scales with service load rather than being automated away. SRE teams measure toil, track its percentage of their workweek, and are expected to continuously reduce it through engineering. Google's original model capped toil at 50% of SRE time; without that discipline, the team drifts toward pure operations work.
Build reliability tooling. SRE teams write software: chaos engineering tools, deployment safety checks, automated rollback triggers, observability instrumentation. This is the engineering part of site reliability engineering — the part that distinguishes SRE from a well-organized operations function.
SRE vs DevOps vs platform engineering
How these practices relate to each other
SRE vs DevOps. DevOps is a culture and set of practices: shared ownership between development and operations, continuous delivery, infrastructure as code, monitoring and feedback loops. SRE is a specific implementation of those practices with particular engineering rigor. An SRE team is one way to do DevOps — a concrete organizational and methodological choice, rather than a cultural aspiration. The Google SRE book itself describes SRE as "what happens when you ask a software engineer to design an operations function."
SRE vs platform engineering. Platform engineering and SRE operate at different levels. SRE focuses on the reliability of specific services — defining SLOs for the checkout service, running postmortems for the auth service, managing error budgets per service. Platform engineering focuses on the shared infrastructure that all service teams run on — the deployment platform, the CI/CD golden paths, the observability stack, the internal developer platform.
In many mature organizations, the platform engineering team absorbs SRE functions at the platform layer: they own the SLOs for the platform itself, they run postmortems for platform incidents, and they manage the error budget for shared infrastructure. Individual service teams then practice SRE principles (with or without dedicated SRE headcount) using the platform the platform team provides.
When to invest
SRE team vs. platform team: rough guidance
The question "should we hire SREs or build a platform team?" usually has the wrong frame. Most Series A and B companies (20 to 150 engineers) need a platform team before they need dedicated SREs. The shared infrastructure — CI/CD, deployment, observability, secrets management — is the foundation everything runs on. Getting that right has leverage across the entire engineering org. SRE practices on top of a shaky platform produce sophisticated measurement of unreliability without fixing the causes of it.
SRE investment makes sense when: services are mature and stable enough to have meaningful SLOs, the organization has enough traffic and incident volume that error budget management is a real operational concern, and the platform foundation is sound enough that reliability is genuinely a service-level problem rather than a platform-level problem.
That usually means companies at or beyond Series C, with 150+ engineers, running customer-facing services at meaningful scale. Earlier than that, most of what looks like an "SRE problem" is actually a platform engineering problem in disguise.
Clouditive's Reliability Program
The Reliability Program wires SLO infrastructure, error budget dashboards, and postmortem process into your engineering organization.
See the Reliability ProgramCommon questions
SRE: direct answers
Do I need an SRE team or a platform team?
The question reveals a common framing problem. SRE and platform engineering solve different problems. SRE is focused on reliability of services: SLOs, error budgets, postmortems, toil elimination. Platform engineering is focused on the shared infrastructure that all teams run on. At most Series A and B companies, the right answer is a platform team first — get the shared infrastructure working before adding reliability specialists for specific services. Once services are mature, SRE practices add meaningful value on top.
What does an SRE do day to day?
SRE work falls into three categories. Reliability work: defining SLOs, reviewing error budgets, running postmortems. Toil elimination: identifying manual operations work that is repetitive and automatable, then automating it. Engineering: writing software to improve system reliability, observability, or deployment safety. Google's original SRE model capped operational work at 50% of an SRE's time to ensure the engineering work gets done. Without that discipline, SREs become sophisticated on-call operators rather than reliability engineers.
How many SREs do I need?
There is no universal formula. Google's original guidance was roughly one SRE per five to eight software engineers for services in the SRE program, but that ratio reflects Google's specific context. Most Series B companies with 50 to 150 engineers find that a small SRE practice (two to four engineers) embedded within a platform team covers the reliability engineering work. The more relevant question is: what are the specific reliability problems you are trying to solve?
What is toil in SRE?
Toil, as defined in the Google SRE book, is work that is manual, repetitive, automatable, tactical rather than strategic, and scales linearly with service growth. Classic examples: manually restarting services, manually reviewing deployments, manually rotating credentials. Toil is not inherently bad, but it should be tracked and continuously reduced through automation. An SRE team that is not reducing toil is just an operations team with better naming.
Build reliability into your engineering org
SRE practices work when the platform foundation is solid underneath them.
The Reliability Program wires SLO infrastructure, error budget tracking, and postmortem process into your engineering organization.