SRE for growth-stage engineering — when you need it and what to build first
TL;DR. Growth-stage organizations do not need Google-scale SRE. The right question is which subset of practices addresses the reliability problems you actually have. Three conditions tell you the investment will return: on-call consuming more than 20 percent of engineering time, more than two customer-impacting incidents a quarter, and operational knowledge stuck in one or two engineers. If none apply, SRE is premature.
Site reliability engineering was developed at Google to solve Google's problems. The SRE book documented what a team of hundreds of experienced engineers built over more than a decade to operate a global-scale distributed system with stringent reliability requirements. It is an excellent reference for what mature SRE looks like.
It is not a guide for what a growth-stage engineering organization needs to do in the next quarter.
Most growth-stage organizations that attempt to adopt SRE practices wholesale discover two things. First, the practices require foundations that do not yet exist: an established software development lifecycle, consistent deployment pipelines, monitoring that is already producing actionable signals. Second, the practices are designed for a scale of complexity that most growth-stage organizations will not encounter for years, if ever.
The right question for a growth-stage engineering leader is not "how do we implement SRE?" It is "what subset of SRE practices addresses our actual reliability problems in our current context?" The answer is a shorter list than the SRE book suggests, and the sequencing matters more than comprehensiveness.
Three conditions that tell you SRE investment will return before it stalls
I use three conditions to evaluate whether a growth-stage organization has reached the point where SRE investment produces positive return.
When on-call load is consuming more than 20 percent of engineering time in any given week across more than one quarter, the reliability problem is costing more than the investment to fix it. Below that threshold, the cost of the reliability problem is bounded enough that basic incident management practices are sufficient. The threshold matters because SRE practices require engineering time to build — and if the engineering time is not available because it is being consumed by incidents, the investment will stall before it returns value.
The second condition is customer-impacting incidents occurring more than twice per quarter. Two P1s per quarter is where the customer trust cost and engineering distraction cost exceed the investment in systematic reliability engineering. Below two per quarter, incident response quality — how fast you detect and respond — matters more than reliability investment, which reduces frequency.
The third is operational knowledge concentrated in one or two engineers. When an engineering organization reaches the point where two or three engineers are the only ones who can resolve complex production incidents, the reliability risk is a people risk. Those engineers are the bottleneck. They will eventually leave, and when they do, the organization's reliability will degrade sharply. SRE practices — specifically runbook discipline, blameless postmortems, and operational knowledge documentation — address this risk directly.
If none of these conditions apply, investing in SRE practices is premature. Invest in deployment reliability and consistent monitoring first. The SRE practices depend on those foundations.
The four capabilities to build first — and why skipping any of them breaks the sequence
Growth-stage SRE investment should follow a specific sequence. Each step builds the foundation for the next. Organizations that skip steps discover they have invested in practices that depend on foundations that do not exist.
The first thing to build is monitoring that produces actionable alerts. The most common monitoring failure in growth-stage engineering organizations is not a lack of monitoring — it is monitoring that alerts on everything and therefore alerts on nothing useful. The on-call engineer wakes up to a PagerDuty alert about CPU usage exceeding 70 percent and has no way to know whether that is the cause of a customer-impacting problem or a normal pattern that has been alerting falsely for six months.
Actionable alerting has three properties: every alert that fires at 2 AM should require human action (if there is no action to take, the alert should not fire at that hour); every alert should link directly to the runbook for the failure mode it indicates; and every alert should have a documented history of how often it fires and what the resolution has been. Alerts that consistently resolve without action are candidates for threshold adjustment or suppression. The investment to move from noisy alerting to actionable alerting is typically two to four weeks of an experienced engineer's time. The return is measured in on-call hours recovered: the difference between waking up three times per night for false positives versus once per week for real incidents.
Once the alerts are actionable, the next investment is runbooks for the top five failure modes. Identify the five failure modes that have occurred most often in the last six months. Write a runbook for each. A runbook is not a long document — it is a decision tree: here is the symptom, here are the checks to run, here are the possible causes, here is the resolution step for each cause. The runbook discipline changes the on-call experience before it changes the incident frequency. Engineers who are responding to documented failure modes can restore service faster because they do not have to re-diagnose problems they have seen before. Write the runbooks by asking the engineers who have resolved each failure mode to narrate the resolution process. Record the narration. Edit it into a checklist. Validate it on the next occurrence. Runbooks written speculatively are usually missing the step that matters most.
After the monitoring and runbooks are working, blameless postmortems become the mechanism for compounding the investment. A blameless postmortem answers one question: what in the system, process, or tooling allowed this failure to occur, and what can be changed to make it less likely or less impactful? It does not answer: who made the mistake? Blame-focused incident reviews produce engineers who conceal near-misses, avoid taking ownership of ambiguous situations, and stop surfacing problems they can route around. Organizations that respond to incidents with blame see degrading reliability over time as engineers start optimizing for not getting blamed rather than for system reliability.
A blameless postmortem requires four elements: a timeline of what happened (specific events with times, not narrative); a root cause analysis covering the sequence of conditions that made the failure possible; a set of action items, each with an owner and a deadline; and a written document that survives the people in the room. The immediate value is the action items from each incident. The compounding value is the institutional knowledge that accumulates in the postmortem archive. New engineers can read the last two years of postmortems and understand the system's failure history without experiencing those failures themselves.
Once the monitoring, runbooks, and postmortem practice are established, service level objectives become useful. An SLO is a target: the payment service will process 99.9 percent of transactions within 500 milliseconds over any rolling 30-day period. Either the service is meeting its SLO or it is not. Without SLOs, reliability is a judgment call, evaluated against an implicit standard that nobody has agreed on. With SLOs, the standard is explicit and the team knows when reliability investment is adequate.
Implement SLOs after the monitoring and runbook steps, not before. SLOs require monitoring that is accurate and consistent — if the monitoring produces false signals, the SLO burn rate calculation is meaningless. SLOs also require a team ready to act on the signals. Start with two or three SLOs for your most critical customer-facing services. The effort required to implement a small set of accurate SLOs with clean monitoring is significant. The effort required to maintain a large set across all services is unsustainable for a growth-stage team. Start small, learn the practice, expand when the discipline is established.
What to skip until the foundations are working — error budgets, capacity planning, chaos engineering
The SRE book describes practices that address problems you probably do not have yet. Error budgets — the mechanism by which feature development is paused when reliability falls below the SLO target — require a level of organizational maturity and executive buy-in that most growth-stage organizations have not established. Implementing error budgets without that maturity produces conflict between the SRE function and the product function rather than alignment.
Capacity planning at the SRE level requires traffic patterns that are stable and predictable enough to model. Growth-stage organizations with rapidly changing product usage patterns will find that capacity planning models have a short shelf life.
Chaos engineering — deliberately injecting failures to validate resilience — requires the monitoring, runbook, and incident response foundations to be well established before it produces value. Chaos engineering on a system with poor observability teaches you nothing. Chaos engineering on a system with incomplete runbooks produces unresolved incidents rather than learning. These practices are valuable at the right time. That time is typically after the four-capability sequence is complete and the organization has operated at that level for at least two quarters.
How the platform team and SRE function reinforce each other at growth stage
In a growth-stage engineering organization, the platform team and the SRE function are often the same team or closely overlapping teams. The platform team's reliability investments — deployment pipeline stability, monitoring standardization, runbook templates — create the foundations that SRE practices require. The SRE practices — postmortems, SLOs, on-call rotation design — create the feedback loops that tell the platform team where to invest next.
The Foundations Framework positions Delivery Reliability as the first pillar precisely because reliability is the prerequisite for everything else the platform does. A platform that is not reliable cannot absorb complexity. A platform team spending 30 percent of its time on reactive incident response cannot invest in proactive platform capability.
The sequencing: establish delivery reliability first, then build the SRE practices that maintain and improve it, then invest in developer experience and portal capabilities on top of the reliable foundation. Organizations that invert this sequence — building developer portals and golden paths on an unreliable platform — produce experiences that look polished and fail under load.
Frequently asked questions
How do you know when it is too early to invest in SRE practices?
If none of the three conditions apply — on-call load below 20 percent of engineering time, fewer than two customer-impacting incidents per quarter, operational knowledge distributed across the team rather than concentrated in one or two engineers — SRE investment is likely premature. The practices require foundations that take time to build (consistent deployment pipelines, monitoring producing actionable signals), and investing in those practices before the foundations exist stalls the investment before it returns value. The more productive use of that engineering time is establishing Delivery Reliability: consistent deployment automation, basic observability, and reliable infrastructure state.
Why do runbooks need to be written from real incident narratives rather than speculatively?
Because the step that matters most in any failure mode is the one that was discovered by someone who has resolved it. Engineers who have walked through a real incident know the three things they checked that turned out to be irrelevant and the one thing that actually identified the root cause. Speculative runbooks document the ideal diagnosis path, which is often not how real incidents resolve. The method is to ask the engineers who have resolved each of the top five failure modes to narrate what they did — record it, edit it into a decision tree, and validate it on the next real occurrence. The result is a runbook that reflects how the system actually fails rather than how it was designed to fail.
When does an SLO become useful, and why implement it after monitoring and runbooks?
An SLO becomes useful when monitoring is accurate enough to produce consistent SLO burn-rate calculations and when the team has runbooks ready to act on SLO breaches. Without accurate monitoring, the SLO burn rate is a function of what the monitoring catches, not what the system is doing. Without runbooks, an SLO alert at 2 AM produces an engineer who knows the service is degraded but not what to do about it. SLOs also require explicit organizational agreement on what reliability level is acceptable, which is a conversation that is easier to have once the team has direct experience with the failure modes the SLO is meant to bound. Start with two or three SLOs for your most critical customer-facing services after the monitoring and runbook investments are working, not before.
Related reading
- The microservices migration that nearly killed the company — the platform reliability problem that SRE practices are meant to prevent: 47 services with no runbooks, no unified observability, and on-call load that burned out three senior engineers.
- How a traditional bank transformed its engineering without a big-bang rewrite — the constraint-by-constraint sequence that moved change failure rate from 25% to under 4%: exactly the reliability improvement SRE practices aim at.
- 5 signs your platform team is stuck in ad-hoc mode — Sign 4 (repeated incidents when the hero is unavailable) and Sign 5 (no baseline) are the two most direct SRE readiness indicators.
- Platform engineering ROI — what to measure and how to defend it — incident reduction ROI is one of the four categories most platform ROI calculations miss: this post shows how to calculate it.
The Foundations Assessment includes a reliability audit as part of the Horizon phase, covering monitoring quality, runbook coverage, and incident response maturity. It produces a sequenced roadmap showing which reliability investments have the highest return given your current state. The free Platform Score surfaces your reliability gaps in fifteen minutes.

Mat Caniglia
LinkedInFounder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.
79 articles published