Skip to main content
Engineering Leadership14 min read·

How Engineering Leaders Respond to Production Crises (And What That Reveals)

A production incident stress-tests engineering leadership as much as engineering systems. Three failure modes in the first 30 minutes decide the outcome.

How Engineering Leaders Respond to Production Crises (And What That Reveals)

TL;DR. Three failure modes in the first 30 minutes of a production incident determine whether it resolves in 90 minutes or four hours: parallel diagnosis without coordination, stakeholder management displacing technical work, and alert noise masking the actual signal. Teams that resolve incidents faster are not staffed with better engineers — they have practiced incident response structures: an incident commander who coordinates rather than diagnoses, a dedicated communication role, and runbooks that were updated after the last incident. The 90 minutes of a crisis also reveals the actual values of engineering leadership, not the stated ones.

At 2:47am on a Tuesday, an e-commerce platform lost the ability to process payments. It was a deployment that had gone smoothly by every automated check. Something downstream had changed. Thirty thousand dollars in orders were failing per hour.

I was on the call as an observer. What I watched over the next 90 minutes told me more about the engineering organization than any assessment I could have run. The technical quality of the system matters, but it is the organizational quality that determines whether a system failure becomes a recoverable incident or a cascading disaster.

Three failure modes in the first 30 minutes that determine whether an incident resolves or cascades

The first 30 minutes of a production incident establish the pattern for everything that follows. Teams with practiced, well-structured incident response restore service faster not because they have better engineers, but because they waste less time on specific failure modes that are observable in nearly every organization experiencing a significant incident for the first time.

The first failure mode is parallel diagnosis without coordination. Multiple engineers independently chase different hypotheses without sharing findings. The most senior person starts looking at the database while someone else checks the CDN and a third person reviews recent deployments, and nobody is communicating what they are finding. Each person might have a piece of the answer, but because there is no coordination mechanism, the puzzle stays unsolved longer than it should. In a distributed system with many potential failure points, the exhaustive-parallel approach is both slow and exhausting.

The second failure mode is stakeholder management displacing technical work. The on-call engineer is on a technical bridge working through the diagnosis and simultaneously fielding Slack messages from five different executives asking for updates, each with their own urgency. The switching cost between debugging mode and communication mode is significant. Every time an engineer breaks out of the focused attention required to debug a complex failure to draft an executive update, they lose the thread they were following. The incident takes longer and the quality of both the debugging and the communication suffers.

The third failure mode is alert noise masking the signal. When an organization's alerting is not well-tuned, a major incident often triggers dozens of secondary alerts that are symptoms of the root cause rather than the cause itself. Engineers spend time investigating alerts that will resolve themselves once the root cause is addressed, instead of going directly to the source. The noise creates urgency without direction and exhausts the team before the real problem is found.

The organization at 2:47am had all three problems. But they had a playbook, and they used it. The playbook did not tell them how to fix the system. It told them how to run the response, and that turned out to be the difference between a 90-minute incident and a four-hour one.

What a good incident playbook actually does — and why it is not a technical document

A good incident playbook does not tell you how to fix the technical problem. It tells you how to run the response: who is the incident commander for this severity level, what is the communication cadence for stakeholder updates, who is authorized to make a rollback decision without escalation, and how are hypotheses and findings shared in real time.

The incident commander role in particular is undervalued in most engineering organizations. The job is not to be the best diagnostician in the room. The job is to ensure that the best diagnosticians are working effectively together: coordinating hypotheses, preventing duplicate work, making sure findings are shared across the bridge, and making decisions when a decision is needed rather than waiting for consensus to naturally emerge.

This last point deserves emphasis. Production incidents are high-pressure, high-ambiguity situations where the correct decision is often not clear. An organization that requires consensus before making a decision will be slower than one that has given the incident commander the authority to make a decision with the best available information. Speed matters in incident response not because you make better decisions faster, but because fast incorrect decisions are often more recoverable than slow correct ones.

At 2:47am, the senior engineer who took incident commander did something counterintuitive. She immediately designated someone else to handle all stakeholder communication and explicitly removed herself from that Slack channel. For the next 60 minutes, she focused entirely on the technical bridge. The stakeholder updates went out on schedule because the dedicated communicator had a template and a cadence. The diagnosis proceeded without interruption because the incident commander was not switching contexts.

At 4:14am, they had root cause identified and service restored. 87 minutes, which was fast for a failure of that complexity and business impact.

Why runbooks exist and are not trusted — and what trustworthy runbooks actually contain

One of the most consistent findings when I assess engineering organizations' incident response capability is that runbooks exist and are not trusted. The runbook was written when the service was deployed, has not been updated since, references tooling that was deprecated two years ago, and describes a debugging procedure that only works if the specific conditions of the original failure are present.

Engineers who have been through a few incidents at the organization have learned not to rely on the runbooks. They rely on the institutional knowledge of the engineers who have been there longest, which creates the knowledge concentration risk that escalation patterns reveal. When the engineer who knows how to debug this service is not on call, the incident takes significantly longer.

Trustworthy runbooks have specific characteristics. They were written by the engineers who actually debug the service, not by the engineers who built it. They were updated the last time the service had an incident. They include the commands to run, the specific output to look for, and the decision point for when to escalate versus when to continue debugging. They do not include architectural documentation or explanations of how the system works. They contain the specific information an engineer needs to resolve an incident at 2am without needing to understand the system design.

The investment required to produce trustworthy runbooks is surprisingly small when structured correctly. After each incident, the engineer who resolved it spends 30 minutes updating or creating the runbook entry for that failure mode. Over 12 months of incidents, the runbook library covers the scenarios that actually occur. The accumulated knowledge is distributed across the organization rather than concentrated in the heads of the most experienced engineers.

What blameless postmortems are actually for — and why they improve MTTR over time

The postmortem happened three days after the 2:47am incident. No names were assigned blame. The question was not "who deployed the change that caused this?" It was "what does this failure reveal about our system that we should address?"

The answer involved four things: a documentation gap in the downstream service interface that nobody had captured, an alerting threshold that was set too conservatively and created noise during the incident, a runbook step that referenced an internal tool that had been deprecated six months prior, and a dependency that had no automated health check and therefore produced no signal when it began degrading.

None of these findings required assigning blame. All of them were actionable. Within three weeks, all four had been addressed. The documentation gap was filled. The alert threshold was adjusted. The runbook was updated. The dependency got a health check.

This is the purpose of a blameless postmortem: to treat production failures as a diagnostic tool for improving the system rather than as a performance failure by individuals. Teams that do this consistently improve their mean time to restore not because the engineers get better at debugging, but because the system gets better at being debugged. The runbooks are more accurate. The alerts are better tuned. The health checks catch degradations earlier. Each incident, handled well, makes the next incident less severe.

The practical difference between organizations that run blameless postmortems and those that do not is measurable in the incident metrics over time. Organizations with consistent blameless postmortem practices see their mean time to restore decrease over 12 to 24 months. Organizations that run postmortems as blame-assignment exercises see no improvement, because the information that would improve the system never surfaces.

What engineering leaders reveal about their actual values during a crisis

The behavior of engineering leadership during a production incident tells engineers a great deal about the actual values of the organization, as opposed to the stated ones. These signals are processed quickly and retained for a long time.

A leader who responds to a major incident by publicly or privately communicating that the incident was caused by specific individuals' errors is teaching the engineering organization to hide problems and protect themselves rather than communicate openly. Engineers who witness this once become more cautious in postmortems, more careful about what they report in near-miss situations, and less likely to escalate early when they see a situation developing. The damage to incident response capability from a single blame-oriented response can take years to repair.

A leader who shows up on the incident bridge, asks how they can help remove obstacles rather than asking for status updates, and participates constructively in the postmortem is demonstrating a different set of values. Engineers notice this too. It shapes how they behave in the next incident. They escalate earlier because they trust that escalating will produce help rather than scrutiny. They share more information in postmortems because they trust that the information will produce improvement rather than assignment of responsibility.

The 90 minutes of a production incident is not wasted time from a leadership perspective. It is information about how the organization functions under pressure, what the actual decision-making process looks like when the playbook is being followed, and where the gaps between the stated values and the actual values become visible. The organizations that learn from their incidents and use them to improve both the technical systems and the organizational practices are the ones that build genuine reliability over time.

How to build incident response capability without waiting for the next major incident

For engineering organizations that want to improve their incident response capability without waiting for the next major incident to learn from, the most effective approach is deliberate practice through tabletop exercises and process reviews.

A tabletop exercise for incident response involves walking through a hypothetical incident scenario with the team, following the playbook, and identifying where the playbook breaks down or where team members are uncertain about what to do. This produces findings without the pressure of a real incident and allows the team to update the playbook before it is needed.

A process review involves examining the last three to five incidents that occurred and asking specific questions: at what point during each incident did the response team have the correct hypothesis? How long did it take from that point to resolution? What caused the delay? The pattern across multiple incidents often reveals a specific bottleneck: escalation that takes too long, a diagnostic step that depends on a specific person's institutional knowledge, or an authorization step that slows the rollback decision.

The organizations that handle production crises well are not the ones with the fewest incidents. They are the ones that have invested in the organizational infrastructure, the playbooks, the runbooks, the communication protocols, and the postmortem culture, that allows them to resolve incidents quickly and learn from them systematically.

Alert fatigue: when alert volume trains engineers to ignore the signal

One of the most common findings when assessing incident response capability is alert fatigue: a state where the volume and noise level of alerts has trained engineers to ignore them. In a high-alert-volume environment, engineers learn to distinguish the alerts they usually ignore from the alerts they act on based on pattern recognition rather than the content of the alert. This creates a gap: a novel failure mode that produces a familiar-looking alert will be ignored until a customer reports it.

The resolution is not simply reducing alert volume, though that is usually part of the answer. It is establishing clear standards for what alerts should demand response: alerts should fire only when human action is required, and they should fire with enough context to direct the first five minutes of investigation. Alerts that fire without these properties should be either eliminated or converted to a lower-urgency notification that does not interrupt the on-call engineer.

The work of tuning alerts to these standards is engineering work that pays back through every subsequent incident that resolves faster because the signal was clear. It is also work that most organizations defer because it is not adding features and the impact is not visible until the next incident. The organizations that invest in it do so because they have connected the cost of incidents to the alert quality, and the connection is unmistakable.

Three postmortem anti-patterns that produce documentation without learning

Blameless postmortems are well-understood in principle and poorly executed in practice. The common anti-patterns are worth naming specifically because they are easy to fall into and have high costs.

The first is the timeline that substitutes for analysis. The postmortem document contains a detailed minute-by-minute account of what happened during the incident, but no analysis of what made those events possible. A timeline is useful context for analysis. It is not analysis in itself. The postmortem that ends with a timeline has done the documentation work without the learning work.

The second is the action items that address symptoms rather than causes. The postmortem concludes that an alert was set too low, so the action item is to raise the threshold. The alert threshold was low because the engineering team had insufficient understanding of normal behavior for this service, and nobody owned the responsibility for reviewing alert thresholds when services were deployed. Raising the threshold addresses the immediate symptom but not the systemic issue. Six months later, a different service has the same problem.

The third is the postmortem that is never read again. The document gets written, the action items get captured in a ticket system, the tickets get deprioritized and expire without resolution. The incident happened again six months later because the action items from the previous postmortem were never completed. The postmortem process has all the form of learning without the substance.

The organizations with excellent incident response cultures treat postmortems as commitments rather than documentation. The action items are assigned, prioritized against feature work with the same urgency framework as the incident itself, and tracked to completion. The learning from each incident is genuinely incorporated into the operational system before the organizational attention moves to other priorities.

Frequently asked questions

What is the incident commander role and why is it undervalued?

The incident commander's job is not to be the best diagnostician in the room. It is to ensure that the best diagnosticians are working effectively together: coordinating hypotheses, preventing duplicate work, making sure findings are shared across the bridge, and making decisions when a decision is needed rather than waiting for consensus. Speed in incident response matters not because fast decisions are better decisions, but because fast incorrect decisions are often more recoverable than slow correct ones. An organization that requires consensus before acting will consistently restore service more slowly than one that has given the incident commander decision authority.

What makes a runbook trustworthy versus a runbook that gets ignored?

Trustworthy runbooks were written by the engineers who actually debug the service, not the engineers who built it. They were updated the last time the service had an incident. They include the specific commands to run, the output to look for, and the decision point for when to escalate. They do not explain how the system works — that is architectural documentation, not incident documentation. The investment to produce them is smaller than it appears: 30 minutes per incident, per engineer who resolved it, to update or create the entry for that failure mode.

Why do blameless postmortems improve mean time to restore over time?

Because they surface the information that improves the system. A postmortem that assigns blame teaches engineers to protect themselves, not share information. Engineers in blame-oriented organizations become more careful about what they report in near-miss situations and less likely to escalate early. Over time, the runbooks stay inaccurate, the alerts stay poorly tuned, and each incident costs as much as the last one. In organizations with genuine blameless postmortem cultures, each incident handled well makes the next one less severe — not because the engineers get better at debugging, but because the system gets better at being debugged.

What are the three postmortem anti-patterns worth avoiding?

The timeline that substitutes for analysis: detailed minute-by-minute documentation with no investigation of what made those events possible. Action items that address symptoms rather than causes: raising an alert threshold without asking why the threshold was wrong and who owns reviewing thresholds when services are deployed. And the postmortem that is never read again: action items captured in a ticket system, deprioritized, and expired without resolution. The organizations with excellent incident cultures treat postmortem action items as commitments with the same urgency framework as the incident itself.


If your incident management process is more improvised than practiced, a Foundations Assessment includes a review of your incident response capabilities and specific recommendations for improvement.

For the runbook and postmortem discipline that transforms improvised incident response into a practiced one, read SRE for growth-stage engineering — when you need it and what to build first.

For why knowledge concentrated in hero engineers (the same pattern that makes incident response improvised) is a platform design failure, read 5 signs your platform team is stuck in ad-hoc mode.

For the DORA metric (mean time to restore) that measures whether incident response improvement is actually happening, read the DORA metrics implementation guide.

Incident ManagementEngineering LeadershipDevOpsEngineering Culture

Found this useful? Share it with your network.

Matías Caniglia

Mat Caniglia

LinkedIn

Founder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.

79 articles published

Related Articles

Engineering Leadership

Team Topologies in Practice: What the Four Team Types Look Like at 60 Engineers

Team Topologies is the most useful org framework for Series A–C engineering leaders. Here is what it actually looks like applied to a 60-engineer company, not an enterprise.

Read More →
Engineering Leadership

DORA Metrics for Engineering Leaders: What They Actually Tell You

Most engineering leaders have seen a DORA dashboard. Fewer understand what the numbers mean in context — or why the platform shapes the metrics more than the teams do.

Read More →
Engineering Leadership

What a Platform Team Actually Costs at Series B

The question every VP Engineering gets: why do we need a team that doesn't ship product features? The answer requires math, not just philosophy.

Read More →

Stay updated with Clouditive

Long-form analysis on platform engineering, DORA, and AI readiness from Mat Caniglia. Sent when there is something worth reading.

Start here

See where your delivery stands.

A fifteen minute self-diagnostic that scores your platform across DORA metrics, deployment frequency, change failure rate, and cognitive load. No sales call required.

Want to read first? See the Foundations Framework