Skip to main content
Career Paths
Concepts
Sre Incident Management Basics
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Incident Management Basics

Severity levels, incident lifecycle, roles and responsibilities, communication protocols, mitigation strategies, and tooling — the complete foundation for managing production incidents at any scale, from startup to FAANG.

🎯Key Takeaways
An incident is an unplanned disruption requiring organized, urgent response. Classify severity by user impact (P0-P4), not by internal system behavior. Declare early and declare often — the cost of a false alarm is trivially small compared to the cost of a delayed response to a real incident.
The incident lifecycle has five phases — Detection, Triage, Mitigation, Resolution, Postmortem — each with distinct goals and time-based checkpoints. Use the 5-15-30-60 rule to maintain urgency: alert acknowledged in 5 minutes, severity declared in 15, mitigation plan executing in 30, user impact resolved in 60.
Assign explicit incident roles: Incident Commander (coordinates, does not debug), Communications Lead (owns all status updates), Operations Lead (leads technical work), SMEs (deep system expertise), and Scribe (documents everything). The IC should spend 80% of their time coordinating and at most 20% on technical work.
Prioritize mitigation over root cause analysis during every active incident. Roll back first, toggle feature flags, shift traffic, scale up — restore service by any means necessary and investigate root cause in the postmortem. You do not investigate arson while the building is burning.
Build your first incident response process in 30 days: week 1 establishes on-call rotation and severity levels, week 2 creates the incident workflow and status page, week 3 writes the three most critical runbooks, and week 4 runs the first tabletop exercise. Start simple, iterate based on postmortem findings, and improve continuously.

Incident Management Basics

Severity levels, incident lifecycle, roles and responsibilities, communication protocols, mitigation strategies, and tooling — the complete foundation for managing production incidents at any scale, from startup to FAANG.

~44 min read
Be the first to complete!
What you'll learn
  • An incident is an unplanned disruption requiring organized, urgent response. Classify severity by user impact (P0-P4), not by internal system behavior. Declare early and declare often — the cost of a false alarm is trivially small compared to the cost of a delayed response to a real incident.
  • The incident lifecycle has five phases — Detection, Triage, Mitigation, Resolution, Postmortem — each with distinct goals and time-based checkpoints. Use the 5-15-30-60 rule to maintain urgency: alert acknowledged in 5 minutes, severity declared in 15, mitigation plan executing in 30, user impact resolved in 60.
  • Assign explicit incident roles: Incident Commander (coordinates, does not debug), Communications Lead (owns all status updates), Operations Lead (leads technical work), SMEs (deep system expertise), and Scribe (documents everything). The IC should spend 80% of their time coordinating and at most 20% on technical work.
  • Prioritize mitigation over root cause analysis during every active incident. Roll back first, toggle feature flags, shift traffic, scale up — restore service by any means necessary and investigate root cause in the postmortem. You do not investigate arson while the building is burning.
  • Build your first incident response process in 30 days: week 1 establishes on-call rotation and severity levels, week 2 creates the incident workflow and status page, week 3 writes the three most critical runbooks, and week 4 runs the first tabletop exercise. Start simple, iterate based on postmortem findings, and improve continuously.

Lesson outline

What Is an Incident?

An incident is an unplanned disruption or degradation of a service that requires an organized, urgent response. This definition is deliberately broad because the threshold for declaring an incident varies between organizations, but the core principle is universal: when the impact to users exceeds what your normal engineering workflow can address in a reasonable timeframe, you have an incident. Not every bug is an incident. A cosmetic UI issue on a settings page that affects no critical user flow is a bug — file a ticket, prioritize it in the next sprint. But if that same UI issue prevents users from completing purchases, it becomes an incident because it is actively causing business impact that grows with every minute it remains unresolved.

The distinction between a bug and an incident comes down to three factors: urgency (does this need to be fixed right now?), impact scope (how many users or how much revenue is affected?), and organizational response (does this require coordination beyond a single engineer?). If all three answers point to "yes," you are dealing with an incident, not a bug. At Google, we use the phrase "declare early, declare often" — it is always better to declare an incident and then downgrade it than to treat something as a routine bug and discover hours later that it was impacting millions of users.

When to Declare an Incident vs. File a Bug

Declare an incident when: any SLO is being violated, customer-facing functionality is degraded or unavailable, data loss or corruption is occurring or imminent, a security breach is suspected, or the issue requires coordination across multiple teams. File a bug when: the issue affects only internal tooling with workarounds available, the impact is cosmetic with no user-facing functionality loss, a single engineer can diagnose and fix the issue within their normal workflow, and no SLO is at risk. When in doubt, declare the incident. The cost of a false alarm (30 minutes of wasted coordination time) is trivially small compared to the cost of a delayed response (hours of unmitigated user impact).

Severity levels: P0 through P4

  • P0 — Critical — Complete service outage or data loss affecting all users. Revenue-generating systems are down. Examples: payment processing is completely broken, the entire website returns 500 errors, database corruption is spreading. Response: all-hands, executive notification, status page updated immediately.
  • P1 — High — Major feature degradation affecting a significant percentage of users (>10%). Core functionality is impaired but not entirely unavailable. Examples: search results are returning stale data, login succeeds but session tokens expire after 30 seconds, API latency is 10x normal. Response: dedicated incident team, stakeholder notification within 15 minutes.
  • P2 — Moderate — Partial degradation affecting a subset of users or a non-critical feature. Workarounds exist. Examples: image uploads fail for users in a specific region, email notifications are delayed by 2 hours, a dashboard widget shows incorrect data. Response: on-call engineer investigates, business-hours escalation.
  • P3 — Low — Minor issue with minimal user impact. Functionality is degraded but operational. Examples: a tooltip displays the wrong text, CSV export is slow but functional, a non-essential API endpoint returns stale cache data. Response: ticket filed, addressed in normal sprint cycle.
  • P4 — Informational — No current user impact but requires tracking. Potential future risk or cosmetic issue. Examples: a dependency has a known vulnerability with no exploit in the wild, disk usage is trending toward capacity in 3 months, a deprecated API is still receiving traffic. Response: tracked in backlog, reviewed during planning.
SeverityUser ImpactResponse TimeEscalationCommunication
P0All users affected, complete outage< 5 minutesAll on-call + engineering leadership + execStatus page + all-hands Slack + email to customers
P1>10% users or core feature degraded< 15 minutesPrimary + secondary on-call + team leadStatus page + incident Slack channel + stakeholders
P2Subset of users, workarounds exist< 1 hourPrimary on-call, business-hours escalationIncident Slack channel + team notification
P3Minor impact, no urgency< 4 hours (business hours)On-call engineer triages, no escalationTeam Slack channel
P4No current impactNext sprint planningNone — tracked in backlogNone required

Severity Inflation and Deflation Are Both Dangerous

Severity inflation (calling everything P0) leads to alert fatigue and the boy-who-cried-wolf effect — when everything is critical, nothing is. Severity deflation (classifying a P1 as P3 to avoid the overhead of incident management) is even more dangerous because it delays response while users suffer. The fix is a well-defined severity classification framework with concrete examples, reviewed and updated quarterly. If your team debates severity for more than 2 minutes during an active incident, your framework needs clearer criteria.

Remember: Impact, Urgency, Scope

Three questions determine whether something is an incident and what severity it is. Impact: What percentage of users or revenue is affected? Urgency: Does this need to be fixed right now, or can it wait? Scope: Does this require coordination beyond one engineer? A P0 scores "maximum" on all three. A bug scores "low" on all three. Everything in between is P1 through P4 based on the combination.

The Incident Lifecycle

Every incident, regardless of severity, follows a lifecycle with distinct phases. Understanding this lifecycle is critical because each phase has different goals, different activities, and different time constraints. The biggest mistake teams make is conflating phases — trying to find root cause while the service is still on fire, or skipping the postmortem because the fix was "obvious." Each phase exists for a reason, and skipping or rushing any phase creates downstream problems that compound over time.

The lifecycle begins the moment an anomaly is detected — whether by an automated alert, a customer report, or an engineer noticing something wrong. From that moment, the clock is ticking. The lifecycle does not end when the service is restored; it ends when the postmortem is completed and action items are assigned. Organizations that end the lifecycle at "service restored" are doomed to repeat the same incidents because they never invest in structural prevention.

graph LR
    D[Detection<br/>Alert fires or<br/>report received] -->|"< 5 min"| T[Triage<br/>Assess severity,<br/>assign IC]
    T -->|"< 15 min"| M[Mitigation<br/>Stop the bleeding,<br/>restore service]
    M -->|"< 60 min"| R[Resolution<br/>Confirm stable,<br/>clean up]
    R -->|"< 72 hours"| P[Postmortem<br/>Root cause,<br/>action items]
    P --> I[Improvement<br/>Implement fixes,<br/>update runbooks]

    style D fill:#ef4444,color:#fff
    style T fill:#f59e0b,color:#fff
    style M fill:#3b82f6,color:#fff
    style R fill:#10b981,color:#fff
    style P fill:#8b5cf6,color:#fff
    style I fill:#06b6d4,color:#fff

The incident lifecycle: each phase has a time-based checkpoint. Detection to triage should take less than 5 minutes, triage to active mitigation under 15 minutes, mitigation to stable service under 60 minutes for P0/P1, and postmortem completed within 72 hours.

The five phases with time-based checkpoints

→

01

DETECTION (0-5 minutes): An alert fires, a customer reports an issue, or an engineer notices anomalous behavior. The goal is to recognize that something is wrong and begin the response process. Automated detection through monitoring and alerting is the gold standard — you should never learn about a P0 from a customer tweet. Key checkpoint: within 5 minutes of the problem starting, someone should be aware of it.

→

02

TRIAGE (5-15 minutes): The on-call engineer assesses the situation, determines severity, declares an incident if warranted, and assigns the Incident Commander role. This phase answers three questions: How bad is it? Who needs to know? What resources do we need? Key checkpoint: within 15 minutes, severity is assigned, an incident channel is open, the right people are engaged, and the first status update has been sent.

→

03

MITIGATION (15-60 minutes): The team works to stop the bleeding and restore service. This is NOT the time to find root cause — it is the time to reduce user impact by any means necessary: rollback, feature flag toggle, traffic rerouting, scaling up, or applying a workaround. Key checkpoint: for P0/P1, user impact should be mitigated within 60 minutes. If not, escalate further.

→

04

RESOLUTION (60+ minutes): The service is confirmed stable, temporary mitigations are cleaned up or made permanent, monitoring confirms metrics have returned to baseline, and the incident is formally closed. Key checkpoint: the Incident Commander confirms resolution by verifying SLO metrics are healthy for at least 15 minutes after the fix is applied.

05

POSTMORTEM (within 72 hours): The team conducts a blameless review of the incident. What happened? Why did it happen? How do we prevent it from happening again? The postmortem produces a written document with a timeline, root cause analysis, impact assessment, and prioritized action items. Key checkpoint: the postmortem document is completed and shared with stakeholders within 72 hours of resolution.

1

DETECTION (0-5 minutes): An alert fires, a customer reports an issue, or an engineer notices anomalous behavior. The goal is to recognize that something is wrong and begin the response process. Automated detection through monitoring and alerting is the gold standard — you should never learn about a P0 from a customer tweet. Key checkpoint: within 5 minutes of the problem starting, someone should be aware of it.

2

TRIAGE (5-15 minutes): The on-call engineer assesses the situation, determines severity, declares an incident if warranted, and assigns the Incident Commander role. This phase answers three questions: How bad is it? Who needs to know? What resources do we need? Key checkpoint: within 15 minutes, severity is assigned, an incident channel is open, the right people are engaged, and the first status update has been sent.

3

MITIGATION (15-60 minutes): The team works to stop the bleeding and restore service. This is NOT the time to find root cause — it is the time to reduce user impact by any means necessary: rollback, feature flag toggle, traffic rerouting, scaling up, or applying a workaround. Key checkpoint: for P0/P1, user impact should be mitigated within 60 minutes. If not, escalate further.

4

RESOLUTION (60+ minutes): The service is confirmed stable, temporary mitigations are cleaned up or made permanent, monitoring confirms metrics have returned to baseline, and the incident is formally closed. Key checkpoint: the Incident Commander confirms resolution by verifying SLO metrics are healthy for at least 15 minutes after the fix is applied.

5

POSTMORTEM (within 72 hours): The team conducts a blameless review of the incident. What happened? Why did it happen? How do we prevent it from happening again? The postmortem produces a written document with a timeline, root cause analysis, impact assessment, and prioritized action items. Key checkpoint: the postmortem document is completed and shared with stakeholders within 72 hours of resolution.

Mitigation First, Root Cause Later

During an active incident, your only goal is to reduce user impact. If a rollback fixes the problem, roll back first and investigate why the deploy broke things after the service is healthy. If disabling a feature flag stops the cascading failure, disable it first and figure out the root cause in the postmortem. The fire department does not stop to investigate arson while the building is burning. They put out the fire first. This is the single most important principle in incident management: mitigation before root cause analysis, every single time.

The 5-15-30-60 Rule

Use time-based checkpoints to maintain urgency during P0/P1 incidents. At 5 minutes: the on-call engineer should be actively investigating. At 15 minutes: severity should be declared, incident channel created, and IC assigned. At 30 minutes: the team should have a working theory and be actively executing a mitigation plan. At 60 minutes: user impact should be mitigated. If any checkpoint is missed, escalate immediately — the incident is taking longer than it should and needs additional resources.

PhaseDuration TargetPrimary GoalKey ActionsExit Criteria
Detection< 5 minRecognize the problem existsAlert fires, on-call acknowledges, initial assessment beginsOn-call engineer is actively investigating
Triage5-15 minClassify and mobilizeAssign severity, declare incident, open war room, notify stakeholdersIC assigned, team assembled, first status update sent
Mitigation15-60 minReduce user impactRollback, feature flag, traffic shift, scale up, workaroundUser-facing impact is measurably reduced or eliminated
Resolution1-4 hoursConfirm stabilityVerify metrics, clean up workarounds, close incidentSLO metrics healthy for 15+ minutes, incident formally closed
Postmortem< 72 hoursLearn and preventTimeline reconstruction, root cause analysis, action itemsDocument published, action items assigned and tracked

Incident Roles and Responsibilities

Effective incident response requires clear roles with well-defined responsibilities. Without explicit roles, incidents devolve into chaos: multiple engineers attempt the same fix simultaneously, nobody communicates status to stakeholders, and critical diagnostic steps are skipped because everyone assumed someone else was doing them. The role structure used by Google, PagerDuty, and most mature SRE organizations is adapted from the Incident Command System (ICS) originally developed for firefighting and emergency response.

Role assignment happens during the triage phase and is the responsibility of the Incident Commander. Roles are not permanent job titles — they are hats worn during an incident. A junior engineer can be IC for a P3 incident. A VP can serve as Communications Lead if they happen to be available. The key principle is that every role is explicitly assigned to exactly one person, and everyone on the incident knows who holds each role. Unassigned roles lead to gaps; multiply-assigned roles lead to conflicts.

The five core incident roles

  • Incident Commander (IC) — The single point of authority and coordination during an incident. The IC does NOT debug the problem — they coordinate the people who do. Responsibilities include: assigning all other roles, making severity classification decisions, authorizing mitigation actions (rollbacks, failovers), managing the timeline and checkpoints, deciding when to escalate, and determining when the incident is resolved. The IC should spend 80% of their time coordinating and communicating, and at most 20% on technical work. When in doubt, the IC delegates technical tasks and focuses on keeping the response organized. At Google, IC is a formally trained role with certification requirements.
  • Communications Lead (Comms Lead) — Owns all external and internal communication during the incident. This includes updating the status page, posting regular updates to the incident Slack channel, sending email notifications to stakeholders, coordinating with customer support teams, and drafting any customer-facing communications. The Comms Lead shields the technical team from communication overhead — engineers should never be interrupted to write a status update while they are debugging. Update frequency depends on severity: P0 gets updates every 15 minutes, P1 every 30 minutes, P2 every hour.
  • Operations Lead (Ops Lead) — Leads the technical debugging and mitigation effort. The Ops Lead is typically the most senior engineer available with domain expertise in the affected system. They direct the diagnostic work, propose mitigation strategies to the IC for approval, and execute or delegate the technical fix. The Ops Lead works closely with Subject Matter Experts and reports findings to the IC. In smaller incidents, the Ops Lead may also be the person hands-on-keyboard making changes, but in larger incidents they coordinate multiple engineers working in parallel on different investigation threads.
  • Subject Matter Experts (SMEs) — Engineers with deep knowledge of specific subsystems who are pulled into the incident as needed. The IC or Ops Lead identifies which systems are involved and pages the appropriate SMEs. SMEs provide diagnostic insight, answer questions about system behavior, and execute targeted technical tasks. A typical P0 incident might involve SMEs from the database team, the networking team, and the application team working in parallel on different investigation threads. SMEs report to the Ops Lead, not directly to the IC, to maintain a clean command structure.
  • Scribe — Documents everything that happens during the incident in real time. The scribe maintains a running timeline of events, decisions, actions taken, and their outcomes. This timeline becomes the foundation of the postmortem document and is invaluable for reconstructing what happened after the fact. Without a scribe, the postmortem relies on memory, which is unreliable under stress. The scribe captures: timestamps of key events, who did what, what commands were run, what the results were, when severity changed, and when communications were sent. Many teams automate parts of this role using incident management tools that log Slack messages and PagerDuty actions automatically.
RolePrimary FocusCommunication PatternTime Allocation
Incident CommanderCoordination, decisions, escalationReceives updates from all roles, broadcasts decisions80% coordinating, 20% technical
Communications LeadStatus updates, stakeholder managementReceives info from IC, broadcasts to stakeholders/customers90% communication, 10% technical
Operations LeadTechnical debugging and mitigationReports findings to IC, directs SMEs80% technical, 20% coordinating SMEs
Subject Matter ExpertsDeep system-specific investigationReports to Ops Lead, answers technical questions95% technical, 5% reporting
ScribeReal-time documentation of all actionsListens to all channels, asks clarifying questions100% documentation

The IC Should NOT Be Debugging

The most common mistake in incident response is having the Incident Commander also serve as the primary debugger. When the IC is heads-down in logs, nobody is coordinating the response, nobody is tracking time checkpoints, and nobody is managing communication. The IC role exists specifically to maintain situational awareness while others do the technical work. If you are the IC and you feel the urge to start typing commands, assign someone else to do it and focus on coordination. At Google, ICs who start debugging are gently reminded: "IC, please delegate that task and return to coordination."

Role Assignment Script for the IC

When you declare an incident, use this script in the incident channel: "I am declaring a [severity] incident for [brief description]. I am taking the IC role. [Name], you are Communications Lead — please post an initial status page update within 10 minutes. [Name], you are Operations Lead — please begin diagnosis and report findings every 10 minutes. [Name], please scribe — capture everything in the incident doc. I will page additional SMEs as needed. Next status check in 15 minutes." This explicit role assignment eliminates ambiguity and gets everyone moving immediately.

Scaling Roles by Severity

Not every incident needs all five roles. For a P3, a single on-call engineer may handle all roles. For a P2, you might have an IC/Ops Lead (combined) and a Comms Lead. For P0/P1, all five roles should be explicitly assigned to different people. The rule of thumb: if the incident requires more than two people, you need an explicit IC who is not doing technical work. If it requires more than four people, you need all five roles assigned.

Severity Classification Framework

A severity classification framework removes ambiguity from the most time-critical decision in incident response: how bad is this, and what response does it warrant? Without a framework, severity assignment becomes subjective — the on-call engineer who tends to panic calls everything a P0, while the one who tends to minimize calls everything a P3. Both behaviors are harmful. The framework provides objective criteria that any engineer can apply consistently, regardless of their temperament or experience level.

The best severity frameworks are based on user impact, not system behavior. A server with 100% CPU utilization is not inherently a P0 — it depends on whether users are affected. Conversely, a server at 20% CPU that is returning corrupted data to every request is absolutely a P0. Always classify severity by what users experience, not by what internal metrics show. This user-centric approach aligns incident severity with business impact, which is what leadership and customers actually care about.

CriteriaP0 — CriticalP1 — HighP2 — ModerateP3 — LowP4 — Info
User ImpactAll users affected>10% users or core featureSubset of users, non-core featureMinor, workaround existsNone currently
Revenue ImpactRevenue completely stoppedSignificant revenue lossModerate revenue impactNegligible revenue impactNo revenue impact
Data RiskActive data loss or corruptionPotential data loss imminentData integrity risk existsNo data riskNo data risk
Response Time< 5 minutes< 15 minutes< 1 hour< 4 hours (business)Next sprint
Who Gets PagedAll on-call + leads + VPPrimary + secondary on-call + leadPrimary on-callOn-call triages in queueNobody
War RoomImmediate, dedicated bridgeDedicated Slack channelSlack threadTicket onlyBacklog item
Status PageUpdated within 10 minutesUpdated within 20 minutesInternal update onlyNo update neededNo update needed
Exec NotificationVP+ within 15 minutesDirector within 30 minutesManager notified asyncNoneNone

Escalation Triggers: When to Upgrade Severity

Upgrade severity when any of these conditions are met: (1) Impact is spreading — what started as regional is now global. (2) The 60-minute mitigation checkpoint has passed with no progress on P1/P2. (3) Additional systems are failing as a result of the original incident (cascading failure). (4) Data loss or corruption is discovered. (5) Customer or media attention increases (a P2 that gets a trending tweet becomes a P1 for business reasons). Severity should also be downgraded when impact reduces: a P0 where mitigation restores service to 95% of users becomes a P1 for the remaining 5%.

The Danger of Under-Classification

Under-classifying incident severity is more dangerous than over-classifying. A P0 incorrectly classified as P2 means: no executive notification (they learn from Twitter), no dedicated war room (the response is uncoordinated), no status page update (customers have no information), and slow escalation (additional engineers are not paged). The cost of over-classification is wasted coordination time — maybe 30 minutes for a few engineers. The cost of under-classification is hours of unmitigated user impact, customer trust erosion, and potential SLA violations with contractual penalties.

How to assign severity in under 2 minutes

→

01

CHECK USER IMPACT FIRST: Are users experiencing the problem right now? How many? Use your monitoring dashboards — check error rates, latency, and availability. If you do not have dashboards, check social media and customer support queues.

→

02

CHECK REVENUE IMPACT: Is the issue affecting revenue-generating transactions? Payment processing, checkout, subscription renewals, ad serving — if any of these are impaired, this is automatically P1 or higher.

→

03

CHECK DATA RISK: Is data being lost, corrupted, or exposed? Data incidents are always P1 or P0 regardless of the number of users affected, because data loss is often irreversible.

→

04

APPLY THE MATRIX: Match your answers against the severity criteria table. If the incident matches multiple severity levels across different criteria, use the highest severity.

05

ANNOUNCE AND ADJUST: Declare the severity in the incident channel. Severity is not permanent — re-evaluate at every 15-minute checkpoint and upgrade or downgrade as the situation evolves.

1

CHECK USER IMPACT FIRST: Are users experiencing the problem right now? How many? Use your monitoring dashboards — check error rates, latency, and availability. If you do not have dashboards, check social media and customer support queues.

2

CHECK REVENUE IMPACT: Is the issue affecting revenue-generating transactions? Payment processing, checkout, subscription renewals, ad serving — if any of these are impaired, this is automatically P1 or higher.

3

CHECK DATA RISK: Is data being lost, corrupted, or exposed? Data incidents are always P1 or P0 regardless of the number of users affected, because data loss is often irreversible.

4

APPLY THE MATRIX: Match your answers against the severity criteria table. If the incident matches multiple severity levels across different criteria, use the highest severity.

5

ANNOUNCE AND ADJUST: Declare the severity in the incident channel. Severity is not permanent — re-evaluate at every 15-minute checkpoint and upgrade or downgrade as the situation evolves.

SLO-Based Severity Classification

Mature SRE organizations tie severity directly to SLO burn rates. If your error budget for the month is being consumed at 100x the normal rate, that is a P0 regardless of absolute error counts. If the burn rate is 10x normal, it is a P1. If it is 3x normal, it is a P2. This approach automatically scales severity with your SLO targets and eliminates the need for arbitrary thresholds like "error rate > 5%." It also means that a 0.1% error rate increase on a 99.99% SLO target is correctly classified as more severe than a 1% error rate increase on a 99% target.

The First 15 Minutes: Detection and Triage

The first 15 minutes of an incident determine whether the response is organized and effective or chaotic and wasteful. Research from PagerDuty and Google shows that incidents where triage is completed within 15 minutes have significantly shorter overall resolution times compared to incidents where triage takes longer than 30 minutes. The reason is straightforward: early, structured triage gets the right people working on the right problem with the right level of urgency. Delayed or chaotic triage means engineers investigate the wrong things, communication is inconsistent, and mitigation is delayed while the team figures out what is actually happening.

The first 15 minutes follow a predictable sequence that should be practiced until it is muscle memory. This is not the time for creative problem-solving or deep analysis — it is the time for a disciplined checklist. Just as pilots follow a pre-flight checklist regardless of how many flights they have completed, SRE teams should follow a triage checklist regardless of how many incidents they have handled.

The first 15 minutes, step by step

→

01

MINUTE 0-1: ALERT FIRES AND ON-CALL ACKNOWLEDGES. A monitoring alert triggers (PagerDuty page, Slack alert, or similar). The on-call engineer acknowledges the alert within 1 minute. Acknowledgment does not mean the problem is being fixed — it means a human is now aware and will begin assessment. If the alert is not acknowledged within 5 minutes, it auto-escalates to the secondary on-call.

→

02

MINUTE 1-3: INITIAL ASSESSMENT. The on-call engineer opens the relevant dashboard and assesses the situation. Three questions to answer: What is broken? (which service, which feature, which region). How bad is it? (error rate, latency, percentage of users affected). Is it getting worse? (is the impact trending up, stable, or recovering on its own). This assessment should take no more than 2-3 minutes using pre-built dashboards.

→

03

MINUTE 3-5: SEVERITY DECISION AND INCIDENT DECLARATION. Based on the initial assessment, the on-call engineer assigns a severity level using the classification framework. If P2 or higher, they formally declare an incident by creating a dedicated incident channel (e.g., #inc-20250219-payment-failures) and sending the declaration message with severity, brief description, and the role assignment script.

→

04

MINUTE 5-8: WAR ROOM ASSEMBLY. For P0/P1, the IC pages additional responders: the Ops Lead (the engineer most familiar with the affected system), the Comms Lead (typically a rotation role), relevant SMEs, and a scribe. For P2, the on-call may handle all roles or pull in one additional person. The IC posts the initial situation summary in the incident channel.

→

05

MINUTE 8-12: FIRST COMMUNICATION. The Communications Lead posts the first external and internal updates. Internal: a message in the company-wide incidents channel summarizing the impact and who is responding. External: a status page update indicating awareness of the issue. Customer support is notified with a brief FAQ so they can handle incoming support tickets. These communications happen regardless of whether the root cause is known — the message is "we are aware and investigating."

06

MINUTE 12-15: FIRST DIAGNOSTIC ACTIONS. The Ops Lead begins systematic diagnosis: checking recent deployments (did anything change?), reviewing error logs, checking dependency health, and looking for correlating signals. The IC sets a timer for the next checkpoint at the 30-minute mark. If no progress is made by then, escalation is automatic.

1

MINUTE 0-1: ALERT FIRES AND ON-CALL ACKNOWLEDGES. A monitoring alert triggers (PagerDuty page, Slack alert, or similar). The on-call engineer acknowledges the alert within 1 minute. Acknowledgment does not mean the problem is being fixed — it means a human is now aware and will begin assessment. If the alert is not acknowledged within 5 minutes, it auto-escalates to the secondary on-call.

2

MINUTE 1-3: INITIAL ASSESSMENT. The on-call engineer opens the relevant dashboard and assesses the situation. Three questions to answer: What is broken? (which service, which feature, which region). How bad is it? (error rate, latency, percentage of users affected). Is it getting worse? (is the impact trending up, stable, or recovering on its own). This assessment should take no more than 2-3 minutes using pre-built dashboards.

3

MINUTE 3-5: SEVERITY DECISION AND INCIDENT DECLARATION. Based on the initial assessment, the on-call engineer assigns a severity level using the classification framework. If P2 or higher, they formally declare an incident by creating a dedicated incident channel (e.g., #inc-20250219-payment-failures) and sending the declaration message with severity, brief description, and the role assignment script.

4

MINUTE 5-8: WAR ROOM ASSEMBLY. For P0/P1, the IC pages additional responders: the Ops Lead (the engineer most familiar with the affected system), the Comms Lead (typically a rotation role), relevant SMEs, and a scribe. For P2, the on-call may handle all roles or pull in one additional person. The IC posts the initial situation summary in the incident channel.

5

MINUTE 8-12: FIRST COMMUNICATION. The Communications Lead posts the first external and internal updates. Internal: a message in the company-wide incidents channel summarizing the impact and who is responding. External: a status page update indicating awareness of the issue. Customer support is notified with a brief FAQ so they can handle incoming support tickets. These communications happen regardless of whether the root cause is known — the message is "we are aware and investigating."

6

MINUTE 12-15: FIRST DIAGNOSTIC ACTIONS. The Ops Lead begins systematic diagnosis: checking recent deployments (did anything change?), reviewing error logs, checking dependency health, and looking for correlating signals. The IC sets a timer for the next checkpoint at the 30-minute mark. If no progress is made by then, escalation is automatic.

Pre-Build Your Incident Channel Template

Do not waste the first 5 minutes of an incident manually setting up a Slack channel and typing a description. Use an automated incident creation workflow (PagerDuty, incident.io, or a custom Slack bot) that creates the channel, sets the topic with the incident summary, pins the runbook, invites the on-call team, and posts a pre-formatted incident declaration message — all with a single slash command like /incident create P1 "Payment processing failures in us-east-1". Automation saves 3-5 minutes of manual setup during the most critical phase of the response.

Do Not Debug Alone for 30 Minutes Before Declaring

A common anti-pattern: the on-call engineer receives an alert, spends 30 minutes investigating on their own, cannot figure it out, and then finally declares an incident. By the time the team assembles, 30 minutes of user impact have already occurred. The fix: if initial assessment (minute 1-3) shows any P2+ symptoms, declare immediately and bring in help. You can always downgrade the severity if it turns out to be less severe than initially thought. You cannot recover the 30 minutes of delayed response if it turns out to be worse.

MinuteActionOwnerOutput
0-1Acknowledge alertOn-call engineerAlert acknowledged in PagerDuty/Opsgenie
1-3Initial assessment (what, how bad, trending?)On-call engineerPreliminary impact assessment
3-5Severity decision, declare incident, create channelOn-call (becomes IC or assigns IC)Incident channel open, severity assigned
5-8Page additional responders, assign rolesIncident CommanderIC, Comms, Ops, Scribe, SMEs assigned
8-12First status update (internal + external)Communications LeadStatus page updated, stakeholders notified
12-15Begin systematic diagnosis, set 30-min checkpointOperations LeadDiagnostic investigation underway, next checkpoint set

The "Did Anything Change?" Question

The single most effective diagnostic question in the first 15 minutes is: "What changed?" Check the deployment log — was there a deploy in the last 2 hours? Check configuration management — was a config change pushed? Check infrastructure — was there a scaling event, a node failure, or a dependency update? In the majority of incidents, the root cause is a recent change. If you can identify the change, you can often mitigate by reverting it, which is faster than diagnosing the specific failure mode the change introduced.

Communication During Incidents

Communication during an incident serves two critical purposes: it keeps the response team coordinated, and it keeps stakeholders informed. Poor communication is the number one cause of incident response dysfunction. Engineers working on different mitigation approaches without knowing about each other, stakeholders flooding the incident channel with questions because they have no status updates, customer support giving incorrect information because they were never briefed — all of these failures are communication failures, not technical failures. At Google, we treat incident communication as a first-class engineering discipline with its own training, templates, and feedback loops.

The key insight is that communication during an incident follows a fundamentally different pattern than normal engineering communication. In normal work, you communicate when you have something complete to share. During an incident, you communicate on a regular cadence regardless of whether you have new information. Saying "no update — still investigating, next update in 15 minutes" is a valid and valuable communication because it tells stakeholders the team is still engaged and sets expectations for the next update.

graph TB
    IC[Incident Commander] --> CL[Communications Lead]
    IC --> OL[Operations Lead]
    OL --> SME1[SME: Database]
    OL --> SME2[SME: Networking]
    OL --> SME3[SME: Application]
    CL --> SP[Status Page]
    CL --> SI[Slack: #incidents]
    CL --> CS[Customer Support]
    CL --> EX[Executive Updates]
    IC --> SC[Scribe]

    style IC fill:#3b82f6,color:#fff
    style CL fill:#f59e0b,color:#fff
    style OL fill:#10b981,color:#fff
    style SP fill:#8b5cf6,color:#fff
    style SI fill:#8b5cf6,color:#fff
    style CS fill:#8b5cf6,color:#fff
    style EX fill:#8b5cf6,color:#fff

Information flows up from SMEs to Ops Lead to IC. Communications flow out from the Comms Lead to all external channels. The IC is the bridge between technical investigation and stakeholder communication.

SeverityInternal Update FrequencyStatus Page UpdateExecutive UpdateCustomer Email
P0Every 15 minutesEvery 15-30 minutesWithin 15 min, then every 30 minIf outage > 30 min
P1Every 30 minutesEvery 30-60 minutesWithin 30 min, then hourlyIf outage > 1 hour
P2Every 60 minutesOnly if customer-visibleAsync summary after resolutionRarely needed
P3As neededNoNoNo

The Incident Status Update Template

Every status update, whether internal or external, should follow this structure: (1) CURRENT STATUS: one sentence on whether the issue is ongoing, mitigated, or resolved. (2) IMPACT: who is affected and how. (3) WHAT WE ARE DOING: specific actions the team is taking right now. (4) NEXT UPDATE: when stakeholders can expect the next communication. Example: "CURRENT STATUS: Payment processing is degraded. IMPACT: Approximately 15% of checkout attempts are failing in us-east-1. WHAT WE ARE DOING: We have identified a database connection pool exhaustion issue and are scaling up the connection pool. NEXT UPDATE: 30 minutes or sooner if status changes." This template works for Slack, status pages, and executive emails.

Communication anti-patterns to avoid

  • The Silent Incident — No updates for 45+ minutes. Stakeholders do not know if the team is working on it or if everyone went to lunch. Fix: set a recurring timer and post updates even when there is no new information.
  • The Blame Channel — Engineers start pointing fingers in the incident channel: "this is the database team's fault" or "who approved this deploy?" Blame during an incident wastes time and damages trust. Fix: the IC immediately redirects blame conversations and reminds the team that root cause analysis happens in the postmortem, not during mitigation.
  • The Noise Flood — Dozens of people join the incident channel and start asking questions, posting theories, or offering unsolicited help. The signal-to-noise ratio drops to zero. Fix: use a structured incident channel where only the IC and assigned roles post primary updates. Others can post in a separate thread or a companion discussion channel.
  • The Stakeholder Interrupt — A VP joins the incident channel and starts asking "why is this happening?" and "when will it be fixed?" These are legitimate questions but asking them in the incident channel interrupts the team. Fix: the Communications Lead proactively sends executive updates on a regular cadence so executives have information without needing to interrupt the war room.
  • The Internal Jargon Blast — The status page update says "connection pool exhaustion in pgbouncer causing 502s from the API gateway." Customers have no idea what this means. Fix: external communications use plain language — "Some customers may experience errors when attempting to make purchases. Our engineering team is actively working to resolve this."

Never Speculate in External Communications

Status page updates and customer communications must never include speculation about root cause. Saying "we believe the issue is caused by a database failure" and then discovering it was actually a network partition makes your team look incompetent and erodes customer trust. External communications should describe the impact (what customers experience), the response (that engineering is actively working), and the timeline for the next update. Root cause is shared only after the postmortem confirms it.

Pre-Write Your Status Page Templates

During an active P0, the Communications Lead should not be wordsmithing status page updates. Pre-write templates for common scenarios: "We are investigating reports of [service] issues affecting [feature]. Some users may experience [symptom]. Our engineering team is actively investigating and we will provide updates every [frequency]." Having templates ready means the first status page update goes live in 2 minutes instead of 10.

Mitigation Strategies

Mitigation is the art of reducing user impact as quickly as possible, regardless of whether you understand the root cause. This is a fundamentally different mindset from debugging, and many engineers struggle with it. The debugging mindset says "I need to understand why this is happening before I can fix it." The mitigation mindset says "I need to make the user impact stop right now, and I will figure out why later." In a production incident, the mitigation mindset always wins. Every minute spent investigating root cause while a mitigation option is available is a minute of unnecessary user suffering.

The best mitigation strategies are pre-built and pre-tested. You do not want to be figuring out how to roll back a deployment for the first time during a P0 at 3 AM. Rollback procedures, feature flag configurations, traffic shifting mechanisms, and scaling playbooks should all be documented, tested regularly, and executable within minutes. The goal is to have a menu of mitigation options that the Ops Lead can choose from based on the situation, not a blank page that requires improvisation under pressure.

The mitigation toolkit: ordered by speed and safety

  • Rollback the last deployment — If the incident correlates with a recent deploy, rolling back is the fastest and safest mitigation. Modern CI/CD systems should support one-click rollback to the previous known-good version. Target: rollback should complete within 5 minutes. This is by far the most common mitigation strategy and resolves the majority of incidents caused by code changes. If your rollback takes more than 15 minutes, fix your deployment pipeline before fixing anything else.
  • Toggle a feature flag — If the incident is caused by a specific feature, disabling that feature via a feature flag is faster than a full rollback and more targeted. Feature flags allow you to surgically disable problematic functionality while keeping the rest of the service healthy. This requires the feature to have been built behind a flag — which is why feature flags should be mandatory for all new features. Target: feature flag toggle takes effect within 1 minute.
  • Shift traffic away from the affected region/instance — If the problem is localized to a specific region, availability zone, or set of instances, redirect traffic to healthy capacity. This can be done via DNS changes (slow, 5-30 minute TTL), load balancer configuration changes (fast, 1-2 minutes), or service mesh routing rules (fastest, seconds). This buys time to investigate the problem in the affected region without user impact.
  • Scale up or out — If the incident is caused by capacity exhaustion (connection pool full, CPU saturated, memory exhausted), adding more capacity can restore service. This is a temporary mitigation, not a permanent fix — you need to understand why capacity was exhausted and either optimize the code or right-size the infrastructure. Auto-scaling should handle this automatically, but if auto-scaling is broken or too slow, manual scaling is the mitigation.
  • Apply a targeted workaround or hotfix — If you can identify the specific code path or configuration causing the issue, a targeted fix may be faster than a full rollback. This is riskier than a rollback because you are pushing a change under pressure, and changes made under pressure have a higher error rate. Use this only when rollback is not possible (e.g., the previous version has a different critical bug, or data migration prevents rollback). Require a second engineer to review the hotfix before deployment.
  • Fail over to a disaster recovery site — For catastrophic failures that cannot be mitigated within the primary environment, fail over to a DR site. This is the mitigation of last resort because failover is complex, often involves data synchronization risks, and takes the longest to execute. DR failover should be tested quarterly so the team knows how to execute it under pressure.
StrategySpeedRiskWhen to UsePrerequisite
Rollback< 5 minLowIncident correlates with recent deployCI/CD supports one-click rollback
Feature flag toggle< 1 minLowProblem is in a flagged featureFeature built behind a feature flag
Traffic shifting1-5 minMediumProblem is regional or instance-specificMulti-region deployment, load balancer access
Scale up/out5-15 minLowCapacity exhaustion is the causeInfrastructure supports scaling, capacity available
Hotfix15-30 minHighRollback not possible, specific fix knownFast CI/CD, second-engineer review
DR failover30-60 minHighPrimary environment is unrecoverableDR site tested, data sync confirmed

The Rollback-First Policy

Adopt a rollback-first policy: if the incident correlates with a recent change (deploy, config push, infrastructure modification), the default mitigation is to revert that change. The IC should not need to justify a rollback — the person who made the change should need to justify NOT rolling back. This policy dramatically reduces mean time to recovery because rollback is the fastest mitigation available. Engineers sometimes resist rollback because they are "almost done" finding the bug. Overrule this instinct. Roll back, restore service, and then debug at leisure.

When Rollback Is Not Possible

Rollback is not always safe. Database schema migrations that drop columns cannot be rolled back without data loss. Changes that alter data formats (serialization changes) may make existing data unreadable by the old code. API version changes where clients have already adopted the new version cannot be rolled back without breaking those clients. For these scenarios, you need forward-fix capability: the ability to quickly deploy a targeted patch rather than reverting to a previous version. These scenarios should be identified during deploy review and have explicit mitigation plans documented before the change is deployed.

Test Your Mitigations Quarterly

Every mitigation strategy in your toolkit should be tested at least quarterly. Can you actually roll back a deploy in under 5 minutes? Try it. Can you toggle a feature flag and see the effect within 1 minute? Test it. Can you shift traffic between regions seamlessly? Prove it. Mitigations that have never been tested are mitigations that will fail when you need them most. Include mitigation testing in your chaos engineering program or Game Day exercises.

Incident Tools and Infrastructure

Incident management tooling automates the repetitive, error-prone parts of incident response so that engineers can focus on the technical work. The right tooling can reduce the overhead of incident declaration from 10 minutes to 10 seconds, ensure that communication happens on schedule without manual reminders, and automatically generate postmortem timelines from the incident channel history. The wrong tooling — or worse, no tooling — means every incident requires manual channel creation, manual role assignment, manual status page updates, and manual timeline reconstruction.

The incident management tool ecosystem has matured significantly in the last five years. The core platforms serve different needs depending on team size, budget, and workflow preferences. Understanding the strengths and limitations of each helps you choose the right tool for your organization and avoid the trap of building custom tooling when off-the-shelf solutions exist.

Core incident management tool categories

  • On-call management and alerting — PagerDuty and Opsgenie are the market leaders for on-call scheduling, alert routing, escalation policies, and notification delivery. These tools ensure that when an alert fires, the right person is notified through the right channel (push notification, phone call, SMS) with automatic escalation if there is no acknowledgment. They are the entry point for the incident lifecycle — without reliable alert delivery, everything downstream fails.
  • Incident coordination platforms — incident.io, FireHydrant, and Rootly provide the coordination layer: automated incident channel creation, role assignment, status page integration, stakeholder communication, and postmortem generation. These tools integrate with Slack and provide slash commands like /incident create P1 "description" that trigger the entire incident workflow automatically. They dramatically reduce the manual overhead of incident management.
  • Status pages — Statuspage (Atlassian), Instatus, and Better Uptime provide external-facing incident communication. When an incident is declared, the status page is updated to show the current impact, and customers can subscribe to updates via email, SMS, or RSS. A well-maintained status page reduces inbound support tickets by 30-50% during incidents because customers get information proactively instead of contacting support.
  • Incident communication (Slack/Teams) — Slack is the de facto war room for most engineering organizations. Dedicated incident channels (#inc-YYYYMMDD-description) keep incident discussion separate from normal work. Slack bots automate repetitive tasks: posting status update reminders, capturing timeline events, and paging additional responders. Microsoft Teams serves the same purpose for organizations in the Microsoft ecosystem.
  • Postmortem and documentation tools — Notion, Confluence, or Google Docs for postmortem documents. incident.io and FireHydrant can auto-generate postmortem templates pre-filled with the incident timeline from Slack messages. The key requirement is that postmortem documents are searchable, so the team can find previous incidents when a similar issue occurs.
  • Monitoring and observability (upstream) — Datadog, Grafana, New Relic, and Prometheus are not incident management tools per se, but they are critical infrastructure for incident detection and diagnosis. The monitoring stack feeds alerts to the on-call management platform, which triggers the incident management workflow. Without monitoring, there are no alerts, and without alerts, incidents are detected by customer complaints instead of automated systems.
FeaturePagerDutyOpsgenieincident.io
Primary StrengthOn-call management and alertingOn-call management (Jira integration)Incident coordination and workflow
On-call SchedulingBest in class, complex rotation supportStrong, Jira ecosystem integrationBasic, focused on incident workflow
Alert RoutingSophisticated escalation policies, event intelligenceGood escalation, Jira auto-ticket creationRelies on PagerDuty/Opsgenie for alerting
Incident CoordinationPagerDuty Incident Response add-onBasic incident trackingBest in class Slack-native coordination
Slack IntegrationGood, bidirectionalGood, bidirectionalExcellent, Slack-first design
Status PagePagerDuty Status Page add-onOpsgenie Status PageIntegrates with Statuspage, Instatus
PostmortemBasic template generationBasic templateAuto-generated from Slack timeline
Pricing$21-41/user/month$9-35/user/monthFrom $16/user/month
Best ForLarge orgs needing robust on-call + alertingJira/Atlassian-heavy organizationsSlack-first teams wanting workflow automation

The Minimum Viable Incident Management Stack

If you are starting from zero, here is the minimum viable stack: (1) PagerDuty or Opsgenie for on-call scheduling and alert routing — pick one based on your existing tooling ecosystem. (2) Slack with a naming convention for incident channels (#inc-YYYYMMDD-brief-description). (3) A status page (Statuspage free tier or Instatus) for external communication. (4) A Google Doc or Notion template for postmortems. This stack costs under $500/month for a team of 10 and covers the essential workflow. Upgrade to incident.io or FireHydrant when you are handling more than 10 incidents per month and the manual overhead becomes unsustainable.

Do Not Build Your Own Incident Management Tool

Every year, well-intentioned SRE teams decide to build a custom Slack bot for incident management. After six months of development, they have a fragile bot that handles the happy path and breaks on edge cases. Meanwhile, incident.io, FireHydrant, and Rootly have dedicated product teams that have spent years building robust incident workflow automation. Unless your incident management needs are truly unique (unlikely), buy do not build. Spend your engineering time on product reliability, not on reinventing incident management tooling.

Tool Integration Is Non-Negotiable

Your incident management tools must integrate seamlessly with each other. When an alert fires in PagerDuty, it should automatically create an incident in incident.io, which creates a Slack channel, pages the on-call team, and posts an initial status page update — all without manual intervention. If any step requires a human to copy-paste information between tools, the workflow will break under pressure. Test your integration end-to-end by firing a test alert and verifying the entire chain executes correctly. This test should be part of your on-call onboarding process.

Building Your First Incident Response Process

If you are starting from scratch — no incident process, no on-call rotation, no runbooks, no postmortem culture — this section provides a step-by-step guide to building your first incident response process in 30 days. The goal is not to build a Google-scale incident management system on day one. The goal is to establish the minimum viable process that ensures incidents are detected, responded to, communicated about, and learned from. You can iterate and improve from there.

The biggest mistake teams make when building their first incident response process is trying to do everything at once. They write a 50-page incident management playbook, configure elaborate PagerDuty escalation policies, and design a complex severity classification matrix — and then nobody follows any of it because it is too complicated. Start simple. A process that is followed 100% of the time is infinitely more valuable than a comprehensive process that is followed 20% of the time. Simplicity drives adoption, and adoption drives reliability.

Week 1-4: From zero to operational incident response

→

01

WEEK 1, DAY 1-2: ESTABLISH ON-CALL ROTATION. Set up PagerDuty or Opsgenie (free tier is sufficient to start). Create a single on-call schedule that covers 24/7 with your team. If you have 4 engineers, each person is on-call for one week out of four. Configure alert routing so that monitoring alerts reach the on-call engineer via push notification and phone call. Test the chain end-to-end: have someone trigger a test alert and verify the on-call person receives it within 60 seconds.

→

02

WEEK 1, DAY 3-5: DEFINE YOUR SEVERITY LEVELS. Write a one-page document defining P0 through P3 (skip P4 for now — keep it simple). For each severity, specify: what qualifies (with concrete examples from your system), who gets notified, expected response time, and communication expectations. Share this document with the entire engineering team and get agreement. Post it in a pinned Slack message in your engineering channel.

→

03

WEEK 2: CREATE INCIDENT WORKFLOW. Define the incident channel naming convention (#inc-YYYYMMDD-description). Write a one-paragraph incident declaration template that includes severity, description, who is IC, and a request for the Ops Lead and Comms Lead. Set up a free Statuspage for external communication. Write three status page update templates: one for "investigating," one for "identified and working on fix," and one for "resolved." Test the entire workflow with a fake incident.

→

04

WEEK 3: WRITE YOUR FIRST THREE RUNBOOKS. Identify the three most common types of incidents your team has experienced (e.g., "database connection pool exhaustion," "high API latency," "deployment failure"). For each, write a runbook that covers: symptoms (how the alert presents), diagnostic commands (what to check), mitigation steps (what to do), and escalation criteria (when to call for help). Store runbooks in your team wiki and link them from the corresponding PagerDuty alerts.

→

05

WEEK 4: CONDUCT YOUR FIRST TABLETOP EXERCISE. A tabletop exercise is a simulated incident where the team practices the response process without any actual system impact. The facilitator presents a scenario ("It is 2 AM, PagerDuty wakes you up. The dashboard shows 50% error rate on the checkout API. What do you do?") and the team walks through their response step by step. Identify gaps in the process, update runbooks, and schedule the next tabletop for one month later.

06

ONGOING: ESTABLISH POSTMORTEM CULTURE. After every P0 and P1 incident, write a blameless postmortem within 72 hours. The postmortem should include: a timeline of events, root cause analysis, what went well, what could be improved, and action items with owners and due dates. Share the postmortem with the entire engineering team. Review action item completion monthly. The postmortem is where the real improvement happens — without it, you will repeat the same incidents.

1

WEEK 1, DAY 1-2: ESTABLISH ON-CALL ROTATION. Set up PagerDuty or Opsgenie (free tier is sufficient to start). Create a single on-call schedule that covers 24/7 with your team. If you have 4 engineers, each person is on-call for one week out of four. Configure alert routing so that monitoring alerts reach the on-call engineer via push notification and phone call. Test the chain end-to-end: have someone trigger a test alert and verify the on-call person receives it within 60 seconds.

2

WEEK 1, DAY 3-5: DEFINE YOUR SEVERITY LEVELS. Write a one-page document defining P0 through P3 (skip P4 for now — keep it simple). For each severity, specify: what qualifies (with concrete examples from your system), who gets notified, expected response time, and communication expectations. Share this document with the entire engineering team and get agreement. Post it in a pinned Slack message in your engineering channel.

3

WEEK 2: CREATE INCIDENT WORKFLOW. Define the incident channel naming convention (#inc-YYYYMMDD-description). Write a one-paragraph incident declaration template that includes severity, description, who is IC, and a request for the Ops Lead and Comms Lead. Set up a free Statuspage for external communication. Write three status page update templates: one for "investigating," one for "identified and working on fix," and one for "resolved." Test the entire workflow with a fake incident.

4

WEEK 3: WRITE YOUR FIRST THREE RUNBOOKS. Identify the three most common types of incidents your team has experienced (e.g., "database connection pool exhaustion," "high API latency," "deployment failure"). For each, write a runbook that covers: symptoms (how the alert presents), diagnostic commands (what to check), mitigation steps (what to do), and escalation criteria (when to call for help). Store runbooks in your team wiki and link them from the corresponding PagerDuty alerts.

5

WEEK 4: CONDUCT YOUR FIRST TABLETOP EXERCISE. A tabletop exercise is a simulated incident where the team practices the response process without any actual system impact. The facilitator presents a scenario ("It is 2 AM, PagerDuty wakes you up. The dashboard shows 50% error rate on the checkout API. What do you do?") and the team walks through their response step by step. Identify gaps in the process, update runbooks, and schedule the next tabletop for one month later.

6

ONGOING: ESTABLISH POSTMORTEM CULTURE. After every P0 and P1 incident, write a blameless postmortem within 72 hours. The postmortem should include: a timeline of events, root cause analysis, what went well, what could be improved, and action items with owners and due dates. Share the postmortem with the entire engineering team. Review action item completion monthly. The postmortem is where the real improvement happens — without it, you will repeat the same incidents.

The Tabletop Exercise Format

Run a 60-minute tabletop exercise monthly. Structure: (1) 5 minutes — facilitator presents the scenario and initial alert. (2) 10 minutes — the team practices triage: severity classification, incident declaration, role assignment. (3) 20 minutes — the facilitator reveals additional information in stages (e.g., "the error rate is now 80%," "a customer VP just tweeted about it," "the database team says the replica is 30 minutes behind") and the team practices response, escalation, and communication. (4) 15 minutes — debrief: what went well, what was confusing, what process gaps were exposed. (5) 10 minutes — update runbooks and the incident process based on findings. Tabletop exercises are the cheapest, lowest-risk way to improve incident response. They require zero infrastructure and expose process gaps before real incidents do.

On-Call Onboarding Checklist

Before an engineer joins the on-call rotation, they should complete this checklist: (1) PagerDuty/Opsgenie account configured with correct notification preferences. (2) Access to all production dashboards and monitoring tools. (3) Access to deployment tools and ability to perform rollbacks. (4) Read all existing runbooks and understand where to find them. (5) Shadow an experienced on-call engineer for one full rotation. (6) Complete a tabletop exercise. (7) Perform a supervised first on-call shift where an experienced engineer is available as backup. Skipping onboarding leads to an overwhelmed on-call engineer who freezes during their first real incident.

WeekDeliverableTime InvestmentImpact
Week 1On-call rotation + severity levels defined8-12 hoursAlerts reach the right person, severity is consistent
Week 2Incident workflow + status page + templates6-8 hoursIncidents are declared consistently, customers are informed
Week 3Three critical runbooks written6-10 hoursOn-call engineers know what to do for common incidents
Week 4First tabletop exercise conducted4-6 hoursProcess gaps identified and fixed before real incidents
OngoingPostmortem after every P0/P13-4 hours per incidentTeam learns from incidents and prevents recurrence

Do Not Wait for Perfection to Start

The biggest risk in building an incident response process is waiting until everything is perfect before going live. A team with a basic severity classification, a Slack channel naming convention, and three runbooks is dramatically better prepared than a team with nothing. Start with the minimum viable process from this section, use it for real incidents, and improve it iteratively based on what breaks. Your postmortems will tell you exactly what to improve next. The process you have today does not need to be the process you have in six months — it just needs to be better than having no process at all.

The Incident Response Maturity Ladder

Level 0: No process — incidents are handled ad hoc by whoever notices the problem. Level 1: Basic on-call rotation and severity levels exist, incidents are declared in Slack. Level 2: Roles are assigned, communication templates exist, runbooks cover common scenarios. Level 3: Incident management tooling automates workflow, postmortems are conducted consistently, tabletop exercises are regular. Level 4: SLO-based severity classification, automated mitigation for known failure modes, incident metrics tracked and reviewed. Level 5: Proactive chaos engineering, automated incident detection and response, incident-free deploys through progressive delivery. Most teams are at Level 0 or 1. Getting to Level 2 in 30 days using this guide is a realistic and impactful goal.

How this might come up in interviews

Incident management is one of the most frequently tested topics in SRE interviews at FAANG companies. In behavioral interviews, you will be asked to describe how you handled a production incident — interviewers evaluate your ability to stay calm, coordinate a team, communicate clearly, and make sound decisions under pressure. In system design interviews, you may be asked to design an incident management system or explain how you would set up on-call for a new service. In scenario-based interviews, you will be given a simulated incident and asked to walk through your response step by step, with the interviewer introducing complications at each stage. Knowing the incident lifecycle, role definitions, severity framework, and mitigation strategies gives you a structured, vocabulary-rich answer that demonstrates real operational experience.

Common questions:

  • Walk me through how you would handle a P0 incident from the moment the alert fires to the postmortem. What are the specific steps, who is involved, and what decisions do you make at each stage?
  • You are the Incident Commander for a major outage. The CEO joins the incident channel and starts asking questions. How do you handle this without disrupting the technical response?
  • Describe the difference between mitigation and resolution. Give me an example of when you would choose each approach during an active incident.
  • How would you build an on-call rotation and incident response process for a team that currently has neither? Walk me through the first month.
  • Tell me about a time when incident communication went wrong. What happened, and what would you do differently?
  • How do you write a blameless postmortem? What sections does it include, and how do you ensure action items actually get completed?

Try this question: Ask the interviewer: What does your on-call rotation look like — what is the typical pages-per-shift rate? How do you conduct postmortems — is there a blameless culture? What incident management tooling do you use? These questions demonstrate operational maturity and help you evaluate whether the organization has a healthy on-call culture or a burnout-inducing one.

Strong answer: Immediately structuring the response around the incident lifecycle phases. Naming specific roles and their responsibilities. Emphasizing the mitigation-first principle. Describing a communication cadence (every 15 minutes for P0). Mentioning blameless postmortems with tracked action items. Citing specific tools (PagerDuty, incident.io) and explaining when each is appropriate. Describing tabletop exercises or chaos engineering as proactive preparation.

Red flags: Describing an incident response that is just "everyone jumps in and starts debugging." Not knowing what an Incident Commander does. Prioritizing root cause analysis over mitigation during an active incident. No mention of communication to stakeholders. Treating postmortems as optional or blame-oriented. Not being able to define severity levels with concrete criteria.

Key takeaways

  • An incident is an unplanned disruption requiring organized, urgent response. Classify severity by user impact (P0-P4), not by internal system behavior. Declare early and declare often — the cost of a false alarm is trivially small compared to the cost of a delayed response to a real incident.
  • The incident lifecycle has five phases — Detection, Triage, Mitigation, Resolution, Postmortem — each with distinct goals and time-based checkpoints. Use the 5-15-30-60 rule to maintain urgency: alert acknowledged in 5 minutes, severity declared in 15, mitigation plan executing in 30, user impact resolved in 60.
  • Assign explicit incident roles: Incident Commander (coordinates, does not debug), Communications Lead (owns all status updates), Operations Lead (leads technical work), SMEs (deep system expertise), and Scribe (documents everything). The IC should spend 80% of their time coordinating and at most 20% on technical work.
  • Prioritize mitigation over root cause analysis during every active incident. Roll back first, toggle feature flags, shift traffic, scale up — restore service by any means necessary and investigate root cause in the postmortem. You do not investigate arson while the building is burning.
  • Build your first incident response process in 30 days: week 1 establishes on-call rotation and severity levels, week 2 creates the incident workflow and status page, week 3 writes the three most critical runbooks, and week 4 runs the first tabletop exercise. Start simple, iterate based on postmortem findings, and improve continuously.
Before you move on: can you answer these?

Walk through the first 15 minutes of a P0 incident. What specific actions happen at each time checkpoint, and who is responsible for each action?

Minute 0-1: Alert fires, on-call engineer acknowledges. Minute 1-3: On-call checks dashboards to assess what is broken, how bad it is, and whether it is getting worse. Minute 3-5: On-call assigns severity (P0), declares incident, creates dedicated incident channel, and assigns IC role (may take IC themselves or assign to another). Minute 5-8: IC pages Ops Lead, Comms Lead, Scribe, and relevant SMEs using the role assignment script. Minute 8-12: Comms Lead posts first status page update and notifies stakeholders — message focuses on impact and that the team is investigating, not on root cause. Minute 12-15: Ops Lead begins systematic diagnosis starting with "what changed recently?" and sets the 30-minute checkpoint for escalation if no progress is made.

Explain the difference between mitigation and resolution. Why should incident response prioritize mitigation, and when is it acceptable to skip directly to root cause analysis?

Mitigation reduces or eliminates user impact without necessarily understanding or fixing the root cause — examples include rollback, feature flag toggle, traffic shifting, and scaling up. Resolution is the permanent fix that addresses the underlying root cause. Incident response prioritizes mitigation because every minute of user impact has real cost (revenue, trust, SLA). You should almost never skip to root cause analysis during an active incident. The only exception is when the mitigation options are exhausted or would cause more harm than the incident itself (e.g., rolling back would cause data loss). Even then, the team should split: one track investigates root cause while another continues exploring mitigation options.

Your team has no incident response process. Describe the minimum viable process you would implement in the first week and how you would iterate on it.

Week 1 deliverables: (1) Set up PagerDuty or Opsgenie with a 24/7 on-call rotation across the team. Test alert delivery end-to-end. (2) Define P0/P1/P2/P3 severity levels with concrete examples and response time expectations. Post in a pinned Slack message. (3) Establish a Slack channel naming convention for incidents (#inc-YYYYMMDD-description) and a one-paragraph declaration template. Iteration: after the first real incident, conduct a postmortem focused on process gaps. In week 2-3, write runbooks for the most common incident types. In week 4, run a tabletop exercise. Each postmortem and tabletop exercise reveals the next thing to improve. The process evolves through real-world feedback, not upfront design.

🧠Mental Model

💡 Analogy

Incident management is like a fire department response. When the alarm rings (alert fires), the dispatcher (Incident Commander) assesses the situation (triage), assigns trucks and crews (roles), and coordinates the response (mitigation). Engine companies fight the fire, ladder companies rescue occupants, and the battalion chief coordinates everything from the command post. After the fire is out, the fire marshal investigates what happened and how to prevent it (postmortem). Critically, you do not stop to investigate the cause of the fire while the building is still burning — you put out the fire first. The fire department does not send one firefighter to handle a five-alarm fire alone, and they do not send the entire department for a trash can fire. They scale the response to the severity, just as incident management scales from a single on-call engineer for a P3 to an all-hands response for a P0.

⚡ Core Idea

Incident management is the structured process of detecting, triaging, mitigating, resolving, and learning from production incidents. It requires clear severity classification, explicit role assignment, disciplined communication, and a bias toward mitigation over root cause analysis during the active response. The postmortem closes the loop by turning incidents into structural improvements that prevent recurrence.

🎯 Why It Matters

Every production system will have incidents — the question is not whether you will face them but whether you will handle them well or poorly. A team with a practiced incident response process mitigates a P0 in 30 minutes. A team without a process takes 3 hours to mitigate the same P0 while engineers step on each other, stakeholders are uninformed, and nobody documents what happened. That 2.5-hour difference directly translates to user impact, revenue loss, SLA penalties, and engineering burnout. Incident management is also one of the most commonly tested topics in SRE interviews at FAANG companies — interviewers want to know that you can stay calm, coordinate a team, and make sound decisions under pressure.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.