Severity levels, incident lifecycle, roles and responsibilities, communication protocols, mitigation strategies, and tooling — the complete foundation for managing production incidents at any scale, from startup to FAANG.
Severity levels, incident lifecycle, roles and responsibilities, communication protocols, mitigation strategies, and tooling — the complete foundation for managing production incidents at any scale, from startup to FAANG.
Lesson outline
An incident is an unplanned disruption or degradation of a service that requires an organized, urgent response. This definition is deliberately broad because the threshold for declaring an incident varies between organizations, but the core principle is universal: when the impact to users exceeds what your normal engineering workflow can address in a reasonable timeframe, you have an incident. Not every bug is an incident. A cosmetic UI issue on a settings page that affects no critical user flow is a bug — file a ticket, prioritize it in the next sprint. But if that same UI issue prevents users from completing purchases, it becomes an incident because it is actively causing business impact that grows with every minute it remains unresolved.
The distinction between a bug and an incident comes down to three factors: urgency (does this need to be fixed right now?), impact scope (how many users or how much revenue is affected?), and organizational response (does this require coordination beyond a single engineer?). If all three answers point to "yes," you are dealing with an incident, not a bug. At Google, we use the phrase "declare early, declare often" — it is always better to declare an incident and then downgrade it than to treat something as a routine bug and discover hours later that it was impacting millions of users.
When to Declare an Incident vs. File a Bug
Declare an incident when: any SLO is being violated, customer-facing functionality is degraded or unavailable, data loss or corruption is occurring or imminent, a security breach is suspected, or the issue requires coordination across multiple teams. File a bug when: the issue affects only internal tooling with workarounds available, the impact is cosmetic with no user-facing functionality loss, a single engineer can diagnose and fix the issue within their normal workflow, and no SLO is at risk. When in doubt, declare the incident. The cost of a false alarm (30 minutes of wasted coordination time) is trivially small compared to the cost of a delayed response (hours of unmitigated user impact).
Severity levels: P0 through P4
| Severity | User Impact | Response Time | Escalation | Communication |
|---|---|---|---|---|
| P0 | All users affected, complete outage | < 5 minutes | All on-call + engineering leadership + exec | Status page + all-hands Slack + email to customers |
| P1 | >10% users or core feature degraded | < 15 minutes | Primary + secondary on-call + team lead | Status page + incident Slack channel + stakeholders |
| P2 | Subset of users, workarounds exist | < 1 hour | Primary on-call, business-hours escalation | Incident Slack channel + team notification |
| P3 | Minor impact, no urgency | < 4 hours (business hours) | On-call engineer triages, no escalation | Team Slack channel |
| P4 | No current impact | Next sprint planning | None — tracked in backlog | None required |
Severity Inflation and Deflation Are Both Dangerous
Severity inflation (calling everything P0) leads to alert fatigue and the boy-who-cried-wolf effect — when everything is critical, nothing is. Severity deflation (classifying a P1 as P3 to avoid the overhead of incident management) is even more dangerous because it delays response while users suffer. The fix is a well-defined severity classification framework with concrete examples, reviewed and updated quarterly. If your team debates severity for more than 2 minutes during an active incident, your framework needs clearer criteria.
Remember: Impact, Urgency, Scope
Three questions determine whether something is an incident and what severity it is. Impact: What percentage of users or revenue is affected? Urgency: Does this need to be fixed right now, or can it wait? Scope: Does this require coordination beyond one engineer? A P0 scores "maximum" on all three. A bug scores "low" on all three. Everything in between is P1 through P4 based on the combination.
Every incident, regardless of severity, follows a lifecycle with distinct phases. Understanding this lifecycle is critical because each phase has different goals, different activities, and different time constraints. The biggest mistake teams make is conflating phases — trying to find root cause while the service is still on fire, or skipping the postmortem because the fix was "obvious." Each phase exists for a reason, and skipping or rushing any phase creates downstream problems that compound over time.
The lifecycle begins the moment an anomaly is detected — whether by an automated alert, a customer report, or an engineer noticing something wrong. From that moment, the clock is ticking. The lifecycle does not end when the service is restored; it ends when the postmortem is completed and action items are assigned. Organizations that end the lifecycle at "service restored" are doomed to repeat the same incidents because they never invest in structural prevention.
graph LR
D[Detection<br/>Alert fires or<br/>report received] -->|"< 5 min"| T[Triage<br/>Assess severity,<br/>assign IC]
T -->|"< 15 min"| M[Mitigation<br/>Stop the bleeding,<br/>restore service]
M -->|"< 60 min"| R[Resolution<br/>Confirm stable,<br/>clean up]
R -->|"< 72 hours"| P[Postmortem<br/>Root cause,<br/>action items]
P --> I[Improvement<br/>Implement fixes,<br/>update runbooks]
style D fill:#ef4444,color:#fff
style T fill:#f59e0b,color:#fff
style M fill:#3b82f6,color:#fff
style R fill:#10b981,color:#fff
style P fill:#8b5cf6,color:#fff
style I fill:#06b6d4,color:#fffThe incident lifecycle: each phase has a time-based checkpoint. Detection to triage should take less than 5 minutes, triage to active mitigation under 15 minutes, mitigation to stable service under 60 minutes for P0/P1, and postmortem completed within 72 hours.
The five phases with time-based checkpoints
01
DETECTION (0-5 minutes): An alert fires, a customer reports an issue, or an engineer notices anomalous behavior. The goal is to recognize that something is wrong and begin the response process. Automated detection through monitoring and alerting is the gold standard — you should never learn about a P0 from a customer tweet. Key checkpoint: within 5 minutes of the problem starting, someone should be aware of it.
02
TRIAGE (5-15 minutes): The on-call engineer assesses the situation, determines severity, declares an incident if warranted, and assigns the Incident Commander role. This phase answers three questions: How bad is it? Who needs to know? What resources do we need? Key checkpoint: within 15 minutes, severity is assigned, an incident channel is open, the right people are engaged, and the first status update has been sent.
03
MITIGATION (15-60 minutes): The team works to stop the bleeding and restore service. This is NOT the time to find root cause — it is the time to reduce user impact by any means necessary: rollback, feature flag toggle, traffic rerouting, scaling up, or applying a workaround. Key checkpoint: for P0/P1, user impact should be mitigated within 60 minutes. If not, escalate further.
04
RESOLUTION (60+ minutes): The service is confirmed stable, temporary mitigations are cleaned up or made permanent, monitoring confirms metrics have returned to baseline, and the incident is formally closed. Key checkpoint: the Incident Commander confirms resolution by verifying SLO metrics are healthy for at least 15 minutes after the fix is applied.
05
POSTMORTEM (within 72 hours): The team conducts a blameless review of the incident. What happened? Why did it happen? How do we prevent it from happening again? The postmortem produces a written document with a timeline, root cause analysis, impact assessment, and prioritized action items. Key checkpoint: the postmortem document is completed and shared with stakeholders within 72 hours of resolution.
DETECTION (0-5 minutes): An alert fires, a customer reports an issue, or an engineer notices anomalous behavior. The goal is to recognize that something is wrong and begin the response process. Automated detection through monitoring and alerting is the gold standard — you should never learn about a P0 from a customer tweet. Key checkpoint: within 5 minutes of the problem starting, someone should be aware of it.
TRIAGE (5-15 minutes): The on-call engineer assesses the situation, determines severity, declares an incident if warranted, and assigns the Incident Commander role. This phase answers three questions: How bad is it? Who needs to know? What resources do we need? Key checkpoint: within 15 minutes, severity is assigned, an incident channel is open, the right people are engaged, and the first status update has been sent.
MITIGATION (15-60 minutes): The team works to stop the bleeding and restore service. This is NOT the time to find root cause — it is the time to reduce user impact by any means necessary: rollback, feature flag toggle, traffic rerouting, scaling up, or applying a workaround. Key checkpoint: for P0/P1, user impact should be mitigated within 60 minutes. If not, escalate further.
RESOLUTION (60+ minutes): The service is confirmed stable, temporary mitigations are cleaned up or made permanent, monitoring confirms metrics have returned to baseline, and the incident is formally closed. Key checkpoint: the Incident Commander confirms resolution by verifying SLO metrics are healthy for at least 15 minutes after the fix is applied.
POSTMORTEM (within 72 hours): The team conducts a blameless review of the incident. What happened? Why did it happen? How do we prevent it from happening again? The postmortem produces a written document with a timeline, root cause analysis, impact assessment, and prioritized action items. Key checkpoint: the postmortem document is completed and shared with stakeholders within 72 hours of resolution.
Mitigation First, Root Cause Later
During an active incident, your only goal is to reduce user impact. If a rollback fixes the problem, roll back first and investigate why the deploy broke things after the service is healthy. If disabling a feature flag stops the cascading failure, disable it first and figure out the root cause in the postmortem. The fire department does not stop to investigate arson while the building is burning. They put out the fire first. This is the single most important principle in incident management: mitigation before root cause analysis, every single time.
The 5-15-30-60 Rule
Use time-based checkpoints to maintain urgency during P0/P1 incidents. At 5 minutes: the on-call engineer should be actively investigating. At 15 minutes: severity should be declared, incident channel created, and IC assigned. At 30 minutes: the team should have a working theory and be actively executing a mitigation plan. At 60 minutes: user impact should be mitigated. If any checkpoint is missed, escalate immediately — the incident is taking longer than it should and needs additional resources.
| Phase | Duration Target | Primary Goal | Key Actions | Exit Criteria |
|---|---|---|---|---|
| Detection | < 5 min | Recognize the problem exists | Alert fires, on-call acknowledges, initial assessment begins | On-call engineer is actively investigating |
| Triage | 5-15 min | Classify and mobilize | Assign severity, declare incident, open war room, notify stakeholders | IC assigned, team assembled, first status update sent |
| Mitigation | 15-60 min | Reduce user impact | Rollback, feature flag, traffic shift, scale up, workaround | User-facing impact is measurably reduced or eliminated |
| Resolution | 1-4 hours | Confirm stability | Verify metrics, clean up workarounds, close incident | SLO metrics healthy for 15+ minutes, incident formally closed |
| Postmortem | < 72 hours | Learn and prevent | Timeline reconstruction, root cause analysis, action items | Document published, action items assigned and tracked |
Effective incident response requires clear roles with well-defined responsibilities. Without explicit roles, incidents devolve into chaos: multiple engineers attempt the same fix simultaneously, nobody communicates status to stakeholders, and critical diagnostic steps are skipped because everyone assumed someone else was doing them. The role structure used by Google, PagerDuty, and most mature SRE organizations is adapted from the Incident Command System (ICS) originally developed for firefighting and emergency response.
Role assignment happens during the triage phase and is the responsibility of the Incident Commander. Roles are not permanent job titles — they are hats worn during an incident. A junior engineer can be IC for a P3 incident. A VP can serve as Communications Lead if they happen to be available. The key principle is that every role is explicitly assigned to exactly one person, and everyone on the incident knows who holds each role. Unassigned roles lead to gaps; multiply-assigned roles lead to conflicts.
The five core incident roles
| Role | Primary Focus | Communication Pattern | Time Allocation |
|---|---|---|---|
| Incident Commander | Coordination, decisions, escalation | Receives updates from all roles, broadcasts decisions | 80% coordinating, 20% technical |
| Communications Lead | Status updates, stakeholder management | Receives info from IC, broadcasts to stakeholders/customers | 90% communication, 10% technical |
| Operations Lead | Technical debugging and mitigation | Reports findings to IC, directs SMEs | 80% technical, 20% coordinating SMEs |
| Subject Matter Experts | Deep system-specific investigation | Reports to Ops Lead, answers technical questions | 95% technical, 5% reporting |
| Scribe | Real-time documentation of all actions | Listens to all channels, asks clarifying questions | 100% documentation |
The IC Should NOT Be Debugging
The most common mistake in incident response is having the Incident Commander also serve as the primary debugger. When the IC is heads-down in logs, nobody is coordinating the response, nobody is tracking time checkpoints, and nobody is managing communication. The IC role exists specifically to maintain situational awareness while others do the technical work. If you are the IC and you feel the urge to start typing commands, assign someone else to do it and focus on coordination. At Google, ICs who start debugging are gently reminded: "IC, please delegate that task and return to coordination."
Role Assignment Script for the IC
When you declare an incident, use this script in the incident channel: "I am declaring a [severity] incident for [brief description]. I am taking the IC role. [Name], you are Communications Lead — please post an initial status page update within 10 minutes. [Name], you are Operations Lead — please begin diagnosis and report findings every 10 minutes. [Name], please scribe — capture everything in the incident doc. I will page additional SMEs as needed. Next status check in 15 minutes." This explicit role assignment eliminates ambiguity and gets everyone moving immediately.
Scaling Roles by Severity
Not every incident needs all five roles. For a P3, a single on-call engineer may handle all roles. For a P2, you might have an IC/Ops Lead (combined) and a Comms Lead. For P0/P1, all five roles should be explicitly assigned to different people. The rule of thumb: if the incident requires more than two people, you need an explicit IC who is not doing technical work. If it requires more than four people, you need all five roles assigned.
A severity classification framework removes ambiguity from the most time-critical decision in incident response: how bad is this, and what response does it warrant? Without a framework, severity assignment becomes subjective — the on-call engineer who tends to panic calls everything a P0, while the one who tends to minimize calls everything a P3. Both behaviors are harmful. The framework provides objective criteria that any engineer can apply consistently, regardless of their temperament or experience level.
The best severity frameworks are based on user impact, not system behavior. A server with 100% CPU utilization is not inherently a P0 — it depends on whether users are affected. Conversely, a server at 20% CPU that is returning corrupted data to every request is absolutely a P0. Always classify severity by what users experience, not by what internal metrics show. This user-centric approach aligns incident severity with business impact, which is what leadership and customers actually care about.
| Criteria | P0 — Critical | P1 — High | P2 — Moderate | P3 — Low | P4 — Info |
|---|---|---|---|---|---|
| User Impact | All users affected | >10% users or core feature | Subset of users, non-core feature | Minor, workaround exists | None currently |
| Revenue Impact | Revenue completely stopped | Significant revenue loss | Moderate revenue impact | Negligible revenue impact | No revenue impact |
| Data Risk | Active data loss or corruption | Potential data loss imminent | Data integrity risk exists | No data risk | No data risk |
| Response Time | < 5 minutes | < 15 minutes | < 1 hour | < 4 hours (business) | Next sprint |
| Who Gets Paged | All on-call + leads + VP | Primary + secondary on-call + lead | Primary on-call | On-call triages in queue | Nobody |
| War Room | Immediate, dedicated bridge | Dedicated Slack channel | Slack thread | Ticket only | Backlog item |
| Status Page | Updated within 10 minutes | Updated within 20 minutes | Internal update only | No update needed | No update needed |
| Exec Notification | VP+ within 15 minutes | Director within 30 minutes | Manager notified async | None | None |
Escalation Triggers: When to Upgrade Severity
Upgrade severity when any of these conditions are met: (1) Impact is spreading — what started as regional is now global. (2) The 60-minute mitigation checkpoint has passed with no progress on P1/P2. (3) Additional systems are failing as a result of the original incident (cascading failure). (4) Data loss or corruption is discovered. (5) Customer or media attention increases (a P2 that gets a trending tweet becomes a P1 for business reasons). Severity should also be downgraded when impact reduces: a P0 where mitigation restores service to 95% of users becomes a P1 for the remaining 5%.
The Danger of Under-Classification
Under-classifying incident severity is more dangerous than over-classifying. A P0 incorrectly classified as P2 means: no executive notification (they learn from Twitter), no dedicated war room (the response is uncoordinated), no status page update (customers have no information), and slow escalation (additional engineers are not paged). The cost of over-classification is wasted coordination time — maybe 30 minutes for a few engineers. The cost of under-classification is hours of unmitigated user impact, customer trust erosion, and potential SLA violations with contractual penalties.
How to assign severity in under 2 minutes
01
CHECK USER IMPACT FIRST: Are users experiencing the problem right now? How many? Use your monitoring dashboards — check error rates, latency, and availability. If you do not have dashboards, check social media and customer support queues.
02
CHECK REVENUE IMPACT: Is the issue affecting revenue-generating transactions? Payment processing, checkout, subscription renewals, ad serving — if any of these are impaired, this is automatically P1 or higher.
03
CHECK DATA RISK: Is data being lost, corrupted, or exposed? Data incidents are always P1 or P0 regardless of the number of users affected, because data loss is often irreversible.
04
APPLY THE MATRIX: Match your answers against the severity criteria table. If the incident matches multiple severity levels across different criteria, use the highest severity.
05
ANNOUNCE AND ADJUST: Declare the severity in the incident channel. Severity is not permanent — re-evaluate at every 15-minute checkpoint and upgrade or downgrade as the situation evolves.
CHECK USER IMPACT FIRST: Are users experiencing the problem right now? How many? Use your monitoring dashboards — check error rates, latency, and availability. If you do not have dashboards, check social media and customer support queues.
CHECK REVENUE IMPACT: Is the issue affecting revenue-generating transactions? Payment processing, checkout, subscription renewals, ad serving — if any of these are impaired, this is automatically P1 or higher.
CHECK DATA RISK: Is data being lost, corrupted, or exposed? Data incidents are always P1 or P0 regardless of the number of users affected, because data loss is often irreversible.
APPLY THE MATRIX: Match your answers against the severity criteria table. If the incident matches multiple severity levels across different criteria, use the highest severity.
ANNOUNCE AND ADJUST: Declare the severity in the incident channel. Severity is not permanent — re-evaluate at every 15-minute checkpoint and upgrade or downgrade as the situation evolves.
SLO-Based Severity Classification
Mature SRE organizations tie severity directly to SLO burn rates. If your error budget for the month is being consumed at 100x the normal rate, that is a P0 regardless of absolute error counts. If the burn rate is 10x normal, it is a P1. If it is 3x normal, it is a P2. This approach automatically scales severity with your SLO targets and eliminates the need for arbitrary thresholds like "error rate > 5%." It also means that a 0.1% error rate increase on a 99.99% SLO target is correctly classified as more severe than a 1% error rate increase on a 99% target.
The first 15 minutes of an incident determine whether the response is organized and effective or chaotic and wasteful. Research from PagerDuty and Google shows that incidents where triage is completed within 15 minutes have significantly shorter overall resolution times compared to incidents where triage takes longer than 30 minutes. The reason is straightforward: early, structured triage gets the right people working on the right problem with the right level of urgency. Delayed or chaotic triage means engineers investigate the wrong things, communication is inconsistent, and mitigation is delayed while the team figures out what is actually happening.
The first 15 minutes follow a predictable sequence that should be practiced until it is muscle memory. This is not the time for creative problem-solving or deep analysis — it is the time for a disciplined checklist. Just as pilots follow a pre-flight checklist regardless of how many flights they have completed, SRE teams should follow a triage checklist regardless of how many incidents they have handled.
The first 15 minutes, step by step
01
MINUTE 0-1: ALERT FIRES AND ON-CALL ACKNOWLEDGES. A monitoring alert triggers (PagerDuty page, Slack alert, or similar). The on-call engineer acknowledges the alert within 1 minute. Acknowledgment does not mean the problem is being fixed — it means a human is now aware and will begin assessment. If the alert is not acknowledged within 5 minutes, it auto-escalates to the secondary on-call.
02
MINUTE 1-3: INITIAL ASSESSMENT. The on-call engineer opens the relevant dashboard and assesses the situation. Three questions to answer: What is broken? (which service, which feature, which region). How bad is it? (error rate, latency, percentage of users affected). Is it getting worse? (is the impact trending up, stable, or recovering on its own). This assessment should take no more than 2-3 minutes using pre-built dashboards.
03
MINUTE 3-5: SEVERITY DECISION AND INCIDENT DECLARATION. Based on the initial assessment, the on-call engineer assigns a severity level using the classification framework. If P2 or higher, they formally declare an incident by creating a dedicated incident channel (e.g., #inc-20250219-payment-failures) and sending the declaration message with severity, brief description, and the role assignment script.
04
MINUTE 5-8: WAR ROOM ASSEMBLY. For P0/P1, the IC pages additional responders: the Ops Lead (the engineer most familiar with the affected system), the Comms Lead (typically a rotation role), relevant SMEs, and a scribe. For P2, the on-call may handle all roles or pull in one additional person. The IC posts the initial situation summary in the incident channel.
05
MINUTE 8-12: FIRST COMMUNICATION. The Communications Lead posts the first external and internal updates. Internal: a message in the company-wide incidents channel summarizing the impact and who is responding. External: a status page update indicating awareness of the issue. Customer support is notified with a brief FAQ so they can handle incoming support tickets. These communications happen regardless of whether the root cause is known — the message is "we are aware and investigating."
06
MINUTE 12-15: FIRST DIAGNOSTIC ACTIONS. The Ops Lead begins systematic diagnosis: checking recent deployments (did anything change?), reviewing error logs, checking dependency health, and looking for correlating signals. The IC sets a timer for the next checkpoint at the 30-minute mark. If no progress is made by then, escalation is automatic.
MINUTE 0-1: ALERT FIRES AND ON-CALL ACKNOWLEDGES. A monitoring alert triggers (PagerDuty page, Slack alert, or similar). The on-call engineer acknowledges the alert within 1 minute. Acknowledgment does not mean the problem is being fixed — it means a human is now aware and will begin assessment. If the alert is not acknowledged within 5 minutes, it auto-escalates to the secondary on-call.
MINUTE 1-3: INITIAL ASSESSMENT. The on-call engineer opens the relevant dashboard and assesses the situation. Three questions to answer: What is broken? (which service, which feature, which region). How bad is it? (error rate, latency, percentage of users affected). Is it getting worse? (is the impact trending up, stable, or recovering on its own). This assessment should take no more than 2-3 minutes using pre-built dashboards.
MINUTE 3-5: SEVERITY DECISION AND INCIDENT DECLARATION. Based on the initial assessment, the on-call engineer assigns a severity level using the classification framework. If P2 or higher, they formally declare an incident by creating a dedicated incident channel (e.g., #inc-20250219-payment-failures) and sending the declaration message with severity, brief description, and the role assignment script.
MINUTE 5-8: WAR ROOM ASSEMBLY. For P0/P1, the IC pages additional responders: the Ops Lead (the engineer most familiar with the affected system), the Comms Lead (typically a rotation role), relevant SMEs, and a scribe. For P2, the on-call may handle all roles or pull in one additional person. The IC posts the initial situation summary in the incident channel.
MINUTE 8-12: FIRST COMMUNICATION. The Communications Lead posts the first external and internal updates. Internal: a message in the company-wide incidents channel summarizing the impact and who is responding. External: a status page update indicating awareness of the issue. Customer support is notified with a brief FAQ so they can handle incoming support tickets. These communications happen regardless of whether the root cause is known — the message is "we are aware and investigating."
MINUTE 12-15: FIRST DIAGNOSTIC ACTIONS. The Ops Lead begins systematic diagnosis: checking recent deployments (did anything change?), reviewing error logs, checking dependency health, and looking for correlating signals. The IC sets a timer for the next checkpoint at the 30-minute mark. If no progress is made by then, escalation is automatic.
Pre-Build Your Incident Channel Template
Do not waste the first 5 minutes of an incident manually setting up a Slack channel and typing a description. Use an automated incident creation workflow (PagerDuty, incident.io, or a custom Slack bot) that creates the channel, sets the topic with the incident summary, pins the runbook, invites the on-call team, and posts a pre-formatted incident declaration message — all with a single slash command like /incident create P1 "Payment processing failures in us-east-1". Automation saves 3-5 minutes of manual setup during the most critical phase of the response.
Do Not Debug Alone for 30 Minutes Before Declaring
A common anti-pattern: the on-call engineer receives an alert, spends 30 minutes investigating on their own, cannot figure it out, and then finally declares an incident. By the time the team assembles, 30 minutes of user impact have already occurred. The fix: if initial assessment (minute 1-3) shows any P2+ symptoms, declare immediately and bring in help. You can always downgrade the severity if it turns out to be less severe than initially thought. You cannot recover the 30 minutes of delayed response if it turns out to be worse.
| Minute | Action | Owner | Output |
|---|---|---|---|
| 0-1 | Acknowledge alert | On-call engineer | Alert acknowledged in PagerDuty/Opsgenie |
| 1-3 | Initial assessment (what, how bad, trending?) | On-call engineer | Preliminary impact assessment |
| 3-5 | Severity decision, declare incident, create channel | On-call (becomes IC or assigns IC) | Incident channel open, severity assigned |
| 5-8 | Page additional responders, assign roles | Incident Commander | IC, Comms, Ops, Scribe, SMEs assigned |
| 8-12 | First status update (internal + external) | Communications Lead | Status page updated, stakeholders notified |
| 12-15 | Begin systematic diagnosis, set 30-min checkpoint | Operations Lead | Diagnostic investigation underway, next checkpoint set |
The "Did Anything Change?" Question
The single most effective diagnostic question in the first 15 minutes is: "What changed?" Check the deployment log — was there a deploy in the last 2 hours? Check configuration management — was a config change pushed? Check infrastructure — was there a scaling event, a node failure, or a dependency update? In the majority of incidents, the root cause is a recent change. If you can identify the change, you can often mitigate by reverting it, which is faster than diagnosing the specific failure mode the change introduced.
Communication during an incident serves two critical purposes: it keeps the response team coordinated, and it keeps stakeholders informed. Poor communication is the number one cause of incident response dysfunction. Engineers working on different mitigation approaches without knowing about each other, stakeholders flooding the incident channel with questions because they have no status updates, customer support giving incorrect information because they were never briefed — all of these failures are communication failures, not technical failures. At Google, we treat incident communication as a first-class engineering discipline with its own training, templates, and feedback loops.
The key insight is that communication during an incident follows a fundamentally different pattern than normal engineering communication. In normal work, you communicate when you have something complete to share. During an incident, you communicate on a regular cadence regardless of whether you have new information. Saying "no update — still investigating, next update in 15 minutes" is a valid and valuable communication because it tells stakeholders the team is still engaged and sets expectations for the next update.
graph TB
IC[Incident Commander] --> CL[Communications Lead]
IC --> OL[Operations Lead]
OL --> SME1[SME: Database]
OL --> SME2[SME: Networking]
OL --> SME3[SME: Application]
CL --> SP[Status Page]
CL --> SI[Slack: #incidents]
CL --> CS[Customer Support]
CL --> EX[Executive Updates]
IC --> SC[Scribe]
style IC fill:#3b82f6,color:#fff
style CL fill:#f59e0b,color:#fff
style OL fill:#10b981,color:#fff
style SP fill:#8b5cf6,color:#fff
style SI fill:#8b5cf6,color:#fff
style CS fill:#8b5cf6,color:#fff
style EX fill:#8b5cf6,color:#fffInformation flows up from SMEs to Ops Lead to IC. Communications flow out from the Comms Lead to all external channels. The IC is the bridge between technical investigation and stakeholder communication.
| Severity | Internal Update Frequency | Status Page Update | Executive Update | Customer Email |
|---|---|---|---|---|
| P0 | Every 15 minutes | Every 15-30 minutes | Within 15 min, then every 30 min | If outage > 30 min |
| P1 | Every 30 minutes | Every 30-60 minutes | Within 30 min, then hourly | If outage > 1 hour |
| P2 | Every 60 minutes | Only if customer-visible | Async summary after resolution | Rarely needed |
| P3 | As needed | No | No | No |
The Incident Status Update Template
Every status update, whether internal or external, should follow this structure: (1) CURRENT STATUS: one sentence on whether the issue is ongoing, mitigated, or resolved. (2) IMPACT: who is affected and how. (3) WHAT WE ARE DOING: specific actions the team is taking right now. (4) NEXT UPDATE: when stakeholders can expect the next communication. Example: "CURRENT STATUS: Payment processing is degraded. IMPACT: Approximately 15% of checkout attempts are failing in us-east-1. WHAT WE ARE DOING: We have identified a database connection pool exhaustion issue and are scaling up the connection pool. NEXT UPDATE: 30 minutes or sooner if status changes." This template works for Slack, status pages, and executive emails.
Communication anti-patterns to avoid
Never Speculate in External Communications
Status page updates and customer communications must never include speculation about root cause. Saying "we believe the issue is caused by a database failure" and then discovering it was actually a network partition makes your team look incompetent and erodes customer trust. External communications should describe the impact (what customers experience), the response (that engineering is actively working), and the timeline for the next update. Root cause is shared only after the postmortem confirms it.
Pre-Write Your Status Page Templates
During an active P0, the Communications Lead should not be wordsmithing status page updates. Pre-write templates for common scenarios: "We are investigating reports of [service] issues affecting [feature]. Some users may experience [symptom]. Our engineering team is actively investigating and we will provide updates every [frequency]." Having templates ready means the first status page update goes live in 2 minutes instead of 10.
Mitigation is the art of reducing user impact as quickly as possible, regardless of whether you understand the root cause. This is a fundamentally different mindset from debugging, and many engineers struggle with it. The debugging mindset says "I need to understand why this is happening before I can fix it." The mitigation mindset says "I need to make the user impact stop right now, and I will figure out why later." In a production incident, the mitigation mindset always wins. Every minute spent investigating root cause while a mitigation option is available is a minute of unnecessary user suffering.
The best mitigation strategies are pre-built and pre-tested. You do not want to be figuring out how to roll back a deployment for the first time during a P0 at 3 AM. Rollback procedures, feature flag configurations, traffic shifting mechanisms, and scaling playbooks should all be documented, tested regularly, and executable within minutes. The goal is to have a menu of mitigation options that the Ops Lead can choose from based on the situation, not a blank page that requires improvisation under pressure.
The mitigation toolkit: ordered by speed and safety
| Strategy | Speed | Risk | When to Use | Prerequisite |
|---|---|---|---|---|
| Rollback | < 5 min | Low | Incident correlates with recent deploy | CI/CD supports one-click rollback |
| Feature flag toggle | < 1 min | Low | Problem is in a flagged feature | Feature built behind a feature flag |
| Traffic shifting | 1-5 min | Medium | Problem is regional or instance-specific | Multi-region deployment, load balancer access |
| Scale up/out | 5-15 min | Low | Capacity exhaustion is the cause | Infrastructure supports scaling, capacity available |
| Hotfix | 15-30 min | High | Rollback not possible, specific fix known | Fast CI/CD, second-engineer review |
| DR failover | 30-60 min | High | Primary environment is unrecoverable | DR site tested, data sync confirmed |
The Rollback-First Policy
Adopt a rollback-first policy: if the incident correlates with a recent change (deploy, config push, infrastructure modification), the default mitigation is to revert that change. The IC should not need to justify a rollback — the person who made the change should need to justify NOT rolling back. This policy dramatically reduces mean time to recovery because rollback is the fastest mitigation available. Engineers sometimes resist rollback because they are "almost done" finding the bug. Overrule this instinct. Roll back, restore service, and then debug at leisure.
When Rollback Is Not Possible
Rollback is not always safe. Database schema migrations that drop columns cannot be rolled back without data loss. Changes that alter data formats (serialization changes) may make existing data unreadable by the old code. API version changes where clients have already adopted the new version cannot be rolled back without breaking those clients. For these scenarios, you need forward-fix capability: the ability to quickly deploy a targeted patch rather than reverting to a previous version. These scenarios should be identified during deploy review and have explicit mitigation plans documented before the change is deployed.
Test Your Mitigations Quarterly
Every mitigation strategy in your toolkit should be tested at least quarterly. Can you actually roll back a deploy in under 5 minutes? Try it. Can you toggle a feature flag and see the effect within 1 minute? Test it. Can you shift traffic between regions seamlessly? Prove it. Mitigations that have never been tested are mitigations that will fail when you need them most. Include mitigation testing in your chaos engineering program or Game Day exercises.
Incident management tooling automates the repetitive, error-prone parts of incident response so that engineers can focus on the technical work. The right tooling can reduce the overhead of incident declaration from 10 minutes to 10 seconds, ensure that communication happens on schedule without manual reminders, and automatically generate postmortem timelines from the incident channel history. The wrong tooling — or worse, no tooling — means every incident requires manual channel creation, manual role assignment, manual status page updates, and manual timeline reconstruction.
The incident management tool ecosystem has matured significantly in the last five years. The core platforms serve different needs depending on team size, budget, and workflow preferences. Understanding the strengths and limitations of each helps you choose the right tool for your organization and avoid the trap of building custom tooling when off-the-shelf solutions exist.
Core incident management tool categories
| Feature | PagerDuty | Opsgenie | incident.io |
|---|---|---|---|
| Primary Strength | On-call management and alerting | On-call management (Jira integration) | Incident coordination and workflow |
| On-call Scheduling | Best in class, complex rotation support | Strong, Jira ecosystem integration | Basic, focused on incident workflow |
| Alert Routing | Sophisticated escalation policies, event intelligence | Good escalation, Jira auto-ticket creation | Relies on PagerDuty/Opsgenie for alerting |
| Incident Coordination | PagerDuty Incident Response add-on | Basic incident tracking | Best in class Slack-native coordination |
| Slack Integration | Good, bidirectional | Good, bidirectional | Excellent, Slack-first design |
| Status Page | PagerDuty Status Page add-on | Opsgenie Status Page | Integrates with Statuspage, Instatus |
| Postmortem | Basic template generation | Basic template | Auto-generated from Slack timeline |
| Pricing | $21-41/user/month | $9-35/user/month | From $16/user/month |
| Best For | Large orgs needing robust on-call + alerting | Jira/Atlassian-heavy organizations | Slack-first teams wanting workflow automation |
The Minimum Viable Incident Management Stack
If you are starting from zero, here is the minimum viable stack: (1) PagerDuty or Opsgenie for on-call scheduling and alert routing — pick one based on your existing tooling ecosystem. (2) Slack with a naming convention for incident channels (#inc-YYYYMMDD-brief-description). (3) A status page (Statuspage free tier or Instatus) for external communication. (4) A Google Doc or Notion template for postmortems. This stack costs under $500/month for a team of 10 and covers the essential workflow. Upgrade to incident.io or FireHydrant when you are handling more than 10 incidents per month and the manual overhead becomes unsustainable.
Do Not Build Your Own Incident Management Tool
Every year, well-intentioned SRE teams decide to build a custom Slack bot for incident management. After six months of development, they have a fragile bot that handles the happy path and breaks on edge cases. Meanwhile, incident.io, FireHydrant, and Rootly have dedicated product teams that have spent years building robust incident workflow automation. Unless your incident management needs are truly unique (unlikely), buy do not build. Spend your engineering time on product reliability, not on reinventing incident management tooling.
Tool Integration Is Non-Negotiable
Your incident management tools must integrate seamlessly with each other. When an alert fires in PagerDuty, it should automatically create an incident in incident.io, which creates a Slack channel, pages the on-call team, and posts an initial status page update — all without manual intervention. If any step requires a human to copy-paste information between tools, the workflow will break under pressure. Test your integration end-to-end by firing a test alert and verifying the entire chain executes correctly. This test should be part of your on-call onboarding process.
If you are starting from scratch — no incident process, no on-call rotation, no runbooks, no postmortem culture — this section provides a step-by-step guide to building your first incident response process in 30 days. The goal is not to build a Google-scale incident management system on day one. The goal is to establish the minimum viable process that ensures incidents are detected, responded to, communicated about, and learned from. You can iterate and improve from there.
The biggest mistake teams make when building their first incident response process is trying to do everything at once. They write a 50-page incident management playbook, configure elaborate PagerDuty escalation policies, and design a complex severity classification matrix — and then nobody follows any of it because it is too complicated. Start simple. A process that is followed 100% of the time is infinitely more valuable than a comprehensive process that is followed 20% of the time. Simplicity drives adoption, and adoption drives reliability.
Week 1-4: From zero to operational incident response
01
WEEK 1, DAY 1-2: ESTABLISH ON-CALL ROTATION. Set up PagerDuty or Opsgenie (free tier is sufficient to start). Create a single on-call schedule that covers 24/7 with your team. If you have 4 engineers, each person is on-call for one week out of four. Configure alert routing so that monitoring alerts reach the on-call engineer via push notification and phone call. Test the chain end-to-end: have someone trigger a test alert and verify the on-call person receives it within 60 seconds.
02
WEEK 1, DAY 3-5: DEFINE YOUR SEVERITY LEVELS. Write a one-page document defining P0 through P3 (skip P4 for now — keep it simple). For each severity, specify: what qualifies (with concrete examples from your system), who gets notified, expected response time, and communication expectations. Share this document with the entire engineering team and get agreement. Post it in a pinned Slack message in your engineering channel.
03
WEEK 2: CREATE INCIDENT WORKFLOW. Define the incident channel naming convention (#inc-YYYYMMDD-description). Write a one-paragraph incident declaration template that includes severity, description, who is IC, and a request for the Ops Lead and Comms Lead. Set up a free Statuspage for external communication. Write three status page update templates: one for "investigating," one for "identified and working on fix," and one for "resolved." Test the entire workflow with a fake incident.
04
WEEK 3: WRITE YOUR FIRST THREE RUNBOOKS. Identify the three most common types of incidents your team has experienced (e.g., "database connection pool exhaustion," "high API latency," "deployment failure"). For each, write a runbook that covers: symptoms (how the alert presents), diagnostic commands (what to check), mitigation steps (what to do), and escalation criteria (when to call for help). Store runbooks in your team wiki and link them from the corresponding PagerDuty alerts.
05
WEEK 4: CONDUCT YOUR FIRST TABLETOP EXERCISE. A tabletop exercise is a simulated incident where the team practices the response process without any actual system impact. The facilitator presents a scenario ("It is 2 AM, PagerDuty wakes you up. The dashboard shows 50% error rate on the checkout API. What do you do?") and the team walks through their response step by step. Identify gaps in the process, update runbooks, and schedule the next tabletop for one month later.
06
ONGOING: ESTABLISH POSTMORTEM CULTURE. After every P0 and P1 incident, write a blameless postmortem within 72 hours. The postmortem should include: a timeline of events, root cause analysis, what went well, what could be improved, and action items with owners and due dates. Share the postmortem with the entire engineering team. Review action item completion monthly. The postmortem is where the real improvement happens — without it, you will repeat the same incidents.
WEEK 1, DAY 1-2: ESTABLISH ON-CALL ROTATION. Set up PagerDuty or Opsgenie (free tier is sufficient to start). Create a single on-call schedule that covers 24/7 with your team. If you have 4 engineers, each person is on-call for one week out of four. Configure alert routing so that monitoring alerts reach the on-call engineer via push notification and phone call. Test the chain end-to-end: have someone trigger a test alert and verify the on-call person receives it within 60 seconds.
WEEK 1, DAY 3-5: DEFINE YOUR SEVERITY LEVELS. Write a one-page document defining P0 through P3 (skip P4 for now — keep it simple). For each severity, specify: what qualifies (with concrete examples from your system), who gets notified, expected response time, and communication expectations. Share this document with the entire engineering team and get agreement. Post it in a pinned Slack message in your engineering channel.
WEEK 2: CREATE INCIDENT WORKFLOW. Define the incident channel naming convention (#inc-YYYYMMDD-description). Write a one-paragraph incident declaration template that includes severity, description, who is IC, and a request for the Ops Lead and Comms Lead. Set up a free Statuspage for external communication. Write three status page update templates: one for "investigating," one for "identified and working on fix," and one for "resolved." Test the entire workflow with a fake incident.
WEEK 3: WRITE YOUR FIRST THREE RUNBOOKS. Identify the three most common types of incidents your team has experienced (e.g., "database connection pool exhaustion," "high API latency," "deployment failure"). For each, write a runbook that covers: symptoms (how the alert presents), diagnostic commands (what to check), mitigation steps (what to do), and escalation criteria (when to call for help). Store runbooks in your team wiki and link them from the corresponding PagerDuty alerts.
WEEK 4: CONDUCT YOUR FIRST TABLETOP EXERCISE. A tabletop exercise is a simulated incident where the team practices the response process without any actual system impact. The facilitator presents a scenario ("It is 2 AM, PagerDuty wakes you up. The dashboard shows 50% error rate on the checkout API. What do you do?") and the team walks through their response step by step. Identify gaps in the process, update runbooks, and schedule the next tabletop for one month later.
ONGOING: ESTABLISH POSTMORTEM CULTURE. After every P0 and P1 incident, write a blameless postmortem within 72 hours. The postmortem should include: a timeline of events, root cause analysis, what went well, what could be improved, and action items with owners and due dates. Share the postmortem with the entire engineering team. Review action item completion monthly. The postmortem is where the real improvement happens — without it, you will repeat the same incidents.
The Tabletop Exercise Format
Run a 60-minute tabletop exercise monthly. Structure: (1) 5 minutes — facilitator presents the scenario and initial alert. (2) 10 minutes — the team practices triage: severity classification, incident declaration, role assignment. (3) 20 minutes — the facilitator reveals additional information in stages (e.g., "the error rate is now 80%," "a customer VP just tweeted about it," "the database team says the replica is 30 minutes behind") and the team practices response, escalation, and communication. (4) 15 minutes — debrief: what went well, what was confusing, what process gaps were exposed. (5) 10 minutes — update runbooks and the incident process based on findings. Tabletop exercises are the cheapest, lowest-risk way to improve incident response. They require zero infrastructure and expose process gaps before real incidents do.
On-Call Onboarding Checklist
Before an engineer joins the on-call rotation, they should complete this checklist: (1) PagerDuty/Opsgenie account configured with correct notification preferences. (2) Access to all production dashboards and monitoring tools. (3) Access to deployment tools and ability to perform rollbacks. (4) Read all existing runbooks and understand where to find them. (5) Shadow an experienced on-call engineer for one full rotation. (6) Complete a tabletop exercise. (7) Perform a supervised first on-call shift where an experienced engineer is available as backup. Skipping onboarding leads to an overwhelmed on-call engineer who freezes during their first real incident.
| Week | Deliverable | Time Investment | Impact |
|---|---|---|---|
| Week 1 | On-call rotation + severity levels defined | 8-12 hours | Alerts reach the right person, severity is consistent |
| Week 2 | Incident workflow + status page + templates | 6-8 hours | Incidents are declared consistently, customers are informed |
| Week 3 | Three critical runbooks written | 6-10 hours | On-call engineers know what to do for common incidents |
| Week 4 | First tabletop exercise conducted | 4-6 hours | Process gaps identified and fixed before real incidents |
| Ongoing | Postmortem after every P0/P1 | 3-4 hours per incident | Team learns from incidents and prevents recurrence |
Do Not Wait for Perfection to Start
The biggest risk in building an incident response process is waiting until everything is perfect before going live. A team with a basic severity classification, a Slack channel naming convention, and three runbooks is dramatically better prepared than a team with nothing. Start with the minimum viable process from this section, use it for real incidents, and improve it iteratively based on what breaks. Your postmortems will tell you exactly what to improve next. The process you have today does not need to be the process you have in six months — it just needs to be better than having no process at all.
The Incident Response Maturity Ladder
Level 0: No process — incidents are handled ad hoc by whoever notices the problem. Level 1: Basic on-call rotation and severity levels exist, incidents are declared in Slack. Level 2: Roles are assigned, communication templates exist, runbooks cover common scenarios. Level 3: Incident management tooling automates workflow, postmortems are conducted consistently, tabletop exercises are regular. Level 4: SLO-based severity classification, automated mitigation for known failure modes, incident metrics tracked and reviewed. Level 5: Proactive chaos engineering, automated incident detection and response, incident-free deploys through progressive delivery. Most teams are at Level 0 or 1. Getting to Level 2 in 30 days using this guide is a realistic and impactful goal.
Incident management is one of the most frequently tested topics in SRE interviews at FAANG companies. In behavioral interviews, you will be asked to describe how you handled a production incident — interviewers evaluate your ability to stay calm, coordinate a team, communicate clearly, and make sound decisions under pressure. In system design interviews, you may be asked to design an incident management system or explain how you would set up on-call for a new service. In scenario-based interviews, you will be given a simulated incident and asked to walk through your response step by step, with the interviewer introducing complications at each stage. Knowing the incident lifecycle, role definitions, severity framework, and mitigation strategies gives you a structured, vocabulary-rich answer that demonstrates real operational experience.
Common questions:
Try this question: Ask the interviewer: What does your on-call rotation look like — what is the typical pages-per-shift rate? How do you conduct postmortems — is there a blameless culture? What incident management tooling do you use? These questions demonstrate operational maturity and help you evaluate whether the organization has a healthy on-call culture or a burnout-inducing one.
Strong answer: Immediately structuring the response around the incident lifecycle phases. Naming specific roles and their responsibilities. Emphasizing the mitigation-first principle. Describing a communication cadence (every 15 minutes for P0). Mentioning blameless postmortems with tracked action items. Citing specific tools (PagerDuty, incident.io) and explaining when each is appropriate. Describing tabletop exercises or chaos engineering as proactive preparation.
Red flags: Describing an incident response that is just "everyone jumps in and starts debugging." Not knowing what an Incident Commander does. Prioritizing root cause analysis over mitigation during an active incident. No mention of communication to stakeholders. Treating postmortems as optional or blame-oriented. Not being able to define severity levels with concrete criteria.
Key takeaways
Walk through the first 15 minutes of a P0 incident. What specific actions happen at each time checkpoint, and who is responsible for each action?
Minute 0-1: Alert fires, on-call engineer acknowledges. Minute 1-3: On-call checks dashboards to assess what is broken, how bad it is, and whether it is getting worse. Minute 3-5: On-call assigns severity (P0), declares incident, creates dedicated incident channel, and assigns IC role (may take IC themselves or assign to another). Minute 5-8: IC pages Ops Lead, Comms Lead, Scribe, and relevant SMEs using the role assignment script. Minute 8-12: Comms Lead posts first status page update and notifies stakeholders — message focuses on impact and that the team is investigating, not on root cause. Minute 12-15: Ops Lead begins systematic diagnosis starting with "what changed recently?" and sets the 30-minute checkpoint for escalation if no progress is made.
Explain the difference between mitigation and resolution. Why should incident response prioritize mitigation, and when is it acceptable to skip directly to root cause analysis?
Mitigation reduces or eliminates user impact without necessarily understanding or fixing the root cause — examples include rollback, feature flag toggle, traffic shifting, and scaling up. Resolution is the permanent fix that addresses the underlying root cause. Incident response prioritizes mitigation because every minute of user impact has real cost (revenue, trust, SLA). You should almost never skip to root cause analysis during an active incident. The only exception is when the mitigation options are exhausted or would cause more harm than the incident itself (e.g., rolling back would cause data loss). Even then, the team should split: one track investigates root cause while another continues exploring mitigation options.
Your team has no incident response process. Describe the minimum viable process you would implement in the first week and how you would iterate on it.
Week 1 deliverables: (1) Set up PagerDuty or Opsgenie with a 24/7 on-call rotation across the team. Test alert delivery end-to-end. (2) Define P0/P1/P2/P3 severity levels with concrete examples and response time expectations. Post in a pinned Slack message. (3) Establish a Slack channel naming convention for incidents (#inc-YYYYMMDD-description) and a one-paragraph declaration template. Iteration: after the first real incident, conduct a postmortem focused on process gaps. In week 2-3, write runbooks for the most common incident types. In week 4, run a tabletop exercise. Each postmortem and tabletop exercise reveals the next thing to improve. The process evolves through real-world feedback, not upfront design.
💡 Analogy
Incident management is like a fire department response. When the alarm rings (alert fires), the dispatcher (Incident Commander) assesses the situation (triage), assigns trucks and crews (roles), and coordinates the response (mitigation). Engine companies fight the fire, ladder companies rescue occupants, and the battalion chief coordinates everything from the command post. After the fire is out, the fire marshal investigates what happened and how to prevent it (postmortem). Critically, you do not stop to investigate the cause of the fire while the building is still burning — you put out the fire first. The fire department does not send one firefighter to handle a five-alarm fire alone, and they do not send the entire department for a trash can fire. They scale the response to the severity, just as incident management scales from a single on-call engineer for a P3 to an all-hands response for a P0.
⚡ Core Idea
Incident management is the structured process of detecting, triaging, mitigating, resolving, and learning from production incidents. It requires clear severity classification, explicit role assignment, disciplined communication, and a bias toward mitigation over root cause analysis during the active response. The postmortem closes the loop by turning incidents into structural improvements that prevent recurrence.
🎯 Why It Matters
Every production system will have incidents — the question is not whether you will face them but whether you will handle them well or poorly. A team with a practiced incident response process mitigates a P0 in 30 minutes. A team without a process takes 3 hours to mitigate the same P0 while engineers step on each other, stakeholders are uninformed, and nobody documents what happened. That 2.5-hour difference directly translates to user impact, revenue loss, SLA penalties, and engineering burnout. Incident management is also one of the most commonly tested topics in SRE interviews at FAANG companies — interviewers want to know that you can stay calm, coordinate a team, and make sound decisions under pressure.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.