Skip to main content
Career Paths
Concepts
Sre Oncall Incidents
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

On-Call and Incident Response

A FAANG-depth guide to running healthy on-call rotations, responding to incidents with precision, escalating effectively, and continuously improving the system — from the first page to the final handoff. Covers Google, Netflix, and Stripe on-call models, the Uber burnout crisis, and a full incident response decision tree.

🎯Key Takeaways
On-call is a formal engineering contract with explicit SLAs: Google targets 5-10 minute acknowledgement times, Netflix targets fewer than 1 incident per week per on-call engineer, and Google recommends a maximum of 2 actionable pages per 12-hour shift.
Healthy rotations require a minimum of 6 engineers for critical services, weekly handoffs with written handoff documents, shadow rotations for new members, and a maximum of 8 interrupt hours per engineer per week.
The first 5 minutes protocol — Acknowledge → Assess → Triage → Communicate → Mitigate — must be followed in order. Skipping Assess and Triage to jump directly to Mitigation is the most common cause of prolonged incidents and unnecessary damage.
Escalation at 15 minutes without a clear hypothesis is the correct professional response, not a sign of weakness. Every minute of solo debugging past the 15-minute threshold without progress is a minute of unnecessary customer impact.
On-call burnout is a systemic organizational failure, not an individual resilience problem. Uber's Project Oncall Health demonstrated that systematic measurement, alert audits, rotation size enforcement, and runbook standards can reduce weekly interrupt hours by 67% — from 18 hours to 6 hours per engineer — over 18 months.

On-Call and Incident Response

A FAANG-depth guide to running healthy on-call rotations, responding to incidents with precision, escalating effectively, and continuously improving the system — from the first page to the final handoff. Covers Google, Netflix, and Stripe on-call models, the Uber burnout crisis, and a full incident response decision tree.

~41 min read
Be the first to complete!
What you'll learn
  • On-call is a formal engineering contract with explicit SLAs: Google targets 5-10 minute acknowledgement times, Netflix targets fewer than 1 incident per week per on-call engineer, and Google recommends a maximum of 2 actionable pages per 12-hour shift.
  • Healthy rotations require a minimum of 6 engineers for critical services, weekly handoffs with written handoff documents, shadow rotations for new members, and a maximum of 8 interrupt hours per engineer per week.
  • The first 5 minutes protocol — Acknowledge → Assess → Triage → Communicate → Mitigate — must be followed in order. Skipping Assess and Triage to jump directly to Mitigation is the most common cause of prolonged incidents and unnecessary damage.
  • Escalation at 15 minutes without a clear hypothesis is the correct professional response, not a sign of weakness. Every minute of solo debugging past the 15-minute threshold without progress is a minute of unnecessary customer impact.
  • On-call burnout is a systemic organizational failure, not an individual resilience problem. Uber's Project Oncall Health demonstrated that systematic measurement, alert audits, rotation size enforcement, and runbook standards can reduce weekly interrupt hours by 67% — from 18 hours to 6 hours per engineer — over 18 months.

Lesson outline

The On-Call Contract: Expectations and Responsibilities

Being on-call is not a punishment, a badge of honour, or a casual commitment. It is a formal engineering contract with specific performance expectations, compensation policies, and health boundaries. At Google, the on-call contract is written down and signed. At Netflix and Stripe it is embedded in the engineering culture. At every FAANG company, the contract answers three questions clearly: what must you respond to, how fast, and what support do you have when you do.

The foundation of the contract is the response time SLA. Google SRE mandates a 5-minute acknowledgement time for Severity-1 pages during business hours, and a 10-minute acknowledgement time at night. Netflix engineering aims for sub-5-minute acknowledgement for Tier-1 services like the streaming API and playback pipeline. Stripe, which processes payments, targets a 3-minute acknowledgement for any page that touches the payment critical path. These are not aspirational goals — they are written SLOs on the on-call function itself, and they are reviewed during quarterly SRE health reviews.

Google's Page Frequency Target

Google SRE guidance states that an on-call engineer should receive no more than 2 actionable pages per 12-hour shift. More than that is a signal that something is wrong with alerting quality, runbook effectiveness, or service reliability — and it triggers a review. Netflix targets fewer than 1 incident requiring human intervention per week per on-call engineer for stable services.

CompanyAcknowledgement SLAMax Pages/ShiftCompensation ModelOn-Call Cadence
Google5 min (business hours), 10 min (nights/weekends)2 actionable pages per 12-hour shiftOn-call pay differential + time-in-lieu for night shiftsTypically 1 week primary, 1 week secondary, 6+ week off-rotation
Netflix<5 min for Tier-1 services<1 incident/week per engineer (target)High base salary model; no separate on-call payWeekly rotation with clear primary/secondary model
Stripe<3 min for payment critical pathNot publicly specified; burnout metrics reviewed quarterlyOn-call stipend + time-off after high-burden shiftsBi-weekly rotation; secondary always available
Uber<5 min (post-2019 reforms)Capped at 10 actionable pages/week after Project Oncall HealthOn-call differential pay + mandatory recovery time1 week rotation after Project Oncall Health reforms

The distinction between a healthy and a toxic on-call rotation comes down to five factors: page frequency, page actionability (can the engineer actually do something?), rotation size (is burden shared fairly?), support availability (is there a secondary?), and recovery time (does the team get rest after a hard shift?). A rotation that pages 15 times per night, where half the pages require no action, with only one engineer covering the service and no backup, is toxic regardless of the compensation model. Engineers in that situation burn out, make mistakes, and eventually leave.

Signs of a healthy on-call rotation vs a toxic one

  • Healthy: Actionable pages only — Every page requires an action. Noise is tracked and eliminated within weeks. The alert has a runbook. The engineer knows exactly what to check first.
  • Healthy: Shared load — The rotation has enough people that no one is on-call more than once per 6-8 weeks for high-burden services. Google targets at minimum 6 engineers in any full on-call rotation.
  • Healthy: Clear escalation path — The on-call engineer is never the last line of defence. There is always a secondary, a team lead, and a manager available to escalate to.
  • Toxic: Alert storms — A single incident generates dozens of alerts. Engineers are alert-fatigued and start dismissing pages without reading them. This is the precursor to a major missed incident.
  • Toxic: No runbooks — The on-call engineer is paged at 3 AM with no documented steps. They debug from memory, make errors under pressure, and the resolution takes 3x longer than necessary.
  • Toxic: Single point of failure — One engineer covers the rotation alone. They cannot sleep soundly, cannot take sick days, and cannot go to the bathroom during an incident. This is an organizational risk, not just a personal burden.
  • Toxic: No follow-up time — On-call engineers are expected to be fully productive on feature work the day after an all-night incident. This guarantees mistakes and breeds resentment.

The On-Call Bill of Rights (Google SRE Team Norms)

At Google, SRE teams maintain an informal "on-call bill of rights": (1) You will not be paged for something you cannot fix. (2) Every page will have a runbook. (3) You will have a secondary. (4) You will receive time-in-lieu after high-burden shifts. (5) Alert noise that makes on-call miserable will be tracked and eliminated within one quarter. Any rotation that violates these norms for more than two consecutive quarters triggers a reliability review.

Designing Healthy On-Call Rotations

The structure of your on-call rotation is a reliability decision with direct human consequences. A poorly designed rotation guarantees burnout and mistakes. A well-designed rotation distributes burden fairly, ensures continuity, and builds the knowledge that prevents future incidents. Designing a rotation is a systems engineering problem: you are optimizing for engineer effectiveness under time pressure, not just coverage hours.

Rotation size is the first variable. Google SRE recommends a minimum of 6 engineers for any production rotation to ensure that each engineer is on primary at most once every 6 weeks. For rotations with both primary and secondary roles (which every production rotation should have), the minimum grows to 8 engineers to avoid overlap fatigue. A rotation with fewer than 4 engineers is a risk that should be escalated to engineering leadership — it is the organizational equivalent of a SPOF.

The 4-Engineer Rotation Trap

A rotation with only 4 engineers sounds manageable — everyone is on-call once per 4 weeks. But add in vacations (2 weeks/year per engineer), sick days (5 days/year), and conference travel, and your effective rotation drops below 3 engineers for significant periods. At that point, on-call is weekly, and burnout follows within 6 months. Size rotations for your worst realistic case, not your best.

Rotation SizeMax On-Call Frequency (Primary)Risk LevelMinimum for Healthy Operations
1-2 engineers50-100% of shiftsCritical risk — immediate organizational intervention requiredNo
3-4 engineersEvery 3-4 weeksHigh risk — burnout likely within 6-12 monthsNo
5-6 engineersEvery 5-6 weeksAcceptable for low-burden services onlyBorderline
7-8 engineersEvery 7-8 weeksHealthy for most servicesYes
9+ engineersEvery 9+ weeksHealthy; consider splitting into primary + secondary poolsYes

Handoff cadence determines how long each engineer carries the pager. Weekly handoffs are the industry standard at FAANG companies because they balance continuity (enough time to see how a service behaves across load patterns) with fairness (no one carries the burden indefinitely). Google and Netflix use Sunday-to-Sunday handoffs to avoid mid-week disruption. Stripe uses Monday-to-Monday. Avoid mid-week handoffs (Wednesday) unless the service load is extremely low — mid-week handoffs mean every engineer transitions in the middle of peak traffic.

Shadow rotations are how you grow the rotation sustainably. Before an engineer takes primary on-call for the first time, they shadow an experienced primary for one full rotation. During the shadow week, they acknowledge every page alongside the primary, walk through every runbook step, and build the muscle memory of the incident response flow. At Google, shadow rotations are mandatory for all new SRE hires and for any software engineer taking on on-call for the first time.

Onboarding a New On-Call Engineer (Shadow Rotation Protocol)

→

01

Week 1 (Observation): New engineer observes a full rotation without any response responsibility. They read every runbook, review dashboards, and ask questions but do not acknowledge pages themselves.

→

02

Week 2 (Shadow): New engineer shadows an experienced primary. They acknowledge pages alongside the primary, suggest diagnostic steps, and participate in runbook execution. The primary has final decision-making authority.

→

03

Week 3 (Reverse Shadow): New engineer is nominally primary. The experienced engineer shadows them. The new engineer leads all responses; the shadow only intervenes for critical errors.

→

04

Week 4 (Solo with buddy): New engineer takes primary independently with an experienced engineer available on Slack. Every page generates a brief post-action note reviewed by the buddy at end of shift.

05

Full Rotation Entry: After successful completion of weeks 1-4, the engineer joins the full rotation. They are paired with an experienced secondary for their first 2 independent primary weeks.

1

Week 1 (Observation): New engineer observes a full rotation without any response responsibility. They read every runbook, review dashboards, and ask questions but do not acknowledge pages themselves.

2

Week 2 (Shadow): New engineer shadows an experienced primary. They acknowledge pages alongside the primary, suggest diagnostic steps, and participate in runbook execution. The primary has final decision-making authority.

3

Week 3 (Reverse Shadow): New engineer is nominally primary. The experienced engineer shadows them. The new engineer leads all responses; the shadow only intervenes for critical errors.

4

Week 4 (Solo with buddy): New engineer takes primary independently with an experienced engineer available on Slack. Every page generates a brief post-action note reviewed by the buddy at end of shift.

5

Full Rotation Entry: After successful completion of weeks 1-4, the engineer joins the full rotation. They are paired with an experienced secondary for their first 2 independent primary weeks.

Toil budget per week is the operational health metric that most teams do not track but should. Google SRE defines the target as no more than 8 hours of interrupt-driven operational work per week for a primary on-call engineer. This is the ceiling at which on-call remains compatible with engineering productivity. Above 8 hours per week, engineers cannot make meaningful progress on feature or reliability work. Above 16 hours, cognitive load degrades decision quality during incidents. The Uber crisis in 2018 was characterized by averages above 18 hours per week — a crisis level that required a dedicated internal project to address.

Track Interrupt Hours, Not Just Page Count

Page count is a misleading metric. A single incident that spans 6 hours is far more damaging than 6 quick 5-minute pages. Track total interrupt hours per engineer per week — time spent acknowledging, investigating, mitigating, and documenting. If the weekly average exceeds 8 hours for more than two consecutive weeks, the team must prioritize reliability improvement work over feature work. This is a team health decision, not an engineering performance failure.

The On-Call Toolkit: What You Need Before Your First Page

An on-call engineer without their toolkit is like a surgeon walking into the OR without instruments. The toolkit is not a luxury — it is a prerequisite for effective incident response. Every item in the toolkit must be verified and tested before the engineer takes the pager. "Runbook exists" is not sufficient; "Runbook was tested in the last 30 days and the steps produce the expected result" is the standard.

Runbooks are the most critical toolkit item. A runbook is a step-by-step operational guide for a specific alert or failure scenario. It answers four questions: what is failing, how do I confirm it is failing (diagnostic commands), how do I mitigate it (immediate actions), and when do I escalate. At Google, every PagerDuty alert must have a corresponding runbook link in the alert body. An alert without a runbook link is not permitted to fire in production. This policy is enforced at the alerting configuration level.

Anatomy of a production-quality runbook

  • Alert name and severity — The exact alert name as it appears in PagerDuty/Opsgenie, the severity level (P1/P2/P3), and whether this alert has ever caused a customer-visible incident.
  • What this alert means — A plain-language explanation of what is failing and why it matters. A new engineer who has never seen this service should understand the impact within 30 seconds of reading this section.
  • Initial triage commands — Exact commands to copy-paste, with expected output. No paraphrasing. Example: kubectl get pods -n payments -l app=payment-processor -o wide | grep -v Running to identify unhealthy pods.
  • Decision tree — If X, do Y. If not X, check Z. Explicit branching logic that accounts for the 3-4 most common failure modes for this alert. Reduces cognitive load at 3 AM.
  • Mitigation steps — Ordered steps to restore service. Each step has its own expected output confirmation so the engineer knows if the step worked before proceeding.
  • Escalation criteria — Explicit conditions under which to escalate: "If mitigation steps 1-3 do not restore error rate to <1% within 15 minutes, page the payment-systems team lead."
  • Post-incident actions — What to do after mitigation: which metrics to watch for recurrence, which Slack channels to notify, whether a postmortem is required.
  • Last updated date and owner — Runbooks rot. If a runbook has not been reviewed in 90 days, it may contain commands for deprecated systems. Owner is accountable for keeping it current.

Dashboards are the second critical toolkit item. Before the first page, an on-call engineer must know exactly which dashboards to open for triage, in what order, and what normal looks like. Time spent searching for the right dashboard during an incident is time the user is experiencing degraded service. At Netflix, every service has a single "service health" landing dashboard pinned in the service repository, team Slack channel, and PagerDuty alert. At Google, the on-call briefing document for each rotation includes a "dashboards quick reference" section with direct links.

The Dashboard Pre-Shift Checklist

Before every on-call shift, spend 10 minutes verifying your toolkit: (1) Open each triage dashboard and confirm data is loading correctly. (2) Verify your PagerDuty/Opsgenie app is installed and notifications are enabled on your phone. (3) Check that you have access to all production systems needed for mitigation (SSH keys valid, Kubernetes contexts configured, database credentials accessible). (4) Read the handoff notes from the outgoing engineer. (5) Check if any changes were deployed in the last 24 hours that might cause incidents. This 10-minute investment saves hours of frantic searching when the page fires at 2 AM.

Access and credentials are the most overlooked toolkit item. Engineers regularly discover during incidents that their credentials have expired, their VPN certificate has lapsed, or their Kubernetes context points to the wrong cluster. A locked-out engineer during a P1 incident adds 15-30 minutes to MTTR with no benefit. Build access verification into your pre-shift checklist and into your onboarding process.

Toolkit ItemWhat to VerifyVerification MethodRotation
RunbooksSteps produce correct output, all commands work, escalation contacts are currentRun through the runbook manually against a staging environment or via dry-run flagsMonthly or after any infrastructure change
DashboardsData is loading, metrics are labeled correctly, alert thresholds match current expectationsOpen each dashboard and verify live data from the previous 24 hours looks reasonableWeekly at shift start
PagerDuty/OpsgenieApp installed, notifications enabled, correct phone number registered, escalation policy is accurateSend a test page and confirm receipt on mobile within 60 secondsFirst day of each rotation
Production accessSSH keys valid, VPN works, Kubernetes contexts correct, database credentials not expiredRun a read-only diagnostic command against each critical systemFirst day of each rotation
Communication channelsIncident Slack channel exists and accessible, war room video tool configured, stakeholder contact list currentVerify Slack channel access; confirm video conferencing works on your deviceWeekly

Escalation paths are the safety net that makes on-call psychologically safe. No engineer should ever feel that they are alone at 3 AM with a P1 incident and no one to call. Every rotation must have a documented escalation path: secondary on-call → team lead → domain expert → engineering manager → incident commander. Each person in the chain must be reachable at any hour and must have agreed to this role. At Stripe, escalation contacts are verified monthly — not just listed.

The Stale Escalation Contact Trap

Nothing is more demoralizing than a P1 incident at 2 AM where the escalation path leads to engineers who have left the company, changed teams, or are on vacation in a timezone 12 hours away. Escalation contacts must be reviewed at the start of every rotation. Use tooling (PagerDuty schedules, team directories) rather than a static document that will go stale. At minimum, verify that the person on the escalation list still works at the company, is not on leave, and has acknowledged their role this quarter.

When the Page Fires: The First 5 Minutes

The first 5 minutes of an incident response determine the trajectory of the entire event. Engineers who follow a practiced protocol in the first 5 minutes achieve faster MTTR, make fewer mistakes under pressure, and communicate more clearly to stakeholders. Engineers who improvise in the first 5 minutes often make the problem worse — running commands without understanding their impact, jumping to conclusions about root cause, or failing to communicate until the situation has deteriorated further.

The protocol has five steps: Acknowledge → Assess → Triage → Communicate → Mitigate. These steps must happen in order. Jumping from Acknowledge directly to Mitigate (applying fixes before assessing the scope of the problem) is the single most common error new on-call engineers make. It leads to applying the wrong mitigation, making the incident worse, or missing a larger underlying issue.

flowchart TD
  A[🔔 Page Fires] --> B[Acknowledge Page
< 5 minutes]
  B --> C{Is this a real incident?
Check initial signal}
  C -->|False positive / noise| D[Snooze, create
noise-reduction ticket]
  C -->|Real incident| E[Assess Scope
Open dashboards, check SLI impact]
  E --> F{Customer impact?}
  F -->|Minimal / no user impact| G[Investigate quietly,
monitor for escalation]
  F -->|Confirmed user impact| H[Open Incident Channel
Post initial status]
  H --> I[Triage: Narrow root cause
Hypothesis-driven debugging]
  I --> J{Can you mitigate
without escalation?}
  J -->|Yes| K[Apply mitigation
Document each step]
  J -->|No| L[Escalate to Secondary
or Domain Expert]
  L --> K
  K --> M{Mitigation working?
Check SLIs recovery}
  M -->|SLIs recovering| N[Continue monitoring
Update incident channel]
  M -->|Not improving| O[Escalate further
Broaden scope assessment]
  N --> P[Declare Incident Resolved
Post all-clear to channel]
  P --> Q[Post-Incident Actions
Runbook update, postmortem if needed]

The complete on-call response flow from page receipt to resolution. The key discipline is following the Acknowledge → Assess → Triage → Communicate → Mitigate order, never skipping Assess and Triage to jump straight to mitigation.

The First 5 Minutes: Exact Steps

→

01

Step 1 — Acknowledge (0:00-1:00): Open PagerDuty/Opsgenie and acknowledge the page immediately. Do not start investigating yet. Acknowledging stops the escalation timer. If you cannot acknowledge within the SLA, the secondary will be paged. Acknowledge first, read second.

→

02

Step 2 — Assess (1:00-2:30): Open the triage dashboard linked in the alert. Determine: Is this alert real (not a false positive from a monitoring blip)? Is customer impact confirmed (error rate above threshold, latency above SLO)? How long has the issue been active (look at the graph — was this a gradual degradation or a sudden spike)? What changed recently (deployments, config pushes, infrastructure events)? These 4 questions take 60-90 seconds with a good dashboard.

→

03

Step 3 — Triage (2:30-4:00): Form your initial hypothesis based on the signal. The most common root causes in order of frequency are: (1) recent code/config deployment, (2) dependency failure, (3) traffic spike. Check each in order. Pull up deployment history — was anything pushed in the last 2 hours? Check dependency health dashboards. Check traffic graphs for unusual spikes. Do not start running fix commands yet.

→

04

Step 4 — Communicate (3:00-4:00): Post an initial status update in the incident Slack channel. Even if you do not know the root cause, communicate: "I am investigating an elevated error rate on the payments service since approximately 14:32 UTC. Dashboards open, assessing scope. Will update in 5 minutes." This communication stops stakeholders from pinging you individually and buys you time to focus. At Google, this initial post is mandatory within 4 minutes of acknowledging a P1.

05

Step 5 — Mitigate (4:00+): Only now do you start executing mitigation steps. Begin with the highest-confidence, lowest-risk action first (e.g., rollback the most recent deployment). Execute one step at a time. Document every command you run and its output in the incident channel. If a step does not improve the metrics within 5 minutes, move to the next hypothesis.

1

Step 1 — Acknowledge (0:00-1:00): Open PagerDuty/Opsgenie and acknowledge the page immediately. Do not start investigating yet. Acknowledging stops the escalation timer. If you cannot acknowledge within the SLA, the secondary will be paged. Acknowledge first, read second.

2

Step 2 — Assess (1:00-2:30): Open the triage dashboard linked in the alert. Determine: Is this alert real (not a false positive from a monitoring blip)? Is customer impact confirmed (error rate above threshold, latency above SLO)? How long has the issue been active (look at the graph — was this a gradual degradation or a sudden spike)? What changed recently (deployments, config pushes, infrastructure events)? These 4 questions take 60-90 seconds with a good dashboard.

3

Step 3 — Triage (2:30-4:00): Form your initial hypothesis based on the signal. The most common root causes in order of frequency are: (1) recent code/config deployment, (2) dependency failure, (3) traffic spike. Check each in order. Pull up deployment history — was anything pushed in the last 2 hours? Check dependency health dashboards. Check traffic graphs for unusual spikes. Do not start running fix commands yet.

4

Step 4 — Communicate (3:00-4:00): Post an initial status update in the incident Slack channel. Even if you do not know the root cause, communicate: "I am investigating an elevated error rate on the payments service since approximately 14:32 UTC. Dashboards open, assessing scope. Will update in 5 minutes." This communication stops stakeholders from pinging you individually and buys you time to focus. At Google, this initial post is mandatory within 4 minutes of acknowledging a P1.

5

Step 5 — Mitigate (4:00+): Only now do you start executing mitigation steps. Begin with the highest-confidence, lowest-risk action first (e.g., rollback the most recent deployment). Execute one step at a time. Document every command you run and its output in the incident channel. If a step does not improve the metrics within 5 minutes, move to the next hypothesis.

The Most Common First-5-Minute Mistakes

Three mistakes account for the majority of prolonged incidents: (1) Running mitigation commands before assessing scope — engineers restart services or roll back deployments before confirming those are the right actions, sometimes making things worse. (2) Failing to communicate — silence in the incident channel causes stakeholders to escalate, generating noise that distracts the responder. (3) Anchoring on the first hypothesis — the first alert you see is not necessarily the root cause. A database slowdown might alert as an API error rate spike. Always follow the data.

The "What Changed?" Heuristic

Approximately 70-80% of production incidents are caused by recent changes: code deployments, configuration pushes, infrastructure updates, or traffic pattern changes (feature launches, viral events). Before running a single diagnostic command, check: what was deployed in the last 2 hours? This single question resolves most incidents faster than any amount of metric analysis.

Incident Response in Practice

Incident response in practice is messier than any protocol suggests. Multiple alerts fire simultaneously. Stakeholders send messages asking for updates. Two engineers start investigating the same system from different angles. The database is slow but it might be because of the bad deploy or it might be a separate issue. This section covers the operational mechanics of managing a real incident: war room setup, hypothesis-driven debugging, the critical distinction between mitigation and resolution, and when to expand the response team.

War room setup is the first decision for any P1 or P2 incident. A war room is a dedicated communication space — a Slack channel, a video call, or both — where all incident responders coordinate. The war room prevents the most common coordination failure in multi-engineer incidents: engineers working in parallel without knowing what each other is doing, applying conflicting mitigations, or duplicating diagnostic work. At Google, every P1 incident automatically creates a dedicated Slack channel and a Google Meet link. At Stripe, P1 incidents trigger an automated war room setup via their internal incident management tooling.

War room setup checklist for P1/P2 incidents

  • Create a dedicated Slack channel — Name it #inc-YYYYMMDD-service-brief-description. Pin the channel to the team channel. Post the incident management link (PagerDuty, Jira, Linear) immediately.
  • Designate an Incident Commander (IC) — The IC coordinates the response. They do not do the technical debugging — they manage communication, coordinate escalations, and ensure no two engineers are doing the same thing. For P1s at Google, the IC role is explicitly assigned within the first 5 minutes.
  • Establish a communication cadence — Agree on update frequency: "I will post a status update every 10 minutes until mitigation." Predictable updates prevent stakeholder messages from disrupting the responder.
  • Assign clear ownership of each investigation thread — If multiple engineers are investigating, assign one to the database, one to the application layer, one to the infrastructure layer. No unassigned investigation threads.
  • Keep a running incident log — Every command run, every hypothesis formed, every action taken goes into the incident channel. This is both your coordination tool and the raw material for the postmortem timeline.

Hypothesis-driven debugging is the analytical framework that prevents random diagnostic wandering. The principle: form a specific hypothesis before running any command, then run the minimum set of commands to confirm or refute the hypothesis. If the hypothesis is refuted, form the next most likely hypothesis and repeat. This sounds obvious but it requires discipline under pressure — when you are stressed at 2 AM and multiple things look wrong, the temptation is to run everything at once and hope something reveals the answer.

Hypothesis-Driven Debugging in 3 Steps

Step 1: State the hypothesis explicitly in the incident channel ("I think this is caused by the config push at 14:32 because the error rate started at 14:34"). Step 2: Identify the single metric or log that would confirm or refute this hypothesis ("If it is the config push, rollback should restore the error rate to baseline within 2 minutes"). Step 3: Execute the test and observe the result. If error rate recovers: hypothesis confirmed. If not: state the next hypothesis. This process documents your reasoning and allows other engineers to follow your logic.

Mitigation versus resolution is a distinction that every on-call engineer must internalize. Mitigation stops the user-visible impact: it is the fastest path back to normal service. Resolution finds and permanently fixes the root cause: it takes much longer and should not block service recovery. The on-call engineer's primary responsibility during an active incident is mitigation. Resolution is the postmortem's responsibility. A traffic rollback that restores service in 5 minutes is better than a 2-hour debugging session that finds the perfect fix — even if the rollback means you will need to redeploy tomorrow.

ConceptDefinitionWho Does ItTimingExample
MitigationStopping user-visible impact by any means necessaryOn-call engineer during the incidentAs fast as possibleRolling back a bad deployment to restore API response rates
ResolutionFinding and permanently fixing the root causeOn-call + development team after mitigationHours to days after mitigationIdentifying the specific code bug that caused the bad deployment and fixing it
PreventionChanging the system to prevent the same root cause from triggering againDevelopment team driven by postmortem actionsDays to weeks after resolutionAdding validation that prevents the class of config error that caused the incident

The Mitigation-First Principle

When a service is degraded, the user does not care whether you understand why — they care that it is fixed. Mitigate first, investigate later. Common mitigations in order of speed and risk: (1) Rollback the most recent deployment (fastest, lowest risk, works 70% of the time). (2) Increase capacity (spin up more instances to absorb load). (3) Enable a feature flag to disable the affected functionality. (4) Activate the circuit breaker to the failing dependency. Each of these can be executed in minutes. Each gives you breathing room to investigate properly.

When to expand the incident response team is a judgment call that new on-call engineers consistently make too late. The rule: if you are not making progress toward mitigation within 15 minutes, bring in additional expertise. This is not a sign of weakness — it is the correct operational decision. Every minute a P1 persists while you attempt to solve it alone is a minute of additional customer impact.

Escalation: When to Ask for Help

Escalation is one of the most misunderstood practices in incident response. New engineers frequently treat it as a last resort — something you do only after you have exhausted all other options and have been struggling for an hour. Experienced SREs treat it as a precision tool: you escalate at the right moment, with the right information, to the right person. The goal is to get the right expertise engaged as quickly as possible, not to demonstrate that you can handle everything alone.

At Google, the escalation doctrine is explicit: if you are not clearly making progress within 15 minutes of starting investigation, escalate. "Making progress" means you have a confirmed hypothesis and a mitigation path. If you are still in "I am not sure what is wrong" territory after 15 minutes, you need more expertise. This threshold is not 30 minutes, 45 minutes, or "after I have tried everything." It is 15 minutes.

Escalation Is Not a Failure

The most experienced SREs escalate more quickly than junior engineers, not less. Senior engineers have learned that time is the most precious resource during an incident. An extra 30 minutes of solo debugging because an engineer did not want to wake someone up can mean 30 additional minutes of customer impact. Escalation is a professional skill that demonstrates operational maturity, not weakness.

Escalation criteria: when to escalate immediately vs escalate-with-context

  • Escalate immediately (no delay) — P1 incident with confirmed customer impact exceeding 5% of traffic. Data loss or corruption confirmed or suspected. Security incident suspected (unauthorized access, credential exposure). Incident scope expanding instead of contracting after mitigation attempts.
  • Escalate with context (gather first, then escalate) — Investigation stalled after 15 minutes with no clear hypothesis. Alert is outside your team's domain of expertise. Incident involves multiple services from different teams. Mitigation requires access or permissions you do not have.
  • Do not escalate (handle yourself) — Alert is a known false positive with a ticket open for it. Issue is self-healing (flapping network, transient spike that has already resolved). Issue is P3 or below with no customer impact and a clear runbook path.

The escalation template is what separates effective escalations from ineffective ones. When you page someone at 3 AM, you owe them a structured briefing that lets them be productive within 60 seconds of joining. An escalation that says "payments is broken, help" wastes the first 5-10 minutes of the escalated engineer's time while you explain context that you should have provided upfront.

The SBAR Escalation Template

Use SBAR format for every escalation: Situation ("payments API error rate is 15% for the last 8 minutes, up from baseline of 0.1%"), Background ("a config push was deployed at 14:32, 6 minutes before the error rate spike"), Assessment ("I believe the config push changed the database connection pool settings causing connection exhaustion; I have checked deployment logs but need help interpreting the JVM thread dump"), Recommendation ("I need a JVM expert to review the thread dump — here is the link: [URL]"). This 4-sentence briefing gets the escalated engineer productive immediately.

Company SizePrimary EscalationSecondary EscalationExecutive EscalationExternal Escalation
Startup (< 50 engineers)Team lead or senior engineerCTO directly for P0 incidentsCEO for customer-impacting outages > 1 hourCloud provider support ticket
Mid-size (50-500 engineers)Domain team on-call engineerEngineering manager on callVP Engineering for P0/P1Cloud provider TAM for infrastructure issues
FAANG (500+ engineers)Secondary on-call in rotationDomain expert team (payments team, infra team)Incident Commander role (senior IC on-call pool)Cloud provider critical support SLA

Pre-Write Your Escalation Messages

For high-severity alerts, pre-write the escalation message template in the runbook. When an alert fires, the runbook should say "if you have not mitigated within 15 minutes, send this message to #payments-escalation: [template with blanks for current metrics]." Pre-written templates are faster, more complete, and more professional than messages composed under stress at 3 AM.

Escalation paths by company size vary significantly. At a startup, you may escalate directly to the founding engineer or CTO. At Google, there is a tiered system: secondary on-call → domain team expert → incident commander pool → VP-level incident management for critical infrastructure outages. At Netflix, there is a dedicated "war room team" that can be engaged for Tier-1 service incidents. Regardless of company size, the principle is the same: the escalation path must be documented, the people on it must have agreed to their role, and the path must be tested before you need it.

Handoffs: Ending Your Shift

The handoff is the moment where on-call continuity is either preserved or broken. A poor handoff leaves the incoming engineer blind to current risk, unaware of recent incidents, and without context about the service's current state. A good handoff takes 15-20 minutes to prepare and saves 2-3 hours of confusion for the incoming engineer. The investment is asymmetric — the effort is small, the benefit is large.

The handoff document is the primary artifact. It is a written record, not an informal chat message. At Google, handoffs are written in an internal on-call handoff tool that maintains a structured format. At companies without dedicated tooling, a structured Slack message or Confluence page achieves the same purpose. The format must be consistent across shifts so the incoming engineer knows exactly where to look for each piece of information.

The On-Call Handoff Template (Required Sections)

→

01

Shift Summary: Start and end time, rotation week number, overall shift rating (quiet / moderate / heavy). One sentence describing the week. This helps the team track on-call burden trends over time.

→

02

Active Incidents: For each incident that is not fully resolved: incident ID, current status, what is known, what still needs investigation, who is the current owner, and the expected next action. The incoming engineer should be able to own any active incident within 5 minutes of reading this section.

→

03

Open Alerts and False Positives: List any alerts that fired but were determined to be false positives, with a note about why and whether a suppression or fix ticket exists. List any alerts that are in a degraded but stable state and need monitoring.

→

04

Recent Changes: All deployments, config changes, and infrastructure events in the last 48 hours. This is the most important section for understanding risk — the incoming engineer needs to know what changed and whether those changes might cause incidents during their shift.

→

05

Risk Assessment: Your honest assessment of what might break during the incoming shift. "The new caching layer deployed today is untested at peak traffic — watch for cache-miss spikes around 18:00 UTC when the US East Coast load ramps." This is institutional knowledge that saves the incoming engineer hours.

06

Action Items: Anything that needs follow-up but was not completed during your shift — runbook updates, noise-reduction tickets, postmortem contributions. Assign each item to a specific person with a due date.

1

Shift Summary: Start and end time, rotation week number, overall shift rating (quiet / moderate / heavy). One sentence describing the week. This helps the team track on-call burden trends over time.

2

Active Incidents: For each incident that is not fully resolved: incident ID, current status, what is known, what still needs investigation, who is the current owner, and the expected next action. The incoming engineer should be able to own any active incident within 5 minutes of reading this section.

3

Open Alerts and False Positives: List any alerts that fired but were determined to be false positives, with a note about why and whether a suppression or fix ticket exists. List any alerts that are in a degraded but stable state and need monitoring.

4

Recent Changes: All deployments, config changes, and infrastructure events in the last 48 hours. This is the most important section for understanding risk — the incoming engineer needs to know what changed and whether those changes might cause incidents during their shift.

5

Risk Assessment: Your honest assessment of what might break during the incoming shift. "The new caching layer deployed today is untested at peak traffic — watch for cache-miss spikes around 18:00 UTC when the US East Coast load ramps." This is institutional knowledge that saves the incoming engineer hours.

6

Action Items: Anything that needs follow-up but was not completed during your shift — runbook updates, noise-reduction tickets, postmortem contributions. Assign each item to a specific person with a due date.

The "Worse Than Before" Rule

An informal but powerful norm at several SRE teams: you must not hand off the service in a worse state than you received it. This does not mean you must resolve every incident — some incidents take days. It means that if you received the service with 1 open incident and 2 known risks, you should not hand off with 3 open incidents and 5 known risks without having made visible progress or without explicit documentation of why. If the service deteriorated during your shift, the handoff document must explain why, what was tried, and what the incoming engineer should focus on first.

Open incident tracking is the operational discipline that prevents incidents from falling through the cracks between shifts. Every open incident must have a ticket (in PagerDuty, Jira, or your issue tracker), a current owner, and a clear next action. "Watching it" is not a valid state for an open incident — if it is worth monitoring, it is worth documenting what you are watching for and what threshold triggers a mitigation action.

Context transfer in practice means that the outgoing engineer should be available for questions for the first 30 minutes after handoff. This is the "warm handoff" principle used by Google SRE teams. The incoming engineer reads the handoff document, identifies their top 3 questions, and asks them while the outgoing engineer is still fresh. After 30 minutes, the outgoing engineer is officially off-duty and should not be disturbed except for P1 emergencies.

The Shift Debrief Habit

Before writing the handoff document, spend 5 minutes on a personal shift debrief: What were the top 3 things that made this shift harder than it needed to be? For each one, is there a ticket open to fix it? If not, create one now — it takes 2 minutes and is the most impactful contribution you can make to improving on-call quality. Over time, these tickets aggregate into a prioritized on-call improvement backlog. Teams that maintain this backlog consistently reduce their interrupt hours every quarter.

Common handoff anti-patterns and how to fix them

  • The "it is fine" handoff — Outgoing engineer says "quiet week, no issues" but does not document the three alerts they silenced or the deployment they are nervous about. Fix: require documentation of all silenced alerts and recent changes regardless of whether incidents occurred.
  • The verbal-only handoff — Context is transferred in a 10-minute Slack call with no written record. The incoming engineer remembers 40% of what was said. Fix: written handoff document is mandatory; verbal call supplements it, never replaces it.
  • The missing owner — Open incident has no current owner listed. The incoming engineer assumes someone else is handling it. No one handles it. Fix: every open incident in the handoff document must have a named owner and a next action.
  • The stale runbook note — Handoff says "runbook needs updating" but no ticket exists and no one is assigned. Fix: if you identify a runbook gap during your shift, create the ticket before you hand off, not in a comment in the handoff doc.

On-Call Health: Preventing Burnout

On-call burnout is one of the most significant causes of SRE and infrastructure engineering attrition. PagerDuty's 2021 State of Digital Operations report found that 26% of on-call engineers have seriously considered leaving their job due to on-call stress, and that engineers who are paged more than 10 times per week are 2.7x more likely to report burnout than those paged fewer than 3 times. This is not a personality weakness — it is a systemic problem that engineering organizations create and can systematically fix.

Interrupt tracking is the measurement foundation for on-call health improvement. You cannot improve what you do not measure. Every team should track, per engineer per week: total interrupt hours (time spent on on-call activities), number of actionable pages, number of noise pages, time to acknowledge (MTTA), time to resolve (MTTR), and whether each incident was covered by a runbook. These metrics should be reviewed in team retrospectives monthly and in quarterly health reviews.

MetricHealthy TargetWarning ThresholdCritical ThresholdHow to Measure
Weekly interrupt hours (per engineer)< 8 hours/week average8-16 hours/week> 16 hours/weekPagerDuty/Opsgenie incident duration + time logged in incident channels
Actionable pages per shift (12 hours)< 2 pages2-5 pages> 5 pagesPagerDuty alert count filtered by "requires action" tag
Noise pages (false positives)< 20% of total pages20-40%> 40%Incidents closed as "no action required" / total incidents
Mean Time to Acknowledge (MTTA)< 5 minutes for P1/P25-15 minutes> 15 minutesPagerDuty MTTA report by service and severity
Engineers who report high stress< 10% of rotation10-25%> 25%Quarterly team health survey; anonymous NPS-style question on on-call experience

The on-call improvement loop is the systematic process that prevents burnout from being a permanent condition. It has four steps: measure current burden, identify the top sources of interrupt work, reduce them (either by fixing reliability or by reducing alert noise), and remeasure. This loop should run continuously — ideally with a dedicated "on-call improvement sprint" each quarter where the team's primary goal is reducing alert noise and toil, not shipping features.

Uber's Project Oncall Health: The Numbers

In 2018-2019, Uber's engineering organization ran "Project Oncall Health" after identifying that multiple SRE teams were spending >40% of their time on interrupt work. The project systematically measured interrupt hours across all teams, identified the top 20 alert-generating services, assigned reliability improvement sprints to each, and implemented a company-wide policy that any team exceeding 20 weekly interrupt hours per engineer triggers a mandatory reliability review. Result: average weekly interrupt hours across SRE teams fell from 18 hours to 6 hours over 18 months. Engineer retention in on-call roles improved significantly.

Organizational structures that improve on-call health include: a rotating "on-call improvement lead" role where one engineer per quarter has their feature work reduced to focus exclusively on alert noise reduction; a "noisy alert fund" budget that reserves engineering time for fixing the top 10 noise-generating alerts each quarter; and a policy that any alert that fires more than 5 times in a week without resulting in a runbook action triggers automatic review and must be either fixed or suppressed within 30 days.

Proven interventions for reducing on-call burden (ranked by impact)

  • 1. Alert noise reduction — The single highest-impact intervention. Identify the top 5 alerts by frequency that require no action (or whose action is always "wait and see"). Either fix the underlying cause or raise the alert threshold. At Google, alert noise above 10% of total pages triggers a mandatory alert review within 2 weeks.
  • 2. Runbook completeness — Incidents without runbooks take 2-4x longer to resolve. Identify the top 10 alert types by time-to-resolve and verify each has a complete, tested runbook. A complete runbook reduces MTTR by 50% for that alert type.
  • 3. Rotation size increase — Adding one or two engineers to an under-staffed rotation immediately reduces per-engineer burden by 15-20%. This requires making the business case with data: interrupt hours per engineer per week, engineer satisfaction scores, and engineer attrition risk.
  • 4. Shadow rotation for recurring issues — For incidents that require specialized knowledge (a complex database issue, a specific Kubernetes configuration failure), maintain a subject matter expert shadow rotation where the expert is available for rapid escalation during their shadow week.
  • 5. Post-incident automation — For incidents that require the same manual steps every time (restart a specific service, clear a specific queue), automate the mitigation. A runbook that says "run this script" is better than one that says "run these 8 commands." A button in a dashboard that runs the script is better than a runbook.
  • 6. Mandatory recovery time — Enforce a policy that engineers who are paged more than 3 times during a night shift receive the next day as recovery time, not counted against vacation. This policy costs the team ~1 day of productivity per heavy night shift but prevents the cognitive degradation that comes from sleep deprivation.

The On-Call Experience Survey

Run a quarterly anonymous 5-question survey of all on-call engineers: (1) How would you rate your on-call experience this quarter (1-10)? (2) How many times were you woken during sleeping hours? (3) What was the most frustrating aspect of on-call this quarter? (4) Is there a specific alert or service that should be improved immediately? (5) Would you recommend on-call at this company to a friend? Track these scores over time. A consistent score below 7/10 is a signal that organizational intervention is required.

Advanced On-Call: GameDays, Runbook Drills, and Improving the System

Surviving on-call is the minimum bar. Elite engineering teams use on-call as a continuous improvement engine. They run game days to build muscle memory before the real incident. They drill runbooks to find gaps before those gaps appear at 3 AM. They analyze on-call metrics to identify systemic reliability problems and address them proactively. The result is a team that is not just reactive to production incidents but actively preventing them.

Game days (also called "chaos drills" or "failure injection exercises") are controlled experiments where the team deliberately introduces failures to practice response procedures. The goal is to build the muscle memory of incident response without the pressure of a real production outage. At Google, the DiRT (Disaster Recovery Testing) program is one of the most mature game day programs in the industry — it systematically tests recovery from datacenter failures, network partitions, data corruption, and service outages on a quarterly basis.

Types of game days and what each tests

  • Alert response drills — Silently trigger a real alert in a staging environment and measure the time for the on-call engineer to acknowledge, diagnose (using only dashboards and runbooks), and mitigate. Reveals gaps in runbook completeness and alert clarity.
  • Dependency failure injection — Use a chaos engineering tool (Chaos Monkey, Gremlin, AWS Fault Injection Simulator) to kill one dependency (database replica, cache, message queue). Measure how the service degrades, whether circuit breakers activate, and how long it takes the on-call engineer to identify the failure and execute the dependency-failure runbook.
  • On-call rotation handoff drill — Simulate a complex incident state at shift change and have both outgoing and incoming engineers execute the handoff procedure. Evaluate: does the incoming engineer understand the current state within 10 minutes? Are all open incidents accurately documented in the handoff?
  • Escalation chain test — Send a test page through the escalation chain and verify response times at each level. Does the secondary receive the page within SLA? Does the domain expert respond within the escalation window? Are all contacts in the escalation path reachable?
  • Full outage simulation (Red Team exercise) — A "Red Team" (a small group of senior engineers) simulates a major multi-system failure without informing the on-call team what they are about to inject. The on-call team responds as they would to a real incident. Debrief evaluates coordination, communication, tooling effectiveness, and time to mitigation. Reserved for teams with mature on-call practices.

Runbook drills are the underinvested complement to game days. A runbook that has never been executed outside a real incident is a runbook that will fail during a real incident. Steps will be missing. Commands will be wrong. Links will be stale. Every runbook should be executed by a new rotation member (who has not memorized it through experience) in a staging environment at least once per quarter. The execution should be timed, and any gaps should be filed as tickets before the drill ends.

The "Intern Test" for Runbooks

A runbook should be executable by a competent but new engineer with zero prior knowledge of the service. If a runbook requires the engineer to "know" something that is not written in the runbook — a background assumption, an undocumented system behavior, a Slack channel that is not mentioned — then the runbook is incomplete. Test your runbooks by having a new team member execute them and noting every question they had to ask. Each question is a gap that needs to be filled.

Building an On-Call Improvement Program (Quarterly Cycle)

→

01

Month 1 — Measure: Collect on-call metrics for the quarter: interrupt hours per engineer, page frequency by alert, MTTA and MTTR by service, and on-call experience survey scores. Identify the top 5 sources of interrupt work by time and frequency.

→

02

Month 1-2 — Prioritize: Rank the top 5 interrupt sources by combination of (impact on engineer hours) and (feasibility of fix). Create improvement tickets for each. Assign owners. For each ticket, define what "done" looks like: "Alert X will fire fewer than 2 times per week" or "Runbook Y will be executable by a new engineer in under 20 minutes."

→

03

Month 2-3 — Execute: Development team and SRE team collaborate on the reliability improvements. SRE tracks progress weekly. Use the 10% rule: if a service improvement is taking more than 10% of the team's engineering capacity to fix, escalate to the service owner's engineering manager.

→

04

Month 3 — Validate: Run a game day at the end of the quarter to validate that the improvements had the intended effect. Rerun the on-call experience survey. Compare metrics to the baseline from Month 1.

05

Quarter Close — Retrospective: Run a 1-hour on-call retrospective with the full rotation. Celebrate improvements. Identify what did not improve. Set goals for next quarter. Share the results with engineering leadership — visible on-call health metrics create organizational pressure to continue investing in reliability.

1

Month 1 — Measure: Collect on-call metrics for the quarter: interrupt hours per engineer, page frequency by alert, MTTA and MTTR by service, and on-call experience survey scores. Identify the top 5 sources of interrupt work by time and frequency.

2

Month 1-2 — Prioritize: Rank the top 5 interrupt sources by combination of (impact on engineer hours) and (feasibility of fix). Create improvement tickets for each. Assign owners. For each ticket, define what "done" looks like: "Alert X will fire fewer than 2 times per week" or "Runbook Y will be executable by a new engineer in under 20 minutes."

3

Month 2-3 — Execute: Development team and SRE team collaborate on the reliability improvements. SRE tracks progress weekly. Use the 10% rule: if a service improvement is taking more than 10% of the team's engineering capacity to fix, escalate to the service owner's engineering manager.

4

Month 3 — Validate: Run a game day at the end of the quarter to validate that the improvements had the intended effect. Rerun the on-call experience survey. Compare metrics to the baseline from Month 1.

5

Quarter Close — Retrospective: Run a 1-hour on-call retrospective with the full rotation. Celebrate improvements. Identify what did not improve. Set goals for next quarter. Share the results with engineering leadership — visible on-call health metrics create organizational pressure to continue investing in reliability.

Making On-Call Better, Not Just Survivable

The goal is not to make on-call tolerable — it is to make on-call a positive engineering experience. The best on-call programs at FAANG companies have engineered on-call to be an opportunity: to see the full system, to build cross-service knowledge, to develop incident command skills, and to drive improvements that benefit every engineer on the rotation. This requires organizational investment in tooling, runbooks, game days, and improvement cycles — but the return is a more reliable system, better engineer retention, and faster incident response.

SRE On-Call at Scale: Netflix's Approach

Netflix's "Freedom and Responsibility" culture extends to on-call: service teams own their on-call rotations end-to-end. Netflix does not have a central SRE team with dedicated on-call; instead, every team that owns a service is responsible for running it. This creates strong incentives for reliability — if your on-call is miserable, it is because your service is unreliable, and it is your team's problem to fix. Netflix compensates for the lack of central SRE by investing heavily in tooling (the Simian Army chaos engineering suite, Vizceral traffic visualization, Atlas metrics platform) that makes every team's on-call more effective.

How this might come up in interviews

On-call and incident response questions appear in multiple interview formats at FAANG companies. System design interviews ask you to describe how you would monitor and operate the system you just designed — SLOs, alerting thresholds, runbooks, escalation paths. Behavioral interviews at SRE roles ask you to describe a production incident you handled, specifically looking at your decision-making process, communication under pressure, and what you learned. Senior engineering and staff engineering interviews ask how you would improve a team's on-call health — they are evaluating your organizational thinking, not just technical skills.

Common questions:

  • Describe a production incident you personally handled. Walk me through your exact steps from the first page to resolution. What would you do differently today?
  • Your team is experiencing significant on-call burnout — engineers are complaining, one engineer has already quit citing on-call stress, and weekly interrupt hours average 20 hours per engineer. You are the new tech lead. What are your first three actions?
  • Walk me through how you would design an on-call rotation for a new service being launched next quarter with a team of 8 engineers. What rotation cadence, escalation paths, and runbook standards would you put in place?
  • During an active P1 incident, how do you balance the pressure to communicate updates to stakeholders with the need to focus on technical investigation? Give a concrete example of how you have managed this.
  • What does a good handoff look like at the end of an on-call shift? What are the most common handoff failures you have seen, and how do you prevent them?
  • How do you make the case to engineering leadership that the team needs more engineers on the on-call rotation? What data do you gather, and how do you present it?

Try this question: Ask the interviewer: How is on-call burden tracked at this company — specifically, do you measure interrupt hours per engineer per week? What happens when a service consistently generates excessive on-call burden? Is there a company policy on maximum rotation sizes? These questions signal that you have thought deeply about on-call as a system, not just as a personal commitment.

Strong answer: Describing the mitigation vs resolution distinction clearly. Explaining the 15-minute escalation threshold. Mentioning specific metrics (interrupt hours, MTTA, noise ratio). Discussing how they have improved on-call health at a previous company with data. Using the ER doctor or shift-worker analogy to explain why rotation design matters. Mentioning game days and runbook drills as proactive investments.

Red flags: Describing on-call as "you just respond to pages and fix things." Not knowing what MTTA and MTTR are. Treating escalation as a last resort or a failure. Having no opinion on runbook quality. Not mentioning communication during incidents. Describing a rotation with fewer than 4 engineers as acceptable.

Key takeaways

  • On-call is a formal engineering contract with explicit SLAs: Google targets 5-10 minute acknowledgement times, Netflix targets fewer than 1 incident per week per on-call engineer, and Google recommends a maximum of 2 actionable pages per 12-hour shift.
  • Healthy rotations require a minimum of 6 engineers for critical services, weekly handoffs with written handoff documents, shadow rotations for new members, and a maximum of 8 interrupt hours per engineer per week.
  • The first 5 minutes protocol — Acknowledge → Assess → Triage → Communicate → Mitigate — must be followed in order. Skipping Assess and Triage to jump directly to Mitigation is the most common cause of prolonged incidents and unnecessary damage.
  • Escalation at 15 minutes without a clear hypothesis is the correct professional response, not a sign of weakness. Every minute of solo debugging past the 15-minute threshold without progress is a minute of unnecessary customer impact.
  • On-call burnout is a systemic organizational failure, not an individual resilience problem. Uber's Project Oncall Health demonstrated that systematic measurement, alert audits, rotation size enforcement, and runbook standards can reduce weekly interrupt hours by 67% — from 18 hours to 6 hours per engineer — over 18 months.
Before you move on: can you answer these?

You have been investigating a P1 payment service incident for 20 minutes and still have no confirmed hypothesis. You are the primary on-call engineer. What should you do?

Escalate immediately. The 15-minute threshold for escalation without a clear hypothesis has already passed. Escalate to the secondary on-call engineer or the payments domain expert using SBAR format: state the current metrics (error rate at X%, started at Y time), what you have ruled out so far (rollback did not help, no dependency failures visible), your best current hypothesis, and what expertise you need. Do not wait longer — every additional minute of solo debugging when you are stuck is a minute of unnecessary customer impact.

Your on-call rotation has only 4 engineers. You are experiencing roughly 12 actionable pages per week per engineer. What are the most important two actions to take?

First, immediately escalate the rotation size problem to engineering management with data: 4 engineers means each engineer is on-call once every 4 weeks, which with vacations and sick days effectively means more frequently. The minimum healthy rotation size is 6 engineers. Second, address the page frequency: 12 actionable pages per week is above the 8-hour/week interrupt threshold. Run an alert audit to identify the top 5 alert sources and create reliability improvement tickets for each. Both fixes are needed — adding engineers without reducing alert noise means new engineers burn out just as fast; reducing alert noise without adding engineers still leaves the rotation under-staffed.

What is the difference between mitigation and resolution, and why does that distinction matter during an active incident?

Mitigation stops user-visible impact as fast as possible — it is the on-call engineer's primary responsibility during the incident. Resolution finds and permanently fixes the root cause — it happens in the postmortem and follow-up work, not during the active incident. The distinction matters because conflating the two leads to the most common on-call mistake: spending 45 minutes debugging the exact root cause when a 5-minute rollback would have restored service. The principle is mitigation-first: restore service, then investigate. The user does not care whether you understand why the incident happened — they care that it is fixed.

🧠Mental Model

💡 Analogy

On-call is like an emergency room doctor rotation. ER doctors work scheduled shifts with clear handoffs — a day doctor hands off explicitly to the night doctor, reviewing every active patient, every pending test, every risk. They have backup colleagues (the attending physician, the specialist on call) they escalate to without hesitation. They follow structured protocols (runbooks) for every common emergency, so decisions at 3 AM are fast and correct even under pressure. And critically, no ER makes its doctors on-call 24/7 indefinitely — they rotate, they rest, and they have defined limits on shift length. A hospital that eliminated rotations and made its ER doctors permanently on-call would see degraded performance, burnout, and eventually patient harm within months. SRE on-call is exactly the same system: structured, finite, with clear handoffs, escalation paths, and explicit recovery time. The ER doctor analogy also captures the mitigation-first principle: when a patient arrives in cardiac arrest, the ER doctor does not pause to investigate the lifestyle choices that caused the cardiac condition. They stabilize the patient first. Root cause analysis happens afterward, in a controlled setting.

⚡ Core Idea

On-call is a formal engineering contract, not an informal commitment. It requires structured rotations with minimum size requirements (6+ engineers), tested runbooks for every alert, clear escalation paths, and disciplined handoffs. The goal is not heroic individual response to incidents — it is a reliable, repeatable, human-sustainable system that responds effectively to failures at any hour without burning out the engineers who run it.

🎯 Why It Matters

On-call quality directly determines both service reliability and engineer retention. Teams with toxic on-call lose their best engineers first — experienced engineers have options and they use them when the pager becomes unbearable. The resulting junior-heavy rotation handles incidents more slowly, generates more mistakes, and perpetuates the cycle. Investing in healthy on-call is not a quality-of-life initiative — it is a reliability and retention investment with measurable financial return.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.