On-Call Health & Sustainable Rotations

Q: Is on-call compensation a perk or a necessity?

It is a reliability investment, not a perk you grant when budget allows. On-call is how you detect and contain failures before customers feel them, and tired responders miss pages and misdiagnose incidents, so burning out your responders is reliability debt with interest.

Q: What makes a rotation unhealthy?

The difference is rarely the technology, it comes down to four variables: too few people, no alert-load budget, no handoff, and no compensation or recovery time. If your rotation is unhealthy on three of those four, you have an attrition problem that has not surfaced yet.

Q: What is an alert-load budget?

It is a running count of pages that, when exceeded, forces the team to fix or silence noisy alerts instead of just absorbing them. It feeds back from every page so that flapping or low-value alerts get addressed rather than quietly eroding people.

Q: How does broken on-call actually cost a team?

It manufactures attrition. In the article's example, three brutal on-call weeks with no alert budget, handoff, or recovery time pushed one of the best engineers to quit, even though no genuinely catastrophic system failure occurred. The rotation was broken, not the system.

On this page

The engineer who quit after three weeks
A sustainable rotation is an investment, not a perk
The picture: how a healthy rotation flows
Healthy vs unhealthy on-call
Make it concrete: an alert-load budget + handoff
Measuring on-call health
Common mistakes that cost you people
Takeaways
Where to go next

TL;DR

Design on-call so people stay healthy and stay employed. The levers that matter: humane rotations, fair comp, an alert-load budget, clean handoffs, and measuring on-call health before burnout costs you good engineers.

The engineer who quit after three weeks

She was one of the best engineers on the team. Then she drew the short straw: three consecutive on-call weeks because two teammates were on leave and nobody adjusted the schedule. Week one, eleven pages, four of them after midnight. Week two, a flapping alert that fired every forty minutes for two days because nobody owned the fix. Week three, a real incident at 3am that ate her entire next day. She showed up to standup grey-faced, shipped nothing for a sprint, and a month later she handed in her notice. Exit interview, one line: "I can't keep being woken up for things that aren't emergencies."

Here is the part that should sting: none of those three weeks involved a genuinely catastrophic system failure. The system was mostly fine. The rotation was broken, too few people, no alert budget, no handoff, no comp, no recovery time. On-call didn't reveal a reliability problem. It manufactured an attrition problem.

Who this is for

Engineers about to join their first on-call rotation, and tech leads or managers who own one. You do not need to run a 200-person SRE org, if your team carries a pager, this applies. We will cover rotation design, compensation, alert-load budgets, handoffs, escalation, and how to measure whether on-call is actually healthy.

A sustainable rotation is an investment, not a perk

A sustainable on-call rotation is a reliability investment, not a perk you grant. The pager is your early-warning system, and you only get honest signal from people who are rested enough to respond.
The on-call principle

Teams often frame on-call comp and reasonable load as "nice-to-haves", perks for when budget allows. That gets the causality backwards. On-call is how you detect and contain failures before customers feel them. If the humans running it are exhausted, the early-warning system degrades exactly when you need it most: tired people miss pages, misdiagnose incidents, and stop filing the follow-ups that would prevent the next one. Burning out your responders is not a cost-saving, it is reliability debt with interest.

A smoke detector that beeps every hour for no reasonAlert fatigue, people stop trusting and stop reacting to pages

A fire crew working 72 hours straightA 1-person or back-to-back rotation with no recovery time

Paying firefighters for night shiftsOn-call compensation, time off in lieu or a stipend

A shift-change briefing at the firehouseThe on-call handoff, open incidents, known risks, what to watch

On-call health maps cleanly onto how we already think about systems.

The picture: how a healthy rotation flows

A healthy rotation is a small pipeline. An alert fires, it pages the primary responder, and if they do not acknowledge within a few minutes it escalates to the secondary, then to a lead. Crucially, every page also feeds back into an alert-load budget, a running count that, when exceeded, forces the team to fix or silence noisy alerts instead of just absorbing them.

Alert → primary → secondary → escalation, with every page feeding the alert-load budget that governs what is allowed to page at all.

1
Size the rotation
Aim for 6–8 people so each person is primary roughly once every 6–8 weeks. Below 4, there is no slack for leave, illness, or recovery, that is where burnout starts.
2
Set shift length and cadence
One-week shifts are common; some teams prefer shorter shifts to cap fatigue. Avoid back-to-back primary weeks for the same person, always.
3
Define primary + secondary
Primary takes the page; secondary is the named backup who gets it if primary does not ack. Two layers means one person being unreachable is never a single point of failure.
4
Wire escalation
If neither responder acks within the window, escalate to a lead and, for high-severity incidents, an incident commander. Document the chain so nobody has to think at 3am.
5
Set an alert-load budget
Decide the maximum pages per shift you consider acceptable (a common target is fewer than 2 actionable pages per shift, near-zero off-hours). Exceeding it triggers alert cleanup, not more stoicism.
6
Add a handoff ritual
End every shift with a written handoff: open incidents, flapping alerts, scheduled risky changes, anything to watch. The next person should start informed, not blind.

Healthy vs unhealthy on-call

The difference between a rotation people tolerate for years and one that drives them out is rarely the technology. It is these four variables.

Dimension	Healthy	Unhealthy
Rotation size	6–8 people; primary ~1 week in 6–8	1–3 people; primary every other week
Pages per shift	< 2 actionable/shift; near-zero off-hours	Nightly pages; flapping alerts tolerated for days
Compensation	Time off in lieu or a stipend; recovery day after a rough night	Nothing, "it's part of the job"
Handoff	Written, every shift; open risks listed	None, each person rediscovers the state from scratch

The four dimensions that decide whether on-call is sustainable.

Watch out

If your rotation is unhealthy on three of these four, you do not have an alerting problem, you have an attrition problem that has not surfaced yet. The page volume is the symptom; the missing comp, handoff, and budget are the disease.

Make it concrete: an alert-load budget + handoff

Healthy on-call needs two artifacts written down: a budget that says how much paging is allowed, and a handoff that travels between shifts. Keep both in the repo next to the runbooks so they are reviewed, not folklore. Here is a budget policy you can adapt.

on-call/alert-budget.yaml

yaml

# Alert-load budget, the contract for what is allowed to page a human.
# Reviewed each retro. Breaching it creates work to FIX alerts, not to endure them.

budget:
  actionable_pages_per_shift_max: 2      # a "shift" = one primary week
  off_hours_pages_per_shift_max: 1       # 22:00–08:00 local; aim for zero
  flapping_tolerance_minutes: 30         # an alert firing > 2x in 30m is "flapping"

# What happens when the budget is blown:
on_breach:
  - file_an_action_item: "Tune, fix, or delete the noisiest alert"
  - owner: "team-lead"                   # not the exhausted on-call person
  - due: "before the next rotation starts"

# Every paging alert must justify itself against these rules:
alert_rules:
  must_be_actionable: true               # if the human can't do anything, it's not a page
  must_link_runbook: true                # no runbook link => downgrade to a ticket
  must_map_to_user_impact: true          # page on symptoms (SLO burn), not causes (CPU 80%)
  default_severity_below_threshold: ticket  # warnings go to a queue, not the pager

And the handoff, the firehouse briefing. A short, structured note at the end of every shift, dropped in the team channel and pinned for the incoming responder.

on-call/handoff-template.md

markdown

## On-call handoff, week of <date> → <next person>

**Outgoing:** @you   **Incoming:** @next   **Pages this shift:** 3 (1 off-hours)

### Open / unresolved
- [ ] INC-482 checkout latency, mitigated (scaled pods), root cause TBD. Owner: @you
- [ ] Flapping disk-space alert on db-3, silenced 24h, needs real fix (see ALRT-77)

### Watch list (no page yet, but be ready)
- Payments deploy scheduled Tue 14:00, risky, blast radius = checkout
- Cert for api.example.com expires in 9 days, renewal PR open #1204

### Noise this week (feeds the alert budget)
- "CPU > 80%" paged twice, both non-actionable → proposed: downgrade to ticket

### Anything else
- Runbooks were accurate this week. Escalation path tested once and worked.

Measuring on-call health

"How is on-call going?" is not a vibe, it is a small set of numbers you can track per shift and review every retro. If you measure nothing, the only signal you get is someone quitting, and by then it is too late to fix.

Pages per shift, total alerts that paged a human. The headline number. Trending up means your budget is being ignored.
Off-hours pages, pages between roughly 22:00 and 08:00 local. These cost the most (broken sleep, lost next day) and should trend toward zero. One bad night can erase a week of goodwill.
Actionable ratio, what fraction of pages led to a real action. A low ratio means you are paging on noise; those alerts belong in a ticket queue, not the pager.
Time to acknowledge / time to resolve, slow acks can mean the rotation is too thin or the secondary is not wired up. Slow resolves can mean missing runbooks.
Recovery honored, did people who took a 3am page actually get the next day (or a slice of it) to recover? Track it; an unhealthy team will quietly skip this.

Pro tip

Review these numbers in the retro with the whole team, not in a 1:1 with the manager. On-call health is a shared system property. The point is to generate action items against the alert-load budget, not to score individuals on how much suffering they absorbed.

Common mistakes that cost you people

1Tiny rotations. A 1–3 person rotation has no slack. One person on leave and someone is doing back-to-back weeks. There is no recovery, and the most reliable person silently absorbs the most pain until they leave.
2No compensation. Treating off-hours response as "part of the job" with no time off in lieu, stipend, or recovery day tells people their nights are free to you. They will eventually agree, by taking their nights elsewhere.
3No alert budget. Without a cap on pages, noisy alerts accumulate forever. Every flapping alert that fires for two days is a tax on whoever is on-call, and nobody is accountable for paying it down.
4No handoff. When each shift rediscovers the state of the world from scratch, open incidents get dropped, risky deploys surprise the new responder, and the same flapping alert gets re-investigated three times.
5Paging on causes, not symptoms. "CPU > 80%" is a cause and usually not actionable; "checkout error rate breached its SLO" is a symptom users feel. Paging on causes floods the pager with noise.
6No escalation path. If primary is unreachable and there is no secondary or lead defined, a single unanswered page becomes a single point of failure for the entire incident response.

Takeaways

On-call health in eight lines

A sustainable rotation is a reliability investment, not a perk, tired responders give you worse signal exactly when you need it.
Size for slack: 6–8 people so primary lands roughly once every 6–8 weeks, never back-to-back.
Two layers always: primary + secondary, with a documented escalation path to a lead/IC.
Set an alert-load budget (e.g. < 2 actionable pages/shift, near-zero off-hours) and let breaches create cleanup work.
Page on symptoms users feel, not causes; if it is not actionable, it is a ticket, not a page.
Compensate the cost, time off in lieu, a stipend, and a real recovery day after a rough night.
Write a handoff every shift: open incidents, watch list, noise to fix.
Measure pages/shift, off-hours pages, and actionable ratio in the retro, the only alternative signal is someone's resignation.

Where to go next

On-call health sits on top of two adjacent skills: keeping the pager quiet in the first place, and running the incident well when it does fire. Read these next.

Alerting Without Burnout, how to design alerts that only page on real, actionable, user-facing symptoms, so your alert-load budget is achievable.
Incident Command: Severity, Roles & Communication, what happens after the page: severity levels, the incident commander role, and clear communication.
Follow the full SRE career path to see how on-call, SLOs, and incident response fit together into a reliability practice.

Your on-call rotation is wearing people out. What do you fix first?

Check your understanding

1. What is the article's central reframe about on-call compensation and reasonable load?

2. What did the opening story reveal about the engineer who quit?

Frequently asked questions

Is on-call compensation a perk or a necessity?

It is a reliability investment, not a perk you grant when budget allows. On-call is how you detect and contain failures before customers feel them, and tired responders miss pages and misdiagnose incidents, so burning out your responders is reliability debt with interest.

What makes a rotation unhealthy?

The difference is rarely the technology, it comes down to four variables: too few people, no alert-load budget, no handoff, and no compensation or recovery time. If your rotation is unhealthy on three of those four, you have an attrition problem that has not surfaced yet.

What is an alert-load budget?

It is a running count of pages that, when exceeded, forces the team to fix or silence noisy alerts instead of just absorbing them. It feeds back from every page so that flapping or low-value alerts get addressed rather than quietly eroding people.

How does broken on-call actually cost a team?

It manufactures attrition. In the article's example, three brutal on-call weeks with no alert budget, handoff, or recovery time pushed one of the best engineers to quit, even though no genuinely catastrophic system failure occurred. The rotation was broken, not the system.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read

SRE

What is Site Reliability Engineering?

Read

On-Call Health & Sustainable Rotations

01The engineer who quit after three weeks

02A sustainable rotation is an investment, not a perk

03The picture: how a healthy rotation flows

04Healthy vs unhealthy on-call

05Make it concrete: an alert-load budget + handoff

06Measuring on-call health

07Common mistakes that cost you people

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

Reliability & Resilience: Designing for Failure

Incident Management & On-Call That Doesn't Burn People Out

What is Site Reliability Engineering?

The engineer who quit after three weeks

A sustainable rotation is an investment, not a perk

The picture: how a healthy rotation flows

Healthy vs unhealthy on-call

Make it concrete: an alert-load budget + handoff

Measuring on-call health

Common mistakes that cost you people

Takeaways

Where to go next