She was one of the best engineers on the team. Then she drew the short straw: three consecutive on-call weeks because two teammates were on leave and nobody adjusted the schedule. Week one, eleven pages, four of them after midnight. Week two, a flapping alert that fired every forty minutes for two days because nobody owned the fix. Week three, a real incident at 3am that ate her entire next day. She showed up to standup grey-faced, shipped nothing for a sprint, and a month later she handed in her notice. Exit interview, one line: "I can't keep being woken up for things that aren't emergencies."
Here is the part that should sting: none of those three weeks involved a genuinely catastrophic system failure. The system was mostly fine. The rotation was broken, too few people, no alert budget, no handoff, no comp, no recovery time. On-call didn't reveal a reliability problem. It manufactured an attrition problem.
Who this is for
Engineers about to join their first on-call rotation, and tech leads or managers who own one. You do not need to run a 200-person SRE org, if your team carries a pager, this applies. We will cover rotation design, compensation, alert-load budgets, handoffs, escalation, and how to measure whether on-call is actually healthy.
A sustainable rotation is an investment, not a perk
A sustainable on-call rotation is a reliability investment, not a perk you grant. The pager is your early-warning system, and you only get honest signal from people who are rested enough to respond.
Teams often frame on-call comp and reasonable load as "nice-to-haves", perks for when budget allows. That gets the causality backwards. On-call is how you detect and contain failures before customers feel them. If the humans running it are exhausted, the early-warning system degrades exactly when you need it most: tired people miss pages, misdiagnose incidents, and stop filing the follow-ups that would prevent the next one. Burning out your responders is not a cost-saving, it is reliability debt with interest.
A smoke detector that beeps every hour for no reasonAlert fatigue, people stop trusting and stop reacting to pages
A fire crew working 72 hours straightA 1-person or back-to-back rotation with no recovery time
Paying firefighters for night shiftsOn-call compensation, time off in lieu or a stipend
A shift-change briefing at the firehouseThe on-call handoff, open incidents, known risks, what to watch
On-call health maps cleanly onto how we already think about systems.
The picture: how a healthy rotation flows
A healthy rotation is a small pipeline. An alert fires, it pages the primary responder, and if they do not acknowledge within a few minutes it escalates to the secondary, then to a lead. Crucially, every page also feeds back into an alert-load budget, a running count that, when exceeded, forces the team to fix or silence noisy alerts instead of just absorbing them.
Alert → primary → secondary → escalation, with every page feeding the alert-load budget that governs what is allowed to page at all.
1
Size the rotation
Aim for 6–8 people so each person is primary roughly once every 6–8 weeks. Below 4, there is no slack for leave, illness, or recovery, that is where burnout starts.
2
Set shift length and cadence
One-week shifts are common; some teams prefer shorter shifts to cap fatigue. Avoid back-to-back primary weeks for the same person, always.
3
Define primary + secondary
Primary takes the page; secondary is the named backup who gets it if primary does not ack. Two layers means one person being unreachable is never a single point of failure.
4
Wire escalation
If neither responder acks within the window, escalate to a lead and, for high-severity incidents, an incident commander. Document the chain so nobody has to think at 3am.
5
Set an alert-load budget
Decide the maximum pages per shift you consider acceptable (a common target is fewer than 2 actionable pages per shift, near-zero off-hours). Exceeding it triggers alert cleanup, not more stoicism.
6
Add a handoff ritual
End every shift with a written handoff: open incidents, flapping alerts, scheduled risky changes, anything to watch. The next person should start informed, not blind.
Healthy vs unhealthy on-call
The difference between a rotation people tolerate for years and one that drives them out is rarely the technology. It is these four variables.
Dimension
Healthy
Unhealthy
Rotation size
6–8 people; primary ~1 week in 6–8
1–3 people; primary every other week
Pages per shift
< 2 actionable/shift; near-zero off-hours
Nightly pages; flapping alerts tolerated for days
Compensation
Time off in lieu or a stipend; recovery day after a rough night
Nothing, "it's part of the job"
Handoff
Written, every shift; open risks listed
None, each person rediscovers the state from scratch
The four dimensions that decide whether on-call is sustainable.
Watch out
If your rotation is unhealthy on three of these four, you do not have an alerting problem, you have an attrition problem that has not surfaced yet. The page volume is the symptom; the missing comp, handoff, and budget are the disease.
Make it concrete: an alert-load budget + handoff
Healthy on-call needs two artifacts written down: a budget that says how much paging is allowed, and a handoff that travels between shifts. Keep both in the repo next to the runbooks so they are reviewed, not folklore. Here is a budget policy you can adapt.
on-call/alert-budget.yaml
yaml
# Alert-load budget, the contract for what is allowed to page a human.# Reviewed each retro. Breaching it creates work to FIX alerts, not to endure them.budget:
actionable_pages_per_shift_max: 2# a "shift" = one primary weekoff_hours_pages_per_shift_max: 1# 22:00–08:00 local; aim for zeroflapping_tolerance_minutes: 30# an alert firing > 2x in 30m is "flapping"# What happens when the budget is blown:on_breach:
- file_an_action_item: "Tune, fix, or delete the noisiest alert"
- owner: "team-lead"# not the exhausted on-call person
- due: "before the next rotation starts"# Every paging alert must justify itself against these rules:alert_rules:
must_be_actionable: true# if the human can't do anything, it's not a pagemust_link_runbook: true# no runbook link => downgrade to a ticketmust_map_to_user_impact: true# page on symptoms (SLO burn), not causes (CPU 80%)default_severity_below_threshold: ticket # warnings go to a queue, not the pager
And the handoff, the firehouse briefing. A short, structured note at the end of every shift, dropped in the team channel and pinned for the incoming responder.
on-call/handoff-template.md
markdown
## On-call handoff, week of <date> → <next person>
**Outgoing:** @you **Incoming:** @next **Pages this shift:** 3 (1 off-hours)
### Open / unresolved
- [ ] INC-482 checkout latency, mitigated (scaled pods), root cause TBD. Owner: @you
- [ ] Flapping disk-space alert on db-3, silenced 24h, needs real fix (see ALRT-77)
### Watch list (no page yet, but be ready)
- Payments deploy scheduled Tue 14:00, risky, blast radius = checkout
- Cert for api.example.com expires in 9 days, renewal PR open #1204
### Noise this week (feeds the alert budget)
- "CPU > 80%" paged twice, both non-actionable → proposed: downgrade to ticket
### Anything else
- Runbooks were accurate this week. Escalation path tested once and worked.
Measuring on-call health
"How is on-call going?" is not a vibe, it is a small set of numbers you can track per shift and review every retro. If you measure nothing, the only signal you get is someone quitting, and by then it is too late to fix.
Pages per shift, total alerts that paged a human. The headline number. Trending up means your budget is being ignored.
Off-hours pages, pages between roughly 22:00 and 08:00 local. These cost the most (broken sleep, lost next day) and should trend toward zero. One bad night can erase a week of goodwill.
Actionable ratio, what fraction of pages led to a real action. A low ratio means you are paging on noise; those alerts belong in a ticket queue, not the pager.
Time to acknowledge / time to resolve, slow acks can mean the rotation is too thin or the secondary is not wired up. Slow resolves can mean missing runbooks.
Recovery honored, did people who took a 3am page actually get the next day (or a slice of it) to recover? Track it; an unhealthy team will quietly skip this.
Pro tip
Review these numbers in the retro with the whole team, not in a 1:1 with the manager. On-call health is a shared system property. The point is to generate action items against the alert-load budget, not to score individuals on how much suffering they absorbed.
Common mistakes that cost you people
Tiny rotations. A 1–3 person rotation has no slack. One person on leave and someone is doing back-to-back weeks. There is no recovery, and the most reliable person silently absorbs the most pain until they leave.
No compensation. Treating off-hours response as "part of the job" with no time off in lieu, stipend, or recovery day tells people their nights are free to you. They will eventually agree, by taking their nights elsewhere.
No alert budget. Without a cap on pages, noisy alerts accumulate forever. Every flapping alert that fires for two days is a tax on whoever is on-call, and nobody is accountable for paying it down.
No handoff. When each shift rediscovers the state of the world from scratch, open incidents get dropped, risky deploys surprise the new responder, and the same flapping alert gets re-investigated three times.
Paging on causes, not symptoms. "CPU > 80%" is a cause and usually not actionable; "checkout error rate breached its SLO" is a symptom users feel. Paging on causes floods the pager with noise.
No escalation path. If primary is unreachable and there is no secondary or lead defined, a single unanswered page becomes a single point of failure for the entire incident response.
Takeaways
On-call health in eight lines
A sustainable rotation is a reliability investment, not a perk, tired responders give you worse signal exactly when you need it.
Size for slack: 6–8 people so primary lands roughly once every 6–8 weeks, never back-to-back.
Two layers always: primary + secondary, with a documented escalation path to a lead/IC.
Set an alert-load budget (e.g. < 2 actionable pages/shift, near-zero off-hours) and let breaches create cleanup work.
Page on symptoms users feel, not causes; if it is not actionable, it is a ticket, not a page.
Compensate the cost, time off in lieu, a stipend, and a real recovery day after a rough night.
Write a handoff every shift: open incidents, watch list, noise to fix.
Measure pages/shift, off-hours pages, and actionable ratio in the retro, the only alternative signal is someone's resignation.
Where to go next
On-call health sits on top of two adjacent skills: keeping the pager quiet in the first place, and running the incident well when it does fire. Read these next.
Alerting Without Burnout, how to design alerts that only page on real, actionable, user-facing symptoms, so your alert-load budget is achievable.
Follow the full SRE career path to see how on-call, SLOs, and incident response fit together into a reliability practice.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.