Run the incident: command, comms, severity
Continues from the last build: Rung 10 left you a chaos catalog that found real gaps, but the drill itself was the lesson: three engineers piled into one inventory container and stepped on each other, two side arguments raged over root cause while checkout stayed broken, and nobody told support, so customers heard nothing; you had the telemetry but no command, and this rung installs that human discipline (not the rollback or ChatOps automation DevOps r12 covers).
Your rung 10 chaos experiments did their job too well. The last run killed catalog and slowed payments at the same time, and what you remember is not the graphs, it is the noise.
What you'll build
You walk away with a working incident command system: a one-page severity matrix with concrete Carta examples, role cards for IC, ops lead, and comms lead, an IC loop checklist, status update templates, and a handoff protocol. You will have run a live unannounced SEV2 drill end to end, produced a timestamped incident log, and stood the incident down with a clean handoff. You will be able to walk into a real incident and impose order in the first ninety seconds.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
# Carta Incident Command This directory is the human operating system for incidents. The telemetry (Prometheus, Grafana, Loki) is inherited and already running. These docs are about people, not pipes. ## Map - severity.md how bad is it, decided in under a minute - roles.md who does what: IC, ops lead, comms lead - ic-loop.md the loop the IC repeats, plus status templates - log.txt the timestamped record, one per incident (generated) ## Golden rules 1. Declare early, downgrade later. Under-responding is the expensive mistake. 2. One hands-on-keyboard. Only the ops lead changes the service. 3. Communicate on a clock. Status every 20 min even if nothing changed. ## Simulation honesty Drills here use INJECTED failures from the rung 10 chaos catalog (for example payments-stub FAIL_RATE=0.3). In real prod the same discipline runs against a real provider outage. Here you inject it on purpose.
Reading this file
One hands-on-keyboard. Only the ops lead changes the service.The rule that directly fixes the r10 three-people-one-container failure.Status every 20 min even if nothing changed.Sets the comms cadence that kept support in the dark last time.INJECTED failures from the rung 10 chaos catalogThe mandatory honesty note: every failure in this rung is deliberately injected.
The orientation page a new responder reads first. It states the three golden rules and the simulation honesty note up front so nobody mistakes a drill for a real outage.
That's 1 of 7 explained code blocks in this single project.
The build, milestone by milestone
- 1
Write the severity matrix with Carta examples
4 guided stepsSeverity is the first decision in any incident and it sets everything downstream: who gets paged, how often you communicate, whether you wake people up. A vague matrix means every incident starts with a meeting about how bad it is, which is exactly the meeting you cannot afford during an outage.
- 2
Define the three incident roles as cards
4 guided stepsThe rung 10 mess happened because everyone did everything: three people in one container, nobody on comms. Named roles with explicit boundaries are how you guarantee exactly one person changes the system and exactly one person talks to stakeholders, which is the difference between a coordinated response and a scrum.
- 3
Build the IC loop and status update templates
4 guided stepsUnder stress people freeze or thrash. A loop gives the IC a default next move when they do not know what to do, and templates give the comms lead a structure so updates are consistent, timestamped, and fast. This is the machinery that keeps support informed every 20 minutes instead of never.
- 4
Arm the surprise drill trigger
4 guided stepsA drill you schedule tests your runbook. A drill that surprises you tests your nervous system: the freeze, the scramble for who is IC, the first thirty seconds. That gap is exactly what bit you in r10, so you must rehearse the surprise, not just the procedure.
- 5
Run the SEV2 drill end to end and stand down
4 guided stepsRunbooks you never execute rot. Running the full loop under a surprise trigger is the only way to find that your severity boundary was ambiguous, your templates had a missing field, or your handoff line was unclear. The drill converts paper into reflexes and produces a log you can review like a real postmortem.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building