Back to path
AdvancedCarta · Project 11 of 14 ~6h· 5 milestones

Run the incident: command, comms, severity

Continues from the last build: Rung 10 left you a chaos catalog that found real gaps, but the drill itself was the lesson: three engineers piled into one inventory container and stepped on each other, two side arguments raged over root cause while checkout stayed broken, and nobody told support, so customers heard nothing; you had the telemetry but no command, and this rung installs that human discipline (not the rollback or ChatOps automation DevOps r12 covers).

Your rung 10 chaos experiments did their job too well. The last run killed catalog and slowed payments at the same time, and what you remember is not the graphs, it is the noise.

Incident command systemSeverity classificationIncident roles and delegationStakeholder communicationStatus update cadenceHandoff disciplineDrill facilitation

What you'll build

You walk away with a working incident command system: a one-page severity matrix with concrete Carta examples, role cards for IC, ops lead, and comms lead, an IC loop checklist, status update templates, and a handoff protocol. You will have run a live unannounced SEV2 drill end to end, produced a timestamped incident log, and stood the incident down with a clean handoff. You will be able to walk into a real incident and impose order in the first ninety seconds.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

docs/incident/README.mdmarkdown
# Carta Incident Command

This directory is the human operating system for incidents. The telemetry
(Prometheus, Grafana, Loki) is inherited and already running. These docs are
about people, not pipes.

## Map
- severity.md  how bad is it, decided in under a minute
- roles.md     who does what: IC, ops lead, comms lead
- ic-loop.md   the loop the IC repeats, plus status templates
- log.txt      the timestamped record, one per incident (generated)

## Golden rules
1. Declare early, downgrade later. Under-responding is the expensive mistake.
2. One hands-on-keyboard. Only the ops lead changes the service.
3. Communicate on a clock. Status every 20 min even if nothing changed.

## Simulation honesty
Drills here use INJECTED failures from the rung 10 chaos catalog
(for example payments-stub FAIL_RATE=0.3). In real prod the same discipline
runs against a real provider outage. Here you inject it on purpose.

Reading this file

  • One hands-on-keyboard. Only the ops lead changes the service.The rule that directly fixes the r10 three-people-one-container failure.
  • Status every 20 min even if nothing changed.Sets the comms cadence that kept support in the dark last time.
  • INJECTED failures from the rung 10 chaos catalogThe mandatory honesty note: every failure in this rung is deliberately injected.

The orientation page a new responder reads first. It states the three golden rules and the simulation honesty note up front so nobody mistakes a drill for a real outage.

That's 1 of 7 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Write the severity matrix with Carta examples

    4 guided steps

    Severity is the first decision in any incident and it sets everything downstream: who gets paged, how often you communicate, whether you wake people up. A vague matrix means every incident starts with a meeting about how bad it is, which is exactly the meeting you cannot afford during an outage.

  2. 2

    Define the three incident roles as cards

    4 guided steps

    The rung 10 mess happened because everyone did everything: three people in one container, nobody on comms. Named roles with explicit boundaries are how you guarantee exactly one person changes the system and exactly one person talks to stakeholders, which is the difference between a coordinated response and a scrum.

  3. 3

    Build the IC loop and status update templates

    4 guided steps

    Under stress people freeze or thrash. A loop gives the IC a default next move when they do not know what to do, and templates give the comms lead a structure so updates are consistent, timestamped, and fast. This is the machinery that keeps support informed every 20 minutes instead of never.

  4. 4

    Arm the surprise drill trigger

    4 guided steps

    A drill you schedule tests your runbook. A drill that surprises you tests your nervous system: the freeze, the scramble for who is IC, the first thirty seconds. That gap is exactly what bit you in r10, so you must rehearse the surprise, not just the procedure.

  5. 5

    Run the SEV2 drill end to end and stand down

    4 guided steps

    Runbooks you never execute rot. Running the full loop under a surprise trigger is the only way to find that your severity boundary was ambiguous, your templates had a missing field, or your handoff line was unclear. The drill converts paper into reflexes and produces a log you can review like a real postmortem.

What's inside when you start

2 starter files, ready to clone
5 guided milestones
5 full reference solutions
7 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

docs/incident/severity.md: SEV1 to SEV3 matrix with numeric triggers, comms cadence, and a Carta example per level
docs/incident/roles.md: role cards for IC, ops lead, and comms lead with explicit boundaries and handoff lines
docs/incident/ic-loop.md: the four-step IC loop plus declaration, recurring update, and stand-down status templates
scripts/log_event.sh: appends UTC ISO-8601 timestamped lines to the incident log
scripts/drill_trigger.sh: surprise timer that injects one r10 scenario and stamps the injection time
docs/incident/log.txt: a completed timestamped incident log from one end-to-end SEV2 drill

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building