Back to path
ExpertBeacon · Project 12 of 12 ~12h· 6 milestones

Automate incident response and safe rollback

Continues from the last build: Rung 11 gave teams a self-service golden path, and Rung 10 showed you the DORA numbers. You can measure failures with DORA, but recovering from one is still a manual, stressful scramble: someone digs for the last good SHA, copies kubectl commands from a wiki, and prays.

It is 02:14 and the canary you promoted at 22:00 has been quietly failing 18 percent of SMS sends.

Automated rollback to last known-good imageChatOps incident workflows with GitHub ActionsSLO and canary breach to rollback automationRunbooks-as-code and on-call handoff designBlameless postmortem authoring and MTTR measurementGame-day drill design and chaos injectionAuditable incident logging for delivery systems

What you'll build

You finish with a delivery platform that detects a bad release and recovers from it on its own or with one human keystroke. A bad SHA in production triggers an automated or ChatOps rollback to the last known-good image, the rollback verifies service health before declaring success, every action is logged for the postmortem, and a recurring game-day drill proves your MTTR is measured in single-digit minutes rather than tense guesswork at 02:14.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

deploy/known-good.jsonjson
{
  "prod": {
    "api": "a1b2c3d",
    "worker": "a1b2c3d",
    "verified_at": "2026-06-08T22:00:00Z",
    "verified_by": "canary-analysis"
  },
  "staging": {
    "api": "e4f5a6b",
    "worker": "e4f5a6b",
    "verified_at": "2026-06-09T09:30:00Z",
    "verified_by": "smoke-suite"
  }
}

Reading this file

  • "api": "a1b2c3d"The last known-good api image SHA. The rollback script reads this, so recovery never depends on a human remembering the tag at 02:14.
  • "verified_by": "canary-analysis"Records WHY this SHA is trusted: it survived the Rung 8 canary. This provenance is gold in a postmortem.
  • "verified_at": "2026-06-08T22:00:00Z"Timestamp lets you compute how long the bad release was live, a direct input to MTTR.
  • "staging":Per-environment keys keep prod and staging rollbacks from ever crossing wires, the fat-finger that hurt last incident.

The single source of truth for what to roll back TO. Updated only after a release passes canary analysis or the smoke suite, never by hand during an incident.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Make rollback one idempotent command

    4 guided steps

    Last incident, recovery was a scavenger hunt for the right SHA and namespace. Making rollback a single idempotent command removes the two biggest sources of MTTR: deciding what to roll back to, and typing it correctly under stress.

  2. 2

    Verify health before declaring recovery

    4 guided steps

    A rollback that flips images but never confirms the app actually recovered is theater. Verification turns rollback into a closed loop and produces the timestamp that ends your MTTR clock honestly.

  3. 3

    Trigger rollback from Slack with ChatOps

    4 guided steps

    At 02:14 nobody should context-switch to a terminal, find the right kubeconfig, and remember flags. ChatOps puts recovery where the incident conversation already is, with an audit trail baked in.

  4. 4

    Auto-trigger rollback on SLO and canary breach

    4 guided steps

    SLO and change-failure breaches are exactly the conditions where humans are slowest and most error-prone. Tying them straight to the rollback you already built collapses MTTR toward the time the alert takes to fire.

  5. 5

    Codify on-call handoff and the blameless postmortem

    4 guided steps

    A recovery you cannot learn from repeats. A blameless template plus an auto-generated stub means the postmortem starts written, the timeline is accurate, and the handoff is unambiguous, so the next on-call inherits context instead of a mystery.

  6. 6

    Prove low MTTR with a scheduled game-day drill

    4 guided steps

    Untested recovery is a belief, not a capability. A recurring game-day turns rollback into a rehearsed, measured drill, so the first time you roll back in anger is not actually the first time.

What's inside when you start

2 starter files, ready to clone
6 guided milestones
6 full reference solutions
8 code blocks explained line-by-line
6 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

scripts/rollback.sh and a Makefile rollback target that flip the app's api and worker to the last known-good SHA from deploy/known-good.json with a per-ENV namespace guard
scripts/verify-health.sh (invoked as ENV=<env> scripts/verify-health.sh) that probes /healthz and runs a real /v1/notifications smoke send, gating recovery on a terminal status
A ChatOps rollback workflow (chatops-rollback.yml) wired to /beacon rollback <env> with a responder allowlist, plus an auto-rollback workflow (auto-rollback.yml) triggered by SLO and canary-abort dispatches with a two-service (api AND worker) anti-flap guard
A runbook-as-code (runbooks/rollback.md), a blameless postmortem template (postmortems/TEMPLATE.md) with on-call handoff, and an auto-generated postmortem stub with computed MTTR
A scheduled game-day workflow (gameday.yml) that injects a bad SHA into both staging services, lets auto-rollback recover it, and asserts MTTR under target inside a bounded deadline

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building