ExpertBeacon · Project 12 of 12 ~12h· 6 milestones

Automate incident response and safe rollback

Continues from the last build: Rung 11 gave teams a self-service golden path, and Rung 10 showed you the DORA numbers. You can measure failures with DORA, but recovering from one is still a manual, stressful scramble: someone digs for the last good SHA, copies kubectl commands from a wiki, and prays.

It is 02:14 and the canary you promoted at 22:00 has been quietly failing 18 percent of SMS sends.

Automated rollback to last known-good imageChatOps incident workflows with GitHub ActionsSLO and canary breach to rollback automationRunbooks-as-code and on-call handoff designBlameless postmortem authoring and MTTR measurementGame-day drill design and chaos injectionAuditable incident logging for delivery systems

Build this free Browse all projectsNo credit card. Already a member?

What you'll build

You finish with a delivery platform that detects a bad release and recovers from it on its own or with one human keystroke. A bad SHA in production triggers an automated or ChatOps rollback to the last known-good image, the rollback verifies service health before declaring success, every action is logged for the postmortem, and a recurring game-day drill proves your MTTR is measured in single-digit minutes rather than tense guesswork at 02:14.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

deploy/known-good.jsonjson

{
  "prod": {
    "api": "a1b2c3d",
    "worker": "a1b2c3d",
    "verified_at": "2026-06-08T22:00:00Z",
    "verified_by": "canary-analysis"
  },
  "staging": {
    "api": "e4f5a6b",
    "worker": "e4f5a6b",
    "verified_at": "2026-06-09T09:30:00Z",
    "verified_by": "smoke-suite"
  }
}

Reading this file

"api": "a1b2c3d"The last known-good api image SHA. The rollback script reads this, so recovery never depends on a human remembering the tag at 02:14.
"verified_by": "canary-analysis"Records WHY this SHA is trusted: it survived the Rung 8 canary. This provenance is gold in a postmortem.
"verified_at": "2026-06-08T22:00:00Z"Timestamp lets you compute how long the bad release was live, a direct input to MTTR.
"staging":Per-environment keys keep prod and staging rollbacks from ever crossing wires, the fat-finger that hurt last incident.

The single source of truth for what to roll back TO. Updated only after a release passes canary analysis or the smoke suite, never by hand during an incident.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

1
Make rollback one idempotent command
4 guided steps
Last incident, recovery was a scavenger hunt for the right SHA and namespace. Making rollback a single idempotent command removes the two biggest sources of MTTR: deciding what to roll back to, and typing it correctly under stress.
2
Verify health before declaring recovery
4 guided steps
A rollback that flips images but never confirms the app actually recovered is theater. Verification turns rollback into a closed loop and produces the timestamp that ends your MTTR clock honestly.
3
Trigger rollback from Slack with ChatOps
4 guided steps
At 02:14 nobody should context-switch to a terminal, find the right kubeconfig, and remember flags. ChatOps puts recovery where the incident conversation already is, with an audit trail baked in.
4
Auto-trigger rollback on SLO and canary breach
4 guided steps
SLO and change-failure breaches are exactly the conditions where humans are slowest and most error-prone. Tying them straight to the rollback you already built collapses MTTR toward the time the alert takes to fire.
5
Codify on-call handoff and the blameless postmortem
4 guided steps
A recovery you cannot learn from repeats. A blameless template plus an auto-generated stub means the postmortem starts written, the timeline is accurate, and the handoff is unambiguous, so the next on-call inherits context instead of a mystery.
6
Prove low MTTR with a scheduled game-day drill
4 guided steps
Untested recovery is a belief, not a capability. A recurring game-day turns rollback into a rehearsed, measured drill, so the first time you roll back in anger is not actually the first time.

What's inside when you start

2 starter files, ready to clone

6 guided milestones

6 full reference solutions

8 code blocks explained line-by-line

6 "is it working?" checks

4 interview questions it prepares you for

You'll walk away with

scripts/rollback.sh and a Makefile rollback target that flip the app's api and worker to the last known-good SHA from deploy/known-good.json with a per-ENV namespace guard

scripts/verify-health.sh (invoked as ENV=<env> scripts/verify-health.sh) that probes /healthz and runs a real /v1/notifications smoke send, gating recovery on a terminal status

A ChatOps rollback workflow (chatops-rollback.yml) wired to /beacon rollback <env> with a responder allowlist, plus an auto-rollback workflow (auto-rollback.yml) triggered by SLO and canary-abort dispatches with a two-service (api AND worker) anti-flap guard

A runbook-as-code (runbooks/rollback.md), a blameless postmortem template (postmortems/TEMPLATE.md) with on-call handoff, and an auto-generated postmortem stub with computed MTTR

A scheduled game-day workflow (gameday.yml) that injects a bad SHA into both staging services, lets auto-rollback recover it, and asserts MTTR under target inside a bounded deadline

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building

Automate incident response and safe rollback

What you'll build

See how we teach, before you sign up

The build, milestone by milestone

Make rollback one idempotent command

Verify health before declaring recovery

Trigger rollback from Slack with ChatOps

Auto-trigger rollback on SLO and canary breach

Codify on-call handoff and the blameless postmortem

Prove low MTTR with a scheduled game-day drill

What's inside when you start

You'll walk away with

This is portfolio-grade. Build it free.