Back to path
AdvancedCarta · Project 12 of 14 ~5h· 5 milestones

Learn from failure: blameless postmortems

Continues from the last build: You ran the r11 drill as Incident Commander with severity, roles, status updates, and a clean handoff, but no written record of what actually happened.

Last week you ran the incident on Carta clean: you declared SEV2, took the IC role, gave status updates every 20 minutes, and handed off without dropping the ball.

Blameless postmortem authoringTimeline reconstruction from logs and metricsContributing-factor analysis over root causeCounterfactual reasoningAction item ownership and trackingLoki LogQL queryingTranslating incidents into SLO and alert changes

What you'll build

You walk away with a reusable postmortem template, one full worked postmortem of the r11 payments drill, a backlog of action items with owners and due dates, and concrete edits to SLOs, alert rules, and a runbook. More importantly, you can run a blameless review meeting that produces change instead of blame, and you understand why "root cause" is usually a trap and "contributing factors" is the honest frame.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

load/postmortem/template.mdmarkdown
# Postmortem: <title>

ID: PM-<date>-<slug>
Status: Draft
Severity: SEV<n>
Authors: <you>

## Summary
One paragraph: what broke, who was affected, how long.

## Timeline (UTC)
| Time | Event | Source |
|------|-------|--------|
| | | |

Detection time: __ min
Mitigation time: __ min

## Analysis
We do not name a single root cause. Contributing factors:
1. [technical|process|detection] <factor>
   counterfactual: if <X> had <Y>, the incident would have <Z>.

## What went well
- 

## Action items
| ID | Action | Owner | Due | Priority | Factor |
|----|--------|-------|-----|----------|--------|
| | | | | | |

## Review record
Date: 
Attendees: 
Decisions: 
Status: Final

Reading this file

  • Status: DraftThe doc starts Draft and only becomes Final in the review record, so its lifecycle is visible at a glance.
  • | Time | Event | Source |The timeline table forces a source on every row, which is the milestone-1 discipline baked into the template.
  • We do not name a single root cause.The blameless frame is pre-written so the author cannot forget it under deadline.
  • | ID | Action | Owner | Due | Priority | Factor |The action-item columns match the verify checks downstream: id, single owner, date, priority, linked factor.

Copy this per incident. The headings are the checklist: a section left empty is a question you have not answered yet. Keep the 'we do not name a single root cause' line, it sets the frame.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Reconstruct the timeline from evidence

    4 guided steps

    A postmortem written from memory drifts toward the author's narrative. A timeline anchored to log lines and panel screenshots is evidence, not opinion, and it is the only way to honestly compute detection and mitigation times. Disagreements at the review collapse fast when every claim has a citation.

  2. 2

    Write the analysis: contributing factors, not root cause

    4 guided steps

    Real incidents are never one cause. The payments stub failing is necessary but not sufficient: it became a SEV2 because retries were missing, the alert went unacked for six minutes, and the inventory cache masked early signal. Naming one 'root cause' hides the other three and lets them survive. Counterfactuals turn analysis into a to-do list.

  3. 3

    Derive action items with owners and due dates

    4 guided steps

    The single most common reason postmortems change nothing is action items with no owner and no date. 'The team will add retries' is a wish; 'AI-1, owner Priya, due 2026-06-24, add retry-with-backoff to checkout payments call' is a commitment you can audit. One owner per item, never a team, because shared ownership is no ownership.

  4. 4

    Wire lessons back into SLOs, alerts, and runbooks

    4 guided steps

    r10 gave you SLOs and alerts, r11 gave you the incident command system, and this rung proves they are living things. If the postmortem does not change an alert, a runbook, or an SLO, the same incident runs the same way next time. The edit is small; the discipline of always making it is the skill.

  5. 5

    Run the blameless review meeting

    4 guided steps

    A postmortem reviewed by its author alone is a diary. The review is where other people stress-test the timeline, where owners accept their items out loud, and where the org decides what 'done' means. The meeting being blameless is a property you enforce live: you redirect any sentence that starts blaming a person back to the system.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
5 full reference solutions
8 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

timeline.md: a sourced, minute-by-minute reconstruction of the r11 drill with derived detection and mitigation times
analysis.md: at least three contributing factors tagged technical/process/detection, each with a testable counterfactual, plus a what-went-well section
action-items.md: a tracked table of owned, dated, prioritized action items linked to factors
An edited Prometheus alert rule that pages and escalates after 5 minutes unacked, referencing the postmortem id
A runbook 'payments degraded' section runnable by someone who was not in the incident
postmortem.md: the assembled blameless postmortem with a review-record section marked Final

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building