Skip to main content
Career Paths
Concepts
Post Mortems
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Postmortems: Learning from Failure at FAANG Depth

A Staff SRE-depth guide to running blameless postmortems that actually change systems: from the philosophy of Just Culture and Sidney Dekker to the Google postmortem template, 5 Whys methodology, action item tracking, real industry examples including the GitLab 2017 database deletion, and the advanced techniques used at Google, Netflix, and Stripe to build organizations that learn from every failure.

🎯Key Takeaways
The postmortem is not documentation after the incident — it is the highest-leverage engineering activity available after a failure. An incident without a postmortem is pure cost. An incident with a great postmortem is an investment that pays compound returns through prevented future incidents, organizational learning, and systemic improvements that benefit every engineer who works on the system afterward.
Blamelessness is not a kindness policy — it is a data quality requirement. When engineers fear blame, they produce diplomatically edited postmortems that omit inconvenient truths. Inaccurate postmortems produce wrong root cause analyses, which produce ineffective action items, which allow incidents to recur. Blameless postmortems are the only kind that actually change systems for the better.
The 5 Whys must never stop at a human action as the root cause. "Engineer X did Y" is a description of what happened. The root cause is always the system property that made that human action possible, likely, or impossible to prevent: the absent confirmation step, the missing automated check, the monitoring gap, the access control that should have blocked the action. Root causes that are system properties produce action items that prevent recurrence. Human-action root causes produce nothing actionable.
Action items are the only reason to write a postmortem, and they are valueless unless implemented. An organization with a postmortem graveyard — thorough documents and unimplemented action items — is paying the full cost of incident analysis with none of the benefit. Action items must be in the sprint backlog, assigned to named individuals, tracked to closure, and reported to engineering leadership. A 90-day action item closure rate below 70% is a management priority failure, not an engineering failure.
The best postmortem programs in the world — at Google, GitLab, GitHub, and Netflix — treat every incident as a gift: a free, live experiment revealing exactly how the system fails. The goal is not to have fewer incidents in the short term (though that follows). The goal is to extract maximum learning from every incident that does occur, building an organization that becomes reliably more reliable over time because it learns systematically from every failure.

Postmortems: Learning from Failure at FAANG Depth

A Staff SRE-depth guide to running blameless postmortems that actually change systems: from the philosophy of Just Culture and Sidney Dekker to the Google postmortem template, 5 Whys methodology, action item tracking, real industry examples including the GitLab 2017 database deletion, and the advanced techniques used at Google, Netflix, and Stripe to build organizations that learn from every failure.

~48 min read
Be the first to complete!
What you'll learn
  • The postmortem is not documentation after the incident — it is the highest-leverage engineering activity available after a failure. An incident without a postmortem is pure cost. An incident with a great postmortem is an investment that pays compound returns through prevented future incidents, organizational learning, and systemic improvements that benefit every engineer who works on the system afterward.
  • Blamelessness is not a kindness policy — it is a data quality requirement. When engineers fear blame, they produce diplomatically edited postmortems that omit inconvenient truths. Inaccurate postmortems produce wrong root cause analyses, which produce ineffective action items, which allow incidents to recur. Blameless postmortems are the only kind that actually change systems for the better.
  • The 5 Whys must never stop at a human action as the root cause. "Engineer X did Y" is a description of what happened. The root cause is always the system property that made that human action possible, likely, or impossible to prevent: the absent confirmation step, the missing automated check, the monitoring gap, the access control that should have blocked the action. Root causes that are system properties produce action items that prevent recurrence. Human-action root causes produce nothing actionable.
  • Action items are the only reason to write a postmortem, and they are valueless unless implemented. An organization with a postmortem graveyard — thorough documents and unimplemented action items — is paying the full cost of incident analysis with none of the benefit. Action items must be in the sprint backlog, assigned to named individuals, tracked to closure, and reported to engineering leadership. A 90-day action item closure rate below 70% is a management priority failure, not an engineering failure.
  • The best postmortem programs in the world — at Google, GitLab, GitHub, and Netflix — treat every incident as a gift: a free, live experiment revealing exactly how the system fails. The goal is not to have fewer incidents in the short term (though that follows). The goal is to extract maximum learning from every incident that does occur, building an organization that becomes reliably more reliable over time because it learns systematically from every failure.

Lesson outline

Why Postmortems Are the Most Important SRE Practice

After 500+ postmortems at Google, I can tell you with certainty: the postmortem is the single highest-leverage activity in all of SRE. An incident lasts hours. A great postmortem lasts forever — it changes the system, trains the organization, and prevents a class of future failures that no one has yet imagined. An incident without a postmortem is pure cost. An incident with a great postmortem is an investment that pays compound interest.

Most engineering teams treat postmortems as documentation overhead — a bureaucratic artifact that must be filed before the on-call engineer can return to their real work. This is a fundamental misunderstanding. The postmortem is not the paperwork after the incident. The postmortem IS the work. The incident gave you a gift: a free, live experiment revealing exactly how your system fails. The postmortem is how you extract full value from that experiment.

The Economics of Postmortems

A P1 incident at a large company costs $500,000-$5,000,000 in lost revenue, engineer time, and customer trust (Gartner estimates $5,600/minute for enterprise downtime). A great postmortem that prevents one recurrence of that incident is worth every one of those dollars. A team that files postmortems but never implements the action items is paying the incident cost twice: once in downtime, once in the engineering time spent writing a document that changes nothing.

The Google SRE Book describes postmortems as a "written record of an incident, its impact, the actions taken to mitigate or resolve it, the root causes, and the follow-up actions to prevent the incident from recurring." But that definition understates the real purpose. A postmortem is how a team becomes smarter than it was before the incident. It is the mechanism by which hard-won operational knowledge is preserved, distributed, and converted into systemic improvements.

What a great postmortem actually does for an organization

  • Extracts maximum learning from every incident — An incident without a postmortem means the knowledge of what happened, why it happened, and how to prevent it lives only in the memory of the engineers who were there — and disappears when they rotate, leave, or simply forget the details 3 months later.
  • Converts expensive incidents into systemic improvements — The action items from a great postmortem directly improve the system: better monitoring, stronger safeguards, clearer runbooks, removed single points of failure. The incident pays for itself in prevented future incidents.
  • Builds organizational memory — A searchable database of postmortems is one of the most valuable engineering knowledge bases an organization can maintain. When a new engineer joins, reading the last 2 years of postmortems teaches them more about how the system actually behaves in production than any design document.
  • Creates psychological safety for honest failure disclosure — When engineers see that postmortems lead to system improvements (not punishments), they report near-misses, small issues, and operational concerns that they would otherwise hide. This early warning system prevents major incidents.
  • Demonstrates engineering maturity to the business — A well-written postmortem with clear action items and owners demonstrates to leadership that the engineering organization is in control of its incidents — not merely reacting to chaos, but systematically learning and improving.
  • Provides input to capacity planning and risk decisions — Aggregate postmortem data tells you which services fail most often, which failure modes recur, and where systemic investment is most needed. This is the empirical foundation for reliability investment decisions.

Sidney Dekker, the professor of human factors and safety science who wrote "The Field Guide to Understanding Human Error" and "Just Culture," articulated the core principle: in complex systems, accidents are not caused by bad people making mistakes. They are caused by systems that make mistakes possible, even inevitable. The pilot who lands on the wrong runway did not fail because they are careless. They failed because the system — ATC communication, runway signage, crew resource management, fatigue policies — allowed and perhaps encouraged that error. Asking "why did this pilot make this mistake?" is the wrong question. The right question is "what did our system do to make this mistake possible, and what can we change to make it impossible next time?"

Postmortem as Investment, Not Expense

Frame postmortems to your team as: this is the one activity where the time you spend is guaranteed to improve the system for every engineer who works on it after you. A runbook improvement from a postmortem saves 30 minutes for every future engineer who handles that incident. A monitoring gap fixed from a postmortem gives earlier warning to every future on-call engineer. The postmortem author is investing their time for the benefit of everyone who comes after.

At Google, every incident that meets the trigger criteria is required to have a postmortem. This is not optional. The postmortem culture is so deeply embedded that engineers will spontaneously open a postmortem document for minor incidents that technically do not require one, because they have internalized that learning from failure is core to the work. That cultural internalization — where postmortems are viewed as a professional obligation and an opportunity, not a punishment — is the hallmark of a mature SRE organization.

Postmortem TriggerThresholdRationale
P1 incident (any duration)Always requiredAny customer-visible outage warrants full analysis regardless of duration — a 5-minute outage that reveals a systemic vulnerability is as important as a 2-hour outage.
P2 incident with >15 min customer impactRequiredExtended degradation at P2 indicates a monitoring or response gap worth analyzing.
Data loss or corruption (any scale)Always requiredData incidents have extraordinary consequence. Even small data losses must be fully analyzed.
Security incidentAlways requiredSecurity incidents require thorough analysis to close attack surfaces.
Novel failure mode (never seen before)RequiredNovel failures reveal gaps in system design that likely have other manifestations not yet discovered.
Near-miss (caught before customer impact)Strongly recommendedNear-misses are the highest-value learning events: you can analyze the failure without the pressure of an active incident.
SLO burn rate threshold exceededTeam discretionRapid error budget consumption that does not trigger a formal incident may still warrant analysis.

Blameless Postmortems: The Philosophy

Blameless postmortems are not a kindness to engineers who make mistakes. They are a hard-nosed engineering requirement for building reliable systems. This distinction is critical: blamelessness is not about protecting feelings. It is about getting accurate data. When engineers fear punishment for mistakes, they distort, omit, or rationalize information in the postmortem. That distorted information produces an inaccurate root cause analysis, which produces ineffective action items, which means the incident recurs. Blame does not just hurt morale — it actively makes systems less reliable.

The psychological mechanism is well documented in human factors research. John Allspaw, former CTO of Etsy and pioneer of the blameless postmortem movement in tech, documented how engineers who fear blame engage in "information editing" — consciously or unconsciously shaping their account of events to minimize their apparent culpability. They omit the decision at 2:47 AM that they regret. They skip the step they took that worsened the incident. They frame contributing factors to shift attention away from their own actions. The result is a postmortem that is factually incomplete, and an organization that learns the wrong lesson.

Blame Creates a Hidden Tax on Every Postmortem

Every time an engineer is blamed, criticized, or made to feel responsible for a failure in a postmortem, you impose a hidden tax on every future postmortem. Future engineers will be less forthcoming. They will mention the "system issues" but leave out the judgment call they made. They will describe the sequence of events but omit the context that explains why those decisions seemed reasonable at the time. Over time, your postmortems become diplomatic documents rather than engineering ones — and they stop being useful.

The accountability versus blame distinction is the most important conceptual boundary in postmortem culture. Accountability is forward-looking: given what we know happened, who is responsible for ensuring specific improvements are made? Blame is backward-looking: who did the wrong thing and should face consequences? In a blameless culture, there is full accountability for action items — owners are named, deadlines are set, and follow-through is tracked. There is no blame for the decisions that led to the incident, because those decisions were made by competent engineers working with the information available to them at the time.

Blame vs. blameless framing: concrete examples

  • Blame: "The engineer forgot to back up the database before running the migration script." — This framing identifies a person as the cause. It produces action items like "remind engineers to back up databases" — which are not actionable and do not prevent recurrence.
  • Blameless: "The migration tooling did not include a pre-flight backup step, and no automated check prevented execution without a recent backup in the last 30 minutes." — This framing identifies a system gap. It produces action items like "add mandatory backup verification to migration tooling" — which actually prevents recurrence.
  • Blame: "The on-call engineer should have noticed the error rate spike earlier." — This framing creates anxiety in future on-call engineers and leads to over-alerting (more noise). It produces no useful action.
  • Blameless: "The alerting threshold for this service was set at 5% error rate, but the SLO breach began at 2% and persisted for 14 minutes before the alert fired." — This framing identifies a monitoring gap. Action item: review and lower alert thresholds for this service.
  • Blame: "The developer pushed untested code to production." — Shaming the individual does not prevent others from doing the same under similar time pressure.
  • Blameless: "The CI/CD pipeline permitted deployment of a commit that had not passed integration tests, because the integration test suite was marked as non-blocking after a previous false failure incident three months ago." — This reveals the system condition that made the unsafe action possible. Action item: restore integration tests as blocking gates.

Creating psychological safety in postmortem review meetings requires active facilitation, not just good intentions. The postmortem facilitator must explicitly establish norms at the start of every review: we are here to understand the system, not to judge decisions; every person who was involved made the best decision they could with the information available at the time; we use "I" statements about system conditions, not "you should have" statements about individual choices. This framing must be actively enforced — if someone begins to express blame in the meeting, the facilitator must redirect immediately.

The "I Would Have Done the Same" Test

A useful facilitation technique from John Allspaw: when analyzing a decision made during an incident, ask the group "if you had been in that position, with the same context, the same tools, the same time pressure, and the same information available — would you have made the same decision?" In the vast majority of cases, the answer is yes. This exercise builds empathy, surfaces the systemic conditions that made the decision reasonable, and redirects the analysis from "why did they do that?" to "what did our system provide to make that decision seem right?"

Sidney Dekker's "Just Culture" model provides the most rigorous framework for navigating the accountability-blame boundary in postmortems. Dekker distinguishes between honest mistakes (which should never be punished), risky behaviors (which require coaching and system improvements), and reckless behavior (which may warrant disciplinary action). In 10+ years of SRE postmortems, I can count on one hand the number of incidents that fell into the "reckless" category. The overwhelming majority — 99%+ — were honest mistakes in complex systems. Defaulting to the blame mindset treats every incident as if it falls into the reckless category. Defaulting to the blameless mindset treats every incident as a systems learning opportunity, which it almost always is.

Blameless Does Not Mean Consequence-Free

Blamelessness is not a free pass for repeated reckless behavior. The distinction: if an engineer circumvents safety controls deliberately, ignores explicit warnings, and causes an incident, that is a conduct issue that should be addressed through management channels — separately from the postmortem. The postmortem should still focus on the system conditions that permitted the reckless behavior to succeed (what controls could have prevented even deliberate circumvention?). Addressing both the conduct issue and the system issue produces better outcomes than addressing either alone.

The Google Postmortem Format

The Google postmortem format has become the de facto industry standard for a reason: it is structured around learning, not documentation. Every section has a specific purpose. Every section earns its place by either capturing essential facts (Summary, Impact, Timeline), driving root cause analysis (Root Causes, Contributing Factors), or producing actionable improvements (Action Items). There is no filler. If a section does not help you understand what happened or prevent recurrence, it does not belong in the template.

The 9 sections of the Google postmortem template

  • 1. Summary (2-3 sentences) — One paragraph that captures: what failed, how long it lasted, what the customer impact was, and what the root cause was. Written last, after all other sections are complete. This is the TL;DR for readers who need to understand the incident without reading the full document. Should be written to be understood by a non-technical stakeholder.
  • 2. Impact — Quantified customer and business impact: what percentage of users were affected, for how long, in which regions, with what error behavior. Include monetary impact if measurable. Include SLO impact (how much error budget was consumed). This section establishes why the incident warranted a postmortem and calibrates the urgency of the action items.
  • 3. Timeline — A chronological log of every significant event from the first symptom to full resolution. Includes: when the issue began (even before it was detected), when detection occurred, when each team member joined, every mitigation action taken and its result, when customer impact ended, and when full resolution was confirmed. The timeline is the hardest section to write and the most important for root cause analysis.
  • 4. Root Causes — The underlying system conditions that made the incident possible. Typically 1-3 root causes. Root causes are system states, not human actions. "The database replica was configured with synchronous replication, causing writes to fail when the replica became unavailable" is a root cause. "The engineer ran the wrong command" is not — it is a symptom that requires further analysis to reach the root cause.
  • 5. Contributing Factors — The conditions that amplified or extended the incident but were not primary root causes. Examples: monitoring gap that delayed detection, runbook gap that extended investigation time, shared dependency that made the blast radius larger. Contributing factors are often where the most valuable action items come from.
  • 6. What Went Well — An honest assessment of what worked: what detected the issue, what accelerated the response, what limited the blast radius, what tools or practices proved valuable. This section is often neglected but is critical — it tells you what to preserve and strengthen, and it provides psychological balance that enables honest assessment of what went wrong.
  • 7. What Went Poorly — An honest assessment of what failed: what delayed detection, what slowed the response, what made the incident worse, what gaps in tools or process were exposed. Written in system terms, not personal terms. "Our deployment pipeline did not have an automatic rollback capability" is correct. "The deployment engineer did not roll back fast enough" is not.
  • 8. Action Items — Specific, assigned, time-bounded improvements derived from root causes and contributing factors. Every action item has an owner (a named individual), a type (prevent/detect/mitigate/process), a priority, and a due date. Action items are not vague aspirations ("improve monitoring") — they are specific deliverables ("add a p99 latency alert for the payments API that fires when latency exceeds 500ms for 3 consecutive minutes, by [date], owner: [engineer]").
  • 9. Lessons Learned — Optional but valuable: 2-5 distilled insights from the incident that apply beyond this specific failure. Lessons learned are the transferable knowledge from the incident — patterns that other teams should know about, architectural anti-patterns that were discovered, or process gaps that affect the broader organization.
flowchart TD
  A[Incident Occurs] --> B[Incident Resolved / Mitigated]
  B --> C{Postmortem Required?
Check trigger criteria}
  C -->|No| D[Brief incident note
in on-call log]
  C -->|Yes| E[Open Postmortem Doc
Within 24 hours]
  E --> F[Assign Author
Typically primary IC or on-call]
  F --> G[Reconstruct Timeline
Logs + monitoring + chat history]
  G --> H[Identify Root Causes
5 Whys from each symptom]
  H --> I[Draft Contributing Factors
Monitoring gaps, process gaps]
  I --> J[Write What Went Well
Honest positive assessment]
  J --> K[Write What Went Poorly
System terms, no blame]
  K --> L[Draft Action Items
SMART: owner + deadline + type]
  L --> M[Postmortem Review Meeting
Blameless facilitation]
  M --> N{Action Items Approved?}
  N -->|Needs revision| L
  N -->|Approved| O[Post to Postmortem Database
Share org-wide]
  O --> P[Track Action Items
In ticket system]
  P --> Q{All Actions Closed?}
  Q -->|No| R[Follow-up in sprint planning]
  Q -->|Yes| S[Postmortem Complete
Learning extracted]

The complete postmortem lifecycle: from incident resolution through documentation, review, and action item closure. The postmortem is not complete until the action items are closed — not when the document is published.

The timeline is the foundation of the entire postmortem. You cannot identify root causes without an accurate timeline. You cannot write meaningful action items without understanding the sequence of events. And the timeline is the section most likely to be inaccurate if you wait too long to write it — memories fade, logs are rotated, and the mental model of "what happened" starts to calcify into a simplified narrative that omits inconvenient details.

The 24-Hour Rule

Start the postmortem within 24 hours of incident resolution. This is not about rushing — it is about preserving data. Within 24 hours, engineers remember what they were thinking when they made each decision. The incident Slack channel is still fresh. Log data is readily available. The mental model is detailed and accurate. After 72 hours, engineers have moved on mentally, memories have begun to simplify, and the team will reconstruct the timeline from fragmentary sources rather than direct memory. After one week, the postmortem will be significantly less accurate — and thus significantly less useful.

The Postmortem Author Role

The postmortem author is typically the Incident Commander or primary on-call engineer for the incident, but this is not a rigid rule. The author should be the person who was most involved in the incident and has the most complete picture of the timeline and decision-making. At Google, the author is expected to spend 4-8 hours on a thorough postmortem for a P1 incident — this is considered primary work, not overhead. The author is not solely responsible for the action items: each action item must be assigned to the engineer best positioned to implement it, regardless of who wrote the postmortem.

Writing the Timeline: The Hardest Part

The postmortem timeline is the hardest section to write well and the most important to get right. A precise, honest, and complete timeline is the foundation on which the entire root cause analysis is built. An inaccurate timeline produces an inaccurate root cause analysis, which produces ineffective action items, which means the incident recurs. I have seen more postmortems ruined by timeline errors than by any other single factor.

The sources of truth for a postmortem timeline are, in order of reliability: structured logs with millisecond timestamps, monitoring system alert history, deployment and configuration change history, incident management tool event log (PagerDuty, Opsgenie), and chat history (Slack, Teams). Human memory is the least reliable source and should always be cross-referenced against these objective sources. The phrase "I think the error started around 2 PM" is not a timeline entry — it is a hypothesis to be verified against logs.

Sources of truth for timeline reconstruction, in order of reliability

  • 1. Application logs with structured timestamps — The most reliable source. Structured JSON logs with ISO 8601 timestamps to millisecond precision are ground truth. Use log aggregation platforms (Splunk, Datadog, Elasticsearch/Kibana) to pull exact log events at exact times. Always use UTC timestamps in postmortems — never local time.
  • 2. Monitoring system alert history — When did Prometheus/Datadog/New Relic first detect the anomaly? Alert firing times are reliably timestamped. The gap between when the anomaly first appeared in metrics and when the alert fired is a critical data point for contributing factors (detection gap).
  • 3. Deployment and config change history — Every deployment system (Spinnaker, ArgoCD, Kubernetes rollout) logs exact deployment times. Every config management system (Puppet, Chef, Terraform) logs change times. Cross-referencing incident start time against recent deployment history often immediately identifies the precipitating change.
  • 4. Incident management tool timeline — PagerDuty and Opsgenie log exact times for: alert firing, alert acknowledgement, escalations, engineer notes, and incident state changes. This gives you the response timeline as distinct from the incident timeline.
  • 5. Chat history (Slack, Teams) — Slack timestamps every message to the second. The incident channel is a chronological log of the responders' observations, hypotheses, actions, and decisions. This is invaluable for understanding what people knew and when they knew it — and for the "what went well/poorly" analysis.
  • 6. Human memory (cross-referenced) — Ask participants to write their recollections before the postmortem review meeting — not during it. Group recall in meetings produces false consensus around the first confident voice. Individual written recollections, cross-referenced against objective sources, reveal discrepancies worth investigating.

Timestamp precision is a source of significant timeline errors that most teams underestimate. At 10:42 AM, a monitoring alert fires. The engineer acknowledges at 10:45 AM. But looking at the logs, you can see the error rate began rising at 10:38 AM — 4 minutes before the alert fired. That 4-minute detection gap is a contributing factor: the alert threshold was too high or the evaluation window too long. Without precise timestamps, you would not see this gap. Always include timestamps precise to at least the minute in postmortem timelines; use seconds when logs provide that precision.

The Hindsight Bias Trap

The single most common timeline error is hindsight bias: writing the timeline as if people knew, at each point in time, what you now know in hindsight. "At 10:38 AM, the error rate began rising due to the config change deployed at 10:31 AM" implies that the on-call engineer knew at 10:38 AM that the config change was the cause. They did not — they only learned that at 10:54 AM after investigation. A correctly written timeline says "At 10:38 AM, error rate began rising [this was not yet detected]." Hindsight bias in timelines leads to false conclusions about how quickly the incident should have been resolved, and to unfair implicit criticism of responders.

The pre-incident timeline is a section that most postmortems omit but should not. Every incident has a history. When did the system first enter the vulnerable state? For the GitLab 2017 incident, the vulnerable state began months before the incident: the backup system had been silently failing, but no one had verification checks to detect the silent failure. The incident on the day was the trigger, but the root cause — silent backup failures — had been present for months. A timeline that starts at the moment of the incident misses this systemic vulnerability entirely.

Ask "When Did the System Enter the Vulnerable State?"

For every incident, after drafting the incident-day timeline, ask: when did the system first enter the state that made this incident possible? This question often reveals a root cause weeks or months before the incident date. A misconfiguration applied three weeks ago. A dependency that has been degraded for days before it triggered a failure. A monitoring gap that has existed since the service was first deployed. This pre-incident timeline is where the most important root causes and action items often hide.

Timeline writing process: step by step

→

01

Open your log aggregation platform and set the time range to 2 hours before the first detected anomaly through 2 hours after full resolution. Export or screenshot the key time series graphs.

→

02

Pull the deployment history for all services involved in the incident for the 48 hours before the incident. Note all deployments with their exact timestamps.

→

03

Download the full incident Slack channel history. Sort by timestamp. Note every engineer action, hypothesis, and decision.

→

04

Pull the PagerDuty/Opsgenie incident log. Record exact alert fire time, acknowledgement time, escalation events, and resolution time.

→

05

Write the timeline as a chronological list of events with timestamps in UTC, using logs as ground truth. For each event, note what was observed and what was done in response.

→

06

Specifically mark the detection gap: the time between first measurable symptom (in logs/metrics) and first human awareness (alert acknowledgement or engineer observation). This gap is always a contributing factor.

07

Identify the knowledge gap moments: points in the timeline where the responders made a decision based on incorrect information or incomplete information. These are the richest sources of action items.

1

Open your log aggregation platform and set the time range to 2 hours before the first detected anomaly through 2 hours after full resolution. Export or screenshot the key time series graphs.

2

Pull the deployment history for all services involved in the incident for the 48 hours before the incident. Note all deployments with their exact timestamps.

3

Download the full incident Slack channel history. Sort by timestamp. Note every engineer action, hypothesis, and decision.

4

Pull the PagerDuty/Opsgenie incident log. Record exact alert fire time, acknowledgement time, escalation events, and resolution time.

5

Write the timeline as a chronological list of events with timestamps in UTC, using logs as ground truth. For each event, note what was observed and what was done in response.

6

Specifically mark the detection gap: the time between first measurable symptom (in logs/metrics) and first human awareness (alert acknowledgement or engineer observation). This gap is always a contributing factor.

7

Identify the knowledge gap moments: points in the timeline where the responders made a decision based on incorrect information or incomplete information. These are the richest sources of action items.

Root Cause Analysis: The 5 Whys

The 5 Whys is the most widely used and most widely misused root cause analysis technique in software engineering. Used correctly, it is a powerful tool for drilling past surface symptoms to systemic root causes. Used incorrectly, it produces a root cause of "human error" that generates useless action items and misses the actual opportunity for improvement. The difference lies in a single discipline: never accept a person as the end of the why chain.

The 5 Whys was developed by Sakichi Toyoda and made famous through the Toyota Production System. The principle: for any problem, ask "why" repeatedly until you reach a root cause that is either a fundamental system property you can change, or a management decision you can challenge. The number "5" is approximate — it takes as many whys as needed to reach that root cause, which is often 4-7 in complex software systems.

Worked Example: 5 Whys for the GitLab 2017 Database Incident

Why did 300GB of production data get deleted? Because a database administrator ran a deletion command against the production database instead of staging. Why? Because both databases were accessible from the same terminal session without a clear visual distinction. Why? Because the team's operational tooling did not enforce environment separation or require environment confirmation before destructive commands. Why? Because the tooling had been built incrementally without a systematic review of which operations required additional safeguards. Why? Because there was no process that required destructive database operations to have an automated backup verification step and a human confirmation step before execution. Root cause: absence of automated safeguards for destructive database operations.

5 Whys: Worked example with a web service incident

→

01

Problem statement: Users received 503 errors on the checkout API for 22 minutes starting at 14:31 UTC.

→

02

Why 1 — Why did users receive 503 errors? Because the checkout API pods were in CrashLoopBackOff, refusing all traffic. The connection pool to the payments database was exhausted.

→

03

Why 2 — Why was the connection pool exhausted? Because each checkout API request was opening a new database connection and not returning it to the pool after completion.

→

04

Why 3 — Why were connections not returned to the pool? Because a new feature deployed at 14:28 UTC introduced a code path that returned early without calling the connection.release() method.

→

05

Why 4 — Why was this code path released without connection.release() being called? Because the integration tests for this code path did not include a test for the early-return scenario, and static analysis was not configured to detect unreleased connections.

06

Why 5 — Why were integration tests not configured to catch this pattern? Because the team had no standard for "connection lifecycle testing" in integration test suites, and the code review checklist did not include a prompt to verify resource cleanup in all execution paths. Root cause: absence of code review and test standards for resource lifecycle management.

1

Problem statement: Users received 503 errors on the checkout API for 22 minutes starting at 14:31 UTC.

2

Why 1 — Why did users receive 503 errors? Because the checkout API pods were in CrashLoopBackOff, refusing all traffic. The connection pool to the payments database was exhausted.

3

Why 2 — Why was the connection pool exhausted? Because each checkout API request was opening a new database connection and not returning it to the pool after completion.

4

Why 3 — Why were connections not returned to the pool? Because a new feature deployed at 14:28 UTC introduced a code path that returned early without calling the connection.release() method.

5

Why 4 — Why was this code path released without connection.release() being called? Because the integration tests for this code path did not include a test for the early-return scenario, and static analysis was not configured to detect unreleased connections.

6

Why 5 — Why were integration tests not configured to catch this pattern? Because the team had no standard for "connection lifecycle testing" in integration test suites, and the code review checklist did not include a prompt to verify resource cleanup in all execution paths. Root cause: absence of code review and test standards for resource lifecycle management.

The critical discipline in 5 Whys is recognizing when you have landed on a person and refusing to stop there. "Because the engineer forgot to call release()" is not a root cause — it is a symptom. Competent engineers make this kind of mistake regularly, under every level of experience and seniority. The system that allows a single forgotten line to exhaust a connection pool and take down a service is the problem, not the engineer. Ask: what would have to be true about the system for a forgotten release() call to be impossible, automatically caught, or immediately mitigated?

When 5 Whys Is Insufficient

The 5 Whys has real limitations. First, it produces a single linear causal chain, missing the multiple contributing factors that most complex failures involve. Second, different analysts asking "why" will reach different root causes — the technique has low inter-rater reliability. Third, it can lead prematurely to a root cause that feels satisfying but is actually too shallow. For complex incidents with multiple contributing factors, consider combining 5 Whys with a Contributing Factors analysis (what conditions enabled each step?) or a Fishbone/Ishikawa diagram (which maps multiple causal branches simultaneously).

Common 5 Whys failure modes and how to avoid them

  • Stopping at human error — "Because the engineer made a mistake" is not a root cause — it is a description of the incident, not an explanation. Always ask: what system properties made that mistake possible? What would have detected it? What would have mitigated it even if it occurred?
  • Following a single causal chain in a multi-causal incident — Most P1 incidents require 3-5 independent failures to align (Swiss Cheese Model). If your 5 Whys only follows one chain, you are missing the contributing factors that are often as important as the primary root cause.
  • Reaching an unfixable root cause — "Because distributed systems can fail" is technically true but not actionable. If your why chain reaches a fundamental property of distributed computing, back up one level — the previous why is your actual root cause for this analysis.
  • Producing root causes that blame a team rather than the system — "Because the database team did not notify us about the maintenance window" is blame-adjacent. The system root cause is: "there is no automated mechanism for database maintenance windows to suppress or defer application deployments during the maintenance period."
  • Asking "why did the person do X" instead of "why did the system permit X" — The framing of each why is critical. Always frame the why around the system: "why was X possible?" or "why did the system not prevent X?" rather than "why did the person do X?"

Systemic causes are the gold standard destination for a 5 Whys analysis. A systemic cause is a property of the system — architecture, tooling, process, policy — that made the incident possible, and which can be changed to make the incident impossible or much less likely. Action items that address systemic causes are the most valuable because they prevent an entire class of future incidents, not just the specific one that occurred.

Action Items: Making Postmortems Actionable

Action items are the only reason to write a postmortem. Everything else — the timeline, the root cause analysis, the contributing factors — is in service of producing action items that change the system. A postmortem without implemented action items is not a postmortem. It is a forensics report. Forensics reports are filed in archives and read by no one. Implemented action items prevent future incidents.

The SMART framework for action items is the minimum quality bar. Each action item must be: Specific (not "improve monitoring" but "add a p99 latency alert for the checkout-api service that fires when latency exceeds 300ms for 5 consecutive minutes"), Measurable (the outcome of completing the action item can be observed), Assignable (one named human is accountable — not a team, not "someone," but a specific engineer or manager), Realistic (completable with available resources and skills), and Time-bound (a specific due date, not "Q2" but "by March 15th").

Action Item TypePurposeExamplePriority
PreventRemove the condition that made the incident possibleAdd automated backup verification to migration tooling that blocks execution if no recent backup existsP1 for root causes
DetectEnable faster detection of the same failure mode in the futureAdd monitoring for database connection pool utilization with alert at 80% capacityP1 for detection gaps
MitigateReduce the impact or duration of future incidents of this typeCreate a runbook for connection pool exhaustion with the exact kubectl exec commands to restart affected podsP2 for response improvements
ProcessChange team processes to prevent the category of failureAdd "verify all execution paths call resource cleanup methods" to the code review checklistP2 for systematic gaps
DocumentPreserve knowledge for future incidentsUpdate the deployment runbook to include the rollback procedure for this serviceP3 for knowledge capture

Owner assignment is where most postmortem action items die. An action item assigned to "the team" is owned by no one. An action item assigned to a team lead who has twelve other priorities is effectively owned by no one. The postmortem author must negotiate ownership with the specific engineer who will implement the action item during the postmortem review meeting — not assign it unilaterally and hope for the best. The owner must acknowledge the assignment and the deadline explicitly.

The Postmortem Graveyard Anti-Pattern

The postmortem graveyard is one of the most common and most damaging anti-patterns in SRE: a growing collection of postmortems with action items that were never implemented. Teams with postmortem graveyards recur the same incidents quarter after quarter, because the root causes are identified but never addressed. The graveyard also destroys postmortem culture: engineers who write thorough postmortems and then watch the action items rot in a backlog quickly conclude that postmortems are a waste of time, and start writing lower-quality documents or avoiding postmortems entirely. Postmortem action items must be treated as P1-equivalent engineering work — not as optional backlog items to be deprioritized whenever a new feature request arrives.

Systems for tracking postmortem action items that actually work

  • Link every action item to a ticket — Every action item in the postmortem document links to a corresponding Jira/Linear/GitHub issue. The postmortem doc shows the ticket status. When the ticket is closed, the action item is marked complete. No ticket = the action item does not exist in engineering reality.
  • Pull action items into sprint planning — Postmortem action items must compete directly against feature work in sprint planning. Teams that keep postmortem actions in a separate "reliability backlog" that is never worked almost always have postmortem graveyards. The action items must be in the same prioritized queue as everything else.
  • Monthly action item review — Designate one engineering meeting per month as a postmortem action item review. Walk through open action items: are they on track? Have they become blocked? Do they need to be re-prioritized? This meeting keeps the organizational attention on follow-through.
  • Escalate overdue action items — Any action item that misses its due date by more than 2 weeks should be escalated to the engineering manager. The escalation is not punitive — it is a signal that either the priority needs to change (the work is getting deprioritized unfairly) or the scope needs to change (the action item is too large and should be broken down).
  • Track closure rate as an SRE metric — Measure and report the 90-day action item closure rate: of the postmortem action items created 90 days ago, what percentage are closed today? A rate below 70% indicates a systemic follow-through problem. A rate above 90% indicates a healthy postmortem culture.

Prioritizing Action Items: The Recurrence and Impact Matrix

Not all action items are equal. Prioritize using two dimensions: recurrence probability (how likely is this failure mode to happen again?) and impact if it does recur (what is the blast radius?). High probability + high impact = P1 action item, implement within 2 weeks. High probability + low impact = P2, implement within 4-6 weeks. Low probability + high impact = P2, implement within 4-6 weeks (rare but catastrophic failures are worth preventing even if unlikely). Low probability + low impact = P3, implement in next reliability sprint.

Deadline discipline is the last critical element. A due date without enforcement is a suggestion. The postmortem review meeting should set due dates that are aggressive but realistic — typically 2 weeks for P1 prevention items, 4 weeks for detection improvements, 6-8 weeks for process changes. These dates must then be tracked. An action item that slips its due date without an explicit re-negotiation and explanation is a failure of postmortem follow-through — and it is a management failure, not just an engineering failure.

Real Postmortem Examples

The best way to understand what a great postmortem looks like is to study real ones from companies that have demonstrated the courage to publish them. These public postmortems are simultaneously the most honest engineering documents in the industry and the most valuable learning resources. The companies that publish them — GitLab, GitHub, AWS, Stripe — demonstrate a level of transparency that is rare and admirable, and they are rewarded for it with increased user trust.

GitLab 2017: The Database Deletion Incident

On January 31, 2017, a GitLab database administrator accidentally deleted 300GB of production database data while attempting to remove a staging database during a maintenance operation to address a replication lag issue. The error was discovered almost immediately, but the team then discovered something far more alarming: the backup systems they had intended to use for recovery were not functioning correctly — most had been silently failing for months. GitLab live-streamed the entire recovery attempt on YouTube, an act of extraordinary transparency. The resulting postmortem (https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/) is an industry benchmark for honesty, specificity, and the quality of its action items.

What made the GitLab postmortem exemplary was not that the incident was well-managed (it was catastrophic) but that the postmortem was written with complete honesty about every failure in both the incident itself and the organizational systems surrounding it. The postmortem identified that: automated backups to S3 had been failing silently for months with no alert; the database replication monitoring was not granular enough to catch the replication lag before it became critical; there was no tested procedure for the type of maintenance operation the DBA was attempting; and the access controls that would have prevented production database deletion from a terminal were absent.

Key structural features of the GitLab postmortem that made it excellent

  • Complete honesty about multiple simultaneous failures — The postmortem did not identify a single root cause and stop. It methodically documented 5 separate system failures that aligned to produce the outcome: the human error, the backup system failure, the monitoring gaps, the access control absence, and the process gaps.
  • Specific, actionable action items with owners and dates — The GitLab postmortem listed 15+ specific action items, each with a named owner and a target date. Examples: "Add monitoring for backup success/failure with alerts to ops team (owner: database team lead, due: 2 weeks)", "Implement pg_basebackup continuously (owner: infrastructure team, due: 2 weeks)".
  • Transparent disclosure of what went wrong with the response — The postmortem honestly described the failed recovery attempts: the LVM snapshot that was too old, the Azure disk snapshot that failed, the filesystem backup that turned out to be incomplete. This transparency about the response failures was as important as the transparency about the incident causes.
  • Public accountability without individual blame — The postmortem named system failures throughout. The DBA who deleted the data is mentioned in the timeline but is not named or blamed. The focus throughout is on the system conditions that made the deletion possible and the recovery difficult.

GitHub 2018: The 24-Hour Network Partition Incident

On October 21-22, 2018, GitHub experienced a 24-hour period of degraded availability caused by a 43-second network partition between its primary US East Coast datacenter and the European datacenter. The partition triggered a failover that promoted a secondary MySQL cluster to primary before replication had completed, causing data divergence between the new primary and the old. The recovery required 24 hours because undoing the data divergence while maintaining data integrity required extreme care. The GitHub postmortem (https://github.blog/2018-10-30-oct21-post-incident-analysis/) is remarkable for its depth of technical explanation and its honest analysis of how a 43-second outage became a 24-hour recovery.

The GitHub 2018 postmortem demonstrates a lesson that is hard to learn any other way: automatic failover in distributed systems can make certain categories of incidents significantly worse. The automatic failover in the GitHub incident promoted a MySQL replica that was 9 minutes behind, leading to a database state inconsistency that required manual reconciliation of over 100GB of data. The postmortem was honest about this: automatic failover was designed to reduce downtime for the common case of primary failure, but it had the opposite effect in this specific scenario.

AWS S3 US-East-1 Outage (February 28, 2017)

The AWS S3 US-East-1 outage in February 2017 brought down a significant fraction of the internet for approximately 4 hours. The root cause: a misconfigured command in a runbook step, executed by an authorized AWS engineer performing routine capacity adjustments to the S3 billing system, removed a larger subset of servers from the index subsystem than intended. The S3 index subsystem needed a full restart — a procedure that had not been exercised at this scale in years and took longer than anticipated. The AWS postmortem (https://aws.amazon.com/message/41926/) is notable for its explicit acknowledgment that the runbook did not have sufficient safeguards against human error and for its commitment to adding automated input validation to prevent the same class of error.

PostmortemWhat Made It ExcellentKey LessonAction Item Quality
GitLab 2017Complete transparency about multiple simultaneous failures, including backup system silent failuresSilent failures are the most dangerous: systems can fail for months without anyone knowing if there are no verification checksExemplary: 15+ specific items, named owners, dates
GitHub 2018Honest analysis of how automatic failover made a 43-second partition into a 24-hour recoveryAutomatic failover optimizes for the common case but can worsen rare, high-severity incidents — design accordinglyStrong: technical specificity, honest about trade-offs
AWS S3 2017Explicit acknowledgment of runbook safeguard gap and commitment to automated input validationRunbooks for destructive operations need automated safeguards that prevent human error in execution, not just documentationClear: specific changes to tooling and runbook design

What These Postmortems Have in Common

All three postmortems share five structural properties of an excellent postmortem: (1) Honest, complete timelines that do not omit inconvenient details. (2) Root causes framed as system properties, not individual failures. (3) Specific, assigned, time-bound action items — not vague commitments. (4) Transparent disclosure of what was tried and failed during the response, not just what ultimately worked. (5) Action items that addressed entire classes of future failures, not just the specific failure that occurred. These properties are not accidental — they reflect disciplined postmortem culture at organizations that take learning from failure seriously.

Building a Postmortem Culture

A postmortem process is a set of procedures. A postmortem culture is an organizational norm. The difference is the difference between engineers who write postmortems because they are required to and engineers who write postmortems because they are committed to learning from failure. Building the culture is harder than implementing the process, but it is the only thing that produces the outcome: a learning organization that becomes more reliable over time rather than simply better at responding to incidents.

Postmortem review meetings are the cultural moment where the organization demonstrates what it believes about failure. A review meeting where managers or peers question why the engineer did not notice the issue sooner, or why a different decision was not made, or who is responsible for the outage — this is a blame meeting, regardless of whether anyone says the words "your fault." The psychological safety required for honest postmortems is destroyed in these meetings and rebuilt slowly through months of consistently blameless facilitation.

Postmortem Review Meeting Facilitation Guide

The facilitator's opening statement should set the norm explicitly: "This meeting has one goal: to understand what happened and what we can change to prevent it. We start from the assumption that everyone involved made the best decision they could with the information and tools available. We are not here to evaluate whether different decisions should have been made — we are here to understand why the system produced the outcome it did, and how to change the system." This statement should be made verbatim, not paraphrased, at the start of every postmortem review.

Sharing postmortems organization-wide is a force multiplier for learning. A postmortem that is shared only within the team that experienced the incident produces local learning. A postmortem that is shared across the entire engineering organization produces organizational learning. The GitLab and GitHub postmortems produced global learning — the lessons from GitLab's backup failures have improved backup practices at hundreds of companies. At Google, postmortems are automatically shared with a broad engineering audience (with appropriate confidentiality for security incidents), and a weekly "postmortem digest" is distributed highlighting interesting recent incidents and their lessons.

Mechanisms for sharing postmortems organization-wide

  • Searchable postmortem database — All postmortems are stored in a searchable, indexed wiki or knowledge base. Engineers facing an unfamiliar incident type can search the postmortem database for past incidents involving the same service, failure mode, or technology. At Google, this database is one of the most-consulted engineering knowledge resources.
  • Weekly postmortem digest — A brief weekly email or Slack message highlighting 2-3 recent postmortems across the organization, with a one-sentence summary of the key lesson from each. This creates broad organizational awareness of failure patterns without requiring everyone to read every full postmortem.
  • Postmortem lightning talks — Monthly engineering all-hands includes a 5-minute "postmortem of the month" segment where the author of a particularly interesting or instructive postmortem walks through the incident, the root cause, and the key action items. This builds the norm that sharing failures is valued, not penalized.
  • Cross-team postmortem reviews — For incidents that reveal systemic issues affecting multiple teams, invite engineers from affected adjacent teams to the postmortem review. This surfaces contributing factors that are invisible to any single team and produces action items that address organization-wide gaps.

Measuring postmortem quality prevents the quality from degrading over time as the process becomes routine. Without explicit quality criteria, postmortems gradually become shorter, less specific, and less honest — not from malice but from time pressure and normalization. At Google, postmortem quality is assessed on five dimensions: completeness of timeline (are all key events documented with precise timestamps?), depth of root cause analysis (does it reach systemic root causes, not surface symptoms?), quality of action items (SMART criteria), blamelessness (are all statements about system conditions, not individual performance?), and follow-through rate (are action items from previous postmortems being closed?).

Quality DimensionStrong SignalWeak SignalHow to Improve
Timeline completenessPrecise timestamps throughout, detection gap clearly marked, pre-incident vulnerable state identifiedVague time references ("around 2 PM"), timeline starts at alert rather than first symptomRequire log-sourced timestamps; train authors to look for pre-incident conditions
Root cause depthReaches systemic property (tool design, process gap, architecture choice)Stops at human action ("engineer X did Y")Train facilitators to ask "why was that possible?" when human action appears as a why
Action item qualitySMART criteria, named owners, specific dates, linked ticketsVague commitments ("improve testing"), unassigned, no due dateProvide action item templates with required fields; review each in the meeting before accepting
BlamelessnessAll language about system conditions, no individual named as causeImplicit blame through language like "should have", "failed to", "neglected"Facilitator reviews draft before meeting; flags and rewrites blame-adjacent language
Follow-through rate> 80% of action items closed within 90 days< 50% of action items closed; same issues recurring in multiple postmortemsMonthly action item reviews; escalate overdue items to engineering management

The Normalization of Deviance in Postmortem Culture

Diane Vaughan's concept of normalization of deviance — originally developed to explain the Challenger disaster — applies directly to postmortem culture. When postmortems consistently fail to produce action items, or when action items consistently fail to be implemented, this becomes the new normal. Each cycle of "postmortem filed, action items ignored" makes the next cycle more likely. Teams normalize the dysfunction until a major incident breaks the pattern — often painfully. The antidote is explicit measurement and leadership attention: track action item closure rates and present them to engineering leadership quarterly.

Advanced Postmortem Techniques

After mastering the standard postmortem process, the next level is developing analytical sophistication: moving beyond 5 Whys to techniques that reveal the systemic conditions that make entire classes of failures possible, not just the proximate cause of the specific incident you are analyzing. These techniques are used by Google's most senior SREs, Netflix's resilience engineering team, and the aviation safety investigators whose practices inspired the blameless postmortem movement.

Learning reviews extend the standard postmortem to examine not just what the system did, but how the organization responded — and what the organization's response reveals about its resilience capabilities. A learning review asks: what adaptive capacity did our engineers demonstrate during this incident? What information management practices helped (or hindered) their decision-making? Where did the response go better than our formal procedures would predict — and why? Where did coordination break down despite everyone following the process? These questions surface organizational capabilities and gaps that the standard postmortem structure misses entirely.

John Allspaw's "Debriefing Facilitation Guide" Technique

John Allspaw and his colleagues at Etsy, and later through the SNAFUcatchers research consortium, developed a learning review methodology based on interview-driven debriefs. Rather than asking participants to fill out a postmortem template, the facilitator interviews each key participant individually before the group review, asking: "Walk me through your experience of the incident from your position. What were you thinking when X happened? What information did you have at that point? What were you uncertain about?" These individual accounts are assembled into a rich multi-perspective timeline that reveals coordination gaps, information asymmetries, and decision-making context that never appears in a template-driven postmortem.

Normalization of deviance analysis goes beyond identifying what went wrong in this specific incident to ask: what practices in our system have been gradually drifting from their designed safe operating state? Diane Vaughan's analysis of the Challenger disaster revealed that NASA engineers had slowly normalized the observation of O-ring erosion — each incident of erosion without disaster made them slightly more comfortable with the risk, until the deviation from the original design specification was so large that catastrophic failure was virtually certain. The same pattern occurs in software systems: organizations gradually normalize degraded alert thresholds ("it always fires at this level, we have learned to ignore it"), deferred maintenance ("we will fix that technical debt when we have time"), and accepted risks ("we know the backup system is unreliable but we have not had to use it yet").

Advanced root cause analysis techniques beyond 5 Whys

  • Fishbone (Ishikawa) Diagram — Visual tool that maps multiple causal categories (People, Process, Technology, Environment, Management, Materials) simultaneously. Useful for complex incidents where multiple independent causal chains contributed. The fish spine is the problem statement; each bone is a causal category; sub-bones are specific causes within that category. Better than 5 Whys for multi-causal incidents.
  • Fault Tree Analysis (FTA) — Top-down deductive analysis using Boolean logic (AND/OR gates) to trace all combinations of lower-level events that could produce the top-level incident. Developed in aerospace for nuclear weapon and aircraft systems. Produces a tree diagram showing which specific combinations of failures led to the incident. Excellent for understanding blast radius and for identifying single points of failure that need redundancy.
  • Systems Theoretic Accident Model and Process (STAMP/STPA) — Developed by MIT Professor Nancy Leveson. Views accidents as failures of control — the system's feedback loops, constraints, and controllers failed to maintain safe operating states. Better than linear cause-effect models for analyzing modern software systems where failure emerges from complex interactions rather than linear causal chains. Used by aviation, nuclear, and increasingly by advanced software reliability teams.
  • Retrospective Analysis across multiple postmortems — Rather than analyzing each postmortem in isolation, retrospective analysis looks for patterns across many postmortems: which services recur most often? Which failure modes appear repeatedly? Which action item types are consistently not implemented? This meta-analysis reveals systemic organizational and architectural vulnerabilities that no single postmortem can surface.
  • Pre-mortem (prospective analysis) — Developed by Gary Klein. Before launching a new system or making a high-risk change, the team imagines that the launch has failed disastrously and works backward: "It is 6 months after launch, and we have had a catastrophic failure. What most likely caused it?" This prospective technique identifies vulnerabilities before they cause incidents, and is the only technique that can prevent rather than just analyze failures.

The Aggregate Postmortem Review: Quarterly Pattern Analysis

Every quarter, the SRE team should conduct an aggregate review of all postmortems from the past quarter. Create two lists: (1) Which failure modes appeared more than once? Any failure mode that appeared twice in a quarter is a systemic issue that warrants a dedicated investigation. (2) Which action item categories are consistently not being closed? If "improve monitoring" action items are consistently being written and consistently not implemented, that is a resourcing and prioritization problem that must be escalated to engineering leadership. These quarterly reviews are how postmortem analysis evolves from reactive (what happened?) to proactive (what is structurally likely to happen next?).

flowchart LR
  A[Individual Incident] --> B[Standard Postmortem
Timeline + 5 Whys + Actions]
  B --> C[Action Items Closed]
  C --> D[Postmortem Database]
  D --> E[Quarterly Aggregate Review
Pattern analysis across postmortems]
  E --> F[Systemic Themes Identified
Recurring failure modes, chronic gaps]
  F --> G[Architectural / Org Investments
Address root causes at system level]
  G --> H[Reduced Incident Rate
System reliability improves over time]
  H --> A

  I[Pre-mortem Analysis
Prospective failure analysis] --> J[Vulnerabilities Identified
Before they cause incidents]
  J --> G

The mature postmortem program: individual postmortems feed a database; quarterly aggregate reviews identify systemic patterns; systemic investments reduce the incident rate over time. The addition of pre-mortem analysis closes the loop by preventing incidents before they occur.

Publishing Your Postmortems: The Trust Dividend

GitLab, GitHub, and Stripe publish their postmortems publicly. This practice generates a trust dividend that is disproportionate to its cost. Users who see an honest, thorough postmortem from a company that experienced an outage are more likely to continue trusting that company than users who receive only a brief apology. The companies that publish great postmortems are demonstrating that they take reliability seriously, that they learn from failures, and that they are transparent with their users. This signal is more valuable for long-term user trust than any marketing campaign. If your organization is not ready to publish postmortems externally, start by sharing them broadly within the company — the cultural benefits of transparency at any scope are real.

How this might come up in interviews

Postmortem questions appear in SRE and senior engineering interviews across three formats: behavioral questions asking you to describe a real postmortem you wrote ("walk me through an incident and how you analyzed it"), system design questions asking how you would design for observability and post-incident learning, and organizational design questions asking how you would build a postmortem culture at a company that currently has none. At Staff+ level, the interviewer is specifically evaluating whether you think about reliability as a systems and organizational problem, not just a technical one — and postmortems are the primary organizational mechanism for system-level learning.

Common questions:

  • Walk me through a significant incident you were involved in. How did you conduct the postmortem? What were the root causes, and what action items came out of it? Were they implemented?
  • Your team has been writing postmortems for a year, but you keep having the same incidents recur. The same failure modes appear in postmortems every quarter. What is going wrong, and how do you fix it?
  • How do you facilitate a blameless postmortem review meeting when the incident was caused by a sequence of decisions that, in hindsight, seem obviously wrong? How do you keep the meeting from becoming a blame session?
  • What is the difference between a root cause and a contributing factor? Give an example of an incident where distinguishing between the two would have produced meaningfully different action items.
  • How would you build a postmortem culture from scratch at a 200-person engineering organization that has never done them before? What are your first three actions, and what are the common failure modes you would watch out for?
  • A postmortem action item has been sitting open for 4 months past its due date. The owner says they keep getting deprioritized by product work. What do you do?

Try this question: Ask the interviewer: Does your organization have a searchable postmortem database? How are postmortem action items tracked, and what is the typical closure rate? Has your team ever experienced a postmortem graveyard and how did you address it? Are postmortems shared cross-functionally or kept within teams? These questions signal that you understand postmortems as an organizational and cultural practice, not just a documentation process.

Strong answer: Describing the accountability-versus-blame distinction clearly and specifically. Referencing Sidney Dekker or Just Culture. Knowing the GitLab or GitHub postmortems and what made them excellent. Describing concrete experiences where postmortem action items actually prevented a class of future incidents. Understanding the organizational dynamics of the postmortem graveyard anti-pattern. Using the NTSB or aviation analogy for blameless investigation. Measuring postmortem health through action item closure rates, not just postmortem count.

Red flags: Conflating postmortems with blame sessions or thinking blamelessness means no accountability. Not knowing what a root cause is versus a contributing factor. Describing action items in vague terms ("improve the process", "add more monitoring"). Having no opinion on what makes a postmortem action item effective versus ineffective. Treating postmortems as overhead rather than as an investment. Not being able to describe a real postmortem they have written.

Key takeaways

  • The postmortem is not documentation after the incident — it is the highest-leverage engineering activity available after a failure. An incident without a postmortem is pure cost. An incident with a great postmortem is an investment that pays compound returns through prevented future incidents, organizational learning, and systemic improvements that benefit every engineer who works on the system afterward.
  • Blamelessness is not a kindness policy — it is a data quality requirement. When engineers fear blame, they produce diplomatically edited postmortems that omit inconvenient truths. Inaccurate postmortems produce wrong root cause analyses, which produce ineffective action items, which allow incidents to recur. Blameless postmortems are the only kind that actually change systems for the better.
  • The 5 Whys must never stop at a human action as the root cause. "Engineer X did Y" is a description of what happened. The root cause is always the system property that made that human action possible, likely, or impossible to prevent: the absent confirmation step, the missing automated check, the monitoring gap, the access control that should have blocked the action. Root causes that are system properties produce action items that prevent recurrence. Human-action root causes produce nothing actionable.
  • Action items are the only reason to write a postmortem, and they are valueless unless implemented. An organization with a postmortem graveyard — thorough documents and unimplemented action items — is paying the full cost of incident analysis with none of the benefit. Action items must be in the sprint backlog, assigned to named individuals, tracked to closure, and reported to engineering leadership. A 90-day action item closure rate below 70% is a management priority failure, not an engineering failure.
  • The best postmortem programs in the world — at Google, GitLab, GitHub, and Netflix — treat every incident as a gift: a free, live experiment revealing exactly how the system fails. The goal is not to have fewer incidents in the short term (though that follows). The goal is to extract maximum learning from every incident that does occur, building an organization that becomes reliably more reliable over time because it learns systematically from every failure.
Before you move on: can you answer these?

You are writing a postmortem for a database outage. Your 5 Whys analysis arrives at "the database administrator ran the wrong command." A teammate says this is the root cause. Is it? What would you say?

"The administrator ran the wrong command" is a description of what happened, not a root cause — it is a human action, not a system property. Ask: why was running the wrong command possible? What properties of the system — access controls, confirmation steps, environment labels, automated safeguards — should have prevented or detected the error? A true root cause for this incident is something like: "the operational tooling permitted destructive commands to execute against the production database without an environment confirmation step or automated pre-operation backup verification." That root cause produces actionable items. "Administrator ran the wrong command" produces no action item that would actually prevent recurrence.

Your team has written 12 postmortems in the past 6 months. Looking back, only 3 of the 12 postmortem action items have been fully closed. What is the problem, and how do you fix it?

A 25% action item closure rate is a postmortem graveyard — the most common and most destructive postmortem anti-pattern. The problem is not the quality of the postmortems themselves; it is a failure of follow-through driven by organizational prioritization. Action items are being treated as optional "reliability backlog" items rather than as first-class engineering work. The fix has three parts: (1) Pull all open postmortem action items into sprint planning as explicit work items competing directly with feature work — not in a separate backlog. (2) Conduct a monthly action item review meeting where overdue items are escalated to the engineering manager with a concrete re-negotiated due date. (3) Present the closure rate (25%) to engineering leadership with the list of open items and their original due dates. Leadership visibility on this metric creates organizational pressure to treat reliability work as real work.

A senior engineer on your team suggests that postmortems should identify the specific engineer responsible for the error so that the organization can ensure that engineer receives appropriate training. How do you respond?

This proposal, however well-intentioned, would destroy the postmortem culture. When engineers learn that postmortems lead to identification of "the responsible person" — even for training rather than punishment — they will begin editing their accounts. They will omit the decision they regret, frame the timeline to minimize their apparent culpability, and produce a postmortem that is diplomatically written rather than technically accurate. An inaccurate postmortem produces wrong root causes and ineffective action items. The incident recurs. The response is: individual training needs are identified and addressed through the normal manager relationship, entirely separate from the postmortem process. The postmortem exclusively analyzes system properties. Dekker's "Just Culture" framework is worth sharing with the senior engineer: accountability for action items (who implements the improvement?) yes; blame for decisions that led to the incident, no.

🧠Mental Model

💡 Analogy

Aviation accident investigation is the mental model that best captures the spirit of blameless postmortems. The United States National Transportation Safety Board (NTSB) investigates every aviation accident — not to blame pilots, but to find systemic improvements. When a plane crashes, NTSB investigators do not ask "who made a mistake?" They ask "what conditions made this mistake possible?" and "how do we change the system so this can never happen again?" The NTSB has the authority to subpoena records, compel testimony, and produce findings — but its findings are explicitly non-punitive. FAA disciplinary actions are handled separately, and NTSB investigators have no role in them. This institutional separation of "learning about what happened" from "determining consequences for who did it" is the exact architecture of a blameless postmortem program. The result speaks for itself: commercial aviation in the United States has become statistically the safest form of transportation in history, with fatal accidents per passenger mile measured in billionths. The aviation industry achieves this by treating every single accident and near-miss as a system learning event — never, ever as an opportunity to blame a pilot. Every time you read a postmortem, imagine an NTSB investigator reading over your shoulder asking: "What does this tell us about the system? What would we need to change so that a different pilot, under different circumstances, could not make the same mistake?"

⚡ Core Idea

A postmortem is not documentation about an incident. It is an investment in the reliability of the system and the capability of the organization. The timeline, root cause analysis, and contributing factors are the research. The action items are the return on investment. An incident that produces no changes to the system has cost the organization the full price of the outage with no return. An incident that produces thorough root cause analysis and implemented action items pays compound interest — it prevents future incidents, trains future engineers, and builds the organizational memory that makes reliability possible at scale.

🎯 Why It Matters

Postmortems are the mechanism by which engineering organizations become smarter than they were before each failure. Without them, organizations repeat the same incidents quarterly, lose the knowledge when engineers leave, and make reliability investments based on intuition rather than evidence. With them, organizations build the searchable, institutional memory that transforms incidents from pure costs into investments in future reliability. The companies that are reliably most reliable — Google, Netflix, Amazon — are not the ones that have the fewest incidents; they are the ones that learn the most from every incident they do have.

Interview prep: 1 resource

Use these to reinforce this concept for interviews.

  • Site Reliability Engineering (Google SRE book)
View all interview resources →

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.