A Staff SRE-depth guide to running blameless postmortems that actually change systems: from the philosophy of Just Culture and Sidney Dekker to the Google postmortem template, 5 Whys methodology, action item tracking, real industry examples including the GitLab 2017 database deletion, and the advanced techniques used at Google, Netflix, and Stripe to build organizations that learn from every failure.
A Staff SRE-depth guide to running blameless postmortems that actually change systems: from the philosophy of Just Culture and Sidney Dekker to the Google postmortem template, 5 Whys methodology, action item tracking, real industry examples including the GitLab 2017 database deletion, and the advanced techniques used at Google, Netflix, and Stripe to build organizations that learn from every failure.
Lesson outline
After 500+ postmortems at Google, I can tell you with certainty: the postmortem is the single highest-leverage activity in all of SRE. An incident lasts hours. A great postmortem lasts forever — it changes the system, trains the organization, and prevents a class of future failures that no one has yet imagined. An incident without a postmortem is pure cost. An incident with a great postmortem is an investment that pays compound interest.
Most engineering teams treat postmortems as documentation overhead — a bureaucratic artifact that must be filed before the on-call engineer can return to their real work. This is a fundamental misunderstanding. The postmortem is not the paperwork after the incident. The postmortem IS the work. The incident gave you a gift: a free, live experiment revealing exactly how your system fails. The postmortem is how you extract full value from that experiment.
The Economics of Postmortems
A P1 incident at a large company costs $500,000-$5,000,000 in lost revenue, engineer time, and customer trust (Gartner estimates $5,600/minute for enterprise downtime). A great postmortem that prevents one recurrence of that incident is worth every one of those dollars. A team that files postmortems but never implements the action items is paying the incident cost twice: once in downtime, once in the engineering time spent writing a document that changes nothing.
The Google SRE Book describes postmortems as a "written record of an incident, its impact, the actions taken to mitigate or resolve it, the root causes, and the follow-up actions to prevent the incident from recurring." But that definition understates the real purpose. A postmortem is how a team becomes smarter than it was before the incident. It is the mechanism by which hard-won operational knowledge is preserved, distributed, and converted into systemic improvements.
What a great postmortem actually does for an organization
Sidney Dekker, the professor of human factors and safety science who wrote "The Field Guide to Understanding Human Error" and "Just Culture," articulated the core principle: in complex systems, accidents are not caused by bad people making mistakes. They are caused by systems that make mistakes possible, even inevitable. The pilot who lands on the wrong runway did not fail because they are careless. They failed because the system — ATC communication, runway signage, crew resource management, fatigue policies — allowed and perhaps encouraged that error. Asking "why did this pilot make this mistake?" is the wrong question. The right question is "what did our system do to make this mistake possible, and what can we change to make it impossible next time?"
Postmortem as Investment, Not Expense
Frame postmortems to your team as: this is the one activity where the time you spend is guaranteed to improve the system for every engineer who works on it after you. A runbook improvement from a postmortem saves 30 minutes for every future engineer who handles that incident. A monitoring gap fixed from a postmortem gives earlier warning to every future on-call engineer. The postmortem author is investing their time for the benefit of everyone who comes after.
At Google, every incident that meets the trigger criteria is required to have a postmortem. This is not optional. The postmortem culture is so deeply embedded that engineers will spontaneously open a postmortem document for minor incidents that technically do not require one, because they have internalized that learning from failure is core to the work. That cultural internalization — where postmortems are viewed as a professional obligation and an opportunity, not a punishment — is the hallmark of a mature SRE organization.
| Postmortem Trigger | Threshold | Rationale |
|---|---|---|
| P1 incident (any duration) | Always required | Any customer-visible outage warrants full analysis regardless of duration — a 5-minute outage that reveals a systemic vulnerability is as important as a 2-hour outage. |
| P2 incident with >15 min customer impact | Required | Extended degradation at P2 indicates a monitoring or response gap worth analyzing. |
| Data loss or corruption (any scale) | Always required | Data incidents have extraordinary consequence. Even small data losses must be fully analyzed. |
| Security incident | Always required | Security incidents require thorough analysis to close attack surfaces. |
| Novel failure mode (never seen before) | Required | Novel failures reveal gaps in system design that likely have other manifestations not yet discovered. |
| Near-miss (caught before customer impact) | Strongly recommended | Near-misses are the highest-value learning events: you can analyze the failure without the pressure of an active incident. |
| SLO burn rate threshold exceeded | Team discretion | Rapid error budget consumption that does not trigger a formal incident may still warrant analysis. |
Blameless postmortems are not a kindness to engineers who make mistakes. They are a hard-nosed engineering requirement for building reliable systems. This distinction is critical: blamelessness is not about protecting feelings. It is about getting accurate data. When engineers fear punishment for mistakes, they distort, omit, or rationalize information in the postmortem. That distorted information produces an inaccurate root cause analysis, which produces ineffective action items, which means the incident recurs. Blame does not just hurt morale — it actively makes systems less reliable.
The psychological mechanism is well documented in human factors research. John Allspaw, former CTO of Etsy and pioneer of the blameless postmortem movement in tech, documented how engineers who fear blame engage in "information editing" — consciously or unconsciously shaping their account of events to minimize their apparent culpability. They omit the decision at 2:47 AM that they regret. They skip the step they took that worsened the incident. They frame contributing factors to shift attention away from their own actions. The result is a postmortem that is factually incomplete, and an organization that learns the wrong lesson.
Blame Creates a Hidden Tax on Every Postmortem
Every time an engineer is blamed, criticized, or made to feel responsible for a failure in a postmortem, you impose a hidden tax on every future postmortem. Future engineers will be less forthcoming. They will mention the "system issues" but leave out the judgment call they made. They will describe the sequence of events but omit the context that explains why those decisions seemed reasonable at the time. Over time, your postmortems become diplomatic documents rather than engineering ones — and they stop being useful.
The accountability versus blame distinction is the most important conceptual boundary in postmortem culture. Accountability is forward-looking: given what we know happened, who is responsible for ensuring specific improvements are made? Blame is backward-looking: who did the wrong thing and should face consequences? In a blameless culture, there is full accountability for action items — owners are named, deadlines are set, and follow-through is tracked. There is no blame for the decisions that led to the incident, because those decisions were made by competent engineers working with the information available to them at the time.
Blame vs. blameless framing: concrete examples
Creating psychological safety in postmortem review meetings requires active facilitation, not just good intentions. The postmortem facilitator must explicitly establish norms at the start of every review: we are here to understand the system, not to judge decisions; every person who was involved made the best decision they could with the information available at the time; we use "I" statements about system conditions, not "you should have" statements about individual choices. This framing must be actively enforced — if someone begins to express blame in the meeting, the facilitator must redirect immediately.
The "I Would Have Done the Same" Test
A useful facilitation technique from John Allspaw: when analyzing a decision made during an incident, ask the group "if you had been in that position, with the same context, the same tools, the same time pressure, and the same information available — would you have made the same decision?" In the vast majority of cases, the answer is yes. This exercise builds empathy, surfaces the systemic conditions that made the decision reasonable, and redirects the analysis from "why did they do that?" to "what did our system provide to make that decision seem right?"
Sidney Dekker's "Just Culture" model provides the most rigorous framework for navigating the accountability-blame boundary in postmortems. Dekker distinguishes between honest mistakes (which should never be punished), risky behaviors (which require coaching and system improvements), and reckless behavior (which may warrant disciplinary action). In 10+ years of SRE postmortems, I can count on one hand the number of incidents that fell into the "reckless" category. The overwhelming majority — 99%+ — were honest mistakes in complex systems. Defaulting to the blame mindset treats every incident as if it falls into the reckless category. Defaulting to the blameless mindset treats every incident as a systems learning opportunity, which it almost always is.
Blameless Does Not Mean Consequence-Free
Blamelessness is not a free pass for repeated reckless behavior. The distinction: if an engineer circumvents safety controls deliberately, ignores explicit warnings, and causes an incident, that is a conduct issue that should be addressed through management channels — separately from the postmortem. The postmortem should still focus on the system conditions that permitted the reckless behavior to succeed (what controls could have prevented even deliberate circumvention?). Addressing both the conduct issue and the system issue produces better outcomes than addressing either alone.
The Google postmortem format has become the de facto industry standard for a reason: it is structured around learning, not documentation. Every section has a specific purpose. Every section earns its place by either capturing essential facts (Summary, Impact, Timeline), driving root cause analysis (Root Causes, Contributing Factors), or producing actionable improvements (Action Items). There is no filler. If a section does not help you understand what happened or prevent recurrence, it does not belong in the template.
The 9 sections of the Google postmortem template
flowchart TD
A[Incident Occurs] --> B[Incident Resolved / Mitigated]
B --> C{Postmortem Required?
Check trigger criteria}
C -->|No| D[Brief incident note
in on-call log]
C -->|Yes| E[Open Postmortem Doc
Within 24 hours]
E --> F[Assign Author
Typically primary IC or on-call]
F --> G[Reconstruct Timeline
Logs + monitoring + chat history]
G --> H[Identify Root Causes
5 Whys from each symptom]
H --> I[Draft Contributing Factors
Monitoring gaps, process gaps]
I --> J[Write What Went Well
Honest positive assessment]
J --> K[Write What Went Poorly
System terms, no blame]
K --> L[Draft Action Items
SMART: owner + deadline + type]
L --> M[Postmortem Review Meeting
Blameless facilitation]
M --> N{Action Items Approved?}
N -->|Needs revision| L
N -->|Approved| O[Post to Postmortem Database
Share org-wide]
O --> P[Track Action Items
In ticket system]
P --> Q{All Actions Closed?}
Q -->|No| R[Follow-up in sprint planning]
Q -->|Yes| S[Postmortem Complete
Learning extracted]The complete postmortem lifecycle: from incident resolution through documentation, review, and action item closure. The postmortem is not complete until the action items are closed — not when the document is published.
The timeline is the foundation of the entire postmortem. You cannot identify root causes without an accurate timeline. You cannot write meaningful action items without understanding the sequence of events. And the timeline is the section most likely to be inaccurate if you wait too long to write it — memories fade, logs are rotated, and the mental model of "what happened" starts to calcify into a simplified narrative that omits inconvenient details.
The 24-Hour Rule
Start the postmortem within 24 hours of incident resolution. This is not about rushing — it is about preserving data. Within 24 hours, engineers remember what they were thinking when they made each decision. The incident Slack channel is still fresh. Log data is readily available. The mental model is detailed and accurate. After 72 hours, engineers have moved on mentally, memories have begun to simplify, and the team will reconstruct the timeline from fragmentary sources rather than direct memory. After one week, the postmortem will be significantly less accurate — and thus significantly less useful.
The Postmortem Author Role
The postmortem author is typically the Incident Commander or primary on-call engineer for the incident, but this is not a rigid rule. The author should be the person who was most involved in the incident and has the most complete picture of the timeline and decision-making. At Google, the author is expected to spend 4-8 hours on a thorough postmortem for a P1 incident — this is considered primary work, not overhead. The author is not solely responsible for the action items: each action item must be assigned to the engineer best positioned to implement it, regardless of who wrote the postmortem.
The postmortem timeline is the hardest section to write well and the most important to get right. A precise, honest, and complete timeline is the foundation on which the entire root cause analysis is built. An inaccurate timeline produces an inaccurate root cause analysis, which produces ineffective action items, which means the incident recurs. I have seen more postmortems ruined by timeline errors than by any other single factor.
The sources of truth for a postmortem timeline are, in order of reliability: structured logs with millisecond timestamps, monitoring system alert history, deployment and configuration change history, incident management tool event log (PagerDuty, Opsgenie), and chat history (Slack, Teams). Human memory is the least reliable source and should always be cross-referenced against these objective sources. The phrase "I think the error started around 2 PM" is not a timeline entry — it is a hypothesis to be verified against logs.
Sources of truth for timeline reconstruction, in order of reliability
Timestamp precision is a source of significant timeline errors that most teams underestimate. At 10:42 AM, a monitoring alert fires. The engineer acknowledges at 10:45 AM. But looking at the logs, you can see the error rate began rising at 10:38 AM — 4 minutes before the alert fired. That 4-minute detection gap is a contributing factor: the alert threshold was too high or the evaluation window too long. Without precise timestamps, you would not see this gap. Always include timestamps precise to at least the minute in postmortem timelines; use seconds when logs provide that precision.
The Hindsight Bias Trap
The single most common timeline error is hindsight bias: writing the timeline as if people knew, at each point in time, what you now know in hindsight. "At 10:38 AM, the error rate began rising due to the config change deployed at 10:31 AM" implies that the on-call engineer knew at 10:38 AM that the config change was the cause. They did not — they only learned that at 10:54 AM after investigation. A correctly written timeline says "At 10:38 AM, error rate began rising [this was not yet detected]." Hindsight bias in timelines leads to false conclusions about how quickly the incident should have been resolved, and to unfair implicit criticism of responders.
The pre-incident timeline is a section that most postmortems omit but should not. Every incident has a history. When did the system first enter the vulnerable state? For the GitLab 2017 incident, the vulnerable state began months before the incident: the backup system had been silently failing, but no one had verification checks to detect the silent failure. The incident on the day was the trigger, but the root cause — silent backup failures — had been present for months. A timeline that starts at the moment of the incident misses this systemic vulnerability entirely.
Ask "When Did the System Enter the Vulnerable State?"
For every incident, after drafting the incident-day timeline, ask: when did the system first enter the state that made this incident possible? This question often reveals a root cause weeks or months before the incident date. A misconfiguration applied three weeks ago. A dependency that has been degraded for days before it triggered a failure. A monitoring gap that has existed since the service was first deployed. This pre-incident timeline is where the most important root causes and action items often hide.
Timeline writing process: step by step
01
Open your log aggregation platform and set the time range to 2 hours before the first detected anomaly through 2 hours after full resolution. Export or screenshot the key time series graphs.
02
Pull the deployment history for all services involved in the incident for the 48 hours before the incident. Note all deployments with their exact timestamps.
03
Download the full incident Slack channel history. Sort by timestamp. Note every engineer action, hypothesis, and decision.
04
Pull the PagerDuty/Opsgenie incident log. Record exact alert fire time, acknowledgement time, escalation events, and resolution time.
05
Write the timeline as a chronological list of events with timestamps in UTC, using logs as ground truth. For each event, note what was observed and what was done in response.
06
Specifically mark the detection gap: the time between first measurable symptom (in logs/metrics) and first human awareness (alert acknowledgement or engineer observation). This gap is always a contributing factor.
07
Identify the knowledge gap moments: points in the timeline where the responders made a decision based on incorrect information or incomplete information. These are the richest sources of action items.
Open your log aggregation platform and set the time range to 2 hours before the first detected anomaly through 2 hours after full resolution. Export or screenshot the key time series graphs.
Pull the deployment history for all services involved in the incident for the 48 hours before the incident. Note all deployments with their exact timestamps.
Download the full incident Slack channel history. Sort by timestamp. Note every engineer action, hypothesis, and decision.
Pull the PagerDuty/Opsgenie incident log. Record exact alert fire time, acknowledgement time, escalation events, and resolution time.
Write the timeline as a chronological list of events with timestamps in UTC, using logs as ground truth. For each event, note what was observed and what was done in response.
Specifically mark the detection gap: the time between first measurable symptom (in logs/metrics) and first human awareness (alert acknowledgement or engineer observation). This gap is always a contributing factor.
Identify the knowledge gap moments: points in the timeline where the responders made a decision based on incorrect information or incomplete information. These are the richest sources of action items.
The 5 Whys is the most widely used and most widely misused root cause analysis technique in software engineering. Used correctly, it is a powerful tool for drilling past surface symptoms to systemic root causes. Used incorrectly, it produces a root cause of "human error" that generates useless action items and misses the actual opportunity for improvement. The difference lies in a single discipline: never accept a person as the end of the why chain.
The 5 Whys was developed by Sakichi Toyoda and made famous through the Toyota Production System. The principle: for any problem, ask "why" repeatedly until you reach a root cause that is either a fundamental system property you can change, or a management decision you can challenge. The number "5" is approximate — it takes as many whys as needed to reach that root cause, which is often 4-7 in complex software systems.
Worked Example: 5 Whys for the GitLab 2017 Database Incident
Why did 300GB of production data get deleted? Because a database administrator ran a deletion command against the production database instead of staging. Why? Because both databases were accessible from the same terminal session without a clear visual distinction. Why? Because the team's operational tooling did not enforce environment separation or require environment confirmation before destructive commands. Why? Because the tooling had been built incrementally without a systematic review of which operations required additional safeguards. Why? Because there was no process that required destructive database operations to have an automated backup verification step and a human confirmation step before execution. Root cause: absence of automated safeguards for destructive database operations.
5 Whys: Worked example with a web service incident
01
Problem statement: Users received 503 errors on the checkout API for 22 minutes starting at 14:31 UTC.
02
Why 1 — Why did users receive 503 errors? Because the checkout API pods were in CrashLoopBackOff, refusing all traffic. The connection pool to the payments database was exhausted.
03
Why 2 — Why was the connection pool exhausted? Because each checkout API request was opening a new database connection and not returning it to the pool after completion.
04
Why 3 — Why were connections not returned to the pool? Because a new feature deployed at 14:28 UTC introduced a code path that returned early without calling the connection.release() method.
05
Why 4 — Why was this code path released without connection.release() being called? Because the integration tests for this code path did not include a test for the early-return scenario, and static analysis was not configured to detect unreleased connections.
06
Why 5 — Why were integration tests not configured to catch this pattern? Because the team had no standard for "connection lifecycle testing" in integration test suites, and the code review checklist did not include a prompt to verify resource cleanup in all execution paths. Root cause: absence of code review and test standards for resource lifecycle management.
Problem statement: Users received 503 errors on the checkout API for 22 minutes starting at 14:31 UTC.
Why 1 — Why did users receive 503 errors? Because the checkout API pods were in CrashLoopBackOff, refusing all traffic. The connection pool to the payments database was exhausted.
Why 2 — Why was the connection pool exhausted? Because each checkout API request was opening a new database connection and not returning it to the pool after completion.
Why 3 — Why were connections not returned to the pool? Because a new feature deployed at 14:28 UTC introduced a code path that returned early without calling the connection.release() method.
Why 4 — Why was this code path released without connection.release() being called? Because the integration tests for this code path did not include a test for the early-return scenario, and static analysis was not configured to detect unreleased connections.
Why 5 — Why were integration tests not configured to catch this pattern? Because the team had no standard for "connection lifecycle testing" in integration test suites, and the code review checklist did not include a prompt to verify resource cleanup in all execution paths. Root cause: absence of code review and test standards for resource lifecycle management.
The critical discipline in 5 Whys is recognizing when you have landed on a person and refusing to stop there. "Because the engineer forgot to call release()" is not a root cause — it is a symptom. Competent engineers make this kind of mistake regularly, under every level of experience and seniority. The system that allows a single forgotten line to exhaust a connection pool and take down a service is the problem, not the engineer. Ask: what would have to be true about the system for a forgotten release() call to be impossible, automatically caught, or immediately mitigated?
When 5 Whys Is Insufficient
The 5 Whys has real limitations. First, it produces a single linear causal chain, missing the multiple contributing factors that most complex failures involve. Second, different analysts asking "why" will reach different root causes — the technique has low inter-rater reliability. Third, it can lead prematurely to a root cause that feels satisfying but is actually too shallow. For complex incidents with multiple contributing factors, consider combining 5 Whys with a Contributing Factors analysis (what conditions enabled each step?) or a Fishbone/Ishikawa diagram (which maps multiple causal branches simultaneously).
Common 5 Whys failure modes and how to avoid them
Systemic causes are the gold standard destination for a 5 Whys analysis. A systemic cause is a property of the system — architecture, tooling, process, policy — that made the incident possible, and which can be changed to make the incident impossible or much less likely. Action items that address systemic causes are the most valuable because they prevent an entire class of future incidents, not just the specific one that occurred.
Action items are the only reason to write a postmortem. Everything else — the timeline, the root cause analysis, the contributing factors — is in service of producing action items that change the system. A postmortem without implemented action items is not a postmortem. It is a forensics report. Forensics reports are filed in archives and read by no one. Implemented action items prevent future incidents.
The SMART framework for action items is the minimum quality bar. Each action item must be: Specific (not "improve monitoring" but "add a p99 latency alert for the checkout-api service that fires when latency exceeds 300ms for 5 consecutive minutes"), Measurable (the outcome of completing the action item can be observed), Assignable (one named human is accountable — not a team, not "someone," but a specific engineer or manager), Realistic (completable with available resources and skills), and Time-bound (a specific due date, not "Q2" but "by March 15th").
| Action Item Type | Purpose | Example | Priority |
|---|---|---|---|
| Prevent | Remove the condition that made the incident possible | Add automated backup verification to migration tooling that blocks execution if no recent backup exists | P1 for root causes |
| Detect | Enable faster detection of the same failure mode in the future | Add monitoring for database connection pool utilization with alert at 80% capacity | P1 for detection gaps |
| Mitigate | Reduce the impact or duration of future incidents of this type | Create a runbook for connection pool exhaustion with the exact kubectl exec commands to restart affected pods | P2 for response improvements |
| Process | Change team processes to prevent the category of failure | Add "verify all execution paths call resource cleanup methods" to the code review checklist | P2 for systematic gaps |
| Document | Preserve knowledge for future incidents | Update the deployment runbook to include the rollback procedure for this service | P3 for knowledge capture |
Owner assignment is where most postmortem action items die. An action item assigned to "the team" is owned by no one. An action item assigned to a team lead who has twelve other priorities is effectively owned by no one. The postmortem author must negotiate ownership with the specific engineer who will implement the action item during the postmortem review meeting — not assign it unilaterally and hope for the best. The owner must acknowledge the assignment and the deadline explicitly.
The Postmortem Graveyard Anti-Pattern
The postmortem graveyard is one of the most common and most damaging anti-patterns in SRE: a growing collection of postmortems with action items that were never implemented. Teams with postmortem graveyards recur the same incidents quarter after quarter, because the root causes are identified but never addressed. The graveyard also destroys postmortem culture: engineers who write thorough postmortems and then watch the action items rot in a backlog quickly conclude that postmortems are a waste of time, and start writing lower-quality documents or avoiding postmortems entirely. Postmortem action items must be treated as P1-equivalent engineering work — not as optional backlog items to be deprioritized whenever a new feature request arrives.
Systems for tracking postmortem action items that actually work
Prioritizing Action Items: The Recurrence and Impact Matrix
Not all action items are equal. Prioritize using two dimensions: recurrence probability (how likely is this failure mode to happen again?) and impact if it does recur (what is the blast radius?). High probability + high impact = P1 action item, implement within 2 weeks. High probability + low impact = P2, implement within 4-6 weeks. Low probability + high impact = P2, implement within 4-6 weeks (rare but catastrophic failures are worth preventing even if unlikely). Low probability + low impact = P3, implement in next reliability sprint.
Deadline discipline is the last critical element. A due date without enforcement is a suggestion. The postmortem review meeting should set due dates that are aggressive but realistic — typically 2 weeks for P1 prevention items, 4 weeks for detection improvements, 6-8 weeks for process changes. These dates must then be tracked. An action item that slips its due date without an explicit re-negotiation and explanation is a failure of postmortem follow-through — and it is a management failure, not just an engineering failure.
The best way to understand what a great postmortem looks like is to study real ones from companies that have demonstrated the courage to publish them. These public postmortems are simultaneously the most honest engineering documents in the industry and the most valuable learning resources. The companies that publish them — GitLab, GitHub, AWS, Stripe — demonstrate a level of transparency that is rare and admirable, and they are rewarded for it with increased user trust.
GitLab 2017: The Database Deletion Incident
On January 31, 2017, a GitLab database administrator accidentally deleted 300GB of production database data while attempting to remove a staging database during a maintenance operation to address a replication lag issue. The error was discovered almost immediately, but the team then discovered something far more alarming: the backup systems they had intended to use for recovery were not functioning correctly — most had been silently failing for months. GitLab live-streamed the entire recovery attempt on YouTube, an act of extraordinary transparency. The resulting postmortem (https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/) is an industry benchmark for honesty, specificity, and the quality of its action items.
What made the GitLab postmortem exemplary was not that the incident was well-managed (it was catastrophic) but that the postmortem was written with complete honesty about every failure in both the incident itself and the organizational systems surrounding it. The postmortem identified that: automated backups to S3 had been failing silently for months with no alert; the database replication monitoring was not granular enough to catch the replication lag before it became critical; there was no tested procedure for the type of maintenance operation the DBA was attempting; and the access controls that would have prevented production database deletion from a terminal were absent.
Key structural features of the GitLab postmortem that made it excellent
GitHub 2018: The 24-Hour Network Partition Incident
On October 21-22, 2018, GitHub experienced a 24-hour period of degraded availability caused by a 43-second network partition between its primary US East Coast datacenter and the European datacenter. The partition triggered a failover that promoted a secondary MySQL cluster to primary before replication had completed, causing data divergence between the new primary and the old. The recovery required 24 hours because undoing the data divergence while maintaining data integrity required extreme care. The GitHub postmortem (https://github.blog/2018-10-30-oct21-post-incident-analysis/) is remarkable for its depth of technical explanation and its honest analysis of how a 43-second outage became a 24-hour recovery.
The GitHub 2018 postmortem demonstrates a lesson that is hard to learn any other way: automatic failover in distributed systems can make certain categories of incidents significantly worse. The automatic failover in the GitHub incident promoted a MySQL replica that was 9 minutes behind, leading to a database state inconsistency that required manual reconciliation of over 100GB of data. The postmortem was honest about this: automatic failover was designed to reduce downtime for the common case of primary failure, but it had the opposite effect in this specific scenario.
AWS S3 US-East-1 Outage (February 28, 2017)
The AWS S3 US-East-1 outage in February 2017 brought down a significant fraction of the internet for approximately 4 hours. The root cause: a misconfigured command in a runbook step, executed by an authorized AWS engineer performing routine capacity adjustments to the S3 billing system, removed a larger subset of servers from the index subsystem than intended. The S3 index subsystem needed a full restart — a procedure that had not been exercised at this scale in years and took longer than anticipated. The AWS postmortem (https://aws.amazon.com/message/41926/) is notable for its explicit acknowledgment that the runbook did not have sufficient safeguards against human error and for its commitment to adding automated input validation to prevent the same class of error.
| Postmortem | What Made It Excellent | Key Lesson | Action Item Quality |
|---|---|---|---|
| GitLab 2017 | Complete transparency about multiple simultaneous failures, including backup system silent failures | Silent failures are the most dangerous: systems can fail for months without anyone knowing if there are no verification checks | Exemplary: 15+ specific items, named owners, dates |
| GitHub 2018 | Honest analysis of how automatic failover made a 43-second partition into a 24-hour recovery | Automatic failover optimizes for the common case but can worsen rare, high-severity incidents — design accordingly | Strong: technical specificity, honest about trade-offs |
| AWS S3 2017 | Explicit acknowledgment of runbook safeguard gap and commitment to automated input validation | Runbooks for destructive operations need automated safeguards that prevent human error in execution, not just documentation | Clear: specific changes to tooling and runbook design |
What These Postmortems Have in Common
All three postmortems share five structural properties of an excellent postmortem: (1) Honest, complete timelines that do not omit inconvenient details. (2) Root causes framed as system properties, not individual failures. (3) Specific, assigned, time-bound action items — not vague commitments. (4) Transparent disclosure of what was tried and failed during the response, not just what ultimately worked. (5) Action items that addressed entire classes of future failures, not just the specific failure that occurred. These properties are not accidental — they reflect disciplined postmortem culture at organizations that take learning from failure seriously.
A postmortem process is a set of procedures. A postmortem culture is an organizational norm. The difference is the difference between engineers who write postmortems because they are required to and engineers who write postmortems because they are committed to learning from failure. Building the culture is harder than implementing the process, but it is the only thing that produces the outcome: a learning organization that becomes more reliable over time rather than simply better at responding to incidents.
Postmortem review meetings are the cultural moment where the organization demonstrates what it believes about failure. A review meeting where managers or peers question why the engineer did not notice the issue sooner, or why a different decision was not made, or who is responsible for the outage — this is a blame meeting, regardless of whether anyone says the words "your fault." The psychological safety required for honest postmortems is destroyed in these meetings and rebuilt slowly through months of consistently blameless facilitation.
Postmortem Review Meeting Facilitation Guide
The facilitator's opening statement should set the norm explicitly: "This meeting has one goal: to understand what happened and what we can change to prevent it. We start from the assumption that everyone involved made the best decision they could with the information and tools available. We are not here to evaluate whether different decisions should have been made — we are here to understand why the system produced the outcome it did, and how to change the system." This statement should be made verbatim, not paraphrased, at the start of every postmortem review.
Sharing postmortems organization-wide is a force multiplier for learning. A postmortem that is shared only within the team that experienced the incident produces local learning. A postmortem that is shared across the entire engineering organization produces organizational learning. The GitLab and GitHub postmortems produced global learning — the lessons from GitLab's backup failures have improved backup practices at hundreds of companies. At Google, postmortems are automatically shared with a broad engineering audience (with appropriate confidentiality for security incidents), and a weekly "postmortem digest" is distributed highlighting interesting recent incidents and their lessons.
Mechanisms for sharing postmortems organization-wide
Measuring postmortem quality prevents the quality from degrading over time as the process becomes routine. Without explicit quality criteria, postmortems gradually become shorter, less specific, and less honest — not from malice but from time pressure and normalization. At Google, postmortem quality is assessed on five dimensions: completeness of timeline (are all key events documented with precise timestamps?), depth of root cause analysis (does it reach systemic root causes, not surface symptoms?), quality of action items (SMART criteria), blamelessness (are all statements about system conditions, not individual performance?), and follow-through rate (are action items from previous postmortems being closed?).
| Quality Dimension | Strong Signal | Weak Signal | How to Improve |
|---|---|---|---|
| Timeline completeness | Precise timestamps throughout, detection gap clearly marked, pre-incident vulnerable state identified | Vague time references ("around 2 PM"), timeline starts at alert rather than first symptom | Require log-sourced timestamps; train authors to look for pre-incident conditions |
| Root cause depth | Reaches systemic property (tool design, process gap, architecture choice) | Stops at human action ("engineer X did Y") | Train facilitators to ask "why was that possible?" when human action appears as a why |
| Action item quality | SMART criteria, named owners, specific dates, linked tickets | Vague commitments ("improve testing"), unassigned, no due date | Provide action item templates with required fields; review each in the meeting before accepting |
| Blamelessness | All language about system conditions, no individual named as cause | Implicit blame through language like "should have", "failed to", "neglected" | Facilitator reviews draft before meeting; flags and rewrites blame-adjacent language |
| Follow-through rate | > 80% of action items closed within 90 days | < 50% of action items closed; same issues recurring in multiple postmortems | Monthly action item reviews; escalate overdue items to engineering management |
The Normalization of Deviance in Postmortem Culture
Diane Vaughan's concept of normalization of deviance — originally developed to explain the Challenger disaster — applies directly to postmortem culture. When postmortems consistently fail to produce action items, or when action items consistently fail to be implemented, this becomes the new normal. Each cycle of "postmortem filed, action items ignored" makes the next cycle more likely. Teams normalize the dysfunction until a major incident breaks the pattern — often painfully. The antidote is explicit measurement and leadership attention: track action item closure rates and present them to engineering leadership quarterly.
After mastering the standard postmortem process, the next level is developing analytical sophistication: moving beyond 5 Whys to techniques that reveal the systemic conditions that make entire classes of failures possible, not just the proximate cause of the specific incident you are analyzing. These techniques are used by Google's most senior SREs, Netflix's resilience engineering team, and the aviation safety investigators whose practices inspired the blameless postmortem movement.
Learning reviews extend the standard postmortem to examine not just what the system did, but how the organization responded — and what the organization's response reveals about its resilience capabilities. A learning review asks: what adaptive capacity did our engineers demonstrate during this incident? What information management practices helped (or hindered) their decision-making? Where did the response go better than our formal procedures would predict — and why? Where did coordination break down despite everyone following the process? These questions surface organizational capabilities and gaps that the standard postmortem structure misses entirely.
John Allspaw's "Debriefing Facilitation Guide" Technique
John Allspaw and his colleagues at Etsy, and later through the SNAFUcatchers research consortium, developed a learning review methodology based on interview-driven debriefs. Rather than asking participants to fill out a postmortem template, the facilitator interviews each key participant individually before the group review, asking: "Walk me through your experience of the incident from your position. What were you thinking when X happened? What information did you have at that point? What were you uncertain about?" These individual accounts are assembled into a rich multi-perspective timeline that reveals coordination gaps, information asymmetries, and decision-making context that never appears in a template-driven postmortem.
Normalization of deviance analysis goes beyond identifying what went wrong in this specific incident to ask: what practices in our system have been gradually drifting from their designed safe operating state? Diane Vaughan's analysis of the Challenger disaster revealed that NASA engineers had slowly normalized the observation of O-ring erosion — each incident of erosion without disaster made them slightly more comfortable with the risk, until the deviation from the original design specification was so large that catastrophic failure was virtually certain. The same pattern occurs in software systems: organizations gradually normalize degraded alert thresholds ("it always fires at this level, we have learned to ignore it"), deferred maintenance ("we will fix that technical debt when we have time"), and accepted risks ("we know the backup system is unreliable but we have not had to use it yet").
Advanced root cause analysis techniques beyond 5 Whys
The Aggregate Postmortem Review: Quarterly Pattern Analysis
Every quarter, the SRE team should conduct an aggregate review of all postmortems from the past quarter. Create two lists: (1) Which failure modes appeared more than once? Any failure mode that appeared twice in a quarter is a systemic issue that warrants a dedicated investigation. (2) Which action item categories are consistently not being closed? If "improve monitoring" action items are consistently being written and consistently not implemented, that is a resourcing and prioritization problem that must be escalated to engineering leadership. These quarterly reviews are how postmortem analysis evolves from reactive (what happened?) to proactive (what is structurally likely to happen next?).
flowchart LR A[Individual Incident] --> B[Standard Postmortem Timeline + 5 Whys + Actions] B --> C[Action Items Closed] C --> D[Postmortem Database] D --> E[Quarterly Aggregate Review Pattern analysis across postmortems] E --> F[Systemic Themes Identified Recurring failure modes, chronic gaps] F --> G[Architectural / Org Investments Address root causes at system level] G --> H[Reduced Incident Rate System reliability improves over time] H --> A I[Pre-mortem Analysis Prospective failure analysis] --> J[Vulnerabilities Identified Before they cause incidents] J --> G
The mature postmortem program: individual postmortems feed a database; quarterly aggregate reviews identify systemic patterns; systemic investments reduce the incident rate over time. The addition of pre-mortem analysis closes the loop by preventing incidents before they occur.
Publishing Your Postmortems: The Trust Dividend
GitLab, GitHub, and Stripe publish their postmortems publicly. This practice generates a trust dividend that is disproportionate to its cost. Users who see an honest, thorough postmortem from a company that experienced an outage are more likely to continue trusting that company than users who receive only a brief apology. The companies that publish great postmortems are demonstrating that they take reliability seriously, that they learn from failures, and that they are transparent with their users. This signal is more valuable for long-term user trust than any marketing campaign. If your organization is not ready to publish postmortems externally, start by sharing them broadly within the company — the cultural benefits of transparency at any scope are real.
Postmortem questions appear in SRE and senior engineering interviews across three formats: behavioral questions asking you to describe a real postmortem you wrote ("walk me through an incident and how you analyzed it"), system design questions asking how you would design for observability and post-incident learning, and organizational design questions asking how you would build a postmortem culture at a company that currently has none. At Staff+ level, the interviewer is specifically evaluating whether you think about reliability as a systems and organizational problem, not just a technical one — and postmortems are the primary organizational mechanism for system-level learning.
Common questions:
Try this question: Ask the interviewer: Does your organization have a searchable postmortem database? How are postmortem action items tracked, and what is the typical closure rate? Has your team ever experienced a postmortem graveyard and how did you address it? Are postmortems shared cross-functionally or kept within teams? These questions signal that you understand postmortems as an organizational and cultural practice, not just a documentation process.
Strong answer: Describing the accountability-versus-blame distinction clearly and specifically. Referencing Sidney Dekker or Just Culture. Knowing the GitLab or GitHub postmortems and what made them excellent. Describing concrete experiences where postmortem action items actually prevented a class of future incidents. Understanding the organizational dynamics of the postmortem graveyard anti-pattern. Using the NTSB or aviation analogy for blameless investigation. Measuring postmortem health through action item closure rates, not just postmortem count.
Red flags: Conflating postmortems with blame sessions or thinking blamelessness means no accountability. Not knowing what a root cause is versus a contributing factor. Describing action items in vague terms ("improve the process", "add more monitoring"). Having no opinion on what makes a postmortem action item effective versus ineffective. Treating postmortems as overhead rather than as an investment. Not being able to describe a real postmortem they have written.
Key takeaways
You are writing a postmortem for a database outage. Your 5 Whys analysis arrives at "the database administrator ran the wrong command." A teammate says this is the root cause. Is it? What would you say?
"The administrator ran the wrong command" is a description of what happened, not a root cause — it is a human action, not a system property. Ask: why was running the wrong command possible? What properties of the system — access controls, confirmation steps, environment labels, automated safeguards — should have prevented or detected the error? A true root cause for this incident is something like: "the operational tooling permitted destructive commands to execute against the production database without an environment confirmation step or automated pre-operation backup verification." That root cause produces actionable items. "Administrator ran the wrong command" produces no action item that would actually prevent recurrence.
Your team has written 12 postmortems in the past 6 months. Looking back, only 3 of the 12 postmortem action items have been fully closed. What is the problem, and how do you fix it?
A 25% action item closure rate is a postmortem graveyard — the most common and most destructive postmortem anti-pattern. The problem is not the quality of the postmortems themselves; it is a failure of follow-through driven by organizational prioritization. Action items are being treated as optional "reliability backlog" items rather than as first-class engineering work. The fix has three parts: (1) Pull all open postmortem action items into sprint planning as explicit work items competing directly with feature work — not in a separate backlog. (2) Conduct a monthly action item review meeting where overdue items are escalated to the engineering manager with a concrete re-negotiated due date. (3) Present the closure rate (25%) to engineering leadership with the list of open items and their original due dates. Leadership visibility on this metric creates organizational pressure to treat reliability work as real work.
A senior engineer on your team suggests that postmortems should identify the specific engineer responsible for the error so that the organization can ensure that engineer receives appropriate training. How do you respond?
This proposal, however well-intentioned, would destroy the postmortem culture. When engineers learn that postmortems lead to identification of "the responsible person" — even for training rather than punishment — they will begin editing their accounts. They will omit the decision they regret, frame the timeline to minimize their apparent culpability, and produce a postmortem that is diplomatically written rather than technically accurate. An inaccurate postmortem produces wrong root causes and ineffective action items. The incident recurs. The response is: individual training needs are identified and addressed through the normal manager relationship, entirely separate from the postmortem process. The postmortem exclusively analyzes system properties. Dekker's "Just Culture" framework is worth sharing with the senior engineer: accountability for action items (who implements the improvement?) yes; blame for decisions that led to the incident, no.
💡 Analogy
Aviation accident investigation is the mental model that best captures the spirit of blameless postmortems. The United States National Transportation Safety Board (NTSB) investigates every aviation accident — not to blame pilots, but to find systemic improvements. When a plane crashes, NTSB investigators do not ask "who made a mistake?" They ask "what conditions made this mistake possible?" and "how do we change the system so this can never happen again?" The NTSB has the authority to subpoena records, compel testimony, and produce findings — but its findings are explicitly non-punitive. FAA disciplinary actions are handled separately, and NTSB investigators have no role in them. This institutional separation of "learning about what happened" from "determining consequences for who did it" is the exact architecture of a blameless postmortem program. The result speaks for itself: commercial aviation in the United States has become statistically the safest form of transportation in history, with fatal accidents per passenger mile measured in billionths. The aviation industry achieves this by treating every single accident and near-miss as a system learning event — never, ever as an opportunity to blame a pilot. Every time you read a postmortem, imagine an NTSB investigator reading over your shoulder asking: "What does this tell us about the system? What would we need to change so that a different pilot, under different circumstances, could not make the same mistake?"
⚡ Core Idea
A postmortem is not documentation about an incident. It is an investment in the reliability of the system and the capability of the organization. The timeline, root cause analysis, and contributing factors are the research. The action items are the return on investment. An incident that produces no changes to the system has cost the organization the full price of the outage with no return. An incident that produces thorough root cause analysis and implemented action items pays compound interest — it prevents future incidents, trains future engineers, and builds the organizational memory that makes reliability possible at scale.
🎯 Why It Matters
Postmortems are the mechanism by which engineering organizations become smarter than they were before each failure. Without them, organizations repeat the same incidents quarterly, lose the knowledge when engineers leave, and make reliability investments based on intuition rather than evidence. With them, organizations build the searchable, institutional memory that transforms incidents from pure costs into investments in future reliability. The companies that are reliably most reliable — Google, Netflix, Amazon — are not the ones that have the fewest incidents; they are the ones that learn the most from every incident they do have.
Interview prep: 1 resource
Use these to reinforce this concept for interviews.
View all interview resources →Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.