Detect, contain, eradicate, and recover from security incidents — with runbooks, timelines, and the blameless postmortem culture that turns incidents into systemic improvements.
Detect, contain, eradicate, and recover from security incidents — with runbooks, timelines, and the blameless postmortem culture that turns incidents into systemic improvements.
Lesson outline
Security incident response follows the same lifecycle across almost every incident type. The phases differ in timing and tactics but the structure is universal. NIST SP 800-61 defines the framework; the DevSecOps version adds automation, pipeline integration, and blameless postmortem culture.
The five phases
Speed matters — especially in containment
A credential exposed on GitHub can be found by automated scanners within seconds. API keys scraped from public repos are tested within minutes. Every second between exposure and revocation is active risk. Automated containment (e.g., GitHub secret scanning auto-revokes some provider tokens) dramatically reduces the window.
You cannot respond to incidents you do not detect. Detection must be built proactively — before incidents happen. The time to set up GuardDuty is not during a breach; it is on day one.
| Detection source | What it catches | Primary tool | Mean time to detect |
|---|---|---|---|
| Secret scanning | API keys, passwords committed to git | GitHub Secret Scanning, GitGuardian, TruffleHog | Seconds to minutes (real-time scan on push) |
| Runtime threat detection | Unusual API calls, privilege escalation, lateral movement | AWS GuardDuty, Microsoft Defender, CrowdStrike | Minutes (near-real-time) |
| SIEM correlation | Multi-event attack patterns (failed auth → success → bulk access) | Splunk, Datadog SIEM, OpenSearch | Minutes to hours (rule-based) |
| Anomaly detection | Deviations from baseline (unusual data volume, new geolocation) | Datadog APM, Elastic ML, AWS CloudWatch | Minutes to hours |
| Container runtime | Process execution, file writes, network calls in containers | Falco, Tetragon, Aqua Runtime | Seconds (in-kernel) |
| External disclosure | Bug bounty report, customer report, researcher notification | HackerOne, email, support ticket | Hours to days after event |
Mean Time to Detect (MTTD) is your most important metric
Dwell time — the time between an attacker gaining access and detection — averaged 21 days in 2023 (IBM Cost of a Data Breach Report). Every day of dwell time is a day of additional data exposure. Your goal is to reduce MTTD to minutes, not days.
A runbook (or playbook) is a documented, step-by-step response procedure for a specific incident type. In DevOps, runbooks live in the repo, are reviewed in PRs, and are tested before incidents happen. Automated runbooks use tools like PagerDuty Runbook Automation, AWS Systems Manager, or custom scripts.
The three runbook anti-patterns: (1) runbook does not exist for the incident type, (2) runbook exists but is out of date, (3) runbook exists but lives in a system that is down during the incident. Store runbooks in a separate system from your primary infrastructure.
1# Runbook: Exposed Credential ResponseInclude severity, owner, and test date in the header — stale runbooks are dangerous2# Severity: P1 (if key was used) / P2 (if not used)3# Owner: Security team + service owner4# Last tested: [date] | Last updated: [date]56## Detection criteria7This runbook applies when:8- GitHub Secret Scanning / GitGuardian alerts on a committed credential9- CloudTrail shows API calls from an unexpected source IP for a service key10- A developer reports accidentally committing a secret1112## Severity triage (first 5 minutes)131. Was the commit to a PUBLIC or private repository?14- Public: immediate P1 — bots scan GitHub continuously15- Private: P2 unless logs show it was accessed externally16Triage within 5 minutes: public repo + used key = immediate P1172. Was the key USED after the exposure timestamp?18- Query CloudTrail: filter by the key's access key ID, timeframe after commit19- If yes: escalate to P1 regardless of repo visibility20213. What PERMISSIONS does the key have?22- Read-only data key: lower severity23- Admin/write access: P1 regardless of usage evidence2425## Containment (target: < 15 minutes from detection)26[ ] Step 1: Revoke the credential immediately27- AWS IAM key: aws iam delete-access-key --access-key-id AKIA...28- Stripe key: Stripe dashboard > API keys > Roll key29- Generic: provider console > security > revoke30Checkboxes: runbooks should be executable, not just readable — each step is an action31[ ] Step 2: Rotate related credentials32- Identify what else is in the same credential storeInclude the exact CLI command — during an incident, looking up syntax wastes critical minutes33- Rotate DB passwords if co-located with the leaked key3435[ ] Step 3: Block active abuse (if key was used)36- Identify source IPs from CloudTrail logs37- Block in WAF / security group if still active3839[ ] Step 4: Notify incident channel40- Slack #incident-response: "P[N] credential leak declared"41- Page service owner if outside business hours4243## Eradication (target: < 4 hours)44[ ] Remove secret from git history45git filter-repo --path-glob '*.env' --invert-paths46# Force push and notify all developers to re-clone4748[ ] Scan for other occurrences49TruffleHog: trufflehog git --only-verified file://./repo50Grep across all repos: gh search code "AKIA[A-Z0-9]{16}"5152[ ] Audit what was accessed53CloudTrail query: last 30 days for the compromised key54Document scope of access for postmortem5556## Recovery57[ ] Deploy with new credentials from secrets manager58- Never hardcode new credential; use Vault or AWS Secrets Manager59[ ] Verify application functionality end-to-endgit filter-repo removes the secret from history; all forks still have it — notify GitHub to remove forks too60[ ] Notify affected users if customer data was accessed (legal review)6162## Postmortem triggers63- Always write a postmortem for P1 incidents64- For P2: write if root cause is systemic (not a one-time human error)65- Template: runbooks/postmortem-template.md
A blameless postmortem investigates what systems and processes failed — not who made a mistake. In a blame culture, people hide incidents and avoid admitting errors. In a blameless culture, people report incidents quickly and participate honestly in reviews, because they know the outcome is improvement, not punishment.
This is not about no accountability — it is about acknowledging that in complex systems, humans make decisions based on the information and tools available to them. The goal is to improve the system so the same decision does not cause the same outcome next time.
Blameless postmortem structure (within 48–72 hours of incident)
01
Timeline: construct a factual, chronological timeline of events — what happened, when, what signals were visible at each point
02
Contributing factors: what system conditions, tool limitations, or process gaps contributed? No "human error" as a root cause — what made the human error possible?
03
Detection: how was this detected? Could it have been detected earlier? What detection gap allowed the incident to progress to this severity?
04
Response effectiveness: what went well in the response? What slowed the team down? (Tool access issues, unclear runbook, missing automation)
05
Action items: specific, assigned, time-bounded improvements — new detection rule, automated containment script, updated runbook, training session
06
Publish and share: postmortems should be shared broadly — similar incidents at other teams can be prevented
Timeline: construct a factual, chronological timeline of events — what happened, when, what signals were visible at each point
Contributing factors: what system conditions, tool limitations, or process gaps contributed? No "human error" as a root cause — what made the human error possible?
Detection: how was this detected? Could it have been detected earlier? What detection gap allowed the incident to progress to this severity?
Response effectiveness: what went well in the response? What slowed the team down? (Tool access issues, unclear runbook, missing automation)
Action items: specific, assigned, time-bounded improvements — new detection rule, automated containment script, updated runbook, training session
Publish and share: postmortems should be shared broadly — similar incidents at other teams can be prevented
Five Whys for security incidents
A secret was committed → Why? No pre-commit hook was configured. Why? The developer did not know about it. Why? Onboarding documentation did not cover it. Why? Documentation was written when secret scanning was not standard practice. Why? The team grew quickly and documentation did not keep up. Root cause: onboarding gap + missing standard tooling. Fix: add Gitleaks to the standard developer environment setup script.
Security incident response does not exist in isolation from the engineering pipeline. The same tools that build and deploy your software are your fastest response path. Many containment and eradication actions can be automated.
Automation opportunities in incident response
NIST SP 800-61 and regulatory requirements
NIST SP 800-61 (Computer Security Incident Handling Guide) is the foundational framework. PCI-DSS Requirement 12.10 mandates a formal incident response plan. HIPAA requires response procedures for security incidents. GDPR requires breach notification to the supervisory authority within 72 hours of detection. Build your runbooks with these timelines in mind.
Incident response questions appear in DevSecOps and senior engineering interviews as scenario questions: "An alert fires at 2 AM showing unusual API traffic — what do you do?" Strong candidates structure their answer using the five phases and mention specific automation. Expect questions about postmortem culture and regulatory timelines at companies in regulated industries.
Common questions:
Strong answer: Mentioning that they would preserve logs before taking action (forensics). Knowing that public repo credential exposure requires immediate revocation (bots scrape in seconds). Discussing automated containment. Mentioning GDPR notification timelines. Asking clarifying questions about scope and severity before answering.
Red flags: Jumping straight to "I would fix the bug" without containment steps. Treating incidents as one-time events rather than systemic learning opportunities. Not knowing what a blameless postmortem is. Thinking the incident is "over" after patching without doing a postmortem.
Quick check · Security Incident Response in DevOps
1 / 4
Key takeaways
You are on-call and GuardDuty alerts that an EC2 instance is making unusual API calls to exfiltrate data. List the first three containment actions.
(1) Apply a "quarantine" security group via AWS CLI/Terraform that blocks all inbound and outbound traffic — this isolates the instance while preserving its state for forensics. (2) Revoke the IAM role credentials attached to the instance (or detach the role). (3) Snapshot the instance volumes and capture VPC Flow Logs and CloudTrail events for forensic analysis. Do not terminate the instance yet — forensic data is needed.
What is the difference between Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR)?
MTTD is the time from when a breach or incident begins to when the team detects it (the gap where the attacker operates undetected — also called dwell time). MTTR is the time from when the incident is detected to when full service is restored. Both matter: MTTD determines how much damage occurs before you respond; MTTR determines how long you are impacted. A low MTTD is more valuable than a low MTTR because it limits the attacker's window.
From the books
The Site Reliability Workbook (Google SRE, 2018)
Chapter 8: On-Call, Chapter 10: Postmortems
Chapter 8 covers on-call practices and postmortem culture that applies directly to security incident response. The blameless postmortem template is used by hundreds of engineering teams.
💡 Analogy
A hospital trauma team protocol
⚡ Core Idea
Trauma teams do not improvise — they follow practiced protocols under pressure. Each person has a defined role, actions are prioritized by what stops the bleeding first, and documentation happens throughout. Security incident response applies the same structure: pre-defined roles, prioritized actions, automated where possible, documented throughout.
🎯 Why It Matters
IBM's 2023 Cost of a Data Breach Report found that organizations with practiced incident response plans contained breaches 54 days faster (median) than those without. Faster containment directly reduces breach cost: the same report found contained breaches cost $1.49M less on average than uncontained ones.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.