Interactive Explainer

🎯Key Takeaways

The five phases — Detection, Containment, Eradication, Recovery, Post-Incident — give structure to chaotic situations; skip a phase and you risk reinfection or repeated incidents

Containment first, always: stop active abuse before investigating root cause or cleaning up

MTTD is the most important metric — bots scan GitHub in seconds; credential exposure windows are measured in minutes, not hours

Runbooks live in the repo, are reviewed in PRs, and are tested before incidents happen — not written during them

Blameless postmortems improve systems, not punish people — cultures that punish individual error hide incidents and get worse over time

GDPR: 72 hours from awareness of breach to supervisory authority notification — know this timeline before an incident happens

Security Incident Response in DevOps

Detect, contain, eradicate, and recover from security incidents — with runbooks, timelines, and the blameless postmortem culture that turns incidents into systemic improvements.

~10 min read

Be the first to complete!

What you'll learn

The five phases — Detection, Containment, Eradication, Recovery, Post-Incident — give structure to chaotic situations; skip a phase and you risk reinfection or repeated incidents
Containment first, always: stop active abuse before investigating root cause or cleaning up
MTTD is the most important metric — bots scan GitHub in seconds; credential exposure windows are measured in minutes, not hours
Runbooks live in the repo, are reviewed in PRs, and are tested before incidents happen — not written during them
Blameless postmortems improve systems, not punish people — cultures that punish individual error hide incidents and get worse over time
GDPR: 72 hours from awareness of breach to supervisory authority notification — know this timeline before an incident happens

Lesson outline

The Five Phases of Incident Response

Security incident response follows the same lifecycle across almost every incident type. The phases differ in timing and tactics but the structure is universal. NIST SP 800-61 defines the framework; the DevSecOps version adds automation, pipeline integration, and blameless postmortem culture.

The five phases

Detection — Monitoring, threat detection (GuardDuty, SIEM, EDR), secret scanners, or external disclosure triggers an alert. Triage: is this a real incident? What is the severity?
Containment — Stop the bleeding. Isolate affected systems, revoke compromised credentials, block malicious IPs. Short-term containment limits blast radius; long-term containment maintains service while you fix root cause.
Eradication — Remove the cause entirely: patch the vulnerability, clean malware from systems, purge secrets from git history, close the access path that was exploited.
Recovery — Restore services safely: deploy clean artifacts, verify integrity, restore from clean backups, bring systems back under monitoring, notify affected users.
Post-Incident — Blameless postmortem: what happened, detection gap, containment effectiveness, permanent fix. Update runbooks, detection rules, and preventive controls.

Speed matters — especially in containment

A credential exposed on GitHub can be found by automated scanners within seconds. API keys scraped from public repos are tested within minutes. Every second between exposure and revocation is active risk. Automated containment (e.g., GitHub secret scanning auto-revokes some provider tokens) dramatically reduces the window.

🚨 INCIDENT RESPONSE TIMELINE5 phases • 3 scenarios • interactive

Select scenario

Click a phase to view response details

Detection

T+0 to T+15 min

Response actions

1GitHub Secret Scanning or TruffleHog alert fires on commit/push
2On-call engineer triages: was the key real and active?
3Check access logs: has the key been used since exposure?
4Declare incident; page security team if key was used

Key tool or automation

GitHub Secret Scanning, GitGuardian, TruffleHog

If you miss this

Attacker may have already scraped the key from the public repo within minutes of the push.

Phase 1 of 5

Detection: Building Visibility Before You Need It

You cannot respond to incidents you do not detect. Detection must be built proactively — before incidents happen. The time to set up GuardDuty is not during a breach; it is on day one.

Detection source	What it catches	Primary tool	Mean time to detect
Secret scanning	API keys, passwords committed to git	GitHub Secret Scanning, GitGuardian, TruffleHog	Seconds to minutes (real-time scan on push)
Runtime threat detection	Unusual API calls, privilege escalation, lateral movement	AWS GuardDuty, Microsoft Defender, CrowdStrike	Minutes (near-real-time)
SIEM correlation	Multi-event attack patterns (failed auth → success → bulk access)	Splunk, Datadog SIEM, OpenSearch	Minutes to hours (rule-based)
Anomaly detection	Deviations from baseline (unusual data volume, new geolocation)	Datadog APM, Elastic ML, AWS CloudWatch	Minutes to hours
Container runtime	Process execution, file writes, network calls in containers	Falco, Tetragon, Aqua Runtime	Seconds (in-kernel)
External disclosure	Bug bounty report, customer report, researcher notification	HackerOne, email, support ticket	Hours to days after event

Mean Time to Detect (MTTD) is your most important metric

Dwell time — the time between an attacker gaining access and detection — averaged 21 days in 2023 (IBM Cost of a Data Breach Report). Every day of dwell time is a day of additional data exposure. Your goal is to reduce MTTD to minutes, not days.

Runbooks: Response Automation Before the Incident

A runbook (or playbook) is a documented, step-by-step response procedure for a specific incident type. In DevOps, runbooks live in the repo, are reviewed in PRs, and are tested before incidents happen. Automated runbooks use tools like PagerDuty Runbook Automation, AWS Systems Manager, or custom scripts.

The three runbook anti-patterns: (1) runbook does not exist for the incident type, (2) runbook exists but is out of date, (3) runbook exists but lives in a system that is down during the incident. Store runbooks in a separate system from your primary infrastructure.

runbooks/credential-leak-response.md (structure)

1# Runbook: Exposed Credential Response
Include severity, owner, and test date in the header — stale runbooks are dangerous
2# Severity: P1 (if key was used) / P2 (if not used)
3# Owner: Security team + service owner
4# Last tested: [date] | Last updated: [date]
5 
6## Detection criteria
7This runbook applies when:
8- GitHub Secret Scanning / GitGuardian alerts on a committed credential
9- CloudTrail shows API calls from an unexpected source IP for a service key
10- A developer reports accidentally committing a secret
11 
12## Severity triage (first 5 minutes)
131. Was the commit to a PUBLIC or private repository?
14   - Public: immediate P1 — bots scan GitHub continuously
15   - Private: P2 unless logs show it was accessed externally
16 
Triage within 5 minutes: public repo + used key = immediate P1
172. Was the key USED after the exposure timestamp?
18   - Query CloudTrail: filter by the key's access key ID, timeframe after commit
19   - If yes: escalate to P1 regardless of repo visibility
20 
213. What PERMISSIONS does the key have?
22   - Read-only data key: lower severity
23   - Admin/write access: P1 regardless of usage evidence
24 
25## Containment (target: < 15 minutes from detection)
26[ ] Step 1: Revoke the credential immediately
27    - AWS IAM key: aws iam delete-access-key --access-key-id AKIA...
28    - Stripe key: Stripe dashboard > API keys > Roll key
29    - Generic: provider console > security > revoke
30 
Checkboxes: runbooks should be executable, not just readable — each step is an action
31[ ] Step 2: Rotate related credentials
32    - Identify what else is in the same credential store
Include the exact CLI command — during an incident, looking up syntax wastes critical minutes
33    - Rotate DB passwords if co-located with the leaked key
34 
35[ ] Step 3: Block active abuse (if key was used)
36    - Identify source IPs from CloudTrail logs
37    - Block in WAF / security group if still active
38 
39[ ] Step 4: Notify incident channel
40    - Slack #incident-response: "P[N] credential leak declared"
41    - Page service owner if outside business hours
42 
43## Eradication (target: < 4 hours)
44[ ] Remove secret from git history
45    git filter-repo --path-glob '*.env' --invert-paths
46    # Force push and notify all developers to re-clone
47 
48[ ] Scan for other occurrences
49    TruffleHog: trufflehog git --only-verified file://./repo
50    Grep across all repos: gh search code "AKIA[A-Z0-9]{16}"
51 
52[ ] Audit what was accessed
53    CloudTrail query: last 30 days for the compromised key
54    Document scope of access for postmortem
55 
56## Recovery
57[ ] Deploy with new credentials from secrets manager
58    - Never hardcode new credential; use Vault or AWS Secrets Manager
59[ ] Verify application functionality end-to-end
git filter-repo removes the secret from history; all forks still have it — notify GitHub to remove forks too
60[ ] Notify affected users if customer data was accessed (legal review)
61 
62## Postmortem triggers
63- Always write a postmortem for P1 incidents
64- For P2: write if root cause is systemic (not a one-time human error)
65- Template: runbooks/postmortem-template.md

The Blameless Postmortem: Turning Incidents into Improvements

A blameless postmortem investigates what systems and processes failed — not who made a mistake. In a blame culture, people hide incidents and avoid admitting errors. In a blameless culture, people report incidents quickly and participate honestly in reviews, because they know the outcome is improvement, not punishment.

This is not about no accountability — it is about acknowledging that in complex systems, humans make decisions based on the information and tools available to them. The goal is to improve the system so the same decision does not cause the same outcome next time.

Blameless postmortem structure (within 48–72 hours of incident)

→

Timeline: construct a factual, chronological timeline of events — what happened, when, what signals were visible at each point

→

Contributing factors: what system conditions, tool limitations, or process gaps contributed? No "human error" as a root cause — what made the human error possible?

→

Detection: how was this detected? Could it have been detected earlier? What detection gap allowed the incident to progress to this severity?

→

Response effectiveness: what went well in the response? What slowed the team down? (Tool access issues, unclear runbook, missing automation)

→

Action items: specific, assigned, time-bounded improvements — new detection rule, automated containment script, updated runbook, training session

Publish and share: postmortems should be shared broadly — similar incidents at other teams can be prevented

Timeline: construct a factual, chronological timeline of events — what happened, when, what signals were visible at each point

Contributing factors: what system conditions, tool limitations, or process gaps contributed? No "human error" as a root cause — what made the human error possible?

Detection: how was this detected? Could it have been detected earlier? What detection gap allowed the incident to progress to this severity?

Response effectiveness: what went well in the response? What slowed the team down? (Tool access issues, unclear runbook, missing automation)

Action items: specific, assigned, time-bounded improvements — new detection rule, automated containment script, updated runbook, training session

Publish and share: postmortems should be shared broadly — similar incidents at other teams can be prevented

Five Whys for security incidents

A secret was committed → Why? No pre-commit hook was configured. Why? The developer did not know about it. Why? Onboarding documentation did not cover it. Why? Documentation was written when secret scanning was not standard practice. Why? The team grew quickly and documentation did not keep up. Root cause: onboarding gap + missing standard tooling. Fix: add Gitleaks to the standard developer environment setup script.

Integrating Response With Your DevOps Pipeline

Security incident response does not exist in isolation from the engineering pipeline. The same tools that build and deploy your software are your fastest response path. Many containment and eradication actions can be automated.

Automation opportunities in incident response

Automated credential revocation — GitHub Secret Scanning auto-revokes tokens for some providers (AWS, Slack, Stripe) within seconds of detection — before a human can even page someone.
Infrastructure isolation via IaC — Apply a "quarantine" security group via Terraform that blocks all inbound/outbound traffic for a compromised EC2 instance. Fast and auditable.
Automated rollback — If a compromised artifact is identified by image digest, an automated script reverts the Kubernetes deployment to the last signed, verified artifact.
Runbook automation — PagerDuty Runbook Automation or AWS Systems Manager Automation can execute runbook steps (revoke, notify, isolate) from a Slack command during the incident.
Pre-built incident channels — Automate the creation of an incident Slack channel, page the right team, post the relevant runbook link, and start a video call link — all triggered by a PagerDuty alert.

NIST SP 800-61 and regulatory requirements

NIST SP 800-61 (Computer Security Incident Handling Guide) is the foundational framework. PCI-DSS Requirement 12.10 mandates a formal incident response plan. HIPAA requires response procedures for security incidents. GDPR requires breach notification to the supervisory authority within 72 hours of detection. Build your runbooks with these timelines in mind.

How this might come up in interviews

Incident response questions appear in DevSecOps and senior engineering interviews as scenario questions: "An alert fires at 2 AM showing unusual API traffic — what do you do?" Strong candidates structure their answer using the five phases and mention specific automation. Expect questions about postmortem culture and regulatory timelines at companies in regulated industries.

Common questions:

Walk me through how you would respond to discovering an API key committed to a public GitHub repository.
What is a blameless postmortem and why does it matter?
How does MTTD (Mean Time to Detect) differ from MTTR (Mean Time to Recover)?
What is the GDPR 72-hour breach notification requirement and when does the clock start?
How do you automate containment actions to reduce incident response time?

Strong answer: Mentioning that they would preserve logs before taking action (forensics). Knowing that public repo credential exposure requires immediate revocation (bots scrape in seconds). Discussing automated containment. Mentioning GDPR notification timelines. Asking clarifying questions about scope and severity before answering.

Red flags: Jumping straight to "I would fix the bug" without containment steps. Treating incidents as one-time events rather than systemic learning opportunities. Not knowing what a blameless postmortem is. Thinking the incident is "over" after patching without doing a postmortem.

Quick check · Security Incident Response in DevOps

1 / 4

An engineer discovers that an AWS access key was committed to a public GitHub repository 48 hours ago. What should be the FIRST action?

Key takeaways

The five phases — Detection, Containment, Eradication, Recovery, Post-Incident — give structure to chaotic situations; skip a phase and you risk reinfection or repeated incidents
Containment first, always: stop active abuse before investigating root cause or cleaning up
MTTD is the most important metric — bots scan GitHub in seconds; credential exposure windows are measured in minutes, not hours
Runbooks live in the repo, are reviewed in PRs, and are tested before incidents happen — not written during them
Blameless postmortems improve systems, not punish people — cultures that punish individual error hide incidents and get worse over time
GDPR: 72 hours from awareness of breach to supervisory authority notification — know this timeline before an incident happens

Before you move on: can you answer these?

You are on-call and GuardDuty alerts that an EC2 instance is making unusual API calls to exfiltrate data. List the first three containment actions.

(1) Apply a "quarantine" security group via AWS CLI/Terraform that blocks all inbound and outbound traffic — this isolates the instance while preserving its state for forensics. (2) Revoke the IAM role credentials attached to the instance (or detach the role). (3) Snapshot the instance volumes and capture VPC Flow Logs and CloudTrail events for forensic analysis. Do not terminate the instance yet — forensic data is needed.

What is the difference between Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR)?

MTTD is the time from when a breach or incident begins to when the team detects it (the gap where the attacker operates undetected — also called dwell time). MTTR is the time from when the incident is detected to when full service is restored. Both matter: MTTD determines how much damage occurs before you respond; MTTR determines how long you are impacted. A low MTTD is more valuable than a low MTTR because it limits the attacker's window.

From the books

The Site Reliability Workbook (Google SRE, 2018)

Chapter 8: On-Call, Chapter 10: Postmortems

Chapter 8 covers on-call practices and postmortem culture that applies directly to security incident response. The blameless postmortem template is used by hundreds of engineering teams.

🧠Mental Model

💡 Analogy

A hospital trauma team protocol

⚡ Core Idea

Trauma teams do not improvise — they follow practiced protocols under pressure. Each person has a defined role, actions are prioritized by what stops the bleeding first, and documentation happens throughout. Security incident response applies the same structure: pre-defined roles, prioritized actions, automated where possible, documented throughout.

🎯 Why It Matters

IBM's 2023 Cost of a Data Breach Report found that organizations with practiced incident response plans contained breaches 54 days faster (median) than those without. Faster containment directly reduces breach cost: the same report found contained breaches cost $1.49M less on average than uncontained ones.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A