Skip to main content
Career Paths
Concepts
Disaster Recovery
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Disaster Recovery (DR)

RTO, RPO, and the four DR strategies — backup/restore, pilot light, warm standby, and active-active — with the cost and operational complexity trade-offs that determine which strategy fits each workload.

🎯Key Takeaways
RTO = maximum acceptable downtime; RPO = maximum acceptable data loss — these two numbers drive all DR architectural decisions
The four DR strategies are backup/restore (cheapest), pilot light, warm standby, and active-active (most expensive) — in order of increasing cost and decreasing RTO
Multi-AZ protects against data centre failures; multi-region DR protects against regional events — they are different threat models
An untested DR plan is not a DR plan — run quarterly drills and measure actual RTO and RPO
Cross-region RDS read replicas are the cornerstone of Pilot Light and Warm Standby DR patterns

Disaster Recovery (DR)

RTO, RPO, and the four DR strategies — backup/restore, pilot light, warm standby, and active-active — with the cost and operational complexity trade-offs that determine which strategy fits each workload.

~5 min read
Be the first to complete!
What you'll learn
  • RTO = maximum acceptable downtime; RPO = maximum acceptable data loss — these two numbers drive all DR architectural decisions
  • The four DR strategies are backup/restore (cheapest), pilot light, warm standby, and active-active (most expensive) — in order of increasing cost and decreasing RTO
  • Multi-AZ protects against data centre failures; multi-region DR protects against regional events — they are different threat models
  • An untested DR plan is not a DR plan — run quarterly drills and measure actual RTO and RPO
  • Cross-region RDS read replicas are the cornerstone of Pilot Light and Warm Standby DR patterns

RTO and RPO: The Two Numbers That Define Your DR

Recovery Time Objective (RTO) is the maximum acceptable duration of a service outage — how long until you must be back online. An RTO of 4 hours means you have 4 hours to recover before the business impact becomes unacceptable.

Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time — how old can the most recent good data be when you recover. An RPO of 1 hour means you can lose at most 1 hour of transactions before it becomes unacceptable.

RTO and RPO drive every architectural decision in DR

If your RPO is 24 hours, daily backups are sufficient. If your RPO is 15 minutes, you need continuous replication. If your RTO is 4 hours, backup/restore might work. If your RTO is 15 minutes, you need pre-provisioned infrastructure in the DR region. The cost of DR scales roughly inversely with RTO and RPO — tighter requirements cost dramatically more.

RTO / RPO requirementStrategyAnnual cost premium over single-regionRecovery mechanism
Hours to days / Hours to daysBackup & Restore~5–10%Restore from S3/Glacier, recreate infrastructure
Hours / Minutes to hoursPilot Light~15–25%Scale up minimal DR infra; restore recent replica
Minutes to 1 hour / MinutesWarm Standby~50–70%Redirect DNS to scaled-down but running DR environment
Seconds / Near zeroActive-Active / Multi-site~100%+Automatic DNS/load balancer failover; dual active

The Four DR Strategies

AWS and most cloud architects describe four DR strategies on a spectrum from cheapest/slowest to most expensive/fastest. Each is appropriate for different criticality tiers.

DR strategy breakdown

  • 1. Backup and Restore — cheapest, slowest (RTO: hours; RPO: hours) — Take regular backups of data (RDS automated backups, EBS snapshots, S3 versioning) to a second region. Infrastructure is not pre-provisioned — you recreate everything from IaC during a disaster. Cost: essentially just storage for backups. Recovery: manual, slow (hours). Suitable for dev environments, low-criticality internal tools, and batch workloads where hours of downtime is acceptable.
  • 2. Pilot Light — minimal pre-provisioned DR (RTO: 1–4 hours; RPO: minutes) — Core data services (RDS read replica, DynamoDB global tables) run continuously in the DR region, keeping data fresh. Compute (EC2, ECS, Lambda) is not running but can be scaled up quickly from IaC. During a disaster: point DNS to DR region, scale up compute, promote database replica. Cost: ~15–25% overhead for the always-on data tier.
  • 3. Warm Standby — scaled-down but running DR (RTO: 15–60 min; RPO: seconds to minutes) — A fully functional but smaller-capacity version of production runs continuously in the DR region. During a disaster: scale up the DR environment (autoscaling, larger instances) and redirect DNS. Cost: ~50–70% overhead since you are always running a mini-production. The most common choice for Tier 1 production systems with 99.9%+ SLAs.
  • 4. Active-Active / Multi-site — full dual-region (RTO: seconds; RPO: near-zero) — Production workloads run simultaneously in two or more regions. Traffic is distributed by a global load balancer (Route 53, AWS Global Accelerator, Cloudflare). If one region fails, the other takes 100% of traffic automatically. Cost: ~100%+ overhead since you are running full production everywhere. Required for RPO near zero (financial systems, healthcare, e-commerce at scale).

Never confuse multi-AZ HA with multi-region DR

Multi-AZ (e.g. RDS Multi-AZ, ALB across 3 AZs) protects against data centre failures within a region. Multi-region DR protects against an entire region becoming unavailable. These are different threat models and different cost profiles. Most teams nail multi-AZ but skip multi-region DR entirely — and discover they needed it during a region-level event.

DR Testing and Runbooks

An untested DR plan is not a DR plan. The most common DR failure mode is not a missing backup — it is that no one has ever run the restore procedure and it does not work under pressure.

Minimum viable DR testing programme

→

01

Define RTO and RPO targets per service tier. Document them in a service runbook. Get business sign-off on the numbers.

→

02

Test restore from backup quarterly: restore an RDS snapshot to a non-production environment and verify data integrity. Document exact time taken.

→

03

Test failover for pilot light and warm standby: simulate a region failure, execute the runbook, measure actual RTO and RPO. Compare to targets.

→

04

Publish the DR runbook to a location accessible without internet (the DR region) — if your runbook is in the failing region, you cannot read it during an incident.

05

After each test, update the runbook with lessons learned. Automate whatever caused delays.

1

Define RTO and RPO targets per service tier. Document them in a service runbook. Get business sign-off on the numbers.

2

Test restore from backup quarterly: restore an RDS snapshot to a non-production environment and verify data integrity. Document exact time taken.

3

Test failover for pilot light and warm standby: simulate a region failure, execute the runbook, measure actual RTO and RPO. Compare to targets.

4

Publish the DR runbook to a location accessible without internet (the DR region) — if your runbook is in the failing region, you cannot read it during an incident.

5

After each test, update the runbook with lessons learned. Automate whatever caused delays.

Use AWS Backup or equivalent for centralised backup management

AWS Backup provides a single pane of glass for backup policies across RDS, DynamoDB, EBS, EFS, FSx, S3, and EC2. Set backup plans with retention rules, cross-region copy rules, and vault lock (WORM — write once, read many — for ransomware protection). Without a centralised backup service, teams rely on per-service backup settings and inevitably miss something.

dr-rds-failover.sh
1# Create a cross-region RDS read replica (Pilot Light pattern)
Cross-region read replica is the Pilot Light DR pattern for RDS — keeps data near-current at low cost
2aws rds create-db-instance-read-replica \
3 --db-instance-identifier prod-db-dr-replica \
4 --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:prod-db \
5 --db-instance-class db.r6g.large \
6 --destination-region eu-west-1 \
7 --no-publicly-accessible
8
9# Promote the read replica to a standalone instance during DR failover
Promotion is irreversible — only run this during an actual DR event or DR drill
10# (irreversible — disconnects from primary)
11aws rds promote-read-replica \
12 --db-instance-identifier prod-db-dr-replica \
13 --region eu-west-1
14
15# Check RDS backup retention and automated backup status
16aws rds describe-db-instances \
17 --query 'DBInstances[*].{ID:DBInstanceIdentifier,BackupRetention:BackupRetentionPeriod,BackupWindow:PreferredBackupWindow}' \
18 --output table
How this might come up in interviews

Cloud engineer and solutions architect interviews at senior level always include a DR scenario. Expect to be given RTO/RPO targets and asked to design the appropriate strategy and justify the cost.

Common questions:

  • What is the difference between RTO and RPO? Give an example of each.
  • Describe the four DR strategies on AWS and when you would use each.
  • Your production RDS is in us-east-1. A region outage occurs. Walk me through your DR procedure.
  • How would you design a DR strategy for a system that needs to recover in under 15 minutes?
  • What is the difference between multi-AZ and multi-region from a disaster recovery perspective?

Strong answer: Maps RTO/RPO requirements to specific strategies unprompted. Mentions Route 53 health checks for automatic DNS failover. Brings up the cost implications and the business case for each tier. Notes that the analytics read replica is an accidental pilot light.

Red flags: Treats multi-AZ as equivalent to DR. Cannot define RTO and RPO clearly. Recommends active-active for everything without discussing cost. Has never mentioned testing the DR plan.

Key takeaways

  • RTO = maximum acceptable downtime; RPO = maximum acceptable data loss — these two numbers drive all DR architectural decisions
  • The four DR strategies are backup/restore (cheapest), pilot light, warm standby, and active-active (most expensive) — in order of increasing cost and decreasing RTO
  • Multi-AZ protects against data centre failures; multi-region DR protects against regional events — they are different threat models
  • An untested DR plan is not a DR plan — run quarterly drills and measure actual RTO and RPO
  • Cross-region RDS read replicas are the cornerstone of Pilot Light and Warm Standby DR patterns
🧠Mental Model

💡 Analogy

Think of DR strategies like insurance policies. Backup & Restore is basic liability insurance — cheap, but if something bad happens, you spend hours filing claims (restoring). Pilot Light is comprehensive insurance with a hotline that answers in an hour. Warm Standby is a pre-booked rental car that's waiting at the airport — you just have to drive it. Active-Active is owning two identical cars and driving both simultaneously — if one breaks down, you're already in the other one.

⚡ Core Idea

RTO (how fast) and RPO (how much data loss) determine which DR strategy you need. Tighter requirements cost exponentially more. The four strategies (backup/restore → pilot light → warm standby → active-active) are a spectrum of cost vs. recovery speed.

🎯 Why It Matters

DR is where most companies discover their architecture has hidden assumptions ("we assumed us-east-1 would never go down"). Designing DR into the architecture from the beginning is 5–10x cheaper than retrofitting it. Engineers who understand RTO/RPO and can map them to concrete AWS services and costs are invaluable in architecture discussions.

Related concepts

Explore topics that connect to this one.

  • High availability (HA) & resilience
  • Observability
  • Regions & Availability Zones

Test yourself: 2 challenges

Apply this concept with scenario-based Q&A.

  • Design database backup and disaster recovery
  • Design multi-cloud disaster recovery

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.