How to design systems that survive an entire cloud region going offline — covering active-passive vs active-active patterns, global routing strategies, data replication trade-offs, failover automation, and the hidden single points of failure that make "multi-region" setups false safety.
How to design systems that survive an entire cloud region going offline — covering active-passive vs active-active patterns, global routing strategies, data replication trade-offs, failover automation, and the hidden single points of failure that make "multi-region" setups false safety.
A cloud region is a geographic area containing multiple availability zones (AZs). Regions fail — not often, but when they do, every service inside them is affected simultaneously. AWS us-east-1 has experienced multiple large-scale events. The question is not whether to build multi-region but whether the cost (2–3x infrastructure, operational complexity, data consistency overhead) is justified by your availability and latency SLOs.
Multi-AZ is not multi-region
A common misconception: "We are multi-AZ, so we are resilient." Multi-AZ protects against a single data centre failure within a region. If the region itself has a networking, power, or control plane outage (which happens), multi-AZ provides no protection. Region-level resilience requires actual resources in a second region.
| Pattern | Description | RTO | RPO | Cost multiplier | Complexity |
|---|---|---|---|---|---|
| Single region, multi-AZ | Resources across AZs, automatic failover within region | < 5 min | < 1 min | 1.5x | Low |
| Active-passive (warm standby) | Secondary region with scaled-down replica, manual or automated failover | 15–60 min | 1–5 min | 1.8x | Medium |
| Active-passive (hot standby) | Secondary region at full capacity, near-instant automated failover | < 5 min | < 1 min | 2x | High |
| Active-active (multi-region) | Both regions serve traffic simultaneously, global DNS routing | < 30 sec | Near-zero | 2.5–3x | Very high |
Choose based on your actual SLOs, not aspiration
Most applications do not need active-active. A 99.99% availability SLO (52 min downtime/year) is achievable with well-implemented active-passive. Active-active is for 99.999% SLOs (5 min downtime/year) — typically payment systems, real-time trading, and services where every second of downtime costs thousands of dollars. Do the math: 60 minutes of RTO × your revenue per minute = maximum acceptable cost of recovery. If that number is smaller than the ongoing cost of active-active, active-passive is the right choice.
The application tier can be replicated trivially — you deploy the same container image in two regions. The hard problem in multi-region architecture is data: how do you keep databases consistent across regions, and what happens to writes that are in-flight when a region fails?
Data replication strategies and their trade-offs
The write conflict problem in active-active
If both regions accept writes simultaneously and a user's session switches regions (e.g., during failover), you can get conflicting writes to the same record. DynamoDB Global Tables handles this with "last writer wins" conflict resolution. Aurora Global Tables uses a single writer model (only the primary region accepts writes; secondaries are read-only). If you need true active-active with writes in both regions, you must implement conflict resolution logic in your application or choose a database that supports multi-master writes (Cassandra, CockroachDB, Spanner).
DNS TTL is part of your RTO
When failing over to a secondary region, you update DNS records to point to the secondary region's endpoints. The time clients take to observe the DNS change is your DNS TTL. If Route 53 health check TTL is 60 seconds and client DNS caches are 300 seconds, your RTO floor is 5 minutes of client-side DNS caching — regardless of how fast your failover automation runs. Set low TTLs (60 seconds) on records you intend to fail over, and confirm this with a failover drill before you need it.
1# Route 53 health check + failover routing policy2# Primary record (us-east-1 ALB)3aws route53 create-health-check \FailureThreshold: 2 means 2 consecutive failures (20s) before marking unhealthy4--caller-reference "primary-us-east-1-$(date +%s)" \5--health-check-config '{6"Type": "HTTPS",7"FullyQualifiedDomainName": "api-us-east-1.example.com",8"Port": 443,9"ResourcePath": "/health",10"RequestInterval": 10,11"FailureThreshold": 212}'1314# Create failover records in the hosted zone15# PRIMARY record — fails over to secondary if health check fails16aws route53 change-resource-record-sets \17--hosted-zone-id Z1234567890 \18--change-batch '{19"Changes": [{20"Action": "CREATE",21"ResourceRecordSet": {22"Name": "api.example.com",EvaluateTargetHealth: true — ALB must also report healthy for the record to be active23"Type": "A",24"AliasTarget": {"DNSName": "my-alb-us-east-1.us-east-1.elb.amazonaws.com", "EvaluateTargetHealth": true},25"Failover": "PRIMARY",26"HealthCheckId": "abc-health-check-id",27"SetIdentifier": "primary"28}29}]30}'3132# Monitor replication lag on Aurora Global Database33aws rds describe-global-clusters \34--query 'GlobalClusters[].GlobalClusterMembers[?IsWriter==`false`].{AZ:DBClusterArn}' \35# Then check CloudWatch metric: AuroraGlobalDBReplicationLag
Once you have resources in multiple regions, you need a global routing layer to direct users to the right region. AWS Route 53 provides three relevant routing policies: latency-based, geolocation, and failover.
| Policy | Routes based on | Use case | Caveat |
|---|---|---|---|
| Latency-based | Measured latency from user to region | Route each user to the lowest-latency region | Latency can change; users may switch regions mid-session |
| Geolocation | User's geographic location (country/continent) | Data residency (GDPR: EU users → EU region) | Does not guarantee low latency — geography ≠ latency |
| Failover | Health check status of primary record | Active-passive with automated cutover to secondary | DNS TTL determines how long clients are stuck on failed primary |
| Weighted | Manually configured % split between records | Gradual traffic migration, blue-green at DNS level | Not responsive to load — just a static percentage split |
AWS Global Accelerator vs CloudFront vs Route 53 — know when each is right
Route 53 latency routing resolves DNS to the nearest region — but the TCP connection still uses the public internet. Global Accelerator anycast-routes your connection to the nearest AWS edge node and then transits the AWS backbone, reducing latency by 20–40% for TCP connections because the public internet path is shorter. CloudFront is for HTTP/HTTPS content distribution and caching. Global Accelerator is for any TCP/UDP traffic that needs the AWS backbone but is not HTTP caching. Use Global Accelerator for APIs where latency matters; use CloudFront for assets and HTML.
Senior cloud engineer, solutions architect, and staff engineer system design interviews. Often framed as a business continuity problem: "The CTO wants five-nines availability. How do you architect this?" Expect to justify cost vs resilience trade-offs.
Common questions:
Try this question: What is the target RTO and RPO? What is the budget multiplier the business can support? Is data residency a legal requirement (GDPR)? Are there write conflicts possible if both regions accept writes simultaneously?
Strong answer: Audits the data plane (every database, cache, queue, secret) before declaring an architecture multi-region. Recommends Aurora Global Database over RDS read replicas. Explicitly designs the automated failover runbook and proposes quarterly DR drills. Quantifies cost with a multiplier estimate.
Red flags: Claims "multi-region is easy — just deploy the same containers in two regions." Does not know what RPO means. Assumes a read replica can be promoted instantly with zero data loss. Does not mention DNS TTL as a component of RTO.
Key takeaways
💡 Analogy
Active-passive is like having a main office and an emergency backup office. The backup office has desks, computers, and phone lines, but the staff sit idle waiting for the main office to burn down. When disaster strikes, you redirect the phone system to the backup number and staff walk over. It takes 15–60 minutes to get the backup office fully operational. Active-active is like two offices that both serve customers simultaneously: London handles European customers, New York handles American customers. If London burns down, New York absorbs all traffic within seconds — because it was already running.
⚡ Core Idea
Active-passive trades recovery time (RTO minutes) for lower cost (one idle standby). Active-active achieves near-zero RTO but requires solved data consistency across regions and costs 2.5–3x. The hard part of multi-region is never the application tier — it is always the data layer and how you handle writes in two places.
🎯 Why It Matters
A "multi-region" architecture that has a single-region database is not multi-region at all. The hidden single point of failure — a database with no cross-region replica, an S3 bucket in one region, a secrets manager secret that has not been replicated — will be the exact point of failure during a real region outage. Understanding the full data plane requirements of multi-region is what separates engineers who can design for real resilience from those who draw a topology diagram that looks multi-region but fails on first test.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.