Interactive Explainer

🎯Key Takeaways

Multi-AZ protects against a single data centre failure; only a second region protects against a regional outage

The application tier is trivial to replicate; the hard problem is data consistency and write availability in the secondary region

A read replica is not a standby primary — it cannot accept writes during failover without explicit promotion

DNS TTL is a floor for your RTO — clients cache DNS for the TTL duration regardless of how fast your failover automation runs

Test the full failover sequence including write operations against secondary region databases — every quarter in production

Multi-Region Architectures

How to design systems that survive an entire cloud region going offline — covering active-passive vs active-active patterns, global routing strategies, data replication trade-offs, failover automation, and the hidden single points of failure that make "multi-region" setups false safety.

~6 min read

Be the first to complete!

What you'll learn

Multi-AZ protects against a single data centre failure; only a second region protects against a regional outage
The application tier is trivial to replicate; the hard problem is data consistency and write availability in the secondary region
A read replica is not a standby primary — it cannot accept writes during failover without explicit promotion
DNS TTL is a floor for your RTO — clients cache DNS for the TTL duration regardless of how fast your failover automation runs
Test the full failover sequence including write operations against secondary region databases — every quarter in production

Why Multi-Region and What It Actually Costs

A cloud region is a geographic area containing multiple availability zones (AZs). Regions fail — not often, but when they do, every service inside them is affected simultaneously. AWS us-east-1 has experienced multiple large-scale events. The question is not whether to build multi-region but whether the cost (2–3x infrastructure, operational complexity, data consistency overhead) is justified by your availability and latency SLOs.

Multi-AZ is not multi-region

A common misconception: "We are multi-AZ, so we are resilient." Multi-AZ protects against a single data centre failure within a region. If the region itself has a networking, power, or control plane outage (which happens), multi-AZ provides no protection. Region-level resilience requires actual resources in a second region.

Pattern	Description	RTO	RPO	Cost multiplier	Complexity
Single region, multi-AZ	Resources across AZs, automatic failover within region	< 5 min	< 1 min	1.5x	Low
Active-passive (warm standby)	Secondary region with scaled-down replica, manual or automated failover	15–60 min	1–5 min	1.8x	Medium
Active-passive (hot standby)	Secondary region at full capacity, near-instant automated failover	< 5 min	< 1 min	2x	High
Active-active (multi-region)	Both regions serve traffic simultaneously, global DNS routing	< 30 sec	Near-zero	2.5–3x	Very high

Choose based on your actual SLOs, not aspiration

Most applications do not need active-active. A 99.99% availability SLO (52 min downtime/year) is achievable with well-implemented active-passive. Active-active is for 99.999% SLOs (5 min downtime/year) — typically payment systems, real-time trading, and services where every second of downtime costs thousands of dollars. Do the math: 60 minutes of RTO × your revenue per minute = maximum acceptable cost of recovery. If that number is smaller than the ongoing cost of active-active, active-passive is the right choice.

Active-Passive vs Active-Active: The Data Problem

The application tier can be replicated trivially — you deploy the same container image in two regions. The hard problem in multi-region architecture is data: how do you keep databases consistent across regions, and what happens to writes that are in-flight when a region fails?

Data replication strategies and their trade-offs

Synchronous replication: zero RPO, higher write latency — Every write must be acknowledged by the secondary region before the primary confirms to the client. Guarantees zero data loss but adds round-trip network latency (us-east-1 to us-west-2 ≈ 65ms) to every write. Only practical when write latency is not a user-facing concern.
Asynchronous replication: near-zero RPO, lower write latency — Writes are acknowledged locally, then replicated asynchronously. RPO is typically 1–30 seconds (replication lag). If the primary fails before replication completes, you lose those writes. Used by Aurora Global Database (< 1s lag), RDS Read Replicas, and DynamoDB Global Tables (near-zero lag).
Event streaming replication: flexible, higher complexity — Changes are captured as events (change data capture / CDC) and streamed to the secondary region via Kinesis or MSK. The secondary replays events to rebuild state. RPO depends on streaming lag. This pattern supports more complex transformation logic during replication.

The write conflict problem in active-active

If both regions accept writes simultaneously and a user's session switches regions (e.g., during failover), you can get conflicting writes to the same record. DynamoDB Global Tables handles this with "last writer wins" conflict resolution. Aurora Global Tables uses a single writer model (only the primary region accepts writes; secondaries are read-only). If you need true active-active with writes in both regions, you must implement conflict resolution logic in your application or choose a database that supports multi-master writes (Cassandra, CockroachDB, Spanner).

DNS TTL is part of your RTO

When failing over to a secondary region, you update DNS records to point to the secondary region's endpoints. The time clients take to observe the DNS change is your DNS TTL. If Route 53 health check TTL is 60 seconds and client DNS caches are 300 seconds, your RTO floor is 5 minutes of client-side DNS caching — regardless of how fast your failover automation runs. Set low TTLs (60 seconds) on records you intend to fail over, and confirm this with a failover drill before you need it.

multi-region-failover.sh

1# Route 53 health check + failover routing policy
2  # Primary record (us-east-1 ALB)
3  aws route53 create-health-check \
FailureThreshold: 2 means 2 consecutive failures (20s) before marking unhealthy
4    --caller-reference "primary-us-east-1-$(date +%s)" \
5    --health-check-config '{
6      "Type": "HTTPS",
7      "FullyQualifiedDomainName": "api-us-east-1.example.com",
8      "Port": 443,
9      "ResourcePath": "/health",
10      "RequestInterval": 10,
11      "FailureThreshold": 2
12    }'
13  
14  # Create failover records in the hosted zone
15  # PRIMARY record — fails over to secondary if health check fails
16  aws route53 change-resource-record-sets \
17    --hosted-zone-id Z1234567890 \
18    --change-batch '{
19      "Changes": [{
20        "Action": "CREATE",
21        "ResourceRecordSet": {
22          "Name": "api.example.com",
EvaluateTargetHealth: true — ALB must also report healthy for the record to be active
23          "Type": "A",
24          "AliasTarget": {"DNSName": "my-alb-us-east-1.us-east-1.elb.amazonaws.com", "EvaluateTargetHealth": true},
25          "Failover": "PRIMARY",
26          "HealthCheckId": "abc-health-check-id",
27          "SetIdentifier": "primary"
28        }
29      }]
30    }'
31  
32  # Monitor replication lag on Aurora Global Database
33  aws rds describe-global-clusters \
34    --query 'GlobalClusters[].GlobalClusterMembers[?IsWriter==`false`].{AZ:DBClusterArn}' \
35    # Then check CloudWatch metric: AuroraGlobalDBReplicationLag

Global Routing Strategies

Once you have resources in multiple regions, you need a global routing layer to direct users to the right region. AWS Route 53 provides three relevant routing policies: latency-based, geolocation, and failover.

Policy	Routes based on	Use case	Caveat
Latency-based	Measured latency from user to region	Route each user to the lowest-latency region	Latency can change; users may switch regions mid-session
Geolocation	User's geographic location (country/continent)	Data residency (GDPR: EU users → EU region)	Does not guarantee low latency — geography ≠ latency
Failover	Health check status of primary record	Active-passive with automated cutover to secondary	DNS TTL determines how long clients are stuck on failed primary
Weighted	Manually configured % split between records	Gradual traffic migration, blue-green at DNS level	Not responsive to load — just a static percentage split

AWS Global Accelerator vs CloudFront vs Route 53 — know when each is right

Route 53 latency routing resolves DNS to the nearest region — but the TCP connection still uses the public internet. Global Accelerator anycast-routes your connection to the nearest AWS edge node and then transits the AWS backbone, reducing latency by 20–40% for TCP connections because the public internet path is shorter. CloudFront is for HTTP/HTTPS content distribution and caching. Global Accelerator is for any TCP/UDP traffic that needs the AWS backbone but is not HTTP caching. Use Global Accelerator for APIs where latency matters; use CloudFront for assets and HTML.

How this might come up in interviews

Senior cloud engineer, solutions architect, and staff engineer system design interviews. Often framed as a business continuity problem: "The CTO wants five-nines availability. How do you architect this?" Expect to justify cost vs resilience trade-offs.

Common questions:

What is the difference between active-passive and active-active multi-region architectures?
A company says they are "multi-region." What questions would you ask to verify that claim?
How does Aurora Global Database differ from an RDS read replica for multi-region failover?
What is RPO and RTO and how do they influence multi-region architecture choices?
Design a multi-region architecture for a payment processing API with a 99.99% SLO.

Try this question: What is the target RTO and RPO? What is the budget multiplier the business can support? Is data residency a legal requirement (GDPR)? Are there write conflicts possible if both regions accept writes simultaneously?

Strong answer: Audits the data plane (every database, cache, queue, secret) before declaring an architecture multi-region. Recommends Aurora Global Database over RDS read replicas. Explicitly designs the automated failover runbook and proposes quarterly DR drills. Quantifies cost with a multiplier estimate.

Red flags: Claims "multi-region is easy — just deploy the same containers in two regions." Does not know what RPO means. Assumes a read replica can be promoted instantly with zero data loss. Does not mention DNS TTL as a component of RTO.

Key takeaways

Multi-AZ protects against a single data centre failure; only a second region protects against a regional outage
The application tier is trivial to replicate; the hard problem is data consistency and write availability in the secondary region
A read replica is not a standby primary — it cannot accept writes during failover without explicit promotion
DNS TTL is a floor for your RTO — clients cache DNS for the TTL duration regardless of how fast your failover automation runs
Test the full failover sequence including write operations against secondary region databases — every quarter in production

🧠Mental Model

💡 Analogy

Active-passive is like having a main office and an emergency backup office. The backup office has desks, computers, and phone lines, but the staff sit idle waiting for the main office to burn down. When disaster strikes, you redirect the phone system to the backup number and staff walk over. It takes 15–60 minutes to get the backup office fully operational. Active-active is like two offices that both serve customers simultaneously: London handles European customers, New York handles American customers. If London burns down, New York absorbs all traffic within seconds — because it was already running.

⚡ Core Idea

Active-passive trades recovery time (RTO minutes) for lower cost (one idle standby). Active-active achieves near-zero RTO but requires solved data consistency across regions and costs 2.5–3x. The hard part of multi-region is never the application tier — it is always the data layer and how you handle writes in two places.

🎯 Why It Matters

A "multi-region" architecture that has a single-region database is not multi-region at all. The hidden single point of failure — a database with no cross-region replica, an S3 bucket in one region, a secrets manager secret that has not been replicated — will be the exact point of failure during a real region outage. Understanding the full data plane requirements of multi-region is what separates engineers who can design for real resilience from those who draw a topology diagram that looks multi-region but fails on first test.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A