Back to path
ExpertTaskFlow · Project 13 of 13 ~28h· 5 milestones

Add multi-region disaster recovery

Continues from the last build: TaskFlow runs securely and observably in one region on EKS via GitOps.

TaskFlow already runs well in eu-west-1: it is on EKS, deployed by Argo CD from Git, with Prometheus and Grafana watching it and security baked into the pipeline.

Multi-region AWS architecture and DR strategy selectionCross-region RDS replication and replica promotionRoute 53 health checks and DNS failover routingReusable Terraform modules across regions and providersWriting and rehearsing a failover runbook (game-day)

What you'll build

TaskFlow survives a full primary-region outage by failing over to a warm standby region, with a target RTO under 15 minutes and a measured RPO that you record during a real game-day (typically single-digit seconds in steady state, not a guarantee). Route 53 health-checked DNS failover routes traffic to the standby, whose database is kept current by asynchronous cross-region replication, and the whole stack is defined as reusable Terraform modules and proven by a rehearsed failover.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

taskflow/infra/dr/providers.tfhcl
terraform {
  required_version = ">= 1.6"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Primary region: eu-west-1
provider "aws" {
  alias  = "primary"
  region = "eu-west-1"
}

# Standby region: eu-central-1
provider "aws" {
  alias  = "standby"
  region = "eu-central-1"
}

Reading this file

  • alias = "primary"A named copy of the AWS provider so you can target eu-west-1 explicitly instead of relying on a single default region.
  • region = "eu-central-1"This is the standby region where the warm copy of TaskFlow lives, ready to take traffic.
  • version = "~> 5.0"Pins the AWS provider to the 5.x line so a future major release cannot silently change resource behavior under you.

One provider per region using alias. Every module call must say which provider it uses, so resources land in the right region.

That's 1 of 7 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Decide the DR strategy and write down RTO/RPO targets

    3 guided steps

    RTO and RPO are the contract. Without them you cannot tell whether your architecture is over-engineered (paying for active-active when warm standby suffices) or under-engineered (promising 5 minutes but your replica lag is 10). They also make the build testable: the game-day passes only if you hit the numbers you wrote here.

  2. 2

    Build a region as a reusable Terraform module and stamp it twice

    3 guided steps

    A region module is what makes the second region trustworthy. If the standby were hand-built, it would drift from the primary and your failover would fail on some forgotten config. One module stamped twice means the standby is the primary, minus the differences you chose on purpose.

  3. 3

    Replicate images, config, and secrets so the standby can actually run

    3 guided steps

    DR failures are almost never about the database; they are about the boring supporting cast. The classic game-day surprise is 'the pods are CrashLoopBackOff in the standby because the image is not there' or 'the app cannot start because the secret only exists in the primary region.' Replicating images and secrets ahead of time removes the two most common failover blockers.

  4. 4

    Wire Route 53 health-checked failover and prove it switches

    3 guided steps

    DNS failover is the automatic part of your RTO. If the health check is wrong (too slow, wrong path, too sensitive) your failover is either too slow to meet RTO or so trigger-happy it flaps during a minor blip. Proving the switch in a controlled way, before a real outage, is the only way to trust the number you wrote in milestone 1.

  5. 5

    Write and rehearse the failover runbook with a game-day

    3 guided steps

    A DR plan you have never executed is fiction. The first time you promote a replica should not be during a real outage. The game-day surfaces the things no diagram shows: the replica took 4 minutes to promote, the app needed a restart to pick up the new DB endpoint, a secret was missing. Measuring the real RTO/RPO is what turns 'we have DR' into 'we have a tested 12-minute RTO,' which is the sentence that wins the audit and the interview.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
4 full reference solutions
7 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A Terraform DR stack under infra/dr/ with a reusable region module stamped into eu-west-1 (primary) and eu-central-1 (standby), each with EKS, RDS, ECR, and an ALB.
Cross-region RDS read replication from the primary to the standby, plus ECR image replication (destination repo keeps the source name 'taskflow') and a multi-region database secret so the standby can run unaided.
Route 53 health-checked failover (PRIMARY/SECONDARY records on app.taskflow.com with a 60s TTL, a single health signal on the primary) that automatically routes to the standby when the primary /healthz fails.
runbooks/dr-design.md stating the warm-standby strategy with RTO/RPO targets, and runbooks/failover.md with an executable, Linux-portable, rehearsed failover and failback procedure.
A completed game-day record showing the measured RTO and RPO from actually executing the failover.

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building