Back to path
IntermediateTaskFlow · Project 6 of 13 ~15h· 5 milestones

Survive an availability-zone outage

Continues from the last build: TaskFlow is fully defined in Terraform but the compute may still be single-AZ.

In the last rung you got all of TaskFlow into Terraform, but the compute is still effectively single-AZ: one Launch Template, instances that happen to land in one subnet, a database with no standby.

Designing a multi-AZ VPC with public and private subnets, route tables, IGW and NATFronting compute with an ALB and target group health checks against /healthzRunning stateless backends as an EC2 Auto Scaling Group across availability zonesProvisioning RDS Postgres Multi-AZ with backups and read replicasBuilding CloudWatch dashboards and alarms wired to SNS for on-call pagingRunning a controlled failover game day and writing an incident runbook

What you'll build

TaskFlow runs across two AZs and heals itself: traffic enters through a highly available ALB, the backend runs as an Auto Scaling Group in private subnets across two AZs, the database is RDS Multi-AZ with backups and a read replica, CloudWatch alarms page on-call when capacity drops, and you have proven, with captured output, that killing an instance causes zero user-visible downtime. You also have a one-page incident runbook the next on-call can actually follow. This matters because a single-AZ deployment is a coin flip away from a full outage every time AWS has a bad day in your zone, and multi-AZ with an ALB, an ASG, and RDS Multi-AZ is the baseline every production system on AWS is expected to meet.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

taskflow/infra/variables.tfhcl
variable "region" {
  type    = string
  default = "eu-west-1"
}

variable "azs" {
  type        = list(string)
  description = "Exactly two AZs to spread TaskFlow across."
  default     = ["eu-west-1a", "eu-west-1b"]
}

variable "db_password" {
  type        = string
  description = "Postgres password. Pass via TF_VAR_db_password, never commit it."
  sensitive   = true
}

variable "alert_email" {
  type        = string
  description = "Email that receives CloudWatch alarm notifications."
}

Reading this file

  • default = ["eu-west-1a", "eu-west-1b"]The two availability zones every multi-AZ resource is spread across.
  • sensitive = trueKeeps the database password out of plan output, logs, and the console.
  • Pass via TF_VAR_db_passwordReminds you to supply the password as an environment variable instead of a committed file.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Stretch the network across two AZs with public and private subnets

    4 guided steps

    An ALB requires subnets in at least two AZs to be considered highly available, and instances spread across two private subnets survive the loss of one AZ. Without this, every later resilience feature is theater because everything still lives in one failure domain. Putting compute in private subnets also removes any inbound internet path to the app servers.

  2. 2

    Put the backend behind an ALB with a /healthz target group and Auto Scaling Group

    4 guided steps

    This is what makes a dead instance a non-event. The ALB stops sending traffic to an unhealthy target within seconds, and the ASG, using ELB health checks, replaces it automatically. Spreading desired_capacity across two AZs means losing one AZ leaves you with running capacity in the other. Referencing SGs by ID is the principle of least privilege: only the load balancer can reach the app port.

  3. 3

    Provision RDS Postgres Multi-AZ with backups and a read replica

    4 guided steps

    Multi-AZ is what lets the database survive an AZ outage: RDS keeps a synchronous standby in the second AZ and fails over automatically by flipping the endpoint, typically in 60-120 seconds, with no data loss. Backups protect against logical mistakes (a bad migration, a dropped table) that Multi-AZ does not. The read replica offloads GET /tasks reads so they do not compete with writes, which matters as TaskFlow grows.

  4. 4

    Add a CloudWatch dashboard and alarms so an AZ problem is visible fast

    4 guided steps

    Resilience you cannot see is luck. When an AZ degrades, the first signal is usually a drop in healthy hosts and a bump in 5xx before users complain loudly. Alarms turn that into a page within minutes instead of a Twitter thread. The dashboard gives the on-call one screen to reason from during the incident the runbook describes.

  5. 5

    Run a failover test and write the incident runbook

    4 guided steps

    Untested resilience is a hypothesis. A controlled game day converts 'it should heal' into evidence and surfaces gaps (too-slow health checks, missing alarms, a NAT in the wrong AZ) before a real outage at 3am does. The runbook means the next on-call does not have to reverse-engineer your architecture under pressure.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
5 full reference solutions
8 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A Terraform module set that provisions a VPC with public and private subnets across two AZs, an internet-facing ALB with a /healthz target group, a Launch Template plus Auto Scaling Group of TaskFlow backend instances spread over both AZs, one NAT gateway, and security groups referenced by ID (not CIDR).
RDS Postgres in Multi-AZ mode with automated backups (7-day retention) and a read replica, with DATABASE_URL injected into instances via user-data from an environment that never hard-codes the password.
A CloudWatch dashboard plus at least three alarms (ALB 5xx, unhealthy host count, RDS CPU/free storage) wired so a failing AZ is visible within minutes.
A reproducible failover test: terminate one backend instance and observe the ASG replace it with zero user-visible 5xx, captured as command output.
A one-page incident runbook (INCIDENT.md) covering AZ loss, instance failure, and RDS failover, with the exact AWS CLI and console steps an on-call engineer follows.

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building