Survive an availability-zone outage
Continues from the last build: TaskFlow is fully defined in Terraform but the compute may still be single-AZ.
In the last rung you got all of TaskFlow into Terraform, but the compute is still effectively single-AZ: one Launch Template, instances that happen to land in one subnet, a database with no standby.
What you'll build
TaskFlow runs across two AZs and heals itself: traffic enters through a highly available ALB, the backend runs as an Auto Scaling Group in private subnets across two AZs, the database is RDS Multi-AZ with backups and a read replica, CloudWatch alarms page on-call when capacity drops, and you have proven, with captured output, that killing an instance causes zero user-visible downtime. You also have a one-page incident runbook the next on-call can actually follow. This matters because a single-AZ deployment is a coin flip away from a full outage every time AWS has a bad day in your zone, and multi-AZ with an ALB, an ASG, and RDS Multi-AZ is the baseline every production system on AWS is expected to meet.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
variable "region" {
type = string
default = "eu-west-1"
}
variable "azs" {
type = list(string)
description = "Exactly two AZs to spread TaskFlow across."
default = ["eu-west-1a", "eu-west-1b"]
}
variable "db_password" {
type = string
description = "Postgres password. Pass via TF_VAR_db_password, never commit it."
sensitive = true
}
variable "alert_email" {
type = string
description = "Email that receives CloudWatch alarm notifications."
}Reading this file
default = ["eu-west-1a", "eu-west-1b"]The two availability zones every multi-AZ resource is spread across.sensitive = trueKeeps the database password out of plan output, logs, and the console.Pass via TF_VAR_db_passwordReminds you to supply the password as an environment variable instead of a committed file.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Stretch the network across two AZs with public and private subnets
4 guided stepsAn ALB requires subnets in at least two AZs to be considered highly available, and instances spread across two private subnets survive the loss of one AZ. Without this, every later resilience feature is theater because everything still lives in one failure domain. Putting compute in private subnets also removes any inbound internet path to the app servers.
- 2
Put the backend behind an ALB with a /healthz target group and Auto Scaling Group
4 guided stepsThis is what makes a dead instance a non-event. The ALB stops sending traffic to an unhealthy target within seconds, and the ASG, using ELB health checks, replaces it automatically. Spreading desired_capacity across two AZs means losing one AZ leaves you with running capacity in the other. Referencing SGs by ID is the principle of least privilege: only the load balancer can reach the app port.
- 3
Provision RDS Postgres Multi-AZ with backups and a read replica
4 guided stepsMulti-AZ is what lets the database survive an AZ outage: RDS keeps a synchronous standby in the second AZ and fails over automatically by flipping the endpoint, typically in 60-120 seconds, with no data loss. Backups protect against logical mistakes (a bad migration, a dropped table) that Multi-AZ does not. The read replica offloads GET /tasks reads so they do not compete with writes, which matters as TaskFlow grows.
- 4
Add a CloudWatch dashboard and alarms so an AZ problem is visible fast
4 guided stepsResilience you cannot see is luck. When an AZ degrades, the first signal is usually a drop in healthy hosts and a bump in 5xx before users complain loudly. Alarms turn that into a page within minutes instead of a Twitter thread. The dashboard gives the on-call one screen to reason from during the incident the runbook describes.
- 5
Run a failover test and write the incident runbook
4 guided stepsUntested resilience is a hypothesis. A controlled game day converts 'it should heal' into evidence and surfaces gaps (too-slow health checks, missing alarms, a NAT in the wrong AZ) before a real outage at 3am does. The runbook means the next on-call does not have to reverse-engineer your architecture under pressure.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building