Back to path
ExpertStorefront · Project 11 of 11 ~16h· 6 milestones

Survive a region outage with multi-cluster failover

Continues from the last build: The Storefront is meshed, autoscaling, and supply-chain-locked after admission policy, but it all still lives in one cluster in one region, so a single cluster or region outage takes the entire Storefront down.

Everything you built works beautifully, in one cluster, in one region.

Multi-cluster Kubernetes topologyMulti-region disaster recovery designGitOps fan-out to many clustersIstio multi-primary / east-west gatewayGlobal load balancing & DNS failoverCross-region database replicationRTO / RPO definition & measurementFailover game day & failback

What you'll build

Two Storefront clusters in two regions, both reconciled from the same GitOps repo so they run the identical Helm release; cross-region data survival via a managed multi-region database (or replicated storage) plus a documented plan for the StatefulSet; a global load balancer / DNS failover that health-checks both regions and routes users to a healthy one; a written RTO and RPO with the design that meets them; and a recorded failover game day where you kill the primary, watch traffic move to the secondary inside the RTO, then fail back, with a clear active-active vs active-passive tradeoff decision.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

clusters/region-2.yamlyaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: storefront-region-2
  namespace: argocd
spec:
  project: storefront
  source:
    repoURL: https://github.com/your-org/storefront-gitops
    path: apps/storefront
    helm:
      valueFiles:
        - values.yaml
        - values-region-2.yaml
  destination:
    server: https://cluster-b.example.com   # second cluster API
    namespace: storefront
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Reading this file

  • kind: ApplicationAn Argo CD Application is one sync target; this second one reconciles the identical chart onto cluster B so both regions stay in lockstep.
  • path: apps/storefrontPoints at the same chart path cluster A uses, so there is one chart, not a fork, deployed to both clusters.
  • values-region-2.yamlThe only per-region difference is this overlay (region, hostname, replica floor); the app itself is identical across clusters.
  • server: https://cluster-bTargets the second clusters API endpoint, so this Application deploys to region-2 rather than region-1.

A GitOps Application that points the SAME Storefront chart at the second cluster, with the region-2 values overlay. This is how one repo fans out to both regions.

That's 1 of 9 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Stand up a second cluster and fan GitOps out to both regions

    5 guided steps

    A second region is only disaster recovery if its app is provably identical to the first and stays that way. GitOps fan-out makes both clusters converge on one source of truth, so they cannot drift, and a fix lands in both regions from a single commit instead of a manual second deploy you will forget under pressure.

  2. 2

    Link the two meshes with an Istio east-west gateway

    5 guided steps

    Two isolated clusters are two islands: if a backend only lives in region-1, a request landing in region-2 has nowhere to go. A multi-primary mesh with an east-west gateway lets endpoints be discovered across clusters, so during a partial failure the surviving cluster can still serve, and so you can later shift traffic deliberately rather than all-or-nothing.

  3. 3

    Make state survive a region loss

    5 guided steps

    Stateless pods are easy to run in two regions; data is the hard part of DR. If the database only lives in region-1, failing over the app to region-2 just points it at a dead database. State has to either live in a multi-region managed service or be replicated, and you have to decide consciously how much recent data you are willing to lose, that decision is your RPO.

  4. 4

    Put a global load balancer with health-checked failover in front

    5 guided steps

    All the cross-region machinery is useless if users still resolve to a dead region. A global LB or health-checked DNS is the thing that actually moves users to the healthy cluster, and its health-check interval plus DNS TTL are the biggest levers on how fast that switch happens, which directly sets your RTO.

  5. 5

    Define RTO and RPO and prove the design meets them

    5 guided steps

    DR without numbers is theatre. RTO and RPO are the contract: they tell the business what to expect and tell you which design choices (replication mode, TTL, warm vs cold standby) actually matter. Writing them down turns "we have a second region" into "we recover in N minutes losing at most M seconds of data", which is what an interviewer and an incident commander both want.

  6. 6

    Run a failover game day, then fail back

    5 guided steps

    A DR plan you have never executed is a guess. The only way to know your RTO/RPO are real, your health checks fire, your standby promotes, and your runbook is correct is to break the primary on purpose, under calm conditions, and time the recovery. The first failover should be a scheduled game day, not a 3am surprise.

What's inside when you start

3 starter files, ready to clone
6 guided milestones
6 full reference solutions
9 code blocks explained line-by-line
6 "is it working?" checks
5 interview questions it prepares you for

You'll walk away with

A second Storefront cluster in a second region, reconciled from the same GitOps repo so both run the identical Helm release
An Istio east-west gateway linking the two meshes, with locality load balancing so traffic prefers in-region endpoints
Cross-region data survival: a managed multi-region database (or replicated storage) plus a written plan for the StatefulSet on failover
A global load balancer / DNS failover that health-checks both regions and routes users to a healthy one
A DR runbook with explicit RTO and RPO targets, their decomposition, and an active-active vs active-passive decision
A recorded failover game day: killed the primary, measured RTO/RPO against targets, and failed back cleanly

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building