Back to path
AdvancedCarta · Project 10 of 14 ~7h· 5 milestones

Break it on purpose: chaos experiments

Continues from the last build: A remediator service auto-heals the known fd-leak signature, but it only handles one failure you already knew about.

Your remediator from the last rung heals the one failure you already understood, the fd leak.

Chaos engineering principlesHypothesis-driven experimentsBlast-radius and abort conditionsFault injection with tc netemSteady-state SLI definitionContainer failure injectionIncident-style observationExperiment documentation

What you'll build

You walk away with a repeatable chaos experiment harness: a steady-state definition tied to your SLIs, three written hypotheses with blast-radius and abort conditions, an injection runner script, and an experiment log documenting what held, what broke, and the one fix the surprise forced. You will be able to defend "our checkout survives a dead catalog" with evidence, not hope.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

chaos/experiment-template.mdmarkdown
# Experiment N: <short name>

hypothesis:   checkout SLO holds when <dependency> is <fault>
blast radius: <which service, how long, what % of traffic>
abort:        <the exact SLI threshold that stops the run>

method:
  # 1. start baseline load
  k6 run load/baseline.js
  # 2. inject
  <injection command>
  # 3. observe steady-state dashboard for <duration>
  # 4. rollback
  <rollback command>

result: <held | broke> + the surprise + the action you took

Reading this file

  • hypothesis: checkout SLO holds when <dependency> is <fault>A falsifiable claim, the heart of a chaos experiment.
  • abort: <the exact SLI threshold that stops the run>Declared before the injection so you stop early, not after damage.
  • k6 run load/baseline.jsSteady-state traffic must flow or the SLIs are undefined.

Copy this per experiment. The order (hypothesis, blast radius, abort BEFORE method) is the discipline that keeps chaos controlled.

That's 1 of 7 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Define steady state and write the experiment template

    4 guided steps

    Chaos without a steady-state definition is just breaking things. The steady-state metrics are your scoreboard: they tell you whether the system tolerated the fault. Writing abort conditions first is what separates a controlled experiment from an outage you caused.

  2. 2

    Experiment 1: kill catalog and observe

    5 guided steps

    This is the cheapest, most informative experiment: a hard dependency failure. It directly tests the resilience claim from earlier rungs. If checkout collapses when catalog dies, you found a single point of failure before the flash sale did.

  3. 3

    Experiment 2: inject payments latency with tc netem

    5 guided steps

    A slow dependency is more dangerous than a dead one: connections pile up, threads block, and the slowness propagates upstream. This tests whether your timeout budget actually protects the checkout latency SLI under realistic provider degradation.

  4. 4

    Experiment 3: fill the Postgres disk

    5 guided steps

    Resource exhaustion is the failure mode teams forget. A full disk does not throw a clean error, it makes writes hang and can corrupt state. This experiment checks the most important safety property of a payment system: it must never take money without recording the order.

  5. 5

    Fix the surprise and write the experiment log

    5 guided steps

    Chaos engineering only pays off if you close the loop: a found weakness becomes a fix becomes a passing re-run. The log is the artifact you bring to the flash-sale readiness review, it is evidence, dated and reproducible, that the checkout path tolerates each fault.

What's inside when you start

2 starter files, ready to clone
5 guided milestones
5 full reference solutions
7 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

steady_state.md defining the checkout SLIs, PromQL queries, thresholds, and global abort conditions
experiment-1.md, experiment-2.md, experiment-3.md, each with hypothesis, blast radius, abort condition, method, and result
run-experiment.sh injection runner scripts with trap-based rollback for the latency and disk-fill faults
An api catalog-fallback fix serving product data from Redis when catalog is dead, with a degraded-mode metric
CHAOS_LOG.md summarizing all three experiments, the surprise, the fix, and follow-up hypotheses

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building