Back to path
IntermediateCarta · Project 5 of 14 ~6h· 5 milestones

Alert on symptoms, kill the noise

Continues from the last build: You have SLO targets (99.9% checkout availability, 95% under 400ms over 30 days), error budget math, a written budget policy, and recording rules in Prometheus.

Last rung you wrote real SLOs and recording rules for Carta, but the inherited Alertmanager still pages on whatever the previous owner wired: a CPU-high rule fired at 2am and woke you for a batch job that no user ever felt, while a real checkout outage burned 40 minutes of error budget in total silence.

Multi-window multi-burn-rate alertingPrometheus recording and alerting rulesPromQL for error budget burnAlertmanager routing and inhibitionSymptom vs cause alert designAlert quality review (precision and recall)Failure injection with k6 and stub knobs

What you'll build

You walk away able to design and ship multi-window multi-burn-rate SLO alerts from scratch, explain why cause-based alerts page wrongly, route and inhibit alerts in Alertmanager so the pager only fires on real user pain, and run an honest alert quality review (precision, recall, time-to-detect) against injected failures.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

prometheus/rules/legacy_cause.rules.ymlyaml
groups:
  - name: legacy_cause_alerts
    rules:
      # Inherited cause-based alert. Pages on CPU, not on user pain.
      # This is the 2am false page. You will DELETE this whole file.
      - alert: HighCPU
        expr: rate(process_cpu_seconds_total{job="api"}[5m]) > 0.8
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "API CPU is high"
          description: "CPU rate above 0.8 for 2m. Does not mean users are hurt."

Reading this file

  • alert: HighCPUThe offending rule name. It alerts on a machine condition, not on whether checkout works.
  • rate(process_cpu_seconds_total{job="api"}[5m]) > 0.8A cause-based expression: high CPU. Users do not feel CPU, they feel failed or slow checkouts.
  • severity: pageThis label is why it wakes you. Routing CPU to the pager is the root mistake you are fixing.
  • for: 2mA flat duration threshold with no notion of error budget burn rate, so it cannot tell real pain from noise.

The inherited noise. CPU being busy is a cause, not a symptom. A batch job or a healthy traffic spike trips this without any user ever seeing an error. Milestone 1 deletes this file and proves nothing breaks.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Delete the inherited cause-based alerts and prove nothing breaks

    4 guided steps

    Cause-based alerts page on machine conditions (CPU, memory, disk) that do not map to user pain. They generate false pages (2am batch job) and false confidence (CPU was fine while checkout was down). You cannot add good alerts on top of noise, you have to remove the noise first so the pager regains meaning.

  2. 2

    Add multi-window burn-rate recording rules

    4 guided steps

    A burn rate is how fast you are spending the 30-day error budget relative to a steady spend. 1x means you will exactly exhaust the budget in 30 days. 14.4x means you would exhaust it in roughly 2 days. Multi-window alerting compares a long window (is this sustained?) against a short window (is it still happening now?), which gives both precision and a fast reset. Pre-recording each window keeps alert rules readable and cheap to evaluate.

  3. 3

    Write the multi-window multi-burn-rate alert rules

    4 guided steps

    A single long window is slow to fire and slow to clear. A single short window is twitchy and noisy. Pairing them gives you the best of both: the long window confirms the burn is real and sustained, the short window confirms it is still happening and lets the alert clear fast when you fix it. 14.4x over 1h burns roughly 2% of a 30-day budget in an hour, which is page-now territory; 3x over 6h is a slow leak worth a ticket, not a 2am page.

  4. 4

    Route pages and tickets and inhibit downstream noise in Alertmanager

    4 guided steps

    Routing is what makes severity mean something. Without it, the page and the ticket land in the same place and the human cannot tell what needs them now. Inhibition encodes the fact that during a known outage, downstream symptoms (slow inventory, cache misses) are consequences, not new incidents, so they should not each fire a separate notification and bury the real signal.

  5. 5

    Run an alert quality review against injected failures

    5 guided steps

    An alert you never tested under failure is a guess. Reviewing precision (did every page correspond to real user pain?) and recall (did every period of real pain produce a page?) is how you prove the new symptom-based alerts are better than the deleted cause-based ones, not just different. Time-to-detect tells you whether the fast window is fast enough.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
5 full reference solutions
8 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

prometheus/rules/slo_burn.rules.yml with four burn-rate recording rules (5m, 1h, 30m, 6h) plus the fast-burn page and slow-burn ticket alerts
The inherited prometheus/rules/legacy_cause.rules.yml deleted, with a note recording why CPU-high was noise
An extended alertmanager/alertmanager.yml with page and ticket routes and a service-scoped inhibition rule
An alert quality review document: a table of injected scenarios, should-page versus did-page, and time-to-detect
A short runbook entry mapping the fast-burn page to first investigation steps (check payments-stub, check 5xx by dependency)

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building