Alert on symptoms, kill the noise
Continues from the last build: You have SLO targets (99.9% checkout availability, 95% under 400ms over 30 days), error budget math, a written budget policy, and recording rules in Prometheus.
Last rung you wrote real SLOs and recording rules for Carta, but the inherited Alertmanager still pages on whatever the previous owner wired: a CPU-high rule fired at 2am and woke you for a batch job that no user ever felt, while a real checkout outage burned 40 minutes of error budget in total silence.
What you'll build
You walk away able to design and ship multi-window multi-burn-rate SLO alerts from scratch, explain why cause-based alerts page wrongly, route and inhibit alerts in Alertmanager so the pager only fires on real user pain, and run an honest alert quality review (precision, recall, time-to-detect) against injected failures.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
groups:
- name: legacy_cause_alerts
rules:
# Inherited cause-based alert. Pages on CPU, not on user pain.
# This is the 2am false page. You will DELETE this whole file.
- alert: HighCPU
expr: rate(process_cpu_seconds_total{job="api"}[5m]) > 0.8
for: 2m
labels:
severity: page
annotations:
summary: "API CPU is high"
description: "CPU rate above 0.8 for 2m. Does not mean users are hurt."Reading this file
alert: HighCPUThe offending rule name. It alerts on a machine condition, not on whether checkout works.rate(process_cpu_seconds_total{job="api"}[5m]) > 0.8A cause-based expression: high CPU. Users do not feel CPU, they feel failed or slow checkouts.severity: pageThis label is why it wakes you. Routing CPU to the pager is the root mistake you are fixing.for: 2mA flat duration threshold with no notion of error budget burn rate, so it cannot tell real pain from noise.
The inherited noise. CPU being busy is a cause, not a symptom. A batch job or a healthy traffic spike trips this without any user ever seeing an error. Milestone 1 deletes this file and proves nothing breaks.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Delete the inherited cause-based alerts and prove nothing breaks
4 guided stepsCause-based alerts page on machine conditions (CPU, memory, disk) that do not map to user pain. They generate false pages (2am batch job) and false confidence (CPU was fine while checkout was down). You cannot add good alerts on top of noise, you have to remove the noise first so the pager regains meaning.
- 2
Add multi-window burn-rate recording rules
4 guided stepsA burn rate is how fast you are spending the 30-day error budget relative to a steady spend. 1x means you will exactly exhaust the budget in 30 days. 14.4x means you would exhaust it in roughly 2 days. Multi-window alerting compares a long window (is this sustained?) against a short window (is it still happening now?), which gives both precision and a fast reset. Pre-recording each window keeps alert rules readable and cheap to evaluate.
- 3
Write the multi-window multi-burn-rate alert rules
4 guided stepsA single long window is slow to fire and slow to clear. A single short window is twitchy and noisy. Pairing them gives you the best of both: the long window confirms the burn is real and sustained, the short window confirms it is still happening and lets the alert clear fast when you fix it. 14.4x over 1h burns roughly 2% of a 30-day budget in an hour, which is page-now territory; 3x over 6h is a slow leak worth a ticket, not a 2am page.
- 4
Route pages and tickets and inhibit downstream noise in Alertmanager
4 guided stepsRouting is what makes severity mean something. Without it, the page and the ticket land in the same place and the human cannot tell what needs them now. Inhibition encodes the fact that during a known outage, downstream symptoms (slow inventory, cache misses) are consequences, not new incidents, so they should not each fire a separate notification and bury the real signal.
- 5
Run an alert quality review against injected failures
5 guided stepsAn alert you never tested under failure is a guess. Reviewing precision (did every page correspond to real user pain?) and recall (did every period of real pain produce a page?) is how you prove the new symptom-based alerts are better than the deleted cause-based ones, not just different. Time-to-detect tells you whether the fast window is fast enough.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building