Back to path
BeginnerCarta · Project 4 of 14 ~5h· 5 milestones

Set SLOs and spend an error budget

Continues from the last build: Rung 3 left you with user-centric SLIs wired into Prometheus: checkout availability, checkout journey latency, and inventory freshness, queryable but with no agreed targets.

You inherited Carta with SLIs that work: you can graph the fraction of POST /checkout requests that succeed and the p95 of the full checkout journey.

SLO definitionError budget mathPrometheus recording rulesPromQL ratio queriesGrafana SLO dashboardsError budget policyBurn rate

What you'll build

You walk away with two committed SLOs, Prometheus recording rules that compute SLI, error budget, and burn rate continuously, a Grafana SLO dashboard that an on-call can read in ten seconds, and a written error budget policy that turns "should we fix this?" into a rule instead of an argument. You will have spent budget on purpose and watched the dashboard react, so the numbers stop being abstract.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

slo/slos.mdmarkdown
# Carta checkout SLOs

Window: rolling 30 days for both.

## SLO 1: checkout availability
Target: 99.9% of POST /checkout requests succeed (HTTP 2xx).
SLI: successful_checkouts / total_checkouts.
Error budget: 0.1% of requests may fail.
Time budget: 30 days = 43200 minutes, so 0.1% = 43.2 minutes of total failure.

## SLO 2: checkout latency
Target: 95% of checkout journeys finish under 400ms.
SLI: fast_checkouts (le 0.4s) / total_checkouts.
Error budget: up to 5% of requests may exceed 400ms.

## Why these numbers
99.9 is reachable for a stub-backed demo and leaves a real budget to spend.
400ms is the point past which a checkout feels slow to a buyer.

Reading this file

  • rolling 30 daysA rolling window means the budget is always measured over the last 30 days, not reset on the 1st, so it cannot be gamed by waiting for month end.
  • 43200 minutes30 days times 24 hours times 60 minutes; this is the denominator that turns a percentage budget into a minutes budget on-call can feel.
  • 43.2 minutes of total failure0.1% of 43200 minutes; this is the single most quoted number on this rung, the monthly allowance for SLO 1.
  • 95% of checkout journeys finish under 400msA latency SLO is a percentile target, not an average; 95% under 400ms still allows a slow tail for 5% of buyers.

The contract. Two SLOs, the window, and the budget each one buys, written so a new on-call understands it in one read.

That's 1 of 7 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Do the budget math and write the SLO contract

    3 guided steps

    An SLO no one computed is an SLO no one believes. Deriving the budget yourself is what makes the later dashboard mean something instead of being a decoration.

  2. 2

    Encode SLI, budget, and burn rate as recording rules

    3 guided steps

    Computing the budget once, centrally, means the dashboard and next rung's alerts read the same trustworthy number instead of each re-deriving it slightly differently.

  3. 3

    Build the SLO dashboard as code

    3 guided steps

    A budget the team cannot see is a budget the team will not respect. Dashboard-as-code means anyone can stand up the identical board and it survives a Grafana wipe.

  4. 4

    Write the error budget policy

    3 guided steps

    Numbers without a policy are just decoration. The policy is the pre-agreed decision so that at 2am no one argues; they read the rule. It is also what makes the SLO have teeth instead of being aspirational.

  5. 5

    Spend the budget on purpose and log it

    3 guided steps

    Until you have watched the number move you do not trust it. Spending budget on purpose proves the rules, the dashboard, and the policy all react together, and it rehearses the policy decision in a safe setting.

What's inside when you start

2 starter files, ready to clone
5 guided milestones
5 full reference solutions
7 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A written SLO definition (99.9% checkout availability, 95% under 400ms, 30-day rolling window) committed as slo/slos.md
A Prometheus recording rules file (slo/recording-rules.yml) computing SLI, error budget remaining, and burn rate
A Grafana SLO dashboard JSON (grafana/dashboards/slo.json) showing budget remaining and burn rate as code
A one-page error budget policy (slo/budget-policy.md) stating what halts when budget burns
A short burn log noting what FAIL_RATE and k6 run you used to spend budget and what the dashboard showed

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building