Back to path
BeginnerCarta · Project 1 of 14 ~4h· 5 milestones

Triage a live incident, find it, stop it

You just joined the team that runs Carta, a small storefront checkout system, and the pager goes off in your first hour: checkout is timing out.

Incident triage and the on-call loopReading Grafana dashboards under pressureQuerying logs in Loki with LogQLReasoning about latency percentiles (p50, p95, p99)Diagnosing database connection pool exhaustionApplying and verifying a mitigationWriting an incident timeline

What you'll build

You walk away having run a full on-call loop under pressure: symptom to dashboards to logs to a localized root cause (connection pool exhaustion) to a mitigation (raise the pool, scale api) to telemetry-verified recovery, plus a short written incident timeline you could paste into a channel. You learn to USE existing observability rather than build it, and you internalize that "the dashboard is green again" is a claim you must prove, not assume.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

docs/runbook-triage.mdmarkdown
# Runbook: checkout latency triage (the on-call loop)

Follow this loop every time. Do not skip steps under pressure.

1. Orient: confirm the baseline. What does healthy p95 look like right now?
   - Grafana http://localhost:3001 -> Carta Overview -> checkout p95
2. Reproduce / observe the symptom.
   - Load: k6 run load/checkout.js
   - Confirm: timed checkout with curl -w "%{time_total}\n"
3. Localize. Eliminate, do not guess.
   - Are downstreams (catalog, inventory, payments) slow, or just api?
   - Logs: Loki Explore -> {service="api"} |= "pool"
4. Mitigate. Smallest safe change first.
   - Raise DB_POOL_SIZE, optionally add an api replica, recreate api.
5. Verify recovery WITH telemetry, under the same load.
   - p95 back near baseline AND pool wait logs stopped.
6. Write the timeline (docs/incident-timeline.md).

Reading this file

  • 1. Orient: confirm the baseline.Step one is always orientation. You cannot prove recovery without a baseline number.
  • Loki Explore -> {service="api"} |= "pool"The exact LogQL query that localizes pool exhaustion, ready to paste into Grafana Explore.
  • 4. Mitigate. Smallest safe change first.Raise the pool before reaching for bigger hammers. Minimal, reversible mitigations first.
  • 5. Verify recovery WITH telemetry, under the same load.The habit this whole rung teaches: recovery is a telemetry claim, proven under the original load.

Your on-call loop on one page. Keep it open while you work the incident, it is the spine of every milestone.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Bring up the inherited stack and confirm a healthy baseline

    4 guided steps

    On-call begins with orientation, not action. A baseline turns a vague feeling ("checkout feels slow") into a measurable claim ("p95 went from 90ms to 4s"). Without it you cannot tell whether a mitigation actually worked.

  2. 2

    Reproduce the incident by injecting load

    4 guided steps

    A fault you cannot reproduce is a fault you cannot confidently fix. Reproducing on demand lets you watch the symptom appear and, later, watch it disappear, which is how you prove the mitigation worked.

  3. 3

    Localize the fault with Grafana and Loki

    4 guided steps

    The hardest part of on-call is not fixing, it is localizing. Latency at the edge can come from any layer. Disciplined elimination (which dependency is slow? what do the logs say?) is what separates a 10 minute incident from a 2 hour one.

  4. 4

    Mitigate by raising the pool and scaling api, then verify recovery

    4 guided steps

    Mitigation without verification is wishful thinking. Many incidents get prolonged because someone declared victory at deploy time. Closing the loop with the same dashboard you opened it with is the core SRE habit this whole rung exists to teach.

  5. 5

    Write the incident timeline

    4 guided steps

    Incidents that are not written down repeat. A crisp timeline with real numbers makes the next on-call faster, feeds a future postmortem, and is the artifact a hiring manager or teammate trusts more than any verbal war story.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
5 full reference solutions
8 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A baseline note recording healthy p95 latency and a working checkout before the incident
A reproduced incident: k6 driving load with the Grafana p95 panel visibly elevated
A localization writeup pointing to api connection pool exhaustion, backed by Grafana plus Loki evidence
An applied mitigation (raised DB_POOL_SIZE and/or scaled api) with telemetry showing recovery under load
A short markdown incident timeline with timestamps, root cause, before/after p95, and one follow-up action

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building