Back to path
BeginnerCarta · Project 2 of 14 ~5h· 5 milestones

Debug the slow box with the USE method

Continues from the last build: Rung 1 left you with checkout latency tamed (DB pool widened from 5) but one api replica still drags, with clean logs.

The pager is quiet again. In rung 1 you traced the checkout latency spike to an exhausted api database connection pool, bumped the pool, and watched the dashboard recover.

USE method performance triageReading top/vmstat/iostat outputInspecting /proc for a processdocker stats for per-container resourcesss for socket and network saturationpy-spy flame graph profilingCounting file descriptors to find a leak

What you'll build

You walk away able to triage a slow host without logs or guesses: a repeatable USE checklist across CPU, memory, disk, network, and file descriptors, fluency in top/vmstat/iostat/ss/proc/docker stats, and the ability to read a py-spy flame graph to find a CPU hot path and confirm an fd leak by counting descriptors over time.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

prod-sim.ymlyaml
# Inherited prod-sim stack (excerpt). Telemetry services omitted for brevity.
# This rung adds two injected-fault flags. Set to 1 to inject, 0 to fix.
services:
  api:
    build: ./api
    environment:
      - DB_POOL_SIZE=20            # widened in rung 1
      - API_SLOW_SERIALIZER=1      # INJECTED: CPU-burning JSON hot path
    deploy:
      replicas: 3
    ports:
      - "8080:8080"
  inventory:
    build: ./inventory
    environment:
      - REDIS_URL=redis://redis:6379
      - INVENTORY_LEAK_FDS=1       # INJECTED: fd opened per request, never closed
    ports:
      - "8082:8082"
  payments-stub:
    build: ./payments-stub
    environment:
      - FAIL_RATE=0
      - LATENCY_MS=0
    ports:
      - "9000:9000"

Reading this file

  • API_SLOW_SERIALIZER=1Set to 1 to inject the CPU hot path, 0 to fix it in the last milestone.
  • INVENTORY_LEAK_FDS=1Set to 1 to inject the fd leak, 0 to fix it once you have confirmed the leak.
  • replicas: 3Three api replicas so only the injected one looks slow on the dashboard.
  • DB_POOL_SIZE=20Carried from rung 1, the pool fix that is no longer the bottleneck.

The two INJECTED comments mark the only lines you toggle this rung. Leaving the stable contract (ports, healthz, k6 path) untouched.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Reproduce the slowness and write a USE checklist

    4 guided steps

    SRE triage starts with a reproducible signal and a method. The USE method (for every resource, check utilization, saturation, and errors) stops you from tunnel-visioning on CPU when the real problem might be file descriptors. A reproducible repro means you can prove the fix later.

  2. 2

    Walk CPU and memory with top, vmstat, and docker stats

    4 guided steps

    docker stats gives a fast per-container read, but it hides per-thread detail and the run queue. vmstat shows saturation (the r column is threads waiting for a CPU) that a single utilization percentage cannot. Knowing the PID lets you target /proc and py-spy later.

  3. 3

    Profile the CPU hot path with a py-spy flame graph

    4 guided steps

    Utilization tells you that the CPU is busy, not why. A flame graph shows where time is actually spent, wide frames are expensive. py-spy is a sampling profiler that attaches to a live process with no code changes and no restart, exactly what you want during an incident.

  4. 4

    Find the file descriptor leak in inventory

    4 guided steps

    fd leaks are silent: no CPU spike, no memory blowup, just a slow march toward the ulimit, after which every new connection fails with EMFILE and the service falls over with no obvious cause. Counting /proc/PID/fd over time is the canonical way to catch a leak before it becomes an outage.

  5. 5

    Fix both bugs and prove it with the same tools

    4 guided steps

    Verification is the difference between resolved and 'seems better'. By re-running k6, docker stats, the flame graph, and the fd count, you prove utilization dropped, the hot path shrank, and the fd count stabilized. This is the closing half of the on-call loop you learned in rung 1, now applied to host performance.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
5 full reference solutions
8 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

use-checklist.md: a one-page USE table with concrete U/S/E commands for CPU, memory, disk, network, and file descriptors
api-flame.svg and api-flame-after.svg: before/after py-spy flame graphs of the api worker
Recorded baseline and post-fix k6 p95/p99 numbers for the slow replica
An fd-count log for inventory showing the climb under load and the flat count after the fix
findings.md: a short writeup naming each bug, the USE signal that flagged it, the tool that confirmed it, and before/after evidence
An updated prod-sim.yml with both injected-fault env flags documented and set to 0

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building