Gate launches with a production readiness review
Continues from the last build: Rung 13 left you with a stateful data layer that survives bad days: automated backups, a timed restore drill, streaming replication with a lag SLI, and a failover game-day runbook.
Carta is reliable now. Thirteen rungs of work bought you SLOs, symptom alerts, timeouts and breakers, capacity headroom, runbooks, backups, and a tested failover.
What you'll build
You walk away with a reusable PRR checklist grounded in the whole ladder, a completed review of a real (if seeded) service, fixes for its worst gaps, a signed go/no-go record, and a recurring error-budget review agenda. You can now gate any new service launch on reliability evidence rather than optimism, and keep it honest month over month.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
import os
import httpx
from fastapi import FastAPI, HTTPException
app = FastAPI()
PAYMENTS_URL = os.environ.get("PAYMENTS_URL", "http://payments-stub:9000")
# GAP: no SLI/SLO defined for this service anywhere.
# GAP: no request metrics exported, so symptom alerts are impossible.
@app.get("/healthz")
def healthz():
# GAP: liveness only, does not check the payments dependency.
return {"status": "ok"}
@app.post("/giftcards/redeem")
async def redeem(code: str, amount: int):
# GAP: no timeout on the payments call, a slow provider hangs the request.
# GAP: no circuit breaker, every request retries into a known-bad dependency.
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{PAYMENTS_URL}/charge",
json={"amount": amount, "ref": code},
)
if resp.status_code != 200:
# GAP: no structured log line, so Loki cannot correlate failures.
raise HTTPException(status_code=502, detail="payment failed")
return {"redeemed": code, "amount": amount}Reading this file
PAYMENTS_URL = os.environ.get("PAYMENTS_URL", "http://payments-stub:9000")Calls the inherited flaky payment simulator, the same dependency you tamed in earlier rungs.async with httpx.AsyncClient() as client:An httpx client with no timeout argument defaults to no deadline, so a slow provider hangs the request indefinitely.raise HTTPException(status_code=502, detail="payment failed")Returns an error but emits no structured log and no metric, so neither Loki nor Prometheus can see the failure.return {"status": "ok"}Health check is liveness only and never checks the payments dependency, so it stays green during a downstream outage.
The launch candidate. Read the GAP comments as hints, but do not trust them to be complete: part of the exercise is finding gaps the comments do not name (capacity untested, no runbook, no backup story). The missing payments timeout is the most dangerous line.
That's 1 of 7 explained code blocks in this single project.
The build, milestone by milestone
- 1
Write the PRR checklist from the whole ladder
4 guided stepsA PRR is only fair if the standard is written down before the review starts. Grounding each item in a rung you completed keeps the checklist honest and teaches the next engineer why the line exists, not just that it does.
- 2
Review gift-cards and find the 8 seeded gaps
4 guided stepsFinding gaps systematically against a written checklist is the core SRE review skill. The seeded service is deliberately plausible, which is exactly how reliability debt slips through a tired reviewer.
- 3
Fix the most dangerous gap: add a payments timeout and breaker
4 guided stepsThe missing timeout is the gap most likely to cause a real outage: one slow dependency hangs every request and exhausts the worker pool. Demonstrating it with an injected knob, then fixing it, is the difference between a checklist item and earned confidence.
- 4
Run the go/no-go and stand up the error-budget review
4 guided stepsA review that does not end in a documented decision is just discussion. And a one-time gate decays unless an error-budget cadence keeps reliability a standing conversation, which is what turns this from a launch checkbox into a practice.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building