Automate the toil away, build a remediator
Continues from the last build: You left rung 8 with a capacity model and headroom policy, but inventory still leaks file descriptors and gets restarted by hand every week.
The inventory fd leak from rung 2 will not die. Roughly once a week the service slowly exhausts its file descriptors, /checkout starts throwing 5xx, an alert pages a human at 3am, and that human SSHes in and runs the same three commands: drain the sick replica, restart it, confirm recovery.
What you'll build
You walk away with a tested, rate-limited remediation service that turns a recurring 3am page into a logged, automatic, bounded action, plus a one-page toil ledger that justifies the work and a runbook entry that now reads "the remediator handles this; here is how to read its audit log."
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
FROM python:3.12-slim WORKDIR /app # Copy the dependency manifest first so the pip layer is cached COPY remediator/requirements.txt ./requirements.txt RUN pip install --no-cache-dir -r requirements.txt # Then copy the source COPY remediator/ ./remediator/ # Run as a non-root user with only the access it needs RUN useradd --create-home runner USER runner CMD ["python", "-m", "remediator.controller"]
Reading this file
COPY remediator/requirements.txt ./requirements.txtCopy the dependency manifest before the source so the pip layer is cached when only code changes.RUN pip install --no-cache-dir -r requirements.txtInstalls deps in the cached layer above the source copy.RUN useradd --create-home runnerCreates a non-root user so the remediator does not run as root.CMD ["python", "-m", "remediator.controller"]The control loop is the container entrypoint.
Manifest-before-source ordering keeps the pip layer cached across code edits; the non-root user limits blast radius even though it still reaches the Docker socket.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Measure the toil before you automate it
3 guided stepsAutomating low-frequency toil can cost more than it saves. A ledger forces the payback question (does building this beat the manual cost) and gives you a metric to prove the remediator worked after you ship it.
- 2
Inject the leak and detect its Prometheus signature
3 guided stepsA remediator is only as safe as its trigger. A precise, sustained-over-time signature (not a single spiky sample) is what stops the tool from restarting healthy replicas. This is the detection half of the control loop.
- 3
Drain and restart the sick replica through the Docker API
3 guided stepsDoing the restart through the Docker API (not a shelled-out command) makes the action testable with a mock and gives you structured errors. Draining before restart avoids dropping in-flight checkouts, which is the part humans often skip at 3am.
- 4
Add the rate-limit guard and audit log
3 guided stepsAn unbounded remediator that hits a restart-crash loop becomes an outage amplifier. The rate limit is a circuit breaker, and the audit log is how a human reconstructs what the robot did. Both are non-negotiable for any automation that touches production.
- 5
Test it like the software it is
3 guided stepsOperational automation that can restart production must be the most-tested code you own, because its failure mode is taking the system down faster than a human would. Mocking the Docker client lets you test the restart path without restarting anything real.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building