The SRE team just rejected your deploy. Ship a microservice that passes their go-live checklist.
Understands why each of the four pillars matters. Can implement /health and /ready endpoints. Knows the difference between liveness and readiness. Can add OpenTelemetry auto-instrumentation.
Implements all four pillars correctly. Tunes circuit breaker thresholds for the actual failure profile. Implements SIGTERM handler that correctly drains requests. Adds business-relevant span attributes.
Designs the production-readiness checklist for the entire org. Reviews PRs for missing observability and resilience patterns. Establishes org-wide defaults for circuit breaker configuration and graceful shutdown timeout. Architects the observability platform.
The SRE team just rejected your deploy. Ship a microservice that passes their go-live checklist.
Your microservice is built and tested. The SRE team reviews the deploy request and rejects it. Their go-live checklist has four failing items: no distributed traces in requests, no /health or /ready endpoints for Kubernetes probes, no SIGTERM handler (Kubernetes sends SIGTERM 30 seconds before SIGKILL — in-flight requests will be cut), and no circuit breaker on the database call (one slow database will take down the entire service). You have three hours to fix all four items before the deployment window closes.
The question this raises
What does it mean for a service to be production-ready — and can you build all four requirements from scratch in one session?
Kubernetes is performing a rolling deploy: it starts a new pod and terminates the old one. The old pod has 5 in-flight requests. What must your service do to prevent those 5 requests from receiving errors?
Lesson outline
How this concept changes your thinking
Debugging a latency issue in production
“grep through logs on 8 servers looking for the request ID, manually reconstruct which service called which, spend 3 weeks finding a 400ms bug”
“paste the trace ID into Jaeger, see the waterfall, find the slow span in 20 minutes — OpenTelemetry auto-instruments every HTTP call and database query”
Kubernetes rolling deploy
“Pod restarts, in-flight requests get SIGKILL, 503 errors spike for 30 seconds. Users see errors during every deploy.”
“SIGTERM handler stops accepting new connections, drains in-flight requests over 10 seconds, then exits cleanly. Zero 503s during deploy.”
Database goes slow or unreachable
“Requests pile up waiting for DB, connection pool exhausted, entire service becomes unresponsive, cascades to all upstream callers”
“Circuit breaker opens after 50% failure rate, immediately returns 503 with cached fallback data, upstream callers get fast failures instead of hangs”
Kubernetes liveness vs readiness
“Single /health endpoint returns 200 even during startup (DB not connected yet). Kubernetes sends traffic before service is ready. First 30 seconds of requests fail.”
“/health returns 200 if process is alive (always). /ready returns 200 only when DB and Redis connections are confirmed. Kubernetes waits for /ready before routing traffic.”
Deploy Rejected — SRE Go-Live Checklist: 0/4 passing
DEPLOY REQUEST: my-service v1.2.3 -> production STATUS: BLOCKED — SRE go-live checklist failed [ ] FAIL: Distributed tracing No traceparent headers propagated. Cannot correlate requests across services. Required: OpenTelemetry SDK initialized before first import. [ ] FAIL: Health + readiness endpoints /health not found (404). Kubernetes liveness probe will restart pods on any 404. /ready not found (404). Traffic will be sent to unready pods. Required: GET /health -> 200 (process alive) | GET /ready -> 200 (dependencies ok) [ ] FAIL: Graceful shutdown SIGTERM handler not implemented. K8s sends SIGTERM 30s before SIGKILL. In-flight requests will be cut on every rolling deploy. Required: stop accepting connections on SIGTERM, drain requests, then exit. [ ] FAIL: Circuit breaker on downstream DB No circuit breaker detected on database calls. One slow DB will exhaust connection pool and cascade to all callers. Required: circuit breaker with fallback response. Deployment window: 3 hours. Fix all 4 items and re-submit.
Each of the four checklist items produces a distinct signal in production. Knowing the signal tells you whether your implementation is working.
1// What a valid OpenTelemetry trace looks like in Jaeger2// After implementing auto-instrumentation, every request produces this:34{5"traceID": "abc123def456789",6"spans": [7{8"operationName": "GET /api/users/:id",9"serviceName": "my-service",10"duration": 145, // microseconds11"tags": {12"http.method": "GET",13"http.url": "/api/users/42",14"http.status_code": 200,15"service.version": "1.2.3"16}17},18{19"operationName": "pg.query",20"serviceName": "my-service",21"parentSpanID": "root-span-id",22"duration": 130, // DB query took 130ms — visible as child span23"tags": {24"db.type": "postgresql",25"db.statement": "SELECT * FROM users WHERE id = $1",26"db.rows_affected": 127}28}29]30}3132// Circuit breaker state transitions — opossum emits these events:33circuitBreaker.on('open', () => {34// Fires when failure rate exceeds threshold (50%)35// All subsequent calls return fallback immediately (no DB hit)36metrics.increment('circuit_breaker.opened', { service: 'db' });37});3839circuitBreaker.on('halfOpen', () => {40// After resetTimeout (30s), allows ONE test request through41// If it succeeds -> close. If it fails -> open again.42metrics.increment('circuit_breaker.half_open', { service: 'db' });43});4445circuitBreaker.on('close', () => {46// Test request succeeded — circuit closed, normal operation resumed47metrics.increment('circuit_breaker.closed', { service: 'db' });48});
What you are building
| Kubernetes probe | Endpoint | Returns 200 when | Returns 503 when |
|---|---|---|---|
| Liveness | /health | Process is running | Process is deadlocked or unresponsive (K8s restarts it) |
| Readiness | /ready | DB + Redis connections healthy | DB or Redis unreachable (K8s stops routing traffic) |
Stripe requires all four pillars before any service ships to production. The full working implementation is a single Node.js/Express file you can copy-paste as a starting point.
The Limit: observability adds overhead
OpenTelemetry adds 3-5% CPU overhead per service (span creation, attribute serialization, exporter network calls). At very high throughput (10k+ req/s per instance), use head-based sampling (1-5%) to reduce this. The circuit breaker timeout window (10s) means brief DB hiccups will open the circuit — tune the threshold (errorThresholdPercentage and resetTimeout) for your DB reliability profile. A circuit that opens on every deploy is worse than no circuit.
1// production-ready-service.ts2// Full implementation: OpenTelemetry + /health + /ready + SIGTERM + circuit breaker3// Run: npx ts-node --require ./tracing.ts production-ready-service.ts45// ── tracing.ts (import FIRST) ─────────────────────────────────────6import { NodeSDK } from '@opentelemetry/sdk-node';7import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';8import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';9import { Resource } from '@opentelemetry/resources';10export const sdk = new NodeSDK({11resource: new Resource({ 'service.name': 'my-service' }),12traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),13instrumentations: [getNodeAutoInstrumentations()],14});15sdk.start();16// Every HTTP, Express, pg, redis call now creates spans automatically.1718// ── server.ts ─────────────────────────────────────────────────────19import express from 'express';20import CircuitBreaker from 'opossum';21import { Pool } from 'pg';2223const app = express();24const db = new Pool({ connectionString: process.env.DATABASE_URL });2526// ── Pillar 4: Circuit breaker on DB calls ─────────────────────────27async function queryDb(sql: string, params: unknown[]) {28const result = await db.query(sql, params as string[]);29return result.rows;30}3132const dbBreaker = new CircuitBreaker(queryDb, {33errorThresholdPercentage: 50, // open when 50%+ of calls fail34resetTimeout: 30000, // try again after 30 seconds35timeout: 3000, // treat calls taking > 3s as failures36volumeThreshold: 10, // need at least 10 calls before tripping37});3839dbBreaker.fallback(() => []); // return empty array when circuit is open4041// ── Pillar 2: Health + Readiness ──────────────────────────────────42app.get('/health', (_req, res) => {43res.json({ status: 'ok', uptime: process.uptime() });44});4546app.get('/ready', async (_req, res) => {47try {48await db.query('SELECT 1'); // verify DB is reachable49res.json({ status: 'ready' });50} catch (err) {51res.status(503).json({52status: 'not ready',53error: (err as Error).message,54});55}56});5758// ── Application routes ────────────────────────────────────────────59app.get('/api/users/:id', async (req, res) => {60const users = await dbBreaker.fire(61'SELECT id, name, email FROM users WHERE id = $1',62[req.params.id],63);64res.json(users[0] ?? { error: 'not found' });65});6667// ── Pillar 3: Graceful shutdown ───────────────────────────────────68const server = app.listen(8080, () => {69console.log('Listening on :8080');70});7172let isShuttingDown = false;7374process.on('SIGTERM', () => {75console.log('SIGTERM received — starting graceful shutdown');76isShuttingDown = true;7778// Stop accepting new connections79server.close(() => {80console.log('HTTP server closed — all in-flight requests drained');81// Shutdown OpenTelemetry exporter (flush remaining spans)82sdk.shutdown().finally(() => {83process.exit(0);84});85});8687// Force exit after 25 seconds (K8s terminationGracePeriodSeconds is 30)88setTimeout(() => {89console.error('Graceful shutdown timeout — forcing exit');90process.exit(1);91}, 25_000);92});9394// Middleware: reject new requests during shutdown95app.use((_req, res, next) => {96if (isShuttingDown) {97res.setHeader('Connection', 'close');98return res.status(503).json({ error: 'Service shutting down' });99}100next();101});
1# Kubernetes deployment with liveness + readiness probes + graceful termination2apiVersion: apps/v13kind: Deployment4metadata:5name: my-service6spec:7replicas: 38template:9spec:10terminationGracePeriodSeconds: 30 # K8s waits 30s before SIGKILL11containers:12- name: my-service13image: my-service:1.2.314ports:15- containerPort: 80801617# Liveness probe: restart pod if process is deadlocked18livenessProbe:19httpGet:20path: /health21port: 808022initialDelaySeconds: 5 # wait 5s before first check23periodSeconds: 1024failureThreshold: 3 # restart after 3 consecutive failures2526# Readiness probe: stop routing traffic if dependencies down27readinessProbe:28httpGet:29path: /ready30port: 808031initialDelaySeconds: 10 # wait for DB connection to establish32periodSeconds: 533failureThreshold: 2 # stop traffic after 2 consecutive failures3435env:36- name: DATABASE_URL37valueFrom:38secretKeyRef:39name: db-credentials40key: url41- name: OTEL_ENDPOINT42value: "http://jaeger-collector:4318/v1/traces"
Liveness vs readiness probes
📖 What the exam expects
Liveness checks if the process is alive — Kubernetes restarts the pod if it fails. Readiness checks if the pod can serve traffic — Kubernetes stops routing to it if it fails.
Toggle between what certifications teach and what production actually requires
Production readiness questions are the clearest signal of engineering maturity. Every L4+ engineer should be able to implement all four pillars without looking them up.
Common questions:
Strong answers include:
Red flags:
Quick check · Build Challenge: Make It Observable and Resilient
1 / 2
From the books
Release It! — Michael T. Nygard (2018)
Chapter 5: Stability Patterns
Circuit breakers, bulkheads, and timeouts are the difference between a service that fails fast and recovers and one that fails slow and cascades. Every production service needs all three.
Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda (2022)
Chapter 2: What Is Observability?
Observability is not a set of tools — it is the property of a system that allows you to ask any question about its internal state from its external outputs. You build for observability before incidents, not after.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.