Skip to main content
Career Paths
Concepts
Bep Project Advanced
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Build Challenge: Make It Observable and Resilient

The SRE team just rejected your deploy. Ship a microservice that passes their go-live checklist.

Relevant for:Mid-levelSeniorPrincipal
Why this matters at your level
Mid-level

Understands why each of the four pillars matters. Can implement /health and /ready endpoints. Knows the difference between liveness and readiness. Can add OpenTelemetry auto-instrumentation.

Senior

Implements all four pillars correctly. Tunes circuit breaker thresholds for the actual failure profile. Implements SIGTERM handler that correctly drains requests. Adds business-relevant span attributes.

Principal

Designs the production-readiness checklist for the entire org. Reviews PRs for missing observability and resilience patterns. Establishes org-wide defaults for circuit breaker configuration and graceful shutdown timeout. Architects the observability platform.

Build Challenge: Make It Observable and Resilient

The SRE team just rejected your deploy. Ship a microservice that passes their go-live checklist.

~6 min read
Be the first to complete!
Go-Live Blocker

Your microservice is built and tested. The SRE team reviews the deploy request and rejects it. Their go-live checklist has four failing items: no distributed traces in requests, no /health or /ready endpoints for Kubernetes probes, no SIGTERM handler (Kubernetes sends SIGTERM 30 seconds before SIGKILL — in-flight requests will be cut), and no circuit breaker on the database call (one slow database will take down the entire service). You have three hours to fix all four items before the deployment window closes.

The question this raises

What does it mean for a service to be production-ready — and can you build all four requirements from scratch in one session?

Test your assumption first

Kubernetes is performing a rolling deploy: it starts a new pod and terminates the old one. The old pod has 5 in-flight requests. What must your service do to prevent those 5 requests from receiving errors?

Lesson outline

Before observability: the broken contract

How this concept changes your thinking

Situation
Before
After

Debugging a latency issue in production

“grep through logs on 8 servers looking for the request ID, manually reconstruct which service called which, spend 3 weeks finding a 400ms bug”

“paste the trace ID into Jaeger, see the waterfall, find the slow span in 20 minutes — OpenTelemetry auto-instruments every HTTP call and database query”

Kubernetes rolling deploy

“Pod restarts, in-flight requests get SIGKILL, 503 errors spike for 30 seconds. Users see errors during every deploy.”

“SIGTERM handler stops accepting new connections, drains in-flight requests over 10 seconds, then exits cleanly. Zero 503s during deploy.”

Database goes slow or unreachable

“Requests pile up waiting for DB, connection pool exhausted, entire service becomes unresponsive, cascades to all upstream callers”

“Circuit breaker opens after 50% failure rate, immediately returns 503 with cached fallback data, upstream callers get fast failures instead of hangs”

Kubernetes liveness vs readiness

“Single /health endpoint returns 200 even during startup (DB not connected yet). Kubernetes sends traffic before service is ready. First 30 seconds of requests fail.”

“/health returns 200 if process is alive (always). /ready returns 200 only when DB and Redis connections are confirmed. Kubernetes waits for /ready before routing traffic.”

The SRE go-live checklist

Deploy Rejected — SRE Go-Live Checklist: 0/4 passing

DEPLOY REQUEST: my-service v1.2.3 -> production STATUS: BLOCKED — SRE go-live checklist failed [ ] FAIL: Distributed tracing No traceparent headers propagated. Cannot correlate requests across services. Required: OpenTelemetry SDK initialized before first import. [ ] FAIL: Health + readiness endpoints /health not found (404). Kubernetes liveness probe will restart pods on any 404. /ready not found (404). Traffic will be sent to unready pods. Required: GET /health -> 200 (process alive) | GET /ready -> 200 (dependencies ok) [ ] FAIL: Graceful shutdown SIGTERM handler not implemented. K8s sends SIGTERM 30s before SIGKILL. In-flight requests will be cut on every rolling deploy. Required: stop accepting connections on SIGTERM, drain requests, then exit. [ ] FAIL: Circuit breaker on downstream DB No circuit breaker detected on database calls. One slow DB will exhaust connection pool and cascade to all callers. Required: circuit breaker with fallback response. Deployment window: 3 hours. Fix all 4 items and re-submit.

Reading what production tells you

Each of the four checklist items produces a distinct signal in production. Knowing the signal tells you whether your implementation is working.

otel-trace-output.json
1// What a valid OpenTelemetry trace looks like in Jaeger
2// After implementing auto-instrumentation, every request produces this:
3
4{
5 "traceID": "abc123def456789",
6 "spans": [
7 {
8 "operationName": "GET /api/users/:id",
9 "serviceName": "my-service",
10 "duration": 145, // microseconds
11 "tags": {
12 "http.method": "GET",
13 "http.url": "/api/users/42",
14 "http.status_code": 200,
15 "service.version": "1.2.3"
16 }
17 },
18 {
19 "operationName": "pg.query",
20 "serviceName": "my-service",
21 "parentSpanID": "root-span-id",
22 "duration": 130, // DB query took 130ms — visible as child span
23 "tags": {
24 "db.type": "postgresql",
25 "db.statement": "SELECT * FROM users WHERE id = $1",
26 "db.rows_affected": 1
27 }
28 }
29 ]
30}
31
32// Circuit breaker state transitions — opossum emits these events:
33circuitBreaker.on('open', () => {
34 // Fires when failure rate exceeds threshold (50%)
35 // All subsequent calls return fallback immediately (no DB hit)
36 metrics.increment('circuit_breaker.opened', { service: 'db' });
37});
38
39circuitBreaker.on('halfOpen', () => {
40 // After resetTimeout (30s), allows ONE test request through
41 // If it succeeds -> close. If it fails -> open again.
42 metrics.increment('circuit_breaker.half_open', { service: 'db' });
43});
44
45circuitBreaker.on('close', () => {
46 // Test request succeeded — circuit closed, normal operation resumed
47 metrics.increment('circuit_breaker.closed', { service: 'db' });
48});

The four pillars of production readiness

What you are building

  • Pillar 1: Distributed tracing (OpenTelemetry) — Auto-instruments every HTTP request, Express route, and Postgres query. Injects W3C traceparent headers on outgoing calls. Every request in production is traceable from entry to exit across all services.
  • Pillar 2: Health + readiness endpoints — /health = liveness probe (is this process alive?). Returns 200 always if the process is running. /ready = readiness probe (can this pod serve traffic?). Returns 200 only when DB and Redis connections are confirmed. Kubernetes stops routing traffic to pods that fail /ready.
  • Pillar 3: Graceful shutdown on SIGTERM — Kubernetes sends SIGTERM 30 seconds before SIGKILL on pod termination (rolling deploy, scale-down, eviction). Your SIGTERM handler must: stop accepting new connections, wait for in-flight requests to complete, then call process.exit(0). Without this, every deploy drops active requests.
  • Pillar 4: Circuit breaker on DB calls (opossum) — When the DB failure rate exceeds 50% over 10 seconds, the circuit opens. All subsequent calls return the fallback response immediately (cached data, empty array, or explicit 503) without hitting the DB. Prevents connection pool exhaustion and cascade failure. After 30 seconds, the circuit half-opens to test recovery.
Kubernetes probeEndpointReturns 200 whenReturns 503 when
Liveness/healthProcess is runningProcess is deadlocked or unresponsive (K8s restarts it)
Readiness/readyDB + Redis connections healthyDB or Redis unreachable (K8s stops routing traffic)

What Stripe requires before every deploy

Stripe requires all four pillars before any service ships to production. The full working implementation is a single Node.js/Express file you can copy-paste as a starting point.

The Limit: observability adds overhead

OpenTelemetry adds 3-5% CPU overhead per service (span creation, attribute serialization, exporter network calls). At very high throughput (10k+ req/s per instance), use head-based sampling (1-5%) to reduce this. The circuit breaker timeout window (10s) means brief DB hiccups will open the circuit — tune the threshold (errorThresholdPercentage and resetTimeout) for your DB reliability profile. A circuit that opens on every deploy is worse than no circuit.

production-ready-service.ts
1// production-ready-service.ts
2// Full implementation: OpenTelemetry + /health + /ready + SIGTERM + circuit breaker
3// Run: npx ts-node --require ./tracing.ts production-ready-service.ts
4
5// ── tracing.ts (import FIRST) ─────────────────────────────────────
6import { NodeSDK } from '@opentelemetry/sdk-node';
7import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
8import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
9import { Resource } from '@opentelemetry/resources';
10export const sdk = new NodeSDK({
11 resource: new Resource({ 'service.name': 'my-service' }),
12 traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
13 instrumentations: [getNodeAutoInstrumentations()],
14});
15sdk.start();
16// Every HTTP, Express, pg, redis call now creates spans automatically.
17
18// ── server.ts ─────────────────────────────────────────────────────
19import express from 'express';
20import CircuitBreaker from 'opossum';
21import { Pool } from 'pg';
22
23const app = express();
24const db = new Pool({ connectionString: process.env.DATABASE_URL });
25
26// ── Pillar 4: Circuit breaker on DB calls ─────────────────────────
27async function queryDb(sql: string, params: unknown[]) {
28 const result = await db.query(sql, params as string[]);
29 return result.rows;
30}
31
32const dbBreaker = new CircuitBreaker(queryDb, {
33 errorThresholdPercentage: 50, // open when 50%+ of calls fail
34 resetTimeout: 30000, // try again after 30 seconds
35 timeout: 3000, // treat calls taking > 3s as failures
36 volumeThreshold: 10, // need at least 10 calls before tripping
37});
38
39dbBreaker.fallback(() => []); // return empty array when circuit is open
40
41// ── Pillar 2: Health + Readiness ──────────────────────────────────
42app.get('/health', (_req, res) => {
43 res.json({ status: 'ok', uptime: process.uptime() });
44});
45
46app.get('/ready', async (_req, res) => {
47 try {
48 await db.query('SELECT 1'); // verify DB is reachable
49 res.json({ status: 'ready' });
50 } catch (err) {
51 res.status(503).json({
52 status: 'not ready',
53 error: (err as Error).message,
54 });
55 }
56});
57
58// ── Application routes ────────────────────────────────────────────
59app.get('/api/users/:id', async (req, res) => {
60 const users = await dbBreaker.fire(
61 'SELECT id, name, email FROM users WHERE id = $1',
62 [req.params.id],
63 );
64 res.json(users[0] ?? { error: 'not found' });
65});
66
67// ── Pillar 3: Graceful shutdown ───────────────────────────────────
68const server = app.listen(8080, () => {
69 console.log('Listening on :8080');
70});
71
72let isShuttingDown = false;
73
74process.on('SIGTERM', () => {
75 console.log('SIGTERM received — starting graceful shutdown');
76 isShuttingDown = true;
77
78 // Stop accepting new connections
79 server.close(() => {
80 console.log('HTTP server closed — all in-flight requests drained');
81 // Shutdown OpenTelemetry exporter (flush remaining spans)
82 sdk.shutdown().finally(() => {
83 process.exit(0);
84 });
85 });
86
87 // Force exit after 25 seconds (K8s terminationGracePeriodSeconds is 30)
88 setTimeout(() => {
89 console.error('Graceful shutdown timeout — forcing exit');
90 process.exit(1);
91 }, 25_000);
92});
93
94// Middleware: reject new requests during shutdown
95app.use((_req, res, next) => {
96 if (isShuttingDown) {
97 res.setHeader('Connection', 'close');
98 return res.status(503).json({ error: 'Service shutting down' });
99 }
100 next();
101});
kubernetes-probes.yaml
1# Kubernetes deployment with liveness + readiness probes + graceful termination
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: my-service
6spec:
7 replicas: 3
8 template:
9 spec:
10 terminationGracePeriodSeconds: 30 # K8s waits 30s before SIGKILL
11 containers:
12 - name: my-service
13 image: my-service:1.2.3
14 ports:
15 - containerPort: 8080
16
17 # Liveness probe: restart pod if process is deadlocked
18 livenessProbe:
19 httpGet:
20 path: /health
21 port: 8080
22 initialDelaySeconds: 5 # wait 5s before first check
23 periodSeconds: 10
24 failureThreshold: 3 # restart after 3 consecutive failures
25
26 # Readiness probe: stop routing traffic if dependencies down
27 readinessProbe:
28 httpGet:
29 path: /ready
30 port: 8080
31 initialDelaySeconds: 10 # wait for DB connection to establish
32 periodSeconds: 5
33 failureThreshold: 2 # stop traffic after 2 consecutive failures
34
35 env:
36 - name: DATABASE_URL
37 valueFrom:
38 secretKeyRef:
39 name: db-credentials
40 key: url
41 - name: OTEL_ENDPOINT
42 value: "http://jaeger-collector:4318/v1/traces"

Exam Answer vs. Production Reality

1 / 3

Liveness vs readiness probes

📖 What the exam expects

Liveness checks if the process is alive — Kubernetes restarts the pod if it fails. Readiness checks if the pod can serve traffic — Kubernetes stops routing to it if it fails.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Production readiness questions are the clearest signal of engineering maturity. Every L4+ engineer should be able to implement all four pillars without looking them up.

Common questions:

  • What happens when a Kubernetes pod receives SIGTERM?
  • How do you prevent a circuit breaker from opening on every deploy?
  • What attributes should you add to OpenTelemetry spans?
  • What is the difference between /health and /ready?
  • How do you trace across service boundaries with OpenTelemetry?
  • What does your circuit breaker return when the database is down?
  • Why does graceful shutdown matter for Kubernetes deployments?

Strong answers include:

  • Immediately distinguishes liveness (process alive) vs readiness (dependencies healthy)
  • Knows Kubernetes sends SIGTERM 30s before SIGKILL and implements drain logic
  • Mentions opossum or similar for circuit breakers, knows the three states (open/half-open/closed)
  • Knows W3C traceparent propagates trace context across HTTP boundaries
  • Mentions that circuit breaker fallback should return cached data or explicit 503, not hang

Red flags:

  • "Both /health and /ready just check if the server is up" — conflates liveness and readiness
  • Does not know what SIGTERM is or why Kubernetes sends it
  • Has never implemented a circuit breaker — only knows retry logic
  • Thinks tracing is just adding request ID to logs
  • Cannot explain what happens to in-flight requests during a rolling deploy

Quick check · Build Challenge: Make It Observable and Resilient

1 / 2

Your Kubernetes pod receives SIGTERM. It has 8 in-flight requests. What is the correct behavior?

From the books

Release It! — Michael T. Nygard (2018)

Chapter 5: Stability Patterns

Circuit breakers, bulkheads, and timeouts are the difference between a service that fails fast and recovers and one that fails slow and cascades. Every production service needs all three.

Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda (2022)

Chapter 2: What Is Observability?

Observability is not a set of tools — it is the property of a system that allows you to ask any question about its internal state from its external outputs. You build for observability before incidents, not after.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.