Metrics, logs, traces, alerting strategy, SLOs and error budgets, and how senior engineers build systems that are easy to debug.
Metrics, logs, traces, alerting strategy, SLOs and error budgets, and how senior engineers build systems that are easy to debug.
Lesson outline
Monitoring answers "is the system working?" — predefined dashboards and alerts for known failure modes. Reactive.
Observability answers "why is it not working?" — the ability to understand internal state from external outputs (metrics, logs, traces). Proactive: debug novel failures you have never seen before.
You need both. Monitoring for fast detection. Observability for fast diagnosis.
Metrics: Numeric time-series. Counters (requests_total), Gauges (active_connections), Histograms (request_duration_seconds). Cheap, fast, great for dashboards. Cannot tell you WHY something is slow — only THAT it is slow.
Logs: Timestamped structured records. Use JSON (not free-text). Include: requestId, userId, operation, duration, error. Answers "what happened?" for specific events.
Traces: Distributed request paths across services. A trace follows one request through frontend → API gateway → auth → orders → database. Essential for diagnosing latency in microservices.
Correlation IDs link all three pillars
Alert fires (metric) → query logs for slow requests in that window → open trace for a specific slow request → see which service is slow. Trace IDs in logs are the link.
1import pino from 'pino';2import { AsyncLocalStorage } from 'async_hooks';34// Request context — automatically included in all logs5const requestContext = new AsyncLocalStorage<{6requestId: string;7userId?: string;8traceId?: string;9}>();1011const logger = pino({12level: process.env.LOG_LEVEL ?? 'info',13formatters: {14log: (obj) => ({15...obj,AsyncLocalStorage injects request context without threading it through every function16...requestContext.getStore(), // Auto-inject requestId, traceId17}),18},19});2021// Middleware: set context for all logs in this request22app.use((req, res, next) => {Propagate X-Request-ID — it links logs across services23const requestId = req.headers['x-request-id'] ?? crypto.randomUUID();24const traceId = req.headers['x-trace-id'] ?? requestId;25requestContext.run({ requestId, traceId }, () => {26res.setHeader('x-request-id', requestId);27next();28});29});3031// Every log automatically includes requestId, traceId32logger.info({ operation: 'order.create', orderId: 42, duration: 45 }, 'Order created');33// ✅ Structured — machine-queryable in Grafana Loki34// ❌ Avoid: logger.info(`Order ${orderId} created`) — unqueryable free text
Alert on: p99 latency > threshold, error rate > 1%, saturation > 80%.
SLI: Metric measuring service behavior. "p99 latency of successful HTTP requests."
SLO: Target for an SLI. "p99 latency < 200ms over 30 days."
Error budget: 100% - SLO target = allowed failure budget. 99.9% availability = 43.8 min/month downtime allowed. If budget exhausted → no new deploys until next period.
Error budgets create organizational alignment: "do we spend the budget on features (risky deploys) or protect reliability (no deploys)?"
SLOs should be based on user happiness
"CPU < 80%" is infrastructure. "99.9% of searches return in < 500ms" is a user-facing SLO. Alert on what users experience, not on infrastructure metrics.
1# SLO alerting: burn rate approach23# Error rate alert (page the on-call)4- alert: SLOErrorBudgetBurningFast5expr: |6(7sum(rate(http_requests_total{status=~"5.."}[1h])) /8sum(rate(http_requests_total[1h]))9) > 0.01 # > 1% error rate for 1 hour>1% for 5 minutes → page. Not every spike — sustained burn.10for: 5m11labels:12severity: page13annotations:14summary: "Error budget burning — SLO at risk"1516# Burn rate: how fast is the monthly budget being consumed?17# Burn rate of 1 = on track (sustainable)18# Burn rate of 14.4 = budget exhausted in 50 hours (urgent)Burn rate > 14.4 = monthly budget exhausted in 50 hours19- record: slo:error_budget_burn_rate:1h20expr: |21(sum(rate(http_requests_total{status=~"5.."}[1h])) /22sum(rate(http_requests_total[1h])))23/ 0.001 # divide by (1 - SLO target) = (1 - 0.999)
OpenTelemetry (OTel): Industry standard. Instrument once, export to any backend (Jaeger, Tempo, Honeycomb, Datadog). Replaces all vendor-specific SDKs.
Context propagation: Trace ID flows through every service via `traceparent` HTTP headers. Each service creates a child span. Result: waterfall showing exactly where time is spent.
Sampling: At 100K RPS, recording every trace is expensive. Tail-based sampling (record all traces with errors or high latency) is most useful for production.
1import { NodeSDK } from '@opentelemetry/sdk-node';2import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';3import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';45// Auto-instruments HTTP, Express, pg, Redis, etc.Auto-instrumentation captures HTTP, DB, Redis spans with zero manual code6const sdk = new NodeSDK({7resource: new Resource({8'service.name': 'orders-service',9'service.version': process.env.APP_VERSION,10}),11traceExporter: new OTLPTraceExporter({12url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,13}),14instrumentations: [getNodeAutoInstrumentations()],15});16sdk.start();1718// Manual span for critical business operations19const tracer = trace.getTracer('orders-service');2021async function processPayment(orderId: string, amount: number) {22return tracer.startActiveSpan('payment.process', async (span) => {23span.setAttributes({ 'order.id': orderId, 'payment.amount': amount });24try {25const result = await paymentGateway.charge(amount);26span.setStatus({ code: SpanStatusCode.OK });27return result;28} catch (error) {29span.recordException(error as Error);30span.setStatus({ code: SpanStatusCode.ERROR });31throw error;Always end spans in finally — unclosed spans cause memory leaks32} finally {33span.end(); // ALWAYS end spans — unclosed spans leak memory34}35});36}
Observability questions test maturity — junior engineers add console.log, senior engineers build observable systems from day one.
Common questions:
Strong answers include:
Red flags:
Quick check · Observability & Monitoring: You Cannot Fix What You Cannot See
1 / 1
Key takeaways
From the books
Site Reliability Engineering — Beyer, Jones, Petoff, Murphy (2016)
Chapter 6: Monitoring Distributed Systems
Four golden signals — latency, traffic, errors, saturation — properly alerted will catch most production problems before users report them.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.