When a request fails across 15 services, how do you find the problem in 60 seconds?
When a request fails across 15 services, how do you find the problem in 60 seconds?
Google SRE coined the term "observability" for distributed systems: the ability to understand a system's internal state from its external outputs. Without it, you're flying blind at 30,000 feet.
| Pillar | What It Is | Answers | Tools |
|---|---|---|---|
| Metrics | Numeric aggregates over time | "How many requests/sec? What's p99 latency?" | Prometheus, DataDog, CloudWatch |
| Logs | Timestamped text events | "What happened exactly? What error occurred?" | Elasticsearch, Loki, Splunk |
| Traces | A request's journey through services | "Which service caused this 5-second request?" | Jaeger, Zipkin, Honeycomb, DataDog APM |
Why Traces Are Different
Metrics tell you p99 latency is 5 seconds. Logs tell you there was an error. Traces tell you that out of 15 services the request visited, the database query in the inventory-service took 4.8 seconds because of a missing index. Traces connect the dots.
The Journey of a Trace
01
Request arrives at API gateway. Gateway generates unique trace-id (e.g., 7d8e9f0a) and span-id for the root span.
02
Gateway adds trace context to outgoing requests: "traceparent: 00-7d8e9f0a...-1a2b...-01"
03
User service receives request, creates child span with new span-id but same trace-id. Records start time, parent span-id.
04
User service calls Database. Records: query, duration, result. Sends span to tracing backend.
05
User service calls Order service. Propagates trace context headers.
06
Order service and dependencies each create spans, record timing, propagate context.
07
All spans with the same trace-id are collected and visualized as a waterfall/gantt chart.
Request arrives at API gateway. Gateway generates unique trace-id (e.g., 7d8e9f0a) and span-id for the root span.
Gateway adds trace context to outgoing requests: "traceparent: 00-7d8e9f0a...-1a2b...-01"
User service receives request, creates child span with new span-id but same trace-id. Records start time, parent span-id.
User service calls Database. Records: query, duration, result. Sends span to tracing backend.
User service calls Order service. Propagates trace context headers.
Order service and dependencies each create spans, record timing, propagate context.
All spans with the same trace-id are collected and visualized as a waterfall/gantt chart.
Use OpenTelemetry
W3C Trace Context (traceparent/tracestate) is the 2023 standard. OpenTelemetry implements it. It's vendor-neutral, auto-instruments Node.js/Python/Go/Java, and exports to any backend (Jaeger, Honeycomb, DataDog, etc).
1// OpenTelemetry: auto-instrument Node.js service2// Must be initialized BEFORE other importsOTel must initialize BEFORE other imports to successfully instrument libraries34import { NodeSDK } from '@opentelemetry/sdk-node';5import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';6import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';7import { Resource } from '@opentelemetry/resources';8import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';910const sdk = new NodeSDK({11resource: new Resource({service.name is how you identify this service in trace visualizations12[SemanticResourceAttributes.SERVICE_NAME]: 'order-service',13[SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA ?? 'unknown',14environment: process.env.NODE_ENV,15}),16traceExporter: new OTLPTraceExporter({17url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,18}),19instrumentations: [Auto-instrumentation patches HTTP, Express, pg, Redis with zero code changes20getNodeAutoInstrumentations({21'@opentelemetry/instrumentation-http': { enabled: true }, // auto-traces HTTP22'@opentelemetry/instrumentation-express': { enabled: true }, // auto-traces Express23'@opentelemetry/instrumentation-pg': { enabled: true }, // auto-traces PostgreSQL24'@opentelemetry/instrumentation-redis': { enabled: true }, // auto-traces Redis25}),26],27});2829sdk.start();3031// Manual spans for custom operations32import { trace, SpanStatusCode } from '@opentelemetry/api';3334const tracer = trace.getTracer('order-service');3536async function processPayment(orderId: string, amountCents: number) {37const span = tracer.startSpan('process-payment');Add business-relevant attributes to spans — makes searching and filtering traces useful38span.setAttributes({39'order.id': orderId,40'payment.amount_cents': amountCents,41'payment.currency': 'USD',42});4344try {45const result = await paymentProvider.charge({ orderId, amountCents });46span.setStatus({ code: SpanStatusCode.OK });47return result;48} catch (error) {Always end spans in finally — leaked spans cause memory issues49span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });50span.recordException(error as Error);51throw error;52} finally {53span.end(); // Always end spans — leaked spans cause memory issues54}55}
The SRE Observability Framework
Alert on Symptoms, Not Causes
Bad: "CPU > 80%." Good: "p99 latency > 500ms for 5 minutes." Bad: "Redis connection errors." Good: "Payment success rate < 99.5%." Alert on what users experience, not what infrastructure does.
| Signal | Example | Alert When |
|---|---|---|
| Rate | Requests/sec | 50% deviation from baseline for 5min |
| Error rate | 5xx / total requests | Error rate > 1% for 5min OR > 5% for 1min |
| Duration | p50, p95, p99 request time | p99 > SLO threshold for 10min |
| Saturation | DB connections, queue depth | Connection pool > 80% utilized |
Observability questions show how you think about running systems in production. Senior engineers think about debugging before writing the code.
Common questions:
Strong answers include:
Red flags:
Quick check · Distributed Tracing: Observability in a Microservices World
1 / 2
Key takeaways
From the books
Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda (2022)
Chapter 4: Distributed Tracing
Observability is not about collecting data — it's about the ability to ask any question of your system without predicting the question in advance.
Site Reliability Engineering — Betsy Beyer et al. (Google) (2016)
Chapter 4: Service Level Objectives
SLOs and error budgets create a rational framework for the reliability vs. feature velocity trade-off. Without them, every outage becomes a political argument.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.