Skip to main content
Career Paths
Concepts
Bep Distributed Tracing
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Distributed Tracing: Observability in a Microservices World

When a request fails across 15 services, how do you find the problem in 60 seconds?

🎯Key Takeaways
Three pillars: Metrics (aggregate), Logs (what happened), Traces (where did time go across services).
OpenTelemetry is the vendor-neutral standard. Auto-instruments most frameworks with zero code changes.
Trace context propagation (W3C traceparent) links spans across service boundaries.
Alert on symptoms (user-visible latency/errors), not causes (CPU, memory).
SLO error budget: your allowed downtime. When exhausted, freeze features and fix reliability.

Distributed Tracing: Observability in a Microservices World

When a request fails across 15 services, how do you find the problem in 60 seconds?

~4 min read
Be the first to complete!
What you'll learn
  • Three pillars: Metrics (aggregate), Logs (what happened), Traces (where did time go across services).
  • OpenTelemetry is the vendor-neutral standard. Auto-instruments most frameworks with zero code changes.
  • Trace context propagation (W3C traceparent) links spans across service boundaries.
  • Alert on symptoms (user-visible latency/errors), not causes (CPU, memory).
  • SLO error budget: your allowed downtime. When exhausted, freeze features and fix reliability.

The Three Pillars of Observability

Google SRE coined the term "observability" for distributed systems: the ability to understand a system's internal state from its external outputs. Without it, you're flying blind at 30,000 feet.

PillarWhat It IsAnswersTools
MetricsNumeric aggregates over time"How many requests/sec? What's p99 latency?"Prometheus, DataDog, CloudWatch
LogsTimestamped text events"What happened exactly? What error occurred?"Elasticsearch, Loki, Splunk
TracesA request's journey through services"Which service caused this 5-second request?"Jaeger, Zipkin, Honeycomb, DataDog APM

Why Traces Are Different

Metrics tell you p99 latency is 5 seconds. Logs tell you there was an error. Traces tell you that out of 15 services the request visited, the database query in the inventory-service took 4.8 seconds because of a missing index. Traces connect the dots.

How Distributed Tracing Works

The Journey of a Trace

→

01

Request arrives at API gateway. Gateway generates unique trace-id (e.g., 7d8e9f0a) and span-id for the root span.

→

02

Gateway adds trace context to outgoing requests: "traceparent: 00-7d8e9f0a...-1a2b...-01"

→

03

User service receives request, creates child span with new span-id but same trace-id. Records start time, parent span-id.

→

04

User service calls Database. Records: query, duration, result. Sends span to tracing backend.

→

05

User service calls Order service. Propagates trace context headers.

→

06

Order service and dependencies each create spans, record timing, propagate context.

07

All spans with the same trace-id are collected and visualized as a waterfall/gantt chart.

1

Request arrives at API gateway. Gateway generates unique trace-id (e.g., 7d8e9f0a) and span-id for the root span.

2

Gateway adds trace context to outgoing requests: "traceparent: 00-7d8e9f0a...-1a2b...-01"

3

User service receives request, creates child span with new span-id but same trace-id. Records start time, parent span-id.

4

User service calls Database. Records: query, duration, result. Sends span to tracing backend.

5

User service calls Order service. Propagates trace context headers.

6

Order service and dependencies each create spans, record timing, propagate context.

7

All spans with the same trace-id are collected and visualized as a waterfall/gantt chart.

Use OpenTelemetry

W3C Trace Context (traceparent/tracestate) is the 2023 standard. OpenTelemetry implements it. It's vendor-neutral, auto-instruments Node.js/Python/Go/Java, and exports to any backend (Jaeger, Honeycomb, DataDog, etc).

opentelemetry-setup.ts
1// OpenTelemetry: auto-instrument Node.js service
2// Must be initialized BEFORE other imports
OTel must initialize BEFORE other imports to successfully instrument libraries
3
4import { NodeSDK } from '@opentelemetry/sdk-node';
5import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
6import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
7import { Resource } from '@opentelemetry/resources';
8import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
9
10const sdk = new NodeSDK({
11 resource: new Resource({
service.name is how you identify this service in trace visualizations
12 [SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
13 [SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA ?? 'unknown',
14 environment: process.env.NODE_ENV,
15 }),
16 traceExporter: new OTLPTraceExporter({
17 url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
18 }),
19 instrumentations: [
Auto-instrumentation patches HTTP, Express, pg, Redis with zero code changes
20 getNodeAutoInstrumentations({
21 '@opentelemetry/instrumentation-http': { enabled: true }, // auto-traces HTTP
22 '@opentelemetry/instrumentation-express': { enabled: true }, // auto-traces Express
23 '@opentelemetry/instrumentation-pg': { enabled: true }, // auto-traces PostgreSQL
24 '@opentelemetry/instrumentation-redis': { enabled: true }, // auto-traces Redis
25 }),
26 ],
27});
28
29sdk.start();
30
31// Manual spans for custom operations
32import { trace, SpanStatusCode } from '@opentelemetry/api';
33
34const tracer = trace.getTracer('order-service');
35
36async function processPayment(orderId: string, amountCents: number) {
37 const span = tracer.startSpan('process-payment');
Add business-relevant attributes to spans — makes searching and filtering traces useful
38 span.setAttributes({
39 'order.id': orderId,
40 'payment.amount_cents': amountCents,
41 'payment.currency': 'USD',
42 });
43
44 try {
45 const result = await paymentProvider.charge({ orderId, amountCents });
46 span.setStatus({ code: SpanStatusCode.OK });
47 return result;
48 } catch (error) {
Always end spans in finally — leaked spans cause memory issues
49 span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
50 span.recordException(error as Error);
51 throw error;
52 } finally {
53 span.end(); // Always end spans — leaked spans cause memory issues
54 }
55}

SLOs, Error Budgets, and Alert Design

The SRE Observability Framework

  • 🎯SLI (Service Level Indicator) — The metric you measure. Success rate = successful_requests / total_requests.
  • 📊SLO (Service Level Objective) — The target. "99.9% of requests succeed." Internal commitment.
  • 📜SLA (Service Level Agreement) — Customer contract. Usually more lenient than SLO. Breach = refunds/penalties.
  • 💰Error Budget — 100% - SLO = error budget. 99.9% SLO = 43.8 minutes downtime/month. When exhausted, freeze features and fix reliability.

Alert on Symptoms, Not Causes

Bad: "CPU > 80%." Good: "p99 latency > 500ms for 5 minutes." Bad: "Redis connection errors." Good: "Payment success rate < 99.5%." Alert on what users experience, not what infrastructure does.

SignalExampleAlert When
RateRequests/sec50% deviation from baseline for 5min
Error rate5xx / total requestsError rate > 1% for 5min OR > 5% for 1min
Durationp50, p95, p99 request timep99 > SLO threshold for 10min
SaturationDB connections, queue depthConnection pool > 80% utilized
How this might come up in interviews

Observability questions show how you think about running systems in production. Senior engineers think about debugging before writing the code.

Common questions:

  • How would you debug a latency spike in a microservices system?
  • What is the difference between tracing, logging, and metrics?
  • How do you design alerts to avoid alert fatigue?
  • What are SLOs and SLAs and why do they matter?

Strong answers include:

  • Knows the three pillars and when each is useful
  • Mentions error budget and burn rate alerting
  • Knows OpenTelemetry and trace context propagation
  • Discusses sampling strategies for high-traffic systems

Red flags:

  • No distinction between logs, metrics, and traces
  • Alerts on infrastructure metrics rather than user-facing symptoms
  • Never heard of OpenTelemetry or distributed tracing

Quick check · Distributed Tracing: Observability in a Microservices World

1 / 2

A user reports slow checkout. p99 latency is 8 seconds. Which tool identifies which specific service is causing the slowdown?

Key takeaways

  • Three pillars: Metrics (aggregate), Logs (what happened), Traces (where did time go across services).
  • OpenTelemetry is the vendor-neutral standard. Auto-instruments most frameworks with zero code changes.
  • Trace context propagation (W3C traceparent) links spans across service boundaries.
  • Alert on symptoms (user-visible latency/errors), not causes (CPU, memory).
  • SLO error budget: your allowed downtime. When exhausted, freeze features and fix reliability.

From the books

Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda (2022)

Chapter 4: Distributed Tracing

Observability is not about collecting data — it's about the ability to ask any question of your system without predicting the question in advance.

Site Reliability Engineering — Betsy Beyer et al. (Google) (2016)

Chapter 4: Service Level Objectives

SLOs and error budgets create a rational framework for the reliability vs. feature velocity trade-off. Without them, every outage becomes a political argument.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.