Skip to main content
Career Paths
Concepts
Istio Observability Tracing
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Istio Observability & Distributed Tracing

How Istio generates golden signals automatically -- and why missing trace propagation headers gives you 0% complete traces.

Relevant for:Mid-levelSeniorStaff
Why this matters at your level
Mid-level

Know the 4 golden signals Istio provides automatically. Understand why distributed tracing requires application header propagation. Know what randomSamplingPercentage does.

Senior

Build Grafana dashboards from Istio metrics. Debug trace propagation issues. Tune sampling rates for production. Know cardinality limits and why high-cardinality labels are dangerous.

Staff

Design the observability strategy for the platform: metrics cardinality budget, trace sampling policy, log retention. Define trace propagation as a platform contract all services must implement.

Istio Observability & Distributed Tracing

How Istio generates golden signals automatically -- and why missing trace propagation headers gives you 0% complete traces.

~4 min read
Be the first to complete!
LIVEData Plane Blind Spot -- Missing Trace Headers -- 0% End-to-End Traces
Breaking News
T+0

Istio installed with Jaeger enabled -- team expects full distributed traces

T+1w

Jaeger shows traces -- but all traces are single-span, no parent-child relationships

WARNING
T+1w

Investigation: Envoy correctly adds x-b3-traceid/spanid headers on every request

T+2w

Root cause: app HTTP clients not forwarding trace headers to downstream calls

CRITICAL
T+4w

30 services updated to propagate B3 headers -- full traces appear in Jaeger

—Services requiring header propagation fix
—Complete end-to-end traces before fix
—Time to achieve full trace coverage
—Trace headers Envoy adds per request

The question this raises

How does Istio generate observability data, what does it do automatically vs what requires application cooperation, and why is distributed tracing the hardest part to get right?

Test your assumption first

Your team enables Istio tracing with Jaeger. After a week, you check Jaeger and see hundreds of traces -- but every trace shows only one span with no child spans. Services are definitely calling each other. What is the root cause?

Lesson outline

What Istio gives you automatically (no app changes)

Istio generates L7 golden signals for every service pair -- automatically

Because every request flows through Envoy, Istio can measure: request rate, error rate, and latency (P50/P95/P99) for every source-destination pair. These four metrics (the four golden signals from SRE) are available in Prometheus/Grafana without any changes to application code.

Automatic observability from Envoy sidecars

  • Request/error rate per service pair — istio_requests_total with labels: source_workload, destination_service, response_code
  • Latency histograms per service pair — istio_request_duration_milliseconds -- P50, P95, P99 per route
  • TCP connection metrics — Bytes sent/received, active connections, connection errors
  • Trace span creation — Envoy creates a span for every request and reports to Zipkin/Jaeger -- but only one span per hop
  • Access logs — Structured JSON logs with full request/response metadata -- configurable per workload

What requires application cooperation


Service A --> Service B --> Service C --> Service D
(frontend)   (orders)     (inventory)  (database)

What Istio does automatically at each hop:
  A's sidecar: generates span-1, adds trace headers to request to B
  B's sidecar: receives span-1 context, generates span-2 for B-to-C call

THE PROBLEM: B's sidecar generates span-2 using the INCOMING request context.
But when B's app makes the outbound call to C, it uses its OWN HTTP client.
If that HTTP client does not copy the incoming trace headers to the outbound call,
the outgoing request has NO parent trace context.

Result:
  Span-1: A to B                    <- traced
  Span-2: B to C (orphan)           <- unconnected -- NEW root trace
  Span-3: C to D (orphan)           <- unconnected -- NEW root trace

Fix: Applications MUST copy these headers from incoming to outgoing requests:
  x-request-id
  x-b3-traceid
  x-b3-spanid
  x-b3-parentspanid
  x-b3-sampled
  x-b3-flags
  x-ot-span-context (if using OpenTracing)

Trace propagation is the application's responsibility

Istio cannot automatically propagate trace context through your application code. The incoming x-b3-* headers must be extracted from the inbound HTTP request and attached to all outbound HTTP calls. This requires a one-time change to your HTTP client wrapper in every service. Without it, tracing shows isolated one-hop spans, not end-to-end traces.

How to implement trace propagation

Making traces end-to-end in your services

→

01

Identify the 7 B3 trace headers your framework must propagate: x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, x-ot-span-context

→

02

Create a centralized HTTP client wrapper that extracts incoming headers from the request context

→

03

Pass extracted headers to all outbound HTTP calls made during request processing

→

04

Alternatively, use OpenTelemetry SDK with auto-instrumentation -- handles propagation automatically

05

Validate in Jaeger: apply a test trace, confirm the trace graph shows A -> B -> C -> D (not 4 separate root spans)

1

Identify the 7 B3 trace headers your framework must propagate: x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, x-ot-span-context

2

Create a centralized HTTP client wrapper that extracts incoming headers from the request context

3

Pass extracted headers to all outbound HTTP calls made during request processing

4

Alternatively, use OpenTelemetry SDK with auto-instrumentation -- handles propagation automatically

5

Validate in Jaeger: apply a test trace, confirm the trace graph shows A -> B -> C -> D (not 4 separate root spans)

kubectl
1# Check if tracing is configured in the mesh
2kubectl get configmap istio -n istio-system -o jsonpath='{.data.mesh}' | grep -A5 tracing
3
4# Check Envoy's tracing config for a specific pod
5istioctl proxy-config bootstrap my-pod.my-namespace | grep -A10 tracing
6
7# Check Prometheus metrics Istio generates (golden signals)
8# Rate of requests per second by service
9kubectl exec -n monitoring prometheus-0 -- \
10 curl -sg 'http://localhost:9090/api/v1/query?query=rate(istio_requests_total[5m])'
11
12# P99 latency per destination service
13# Query: histogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket[5m]))
14
15# Enable access logging for a workload
16kubectl apply -f - <<EOF
17apiVersion: telemetry.istio.io/v1alpha1
18kind: Telemetry
19metadata:
20 name: enable-access-log
21 namespace: my-namespace
22spec:
23 accessLogging:
24 - providers:
25 - name: envoy
26EOF
telemetry.yaml
1# Istio Telemetry API -- configure tracing per namespace
2apiVersion: telemetry.istio.io/v1alpha1
3kind: Telemetry
4metadata:
5 name: tracing-config
6 namespace: production
7spec:
8 tracing:
9 - providers:
10 - name: jaeger
11 randomSamplingPercentage: 1.0 # 1% sampling -- adjust for traffic volume
12 # Use 100% only in development -- at 1000 RPS, 100% = 1000 spans/second to Jaeger
13
14# Increase sampling for a specific workload during debugging
15---
16apiVersion: telemetry.istio.io/v1alpha1
17kind: Telemetry
18metadata:
19 name: debug-tracing
20 namespace: production
21spec:
22 selector:
23 matchLabels:
24 app: payment-svc # only this workload
25 tracing:
26 - providers:
27 - name: jaeger
28 randomSamplingPercentage: 100.0 # 100% for debugging only

What breaks in production

Blast radius of observability misconfig

  • 100% trace sampling at high RPS — At 1000 RPS, 100% sampling = 1000 spans/second -- Jaeger storage fills, Envoy CPU increases 20%
  • Access log verbosity — DEBUG access logging with full request/response bodies at 500 RPS -- disk fills in hours
  • Missing Prometheus scrape config — Istio metrics generated but not scraped -- golden signals dashboard shows no data
  • Label cardinality explosion — Custom request_url label on every metric with unique URLs -- Prometheus OOM from millions of time series

100% trace sampling in production -- storage and CPU overload

Bug
# Telemetry with 100% sampling in production
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: tracing
  namespace: production
spec:
  tracing:
  - providers:
    - name: jaeger
    randomSamplingPercentage: 100.0  # WRONG for production
    # At 500 RPS: 500 spans/second written to Jaeger
    # Jaeger storage fills within hours
    # Envoy reports metrics overhead for every span
Fix
# Production: 1% sampling (sample 1 in 100 requests)
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: tracing
  namespace: production
spec:
  tracing:
  - providers:
    - name: jaeger
    randomSamplingPercentage: 1.0  # 1% = 5 spans/sec at 500 RPS
    # For error tracing: configure head-based sampling
    # or use Jaeger's adaptive sampling to auto-tune

# Override to 100% only for specific debug sessions:
# kubectl annotate pod <pod> sidecar.istio.io/traceSampling=100

1% sampling captures 1 in 100 requests -- sufficient to see latency distributions and catch common errors. For rare errors (< 1%), use tail-based sampling or increase temporarily via pod annotation. Never run 100% in production at scale.

Decision guide: observability stack selection

Do you already have a metrics backend (Prometheus, Datadog, etc.)?
YesIntegrate Istio metrics via Prometheus scraping (port 15014 on istiod, 15090 on sidecars). Use existing dashboards -- Grafana Istio addon provides pre-built boards.
NoInstall Prometheus + Grafana using Istio addons: istioctl dashboard grafana. Baseline observability in 10 minutes.

Istio observability components

SignalSourceAutomatic?App changes neededKey metric/config
Request rateEnvoy sidecar -> PrometheusYesNoneistio_requests_total
Error rateEnvoy sidecar -> PrometheusYesNoneistio_requests_total{response_code!~"2.."}
LatencyEnvoy sidecar -> PrometheusYesNoneistio_request_duration_milliseconds
Distributed tracesEnvoy -> Jaeger/ZipkinPartialMust propagate B3 headersrandomSamplingPercentage
Access logsEnvoy -> stdout/LokiYes (disabled by default)NoneTelemetry API or MeshConfig

Exam Answer vs. Production Reality

1 / 2

What Istio generates automatically

📖 What the exam expects

Istio automatically generates request rate, error rate, and latency metrics for every service-to-service communication. These metrics are available in Prometheus without application code changes.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Asked when the interviewer wants to test real-world Istio experience. "What observability does Istio give you for free?" is the lead-in. The follow-up "what does it NOT give you for free?" reveals depth.

Common questions:

  • What observability does Istio provide automatically?
  • Why might your Jaeger traces show only one span per service?
  • How do you configure trace sampling in Istio?
  • What are the golden signals and how does Istio help you measure them?
  • What is the risk of high-cardinality labels in Istio metrics?

Strong answer: Immediately mentions header propagation limitation, knows B3 header names, has built Grafana dashboards from istio_requests_total, knows Telemetry API.

Red flags: Thinks distributed tracing is fully automatic, does not know about B3 header propagation, thinks 100% sampling is fine in production.

Related concepts

Explore topics that connect to this one.

  • Sidecar Injection & Init Containers
  • Observability
  • VirtualServices & Traffic Routing

Suggested next

Often learned after this topic.

VirtualServices & Traffic Routing

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.

Sign in to track your progress and mark lessons complete.

Continue learning

VirtualServices & Traffic Routing

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.