Skip to main content
Career Paths
Concepts
Observability
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Observability

The practice of understanding the internal state of a system from its external outputs — metrics, logs, and traces — enabling teams to debug novel failures in complex distributed systems without deploying new code.

🎯Key Takeaways
Observability is fundamentally different from monitoring: monitoring checks known failure conditions, while observability enables you to ask arbitrary questions about novel failures without deploying new code. At FAANG scale with thousands of services, this distinction is the difference between 10-minute and 4-hour incident resolution.
The three pillars work together in a diagnostic workflow: metrics tell you something is wrong and when it started, logs tell you what happened with specific error details, and traces show you exactly where the problem is across your distributed service graph. No single pillar is sufficient alone.
High-cardinality data is the key insight that separates observability from monitoring. The ability to slice telemetry by arbitrary dimensions (shard_id, customer_tier, feature_flag) in real time is what lets you debug problems that affect 2% of users while aggregate metrics show "everything green."
Alert on user-facing symptoms (error rate, latency) using SLO-based burn rates, not on internal causes (CPU usage) with arbitrary thresholds. Multi-window burn rate alerting catches both fast incidents and slow degradation while minimizing alert fatigue. Every alert must have a runbook.
Observability at scale requires deliberate cost management: bound metric cardinality by using label discipline, implement tail-based trace sampling to keep interesting traces while controlling volume, use tiered retention (hot/warm/cold), and treat the observability pipeline itself as a production system with its own SLOs.

Observability

The practice of understanding the internal state of a system from its external outputs — metrics, logs, and traces — enabling teams to debug novel failures in complex distributed systems without deploying new code.

~38 min read
Be the first to complete!
What you'll learn
  • Observability is fundamentally different from monitoring: monitoring checks known failure conditions, while observability enables you to ask arbitrary questions about novel failures without deploying new code. At FAANG scale with thousands of services, this distinction is the difference between 10-minute and 4-hour incident resolution.
  • The three pillars work together in a diagnostic workflow: metrics tell you something is wrong and when it started, logs tell you what happened with specific error details, and traces show you exactly where the problem is across your distributed service graph. No single pillar is sufficient alone.
  • High-cardinality data is the key insight that separates observability from monitoring. The ability to slice telemetry by arbitrary dimensions (shard_id, customer_tier, feature_flag) in real time is what lets you debug problems that affect 2% of users while aggregate metrics show "everything green."
  • Alert on user-facing symptoms (error rate, latency) using SLO-based burn rates, not on internal causes (CPU usage) with arbitrary thresholds. Multi-window burn rate alerting catches both fast incidents and slow degradation while minimizing alert fatigue. Every alert must have a runbook.
  • Observability at scale requires deliberate cost management: bound metric cardinality by using label discipline, implement tail-based trace sampling to keep interesting traces while controlling volume, use tiered retention (hot/warm/cold), and treat the observability pipeline itself as a production system with its own SLOs.

Lesson outline

What Is Observability and Why It Matters

Observability is not just monitoring with a fancier name. Monitoring tells you when something breaks based on conditions you anticipated. Observability lets you ask arbitrary questions about your system that you did not predict in advance. At its core, observability is about being able to understand the internal state of a system by examining its external outputs. This distinction matters enormously at FAANG scale, where novel failure modes emerge from the combinatorial explosion of services, configurations, and traffic patterns that no one could anticipate ahead of time.

Charity Majors’ Definition

Charity Majors, co-founder of Honeycomb and one of the most influential voices in the observability community, defines observability as the ability to ask new questions of your system without having to ship new code. If you have to add a new metric, create a new log line, or redeploy to debug a production problem, your system is not observable — it is merely monitored. True observability means the telemetry data you already collect is rich enough to answer questions you have never asked before.

The distinction between monitoring and observability becomes painfully clear in microservice architectures. A monolith has a single process with a single stack trace. When it breaks, you look at one log file and one set of metrics. A distributed system with 200 services has 200 log streams, thousands of metric time series, and requests that fan out across dozens of services. Monitoring each service individually tells you nothing about the cross-service behavior that causes most production incidents.

Monitoring vs Observability

  • Monitoring — Predefined dashboards and alerts for known failure modes. You decide what to watch before the failure happens. Works well for infrastructure-level signals like CPU, memory, and disk.
  • Observability — The ability to explore and query telemetry data in real time to understand novel failures. You decide what to ask after the failure happens. Essential for application-level debugging in distributed systems.
  • Key difference — Monitoring answers "is the system working?" Observability answers "why is the system not working for this specific user, right now, in this specific way?"

The three pillars of observability — metrics, logs, and traces — each provide a different lens into system behavior. Metrics tell you that something is wrong (the error rate spiked). Logs tell you what happened (this specific request failed with this error). Traces tell you where it happened (the request was slow because shard-17 in the payments database took 30 seconds). You need all three because no single pillar gives you the full picture.

The Three Pillars Working Together

At Google, we train new SREs with this debugging flow: (1) Metrics alert you that p99 latency exceeded the SLO threshold. (2) You drill into the dashboard to identify which service is slow. (3) You search logs for errors or slow queries in that service during the affected time window. (4) You pull a trace for a slow request to see exactly which downstream dependency is the bottleneck. Each pillar narrows the search space until you reach the root cause.

At FAANG scale, observability is not optional — it is a survival requirement. Google runs over 10,000 production services. A single user request to Google Search can touch dozens of services. Without comprehensive observability, debugging a latency regression would require guessing across thousands of services. The systems that handle this at scale — Monarch for metrics, Dapper for traces, centralized structured logging — were built because traditional monitoring could not cope with the complexity.

Quick Mental Model

Monitoring is like a smoke detector: it tells you there is a fire. Observability is like a security camera system: it lets you rewind, zoom in, and understand exactly what happened, when, and why — even for events you never anticipated.

AspectTraditional MonitoringObservability
ApproachPredefined checks and thresholdsExploratory querying of rich telemetry
Failure typesKnown, anticipated failuresNovel, emergent failures
Data modelAggregated metrics and static dashboardsHigh-cardinality events, traces, structured logs
DebuggingCheck dashboards, grep logs manuallySlice and dice data across any dimension in real time
Scale challengeDashboard sprawl, alert fatigueCardinality explosion, storage costs, sampling trade-offs

The three pillars of observability

Metrics

What is happening?

Aggregated numbers over time—trends, rates, percentiles.

  • • Error rate: 0.1%
  • • Latency p99: 200ms
  • • CPU: 45%

Logs

When did it happen?

Event records with timestamps—who did what, and when.

  • • User logged in
  • • Payment processed
  • • Error: timeout

Traces

Why did it happen?

Request journey across services—find bottlenecks and root cause.

  • • API → Auth → DB
  • • Slow span: tax API
  • • Distributed debug

Example: Distributed trace (one request)

API
Auth
DB

Traces show where time is spent across services—e.g. DB span is longest → optimize or cache.

The power of observability

Metrics tell you what is happening. Logs tell you when it happened. Traces tell you why—following a request through services to find the root cause.

Metrics: The Foundation of Quantitative Monitoring

Metrics are numerical measurements collected over time. They are the most mature and well-understood pillar of observability, providing the foundation for dashboards, alerting, and capacity planning. At Google, our internal metrics system (Monarch) ingests billions of data points per second. In the open-source world, Prometheus has become the de facto standard for metrics collection in cloud-native environments.

Understanding metric types is fundamental. Each type answers a different question about your system, and using the wrong type leads to incorrect conclusions or inefficient storage.

The Four Metric Types

  • Counter — A monotonically increasing value that can only go up (or reset to zero on restart). Use for: total requests served, total errors, total bytes transferred. You always query counters with rate() to get per-second values. Example: http_requests_total. Never use a counter for something that can decrease.
  • Gauge — A value that can go up and down at any time. Use for: current CPU utilization, memory usage, number of active connections, queue depth, temperature. Example: node_memory_available_bytes. Gauges represent a snapshot in time.
  • Histogram — Samples observations and counts them in configurable buckets, plus sum and count. Use for: request latency distribution, response size distribution. Example: http_request_duration_seconds_bucket{le="0.1"} shows requests under 100ms. Histograms let you calculate percentiles (p50, p95, p99) at query time.
  • Summary — Similar to histogram but calculates quantiles on the client side. Use when: you need precise quantiles for a single instance and cannot aggregate across instances. Limitation: summaries cannot be aggregated across instances, making them less useful in distributed systems. Prefer histograms in most cases.

Counter vs Gauge: The Most Common Mistake

New engineers frequently query a counter directly (e.g., http_requests_total = 1,523,847) and wonder why the number is so large or meaningless. Counters must always be wrapped in rate() or increase(). rate(http_requests_total[5m]) gives you requests per second averaged over 5 minutes. Querying a gauge with rate() is equally wrong — you would get the rate of change of CPU utilization, not the CPU utilization itself.

The Prometheus data model organizes metrics as time series identified by a metric name and a set of key-value labels. For example, http_requests_total{method="GET", handler="/api/users", status="200"} is one time series. Each unique combination of labels creates a separate time series. This is extremely powerful for filtering and grouping but introduces the cardinality problem (covered in Section 9).

Prometheus Metric Naming Conventions

→

01

Use snake_case for metric names: http_requests_total, not httpRequestsTotal or HttpRequestsTotal

→

02

Include the unit as a suffix: http_request_duration_seconds, node_memory_bytes, disk_io_operations_total

→

03

Use _total suffix for counters: http_requests_total, errors_total, bytes_sent_total

→

04

Use base units (seconds not milliseconds, bytes not megabytes): this avoids confusion and makes unit conversion explicit in queries

→

05

Prefix with a namespace for your application: myapp_http_requests_total distinguishes your metrics from library metrics

06

Keep label cardinality bounded: labels like user_id or request_id create unbounded cardinality and will crash Prometheus

1

Use snake_case for metric names: http_requests_total, not httpRequestsTotal or HttpRequestsTotal

2

Include the unit as a suffix: http_request_duration_seconds, node_memory_bytes, disk_io_operations_total

3

Use _total suffix for counters: http_requests_total, errors_total, bytes_sent_total

4

Use base units (seconds not milliseconds, bytes not megabytes): this avoids confusion and makes unit conversion explicit in queries

5

Prefix with a namespace for your application: myapp_http_requests_total distinguishes your metrics from library metrics

6

Keep label cardinality bounded: labels like user_id or request_id create unbounded cardinality and will crash Prometheus

PromQL (Prometheus Query Language) is the query language for extracting insights from metrics. It is a functional language where you compose operations on time series data. Understanding PromQL is essential for building dashboards and writing alert rules.

PromQL ExpressionWhat It ReturnsUse Case
rate(http_requests_total[5m])Per-second request rate averaged over 5 minutesThroughput dashboard panel, capacity planning
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))99th percentile latency over 5 minutesSLO monitoring, latency alerting
sum by (status)(rate(http_requests_total[5m]))Request rate broken down by HTTP status codeError rate calculation, success/failure ratio
increase(errors_total[1h])Total number of errors in the last hourError spike detection, incident triage
avg_over_time(node_cpu_utilization[30m])Average CPU utilization over 30 minutesCapacity planning, rightsizing alerts

The rate() vs increase() Decision

rate() gives you per-second averages — ideal for dashboards and comparing throughput across services. increase() gives you the total count over a window — ideal for "how many errors happened in the last hour?" queries. Both handle counter resets correctly. Use rate() for continuous monitoring and increase() for incident investigation.

Averaging Percentiles Is Statistically Invalid

A common and dangerous mistake is computing avg(p99_latency) across multiple instances. Percentiles are not aggregatable — the average of p99 values from 10 instances is not the system-wide p99. Use histogram_quantile() on aggregated histogram buckets instead. This is why Prometheus histograms exist: they let you compute accurate percentiles across any number of instances at query time.

Logs: Structured Event Records

Logs are discrete event records emitted by applications and infrastructure. They are the most detailed pillar of observability — every log line captures a specific moment in time with full context. However, this richness comes at a cost: logs are expensive to store, difficult to query at scale, and only useful if they are structured and correlated.

Structured Logging Is Non-Negotiable

At Google and across all FAANG companies, unstructured logs (plain text lines like "Error processing request") are considered a bug, not a feature. Every log line must be a structured object (JSON, protobuf, or similar) with machine-parseable fields. This enables automated querying, aggregation, and correlation across services. If you cannot filter logs by field (e.g., request_id=abc AND level=error AND service=payments), your logging is broken.

The difference between structured and unstructured logging is the difference between a searchable database and a pile of text files. Unstructured logs require regex-based grep across gigabytes of text. Structured logs can be indexed, filtered, and aggregated using the same kinds of queries you use on a database.

AspectUnstructuredStructured (JSON)
Format2024-01-15 10:23:45 ERROR Payment failed for user 12345{"timestamp":"2024-01-15T10:23:45Z","level":"error","msg":"Payment failed","user_id":"12345","service":"payments","trace_id":"abc123"}
Queryinggrep "Payment failed" | grep "12345"WHERE user_id="12345" AND level="error"
AggregationImpossible without custom parsingCOUNT(*) GROUP BY service WHERE level="error"
CorrelationManual text matching across filesJOIN on trace_id or request_id across all services
Machine parsingFragile regex, breaks on format changesReliable field extraction, schema-aware

Log levels provide a severity hierarchy that helps you filter signal from noise. Using them consistently across all services is critical for effective log management.

Log Levels (Syslog Severity)

  • DEBUG — Detailed diagnostic information. Disabled in production by default. Enable per-service or per-request (via header) when actively debugging. High volume — will overwhelm log storage if left on.
  • INFO — Significant business events: user login, order placed, payment processed. Should be informative but not noisy. If you cannot explain why someone would search for this log line, remove it.
  • WARN — Unexpected conditions that the system handled gracefully: retry succeeded after first attempt failed, cache miss forcing a database query, approaching a resource limit. Not urgent but warrants investigation if the rate increases.
  • ERROR — An operation failed and could not be recovered automatically: payment declined, database query timeout, external API returned 500. Every ERROR log should be actionable — if you would not investigate it, it should be WARN.
  • FATAL/CRITICAL — The process cannot continue and is shutting down: out of memory, cannot connect to database on startup, corrupted configuration. These should trigger immediate alerts and are extremely rare in well-built systems.

Correlation IDs: The Thread That Connects Everything

A correlation ID (also called trace ID or request ID) is a unique identifier assigned at the entry point of a request and propagated through every service that handles it. When a user reports "my checkout failed," you search for their request's correlation ID and instantly see every log line, metric, and trace span across every service that touched that request. Without correlation IDs, debugging a distributed system is like searching for a needle in a haystack blindfolded. Implement them from day one.

Log aggregation systems collect logs from all services into a centralized store where they can be searched, filtered, and analyzed. The two dominant approaches in the industry are the ELK stack (Elasticsearch, Logstash, Kibana) and the Grafana Loki stack.

FeatureELK Stack (Elasticsearch)Grafana Loki
Indexing approachFull-text index on log content — fast arbitrary searchIndex only on labels (like Prometheus) — cheaper storage, filter then grep
Storage costHigh — Elasticsearch indexes everything, requires significant disk and memoryLow — stores compressed log chunks, only indexes metadata labels
Query speedFast for complex queries across any fieldFast for label-based filtering, slower for full-text search within results
Operational complexityHigh — Elasticsearch cluster management is a job in itself (JVM tuning, shard management, rolling upgrades)Low — designed for simplicity, integrates natively with Grafana
Best forOrganizations that need powerful full-text search and complex analyticsTeams already using Prometheus and Grafana that want cost-effective log aggregation

The Log Volume Trap

At FAANG scale, a single service handling 100K requests per second that logs one line per request generates 8.6 billion log lines per day. At 500 bytes per line, that is 4 TB per day for one service. Multiply by hundreds of services and you quickly reach petabytes. Log sampling, dynamic log levels (increase verbosity only during incidents), and aggressive retention policies (7 days hot, 30 days warm, 90 days cold) are essential cost controls.

Structured Logging Best Practices

→

01

Always include: timestamp (ISO 8601), level, service name, trace_id/request_id, and a human-readable message

→

02

Add contextual fields relevant to the operation: user_id, order_id, endpoint, duration_ms, status_code

→

03

Never log sensitive data: passwords, API keys, credit card numbers, PII without redaction

→

04

Use consistent field names across all services: trace_id not traceId or TraceID or request_trace_id

→

05

Set log levels correctly: if you would not investigate a log line at 3 AM, it is not an ERROR

06

Implement dynamic log levels: allow per-service or per-request DEBUG logging via feature flags or headers without redeploying

1

Always include: timestamp (ISO 8601), level, service name, trace_id/request_id, and a human-readable message

2

Add contextual fields relevant to the operation: user_id, order_id, endpoint, duration_ms, status_code

3

Never log sensitive data: passwords, API keys, credit card numbers, PII without redaction

4

Use consistent field names across all services: trace_id not traceId or TraceID or request_trace_id

5

Set log levels correctly: if you would not investigate a log line at 3 AM, it is not an ERROR

6

Implement dynamic log levels: allow per-service or per-request DEBUG logging via feature flags or headers without redeploying

Distributed Traces: Following Requests Across Services

Distributed tracing is the third pillar of observability and arguably the most transformative for microservice architectures. A trace represents the complete journey of a single request as it propagates through your distributed system. It answers the question that metrics and logs alone cannot: "What happened to this specific request, across every service it touched, and how long did each step take?"

A trace is composed of spans. Each span represents a unit of work performed by a single service — an HTTP handler, a database query, a gRPC call to another service. Spans are organized in a tree structure: a root span (the initial request) has child spans (downstream calls), which can have their own child spans, creating a full picture of the request's execution path.

Anatomy of a Span

  • Trace ID — Globally unique identifier shared by all spans in the trace. This is what connects spans across services. Typically a 128-bit hex string like a]b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6.
  • Span ID — Unique identifier for this specific span within the trace. Used to establish parent-child relationships between spans.
  • Parent Span ID — The span ID of the caller. The root span has no parent. This field builds the tree structure of the trace.
  • Operation name — Human-readable name describing the work: "GET /api/orders", "SELECT orders WHERE user_id=?", "gRPC payments.Charge".
  • Start time and duration — When the span started and how long it took. This is what makes the waterfall visualization possible — you can see exactly where time was spent.
  • Tags/attributes — Key-value metadata: http.method=GET, http.status_code=200, db.type=postgresql, error=true. Used for filtering and searching traces.
  • Events/logs — Timestamped annotations within the span: "cache miss at 10ms", "retry attempt 2 at 150ms". Provide fine-grained context without separate log lines.
sequenceDiagram
    participant U as User Browser
    participant GW as API Gateway
    participant OS as Order Service
    participant PS as Payment Service
    participant DB as Database
    participant Q as Message Queue

    U->>GW: POST /checkout (trace_id: abc123)
    Note over GW: span: api-gateway (12ms)
    GW->>OS: gRPC CreateOrder
    Note over OS: span: order-service (340ms)
    OS->>DB: INSERT order
    Note over DB: span: db-insert (45ms)
    DB-->>OS: OK
    OS->>PS: gRPC ChargeCard
    Note over PS: span: payment-service (280ms)
    PS->>DB: SELECT payment_method
    Note over DB: span: db-select (15ms)
    DB-->>PS: result
    PS-->>OS: charged
    OS->>Q: Publish OrderCreated
    Note over Q: span: queue-publish (5ms)
    Q-->>OS: ack
    OS-->>GW: 201 Created
    GW-->>U: 201 Created

A distributed trace showing a checkout request flowing through API Gateway, Order Service, Payment Service, Database, and Message Queue. Each service creates a span with timing information.

The W3C Trace Context standard (traceparent and tracestate HTTP headers) provides a vendor-neutral way to propagate trace context across service boundaries. The traceparent header carries the trace ID, parent span ID, and trace flags. This standardization means you can use different tracing backends for different services and still get connected traces.

OpenTelemetry: The Industry Standard

OpenTelemetry (OTel) has become the industry standard for instrumentation. It provides vendor-neutral APIs, SDKs, and the OpenTelemetry Collector for collecting and exporting metrics, logs, and traces. Major cloud providers (AWS, GCP, Azure) and observability vendors (Datadog, New Relic, Honeycomb, Grafana) all support OpenTelemetry natively. If you are starting a new project today, instrument with OpenTelemetry — it avoids vendor lock-in and gives you flexibility to switch backends without changing application code.

Tracing BackendArchitectureBest For
JaegerOpen-source, CNCF graduated, Elasticsearch or Cassandra storageTeams that want full control, self-hosted environments, strong Kubernetes integration
ZipkinOpen-source, lightweight, originated at TwitterSimpler deployments, Java/Spring ecosystem, lower operational overhead than Jaeger
Tempo (Grafana)Open-source, object storage backend (S3/GCS), no indexingCost-effective at scale, teams already using Grafana, pairs with Loki for logs
Google Cloud TraceManaged service, tight GCP integrationGCP-native workloads, automatic instrumentation for GCP services
AWS X-RayManaged service, tight AWS integrationAWS-native workloads, Lambda and ECS auto-instrumentation

The Most Common Tracing Failure

The number one reason distributed tracing fails in organizations is broken context propagation. If even one service in the request path does not propagate the trace context headers (traceparent), the trace is broken into disconnected fragments. Before investing in tracing infrastructure, ensure every service — including proxies, message queues, and batch jobs — correctly propagates trace context. A partial trace is often worse than no trace because it gives you a false picture of the request path.

Auto-Instrumentation Saves Months of Work

OpenTelemetry SDKs provide auto-instrumentation for common frameworks and libraries (Express, Spring Boot, Django, gRPC, database drivers). Auto-instrumentation wraps library calls to automatically create spans, propagate context, and record timing. Start with auto-instrumentation to get basic tracing working across all services in days, then add custom spans for business-critical operations that need more detail.

The Observability Pipeline

An observability pipeline is the data infrastructure that collects telemetry from applications, processes and enriches it, routes it to the appropriate storage backends, and makes it queryable through dashboards and alerting systems. At scale, the pipeline itself becomes a critical production system that must be reliable, scalable, and cost-efficient.

graph LR
    subgraph "Applications"
      A1[Service A] -->|metrics, logs, traces| C
      A2[Service B] -->|metrics, logs, traces| C
      A3[Service C] -->|metrics, logs, traces| C
      A4[Infrastructure] -->|node metrics, syslogs| C
    end
    subgraph "Collection Layer"
      C[OTel Collector] -->|filter, enrich, batch| R
    end
    subgraph "Routing & Processing"
      R[Router / Fan-out] -->|metrics| M
      R -->|logs| L
      R -->|traces| T
    end
    subgraph "Storage Backends"
      M[Prometheus / Mimir]
      L[Loki / Elasticsearch]
      T[Tempo / Jaeger]
    end
    subgraph "Query & Visualization"
      M --> G[Grafana Dashboards]
      L --> G
      T --> G
      M --> AL[Alertmanager]
      AL --> PD[PagerDuty / Slack]
    end

End-to-end observability pipeline: applications emit telemetry to collectors, which process and route data to specialized storage backends, queryable through dashboards and alerting.

The OpenTelemetry Collector is the central component of a modern observability pipeline. It is a vendor-neutral data pipeline that receives telemetry data in multiple formats, processes it (filtering, enriching, batching, sampling), and exports it to one or more backends. Running the Collector as a sidecar or as a gateway deployment decouples your applications from the specifics of your observability backend.

OpenTelemetry Collector Components

  • Receivers — Accept telemetry data in various formats: OTLP (native OTel), Prometheus scrape, Jaeger Thrift, Zipkin, StatsD, Fluentd. A single Collector can receive data from heterogeneous sources.
  • Processors — Transform data in the pipeline: batch (group data for efficient export), filter (drop unwanted metrics or spans), attributes (add/modify/delete attributes), tail_sampling (sample traces based on latency or error status), memory_limiter (prevent OOM).
  • Exporters — Send processed data to backends: OTLP (to another Collector or compatible backend), Prometheus remote_write, Loki, Jaeger, Datadog, New Relic, Cloud provider backends. Multiple exporters enable fan-out to different systems.
  • Connectors — Bridge between pipelines within a single Collector: e.g., extract metrics from spans (span-to-metrics connector) to generate RED metrics automatically from trace data.

Gateway vs Sidecar Deployment

Run the OTel Collector in two tiers: (1) As a sidecar or DaemonSet agent on each node that collects local telemetry with minimal processing. (2) As a centralized gateway cluster that performs heavier processing (sampling, enrichment, fan-out). This architecture distributes collection load while centralizing expensive operations, and it provides a single point to reconfigure routing without touching application deployments.

One of the most important architectural decisions in observability is whether to use pull-based (scrape) or push-based collection for metrics. Prometheus pioneered the pull model; StatsD and OpenTelemetry use push.

AspectPull / Scrape (Prometheus)Push (OTel, StatsD)
How it worksPrometheus server periodically scrapes HTTP endpoints (/metrics) on each targetApplications push telemetry to a collector or backend on each event or at regular intervals
Service discoveryPrometheus must know where to scrape — uses Kubernetes SD, Consul, file-based configsApplications push to a known endpoint — collector does not need to discover targets
Failure detectionIf a scrape fails, Prometheus knows the target is down — built-in up/down detectionIf a service stops pushing, silence is ambiguous — is it down or just not emitting?
Short-lived jobsProblematic — job may finish before the next scrape interval. Use Pushgateway as workaroundNatural fit — push metrics before the job exits
Network topologyRequires Prometheus to reach targets (problematic across firewalls, NATs)Requires targets to reach the collector (usually easier in practice)
Dominant usePrometheus ecosystem, Kubernetes-native monitoringOpenTelemetry, serverless, edge computing, cross-network scenarios

The Best of Both Worlds

Many production setups use both models: Prometheus scrapes infrastructure and application metrics from long-running services, while the OTel Collector receives push-based traces and logs. The OTel Collector can also scrape Prometheus endpoints, acting as a unified collection layer. Do not force one model everywhere — choose based on the telemetry type and the service characteristics.

Building a Production Observability Pipeline

→

01

Instrument applications with OpenTelemetry SDKs (auto-instrumentation first, then custom spans for critical paths)

→

02

Deploy OTel Collector as DaemonSet on every node for local collection and basic processing

→

03

Deploy a centralized OTel Collector gateway for sampling, enrichment, and fan-out to multiple backends

→

04

Set up Prometheus (or Mimir for multi-tenant) for metrics storage with appropriate retention (15 days hot, 1 year long-term)

→

05

Set up Loki (or Elasticsearch) for log aggregation with structured log parsing and retention policies

→

06

Set up Tempo (or Jaeger) for trace storage with tail-based sampling to control costs

→

07

Configure Grafana dashboards for each service with RED and USE method panels

08

Configure Alertmanager with SLO-based alerts routed to PagerDuty/Slack by service ownership

1

Instrument applications with OpenTelemetry SDKs (auto-instrumentation first, then custom spans for critical paths)

2

Deploy OTel Collector as DaemonSet on every node for local collection and basic processing

3

Deploy a centralized OTel Collector gateway for sampling, enrichment, and fan-out to multiple backends

4

Set up Prometheus (or Mimir for multi-tenant) for metrics storage with appropriate retention (15 days hot, 1 year long-term)

5

Set up Loki (or Elasticsearch) for log aggregation with structured log parsing and retention policies

6

Set up Tempo (or Jaeger) for trace storage with tail-based sampling to control costs

7

Configure Grafana dashboards for each service with RED and USE method panels

8

Configure Alertmanager with SLO-based alerts routed to PagerDuty/Slack by service ownership

Dashboard Design: Building Effective Visualizations

A dashboard is only useful if it helps someone make a decision. The most common failure in dashboard design is creating "metric museums" — screens full of graphs that look impressive but do not help anyone debug an incident or understand system health. Every panel on a dashboard should answer a specific question, and the dashboard as a whole should tell a coherent story about the health of a service.

The RED Method for Request-Driven Services

The RED method (coined by Tom Wilkie of Grafana Labs) defines three golden metrics for every request-driven service: Rate (requests per second), Errors (failed requests per second or error rate percentage), and Duration (latency distribution, especially p50, p95, p99). A single RED dashboard answers the critical questions: "Is the service handling traffic? Is it failing? Is it slow?" Every microservice should have a RED dashboard as its primary health view.

RED MetricPromQL ExampleDashboard Panel TypeAlert Threshold Example
Ratesum(rate(http_requests_total[5m]))Time series line graphAlert if rate drops >50% in 5 min (traffic loss)
Errorssum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))Time series + single stat (current error %)Alert if error rate >1% for 5 min
Duration (p50)histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))Time series line graphContext line — not alerted on directly
Duration (p99)histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))Time series line graphAlert if p99 >500ms for 5 min

The USE Method for Infrastructure Resources

While RED works for application services, the USE method (by Brendan Gregg) works for infrastructure resources like CPU, memory, disk, and network. USE stands for: Utilization (percent of resource busy — CPU at 85%), Saturation (degree of queuing — 15 processes waiting for CPU), Errors (error count — disk I/O errors). A USE dashboard for each node or instance helps distinguish "the application is slow" from "the infrastructure is overloaded."

Google's Four Golden Signals (from the SRE book) provide another framework that overlaps with RED but adds saturation: Latency, Traffic, Errors, and Saturation. Use RED for application services, USE for infrastructure, and the Four Golden Signals as a unifying framework that covers both.

Dashboard Layout Principles

  • Top row: the answer — Place the most critical metrics (error rate, p99 latency, SLO burn rate) in the top row as single-stat panels with color thresholds (green/yellow/red). An on-call engineer opening this dashboard at 3 AM should know within 2 seconds if the service is healthy.
  • Second row: the context — Show traffic rate, error rate over time, and latency percentiles as time-series graphs. This row answers "when did the problem start?" and "is it getting worse?"
  • Third row: the breakdown — Break down metrics by endpoint, status code, downstream dependency, or region. This answers "where is the problem?" — is it all endpoints or just /api/checkout? All regions or just us-east-1?
  • Bottom row: infrastructure — CPU, memory, disk I/O, network throughput for the service. This answers "is the problem caused by resource exhaustion?"
  • Annotations — Overlay deployment markers, incident start/end times, and config changes on time-series graphs. The most common root cause is "what changed?" — annotations make this immediately visible.

Dashboard Anti-Patterns

The Metric Museum: 50 panels showing every metric available, with no clear narrative. The Wall of Green: everything shows "OK" because thresholds are too loose, giving false confidence. The Average Dashboard: showing only averages hides tail latency — p99 can be 10x the average. The Stale Dashboard: built for the service as it existed 6 months ago, now showing metrics for endpoints that no longer exist. Every dashboard should have an owner and a quarterly review cycle.

Grafana Best Practices

→

01

Use dashboard variables ($service, $environment, $region) so one template works for all services

→

02

Set meaningful Y-axis labels and units (requests/sec, milliseconds, bytes) — never leave "undefined"

→

03

Use consistent color schemes: green for healthy, yellow for warning, red for critical across all dashboards

→

04

Add links between dashboards: service overview links to per-endpoint detail, links to relevant logs and traces

→

05

Use dashboard folders organized by team or domain, with consistent naming: "Team - Service - Detail Level"

06

Set appropriate time ranges as defaults: 6 hours for operational dashboards, 7 days for capacity planning

1

Use dashboard variables ($service, $environment, $region) so one template works for all services

2

Set meaningful Y-axis labels and units (requests/sec, milliseconds, bytes) — never leave "undefined"

3

Use consistent color schemes: green for healthy, yellow for warning, red for critical across all dashboards

4

Add links between dashboards: service overview links to per-endpoint detail, links to relevant logs and traces

5

Use dashboard folders organized by team or domain, with consistent naming: "Team - Service - Detail Level"

6

Set appropriate time ranges as defaults: 6 hours for operational dashboards, 7 days for capacity planning

The 3-Click Rule for Incident Response

An on-call engineer should be able to go from a PagerDuty alert to the root cause in three clicks or fewer: (1) Click the alert link to open the service dashboard. (2) Click a spike on the error graph to see the affected time range and endpoint. (3) Click through to the logs or traces filtered by that time range and endpoint. If your dashboards require more than three clicks, reorganize them with better linking and pre-filtered views.

Alerting: From Signals to Action

Alerting is where observability meets human action. The best metrics, logs, and traces in the world are useless if alerts fire for the wrong reasons, go to the wrong people, or generate so much noise that on-call engineers learn to ignore them. Alert fatigue is the number one operational failure mode at scale — it is more dangerous than missing alerts because it trains people to dismiss real problems.

Symptom-Based Alerting, Not Cause-Based

Alert on symptoms that users experience, not on internal causes that may or may not matter. Bad alert: "CPU usage > 80%." Why bad? CPU at 85% might be perfectly fine — the service is handling traffic within SLO. Good alert: "Error rate > 0.5% for 5 minutes" or "p99 latency > 500ms for 10 minutes." These alert on what the user actually experiences. CPU at 85% becomes a capacity planning concern, not a paging alert. The rule is simple: if this condition does not impact users, it should not page anyone.

SLO-based alerting takes symptom-based alerting to the next level by connecting alerts directly to your Service Level Objectives. Instead of setting arbitrary thresholds, you alert when you are consuming your error budget faster than sustainable.

Multi-Window Burn Rate Alerting

  • Concept — An error budget is the allowed amount of unreliability (e.g., 99.9% SLO = 0.1% error budget = ~43 minutes of downtime per 30-day window). Burn rate measures how fast you are consuming this budget. A burn rate of 1 means you will exactly exhaust the budget by end of month. A burn rate of 10 means you will exhaust it in 3 days.
  • Fast burn alert (pages) — Window: 5 minutes and 1 hour. Threshold: burn rate > 14x (exhausts monthly budget in ~2 days). This catches severe incidents — total outage, massive error spike. Pages the on-call engineer immediately.
  • Slow burn alert (tickets) — Window: 6 hours and 3 days. Threshold: burn rate > 1x (on track to exhaust monthly budget). This catches chronic degradation — a slow memory leak, gradually increasing latency. Creates a ticket for the team to investigate during business hours.
  • Why multi-window? — A single window creates either noisy alerts (short window catches every blip) or slow alerts (long window misses fast incidents). Two windows per alert combine sensitivity with specificity: both must exceed the threshold to fire, reducing false positives while catching real problems quickly.

The Alert Fatigue Death Spiral

When alerts fire too frequently for non-actionable reasons, on-call engineers start ignoring them. When they ignore alerts, they miss real incidents. When they miss real incidents, teams add more alerts to "catch everything." More alerts create more noise, which leads to more ignoring. This is the death spiral. The fix is brutal but necessary: delete every alert that has not resulted in a human taking meaningful action in the last 30 days. If it did not page to a real problem, it should not page at all.

Alert TypeWindowBurn Rate ThresholdActionRouting
Page (critical)5 min AND 1 hour14.4x (exhausts budget in 2 days)Wake up on-call engineerPagerDuty high-urgency
Page (urgent)30 min AND 6 hours6x (exhausts budget in 5 days)Notify on-call during working hoursPagerDuty low-urgency
Ticket (slow burn)6 hours AND 3 days1x (on track to exhaust budget)Create JIRA ticket for teamSlack channel + ticket system
Info (awareness)24 hours0.5x (consuming budget slowly)Weekly reportEmail digest or dashboard only

Alert routing ensures the right person receives the right alert at the right time. This requires a clear service ownership model and integration between your alerting system and your incident management platform.

Alert Routing Best Practices

→

01

Define service ownership: every alert must route to a specific team, never to a shared "ops" channel

→

02

Use severity levels that map to response expectations: P1 = respond in 5 minutes, P2 = 30 minutes, P3 = next business day

→

03

Configure escalation paths: if P1 is not acknowledged in 5 minutes, escalate to the secondary, then the engineering manager

→

04

Route through PagerDuty or Opsgenie for on-call management — never alert directly to Slack for critical issues (messages get buried)

→

05

Include runbook links in every alert: the alert message should contain a direct URL to the troubleshooting guide for that specific alert

06

Implement alert grouping and deduplication: 100 instances of the same alert should create 1 incident, not 100 pages

1

Define service ownership: every alert must route to a specific team, never to a shared "ops" channel

2

Use severity levels that map to response expectations: P1 = respond in 5 minutes, P2 = 30 minutes, P3 = next business day

3

Configure escalation paths: if P1 is not acknowledged in 5 minutes, escalate to the secondary, then the engineering manager

4

Route through PagerDuty or Opsgenie for on-call management — never alert directly to Slack for critical issues (messages get buried)

5

Include runbook links in every alert: the alert message should contain a direct URL to the troubleshooting guide for that specific alert

6

Implement alert grouping and deduplication: 100 instances of the same alert should create 1 incident, not 100 pages

The Runbook Test

For every alert, ask: "Does a runbook exist for this alert? Could a junior engineer follow the runbook at 3 AM and resolve the issue?" If the answer is no, the alert needs either a runbook or a redesign. Alerts without runbooks create panic. Alerts with runbooks create confidence. At Google, every production alert is required to have an associated entry in the team's playbook before it can be deployed.

Alerting on Every HTTP 500

A single 500 error is not an incident — it might be a client sending a malformed request, a transient network blip, or an expected retry. Alerting on individual errors creates massive noise. Instead, alert on the error rate exceeding a threshold over a time window. "5xx rate > 1% for 5 minutes" is actionable. "One 500 error occurred" is noise. The goal is to detect degradation patterns, not individual failures.

Debugging with Observability: The Investigation Workflow

Observability tools are only valuable if you know how to use them during an incident. The investigation workflow is the structured process that transforms "something is broken" into "here is the root cause and here is the fix." This workflow is what separates a 10-minute incident resolution from a 4-hour guessing session.

graph TD
    A["PagerDuty Alert Fires"] --> B["Open Service Dashboard"]
    B --> C{"Error rate spike
or latency spike?"}
    C -->|Error spike| D["Filter by status code:
5xx vs 4xx"]
    C -->|Latency spike| E["Check p50 vs p99:
all requests or tail?"]
    D --> F["Search logs: level=error,
time range, service"]
    E --> F
    F --> G["Extract trace_id from
error/slow log line"]
    G --> H["Open trace: identify
slowest span"]
    H --> I{"Slow span is
downstream service?"}
    I -->|Yes| J["Open downstream
service dashboard"]
    I -->|No| K["Check resource metrics:
CPU, memory, disk I/O"]
    J --> L["Repeat from step B
for downstream service"]
    K --> M["Root cause identified:
apply fix or mitigation"]
    L --> M

The investigation workflow: alert fires, dashboard narrows the service, logs identify the request, traces pinpoint the bottleneck, and resource metrics reveal the root cause.

Let us walk through a real debugging flow. You are on-call for the checkout service at 2 AM. PagerDuty fires: "checkout-service error rate > 2% for 5 minutes."

Step-by-Step Debugging Flow

→

01

Open the checkout-service RED dashboard (linked from the PagerDuty alert). You see the error rate spiked from 0.1% to 3.5% starting at 1:47 AM. Traffic rate is normal — this is not a traffic spike causing overload.

→

02

Drill into the error breakdown panel. All errors are HTTP 504 Gateway Timeout. The checkout service is timing out on a downstream call.

→

03

Check the deployment annotations. No deployments in the last 6 hours. This is not a bad deploy — something else changed.

→

04

Open logs filtered by level=error, service=checkout, time range 1:47-1:55 AM. Log lines show: "context deadline exceeded calling payments-service, trace_id=def456, duration=30012ms". The payments service is taking 30 seconds to respond.

→

05

Open the trace for trace_id=def456. The waterfall shows: checkout-service (30.1s total) → payments-service gRPC call (30.0s). Inside payments-service: database query "SELECT * FROM payment_methods WHERE user_id=?" took 29.8 seconds.

→

06

Open the payments-service dashboard. CPU and memory are fine. Open the database dashboard. The "slow queries" panel shows a spike in queries over 10 seconds starting at 1:45 AM. The database connection pool is saturated.

→

07

Check the database metrics. A long-running analytical query (started at 1:44 AM by a batch job) is holding table locks. The batch job runs nightly but usually finishes in seconds — tonight it is processing 10x more data due to a marketing campaign.

08

Resolution: Kill the long-running query immediately. Create a ticket to move the batch job to a read replica. Add a query timeout of 5 seconds to the payments service database client. Total time to resolution: 12 minutes.

1

Open the checkout-service RED dashboard (linked from the PagerDuty alert). You see the error rate spiked from 0.1% to 3.5% starting at 1:47 AM. Traffic rate is normal — this is not a traffic spike causing overload.

2

Drill into the error breakdown panel. All errors are HTTP 504 Gateway Timeout. The checkout service is timing out on a downstream call.

3

Check the deployment annotations. No deployments in the last 6 hours. This is not a bad deploy — something else changed.

4

Open logs filtered by level=error, service=checkout, time range 1:47-1:55 AM. Log lines show: "context deadline exceeded calling payments-service, trace_id=def456, duration=30012ms". The payments service is taking 30 seconds to respond.

5

Open the trace for trace_id=def456. The waterfall shows: checkout-service (30.1s total) → payments-service gRPC call (30.0s). Inside payments-service: database query "SELECT * FROM payment_methods WHERE user_id=?" took 29.8 seconds.

6

Open the payments-service dashboard. CPU and memory are fine. Open the database dashboard. The "slow queries" panel shows a spike in queries over 10 seconds starting at 1:45 AM. The database connection pool is saturated.

7

Check the database metrics. A long-running analytical query (started at 1:44 AM by a batch job) is holding table locks. The batch job runs nightly but usually finishes in seconds — tonight it is processing 10x more data due to a marketing campaign.

8

Resolution: Kill the long-running query immediately. Create a ticket to move the batch job to a read replica. Add a query timeout of 5 seconds to the payments service database client. Total time to resolution: 12 minutes.

The Five Whys During Incidents

Do not stop at the first answer. Why did the checkout service error? Because payments timed out. Why did payments time out? Because the database was slow. Why was the database slow? Because a batch query held locks. Why did the batch query hold locks? Because it ran on the primary instead of a replica. Why did it run on the primary? Because the connection string was not updated after the replica was added 3 months ago. The root cause is not "database was slow" — it is "missing configuration for read replica routing."

Exemplars: Connecting Metrics to Traces

Prometheus exemplars attach a trace_id to specific metric observations. When you see a latency spike on a dashboard, you can click on a data point and jump directly to the trace for that specific slow request. This eliminates the manual step of searching logs for a trace_id. Grafana supports exemplar display natively — configure your histogram metrics to include exemplars and enable the exemplar layer on your time-series panels.

The Observability Blind Spot: Between Services

The hardest bugs to find are not inside services — they are between services. Network timeouts, DNS resolution failures, load balancer misconfigurations, and TLS handshake problems are invisible to application-level tracing. When traces show a gap (parent span ended but child span never started), investigate the network layer: check service mesh metrics (Istio, Linkerd), load balancer access logs, and DNS resolution times. Infrastructure-level observability is just as important as application-level.

The key insight about debugging with observability is that each tool narrows the search space. Metrics tell you when and what (error rate spike at 1:47 AM in checkout-service). Logs tell you which (specific request IDs, error messages). Traces tell you where (which service, which database query). Resource metrics tell you why (CPU exhaustion, disk full, connection pool saturated). The workflow always moves from broad to narrow: metrics → logs → traces → root cause.

Common Investigation Patterns

  • The slow dependency — p99 latency spike → trace shows one downstream service is slow → open downstream dashboard → repeat until you find the actual bottleneck (often a database query or external API).
  • The partial outage — Error rate up but only for some users → breakdown by region/endpoint/customer-tier reveals the affected segment → logs for affected segment show a specific error → trace shows the failing path.
  • The resource leak — Gradually increasing latency over hours/days → not correlated with traffic → resource metrics show memory growing or connection count climbing → the service has a leak. Restart mitigates; profiling reveals the root cause.
  • The configuration change — Sudden change in behavior with no deployment → check config management audit logs, feature flag changes, DNS changes, certificate rotations → overlay these events on the metrics timeline.

Observability at Scale: Cardinality, Sampling, and Cost

As systems grow from tens to thousands of services, observability itself becomes one of the most expensive and challenging infrastructure problems. The three pillars generate enormous volumes of data, and naive approaches to collection, storage, and querying break down. At Google scale, our internal observability systems handle more data than many companies' entire business data. Understanding cardinality, sampling, and cost optimization is essential for any team operating at scale.

The Cardinality Explosion Problem

Cardinality is the number of unique time series in your metrics system. Every unique combination of metric name + label values creates a new time series. A metric with labels {method, path, status, region, instance} where method has 5 values, path has 100 endpoints, status has 5 codes, region has 4, and instance has 50 creates 5 x 100 x 5 x 4 x 50 = 500,000 time series from one metric. Add a user_id label and you get billions. Prometheus will run out of memory and crash. This is the single most common way teams accidentally destroy their monitoring infrastructure.

LabelCardinalitySafe?Recommendation
method (GET, POST, etc.)~8YesAlways include — low cardinality, high value
status_code (200, 404, 500)~15YesAlways include — essential for error rate calculation
endpoint (/api/users, /api/orders)10-200UsuallyInclude if bounded; use route patterns not raw paths (/api/users/:id not /api/users/12345)
region (us-east, eu-west)3-10YesAlways include — essential for regional debugging
instance/pod10-1000CarefulInclude only on infrastructure metrics, not application metrics. Aggregate application metrics across instances.
user_idMillionsNeverNever use as a metric label. Use traces or logs for per-user debugging.
request_idInfiniteNeverNever use as a metric label. This is what traces are for.

Trace sampling is the most important cost control mechanism for distributed tracing. At high traffic volumes, collecting every trace is prohibitively expensive and provides diminishing returns. The key is to sample strategically so you retain the traces that matter most.

Trace Sampling Strategies

  • Head-based sampling — The sampling decision is made at the start of the request (the head). Example: sample 10% of all requests randomly. Simple to implement but you miss rare interesting traces — 90% of errors, slow requests, and edge cases are discarded. Use as a baseline only.
  • Tail-based sampling — The sampling decision is made after the request completes (the tail), when you know the outcome. Keep 100% of error traces, 100% of traces above p99 latency, and 1% of normal traces. This ensures you always have traces for the requests that matter most. Requires buffering complete traces before deciding, which adds memory overhead to the collector.
  • Priority sampling — Mark certain requests as high-priority at the entry point: requests from VIP customers, requests that touch sensitive endpoints (checkout, payment), synthetic canary requests. These are always sampled at 100% regardless of overall sampling rate.
  • Adaptive sampling — Dynamically adjust sampling rate based on traffic volume. During low traffic (100 RPS), sample 100%. During peak (100K RPS), sample 0.1%. This provides complete visibility during quiet periods and cost control during spikes. OpenTelemetry Collector supports this via the tail_sampling processor.

The 1/1/1 Rule for Cost Control

A practical cost framework for observability at scale: keep metrics for 1 year (small, aggregated, cheap), keep logs for 1 month (large, detailed, expensive), keep traces for 1 week (very large, per-request, most expensive). Adjust based on compliance requirements, but this baseline prevents the common mistake of storing everything forever. Use tiered storage: hot (SSD, fast queries) for recent data, warm (HDD) for older data, cold (object storage like S3) for archival.

Telemetry TypeVolume per ServiceStorage Cost DriverPrimary Cost Control
Metrics~1,000-50,000 time seriesCardinality (unique label combinations)Reduce labels, aggregate across instances, use recording rules for expensive queries
Logs~1-100 GB per dayVolume (bytes stored and indexed)Sampling, dynamic log levels, structured logging with selective indexing, aggressive retention
Traces~10-500 GB per dayVolume (spans stored per request)Tail-based sampling, span filtering (drop health checks), shorter retention

Metric Aggregation and Recording Rules

→

01

Identify expensive PromQL queries that run frequently (dashboard panels queried every 30 seconds by many viewers)

→

02

Create recording rules that pre-compute these queries and store the result as a new time series: e.g., record: job:http_requests:rate5m, expr: sum(rate(http_requests_total[5m])) by (job)

→

03

Replace dashboard queries with the pre-computed recording rule metrics for faster load times and reduced Prometheus load

→

04

Use metric relabeling to drop unnecessary labels at scrape time — this prevents high-cardinality labels from ever entering storage

→

05

Implement metric federation for multi-cluster setups: each cluster runs local Prometheus, a global Prometheus scrapes aggregated recording rules from each cluster

06

Consider Mimir, Thanos, or Cortex for long-term storage: they use object storage (S3/GCS) for cost-effective retention beyond the 15-day local Prometheus default

1

Identify expensive PromQL queries that run frequently (dashboard panels queried every 30 seconds by many viewers)

2

Create recording rules that pre-compute these queries and store the result as a new time series: e.g., record: job:http_requests:rate5m, expr: sum(rate(http_requests_total[5m])) by (job)

3

Replace dashboard queries with the pre-computed recording rule metrics for faster load times and reduced Prometheus load

4

Use metric relabeling to drop unnecessary labels at scrape time — this prevents high-cardinality labels from ever entering storage

5

Implement metric federation for multi-cluster setups: each cluster runs local Prometheus, a global Prometheus scrapes aggregated recording rules from each cluster

6

Consider Mimir, Thanos, or Cortex for long-term storage: they use object storage (S3/GCS) for cost-effective retention beyond the 15-day local Prometheus default

Cost Monitoring for Observability Itself

Monitor the cost and performance of your observability infrastructure with the same rigor you monitor production services. Track: total time series in Prometheus (tsdb head series), log ingestion rate (bytes per second), trace storage growth (GB per day), query latency for dashboard panels. Set alerts when observability costs exceed budget thresholds. At Google, the observability infrastructure team has its own SLOs and error budgets. If monitoring is down, you are flying blind — treat it as a P1.

Storing Raw Metrics for User-Level Debugging

A common and expensive mistake is trying to use metrics for per-user or per-request debugging by adding high-cardinality labels (user_id, session_id, request_id). This creates millions of time series and crashes Prometheus. Metrics are for aggregate patterns ("5% of requests are failing"). Logs and traces are for individual request debugging ("why did user 12345's request fail?"). Use the right tool for the right job.

How this might come up in interviews

Observability is a core topic in SRE, platform engineering, and backend engineering interviews at all FAANG companies. Expect system design questions like "design the monitoring for a payment service" where you must specify metrics (RED method), dashboards, alerting strategy (SLO-based), log aggregation, and distributed tracing. Debugging scenarios are common: "here is a set of symptoms, walk me through how you would find the root cause." Senior-level interviews test your understanding of trade-offs: cardinality management, sampling strategies, cost optimization, and choosing between self-hosted and SaaS observability platforms.

Common questions:

  • How would you debug a request that is slow for only 2% of users while all dashboards show healthy metrics?
  • Design the observability stack for a new microservice handling payment processing. What metrics, logs, traces, dashboards, and alerts would you set up?
  • What is the difference between monitoring and observability? When is traditional monitoring sufficient, and when do you need full observability?
  • Explain the trade-offs between head-based and tail-based trace sampling. How would you implement sampling for a service handling 100K requests per second?
  • Your Prometheus is running out of memory. Walk me through how you would diagnose and fix the cardinality problem.
  • How do SLO-based alerts with multi-window burn rates work, and why are they better than simple threshold-based alerts?

Try this question: Ask about the team's current observability stack, alert-to-resolution time, on-call burden (pages per week), and whether they use SLO-based alerting. These questions reveal operational maturity.

Strong answer: Candidates who describe the debugging workflow (metrics → logs → traces → root cause). Candidates who discuss high-cardinality observability for subset debugging. Candidates who mention SLO-based alerting and explain burn rates. Candidates who proactively discuss cost and cardinality management at scale.

Red flags: Candidates who equate observability with "setting up Prometheus and Grafana" without discussing traces or structured logging. Candidates who cannot explain why averaging percentiles is wrong. Candidates who suggest alerting on CPU usage as a primary strategy.

Key takeaways

  • Observability is fundamentally different from monitoring: monitoring checks known failure conditions, while observability enables you to ask arbitrary questions about novel failures without deploying new code. At FAANG scale with thousands of services, this distinction is the difference between 10-minute and 4-hour incident resolution.
  • The three pillars work together in a diagnostic workflow: metrics tell you something is wrong and when it started, logs tell you what happened with specific error details, and traces show you exactly where the problem is across your distributed service graph. No single pillar is sufficient alone.
  • High-cardinality data is the key insight that separates observability from monitoring. The ability to slice telemetry by arbitrary dimensions (shard_id, customer_tier, feature_flag) in real time is what lets you debug problems that affect 2% of users while aggregate metrics show "everything green."
  • Alert on user-facing symptoms (error rate, latency) using SLO-based burn rates, not on internal causes (CPU usage) with arbitrary thresholds. Multi-window burn rate alerting catches both fast incidents and slow degradation while minimizing alert fatigue. Every alert must have a runbook.
  • Observability at scale requires deliberate cost management: bound metric cardinality by using label discipline, implement tail-based trace sampling to keep interesting traces while controlling volume, use tiered retention (hot/warm/cold), and treat the observability pipeline itself as a production system with its own SLOs.
Before you move on: can you answer these?

A user reports that checkout is slow, but your service dashboard shows healthy p50 and p95 latency. How would you investigate?

Check p99 and p99.9 latency — the problem likely affects a small subset of requests hidden by lower percentiles. Break down latency by endpoint, region, database shard, or customer tier to find the affected segment. Pull a trace for a slow request to see which downstream service or query is the bottleneck. The issue may only affect requests routed to a specific shard or region, invisible to global aggregates.

Your Prometheus instance is consuming 32 GB of memory and crashing. What are the most likely causes and how do you fix them?

The most likely cause is high cardinality — too many unique time series from labels with many values (user_id, request_id, IP address). Check the tsdb head series count with the Prometheus /tsdb-status page. Fix by: (1) removing high-cardinality labels using metric_relabel_configs, (2) using recording rules to pre-aggregate expensive queries, (3) reducing scrape targets or scrape frequency, (4) for long-term scaling, migrate to Mimir or Thanos which use object storage instead of local memory.

Explain the difference between head-based and tail-based trace sampling and when you would use each.

Head-based sampling decides at request start (randomly keep X% of traces). Simple but discards interesting traces — errors and slow requests are sampled at the same rate as normal requests. Tail-based sampling decides after request completion, keeping 100% of error traces and high-latency traces while sampling normal traces at a lower rate. Use tail-based sampling in production to ensure you always have traces for the requests that matter most. Head-based is simpler and suitable for development or low-traffic services where you can afford 100% sampling.

🧠Mental Model

💡 Analogy

Think of observability as a doctor's diagnostic toolkit. Metrics are like vital signs — heart rate, blood pressure, temperature — quick numerical checks that tell you immediately if something is off. Logs are like patient history — detailed records of every event, symptom, and treatment, which you review to understand what happened over time. Traces are like an MRI scan — they let you see exactly what is happening inside the body at each layer, revealing the precise location and nature of the problem. You need all three: vitals tell you something is wrong, history tells you what happened, and the MRI shows you exactly where the problem is. A doctor who only checks vital signs might know the patient is sick but not why. A doctor who only reads history misses the current crisis. A doctor who orders an MRI for every patient goes bankrupt. The art of observability, like medicine, is knowing which tool to reach for at each stage of diagnosis.

⚡ Core Idea

Observability is the ability to understand the internal state of a system from its external outputs. Metrics provide quantitative signals (the "what"), logs provide qualitative context (the "when and how"), and traces provide causal chains across distributed components (the "where"). True observability enables you to debug novel failures you never anticipated without deploying new code.

🎯 Why It Matters

Without observability, debugging distributed systems is guesswork. A microservice architecture with 200 services generates failure modes that no one can predict. Metrics tell you something is wrong, logs tell you what happened, and traces show you the exact request path that failed. The difference between a 10-minute incident resolution and a 4-hour outage is whether your observability lets you ask the right questions in real time.

Related concepts

Explore topics that connect to this one.

  • High availability (HA) & resilience
  • Disaster Recovery (DR)
  • SLIs, SLOs, and Error Budgets

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.