The practice of understanding the internal state of a system from its external outputs — metrics, logs, and traces — enabling teams to debug novel failures in complex distributed systems without deploying new code.
The practice of understanding the internal state of a system from its external outputs — metrics, logs, and traces — enabling teams to debug novel failures in complex distributed systems without deploying new code.
Lesson outline
Observability is not just monitoring with a fancier name. Monitoring tells you when something breaks based on conditions you anticipated. Observability lets you ask arbitrary questions about your system that you did not predict in advance. At its core, observability is about being able to understand the internal state of a system by examining its external outputs. This distinction matters enormously at FAANG scale, where novel failure modes emerge from the combinatorial explosion of services, configurations, and traffic patterns that no one could anticipate ahead of time.
Charity Majors’ Definition
Charity Majors, co-founder of Honeycomb and one of the most influential voices in the observability community, defines observability as the ability to ask new questions of your system without having to ship new code. If you have to add a new metric, create a new log line, or redeploy to debug a production problem, your system is not observable — it is merely monitored. True observability means the telemetry data you already collect is rich enough to answer questions you have never asked before.
The distinction between monitoring and observability becomes painfully clear in microservice architectures. A monolith has a single process with a single stack trace. When it breaks, you look at one log file and one set of metrics. A distributed system with 200 services has 200 log streams, thousands of metric time series, and requests that fan out across dozens of services. Monitoring each service individually tells you nothing about the cross-service behavior that causes most production incidents.
Monitoring vs Observability
The three pillars of observability — metrics, logs, and traces — each provide a different lens into system behavior. Metrics tell you that something is wrong (the error rate spiked). Logs tell you what happened (this specific request failed with this error). Traces tell you where it happened (the request was slow because shard-17 in the payments database took 30 seconds). You need all three because no single pillar gives you the full picture.
The Three Pillars Working Together
At Google, we train new SREs with this debugging flow: (1) Metrics alert you that p99 latency exceeded the SLO threshold. (2) You drill into the dashboard to identify which service is slow. (3) You search logs for errors or slow queries in that service during the affected time window. (4) You pull a trace for a slow request to see exactly which downstream dependency is the bottleneck. Each pillar narrows the search space until you reach the root cause.
At FAANG scale, observability is not optional — it is a survival requirement. Google runs over 10,000 production services. A single user request to Google Search can touch dozens of services. Without comprehensive observability, debugging a latency regression would require guessing across thousands of services. The systems that handle this at scale — Monarch for metrics, Dapper for traces, centralized structured logging — were built because traditional monitoring could not cope with the complexity.
Quick Mental Model
Monitoring is like a smoke detector: it tells you there is a fire. Observability is like a security camera system: it lets you rewind, zoom in, and understand exactly what happened, when, and why — even for events you never anticipated.
| Aspect | Traditional Monitoring | Observability |
|---|---|---|
| Approach | Predefined checks and thresholds | Exploratory querying of rich telemetry |
| Failure types | Known, anticipated failures | Novel, emergent failures |
| Data model | Aggregated metrics and static dashboards | High-cardinality events, traces, structured logs |
| Debugging | Check dashboards, grep logs manually | Slice and dice data across any dimension in real time |
| Scale challenge | Dashboard sprawl, alert fatigue | Cardinality explosion, storage costs, sampling trade-offs |
The three pillars of observability
What is happening?
Aggregated numbers over time—trends, rates, percentiles.
When did it happen?
Event records with timestamps—who did what, and when.
Why did it happen?
Request journey across services—find bottlenecks and root cause.
Example: Distributed trace (one request)
Traces show where time is spent across services—e.g. DB span is longest → optimize or cache.
Metrics tell you what is happening. Logs tell you when it happened. Traces tell you why—following a request through services to find the root cause.
Metrics are numerical measurements collected over time. They are the most mature and well-understood pillar of observability, providing the foundation for dashboards, alerting, and capacity planning. At Google, our internal metrics system (Monarch) ingests billions of data points per second. In the open-source world, Prometheus has become the de facto standard for metrics collection in cloud-native environments.
Understanding metric types is fundamental. Each type answers a different question about your system, and using the wrong type leads to incorrect conclusions or inefficient storage.
The Four Metric Types
Counter vs Gauge: The Most Common Mistake
New engineers frequently query a counter directly (e.g., http_requests_total = 1,523,847) and wonder why the number is so large or meaningless. Counters must always be wrapped in rate() or increase(). rate(http_requests_total[5m]) gives you requests per second averaged over 5 minutes. Querying a gauge with rate() is equally wrong — you would get the rate of change of CPU utilization, not the CPU utilization itself.
The Prometheus data model organizes metrics as time series identified by a metric name and a set of key-value labels. For example, http_requests_total{method="GET", handler="/api/users", status="200"} is one time series. Each unique combination of labels creates a separate time series. This is extremely powerful for filtering and grouping but introduces the cardinality problem (covered in Section 9).
Prometheus Metric Naming Conventions
01
Use snake_case for metric names: http_requests_total, not httpRequestsTotal or HttpRequestsTotal
02
Include the unit as a suffix: http_request_duration_seconds, node_memory_bytes, disk_io_operations_total
03
Use _total suffix for counters: http_requests_total, errors_total, bytes_sent_total
04
Use base units (seconds not milliseconds, bytes not megabytes): this avoids confusion and makes unit conversion explicit in queries
05
Prefix with a namespace for your application: myapp_http_requests_total distinguishes your metrics from library metrics
06
Keep label cardinality bounded: labels like user_id or request_id create unbounded cardinality and will crash Prometheus
Use snake_case for metric names: http_requests_total, not httpRequestsTotal or HttpRequestsTotal
Include the unit as a suffix: http_request_duration_seconds, node_memory_bytes, disk_io_operations_total
Use _total suffix for counters: http_requests_total, errors_total, bytes_sent_total
Use base units (seconds not milliseconds, bytes not megabytes): this avoids confusion and makes unit conversion explicit in queries
Prefix with a namespace for your application: myapp_http_requests_total distinguishes your metrics from library metrics
Keep label cardinality bounded: labels like user_id or request_id create unbounded cardinality and will crash Prometheus
PromQL (Prometheus Query Language) is the query language for extracting insights from metrics. It is a functional language where you compose operations on time series data. Understanding PromQL is essential for building dashboards and writing alert rules.
| PromQL Expression | What It Returns | Use Case |
|---|---|---|
| rate(http_requests_total[5m]) | Per-second request rate averaged over 5 minutes | Throughput dashboard panel, capacity planning |
| histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) | 99th percentile latency over 5 minutes | SLO monitoring, latency alerting |
| sum by (status)(rate(http_requests_total[5m])) | Request rate broken down by HTTP status code | Error rate calculation, success/failure ratio |
| increase(errors_total[1h]) | Total number of errors in the last hour | Error spike detection, incident triage |
| avg_over_time(node_cpu_utilization[30m]) | Average CPU utilization over 30 minutes | Capacity planning, rightsizing alerts |
The rate() vs increase() Decision
rate() gives you per-second averages — ideal for dashboards and comparing throughput across services. increase() gives you the total count over a window — ideal for "how many errors happened in the last hour?" queries. Both handle counter resets correctly. Use rate() for continuous monitoring and increase() for incident investigation.
Averaging Percentiles Is Statistically Invalid
A common and dangerous mistake is computing avg(p99_latency) across multiple instances. Percentiles are not aggregatable — the average of p99 values from 10 instances is not the system-wide p99. Use histogram_quantile() on aggregated histogram buckets instead. This is why Prometheus histograms exist: they let you compute accurate percentiles across any number of instances at query time.
Logs are discrete event records emitted by applications and infrastructure. They are the most detailed pillar of observability — every log line captures a specific moment in time with full context. However, this richness comes at a cost: logs are expensive to store, difficult to query at scale, and only useful if they are structured and correlated.
Structured Logging Is Non-Negotiable
At Google and across all FAANG companies, unstructured logs (plain text lines like "Error processing request") are considered a bug, not a feature. Every log line must be a structured object (JSON, protobuf, or similar) with machine-parseable fields. This enables automated querying, aggregation, and correlation across services. If you cannot filter logs by field (e.g., request_id=abc AND level=error AND service=payments), your logging is broken.
The difference between structured and unstructured logging is the difference between a searchable database and a pile of text files. Unstructured logs require regex-based grep across gigabytes of text. Structured logs can be indexed, filtered, and aggregated using the same kinds of queries you use on a database.
| Aspect | Unstructured | Structured (JSON) |
|---|---|---|
| Format | 2024-01-15 10:23:45 ERROR Payment failed for user 12345 | {"timestamp":"2024-01-15T10:23:45Z","level":"error","msg":"Payment failed","user_id":"12345","service":"payments","trace_id":"abc123"} |
| Querying | grep "Payment failed" | grep "12345" | WHERE user_id="12345" AND level="error" |
| Aggregation | Impossible without custom parsing | COUNT(*) GROUP BY service WHERE level="error" |
| Correlation | Manual text matching across files | JOIN on trace_id or request_id across all services |
| Machine parsing | Fragile regex, breaks on format changes | Reliable field extraction, schema-aware |
Log levels provide a severity hierarchy that helps you filter signal from noise. Using them consistently across all services is critical for effective log management.
Log Levels (Syslog Severity)
Correlation IDs: The Thread That Connects Everything
A correlation ID (also called trace ID or request ID) is a unique identifier assigned at the entry point of a request and propagated through every service that handles it. When a user reports "my checkout failed," you search for their request's correlation ID and instantly see every log line, metric, and trace span across every service that touched that request. Without correlation IDs, debugging a distributed system is like searching for a needle in a haystack blindfolded. Implement them from day one.
Log aggregation systems collect logs from all services into a centralized store where they can be searched, filtered, and analyzed. The two dominant approaches in the industry are the ELK stack (Elasticsearch, Logstash, Kibana) and the Grafana Loki stack.
| Feature | ELK Stack (Elasticsearch) | Grafana Loki |
|---|---|---|
| Indexing approach | Full-text index on log content — fast arbitrary search | Index only on labels (like Prometheus) — cheaper storage, filter then grep |
| Storage cost | High — Elasticsearch indexes everything, requires significant disk and memory | Low — stores compressed log chunks, only indexes metadata labels |
| Query speed | Fast for complex queries across any field | Fast for label-based filtering, slower for full-text search within results |
| Operational complexity | High — Elasticsearch cluster management is a job in itself (JVM tuning, shard management, rolling upgrades) | Low — designed for simplicity, integrates natively with Grafana |
| Best for | Organizations that need powerful full-text search and complex analytics | Teams already using Prometheus and Grafana that want cost-effective log aggregation |
The Log Volume Trap
At FAANG scale, a single service handling 100K requests per second that logs one line per request generates 8.6 billion log lines per day. At 500 bytes per line, that is 4 TB per day for one service. Multiply by hundreds of services and you quickly reach petabytes. Log sampling, dynamic log levels (increase verbosity only during incidents), and aggressive retention policies (7 days hot, 30 days warm, 90 days cold) are essential cost controls.
Structured Logging Best Practices
01
Always include: timestamp (ISO 8601), level, service name, trace_id/request_id, and a human-readable message
02
Add contextual fields relevant to the operation: user_id, order_id, endpoint, duration_ms, status_code
03
Never log sensitive data: passwords, API keys, credit card numbers, PII without redaction
04
Use consistent field names across all services: trace_id not traceId or TraceID or request_trace_id
05
Set log levels correctly: if you would not investigate a log line at 3 AM, it is not an ERROR
06
Implement dynamic log levels: allow per-service or per-request DEBUG logging via feature flags or headers without redeploying
Always include: timestamp (ISO 8601), level, service name, trace_id/request_id, and a human-readable message
Add contextual fields relevant to the operation: user_id, order_id, endpoint, duration_ms, status_code
Never log sensitive data: passwords, API keys, credit card numbers, PII without redaction
Use consistent field names across all services: trace_id not traceId or TraceID or request_trace_id
Set log levels correctly: if you would not investigate a log line at 3 AM, it is not an ERROR
Implement dynamic log levels: allow per-service or per-request DEBUG logging via feature flags or headers without redeploying
Distributed tracing is the third pillar of observability and arguably the most transformative for microservice architectures. A trace represents the complete journey of a single request as it propagates through your distributed system. It answers the question that metrics and logs alone cannot: "What happened to this specific request, across every service it touched, and how long did each step take?"
A trace is composed of spans. Each span represents a unit of work performed by a single service — an HTTP handler, a database query, a gRPC call to another service. Spans are organized in a tree structure: a root span (the initial request) has child spans (downstream calls), which can have their own child spans, creating a full picture of the request's execution path.
Anatomy of a Span
sequenceDiagram
participant U as User Browser
participant GW as API Gateway
participant OS as Order Service
participant PS as Payment Service
participant DB as Database
participant Q as Message Queue
U->>GW: POST /checkout (trace_id: abc123)
Note over GW: span: api-gateway (12ms)
GW->>OS: gRPC CreateOrder
Note over OS: span: order-service (340ms)
OS->>DB: INSERT order
Note over DB: span: db-insert (45ms)
DB-->>OS: OK
OS->>PS: gRPC ChargeCard
Note over PS: span: payment-service (280ms)
PS->>DB: SELECT payment_method
Note over DB: span: db-select (15ms)
DB-->>PS: result
PS-->>OS: charged
OS->>Q: Publish OrderCreated
Note over Q: span: queue-publish (5ms)
Q-->>OS: ack
OS-->>GW: 201 Created
GW-->>U: 201 CreatedA distributed trace showing a checkout request flowing through API Gateway, Order Service, Payment Service, Database, and Message Queue. Each service creates a span with timing information.
The W3C Trace Context standard (traceparent and tracestate HTTP headers) provides a vendor-neutral way to propagate trace context across service boundaries. The traceparent header carries the trace ID, parent span ID, and trace flags. This standardization means you can use different tracing backends for different services and still get connected traces.
OpenTelemetry: The Industry Standard
OpenTelemetry (OTel) has become the industry standard for instrumentation. It provides vendor-neutral APIs, SDKs, and the OpenTelemetry Collector for collecting and exporting metrics, logs, and traces. Major cloud providers (AWS, GCP, Azure) and observability vendors (Datadog, New Relic, Honeycomb, Grafana) all support OpenTelemetry natively. If you are starting a new project today, instrument with OpenTelemetry — it avoids vendor lock-in and gives you flexibility to switch backends without changing application code.
| Tracing Backend | Architecture | Best For |
|---|---|---|
| Jaeger | Open-source, CNCF graduated, Elasticsearch or Cassandra storage | Teams that want full control, self-hosted environments, strong Kubernetes integration |
| Zipkin | Open-source, lightweight, originated at Twitter | Simpler deployments, Java/Spring ecosystem, lower operational overhead than Jaeger |
| Tempo (Grafana) | Open-source, object storage backend (S3/GCS), no indexing | Cost-effective at scale, teams already using Grafana, pairs with Loki for logs |
| Google Cloud Trace | Managed service, tight GCP integration | GCP-native workloads, automatic instrumentation for GCP services |
| AWS X-Ray | Managed service, tight AWS integration | AWS-native workloads, Lambda and ECS auto-instrumentation |
The Most Common Tracing Failure
The number one reason distributed tracing fails in organizations is broken context propagation. If even one service in the request path does not propagate the trace context headers (traceparent), the trace is broken into disconnected fragments. Before investing in tracing infrastructure, ensure every service — including proxies, message queues, and batch jobs — correctly propagates trace context. A partial trace is often worse than no trace because it gives you a false picture of the request path.
Auto-Instrumentation Saves Months of Work
OpenTelemetry SDKs provide auto-instrumentation for common frameworks and libraries (Express, Spring Boot, Django, gRPC, database drivers). Auto-instrumentation wraps library calls to automatically create spans, propagate context, and record timing. Start with auto-instrumentation to get basic tracing working across all services in days, then add custom spans for business-critical operations that need more detail.
An observability pipeline is the data infrastructure that collects telemetry from applications, processes and enriches it, routes it to the appropriate storage backends, and makes it queryable through dashboards and alerting systems. At scale, the pipeline itself becomes a critical production system that must be reliable, scalable, and cost-efficient.
graph LR
subgraph "Applications"
A1[Service A] -->|metrics, logs, traces| C
A2[Service B] -->|metrics, logs, traces| C
A3[Service C] -->|metrics, logs, traces| C
A4[Infrastructure] -->|node metrics, syslogs| C
end
subgraph "Collection Layer"
C[OTel Collector] -->|filter, enrich, batch| R
end
subgraph "Routing & Processing"
R[Router / Fan-out] -->|metrics| M
R -->|logs| L
R -->|traces| T
end
subgraph "Storage Backends"
M[Prometheus / Mimir]
L[Loki / Elasticsearch]
T[Tempo / Jaeger]
end
subgraph "Query & Visualization"
M --> G[Grafana Dashboards]
L --> G
T --> G
M --> AL[Alertmanager]
AL --> PD[PagerDuty / Slack]
endEnd-to-end observability pipeline: applications emit telemetry to collectors, which process and route data to specialized storage backends, queryable through dashboards and alerting.
The OpenTelemetry Collector is the central component of a modern observability pipeline. It is a vendor-neutral data pipeline that receives telemetry data in multiple formats, processes it (filtering, enriching, batching, sampling), and exports it to one or more backends. Running the Collector as a sidecar or as a gateway deployment decouples your applications from the specifics of your observability backend.
OpenTelemetry Collector Components
Gateway vs Sidecar Deployment
Run the OTel Collector in two tiers: (1) As a sidecar or DaemonSet agent on each node that collects local telemetry with minimal processing. (2) As a centralized gateway cluster that performs heavier processing (sampling, enrichment, fan-out). This architecture distributes collection load while centralizing expensive operations, and it provides a single point to reconfigure routing without touching application deployments.
One of the most important architectural decisions in observability is whether to use pull-based (scrape) or push-based collection for metrics. Prometheus pioneered the pull model; StatsD and OpenTelemetry use push.
| Aspect | Pull / Scrape (Prometheus) | Push (OTel, StatsD) |
|---|---|---|
| How it works | Prometheus server periodically scrapes HTTP endpoints (/metrics) on each target | Applications push telemetry to a collector or backend on each event or at regular intervals |
| Service discovery | Prometheus must know where to scrape — uses Kubernetes SD, Consul, file-based configs | Applications push to a known endpoint — collector does not need to discover targets |
| Failure detection | If a scrape fails, Prometheus knows the target is down — built-in up/down detection | If a service stops pushing, silence is ambiguous — is it down or just not emitting? |
| Short-lived jobs | Problematic — job may finish before the next scrape interval. Use Pushgateway as workaround | Natural fit — push metrics before the job exits |
| Network topology | Requires Prometheus to reach targets (problematic across firewalls, NATs) | Requires targets to reach the collector (usually easier in practice) |
| Dominant use | Prometheus ecosystem, Kubernetes-native monitoring | OpenTelemetry, serverless, edge computing, cross-network scenarios |
The Best of Both Worlds
Many production setups use both models: Prometheus scrapes infrastructure and application metrics from long-running services, while the OTel Collector receives push-based traces and logs. The OTel Collector can also scrape Prometheus endpoints, acting as a unified collection layer. Do not force one model everywhere — choose based on the telemetry type and the service characteristics.
Building a Production Observability Pipeline
01
Instrument applications with OpenTelemetry SDKs (auto-instrumentation first, then custom spans for critical paths)
02
Deploy OTel Collector as DaemonSet on every node for local collection and basic processing
03
Deploy a centralized OTel Collector gateway for sampling, enrichment, and fan-out to multiple backends
04
Set up Prometheus (or Mimir for multi-tenant) for metrics storage with appropriate retention (15 days hot, 1 year long-term)
05
Set up Loki (or Elasticsearch) for log aggregation with structured log parsing and retention policies
06
Set up Tempo (or Jaeger) for trace storage with tail-based sampling to control costs
07
Configure Grafana dashboards for each service with RED and USE method panels
08
Configure Alertmanager with SLO-based alerts routed to PagerDuty/Slack by service ownership
Instrument applications with OpenTelemetry SDKs (auto-instrumentation first, then custom spans for critical paths)
Deploy OTel Collector as DaemonSet on every node for local collection and basic processing
Deploy a centralized OTel Collector gateway for sampling, enrichment, and fan-out to multiple backends
Set up Prometheus (or Mimir for multi-tenant) for metrics storage with appropriate retention (15 days hot, 1 year long-term)
Set up Loki (or Elasticsearch) for log aggregation with structured log parsing and retention policies
Set up Tempo (or Jaeger) for trace storage with tail-based sampling to control costs
Configure Grafana dashboards for each service with RED and USE method panels
Configure Alertmanager with SLO-based alerts routed to PagerDuty/Slack by service ownership
A dashboard is only useful if it helps someone make a decision. The most common failure in dashboard design is creating "metric museums" — screens full of graphs that look impressive but do not help anyone debug an incident or understand system health. Every panel on a dashboard should answer a specific question, and the dashboard as a whole should tell a coherent story about the health of a service.
The RED Method for Request-Driven Services
The RED method (coined by Tom Wilkie of Grafana Labs) defines three golden metrics for every request-driven service: Rate (requests per second), Errors (failed requests per second or error rate percentage), and Duration (latency distribution, especially p50, p95, p99). A single RED dashboard answers the critical questions: "Is the service handling traffic? Is it failing? Is it slow?" Every microservice should have a RED dashboard as its primary health view.
| RED Metric | PromQL Example | Dashboard Panel Type | Alert Threshold Example |
|---|---|---|---|
| Rate | sum(rate(http_requests_total[5m])) | Time series line graph | Alert if rate drops >50% in 5 min (traffic loss) |
| Errors | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) | Time series + single stat (current error %) | Alert if error rate >1% for 5 min |
| Duration (p50) | histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) | Time series line graph | Context line — not alerted on directly |
| Duration (p99) | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) | Time series line graph | Alert if p99 >500ms for 5 min |
The USE Method for Infrastructure Resources
While RED works for application services, the USE method (by Brendan Gregg) works for infrastructure resources like CPU, memory, disk, and network. USE stands for: Utilization (percent of resource busy — CPU at 85%), Saturation (degree of queuing — 15 processes waiting for CPU), Errors (error count — disk I/O errors). A USE dashboard for each node or instance helps distinguish "the application is slow" from "the infrastructure is overloaded."
Google's Four Golden Signals (from the SRE book) provide another framework that overlaps with RED but adds saturation: Latency, Traffic, Errors, and Saturation. Use RED for application services, USE for infrastructure, and the Four Golden Signals as a unifying framework that covers both.
Dashboard Layout Principles
Dashboard Anti-Patterns
The Metric Museum: 50 panels showing every metric available, with no clear narrative. The Wall of Green: everything shows "OK" because thresholds are too loose, giving false confidence. The Average Dashboard: showing only averages hides tail latency — p99 can be 10x the average. The Stale Dashboard: built for the service as it existed 6 months ago, now showing metrics for endpoints that no longer exist. Every dashboard should have an owner and a quarterly review cycle.
Grafana Best Practices
01
Use dashboard variables ($service, $environment, $region) so one template works for all services
02
Set meaningful Y-axis labels and units (requests/sec, milliseconds, bytes) — never leave "undefined"
03
Use consistent color schemes: green for healthy, yellow for warning, red for critical across all dashboards
04
Add links between dashboards: service overview links to per-endpoint detail, links to relevant logs and traces
05
Use dashboard folders organized by team or domain, with consistent naming: "Team - Service - Detail Level"
06
Set appropriate time ranges as defaults: 6 hours for operational dashboards, 7 days for capacity planning
Use dashboard variables ($service, $environment, $region) so one template works for all services
Set meaningful Y-axis labels and units (requests/sec, milliseconds, bytes) — never leave "undefined"
Use consistent color schemes: green for healthy, yellow for warning, red for critical across all dashboards
Add links between dashboards: service overview links to per-endpoint detail, links to relevant logs and traces
Use dashboard folders organized by team or domain, with consistent naming: "Team - Service - Detail Level"
Set appropriate time ranges as defaults: 6 hours for operational dashboards, 7 days for capacity planning
The 3-Click Rule for Incident Response
An on-call engineer should be able to go from a PagerDuty alert to the root cause in three clicks or fewer: (1) Click the alert link to open the service dashboard. (2) Click a spike on the error graph to see the affected time range and endpoint. (3) Click through to the logs or traces filtered by that time range and endpoint. If your dashboards require more than three clicks, reorganize them with better linking and pre-filtered views.
Alerting is where observability meets human action. The best metrics, logs, and traces in the world are useless if alerts fire for the wrong reasons, go to the wrong people, or generate so much noise that on-call engineers learn to ignore them. Alert fatigue is the number one operational failure mode at scale — it is more dangerous than missing alerts because it trains people to dismiss real problems.
Symptom-Based Alerting, Not Cause-Based
Alert on symptoms that users experience, not on internal causes that may or may not matter. Bad alert: "CPU usage > 80%." Why bad? CPU at 85% might be perfectly fine — the service is handling traffic within SLO. Good alert: "Error rate > 0.5% for 5 minutes" or "p99 latency > 500ms for 10 minutes." These alert on what the user actually experiences. CPU at 85% becomes a capacity planning concern, not a paging alert. The rule is simple: if this condition does not impact users, it should not page anyone.
SLO-based alerting takes symptom-based alerting to the next level by connecting alerts directly to your Service Level Objectives. Instead of setting arbitrary thresholds, you alert when you are consuming your error budget faster than sustainable.
Multi-Window Burn Rate Alerting
The Alert Fatigue Death Spiral
When alerts fire too frequently for non-actionable reasons, on-call engineers start ignoring them. When they ignore alerts, they miss real incidents. When they miss real incidents, teams add more alerts to "catch everything." More alerts create more noise, which leads to more ignoring. This is the death spiral. The fix is brutal but necessary: delete every alert that has not resulted in a human taking meaningful action in the last 30 days. If it did not page to a real problem, it should not page at all.
| Alert Type | Window | Burn Rate Threshold | Action | Routing |
|---|---|---|---|---|
| Page (critical) | 5 min AND 1 hour | 14.4x (exhausts budget in 2 days) | Wake up on-call engineer | PagerDuty high-urgency |
| Page (urgent) | 30 min AND 6 hours | 6x (exhausts budget in 5 days) | Notify on-call during working hours | PagerDuty low-urgency |
| Ticket (slow burn) | 6 hours AND 3 days | 1x (on track to exhaust budget) | Create JIRA ticket for team | Slack channel + ticket system |
| Info (awareness) | 24 hours | 0.5x (consuming budget slowly) | Weekly report | Email digest or dashboard only |
Alert routing ensures the right person receives the right alert at the right time. This requires a clear service ownership model and integration between your alerting system and your incident management platform.
Alert Routing Best Practices
01
Define service ownership: every alert must route to a specific team, never to a shared "ops" channel
02
Use severity levels that map to response expectations: P1 = respond in 5 minutes, P2 = 30 minutes, P3 = next business day
03
Configure escalation paths: if P1 is not acknowledged in 5 minutes, escalate to the secondary, then the engineering manager
04
Route through PagerDuty or Opsgenie for on-call management — never alert directly to Slack for critical issues (messages get buried)
05
Include runbook links in every alert: the alert message should contain a direct URL to the troubleshooting guide for that specific alert
06
Implement alert grouping and deduplication: 100 instances of the same alert should create 1 incident, not 100 pages
Define service ownership: every alert must route to a specific team, never to a shared "ops" channel
Use severity levels that map to response expectations: P1 = respond in 5 minutes, P2 = 30 minutes, P3 = next business day
Configure escalation paths: if P1 is not acknowledged in 5 minutes, escalate to the secondary, then the engineering manager
Route through PagerDuty or Opsgenie for on-call management — never alert directly to Slack for critical issues (messages get buried)
Include runbook links in every alert: the alert message should contain a direct URL to the troubleshooting guide for that specific alert
Implement alert grouping and deduplication: 100 instances of the same alert should create 1 incident, not 100 pages
The Runbook Test
For every alert, ask: "Does a runbook exist for this alert? Could a junior engineer follow the runbook at 3 AM and resolve the issue?" If the answer is no, the alert needs either a runbook or a redesign. Alerts without runbooks create panic. Alerts with runbooks create confidence. At Google, every production alert is required to have an associated entry in the team's playbook before it can be deployed.
Alerting on Every HTTP 500
A single 500 error is not an incident — it might be a client sending a malformed request, a transient network blip, or an expected retry. Alerting on individual errors creates massive noise. Instead, alert on the error rate exceeding a threshold over a time window. "5xx rate > 1% for 5 minutes" is actionable. "One 500 error occurred" is noise. The goal is to detect degradation patterns, not individual failures.
Observability tools are only valuable if you know how to use them during an incident. The investigation workflow is the structured process that transforms "something is broken" into "here is the root cause and here is the fix." This workflow is what separates a 10-minute incident resolution from a 4-hour guessing session.
graph TD
A["PagerDuty Alert Fires"] --> B["Open Service Dashboard"]
B --> C{"Error rate spike
or latency spike?"}
C -->|Error spike| D["Filter by status code:
5xx vs 4xx"]
C -->|Latency spike| E["Check p50 vs p99:
all requests or tail?"]
D --> F["Search logs: level=error,
time range, service"]
E --> F
F --> G["Extract trace_id from
error/slow log line"]
G --> H["Open trace: identify
slowest span"]
H --> I{"Slow span is
downstream service?"}
I -->|Yes| J["Open downstream
service dashboard"]
I -->|No| K["Check resource metrics:
CPU, memory, disk I/O"]
J --> L["Repeat from step B
for downstream service"]
K --> M["Root cause identified:
apply fix or mitigation"]
L --> MThe investigation workflow: alert fires, dashboard narrows the service, logs identify the request, traces pinpoint the bottleneck, and resource metrics reveal the root cause.
Let us walk through a real debugging flow. You are on-call for the checkout service at 2 AM. PagerDuty fires: "checkout-service error rate > 2% for 5 minutes."
Step-by-Step Debugging Flow
01
Open the checkout-service RED dashboard (linked from the PagerDuty alert). You see the error rate spiked from 0.1% to 3.5% starting at 1:47 AM. Traffic rate is normal — this is not a traffic spike causing overload.
02
Drill into the error breakdown panel. All errors are HTTP 504 Gateway Timeout. The checkout service is timing out on a downstream call.
03
Check the deployment annotations. No deployments in the last 6 hours. This is not a bad deploy — something else changed.
04
Open logs filtered by level=error, service=checkout, time range 1:47-1:55 AM. Log lines show: "context deadline exceeded calling payments-service, trace_id=def456, duration=30012ms". The payments service is taking 30 seconds to respond.
05
Open the trace for trace_id=def456. The waterfall shows: checkout-service (30.1s total) → payments-service gRPC call (30.0s). Inside payments-service: database query "SELECT * FROM payment_methods WHERE user_id=?" took 29.8 seconds.
06
Open the payments-service dashboard. CPU and memory are fine. Open the database dashboard. The "slow queries" panel shows a spike in queries over 10 seconds starting at 1:45 AM. The database connection pool is saturated.
07
Check the database metrics. A long-running analytical query (started at 1:44 AM by a batch job) is holding table locks. The batch job runs nightly but usually finishes in seconds — tonight it is processing 10x more data due to a marketing campaign.
08
Resolution: Kill the long-running query immediately. Create a ticket to move the batch job to a read replica. Add a query timeout of 5 seconds to the payments service database client. Total time to resolution: 12 minutes.
Open the checkout-service RED dashboard (linked from the PagerDuty alert). You see the error rate spiked from 0.1% to 3.5% starting at 1:47 AM. Traffic rate is normal — this is not a traffic spike causing overload.
Drill into the error breakdown panel. All errors are HTTP 504 Gateway Timeout. The checkout service is timing out on a downstream call.
Check the deployment annotations. No deployments in the last 6 hours. This is not a bad deploy — something else changed.
Open logs filtered by level=error, service=checkout, time range 1:47-1:55 AM. Log lines show: "context deadline exceeded calling payments-service, trace_id=def456, duration=30012ms". The payments service is taking 30 seconds to respond.
Open the trace for trace_id=def456. The waterfall shows: checkout-service (30.1s total) → payments-service gRPC call (30.0s). Inside payments-service: database query "SELECT * FROM payment_methods WHERE user_id=?" took 29.8 seconds.
Open the payments-service dashboard. CPU and memory are fine. Open the database dashboard. The "slow queries" panel shows a spike in queries over 10 seconds starting at 1:45 AM. The database connection pool is saturated.
Check the database metrics. A long-running analytical query (started at 1:44 AM by a batch job) is holding table locks. The batch job runs nightly but usually finishes in seconds — tonight it is processing 10x more data due to a marketing campaign.
Resolution: Kill the long-running query immediately. Create a ticket to move the batch job to a read replica. Add a query timeout of 5 seconds to the payments service database client. Total time to resolution: 12 minutes.
The Five Whys During Incidents
Do not stop at the first answer. Why did the checkout service error? Because payments timed out. Why did payments time out? Because the database was slow. Why was the database slow? Because a batch query held locks. Why did the batch query hold locks? Because it ran on the primary instead of a replica. Why did it run on the primary? Because the connection string was not updated after the replica was added 3 months ago. The root cause is not "database was slow" — it is "missing configuration for read replica routing."
Exemplars: Connecting Metrics to Traces
Prometheus exemplars attach a trace_id to specific metric observations. When you see a latency spike on a dashboard, you can click on a data point and jump directly to the trace for that specific slow request. This eliminates the manual step of searching logs for a trace_id. Grafana supports exemplar display natively — configure your histogram metrics to include exemplars and enable the exemplar layer on your time-series panels.
The Observability Blind Spot: Between Services
The hardest bugs to find are not inside services — they are between services. Network timeouts, DNS resolution failures, load balancer misconfigurations, and TLS handshake problems are invisible to application-level tracing. When traces show a gap (parent span ended but child span never started), investigate the network layer: check service mesh metrics (Istio, Linkerd), load balancer access logs, and DNS resolution times. Infrastructure-level observability is just as important as application-level.
The key insight about debugging with observability is that each tool narrows the search space. Metrics tell you when and what (error rate spike at 1:47 AM in checkout-service). Logs tell you which (specific request IDs, error messages). Traces tell you where (which service, which database query). Resource metrics tell you why (CPU exhaustion, disk full, connection pool saturated). The workflow always moves from broad to narrow: metrics → logs → traces → root cause.
Common Investigation Patterns
As systems grow from tens to thousands of services, observability itself becomes one of the most expensive and challenging infrastructure problems. The three pillars generate enormous volumes of data, and naive approaches to collection, storage, and querying break down. At Google scale, our internal observability systems handle more data than many companies' entire business data. Understanding cardinality, sampling, and cost optimization is essential for any team operating at scale.
The Cardinality Explosion Problem
Cardinality is the number of unique time series in your metrics system. Every unique combination of metric name + label values creates a new time series. A metric with labels {method, path, status, region, instance} where method has 5 values, path has 100 endpoints, status has 5 codes, region has 4, and instance has 50 creates 5 x 100 x 5 x 4 x 50 = 500,000 time series from one metric. Add a user_id label and you get billions. Prometheus will run out of memory and crash. This is the single most common way teams accidentally destroy their monitoring infrastructure.
| Label | Cardinality | Safe? | Recommendation |
|---|---|---|---|
| method (GET, POST, etc.) | ~8 | Yes | Always include — low cardinality, high value |
| status_code (200, 404, 500) | ~15 | Yes | Always include — essential for error rate calculation |
| endpoint (/api/users, /api/orders) | 10-200 | Usually | Include if bounded; use route patterns not raw paths (/api/users/:id not /api/users/12345) |
| region (us-east, eu-west) | 3-10 | Yes | Always include — essential for regional debugging |
| instance/pod | 10-1000 | Careful | Include only on infrastructure metrics, not application metrics. Aggregate application metrics across instances. |
| user_id | Millions | Never | Never use as a metric label. Use traces or logs for per-user debugging. |
| request_id | Infinite | Never | Never use as a metric label. This is what traces are for. |
Trace sampling is the most important cost control mechanism for distributed tracing. At high traffic volumes, collecting every trace is prohibitively expensive and provides diminishing returns. The key is to sample strategically so you retain the traces that matter most.
Trace Sampling Strategies
The 1/1/1 Rule for Cost Control
A practical cost framework for observability at scale: keep metrics for 1 year (small, aggregated, cheap), keep logs for 1 month (large, detailed, expensive), keep traces for 1 week (very large, per-request, most expensive). Adjust based on compliance requirements, but this baseline prevents the common mistake of storing everything forever. Use tiered storage: hot (SSD, fast queries) for recent data, warm (HDD) for older data, cold (object storage like S3) for archival.
| Telemetry Type | Volume per Service | Storage Cost Driver | Primary Cost Control |
|---|---|---|---|
| Metrics | ~1,000-50,000 time series | Cardinality (unique label combinations) | Reduce labels, aggregate across instances, use recording rules for expensive queries |
| Logs | ~1-100 GB per day | Volume (bytes stored and indexed) | Sampling, dynamic log levels, structured logging with selective indexing, aggressive retention |
| Traces | ~10-500 GB per day | Volume (spans stored per request) | Tail-based sampling, span filtering (drop health checks), shorter retention |
Metric Aggregation and Recording Rules
01
Identify expensive PromQL queries that run frequently (dashboard panels queried every 30 seconds by many viewers)
02
Create recording rules that pre-compute these queries and store the result as a new time series: e.g., record: job:http_requests:rate5m, expr: sum(rate(http_requests_total[5m])) by (job)
03
Replace dashboard queries with the pre-computed recording rule metrics for faster load times and reduced Prometheus load
04
Use metric relabeling to drop unnecessary labels at scrape time — this prevents high-cardinality labels from ever entering storage
05
Implement metric federation for multi-cluster setups: each cluster runs local Prometheus, a global Prometheus scrapes aggregated recording rules from each cluster
06
Consider Mimir, Thanos, or Cortex for long-term storage: they use object storage (S3/GCS) for cost-effective retention beyond the 15-day local Prometheus default
Identify expensive PromQL queries that run frequently (dashboard panels queried every 30 seconds by many viewers)
Create recording rules that pre-compute these queries and store the result as a new time series: e.g., record: job:http_requests:rate5m, expr: sum(rate(http_requests_total[5m])) by (job)
Replace dashboard queries with the pre-computed recording rule metrics for faster load times and reduced Prometheus load
Use metric relabeling to drop unnecessary labels at scrape time — this prevents high-cardinality labels from ever entering storage
Implement metric federation for multi-cluster setups: each cluster runs local Prometheus, a global Prometheus scrapes aggregated recording rules from each cluster
Consider Mimir, Thanos, or Cortex for long-term storage: they use object storage (S3/GCS) for cost-effective retention beyond the 15-day local Prometheus default
Cost Monitoring for Observability Itself
Monitor the cost and performance of your observability infrastructure with the same rigor you monitor production services. Track: total time series in Prometheus (tsdb head series), log ingestion rate (bytes per second), trace storage growth (GB per day), query latency for dashboard panels. Set alerts when observability costs exceed budget thresholds. At Google, the observability infrastructure team has its own SLOs and error budgets. If monitoring is down, you are flying blind — treat it as a P1.
Storing Raw Metrics for User-Level Debugging
A common and expensive mistake is trying to use metrics for per-user or per-request debugging by adding high-cardinality labels (user_id, session_id, request_id). This creates millions of time series and crashes Prometheus. Metrics are for aggregate patterns ("5% of requests are failing"). Logs and traces are for individual request debugging ("why did user 12345's request fail?"). Use the right tool for the right job.
Observability is a core topic in SRE, platform engineering, and backend engineering interviews at all FAANG companies. Expect system design questions like "design the monitoring for a payment service" where you must specify metrics (RED method), dashboards, alerting strategy (SLO-based), log aggregation, and distributed tracing. Debugging scenarios are common: "here is a set of symptoms, walk me through how you would find the root cause." Senior-level interviews test your understanding of trade-offs: cardinality management, sampling strategies, cost optimization, and choosing between self-hosted and SaaS observability platforms.
Common questions:
Try this question: Ask about the team's current observability stack, alert-to-resolution time, on-call burden (pages per week), and whether they use SLO-based alerting. These questions reveal operational maturity.
Strong answer: Candidates who describe the debugging workflow (metrics → logs → traces → root cause). Candidates who discuss high-cardinality observability for subset debugging. Candidates who mention SLO-based alerting and explain burn rates. Candidates who proactively discuss cost and cardinality management at scale.
Red flags: Candidates who equate observability with "setting up Prometheus and Grafana" without discussing traces or structured logging. Candidates who cannot explain why averaging percentiles is wrong. Candidates who suggest alerting on CPU usage as a primary strategy.
Key takeaways
A user reports that checkout is slow, but your service dashboard shows healthy p50 and p95 latency. How would you investigate?
Check p99 and p99.9 latency — the problem likely affects a small subset of requests hidden by lower percentiles. Break down latency by endpoint, region, database shard, or customer tier to find the affected segment. Pull a trace for a slow request to see which downstream service or query is the bottleneck. The issue may only affect requests routed to a specific shard or region, invisible to global aggregates.
Your Prometheus instance is consuming 32 GB of memory and crashing. What are the most likely causes and how do you fix them?
The most likely cause is high cardinality — too many unique time series from labels with many values (user_id, request_id, IP address). Check the tsdb head series count with the Prometheus /tsdb-status page. Fix by: (1) removing high-cardinality labels using metric_relabel_configs, (2) using recording rules to pre-aggregate expensive queries, (3) reducing scrape targets or scrape frequency, (4) for long-term scaling, migrate to Mimir or Thanos which use object storage instead of local memory.
Explain the difference between head-based and tail-based trace sampling and when you would use each.
Head-based sampling decides at request start (randomly keep X% of traces). Simple but discards interesting traces — errors and slow requests are sampled at the same rate as normal requests. Tail-based sampling decides after request completion, keeping 100% of error traces and high-latency traces while sampling normal traces at a lower rate. Use tail-based sampling in production to ensure you always have traces for the requests that matter most. Head-based is simpler and suitable for development or low-traffic services where you can afford 100% sampling.
💡 Analogy
Think of observability as a doctor's diagnostic toolkit. Metrics are like vital signs — heart rate, blood pressure, temperature — quick numerical checks that tell you immediately if something is off. Logs are like patient history — detailed records of every event, symptom, and treatment, which you review to understand what happened over time. Traces are like an MRI scan — they let you see exactly what is happening inside the body at each layer, revealing the precise location and nature of the problem. You need all three: vitals tell you something is wrong, history tells you what happened, and the MRI shows you exactly where the problem is. A doctor who only checks vital signs might know the patient is sick but not why. A doctor who only reads history misses the current crisis. A doctor who orders an MRI for every patient goes bankrupt. The art of observability, like medicine, is knowing which tool to reach for at each stage of diagnosis.
⚡ Core Idea
Observability is the ability to understand the internal state of a system from its external outputs. Metrics provide quantitative signals (the "what"), logs provide qualitative context (the "when and how"), and traces provide causal chains across distributed components (the "where"). True observability enables you to debug novel failures you never anticipated without deploying new code.
🎯 Why It Matters
Without observability, debugging distributed systems is guesswork. A microservice architecture with 200 services generates failure modes that no one can predict. Metrics tell you something is wrong, logs tell you what happened, and traces show you the exact request path that failed. The difference between a 10-minute incident resolution and a 4-hour outage is whether your observability lets you ask the right questions in real time.
Related concepts
Explore topics that connect to this one.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.