Interactive Explainer

🎯Key Takeaways

Observability is fundamentally different from monitoring: monitoring checks known failure conditions, while observability enables you to ask arbitrary questions about novel failures without deploying new code. At FAANG scale with thousands of services, this distinction is the difference between 10-minute and 4-hour incident resolution.

The three pillars work together in a diagnostic workflow: metrics tell you something is wrong and when it started, logs tell you what happened with specific error details, and traces show you exactly where the problem is across your distributed service graph. No single pillar is sufficient alone.

High-cardinality data is the key insight that separates observability from monitoring. The ability to slice telemetry by arbitrary dimensions (shard_id, customer_tier, feature_flag) in real time is what lets you debug problems that affect 2% of users while aggregate metrics show "everything green."

Alert on user-facing symptoms (error rate, latency) using SLO-based burn rates, not on internal causes (CPU usage) with arbitrary thresholds. Multi-window burn rate alerting catches both fast incidents and slow degradation while minimizing alert fatigue. Every alert must have a runbook.

Observability at scale requires deliberate cost management: bound metric cardinality by using label discipline, implement tail-based trace sampling to keep interesting traces while controlling volume, use tiered retention (hot/warm/cold), and treat the observability pipeline itself as a production system with its own SLOs.

Observability

The practice of understanding the internal state of a system from its external outputs — metrics, logs, and traces — enabling teams to debug novel failures in complex distributed systems without deploying new code.

~38 min read

Be the first to complete!

What you'll learn

Observability is fundamentally different from monitoring: monitoring checks known failure conditions, while observability enables you to ask arbitrary questions about novel failures without deploying new code. At FAANG scale with thousands of services, this distinction is the difference between 10-minute and 4-hour incident resolution.
The three pillars work together in a diagnostic workflow: metrics tell you something is wrong and when it started, logs tell you what happened with specific error details, and traces show you exactly where the problem is across your distributed service graph. No single pillar is sufficient alone.
High-cardinality data is the key insight that separates observability from monitoring. The ability to slice telemetry by arbitrary dimensions (shard_id, customer_tier, feature_flag) in real time is what lets you debug problems that affect 2% of users while aggregate metrics show "everything green."
Alert on user-facing symptoms (error rate, latency) using SLO-based burn rates, not on internal causes (CPU usage) with arbitrary thresholds. Multi-window burn rate alerting catches both fast incidents and slow degradation while minimizing alert fatigue. Every alert must have a runbook.
Observability at scale requires deliberate cost management: bound metric cardinality by using label discipline, implement tail-based trace sampling to keep interesting traces while controlling volume, use tiered retention (hot/warm/cold), and treat the observability pipeline itself as a production system with its own SLOs.

Lesson outline

What Is Observability and Why It Matters

Observability is not just monitoring with a fancier name. Monitoring tells you when something breaks based on conditions you anticipated. Observability lets you ask arbitrary questions about your system that you did not predict in advance. At its core, observability is about being able to understand the internal state of a system by examining its external outputs. This distinction matters enormously at FAANG scale, where novel failure modes emerge from the combinatorial explosion of services, configurations, and traffic patterns that no one could anticipate ahead of time.

Charity Majors’ Definition

Charity Majors, co-founder of Honeycomb and one of the most influential voices in the observability community, defines observability as the ability to ask new questions of your system without having to ship new code. If you have to add a new metric, create a new log line, or redeploy to debug a production problem, your system is not observable — it is merely monitored. True observability means the telemetry data you already collect is rich enough to answer questions you have never asked before.

The distinction between monitoring and observability becomes painfully clear in microservice architectures. A monolith has a single process with a single stack trace. When it breaks, you look at one log file and one set of metrics. A distributed system with 200 services has 200 log streams, thousands of metric time series, and requests that fan out across dozens of services. Monitoring each service individually tells you nothing about the cross-service behavior that causes most production incidents.

Monitoring vs Observability

Monitoring — Predefined dashboards and alerts for known failure modes. You decide what to watch before the failure happens. Works well for infrastructure-level signals like CPU, memory, and disk.
Observability — The ability to explore and query telemetry data in real time to understand novel failures. You decide what to ask after the failure happens. Essential for application-level debugging in distributed systems.
Key difference — Monitoring answers "is the system working?" Observability answers "why is the system not working for this specific user, right now, in this specific way?"

The three pillars of observability — metrics, logs, and traces — each provide a different lens into system behavior. Metrics tell you that something is wrong (the error rate spiked). Logs tell you what happened (this specific request failed with this error). Traces tell you where it happened (the request was slow because shard-17 in the payments database took 30 seconds). You need all three because no single pillar gives you the full picture.

The Three Pillars Working Together

At Google, we train new SREs with this debugging flow: (1) Metrics alert you that p99 latency exceeded the SLO threshold. (2) You drill into the dashboard to identify which service is slow. (3) You search logs for errors or slow queries in that service during the affected time window. (4) You pull a trace for a slow request to see exactly which downstream dependency is the bottleneck. Each pillar narrows the search space until you reach the root cause.

At FAANG scale, observability is not optional — it is a survival requirement. Google runs over 10,000 production services. A single user request to Google Search can touch dozens of services. Without comprehensive observability, debugging a latency regression would require guessing across thousands of services. The systems that handle this at scale — Monarch for metrics, Dapper for traces, centralized structured logging — were built because traditional monitoring could not cope with the complexity.

Quick Mental Model

Monitoring is like a smoke detector: it tells you there is a fire. Observability is like a security camera system: it lets you rewind, zoom in, and understand exactly what happened, when, and why — even for events you never anticipated.

Aspect	Traditional Monitoring	Observability
Approach	Predefined checks and thresholds	Exploratory querying of rich telemetry
Failure types	Known, anticipated failures	Novel, emergent failures
Data model	Aggregated metrics and static dashboards	High-cardinality events, traces, structured logs
Debugging	Check dashboards, grep logs manually	Slice and dice data across any dimension in real time
Scale challenge	Dashboard sprawl, alert fatigue	Cardinality explosion, storage costs, sampling trade-offs

The three pillars of observability

Metrics

What is happening?

Aggregated numbers over time—trends, rates, percentiles.

• Error rate: 0.1%
• Latency p99: 200ms
• CPU: 45%

Logs

When did it happen?

Event records with timestamps—who did what, and when.

• User logged in
• Payment processed
• Error: timeout

Traces

Why did it happen?

Request journey across services—find bottlenecks and root cause.

• API → Auth → DB
• Slow span: tax API
• Distributed debug

Example: Distributed trace (one request)

API

Auth

Traces show where time is spent across services—e.g. DB span is longest → optimize or cache.

The power of observability

Metrics tell you what is happening. Logs tell you when it happened. Traces tell you why—following a request through services to find the root cause.

Metrics: The Foundation of Quantitative Monitoring

Metrics are numerical measurements collected over time. They are the most mature and well-understood pillar of observability, providing the foundation for dashboards, alerting, and capacity planning. At Google, our internal metrics system (Monarch) ingests billions of data points per second. In the open-source world, Prometheus has become the de facto standard for metrics collection in cloud-native environments.

Understanding metric types is fundamental. Each type answers a different question about your system, and using the wrong type leads to incorrect conclusions or inefficient storage.

The Four Metric Types

Counter — A monotonically increasing value that can only go up (or reset to zero on restart). Use for: total requests served, total errors, total bytes transferred. You always query counters with rate() to get per-second values. Example: http_requests_total. Never use a counter for something that can decrease.
Gauge — A value that can go up and down at any time. Use for: current CPU utilization, memory usage, number of active connections, queue depth, temperature. Example: node_memory_available_bytes. Gauges represent a snapshot in time.
Histogram — Samples observations and counts them in configurable buckets, plus sum and count. Use for: request latency distribution, response size distribution. Example: http_request_duration_seconds_bucket{le="0.1"} shows requests under 100ms. Histograms let you calculate percentiles (p50, p95, p99) at query time.
Summary — Similar to histogram but calculates quantiles on the client side. Use when: you need precise quantiles for a single instance and cannot aggregate across instances. Limitation: summaries cannot be aggregated across instances, making them less useful in distributed systems. Prefer histograms in most cases.

Counter vs Gauge: The Most Common Mistake

New engineers frequently query a counter directly (e.g., http_requests_total = 1,523,847) and wonder why the number is so large or meaningless. Counters must always be wrapped in rate() or increase(). rate(http_requests_total[5m]) gives you requests per second averaged over 5 minutes. Querying a gauge with rate() is equally wrong — you would get the rate of change of CPU utilization, not the CPU utilization itself.

The Prometheus data model organizes metrics as time series identified by a metric name and a set of key-value labels. For example, http_requests_total{method="GET", handler="/api/users", status="200"} is one time series. Each unique combination of labels creates a separate time series. This is extremely powerful for filtering and grouping but introduces the cardinality problem (covered in Section 9).

Prometheus Metric Naming Conventions

→

Use snake_case for metric names: http_requests_total, not httpRequestsTotal or HttpRequestsTotal

→

Include the unit as a suffix: http_request_duration_seconds, node_memory_bytes, disk_io_operations_total

→

Use _total suffix for counters: http_requests_total, errors_total, bytes_sent_total

→

Use base units (seconds not milliseconds, bytes not megabytes): this avoids confusion and makes unit conversion explicit in queries

→

Prefix with a namespace for your application: myapp_http_requests_total distinguishes your metrics from library metrics

Keep label cardinality bounded: labels like user_id or request_id create unbounded cardinality and will crash Prometheus

Use snake_case for metric names: http_requests_total, not httpRequestsTotal or HttpRequestsTotal

Include the unit as a suffix: http_request_duration_seconds, node_memory_bytes, disk_io_operations_total

Use _total suffix for counters: http_requests_total, errors_total, bytes_sent_total

Use base units (seconds not milliseconds, bytes not megabytes): this avoids confusion and makes unit conversion explicit in queries

Prefix with a namespace for your application: myapp_http_requests_total distinguishes your metrics from library metrics

Keep label cardinality bounded: labels like user_id or request_id create unbounded cardinality and will crash Prometheus

PromQL (Prometheus Query Language) is the query language for extracting insights from metrics. It is a functional language where you compose operations on time series data. Understanding PromQL is essential for building dashboards and writing alert rules.

PromQL Expression	What It Returns	Use Case
rate(http_requests_total[5m])	Per-second request rate averaged over 5 minutes	Throughput dashboard panel, capacity planning
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))	99th percentile latency over 5 minutes	SLO monitoring, latency alerting
sum by (status)(rate(http_requests_total[5m]))	Request rate broken down by HTTP status code	Error rate calculation, success/failure ratio
increase(errors_total[1h])	Total number of errors in the last hour	Error spike detection, incident triage
avg_over_time(node_cpu_utilization[30m])	Average CPU utilization over 30 minutes	Capacity planning, rightsizing alerts

The rate() vs increase() Decision

rate() gives you per-second averages — ideal for dashboards and comparing throughput across services. increase() gives you the total count over a window — ideal for "how many errors happened in the last hour?" queries. Both handle counter resets correctly. Use rate() for continuous monitoring and increase() for incident investigation.

Averaging Percentiles Is Statistically Invalid

A common and dangerous mistake is computing avg(p99_latency) across multiple instances. Percentiles are not aggregatable — the average of p99 values from 10 instances is not the system-wide p99. Use histogram_quantile() on aggregated histogram buckets instead. This is why Prometheus histograms exist: they let you compute accurate percentiles across any number of instances at query time.

Logs: Structured Event Records

Logs are discrete event records emitted by applications and infrastructure. They are the most detailed pillar of observability — every log line captures a specific moment in time with full context. However, this richness comes at a cost: logs are expensive to store, difficult to query at scale, and only useful if they are structured and correlated.

Structured Logging Is Non-Negotiable

At Google and across all FAANG companies, unstructured logs (plain text lines like "Error processing request") are considered a bug, not a feature. Every log line must be a structured object (JSON, protobuf, or similar) with machine-parseable fields. This enables automated querying, aggregation, and correlation across services. If you cannot filter logs by field (e.g., request_id=abc AND level=error AND service=payments), your logging is broken.

The difference between structured and unstructured logging is the difference between a searchable database and a pile of text files. Unstructured logs require regex-based grep across gigabytes of text. Structured logs can be indexed, filtered, and aggregated using the same kinds of queries you use on a database.

Aspect	Unstructured	Structured (JSON)
Format	2024-01-15 10:23:45 ERROR Payment failed for user 12345	{"timestamp":"2024-01-15T10:23:45Z","level":"error","msg":"Payment failed","user_id":"12345","service":"payments","trace_id":"abc123"}
Querying	grep "Payment failed" \| grep "12345"	WHERE user_id="12345" AND level="error"
Aggregation	Impossible without custom parsing	COUNT(*) GROUP BY service WHERE level="error"
Correlation	Manual text matching across files	JOIN on trace_id or request_id across all services
Machine parsing	Fragile regex, breaks on format changes	Reliable field extraction, schema-aware

Log levels provide a severity hierarchy that helps you filter signal from noise. Using them consistently across all services is critical for effective log management.

Log Levels (Syslog Severity)

DEBUG — Detailed diagnostic information. Disabled in production by default. Enable per-service or per-request (via header) when actively debugging. High volume — will overwhelm log storage if left on.
INFO — Significant business events: user login, order placed, payment processed. Should be informative but not noisy. If you cannot explain why someone would search for this log line, remove it.
WARN — Unexpected conditions that the system handled gracefully: retry succeeded after first attempt failed, cache miss forcing a database query, approaching a resource limit. Not urgent but warrants investigation if the rate increases.
ERROR — An operation failed and could not be recovered automatically: payment declined, database query timeout, external API returned 500. Every ERROR log should be actionable — if you would not investigate it, it should be WARN.
FATAL/CRITICAL — The process cannot continue and is shutting down: out of memory, cannot connect to database on startup, corrupted configuration. These should trigger immediate alerts and are extremely rare in well-built systems.

Correlation IDs: The Thread That Connects Everything

A correlation ID (also called trace ID or request ID) is a unique identifier assigned at the entry point of a request and propagated through every service that handles it. When a user reports "my checkout failed," you search for their request's correlation ID and instantly see every log line, metric, and trace span across every service that touched that request. Without correlation IDs, debugging a distributed system is like searching for a needle in a haystack blindfolded. Implement them from day one.

Log aggregation systems collect logs from all services into a centralized store where they can be searched, filtered, and analyzed. The two dominant approaches in the industry are the ELK stack (Elasticsearch, Logstash, Kibana) and the Grafana Loki stack.

Feature	ELK Stack (Elasticsearch)	Grafana Loki
Indexing approach	Full-text index on log content — fast arbitrary search	Index only on labels (like Prometheus) — cheaper storage, filter then grep
Storage cost	High — Elasticsearch indexes everything, requires significant disk and memory	Low — stores compressed log chunks, only indexes metadata labels
Query speed	Fast for complex queries across any field	Fast for label-based filtering, slower for full-text search within results
Operational complexity	High — Elasticsearch cluster management is a job in itself (JVM tuning, shard management, rolling upgrades)	Low — designed for simplicity, integrates natively with Grafana
Best for	Organizations that need powerful full-text search and complex analytics	Teams already using Prometheus and Grafana that want cost-effective log aggregation

The Log Volume Trap

At FAANG scale, a single service handling 100K requests per second that logs one line per request generates 8.6 billion log lines per day. At 500 bytes per line, that is 4 TB per day for one service. Multiply by hundreds of services and you quickly reach petabytes. Log sampling, dynamic log levels (increase verbosity only during incidents), and aggressive retention policies (7 days hot, 30 days warm, 90 days cold) are essential cost controls.

Structured Logging Best Practices

→

Always include: timestamp (ISO 8601), level, service name, trace_id/request_id, and a human-readable message

→

Add contextual fields relevant to the operation: user_id, order_id, endpoint, duration_ms, status_code

→

Never log sensitive data: passwords, API keys, credit card numbers, PII without redaction

→

Use consistent field names across all services: trace_id not traceId or TraceID or request_trace_id

→

Set log levels correctly: if you would not investigate a log line at 3 AM, it is not an ERROR

Implement dynamic log levels: allow per-service or per-request DEBUG logging via feature flags or headers without redeploying

Always include: timestamp (ISO 8601), level, service name, trace_id/request_id, and a human-readable message

Add contextual fields relevant to the operation: user_id, order_id, endpoint, duration_ms, status_code

Never log sensitive data: passwords, API keys, credit card numbers, PII without redaction

Use consistent field names across all services: trace_id not traceId or TraceID or request_trace_id

Set log levels correctly: if you would not investigate a log line at 3 AM, it is not an ERROR

Implement dynamic log levels: allow per-service or per-request DEBUG logging via feature flags or headers without redeploying

Distributed Traces: Following Requests Across Services

Distributed tracing is the third pillar of observability and arguably the most transformative for microservice architectures. A trace represents the complete journey of a single request as it propagates through your distributed system. It answers the question that metrics and logs alone cannot: "What happened to this specific request, across every service it touched, and how long did each step take?"

A trace is composed of spans. Each span represents a unit of work performed by a single service — an HTTP handler, a database query, a gRPC call to another service. Spans are organized in a tree structure: a root span (the initial request) has child spans (downstream calls), which can have their own child spans, creating a full picture of the request's execution path.

Anatomy of a Span

Trace ID — Globally unique identifier shared by all spans in the trace. This is what connects spans across services. Typically a 128-bit hex string like a]b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6.
Span ID — Unique identifier for this specific span within the trace. Used to establish parent-child relationships between spans.
Parent Span ID — The span ID of the caller. The root span has no parent. This field builds the tree structure of the trace.
Operation name — Human-readable name describing the work: "GET /api/orders", "SELECT orders WHERE user_id=?", "gRPC payments.Charge".
Start time and duration — When the span started and how long it took. This is what makes the waterfall visualization possible — you can see exactly where time was spent.
Tags/attributes — Key-value metadata: http.method=GET, http.status_code=200, db.type=postgresql, error=true. Used for filtering and searching traces.
Events/logs — Timestamped annotations within the span: "cache miss at 10ms", "retry attempt 2 at 150ms". Provide fine-grained context without separate log lines.

sequenceDiagram
    participant U as User Browser
    participant GW as API Gateway
    participant OS as Order Service
    participant PS as Payment Service
    participant DB as Database
    participant Q as Message Queue

    U->>GW: POST /checkout (trace_id: abc123)
    Note over GW: span: api-gateway (12ms)
    GW->>OS: gRPC CreateOrder
    Note over OS: span: order-service (340ms)
    OS->>DB: INSERT order
    Note over DB: span: db-insert (45ms)
    DB-->>OS: OK
    OS->>PS: gRPC ChargeCard
    Note over PS: span: payment-service (280ms)
    PS->>DB: SELECT payment_method
    Note over DB: span: db-select (15ms)
    DB-->>PS: result
    PS-->>OS: charged
    OS->>Q: Publish OrderCreated
    Note over Q: span: queue-publish (5ms)
    Q-->>OS: ack
    OS-->>GW: 201 Created
    GW-->>U: 201 Created

A distributed trace showing a checkout request flowing through API Gateway, Order Service, Payment Service, Database, and Message Queue. Each service creates a span with timing information.

The W3C Trace Context standard (traceparent and tracestate HTTP headers) provides a vendor-neutral way to propagate trace context across service boundaries. The traceparent header carries the trace ID, parent span ID, and trace flags. This standardization means you can use different tracing backends for different services and still get connected traces.

OpenTelemetry: The Industry Standard

OpenTelemetry (OTel) has become the industry standard for instrumentation. It provides vendor-neutral APIs, SDKs, and the OpenTelemetry Collector for collecting and exporting metrics, logs, and traces. Major cloud providers (AWS, GCP, Azure) and observability vendors (Datadog, New Relic, Honeycomb, Grafana) all support OpenTelemetry natively. If you are starting a new project today, instrument with OpenTelemetry — it avoids vendor lock-in and gives you flexibility to switch backends without changing application code.

Tracing Backend	Architecture	Best For
Jaeger	Open-source, CNCF graduated, Elasticsearch or Cassandra storage	Teams that want full control, self-hosted environments, strong Kubernetes integration
Zipkin	Open-source, lightweight, originated at Twitter	Simpler deployments, Java/Spring ecosystem, lower operational overhead than Jaeger
Tempo (Grafana)	Open-source, object storage backend (S3/GCS), no indexing	Cost-effective at scale, teams already using Grafana, pairs with Loki for logs
Google Cloud Trace	Managed service, tight GCP integration	GCP-native workloads, automatic instrumentation for GCP services
AWS X-Ray	Managed service, tight AWS integration	AWS-native workloads, Lambda and ECS auto-instrumentation

The Most Common Tracing Failure

The number one reason distributed tracing fails in organizations is broken context propagation. If even one service in the request path does not propagate the trace context headers (traceparent), the trace is broken into disconnected fragments. Before investing in tracing infrastructure, ensure every service — including proxies, message queues, and batch jobs — correctly propagates trace context. A partial trace is often worse than no trace because it gives you a false picture of the request path.

Auto-Instrumentation Saves Months of Work

OpenTelemetry SDKs provide auto-instrumentation for common frameworks and libraries (Express, Spring Boot, Django, gRPC, database drivers). Auto-instrumentation wraps library calls to automatically create spans, propagate context, and record timing. Start with auto-instrumentation to get basic tracing working across all services in days, then add custom spans for business-critical operations that need more detail.

The Observability Pipeline

An observability pipeline is the data infrastructure that collects telemetry from applications, processes and enriches it, routes it to the appropriate storage backends, and makes it queryable through dashboards and alerting systems. At scale, the pipeline itself becomes a critical production system that must be reliable, scalable, and cost-efficient.

graph LR
    subgraph "Applications"
      A1[Service A] -->|metrics, logs, traces| C
      A2[Service B] -->|metrics, logs, traces| C
      A3[Service C] -->|metrics, logs, traces| C
      A4[Infrastructure] -->|node metrics, syslogs| C
    end
    subgraph "Collection Layer"
      C[OTel Collector] -->|filter, enrich, batch| R
    end
    subgraph "Routing & Processing"
      R[Router / Fan-out] -->|metrics| M
      R -->|logs| L
      R -->|traces| T
    end
    subgraph "Storage Backends"
      M[Prometheus / Mimir]
      L[Loki / Elasticsearch]
      T[Tempo / Jaeger]
    end
    subgraph "Query & Visualization"
      M --> G[Grafana Dashboards]
      L --> G
      T --> G
      M --> AL[Alertmanager]
      AL --> PD[PagerDuty / Slack]
    end

End-to-end observability pipeline: applications emit telemetry to collectors, which process and route data to specialized storage backends, queryable through dashboards and alerting.

The OpenTelemetry Collector is the central component of a modern observability pipeline. It is a vendor-neutral data pipeline that receives telemetry data in multiple formats, processes it (filtering, enriching, batching, sampling), and exports it to one or more backends. Running the Collector as a sidecar or as a gateway deployment decouples your applications from the specifics of your observability backend.

OpenTelemetry Collector Components

Receivers — Accept telemetry data in various formats: OTLP (native OTel), Prometheus scrape, Jaeger Thrift, Zipkin, StatsD, Fluentd. A single Collector can receive data from heterogeneous sources.
Processors — Transform data in the pipeline: batch (group data for efficient export), filter (drop unwanted metrics or spans), attributes (add/modify/delete attributes), tail_sampling (sample traces based on latency or error status), memory_limiter (prevent OOM).
Exporters — Send processed data to backends: OTLP (to another Collector or compatible backend), Prometheus remote_write, Loki, Jaeger, Datadog, New Relic, Cloud provider backends. Multiple exporters enable fan-out to different systems.
Connectors — Bridge between pipelines within a single Collector: e.g., extract metrics from spans (span-to-metrics connector) to generate RED metrics automatically from trace data.

Gateway vs Sidecar Deployment

Run the OTel Collector in two tiers: (1) As a sidecar or DaemonSet agent on each node that collects local telemetry with minimal processing. (2) As a centralized gateway cluster that performs heavier processing (sampling, enrichment, fan-out). This architecture distributes collection load while centralizing expensive operations, and it provides a single point to reconfigure routing without touching application deployments.

One of the most important architectural decisions in observability is whether to use pull-based (scrape) or push-based collection for metrics. Prometheus pioneered the pull model; StatsD and OpenTelemetry use push.

Aspect	Pull / Scrape (Prometheus)	Push (OTel, StatsD)
How it works	Prometheus server periodically scrapes HTTP endpoints (/metrics) on each target	Applications push telemetry to a collector or backend on each event or at regular intervals
Service discovery	Prometheus must know where to scrape — uses Kubernetes SD, Consul, file-based configs	Applications push to a known endpoint — collector does not need to discover targets
Failure detection	If a scrape fails, Prometheus knows the target is down — built-in up/down detection	If a service stops pushing, silence is ambiguous — is it down or just not emitting?
Short-lived jobs	Problematic — job may finish before the next scrape interval. Use Pushgateway as workaround	Natural fit — push metrics before the job exits
Network topology	Requires Prometheus to reach targets (problematic across firewalls, NATs)	Requires targets to reach the collector (usually easier in practice)
Dominant use	Prometheus ecosystem, Kubernetes-native monitoring	OpenTelemetry, serverless, edge computing, cross-network scenarios

The Best of Both Worlds

Many production setups use both models: Prometheus scrapes infrastructure and application metrics from long-running services, while the OTel Collector receives push-based traces and logs. The OTel Collector can also scrape Prometheus endpoints, acting as a unified collection layer. Do not force one model everywhere — choose based on the telemetry type and the service characteristics.

Building a Production Observability Pipeline

→

Instrument applications with OpenTelemetry SDKs (auto-instrumentation first, then custom spans for critical paths)

→

Deploy OTel Collector as DaemonSet on every node for local collection and basic processing

→

Deploy a centralized OTel Collector gateway for sampling, enrichment, and fan-out to multiple backends

→

Set up Prometheus (or Mimir for multi-tenant) for metrics storage with appropriate retention (15 days hot, 1 year long-term)

→

Set up Loki (or Elasticsearch) for log aggregation with structured log parsing and retention policies

→

Set up Tempo (or Jaeger) for trace storage with tail-based sampling to control costs

→

Configure Grafana dashboards for each service with RED and USE method panels

Configure Alertmanager with SLO-based alerts routed to PagerDuty/Slack by service ownership