Skip to main content
Career Paths
Concepts
Sre Monitoring Fundamentals
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Monitoring Fundamentals: RED, USE & Golden Signals

A FAANG-depth guide to the monitoring frameworks every SRE must master — the RED method for request-driven services, the USE method for infrastructure resources, and Google's Four Golden Signals that map directly to SLOs. Covers Prometheus architecture and PromQL, production dashboard design, monitoring anti-patterns, monitoring at FAANG scale (Google Monarch, Netflix Atlas, Uber M3), and a step-by-step guide to building your first production monitoring stack. Includes real numbers: Google monitors 100B+ time series, Netflix monitors ~2B metrics, and a typical startup handles 10K-100K time series.

🎯Key Takeaways
Monitoring is the nervous system of SRE — without it, SLOs are guesses, error budgets are fictional, incident detection relies on user reports, and capacity planning is fortune-telling. Every other SRE practice depends on monitoring being in place first.
Use RED (Rate, Errors, Duration) for request-driven services to monitor the user experience, and USE (Utilization, Saturation, Errors) for infrastructure resources to find bottlenecks. The Four Golden Signals (Latency, Traffic, Errors, Saturation) unify both into the minimum metric set that maps to SLO compliance.
Prometheus + Grafana is the industry-standard open-source monitoring stack. Understand the four metric types (Counter, Gauge, Histogram, Summary), use histograms over summaries, manage label cardinality ruthlessly, and add Thanos when you outgrow a single instance.
Dashboard design follows the 5-second rule: an on-call engineer at 3 AM must determine system health within 5 seconds. Use a three-level hierarchy (executive overview, service dashboards, component dashboards) with no more than 12-16 panels per dashboard.
Alert on symptoms (SLO violations), not causes (CPU > 80%). Use multi-window burn-rate alerts to catch both fast-burn incidents and slow-burn degradation. Every alert must link to a runbook. If more than 50% of alerts are non-actionable, your alerting is broken — fix it before alert fatigue causes a missed incident.

Monitoring Fundamentals: RED, USE & Golden Signals

A FAANG-depth guide to the monitoring frameworks every SRE must master — the RED method for request-driven services, the USE method for infrastructure resources, and Google's Four Golden Signals that map directly to SLOs. Covers Prometheus architecture and PromQL, production dashboard design, monitoring anti-patterns, monitoring at FAANG scale (Google Monarch, Netflix Atlas, Uber M3), and a step-by-step guide to building your first production monitoring stack. Includes real numbers: Google monitors 100B+ time series, Netflix monitors ~2B metrics, and a typical startup handles 10K-100K time series.

~40 min read
Be the first to complete!
What you'll learn
  • Monitoring is the nervous system of SRE — without it, SLOs are guesses, error budgets are fictional, incident detection relies on user reports, and capacity planning is fortune-telling. Every other SRE practice depends on monitoring being in place first.
  • Use RED (Rate, Errors, Duration) for request-driven services to monitor the user experience, and USE (Utilization, Saturation, Errors) for infrastructure resources to find bottlenecks. The Four Golden Signals (Latency, Traffic, Errors, Saturation) unify both into the minimum metric set that maps to SLO compliance.
  • Prometheus + Grafana is the industry-standard open-source monitoring stack. Understand the four metric types (Counter, Gauge, Histogram, Summary), use histograms over summaries, manage label cardinality ruthlessly, and add Thanos when you outgrow a single instance.
  • Dashboard design follows the 5-second rule: an on-call engineer at 3 AM must determine system health within 5 seconds. Use a three-level hierarchy (executive overview, service dashboards, component dashboards) with no more than 12-16 panels per dashboard.
  • Alert on symptoms (SLO violations), not causes (CPU > 80%). Use multi-window burn-rate alerts to catch both fast-burn incidents and slow-burn degradation. Every alert must link to a runbook. If more than 50% of alerts are non-actionable, your alerting is broken — fix it before alert fatigue causes a missed incident.

Lesson outline

Why Monitoring Is the Foundation of SRE

You cannot manage what you cannot measure. This maxim, attributed to Peter Drucker, is the foundational truth of Site Reliability Engineering. Monitoring is not a feature you bolt onto a production system after launch — it is the nervous system of your entire operation. Without monitoring, every other SRE practice collapses: SLOs become aspirational guesses because you have no data to measure compliance, error budgets are fictional because you cannot calculate burn rate, incident response degrades to users reporting outages on Twitter, and capacity planning becomes fortune-telling.

At its core, monitoring answers four fundamental questions that every SRE must be able to answer at any moment: Is the system working right now? How fast is it working? Are users experiencing errors? Is the system close to capacity? These questions map directly to the Four Golden Signals (latency, traffic, errors, saturation) that Google codified in the SRE book, and they form the backbone of every monitoring framework discussed in this entry.

Monitoring vs Observability

Monitoring tells you WHEN something is wrong — it watches known metrics and fires alerts when thresholds are breached. Observability tells you WHY something is wrong — it lets you ask arbitrary questions of your system using metrics, logs, and traces. Monitoring is a subset of observability. This entry focuses on monitoring fundamentals; the observability entry covers the broader discipline including distributed tracing and log aggregation.

Consider what happens when monitoring fails or does not exist. In 2012, Knight Capital lost $440 million in 45 minutes because no automated monitoring existed to detect that their trading volume had spiked to anomalous levels. A human noticed the problem visually on a trading screen — not an alert, not a dashboard, not a metric threshold. If a simple metric alert on "order volume exceeds 3 standard deviations from historical average" had existed, automated circuit breakers could have stopped the bleeding within two minutes, saving over $400 million. This is not a theoretical scenario; it is a $440 million argument for monitoring.

The Monitoring Hierarchy of Needs

Think of monitoring like Maslow's hierarchy: you must satisfy the lower levels before the upper levels provide value. Level 1: Collection (are you gathering data?). Level 2: Visualization (can you see the data?). Level 3: Alerting (does the data tell you when something is wrong?). Level 4: Analysis (can you explore the data to find root causes?). Level 5: Prediction (can the data forecast problems before they happen?). Most teams get stuck at Level 2 with dashboards that nobody watches.

What monitoring gives you in production

  • SLO compliance measurement — Without metrics, your SLOs are aspirational wishes. Monitoring provides the raw SLI data that feeds your SLO calculations and error budget tracking. A 99.9% latency SLO requires sub-second latency measurement at the p99 level across every request.
  • Incident detection speed — The mean time to detect (MTTD) directly determines your mean time to resolve (MTTR). If detection takes 30 minutes and resolution takes 15 minutes, your MTTR is 45 minutes. If monitoring detects the issue in 30 seconds, your MTTR drops to 15.5 minutes — a 65% improvement from detection alone.
  • Capacity planning data — Monitoring historical resource utilization trends lets you predict when you will exhaust capacity. If your database CPU grows 2% per week and you are at 60%, you have roughly 20 weeks before you hit 100%. Without monitoring, you discover capacity limits when the system falls over.
  • Change impact verification — Every deployment, configuration change, and infrastructure update should produce a measurable signal. Monitoring lets you compare before-and-after metrics to confirm that a change had the intended effect and did not introduce regressions.
  • Cost attribution — Resource utilization metrics let you attribute infrastructure costs to specific services, teams, and features. At FAANG scale, this matters enormously: a 5% efficiency improvement in a service running on 10,000 machines saves 500 machines worth of compute.
  • Regulatory compliance evidence — Industries like finance and healthcare require audit trails proving system availability and performance. Monitoring data provides the evidence needed for SOC 2, HIPAA, and PCI-DSS compliance audits.

The Three Monitoring Frameworks

The industry has converged on three complementary monitoring frameworks: RED (Rate, Errors, Duration) for request-driven services, USE (Utilization, Saturation, Errors) for infrastructure resources, and Google's Four Golden Signals (Latency, Traffic, Errors, Saturation) as a unified overlay. Mastering all three gives you a complete vocabulary for monitoring any production system at any scale.

Monitoring at scale is a systems problem in itself. Google monitors over 100 billion time series across its infrastructure. Netflix tracks approximately 2 billion metrics from its streaming platform. Even a typical startup with a handful of microservices generates 10,000 to 100,000 time series. The monitoring system must be more reliable than the systems it monitors — if your monitoring goes down during an incident, you are flying blind. This is why FAANG companies build dedicated monitoring infrastructure teams and treat the monitoring stack as a Tier-0 service.

The RED Method: Rate, Errors, Duration

The RED method was coined by Tom Wilkie (Grafana Labs, formerly Weaveworks) as a simple, memorable framework for monitoring request-driven services — any service that responds to requests from users or other services. RED stands for Rate (how many requests per second), Errors (how many of those requests are failing), and Duration (how long those requests take). If you can answer these three questions for every service in your architecture, you have a solid foundation for user-facing monitoring.

RED is specifically designed for the application layer. It does not tell you about CPU utilization or disk space — that is the USE method's job. RED tells you about the user experience: are requests being served, are they succeeding, and are they fast enough? This user-centric perspective is what makes RED so valuable for SRE work. Your SLOs are almost certainly defined in terms of RED metrics: request success rate (errors), response latency (duration), and throughput capacity (rate).

graph LR
  subgraph "RED Dashboard Layout"
    direction TB
    subgraph "Row 1: Rate"
      R1[Requests/sec<br/>Current vs Historical]
      R2[Rate by Endpoint<br/>Top 10 Breakdown]
      R3[Rate by Status Code<br/>2xx / 3xx / 4xx / 5xx]
    end
    subgraph "Row 2: Errors"
      E1[Error Rate %<br/>5xx / Total Requests]
      E2[Error Rate by Type<br/>Timeout / 5xx / Connection]
      E3[Error Budget Burn<br/>Current vs Allowed]
    end
    subgraph "Row 3: Duration"
      D1[Latency p50 / p95 / p99<br/>Over Time]
      D2[Latency by Endpoint<br/>Slowest Endpoints]
      D3[Latency Distribution<br/>Histogram Heatmap]
    end
  end

A production RED dashboard layout organized in three rows. Each row focuses on one signal with three panels: current state, breakdown by dimension, and comparison to baseline or budget.

SignalWhat to MeasurePrometheus Metric TypeExample PromQL
RateRequests per second, broken down by endpoint and status codeCounter (http_requests_total)rate(http_requests_total[5m])
ErrorsFailed requests as a percentage of total requestsCounter (http_requests_total{status=~"5.."})rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
DurationRequest latency at p50, p95, and p99 percentilesHistogram (http_request_duration_seconds)histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Why p99 Matters More Than Average Latency

If your average latency is 50ms but your p99 is 2 seconds, one in every hundred users waits 40 times longer than average. At 1,000 requests per second, that is 10 users every second having a terrible experience. Amazon found that every 100ms of latency costs 1% of sales. Always monitor p50 (median), p95 (most users), and p99 (worst case that still represents real users). Ignore average latency in production — it hides outliers.

When to use the RED method

  • HTTP/gRPC API services — Any service that receives and responds to requests. This includes web servers, API gateways, microservices, and GraphQL endpoints. RED is the primary monitoring framework for these services.
  • Load balancers and proxies — Nginx, HAProxy, Envoy, and cloud load balancers all process requests. Their RED metrics (connection rate, error rate, response time) are critical for understanding user-facing performance.
  • Database query layers — Treat the database as a service that receives query requests. Measure queries per second (rate), failed queries (errors), and query latency (duration). This is especially valuable for connection poolers like PgBouncer.
  • Message consumers — While not request-driven in the HTTP sense, message queue consumers process messages at a rate, may fail (errors), and have processing duration. RED applies with minor terminology adjustments.

RED Does Not Monitor Infrastructure

A common mistake is using only RED metrics and ignoring the underlying infrastructure. Your API might show healthy RED metrics while the database server is at 95% CPU utilization — one traffic spike away from collapse. RED tells you the patient feels fine; USE (next section) tells you the patient's blood pressure is dangerously high. You need both.

Labeling Strategy for RED Metrics

Always include at minimum these labels on your RED metrics: service name, endpoint/route, HTTP method, response status code, and deployment version. This lets you slice metrics by any dimension during an incident. Avoid high-cardinality labels like user ID or request ID — these explode your time series count and crash your monitoring system. A service with 100 endpoints, 4 methods, and 5 status codes generates 2,000 time series per metric. Add a user_id label with 1 million users and you get 2 billion time series.

The USE Method: Utilization, Saturation, Errors

The USE method was created by Brendan Gregg, a performance engineer at Netflix (formerly Sun Microsystems and Joyent), as a systematic approach to analyzing the performance of any system resource. USE stands for Utilization (how busy is the resource), Saturation (how much extra work is queued because the resource is busy), and Errors (how many error events the resource has experienced). While RED focuses on the user experience at the application layer, USE focuses on the infrastructure layer — the CPUs, memory, disks, and network interfaces that your applications depend on.

The key insight of USE is that every performance problem is ultimately a resource problem. If your API latency suddenly doubles, the root cause is almost always one of these: CPU saturation (not enough compute), memory pressure (swapping or garbage collection), disk I/O bottleneck (slow reads or writes), or network saturation (bandwidth exhaustion or connection limits). USE gives you a systematic checklist for investigating each resource rather than guessing.

ResourceUtilization MetricSaturation MetricError Metric
CPUCPU usage % (user + system time across all cores)Run queue length (processes waiting for CPU)Machine check exceptions, thermal throttling events
MemoryUsed memory / total memory (excluding cache/buffers)Swap usage, OOM kill count, page fault rateECC memory errors, allocation failures
Disk I/ODisk busy % (time spent doing I/O operations)I/O queue depth (requests waiting for disk)Disk read/write errors, SMART warnings, filesystem errors
NetworkBandwidth utilization (bytes in + bytes out / capacity)TCP retransmit rate, socket backlog depthInterface errors (CRC, frame), packet drops, connection resets
File DescriptorsOpen FDs / max FDs (ulimit)Threads blocked on FD allocationEMFILE/ENFILE errors (too many open files)
Connection PoolsActive connections / pool max sizeConnection wait queue length, connection wait timeConnection timeout errors, pool exhaustion events

Utilization vs Saturation: The Crucial Difference

A CPU at 80% utilization might be perfectly healthy — it means you are getting good value from your hardware. But if the CPU run queue length is 50 (meaning 50 processes are waiting for CPU time), the system is saturated and users will experience latency. Utilization tells you how busy a resource is; saturation tells you whether that busyness is causing queuing delays. A highway at 80% capacity flows smoothly. A highway at 80% capacity with a 3-mile backup at every on-ramp is saturated. Same utilization, very different user experience.

Applying USE to diagnose a slow API

→

01

Start with CPU: Check utilization (top, mpstat) and saturation (vmstat procs column, /proc/loadavg). If CPU utilization is above 80% with a run queue above 2x the core count, CPU is your bottleneck. Profile the application to find hot code paths.

→

02

Check memory: Look at utilization (free -h, /proc/meminfo) and saturation (vmstat swap columns, dmesg for OOM kills). If the system is swapping or experiencing frequent page faults, memory pressure is causing I/O wait that masquerades as disk latency.

→

03

Check disk I/O: Use iostat -x to see disk utilization (%util) and saturation (avgqu-sz for queue depth). If disk utilization is above 70% with a queue depth above 2, disk I/O is your bottleneck. Check if the application is doing synchronous writes or if a logging framework is flushing too aggressively.

→

04

Check network: Use sar -n DEV for bandwidth utilization and ss -s for socket statistics. Look for TCP retransmits (indicates packet loss or congestion) and socket backlog growth (indicates connection handling cannot keep up with incoming requests).

→

05

Check application-level resources: Connection pool utilization (active/max), thread pool saturation (queued tasks), and file descriptor usage (lsof | wc -l). These application resources often saturate before system resources because their limits are configured conservatively.

06

Correlate findings: The bottleneck is almost always at ONE resource. If CPU is saturated but memory and disk are idle, do not add more memory — optimize CPU usage. Fix the actual bottleneck, then re-measure to see if a new bottleneck has emerged.

1

Start with CPU: Check utilization (top, mpstat) and saturation (vmstat procs column, /proc/loadavg). If CPU utilization is above 80% with a run queue above 2x the core count, CPU is your bottleneck. Profile the application to find hot code paths.

2

Check memory: Look at utilization (free -h, /proc/meminfo) and saturation (vmstat swap columns, dmesg for OOM kills). If the system is swapping or experiencing frequent page faults, memory pressure is causing I/O wait that masquerades as disk latency.

3

Check disk I/O: Use iostat -x to see disk utilization (%util) and saturation (avgqu-sz for queue depth). If disk utilization is above 70% with a queue depth above 2, disk I/O is your bottleneck. Check if the application is doing synchronous writes or if a logging framework is flushing too aggressively.

4

Check network: Use sar -n DEV for bandwidth utilization and ss -s for socket statistics. Look for TCP retransmits (indicates packet loss or congestion) and socket backlog growth (indicates connection handling cannot keep up with incoming requests).

5

Check application-level resources: Connection pool utilization (active/max), thread pool saturation (queued tasks), and file descriptor usage (lsof | wc -l). These application resources often saturate before system resources because their limits are configured conservatively.

6

Correlate findings: The bottleneck is almost always at ONE resource. If CPU is saturated but memory and disk are idle, do not add more memory — optimize CPU usage. Fix the actual bottleneck, then re-measure to see if a new bottleneck has emerged.

USE Method Checklist: Print This Out

For every physical and logical resource in your system, answer three questions: (1) What is its utilization? (2) What is its saturation? (3) Is it producing errors? If you cannot answer any of these questions, you have a monitoring gap. Brendan Gregg maintains a USE method checklist at his website with specific commands for Linux, Solaris, and other operating systems. This checklist should be in every SRE's toolkit.

RED + USE = Complete Picture

RED and USE are complementary, not competing. RED monitors the symptoms (users are seeing slow responses). USE monitors the causes (the database disk is saturated). In an incident, start with RED to confirm the user impact, then use USE to find the bottleneck. Think of RED as the check engine light on your car dashboard and USE as the mechanic's diagnostic tool under the hood.

The Four Golden Signals

In Chapter 6 of the Google SRE book, the authors present the Four Golden Signals as the minimum set of metrics that every service should monitor: Latency, Traffic, Errors, and Saturation. These four signals were distilled from years of operating some of the largest production systems in the world. Google's position is unambiguous: if you can only measure four things about your user-facing system, measure these four things. Everything else is secondary.

The Four Golden Signals are not a third framework competing with RED and USE. They are a unifying layer that borrows from both. Latency and Errors come from RED (Duration and Errors). Traffic comes from RED (Rate). Saturation comes from USE. Google's contribution was recognizing that these four specific metrics are the most important subset and that they map directly to SLO compliance.

Golden SignalDefinitionMaps to RED/USESLO Connection
LatencyTime to serve a request. Must distinguish between successful and failed request latency — a fast error is not a success.RED: DurationLatency SLO: 99% of requests complete in under 200ms. Measure as histogram percentiles (p50, p95, p99).
TrafficDemand placed on your system. For a web service: HTTP requests per second. For a streaming system: bytes per second. For a database: queries per second.RED: RateTraffic is the denominator in your SLO calculations. High traffic with low errors = healthy. High traffic with rising errors = investigate.
ErrorsRate of requests that fail. Includes explicit failures (HTTP 5xx), implicit failures (HTTP 200 with wrong content), and policy violations (responses slower than your SLO threshold).RED: ErrorsAvailability SLO: 99.9% of requests succeed. Error budget = 0.1% = ~43 minutes of total failure per month.
SaturationHow full your service is. Emphasize the most constrained resource (CPU, memory, disk, network). Saturation is the leading indicator that predicts failures before they happen.USE: SaturationCapacity planning: if CPU saturation trends hit 80% in 4 weeks, provision more capacity now.

The Hidden Golden Signal: Failed Request Latency

Google specifically warns that you must measure latency for successful AND failed requests separately. A service that returns errors in 1ms and successful responses in 500ms has an average latency of 250ms — which looks great. But users are experiencing 500ms latency. If errors suddenly spike, average latency drops and your dashboard shows "improvement" while users suffer. Always filter latency metrics by response status.

graph TD
  subgraph "Four Golden Signals → SLO Mapping"
    L[Latency<br/>p99 < 200ms] -->|feeds| LSLO[Latency SLO<br/>99% requests < 200ms]
    T[Traffic<br/>requests/sec] -->|denominator| ESLO[Availability SLO<br/>99.9% success rate]
    E[Errors<br/>5xx rate] -->|numerator| ESLO
    S[Saturation<br/>CPU/Memory/Disk] -->|predicts| CAP[Capacity Planning<br/>Scale before breach]
    LSLO -->|violation triggers| EB[Error Budget<br/>Burn Rate Alert]
    ESLO -->|violation triggers| EB
    EB -->|depleted| FREEZE[Feature Freeze<br/>Reliability Focus]
    EB -->|healthy| SHIP[Ship Features<br/>Accelerate Velocity]
  end

How the Four Golden Signals feed into SLO compliance, error budget tracking, and the reliability-vs-velocity decision loop.

Implementing the Four Golden Signals in practice

  • Instrument at the service boundary — Measure the golden signals at the point where your service receives requests (the ingress) and where it makes requests to dependencies (the egress). Middleware or service mesh sidecars (like Envoy in Istio) can capture these automatically without code changes.
  • Use histograms for latency, not averages — Prometheus histograms with carefully chosen bucket boundaries let you calculate any percentile. Recommended buckets for a web service: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s. Adjust based on your SLO threshold.
  • Include implicit errors in error counting — A request that returns HTTP 200 but contains stale data, wrong results, or partial content is an error from the user perspective. Instrument your application to emit error metrics for semantic failures, not just HTTP status codes.
  • Measure saturation with headroom, not just current value — A saturation metric that shows "70% CPU" is less useful than one that shows "at current growth rate, CPU hits 100% in 3 weeks." Linear regression on utilization trends converts current metrics into actionable capacity planning data.
  • Aggregate signals per service in a single dashboard row — Each service should have exactly one dashboard row with four panels: latency (time series with p50/p95/p99 lines), traffic (requests per second), errors (error rate percentage), saturation (most constrained resource). This is your at-a-glance health indicator.
  • Alert on golden signals, not on causes — Alert on "p99 latency exceeded SLO threshold for 5 minutes" rather than "CPU > 80%." Cause-based alerts fire even when users are not impacted (noisy) and miss novel failure modes. Symptom-based alerts catch any issue that affects users, regardless of root cause.

The Google Approach to Alert Design

Google recommends alerting on multi-window, multi-burn-rate error budget consumption. Instead of alerting on "error rate > 1%," alert on "the error budget is being consumed 14.4x faster than sustainable over a 1-hour window and 6x faster than sustainable over a 6-hour window." This approach catches both fast-burn incidents (server crash affecting all traffic) and slow-burn degradation (gradual memory leak increasing latency) while maintaining a low false-positive rate.

Prometheus Deep Dive

Prometheus is the open-source monitoring system that has become the de facto standard for cloud-native infrastructure monitoring. Originally built at SoundCloud in 2012, inspired by Google's internal Borgmon monitoring system, Prometheus was the second project accepted into the Cloud Native Computing Foundation (CNCF) after Kubernetes. Today, Prometheus is used by virtually every company running Kubernetes and is the backbone of monitoring at organizations from small startups to companies processing billions of metrics per day.

Understanding Prometheus architecture is essential for any SRE role because it underpins how metrics are collected, stored, queried, and alerted on in most modern infrastructure. Even if you use a managed monitoring service like Datadog or New Relic, understanding Prometheus helps you reason about metric types, query patterns, and performance tradeoffs.

graph LR
  subgraph "Targets"
    App1[App Server<br/>/metrics endpoint]
    App2[Database Exporter<br/>node_exporter]
    App3[Kubernetes<br/>kube-state-metrics]
  end
  subgraph "Prometheus Server"
    SD[Service Discovery<br/>K8s, Consul, DNS] --> Scrape[Scrape Manager<br/>Pull metrics every 15-30s]
    Scrape --> TSDB[Time Series DB<br/>Local SSD Storage]
    TSDB --> Rules[Recording Rules<br/>Pre-compute expensive queries]
    TSDB --> Alerts[Alert Rules<br/>Evaluate conditions]
  end
  subgraph "Downstream"
    Alerts --> AM[Alertmanager<br/>Dedup, Group, Route]
    AM --> PD[PagerDuty / Slack]
    TSDB --> Grafana[Grafana<br/>Dashboards]
    TSDB --> API[HTTP API<br/>PromQL Queries]
    TSDB --> Fed[Federation<br/>Global Prometheus]
  end
  App1 --> Scrape
  App2 --> Scrape
  App3 --> Scrape

Prometheus architecture: targets expose /metrics endpoints, the scrape manager pulls metrics on a configurable interval, the TSDB stores time series on local disk, and downstream consumers (Grafana, Alertmanager, federation) query the data via PromQL.

Metric TypeWhat It MeasuresExampleQuery Pattern
CounterCumulative value that only goes up (resets on restart)http_requests_total, node_cpu_seconds_totalrate() or increase() to get per-second rate or total increase over a window
GaugeValue that can go up or down at any timenode_memory_available_bytes, temperature_celsiusDirect value or delta/deriv() for rate of change
HistogramObservations bucketed into configurable rangeshttp_request_duration_seconds with le bucketshistogram_quantile() for percentiles, _count and _sum for averages
SummaryClient-side calculated quantiles (not recommended for new code)go_gc_duration_secondsPre-calculated quantiles; cannot be aggregated across instances — prefer histograms

Always Use Histograms Over Summaries

Summaries calculate percentiles on the client side, which means you cannot aggregate them across multiple instances. If you have 10 API server replicas and each reports its own p99 latency via a Summary, you cannot calculate the overall p99 across all replicas — the average of percentiles is not the percentile of the average. Histograms store raw bucket counts that can be aggregated with histogram_quantile(), giving you correct cross-instance percentiles. The only trade-off is that histogram accuracy depends on bucket boundary selection.

Prometheus production best practices

  • Scrape interval: 15-30 seconds — The default 15s scrape interval is appropriate for most services. Going below 10s increases storage costs and network traffic without meaningful alerting benefit. Going above 60s means your dashboards are too coarse-grained to diagnose short-lived issues.
  • Recording rules for expensive queries — If a dashboard panel or alert rule requires a complex PromQL query that scans millions of samples, create a recording rule that pre-computes the result every evaluation interval. Store the result as a new metric. This moves the computation cost from query time to ingestion time.
  • Label cardinality management — Every unique combination of metric name and label values creates a new time series. A metric with labels {method, endpoint, status, instance} across 100 endpoints, 4 methods, 5 status codes, and 50 instances creates 100,000 time series for ONE metric. Monitor your total time series count and set alerts at 80% of your Prometheus server capacity.
  • Retention and storage sizing — Prometheus stores data on local disk in 2-hour blocks that are compacted into larger blocks over time. Plan storage as: bytes per sample (1-2 bytes with compression) multiplied by samples per second multiplied by retention period. For 100K time series at 15s scrape interval: ~6,667 samples/sec times 1.5 bytes times 15 days = ~13 GB.
  • High availability with Thanos or Cortex — Single Prometheus servers are a single point of failure. For production, run two identical Prometheus servers scraping the same targets. Use Thanos Sidecar to upload blocks to object storage (S3/GCS) for long-term retention and global querying across multiple Prometheus instances.
  • Service discovery over static configuration — In Kubernetes, use the kubernetes_sd_config to automatically discover pods, services, and nodes. This eliminates manual target management and ensures new deployments are monitored immediately. Use relabel_configs to filter and transform discovered targets.

The Pull vs Push Debate

Prometheus uses a pull model: the server scrapes targets. This is ideal for service monitoring because if a target stops responding to scrapes, Prometheus knows it is down. Push-based systems (like StatsD or Graphite) cannot distinguish between "the service is down" and "the service stopped sending data." However, for short-lived batch jobs, use the Prometheus Pushgateway — jobs push their metrics to the gateway before exiting, and Prometheus scrapes the gateway.

Building Production Dashboards

A monitoring system that collects perfect data but presents it poorly is nearly as useless as no monitoring at all. Dashboards are the primary interface between your metrics and the humans who need to act on them — SREs triaging incidents, managers reviewing service health, and developers validating deployments. A well-designed dashboard hierarchy lets anyone in the organization understand system health within seconds. A poorly designed dashboard wastes time, creates confusion, and leads to missed signals during incidents when speed matters most.

The cardinal rule of dashboard design is the 5-second rule: an on-call engineer woken at 3 AM should be able to determine whether the system is healthy or unhealthy within 5 seconds of looking at a dashboard. If the dashboard requires scrolling, mental arithmetic, or context that is not visible on the screen, it fails the 5-second rule. Everything else in dashboard design flows from this principle.

graph TD
  subgraph "Dashboard Hierarchy"
    L1[Level 1: Executive Overview<br/>All services - green/yellow/red<br/>Audience: VP Eng, On-Call Lead]
    L1 --> L2A[Level 2: Service Dashboard<br/>Golden Signals for Service A<br/>Audience: Service Owner, On-Call]
    L1 --> L2B[Level 2: Service Dashboard<br/>Golden Signals for Service B<br/>Audience: Service Owner, On-Call]
    L1 --> L2C[Level 2: Service Dashboard<br/>Golden Signals for Service C<br/>Audience: Service Owner, On-Call]
    L2A --> L3A[Level 3: Component Dashboard<br/>Database, Cache, Queue details<br/>Audience: On-Call debugging]
    L2A --> L3B[Level 3: Component Dashboard<br/>Deployment & Canary metrics<br/>Audience: On-Call debugging]
    L2B --> L3C[Level 3: Component Dashboard<br/>Dependency health & USE metrics<br/>Audience: On-Call debugging]
  end

Three-level dashboard hierarchy: executive overview (are we healthy?), service dashboards (Golden Signals per service), and component dashboards (deep-dive debugging). Navigation flows from overview to detail during incident triage.

Dashboard design principles

  • The 5-second rule — An on-call engineer at 3 AM should know if the system is healthy within 5 seconds. Use color coding (green/yellow/red), large stat panels for current values, and time series graphs with clear SLO threshold lines. If the engineer needs to read a legend or hover over data points to understand status, the dashboard has failed.
  • Top-down information hierarchy — The most important information goes at the top. Row 1: overall health (SLO compliance status). Row 2: Golden Signals (latency, traffic, errors, saturation). Row 3: key dependencies. Row 4: recent deployments and changes. Users should never need to scroll to find the "is it broken?" answer.
  • Consistent time ranges across panels — All panels on a single dashboard should share the same time range. If the latency graph shows the last 6 hours and the error graph shows the last 24 hours, correlation becomes impossible. Use Grafana's shared time range feature and avoid per-panel overrides except for historical comparison panels.
  • Show context, not just current values — A latency of 200ms means nothing without context. Is it usually 50ms (4x regression) or usually 190ms (normal)? Add historical baselines (same time last week), SLO threshold lines, and deployment markers to every time series graph. Context turns data into information.
  • One dashboard per service, one row per Golden Signal — Resist the temptation to create a single mega-dashboard with every metric. Each service gets its own dashboard with exactly four rows (latency, traffic, errors, saturation) plus dependency health and deployment markers. This structure is predictable: anyone who has seen one service dashboard can navigate any other.
  • No vanity metrics — Total requests served all time, total users registered, or uptime since last restart are vanity metrics that look impressive but provide zero operational value. Every panel on an operational dashboard should answer a question that an on-call engineer might ask during an incident. If no one would look at a panel during an outage, remove it.

The Wall of Graphs Anti-Pattern

Many teams create dashboards with 30+ panels covering every conceivable metric. These "wall of graphs" dashboards are overwhelming and useless during incidents because no human can scan 30 graphs for anomalies. The on-call engineer ends up ignoring the dashboard and looking at logs instead. Limit service dashboards to 12-16 panels maximum. Put deep-dive metrics on separate component dashboards that you drill into only when needed.

Building your first service dashboard in Grafana

→

01

Create a new dashboard and set the default time range to "Last 1 hour" with auto-refresh every 30 seconds. Name it "[Service Name] - Golden Signals." Add a variable for environment (prod/staging) to enable environment switching without separate dashboards.

→

02

Row 1 - Status Overview: Add a Stat panel showing current SLO compliance percentage (green above SLO, red below). Add a Stat panel showing current error budget remaining (percentage). Add a Stat panel showing active alerts count.

→

03

Row 2 - Latency: Add a time series panel with p50, p95, and p99 latency lines. Add a horizontal threshold line at your SLO latency target (e.g., 200ms). Add a second panel showing latency heatmap (Grafana's native heatmap from Prometheus histograms) to visualize the full distribution.

→

04

Row 3 - Traffic and Errors: Add a time series panel showing requests per second, broken down by status code class (2xx, 4xx, 5xx). Add a time series panel showing error rate percentage with an SLO threshold line. Add annotations for deployments so you can correlate changes with error spikes.

→

05

Row 4 - Saturation: Add panels for the most constrained resource for this service (CPU utilization for compute-heavy services, disk I/O for storage services, connection pool utilization for database-heavy services). Add capacity threshold lines at 70% (warning) and 90% (critical).

06

Row 5 - Dependencies: Add panels showing the latency and error rate of every downstream dependency (database, cache, external APIs). These are the first thing to check when your service shows degradation — the root cause is often a dependency, not your own code.

1

Create a new dashboard and set the default time range to "Last 1 hour" with auto-refresh every 30 seconds. Name it "[Service Name] - Golden Signals." Add a variable for environment (prod/staging) to enable environment switching without separate dashboards.

2

Row 1 - Status Overview: Add a Stat panel showing current SLO compliance percentage (green above SLO, red below). Add a Stat panel showing current error budget remaining (percentage). Add a Stat panel showing active alerts count.

3

Row 2 - Latency: Add a time series panel with p50, p95, and p99 latency lines. Add a horizontal threshold line at your SLO latency target (e.g., 200ms). Add a second panel showing latency heatmap (Grafana's native heatmap from Prometheus histograms) to visualize the full distribution.

4

Row 3 - Traffic and Errors: Add a time series panel showing requests per second, broken down by status code class (2xx, 4xx, 5xx). Add a time series panel showing error rate percentage with an SLO threshold line. Add annotations for deployments so you can correlate changes with error spikes.

5

Row 4 - Saturation: Add panels for the most constrained resource for this service (CPU utilization for compute-heavy services, disk I/O for storage services, connection pool utilization for database-heavy services). Add capacity threshold lines at 70% (warning) and 90% (critical).

6

Row 5 - Dependencies: Add panels showing the latency and error rate of every downstream dependency (database, cache, external APIs). These are the first thing to check when your service shows degradation — the root cause is often a dependency, not your own code.

Deployment Annotations Are Critical

Add annotations to your Grafana dashboards that mark every deployment, configuration change, and scaling event. During an incident, the single most valuable piece of information is "what changed?" Deployment annotations let you visually correlate a metric change with a code push in less than a second. Use Grafana annotations API in your CI/CD pipeline to automate this.

Monitoring Anti-Patterns

Monitoring systems degrade over time in predictable ways. Like code, monitoring accumulates technical debt — dashboards become stale, alert rules proliferate, metrics grow unbounded, and the gap between what is monitored and what matters widens. Recognizing these anti-patterns early prevents your monitoring from becoming a liability rather than an asset. The worst outcome is a team that believes they are well-monitored because they have "lots of dashboards and alerts" when in reality their monitoring is noisy, incomplete, or measuring the wrong things.

The seven deadly monitoring anti-patterns

  • Dashboard rot — Dashboards created for a specific investigation or incident are never cleaned up. Over months, the Grafana instance accumulates hundreds of dashboards, most of which are outdated or reference metrics that no longer exist. No one knows which dashboards are authoritative. Fix: assign dashboard ownership, review quarterly, delete unused dashboards aggressively, and maintain a curated "official dashboards" folder per service.
  • Alert fatigue — Too many alerts fire too often, training the on-call team to ignore pages. Studies show that if more than 50% of alerts are non-actionable, engineers start treating ALL alerts as non-urgent. Fix: every alert must have a documented action (runbook link). If an alert fires and the on-call response is "acknowledge and do nothing," delete the alert. Aim for fewer than 2 pages per on-call shift.
  • Metric explosion — Careless labeling creates millions of time series that overwhelm Prometheus. A single metric with high-cardinality labels (user_id, request_id, trace_id) can generate more time series than the rest of the system combined. Fix: audit label cardinality monthly. Any label with more than 100 unique values should be reviewed. Move high-cardinality data to logs or traces, not metrics.
  • Monitoring the monitoring — The monitoring system itself becomes unreliable: Prometheus runs out of disk, Alertmanager loses connectivity, Grafana becomes unresponsive under load. During the one incident where monitoring matters most, it is unavailable. Fix: treat your monitoring stack as a Tier-0 service with its own SLOs, its own alerts (sent through a DIFFERENT channel), and its own capacity planning.
  • Vanity dashboards — Dashboards designed to impress executives or visitors with impressive-looking graphs that provide zero operational value. "Total requests served since launch" and "registered users" are product metrics, not operational metrics. Fix: every panel on an operational dashboard must answer a question an on-call engineer would ask during an incident. Product metrics belong on product dashboards.
  • Copy-paste alerting — Teams copy alert rules from blog posts or other teams without adjusting thresholds to their service characteristics. A "CPU > 80%" alert makes sense for a latency-sensitive API server but is noise for a batch processing job that intentionally uses 100% CPU. Fix: derive alert thresholds from your SLOs. Alert on SLO violations, not on arbitrary resource thresholds.
  • Missing the business layer — All monitoring covers infrastructure (CPU, memory, disk) and application (latency, errors) but nothing about business outcomes (orders placed, payments processed, sign-ups completed). An incident where the payment gateway returns 200 OK but charges $0 instead of $99 produces perfect RED metrics and zero revenue. Fix: instrument business-critical transactions as explicit metrics.

Alert Fatigue Kills

Alert fatigue is not merely annoying — it is dangerous. When engineers become desensitized to alerts, they delay response to real incidents. The 2003 Northeast blackout in the US and Canada was worsened because operators had so many alarms firing that they missed the critical ones indicating a cascading grid failure. In SRE, the same dynamic plays out: if your team receives 20 pages per shift and 18 are false positives, the 2 real incidents get the same delayed, skeptical response. The target is fewer than 2 actionable pages per 12-hour on-call shift.

Anti-PatternSymptomRoot CauseFix
Dashboard rot200+ dashboards, nobody knows which to useNo ownership, no review cadenceAssign owners, quarterly review, delete unused
Alert fatigue15+ pages per on-call shift, most ignoredThreshold alerts without SLO basisSLO-based alerting, runbook requirement, max 2 pages/shift target
Metric explosionPrometheus OOM or slow queriesHigh-cardinality labelsMonthly cardinality audit, max 100 values per label
Monitoring the monitoringMonitoring down during incidentsMonitoring treated as non-criticalTier-0 SLO for monitoring stack, separate alert channel
Vanity dashboardsPretty graphs, no operational valueBuilt for demos, not incidentsEvery panel must answer an on-call question

The Quarterly Monitoring Audit

Every quarter, review all dashboards and alerts with the on-call team. For each dashboard: when was it last viewed during an incident? If never, consider deleting it. For each alert: how many times did it fire? How many times was the response "do nothing"? If more than 50% of firings are non-actionable, fix the alert threshold or delete it. Track the total alert count and the actionable percentage as team health metrics.

The Oncall Handoff Test

A quick test for monitoring quality: can a new on-call engineer who has never seen the service before diagnose a problem using only the dashboards and runbooks? If they need to ask a senior engineer "which dashboard should I look at?" or "what does this graph mean?", your monitoring and documentation have failed. Run this test with every new team member and fix the gaps they reveal.

Monitoring at FAANG Scale

When your infrastructure spans millions of machines across dozens of datacenters serving billions of users, monitoring itself becomes one of the most challenging distributed systems problems you face. The monitoring system must ingest, store, query, and alert on more data than most companies' entire production workloads. This section examines how Google, Netflix, and Uber solved the monitoring-at-scale problem with purpose-built systems that handle 100 million to 100 billion time series.

The numbers are staggering. Google's internal monitoring system Monarch ingests over 100 billion time series. Netflix's Atlas handles approximately 2 billion metrics from their streaming infrastructure. Uber's M3 was built to handle their rapid growth from thousands to millions of time series within a few years. Each of these systems made fundamentally different architectural decisions, but they all share a common insight: off-the-shelf monitoring tools like Prometheus were not designed for this scale and must be either heavily extended or replaced.

SystemCompanyScaleArchitectureKey Innovation
MonarchGoogle100B+ time series, petabytes of storageFully integrated with Borg (Kubernetes predecessor). In-memory zones + long-term Leaf storage. Global query federation across all datacenters.Zone-local ingestion with global query: metrics are stored in the zone where they originate but can be queried globally. This avoids cross-datacenter write traffic.
AtlasNetflix~2B metrics, ~2.5 petabytes per day ingestedIn-memory dimensional time series database. Tags replace hierarchical metric names. Integration with Spectator client library.Dimensional data model: metrics are identified by a name and an arbitrary set of key-value tags, enabling flexible slicing and dicing without pre-defined hierarchies.
M3UberHundreds of millions of time seriesDistributed TSDB (M3DB) + query engine (M3Query) + aggregation (M3Aggregator). Prometheus-compatible APIs.Prometheus compatibility: M3 accepts Prometheus remote-write protocol and speaks PromQL, allowing teams to keep their existing Prometheus setup while gaining horizontal scalability.

Why Not Just Scale Prometheus?

Prometheus was designed as a single-server system with local disk storage. A single Prometheus instance can handle approximately 1-10 million time series depending on hardware. At FAANG scale, you need 10,000-100,000x that capacity. Solutions like Thanos and Cortex add horizontal scalability to Prometheus through remote storage backends and query federation, but they add significant operational complexity. At extreme scale, companies build purpose-built systems because the operational overhead of maintaining thousands of federated Prometheus instances exceeds the cost of building a custom solution.

Common patterns in FAANG-scale monitoring

  • Write-local, query-global — Metrics are written to local storage in the datacenter or region where they originate, avoiding cross-datacenter write amplification. Queries can span all regions when needed (e.g., "show me global error rate") through a federation layer that routes sub-queries to each region's local store.
  • Tiered storage with automatic downsampling — Recent data (last 24 hours) is stored at full resolution (every 15s sample). Older data is downsampled: 1-minute resolution for the last 7 days, 5-minute for 30 days, 1-hour for 1 year. This reduces storage costs by 100-1000x for historical data while preserving detail for recent incidents.
  • Pre-aggregation at ingestion — Instead of storing every individual time series and aggregating at query time, compute aggregates (sum, average, percentiles) during ingestion. If 10,000 instances report a metric, store the per-instance series AND pre-computed aggregates. Queries that need the aggregate hit a single series instead of scanning 10,000.
  • Separation of metrics, logs, and traces — At FAANG scale, metrics, logs, and traces are stored in different systems optimized for their access patterns. Metrics are time-series optimized (Monarch, Atlas, M3). Logs are append-optimized (Google Cloud Logging, ELK). Traces are graph-optimized (Jaeger, Zipkin). Correlation between pillars uses shared identifiers (trace ID, request ID).
  • Self-service with guardrails — Teams can create their own metrics and dashboards without approval, but cardinality limits and quota systems prevent any single team from overwhelming the shared monitoring infrastructure. If a team creates a metric with 1 million time series, their quota is charged and they are notified to reduce cardinality.
  • Monitoring system as a Tier-0 service — The monitoring infrastructure itself has dedicated SRE teams, its own SLOs (e.g., 99.99% ingest availability, query latency p99 under 5 seconds), and its own capacity planning. If monitoring goes down, EVERY service in the company loses visibility. This makes the monitoring system one of the most critical services to keep running.

Scale Numbers to Know for Interviews

When discussing monitoring at scale in interviews, cite these benchmarks: Google Monarch handles 100B+ time series. Netflix Atlas handles ~2B metrics with ~2.5PB/day ingest. A single Prometheus server handles 1-10M time series. Thanos/Cortex extends Prometheus to tens of millions. A typical startup generates 10K-100K time series. A mid-size company generates 1M-10M. These numbers help you calibrate architecture decisions to the actual scale of the problem.

Start Simple, Scale When Needed

Do not build a Monarch-scale monitoring system for a startup with 5 services. Start with a single Prometheus server and Grafana. When you outgrow one Prometheus instance (typically around 1-5M active time series), add Thanos for long-term storage and cross-cluster querying. Only consider custom solutions or managed services like Datadog when the operational overhead of self-hosted monitoring exceeds 1-2 full-time engineers.

Your First Production Monitoring Setup

Theory without practice is useless. This section provides a concrete, step-by-step guide to building a production monitoring stack from scratch. The goal is to take you from zero monitoring to a fully operational setup with metrics collection, storage, visualization, and alerting — all using open-source tools that are free to run and used by FAANG companies at scale. By the end of this section, you will have a monitoring stack that covers the RED method, the USE method, and the Four Golden Signals.

Step-by-step: From zero to production monitoring

→

01

INSTRUMENT YOUR APPLICATION: Add a Prometheus client library to your service (available for Go, Java, Python, Node.js, Ruby, and more). Expose an HTTP endpoint at /metrics. At minimum, instrument three things: a Counter for total requests (http_requests_total with method, endpoint, status labels), a Histogram for request duration (http_request_duration_seconds with the same labels), and a Gauge for in-flight requests (http_requests_in_flight). These three metrics give you the RED signals: Rate (rate of counter), Errors (rate of counter filtered by 5xx status), Duration (histogram percentiles).

→

02

COLLECT INFRASTRUCTURE METRICS: Deploy node_exporter on every machine to expose USE method metrics — CPU utilization, memory usage, disk I/O, and network statistics. If running Kubernetes, deploy kube-state-metrics for cluster state (pod status, deployment replicas, resource requests/limits) and cAdvisor (built into kubelet) for container-level resource metrics.

→

03

SET UP PROMETHEUS: Deploy a Prometheus server with service discovery configured for your environment (kubernetes_sd_config for Kubernetes, ec2_sd_config for AWS, file_sd_config for static targets). Set the scrape interval to 15 seconds. Configure retention for 15 days of local storage. This single instance handles up to 1-5 million active time series.

→

04

CREATE RECORDING RULES: Write recording rules for your most common queries. Pre-compute the error rate (ratio of 5xx requests to total requests), the p99 latency, and the error budget burn rate. Recording rules reduce dashboard load times from seconds to milliseconds and make alert rules simpler to write and maintain.

→

05

BUILD YOUR GOLDEN SIGNALS DASHBOARD: In Grafana, create one dashboard per service with four rows: Latency (p50/p95/p99 time series with SLO threshold line), Traffic (requests per second by status code), Errors (error rate percentage with SLO threshold), Saturation (most constrained resource). Add deployment annotations from your CI/CD pipeline.

06

CONFIGURE SLO-BASED ALERTS: In Prometheus, create alerting rules based on SLO violations rather than arbitrary thresholds. Use multi-window burn-rate alerts: a fast-burn alert (14.4x budget consumption over 1 hour) for pages and a slow-burn alert (3x budget consumption over 3 days) for tickets. Route alerts through Alertmanager to PagerDuty for pages and Slack for warnings.

1

INSTRUMENT YOUR APPLICATION: Add a Prometheus client library to your service (available for Go, Java, Python, Node.js, Ruby, and more). Expose an HTTP endpoint at /metrics. At minimum, instrument three things: a Counter for total requests (http_requests_total with method, endpoint, status labels), a Histogram for request duration (http_request_duration_seconds with the same labels), and a Gauge for in-flight requests (http_requests_in_flight). These three metrics give you the RED signals: Rate (rate of counter), Errors (rate of counter filtered by 5xx status), Duration (histogram percentiles).

2

COLLECT INFRASTRUCTURE METRICS: Deploy node_exporter on every machine to expose USE method metrics — CPU utilization, memory usage, disk I/O, and network statistics. If running Kubernetes, deploy kube-state-metrics for cluster state (pod status, deployment replicas, resource requests/limits) and cAdvisor (built into kubelet) for container-level resource metrics.

3

SET UP PROMETHEUS: Deploy a Prometheus server with service discovery configured for your environment (kubernetes_sd_config for Kubernetes, ec2_sd_config for AWS, file_sd_config for static targets). Set the scrape interval to 15 seconds. Configure retention for 15 days of local storage. This single instance handles up to 1-5 million active time series.

4

CREATE RECORDING RULES: Write recording rules for your most common queries. Pre-compute the error rate (ratio of 5xx requests to total requests), the p99 latency, and the error budget burn rate. Recording rules reduce dashboard load times from seconds to milliseconds and make alert rules simpler to write and maintain.

5

BUILD YOUR GOLDEN SIGNALS DASHBOARD: In Grafana, create one dashboard per service with four rows: Latency (p50/p95/p99 time series with SLO threshold line), Traffic (requests per second by status code), Errors (error rate percentage with SLO threshold), Saturation (most constrained resource). Add deployment annotations from your CI/CD pipeline.

6

CONFIGURE SLO-BASED ALERTS: In Prometheus, create alerting rules based on SLO violations rather than arbitrary thresholds. Use multi-window burn-rate alerts: a fast-burn alert (14.4x budget consumption over 1 hour) for pages and a slow-burn alert (3x budget consumption over 3 days) for tickets. Route alerts through Alertmanager to PagerDuty for pages and Slack for warnings.

ComponentToolPurposeResource Requirements
Metrics CollectionPrometheusScrape, store, and query time series metrics2 CPU, 8 GB RAM, 100 GB SSD for ~500K time series with 15-day retention
Infrastructure Metricsnode_exporter + kube-state-metricsExpose OS-level and Kubernetes cluster metricsMinimal: 0.1 CPU, 64 MB RAM per node for node_exporter
VisualizationGrafanaDashboards, graphs, and stat panels1 CPU, 2 GB RAM, 10 GB disk for dashboard storage
AlertingAlertmanagerAlert deduplication, grouping, routing, and silencing0.5 CPU, 512 MB RAM — runs as a small sidecar
Long-term StorageThanos (when needed)Object storage backend for retention beyond 15 daysSidecar: 0.5 CPU, 1 GB RAM. Store gateway: 2 CPU, 4 GB RAM. S3/GCS costs vary.
Application InstrumentationPrometheus client librariesExpose /metrics endpoint from your application codeNegligible overhead: ~1-2% CPU, ~10 MB RAM for metric tracking

The Starter Stack Recommendation

For teams starting from zero, deploy the kube-prometheus-stack Helm chart. It bundles Prometheus, Grafana, Alertmanager, node_exporter, and kube-state-metrics with sane defaults and pre-built dashboards for Kubernetes monitoring. You get a fully operational monitoring stack in under 30 minutes. Customize from there by adding application-specific metrics, recording rules, and SLO-based alerts.

Common Starter Mistakes

Three mistakes kill most first monitoring setups: (1) Not persisting Prometheus data — if Prometheus restarts and loses its storage, you lose all metrics history. Always use a PersistentVolumeClaim in Kubernetes. (2) Not configuring Alertmanager receivers — alerts that fire but go nowhere are worse than no alerts because they create a false sense of security. (3) Not instrumenting the application — relying solely on infrastructure metrics (CPU, memory) means you cannot detect application-level failures. Instrument your code from day one.

What to monitor on day one vs day thirty

  • Day 1: RED metrics for your primary service — Request rate, error rate, and latency percentiles. These three metrics tell you if users are happy. If you instrument nothing else, instrument these.
  • Day 1: USE metrics for your infrastructure — CPU, memory, disk, and network utilization and saturation via node_exporter. These tell you when you are running out of capacity.
  • Day 7: Golden Signals dashboard per service — A Grafana dashboard with the four golden signals for each service, plus deployment annotations. This becomes your single source of truth for service health.
  • Day 14: SLO-based alerting — Replace threshold alerts (CPU > 80%) with SLO-based alerts (error budget burn rate > 14.4x). This dramatically reduces alert noise while catching all user-impacting issues.
  • Day 30: Dependency monitoring and business metrics — Add latency and error rate monitoring for every downstream dependency. Add business transaction metrics (orders placed, payments processed). These fill the gaps that RED and USE leave.
  • Day 30: Runbooks for every alert — Every alert rule should link to a runbook that describes what the alert means, how to diagnose the issue, and what actions to take. An alert without a runbook is just noise.

Test Your Monitoring Before You Need It

Intentionally break your service in a staging environment and verify that your monitoring detects the failure. Introduce latency with a sleep statement and check if your latency alert fires. Return 500 errors and verify the error rate dashboard reflects the increase. Exhaust memory and confirm the saturation panel shows the spike. If your monitoring cannot detect intentional failures in staging, it will not detect real failures in production.

How this might come up in interviews

Monitoring fundamentals appear in virtually every SRE and infrastructure interview at FAANG companies. In system design interviews, you will be asked how you would monitor a system you just designed — interviewers expect you to name specific metrics (p99 latency, error rate, CPU saturation), specific tools (Prometheus, Grafana), and specific frameworks (RED, USE, Golden Signals). In behavioral interviews, you will be asked about incidents where monitoring helped (or where lack of monitoring caused problems). In coding interviews at SRE-focused roles, you may be asked to write PromQL queries or design a monitoring schema for a given service. Knowing RED, USE, and the Four Golden Signals gives you a structured, vocabulary-rich answer for any monitoring question.

Common questions:

  • How would you set up monitoring for a new microservice from scratch? Walk me through the specific metrics, tools, and dashboards you would create.
  • Explain the RED method and USE method. How are they different, and when would you use each? How do they relate to the Four Golden Signals?
  • Your service has a p99 latency SLO of 200ms. Describe how you would set up alerting for this SLO using Prometheus, including the specific alert rules and thresholds you would configure.
  • You inherit a monitoring system with 500 dashboards and 200 alert rules. The on-call team reports severe alert fatigue. How would you triage and fix this monitoring system?
  • Compare Prometheus with a managed monitoring solution like Datadog. What are the tradeoffs, and how would you decide which to use for a new project?
  • Describe how monitoring works at FAANG scale. What happens when you have billions of time series? What architectural patterns enable monitoring at that scale?

Try this question: Ask the interviewer: What monitoring stack do you use and how large is it (time series count)? How do you handle alert fatigue — what is your typical pages-per-shift rate? Do you have SLO-based alerting or threshold-based alerting? These questions demonstrate that you understand monitoring at a practical, operational level and are evaluating the maturity of their monitoring infrastructure.

Strong answer: Immediately structuring the answer using RED/USE/Golden Signals. Explaining why histograms are preferred over summaries. Mentioning label cardinality as a Prometheus scaling concern. Describing multi-window burn-rate alerts for SLO-based alerting. Knowing the three-level dashboard hierarchy. Citing concrete scale numbers and tool recommendations appropriate to the scale being discussed.

Red flags: Monitoring only CPU and memory without application-level metrics. Not knowing what PromQL is. Suggesting average latency instead of percentiles. Setting alerts on arbitrary thresholds (CPU > 80%) without connecting to SLOs. Not understanding the difference between monitoring and observability. Describing dashboards with 50+ panels as "comprehensive monitoring."

Key takeaways

  • Monitoring is the nervous system of SRE — without it, SLOs are guesses, error budgets are fictional, incident detection relies on user reports, and capacity planning is fortune-telling. Every other SRE practice depends on monitoring being in place first.
  • Use RED (Rate, Errors, Duration) for request-driven services to monitor the user experience, and USE (Utilization, Saturation, Errors) for infrastructure resources to find bottlenecks. The Four Golden Signals (Latency, Traffic, Errors, Saturation) unify both into the minimum metric set that maps to SLO compliance.
  • Prometheus + Grafana is the industry-standard open-source monitoring stack. Understand the four metric types (Counter, Gauge, Histogram, Summary), use histograms over summaries, manage label cardinality ruthlessly, and add Thanos when you outgrow a single instance.
  • Dashboard design follows the 5-second rule: an on-call engineer at 3 AM must determine system health within 5 seconds. Use a three-level hierarchy (executive overview, service dashboards, component dashboards) with no more than 12-16 panels per dashboard.
  • Alert on symptoms (SLO violations), not causes (CPU > 80%). Use multi-window burn-rate alerts to catch both fast-burn incidents and slow-burn degradation. Every alert must link to a runbook. If more than 50% of alerts are non-actionable, your alerting is broken — fix it before alert fatigue causes a missed incident.
Before you move on: can you answer these?

Explain the difference between the RED method and the USE method. When would you apply each, and how do they complement each other during incident investigation?

RED (Rate, Errors, Duration) monitors the user experience of request-driven services — how many requests are coming in, how many are failing, and how long they take. USE (Utilization, Saturation, Errors) monitors infrastructure resources — how busy CPUs, memory, disks, and networks are, whether work is queuing, and whether hardware errors are occurring. During an incident, you start with RED to confirm user impact (symptoms), then switch to USE to find the bottleneck (root cause). RED tells you the patient is sick; USE tells you which organ is failing. They are complementary, not competing — you need both for complete production monitoring.

Name the Four Golden Signals and explain how each maps to an SLO. Why does Google recommend separating latency measurements for successful and failed requests?

The Four Golden Signals are Latency (response time, feeds latency SLO like "99% of requests under 200ms"), Traffic (request rate, provides the denominator for SLO calculations), Errors (failure rate, feeds availability SLO like "99.9% of requests succeed"), and Saturation (resource fullness, feeds capacity planning to prevent future SLO violations). Google recommends separating successful and failed request latency because errors often return much faster than successful responses. If error rate spikes, average latency drops, making the dashboard falsely show "improvement" while users experience degradation. Measuring latency only for successful requests gives an accurate view of user experience.

You are building monitoring for a new microservice. Walk through the specific metrics you would instrument and why, covering both RED and USE perspectives.

For RED: (1) A Counter http_requests_total with labels for method, endpoint, and status code — rate() gives request rate, filtering by 5xx gives error rate. (2) A Histogram http_request_duration_seconds with the same labels — histogram_quantile() gives p50/p95/p99 latency. (3) A Gauge http_requests_in_flight for connection concurrency monitoring. For USE: deploy node_exporter for system metrics (CPU utilization/saturation, memory usage, disk I/O, network bandwidth). Add application-specific resource metrics: database connection pool utilization (active/max), thread pool queue depth, and cache hit ratio. For business context: add a Counter for the primary business transaction (e.g., orders_placed_total). This gives complete coverage of the user experience (RED), infrastructure health (USE), and business outcomes.

🧠Mental Model

💡 Analogy

Monitoring is like the instrument panel in an aircraft cockpit. Pilots do not fly by looking out the window — they rely on instruments that measure altitude (utilization), airspeed (rate), engine warnings (errors), and fuel level (saturation). The RED method is like monitoring the passenger experience: are flights departing on time (rate), are passengers arriving at their destination (errors), and how long is the journey taking (duration)? The USE method is like monitoring the aircraft systems: engine temperature (utilization), fuel reserves (saturation), and mechanical warnings (errors). The Four Golden Signals combine both perspectives into the minimum instrument set needed to fly safely. Just as a pilot scans instruments in a specific pattern (aviate, navigate, communicate), an SRE scans golden signals in a specific pattern: latency first (is the service responsive?), errors second (are requests succeeding?), traffic third (what is the load?), saturation fourth (how much headroom remains?). You need both the passenger perspective and the aircraft systems perspective to fly safely — and you need both RED and USE to operate production systems safely.

⚡ Core Idea

Monitoring provides the nervous system of your production infrastructure — without it, every other SRE practice (SLOs, error budgets, incident response, capacity planning) is impossible. The RED method monitors the user experience of request-driven services. The USE method monitors the health of infrastructure resources. The Four Golden Signals unify both into the minimum metric set that maps directly to SLO compliance. Master all three frameworks to monitor any system at any scale.

🎯 Why It Matters

Every FAANG SRE interview will test your ability to describe how you would monitor a production system. Knowing RED, USE, and the Four Golden Signals gives you a structured answer for any monitoring question. More importantly, in production, monitoring is the difference between detecting an incident in 30 seconds (automated alert) and detecting it in 30 minutes (user complaint on Twitter). That 29.5-minute difference directly translates to user impact, revenue loss, and error budget consumption.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.