A FAANG-depth guide to the monitoring frameworks every SRE must master — the RED method for request-driven services, the USE method for infrastructure resources, and Google's Four Golden Signals that map directly to SLOs. Covers Prometheus architecture and PromQL, production dashboard design, monitoring anti-patterns, monitoring at FAANG scale (Google Monarch, Netflix Atlas, Uber M3), and a step-by-step guide to building your first production monitoring stack. Includes real numbers: Google monitors 100B+ time series, Netflix monitors ~2B metrics, and a typical startup handles 10K-100K time series.
A FAANG-depth guide to the monitoring frameworks every SRE must master — the RED method for request-driven services, the USE method for infrastructure resources, and Google's Four Golden Signals that map directly to SLOs. Covers Prometheus architecture and PromQL, production dashboard design, monitoring anti-patterns, monitoring at FAANG scale (Google Monarch, Netflix Atlas, Uber M3), and a step-by-step guide to building your first production monitoring stack. Includes real numbers: Google monitors 100B+ time series, Netflix monitors ~2B metrics, and a typical startup handles 10K-100K time series.
Lesson outline
You cannot manage what you cannot measure. This maxim, attributed to Peter Drucker, is the foundational truth of Site Reliability Engineering. Monitoring is not a feature you bolt onto a production system after launch — it is the nervous system of your entire operation. Without monitoring, every other SRE practice collapses: SLOs become aspirational guesses because you have no data to measure compliance, error budgets are fictional because you cannot calculate burn rate, incident response degrades to users reporting outages on Twitter, and capacity planning becomes fortune-telling.
At its core, monitoring answers four fundamental questions that every SRE must be able to answer at any moment: Is the system working right now? How fast is it working? Are users experiencing errors? Is the system close to capacity? These questions map directly to the Four Golden Signals (latency, traffic, errors, saturation) that Google codified in the SRE book, and they form the backbone of every monitoring framework discussed in this entry.
Monitoring vs Observability
Monitoring tells you WHEN something is wrong — it watches known metrics and fires alerts when thresholds are breached. Observability tells you WHY something is wrong — it lets you ask arbitrary questions of your system using metrics, logs, and traces. Monitoring is a subset of observability. This entry focuses on monitoring fundamentals; the observability entry covers the broader discipline including distributed tracing and log aggregation.
Consider what happens when monitoring fails or does not exist. In 2012, Knight Capital lost $440 million in 45 minutes because no automated monitoring existed to detect that their trading volume had spiked to anomalous levels. A human noticed the problem visually on a trading screen — not an alert, not a dashboard, not a metric threshold. If a simple metric alert on "order volume exceeds 3 standard deviations from historical average" had existed, automated circuit breakers could have stopped the bleeding within two minutes, saving over $400 million. This is not a theoretical scenario; it is a $440 million argument for monitoring.
The Monitoring Hierarchy of Needs
Think of monitoring like Maslow's hierarchy: you must satisfy the lower levels before the upper levels provide value. Level 1: Collection (are you gathering data?). Level 2: Visualization (can you see the data?). Level 3: Alerting (does the data tell you when something is wrong?). Level 4: Analysis (can you explore the data to find root causes?). Level 5: Prediction (can the data forecast problems before they happen?). Most teams get stuck at Level 2 with dashboards that nobody watches.
What monitoring gives you in production
The Three Monitoring Frameworks
The industry has converged on three complementary monitoring frameworks: RED (Rate, Errors, Duration) for request-driven services, USE (Utilization, Saturation, Errors) for infrastructure resources, and Google's Four Golden Signals (Latency, Traffic, Errors, Saturation) as a unified overlay. Mastering all three gives you a complete vocabulary for monitoring any production system at any scale.
Monitoring at scale is a systems problem in itself. Google monitors over 100 billion time series across its infrastructure. Netflix tracks approximately 2 billion metrics from its streaming platform. Even a typical startup with a handful of microservices generates 10,000 to 100,000 time series. The monitoring system must be more reliable than the systems it monitors — if your monitoring goes down during an incident, you are flying blind. This is why FAANG companies build dedicated monitoring infrastructure teams and treat the monitoring stack as a Tier-0 service.
The RED method was coined by Tom Wilkie (Grafana Labs, formerly Weaveworks) as a simple, memorable framework for monitoring request-driven services — any service that responds to requests from users or other services. RED stands for Rate (how many requests per second), Errors (how many of those requests are failing), and Duration (how long those requests take). If you can answer these three questions for every service in your architecture, you have a solid foundation for user-facing monitoring.
RED is specifically designed for the application layer. It does not tell you about CPU utilization or disk space — that is the USE method's job. RED tells you about the user experience: are requests being served, are they succeeding, and are they fast enough? This user-centric perspective is what makes RED so valuable for SRE work. Your SLOs are almost certainly defined in terms of RED metrics: request success rate (errors), response latency (duration), and throughput capacity (rate).
graph LR
subgraph "RED Dashboard Layout"
direction TB
subgraph "Row 1: Rate"
R1[Requests/sec<br/>Current vs Historical]
R2[Rate by Endpoint<br/>Top 10 Breakdown]
R3[Rate by Status Code<br/>2xx / 3xx / 4xx / 5xx]
end
subgraph "Row 2: Errors"
E1[Error Rate %<br/>5xx / Total Requests]
E2[Error Rate by Type<br/>Timeout / 5xx / Connection]
E3[Error Budget Burn<br/>Current vs Allowed]
end
subgraph "Row 3: Duration"
D1[Latency p50 / p95 / p99<br/>Over Time]
D2[Latency by Endpoint<br/>Slowest Endpoints]
D3[Latency Distribution<br/>Histogram Heatmap]
end
endA production RED dashboard layout organized in three rows. Each row focuses on one signal with three panels: current state, breakdown by dimension, and comparison to baseline or budget.
| Signal | What to Measure | Prometheus Metric Type | Example PromQL |
|---|---|---|---|
| Rate | Requests per second, broken down by endpoint and status code | Counter (http_requests_total) | rate(http_requests_total[5m]) |
| Errors | Failed requests as a percentage of total requests | Counter (http_requests_total{status=~"5.."}) | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) |
| Duration | Request latency at p50, p95, and p99 percentiles | Histogram (http_request_duration_seconds) | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
Why p99 Matters More Than Average Latency
If your average latency is 50ms but your p99 is 2 seconds, one in every hundred users waits 40 times longer than average. At 1,000 requests per second, that is 10 users every second having a terrible experience. Amazon found that every 100ms of latency costs 1% of sales. Always monitor p50 (median), p95 (most users), and p99 (worst case that still represents real users). Ignore average latency in production — it hides outliers.
When to use the RED method
RED Does Not Monitor Infrastructure
A common mistake is using only RED metrics and ignoring the underlying infrastructure. Your API might show healthy RED metrics while the database server is at 95% CPU utilization — one traffic spike away from collapse. RED tells you the patient feels fine; USE (next section) tells you the patient's blood pressure is dangerously high. You need both.
Labeling Strategy for RED Metrics
Always include at minimum these labels on your RED metrics: service name, endpoint/route, HTTP method, response status code, and deployment version. This lets you slice metrics by any dimension during an incident. Avoid high-cardinality labels like user ID or request ID — these explode your time series count and crash your monitoring system. A service with 100 endpoints, 4 methods, and 5 status codes generates 2,000 time series per metric. Add a user_id label with 1 million users and you get 2 billion time series.
The USE method was created by Brendan Gregg, a performance engineer at Netflix (formerly Sun Microsystems and Joyent), as a systematic approach to analyzing the performance of any system resource. USE stands for Utilization (how busy is the resource), Saturation (how much extra work is queued because the resource is busy), and Errors (how many error events the resource has experienced). While RED focuses on the user experience at the application layer, USE focuses on the infrastructure layer — the CPUs, memory, disks, and network interfaces that your applications depend on.
The key insight of USE is that every performance problem is ultimately a resource problem. If your API latency suddenly doubles, the root cause is almost always one of these: CPU saturation (not enough compute), memory pressure (swapping or garbage collection), disk I/O bottleneck (slow reads or writes), or network saturation (bandwidth exhaustion or connection limits). USE gives you a systematic checklist for investigating each resource rather than guessing.
| Resource | Utilization Metric | Saturation Metric | Error Metric |
|---|---|---|---|
| CPU | CPU usage % (user + system time across all cores) | Run queue length (processes waiting for CPU) | Machine check exceptions, thermal throttling events |
| Memory | Used memory / total memory (excluding cache/buffers) | Swap usage, OOM kill count, page fault rate | ECC memory errors, allocation failures |
| Disk I/O | Disk busy % (time spent doing I/O operations) | I/O queue depth (requests waiting for disk) | Disk read/write errors, SMART warnings, filesystem errors |
| Network | Bandwidth utilization (bytes in + bytes out / capacity) | TCP retransmit rate, socket backlog depth | Interface errors (CRC, frame), packet drops, connection resets |
| File Descriptors | Open FDs / max FDs (ulimit) | Threads blocked on FD allocation | EMFILE/ENFILE errors (too many open files) |
| Connection Pools | Active connections / pool max size | Connection wait queue length, connection wait time | Connection timeout errors, pool exhaustion events |
Utilization vs Saturation: The Crucial Difference
A CPU at 80% utilization might be perfectly healthy — it means you are getting good value from your hardware. But if the CPU run queue length is 50 (meaning 50 processes are waiting for CPU time), the system is saturated and users will experience latency. Utilization tells you how busy a resource is; saturation tells you whether that busyness is causing queuing delays. A highway at 80% capacity flows smoothly. A highway at 80% capacity with a 3-mile backup at every on-ramp is saturated. Same utilization, very different user experience.
Applying USE to diagnose a slow API
01
Start with CPU: Check utilization (top, mpstat) and saturation (vmstat procs column, /proc/loadavg). If CPU utilization is above 80% with a run queue above 2x the core count, CPU is your bottleneck. Profile the application to find hot code paths.
02
Check memory: Look at utilization (free -h, /proc/meminfo) and saturation (vmstat swap columns, dmesg for OOM kills). If the system is swapping or experiencing frequent page faults, memory pressure is causing I/O wait that masquerades as disk latency.
03
Check disk I/O: Use iostat -x to see disk utilization (%util) and saturation (avgqu-sz for queue depth). If disk utilization is above 70% with a queue depth above 2, disk I/O is your bottleneck. Check if the application is doing synchronous writes or if a logging framework is flushing too aggressively.
04
Check network: Use sar -n DEV for bandwidth utilization and ss -s for socket statistics. Look for TCP retransmits (indicates packet loss or congestion) and socket backlog growth (indicates connection handling cannot keep up with incoming requests).
05
Check application-level resources: Connection pool utilization (active/max), thread pool saturation (queued tasks), and file descriptor usage (lsof | wc -l). These application resources often saturate before system resources because their limits are configured conservatively.
06
Correlate findings: The bottleneck is almost always at ONE resource. If CPU is saturated but memory and disk are idle, do not add more memory — optimize CPU usage. Fix the actual bottleneck, then re-measure to see if a new bottleneck has emerged.
Start with CPU: Check utilization (top, mpstat) and saturation (vmstat procs column, /proc/loadavg). If CPU utilization is above 80% with a run queue above 2x the core count, CPU is your bottleneck. Profile the application to find hot code paths.
Check memory: Look at utilization (free -h, /proc/meminfo) and saturation (vmstat swap columns, dmesg for OOM kills). If the system is swapping or experiencing frequent page faults, memory pressure is causing I/O wait that masquerades as disk latency.
Check disk I/O: Use iostat -x to see disk utilization (%util) and saturation (avgqu-sz for queue depth). If disk utilization is above 70% with a queue depth above 2, disk I/O is your bottleneck. Check if the application is doing synchronous writes or if a logging framework is flushing too aggressively.
Check network: Use sar -n DEV for bandwidth utilization and ss -s for socket statistics. Look for TCP retransmits (indicates packet loss or congestion) and socket backlog growth (indicates connection handling cannot keep up with incoming requests).
Check application-level resources: Connection pool utilization (active/max), thread pool saturation (queued tasks), and file descriptor usage (lsof | wc -l). These application resources often saturate before system resources because their limits are configured conservatively.
Correlate findings: The bottleneck is almost always at ONE resource. If CPU is saturated but memory and disk are idle, do not add more memory — optimize CPU usage. Fix the actual bottleneck, then re-measure to see if a new bottleneck has emerged.
USE Method Checklist: Print This Out
For every physical and logical resource in your system, answer three questions: (1) What is its utilization? (2) What is its saturation? (3) Is it producing errors? If you cannot answer any of these questions, you have a monitoring gap. Brendan Gregg maintains a USE method checklist at his website with specific commands for Linux, Solaris, and other operating systems. This checklist should be in every SRE's toolkit.
RED + USE = Complete Picture
RED and USE are complementary, not competing. RED monitors the symptoms (users are seeing slow responses). USE monitors the causes (the database disk is saturated). In an incident, start with RED to confirm the user impact, then use USE to find the bottleneck. Think of RED as the check engine light on your car dashboard and USE as the mechanic's diagnostic tool under the hood.
In Chapter 6 of the Google SRE book, the authors present the Four Golden Signals as the minimum set of metrics that every service should monitor: Latency, Traffic, Errors, and Saturation. These four signals were distilled from years of operating some of the largest production systems in the world. Google's position is unambiguous: if you can only measure four things about your user-facing system, measure these four things. Everything else is secondary.
The Four Golden Signals are not a third framework competing with RED and USE. They are a unifying layer that borrows from both. Latency and Errors come from RED (Duration and Errors). Traffic comes from RED (Rate). Saturation comes from USE. Google's contribution was recognizing that these four specific metrics are the most important subset and that they map directly to SLO compliance.
| Golden Signal | Definition | Maps to RED/USE | SLO Connection |
|---|---|---|---|
| Latency | Time to serve a request. Must distinguish between successful and failed request latency — a fast error is not a success. | RED: Duration | Latency SLO: 99% of requests complete in under 200ms. Measure as histogram percentiles (p50, p95, p99). |
| Traffic | Demand placed on your system. For a web service: HTTP requests per second. For a streaming system: bytes per second. For a database: queries per second. | RED: Rate | Traffic is the denominator in your SLO calculations. High traffic with low errors = healthy. High traffic with rising errors = investigate. |
| Errors | Rate of requests that fail. Includes explicit failures (HTTP 5xx), implicit failures (HTTP 200 with wrong content), and policy violations (responses slower than your SLO threshold). | RED: Errors | Availability SLO: 99.9% of requests succeed. Error budget = 0.1% = ~43 minutes of total failure per month. |
| Saturation | How full your service is. Emphasize the most constrained resource (CPU, memory, disk, network). Saturation is the leading indicator that predicts failures before they happen. | USE: Saturation | Capacity planning: if CPU saturation trends hit 80% in 4 weeks, provision more capacity now. |
The Hidden Golden Signal: Failed Request Latency
Google specifically warns that you must measure latency for successful AND failed requests separately. A service that returns errors in 1ms and successful responses in 500ms has an average latency of 250ms — which looks great. But users are experiencing 500ms latency. If errors suddenly spike, average latency drops and your dashboard shows "improvement" while users suffer. Always filter latency metrics by response status.
graph TD
subgraph "Four Golden Signals → SLO Mapping"
L[Latency<br/>p99 < 200ms] -->|feeds| LSLO[Latency SLO<br/>99% requests < 200ms]
T[Traffic<br/>requests/sec] -->|denominator| ESLO[Availability SLO<br/>99.9% success rate]
E[Errors<br/>5xx rate] -->|numerator| ESLO
S[Saturation<br/>CPU/Memory/Disk] -->|predicts| CAP[Capacity Planning<br/>Scale before breach]
LSLO -->|violation triggers| EB[Error Budget<br/>Burn Rate Alert]
ESLO -->|violation triggers| EB
EB -->|depleted| FREEZE[Feature Freeze<br/>Reliability Focus]
EB -->|healthy| SHIP[Ship Features<br/>Accelerate Velocity]
endHow the Four Golden Signals feed into SLO compliance, error budget tracking, and the reliability-vs-velocity decision loop.
Implementing the Four Golden Signals in practice
The Google Approach to Alert Design
Google recommends alerting on multi-window, multi-burn-rate error budget consumption. Instead of alerting on "error rate > 1%," alert on "the error budget is being consumed 14.4x faster than sustainable over a 1-hour window and 6x faster than sustainable over a 6-hour window." This approach catches both fast-burn incidents (server crash affecting all traffic) and slow-burn degradation (gradual memory leak increasing latency) while maintaining a low false-positive rate.
Prometheus is the open-source monitoring system that has become the de facto standard for cloud-native infrastructure monitoring. Originally built at SoundCloud in 2012, inspired by Google's internal Borgmon monitoring system, Prometheus was the second project accepted into the Cloud Native Computing Foundation (CNCF) after Kubernetes. Today, Prometheus is used by virtually every company running Kubernetes and is the backbone of monitoring at organizations from small startups to companies processing billions of metrics per day.
Understanding Prometheus architecture is essential for any SRE role because it underpins how metrics are collected, stored, queried, and alerted on in most modern infrastructure. Even if you use a managed monitoring service like Datadog or New Relic, understanding Prometheus helps you reason about metric types, query patterns, and performance tradeoffs.
graph LR
subgraph "Targets"
App1[App Server<br/>/metrics endpoint]
App2[Database Exporter<br/>node_exporter]
App3[Kubernetes<br/>kube-state-metrics]
end
subgraph "Prometheus Server"
SD[Service Discovery<br/>K8s, Consul, DNS] --> Scrape[Scrape Manager<br/>Pull metrics every 15-30s]
Scrape --> TSDB[Time Series DB<br/>Local SSD Storage]
TSDB --> Rules[Recording Rules<br/>Pre-compute expensive queries]
TSDB --> Alerts[Alert Rules<br/>Evaluate conditions]
end
subgraph "Downstream"
Alerts --> AM[Alertmanager<br/>Dedup, Group, Route]
AM --> PD[PagerDuty / Slack]
TSDB --> Grafana[Grafana<br/>Dashboards]
TSDB --> API[HTTP API<br/>PromQL Queries]
TSDB --> Fed[Federation<br/>Global Prometheus]
end
App1 --> Scrape
App2 --> Scrape
App3 --> ScrapePrometheus architecture: targets expose /metrics endpoints, the scrape manager pulls metrics on a configurable interval, the TSDB stores time series on local disk, and downstream consumers (Grafana, Alertmanager, federation) query the data via PromQL.
| Metric Type | What It Measures | Example | Query Pattern |
|---|---|---|---|
| Counter | Cumulative value that only goes up (resets on restart) | http_requests_total, node_cpu_seconds_total | rate() or increase() to get per-second rate or total increase over a window |
| Gauge | Value that can go up or down at any time | node_memory_available_bytes, temperature_celsius | Direct value or delta/deriv() for rate of change |
| Histogram | Observations bucketed into configurable ranges | http_request_duration_seconds with le buckets | histogram_quantile() for percentiles, _count and _sum for averages |
| Summary | Client-side calculated quantiles (not recommended for new code) | go_gc_duration_seconds | Pre-calculated quantiles; cannot be aggregated across instances — prefer histograms |
Always Use Histograms Over Summaries
Summaries calculate percentiles on the client side, which means you cannot aggregate them across multiple instances. If you have 10 API server replicas and each reports its own p99 latency via a Summary, you cannot calculate the overall p99 across all replicas — the average of percentiles is not the percentile of the average. Histograms store raw bucket counts that can be aggregated with histogram_quantile(), giving you correct cross-instance percentiles. The only trade-off is that histogram accuracy depends on bucket boundary selection.
Prometheus production best practices
The Pull vs Push Debate
Prometheus uses a pull model: the server scrapes targets. This is ideal for service monitoring because if a target stops responding to scrapes, Prometheus knows it is down. Push-based systems (like StatsD or Graphite) cannot distinguish between "the service is down" and "the service stopped sending data." However, for short-lived batch jobs, use the Prometheus Pushgateway — jobs push their metrics to the gateway before exiting, and Prometheus scrapes the gateway.
A monitoring system that collects perfect data but presents it poorly is nearly as useless as no monitoring at all. Dashboards are the primary interface between your metrics and the humans who need to act on them — SREs triaging incidents, managers reviewing service health, and developers validating deployments. A well-designed dashboard hierarchy lets anyone in the organization understand system health within seconds. A poorly designed dashboard wastes time, creates confusion, and leads to missed signals during incidents when speed matters most.
The cardinal rule of dashboard design is the 5-second rule: an on-call engineer woken at 3 AM should be able to determine whether the system is healthy or unhealthy within 5 seconds of looking at a dashboard. If the dashboard requires scrolling, mental arithmetic, or context that is not visible on the screen, it fails the 5-second rule. Everything else in dashboard design flows from this principle.
graph TD
subgraph "Dashboard Hierarchy"
L1[Level 1: Executive Overview<br/>All services - green/yellow/red<br/>Audience: VP Eng, On-Call Lead]
L1 --> L2A[Level 2: Service Dashboard<br/>Golden Signals for Service A<br/>Audience: Service Owner, On-Call]
L1 --> L2B[Level 2: Service Dashboard<br/>Golden Signals for Service B<br/>Audience: Service Owner, On-Call]
L1 --> L2C[Level 2: Service Dashboard<br/>Golden Signals for Service C<br/>Audience: Service Owner, On-Call]
L2A --> L3A[Level 3: Component Dashboard<br/>Database, Cache, Queue details<br/>Audience: On-Call debugging]
L2A --> L3B[Level 3: Component Dashboard<br/>Deployment & Canary metrics<br/>Audience: On-Call debugging]
L2B --> L3C[Level 3: Component Dashboard<br/>Dependency health & USE metrics<br/>Audience: On-Call debugging]
endThree-level dashboard hierarchy: executive overview (are we healthy?), service dashboards (Golden Signals per service), and component dashboards (deep-dive debugging). Navigation flows from overview to detail during incident triage.
Dashboard design principles
The Wall of Graphs Anti-Pattern
Many teams create dashboards with 30+ panels covering every conceivable metric. These "wall of graphs" dashboards are overwhelming and useless during incidents because no human can scan 30 graphs for anomalies. The on-call engineer ends up ignoring the dashboard and looking at logs instead. Limit service dashboards to 12-16 panels maximum. Put deep-dive metrics on separate component dashboards that you drill into only when needed.
Building your first service dashboard in Grafana
01
Create a new dashboard and set the default time range to "Last 1 hour" with auto-refresh every 30 seconds. Name it "[Service Name] - Golden Signals." Add a variable for environment (prod/staging) to enable environment switching without separate dashboards.
02
Row 1 - Status Overview: Add a Stat panel showing current SLO compliance percentage (green above SLO, red below). Add a Stat panel showing current error budget remaining (percentage). Add a Stat panel showing active alerts count.
03
Row 2 - Latency: Add a time series panel with p50, p95, and p99 latency lines. Add a horizontal threshold line at your SLO latency target (e.g., 200ms). Add a second panel showing latency heatmap (Grafana's native heatmap from Prometheus histograms) to visualize the full distribution.
04
Row 3 - Traffic and Errors: Add a time series panel showing requests per second, broken down by status code class (2xx, 4xx, 5xx). Add a time series panel showing error rate percentage with an SLO threshold line. Add annotations for deployments so you can correlate changes with error spikes.
05
Row 4 - Saturation: Add panels for the most constrained resource for this service (CPU utilization for compute-heavy services, disk I/O for storage services, connection pool utilization for database-heavy services). Add capacity threshold lines at 70% (warning) and 90% (critical).
06
Row 5 - Dependencies: Add panels showing the latency and error rate of every downstream dependency (database, cache, external APIs). These are the first thing to check when your service shows degradation — the root cause is often a dependency, not your own code.
Create a new dashboard and set the default time range to "Last 1 hour" with auto-refresh every 30 seconds. Name it "[Service Name] - Golden Signals." Add a variable for environment (prod/staging) to enable environment switching without separate dashboards.
Row 1 - Status Overview: Add a Stat panel showing current SLO compliance percentage (green above SLO, red below). Add a Stat panel showing current error budget remaining (percentage). Add a Stat panel showing active alerts count.
Row 2 - Latency: Add a time series panel with p50, p95, and p99 latency lines. Add a horizontal threshold line at your SLO latency target (e.g., 200ms). Add a second panel showing latency heatmap (Grafana's native heatmap from Prometheus histograms) to visualize the full distribution.
Row 3 - Traffic and Errors: Add a time series panel showing requests per second, broken down by status code class (2xx, 4xx, 5xx). Add a time series panel showing error rate percentage with an SLO threshold line. Add annotations for deployments so you can correlate changes with error spikes.
Row 4 - Saturation: Add panels for the most constrained resource for this service (CPU utilization for compute-heavy services, disk I/O for storage services, connection pool utilization for database-heavy services). Add capacity threshold lines at 70% (warning) and 90% (critical).
Row 5 - Dependencies: Add panels showing the latency and error rate of every downstream dependency (database, cache, external APIs). These are the first thing to check when your service shows degradation — the root cause is often a dependency, not your own code.
Deployment Annotations Are Critical
Add annotations to your Grafana dashboards that mark every deployment, configuration change, and scaling event. During an incident, the single most valuable piece of information is "what changed?" Deployment annotations let you visually correlate a metric change with a code push in less than a second. Use Grafana annotations API in your CI/CD pipeline to automate this.
Monitoring systems degrade over time in predictable ways. Like code, monitoring accumulates technical debt — dashboards become stale, alert rules proliferate, metrics grow unbounded, and the gap between what is monitored and what matters widens. Recognizing these anti-patterns early prevents your monitoring from becoming a liability rather than an asset. The worst outcome is a team that believes they are well-monitored because they have "lots of dashboards and alerts" when in reality their monitoring is noisy, incomplete, or measuring the wrong things.
The seven deadly monitoring anti-patterns
Alert Fatigue Kills
Alert fatigue is not merely annoying — it is dangerous. When engineers become desensitized to alerts, they delay response to real incidents. The 2003 Northeast blackout in the US and Canada was worsened because operators had so many alarms firing that they missed the critical ones indicating a cascading grid failure. In SRE, the same dynamic plays out: if your team receives 20 pages per shift and 18 are false positives, the 2 real incidents get the same delayed, skeptical response. The target is fewer than 2 actionable pages per 12-hour on-call shift.
| Anti-Pattern | Symptom | Root Cause | Fix |
|---|---|---|---|
| Dashboard rot | 200+ dashboards, nobody knows which to use | No ownership, no review cadence | Assign owners, quarterly review, delete unused |
| Alert fatigue | 15+ pages per on-call shift, most ignored | Threshold alerts without SLO basis | SLO-based alerting, runbook requirement, max 2 pages/shift target |
| Metric explosion | Prometheus OOM or slow queries | High-cardinality labels | Monthly cardinality audit, max 100 values per label |
| Monitoring the monitoring | Monitoring down during incidents | Monitoring treated as non-critical | Tier-0 SLO for monitoring stack, separate alert channel |
| Vanity dashboards | Pretty graphs, no operational value | Built for demos, not incidents | Every panel must answer an on-call question |
The Quarterly Monitoring Audit
Every quarter, review all dashboards and alerts with the on-call team. For each dashboard: when was it last viewed during an incident? If never, consider deleting it. For each alert: how many times did it fire? How many times was the response "do nothing"? If more than 50% of firings are non-actionable, fix the alert threshold or delete it. Track the total alert count and the actionable percentage as team health metrics.
The Oncall Handoff Test
A quick test for monitoring quality: can a new on-call engineer who has never seen the service before diagnose a problem using only the dashboards and runbooks? If they need to ask a senior engineer "which dashboard should I look at?" or "what does this graph mean?", your monitoring and documentation have failed. Run this test with every new team member and fix the gaps they reveal.
When your infrastructure spans millions of machines across dozens of datacenters serving billions of users, monitoring itself becomes one of the most challenging distributed systems problems you face. The monitoring system must ingest, store, query, and alert on more data than most companies' entire production workloads. This section examines how Google, Netflix, and Uber solved the monitoring-at-scale problem with purpose-built systems that handle 100 million to 100 billion time series.
The numbers are staggering. Google's internal monitoring system Monarch ingests over 100 billion time series. Netflix's Atlas handles approximately 2 billion metrics from their streaming infrastructure. Uber's M3 was built to handle their rapid growth from thousands to millions of time series within a few years. Each of these systems made fundamentally different architectural decisions, but they all share a common insight: off-the-shelf monitoring tools like Prometheus were not designed for this scale and must be either heavily extended or replaced.
| System | Company | Scale | Architecture | Key Innovation |
|---|---|---|---|---|
| Monarch | 100B+ time series, petabytes of storage | Fully integrated with Borg (Kubernetes predecessor). In-memory zones + long-term Leaf storage. Global query federation across all datacenters. | Zone-local ingestion with global query: metrics are stored in the zone where they originate but can be queried globally. This avoids cross-datacenter write traffic. | |
| Atlas | Netflix | ~2B metrics, ~2.5 petabytes per day ingested | In-memory dimensional time series database. Tags replace hierarchical metric names. Integration with Spectator client library. | Dimensional data model: metrics are identified by a name and an arbitrary set of key-value tags, enabling flexible slicing and dicing without pre-defined hierarchies. |
| M3 | Uber | Hundreds of millions of time series | Distributed TSDB (M3DB) + query engine (M3Query) + aggregation (M3Aggregator). Prometheus-compatible APIs. | Prometheus compatibility: M3 accepts Prometheus remote-write protocol and speaks PromQL, allowing teams to keep their existing Prometheus setup while gaining horizontal scalability. |
Why Not Just Scale Prometheus?
Prometheus was designed as a single-server system with local disk storage. A single Prometheus instance can handle approximately 1-10 million time series depending on hardware. At FAANG scale, you need 10,000-100,000x that capacity. Solutions like Thanos and Cortex add horizontal scalability to Prometheus through remote storage backends and query federation, but they add significant operational complexity. At extreme scale, companies build purpose-built systems because the operational overhead of maintaining thousands of federated Prometheus instances exceeds the cost of building a custom solution.
Common patterns in FAANG-scale monitoring
Scale Numbers to Know for Interviews
When discussing monitoring at scale in interviews, cite these benchmarks: Google Monarch handles 100B+ time series. Netflix Atlas handles ~2B metrics with ~2.5PB/day ingest. A single Prometheus server handles 1-10M time series. Thanos/Cortex extends Prometheus to tens of millions. A typical startup generates 10K-100K time series. A mid-size company generates 1M-10M. These numbers help you calibrate architecture decisions to the actual scale of the problem.
Start Simple, Scale When Needed
Do not build a Monarch-scale monitoring system for a startup with 5 services. Start with a single Prometheus server and Grafana. When you outgrow one Prometheus instance (typically around 1-5M active time series), add Thanos for long-term storage and cross-cluster querying. Only consider custom solutions or managed services like Datadog when the operational overhead of self-hosted monitoring exceeds 1-2 full-time engineers.
Theory without practice is useless. This section provides a concrete, step-by-step guide to building a production monitoring stack from scratch. The goal is to take you from zero monitoring to a fully operational setup with metrics collection, storage, visualization, and alerting — all using open-source tools that are free to run and used by FAANG companies at scale. By the end of this section, you will have a monitoring stack that covers the RED method, the USE method, and the Four Golden Signals.
Step-by-step: From zero to production monitoring
01
INSTRUMENT YOUR APPLICATION: Add a Prometheus client library to your service (available for Go, Java, Python, Node.js, Ruby, and more). Expose an HTTP endpoint at /metrics. At minimum, instrument three things: a Counter for total requests (http_requests_total with method, endpoint, status labels), a Histogram for request duration (http_request_duration_seconds with the same labels), and a Gauge for in-flight requests (http_requests_in_flight). These three metrics give you the RED signals: Rate (rate of counter), Errors (rate of counter filtered by 5xx status), Duration (histogram percentiles).
02
COLLECT INFRASTRUCTURE METRICS: Deploy node_exporter on every machine to expose USE method metrics — CPU utilization, memory usage, disk I/O, and network statistics. If running Kubernetes, deploy kube-state-metrics for cluster state (pod status, deployment replicas, resource requests/limits) and cAdvisor (built into kubelet) for container-level resource metrics.
03
SET UP PROMETHEUS: Deploy a Prometheus server with service discovery configured for your environment (kubernetes_sd_config for Kubernetes, ec2_sd_config for AWS, file_sd_config for static targets). Set the scrape interval to 15 seconds. Configure retention for 15 days of local storage. This single instance handles up to 1-5 million active time series.
04
CREATE RECORDING RULES: Write recording rules for your most common queries. Pre-compute the error rate (ratio of 5xx requests to total requests), the p99 latency, and the error budget burn rate. Recording rules reduce dashboard load times from seconds to milliseconds and make alert rules simpler to write and maintain.
05
BUILD YOUR GOLDEN SIGNALS DASHBOARD: In Grafana, create one dashboard per service with four rows: Latency (p50/p95/p99 time series with SLO threshold line), Traffic (requests per second by status code), Errors (error rate percentage with SLO threshold), Saturation (most constrained resource). Add deployment annotations from your CI/CD pipeline.
06
CONFIGURE SLO-BASED ALERTS: In Prometheus, create alerting rules based on SLO violations rather than arbitrary thresholds. Use multi-window burn-rate alerts: a fast-burn alert (14.4x budget consumption over 1 hour) for pages and a slow-burn alert (3x budget consumption over 3 days) for tickets. Route alerts through Alertmanager to PagerDuty for pages and Slack for warnings.
INSTRUMENT YOUR APPLICATION: Add a Prometheus client library to your service (available for Go, Java, Python, Node.js, Ruby, and more). Expose an HTTP endpoint at /metrics. At minimum, instrument three things: a Counter for total requests (http_requests_total with method, endpoint, status labels), a Histogram for request duration (http_request_duration_seconds with the same labels), and a Gauge for in-flight requests (http_requests_in_flight). These three metrics give you the RED signals: Rate (rate of counter), Errors (rate of counter filtered by 5xx status), Duration (histogram percentiles).
COLLECT INFRASTRUCTURE METRICS: Deploy node_exporter on every machine to expose USE method metrics — CPU utilization, memory usage, disk I/O, and network statistics. If running Kubernetes, deploy kube-state-metrics for cluster state (pod status, deployment replicas, resource requests/limits) and cAdvisor (built into kubelet) for container-level resource metrics.
SET UP PROMETHEUS: Deploy a Prometheus server with service discovery configured for your environment (kubernetes_sd_config for Kubernetes, ec2_sd_config for AWS, file_sd_config for static targets). Set the scrape interval to 15 seconds. Configure retention for 15 days of local storage. This single instance handles up to 1-5 million active time series.
CREATE RECORDING RULES: Write recording rules for your most common queries. Pre-compute the error rate (ratio of 5xx requests to total requests), the p99 latency, and the error budget burn rate. Recording rules reduce dashboard load times from seconds to milliseconds and make alert rules simpler to write and maintain.
BUILD YOUR GOLDEN SIGNALS DASHBOARD: In Grafana, create one dashboard per service with four rows: Latency (p50/p95/p99 time series with SLO threshold line), Traffic (requests per second by status code), Errors (error rate percentage with SLO threshold), Saturation (most constrained resource). Add deployment annotations from your CI/CD pipeline.
CONFIGURE SLO-BASED ALERTS: In Prometheus, create alerting rules based on SLO violations rather than arbitrary thresholds. Use multi-window burn-rate alerts: a fast-burn alert (14.4x budget consumption over 1 hour) for pages and a slow-burn alert (3x budget consumption over 3 days) for tickets. Route alerts through Alertmanager to PagerDuty for pages and Slack for warnings.
| Component | Tool | Purpose | Resource Requirements |
|---|---|---|---|
| Metrics Collection | Prometheus | Scrape, store, and query time series metrics | 2 CPU, 8 GB RAM, 100 GB SSD for ~500K time series with 15-day retention |
| Infrastructure Metrics | node_exporter + kube-state-metrics | Expose OS-level and Kubernetes cluster metrics | Minimal: 0.1 CPU, 64 MB RAM per node for node_exporter |
| Visualization | Grafana | Dashboards, graphs, and stat panels | 1 CPU, 2 GB RAM, 10 GB disk for dashboard storage |
| Alerting | Alertmanager | Alert deduplication, grouping, routing, and silencing | 0.5 CPU, 512 MB RAM — runs as a small sidecar |
| Long-term Storage | Thanos (when needed) | Object storage backend for retention beyond 15 days | Sidecar: 0.5 CPU, 1 GB RAM. Store gateway: 2 CPU, 4 GB RAM. S3/GCS costs vary. |
| Application Instrumentation | Prometheus client libraries | Expose /metrics endpoint from your application code | Negligible overhead: ~1-2% CPU, ~10 MB RAM for metric tracking |
The Starter Stack Recommendation
For teams starting from zero, deploy the kube-prometheus-stack Helm chart. It bundles Prometheus, Grafana, Alertmanager, node_exporter, and kube-state-metrics with sane defaults and pre-built dashboards for Kubernetes monitoring. You get a fully operational monitoring stack in under 30 minutes. Customize from there by adding application-specific metrics, recording rules, and SLO-based alerts.
Common Starter Mistakes
Three mistakes kill most first monitoring setups: (1) Not persisting Prometheus data — if Prometheus restarts and loses its storage, you lose all metrics history. Always use a PersistentVolumeClaim in Kubernetes. (2) Not configuring Alertmanager receivers — alerts that fire but go nowhere are worse than no alerts because they create a false sense of security. (3) Not instrumenting the application — relying solely on infrastructure metrics (CPU, memory) means you cannot detect application-level failures. Instrument your code from day one.
What to monitor on day one vs day thirty
Test Your Monitoring Before You Need It
Intentionally break your service in a staging environment and verify that your monitoring detects the failure. Introduce latency with a sleep statement and check if your latency alert fires. Return 500 errors and verify the error rate dashboard reflects the increase. Exhaust memory and confirm the saturation panel shows the spike. If your monitoring cannot detect intentional failures in staging, it will not detect real failures in production.
Monitoring fundamentals appear in virtually every SRE and infrastructure interview at FAANG companies. In system design interviews, you will be asked how you would monitor a system you just designed — interviewers expect you to name specific metrics (p99 latency, error rate, CPU saturation), specific tools (Prometheus, Grafana), and specific frameworks (RED, USE, Golden Signals). In behavioral interviews, you will be asked about incidents where monitoring helped (or where lack of monitoring caused problems). In coding interviews at SRE-focused roles, you may be asked to write PromQL queries or design a monitoring schema for a given service. Knowing RED, USE, and the Four Golden Signals gives you a structured, vocabulary-rich answer for any monitoring question.
Common questions:
Try this question: Ask the interviewer: What monitoring stack do you use and how large is it (time series count)? How do you handle alert fatigue — what is your typical pages-per-shift rate? Do you have SLO-based alerting or threshold-based alerting? These questions demonstrate that you understand monitoring at a practical, operational level and are evaluating the maturity of their monitoring infrastructure.
Strong answer: Immediately structuring the answer using RED/USE/Golden Signals. Explaining why histograms are preferred over summaries. Mentioning label cardinality as a Prometheus scaling concern. Describing multi-window burn-rate alerts for SLO-based alerting. Knowing the three-level dashboard hierarchy. Citing concrete scale numbers and tool recommendations appropriate to the scale being discussed.
Red flags: Monitoring only CPU and memory without application-level metrics. Not knowing what PromQL is. Suggesting average latency instead of percentiles. Setting alerts on arbitrary thresholds (CPU > 80%) without connecting to SLOs. Not understanding the difference between monitoring and observability. Describing dashboards with 50+ panels as "comprehensive monitoring."
Key takeaways
Explain the difference between the RED method and the USE method. When would you apply each, and how do they complement each other during incident investigation?
RED (Rate, Errors, Duration) monitors the user experience of request-driven services — how many requests are coming in, how many are failing, and how long they take. USE (Utilization, Saturation, Errors) monitors infrastructure resources — how busy CPUs, memory, disks, and networks are, whether work is queuing, and whether hardware errors are occurring. During an incident, you start with RED to confirm user impact (symptoms), then switch to USE to find the bottleneck (root cause). RED tells you the patient is sick; USE tells you which organ is failing. They are complementary, not competing — you need both for complete production monitoring.
Name the Four Golden Signals and explain how each maps to an SLO. Why does Google recommend separating latency measurements for successful and failed requests?
The Four Golden Signals are Latency (response time, feeds latency SLO like "99% of requests under 200ms"), Traffic (request rate, provides the denominator for SLO calculations), Errors (failure rate, feeds availability SLO like "99.9% of requests succeed"), and Saturation (resource fullness, feeds capacity planning to prevent future SLO violations). Google recommends separating successful and failed request latency because errors often return much faster than successful responses. If error rate spikes, average latency drops, making the dashboard falsely show "improvement" while users experience degradation. Measuring latency only for successful requests gives an accurate view of user experience.
You are building monitoring for a new microservice. Walk through the specific metrics you would instrument and why, covering both RED and USE perspectives.
For RED: (1) A Counter http_requests_total with labels for method, endpoint, and status code — rate() gives request rate, filtering by 5xx gives error rate. (2) A Histogram http_request_duration_seconds with the same labels — histogram_quantile() gives p50/p95/p99 latency. (3) A Gauge http_requests_in_flight for connection concurrency monitoring. For USE: deploy node_exporter for system metrics (CPU utilization/saturation, memory usage, disk I/O, network bandwidth). Add application-specific resource metrics: database connection pool utilization (active/max), thread pool queue depth, and cache hit ratio. For business context: add a Counter for the primary business transaction (e.g., orders_placed_total). This gives complete coverage of the user experience (RED), infrastructure health (USE), and business outcomes.
💡 Analogy
Monitoring is like the instrument panel in an aircraft cockpit. Pilots do not fly by looking out the window — they rely on instruments that measure altitude (utilization), airspeed (rate), engine warnings (errors), and fuel level (saturation). The RED method is like monitoring the passenger experience: are flights departing on time (rate), are passengers arriving at their destination (errors), and how long is the journey taking (duration)? The USE method is like monitoring the aircraft systems: engine temperature (utilization), fuel reserves (saturation), and mechanical warnings (errors). The Four Golden Signals combine both perspectives into the minimum instrument set needed to fly safely. Just as a pilot scans instruments in a specific pattern (aviate, navigate, communicate), an SRE scans golden signals in a specific pattern: latency first (is the service responsive?), errors second (are requests succeeding?), traffic third (what is the load?), saturation fourth (how much headroom remains?). You need both the passenger perspective and the aircraft systems perspective to fly safely — and you need both RED and USE to operate production systems safely.
⚡ Core Idea
Monitoring provides the nervous system of your production infrastructure — without it, every other SRE practice (SLOs, error budgets, incident response, capacity planning) is impossible. The RED method monitors the user experience of request-driven services. The USE method monitors the health of infrastructure resources. The Four Golden Signals unify both into the minimum metric set that maps directly to SLO compliance. Master all three frameworks to monitor any system at any scale.
🎯 Why It Matters
Every FAANG SRE interview will test your ability to describe how you would monitor a production system. Knowing RED, USE, and the Four Golden Signals gives you a structured answer for any monitoring question. More importantly, in production, monitoring is the difference between detecting an incident in 30 seconds (automated alert) and detecting it in 30 minutes (user complaint on Twitter). That 29.5-minute difference directly translates to user impact, revenue loss, and error budget consumption.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.