The Four Golden Signals: The Minimal Set of Metrics That Catch Almost Everything

On this page

You can't dashboard your way to calm
The four signals in one breath
The picture: a service emitting four signals
Signal by signal: what to measure and why
Monitoring vs. observability, RED, and USE
Instrument a service with the four signals
Common mistakes that cost hours
Takeaways
Where to go next

TL;DR

Latency, traffic, errors, saturation: monitor these four and you catch most user-facing problems. Learn what to measure for each, how they relate to RED and USE, and how to instrument a service with Prometheus.

You can't dashboard your way to calm

Here is the trap every team falls into. The service is slow, someone is paging you, and you open a dashboard with forty graphs on it. CPU, heap, GC pauses, thread pools, cache hit ratios, p50/p90/p99 for twelve endpoints, three flavors of queue depth. You stare at the wall of sparklines and feel *less* sure of what's wrong than when you started. More metrics did not buy you more clarity, they bought you more places to look.

The fix is not another graph. It's a smaller, sharper set. In 2016, Google's Site Reliability Engineering book made a deceptively simple claim: if you can only measure four things about a user-facing system, measure latency, traffic, errors, and saturation. They called them the Four Golden Signals, and they hold up because they are framed around the user's experience, not the machine's internals.

Who this is for

Engineers who own a service in production and want a starting point for monitoring that isn't "graph everything and hope." If you can deploy a service and read a chart, you're ready. No prior SRE background assumed, by the end you'll be able to instrument a service with all four signals and know what a spike in each one is telling you.

The four signals in one breath

If you can measure only four metrics of your user-facing system, focus on these four: latency, traffic, errors, and saturation.
Google SRE Book, Monitoring Distributed Systems

Think of these four as the dashboard of a car. You don't need to watch the engine's combustion timing to know something is wrong, the car's dashboard already surfaces the signals that matter to the driver, and each one maps cleanly to a golden signal.

Speedometer, how fast you're actually goingLatency, how fast requests are actually served

Trip odometer / RPM, how hard the car is working right nowTraffic, how much demand is hitting the service right now

Check-engine light, something just went wrongErrors, the rate of requests that are failing

Fuel gauge / temperature, how close to the limit you areSaturation, how full the most constrained resource is

A driver doesn't read the engine block. They read four gauges. Your on-call self should too.

The power of the set is *coverage*. A user-facing problem almost always shows up as: requests got slow (latency), demand spiked or vanished (traffic), requests started failing (errors), or a resource is about to tip over (saturation). Watch four, catch most.

The picture: a service emitting four signals

Here's the mental model. Users hit your service on the happy path. The service emits the four golden signals out to an observability stack, a metrics store that feeds dashboards and an alerting pipeline. The signals are a side-channel; they never block the request.

Users flow through the service on the happy path (solid). The four golden signals fan out to the observability node (dashed, async).

1
A user sends a request
The request enters your service on the happy path. This is the only flow the user cares about, everything else exists to keep this fast and correct.
2
The service times the request
It records when work started and finished. That duration is your latency sample, ideally tagged by endpoint and split by success vs. failure.
3
The service counts the request
A counter ticks up for every request handled, that count over time is traffic. A second counter ticks only when the request fails, that's the errors signal.
4
The runtime exposes saturation
CPU, memory, connection-pool usage, queue depth, how full the most constrained resource is. This is the leading indicator: it climbs before latency and errors do.
5
An agent scrapes the signals
Prometheus pulls these numbers on an interval and stores them as time series. Dashboards visualize them; alert rules fire when they cross a threshold tied to your SLOs.

Signal by signal: what to measure and why

Each signal answers a different question about the user's experience. Here's the cheat sheet, what it measures, a concrete metric to emit, and what a spike is usually telling you.

Signal	What it measures	Example metric	What a spike means
Latency	How long a request takes, as a distribution (not an average)	p99 of request_duration_seconds, split success vs. error	Something downstream is slow or you're resource-starved, users are waiting
Traffic	How much demand the service is handling right now	requests per second from http_requests_total	A real load spike, a retry storm, or a drop to zero, meaning an upstream is broken
Errors	The rate of requests that fail, explicitly or implicitly	ratio of 5xx (and bad 200s) to total requests	A bad deploy or failing dependency, correctness is breaking, not just speed
Saturation	How full your most constrained resource is, vs. its limit	memory / CPU / pool utilization as a percentage	You're approaching a cliff, performance degrades nonlinearly past ~80%

Latency, traffic, errors measure what the user feels now; saturation predicts what they'll feel soon.

Latency, measure the tail, not the average

The single most common latency mistake is reporting the average. Averages hide pain: if 99 requests take 10ms and one takes 10 seconds, the average is a comfortable 110ms while a real user stares at a frozen page. Always track percentiles, p50 for the typical user, p99 for the unlucky tail. And split latency of *successful* requests from *failed* ones: a fast error and a slow success are both interesting, and averaging them together is meaningless.

Traffic, your denominator and your context

Traffic is requests per second (or transactions, or queries). On its own it's rarely an alert, but it's the context that makes the other three readable. A 500-error count means nothing without knowing whether you served 10 requests or 10 million. Traffic is also the denominator for your error *rate*, and a sudden drop to zero is its own kind of incident.

Errors, count the lies, not just the crashes

Explicit errors (HTTP 5xx, exceptions) are easy. The dangerous ones are implicit: a 200 OK that returns the wrong content, a response that violates a contract, or a success that took 30 seconds when your SLO is 300ms. Define what "failure" means for *your* service and count all of it. Errors are about correctness, a fast wrong answer is still wrong.

Saturation, the leading indicator

Saturation is how full your most constrained resource is: memory, CPU, disk I/O, a connection pool, a thread pool, a queue. It's special because it's a leading indicator, it climbs *before* latency and errors degrade. Most systems behave badly well below 100%; a queue at 80% utilization is already adding latency. Find your bottleneck resource, measure it as a fraction of its limit, and alert before the cliff.

Monitoring vs. observability, RED, and USE

The four signals sit inside a bigger vocabulary. Two distinctions are worth knowing so you don't get lost in the jargon.

Monitoring vs. observability. Monitoring answers *known* questions, "is the error rate above 1%?", with predefined dashboards and alerts. The golden signals are classic monitoring. Observability is the property of being able to ask *new* questions of a running system without shipping new code: "why is p99 high *only* for users in region X on checkout after the 2pm deploy?" You get there by adding high-cardinality context (traces, structured logs, exemplars). Golden signals tell you *that* something is wrong; observability helps you find out *why*.

RED and USE are two popular reframings of the same idea, each aimed at a different layer:

Method	Best for	What it tracks
Golden Signals	Any user-facing system (the general case)	Latency, Traffic, Errors, Saturation
RED	Request-driven services and microservices	Rate (traffic), Errors, Duration (latency)
USE	Hardware and finite resources (a host, a disk, a pool)	Utilization, Saturation, Errors

RED is the golden signals minus saturation, viewed per service. USE is the resource-side view. They overlap by design.

In practice: use RED to instrument each service from the request's point of view, USE to investigate the resources underneath it, and treat the four golden signals as the umbrella that guarantees you've covered both angles.

Instrument a service with the four signals

Here's the concrete walkthrough for a request-driven HTTP service exposing Prometheus metrics. The pattern is the same in any language with a Prometheus client.

1
Expose a /metrics endpoint
Add a Prometheus client library and mount a /metrics route. This is the surface Prometheus scrapes, it returns the current value of every metric in plaintext.
2
Emit a histogram for latency
Wrap your request handler in a timer that observes duration into a histogram labeled by route and status. Histograms let you compute p50/p99 later in PromQL.
3
Emit a counter for traffic and errors
Increment http_requests_total on every request, labeled by status code. Traffic is the rate of the whole counter; errors are the rate of the 5xx slice. One counter, two signals.
4
Expose saturation from the runtime
Most client libraries auto-export process CPU/memory. Add your bottleneck explicitly, connection-pool in use vs. max, or queue depth vs. capacity, as a gauge.
5
Scrape and alert
Point Prometheus at the endpoint, then write alert rules in PromQL that map to your SLOs. Alert on symptoms users feel (error ratio, p99), not on raw CPU.

prometheus.yml

yaml

# Tell Prometheus to scrape your service's /metrics endpoint
scrape_configs:
  - job_name: "checkout-service"
    metrics_path: /metrics
    scrape_interval: 15s
    static_configs:
      - targets: ["checkout:8080"]
        labels:
          team: payments

With those three metric types in place, all four signals fall out of a handful of PromQL queries. These are the expressions you put on a dashboard and behind alerts:

golden-signals.promql

promql

# 1) LATENCY, p99 of successful requests over the last 5 minutes
histogram_quantile(
  0.99,
  sum by (le) (
    rate(http_request_duration_seconds_bucket{status=~"2.."}[5m])
  )
)

# 2) TRAFFIC, requests per second across the whole service
sum(rate(http_requests_total[5m]))

# 3) ERRORS, fraction of requests returning 5xx (the error RATE)
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

# 4) SATURATION, connection pool fullness as a fraction of its limit
max(db_pool_connections_in_use) / max(db_pool_connections_max)

Alert on the ratio, not the raw count

A rule like "5xx > 10/s" fires on a traffic spike even when reliability is fine. Alert on the error *ratio* against your SLO (e.g. error rate > 1% for 5 minutes) so the threshold scales with load instead of paging you every Black Friday.

Common mistakes that cost hours

1Reporting averages for latency. The mean hides the tail where real users hurt. Always use percentiles (p50, p99), and never average a histogram, compute the quantile.
2Alerting on causes instead of symptoms. "CPU > 80%" pages you when nothing is broken for users. Alert on what users feel, error ratio and p99 latency, and use saturation for investigation.
3Counting errors without traffic. "50 errors" is meaningless. 50 of 50 is an outage; 50 of 5 million is noise. Always express errors as a *rate* over total requests.
4Ignoring implicit errors. A 200 OK with a corrupt body, or a success that blew your latency SLO, is still a failure. Define failure for your service and count all of it.
5Watching saturation last. It's the leading indicator, it moves first. If you only look at it in the postmortem, you missed the warning the system gave you ten minutes before the page.
6Confusing dashboards with observability. Forty static graphs is still monitoring. If you can't ask a new question without deploying code, you have monitoring, not observability.

Takeaways

The whole article in seven lines

Measure four things to catch most user-facing problems: latency, traffic, errors, saturation.
Latency: track percentiles (p99), and split successful from failed requests.
Traffic: requests per second, the context and denominator for everything else.
Errors: count as a rate over total, and include implicit failures (wrong 200s, blown SLOs).
Saturation: how full your bottleneck is, the leading indicator that moves first.
RED = the per-service request view; USE = the per-resource view; golden signals = the umbrella.
Monitoring answers known questions; observability lets you ask new ones. Alert on symptoms, not causes.

Where to go next

The four signals are the *what* to measure. The next two questions are how to set targets for them and how to alert without burning out your on-call. Both have their own articles in this track.

Turn signals into reliability targets: SLIs, SLOs & SLAs: Defining Reliability.
Make the alerting humane: Alerting Without Burnout, symptom-based, SLO-driven alerts.
Get hands-on with the runtime these signals come from: the Linux lab and the kubectl lab.
See where this fits: the SRE career path walks from monitoring foundations through SLOs, alerting, and incident response.

Check your understanding

1. What are the Four Golden Signals named in the article?

2. What does the article say is the single most common latency mistake?

Frequently asked questions

What are the Four Golden Signals?

Latency, traffic, errors, and saturation. Google's SRE book argued that if you can measure only four things about a user-facing system, these are the four, because they are framed around the user's experience rather than the machine's internals.

Why are just four signals enough when I could graph dozens of metrics?

Their strength is coverage. A user-facing problem almost always shows up as requests getting slow (latency), demand spiking or vanishing (traffic), requests failing (errors), or a resource about to tip over (saturation). Watching four catches most issues, while forty graphs often buy more places to look rather than more clarity.

Why shouldn't I report latency as an average?

Averages hide pain. If 99 requests take 10ms and one takes 10 seconds, the average looks like a comfortable 110ms while a real user stares at a frozen page. Track percentiles like p50, p90, and p99 so the tail is visible.

How do the golden signals relate to RED and USE?

They are complementary monitoring frameworks. The golden signals center on the user-facing experience, and RED and USE offer related lenses, with the article placing all of them alongside the distinction between monitoring and observability.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

Observability: Metrics, Logs & Traces (The Three Pillars)

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read

The Four Golden Signals: The Minimal Set of Metrics That Catch Almost Everything

01You can't dashboard your way to calm

02The four signals in one breath

03The picture: a service emitting four signals

04Signal by signal: what to measure and why

Latency, measure the tail, not the average

Traffic, your denominator and your context

Errors, count the lies, not just the crashes

Saturation, the leading indicator

05Monitoring vs. observability, RED, and USE

06Instrument a service with the four signals

07Common mistakes that cost hours

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

Observability: Metrics, Logs & Traces (The Three Pillars)

Reliability & Resilience: Designing for Failure

Incident Management & On-Call That Doesn't Burn People Out

You can't dashboard your way to calm

The four signals in one breath

The picture: a service emitting four signals

Signal by signal: what to measure and why

Monitoring vs. observability, RED, and USE

Instrument a service with the four signals

Common mistakes that cost hours

Takeaways

Where to go next