Interactive Explainer

🎯Key Takeaways

Monitoring tells you THAT something is wrong. Observability tells you WHY — you can answer questions you did not know you would need to ask.

Three pillars: logs (what happened), metrics (how it is performing quantitatively), traces (where time was spent across services).

Propagate trace_id through all services and include it in every log line — this enables correlation across pillars.

SLO = reliability target. Error budget = allowed failure quota (100% - SLO). When budget is exhausted, pause feature work, fix reliability.

Alert on symptoms (user-facing impact) not causes (CPU, memory). Alert on error budget burn rate, not static thresholds.

Observable & Monitorable Systems

The three pillars of observability, SLOs and error budgets, distributed tracing, and alerting that doesn't cry wolf.

~6 min read

Be the first to complete!

What you'll learn

Monitoring tells you THAT something is wrong. Observability tells you WHY — you can answer questions you did not know you would need to ask.
Three pillars: logs (what happened), metrics (how it is performing quantitatively), traces (where time was spent across services).
Propagate trace_id through all services and include it in every log line — this enables correlation across pillars.
SLO = reliability target. Error budget = allowed failure quota (100% - SLO). When budget is exhausted, pause feature work, fix reliability.
Alert on symptoms (user-facing impact) not causes (CPU, memory). Alert on error budget burn rate, not static thresholds.

Lesson outline

The 3 AM phone call you cannot answer

The page fires at 3 AM. "API error rate is 12%." You open your laptop. Your monitoring dashboard shows the error rate. But it does not show you why.

You check the logs — millions of lines, no structured search. You check the metrics — all look normal except the error rate. You have no traces to follow a request through the system. You are flying blind.

Thirty minutes later, you find it: a downstream authentication service started returning 401s due to a certificate expiry. It took 30 minutes to find something that a distributed trace would have shown in 10 seconds.

Monitoring vs Observability

Monitoring: you know in advance what questions to ask, and you instrument for those specific questions. Observability: the system emits enough data that you can answer questions you did not know you would need to ask. Monitoring tells you THAT something is wrong. Observability tells you WHY.

The three pillars of observability

What each pillar answers and when to use it

Logs — What happened? — Structured event records capturing discrete events. Best format: JSON with consistent fields (timestamp, level, service, trace_id, user_id, message). Tools: CloudWatch Logs, Loki, Elasticsearch/OpenSearch. Structured logs enable filtering by user_id=4421 to find all events for a specific user during an incident.
Metrics — How is it performing? — Numerical time-series data. Counters (request count), gauges (memory usage, connection pool size), histograms (request latency distribution). Tools: Prometheus + Grafana, Datadog, CloudWatch Metrics. Metrics tell you "P99 latency is 340ms and rising" — great for alerting and trend analysis.
Traces — Where did time go? — A distributed trace follows a single request as it travels through multiple services. Each service adds a "span" — a timed segment of work with metadata. The trace shows the full waterfall: API Gateway (5ms) → Auth Service (8ms) → Product Service (320ms — this is your bottleneck) → Database (310ms of that). Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM.

The correlation key: trace_id

Generate a unique trace_id at the entry point (API gateway, load balancer) and propagate it through every service call as an HTTP header (X-Trace-ID). Include it in every log line. Now you can: search logs by trace_id to find all events for one request, find the trace in Jaeger to see the waterfall, and correlate with metrics from that time window.

SLOs, SLIs, and error budgets — the language of reliability

The reliability vocabulary

SLI (Service Level Indicator) — A specific measurable metric about your service. "The fraction of requests that returned a successful response (2xx) in under 500ms over the past 24 hours." SLIs are ratios, not absolutes.
SLO (Service Level Objective) — Your reliability target for an SLI. "99.9% of requests should return a 2xx response in under 500ms." This is an internal goal — not a promise to customers (that is an SLA).
Error budget — The allowed failure quota derived from your SLO. At 99.9% SLO over 30 days, your error budget = 0.1% × 30 × 24 × 60 = 43.2 minutes of downtime. If you spend it all, engineering velocity is paused until the budget resets.
SLA (Service Level Agreement) — A contractual promise to customers with financial penalties for violations. SLAs are always looser than SLOs — you keep internal targets higher than what you promise externally.

How error budgets change engineering decisions

Team has 43 minutes of error budget left for the month. A risky migration is proposed. The team asks: "If this migration has a 20% chance of causing a 30-minute incident, should we do it?" The error budget quantifies risk: 20% × 30min = 6min expected burn. Is that worth the feature benefit? This is the conversation error budgets enable.

SLO	Allowed downtime / month	Allowed downtime / year	Who uses it
99% (two nines)	7.3 hours	3.65 days	Internal tools, batch jobs
99.9% (three nines)	43.8 minutes	8.76 hours	Most production services
99.95%	21.9 minutes	4.38 hours	Important services, e-commerce
99.99% (four nines)	4.4 minutes	52.6 minutes	Payment processing, auth
99.999% (five nines)	26 seconds	5.26 minutes	Telecom, life-critical systems

Quick check

Your API has a 99.9% availability SLO. After a 60-minute incident, your error budget is fully consumed with 15 days left in the month. What should the team do?

Alerting that does not cry wolf

The fastest way to make alerts useless is to have too many of them. If engineers get 50 alerts per shift and 45 are noise, they will train themselves to ignore all 50. This is how critical alerts get missed.

Principles for effective alerting

Alert on symptoms, not causes — "API error rate > 5%" is a symptom alert — something is wrong for users. "CPU > 80%" is a cause alert — might not be causing user impact. Alert on what users experience; diagnose causes in runbooks.
Alert on SLO burn rate, not thresholds — Instead of "error rate > 1%", alert on "current error rate will exhaust the monthly error budget within X hours at this burn rate." This makes alerts actionable (budget-proportional urgency) and reduces false positives.
Every alert must have a runbook — If there is no runbook, the on-call engineer cannot resolve it — they will just escalate. If the resolution is always the same, automate it. Alerts with runbooks that say "restart the service" should be automated remediations.
Reduce alert fatigue with multi-window burn rate — Alert when the short window (1h) burn rate AND the long window (6h) burn rate both exceed threshold. Short window catches sudden failures, long window catches slow burns. This dramatically reduces false positives.

How this might come up in interviews

SRE, platform engineering, and senior backend interviews. Observability is one of the most commonly tested SRE concepts.

Common questions:

What is the difference between monitoring and observability?
What are the three pillars of observability?
What is an SLO and how does it relate to an error budget?
How do you design a distributed tracing system?
How do you reduce alert fatigue?

Key takeaways

Monitoring tells you THAT something is wrong. Observability tells you WHY — you can answer questions you did not know you would need to ask.
Three pillars: logs (what happened), metrics (how it is performing quantitatively), traces (where time was spent across services).
Propagate trace_id through all services and include it in every log line — this enables correlation across pillars.
SLO = reliability target. Error budget = allowed failure quota (100% - SLO). When budget is exhausted, pause feature work, fix reliability.
Alert on symptoms (user-facing impact) not causes (CPU, memory). Alert on error budget burn rate, not static thresholds.

Before you move on: can you answer these?

What is the difference between an SLI, SLO, and SLA?

SLI: the specific metric (fraction of successful requests). SLO: your internal target for the SLI (99.9% success). SLA: contractual promise to customers — always looser than the SLO.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A