Skip to main content
Career Paths
Concepts
Fsp Observability Monitoring
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Observability & Monitoring: You Cannot Fix What You Cannot See

Metrics, logs, traces, alerting strategy, SLOs and error budgets, and how senior engineers build systems that are easy to debug.

🎯Key Takeaways
Monitoring = known failures. Observability = novel failures. You need both.
Three pillars: metrics (what), logs (what happened), traces (where time went)
Structured logging (JSON) with correlation IDs — free-text logs are unqueryable
Alert on four golden signals: p99 latency, traffic, error rate, saturation
SLOs + error budgets align engineering and product around reliability
OpenTelemetry: instrument before you need it, not after an incident

Observability & Monitoring: You Cannot Fix What You Cannot See

Metrics, logs, traces, alerting strategy, SLOs and error budgets, and how senior engineers build systems that are easy to debug.

~4 min read
Be the first to complete!
What you'll learn
  • Monitoring = known failures. Observability = novel failures. You need both.
  • Three pillars: metrics (what), logs (what happened), traces (where time went)
  • Structured logging (JSON) with correlation IDs — free-text logs are unqueryable
  • Alert on four golden signals: p99 latency, traffic, error rate, saturation
  • SLOs + error budgets align engineering and product around reliability
  • OpenTelemetry: instrument before you need it, not after an incident

Lesson outline

Monitoring vs observability

Monitoring answers "is the system working?" — predefined dashboards and alerts for known failure modes. Reactive.

Observability answers "why is it not working?" — the ability to understand internal state from external outputs (metrics, logs, traces). Proactive: debug novel failures you have never seen before.

You need both. Monitoring for fast detection. Observability for fast diagnosis.

The three pillars: metrics, logs, traces

Metrics: Numeric time-series. Counters (requests_total), Gauges (active_connections), Histograms (request_duration_seconds). Cheap, fast, great for dashboards. Cannot tell you WHY something is slow — only THAT it is slow.

Logs: Timestamped structured records. Use JSON (not free-text). Include: requestId, userId, operation, duration, error. Answers "what happened?" for specific events.

Traces: Distributed request paths across services. A trace follows one request through frontend → API gateway → auth → orders → database. Essential for diagnosing latency in microservices.

Correlation IDs link all three pillars

Alert fires (metric) → query logs for slow requests in that window → open trace for a specific slow request → see which service is slow. Trace IDs in logs are the link.

structured-logging.ts
1import pino from 'pino';
2import { AsyncLocalStorage } from 'async_hooks';
3
4// Request context — automatically included in all logs
5const requestContext = new AsyncLocalStorage<{
6 requestId: string;
7 userId?: string;
8 traceId?: string;
9}>();
10
11const logger = pino({
12 level: process.env.LOG_LEVEL ?? 'info',
13 formatters: {
14 log: (obj) => ({
15 ...obj,
AsyncLocalStorage injects request context without threading it through every function
16 ...requestContext.getStore(), // Auto-inject requestId, traceId
17 }),
18 },
19});
20
21// Middleware: set context for all logs in this request
22app.use((req, res, next) => {
Propagate X-Request-ID — it links logs across services
23 const requestId = req.headers['x-request-id'] ?? crypto.randomUUID();
24 const traceId = req.headers['x-trace-id'] ?? requestId;
25 requestContext.run({ requestId, traceId }, () => {
26 res.setHeader('x-request-id', requestId);
27 next();
28 });
29});
30
31// Every log automatically includes requestId, traceId
32logger.info({ operation: 'order.create', orderId: 42, duration: 45 }, 'Order created');
33// ✅ Structured — machine-queryable in Grafana Loki
34// ❌ Avoid: logger.info(`Order ${orderId} created`) — unqueryable free text

Four golden signals (Google SRE)

  • Latency — Track p50, p95, p99 — NOT average. p99 at 2s means 1% of users wait 2 seconds.
  • 🚦Traffic — Requests/sec. Correlates latency increases with traffic spikes.
  • 💥Errors — Distinguish 4xx (client, expected) from 5xx (server, alert on). Track as percentage.
  • 📊Saturation — CPU %, queue depth, connection pool utilization. Leading indicator — rises before failures.

Alert on: p99 latency > threshold, error rate > 1%, saturation > 80%.

SLOs and error budgets

SLI: Metric measuring service behavior. "p99 latency of successful HTTP requests."

SLO: Target for an SLI. "p99 latency < 200ms over 30 days."

Error budget: 100% - SLO target = allowed failure budget. 99.9% availability = 43.8 min/month downtime allowed. If budget exhausted → no new deploys until next period.

Error budgets create organizational alignment: "do we spend the budget on features (risky deploys) or protect reliability (no deploys)?"

SLOs should be based on user happiness

"CPU < 80%" is infrastructure. "99.9% of searches return in < 500ms" is a user-facing SLO. Alert on what users experience, not on infrastructure metrics.

prometheus-slo.yaml
1# SLO alerting: burn rate approach
2
3# Error rate alert (page the on-call)
4- alert: SLOErrorBudgetBurningFast
5 expr: |
6 (
7 sum(rate(http_requests_total{status=~"5.."}[1h])) /
8 sum(rate(http_requests_total[1h]))
9 ) > 0.01 # > 1% error rate for 1 hour
>1% for 5 minutes → page. Not every spike — sustained burn.
10 for: 5m
11 labels:
12 severity: page
13 annotations:
14 summary: "Error budget burning — SLO at risk"
15
16# Burn rate: how fast is the monthly budget being consumed?
17# Burn rate of 1 = on track (sustainable)
18# Burn rate of 14.4 = budget exhausted in 50 hours (urgent)
Burn rate > 14.4 = monthly budget exhausted in 50 hours
19- record: slo:error_budget_burn_rate:1h
20 expr: |
21 (sum(rate(http_requests_total{status=~"5.."}[1h])) /
22 sum(rate(http_requests_total[1h])))
23 / 0.001 # divide by (1 - SLO target) = (1 - 0.999)

Distributed tracing with OpenTelemetry

OpenTelemetry (OTel): Industry standard. Instrument once, export to any backend (Jaeger, Tempo, Honeycomb, Datadog). Replaces all vendor-specific SDKs.

Context propagation: Trace ID flows through every service via `traceparent` HTTP headers. Each service creates a child span. Result: waterfall showing exactly where time is spent.

Sampling: At 100K RPS, recording every trace is expensive. Tail-based sampling (record all traces with errors or high latency) is most useful for production.

opentelemetry-setup.ts
1import { NodeSDK } from '@opentelemetry/sdk-node';
2import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
3import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
4
5// Auto-instruments HTTP, Express, pg, Redis, etc.
Auto-instrumentation captures HTTP, DB, Redis spans with zero manual code
6const sdk = new NodeSDK({
7 resource: new Resource({
8 'service.name': 'orders-service',
9 'service.version': process.env.APP_VERSION,
10 }),
11 traceExporter: new OTLPTraceExporter({
12 url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
13 }),
14 instrumentations: [getNodeAutoInstrumentations()],
15});
16sdk.start();
17
18// Manual span for critical business operations
19const tracer = trace.getTracer('orders-service');
20
21async function processPayment(orderId: string, amount: number) {
22 return tracer.startActiveSpan('payment.process', async (span) => {
23 span.setAttributes({ 'order.id': orderId, 'payment.amount': amount });
24 try {
25 const result = await paymentGateway.charge(amount);
26 span.setStatus({ code: SpanStatusCode.OK });
27 return result;
28 } catch (error) {
29 span.recordException(error as Error);
30 span.setStatus({ code: SpanStatusCode.ERROR });
31 throw error;
Always end spans in finally — unclosed spans cause memory leaks
32 } finally {
33 span.end(); // ALWAYS end spans — unclosed spans leak memory
34 }
35 });
36}
How this might come up in interviews

Observability questions test maturity — junior engineers add console.log, senior engineers build observable systems from day one.

Common questions:

  • How would you diagnose a latency spike in a microservices system?
  • What is the difference between monitoring and observability?
  • Design an SLO for a checkout API.
  • How does distributed tracing work?

Strong answers include:

  • Monitors p99, not just p50
  • Can explain error budgets and deployment decisions
  • Knows OpenTelemetry and context propagation
  • Distinguishes metrics, logs, traces and when to use each

Red flags:

  • Relies on console.log for production debugging
  • Monitors only CPU and memory
  • Cannot explain SLOs
  • Does not know distributed tracing

Quick check · Observability & Monitoring: You Cannot Fix What You Cannot See

1 / 1

API average latency is 45ms but customers complain of slowness. Which metric reveals the problem?

Key takeaways

  • Monitoring = known failures. Observability = novel failures. You need both.
  • Three pillars: metrics (what), logs (what happened), traces (where time went)
  • Structured logging (JSON) with correlation IDs — free-text logs are unqueryable
  • Alert on four golden signals: p99 latency, traffic, error rate, saturation
  • SLOs + error budgets align engineering and product around reliability
  • OpenTelemetry: instrument before you need it, not after an incident

From the books

Site Reliability Engineering — Beyer, Jones, Petoff, Murphy (2016)

Chapter 6: Monitoring Distributed Systems

Four golden signals — latency, traffic, errors, saturation — properly alerted will catch most production problems before users report them.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.