Skip to main content
Career Paths
Concepts
Sre Alerting
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

SRE Alerting: From Alert Fatigue to SLO-Driven Precision

Master the art and engineering of production alerting: eliminate alert fatigue with symptom-based and SLO-driven burn-rate alerts, design actionable runbook-linked pages, build multi-window alerting algorithms, and architect robust routing pipelines — the way Google, Cloudflare, and Netflix do it at scale.

🎯Key Takeaways
Alert fatigue is not a people problem — it is a system design problem. Google found teams ignoring 50%+ of alerts as a rational survival response to poorly designed alerting. The fix is engineering-grade alerting: symptom-based, SLO-calibrated, and built with the same rigor as production software.
SLO-based burn rate alerting is the gold standard. Instead of asking "is CPU > 80%?", ask "how fast is the error budget being consumed?" A burn rate of 14.4x over 1 hour means the monthly SLO budget will be exhausted in 50 hours — that is always a P1, regardless of what the underlying cause metric looks like.
The multi-window algorithm eliminates false positives: a P1 alert should require elevated burn rate in BOTH a 1-hour AND 5-hour window simultaneously. Transient 20-minute spikes elevate the short window but not the long window, and never page. Sustained incidents elevate both and page immediately.
Alert routing and noise reduction require active engineering: Alertmanager inhibition rules, grouping, deduplication, and properly-configured escalation policies in PagerDuty or Grafana OnCall. The Cloudflare 2019 incident shows that an alert storm (100+ pages for one root cause) can add 17 minutes to MTTR — time during which 27 million users were affected.
Every alert must pass the UIA test (Urgent, Important, Actionable) and link to a current runbook. An alert that fires but leaves the on-call engineer unsure of what to do is not an alert — it is confusion with a pager attached. The dead man switch (heartbeat alert) is the highest-priority alert in any system: if your alerting pipeline fails silently, you have no alerts at all.

SRE Alerting: From Alert Fatigue to SLO-Driven Precision

Master the art and engineering of production alerting: eliminate alert fatigue with symptom-based and SLO-driven burn-rate alerts, design actionable runbook-linked pages, build multi-window alerting algorithms, and architect robust routing pipelines — the way Google, Cloudflare, and Netflix do it at scale.

~42 min read
Be the first to complete!
What you'll learn
  • Alert fatigue is not a people problem — it is a system design problem. Google found teams ignoring 50%+ of alerts as a rational survival response to poorly designed alerting. The fix is engineering-grade alerting: symptom-based, SLO-calibrated, and built with the same rigor as production software.
  • SLO-based burn rate alerting is the gold standard. Instead of asking "is CPU > 80%?", ask "how fast is the error budget being consumed?" A burn rate of 14.4x over 1 hour means the monthly SLO budget will be exhausted in 50 hours — that is always a P1, regardless of what the underlying cause metric looks like.
  • The multi-window algorithm eliminates false positives: a P1 alert should require elevated burn rate in BOTH a 1-hour AND 5-hour window simultaneously. Transient 20-minute spikes elevate the short window but not the long window, and never page. Sustained incidents elevate both and page immediately.
  • Alert routing and noise reduction require active engineering: Alertmanager inhibition rules, grouping, deduplication, and properly-configured escalation policies in PagerDuty or Grafana OnCall. The Cloudflare 2019 incident shows that an alert storm (100+ pages for one root cause) can add 17 minutes to MTTR — time during which 27 million users were affected.
  • Every alert must pass the UIA test (Urgent, Important, Actionable) and link to a current runbook. An alert that fires but leaves the on-call engineer unsure of what to do is not an alert — it is confusion with a pager attached. The dead man switch (heartbeat alert) is the highest-priority alert in any system: if your alerting pipeline fails silently, you have no alerts at all.

Lesson outline

The Alert Fatigue Crisis: Why Most Teams Drown in Noise

Alert fatigue is one of the most insidious reliability problems in modern operations. It begins innocuously: a team adds a CPU alert, then a memory alert, then a disk alert, then five more "just in case" alerts. Within months, the on-call rotation receives 200+ pages per week. Engineers begin ignoring pages. They silence their phones. They acknowledge alerts without investigating them. When the real incident finally arrives — the one that matters — it drowns in the noise. By the time someone responds, millions of users have been affected for 20 minutes.

Google conducted internal research on alerting quality across their SRE organization and found a staggering result: teams with poorly-designed alerting were ignoring more than 50% of their alerts. Not occasionally — routinely, as a survival mechanism. The engineers were not lazy or negligent; they had learned through painful experience that most alerts did not require action. The signal-to-noise ratio had collapsed so badly that treating all alerts as urgent had become impossible. The only rational response was to triage by intuition rather than by alert content.

The Alert Fatigue Death Spiral

High alert volume → engineers ignore alerts → real incidents missed → outages → add more alerts to "catch everything" → even higher alert volume. Teams caught in this spiral often cannot escape without a deliberate alerting overhaul. The spiral is self-reinforcing because each new alert added after a missed incident feels justified, even when it makes the underlying problem worse.

The Nagios era gave us the template for this failure. Nagios popularized check-based alerting in the late 1990s: you define a threshold on a metric (CPU > 90%, disk > 80%, memory > 85%), and when the threshold is crossed, a page fires. This model has three critical flaws. First, thresholds are arbitrary. What makes CPU > 90% an emergency versus CPU > 85%? Nobody knows — these numbers are typically copied from a default config. Second, many threshold breaches are transient — CPU spikes for 30 seconds and recovers. The alert fired, the on-call engineer woke up at 3 AM, nothing was wrong. Third, and most importantly, threshold-based alerts are cause-based, not symptom-based. A CPU spike does not tell you whether users are being affected. Users might be experiencing no impact at all, or the impact might be severe — you cannot tell from the CPU metric alone.

Consider the numbers. A team with 50 active Nagios-style threshold alerts across a fleet of 100 servers generates thousands of alert events per week, even in a healthy system, due to natural metric variance. At Google scale (millions of servers), this model is not just impractical — it is impossible. Google moved to symptom-based alerting and eventually SLO-based alerting specifically because threshold-based alerting does not scale. The lesson applies equally to smaller teams: a startup with 20 services and 200 threshold alerts has the same structural problem as Google with 10,000 services and 2 million alerts.

Five Root Causes of Alert Fatigue

  • Cause-based alerts — Alerting on low-level system metrics (CPU, memory, disk) that may or may not correspond to user pain. A CPU spike might be completely harmless — a batch job running normally. Cause-based alerts generate massive noise because systems are full of transient metric excursions that resolve on their own.
  • Arbitrary thresholds — Thresholds copied from defaults or set by intuition rather than derived from SLO analysis. "CPU > 80% is bad" has no grounding in actual user impact. Over time, thresholds are never recalibrated as traffic patterns change, leading to chronic false positives.
  • No ownership model — Alerts added by one team that pages a different team. Nobody has clear responsibility for maintaining alert quality. Stale alerts accumulate because nobody has an incentive to clean them up — the person who added them is long gone, and the person receiving them has no authority to delete them.
  • "Just in case" alerts — Alerts added after incidents with the reasoning "we should have caught this earlier." Each post-mortem adds two more alerts. The team never removes old alerts. After 3 years, 80% of active alerts were added reactively and have fired fewer than 5 times total.
  • No alert retirement process — Alerts are immortal by default. There is no process for reviewing and retiring alerts that are low-value or consistently noisy. The alert inventory grows monotonically. A healthy alerting system requires active gardening: add alerts deliberately, retire them systematically, and review the full inventory quarterly.
MetricHealthy TeamAlert Fatigued Team
Pages per week per on-call engineer< 1050-200+
Percentage of pages requiring action> 80%< 30%
Mean time to acknowledge (MTTA)< 5 minutes15-30 minutes (engineers stopped checking)
Alert response rate> 95%40-60% (active ignoring)
Percentage of alerts ignored or auto-acked< 5%40-60%
Time spent triaging before identifying real issue< 2 minutes10-20 minutes per incident

The solution is not fewer alerts across the board — it is better alerts. The SRE approach treats alerting as engineering: alerts are software artifacts that must be designed, tested, maintained, and retired like any other software. The foundation of this approach is asking one question before adding any alert: "If this alert fires at 3 AM, what will the on-call engineer do?" If the answer is "investigate and probably find nothing wrong," or "acknowledge and go back to sleep," the alert should not exist. If the answer is "execute this specific runbook to mitigate a user-visible problem," the alert belongs in production.

The Alert Justification Test

For every alert in your system, answer three questions: (1) Does this alert fire only when users are experiencing pain, not just when a metric is elevated? (2) Does the alert contain a clear description of what is happening and a link to the runbook? (3) Is there a specific, documented action the on-call engineer must take? If you cannot answer yes to all three, the alert should not be paging anyone.

Symptom-Based vs Cause-Based Alerting

The most important conceptual shift in modern alerting is moving from cause-based to symptom-based alerting. Cause-based alerting monitors the internal state of your systems: CPU utilization, memory pressure, disk I/O, connection pool saturation, garbage collection pause duration. Symptom-based alerting monitors the user experience: error rate, request latency, throughput, availability. The difference is not semantic — it is the difference between a system that pages you 50 times per week and one that pages you 5 times per week, with the symptom-based system catching every real incident and the cause-based system missing the ones that matter most.

The classic demonstration: your application is experiencing elevated error rates for the checkout flow. Users cannot complete purchases. The root cause is a deadlock in the payment service database connection pool. Now count the cause-based alerts that fired: connection pool saturation alert, database connection error rate alert, application CPU alert (from retry storms), network latency alert (from connection timeouts). Four alerts for one incident, all requiring triage to understand the relationship. Count the symptom-based alerts that fired: one alert — "checkout flow error rate > 1% for 5 minutes." One alert, directly tied to user pain, with a clear description of the impact.

flowchart TD
    subgraph CAUSE["Cause-Based Alert Chain (4 pages, 1 incident)"]
        direction TB
        C1["CPU Alert: app-server-03 CPU > 90%"]
        C2["DB Alert: connection_pool_saturation > 85%"]
        C3["Net Alert: p99 latency > 500ms"]
        C4["App Alert: connection_timeout_rate > 5%"]
        C1 & C2 & C3 & C4 --> TRIAGE["Engineer spends 17 min triaging
4 alerts to find root cause"]
        TRIAGE --> ROOT["Root cause: DB deadlock in payment service"]
    end

    subgraph SYMPTOM["Symptom-Based Alert (1 page, same incident)"]
        direction TB
        S1["SLO Alert: checkout_error_rate > 1% for 5m
Impact: users cannot complete purchases"]
        S1 --> ACTION["Engineer opens runbook immediately
Finds root cause in < 2 minutes"]
        ACTION --> ROOT2["Root cause: DB deadlock in payment service"]
    end

    CAUSE -.->|"Same incident
different alerting approach"| SYMPTOM

Cause-Based vs Symptom-Based Alert Comparison

The examples of well-designed symptom-based alerts at major companies illustrate the principle clearly. Google alerts on HTTP 5xx error rate and p99 latency per endpoint, not on server CPU or memory. Netflix alerts on play start failure rate and rebuffering ratio, not on streaming server resource utilization. Cloudflare alerts on edge hit rate and origin error rate, not on individual edge node health. In each case, the alert is grounded in user experience — it fires if and only if users are being harmed, regardless of what the underlying infrastructure is doing.

Symptom-Based Alert Examples by Service Type

  • Web API (user-facing) — Alert: HTTP 5xx error rate > 0.5% over 5 minutes. Symptom: users receiving error responses. Do NOT alert on: individual server CPU, application memory usage, thread pool size.
  • Database — Alert: query latency p99 > 200ms sustained for 3 minutes, or failed transaction rate > 0.1%. Symptom: application operations failing or slowing. Do NOT alert on: raw disk IOPS, connection count, buffer pool hit rate.
  • Message queue — Alert: consumer lag growing for > 10 minutes, or message processing error rate > 1%. Symptom: downstream consumers are falling behind or failing. Do NOT alert on: broker memory, partition count, replication lag (unless it affects consumer lag).
  • CDN / cache layer — Alert: cache miss rate spike exceeding baseline by 50% for 5 minutes, or origin error rate > 2%. Symptom: users experiencing increased latency or errors from cache stampede or origin failure. Do NOT alert on: individual cache node memory, eviction rate.
  • Background job processing — Alert: job failure rate > 5% over 15 minutes, or job processing latency p95 > 2x SLA. Symptom: batch operations not completing, downstream data pipelines stalled. Do NOT alert on: worker CPU, queue depth in isolation.

The counterargument is that cause-based alerts provide early warning before user impact. This is a legitimate concern, but it is usually solved better by dashboards and capacity planning than by pages. A CPU trending toward 100% over 24 hours is a capacity signal that belongs on a dashboard with a daily report, not a 3 AM page. If the CPU actually causes user-visible errors, the symptom-based alert will fire. The question is whether you need to wake someone up for the early warning — in most cases you do not. Dashboards, capacity alerts in business-hours-only routing, and weekly review cadences handle early warning appropriately without generating nighttime noise.

Where Cause-Based Alerts Still Belong

Some cause-based alerts are valuable when they are reliably predictive of user impact AND have a long enough lead time to allow preventive action. Examples: disk usage trending to full in < 4 hours (there is a runbook action: clean logs or expand disk), TLS certificate expiring in < 7 days (there is a runbook action: renew certificate), SSL cipher suite using deprecated algorithm (there is a runbook action: rotate). These pass the actionability test. Pure resource utilization metrics that fluctuate without user impact do not.

The Google SRE Book Definition

The Google SRE Book states: "Symptoms are a better way to catch a broad class of problems, but they only catch the right things in the right measure when the thresholds are correct." The key insight is that symptom-based alerting is not about reducing sensitivity — it is about alerting on the right things. A 1% error rate threshold catches real user pain while ignoring the constant background noise of cause-based metrics.

SLO-Based Alerting: The Gold Standard

SLO-based alerting is the most sophisticated and effective alerting approach in modern SRE practice. Instead of alerting on raw metrics or even individual symptom thresholds, SLO-based alerting monitors the rate at which your error budget is being consumed. This approach, detailed in the Google SRE Workbook, solves the fundamental problem with threshold-based alerting: it automatically adjusts alert sensitivity to reflect the actual user impact and the service's reliability commitments.

The foundation is the error budget. If your service has a 99.9% availability SLO, you are allowed 0.1% of requests to fail (or equivalently, 43.8 minutes of downtime per month). Your error budget is that 0.1% of requests. Every time a user request fails, you spend a tiny fraction of that budget. SLO-based alerting asks: how fast are you spending your error budget? If you spend it slowly (small but persistent error rate), you get a low-severity alert. If you spend it rapidly (high error rate), you get an immediate P1 page.

Error Budget Mathematics

For a 99.9% SLO with a 30-day window: total error budget = 0.1% x 30 days x 24 hours x 60 minutes = 43.8 minutes. If your service is currently returning 5% errors, you are consuming the error budget at 50x the normal rate (5% / 0.1% = 50x). At that rate, you will exhaust the entire monthly error budget in 43.8 minutes / 50 = 52 minutes. This is a P1 incident. This calculation is the core of burn rate alerting.

The burn rate concept formalizes this. A burn rate of 1 means you are consuming your error budget at exactly the rate that would use it up precisely at the end of the window. A burn rate of 10 means you are consuming it 10x faster — if you started at the beginning of the month, you would exhaust the budget in 3 days instead of 30. A burn rate of 60 means you would exhaust the budget in 12 hours. The burn rate translates the abstract error rate into a concrete measure of severity: "how much runway do you have before the SLO is breached?"

Burn RateError Rate (for 99.9% SLO)Time to Exhaust BudgetRecommended Action
1x0.1%30 days (full window)No alert — this is normal budget consumption
2x0.2%15 daysNo page — ticket for investigation during business hours
5x0.5%6 daysLow-severity alert — P3, business hours only
10x1%3 daysMedium alert — P2, page during business hours, on-call at night
14.4x1.44%50 hours (2% budget in 1hr)High severity — P1 fast burn, page immediately
36x3.6%20 hoursCritical — P1, immediate page + escalation
60x6%12 hoursSevere — P1, major incident declaration

The Google SRE Workbook specifies a precise burn rate threshold for the fast-burn case: a burn rate of 14.4 or higher over the last hour should trigger an immediate page. Why 14.4? Because 14.4x burn rate means you are consuming 2% of your 30-day error budget every hour. At that rate, you exhaust the budget in 50 hours — less than 3 days. This is significant enough to require immediate human attention regardless of time of day. The threshold was chosen to balance sensitivity (catching real incidents quickly) with specificity (not paging on minor transient spikes).

Why SLO Alerting Beats Threshold Alerting

A 1% error rate on a service with a 99.9% SLO is a P1 incident (burn rate 10x). A 1% error rate on a service with a 99% SLO is background noise (burn rate 1x — you are consuming budget exactly on plan). SLO-based alerting automatically calibrates alert severity to the service's actual reliability commitments. Threshold-based alerting treats 1% errors the same regardless of what the service promised users.

Setting Up SLO-Based Alerting: Step by Step

→

01

Define your SLI (Service Level Indicator): the metric that measures user experience. Usually: error rate (proportion of requests returning 5xx), latency (proportion of requests slower than threshold), or availability (proportion of time service is serving traffic).

→

02

Set your SLO (Service Level Objective): the target reliability level. Example: "99.9% of checkout requests will succeed over any 30-day rolling window." This sets your error budget at 0.1%.

→

03

Calculate your error budget in concrete terms: for 99.9% SLO over 30 days, error budget = 43.8 minutes of downtime equivalent or 0.1% of request volume.

→

04

Define burn rate thresholds for each alert severity. Fast burn (P1): 14.4x over 1 hour (2% of monthly budget consumed in 1 hour). Slow burn (P2): 5x over 6 hours (5% of monthly budget in 6 hours). Very slow burn (P3/ticket): 2x over 3 days.

→

05

Implement multi-window alerting: require the burn rate to be elevated in BOTH a short window AND a long window. This eliminates false positives from transient spikes while maintaining fast detection of sustained incidents.

06

Link every alert to a runbook with specific diagnostic steps. The alert body should include current burn rate, percentage of error budget remaining, and the runbook URL.

1

Define your SLI (Service Level Indicator): the metric that measures user experience. Usually: error rate (proportion of requests returning 5xx), latency (proportion of requests slower than threshold), or availability (proportion of time service is serving traffic).

2

Set your SLO (Service Level Objective): the target reliability level. Example: "99.9% of checkout requests will succeed over any 30-day rolling window." This sets your error budget at 0.1%.

3

Calculate your error budget in concrete terms: for 99.9% SLO over 30 days, error budget = 43.8 minutes of downtime equivalent or 0.1% of request volume.

4

Define burn rate thresholds for each alert severity. Fast burn (P1): 14.4x over 1 hour (2% of monthly budget consumed in 1 hour). Slow burn (P2): 5x over 6 hours (5% of monthly budget in 6 hours). Very slow burn (P3/ticket): 2x over 3 days.

5

Implement multi-window alerting: require the burn rate to be elevated in BOTH a short window AND a long window. This eliminates false positives from transient spikes while maintaining fast detection of sustained incidents.

6

Link every alert to a runbook with specific diagnostic steps. The alert body should include current burn rate, percentage of error budget remaining, and the runbook URL.

Multi-Window Burn Rate Alerts Deep Dive

The multi-window burn rate algorithm is the gold standard implementation of SLO-based alerting, and it is the approach used at Google, Spotify, and other organizations that have invested seriously in reliability engineering. The core insight is that a single-window burn rate alert (e.g., "burn rate > 14.4x in the last hour") has a significant flaw: it fires on transient spikes that resolve themselves within the measurement window. The multi-window approach requires the burn rate to be elevated in two windows simultaneously — a short window for fast detection and a longer window for false-positive reduction.

The algorithm works as follows. Define two measurement windows: a short window (1 hour for P1, 6 hours for P2) and a long window (5x the short window: 5 hours for P1, 30 hours for P2). The alert fires only when burn rate exceeds the threshold in BOTH windows simultaneously. This means a 20-minute spike that resolves before the end of the 5-hour long window will NOT fire a P1 alert. A sustained 2-hour incident WILL fire because both the 1-hour and 5-hour windows show elevated burn rate.

The Four Alert Categories from the SRE Workbook

P1 Fast Burn: burn rate > 14.4x in 1-hour AND 5-hour windows (2% budget consumed in 1 hour — exhausted in < 50 hours). P2 Slow Burn: burn rate > 6x in 6-hour AND 30-hour windows (5% budget consumed in 6 hours — exhausted in < 5 days). P3 Ticket: burn rate > 3x in 24-hour window (10% budget consumed in 24 hours — exhausted in < 10 days). P4 No Action: burn rate between 1x and 3x — monitor on dashboard, no page.

Complete Prometheus Alert Rules for Multi-Window Burn Rate

  • P1 Fast Burn Alert (PromQL) — groups: - name: slo_alerts rules: - alert: ErrorBudgetFastBurn expr: | ( job:slo_errors_per_request:ratio_rate1h{job="checkout"} > (14.4 * 0.001) ) and ( job:slo_errors_per_request:ratio_rate5h{job="checkout"} > (14.4 * 0.001) ) for: 2m labels: severity: page priority: P1 annotations: summary: "Fast burn rate on checkout SLO" description: "Error budget burn rate is {{ $value | humanizePercentage }} (threshold: 14.4x). Budget exhaustion in < 50 hours. Runbook: https://wiki.internal/runbooks/checkout-errors" runbook_url: "https://wiki.internal/runbooks/checkout-errors"
  • P2 Slow Burn Alert (PromQL) — - alert: ErrorBudgetSlowBurn expr: | ( job:slo_errors_per_request:ratio_rate6h{job="checkout"} > (6 * 0.001) ) and ( job:slo_errors_per_request:ratio_rate30h{job="checkout"} > (6 * 0.001) ) for: 15m labels: severity: ticket priority: P2 annotations: summary: "Slow burn rate on checkout SLO" description: "Error budget burn rate elevated for 6+ hours. Current rate will exhaust budget in < 5 days. Investigate during business hours."
  • Recording Rules for Efficiency — groups: - name: slo_recording_rules rules: - record: job:slo_errors_per_request:ratio_rate1h expr: sum(rate(http_requests_total{status=~"5.."}[1h])) by (job) / sum(rate(http_requests_total[1h])) by (job) - record: job:slo_errors_per_request:ratio_rate5h expr: sum(rate(http_requests_total{status=~"5.."}[5h])) by (job) / sum(rate(http_requests_total[5h])) by (job) - record: job:slo_errors_per_request:ratio_rate6h expr: sum(rate(http_requests_total{status=~"5.."}[6h])) by (job) / sum(rate(http_requests_total[6h])) by (job) - record: job:slo_errors_per_request:ratio_rate30h expr: sum(rate(http_requests_total{status=~"5.."}[30h])) by (job) / sum(rate(http_requests_total[30h])) by (job)

The for: 2m parameter in the P1 alert deserves explanation. The for clause means the condition must be continuously true for 2 minutes before the alert fires. This eliminates alert flapping from transient metric collection gaps or momentary spikes that cross the threshold for a single scrape interval. For P1 alerts, 2 minutes is the right balance: it is short enough to page quickly for real incidents (total detection time: spike begins + 2 minutes) but long enough to avoid spurious pages from 15-second metric collection anomalies. P2 alerts use for: 15m because the longer burn window already provides temporal smoothing.

Alert LevelShort WindowLong WindowBurn RateBudget ConsumedTime to ExhaustionResponse SLA
P1 Fast Burn1 hour5 hours> 14.4x2% in 1 hr< 50 hoursPage immediately, respond within 5 min
P2 Slow Burn6 hours30 hours> 6x5% in 6 hr< 5 daysPage engineer, respond within 30 min
P3 Ticket24 hours72 hours> 3x10% in 24 hr< 10 daysCreate ticket, address in current sprint
P4 MonitorN/AN/A1x–3xOn trackOn trackDashboard review, weekly SLO meeting

Latency SLO Burn Rate Requires Different PromQL

For latency SLOs, you cannot use error rate. Instead, measure the proportion of requests that EXCEED the latency threshold: sum(rate(http_request_duration_seconds_bucket{le="0.3"}[1h])) / sum(rate(http_request_duration_seconds_count[1h])). If your SLO is "95% of requests complete within 300ms," the error rate equivalent is 1 minus the proportion of requests under 300ms. Apply the same burn rate thresholds to this derived metric.

Use Recording Rules, Not Raw PromQL in Alert Rules

Multi-window burn rate calculations over long windows (30 hours) are expensive to evaluate in real time for every Prometheus evaluation interval (typically 15-30 seconds). Pre-compute the ratio using recording rules, evaluated on a slower interval (1-5 minutes). This reduces query load on Prometheus by 90%+ while providing the same alerting accuracy. The SRE Workbook recommends this pattern explicitly.

Alert Routing and Escalation Architecture

Generating the right alerts is only half the battle. Those alerts must reach the right people, through the right channel, at the right time. Alert routing is the system that maps alerts to on-call engineers based on service ownership, severity, time of day, and team availability. When routing fails — alerts go to the wrong team, bypass escalation, or disappear silently — the best alerting logic in the world does not protect your users.

The modern alert routing pipeline has several components. Alertmanager (for Prometheus-based stacks) or equivalent receives raw alerts from the monitoring system. A routing tree evaluates each alert against rules that map labels to receivers. Receivers send notifications to PagerDuty, Opsgenie, or a direct integration. The on-call scheduling platform manages rotation schedules and escalation policies. A secondary escalation path ensures that un-acknowledged alerts reach someone within the SLA.

flowchart TD
    A["Prometheus / Grafana
Alert fires"] --> B["Alertmanager
Receives alert"]
    B --> C{"Route by labels:
team, severity, env"}

    C -->|"severity=critical
team=payments"| D["PagerDuty
Payments On-Call Policy"]
    C -->|"severity=warning
team=infra"| E["Slack #infra-alerts
+ PagerDuty low-urgency"]
    C -->|"severity=info
env=staging"| F["Slack #staging-alerts
(no page)"]

    D --> G["Primary On-Call
Notified via SMS + app"]
    G -->|"No ACK in 5 min"| H["Escalation Layer 1
Secondary On-Call"]
    H -->|"No ACK in 10 min"| I["Escalation Layer 2
Engineering Manager"]
    I -->|"No ACK in 15 min"| J["Escalation Layer 3
Director of Engineering"]

    E --> K["On-Call Engineer
Low urgency notification"]
    K -->|"No ACK in 30 min"| L["Escalation
Team Lead (business hours)"]

Alert Routing and Escalation Flow

Alertmanager Routing Configuration Example

  • Root route — route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'pagerduty-critical' continue: false - match: severity: warning receiver: 'pagerduty-warning' continue: false - match: severity: info receiver: 'slack-info' continue: false
  • PagerDuty critical receiver — receivers: - name: 'pagerduty-critical' pagerduty_configs: - routing_key: '<PAGERDUTY_INTEGRATION_KEY>' description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}' details: firing: '{{ .Alerts.Firing | len }} alerts firing' runbook: '{{ .CommonAnnotations.runbook_url }}' dashboard: '{{ .CommonAnnotations.dashboard_url }}' severity: 'critical'
  • Slack receiver for low-severity — - name: 'slack-info' slack_configs: - channel: '#alerts-info' title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}' send_resolved: true

PagerDuty escalation policies provide the safety net when an on-call engineer does not respond. A well-designed escalation policy has at minimum three levels: (1) Primary on-call engineer, notified immediately via SMS and push notification, with a 5-minute acknowledgement window. (2) Secondary on-call engineer, notified if primary does not acknowledge within 5 minutes — this is the backup, typically the previous week's primary. (3) Engineering manager or team lead, notified if secondary does not acknowledge within 10 minutes. For P1 incidents at companies with strict SLAs, a fourth level may exist: a VP or Director who receives a notification after 20 minutes if the incident is still unacknowledged.

Severity-Based Notification Channels

P1 (critical): SMS + phone call + push notification. Never just email. The goal is to wake someone up if necessary. P2 (warning): Push notification + email. Should interrupt but not necessarily wake from sleep. P3 (info): Email + Slack. Business hours only — configure time-based routing in PagerDuty to suppress nighttime pages for P3. P4 (diagnostic): Slack only, or dashboard visibility only — no active notification.

SeverityNotification ChannelACK WindowEscalation PathBusiness Hours Only?
P1 CriticalPhone call + SMS + Push5 minPrimary → Secondary → Manager → DirectorNo — always pages
P2 HighSMS + Push15 minPrimary → Secondary → ManagerNo — always pages
P3 WarningPush notification30 minPrimary → SecondaryYes — suppress 10pm-8am
P4 InfoSlack + EmailNext business dayNone — ticket createdYes — Slack only

Alert Routing Anti-Pattern: No Team Label

Alerts without a team label route to a catch-all "default" receiver and often page the wrong team or nobody at all. Every alert must have a team label that maps to a specific on-call rotation. At Google, the ownership model is strict: every service has exactly one SRE team responsible for it, and every alert for that service has that team's label. Alerts without a clear owner are not alerts — they are noise with a probability of resolution.

Designing Actionable Alerts: The Three Questions

The difference between a good alert and a bad alert comes down to three questions, formulated by Google's SRE team and described in both the SRE Book and the SRE Workbook. Before any alert is added to production, every member of the team should be able to answer yes to all three: Is it urgent? Is it important? Is it actionable? These are not trivially obvious questions. Each one has a specific operational meaning, and failing even one disqualifies the alert from paging on-call engineers.

"Is it urgent?" asks: does this situation require a human response within minutes? An alert that can wait until morning is not urgent. An alert that can wait until the on-call engineer finishes their current task is not urgent. Urgency means: if this is not addressed in the next 5-15 minutes, user harm will escalate or SLO budget will be exhausted. P1 burn rate alerts are urgent. CPU > 80% for 10 minutes is not urgent — it can wait for a dashboard review.

"Is it important?" asks: does the impact on users justify interrupting an engineer? A 0.01% increase in error rate that keeps you well within SLO is not important. A 2% error rate burning down your error budget is important. Importance is calibrated against the SLO. An alert that fires when you are at 1x burn rate (normal budget consumption) is not important. An alert that fires at 14.4x burn rate (imminent SLO breach) is critically important.

"Is it actionable?" asks: is there a specific set of steps the on-call engineer can execute to diagnose and mitigate the issue? An alert that says "something is wrong" is not actionable. An alert that says "checkout error rate is 3% (burn rate 30x), likely causes: payment gateway timeout or database deadlock, see runbook for diagnostic commands" is actionable. Actionability requires a runbook link. It requires the alert to contain enough context that an engineer who has never seen this failure before can begin effective triage in under 2 minutes.

The Minimum Viable Alert Body

Every alert should contain: (1) WHAT is happening in plain English — "Checkout API error rate is 2.3%, above the 1% SLO threshold." (2) WHO is affected — "This affects users attempting to complete purchases." (3) SEVERITY AND URGENCY — "P1: error budget burn rate is 23x. Monthly budget exhausted in 32 hours at this rate." (4) WHERE to go — "Runbook: https://wiki.internal/runbooks/checkout-errors | Dashboard: https://grafana.internal/d/checkout-overview"

Alert Template Structure for PagerDuty / Alertmanager

  • Alert name — Use a consistent naming convention: [SERVICE]_[METRIC]_[SEVERITY]. Example: "CHECKOUT_ERROR_RATE_P1", "PAYMENT_LATENCY_P2". This makes alerts immediately identifiable in PagerDuty incident lists and searchable in historical data.
  • Summary (50 chars) — One sentence that answers "what is wrong and why it matters." Example: "Checkout API error rate 2.3% — SLO breach in 32 hours." This is what the on-call engineer sees on their phone lock screen at 3 AM. It must be unambiguous and informative enough to trigger the right mental model immediately.
  • Description (rich text) — Full context: current metric value, SLO threshold, burn rate, budget remaining, affected endpoints or users, top suspected root causes based on historical incidents. Include any relevant context that helps triage: "This alert has fired 3 times in the last 30 days; each time the root cause was a payment gateway timeout during high traffic."
  • Runbook URL — A direct deep link to the runbook for this specific alert. Not a link to the runbooks index — a link to the specific page. The runbook should include: diagnostic commands to run, metrics to check, common root causes and their mitigations, rollback procedure, and escalation contacts.
  • Dashboard URL — A deep link to the Grafana or equivalent dashboard pre-filtered to the affected service and time range. Ideally a link that auto-fills the time range to the last 2 hours so the engineer can immediately see the incident timeline.
  • Severity and priority labels — Machine-readable labels that drive routing: severity: critical|warning|info, priority: P1|P2|P3|P4, team: payments|infra|platform, environment: production|staging. These labels are used by Alertmanager routing rules and should be consistent across all alerts.

Alert Design Review Process (Pre-Merge Checklist)

→

01

Does this alert fire only when users are experiencing measurable harm (error rate increase, latency increase, availability drop)? If it fires on an internal metric that does not directly correlate to user impact, reconsider.

→

02

Is the alert threshold derived from the SLO? For error rate alerts, is the threshold above the SLO budget consumption rate? For latency alerts, is it above the agreed SLA latency? Arbitrary thresholds (90%, 80%) are a red flag.

→

03

Does the alert have a runbook link that exists and is current? Test the link. Open the runbook and verify the diagnostic commands still work against current infrastructure.

→

04

Has this alert been tested in a non-production environment or validated against historical data? Can you find at least one real historical incident where this alert would have fired? Can you confirm it would NOT have fired during non-incident periods in the last 30 days?

→

05

Does the alert have an owner (team label) who understands the runbook and is on the on-call rotation that will receive it? If no team owns it, do not deploy it.

06

Is there a review date or TTL set for this alert? Alerts should expire and require re-justification after 90-180 days unless they have demonstrated value by firing and triggering real incident response.

1

Does this alert fire only when users are experiencing measurable harm (error rate increase, latency increase, availability drop)? If it fires on an internal metric that does not directly correlate to user impact, reconsider.

2

Is the alert threshold derived from the SLO? For error rate alerts, is the threshold above the SLO budget consumption rate? For latency alerts, is it above the agreed SLA latency? Arbitrary thresholds (90%, 80%) are a red flag.

3

Does the alert have a runbook link that exists and is current? Test the link. Open the runbook and verify the diagnostic commands still work against current infrastructure.

4

Has this alert been tested in a non-production environment or validated against historical data? Can you find at least one real historical incident where this alert would have fired? Can you confirm it would NOT have fired during non-incident periods in the last 30 days?

5

Does the alert have an owner (team label) who understands the runbook and is on the on-call rotation that will receive it? If no team owns it, do not deploy it.

6

Is there a review date or TTL set for this alert? Alerts should expire and require re-justification after 90-180 days unless they have demonstrated value by firing and triggering real incident response.

The Alert Quality Mnemonic: UIA

Before adding any alert, ask: Urgent (must respond in minutes), Important (user-visible impact at SLO-relevant scale), Actionable (runbook exists, engineer knows what to do). All three must be true. Missing any one: UIA fails, alert should not page. Demote to dashboard, ticket, or Slack notification.

Reducing Alert Noise: Deduplication, Grouping, and Inhibition

Even with well-designed individual alerts, a fleet of thousands of services will generate correlated alert storms during major incidents. When a shared database goes down, 50 services that depend on it all fire alerts simultaneously. The on-call engineer receives 50 pages within 2 minutes, all describing symptoms of the same root cause. Without noise reduction mechanisms, the alert storm becomes the primary obstacle to incident resolution — engineers spend more time acknowledging and triaging the storm than fixing the actual problem.

Alertmanager provides four primary noise-reduction mechanisms: deduplication, grouping, inhibition, and silencing. Each addresses a different failure mode. Understanding all four and knowing when to use each is the mark of a mature alerting practice.

The Four Noise Reduction Mechanisms

  • Deduplication — Alertmanager automatically deduplicates alerts with the same labels. If the same alert fires 10 times in 60 seconds (from different Prometheus instances, each evaluating the same rule), only one notification is sent. This is handled by the fingerprint: a hash of the alert's labels. Configuration: group_interval: 5m means re-notifications are sent only every 5 minutes for an ongoing alert, not on every evaluation cycle.
  • Grouping — Grouping combines multiple related alerts into a single notification. Configure with group_by: [alertname, cluster, service]. When 50 database-dependency alerts fire for the same cluster, they are grouped into one notification: "50 alerts firing in cluster prod-us-east-1." The group_wait: 30s parameter means Alertmanager waits 30 seconds after the first alert to collect related alerts before sending the grouped notification. This prevents alert storm floods during mass failures.
  • Inhibition — Inhibition suppresses child alerts when a parent alert is active. Configuration: if a "DatabaseDown" alert is firing for a specific cluster, inhibit all "ServiceHighErrorRate" alerts for services in that cluster. The engineer sees one alert (database down) instead of 50 (all dependent services). Inhibition rules are: inhibit_rules: [{source_match: {alertname: "DatabaseDown"}, target_match: {cluster: "${cluster}"}, equal: ["cluster"]}]. Requires careful design to avoid inhibiting independent incidents.
  • Silencing — Silences suppress alerts during planned maintenance windows or after a decision to accept known impact. Unlike inhibition (automatic, condition-based), silences are manually created through the Alertmanager UI or API, typically by the on-call engineer when they have identified an incident and want to suppress downstream symptom alerts while working on the root cause. Silences are time-limited and require a comment justifying why the silence is appropriate.

Alert Dependency Graphs: Inhibition Done Right

Inhibition requires modeling your service dependency graph in Alertmanager configuration. A payment service depends on: a database cluster, a fraud detection API, and a currency exchange service. If the database fires a critical alert, inhibit all downstream payment service symptom alerts. But if the fraud detection API fires independently, do NOT inhibit — that is a separate incident. Getting inhibition wrong means either missing real incidents (over-inhibition) or creating alert storms (under-inhibition). Start conservative: only add inhibition rules for well-understood parent-child relationships.

Flap detection addresses another noise source: alerts that rapidly oscillate between firing and resolved. An alert that fires, resolves, fires, resolves every 2 minutes over 30 minutes generates 15 notifications for what might be a single underlying issue. Alertmanager does not have native flap detection, but it can be approximated using the for clause (requiring sustained condition before firing) and repeat_interval (minimum time between re-notifications). A better approach is to use a longer evaluation window in the PromQL expression itself: instead of alerting on the current 5-minute error rate, alert on a 15-minute or 30-minute rate, which smooths transient fluctuations.

Noise ProblemRoot CauseAlertmanager SolutionPromQL Solution
Duplicate pages for same alertSame condition firing from multiple sourcesgroup_by + deduplicationN/A — Alertmanager handles this
50 pages in 2 minutes during outageMass failure of dependent servicesgroup_by + group_wait: 30sN/A — grouping is the right tool
50 symptom alerts masking 1 root causeNo parent-child alert modelinginhibit_rules by cluster/serviceN/A — inhibition is the right tool
Alert flapping every few minutesThreshold too close to metric baselinefor: 5m to require sustained conditionUse longer rate window (15m or 30m)
Alerts firing during known maintenanceNo planned silence mechanismamtool silence add with time limitN/A — silencing is the right tool
Continuous re-notifications for open incidentDefault repeat_interval too shortrepeat_interval: 4h for P2, 1h for P1N/A — repeat_interval is the right tool

The Alert Dependency Graph as Code

Maintain your service dependency graph in a machine-readable format (YAML or JSON) and generate your Alertmanager inhibition rules from it automatically. When Service A depends on Service B, the inhibition rule "B critical inhibits A warning" is a logical consequence of the dependency. Generating rules from the dependency graph ensures consistency and makes it easy to add new services without manually crafting inhibition rules. Store this in Git and apply it via Terraform or Helm.

Alerting Anti-Patterns Hall of Shame

The best way to understand good alerting design is to study the failure modes in detail. These anti-patterns are drawn from real-world SRE experience at scale: teams that implemented them consistently saw degraded on-call quality, higher incident response times, and engineer burnout. Each anti-pattern comes with a real cost calculation — the goal is to make the organizational cost of bad alerting concrete and undeniable.

Anti-Pattern 1: The CPU Alert

Alert: "server CPU > 80% for 5 minutes — page on-call." Cost: fires 200+ times per month across a fleet of 50 servers during normal traffic patterns, batch jobs, and index rebuilds. On-call engineer investigates and finds nothing wrong 95% of the time. Total on-call time wasted: 200 pages x 10 min investigation = 2,000 minutes = 33 hours per month. That is the equivalent of an engineer's full work week, every month, spent on false positives. The fix: replace with a symptom-based alert on application error rate or latency that the CPU spike might eventually cause — and only if the CPU spike actually causes user impact, which it usually does not.

Anti-Pattern 2: Disk Usage Without Trend

Alert: "disk usage > 80%." A disk at 80% that has been stable for 6 months is not an emergency — it will never fill. A disk at 60% that is growing 5% per day will be full in 8 days. The right alert is trend-based: disk_usage > 80% AND disk_fill_rate > 2%/day. Even better: predict_linear(node_filesystem_free_bytes[6h], 4 * 3600) < 0, which fires when the disk is projected to fill within 4 hours based on the recent trend. The static threshold alert generates chronic noise from disks that are healthy but large, while missing the disks that are about to fill.

Anti-Pattern 3: The "Just in Case" Alert

Motivation: "we had a database OOM last quarter and we should have caught it earlier, so let us add an alert for database memory > 70%." Every post-mortem adds two alerts. After 3 years, the team has 400 alerts, 80% added reactively after incidents. Most have fired fewer than 5 times total. The on-call queue has 200+ open low-severity alerts that nobody has time to investigate. The team's actual MTTD (mean time to detect) for real incidents is HIGHER than before the alerts were added because engineers learned to ignore the pager. The fix: post-mortem actions should add symptom-based SLO alerts, not cause-based metric alerts. Ask "what is the user-visible symptom we missed?" not "what metric was elevated before the incident?"

Anti-Pattern 4: Alert Copy-Paste Across Services

A team copies the alert config from service-A to service-B because the services are similar. Service-B has different traffic patterns (10x lower volume), so the absolute thresholds are wrong. The p99 latency alert fires constantly for service-B during normal operation because the thresholds were calibrated for service-A's traffic volume. After 3 weeks of noise, the on-call engineer silences the alert indefinitely. Cost: the alert that should catch service-B latency regressions is permanently silenced because it was configured incorrectly. Real incidents in service-B will now be missed. Fix: never copy-paste alert configs. Derive thresholds from each service's SLO and baseline metrics. Use templated alert rules with per-service parameterization.

Anti-Pattern 5: Too Many Owners (Alert CC Culture)

Alert: "database slow queries — page DBA team, notify platform team, CC app team, notify VP of Engineering." Five people receive a page for the same alert. Nobody knows who is responsible. The DBA assumes the app team is handling it. The app team assumes the DBA is handling it. The incident festers for 30 minutes before anyone takes ownership. The fix is strict single-owner alerting: every alert pages exactly one team. Other teams can be informed via Slack notifications, but only one team is responsible for response. The responsible team can escalate to others, but the default routing is unambiguous.

Anti-PatternMonthly False Positive CostReal Incident RiskFix
CPU > 80% alert200+ pages, 33 hours wastedLow — CPU rarely causes user impact directlyAlert on application error rate or latency instead
Static disk > 80% alert50+ pages, 8 hours wastedMedium — misses rapid-fill scenariosUse predict_linear() for trend-based projection
"Just in case" post-mortem alerts100+ pages, 17 hours wastedHigh — causes engineers to ignore all alertsAdd SLO/symptom alerts from post-mortems, not metric alerts
Copy-paste thresholdsVariable — typically 30-50 false positivesHigh — silenced alerts miss real incidentsDerive thresholds from per-service SLO and baseline
Multi-owner CC culture0 extra pages but delayed responseVery High — diffusion of responsibilitySingle owner per alert; others notified via secondary channel

The Real Cost of Alert Fatigue: Incident Response Time Doubles

A study by VictorOps (now Splunk On-Call) found that teams with high alert fatigue (> 50 pages/week per on-call) have mean time to acknowledge (MTTA) of 15-25 minutes, compared to 2-5 minutes for teams with low alert fatigue (< 20 pages/week). During that extra 10-20 minutes, an incident causing $10,000/minute in revenue loss costs $100,000-200,000 more than it should have. Alert fatigue is not just an engineering quality-of-life issue — it is a direct driver of incident cost.

The Quarterly Alert Audit

Every quarter, pull the alert firing history from PagerDuty or Opsgenie. For each alert that fired in the last 90 days: How many times did it fire? What percentage of fires resulted in a real incident (not just acknowledged and closed)? What was the average MTTA? Alerts with a real-incident rate below 20% are candidates for retirement. Alerts with an average MTTA above 15 minutes are either misconfigured (wrong team) or lack runbooks. This audit, done quarterly, prevents alert inventory from growing monotonically.

Building Your Alerting Stack: From Config to Culture

Building a production-grade alerting stack involves more than writing PromQL. It requires choosing the right tools, configuring them correctly, integrating on-call scheduling, establishing an alert review culture, and building a starter library of high-value alerts that your team can deploy and customize. This section walks through the complete stack from infrastructure to culture, with concrete configurations you can adapt immediately.

Complete Alertmanager Configuration (Production Template)

  • Global config — global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' templates: - '/etc/alertmanager/templates/*.tmpl' route: group_by: ['alertname', 'cluster', 'service', 'team'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default' routes: - match: severity: critical receiver: 'pagerduty-critical' - match: severity: warning receiver: 'pagerduty-warning' - match: severity: info receiver: 'slack-info'
  • Inhibition rules — inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service'] - source_match: alertname: 'DatabaseDown' target_match: type: 'database-dependency' equal: ['cluster']
  • PagerDuty receiver — receivers: - name: 'pagerduty-critical' pagerduty_configs: - routing_key: '{{ .GroupLabels.pagerduty_key }}' description: '{{ template "pagerduty.default.description" . }}' client: 'alertmanager' client_url: '{{ template "pagerduty.default.clientURL" . }}' details: firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}' resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}' num_firing: '{{ .Alerts.Firing | len }}' runbook: '{{ .CommonAnnotations.runbook_url }}'

Starter Alert Library: 10 High-Value Alerts

  • HTTP Error Rate SLO Burn (P1) — Fires when error rate creates a burn rate > 14.4x over the last hour. PromQL: sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) > 0.001 * 14.4. The foundation of SLO-based alerting for any HTTP service.
  • HTTP Latency SLO Burn (P2) — Fires when p99 latency exceeds SLA threshold sustained for 5 minutes, indicating slow burn of latency SLO. PromQL: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.5. Threshold is 500ms here — adjust to your SLA.
  • TLS Certificate Expiry (P2) — Fires when a TLS certificate expires within 7 days. PromQL: (probe_ssl_earliest_cert_expiry - time()) / 3600 / 24 < 7. A cause-based alert that passes the actionability test: there is a specific runbook action (renew certificate) and a clear lead time for action.
  • Pod Crash Loop (P2) — Fires when a Kubernetes pod has restarted more than 5 times in 10 minutes. PromQL: rate(kube_pod_container_status_restarts_total[10m]) * 600 > 5. This is a symptom at the infrastructure layer that reliably indicates a broken service.
  • Disk Fill Prediction (P2) — Fires when disk is projected to fill within 4 hours. PromQL: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4 * 3600) < 0. Trend-based, not static threshold. Does not fire on stable-but-large disks.
  • Dead Man Switch (P1) — A heartbeat alert that fires when the alerting pipeline itself is broken. Configure a Prometheus alert that always fires (expr: vector(1)), route it to a "dead man switch" receiver in Alertmanager, and configure PagerDuty or Opsgenie to page if the heartbeat stops arriving. If Prometheus or Alertmanager is down, this alert stops sending and PagerDuty generates an incident.
  • Error Budget Exhaustion Warning (P2) — Fires when remaining error budget drops below 10% for the current 30-day window. Requires a recording rule that tracks cumulative budget consumed. This gives advance warning before the SLO is breached at month end, allowing proactive action.
  • Database Replication Lag (P2) — Fires when replication lag exceeds 30 seconds for a primary-replica setup. PromQL: mysql_slave_status_seconds_behind_master > 30 or pg_replication_lag > 30. Replication lag at this level risks data loss in failover scenarios.
  • Queue Consumer Lag (P2) — Fires when Kafka/SQS consumer lag is growing for more than 10 minutes, indicating consumers are not keeping up with producers. PromQL: rate(kafka_consumer_group_lag[10m]) > 0. A growing lag is the symptom; the cause (slow consumers, high message volume, crashed consumer group) is for the runbook to diagnose.
  • Automated Deployment Health (P1) — Fires when a deployment is in progress and the success rate of the new version is more than 5% below the previous version's baseline within 10 minutes of rollout. This is a canary alert that enables automatic rollback triggers in your CD pipeline.

PagerDuty + Opsgenie Integration Checklist

When integrating your alerting stack with PagerDuty or Opsgenie: (1) Create one integration per service, not one per team — this enables per-service routing keys and per-service escalation policies. (2) Set the acknowledgement timeout to 5 minutes for P1 and 15 minutes for P2. (3) Enable conference bridge auto-creation for P1 incidents — when the first responder acknowledges, a Zoom or Google Meet link is automatically created and included in the incident. (4) Configure status page integration so P1 incidents automatically update your external status page. (5) Enable postmortem reminder automation: 24 hours after a P1 incident is resolved, PagerDuty sends a reminder to schedule the postmortem.

On-Call Schedule Design: Best Practices

→

01

Set rotation length to 1 week. Shorter rotations (3 days) do not allow engineers to build context; longer rotations (2 weeks) cause burnout. One week is the industry standard at Google, PagerDuty, and most FAANG companies.

→

02

Implement a follow-the-sun rotation for global teams: handoff on-call responsibility to a team in the next timezone during business hours. This ensures on-call engineers are never paged during the middle of their night if global coverage exists.

→

03

Require overlap during handoff: the outgoing on-call engineer and the incoming engineer are both on-call for 1 hour during the transition. This allows knowledge transfer on any active incidents and prevents "no coverage" windows during timezone handoffs.

→

04

Define explicit off-hours expectations: how quickly must P1 be acknowledged? (5 minutes is industry standard). What constitutes an excused non-response? (being on a flight, medical emergency). Who is the default secondary when the primary is unavailable?

→

05

Track on-call load metrics per engineer: total pages received, pages requiring action, MTTA, total incident time. Review these monthly. Engineers with significantly higher load than peers indicate an imbalanced rotation or a service that needs reliability investment.

06

Establish a post-on-call recovery policy: after a week with more than 3 overnight P1 incidents, the on-call engineer gets recovery time (typically a day or partial day) to compensate for lost sleep. This is explicit at Google and should be formalized at any organization that takes on-call health seriously.

1

Set rotation length to 1 week. Shorter rotations (3 days) do not allow engineers to build context; longer rotations (2 weeks) cause burnout. One week is the industry standard at Google, PagerDuty, and most FAANG companies.

2

Implement a follow-the-sun rotation for global teams: handoff on-call responsibility to a team in the next timezone during business hours. This ensures on-call engineers are never paged during the middle of their night if global coverage exists.

3

Require overlap during handoff: the outgoing on-call engineer and the incoming engineer are both on-call for 1 hour during the transition. This allows knowledge transfer on any active incidents and prevents "no coverage" windows during timezone handoffs.

4

Define explicit off-hours expectations: how quickly must P1 be acknowledged? (5 minutes is industry standard). What constitutes an excused non-response? (being on a flight, medical emergency). Who is the default secondary when the primary is unavailable?

5

Track on-call load metrics per engineer: total pages received, pages requiring action, MTTA, total incident time. Review these monthly. Engineers with significantly higher load than peers indicate an imbalanced rotation or a service that needs reliability investment.

6

Establish a post-on-call recovery policy: after a week with more than 3 overnight P1 incidents, the on-call engineer gets recovery time (typically a day or partial day) to compensate for lost sleep. This is explicit at Google and should be formalized at any organization that takes on-call health seriously.

Runbook as Code: The Foundation of Actionable Alerts

Every alert in your starter library must link to a runbook. Runbooks should live in your team's internal wiki with a consistent URL structure: https://wiki.internal/runbooks/{service}/{alert-name}. Each runbook contains: description of the alert and what it means, likely root causes ranked by frequency, diagnostic commands with expected output, mitigation steps in order of preference, rollback procedure if mitigation requires changes, and escalation contacts if the issue cannot be resolved within 30 minutes. Runbooks are living documents: update them every time an alert fires and the runbook is used, adding new root causes and updating commands that no longer work.

How this might come up in interviews

SRE alerting questions appear in three contexts at FAANG interviews. In system design, you may be asked to design the monitoring and alerting layer for a high-traffic service — interviewers expect SLO-based burn rate alerting, not threshold alerts. In behavioral interviews, questions like "tell me about a time you improved on-call quality" or "describe how you reduced alert fatigue on your team" directly test this knowledge. In SRE-specific technical screens, you may be asked to write PromQL for burn rate alerts, explain multi-window alerting, or describe how you would configure Alertmanager routing for a complex multi-team service. The most senior candidates differentiate themselves by connecting alerting design to business outcomes: fewer false positives mean faster MTTD, which means less user impact per incident.

Common questions:

  • What is alert fatigue and what causes it? How would you diagnose and fix it on a team that is receiving 200 pages per week per on-call engineer?
  • Explain SLO-based burn rate alerting. Why is a burn rate of 14.4x the right threshold for a P1 alert on a 99.9% SLO? Walk through the math.
  • What is the multi-window burn rate algorithm and why does requiring both a short and long window reduce false positives without sacrificing detection speed?
  • Design the alerting architecture for a global e-commerce checkout service with a 99.95% availability SLO. What alerts would you create, how would you route them, and how would you prevent an alert storm if the payment gateway goes down?
  • Walk me through how you would configure Alertmanager grouping and inhibition to reduce alert noise during a database outage that affects 50 dependent services.
  • Your team is considering migrating from PagerDuty to Grafana OnCall to reduce costs. What are the key trade-offs? What would you validate before cutting over, and what are the scenarios where you would keep PagerDuty?

Try this question: Ask the interviewer: What is your current alert-to-incident ratio — how many alerts fire per week per real incident? Have you adopted SLO-based alerting or are you still on threshold-based alerting? How do you measure on-call health and what is your target for actionable alert percentage? These questions demonstrate that you evaluate team operational health proactively and will bring a structured approach to improving alerting quality.

Strong answer: Demonstrating the burn rate calculation without prompting. Mentioning multi-window alerting unprompted when discussing SLO alerts. Describing the three-question alert test (Urgent, Important, Actionable). Knowing that Alertmanager handles routing and noise reduction while PagerDuty handles escalation and on-call scheduling — and being able to explain why the combination is better than either alone. Citing the dead man switch as a critical reliability requirement for any alerting stack.

Red flags: Recommending only CPU, memory, and disk threshold alerts as the primary alerting strategy. Not knowing what error budget burn rate means or being unable to do the burn rate math. Treating alert fatigue as a training problem ("engineers just need to take alerts more seriously") rather than a system design problem. Not mentioning runbooks as a requirement for actionable alerts. Describing inhibition rules as a way to suppress alerts rather than as a way to reduce noise while maintaining coverage of independent incidents.

Key takeaways

  • Alert fatigue is not a people problem — it is a system design problem. Google found teams ignoring 50%+ of alerts as a rational survival response to poorly designed alerting. The fix is engineering-grade alerting: symptom-based, SLO-calibrated, and built with the same rigor as production software.
  • SLO-based burn rate alerting is the gold standard. Instead of asking "is CPU > 80%?", ask "how fast is the error budget being consumed?" A burn rate of 14.4x over 1 hour means the monthly SLO budget will be exhausted in 50 hours — that is always a P1, regardless of what the underlying cause metric looks like.
  • The multi-window algorithm eliminates false positives: a P1 alert should require elevated burn rate in BOTH a 1-hour AND 5-hour window simultaneously. Transient 20-minute spikes elevate the short window but not the long window, and never page. Sustained incidents elevate both and page immediately.
  • Alert routing and noise reduction require active engineering: Alertmanager inhibition rules, grouping, deduplication, and properly-configured escalation policies in PagerDuty or Grafana OnCall. The Cloudflare 2019 incident shows that an alert storm (100+ pages for one root cause) can add 17 minutes to MTTR — time during which 27 million users were affected.
  • Every alert must pass the UIA test (Urgent, Important, Actionable) and link to a current runbook. An alert that fires but leaves the on-call engineer unsure of what to do is not an alert — it is confusion with a pager attached. The dead man switch (heartbeat alert) is the highest-priority alert in any system: if your alerting pipeline fails silently, you have no alerts at all.
Before you move on: can you answer these?

Explain the difference between symptom-based and cause-based alerting, and give a concrete example of each for a checkout API. Then explain why Google's SRE team moved away from cause-based alerting at scale.

Cause-based alerting monitors internal system state: CPU > 80%, database connection pool saturation > 90%, garbage collection pause > 500ms. These metrics may be elevated without any user impact. Symptom-based alerting monitors user experience: HTTP 5xx error rate > 1%, checkout flow latency p99 > 2 seconds. For a checkout API: cause-based alert = "payment-service CPU > 80%." Symptom-based alert = "checkout API error rate > 1% for 5 minutes (SLO burn rate 10x)." Google moved away from cause-based alerting because at scale (millions of servers), cause-based metrics fluctuate constantly due to batch jobs, GC cycles, and natural load variation. This generated millions of false-positive alerts per day, destroyed on-call signal quality, and caused engineers to ignore the pager — the exact behavior that led to missed incidents. Symptom-based SLO alerting reduced false positives by 10x while maintaining or improving detection coverage for real incidents.

Calculate the error budget burn rate for a service with a 99.9% SLO that is currently returning 2% errors. How long until the monthly error budget is exhausted? At what burn rate should you page P1? Walk through the multi-window algorithm.

Error budget for 99.9% SLO: 0.1% of requests, or 43.8 minutes per 30-day window. Current error rate: 2%. Burn rate = 2% / 0.1% = 20x. Time to budget exhaustion = 43.8 minutes / 20 = 2.19 minutes — no, at budget rate: 30 days / 20 = 1.5 days. P1 threshold is 14.4x burn rate (consumes 2% of budget in 1 hour, exhausted in < 50 hours). Current burn rate of 20x exceeds the P1 threshold and should page immediately. Multi-window algorithm: the alert fires only when burn rate > 14.4x in BOTH the 1-hour AND 5-hour windows simultaneously. This eliminates false positives from transient 20-minute spikes (which would elevate the 1-hour window but not the 5-hour window). A sustained 2-hour incident at 20x burn rate would elevate both windows and fire the P1 alert.

Your team experienced an alert storm during a database outage last week: 60 alerts fired for a single root cause. Walk through the Alertmanager configuration changes you would make to prevent this in the future, including grouping, inhibition, and any PromQL changes.

Three complementary changes: (1) Grouping: add group_by: [alertname, cluster, database] to the route configuration. Set group_wait: 30s so Alertmanager waits 30 seconds to collect related alerts before sending. This reduces 60 simultaneous pages to a few grouped notifications. (2) Inhibition: add an inhibit rule that suppresses dependent service alerts when a database alert is active: source_match alertname=DatabaseDown, target_match type=database-dependency, equal: [cluster]. This reduces symptom alerts (application errors, latency spikes) that are consequences of the database being down. (3) PromQL: use longer evaluation windows (15m or 30m) for service-level alerts to smooth transient errors, reducing flapping that generates duplicate pages. Also add a deployment correlation rule: if > 20 alerts fire within 2 minutes of a database configuration change, group them under a single "database change impact" alert.

🧠Mental Model

💡 Analogy

SLO-based alerting is like a smart smoke detector. A cheap smoke detector goes off every time you toast bread — it is so sensitive to any trace of smoke that it has lost your trust. After the tenth false alarm, you remove the battery. Now your house burns down and you never know. A broken smoke detector that never triggers (no alerts at all) is equally useless. A good smoke detector only triggers for real fires AND tells you which room is affected (actionable, routed correctly). But the best smoke detector of all — the equivalent of SLO-based alerting — is smart enough to know whether the smoke is coming from a small candle that will self-extinguish (slow burn, P3 ticket) or a kitchen fire that is consuming your error budget at 14.4x and will burn the house down in 50 hours (P1 fast burn). It does not wake you up for bread crumbs. It wakes you up when there is actually a fire. And it tells you exactly which room to go to.

⚡ Core Idea

Alert on user pain (symptoms), not internal system state (causes). Calibrate alert severity to error budget burn rate, not arbitrary metric thresholds. Route alerts to exactly one owner through a graduated escalation chain. Reduce noise through deduplication, grouping, and inhibition. The goal is signal fidelity: every page represents a real situation requiring human action, and every real situation generates a page.

🎯 Why It Matters

Alert fatigue is the hidden reliability crisis that destroys on-call rotations. Google found that teams ignoring 50% of their alerts is not an edge case — it is the norm for teams that have not invested in alerting quality. An on-call engineer who has been conditioned to ignore pages by months of false positives will miss the one page that matters: the P1 incident that takes down the service for 200,000 users. SLO-based burn-rate alerting solves this by making alert severity proportional to actual user impact, eliminating the false positive flood, and ensuring that when the pager fires, it always means something.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.