Master the art and engineering of production alerting: eliminate alert fatigue with symptom-based and SLO-driven burn-rate alerts, design actionable runbook-linked pages, build multi-window alerting algorithms, and architect robust routing pipelines — the way Google, Cloudflare, and Netflix do it at scale.
Master the art and engineering of production alerting: eliminate alert fatigue with symptom-based and SLO-driven burn-rate alerts, design actionable runbook-linked pages, build multi-window alerting algorithms, and architect robust routing pipelines — the way Google, Cloudflare, and Netflix do it at scale.
Lesson outline
Alert fatigue is one of the most insidious reliability problems in modern operations. It begins innocuously: a team adds a CPU alert, then a memory alert, then a disk alert, then five more "just in case" alerts. Within months, the on-call rotation receives 200+ pages per week. Engineers begin ignoring pages. They silence their phones. They acknowledge alerts without investigating them. When the real incident finally arrives — the one that matters — it drowns in the noise. By the time someone responds, millions of users have been affected for 20 minutes.
Google conducted internal research on alerting quality across their SRE organization and found a staggering result: teams with poorly-designed alerting were ignoring more than 50% of their alerts. Not occasionally — routinely, as a survival mechanism. The engineers were not lazy or negligent; they had learned through painful experience that most alerts did not require action. The signal-to-noise ratio had collapsed so badly that treating all alerts as urgent had become impossible. The only rational response was to triage by intuition rather than by alert content.
The Alert Fatigue Death Spiral
High alert volume → engineers ignore alerts → real incidents missed → outages → add more alerts to "catch everything" → even higher alert volume. Teams caught in this spiral often cannot escape without a deliberate alerting overhaul. The spiral is self-reinforcing because each new alert added after a missed incident feels justified, even when it makes the underlying problem worse.
The Nagios era gave us the template for this failure. Nagios popularized check-based alerting in the late 1990s: you define a threshold on a metric (CPU > 90%, disk > 80%, memory > 85%), and when the threshold is crossed, a page fires. This model has three critical flaws. First, thresholds are arbitrary. What makes CPU > 90% an emergency versus CPU > 85%? Nobody knows — these numbers are typically copied from a default config. Second, many threshold breaches are transient — CPU spikes for 30 seconds and recovers. The alert fired, the on-call engineer woke up at 3 AM, nothing was wrong. Third, and most importantly, threshold-based alerts are cause-based, not symptom-based. A CPU spike does not tell you whether users are being affected. Users might be experiencing no impact at all, or the impact might be severe — you cannot tell from the CPU metric alone.
Consider the numbers. A team with 50 active Nagios-style threshold alerts across a fleet of 100 servers generates thousands of alert events per week, even in a healthy system, due to natural metric variance. At Google scale (millions of servers), this model is not just impractical — it is impossible. Google moved to symptom-based alerting and eventually SLO-based alerting specifically because threshold-based alerting does not scale. The lesson applies equally to smaller teams: a startup with 20 services and 200 threshold alerts has the same structural problem as Google with 10,000 services and 2 million alerts.
Five Root Causes of Alert Fatigue
| Metric | Healthy Team | Alert Fatigued Team |
|---|---|---|
| Pages per week per on-call engineer | < 10 | 50-200+ |
| Percentage of pages requiring action | > 80% | < 30% |
| Mean time to acknowledge (MTTA) | < 5 minutes | 15-30 minutes (engineers stopped checking) |
| Alert response rate | > 95% | 40-60% (active ignoring) |
| Percentage of alerts ignored or auto-acked | < 5% | 40-60% |
| Time spent triaging before identifying real issue | < 2 minutes | 10-20 minutes per incident |
The solution is not fewer alerts across the board — it is better alerts. The SRE approach treats alerting as engineering: alerts are software artifacts that must be designed, tested, maintained, and retired like any other software. The foundation of this approach is asking one question before adding any alert: "If this alert fires at 3 AM, what will the on-call engineer do?" If the answer is "investigate and probably find nothing wrong," or "acknowledge and go back to sleep," the alert should not exist. If the answer is "execute this specific runbook to mitigate a user-visible problem," the alert belongs in production.
The Alert Justification Test
For every alert in your system, answer three questions: (1) Does this alert fire only when users are experiencing pain, not just when a metric is elevated? (2) Does the alert contain a clear description of what is happening and a link to the runbook? (3) Is there a specific, documented action the on-call engineer must take? If you cannot answer yes to all three, the alert should not be paging anyone.
The most important conceptual shift in modern alerting is moving from cause-based to symptom-based alerting. Cause-based alerting monitors the internal state of your systems: CPU utilization, memory pressure, disk I/O, connection pool saturation, garbage collection pause duration. Symptom-based alerting monitors the user experience: error rate, request latency, throughput, availability. The difference is not semantic — it is the difference between a system that pages you 50 times per week and one that pages you 5 times per week, with the symptom-based system catching every real incident and the cause-based system missing the ones that matter most.
The classic demonstration: your application is experiencing elevated error rates for the checkout flow. Users cannot complete purchases. The root cause is a deadlock in the payment service database connection pool. Now count the cause-based alerts that fired: connection pool saturation alert, database connection error rate alert, application CPU alert (from retry storms), network latency alert (from connection timeouts). Four alerts for one incident, all requiring triage to understand the relationship. Count the symptom-based alerts that fired: one alert — "checkout flow error rate > 1% for 5 minutes." One alert, directly tied to user pain, with a clear description of the impact.
flowchart TD
subgraph CAUSE["Cause-Based Alert Chain (4 pages, 1 incident)"]
direction TB
C1["CPU Alert: app-server-03 CPU > 90%"]
C2["DB Alert: connection_pool_saturation > 85%"]
C3["Net Alert: p99 latency > 500ms"]
C4["App Alert: connection_timeout_rate > 5%"]
C1 & C2 & C3 & C4 --> TRIAGE["Engineer spends 17 min triaging
4 alerts to find root cause"]
TRIAGE --> ROOT["Root cause: DB deadlock in payment service"]
end
subgraph SYMPTOM["Symptom-Based Alert (1 page, same incident)"]
direction TB
S1["SLO Alert: checkout_error_rate > 1% for 5m
Impact: users cannot complete purchases"]
S1 --> ACTION["Engineer opens runbook immediately
Finds root cause in < 2 minutes"]
ACTION --> ROOT2["Root cause: DB deadlock in payment service"]
end
CAUSE -.->|"Same incident
different alerting approach"| SYMPTOMCause-Based vs Symptom-Based Alert Comparison
The examples of well-designed symptom-based alerts at major companies illustrate the principle clearly. Google alerts on HTTP 5xx error rate and p99 latency per endpoint, not on server CPU or memory. Netflix alerts on play start failure rate and rebuffering ratio, not on streaming server resource utilization. Cloudflare alerts on edge hit rate and origin error rate, not on individual edge node health. In each case, the alert is grounded in user experience — it fires if and only if users are being harmed, regardless of what the underlying infrastructure is doing.
Symptom-Based Alert Examples by Service Type
The counterargument is that cause-based alerts provide early warning before user impact. This is a legitimate concern, but it is usually solved better by dashboards and capacity planning than by pages. A CPU trending toward 100% over 24 hours is a capacity signal that belongs on a dashboard with a daily report, not a 3 AM page. If the CPU actually causes user-visible errors, the symptom-based alert will fire. The question is whether you need to wake someone up for the early warning — in most cases you do not. Dashboards, capacity alerts in business-hours-only routing, and weekly review cadences handle early warning appropriately without generating nighttime noise.
Where Cause-Based Alerts Still Belong
Some cause-based alerts are valuable when they are reliably predictive of user impact AND have a long enough lead time to allow preventive action. Examples: disk usage trending to full in < 4 hours (there is a runbook action: clean logs or expand disk), TLS certificate expiring in < 7 days (there is a runbook action: renew certificate), SSL cipher suite using deprecated algorithm (there is a runbook action: rotate). These pass the actionability test. Pure resource utilization metrics that fluctuate without user impact do not.
The Google SRE Book Definition
The Google SRE Book states: "Symptoms are a better way to catch a broad class of problems, but they only catch the right things in the right measure when the thresholds are correct." The key insight is that symptom-based alerting is not about reducing sensitivity — it is about alerting on the right things. A 1% error rate threshold catches real user pain while ignoring the constant background noise of cause-based metrics.
SLO-based alerting is the most sophisticated and effective alerting approach in modern SRE practice. Instead of alerting on raw metrics or even individual symptom thresholds, SLO-based alerting monitors the rate at which your error budget is being consumed. This approach, detailed in the Google SRE Workbook, solves the fundamental problem with threshold-based alerting: it automatically adjusts alert sensitivity to reflect the actual user impact and the service's reliability commitments.
The foundation is the error budget. If your service has a 99.9% availability SLO, you are allowed 0.1% of requests to fail (or equivalently, 43.8 minutes of downtime per month). Your error budget is that 0.1% of requests. Every time a user request fails, you spend a tiny fraction of that budget. SLO-based alerting asks: how fast are you spending your error budget? If you spend it slowly (small but persistent error rate), you get a low-severity alert. If you spend it rapidly (high error rate), you get an immediate P1 page.
Error Budget Mathematics
For a 99.9% SLO with a 30-day window: total error budget = 0.1% x 30 days x 24 hours x 60 minutes = 43.8 minutes. If your service is currently returning 5% errors, you are consuming the error budget at 50x the normal rate (5% / 0.1% = 50x). At that rate, you will exhaust the entire monthly error budget in 43.8 minutes / 50 = 52 minutes. This is a P1 incident. This calculation is the core of burn rate alerting.
The burn rate concept formalizes this. A burn rate of 1 means you are consuming your error budget at exactly the rate that would use it up precisely at the end of the window. A burn rate of 10 means you are consuming it 10x faster — if you started at the beginning of the month, you would exhaust the budget in 3 days instead of 30. A burn rate of 60 means you would exhaust the budget in 12 hours. The burn rate translates the abstract error rate into a concrete measure of severity: "how much runway do you have before the SLO is breached?"
| Burn Rate | Error Rate (for 99.9% SLO) | Time to Exhaust Budget | Recommended Action |
|---|---|---|---|
| 1x | 0.1% | 30 days (full window) | No alert — this is normal budget consumption |
| 2x | 0.2% | 15 days | No page — ticket for investigation during business hours |
| 5x | 0.5% | 6 days | Low-severity alert — P3, business hours only |
| 10x | 1% | 3 days | Medium alert — P2, page during business hours, on-call at night |
| 14.4x | 1.44% | 50 hours (2% budget in 1hr) | High severity — P1 fast burn, page immediately |
| 36x | 3.6% | 20 hours | Critical — P1, immediate page + escalation |
| 60x | 6% | 12 hours | Severe — P1, major incident declaration |
The Google SRE Workbook specifies a precise burn rate threshold for the fast-burn case: a burn rate of 14.4 or higher over the last hour should trigger an immediate page. Why 14.4? Because 14.4x burn rate means you are consuming 2% of your 30-day error budget every hour. At that rate, you exhaust the budget in 50 hours — less than 3 days. This is significant enough to require immediate human attention regardless of time of day. The threshold was chosen to balance sensitivity (catching real incidents quickly) with specificity (not paging on minor transient spikes).
Why SLO Alerting Beats Threshold Alerting
A 1% error rate on a service with a 99.9% SLO is a P1 incident (burn rate 10x). A 1% error rate on a service with a 99% SLO is background noise (burn rate 1x — you are consuming budget exactly on plan). SLO-based alerting automatically calibrates alert severity to the service's actual reliability commitments. Threshold-based alerting treats 1% errors the same regardless of what the service promised users.
Setting Up SLO-Based Alerting: Step by Step
01
Define your SLI (Service Level Indicator): the metric that measures user experience. Usually: error rate (proportion of requests returning 5xx), latency (proportion of requests slower than threshold), or availability (proportion of time service is serving traffic).
02
Set your SLO (Service Level Objective): the target reliability level. Example: "99.9% of checkout requests will succeed over any 30-day rolling window." This sets your error budget at 0.1%.
03
Calculate your error budget in concrete terms: for 99.9% SLO over 30 days, error budget = 43.8 minutes of downtime equivalent or 0.1% of request volume.
04
Define burn rate thresholds for each alert severity. Fast burn (P1): 14.4x over 1 hour (2% of monthly budget consumed in 1 hour). Slow burn (P2): 5x over 6 hours (5% of monthly budget in 6 hours). Very slow burn (P3/ticket): 2x over 3 days.
05
Implement multi-window alerting: require the burn rate to be elevated in BOTH a short window AND a long window. This eliminates false positives from transient spikes while maintaining fast detection of sustained incidents.
06
Link every alert to a runbook with specific diagnostic steps. The alert body should include current burn rate, percentage of error budget remaining, and the runbook URL.
Define your SLI (Service Level Indicator): the metric that measures user experience. Usually: error rate (proportion of requests returning 5xx), latency (proportion of requests slower than threshold), or availability (proportion of time service is serving traffic).
Set your SLO (Service Level Objective): the target reliability level. Example: "99.9% of checkout requests will succeed over any 30-day rolling window." This sets your error budget at 0.1%.
Calculate your error budget in concrete terms: for 99.9% SLO over 30 days, error budget = 43.8 minutes of downtime equivalent or 0.1% of request volume.
Define burn rate thresholds for each alert severity. Fast burn (P1): 14.4x over 1 hour (2% of monthly budget consumed in 1 hour). Slow burn (P2): 5x over 6 hours (5% of monthly budget in 6 hours). Very slow burn (P3/ticket): 2x over 3 days.
Implement multi-window alerting: require the burn rate to be elevated in BOTH a short window AND a long window. This eliminates false positives from transient spikes while maintaining fast detection of sustained incidents.
Link every alert to a runbook with specific diagnostic steps. The alert body should include current burn rate, percentage of error budget remaining, and the runbook URL.
The multi-window burn rate algorithm is the gold standard implementation of SLO-based alerting, and it is the approach used at Google, Spotify, and other organizations that have invested seriously in reliability engineering. The core insight is that a single-window burn rate alert (e.g., "burn rate > 14.4x in the last hour") has a significant flaw: it fires on transient spikes that resolve themselves within the measurement window. The multi-window approach requires the burn rate to be elevated in two windows simultaneously — a short window for fast detection and a longer window for false-positive reduction.
The algorithm works as follows. Define two measurement windows: a short window (1 hour for P1, 6 hours for P2) and a long window (5x the short window: 5 hours for P1, 30 hours for P2). The alert fires only when burn rate exceeds the threshold in BOTH windows simultaneously. This means a 20-minute spike that resolves before the end of the 5-hour long window will NOT fire a P1 alert. A sustained 2-hour incident WILL fire because both the 1-hour and 5-hour windows show elevated burn rate.
The Four Alert Categories from the SRE Workbook
P1 Fast Burn: burn rate > 14.4x in 1-hour AND 5-hour windows (2% budget consumed in 1 hour — exhausted in < 50 hours). P2 Slow Burn: burn rate > 6x in 6-hour AND 30-hour windows (5% budget consumed in 6 hours — exhausted in < 5 days). P3 Ticket: burn rate > 3x in 24-hour window (10% budget consumed in 24 hours — exhausted in < 10 days). P4 No Action: burn rate between 1x and 3x — monitor on dashboard, no page.
Complete Prometheus Alert Rules for Multi-Window Burn Rate
The for: 2m parameter in the P1 alert deserves explanation. The for clause means the condition must be continuously true for 2 minutes before the alert fires. This eliminates alert flapping from transient metric collection gaps or momentary spikes that cross the threshold for a single scrape interval. For P1 alerts, 2 minutes is the right balance: it is short enough to page quickly for real incidents (total detection time: spike begins + 2 minutes) but long enough to avoid spurious pages from 15-second metric collection anomalies. P2 alerts use for: 15m because the longer burn window already provides temporal smoothing.
| Alert Level | Short Window | Long Window | Burn Rate | Budget Consumed | Time to Exhaustion | Response SLA |
|---|---|---|---|---|---|---|
| P1 Fast Burn | 1 hour | 5 hours | > 14.4x | 2% in 1 hr | < 50 hours | Page immediately, respond within 5 min |
| P2 Slow Burn | 6 hours | 30 hours | > 6x | 5% in 6 hr | < 5 days | Page engineer, respond within 30 min |
| P3 Ticket | 24 hours | 72 hours | > 3x | 10% in 24 hr | < 10 days | Create ticket, address in current sprint |
| P4 Monitor | N/A | N/A | 1x–3x | On track | On track | Dashboard review, weekly SLO meeting |
Latency SLO Burn Rate Requires Different PromQL
For latency SLOs, you cannot use error rate. Instead, measure the proportion of requests that EXCEED the latency threshold: sum(rate(http_request_duration_seconds_bucket{le="0.3"}[1h])) / sum(rate(http_request_duration_seconds_count[1h])). If your SLO is "95% of requests complete within 300ms," the error rate equivalent is 1 minus the proportion of requests under 300ms. Apply the same burn rate thresholds to this derived metric.
Use Recording Rules, Not Raw PromQL in Alert Rules
Multi-window burn rate calculations over long windows (30 hours) are expensive to evaluate in real time for every Prometheus evaluation interval (typically 15-30 seconds). Pre-compute the ratio using recording rules, evaluated on a slower interval (1-5 minutes). This reduces query load on Prometheus by 90%+ while providing the same alerting accuracy. The SRE Workbook recommends this pattern explicitly.
Generating the right alerts is only half the battle. Those alerts must reach the right people, through the right channel, at the right time. Alert routing is the system that maps alerts to on-call engineers based on service ownership, severity, time of day, and team availability. When routing fails — alerts go to the wrong team, bypass escalation, or disappear silently — the best alerting logic in the world does not protect your users.
The modern alert routing pipeline has several components. Alertmanager (for Prometheus-based stacks) or equivalent receives raw alerts from the monitoring system. A routing tree evaluates each alert against rules that map labels to receivers. Receivers send notifications to PagerDuty, Opsgenie, or a direct integration. The on-call scheduling platform manages rotation schedules and escalation policies. A secondary escalation path ensures that un-acknowledged alerts reach someone within the SLA.
flowchart TD
A["Prometheus / Grafana
Alert fires"] --> B["Alertmanager
Receives alert"]
B --> C{"Route by labels:
team, severity, env"}
C -->|"severity=critical
team=payments"| D["PagerDuty
Payments On-Call Policy"]
C -->|"severity=warning
team=infra"| E["Slack #infra-alerts
+ PagerDuty low-urgency"]
C -->|"severity=info
env=staging"| F["Slack #staging-alerts
(no page)"]
D --> G["Primary On-Call
Notified via SMS + app"]
G -->|"No ACK in 5 min"| H["Escalation Layer 1
Secondary On-Call"]
H -->|"No ACK in 10 min"| I["Escalation Layer 2
Engineering Manager"]
I -->|"No ACK in 15 min"| J["Escalation Layer 3
Director of Engineering"]
E --> K["On-Call Engineer
Low urgency notification"]
K -->|"No ACK in 30 min"| L["Escalation
Team Lead (business hours)"]Alert Routing and Escalation Flow
Alertmanager Routing Configuration Example
PagerDuty escalation policies provide the safety net when an on-call engineer does not respond. A well-designed escalation policy has at minimum three levels: (1) Primary on-call engineer, notified immediately via SMS and push notification, with a 5-minute acknowledgement window. (2) Secondary on-call engineer, notified if primary does not acknowledge within 5 minutes — this is the backup, typically the previous week's primary. (3) Engineering manager or team lead, notified if secondary does not acknowledge within 10 minutes. For P1 incidents at companies with strict SLAs, a fourth level may exist: a VP or Director who receives a notification after 20 minutes if the incident is still unacknowledged.
Severity-Based Notification Channels
P1 (critical): SMS + phone call + push notification. Never just email. The goal is to wake someone up if necessary. P2 (warning): Push notification + email. Should interrupt but not necessarily wake from sleep. P3 (info): Email + Slack. Business hours only — configure time-based routing in PagerDuty to suppress nighttime pages for P3. P4 (diagnostic): Slack only, or dashboard visibility only — no active notification.
| Severity | Notification Channel | ACK Window | Escalation Path | Business Hours Only? |
|---|---|---|---|---|
| P1 Critical | Phone call + SMS + Push | 5 min | Primary → Secondary → Manager → Director | No — always pages |
| P2 High | SMS + Push | 15 min | Primary → Secondary → Manager | No — always pages |
| P3 Warning | Push notification | 30 min | Primary → Secondary | Yes — suppress 10pm-8am |
| P4 Info | Slack + Email | Next business day | None — ticket created | Yes — Slack only |
Alert Routing Anti-Pattern: No Team Label
Alerts without a team label route to a catch-all "default" receiver and often page the wrong team or nobody at all. Every alert must have a team label that maps to a specific on-call rotation. At Google, the ownership model is strict: every service has exactly one SRE team responsible for it, and every alert for that service has that team's label. Alerts without a clear owner are not alerts — they are noise with a probability of resolution.
The difference between a good alert and a bad alert comes down to three questions, formulated by Google's SRE team and described in both the SRE Book and the SRE Workbook. Before any alert is added to production, every member of the team should be able to answer yes to all three: Is it urgent? Is it important? Is it actionable? These are not trivially obvious questions. Each one has a specific operational meaning, and failing even one disqualifies the alert from paging on-call engineers.
"Is it urgent?" asks: does this situation require a human response within minutes? An alert that can wait until morning is not urgent. An alert that can wait until the on-call engineer finishes their current task is not urgent. Urgency means: if this is not addressed in the next 5-15 minutes, user harm will escalate or SLO budget will be exhausted. P1 burn rate alerts are urgent. CPU > 80% for 10 minutes is not urgent — it can wait for a dashboard review.
"Is it important?" asks: does the impact on users justify interrupting an engineer? A 0.01% increase in error rate that keeps you well within SLO is not important. A 2% error rate burning down your error budget is important. Importance is calibrated against the SLO. An alert that fires when you are at 1x burn rate (normal budget consumption) is not important. An alert that fires at 14.4x burn rate (imminent SLO breach) is critically important.
"Is it actionable?" asks: is there a specific set of steps the on-call engineer can execute to diagnose and mitigate the issue? An alert that says "something is wrong" is not actionable. An alert that says "checkout error rate is 3% (burn rate 30x), likely causes: payment gateway timeout or database deadlock, see runbook for diagnostic commands" is actionable. Actionability requires a runbook link. It requires the alert to contain enough context that an engineer who has never seen this failure before can begin effective triage in under 2 minutes.
The Minimum Viable Alert Body
Every alert should contain: (1) WHAT is happening in plain English — "Checkout API error rate is 2.3%, above the 1% SLO threshold." (2) WHO is affected — "This affects users attempting to complete purchases." (3) SEVERITY AND URGENCY — "P1: error budget burn rate is 23x. Monthly budget exhausted in 32 hours at this rate." (4) WHERE to go — "Runbook: https://wiki.internal/runbooks/checkout-errors | Dashboard: https://grafana.internal/d/checkout-overview"
Alert Template Structure for PagerDuty / Alertmanager
Alert Design Review Process (Pre-Merge Checklist)
01
Does this alert fire only when users are experiencing measurable harm (error rate increase, latency increase, availability drop)? If it fires on an internal metric that does not directly correlate to user impact, reconsider.
02
Is the alert threshold derived from the SLO? For error rate alerts, is the threshold above the SLO budget consumption rate? For latency alerts, is it above the agreed SLA latency? Arbitrary thresholds (90%, 80%) are a red flag.
03
Does the alert have a runbook link that exists and is current? Test the link. Open the runbook and verify the diagnostic commands still work against current infrastructure.
04
Has this alert been tested in a non-production environment or validated against historical data? Can you find at least one real historical incident where this alert would have fired? Can you confirm it would NOT have fired during non-incident periods in the last 30 days?
05
Does the alert have an owner (team label) who understands the runbook and is on the on-call rotation that will receive it? If no team owns it, do not deploy it.
06
Is there a review date or TTL set for this alert? Alerts should expire and require re-justification after 90-180 days unless they have demonstrated value by firing and triggering real incident response.
Does this alert fire only when users are experiencing measurable harm (error rate increase, latency increase, availability drop)? If it fires on an internal metric that does not directly correlate to user impact, reconsider.
Is the alert threshold derived from the SLO? For error rate alerts, is the threshold above the SLO budget consumption rate? For latency alerts, is it above the agreed SLA latency? Arbitrary thresholds (90%, 80%) are a red flag.
Does the alert have a runbook link that exists and is current? Test the link. Open the runbook and verify the diagnostic commands still work against current infrastructure.
Has this alert been tested in a non-production environment or validated against historical data? Can you find at least one real historical incident where this alert would have fired? Can you confirm it would NOT have fired during non-incident periods in the last 30 days?
Does the alert have an owner (team label) who understands the runbook and is on the on-call rotation that will receive it? If no team owns it, do not deploy it.
Is there a review date or TTL set for this alert? Alerts should expire and require re-justification after 90-180 days unless they have demonstrated value by firing and triggering real incident response.
The Alert Quality Mnemonic: UIA
Before adding any alert, ask: Urgent (must respond in minutes), Important (user-visible impact at SLO-relevant scale), Actionable (runbook exists, engineer knows what to do). All three must be true. Missing any one: UIA fails, alert should not page. Demote to dashboard, ticket, or Slack notification.
Even with well-designed individual alerts, a fleet of thousands of services will generate correlated alert storms during major incidents. When a shared database goes down, 50 services that depend on it all fire alerts simultaneously. The on-call engineer receives 50 pages within 2 minutes, all describing symptoms of the same root cause. Without noise reduction mechanisms, the alert storm becomes the primary obstacle to incident resolution — engineers spend more time acknowledging and triaging the storm than fixing the actual problem.
Alertmanager provides four primary noise-reduction mechanisms: deduplication, grouping, inhibition, and silencing. Each addresses a different failure mode. Understanding all four and knowing when to use each is the mark of a mature alerting practice.
The Four Noise Reduction Mechanisms
Alert Dependency Graphs: Inhibition Done Right
Inhibition requires modeling your service dependency graph in Alertmanager configuration. A payment service depends on: a database cluster, a fraud detection API, and a currency exchange service. If the database fires a critical alert, inhibit all downstream payment service symptom alerts. But if the fraud detection API fires independently, do NOT inhibit — that is a separate incident. Getting inhibition wrong means either missing real incidents (over-inhibition) or creating alert storms (under-inhibition). Start conservative: only add inhibition rules for well-understood parent-child relationships.
Flap detection addresses another noise source: alerts that rapidly oscillate between firing and resolved. An alert that fires, resolves, fires, resolves every 2 minutes over 30 minutes generates 15 notifications for what might be a single underlying issue. Alertmanager does not have native flap detection, but it can be approximated using the for clause (requiring sustained condition before firing) and repeat_interval (minimum time between re-notifications). A better approach is to use a longer evaluation window in the PromQL expression itself: instead of alerting on the current 5-minute error rate, alert on a 15-minute or 30-minute rate, which smooths transient fluctuations.
| Noise Problem | Root Cause | Alertmanager Solution | PromQL Solution |
|---|---|---|---|
| Duplicate pages for same alert | Same condition firing from multiple sources | group_by + deduplication | N/A — Alertmanager handles this |
| 50 pages in 2 minutes during outage | Mass failure of dependent services | group_by + group_wait: 30s | N/A — grouping is the right tool |
| 50 symptom alerts masking 1 root cause | No parent-child alert modeling | inhibit_rules by cluster/service | N/A — inhibition is the right tool |
| Alert flapping every few minutes | Threshold too close to metric baseline | for: 5m to require sustained condition | Use longer rate window (15m or 30m) |
| Alerts firing during known maintenance | No planned silence mechanism | amtool silence add with time limit | N/A — silencing is the right tool |
| Continuous re-notifications for open incident | Default repeat_interval too short | repeat_interval: 4h for P2, 1h for P1 | N/A — repeat_interval is the right tool |
The Alert Dependency Graph as Code
Maintain your service dependency graph in a machine-readable format (YAML or JSON) and generate your Alertmanager inhibition rules from it automatically. When Service A depends on Service B, the inhibition rule "B critical inhibits A warning" is a logical consequence of the dependency. Generating rules from the dependency graph ensures consistency and makes it easy to add new services without manually crafting inhibition rules. Store this in Git and apply it via Terraform or Helm.
The best way to understand good alerting design is to study the failure modes in detail. These anti-patterns are drawn from real-world SRE experience at scale: teams that implemented them consistently saw degraded on-call quality, higher incident response times, and engineer burnout. Each anti-pattern comes with a real cost calculation — the goal is to make the organizational cost of bad alerting concrete and undeniable.
Anti-Pattern 1: The CPU Alert
Alert: "server CPU > 80% for 5 minutes — page on-call." Cost: fires 200+ times per month across a fleet of 50 servers during normal traffic patterns, batch jobs, and index rebuilds. On-call engineer investigates and finds nothing wrong 95% of the time. Total on-call time wasted: 200 pages x 10 min investigation = 2,000 minutes = 33 hours per month. That is the equivalent of an engineer's full work week, every month, spent on false positives. The fix: replace with a symptom-based alert on application error rate or latency that the CPU spike might eventually cause — and only if the CPU spike actually causes user impact, which it usually does not.
Anti-Pattern 2: Disk Usage Without Trend
Alert: "disk usage > 80%." A disk at 80% that has been stable for 6 months is not an emergency — it will never fill. A disk at 60% that is growing 5% per day will be full in 8 days. The right alert is trend-based: disk_usage > 80% AND disk_fill_rate > 2%/day. Even better: predict_linear(node_filesystem_free_bytes[6h], 4 * 3600) < 0, which fires when the disk is projected to fill within 4 hours based on the recent trend. The static threshold alert generates chronic noise from disks that are healthy but large, while missing the disks that are about to fill.
Anti-Pattern 3: The "Just in Case" Alert
Motivation: "we had a database OOM last quarter and we should have caught it earlier, so let us add an alert for database memory > 70%." Every post-mortem adds two alerts. After 3 years, the team has 400 alerts, 80% added reactively after incidents. Most have fired fewer than 5 times total. The on-call queue has 200+ open low-severity alerts that nobody has time to investigate. The team's actual MTTD (mean time to detect) for real incidents is HIGHER than before the alerts were added because engineers learned to ignore the pager. The fix: post-mortem actions should add symptom-based SLO alerts, not cause-based metric alerts. Ask "what is the user-visible symptom we missed?" not "what metric was elevated before the incident?"
Anti-Pattern 4: Alert Copy-Paste Across Services
A team copies the alert config from service-A to service-B because the services are similar. Service-B has different traffic patterns (10x lower volume), so the absolute thresholds are wrong. The p99 latency alert fires constantly for service-B during normal operation because the thresholds were calibrated for service-A's traffic volume. After 3 weeks of noise, the on-call engineer silences the alert indefinitely. Cost: the alert that should catch service-B latency regressions is permanently silenced because it was configured incorrectly. Real incidents in service-B will now be missed. Fix: never copy-paste alert configs. Derive thresholds from each service's SLO and baseline metrics. Use templated alert rules with per-service parameterization.
Anti-Pattern 5: Too Many Owners (Alert CC Culture)
Alert: "database slow queries — page DBA team, notify platform team, CC app team, notify VP of Engineering." Five people receive a page for the same alert. Nobody knows who is responsible. The DBA assumes the app team is handling it. The app team assumes the DBA is handling it. The incident festers for 30 minutes before anyone takes ownership. The fix is strict single-owner alerting: every alert pages exactly one team. Other teams can be informed via Slack notifications, but only one team is responsible for response. The responsible team can escalate to others, but the default routing is unambiguous.
| Anti-Pattern | Monthly False Positive Cost | Real Incident Risk | Fix |
|---|---|---|---|
| CPU > 80% alert | 200+ pages, 33 hours wasted | Low — CPU rarely causes user impact directly | Alert on application error rate or latency instead |
| Static disk > 80% alert | 50+ pages, 8 hours wasted | Medium — misses rapid-fill scenarios | Use predict_linear() for trend-based projection |
| "Just in case" post-mortem alerts | 100+ pages, 17 hours wasted | High — causes engineers to ignore all alerts | Add SLO/symptom alerts from post-mortems, not metric alerts |
| Copy-paste thresholds | Variable — typically 30-50 false positives | High — silenced alerts miss real incidents | Derive thresholds from per-service SLO and baseline |
| Multi-owner CC culture | 0 extra pages but delayed response | Very High — diffusion of responsibility | Single owner per alert; others notified via secondary channel |
The Real Cost of Alert Fatigue: Incident Response Time Doubles
A study by VictorOps (now Splunk On-Call) found that teams with high alert fatigue (> 50 pages/week per on-call) have mean time to acknowledge (MTTA) of 15-25 minutes, compared to 2-5 minutes for teams with low alert fatigue (< 20 pages/week). During that extra 10-20 minutes, an incident causing $10,000/minute in revenue loss costs $100,000-200,000 more than it should have. Alert fatigue is not just an engineering quality-of-life issue — it is a direct driver of incident cost.
The Quarterly Alert Audit
Every quarter, pull the alert firing history from PagerDuty or Opsgenie. For each alert that fired in the last 90 days: How many times did it fire? What percentage of fires resulted in a real incident (not just acknowledged and closed)? What was the average MTTA? Alerts with a real-incident rate below 20% are candidates for retirement. Alerts with an average MTTA above 15 minutes are either misconfigured (wrong team) or lack runbooks. This audit, done quarterly, prevents alert inventory from growing monotonically.
Building a production-grade alerting stack involves more than writing PromQL. It requires choosing the right tools, configuring them correctly, integrating on-call scheduling, establishing an alert review culture, and building a starter library of high-value alerts that your team can deploy and customize. This section walks through the complete stack from infrastructure to culture, with concrete configurations you can adapt immediately.
Complete Alertmanager Configuration (Production Template)
Starter Alert Library: 10 High-Value Alerts
PagerDuty + Opsgenie Integration Checklist
When integrating your alerting stack with PagerDuty or Opsgenie: (1) Create one integration per service, not one per team — this enables per-service routing keys and per-service escalation policies. (2) Set the acknowledgement timeout to 5 minutes for P1 and 15 minutes for P2. (3) Enable conference bridge auto-creation for P1 incidents — when the first responder acknowledges, a Zoom or Google Meet link is automatically created and included in the incident. (4) Configure status page integration so P1 incidents automatically update your external status page. (5) Enable postmortem reminder automation: 24 hours after a P1 incident is resolved, PagerDuty sends a reminder to schedule the postmortem.
On-Call Schedule Design: Best Practices
01
Set rotation length to 1 week. Shorter rotations (3 days) do not allow engineers to build context; longer rotations (2 weeks) cause burnout. One week is the industry standard at Google, PagerDuty, and most FAANG companies.
02
Implement a follow-the-sun rotation for global teams: handoff on-call responsibility to a team in the next timezone during business hours. This ensures on-call engineers are never paged during the middle of their night if global coverage exists.
03
Require overlap during handoff: the outgoing on-call engineer and the incoming engineer are both on-call for 1 hour during the transition. This allows knowledge transfer on any active incidents and prevents "no coverage" windows during timezone handoffs.
04
Define explicit off-hours expectations: how quickly must P1 be acknowledged? (5 minutes is industry standard). What constitutes an excused non-response? (being on a flight, medical emergency). Who is the default secondary when the primary is unavailable?
05
Track on-call load metrics per engineer: total pages received, pages requiring action, MTTA, total incident time. Review these monthly. Engineers with significantly higher load than peers indicate an imbalanced rotation or a service that needs reliability investment.
06
Establish a post-on-call recovery policy: after a week with more than 3 overnight P1 incidents, the on-call engineer gets recovery time (typically a day or partial day) to compensate for lost sleep. This is explicit at Google and should be formalized at any organization that takes on-call health seriously.
Set rotation length to 1 week. Shorter rotations (3 days) do not allow engineers to build context; longer rotations (2 weeks) cause burnout. One week is the industry standard at Google, PagerDuty, and most FAANG companies.
Implement a follow-the-sun rotation for global teams: handoff on-call responsibility to a team in the next timezone during business hours. This ensures on-call engineers are never paged during the middle of their night if global coverage exists.
Require overlap during handoff: the outgoing on-call engineer and the incoming engineer are both on-call for 1 hour during the transition. This allows knowledge transfer on any active incidents and prevents "no coverage" windows during timezone handoffs.
Define explicit off-hours expectations: how quickly must P1 be acknowledged? (5 minutes is industry standard). What constitutes an excused non-response? (being on a flight, medical emergency). Who is the default secondary when the primary is unavailable?
Track on-call load metrics per engineer: total pages received, pages requiring action, MTTA, total incident time. Review these monthly. Engineers with significantly higher load than peers indicate an imbalanced rotation or a service that needs reliability investment.
Establish a post-on-call recovery policy: after a week with more than 3 overnight P1 incidents, the on-call engineer gets recovery time (typically a day or partial day) to compensate for lost sleep. This is explicit at Google and should be formalized at any organization that takes on-call health seriously.
Runbook as Code: The Foundation of Actionable Alerts
Every alert in your starter library must link to a runbook. Runbooks should live in your team's internal wiki with a consistent URL structure: https://wiki.internal/runbooks/{service}/{alert-name}. Each runbook contains: description of the alert and what it means, likely root causes ranked by frequency, diagnostic commands with expected output, mitigation steps in order of preference, rollback procedure if mitigation requires changes, and escalation contacts if the issue cannot be resolved within 30 minutes. Runbooks are living documents: update them every time an alert fires and the runbook is used, adding new root causes and updating commands that no longer work.
SRE alerting questions appear in three contexts at FAANG interviews. In system design, you may be asked to design the monitoring and alerting layer for a high-traffic service — interviewers expect SLO-based burn rate alerting, not threshold alerts. In behavioral interviews, questions like "tell me about a time you improved on-call quality" or "describe how you reduced alert fatigue on your team" directly test this knowledge. In SRE-specific technical screens, you may be asked to write PromQL for burn rate alerts, explain multi-window alerting, or describe how you would configure Alertmanager routing for a complex multi-team service. The most senior candidates differentiate themselves by connecting alerting design to business outcomes: fewer false positives mean faster MTTD, which means less user impact per incident.
Common questions:
Try this question: Ask the interviewer: What is your current alert-to-incident ratio — how many alerts fire per week per real incident? Have you adopted SLO-based alerting or are you still on threshold-based alerting? How do you measure on-call health and what is your target for actionable alert percentage? These questions demonstrate that you evaluate team operational health proactively and will bring a structured approach to improving alerting quality.
Strong answer: Demonstrating the burn rate calculation without prompting. Mentioning multi-window alerting unprompted when discussing SLO alerts. Describing the three-question alert test (Urgent, Important, Actionable). Knowing that Alertmanager handles routing and noise reduction while PagerDuty handles escalation and on-call scheduling — and being able to explain why the combination is better than either alone. Citing the dead man switch as a critical reliability requirement for any alerting stack.
Red flags: Recommending only CPU, memory, and disk threshold alerts as the primary alerting strategy. Not knowing what error budget burn rate means or being unable to do the burn rate math. Treating alert fatigue as a training problem ("engineers just need to take alerts more seriously") rather than a system design problem. Not mentioning runbooks as a requirement for actionable alerts. Describing inhibition rules as a way to suppress alerts rather than as a way to reduce noise while maintaining coverage of independent incidents.
Key takeaways
Explain the difference between symptom-based and cause-based alerting, and give a concrete example of each for a checkout API. Then explain why Google's SRE team moved away from cause-based alerting at scale.
Cause-based alerting monitors internal system state: CPU > 80%, database connection pool saturation > 90%, garbage collection pause > 500ms. These metrics may be elevated without any user impact. Symptom-based alerting monitors user experience: HTTP 5xx error rate > 1%, checkout flow latency p99 > 2 seconds. For a checkout API: cause-based alert = "payment-service CPU > 80%." Symptom-based alert = "checkout API error rate > 1% for 5 minutes (SLO burn rate 10x)." Google moved away from cause-based alerting because at scale (millions of servers), cause-based metrics fluctuate constantly due to batch jobs, GC cycles, and natural load variation. This generated millions of false-positive alerts per day, destroyed on-call signal quality, and caused engineers to ignore the pager — the exact behavior that led to missed incidents. Symptom-based SLO alerting reduced false positives by 10x while maintaining or improving detection coverage for real incidents.
Calculate the error budget burn rate for a service with a 99.9% SLO that is currently returning 2% errors. How long until the monthly error budget is exhausted? At what burn rate should you page P1? Walk through the multi-window algorithm.
Error budget for 99.9% SLO: 0.1% of requests, or 43.8 minutes per 30-day window. Current error rate: 2%. Burn rate = 2% / 0.1% = 20x. Time to budget exhaustion = 43.8 minutes / 20 = 2.19 minutes — no, at budget rate: 30 days / 20 = 1.5 days. P1 threshold is 14.4x burn rate (consumes 2% of budget in 1 hour, exhausted in < 50 hours). Current burn rate of 20x exceeds the P1 threshold and should page immediately. Multi-window algorithm: the alert fires only when burn rate > 14.4x in BOTH the 1-hour AND 5-hour windows simultaneously. This eliminates false positives from transient 20-minute spikes (which would elevate the 1-hour window but not the 5-hour window). A sustained 2-hour incident at 20x burn rate would elevate both windows and fire the P1 alert.
Your team experienced an alert storm during a database outage last week: 60 alerts fired for a single root cause. Walk through the Alertmanager configuration changes you would make to prevent this in the future, including grouping, inhibition, and any PromQL changes.
Three complementary changes: (1) Grouping: add group_by: [alertname, cluster, database] to the route configuration. Set group_wait: 30s so Alertmanager waits 30 seconds to collect related alerts before sending. This reduces 60 simultaneous pages to a few grouped notifications. (2) Inhibition: add an inhibit rule that suppresses dependent service alerts when a database alert is active: source_match alertname=DatabaseDown, target_match type=database-dependency, equal: [cluster]. This reduces symptom alerts (application errors, latency spikes) that are consequences of the database being down. (3) PromQL: use longer evaluation windows (15m or 30m) for service-level alerts to smooth transient errors, reducing flapping that generates duplicate pages. Also add a deployment correlation rule: if > 20 alerts fire within 2 minutes of a database configuration change, group them under a single "database change impact" alert.
💡 Analogy
SLO-based alerting is like a smart smoke detector. A cheap smoke detector goes off every time you toast bread — it is so sensitive to any trace of smoke that it has lost your trust. After the tenth false alarm, you remove the battery. Now your house burns down and you never know. A broken smoke detector that never triggers (no alerts at all) is equally useless. A good smoke detector only triggers for real fires AND tells you which room is affected (actionable, routed correctly). But the best smoke detector of all — the equivalent of SLO-based alerting — is smart enough to know whether the smoke is coming from a small candle that will self-extinguish (slow burn, P3 ticket) or a kitchen fire that is consuming your error budget at 14.4x and will burn the house down in 50 hours (P1 fast burn). It does not wake you up for bread crumbs. It wakes you up when there is actually a fire. And it tells you exactly which room to go to.
⚡ Core Idea
Alert on user pain (symptoms), not internal system state (causes). Calibrate alert severity to error budget burn rate, not arbitrary metric thresholds. Route alerts to exactly one owner through a graduated escalation chain. Reduce noise through deduplication, grouping, and inhibition. The goal is signal fidelity: every page represents a real situation requiring human action, and every real situation generates a page.
🎯 Why It Matters
Alert fatigue is the hidden reliability crisis that destroys on-call rotations. Google found that teams ignoring 50% of their alerts is not an edge case — it is the norm for teams that have not invested in alerting quality. An on-call engineer who has been conditioned to ignore pages by months of false positives will miss the one page that matters: the P1 incident that takes down the service for 200,000 users. SLO-based burn-rate alerting solves this by making alert severity proportional to actual user impact, eliminating the false positive flood, and ensuring that when the pager fires, it always means something.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.