Burn-Rate Alerting: SLO Math That Doesn't Page You for Nothing
Static "errors > 1%" thresholds page you for every blip and miss the slow bleed. Replace them with multi-window, multi-burn-rate alerts tied to your error budget, and to real user impact.
Your alert says error_rate > 1%. At 3 a.m. it fires. You stumble to your laptop, pull up the dashboard, and the error rate is already back to 0.2%. A single bad deploy node flapped for ninety seconds, retried, and healed itself. Nothing was broken by the time you were awake. You acknowledge, you close the laptop, and you do not fall back asleep.
The opposite failure is quieter and worse. Your service sits at 0.9% errors for three straight days, never tripping the 1% line, but steadily torching the reliability budget your users actually feel. No page ever fires. By the time someone notices, you have already blown your quarterly target.
Both failures come from the same mistake: a static threshold has no sense of time or scale. It cannot tell a harmless blip from a sustained bleed, because it only ever looks at one instant. Burn-rate alerting fixes this by measuring how fast you are spending your error budget, and only waking a human when the spend rate actually threatens the target.
Who this is for
Engineers and SREs who already have an SLO (or want one) and are tired of noisy threshold alerts. You should be comfortable with Prometheus/PromQL and the idea of an [error budget](/blog/error-budgets-explained). No advanced math required, just ratios and time.
The principle: alert on spend rate, not on a line
Page a human when the error budget is being consumed fast enough that, left unchecked, it will run out before you can reasonably respond, and nowhere near as long before that.
A static threshold is a fixed line painted on the road. A burn rate is a fuel gauge that warns you by how fast the needle is dropping, not by where it sits. If your tank empties in two minutes, that is an emergency whether you are at half a tank or a quarter. If it empties slowly over a week, you have time to pull over at the next station. The number to act on is the *rate of consumption relative to how long the fuel is supposed to last*, not the current level.
Fuel tank sizeError budget (the allowed failures over the SLO window)
How fast the needle dropsBurn rate (multiple of normal budget spend)
"Empties in 2 minutes" warning lightFast-burn page: budget gone in hours
"Range getting low" trip warningSlow-burn ticket: budget trending to zero over days
Ignoring a single flickering gaugeShort confirmation window: don't react to one bad reading
Burn rate replaces "how many errors right now" with "how fast am I spending my safety margin."
Concretely: burn rate is the ratio of your current error rate to the error rate you are *allowed* to sustain across the whole SLO window. A burn rate of 1x means you will spend exactly 100% of your budget by the end of the window, perfectly on plan. A burn rate of 14.4x means you will spend the entire month's budget in about two days. The bigger the multiple, the less time you have.
From SLO to a page: the pipeline
Every burn-rate alert is the end of a short chain. You start from a target, derive a budget, measure how fast it is being spent, evaluate that spend over two time windows at once, and only then decide whether a human gets paged or a ticket gets filed.
SLO target → error budget → live burn rate → multi-window evaluation → page or ticket.
1
Pick an SLO
Say 99.9% of requests succeed over a rolling 30-day window. That leaves a 0.1% error budget.
2
Turn the target into a budget
The budget is `1 - SLO` = 0.1% of all requests over the window. Spend it however you like; just don't run out.
3
Measure the live error rate
Over a short window, compute failures ÷ total requests. This is your instantaneous spend.
4
Compute burn rate
Divide the live error rate by the budget (0.001). 1.4% errors ÷ 0.1% = 14x burn, fourteen times faster than sustainable.
5
Evaluate over two windows
Require a long window AND a short window to both exceed the threshold, so you confirm sustained burn and recover fast when it stops.
6
Route by severity
Fast burn (budget gone in hours) pages a human. Slow burn (budget gone in days) opens a ticket for business hours.
Choosing windows and thresholds
The art is mapping burn rates to severities. The trick is to anchor each threshold to a time-to-exhaust: at this burn rate, how long until the whole budget is gone? A useful identity for a 30-day window: the budget lasts (30 days) ÷ burn_rate. The classic Google two-tier setup pages on anything that would exhaust the budget within ~2 days and tickets on anything that would exhaust it within ~3 days at the slow end.
Burn rate
Time to exhaust
Budget spent (window)
Action
14.4x
~2 days
2% in 1h
Page on-call (urgent)
6x
~5 days
5% in 6h
Page on-call (high)
3x
~10 days
10% in 24h
File ticket (business hrs)
1x
~30 days
on plan
No action, expected spend
< 1x
never
under budget
No action, healthy
Burn rate → time to exhaust a 30-day budget → recommended action. Higher burn = less time = louder alert.
Read the top row as: "if we keep burning this fast, 2% of the month's budget vanishes in an hour and the whole thing is gone in two days." That is worth a page. The 3x row bleeds slowly enough that a ticket and a fix during working hours is fine. Anything at or below 1x is exactly the spend you planned for, silence is correct.
Pick windows proportional to the burn rate
Faster tiers use shorter windows (1h long / 5m short) so you catch and recover quickly; slower tiers use longer windows (24h long / 2h short) so a brief spike can't open a ticket. Rule of thumb: the short window is ~1/12 of the long window.
The alert in Prometheus
Here is a complete two-tier, multi-window burn-rate rule for a 99.9% availability SLO. Each alert requires both a long-window and a short-window burn rate to exceed the threshold. Define a recording rule for your success ratio first, then reference it at multiple ranges.
burn-rate-rules.yaml
yaml
groups:
- name: slo_burn_rate
rules:
# Helper: error ratio over a range. Repeat per window below.# error_ratio = failed requests / total requests# ---- FAST BURN: 14.4x over 1h, confirmed by 5m ----
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{job="api",code=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="api"}[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{job="api",code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: page
annotations:
summary: "Fast budget burn, 30d budget gone in ~2 days"# ---- SLOW BURN: 3x over 24h, confirmed by 2h ----
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{job="api",code=~"5.."}[24h]))
/ sum(rate(http_requests_total{job="api"}[24h]))
) > (3 * 0.001)
and
(
sum(rate(http_requests_total{job="api",code=~"5.."}[2h]))
/ sum(rate(http_requests_total{job="api"}[2h]))
) > (3 * 0.001)
for: 15m
labels:
severity: ticket
annotations:
summary: "Slow budget burn, 30d budget trending to zero in ~10 days"
The 0.001 is your error budget (1 - 0.999). The multiplier (14.4, 3) is the burn rate from the table. Route severity: page to PagerDuty/Opsgenie and severity: ticket to a queue. For production, extract the ratio into a recording rule so you are not recomputing the same query at four ranges.
Why two windows, not one
A single long window is accurate but slow to react and slow to clear: a 1-hour window keeps firing for nearly an hour after the incident is fixed, because the bad minutes stay inside the range. A single short window is fast but jumpy: a 5-minute window fires on any brief spike and creates exactly the noise you were trying to escape.
Requiring both windows to breach gives you the best of each. The long window confirms the burn is real and sustained, not a flicker. The short window resets the alert quickly: the moment the recent error rate drops, the short window clears and the alert auto-resolves, even though the long window is still elevated. Confirmation from the long side, fast recovery from the short side.
The mental model
Long window = "is this actually a problem?" Short window = "is it still happening right now?" You only page when the answer to both is yes.
Common mistakes that cost hours
Forgetting the short window. Long-window-only alerts fire long after the incident heals, so on-call keeps getting paged for something already fixed.
Using the same window for every severity. A 2% urgent burn and a slow 10-day bleed need different time scales. One window can't catch both without being either deaf or hysterical.
Alerting on raw error count instead of a ratio. 50 errors is catastrophic at low traffic and invisible at peak. Always divide by total requests.
Hard-coding the budget in five places. When the SLO changes, you'll miss one. Compute the threshold from a single budget constant or a recording rule.
Setting the SLO to 100%. A zero budget means any single error is an infinite burn rate. Pick a target your system can realistically meet, then alert against the slack.
Paging on slow burn. A 3x burn over 24h is a ticket, not a 3 a.m. wake-up. Reserve the page for budgets that vanish in hours.
Ignoring the denominator going to zero. At very low traffic the ratio gets noisy or undefined. Add a minimum-request guard or widen the window for low-volume services.
Takeaways
The whole article in seven lines
Static thresholds have no sense of time, they page on blips and sleep through slow bleeds.
Burn rate = current error rate ÷ error budget. It measures how fast you're spending your safety margin.
Anchor every threshold to a time-to-exhaust: 14.4x ≈ 2 days (page), 3x ≈ 10 days (ticket), 1x = on plan (silence).
Always require a long window AND a short window to breach together.
Long window confirms the burn is real; short window clears the alert fast when it heals.
Fast burn pages a human; slow burn files a ticket for business hours.
Compute thresholds from one budget constant so an SLO change can't leave a stale alert behind.
Where to go next
Burn-rate alerting only works on top of a well-defined budget and a low-noise alerting culture. These pair directly with this article:
Error Budgets Explained, where the budget you're spending actually comes from, and how to negotiate it.
The Four Golden Signals, latency, traffic, errors, and saturation: the signals you build SLOs on.
Practice the PromQL and rule syntax in the YAML lab, then map the full SRE career path to see where SLOs sit in the bigger picture.
Start with one SLO, one budget, and the two-tier rule above. You'll page less and catch more.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.