Error Budgets Explained: Reliability You Can Actually Spend

On this page

The fight nobody wins
What an error budget actually is
The error-budget loop
What each SLO target really costs
Running an error-budget policy
Burn-rate alerts: catching it early
Common mistakes that cost hours
Takeaways
Where to go next

TL;DR

An error budget is the single number that lets you ship fast without breaking trust. You get the exact downtime each SLO target buys, how to run a policy around it, and how burn-rate alerts catch trouble early.

The fight nobody wins

Product wants to ship. Ops wants stability. So every release meeting becomes the same standoff: one side pushing features out the door, the other side hitting the brakes because "what if it breaks?" Both are right, and with no shared number to point at, the loudest voice wins, which is exactly how you end up either shipping nothing or paging someone at 3am.

The error budget ends that argument. It turns reliability into a quantity you can measure, spend, and run out of. Instead of arguing about whether a release is "safe," you look at how much budget is left. If there is room, you ship. If there is not, you stop and fix things. The decision stops being political and starts being arithmetic.

Who this is for

Engineers, SREs, and engineering leads who have heard the words "SLO" and "error budget" but want the concrete version: the real minutes, the policy, and the alert that catches a burn before it becomes an outage. If you have not met SLOs yet, read SLIs, SLOs & SLAs first, this picks up right where it ends.

What an error budget actually is

An error budget is the amount of unreliability you are allowed to spend in a given window. It is the simple complement of your reliability target: error budget = 1 − SLO.
The one equation to remember

If your SLO says "99.9% of requests succeed this month," then 0.1% are *allowed* to fail. That 0.1% is not a failure of the team, it is a deliberate allowance. You spend it on risky deploys, config changes, dependency upgrades, and the occasional genuine incident. Once it is gone, you have spent your reliability for the month and the rules change.

The mindset shift is the whole point: 100% is the wrong target. Chasing perfect reliability means never shipping, and your users cannot tell the difference between 99.99% and 100% anyway, their own wifi drops more than that. The budget gives you explicit permission to use the gap between your target and perfection.

Your monthly paycheckThe error budget granted at the start of the window (1 − SLO)

Buying something niceShipping a risky feature or a big migration

An unexpected car repairAn unplanned incident eating into the budget

Checking your balance before a big purchaseChecking remaining budget before a deploy

Going broke, cutting all spendingBudget exhausted, feature freeze until it recovers

An error budget behaves exactly like a monthly spending budget.

The error-budget loop

Reliability is not a one-time decision, it is a cycle that resets every window. You start with a full budget, incidents and risky changes draw it down, and a policy gate decides what the team is allowed to do based on what is left. Here is the loop end to end:

The error-budget loop: the SLO sets the budget, incidents consume it, and a policy gate routes the team back to features or to reliability work.

1
Set the SLO
Pick a reliability target tied to user happiness, e.g. 99.9% of requests succeed over a rolling 30-day window.
2
Derive the budget
The budget is whatever is left: 1 − 0.999 = 0.1% of requests, or the equivalent in downtime minutes.
3
Spend it on change
Every deploy, migration, and incident draws the budget down. This is expected, the budget exists to be spent.
4
Check the gate
Continuously compare consumed vs. remaining. Budget left means you keep shipping; budget gone trips the policy.
5
Reset and repeat
When the window rolls forward, old errors age out and the budget refills. The loop starts again.

What each SLO target really costs

"Add another nine" sounds cheap until you see the minutes. Each extra nine cuts your allowed downtime by 10x, and the engineering cost to hold it climbs far faster than that. This table is the one you should screenshot before promising a number to a customer:

SLO target	Error budget	Downtime / month	Downtime / year
99%	1%	7.2 hours	3.65 days
99.9% (three nines)	0.1%	43.2 minutes	8.76 hours
99.95%	0.05%	21.6 minutes	4.38 hours
99.99% (four nines)	0.01%	4.32 minutes	52.6 minutes

Allowed downtime per SLO target (the error budget, in real time).

Read the jump from 99.9% to 99.99%: your entire monthly margin for error shrinks from 43 minutes to 4 minutes. A single bad deploy plus a rollback can blow four nines in one afternoon. That is why the right target is the *lowest* one your users will tolerate, not the highest one you can brag about.

Multiply it out yourself

A 30-day month is 43,200 minutes. Budget minutes = 43,200 × (1 − SLO). So 99.9% → 43,200 × 0.001 = 43.2 minutes. Swap 43,200 for 525,600 to get the yearly figure.

Running an error-budget policy

The budget is useless without a policy, a written, agreed-upon answer to the question "what happens when we run out?" The policy is what gives the number teeth. It must be decided *before* you are in the red, when nobody is defensive, and signed off by both engineering and product so neither can wriggle out later.

1
Write the policy before you need it
Agree the rules in calm times: who owns the budget, what window it covers, and what each threshold triggers. A policy invented mid-incident is just an argument.
2
Define the spend rule
While budget remains, the team ships freely, risky changes are allowed and even encouraged, because unspent budget is wasted opportunity.
3
Set the freeze trigger
When the budget hits zero (or a buffer like 10% remaining), feature work stops. All engineering effort redirects to reliability: bug fixes, hardening, paying down the debt that burned the budget.
4
Make the freeze automatic, not optional
Wire the freeze to the dashboard, not to a manager's mood. "Budget exhausted" should block the release pipeline or flip a flag, no debate, no exceptions for the VP's pet feature.
5
Run the postmortem and lift the freeze
When the window recovers the budget (or the team ships the hardening work), exit the freeze. Feed what you learned back into the next window's risk decisions.

A budget with no consequence is just a chart

The single most common failure is having an error budget that nobody enforces. If blowing the budget changes nothing about what the team does next week, you do not have an error budget, you have a vanity metric.

Burn-rate alerts: catching it early

Waiting until the budget is empty to react is like noticing you are broke only when the card declines. Burn rate measures how fast you are spending. A burn rate of 1 means you will use exactly 100% of the budget by the end of the window, sustainable. A burn rate of 10 means you will exhaust the whole month's budget in about three days.

Good alerting fires on *fast* burn (a sudden spike, wake someone up) and *slow* burn (a steady leak, a ticket, not a page). The pattern below is the multi-window, multi-burn-rate alert popularized by Google's SRE workbook: page only when both a long and a short window agree, so a brief blip does not wake anyone.

burn-rate-alerts.yml

yaml

# Prometheus alerting rules for a 99.9% SLO
# (error budget = 0.1%; burn rate = error_rate / 0.001)
groups:
  - name: error-budget-burn
    rules:
      # FAST BURN: 14.4x rate exhausts a 30d budget in ~2 days -> page
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            job:slo_errors:ratio_rate1h{job="api"}  > (14.4 * 0.001)
          and
            job:slo_errors:ratio_rate5m{job="api"}  > (14.4 * 0.001)
          )
        labels:
          severity: page
        annotations:
          summary: "Burning error budget 14.4x too fast"
          runbook: "https://runbooks.example.com/error-budget"

      # SLOW BURN: 3x rate, steady leak -> ticket, not a 3am page
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            job:slo_errors:ratio_rate6h{job="api"}  > (3 * 0.001)
          and
            job:slo_errors:ratio_rate30m{job="api"} > (3 * 0.001)
          )
        labels:
          severity: ticket
        annotations:
          summary: "Steady error-budget leak at 3x"

The two-window and is the trick: the long window (1h/6h) confirms the problem is sustained, and the short window (5m/30m) confirms it is *still happening right now*, so the alert resolves quickly once you fix it. One window alone either pages on noise or recovers too slowly.

Common mistakes that cost hours

1Targeting 100% reliability. There is no budget at 100%, every change is a violation, so the team either freezes forever or ignores the budget entirely. Pick a target with room to ship.
2Setting the SLO from infrastructure, not users. A budget based on CPU or pod restarts measures the wrong thing. Tie it to a user-facing SLI, successful requests, fast responses, or the budget protects nothing users feel.
3No written policy. Without an agreed consequence, "budget exhausted" is a number on a dashboard that everyone scrolls past. Decide the freeze rule in advance and get product to sign it.
4Alerting on raw error rate instead of burn rate. "Errors > 5" pages on every blip and misses slow leaks. Alert on how fast you are spending the budget, with multi-window confirmation.
5Making exceptions for big launches. The first time leadership overrides a freeze for a flagship feature, the budget is dead, everyone learns it is negotiable. The whole value is that it is not.
6Treating a blown budget as blame. The budget is a thermostat, not a courtroom. Pair it with blameless postmortems so a freeze drives learning, not finger-pointing.

Takeaways

The whole article in six lines

Error budget = 1 − SLO. It is the unreliability you are *allowed* to spend each window.
It ends the ship-vs-stability fight: budget left → ship; budget gone → freeze and harden.
Know the minutes: 99% = 7.2h/mo, 99.9% = 43.2m/mo, 99.95% = 21.6m/mo, 99.99% = 4.32m/mo.
A budget with no enforced policy is a vanity metric, write the freeze rule before you need it.
Alert on burn rate (how fast you spend), not raw errors, multi-window to dodge false pages.
Pick the lowest target users tolerate; each extra nine costs 10x the budget and far more effort.

Where to go next

The error budget is the third piece of the reliability core. You now have the number, the policy, and the alert, next is connecting it to how your team responds when the budget burns and how to practice the underlying skills hands-on.

Start upstream: SLIs, SLOs & SLAs, the budget is only as good as the SLO it comes from.
Close the loop: Blameless Postmortems turn a burned budget into durable learning instead of blame.
Practice the ops muscle: the kubectl lab lets you inspect, roll back, and debug the workloads whose failures eat the budget.
See the bigger map: the SRE career path sequences reliability, observability, and incident response end to end.

Your release meeting is deadlocked between shipping features and protecting stability. How do you decide with an error budget?

Check your understanding

1. How does the article define an error budget?

2. What does the article say happens to allowed downtime when you move from 99.9% to 99.99%?

Frequently asked questions

What exactly is an error budget?

It is the amount of unreliability you are allowed to spend in a given window, and it is simply the complement of your reliability target: error budget equals 1 minus the SLO. If your SLO is 99.9%, then 0.1% of requests are allowed to fail.

How does an error budget settle the fight between shipping and stability?

It turns the argument into arithmetic. Instead of debating whether a release is safe, the team checks how much budget is left: if there is room, ship; if it is gone, stop and fix things. The decision stops being political.

How much downtime does each extra nine of reliability cost?

Each extra nine cuts your allowed downtime by about 10x. Going from 99.9% to 99.99% shrinks your monthly margin for error from roughly 43 minutes down to about 4 minutes, which a single bad deploy plus a rollback can blow in one afternoon.

What is a burn-rate alert?

It is an alert that watches how fast you are consuming the error budget so you can catch a burn early, before it becomes a full outage. It lets you react to a fast drawdown rather than only noticing once the budget is already gone.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

Kubernetes in Production: Beyond the Tutorial

Read

DevOps

Observability: Metrics, Logs & Traces (The Three Pillars)

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

Error Budgets Explained: Reliability You Can Actually Spend

01The fight nobody wins

02What an error budget actually is

03The error-budget loop

04What each SLO target really costs

05Running an error-budget policy

06Burn-rate alerts: catching it early

07Common mistakes that cost hours

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

Kubernetes in Production: Beyond the Tutorial

Observability: Metrics, Logs & Traces (The Three Pillars)

Reliability & Resilience: Designing for Failure

The fight nobody wins

What an error budget actually is

The error-budget loop

What each SLO target really costs

Running an error-budget policy

Burn-rate alerts: catching it early

Common mistakes that cost hours

Takeaways

Where to go next