Back to Blog
SRE11 min readJun 2026

Error Budgets Explained: Reliability You Can Actually Spend

An error budget is the one number that lets you ship fast without breaking trust. Here is what it is, the exact downtime each SLO buys you, and how to run a policy around it.

SREError BudgetSLOReliability
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

The fight nobody wins

Product wants to ship. Ops wants stability. So every release meeting becomes the same standoff: one side pushing features out the door, the other side hitting the brakes because "what if it breaks?" Both are right, and with no shared number to point at, the loudest voice wins, which is exactly how you end up either shipping nothing or paging someone at 3am.

The error budget ends that argument. It turns reliability into a quantity you can measure, spend, and run out of. Instead of arguing about whether a release is "safe," you look at how much budget is left. If there is room, you ship. If there is not, you stop and fix things. The decision stops being political and starts being arithmetic.

Who this is for

Engineers, SREs, and engineering leads who have heard the words "SLO" and "error budget" but want the concrete version: the real minutes, the policy, and the alert that catches a burn before it becomes an outage. If you have not met SLOs yet, read [SLIs, SLOs & SLAs](/blog/sli-slo-sla-defining-reliability) first, this picks up right where it ends.

What an error budget actually is

An error budget is the amount of unreliability you are allowed to spend in a given window. It is the simple complement of your reliability target: error budget = 1 − SLO.
The one equation to remember

If your SLO says "99.9% of requests succeed this month," then 0.1% are *allowed* to fail. That 0.1% is not a failure of the team, it is a deliberate allowance. You spend it on risky deploys, config changes, dependency upgrades, and the occasional genuine incident. Once it is gone, you have spent your reliability for the month and the rules change.

The mindset shift is the whole point: 100% is the wrong target. Chasing perfect reliability means never shipping, and your users cannot tell the difference between 99.99% and 100% anyway, their own wifi drops more than that. The budget gives you explicit permission to use the gap between your target and perfection.

Your monthly paycheckThe error budget granted at the start of the window (1 − SLO)
Buying something niceShipping a risky feature or a big migration
An unexpected car repairAn unplanned incident eating into the budget
Checking your balance before a big purchaseChecking remaining budget before a deploy
Going broke, cutting all spendingBudget exhausted, feature freeze until it recovers
An error budget behaves exactly like a monthly spending budget.

The error-budget loop

Reliability is not a one-time decision, it is a cycle that resets every window. You start with a full budget, incidents and risky changes draw it down, and a policy gate decides what the team is allowed to do based on what is left. Here is the loop end to end:

definesburnsleftemptymore riskharden
SLO target

e.g. 99.9%/month

Error budget

remaining = 1 − SLO

Consumption

incidents & bad deploys

Policy gate

budget left?

Ship features

budget remaining

Reliability freeze

budget gone

Back to dev

next window

The error-budget loop: the SLO sets the budget, incidents consume it, and a policy gate routes the team back to features or to reliability work.

  1. 1

    Set the SLO

    Pick a reliability target tied to user happiness, e.g. 99.9% of requests succeed over a rolling 30-day window.

  2. 2

    Derive the budget

    The budget is whatever is left: 1 − 0.999 = 0.1% of requests, or the equivalent in downtime minutes.

  3. 3

    Spend it on change

    Every deploy, migration, and incident draws the budget down. This is expected, the budget exists to be spent.

  4. 4

    Check the gate

    Continuously compare consumed vs. remaining. Budget left means you keep shipping; budget gone trips the policy.

  5. 5

    Reset and repeat

    When the window rolls forward, old errors age out and the budget refills. The loop starts again.

What each SLO target really costs

"Add another nine" sounds cheap until you see the minutes. Each extra nine cuts your allowed downtime by 10x, and the engineering cost to hold it climbs far faster than that. This table is the one you should screenshot before promising a number to a customer:

SLO targetError budgetDowntime / monthDowntime / year
99%1%7.2 hours3.65 days
99.9% (three nines)0.1%43.2 minutes8.76 hours
99.95%0.05%21.6 minutes4.38 hours
99.99% (four nines)0.01%4.32 minutes52.6 minutes
Allowed downtime per SLO target (the error budget, in real time).

Read the jump from 99.9% to 99.99%: your entire monthly margin for error shrinks from 43 minutes to 4 minutes. A single bad deploy plus a rollback can blow four nines in one afternoon. That is why the right target is the *lowest* one your users will tolerate, not the highest one you can brag about.

Multiply it out yourself

A 30-day month is 43,200 minutes. Budget minutes = 43,200 × (1 − SLO). So 99.9% → 43,200 × 0.001 = 43.2 minutes. Swap 43,200 for 525,600 to get the yearly figure.

Running an error-budget policy

The budget is useless without a policy, a written, agreed-upon answer to the question "what happens when we run out?" The policy is what gives the number teeth. It must be decided *before* you are in the red, when nobody is defensive, and signed off by both engineering and product so neither can wriggle out later.

  1. 1

    Write the policy before you need it

    Agree the rules in calm times: who owns the budget, what window it covers, and what each threshold triggers. A policy invented mid-incident is just an argument.

  2. 2

    Define the spend rule

    While budget remains, the team ships freely, risky changes are allowed and even encouraged, because unspent budget is wasted opportunity.

  3. 3

    Set the freeze trigger

    When the budget hits zero (or a buffer like 10% remaining), feature work stops. All engineering effort redirects to reliability: bug fixes, hardening, paying down the debt that burned the budget.

  4. 4

    Make the freeze automatic, not optional

    Wire the freeze to the dashboard, not to a manager's mood. "Budget exhausted" should block the release pipeline or flip a flag, no debate, no exceptions for the VP's pet feature.

  5. 5

    Run the postmortem and lift the freeze

    When the window recovers the budget (or the team ships the hardening work), exit the freeze. Feed what you learned back into the next window's risk decisions.

A budget with no consequence is just a chart

The single most common failure is having an error budget that nobody enforces. If blowing the budget changes nothing about what the team does next week, you do not have an error budget, you have a vanity metric.

Burn-rate alerts: catching it early

Waiting until the budget is empty to react is like noticing you are broke only when the card declines. Burn rate measures how fast you are spending. A burn rate of 1 means you will use exactly 100% of the budget by the end of the window, sustainable. A burn rate of 10 means you will exhaust the whole month's budget in about three days.

Good alerting fires on *fast* burn (a sudden spike, wake someone up) and *slow* burn (a steady leak, a ticket, not a page). The pattern below is the multi-window, multi-burn-rate alert popularized by Google's SRE workbook: page only when both a long and a short window agree, so a brief blip does not wake anyone.

burn-rate-alerts.yml
yaml
# Prometheus alerting rules for a 99.9% SLO
# (error budget = 0.1%; burn rate = error_rate / 0.001)
groups:
  - name: error-budget-burn
    rules:
      # FAST BURN: 14.4x rate exhausts a 30d budget in ~2 days -> page
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            job:slo_errors:ratio_rate1h{job="api"}  > (14.4 * 0.001)
          and
            job:slo_errors:ratio_rate5m{job="api"}  > (14.4 * 0.001)
          )
        labels:
          severity: page
        annotations:
          summary: "Burning error budget 14.4x too fast"
          runbook: "https://runbooks.example.com/error-budget"

      # SLOW BURN: 3x rate, steady leak -> ticket, not a 3am page
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            job:slo_errors:ratio_rate6h{job="api"}  > (3 * 0.001)
          and
            job:slo_errors:ratio_rate30m{job="api"} > (3 * 0.001)
          )
        labels:
          severity: ticket
        annotations:
          summary: "Steady error-budget leak at 3x"

The two-window and is the trick: the long window (1h/6h) confirms the problem is sustained, and the short window (5m/30m) confirms it is *still happening right now*, so the alert resolves quickly once you fix it. One window alone either pages on noise or recovers too slowly.

Common mistakes that cost hours

  1. Targeting 100% reliability. There is no budget at 100%, every change is a violation, so the team either freezes forever or ignores the budget entirely. Pick a target with room to ship.
  2. Setting the SLO from infrastructure, not users. A budget based on CPU or pod restarts measures the wrong thing. Tie it to a user-facing SLI, successful requests, fast responses, or the budget protects nothing users feel.
  3. No written policy. Without an agreed consequence, "budget exhausted" is a number on a dashboard that everyone scrolls past. Decide the freeze rule in advance and get product to sign it.
  4. Alerting on raw error rate instead of burn rate. "Errors > 5" pages on every blip and misses slow leaks. Alert on how fast you are spending the budget, with multi-window confirmation.
  5. Making exceptions for big launches. The first time leadership overrides a freeze for a flagship feature, the budget is dead, everyone learns it is negotiable. The whole value is that it is not.
  6. Treating a blown budget as blame. The budget is a thermostat, not a courtroom. Pair it with blameless postmortems so a freeze drives learning, not finger-pointing.

Takeaways

The whole article in six lines

  • Error budget = 1 − SLO. It is the unreliability you are *allowed* to spend each window.
  • It ends the ship-vs-stability fight: budget left → ship; budget gone → freeze and harden.
  • Know the minutes: 99% = 7.2h/mo, 99.9% = 43.2m/mo, 99.95% = 21.6m/mo, 99.99% = 4.32m/mo.
  • A budget with no enforced policy is a vanity metric, write the freeze rule before you need it.
  • Alert on burn rate (how fast you spend), not raw errors, multi-window to dodge false pages.
  • Pick the lowest target users tolerate; each extra nine costs 10x the budget and far more effort.

Where to go next

The error budget is the third piece of the reliability core. You now have the number, the policy, and the alert, next is connecting it to how your team responds when the budget burns and how to practice the underlying skills hands-on.

  • Start upstream: SLIs, SLOs & SLAs, the budget is only as good as the SLO it comes from.
  • Close the loop: Blameless Postmortems turn a burned budget into durable learning instead of blame.
  • Practice the ops muscle: the kubectl lab lets you inspect, roll back, and debug the workloads whose failures eat the budget.
  • See the bigger map: the SRE career path sequences reliability, observability, and incident response end to end.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.