A comprehensive guide to defining Service Level Indicators (SLIs), setting Service Level Objectives (SLOs), and using error budgets to balance reliability with innovation velocity. Covers the math of nines, SLO-based alerting, real FAANG examples, common anti-patterns, and a step-by-step implementation guide from zero to production.
A comprehensive guide to defining Service Level Indicators (SLIs), setting Service Level Objectives (SLOs), and using error budgets to balance reliability with innovation velocity. Covers the math of nines, SLO-based alerting, real FAANG examples, common anti-patterns, and a step-by-step implementation guide from zero to production.
Lesson outline
Every engineering organization eventually faces the same fundamental tension: product teams want to ship features as fast as possible, and operations teams want the system to be as stable as possible. Without a shared framework for discussing reliability, this tension degrades into an unproductive argument. Product managers say "why are we not shipping?" and engineers say "because the system is unstable." Neither side has data to support their position, and the loudest voice in the room wins.
SLIs, SLOs, and error budgets are the framework that resolves this tension permanently. They provide a shared, data-driven language for discussing reliability that every role in the organization can understand. An SLI is a metric that captures what users actually experience. An SLO is a target for that metric that represents the minimum acceptable reliability. An error budget is the amount of unreliability you are willing to tolerate — calculated as 100% minus the SLO target. Together, these three concepts create an objective arbiter that replaces subjective debates about when to ship and when to stabilize.
The Three Pillars in One Sentence
An SLI measures what users experience (the metric), an SLO sets the bar for that experience (the target), and the error budget is the gap between perfection and that bar (the tolerance). Together, they turn reliability from a vague aspiration into a measurable, manageable engineering practice.
At Google, SLOs are not optional guidelines. They are formal contracts between SRE teams and product teams. Every Tier-1 service (Search, Gmail, YouTube, Cloud) has published SLOs that are reviewed quarterly. When a service exhausts its error budget, the consequences are real: feature launches are delayed, deployment freezes are enforced, and engineering effort is redirected from feature development to reliability work. This is not punitive — it is a rational response to evidence that the system cannot handle more change right now.
Why SLOs matter at every level of the organization
SLOs vs SLAs: A Critical Distinction
An SLO is an internal target set by the engineering team. An SLA is an external contractual obligation to customers, often with financial penalties for violations. Your SLO should always be stricter than your SLA. If your SLA guarantees 99.9% availability, your internal SLO should target 99.95% to give yourself a buffer. If you set your SLO equal to your SLA, you will not know you are in trouble until you are already violating the customer contract.
The concept of error budgets was arguably Google SRE's most important contribution to the industry. Before error budgets, reliability was binary: either the system was "up" or it was "down." There was no nuance, no way to quantify how much risk was acceptable. Error budgets introduced a fundamental shift in thinking: reliability is not about preventing all failures. It is about ensuring that the rate and duration of failures stays within an acceptable range, measured from the user's perspective.
The Bank Account Analogy
Think of your error budget as a bank account. Every outage, latency spike, or error response is a withdrawal. The SLO sets the minimum balance you must maintain. When your balance is high (budget remaining), you can take risks — ship fast, experiment, migrate infrastructure. When your balance is low (budget nearly exhausted), you must stop spending and focus on deposits (reliability improvements). When you are overdrawn (budget exhausted), all discretionary spending stops (deployment freeze) until the account replenishes at the start of the next window.
A Service Level Indicator is a carefully defined quantitative measure of some aspect of the level of service that is provided. The key word is "carefully" — a poorly chosen SLI will lead you to optimize for the wrong thing. The fundamental rule of SLI selection is: measure what the user experiences, not what the infrastructure reports. CPU utilization, disk I/O, and network bandwidth are useful operational metrics, but they are not SLIs because they do not directly describe the user experience.
At Google, SLI selection starts with the user journey. For Google Search, the user journey is: type a query, press Enter, see results. The SLIs that matter are: did the search return results (availability), how fast did the results appear (latency), and were the results relevant (correctness/quality). Nobody cares about the CPU utilization of the search indexing cluster — that is an implementation detail. Users care about whether search works, how fast it works, and whether the results are good.
The five categories of SLIs
The SLI Specification vs SLI Implementation
Google distinguishes between the SLI specification (what you want to measure, described in terms of user experience) and the SLI implementation (how you actually measure it, with specific metrics and data sources). The specification is: "proportion of homepage requests that load successfully in under 1 second." The implementation is: "count of requests where HTTP status is 200 AND time-to-first-byte is under 1000ms, divided by total request count, measured at the load balancer." Separating these lets you refine the implementation without changing what you are fundamentally measuring.
| Service Type | Primary SLI | Measurement Point | Why This SLI |
|---|---|---|---|
| User-facing web app | Availability + Latency (p99) | Load balancer access logs | Users perceive the app as "working" when pages load fast without errors |
| API service | Availability + Latency (p50, p99) | API gateway metrics | API consumers care about success rate and consistent response times |
| Data pipeline | Freshness + Correctness | Pipeline output timestamp vs source timestamp | Consumers of pipeline data care about recency and accuracy |
| Storage system | Durability + Availability | Object retrieval success rate, bit-rot detection | Data must not be lost (durability) and must be accessible (availability) |
| Streaming service | Availability + Startup latency + Rebuffer ratio | Client-side telemetry | Users experience buffering and startup delay directly |
Common SLI Mistake: Measuring at the Wrong Point
If you measure availability at the application server but your load balancer is dropping connections, your SLI will show 100% availability while users experience errors. Always measure SLIs as close to the user as possible. The ideal measurement point is client-side telemetry (Real User Monitoring). The next best is the load balancer or API gateway. The worst is the application server itself, because it cannot see failures that happen before requests reach it.
When choosing SLIs, resist the temptation to measure everything. A service should have 3 to 5 SLIs maximum. Each SLI should represent a distinct dimension of user experience. If you have 20 SLIs, you have a dashboard, not an SLO framework. The cognitive overhead of tracking too many indicators defeats the purpose of having clear signals. At Google, most services have exactly three SLIs: availability, latency, and one domain-specific indicator (correctness for financial services, freshness for search, throughput for batch systems).
Latency: Always Use Percentiles, Never Averages
Average latency is a misleading metric. If 99% of requests complete in 50ms and 1% take 10 seconds, the average is 149ms — which describes nobody's actual experience. Use percentiles: p50 (median, typical experience), p90 (the experience of the slowest 10%), p99 (the tail, where the worst user experiences hide). At Google, SLOs are typically set on p50 and p99 independently. A service might target p50 < 100ms AND p99 < 500ms.
A Service Level Objective is a target value or range of values for a service level measured by an SLI. In plain terms, an SLO says: "We commit to keeping this SLI above this threshold over this time window." For example: "99.9% of API requests will return successfully within 200ms, measured over a rolling 30-day window." The SLO has three components: the SLI (success rate under 200ms), the target (99.9%), and the measurement window (rolling 30 days).
Setting the right SLO target is one of the hardest decisions in SRE. Too high, and your team will spend all its time firefighting to maintain an unnecessarily strict target. Too low, and users will leave because the service is unreliable. The target should reflect the point at which users are satisfied — not the point at which the system is technically perfect.
How to Set Your First SLO
Start with data. Measure your current performance over the past 30 to 90 days. If your service has been returning 99.95% successful responses, do not set an SLO of 99.99% — you are not meeting that today. Set your initial SLO slightly below your current performance (e.g., 99.9%) to give yourself a real error budget. Then tighten the SLO incrementally as you improve the system. An aspirational SLO that you never meet is worse than no SLO at all.
| SLO Target | Monthly Downtime Allowed | Annual Downtime Allowed | Typical Use Case | Engineering Cost |
|---|---|---|---|---|
| 99% (two nines) | 7.31 hours/month | 3.65 days/year | Internal tools, batch jobs, dev environments | Low — basic monitoring, manual failover |
| 99.9% (three nines) | 43.2 minutes/month | 8.77 hours/year | Standard web apps, most SaaS products, Google Search | Moderate — automated deployment, health checks, basic redundancy |
| 99.95% | 21.6 minutes/month | 4.38 hours/year | E-commerce checkout, business-critical APIs | Moderate-High — multi-AZ deployment, automated failover, SLO alerting |
| 99.99% (four nines) | 4.38 minutes/month | 52.6 minutes/year | Payment processing (Stripe), cloud platform APIs, Netflix streaming | High — multi-region, chaos engineering, sophisticated SLO-based alerting |
| 99.999% (five nines) | 26.3 seconds/month | 5.26 minutes/year | Emergency services (911), air traffic control, core banking ledgers | Very High — active-active multi-region, formal verification, dedicated SRE teams |
The cost of each additional nine is not linear — it is roughly exponential. Moving from 99.9% to 99.99% does not cost 10% more engineering effort; it costs approximately 10x more. This is because each additional nine requires eliminating an entirely new class of failures. At 99.9%, you can tolerate occasional server crashes and network blips. At 99.99%, you must also handle datacenter-level failures, slow database queries, and edge-case race conditions. At 99.999%, you must handle correlated failures, software bugs in your reliability infrastructure, and failures in your failure-handling code.
The Hidden Cost of Over-Reliability
If your SLO is higher than your users need, you are wasting engineering effort. A mobile app with a 3-second page load time does not benefit from 99.999% server-side availability because the cellular network introduces more downtime than your SLO permits. Users on unreliable networks will not notice the difference between 99.99% and 99.999% because their experience is dominated by network variability. Always consider the end-to-end user experience, including factors outside your control.
SLO document template — what every SLO must specify
01
Service name and owner: Which service, which team owns it, who reviews the SLO quarterly.
02
SLI definition: The exact metric, measurement point, and what constitutes a "good" event vs a "bad" event.
03
SLO target and window: The target percentage and the measurement window (rolling 30 days is most common). Example: 99.9% of requests succeed over a rolling 30-day window.
04
Error budget policy: What happens when the error budget is exhausted — deployment freeze, mandatory reliability sprint, escalation to leadership.
05
Exclusions: What events are excluded from SLO calculations — planned maintenance windows, load-shedding responses (429s), dependency failures that are explicitly out of scope.
06
Review cadence: How often the SLO is reviewed (quarterly at Google). SLOs should be tightened as the system matures and loosened if they are causing unnecessary toil.
Service name and owner: Which service, which team owns it, who reviews the SLO quarterly.
SLI definition: The exact metric, measurement point, and what constitutes a "good" event vs a "bad" event.
SLO target and window: The target percentage and the measurement window (rolling 30 days is most common). Example: 99.9% of requests succeed over a rolling 30-day window.
Error budget policy: What happens when the error budget is exhausted — deployment freeze, mandatory reliability sprint, escalation to leadership.
Exclusions: What events are excluded from SLO calculations — planned maintenance windows, load-shedding responses (429s), dependency failures that are explicitly out of scope.
Review cadence: How often the SLO is reviewed (quarterly at Google). SLOs should be tightened as the system matures and loosened if they are causing unnecessary toil.
Rolling Windows vs Calendar Windows
A rolling 30-day window is preferred over a calendar month because it avoids the "beginning of month" problem. With a calendar window, teams burn their error budget in the first week and then operate in panic mode for three weeks. With a rolling window, every day an old day drops off and a new day is added, providing a continuous and stable view of reliability. Google uses rolling windows for all Tier-1 SLOs.
The error budget is the single most powerful concept in the SLO framework. It answers a question that has plagued engineering organizations for decades: "How do we decide when to ship features versus when to work on reliability?" The error budget provides an objective, mathematical answer. Your error budget is 100% minus your SLO target. If your SLO is 99.9%, your error budget is 0.1%. Over a 30-day rolling window, 0.1% of total time is approximately 43.2 minutes. That is how much downtime, degradation, or errors you can tolerate before the SLO is violated.
Error Budget Calculation
Error Budget = 100% - SLO Target. For a 99.9% SLO over 30 days: budget = 0.1% of 30 days = 0.001 * 30 * 24 * 60 = 43.2 minutes. For a 99.99% SLO over 30 days: budget = 0.01% of 30 days = 0.0001 * 30 * 24 * 60 = 4.32 minutes. That is the difference between "we can tolerate a 40-minute outage this month" and "we cannot tolerate more than a 4-minute outage this month."
The real power of the error budget is not the math — it is the organizational policy attached to it. An error budget without a policy is just a dashboard metric that nobody acts on. The error budget policy defines concrete actions triggered at specific budget thresholds. This is where the error budget transforms from a metric into a decision-making framework.
Error budget policy — escalation tiers
01
Budget > 75% remaining: Green zone. Ship freely. Experiment with new architectures. Run chaos engineering experiments. This is the time to take calculated risks because you have budget to absorb failures.
02
Budget 50-75% remaining: Yellow zone. Proceed with normal deployments but increase monitoring. Cancel any planned chaos experiments. Review recent incidents for patterns that could accelerate budget consumption.
03
Budget 25-50% remaining: Orange zone. Require additional review for all deployments. Only deploy changes that have been tested in staging for at least 24 hours. Begin allocating engineering time to reliability improvements.
04
Budget 10-25% remaining: Red zone. Freeze all non-critical deployments. Only emergency fixes and reliability improvements are permitted. The team enters a mandatory reliability sprint to address the root causes of budget consumption.
05
Budget < 10% remaining: Critical zone. Full deployment freeze. All engineering effort redirected to reliability. Escalate to engineering leadership. Convene a reliability review to determine if the SLO is achievable with current architecture.
06
Budget exhausted (0%): SLO violation. Postmortem required. Feature velocity is zero until the error budget replenishes. If the SLO is tied to an SLA, customer credits may be triggered.
Budget > 75% remaining: Green zone. Ship freely. Experiment with new architectures. Run chaos engineering experiments. This is the time to take calculated risks because you have budget to absorb failures.
Budget 50-75% remaining: Yellow zone. Proceed with normal deployments but increase monitoring. Cancel any planned chaos experiments. Review recent incidents for patterns that could accelerate budget consumption.
Budget 25-50% remaining: Orange zone. Require additional review for all deployments. Only deploy changes that have been tested in staging for at least 24 hours. Begin allocating engineering time to reliability improvements.
Budget 10-25% remaining: Red zone. Freeze all non-critical deployments. Only emergency fixes and reliability improvements are permitted. The team enters a mandatory reliability sprint to address the root causes of budget consumption.
Budget < 10% remaining: Critical zone. Full deployment freeze. All engineering effort redirected to reliability. Escalate to engineering leadership. Convene a reliability review to determine if the SLO is achievable with current architecture.
Budget exhausted (0%): SLO violation. Postmortem required. Feature velocity is zero until the error budget replenishes. If the SLO is tied to an SLA, customer credits may be triggered.
Burn Rate: How Fast Are You Spending?
The error budget remaining tells you how much you have left. The burn rate tells you how fast you are spending it. A burn rate of 1x means you will exactly exhaust your budget by the end of the window. A burn rate of 10x means you will exhaust it in 1/10th of the window — roughly 3 days for a 30-day window. Burn rate is more actionable than remaining budget because it predicts the future. If your current burn rate is 5x, you know you need to act now even if you still have 80% of the budget remaining.
Error budgets also serve as a negotiation tool between product and engineering. When a product manager requests an aggressive launch timeline, the SRE team does not need to say "no" — they can say "yes, but here is the risk." If the launch is expected to cause a temporary reliability dip that consumes 30% of the error budget, the product manager can make an informed decision: is this launch worth 30% of our reliability budget? Sometimes the answer is yes (major revenue-driving feature). Sometimes it is no (minor UI tweak that does not justify the risk).
Anti-Pattern: Error Budgets With No Consequences
An error budget policy that nobody enforces is worse than having no policy at all. It creates the illusion of a reliability framework while providing none of the benefits. If the error budget is exhausted and deployments continue as normal, the error budget is just a number on a dashboard. The policy must have teeth: deployment freezes must actually happen, reliability sprints must be staffed, and leadership must support enforcement even when it delays a high-priority feature launch.
| SLO Target | Error Budget (30-day) | What It Buys You | Realistic Scenario |
|---|---|---|---|
| 99% | 7.31 hours | One major outage per month without violating SLO | Internal tooling, non-critical batch processing |
| 99.9% | 43.2 minutes | A couple of moderate incidents or many brief blips | Standard web applications — this is the most common target for user-facing services |
| 99.95% | 21.6 minutes | One moderate incident or a handful of brief issues | E-commerce, SaaS platforms with paying customers |
| 99.99% | 4.32 minutes | Only a few brief incidents; no room for extended outages | Payment APIs, streaming platforms, cloud infrastructure |
| 99.999% | 26.3 seconds | Essentially zero tolerance; every second counts | Life-critical systems, financial settlement systems |
The "nines" of availability is the universal shorthand for reliability targets. When someone says "three nines" they mean 99.9% availability; "four nines" means 99.99%. Understanding the math behind the nines — and the exponential cost of each additional nine — is essential for every SRE. The math is simple, but the implications are profound. Each additional nine reduces your permitted downtime by a factor of 10 while increasing the engineering investment by roughly the same factor.
| Nines | Availability % | Downtime/Year | Downtime/Month | Downtime/Week | Downtime/Day |
|---|---|---|---|---|---|
| One nine | 90% | 36.5 days | 72 hours | 16.8 hours | 2.4 hours |
| Two nines | 99% | 3.65 days | 7.31 hours | 1.68 hours | 14.4 minutes |
| Three nines | 99.9% | 8.77 hours | 43.2 minutes | 10.1 minutes | 1.44 minutes |
| Four nines | 99.99% | 52.6 minutes | 4.32 minutes | 1.01 minutes | 8.6 seconds |
| Five nines | 99.999% | 5.26 minutes | 26.3 seconds | 6.05 seconds | 0.86 seconds |
| Six nines | 99.9999% | 31.6 seconds | 2.63 seconds | 0.6 seconds | 0.086 seconds |
Let these numbers sink in. At 99.9% (three nines), you have 43.2 minutes of downtime budget per month. That is enough for a brief outage, a bad deployment that gets rolled back in 15 minutes, and a few minutes of degraded performance from a traffic spike. At 99.99% (four nines), you have 4.32 minutes per month. That means a single deployment gone wrong can consume your entire monthly budget if you do not detect and roll back within 4 minutes. At 99.999% (five nines), you have 26.3 seconds per month — which means you essentially cannot have any human-in-the-loop recovery because humans cannot diagnose and fix problems in 26 seconds.
graph TB
A["99% — Two Nines<br/>7.3 hours/month<br/>Cost: $"] --> B["99.9% — Three Nines<br/>43.2 min/month<br/>Cost: $$"]
B --> C["99.95%<br/>21.6 min/month<br/>Cost: $$$"]
C --> D["99.99% — Four Nines<br/>4.32 min/month<br/>Cost: $$$$"]
D --> E["99.999% — Five Nines<br/>26.3 sec/month<br/>Cost: $$$$$"]
style A fill:#4ade80,stroke:#16a34a,color:#000
style B fill:#a3e635,stroke:#65a30d,color:#000
style C fill:#facc15,stroke:#ca8a04,color:#000
style D fill:#fb923c,stroke:#ea580c,color:#000
style E fill:#f87171,stroke:#dc2626,color:#000Each additional nine reduces permitted downtime by 10x and roughly increases cost by 10x. The jump from three nines to four nines is where most organizations feel the exponential cost.
Compound Availability: The Chain Is Only as Strong as Its Weakest Link
If your service depends on three independent components each with 99.9% availability, the compound availability is 99.9% * 99.9% * 99.9% = 99.7%. Three "three nines" services chained together give you less than "three nines" end-to-end. This is why Google invests heavily in reducing the dependency chain for critical services, and why redundancy at every layer is non-negotiable for four-nines services.
What each nine costs in engineering practice
Quick Mental Math for Nines
For a 30-day month: 99% = ~7 hours, 99.9% = ~43 minutes, 99.99% = ~4 minutes, 99.999% = ~26 seconds. Each additional nine divides the permitted downtime by 10. Start with 99% = 7 hours and keep dividing: 7 hours / 10 = 43 minutes / 10 = 4.3 minutes / 10 = 26 seconds. This mental math will serve you well in interviews and design discussions.
Traditional threshold-based alerting (alert when CPU > 80%, alert when error rate > 1%) is the source of most alert fatigue in the industry. It generates false positives (CPU spike during a batch job that does not affect users) and misses real problems (a slow degradation that keeps error rate at 0.8% but steadily burns the error budget). SLO-based alerting replaces threshold alerts with burn-rate alerts that are directly tied to user impact and error budget consumption.
The core idea is simple: instead of asking "is this metric above a threshold right now?" ask "at the current rate of errors, when will the error budget be exhausted?" If the answer is "in 2 hours," you need to act immediately. If the answer is "in 45 days," everything is fine. This is burn-rate alerting, and it is the standard at Google, Netflix, and most mature SRE organizations.
Burn Rate Defined
Burn rate is how fast you are consuming the error budget relative to the SLO window. A burn rate of 1 means you will exactly exhaust the budget by the end of the window. A burn rate of 14.4 means you will exhaust a 30-day budget in 50 hours. A burn rate of 720 means you will exhaust a 30-day budget in 1 hour. The formula is: burn_rate = (error_rate_observed / error_rate_allowed). If your SLO allows 0.1% errors and you are seeing 1.44% errors, your burn rate is 14.4x.
| Alert Type | Burn Rate | Budget Consumed | Detection Window | Response |
|---|---|---|---|---|
| Fast burn (page) | 14.4x | 2% in 1 hour | 5-minute + 1-hour window | Immediate page to on-call SRE; likely a significant outage or bad deployment |
| Moderate burn (page) | 6x | 5% in 6 hours | 30-minute + 6-hour window | Page on-call; sustained degradation requiring investigation |
| Slow burn (ticket) | 3x | 10% in 3 days | 6-hour + 3-day window | Create a ticket; investigate during business hours, no emergency response |
| Gradual burn (review) | 1x | On track to exhaust | 1-day + 30-day window | Add to weekly SLO review; may indicate growing technical debt |
The multi-window approach is critical. A single window will either be too sensitive (short window, many false positives) or too slow (long window, delayed detection). Multi-window burn-rate alerting uses two windows: a short window to confirm that the problem is happening right now, and a long window to confirm that it is significant enough to warrant action. Both conditions must be true simultaneously for the alert to fire.
graph TD
A["Error Occurs"] --> B{"Short Window<br/>5 min burn rate<br/>> 14.4x?"}
B -->|No| C["No Alert<br/>Transient blip"]
B -->|Yes| D{"Long Window<br/>1 hour burn rate<br/>> 14.4x?"}
D -->|No| E["No Alert<br/>Brief spike, recovering"]
D -->|Yes| F["PAGE ON-CALL<br/>Fast burn confirmed"]
A --> G{"Short Window<br/>30 min burn rate<br/>> 6x?"}
G -->|No| H["No Alert"]
G -->|Yes| I{"Long Window<br/>6 hour burn rate<br/>> 6x?"}
I -->|No| J["No Alert<br/>Recovering"]
I -->|Yes| K["PAGE ON-CALL<br/>Moderate burn confirmed"]
A --> L{"Short Window<br/>6 hour burn rate<br/>> 3x?"}
L -->|No| M["No Alert"]
L -->|Yes| N{"Long Window<br/>3 day burn rate<br/>> 3x?"}
N -->|No| O["No Alert"]
N -->|Yes| P["CREATE TICKET<br/>Slow burn confirmed"]
style F fill:#f87171,stroke:#dc2626,color:#000
style K fill:#fb923c,stroke:#ea580c,color:#000
style P fill:#facc15,stroke:#ca8a04,color:#000Multi-window, multi-burn-rate alerting flow. Short window confirms the problem exists now; long window confirms it is significant. Both must trigger for an alert to fire, dramatically reducing false positives.
Alert Routing: Right Severity, Right Channel
Fast-burn alerts (14.4x) go to PagerDuty and wake up the on-call engineer. Moderate-burn alerts (6x) page during business hours or after a 15-minute delay. Slow-burn alerts (3x) create JIRA tickets for the next sprint. Gradual-burn alerts (1x) appear on weekly SLO review dashboards. This tiered approach means the pager only fires for genuine emergencies, and slower degradation is still tracked and addressed — just not at 3 AM.
Implementing SLO Alerts in Prometheus
In Prometheus, a multi-window burn-rate alert looks like: (rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m])) > (14.4 * 0.001) AND (rate(http_requests_total{code=~"5.."}[1h]) / rate(http_requests_total[1h])) > (14.4 * 0.001). The 0.001 is the error rate allowed by a 99.9% SLO. The Google SRE Workbook provides complete Prometheus alerting rules for all burn-rate tiers.
Theory is useful, but seeing how the world's largest and most reliable services define and manage SLOs makes the concepts concrete. Below are real-world SLO frameworks from companies that operate some of the highest-traffic services on the planet. These examples demonstrate that even at FAANG scale, the principles are the same: choose user-centric SLIs, set achievable targets, and enforce error budget policies.
Google Search — Three Nines (99.9%)
Gmail — Three Nines (99.9%)
Netflix — Four Nines (99.99%) for Stream Start
Stripe — Four Nines (99.99%) for API Availability
Pattern: The Higher the Stakes, the Tighter the SLO
Google Search and Gmail target 99.9% because users can retry (search again, reload inbox). Netflix and Stripe target 99.99% because failures are more impactful (broken stream start ruins the experience, failed payment loses revenue). Notice that no one targets 99.999% for user-facing web services — the cost is not justified because end-to-end reliability is bottlenecked by client-side factors (network, device, browser) that you cannot control.
Implementing SLOs incorrectly is often worse than not implementing them at all, because bad SLOs create false confidence. Teams believe they have a reliability framework, but the SLOs are measuring the wrong things, set at the wrong levels, or lack the enforcement mechanisms that make them useful. After years of rolling out SLOs across Google Tier-1 services and consulting with external organizations on SLO adoption, these are the most common and most damaging mistakes.
The most common SLO anti-patterns
The "100% SLO" Trap
If your SLO is 100%, your error budget is zero. That means any single error, any brief latency spike, any momentary blip violates the SLO. This makes the SLO permanently violated and therefore useless. No real-world system achieves 100% availability over any meaningful time window. Even Google, with its massive infrastructure investment, does not target 100% for any service. The pursuit of 100% reliability has infinite cost and zero marginal benefit beyond what five nines provides for all practical purposes.
Anti-Pattern: Ignoring Dependencies
Your service cannot be more reliable than its least reliable critical dependency. If your payment service depends on a database with 99.9% availability and an external fraud-detection API with 99.5% availability, your compound availability ceiling is approximately 99.4%, regardless of how perfect your own code is. Map your dependency chain and factor dependency SLOs into your own SLO target.
SLO health check — signs your SLO framework needs fixing
01
You have never triggered an error budget policy action. Either your SLOs are too loose, or you do not have a policy.
02
Nobody knows the current error budget remaining for their service. SLOs are not visible or not reviewed regularly.
03
The SLO has not been updated in over a year. User expectations and system architecture change; SLOs should be reviewed quarterly.
04
Error budget violations happen every month. The SLO may be too tight, or the service has fundamental reliability problems that need investment.
05
Teams work around the SLO by excluding too many error types. If your exclusion list is longer than your SLI definition, you are gaming the metric.
06
SLOs are only discussed during incidents. They should be part of sprint planning, quarterly reviews, and architectural decisions.
You have never triggered an error budget policy action. Either your SLOs are too loose, or you do not have a policy.
Nobody knows the current error budget remaining for their service. SLOs are not visible or not reviewed regularly.
The SLO has not been updated in over a year. User expectations and system architecture change; SLOs should be reviewed quarterly.
Error budget violations happen every month. The SLO may be too tight, or the service has fundamental reliability problems that need investment.
Teams work around the SLO by excluding too many error types. If your exclusion list is longer than your SLI definition, you are gaming the metric.
SLOs are only discussed during incidents. They should be part of sprint planning, quarterly reviews, and architectural decisions.
Rolling out SLOs in an organization that has never used them is a multi-quarter effort. It involves cultural change (getting product and engineering to agree on reliability targets), technical work (instrumenting services, building dashboards, configuring alerts), and process changes (creating error budget policies, integrating SLOs into sprint planning). This section provides a practical roadmap from zero to production-grade SLO implementation.
Phase 1: Foundation (weeks 1-4) — Choose your pilot service
01
Select one service that is user-facing, has existing monitoring, and has a team willing to adopt SLOs. Do not start with your most critical service; start with one where the stakes are manageable.
02
Map the user journey for the selected service. Identify the 2-3 most important interactions (e.g., login, search, checkout). These will inform your SLI choices.
03
Instrument the service to collect SLI data. Add HTTP status code tracking at the load balancer, latency histograms (not averages) with Prometheus or your preferred metrics system, and client-side error tracking if possible.
04
Measure current performance for 2-4 weeks to establish a baseline. What is the actual availability, latency p50, and latency p99 right now? This data will inform your SLO targets.
Select one service that is user-facing, has existing monitoring, and has a team willing to adopt SLOs. Do not start with your most critical service; start with one where the stakes are manageable.
Map the user journey for the selected service. Identify the 2-3 most important interactions (e.g., login, search, checkout). These will inform your SLI choices.
Instrument the service to collect SLI data. Add HTTP status code tracking at the load balancer, latency histograms (not averages) with Prometheus or your preferred metrics system, and client-side error tracking if possible.
Measure current performance for 2-4 weeks to establish a baseline. What is the actual availability, latency p50, and latency p99 right now? This data will inform your SLO targets.
Phase 2: Define and Document (weeks 5-8)
01
Write the SLO document: SLI definitions, SLO targets (set slightly below current baseline), measurement windows (rolling 30 days recommended), and exclusions.
02
Draft the error budget policy with concrete escalation tiers. Get sign-off from both the engineering team lead and the product manager. Both must agree that the policy will be enforced.
03
Build the SLO dashboard in Grafana (or your visualization tool). The dashboard must show: current SLO compliance percentage, error budget remaining (minutes and percentage), burn rate (current and historical), and time-series graph of SLI performance.
04
Configure multi-window burn-rate alerts. Start with the fast-burn (14.4x) and slow-burn (3x) tiers. Use a 5-minute short window and 1-hour long window for fast burn; 6-hour short window and 3-day long window for slow burn.
Write the SLO document: SLI definitions, SLO targets (set slightly below current baseline), measurement windows (rolling 30 days recommended), and exclusions.
Draft the error budget policy with concrete escalation tiers. Get sign-off from both the engineering team lead and the product manager. Both must agree that the policy will be enforced.
Build the SLO dashboard in Grafana (or your visualization tool). The dashboard must show: current SLO compliance percentage, error budget remaining (minutes and percentage), burn rate (current and historical), and time-series graph of SLI performance.
Configure multi-window burn-rate alerts. Start with the fast-burn (14.4x) and slow-burn (3x) tiers. Use a 5-minute short window and 1-hour long window for fast burn; 6-hour short window and 3-day long window for slow burn.
Phase 3: Operate and Iterate (weeks 9-16)
01
Run the SLOs in "observe-only" mode for 4 weeks. The alerts fire, the dashboard is visible, but no error budget policy actions are enforced. This builds confidence that the SLOs and alerts are calibrated correctly.
02
Hold a weekly SLO review meeting during observe-only mode. Review: Were any alerts false positives? Was the burn rate stable? Did any incidents occur that should have triggered alerts but did not? Tune thresholds based on findings.
03
After the observe-only period, activate the error budget policy. Start with the least aggressive tier (slow-burn tickets) and add escalation tiers over the next month.
04
Conduct the first quarterly SLO review. Assess: Is the target too tight or too loose? Are the SLIs measuring the right things? Are the error budget policy actions proportionate? Adjust accordingly.
Run the SLOs in "observe-only" mode for 4 weeks. The alerts fire, the dashboard is visible, but no error budget policy actions are enforced. This builds confidence that the SLOs and alerts are calibrated correctly.
Hold a weekly SLO review meeting during observe-only mode. Review: Were any alerts false positives? Was the burn rate stable? Did any incidents occur that should have triggered alerts but did not? Tune thresholds based on findings.
After the observe-only period, activate the error budget policy. Start with the least aggressive tier (slow-burn tickets) and add escalation tiers over the next month.
Conduct the first quarterly SLO review. Assess: Is the target too tight or too loose? Are the SLIs measuring the right things? Are the error budget policy actions proportionate? Adjust accordingly.
Tooling Stack for SLO Implementation
The industry-standard open-source stack is: Prometheus for metrics collection, Grafana for SLO dashboards (with the Grafana SLO plugin or custom PromQL queries), Alertmanager for burn-rate alerting, and a postmortem tool (Blameless, incident.io, or a shared Google Doc template) for tracking SLO violations. For organizations using cloud-native tooling, Google Cloud SLO Monitoring, Datadog SLOs, and Nobl9 are purpose-built SLO management platforms.
SLO adoption roadmap — organizational rollout
Start Small, Iterate Fast
The biggest mistake in SLO adoption is trying to boil the ocean — instrumenting every service, defining 50 SLIs, and building a comprehensive dashboard before anyone has used an SLO in production. Start with one service, three SLIs, and a simple Grafana dashboard. You will learn more from operating one SLO for a month than from planning 50 SLOs on a whiteboard. The first SLO does not need to be perfect; it needs to be operational.
Prometheus Recording Rules for SLO Tracking
Create Prometheus recording rules that pre-compute your SLI ratios: record: sli:availability:ratio_rate5m, expr: sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])). Then your SLO alert rule becomes: alert: SLOBurnRateFast, expr: (1 - sli:availability:ratio_rate5m) > (14.4 * 0.001), for: 2m. This pattern separates SLI computation from alerting logic, making both easier to maintain.
SLIs, SLOs, and error budgets are among the most commonly tested concepts in SRE, platform engineering, and senior backend interviews. Expect to be asked to define SLIs for a hypothetical system, calculate error budgets, explain burn-rate alerting, and describe how you would use error budget policies to balance feature velocity with reliability. Senior candidates are expected to cite real numbers (e.g., 99.9% = 43.2 min/month) and discuss the exponential cost of each additional nine. Staff-level candidates should be able to design an end-to-end SLO framework including the organizational rollout plan.
Common questions:
Strong answer: Citing real FAANG SLO examples with specific numbers. Explaining the bank-account analogy for error budgets unprompted. Discussing multi-window burn-rate alerting with specific burn rates (14.4x, 6x, 3x). Mentioning rolling windows vs calendar windows. Describing how to roll out SLOs incrementally starting with a pilot service. Understanding that SLOs should be tighter than SLAs.
Red flags: Confusing SLOs with SLAs. Suggesting 100% as an SLO target. Using averages instead of percentiles for latency. Not knowing the downtime math for common nines (99.9% = 43.2 min/month). Describing SLOs as purely a monitoring concern without mentioning error budget policies or organizational alignment. Inability to explain why each additional nine costs exponentially more.
Key takeaways
Your service has a 99.95% availability SLO over a rolling 30-day window. You had a 45-minute outage on day 10. How much error budget have you consumed, and what should you do?
A 99.95% SLO gives you an error budget of 0.05% of 30 days = 21.6 minutes. A 45-minute outage consumes 45/21.6 = 208% of your monthly budget. You have not only exhausted your error budget but exceeded it. This means you are in SLO violation. You should activate the full deployment freeze policy, redirect all engineering effort to reliability, conduct a postmortem on the outage, and report the SLO violation to leadership. The budget will begin replenishing as days older than 30 days drop off the rolling window.
Why is it a mistake to set your internal SLO equal to your external SLA, and what is the recommended gap?
If your SLO equals your SLA, you will not know you are in trouble until you are already violating the customer contract (and potentially incurring financial penalties). The SLO should be stricter than the SLA to provide a safety buffer. A common recommendation is to set the SLO one "half-nine" above the SLA. If your SLA guarantees 99.9%, your internal SLO should be 99.95%. This gives you approximately half the downtime budget as an early warning system — when the SLO is violated, you still have buffer before the SLA is breached.
Explain why average latency is a dangerous metric for SLOs and what you should use instead.
Average latency hides tail latency problems. If 99% of requests complete in 50ms and 1% take 10 seconds, the average is 149ms — which describes nobody's actual experience. The 1% of users experiencing 10-second latency have a terrible experience that the average completely masks. SLOs should use percentiles: p50 for typical experience, p99 for worst-case tail experience. At Google, latency SLOs are typically set on both p50 and p99 independently, such as "p50 < 100ms AND p99 < 500ms." This ensures both the typical and worst-case user experiences are within acceptable bounds.
From the books
Site Reliability Engineering (Google)
Ch. 3-4
Chapters 3 and 4 define the SLI/SLO/error budget framework that has become the industry standard. The key insight is that error budgets transform reliability from a subjective debate into an objective, measurable trade-off between velocity and stability.
The Site Reliability Workbook (Google)
Ch. 2-5
Chapters 2-5 provide the practical implementation guide: how to define SLIs, set SLO targets, build SLO dashboards, configure burn-rate alerting, and draft error budget policies. This is the hands-on companion to the theoretical framework in the first book.
Implementing Service Level Objectives (Alex Hidalgo)
Full book
The most comprehensive single resource on SLO implementation. Covers the full lifecycle from choosing SLIs to organizational rollout, with detailed examples from multiple industries. Particularly strong on the cultural and organizational aspects of SLO adoption.
💡 Analogy
Your error budget is like a bank account. Every outage, latency spike, or error response is a withdrawal from the account. The SLO sets the minimum balance you must maintain. When your balance is high (budget remaining), you are rich — you can take risks, ship fast, experiment with new architectures, and run chaos engineering tests. When your balance is low (budget nearly exhausted), you are running low on funds — proceed with caution, increase review rigor, and defer risky changes. When you are overdrawn (budget exhausted), all discretionary spending stops — deployment freeze, mandatory reliability sprint, engineering effort redirected to stabilization. The account replenishes over time as the rolling window advances and old bad days drop off.
⚡ Core Idea
SLIs measure what users experience. SLOs set the minimum acceptable threshold for that experience. Error budgets (100% minus SLO) quantify exactly how much unreliability you can tolerate. When the budget is healthy, ship fast. When it is depleted, invest in reliability. This creates a data-driven framework that eliminates the subjective "ship vs stabilize" debate.
🎯 Why It Matters
Without SLOs, reliability decisions are driven by politics and gut feeling. With SLOs, they are driven by data. The error budget is the single mechanism that aligns product teams (who want velocity) and SRE teams (who want stability) around a shared objective. Organizations without SLOs either over-invest in reliability (wasting engineering capacity on unnecessary nines) or under-invest (shipping recklessly until a major outage forces a reactive reliability sprint). SLOs make the trade-off explicit, measurable, and manageable.
Related concepts
Explore topics that connect to this one.
Suggested next
Often learned after this topic.
Embracing Risk: Why Perfect Reliability Is the Enemy of Great EngineeringInterview prep: 1 resource
Use these to reinforce this concept for interviews.
View all interview resources →Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.