A deep-dive into rate limiting algorithms, distributed counting strategies, Redis Lua atomicity patterns, and production deployment at FAANG scale — covering token bucket, leaky bucket, fixed window, sliding window log, and sliding window counter with real-world tradeoffs.
A deep-dive into rate limiting algorithms, distributed counting strategies, Redis Lua atomicity patterns, and production deployment at FAANG scale — covering token bucket, leaky bucket, fixed window, sliding window log, and sliding window counter with real-world tradeoffs.
Lesson outline
Every public-facing API needs a way to say "slow down" before a flood of requests collapses the system. Rate limiting is the first line of defence against denial-of-service attacks, misbehaving clients, runaway retry loops, and simple misconfiguration. Without it, a single bad actor or a bug in a partner integration can saturate your database connection pool, exhaust thread pools, and cascade failures across every microservice downstream.
Rate limiting vs throttling vs backpressure
Rate limiting rejects excess requests at the boundary. Throttling delays them (queuing or slowing). Backpressure propagates slowness upstream so producers reduce their sending rate. A production system often uses all three in concert.
What a rate limiter protects against
The token bucket is the most widely deployed rate limiting algorithm in production. It models a bucket that holds up to B tokens (the burst capacity). Tokens are added at a steady rate of R tokens per second. Each request consumes one token. If the bucket is empty, the request is rejected (or queued). This naturally allows short bursts up to B while enforcing an average rate of R.
[Refill: R tokens/sec] --> [Bucket: max B tokens] --> [Request arrives: take 1 token]
|
bucket empty? --> 429 Too Many RequestsToken bucket conceptual flow — tokens refill at a constant rate, each request drains one token.
The math
If refill rate = R tokens/sec and bucket size = B, the maximum burst is B requests instantly, and sustained throughput is R req/sec. After a burst of B, the client must wait (B - available) / R seconds to refill. Example: R = 10, B = 50 means a client can burst 50 requests then sustain 10/sec.
Token Bucket — Redis Lua Script
-- KEYS[1] = rate limit key
-- ARGV[1] = bucket capacity (B)
-- ARGV[2] = refill rate per second (R)
-- ARGV[3] = current timestamp (seconds)
-- ARGV[4] = tokens to consume (usually 1)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call("hmget", key, "tokens", "last_refill")
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Calculate tokens to add since last refill
local elapsed = math.max(0, now - last_refill)
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)
if new_tokens >= requested then
new_tokens = new_tokens - requested
redis.call("hmset", key, "tokens", new_tokens, "last_refill", now)
redis.call("expire", key, math.ceil(capacity / refill_rate) * 2)
return {1, new_tokens} -- allowed, remaining
else
redis.call("hmset", key, "tokens", new_tokens, "last_refill", now)
redis.call("expire", key, math.ceil(capacity / refill_rate) * 2)
return {0, new_tokens} -- denied, remaining
endUse for: Atomic token bucket in Redis — a single Lua script ensures no race condition between read-modify-write steps across concurrent requests.
Stripe uses a token bucket approach for their API rate limiting. Each API key gets a bucket sized to its plan tier. The token bucket is ideal when you want to allow controlled bursts (e.g., a batch import) while maintaining a long-term average rate.
The leaky bucket is the dual of the token bucket. Think of a bucket with a hole in the bottom that leaks at a constant rate. Requests enter the top of the bucket (a FIFO queue). If the bucket (queue) is full, new requests are dropped. The bucket drains at a fixed rate, so outgoing requests are perfectly smoothed — no bursts at all on the output side.
[Incoming requests] --> [Queue (max size Q)] --> drains at rate R --> [Backend service]
|
queue full? --> 429 RejectedLeaky bucket model — the queue absorbs bursts, but the output is always steady at rate R.
| Property | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst handling | Allows bursts up to bucket size B | No output bursts; queue absorbs input bursts |
| Output shape | Bursty up to B, then R/sec | Perfectly smooth at R/sec |
| Implementation | Counter + timestamp | FIFO queue + timer |
| Memory | O(1) per key | O(Q) per key due to queue |
| Best for | APIs where short bursts are acceptable | Systems needing smooth, predictable load |
When to choose leaky bucket over token bucket
Use leaky bucket when your backend truly cannot handle any burstiness — for example, a payment processing pipeline with strict per-second throughput limits from a downstream provider. The tradeoff is added latency: requests sit in the queue waiting to drain rather than being served immediately.
Shopify historically used a leaky bucket for their REST Admin API. Each app got a bucket of 40 requests that leaked at 2 requests/second, giving a maximum sustained rate of 2/sec with the ability to absorb a short burst of 40.
The simplest algorithm: divide time into fixed windows (e.g., every 60-second block starting at :00) and maintain a counter per window. When a request arrives, increment the counter for the current window. If counter > limit, reject. At the start of a new window, the counter resets to zero.
Fixed window — step by step
01
Determine the current window: window_start = floor(now / window_size) * window_size
02
Construct the key: "rate:{user_id}:{window_start}"
03
INCR the key in Redis. If the result is 1 (first request), also set EXPIRE to window_size.
04
If the counter exceeds the limit, return HTTP 429. Otherwise, allow the request.
Determine the current window: window_start = floor(now / window_size) * window_size
Construct the key: "rate:{user_id}:{window_start}"
INCR the key in Redis. If the result is 1 (first request), also set EXPIRE to window_size.
If the counter exceeds the limit, return HTTP 429. Otherwise, allow the request.
The boundary burst problem
A user can send N requests at second 59 of window 1 and N more at second 0 of window 2 — achieving 2N requests in 2 seconds while technically staying within "N per minute" in both windows. This is the fundamental flaw of fixed windows.
Fixed Window — Redis One-Liner (with race condition!)
-- Naive approach: INCR + EXPIRE is NOT atomic
-- Two commands = race condition window
local count = redis.call("INCR", KEYS[1])
if count == 1 then
redis.call("EXPIRE", KEYS[1], tonumber(ARGV[1]))
end
return countUse for: This Lua script wraps INCR and EXPIRE atomically. Without the Lua script, a crash between INCR and EXPIRE leaves a key with no TTL — a "leaked counter" that never resets and permanently blocks the user.
GitHub uses a fixed window approach for some of their REST API limits (5 000 requests per hour per authenticated user). The simplicity is attractive at scale — just one INCR per request — but the boundary burst issue means actual instantaneous load can be up to 2x the nominal limit.
To fix the boundary burst problem, the sliding window log tracks the exact timestamp of every request in a sorted set. When a new request arrives, remove all entries older than the window size, count remaining entries, and accept or reject. This gives a perfect per-user rate limit with no boundary issues.
Sliding Window Log — Redis Sorted Set
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests per window
-- ARGV[3] = current timestamp (microseconds for precision)
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Remove entries outside the window
redis.call("ZREMRANGEBYSCORE", key, 0, now - window * 1000000)
-- Count remaining entries
local count = redis.call("ZCARD", key)
if count < limit then
-- Add current request timestamp as both score and member
redis.call("ZADD", key, now, now .. ":" .. math.random(1000000))
redis.call("EXPIRE", key, window + 1)
return {1, limit - count - 1} -- allowed, remaining
else
redis.call("EXPIRE", key, window + 1)
return {0, 0} -- denied, 0 remaining
endUse for: Each request timestamp is stored as a member in a Redis sorted set. ZREMRANGEBYSCORE prunes old entries, ZCARD counts current ones. Perfectly accurate but O(N) memory per user.
Memory trap at scale
If your limit is 10 000 requests per hour, each user key stores up to 10 000 entries in a sorted set. With 1 million active users, that is 10 billion entries in Redis. This algorithm is best suited for low-limit scenarios (e.g., 100 login attempts per hour) where memory per key stays manageable.
The sliding window log is the gold standard for accuracy. It is commonly used for authentication-related limits (login attempts, password resets) where the limit is low and precision matters more than memory efficiency.
The sliding window counter is the practical sweet spot — it combines the low memory of fixed windows with an approximation of sliding window accuracy. It keeps counters for the current and previous fixed windows, then uses a weighted formula to estimate the count in the sliding window.
The formula
estimated_count = (previous_window_count * overlap_percentage) + current_window_count. For example, if we are 30% into the current minute, overlap = 70% of previous window. If previous had 80 requests and current has 20, estimate = 80 * 0.7 + 20 = 76. With a limit of 100, we allow the request (76 < 100).
Timeline: |----prev window (count=80)----|----current window (count=20)----|
^-- 70% overlap --^-- 30% elapsed
Estimated sliding count = 80 * 0.70 + 20 = 76Sliding window counter interpolates between two fixed windows for near-accurate results with O(1) memory.
Sliding Window Counter — Redis Implementation
-- KEYS[1] = current window key, KEYS[2] = previous window key
-- ARGV[1] = limit, ARGV[2] = window size (sec)
-- ARGV[3] = current timestamp
local curr_key = KEYS[1]
local prev_key = KEYS[2]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local curr_start = math.floor(now / window) * window
local elapsed = now - curr_start
local weight = 1 - (elapsed / window)
local prev_count = tonumber(redis.call("GET", prev_key) or "0")
local curr_count = tonumber(redis.call("GET", curr_key) or "0")
local estimated = prev_count * weight + curr_count
if estimated < limit then
redis.call("INCR", curr_key)
redis.call("EXPIRE", curr_key, window * 2)
return {1, math.floor(limit - estimated - 1)}
else
return {0, 0}
endUse for: Two keys, one INCR — near-perfect accuracy with O(1) memory per user. Cloudflare uses this approach for many of their rate limiting products.
The recommended default
For most production systems, the sliding window counter is the best starting algorithm. It has O(1) memory per key (just two counters), no boundary burst problem (within ~1-2% error), and trivial Redis implementation. Only move to token bucket if you need explicit burst control, or sliding window log if you need exact precision.
In a single-node system, an in-memory counter with a mutex works fine. But production systems run behind load balancers with dozens of application servers. If each node keeps its own counter, a user with a limit of 100/min could send 100 requests to each of N nodes, achieving N * 100 total requests. You need a shared, centralized counter — and that introduces distributed systems challenges.
The INCR/EXPIRE race condition
In Redis, INCR and EXPIRE are separate commands. If your process crashes after INCR but before EXPIRE, the key persists forever with no TTL — the user is permanently rate-limited. Always use a Lua script or Redis 7.0+ EXPIRETIME to make the pair atomic. Alternatively, embed the window timestamp in the key name so keys naturally become irrelevant.
Strategies for distributed counting
Why Lua scripts matter
Redis is single-threaded per shard — a Lua script runs atomically without any lock. This means your read-check-update logic cannot be interleaved with another request. This is why every production rate limiter uses EVAL/EVALSHA rather than pipelining separate GET/INCR/EXPIRE commands.
| Challenge | Impact | Mitigation |
|---|---|---|
| Clock skew across nodes | Different servers disagree on current window, causing inconsistent accept/reject | Use Redis server time (redis.call("TIME")) inside Lua scripts rather than application-side timestamps |
| Redis failover | During sentinel failover (~5-30s), rate limit state is lost; all counters reset to zero | Accept brief over-limit during failover; use Redis Cluster with replicas for faster failover |
| Hot keys | A viral endpoint causes millions of INCR calls to one key, creating a Redis hot-spot | Shard by {user_id}:{endpoint}, or use local pre-filtering to absorb obvious floods before hitting Redis |
| Network partition | App servers cannot reach Redis; rate limiter cannot make decisions | Fail-open (allow requests) or fail-closed (reject all) — choose based on your risk model; most systems fail-open |
A well-designed rate limiter communicates its state to clients through standard HTTP headers so that well-behaved clients can self-throttle and avoid hitting limits in the first place.
| Header | Description | Example Value |
|---|---|---|
| HTTP 429 Too Many Requests | Status code returned when rate limit is exceeded | 429 |
| Retry-After | Seconds (or HTTP-date) until the client should retry | 30 |
| X-RateLimit-Limit | The maximum number of requests allowed in the window | 1000 |
| X-RateLimit-Remaining | How many requests the client has left in the current window | 247 |
| X-RateLimit-Reset | Unix timestamp when the window resets (or seconds until reset) | 1672531200 |
| RateLimit-Policy | IETF draft header describing the policy (e.g., "100;w=60") | 100;w=60 |
Always include Retry-After with 429
Without Retry-After, clients have no idea when to retry and will either give up or retry aggressively with exponential backoff that may not align with your window. Include it in every 429 response. Good clients like the GitHub CLI parse Retry-After and sleep exactly that long.
IETF RateLimit header fields (RFC 9110 extension)
The IETF is standardizing rate-limit headers as RateLimit (combined remaining, limit, reset) and RateLimit-Policy. While X-RateLimit-* headers are still the de facto standard, new APIs should consider adopting the IETF draft for forward compatibility.
Stripe returns X-RateLimit-Limit and X-RateLimit-Remaining on every response, not just 429s. This lets clients monitor their consumption proactively. GitHub includes X-RateLimit-Reset as a Unix timestamp and offers a /rate_limit endpoint to check status without consuming a request.
Where you place the rate limiter has major implications for latency, accuracy, and operational complexity.
Placement options
Layer your rate limiters
Production systems typically use multiple layers: a coarse global limit at the edge (e.g., 10 000 req/min per IP), a per-user limit at the API gateway (e.g., 1 000 req/min per API key), and a per-resource limit at the service level (e.g., 100 LLM calls/day per user). Each layer catches what the layer above misses.
Global limit (single source of truth)
Per-region independent limits
Hybrid: local + async sync
| Approach | Accuracy | Latency Impact | Complexity | Use Case |
|---|---|---|---|---|
| Global centralized Redis | Exact | High for distant regions (+50-150ms) | Low | Strict limits on financial APIs |
| Per-region independent | Per-region only | None (local) | Low | Approximate limits where 2x is tolerable |
| Hybrid with async sync | Eventually consistent | Low (local + background sync) | High | Global APIs needing low latency and reasonable accuracy |
| CRDTs (e.g., G-Counter) | Eventually consistent | Low | Very high | Research/cutting-edge; Cloudflare experiments |
For most applications, per-region independent limits with a known multiplier (total effective limit = per-region limit * active regions) is the simplest approach. Reserve global centralized counting for scenarios where over-limit has real cost — billing, compliance, or security-critical endpoints.
| Company | Algorithm | Limits | Notable Details |
|---|---|---|---|
| GitHub | Fixed window | 5 000 req/hr (authenticated), 60/hr (unauthenticated) | Dedicated /rate_limit endpoint; X-RateLimit-* headers on every response; uses conditional requests (304) to reduce consumption |
| Stripe | Token bucket | 100 req/sec (live mode), 25 req/sec (test mode) | Per-API-key; returns X-RateLimit-* headers; exponential backoff recommended in docs; idempotency keys prevent double-charges on retries |
| Cloudflare | Sliding window counter | Configurable per zone | Runs at the edge (200+ PoPs); can rate-limit by URI, headers, cookies, or JA3 fingerprint; action options: block, challenge, JS challenge, managed challenge |
| AWS API Gateway | Token bucket | 10 000 req/sec (account), 5 000 burst | Per-region; stage-level and method-level throttling; usage plans tie to API keys |
| Discord | Token bucket per route | Varies per endpoint (e.g., 5/sec for messages) | Returns X-RateLimit-Bucket header; global and per-route limits; bot library authors must handle both |
Idempotency and rate limiting work together
When a rate-limited request is retried, you need idempotency to prevent duplicate side effects. Stripe requires an Idempotency-Key header for POST requests — if the retry lands after the original succeeds, the duplicate is safely ignored. Always pair rate limiting with idempotency for write operations.
Rate limiter decision flow
Rate limiter design is one of the most frequently asked system design questions at FAANG companies. Interviewers expect you to start with requirements (who is being limited, what resource, what limit), walk through at least two algorithms with tradeoffs, discuss the distributed counting problem, and explain HTTP 429 headers. Senior candidates are expected to discuss Redis Lua atomicity, multi-region challenges, and how to handle rate limiter failures (fail-open vs fail-closed).
Common questions:
Key takeaways
A user sends 100 requests at second 59 of a fixed window and 100 more at second 0 of the next window. If the limit is 100 requests per minute, how many total requests go through in those 2 seconds? How would a sliding window counter handle this differently?
With fixed windows, all 200 requests go through (100 per window, both within limit). With a sliding window counter at second 0 of the new window, the estimated count would be approximately 100 * 0.98 + 100 = 198, which exceeds the limit of 100, so most of the second batch would be rejected.
Why must INCR and EXPIRE in a fixed window rate limiter be executed atomically, and what happens if they are not?
If the process crashes after INCR but before EXPIRE, the key has no TTL and persists forever. The counter never resets, permanently blocking the user. A Redis Lua script ensures both operations execute as a single atomic unit. Alternatively, embedding the window timestamp in the key name means stale keys become irrelevant even without EXPIRE.
You run 3 API gateway instances in us-east-1 and 3 in eu-west-1, each with an independent Redis. Your per-user limit is 1000 req/min. What is the effective global limit a user could achieve, and how would you fix it?
The effective global limit is 2000 req/min (1000 per region * 2 regions). To fix: either use a single global Redis (adds cross-region latency), implement async cross-region counter synchronization (eventually consistent), or set per-region limits to 500 so the global total stays at 1000.
💡 Analogy
A rate limiter is like a nightclub bouncer with a clicker counter. The bouncer lets people in one by one, clicking the counter each time. Once the club hits capacity, the bouncer says "line is full, come back later." The bouncer does not care who you are — just whether there is room. The token bucket is a bouncer who also hands out drink tickets at a steady rate: you can use your tickets in a burst, but once they are gone, you wait for the next batch.
⚡ Core Idea
Rate limiting is about converting an unpredictable flood of requests into a predictable, manageable stream — protecting both the system and the fairness contract between users.
🎯 Why It Matters
Without rate limiting, a single misbehaving client can take down an entire platform. Every FAANG outage postmortem eventually asks "why did we not rate-limit this?" It is the cheapest, highest-ROI reliability investment you can make.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.