Back to Blog
Backend15 min readJun 2026

Rate Limiting and Throttling: Protecting APIs from Abuse and Overload

Why the same five algorithms keep showing up in every API gateway, where to enforce them, and how to build a distributed token-bucket limiter on Redis that returns clean 429s.

Rate LimitingAPIRedisReliability
SB

Sri Balaji

Founder

On this page

Your API is only as polite as its rudest caller

See it: traffic vs refill

Try itToken-bucket rate limiter
10.0

tokens left
capacity 10

Incoming requests

0

Allowed

0

429'd

100%

Pass rate

Adjust the sliders and watch what happens…

Drag the sliders. When traffic outruns the refill rate, the bucket empties and excess requests get 429'd — but short bursts still pass thanks to the capacity buffer.

Who this is for

You ship an HTTP API and a single client just hammered it 10,000 times in a second, a buggy retry loop, a scraper, or a customer who wrote `while (true) { fetch() }`. Everyone else's requests slow to a crawl. This article is for backend and SRE folks who want to understand the **algorithms** behind rate limiting, **where** to enforce them, and how to build a **distributed** limiter that survives more than one server.

The uncomfortable truth: without limits, your API's capacity is a shared resource that anyone can monopolize. One abusive caller, malicious or just careless, degrades the experience for every honest one. Rate limiting is how you turn 'first come, first served until we fall over' into 'everyone gets a fair, predictable slice'.

Rate limiting is not about punishing users. It is about protecting the API so that it stays available for all of them.
The whole point, in one line

Four words people use interchangeably (and shouldn't)

Before the algorithms, get the vocabulary straight. These terms get mixed up constantly, but they solve different problems and live in different layers.

TermWhat it controlsTime horizon
Rate limitingRequests per unit of time per callerSeconds to minutes
ThrottlingSlowing or queueing requests once a soft limit is hit (instead of rejecting)Real-time, per request
Load sheddingDropping requests when the *server* is overloaded, regardless of who sent themDuring an incident
QuotasTotal allowance over a long billing/contract windowDay, month, plan tier
Rate limiting protects fairness; load shedding protects survival; quotas enforce business deals.

The clean mental split

Rate limiting and quotas are about the **caller** ('you get 100/min, 1M/month'). Load shedding is about the **server** ('I am at 95% CPU, start dropping the cheapest requests'). You need both, a healthy server still has to fend off one greedy client, and a fair limiter can't save a server that is genuinely on fire.

An analogy: the bouncer and the bucket

A nightclub bouncer letting people in one at a timeA limiter admitting requests at a controlled rate
The club's fire-code capacity (a hard ceiling)Max concurrency / server capacity
A short queue outside that the bouncer lets trickle inThrottling, delay instead of reject
A VIP wristband that allows a small burst of friends togetherToken bucket, accumulated tokens allow a burst
A bucket with a hole that drains at a fixed pace no matter how fast you pourLeaky bucket, smooths bursty input into a steady output
Token bucket forgives bursts; leaky bucket forces a smooth, constant outflow.

Hold the bucket image: a token bucket fills with tokens at a steady rate and each request spends one, so if you've been quiet, you've saved up tokens and can fire a burst. A leaky bucket is the opposite intuition: requests pour in, but they only leave through a fixed-size hole, so output is always smooth even when input spikes.

Where the limiter actually sits

A limiter is only useful if every relevant request flows through it and it can see a shared count. The common shape: requests hit the API gateway / edge first, the limiter checks a counter in a shared store (Redis), and only admitted requests reach your app.

requestINCR + checkif allowedemit
Clients

per-IP / per-API-key

API Gateway

rate limiter

Redis

shared counters

App Service

business logic

Metrics

429 rate, limit hits

The gateway checks a shared Redis counter before any request touches the app tier.

  1. 1

    Identify the caller

    Derive a key, API key, user ID, or client IP. The key decides whose bucket you're checking. Per-key is fairest; per-IP is the fallback for anonymous traffic.

  2. 2

    Check the shared counter

    Atomically read-and-update the caller's count in Redis. This must be atomic, or two concurrent requests both see 'under limit' and both pass.

  3. 3

    Decide: allow, throttle, or reject

    Under limit -> forward to the app. Over limit -> return 429 with a Retry-After header (or queue briefly if you throttle).

  4. 4

    Tell the caller where they stand

    Attach X-RateLimit-* headers on every response so well-behaved clients can self-pace before they ever get rejected.

Edge vs app vs per-user

**Edge/gateway** is the right default: it rejects abuse before it costs you compute, and it sees all traffic. **In-app** limiting is for expensive operations the gateway can't reason about (e.g. 'max 3 report exports per user'). **Per-user/per-key** is fairest for authenticated APIs; **per-IP** is your only lever for anonymous endpoints (but beware NAT and shared proxies bundling many users behind one IP).

The five algorithms, and what each one trades away

Every limiter you'll meet is one of these five. They differ mostly in how they handle bursts at the edge of a window and how much memory they cost per caller.

  • Fixed window, count requests per fixed clock interval (e.g. per calendar minute). Dead simple, one integer per caller. Flaw: a caller can fire a full limit at 0:59 and another full limit at 1:00, a 2x burst across the boundary.
  • Sliding window log, store a timestamp for every request, count how many fall in the last N seconds. Perfectly accurate, no boundary spikes. Cost: memory grows with request volume (one entry per request).
  • Sliding window counter, approximate the sliding window by weighting the previous window's count. Smooths the boundary problem with O(1) memory. The pragmatic favorite.
  • Token bucket, tokens refill at a steady rate up to a cap; each request spends one. Allows controlled bursts (up to the bucket size) while bounding the long-run average. What most APIs actually want.
  • Leaky bucket, a queue drained at a constant rate; overflow is rejected. Guarantees a smooth, constant output rate, great for protecting a fragile downstream that hates bursts.
AlgorithmBurst handlingMemory / accuracy
Fixed windowAllows 2x burst at window edges1 int/caller; least accurate
Sliding window logNo burst leak; exactO(requests); accurate but heavy
Sliding window counterSmooths edges (approx)O(1); good accuracy, cheap
Token bucketAllows bursts up to bucket size2 values/caller; bounds average
Leaky bucketNo bursts, constant outflowQueue depth; smoothest output
Token bucket and sliding-window-counter are the two you'll reach for 90% of the time.

Build it: a distributed token bucket on Redis

Here's a production-shaped token-bucket limiter in TypeScript. The whole check-and-decrement must be atomic, so it runs inside a single Redis Lua script. Doing INCR then EXPIRE in two round-trips opens a race where the key never expires; one script closes that gap.

rate-limit.ts
typescript
import { createClient } from "redis";

const redis = createClient();
await redis.connect();

// Atomic token-bucket refill-and-consume.
// KEYS[1] = bucket key   ARGV = [capacity, refillPerSec, now, cost]
const TOKEN_BUCKET = `
  local key      = KEYS[1]
  local capacity = tonumber(ARGV[1])
  local refill   = tonumber(ARGV[2])
  local now      = tonumber(ARGV[3])
  local cost     = tonumber(ARGV[4])

  local state  = redis.call('HMGET', key, 'tokens', 'ts')
  local tokens = tonumber(state[1])
  local ts     = tonumber(state[2])
  if tokens == nil then tokens = capacity; ts = now end

  -- refill based on elapsed time, capped at capacity
  local delta = math.max(0, now - ts)
  tokens = math.min(capacity, tokens + delta * refill)

  local allowed = 0
  if tokens >= cost then
    tokens = tokens - cost
    allowed = 1
  end

  redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
  redis.call('EXPIRE', key, math.ceil(capacity / refill) + 1)

  -- seconds until one token is available again
  local retry = 0
  if allowed == 0 then retry = (cost - tokens) / refill end
  return { allowed, math.floor(tokens), retry }
`;

export interface LimitResult {
  allowed: boolean;
  remaining: number;
  retryAfter: number; // seconds
}

export async function rateLimit(
  key: string,
  capacity = 100,    // bucket size = max burst
  refillPerSec = 10, // steady-state rate
  cost = 1,
): Promise<LimitResult> {
  const now = Date.now() / 1000;
  const res = (await redis.eval(TOKEN_BUCKET, {
    keys: [`rl:${key}`],
    arguments: [String(capacity), String(refillPerSec), String(now), String(cost)],
  })) as [number, number, number];

  return { allowed: res[0] === 1, remaining: res[1], retryAfter: Math.ceil(res[2]) };
}

Now wire it into a request handler. The contract: set the X-RateLimit-* headers on every response, and on rejection return 429 with Retry-After so the client knows exactly how long to wait.

middleware.ts
typescript
import type { Request, Response, NextFunction } from "express";
import { rateLimit } from "./rate-limit";

export async function limiter(req: Request, res: Response, next: NextFunction) {
  // prefer an authenticated key; fall back to IP for anonymous traffic
  const key = (req.header("x-api-key") ?? req.ip) as string;
  const { allowed, remaining, retryAfter } = await rateLimit(key, 100, 10);

  res.setHeader("X-RateLimit-Limit", "100");
  res.setHeader("X-RateLimit-Remaining", String(Math.max(0, remaining)));
  res.setHeader("X-RateLimit-Reset", String(Math.ceil(Date.now() / 1000) + retryAfter));

  if (!allowed) {
    res.setHeader("Retry-After", String(retryAfter)); // seconds
    return res.status(429).json({
      error: "rate_limited",
      message: "Too many requests. Slow down and retry after the given delay.",
      retryAfter,
    });
  }
  next();
}

Headers are a contract, not decoration

`X-RateLimit-Remaining` lets a good client throttle itself *before* it hits the wall. `Retry-After` (seconds, or an HTTP date) on a `429` tells it exactly when to come back. Clients that respect these never see a hard rejection, which is the whole goal.

Distributed limiting: one source of truth, please

The moment you run more than one gateway instance, an in-memory counter is wrong, each instance only sees its own slice, so a '100/min' limit silently becomes '100/min times N instances'. The fix is a shared store (Redis), and that introduces two subtleties worth respecting.

Atomicity and clock skew

**Atomicity:** never do `GET` then `SET` (or `INCR` then `EXPIRE`) as separate calls, concurrent requests race and both pass. Use a Lua script (one atomic execution) or `INCR` with `EXPIRE` set only on first creation. **Clock skew:** if each gateway uses its own wall clock for token refill, drift between machines makes the limit fuzzy. Either pass time *into* Redis from one trusted source, or use Redis's own `TIME` command inside the script so every instance agrees on 'now'.

redis
# The classic two-call race for a fixed window, DON'T do this across replicas:
INCR rl:user:42        # returns 1 ... or, under load, the EXPIRE below is skipped
EXPIRE rl:user:42 60   # if the process dies here, the key lives forever

# Safer single-shot: only set TTL when the key is first created.
SET rl:user:42 1 EX 60 NX   # NX = only if not exists, atomic create-with-TTL
INCR rl:user:42             # subsequent hits just bump the counter

Two more realities: Redis is now in your hot path, so plan for it being unavailable. A common policy is fail open (allow traffic if the limiter store is down, availability over protection) for user-facing reads, and fail closed for sensitive or expensive endpoints. Also see caching strategies, the same Redis instance often does double duty, so size and isolate it deliberately.

Common mistakes that cost hours

  1. Counting in memory behind a load balancer. Each instance limits independently, so your real limit is N times what you configured. Move the counter to a shared store the day you add a second instance.
  2. Non-atomic check-and-update. GET/SET or INCR/EXPIRE as two calls races under load. Use one Lua script or SET ... NX so the limit is actually enforced.
  3. Returning the wrong status code. Use 429 Too Many Requests, not 403, not 500. The status is how clients and CDNs know to back off rather than treat it as a bug.
  4. Omitting `Retry-After`. Without it, clients guess, usually by retrying immediately and making the overload worse. Always tell them when to come back.
  5. Rate limiting purely by IP on authenticated APIs. NAT and corporate proxies put thousands of legitimate users behind one IP; key on the API key or user ID instead.
  6. Clients retrying without jitter. A thundering herd of synchronized retries hits the limit in lockstep forever. Use exponential backoff *plus random jitter* so retries spread out.
  7. Forgetting load shedding. A perfectly fair limiter still lets a healthy-but-saturated server accept more than it can handle. Add a server-side shed valve for when CPU/queue depth spikes.
backoff.ts
typescript
// Graceful client backoff: honor Retry-After, then exponential + jitter.
async function fetchWithBackoff(url: string, max = 5): Promise<Response> {
  for (let attempt = 0; attempt < max; attempt++) {
    const res = await fetch(url);
    if (res.status !== 429) return res;

    const header = res.headers.get("Retry-After");
    const serverWait = header ? Number(header) * 1000 : 0;
    const backoff = Math.min(2 ** attempt * 200, 10_000); // cap at 10s
    const jitter = Math.random() * backoff;               // spread the herd
    await new Promise((r) => setTimeout(r, Math.max(serverWait, jitter)));
  }
  throw new Error("rate limited: gave up after retries");
}

Takeaways

The whole article in seven lines

  • Rate limiting protects fairness; throttling delays instead of rejecting; load shedding protects the server; quotas enforce the business deal.
  • There are only five algorithms, token bucket (bursts) and sliding-window-counter (cheap accuracy) cover almost everything.
  • Enforce at the edge/gateway by default; add in-app limits for expensive operations the gateway can't see.
  • Key on API key or user ID for authenticated traffic; fall back to IP only for anonymous endpoints.
  • Distributed means a shared store, and the check-and-update must be atomic, use a Lua script, not two round-trips.
  • Return 429 with Retry-After and emit X-RateLimit-* headers so good clients self-pace.
  • Decide your failure mode up front: fail open for availability, fail closed for sensitive endpoints.

Where to go next

Rate limiting is one layer of a larger reliability story. It lives next to the gateway, leans on the same shared cache, and is part of how systems stay up under load.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.