Rate Limiting and Throttling: Protecting APIs from Abuse and Overload

On this page

Your API is only as polite as its rudest caller
Four words people use interchangeably (and shouldn't)
An analogy: the bouncer and the bucket
Where the limiter actually sits
The five algorithms, and what each one trades away
Build it: a distributed token bucket on Redis
Distributed limiting: one source of truth, please
Common mistakes that cost hours
Takeaways
Where to go next

TL;DR

Learn the five limiter algorithms and what each trades away, decide where in the stack to enforce them, and walk away able to build a distributed token bucket on Redis that returns clean 429s instead of falling over under abuse.

Your API is only as polite as its rudest caller

See it: traffic vs refill

Try itToken-bucket rate limiter

10.0

tokens left
capacity 10

Incoming requests

…

Allowed

429'd

100%

Pass rate

Capacity (burst) 10Refill rate 5/sTraffic 12/s

Adjust the sliders and watch what happens…

Drag the sliders. When traffic outruns the refill rate, the bucket empties and excess requests get 429'd — but short bursts still pass thanks to the capacity buffer.

Who this is for

You ship an HTTP API and a single client just hammered it 10,000 times in a second, a buggy retry loop, a scraper, or a customer who wrote while (true) { fetch() }. Everyone else's requests slow to a crawl. This article is for backend and SRE folks who want to understand the algorithms behind rate limiting, where to enforce them, and how to build a distributed limiter that survives more than one server.

The uncomfortable truth: without limits, your API's capacity is a shared resource that anyone can monopolize. One abusive caller, malicious or just careless, degrades the experience for every honest one. Rate limiting is how you turn 'first come, first served until we fall over' into 'everyone gets a fair, predictable slice'.

Rate limiting is not about punishing users. It is about protecting the API so that it stays available for all of them.
The whole point, in one line

Four words people use interchangeably (and shouldn't)

Before the algorithms, get the vocabulary straight. These terms get mixed up constantly, but they solve different problems and live in different layers.

Term	What it controls	Time horizon
Rate limiting	Requests per unit of time per caller	Seconds to minutes
Throttling	Slowing or queueing requests once a soft limit is hit (instead of rejecting)	Real-time, per request
Load shedding	Dropping requests when the server is overloaded, regardless of who sent them	During an incident
Quotas	Total allowance over a long billing/contract window	Day, month, plan tier

Rate limiting protects fairness; load shedding protects survival; quotas enforce business deals.

The clean mental split

Rate limiting and quotas are about the caller ('you get 100/min, 1M/month'). Load shedding is about the server ('I am at 95% CPU, start dropping the cheapest requests'). You need both, a healthy server still has to fend off one greedy client, and a fair limiter can't save a server that is genuinely on fire.

An analogy: the bouncer and the bucket

A nightclub bouncer letting people in one at a timeA limiter admitting requests at a controlled rate

The club's fire-code capacity (a hard ceiling)Max concurrency / server capacity

A short queue outside that the bouncer lets trickle inThrottling, delay instead of reject

A VIP wristband that allows a small burst of friends togetherToken bucket, accumulated tokens allow a burst

A bucket with a hole that drains at a fixed pace no matter how fast you pourLeaky bucket, smooths bursty input into a steady output

Token bucket forgives bursts; leaky bucket forces a smooth, constant outflow.

Hold the bucket image: a token bucket fills with tokens at a steady rate and each request spends one, so if you've been quiet, you've saved up tokens and can fire a burst. A leaky bucket is the opposite intuition: requests pour in, but they only leave through a fixed-size hole, so output is always smooth even when input spikes.

Where the limiter actually sits

A limiter is only useful if every relevant request flows through it and it can see a shared count. The common shape: requests hit the API gateway / edge first, the limiter checks a counter in a shared store (Redis), and only admitted requests reach your app.

The gateway checks a shared Redis counter before any request touches the app tier.

1
Identify the caller
Derive a key, API key, user ID, or client IP. The key decides whose bucket you're checking. Per-key is fairest; per-IP is the fallback for anonymous traffic.
2
Check the shared counter
Atomically read-and-update the caller's count in Redis. This must be atomic, or two concurrent requests both see 'under limit' and both pass.
3
Decide: allow, throttle, or reject
Under limit -> forward to the app. Over limit -> return 429 with a Retry-After header (or queue briefly if you throttle).
4
Tell the caller where they stand
Attach X-RateLimit-* headers on every response so well-behaved clients can self-pace before they ever get rejected.

Edge vs app vs per-user

Edge/gateway is the right default: it rejects abuse before it costs you compute, and it sees all traffic. In-app limiting is for expensive operations the gateway can't reason about (e.g. 'max 3 report exports per user'). Per-user/per-key is fairest for authenticated APIs; per-IP is your only lever for anonymous endpoints (but beware NAT and shared proxies bundling many users behind one IP).

The five algorithms, and what each one trades away

Every limiter you'll meet is one of these five. They differ mostly in how they handle bursts at the edge of a window and how much memory they cost per caller.

Fixed window, count requests per fixed clock interval (e.g. per calendar minute). Dead simple, one integer per caller. Flaw: a caller can fire a full limit at 0:59 and another full limit at 1:00, a 2x burst across the boundary.
Sliding window log, store a timestamp for every request, count how many fall in the last N seconds. Perfectly accurate, no boundary spikes. Cost: memory grows with request volume (one entry per request).
Sliding window counter, approximate the sliding window by weighting the previous window's count. Smooths the boundary problem with O(1) memory. The pragmatic favorite.
Token bucket, tokens refill at a steady rate up to a cap; each request spends one. Allows controlled bursts (up to the bucket size) while bounding the long-run average. What most APIs actually want.
Leaky bucket, a queue drained at a constant rate; overflow is rejected. Guarantees a smooth, constant output rate, great for protecting a fragile downstream that hates bursts.

Algorithm	Burst handling	Memory / accuracy
Fixed window	Allows 2x burst at window edges	1 int/caller; least accurate
Sliding window log	No burst leak; exact	O(requests); accurate but heavy
Sliding window counter	Smooths edges (approx)	O(1); good accuracy, cheap
Token bucket	Allows bursts up to bucket size	2 values/caller; bounds average
Leaky bucket	No bursts, constant outflow	Queue depth; smoothest output

Token bucket and sliding-window-counter are the two you'll reach for 90% of the time.

Build it: a distributed token bucket on Redis

Here's a production-shaped token-bucket limiter in TypeScript. The whole check-and-decrement must be atomic, so it runs inside a single Redis Lua script. Doing INCR then EXPIRE in two round-trips opens a race where the key never expires; one script closes that gap.

rate-limit.ts

typescript

import { createClient } from "redis";

const redis = createClient();
await redis.connect();

// Atomic token-bucket refill-and-consume.
// KEYS[1] = bucket key   ARGV = [capacity, refillPerSec, now, cost]
const TOKEN_BUCKET = `
  local key      = KEYS[1]
  local capacity = tonumber(ARGV[1])
  local refill   = tonumber(ARGV[2])
  local now      = tonumber(ARGV[3])
  local cost     = tonumber(ARGV[4])

  local state  = redis.call('HMGET', key, 'tokens', 'ts')
  local tokens = tonumber(state[1])
  local ts     = tonumber(state[2])
  if tokens == nil then tokens = capacity; ts = now end

  -- refill based on elapsed time, capped at capacity
  local delta = math.max(0, now - ts)
  tokens = math.min(capacity, tokens + delta * refill)

  local allowed = 0
  if tokens >= cost then
    tokens = tokens - cost
    allowed = 1
  end

  redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
  redis.call('EXPIRE', key, math.ceil(capacity / refill) + 1)

  -- seconds until one token is available again
  local retry = 0
  if allowed == 0 then retry = (cost - tokens) / refill end
  return { allowed, math.floor(tokens), retry }
`;

export interface LimitResult {
  allowed: boolean;
  remaining: number;
  retryAfter: number; // seconds
}

export async function rateLimit(
  key: string,
  capacity = 100,    // bucket size = max burst
  refillPerSec = 10, // steady-state rate
  cost = 1,
): Promise<LimitResult> {
  const now = Date.now() / 1000;
  const res = (await redis.eval(TOKEN_BUCKET, {
    keys: [`rl:${key}`],
    arguments: [String(capacity), String(refillPerSec), String(now), String(cost)],
  })) as [number, number, number];

  return { allowed: res[0] === 1, remaining: res[1], retryAfter: Math.ceil(res[2]) };
}

Now wire it into a request handler. The contract: set the X-RateLimit-* headers on every response, and on rejection return 429 with Retry-After so the client knows exactly how long to wait.

middleware.ts

typescript

import type { Request, Response, NextFunction } from "express";
import { rateLimit } from "./rate-limit";

export async function limiter(req: Request, res: Response, next: NextFunction) {
  // prefer an authenticated key; fall back to IP for anonymous traffic
  const key = (req.header("x-api-key") ?? req.ip) as string;
  const { allowed, remaining, retryAfter } = await rateLimit(key, 100, 10);

  res.setHeader("X-RateLimit-Limit", "100");
  res.setHeader("X-RateLimit-Remaining", String(Math.max(0, remaining)));
  res.setHeader("X-RateLimit-Reset", String(Math.ceil(Date.now() / 1000) + retryAfter));

  if (!allowed) {
    res.setHeader("Retry-After", String(retryAfter)); // seconds
    return res.status(429).json({
      error: "rate_limited",
      message: "Too many requests. Slow down and retry after the given delay.",
      retryAfter,
    });
  }
  next();
}

Headers are a contract, not decoration

X-RateLimit-Remaining lets a good client throttle itself *before* it hits the wall. Retry-After (seconds, or an HTTP date) on a 429 tells it exactly when to come back. Clients that respect these never see a hard rejection, which is the whole goal.

Distributed limiting: one source of truth, please

The moment you run more than one gateway instance, an in-memory counter is wrong, each instance only sees its own slice, so a '100/min' limit silently becomes '100/min times N instances'. The fix is a shared store (Redis), and that introduces two subtleties worth respecting.

Atomicity and clock skew

Atomicity: never do GET then SET (or INCR then EXPIRE) as separate calls, concurrent requests race and both pass. Use a Lua script (one atomic execution) or INCR with EXPIRE set only on first creation. Clock skew: if each gateway uses its own wall clock for token refill, drift between machines makes the limit fuzzy. Either pass time *into* Redis from one trusted source, or use Redis's own TIME command inside the script so every instance agrees on 'now'.

redis

# The classic two-call race for a fixed window, DON'T do this across replicas:
INCR rl:user:42        # returns 1 ... or, under load, the EXPIRE below is skipped
EXPIRE rl:user:42 60   # if the process dies here, the key lives forever

# Safer single-shot: only set TTL when the key is first created.
SET rl:user:42 1 EX 60 NX   # NX = only if not exists, atomic create-with-TTL
INCR rl:user:42             # subsequent hits just bump the counter

Two more realities: Redis is now in your hot path, so plan for it being unavailable. A common policy is fail open (allow traffic if the limiter store is down, availability over protection) for user-facing reads, and fail closed for sensitive or expensive endpoints. Also see caching strategies, the same Redis instance often does double duty, so size and isolate it deliberately.

Common mistakes that cost hours

1Counting in memory behind a load balancer. Each instance limits independently, so your real limit is N times what you configured. Move the counter to a shared store the day you add a second instance.
2Non-atomic check-and-update. GET/SET or INCR/EXPIRE as two calls races under load. Use one Lua script or SET ... NX so the limit is actually enforced.
3Returning the wrong status code. Use 429 Too Many Requests, not 403, not 500. The status is how clients and CDNs know to back off rather than treat it as a bug.
4Omitting `Retry-After`. Without it, clients guess, usually by retrying immediately and making the overload worse. Always tell them when to come back.
5Rate limiting purely by IP on authenticated APIs. NAT and corporate proxies put thousands of legitimate users behind one IP; key on the API key or user ID instead.
6Clients retrying without jitter. A thundering herd of synchronized retries hits the limit in lockstep forever. Use exponential backoff *plus random jitter* so retries spread out.
7Forgetting load shedding. A perfectly fair limiter still lets a healthy-but-saturated server accept more than it can handle. Add a server-side shed valve for when CPU/queue depth spikes.

backoff.ts

typescript

// Graceful client backoff: honor Retry-After, then exponential + jitter.
async function fetchWithBackoff(url: string, max = 5): Promise<Response> {
  for (let attempt = 0; attempt < max; attempt++) {
    const res = await fetch(url);
    if (res.status !== 429) return res;

    const header = res.headers.get("Retry-After");
    const serverWait = header ? Number(header) * 1000 : 0;
    const backoff = Math.min(2 ** attempt * 200, 10_000); // cap at 10s
    const jitter = Math.random() * backoff;               // spread the herd
    await new Promise((r) => setTimeout(r, Math.max(serverWait, jitter)));
  }
  throw new Error("rate limited: gave up after retries");
}

Takeaways

The whole article in seven lines

Rate limiting protects fairness; throttling delays instead of rejecting; load shedding protects the server; quotas enforce the business deal.
There are only five algorithms, token bucket (bursts) and sliding-window-counter (cheap accuracy) cover almost everything.
Enforce at the edge/gateway by default; add in-app limits for expensive operations the gateway can't see.
Key on API key or user ID for authenticated traffic; fall back to IP only for anonymous endpoints.
Distributed means a shared store, and the check-and-update must be atomic, use a Lua script, not two round-trips.
Return 429 with Retry-After and emit X-RateLimit-* headers so good clients self-pace.
Decide your failure mode up front: fail open for availability, fail closed for sensitive endpoints.

Where to go next

Rate limiting is one layer of a larger reliability story. It lives next to the gateway, leans on the same shared cache, and is part of how systems stay up under load.

API gateways and the edge, where the limiter most naturally lives, alongside auth and routing.
Caching strategies, the same Redis often backs both; learn to size and isolate it.
Scalability principles, limiting is one tool; horizontal scaling, queues, and backpressure are the rest of the kit.
Hands-on: practice the request path in the networking lab and tighten config at the DNS lab.

Where should you enforce a rate limit for a given case?

Check your understanding

1. How does the article distinguish rate limiting/quotas from load shedding?

2. What is the difference between a token bucket and a leaky bucket as the article describes them?

Frequently asked questions

Where should I enforce rate limits, at the gateway or in the app?

The edge or gateway is the right default because it rejects abuse before it costs you compute and it sees all traffic. In-app limiting is for expensive operations the gateway cannot reason about, such as capping report exports per user.

What is the difference between a token bucket and a leaky bucket?

A token bucket fills with tokens at a steady rate and each request spends one, so a quiet caller saves up tokens and can fire a burst. A leaky bucket is the opposite intuition: requests pour in but only leave through a fixed-size hole, so output stays smooth even when input spikes.

Should I limit per IP or per user?

Per-user or per-key limiting is fairest for authenticated APIs because it ties usage to a real account. Per-IP is your only lever for anonymous endpoints, but beware that NAT and shared proxies can bundle many real users behind one IP.

How is rate limiting different from load shedding?

Rate limiting and quotas are about the caller, for example 100 requests per minute. Load shedding is about the server protecting itself, for example dropping the cheapest requests when CPU hits 95%. You need both, since a healthy server still has to fend off a greedy client and a fair limiter cannot save a server that is genuinely on fire.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

Kubernetes in Production: Beyond the Tutorial

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

SRE

What is Site Reliability Engineering?

Read

Rate Limiting and Throttling: Protecting APIs from Abuse and Overload

01Your API is only as polite as its rudest caller

See it: traffic vs refill

02Four words people use interchangeably (and shouldn't)

03An analogy: the bouncer and the bucket

04Where the limiter actually sits

05The five algorithms, and what each one trades away

06Build it: a distributed token bucket on Redis

07Distributed limiting: one source of truth, please

08Common mistakes that cost hours

09Takeaways

10Where to go next

Frequently asked questions

Want to go deeper?

Kubernetes in Production: Beyond the Tutorial

Reliability & Resilience: Designing for Failure

What is Site Reliability Engineering?

Your API is only as polite as its rudest caller

Four words people use interchangeably (and shouldn't)

An analogy: the bouncer and the bucket

Where the limiter actually sits

The five algorithms, and what each one trades away

Build it: a distributed token bucket on Redis

Distributed limiting: one source of truth, please

Common mistakes that cost hours

Takeaways

Where to go next