Consistent hashing maps both keys and nodes onto a logical ring so that adding or removing a server moves only a tiny fraction of data. It is the foundational partitioning strategy inside Amazon Dynamo, Apache Cassandra, Akamai CDN, Discord, and dozens of other systems that must scale horizontally without downtime. Understanding the ring, virtual nodes, replication placement, and alternative algorithms like jump hashing and rendezvous hashing is essential for any system design interview.
Consistent hashing maps both keys and nodes onto a logical ring so that adding or removing a server moves only a tiny fraction of data. It is the foundational partitioning strategy inside Amazon Dynamo, Apache Cassandra, Akamai CDN, Discord, and dozens of other systems that must scale horizontally without downtime. Understanding the ring, virtual nodes, replication placement, and alternative algorithms like jump hashing and rendezvous hashing is essential for any system design interview.
Lesson outline
The simplest way to distribute keys across N servers is to compute hash(key) mod N and send the request to that server. This works perfectly when N is fixed, but it falls apart the moment you add or remove a node.
| Key | hash(key) | hash mod 3 | hash mod 4 |
|---|---|---|---|
| user:1001 | 7 | Server 1 | Server 3 |
| user:1002 | 13 | Server 1 | Server 1 |
| user:1003 | 22 | Server 1 | Server 2 |
| user:1004 | 35 | Server 2 | Server 3 |
| user:1005 | 42 | Server 0 | Server 2 |
| user:1006 | 58 | Server 1 | Server 2 |
When N changes from 3 to 4, most keys land on a different server. For a cluster with 100 million cached keys, that means roughly 75 million cache misses simultaneously, slamming the database with a thundering herd of requests.
The mod-N rehash storm
With naive modular hashing, adding one server to an N-node cluster invalidates approximately (N-1)/N of all key mappings. For N = 10 that is 90% of keys. For N = 100 that is 99%. This makes elastic scaling practically impossible without a full cache warm-up cycle.
Consequences of mod-N hashing in production
Consistent hashing solves the rehashing problem by placing both keys and nodes on a circular hash space — a ring from 0 to 2^32 - 1 (or any large integer range). Each node is hashed to a position on the ring, and each key is assigned to the first node encountered when walking clockwise from the key's hash position.
graph TD
subgraph Ring["Hash Ring (0 to 2^32 - 1)"]
direction LR
A["Node A
position: 0x1A3F"] --> B["Node B
position: 0x5B72"]
B --> C["Node C
position: 0x9DE1"]
C --> A
end
K1["key: user:1001
hash: 0x0F22"] -.->|"clockwise →"| A
K2["key: user:1002
hash: 0x4C11"] -.->|"clockwise →"| B
K3["key: user:1003
hash: 0x7BAA"] -.->|"clockwise →"| CKeys are assigned to the next node clockwise on the ring. Adding or removing a node only affects keys in the arc between it and its predecessor.
Worked example: adding a node
01
Start with nodes A (pos 10), B (pos 50), C (pos 90) on a ring of size 100.
02
Key K with hash 35 maps to B (first node clockwise after 35).
03
Add node D at position 40. Only keys in the arc (10, 40] now move from B to D.
04
Key K (hash 35) falls in (10, 40], so it moves to D. Key with hash 45 stays on B.
05
Result: only keys between D's predecessor (A at 10) and D (at 40) are reassigned — roughly 1/N of total keys.
Start with nodes A (pos 10), B (pos 50), C (pos 90) on a ring of size 100.
Key K with hash 35 maps to B (first node clockwise after 35).
Add node D at position 40. Only keys in the arc (10, 40] now move from B to D.
Key K (hash 35) falls in (10, 40], so it moves to D. Key with hash 45 stays on B.
Result: only keys between D's predecessor (A at 10) and D (at 40) are reassigned — roughly 1/N of total keys.
The 1/N rule
When a node is added to a ring with N existing nodes, only approximately 1/N of all keys need to move. When a node is removed, only the keys that were assigned to that node move to its clockwise successor. This is the fundamental property that makes consistent hashing practical for elastic scaling.
The mathematical guarantee: with K keys and N nodes, adding one node moves at most K/N keys on average. Compare this to mod-N hashing where adding one node moves K*(N-1)/N keys — nearly all of them.
With only physical node positions on the ring, hash randomness means that arcs between nodes will be unequal. One node might own 50% of the ring while another owns only 10%. Virtual nodes (vnodes) solve this by placing multiple hash positions per physical node.
How virtual nodes work
Instead of hashing each node once, hash it V times (e.g., "NodeA-0", "NodeA-1", ..., "NodeA-149"). Each virtual node is an independent point on the ring. The physical node owns all keys assigned to any of its virtual nodes. With V = 150 per node and 10 physical nodes, you have 1,500 points on the ring, giving a much more uniform distribution.
| Vnodes per node | Std dev of load (% of ideal) | Max/min load ratio |
|---|---|---|
| 1 (no vnodes) | ~80% | 5–10x |
| 10 | ~25% | ~2x |
| 50 | ~10% | ~1.4x |
| 100 | ~7% | ~1.2x |
| 150 | ~5.8% | ~1.15x |
| 200 | ~5% | ~1.12x |
| 500 | ~3% | ~1.07x |
The standard deviation of load decreases proportionally to 1/sqrt(V), where V is the number of virtual nodes per physical server. Cassandra defaults to 256 vnodes per node. Amazon Dynamo originally used 150. Most production systems use between 100 and 256.
Choosing the right vnode count
Start with 128 or 256 vnodes per node. Fewer than 50 produces visible skew; more than 500 adds memory overhead to the ring metadata without meaningful distribution improvement. Monitor the coefficient of variation (std dev / mean) of per-node key counts — aim for below 10%.
Handling heterogeneous hardware with weighted vnodes
Consistent hashing naturally supports replication. To store R replicas of each key, simply walk clockwise from the key's position and place copies on the next R distinct physical nodes encountered. This is exactly how Amazon Dynamo and Cassandra implement their replication strategy.
Replication placement walkthrough (R = 3)
01
Hash the key to position P on the ring.
02
Walk clockwise from P. The first physical node encountered is the coordinator (primary replica).
03
Continue clockwise. The second distinct physical node becomes replica 2. Skip virtual nodes belonging to the same physical node.
04
Continue clockwise. The third distinct physical node becomes replica 3.
05
The coordinator handles the write and forwards it to replicas 2 and 3 in parallel.
06
With a consistency level of QUORUM, the write succeeds when 2 of 3 replicas acknowledge (W = floor(R/2) + 1).
Hash the key to position P on the ring.
Walk clockwise from P. The first physical node encountered is the coordinator (primary replica).
Continue clockwise. The second distinct physical node becomes replica 2. Skip virtual nodes belonging to the same physical node.
Continue clockwise. The third distinct physical node becomes replica 3.
The coordinator handles the write and forwards it to replicas 2 and 3 in parallel.
With a consistency level of QUORUM, the write succeeds when 2 of 3 replicas acknowledge (W = floor(R/2) + 1).
The rack-awareness trap
If all three replicas land on servers in the same rack, a single top-of-rack switch failure loses all copies. Production systems like Cassandra use a rack-aware or datacenter-aware replication strategy: the next R nodes must be in distinct racks (or distinct availability zones). This is configured via the NetworkTopologyStrategy with replication_factor per datacenter.
Consistency levels and the ring
The relationship W + R > N (where W = write consistency, R = read consistency, N = replication factor) is the fundamental formula guaranteeing that reads see the latest write. This is the mathematical backbone of tunable consistency in ring-based systems.
Ring-based consistent hashing is not the only approach. Two important alternatives solve specific problems that the ring does not handle well.
Jump consistent hashing
Jump consistent hashing, published by Google engineers Lamping and Veach in 2014, uses a pseudo-random sequence derived from the key to "jump" through bucket assignments. It produces a perfectly uniform distribution with zero memory overhead (no ring to store). The algorithm runs in O(ln N) time and requires only a key and a bucket count. The downside: it only supports appending or removing the last server. You cannot remove an arbitrary server without remapping many keys.
| Property | Ring-based consistent hashing | Jump consistent hashing | Rendezvous (HRW) hashing |
|---|---|---|---|
| Distribution uniformity | Good with 100+ vnodes | Perfect (provably optimal) | Near-perfect |
| Memory overhead | O(N * V) ring entries | O(1) — no stored state | O(1) — no stored state |
| Lookup time | O(log(N * V)) binary search | O(ln N) computation | O(N) — must check all nodes |
| Arbitrary node removal | Yes — only K/N keys move | No — only supports removing the last bucket | Yes — only K/N keys move |
| Weighted nodes | Yes — vary vnodes per node | Difficult | Yes — multiply weight into hash score |
| Use cases | General-purpose distributed systems | Append-only clusters, sharded databases with sequential IDs | DNS, CDN routing, small-to-medium clusters |
Rendezvous hashing (Highest Random Weight)
Rendezvous hashing, also called HRW (Highest Random Weight), takes a different approach: for each key, compute a score for every node as hash(key, node_id) and assign the key to the node with the highest score. When a node is removed, each key that was on that node simply goes to the node with the next-highest score — exactly K/N keys move. The downside is O(N) lookup time since every node must be scored, making it impractical for clusters with thousands of nodes. It is widely used in CDN request routing (Akamai), distributed caching with small clusters, and Windows DCHP server load balancing.
When to use which algorithm
Consistent hashing is not just a textbook concept — it is the core data placement algorithm in some of the most critical infrastructure on the internet.
Amazon Dynamo
Apache Cassandra
Akamai CDN and Discord
Cassandra's implementation of consistent hashing is one of the most battle-tested in production. Understanding how it works at the token level is valuable for both operations and interviews.
How a write finds its replicas in Cassandra
01
The client sends a write with a partition key (e.g., user_id = "alice").
02
The coordinator node computes token = murmur3("alice") = -4567891234567890 (a 64-bit signed integer).
03
The coordinator consults the token ring metadata to find which node owns the range containing this token.
04
The owning node is the primary replica. The coordinator walks clockwise on the ring to find the next RF - 1 distinct rack/DC nodes for additional replicas.
05
With consistency level QUORUM and RF = 3, the coordinator sends the write to all 3 replicas and waits for 2 acknowledgments.
06
Each replica writes to its commit log (for durability), then to its memtable (in-memory sorted structure), and eventually flushes to SSTables on disk.
The client sends a write with a partition key (e.g., user_id = "alice").
The coordinator node computes token = murmur3("alice") = -4567891234567890 (a 64-bit signed integer).
The coordinator consults the token ring metadata to find which node owns the range containing this token.
The owning node is the primary replica. The coordinator walks clockwise on the ring to find the next RF - 1 distinct rack/DC nodes for additional replicas.
With consistency level QUORUM and RF = 3, the coordinator sends the write to all 3 replicas and waits for 2 acknowledgments.
Each replica writes to its commit log (for durability), then to its memtable (in-memory sorted structure), and eventually flushes to SSTables on disk.
Choosing partition keys for even distribution
The partition key determines which token range (and therefore which node) handles the data. A poor partition key (e.g., country_code with only ~200 values) creates hot spots. A good partition key (e.g., user_id with millions of distinct values) distributes evenly. Use the nodetool tablestats and nodetool ring commands to check for partition skew. If one node has significantly more data than others, your partition key is too coarse.
| Operation | Token ring involvement | Cost |
|---|---|---|
| Point read (partition key lookup) | Hash to token, find owning node, read from RF nodes | O(1) node lookups, O(log N) ring search |
| Range scan (token range) | Contact all nodes owning tokens in the range | O(nodes in range) — can be expensive |
| Node bootstrap (join) | New node picks token ranges, streams data from existing owners | Background streaming, minutes to hours depending on data size |
| Node decommission (leave) | Leaving node streams its ranges to successors | Background streaming, similar cost to bootstrap |
| Repair (anti-entropy) | Compare Merkle trees between replicas for each token range | CPU and I/O intensive, schedule during low-traffic periods |
Cassandra 4.0 introduced transient replication, where some replicas only hold data temporarily for availability during writes. This reduces storage costs while maintaining the same write availability guarantees of the consistent hashing ring.
Consistent hashing looks elegant on a whiteboard, but running it in production reveals subtle failure modes that can catch teams off guard.
Hot partitions from celebrity keys
Even with perfect ring distribution, a single key that receives millions of reads per second (e.g., a viral tweet, a popular product page) creates a hot partition on one node. Consistent hashing distributes keys evenly across nodes but does not distribute load within a key. Solutions include read replicas, caching layers in front of the ring, or key splitting (appending a random suffix to distribute one logical key across multiple physical keys).
Production best practices
The split-brain risk during node replacement
When replacing a node (remove old, add new), there is a window where the old node is still serving reads while the new node is bootstrapping. If both nodes think they own the same token range, clients may read stale data from the old node while writes go to the new one. Always decommission the old node fully before bootstrapping the replacement, or use Cassandra's "replace node" operation that handles this atomically.
Capacity planning formula
When adding a node to an N-node cluster, expect to transfer approximately (total_data_size / (N + 1)) bytes of data. With 10 nodes holding 1 TB each (10 TB total), adding one node transfers roughly 910 GB (10 TB / 11). At 100 MB/s streaming throughput, this takes about 2.5 hours. Plan your maintenance windows accordingly.
Consistent hashing appears in almost every system design interview that involves distributed data. Knowing when to bring it up and how deep to go separates senior candidates from juniors.
How to present consistent hashing in an interview
01
Mention it at the right moment: when the interviewer asks "how do you distribute data across servers" or when you identify a sharding need in your design.
02
Start with the problem: "Naive mod-N hashing moves nearly all keys when we add a server. We need a scheme where adding a node moves only 1/N of the data."
03
Draw the ring: place 3-4 nodes on a circle, hash a few keys, show which node each key maps to.
04
Show the add-node scenario: insert a new node, demonstrate that only keys in one arc move.
05
Mention virtual nodes: "In practice, each server gets 100-256 virtual positions on the ring to ensure even distribution."
06
Connect to replication: "We replicate to the next R distinct nodes clockwise, using rack-awareness to avoid correlated failures."
07
Discuss trade-offs: "For very large clusters, we could consider jump hashing for better uniformity, but ring-based is more flexible for arbitrary node removal."
Mention it at the right moment: when the interviewer asks "how do you distribute data across servers" or when you identify a sharding need in your design.
Start with the problem: "Naive mod-N hashing moves nearly all keys when we add a server. We need a scheme where adding a node moves only 1/N of the data."
Draw the ring: place 3-4 nodes on a circle, hash a few keys, show which node each key maps to.
Show the add-node scenario: insert a new node, demonstrate that only keys in one arc move.
Mention virtual nodes: "In practice, each server gets 100-256 virtual positions on the ring to ensure even distribution."
Connect to replication: "We replicate to the next R distinct nodes clockwise, using rack-awareness to avoid correlated failures."
Discuss trade-offs: "For very large clusters, we could consider jump hashing for better uniformity, but ring-based is more flexible for arbitrary node removal."
Depth signals that impress interviewers
Mentioning the W + R > N consistency formula, explaining why Cassandra uses 256 vnodes by default, discussing the Merkle tree anti-entropy mechanism, or comparing ring-based hashing to rendezvous hashing shows that you understand the concept at an implementation level, not just a whiteboard level. These details distinguish a "read the blog post" answer from a "built and operated it" answer.
Common interview contexts where consistent hashing applies
Consistent hashing is one of the most frequently tested concepts in system design interviews at FAANG companies. It appears whenever the design requires horizontal data partitioning: distributed caches, key-value stores, CDNs, chat systems, or any system where data must be spread across multiple servers. Interviewers expect candidates to explain why mod-N fails, draw the hash ring, introduce virtual nodes unprompted, and connect consistent hashing to replication and consistency guarantees. Senior candidates are expected to discuss trade-offs with jump hashing or rendezvous hashing and cite real systems like Dynamo and Cassandra.
Common questions:
Key takeaways
When a new node is added to a consistent hash ring with N existing nodes, approximately what fraction of keys need to be remapped, and why?
Approximately 1/N of all keys need to move. The new node takes ownership of the arc between itself and its predecessor on the ring. Only keys that hash into this arc (which is approximately 1/N of the total ring space) migrate from the existing successor node to the new node. All other keys remain on their current nodes.
Why do production systems use 100-256 virtual nodes per physical node instead of placing each node once on the ring?
With only one position per node, hash randomness creates highly uneven arc sizes — one node might own 50% of the ring while another owns 5%. Virtual nodes place each physical node at 100-256 random positions, breaking the ring into many small arcs. By the law of large numbers, the total arc length owned by each physical node converges to 1/N of the ring. The standard deviation of load decreases proportionally to 1/sqrt(V) where V is the vnode count.
Explain the W + R > N formula and how it relates to consistent hashing with replication factor R.
W is the number of replicas that must acknowledge a write, R is the number of replicas that must respond to a read, and N is the total replication factor. When W + R > N, at least one replica that acknowledged the write will be among those queried during the read, guaranteeing the read sees the latest write. On a consistent hash ring with N = 3, using W = 2 (QUORUM write) and R = 2 (QUORUM read) gives W + R = 4 > 3, ensuring strong consistency.
💡 Analogy
Imagine a circular conveyor belt in a warehouse with pickup stations evenly spaced around it. Packages (keys) are placed on the belt at a position determined by their barcode (hash). Each package rides the belt clockwise until it reaches the next pickup station (node), where a worker grabs it. If you add a new pickup station between two existing ones, only the packages between the previous station and the new one change hands — every other worker keeps their existing packages. If a station closes, its packages simply ride past to the next open station. Virtual nodes are like each worker operating multiple stations around the belt, ensuring no single worker is responsible for a disproportionately long stretch of belt.
⚡ Core Idea
Consistent hashing maps both data keys and server nodes onto the same circular hash space. Each key is assigned to the nearest node clockwise on the ring. This guarantees that adding or removing a node disrupts only K/N keys (where K is total keys and N is total nodes), compared to mod-N hashing which disrupts nearly all keys. Virtual nodes (100-256 per server) smooth out distribution, and walking clockwise for R distinct nodes provides natural replication placement.
🎯 Why It Matters
Every distributed system that stores state — caches, databases, message brokers, CDNs — must decide which server owns which data. Without consistent hashing, scaling up or down triggers massive data migration, cache invalidation storms, and potential outages. With it, you can add capacity smoothly, survive node failures gracefully, and reason about data placement deterministically. It is the single most important partitioning algorithm in distributed systems and appears in virtually every system design interview.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.