Skip to main content
Career Paths
Concepts
Consistent Hashing
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Consistent hashing: the algorithm behind every scalable distributed system

Consistent hashing maps both keys and nodes onto a logical ring so that adding or removing a server moves only a tiny fraction of data. It is the foundational partitioning strategy inside Amazon Dynamo, Apache Cassandra, Akamai CDN, Discord, and dozens of other systems that must scale horizontally without downtime. Understanding the ring, virtual nodes, replication placement, and alternative algorithms like jump hashing and rendezvous hashing is essential for any system design interview.

🎯Key Takeaways
Consistent hashing maps keys and nodes onto the same ring, so adding or removing a node moves only approximately 1/N of the keys instead of nearly all of them as with mod-N hashing.
Virtual nodes (100-256 per physical server) are essential in production to smooth out distribution skew caused by hash randomness, reducing load variance from 5-10x down to within 15%.
Replication on the ring works by walking clockwise to the next R distinct physical nodes, with rack-awareness to prevent correlated failures from taking out all replicas.
Jump consistent hashing offers perfect uniformity with zero memory overhead but cannot handle arbitrary node removal; rendezvous hashing supports arbitrary removal but has O(N) lookup cost — choose based on your cluster dynamics.
The rate of key migration matters as much as the quantity: moving K/N keys over 15 minutes is smooth; moving K/N keys in 5 seconds causes cascading failures from memory pressure and cache miss storms.

Consistent hashing: the algorithm behind every scalable distributed system

Consistent hashing maps both keys and nodes onto a logical ring so that adding or removing a server moves only a tiny fraction of data. It is the foundational partitioning strategy inside Amazon Dynamo, Apache Cassandra, Akamai CDN, Discord, and dozens of other systems that must scale horizontally without downtime. Understanding the ring, virtual nodes, replication placement, and alternative algorithms like jump hashing and rendezvous hashing is essential for any system design interview.

~19 min read
Be the first to complete!
What you'll learn
  • Consistent hashing maps keys and nodes onto the same ring, so adding or removing a node moves only approximately 1/N of the keys instead of nearly all of them as with mod-N hashing.
  • Virtual nodes (100-256 per physical server) are essential in production to smooth out distribution skew caused by hash randomness, reducing load variance from 5-10x down to within 15%.
  • Replication on the ring works by walking clockwise to the next R distinct physical nodes, with rack-awareness to prevent correlated failures from taking out all replicas.
  • Jump consistent hashing offers perfect uniformity with zero memory overhead but cannot handle arbitrary node removal; rendezvous hashing supports arbitrary removal but has O(N) lookup cost — choose based on your cluster dynamics.
  • The rate of key migration matters as much as the quantity: moving K/N keys over 15 minutes is smooth; moving K/N keys in 5 seconds causes cascading failures from memory pressure and cache miss storms.

Lesson outline

The problem with naive mod-N hashing

The simplest way to distribute keys across N servers is to compute hash(key) mod N and send the request to that server. This works perfectly when N is fixed, but it falls apart the moment you add or remove a node.

Keyhash(key)hash mod 3hash mod 4
user:10017Server 1Server 3
user:100213Server 1Server 1
user:100322Server 1Server 2
user:100435Server 2Server 3
user:100542Server 0Server 2
user:100658Server 1Server 2

When N changes from 3 to 4, most keys land on a different server. For a cluster with 100 million cached keys, that means roughly 75 million cache misses simultaneously, slamming the database with a thundering herd of requests.

The mod-N rehash storm

With naive modular hashing, adding one server to an N-node cluster invalidates approximately (N-1)/N of all key mappings. For N = 10 that is 90% of keys. For N = 100 that is 99%. This makes elastic scaling practically impossible without a full cache warm-up cycle.

Consequences of mod-N hashing in production

  • Thundering herd — Millions of cache misses hit the database simultaneously, often causing cascading failures.
  • Extended warm-up — The new node configuration needs minutes to hours of cache population before it reaches steady-state hit rates.
  • No graceful scaling — You cannot add capacity gradually; every scaling event is an all-or-nothing migration.
  • Sticky session breakage — Session affinity based on mod-N breaks when N changes, logging out users or losing in-memory state.

The hash ring: how consistent hashing works

Consistent hashing solves the rehashing problem by placing both keys and nodes on a circular hash space — a ring from 0 to 2^32 - 1 (or any large integer range). Each node is hashed to a position on the ring, and each key is assigned to the first node encountered when walking clockwise from the key's hash position.

graph TD
    subgraph Ring["Hash Ring (0 to 2^32 - 1)"]
        direction LR
        A["Node A
position: 0x1A3F"] --> B["Node B
position: 0x5B72"]
        B --> C["Node C
position: 0x9DE1"]
        C --> A
    end
    K1["key: user:1001
hash: 0x0F22"] -.->|"clockwise →"| A
    K2["key: user:1002
hash: 0x4C11"] -.->|"clockwise →"| B
    K3["key: user:1003
hash: 0x7BAA"] -.->|"clockwise →"| C

Keys are assigned to the next node clockwise on the ring. Adding or removing a node only affects keys in the arc between it and its predecessor.

Worked example: adding a node

→

01

Start with nodes A (pos 10), B (pos 50), C (pos 90) on a ring of size 100.

→

02

Key K with hash 35 maps to B (first node clockwise after 35).

→

03

Add node D at position 40. Only keys in the arc (10, 40] now move from B to D.

→

04

Key K (hash 35) falls in (10, 40], so it moves to D. Key with hash 45 stays on B.

05

Result: only keys between D's predecessor (A at 10) and D (at 40) are reassigned — roughly 1/N of total keys.

1

Start with nodes A (pos 10), B (pos 50), C (pos 90) on a ring of size 100.

2

Key K with hash 35 maps to B (first node clockwise after 35).

3

Add node D at position 40. Only keys in the arc (10, 40] now move from B to D.

4

Key K (hash 35) falls in (10, 40], so it moves to D. Key with hash 45 stays on B.

5

Result: only keys between D's predecessor (A at 10) and D (at 40) are reassigned — roughly 1/N of total keys.

The 1/N rule

When a node is added to a ring with N existing nodes, only approximately 1/N of all keys need to move. When a node is removed, only the keys that were assigned to that node move to its clockwise successor. This is the fundamental property that makes consistent hashing practical for elastic scaling.

The mathematical guarantee: with K keys and N nodes, adding one node moves at most K/N keys on average. Compare this to mod-N hashing where adding one node moves K*(N-1)/N keys — nearly all of them.

Virtual nodes: solving distribution skew

With only physical node positions on the ring, hash randomness means that arcs between nodes will be unequal. One node might own 50% of the ring while another owns only 10%. Virtual nodes (vnodes) solve this by placing multiple hash positions per physical node.

How virtual nodes work

Instead of hashing each node once, hash it V times (e.g., "NodeA-0", "NodeA-1", ..., "NodeA-149"). Each virtual node is an independent point on the ring. The physical node owns all keys assigned to any of its virtual nodes. With V = 150 per node and 10 physical nodes, you have 1,500 points on the ring, giving a much more uniform distribution.

Vnodes per nodeStd dev of load (% of ideal)Max/min load ratio
1 (no vnodes)~80%5–10x
10~25%~2x
50~10%~1.4x
100~7%~1.2x
150~5.8%~1.15x
200~5%~1.12x
500~3%~1.07x

The standard deviation of load decreases proportionally to 1/sqrt(V), where V is the number of virtual nodes per physical server. Cassandra defaults to 256 vnodes per node. Amazon Dynamo originally used 150. Most production systems use between 100 and 256.

Choosing the right vnode count

Start with 128 or 256 vnodes per node. Fewer than 50 produces visible skew; more than 500 adds memory overhead to the ring metadata without meaningful distribution improvement. Monitor the coefficient of variation (std dev / mean) of per-node key counts — aim for below 10%.

Handling heterogeneous hardware with weighted vnodes

  • Problem — Not all servers are identical. A server with 256 GB RAM and NVMe drives can handle 4x the load of a server with 64 GB RAM and spinning disks.
  • Solution — Assign more virtual nodes to more powerful servers. If server A has 4x the capacity of server B, give A 4x as many vnodes (e.g., 512 vs 128).
  • Migration — When upgrading a server, gradually increase its vnode count. Each new vnode steals a small arc from another node, providing smooth capacity absorption.
  • Cassandra approach — Cassandra allows setting num_tokens per node in cassandra.yaml. A high-spec node can be assigned num_tokens: 512 while a smaller node gets num_tokens: 128.

Replication on the ring

Consistent hashing naturally supports replication. To store R replicas of each key, simply walk clockwise from the key's position and place copies on the next R distinct physical nodes encountered. This is exactly how Amazon Dynamo and Cassandra implement their replication strategy.

Replication placement walkthrough (R = 3)

→

01

Hash the key to position P on the ring.

→

02

Walk clockwise from P. The first physical node encountered is the coordinator (primary replica).

→

03

Continue clockwise. The second distinct physical node becomes replica 2. Skip virtual nodes belonging to the same physical node.

→

04

Continue clockwise. The third distinct physical node becomes replica 3.

→

05

The coordinator handles the write and forwards it to replicas 2 and 3 in parallel.

06

With a consistency level of QUORUM, the write succeeds when 2 of 3 replicas acknowledge (W = floor(R/2) + 1).

1

Hash the key to position P on the ring.

2

Walk clockwise from P. The first physical node encountered is the coordinator (primary replica).

3

Continue clockwise. The second distinct physical node becomes replica 2. Skip virtual nodes belonging to the same physical node.

4

Continue clockwise. The third distinct physical node becomes replica 3.

5

The coordinator handles the write and forwards it to replicas 2 and 3 in parallel.

6

With a consistency level of QUORUM, the write succeeds when 2 of 3 replicas acknowledge (W = floor(R/2) + 1).

The rack-awareness trap

If all three replicas land on servers in the same rack, a single top-of-rack switch failure loses all copies. Production systems like Cassandra use a rack-aware or datacenter-aware replication strategy: the next R nodes must be in distinct racks (or distinct availability zones). This is configured via the NetworkTopologyStrategy with replication_factor per datacenter.

Consistency levels and the ring

  • ONE — Write or read succeeds after one replica responds. Fastest but risks reading stale data.
  • QUORUM — Requires floor(R/2) + 1 replicas. With R = 3, QUORUM = 2. Guarantees strong consistency when both reads and writes use QUORUM (since W + R > N).
  • ALL — All R replicas must respond. Highest consistency but any single node failure blocks the operation.
  • LOCAL_QUORUM — Quorum within the local datacenter only. Used in multi-region deployments to avoid cross-region latency on every write.

The relationship W + R > N (where W = write consistency, R = read consistency, N = replication factor) is the fundamental formula guaranteeing that reads see the latest write. This is the mathematical backbone of tunable consistency in ring-based systems.

Alternative algorithms: jump hashing and rendezvous hashing

Ring-based consistent hashing is not the only approach. Two important alternatives solve specific problems that the ring does not handle well.

Jump consistent hashing

Jump consistent hashing, published by Google engineers Lamping and Veach in 2014, uses a pseudo-random sequence derived from the key to "jump" through bucket assignments. It produces a perfectly uniform distribution with zero memory overhead (no ring to store). The algorithm runs in O(ln N) time and requires only a key and a bucket count. The downside: it only supports appending or removing the last server. You cannot remove an arbitrary server without remapping many keys.

PropertyRing-based consistent hashingJump consistent hashingRendezvous (HRW) hashing
Distribution uniformityGood with 100+ vnodesPerfect (provably optimal)Near-perfect
Memory overheadO(N * V) ring entriesO(1) — no stored stateO(1) — no stored state
Lookup timeO(log(N * V)) binary searchO(ln N) computationO(N) — must check all nodes
Arbitrary node removalYes — only K/N keys moveNo — only supports removing the last bucketYes — only K/N keys move
Weighted nodesYes — vary vnodes per nodeDifficultYes — multiply weight into hash score
Use casesGeneral-purpose distributed systemsAppend-only clusters, sharded databases with sequential IDsDNS, CDN routing, small-to-medium clusters

Rendezvous hashing (Highest Random Weight)

Rendezvous hashing, also called HRW (Highest Random Weight), takes a different approach: for each key, compute a score for every node as hash(key, node_id) and assign the key to the node with the highest score. When a node is removed, each key that was on that node simply goes to the node with the next-highest score — exactly K/N keys move. The downside is O(N) lookup time since every node must be scored, making it impractical for clusters with thousands of nodes. It is widely used in CDN request routing (Akamai), distributed caching with small clusters, and Windows DCHP server load balancing.

When to use which algorithm

  • Ring-based consistent hashing — The default choice for most distributed systems. Use when you need arbitrary node addition/removal, weighted nodes, and the cluster has more than ~20 nodes. Examples: Dynamo, Cassandra, Riak, Memcached client libraries.
  • Jump consistent hashing — Use when nodes are added sequentially and never removed from the middle, distribution must be perfectly uniform, and memory is constrained. Examples: sharded log processors, append-only data pipelines, Google internal systems.
  • Rendezvous hashing — Use when the cluster is small (under ~50 nodes), you need the simplicity of no ring metadata, and O(N) lookup is acceptable. Examples: CDN edge routing, DNS-based load balancing, small Memcached clusters.

Real-world implementations deep dive

Consistent hashing is not just a textbook concept — it is the core data placement algorithm in some of the most critical infrastructure on the internet.

Amazon Dynamo

  • Architecture — Dynamo uses a ring-based consistent hash with virtual nodes. Each node is responsible for the region between it and its predecessor on the ring. The coordinator for a key is the first node clockwise from the key's hash.
  • Preference list — For each key, Dynamo maintains a preference list of N nodes responsible for that key. The first N distinct physical nodes on the ring clockwise from the key form this list.
  • Sloppy quorum — Dynamo uses sloppy quorums with hinted handoff. If a node in the preference list is down, the write goes to the next healthy node clockwise, which stores a hint and forwards the data when the original node recovers.
  • Merkle trees — Each node maintains a Merkle tree per key range for anti-entropy. Background processes compare Merkle tree roots between replicas to detect and repair inconsistencies without transferring all data.

Apache Cassandra

  • Partitioner — Cassandra uses the Murmur3Partitioner by default, which hashes partition keys with MurmurHash3 to produce a 64-bit token in the range -2^63 to 2^63 - 1. Each node owns a set of token ranges on this ring.
  • Vnodes — Each node owns num_tokens (default 256) randomly placed tokens on the ring. This gives fine-grained load distribution and makes bootstrapping new nodes faster because they steal small ranges from many existing nodes rather than one large range from one node.
  • Replication — The NetworkTopologyStrategy walks the ring clockwise placing replicas on nodes in distinct racks within each datacenter. With RF = 3 in two datacenters, each key has 6 total replicas (3 per DC).
  • Streaming — When a node joins or leaves, Cassandra streams only the token ranges that need to move. With 256 vnodes across 10 nodes, adding one node transfers approximately 10% of data, streamed in parallel from multiple existing nodes.

Akamai CDN and Discord

  • Akamai — Akamai pioneered consistent hashing in their 1997 paper. Their CDN uses it to route user requests to the nearest edge server that has cached the requested content. When an edge server goes down, only the content assigned to that server needs re-routing — the rest of the CDN is unaffected.
  • Discord — Discord uses consistent hashing to route messages to the correct guild (server) process. Each guild is assigned to a node based on its guild_id hash. When Discord scales by adding nodes, only a fraction of guilds migrate. They also use it for voice server assignment, routing users to the nearest voice server that handles their channel.
  • Memcached client libraries — Libraries like libketama (used by Facebook, Twitter) implement consistent hashing on the client side. The client maintains the ring and routes each key to the correct Memcached server. Adding a server to the pool moves only 1/N of the keys, preventing the cache stampede that mod-N would cause.

Cassandra's Murmur3 partitioner in depth

Cassandra's implementation of consistent hashing is one of the most battle-tested in production. Understanding how it works at the token level is valuable for both operations and interviews.

How a write finds its replicas in Cassandra

→

01

The client sends a write with a partition key (e.g., user_id = "alice").

→

02

The coordinator node computes token = murmur3("alice") = -4567891234567890 (a 64-bit signed integer).

→

03

The coordinator consults the token ring metadata to find which node owns the range containing this token.

→

04

The owning node is the primary replica. The coordinator walks clockwise on the ring to find the next RF - 1 distinct rack/DC nodes for additional replicas.

→

05

With consistency level QUORUM and RF = 3, the coordinator sends the write to all 3 replicas and waits for 2 acknowledgments.

06

Each replica writes to its commit log (for durability), then to its memtable (in-memory sorted structure), and eventually flushes to SSTables on disk.

1

The client sends a write with a partition key (e.g., user_id = "alice").

2

The coordinator node computes token = murmur3("alice") = -4567891234567890 (a 64-bit signed integer).

3

The coordinator consults the token ring metadata to find which node owns the range containing this token.

4

The owning node is the primary replica. The coordinator walks clockwise on the ring to find the next RF - 1 distinct rack/DC nodes for additional replicas.

5

With consistency level QUORUM and RF = 3, the coordinator sends the write to all 3 replicas and waits for 2 acknowledgments.

6

Each replica writes to its commit log (for durability), then to its memtable (in-memory sorted structure), and eventually flushes to SSTables on disk.

Choosing partition keys for even distribution

The partition key determines which token range (and therefore which node) handles the data. A poor partition key (e.g., country_code with only ~200 values) creates hot spots. A good partition key (e.g., user_id with millions of distinct values) distributes evenly. Use the nodetool tablestats and nodetool ring commands to check for partition skew. If one node has significantly more data than others, your partition key is too coarse.

OperationToken ring involvementCost
Point read (partition key lookup)Hash to token, find owning node, read from RF nodesO(1) node lookups, O(log N) ring search
Range scan (token range)Contact all nodes owning tokens in the rangeO(nodes in range) — can be expensive
Node bootstrap (join)New node picks token ranges, streams data from existing ownersBackground streaming, minutes to hours depending on data size
Node decommission (leave)Leaving node streams its ranges to successorsBackground streaming, similar cost to bootstrap
Repair (anti-entropy)Compare Merkle trees between replicas for each token rangeCPU and I/O intensive, schedule during low-traffic periods

Cassandra 4.0 introduced transient replication, where some replicas only hold data temporarily for availability during writes. This reduces storage costs while maintaining the same write availability guarantees of the consistent hashing ring.

Operational pitfalls and best practices

Consistent hashing looks elegant on a whiteboard, but running it in production reveals subtle failure modes that can catch teams off guard.

Hot partitions from celebrity keys

Even with perfect ring distribution, a single key that receives millions of reads per second (e.g., a viral tweet, a popular product page) creates a hot partition on one node. Consistent hashing distributes keys evenly across nodes but does not distribute load within a key. Solutions include read replicas, caching layers in front of the ring, or key splitting (appending a random suffix to distribute one logical key across multiple physical keys).

Production best practices

  • Monitor per-node load variance — Track requests per second, data size, and latency percentiles per node. A coefficient of variation above 15% indicates distribution problems. Tools: nodetool status (Cassandra), Dynamo CloudWatch metrics, custom Prometheus exporters.
  • Use consistent hash-aware clients — Token-aware drivers (e.g., DataStax Java driver for Cassandra) route requests directly to the correct coordinator, avoiding an extra network hop. This reduces latency by 30-50% compared to random routing.
  • Avoid ring resizing during peak hours — Adding or removing nodes triggers data streaming that competes with production traffic for network bandwidth and disk I/O. Schedule topology changes during low-traffic windows.
  • Test with realistic key distributions — Zipf distributions (where a small number of keys get most of the traffic) are the norm in production. Test your ring with Zipf-distributed load, not uniform random keys.
  • Implement proper health checking — A node that is alive but slow (brown failure) is worse than a dead node. Dead nodes trigger failover immediately; slow nodes cause cascading latency. Implement latency-based health checks, not just liveness checks.

The split-brain risk during node replacement

When replacing a node (remove old, add new), there is a window where the old node is still serving reads while the new node is bootstrapping. If both nodes think they own the same token range, clients may read stale data from the old node while writes go to the new one. Always decommission the old node fully before bootstrapping the replacement, or use Cassandra's "replace node" operation that handles this atomically.

Capacity planning formula

When adding a node to an N-node cluster, expect to transfer approximately (total_data_size / (N + 1)) bytes of data. With 10 nodes holding 1 TB each (10 TB total), adding one node transfers roughly 910 GB (10 TB / 11). At 100 MB/s streaming throughput, this takes about 2.5 hours. Plan your maintenance windows accordingly.

System design interview strategy

Consistent hashing appears in almost every system design interview that involves distributed data. Knowing when to bring it up and how deep to go separates senior candidates from juniors.

How to present consistent hashing in an interview

→

01

Mention it at the right moment: when the interviewer asks "how do you distribute data across servers" or when you identify a sharding need in your design.

→

02

Start with the problem: "Naive mod-N hashing moves nearly all keys when we add a server. We need a scheme where adding a node moves only 1/N of the data."

→

03

Draw the ring: place 3-4 nodes on a circle, hash a few keys, show which node each key maps to.

→

04

Show the add-node scenario: insert a new node, demonstrate that only keys in one arc move.

→

05

Mention virtual nodes: "In practice, each server gets 100-256 virtual positions on the ring to ensure even distribution."

→

06

Connect to replication: "We replicate to the next R distinct nodes clockwise, using rack-awareness to avoid correlated failures."

07

Discuss trade-offs: "For very large clusters, we could consider jump hashing for better uniformity, but ring-based is more flexible for arbitrary node removal."

1

Mention it at the right moment: when the interviewer asks "how do you distribute data across servers" or when you identify a sharding need in your design.

2

Start with the problem: "Naive mod-N hashing moves nearly all keys when we add a server. We need a scheme where adding a node moves only 1/N of the data."

3

Draw the ring: place 3-4 nodes on a circle, hash a few keys, show which node each key maps to.

4

Show the add-node scenario: insert a new node, demonstrate that only keys in one arc move.

5

Mention virtual nodes: "In practice, each server gets 100-256 virtual positions on the ring to ensure even distribution."

6

Connect to replication: "We replicate to the next R distinct nodes clockwise, using rack-awareness to avoid correlated failures."

7

Discuss trade-offs: "For very large clusters, we could consider jump hashing for better uniformity, but ring-based is more flexible for arbitrary node removal."

Depth signals that impress interviewers

Mentioning the W + R > N consistency formula, explaining why Cassandra uses 256 vnodes by default, discussing the Merkle tree anti-entropy mechanism, or comparing ring-based hashing to rendezvous hashing shows that you understand the concept at an implementation level, not just a whiteboard level. These details distinguish a "read the blog post" answer from a "built and operated it" answer.

Common interview contexts where consistent hashing applies

  • Design a distributed cache — Use consistent hashing to shard keys across Memcached or Redis nodes. Discuss how adding a cache node moves only 1/N of keys instead of causing a full cache miss storm.
  • Design a key-value store — Consistent hashing is the partitioning layer. Combine it with replication (R = 3), tunable consistency (W + R > N), and gossip protocol for membership.
  • Design a CDN — Use consistent hashing to route requests to edge servers. When a server goes down, its content is automatically served by the next node on the ring.
  • Design a chat system — Hash channel_id or guild_id to assign each conversation to a specific server process. Discord does exactly this for guild routing.
  • Design a rate limiter — Shard rate limit counters across nodes using consistent hashing on user_id or API key. Each node tracks counts for its assigned keys.
  • Any horizontal scaling discussion — Whenever the interviewer pushes on "what happens when you need more servers," consistent hashing is the answer for stateful workloads.
How this might come up in interviews

Consistent hashing is one of the most frequently tested concepts in system design interviews at FAANG companies. It appears whenever the design requires horizontal data partitioning: distributed caches, key-value stores, CDNs, chat systems, or any system where data must be spread across multiple servers. Interviewers expect candidates to explain why mod-N fails, draw the hash ring, introduce virtual nodes unprompted, and connect consistent hashing to replication and consistency guarantees. Senior candidates are expected to discuss trade-offs with jump hashing or rendezvous hashing and cite real systems like Dynamo and Cassandra.

Common questions:

  • How would you distribute data across multiple cache servers, and what happens when you add a new server? Walk me through consistent hashing step by step.
  • Explain virtual nodes. Why are they necessary, and how many would you use in production? What is the trade-off between vnode count and memory overhead?
  • You are designing a key-value store with replication factor 3. How do you decide which nodes hold the replicas for a given key? How do you handle the case where the next three nodes on the ring are all in the same rack?
  • Compare consistent hashing with rendezvous hashing. When would you choose one over the other? What are the performance characteristics of each?
  • A single key in your distributed cache is receiving 10x more reads than any other key, causing one node to be overloaded. How would you solve this hot partition problem while still using consistent hashing?
  • Explain how Cassandra uses consistent hashing. What is the Murmur3Partitioner, what are tokens, and how does the system handle adding a new node to the cluster?

Key takeaways

  • Consistent hashing maps keys and nodes onto the same ring, so adding or removing a node moves only approximately 1/N of the keys instead of nearly all of them as with mod-N hashing.
  • Virtual nodes (100-256 per physical server) are essential in production to smooth out distribution skew caused by hash randomness, reducing load variance from 5-10x down to within 15%.
  • Replication on the ring works by walking clockwise to the next R distinct physical nodes, with rack-awareness to prevent correlated failures from taking out all replicas.
  • Jump consistent hashing offers perfect uniformity with zero memory overhead but cannot handle arbitrary node removal; rendezvous hashing supports arbitrary removal but has O(N) lookup cost — choose based on your cluster dynamics.
  • The rate of key migration matters as much as the quantity: moving K/N keys over 15 minutes is smooth; moving K/N keys in 5 seconds causes cascading failures from memory pressure and cache miss storms.
Before you move on: can you answer these?

When a new node is added to a consistent hash ring with N existing nodes, approximately what fraction of keys need to be remapped, and why?

Approximately 1/N of all keys need to move. The new node takes ownership of the arc between itself and its predecessor on the ring. Only keys that hash into this arc (which is approximately 1/N of the total ring space) migrate from the existing successor node to the new node. All other keys remain on their current nodes.

Why do production systems use 100-256 virtual nodes per physical node instead of placing each node once on the ring?

With only one position per node, hash randomness creates highly uneven arc sizes — one node might own 50% of the ring while another owns 5%. Virtual nodes place each physical node at 100-256 random positions, breaking the ring into many small arcs. By the law of large numbers, the total arc length owned by each physical node converges to 1/N of the ring. The standard deviation of load decreases proportionally to 1/sqrt(V) where V is the vnode count.

Explain the W + R > N formula and how it relates to consistent hashing with replication factor R.

W is the number of replicas that must acknowledge a write, R is the number of replicas that must respond to a read, and N is the total replication factor. When W + R > N, at least one replica that acknowledged the write will be among those queried during the read, guaranteeing the read sees the latest write. On a consistent hash ring with N = 3, using W = 2 (QUORUM write) and R = 2 (QUORUM read) gives W + R = 4 > 3, ensuring strong consistency.

🧠Mental Model

💡 Analogy

Imagine a circular conveyor belt in a warehouse with pickup stations evenly spaced around it. Packages (keys) are placed on the belt at a position determined by their barcode (hash). Each package rides the belt clockwise until it reaches the next pickup station (node), where a worker grabs it. If you add a new pickup station between two existing ones, only the packages between the previous station and the new one change hands — every other worker keeps their existing packages. If a station closes, its packages simply ride past to the next open station. Virtual nodes are like each worker operating multiple stations around the belt, ensuring no single worker is responsible for a disproportionately long stretch of belt.

⚡ Core Idea

Consistent hashing maps both data keys and server nodes onto the same circular hash space. Each key is assigned to the nearest node clockwise on the ring. This guarantees that adding or removing a node disrupts only K/N keys (where K is total keys and N is total nodes), compared to mod-N hashing which disrupts nearly all keys. Virtual nodes (100-256 per server) smooth out distribution, and walking clockwise for R distinct nodes provides natural replication placement.

🎯 Why It Matters

Every distributed system that stores state — caches, databases, message brokers, CDNs — must decide which server owns which data. Without consistent hashing, scaling up or down triggers massive data migration, cache invalidation storms, and potential outages. With it, you can add capacity smoothly, survive node failures gracefully, and reason about data placement deterministically. It is the single most important partitioning algorithm in distributed systems and appears in virtually every system design interview.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.