A deep, end-to-end blueprint for building a production-grade distributed key-value store. We cover the API surface, storage engine internals (LSM trees vs B+ trees), write and read paths, partitioning via consistent hashing, replication models, consistency trade-offs, conflict resolution, and how real systems like DynamoDB, Cassandra, and etcd put it all together.
A deep, end-to-end blueprint for building a production-grade distributed key-value store. We cover the API surface, storage engine internals (LSM trees vs B+ trees), write and read paths, partitioning via consistent hashing, replication models, consistency trade-offs, conflict resolution, and how real systems like DynamoDB, Cassandra, and etcd put it all together.
Lesson outline
A key-value store exposes three fundamental operations: put(key, value) to insert or update, get(key) to retrieve, and delete(key) to remove. Some systems add conditional variants such as compare-and-swap (CAS) for optimistic concurrency, or TTL-based expiration for cache workloads.
Keys are typically byte strings or UTF-8 strings with a maximum length (e.g., 256 bytes in DynamoDB, 2 KB in etcd). Values can be opaque blobs up to several megabytes. Keeping values small (under 1 MB) is critical for tail latency because large values consume network bandwidth and compaction I/O.
Extended API operations in production systems
Interview tip: Start with the API
In a system design interview, always begin by clarifying the API contract and access patterns. Asking "Do we need range queries or only point lookups?" immediately shapes your partitioning strategy and storage engine choice.
The storage engine is the heart of any key-value store. The two dominant approaches are Log-Structured Merge trees (LSM trees) and B+ trees. Each optimizes for a different axis of the read-write-space amplification triangle.
| Property | LSM Tree | B+ Tree |
|---|---|---|
| Write amplification | Moderate (10-30x due to compaction) | Low (1-2x, in-place update) |
| Read amplification | Higher (check memtable + multiple SSTable levels) | Low (single tree traversal, O(log n)) |
| Space amplification | Higher (stale versions exist until compacted) | Low (in-place updates reclaim space) |
| Write throughput | Excellent (sequential I/O only) | Good (random I/O for page splits) |
| Range scan performance | Good after compaction (sorted SSTables) | Excellent (leaf pages are linked) |
| Use cases | Cassandra, RocksDB, LevelDB, HBase | MySQL InnoDB, PostgreSQL, etcd (via boltdb) |
LSM trees convert all writes into sequential appends, which is ideal for SSDs and spinning disks alike. The trade-off is that reads may need to consult multiple levels, and background compaction consumes I/O bandwidth. B+ trees provide predictable read latency but generate random I/O on writes due to page splits and in-place updates.
The amplification triangle
You cannot minimize read amplification, write amplification, and space amplification simultaneously. LSM trees favor write throughput at the cost of read and space amplification. B+ trees favor read performance at the cost of write amplification. This is a fundamental storage engine trade-off known as the RUM conjecture.
graph TD A[Client Write] --> B[Write-Ahead Log] B --> C[Memtable - in memory, sorted] C -->|Memtable full| D[Flush to SSTable Level 0] D --> E[Level 0 SSTables] E -->|Size threshold| F[Compaction to Level 1] F --> G[Level 1 SSTables - sorted, non-overlapping] G -->|Size threshold| H[Compaction to Level 2] H --> I[Level 2 SSTables] I --> J[...Level N]
LSM tree write path: data flows from WAL to memtable to increasingly larger SSTable levels via compaction.
Every write follows a strict sequence to guarantee durability without sacrificing throughput. Understanding this pipeline is essential for diagnosing latency spikes and data loss scenarios.
Write path step by step
01
Client sends put(key, value) to the coordinator node (the node responsible for this key's partition).
02
The coordinator appends the operation to the Write-Ahead Log (WAL) on disk. This is a sequential append and is fast. The WAL guarantees durability: if the process crashes after this step, the write can be recovered on restart.
03
The coordinator inserts the key-value pair into the in-memory memtable (typically a red-black tree or skip list). The memtable keeps entries sorted by key for efficient range scans and subsequent flush.
04
When the memtable exceeds a size threshold (e.g., 64 MB), it is frozen and a new empty memtable is created to accept incoming writes.
05
A background thread flushes the frozen memtable to disk as an SSTable (Sorted String Table). SSTables are immutable once written. The corresponding WAL segment can then be safely deleted.
06
The coordinator replicates the write to N-1 replica nodes (the replication factor minus one). Depending on consistency settings, it waits for W acknowledgments before responding to the client.
Client sends put(key, value) to the coordinator node (the node responsible for this key's partition).
The coordinator appends the operation to the Write-Ahead Log (WAL) on disk. This is a sequential append and is fast. The WAL guarantees durability: if the process crashes after this step, the write can be recovered on restart.
The coordinator inserts the key-value pair into the in-memory memtable (typically a red-black tree or skip list). The memtable keeps entries sorted by key for efficient range scans and subsequent flush.
When the memtable exceeds a size threshold (e.g., 64 MB), it is frozen and a new empty memtable is created to accept incoming writes.
A background thread flushes the frozen memtable to disk as an SSTable (Sorted String Table). SSTables are immutable once written. The corresponding WAL segment can then be safely deleted.
The coordinator replicates the write to N-1 replica nodes (the replication factor minus one). Depending on consistency settings, it waits for W acknowledgments before responding to the client.
WAL sync modes matter
The fsync policy on the WAL determines the durability guarantee. fsync on every write provides the strongest guarantee but adds ~2ms latency per write. Batching fsync (e.g., every 10ms or every 100 writes) improves throughput 10-50x but risks losing the last batch on power failure. Cassandra defaults to periodic sync; etcd fsyncs every write because metadata loss is catastrophic.
In a leader-follower setup, the leader executes the write path and ships the WAL entries (or a replication log) to followers. In a leaderless (Dynamo-style) system, the client or coordinator sends the write to all N replicas in parallel and waits for W acknowledgments.
Reads in an LSM-based store must search multiple locations because data for a given key may reside in the memtable, any SSTable level, or multiple levels simultaneously (if compaction has not yet merged them). Bloom filters are the critical optimization that makes this feasible.
Read path step by step
01
Check the active memtable. If the key is found, return immediately. This is the fastest path (sub-microsecond, pure memory lookup).
02
Check the frozen memtable (if one exists during a flush). Same logic.
03
For each SSTable level starting from Level 0 (most recent), consult the Bloom filter. A Bloom filter is a probabilistic data structure that can tell you "definitely not here" (true negative) or "maybe here" (possible false positive, typically <1% rate). This eliminates unnecessary disk reads for 99%+ of SSTables that don't contain the key.
04
If the Bloom filter says "maybe," perform a binary search on the SSTable's sparse index to locate the data block, then read the block from disk (or block cache) and search within it.
05
Return the most recent version of the key (highest timestamp wins). If a tombstone is the most recent entry, return "key not found."
Check the active memtable. If the key is found, return immediately. This is the fastest path (sub-microsecond, pure memory lookup).
Check the frozen memtable (if one exists during a flush). Same logic.
For each SSTable level starting from Level 0 (most recent), consult the Bloom filter. A Bloom filter is a probabilistic data structure that can tell you "definitely not here" (true negative) or "maybe here" (possible false positive, typically <1% rate). This eliminates unnecessary disk reads for 99%+ of SSTables that don't contain the key.
If the Bloom filter says "maybe," perform a binary search on the SSTable's sparse index to locate the data block, then read the block from disk (or block cache) and search within it.
Return the most recent version of the key (highest timestamp wins). If a tombstone is the most recent entry, return "key not found."
Bloom filter tuning
A Bloom filter with 10 bits per key gives a ~1% false positive rate. Increasing to 15 bits per key drops it to ~0.1%. Each additional bit halves the false positive rate but increases memory usage. For a 1 billion key dataset, a 10 bits/key Bloom filter needs ~1.2 GB of RAM. This is almost always worth it because a single avoided disk seek saves 100-500 microseconds on SSD.
| Read optimization | How it helps | Cost |
|---|---|---|
| Bloom filters | Eliminates 99%+ of unnecessary SSTable reads | ~1.2 GB RAM per billion keys at 10 bits/key |
| Block cache (LRU) | Caches recently read SSTable data blocks in RAM | Configurable memory budget (e.g., 4 GB) |
| Sparse index | Stores every Nth key offset; binary search narrows to a block | Tiny: ~0.1% of data size |
| Compression | Reduces disk I/O per block read (LZ4, Snappy, Zstd) | CPU overhead; LZ4 decompresses at 4 GB/s |
| Compaction (background) | Reduces number of SSTables to search | I/O and CPU budget; can spike latency during heavy compaction |
Compaction is the background process that merges SSTables, removes stale versions and tombstones, and reorganizes data to reduce read amplification. The two dominant strategies make different trade-offs between write amplification and read performance.
Size-tiered compaction (STCS)
Leveled compaction (LCS)
Common mistake: ignoring compaction debt
If writes arrive faster than compaction can process, the number of SSTables grows unboundedly ("compaction debt"). This causes read latency to spike as more files must be checked per lookup. Monitor the pending compaction bytes metric. If it keeps growing, either throttle writes, add I/O capacity, or switch to a less write-amplifying compaction strategy.
A single machine cannot store or serve the entire dataset at scale. Partitioning (sharding) distributes keys across multiple nodes. Consistent hashing is the standard approach because it minimizes data movement when nodes are added or removed.
In consistent hashing, both keys and nodes are mapped onto a logical ring using a hash function (e.g., MD5 or Murmur3). Each key is assigned to the first node encountered when walking clockwise around the ring. When a node joins or leaves, only the keys in the affected arc need to be redistributed, not the entire dataset.
graph LR
subgraph Hash Ring
N1((Node A)) ---|owns keys| R1[Key range 0-90]
N2((Node B)) ---|owns keys| R2[Key range 91-180]
N3((Node C)) ---|owns keys| R3[Key range 181-270]
N4((Node D)) ---|owns keys| R4[Key range 271-360]
end
subgraph Virtual Nodes
N1 -.-> V1[A-v1]
N1 -.-> V2[A-v2]
N1 -.-> V3[A-v3]
N2 -.-> V4[B-v1]
N2 -.-> V5[B-v2]
N2 -.-> V6[B-v3]
endConsistent hashing ring with virtual nodes. Each physical node owns multiple positions on the ring for balanced distribution.
Key partitioning concepts
DynamoDB's approach to partitioning
DynamoDB uses consistent hashing internally but also adaptively splits partitions when throughput exceeds the per-partition limit (1,000 WCU or 3,000 RCU). This automatic splitting is invisible to the user and is a key differentiator from manually-partitioned systems.
Replication is non-negotiable for durability and availability: if a node fails, its replicas serve the data. The fundamental question is how to maintain consistency across replicas while maximizing availability.
Replication topologies
| Quorum setting (N=3) | Behavior | Consistency | Availability |
|---|---|---|---|
| W=1, R=1 | Write to 1 node, read from 1 node | Eventual (may read stale data) | Highest (tolerates 2 node failures for reads OR writes) |
| W=2, R=2 | Write to 2 nodes, read from 2 nodes | Strong (W+R=4 > N=3, overlap guaranteed) | Moderate (tolerates 1 node failure) |
| W=3, R=1 | Write to all 3 nodes, read from 1 node | Strong reads, but writes need all nodes up | Low for writes (0 node failures tolerated) |
| W=1, R=3 | Write to 1 node, read from all 3 nodes | Strong reads, fast writes, slow reads | Low for reads (0 node failures tolerated) |
The quorum formula W + R > N guarantees that at least one node in the read set has the latest write. However, quorums alone do not prevent all anomalies: sloppy quorums (where hinted handoff nodes substitute for unavailable replicas) can violate this guarantee. Cassandra and DynamoDB use sloppy quorums for higher availability, which means even "quorum" reads can return stale data during network partitions.
Choosing consistency levels
For most applications: use W=QUORUM, R=QUORUM for data that must be consistent (financial transactions, counters). Use W=ONE, R=ONE for data where staleness is acceptable (user profile caches, activity feeds). Never use W=ALL in production unless you can tolerate write unavailability during any single node failure.
Consistency models explained
In a distributed system where multiple nodes accept writes (leaderless or multi-leader), conflicting writes to the same key are inevitable. The system must detect and resolve these conflicts, detect failed nodes, and repair data inconsistencies.
Conflict resolution strategies
Failure detection mechanisms
Anti-entropy and repair
The CAP theorem in practice
During a network partition, a system must choose between Consistency (reject writes to avoid inconsistency) and Availability (accept writes on both sides of the partition, resolve conflicts later). DynamoDB and Cassandra choose AP: they remain available during partitions but may serve stale reads. etcd and ZooKeeper choose CP: they reject writes if a majority quorum cannot be reached. In practice, most systems are not purely CP or AP; they offer tunable consistency per-operation.
Understanding individual components is necessary but not sufficient. Let us see how production systems combine these building blocks into cohesive architectures.
| System | Storage Engine | Partitioning | Replication | Consistency | Conflict Resolution |
|---|---|---|---|---|---|
| DynamoDB | B-tree (internal) | Consistent hashing with adaptive splits | Leaderless, multi-AZ | Eventual (default) or strong (per-read) | LWW (timestamps) |
| Cassandra | LSM tree (based on BigTable) | Consistent hashing with vnodes | Leaderless, tunable quorum | Tunable (ONE to ALL) | LWW (timestamps), optional CRDTs |
| Riak | Bitcask / LevelDB | Consistent hashing with vnodes | Leaderless (Dynamo model) | Eventual, tunable quorum | Vector clocks, CRDTs, sibling resolution |
| etcd | B+ tree (boltdb/bbolt) | No partitioning (single cluster) | Leader-follower via Raft consensus | Strong (linearizable) | Not applicable (single leader) |
| Redis | In-memory hash table | Hash slots (16384 slots) | Leader-follower, async | Eventual (async replication) | LWW (last replica write wins) |
Choosing the right system
Need strong consistency for coordination or config metadata? Use etcd or ZooKeeper. Need massive write throughput with tunable consistency? Use Cassandra. Need a fully managed solution with automatic scaling? Use DynamoDB. Need sub-millisecond latency for caching? Use Redis. The choice always starts with your access patterns, consistency requirements, and operational capacity.
graph TD
Client[Client] -->|put/get/delete| Coord[Coordinator Node]
Coord -->|1. Write to WAL| WAL[Write-Ahead Log]
Coord -->|2. Insert into| MT[Memtable]
MT -->|Flush| SST[SSTable Levels]
SST -->|Compaction| SST
Coord -->|3. Replicate| R1[Replica 1]
Coord -->|3. Replicate| R2[Replica 2]
R1 -->|Gossip| R2
R2 -->|Gossip| Coord
Coord -->|Read: Bloom Filter| SST
Coord -->|Read: check| MT
subgraph Anti-Entropy
MK[Merkle Trees] -->|Compare hashes| Repair[Sync diverged keys]
endEnd-to-end architecture of a Dynamo-style distributed key-value store showing write path, read path, replication, and anti-entropy repair.
Building a distributed key-value store is a masterclass in trade-off engineering. Every decision, from the storage engine to the consistency model to the compaction strategy, involves sacrificing one desirable property to gain another. The best engineers do not memorize the "right" answer; they understand the trade-off space deeply enough to make the right choice for their specific requirements.
Designing a key-value store is one of the top 5 most frequently asked system design questions at FAANG companies. Interviewers use it to test depth across multiple dimensions: storage engine internals, distributed systems fundamentals, consistency trade-offs, and the ability to reason about failure modes. A strong candidate starts with the API, discusses storage engine trade-offs (LSM vs B+ tree), walks through the write and read paths, explains partitioning and replication, and then dives into consistency models and conflict resolution. The key differentiator between a senior and staff-level answer is demonstrating awareness of operational concerns: compaction tuning, failure detection, anti-entropy repair, and hot key mitigation.
Common questions:
Key takeaways
In a Dynamo-style KV store with N=3 replicas, you set W=2 and R=2. One replica goes down permanently. Can you still serve reads and writes? What happens when the downed replica comes back?
Yes, you can still serve both reads and writes: W=2 requires 2 of 3 replicas (2 are alive), and R=2 also succeeds with 2 of 3. When the downed replica recovers, it will be out of date. Hinted handoff replays queued writes if the outage was short. For longer outages, an anti-entropy repair using Merkle trees synchronizes the diverged key ranges. Read repair also heals stale data opportunistically on subsequent reads.
Why does an LSM-tree-based KV store have higher read amplification than a B+ tree, and what are three techniques to mitigate it?
In an LSM tree, a key may exist in the memtable, L0 SSTables, and any deeper level. A point lookup must check all of these in the worst case (read amplification). Three mitigations: (1) Bloom filters eliminate 99%+ of unnecessary SSTable checks. (2) Leveled compaction ensures at most one SSTable per level contains a given key. (3) Block cache (LRU) keeps frequently accessed SSTable blocks in memory, avoiding repeated disk reads.
Explain what happens when two clients concurrently write different values to the same key in a leaderless KV store using last-writer-wins (LWW). Why is this problematic, and what is an alternative?
Both writes succeed on their respective replicas. When a read queries multiple replicas, it sees both versions and picks the one with the higher timestamp. The "losing" write is silently discarded, causing data loss. This is problematic because LWW conflates "most recent timestamp" with "correct value," and clock skew can cause the objectively earlier write to win. Alternatives: vector clocks detect the conflict and let the application merge values explicitly, or CRDTs (e.g., OR-Set) define automatic, mathematically convergent merge semantics.
💡 Analogy
A distributed key-value store is like a chain of safe deposit vaults across multiple cities. You give a clerk (coordinator) your box number (key) and they use a directory (consistent hashing ring) to determine which city vault holds your box. For durability, copies of your box contents are kept in 2-3 vaults in different cities (replication). To read, the clerk checks the most recently updated copies. If the clerks in different cities disagree about what is in the box (conflict), they use timestamps or version tracking to figure out which is correct. The vault rooms are organized with recent items in a fast-access drawer (memtable) and older items filed in organized cabinets (SSTables), with a manifest (Bloom filter) at each cabinet door that instantly tells you "your item is definitely not in here" to save you from searching.
⚡ Core Idea
A key-value store is deceptively simple on the outside (get/put/delete) but internally requires solving the hardest problems in distributed systems: partitioning data across machines, replicating for durability, maintaining consistency, detecting failures, and resolving conflicts. The storage engine (LSM tree vs B+ tree) determines single-node performance, while the replication and partitioning layers determine system-wide scalability and fault tolerance.
🎯 Why It Matters
Key-value stores are the backbone of modern infrastructure: session stores, caches, configuration management, metadata catalogs, and even as building blocks for larger databases. Understanding how they work gives you a mental framework for reasoning about any distributed storage system. In interviews, "Design a key-value store" is one of the most common and highest-signal system design questions because it tests your knowledge of storage engines, networking, distributed consensus, and trade-off analysis simultaneously.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.