Interactive Explainer

🎯Key Takeaways

The API is deceptively simple (get/put/delete), but production key-value stores must solve partitioning, replication, consistency, compaction, conflict resolution, and failure detection under the hood.

LSM trees optimize for write throughput (sequential I/O) at the cost of read amplification; B+ trees optimize for reads (single tree traversal) at the cost of write amplification. The RUM conjecture says you cannot minimize all three amplification factors simultaneously.

The quorum formula W + R > N guarantees read-your-writes consistency, but sloppy quorums and hinted handoff can violate this during partitions. Always understand what your consistency level actually guarantees.

Conflict resolution is a design choice, not an afterthought. LWW is simple but lossy; vector clocks preserve conflicts for application-level resolution; CRDTs provide automatic convergence for supported data types.

Anti-entropy mechanisms (Merkle trees, read repair, hinted handoff) are essential for long-term data integrity. Without them, replicas silently diverge over time, and the system slowly rots from the inside out.

Designing a Distributed Key-Value Store from Scratch

A deep, end-to-end blueprint for building a production-grade distributed key-value store. We cover the API surface, storage engine internals (LSM trees vs B+ trees), write and read paths, partitioning via consistent hashing, replication models, consistency trade-offs, conflict resolution, and how real systems like DynamoDB, Cassandra, and etcd put it all together.

~18 min read

Be the first to complete!

What you'll learn

The API is deceptively simple (get/put/delete), but production key-value stores must solve partitioning, replication, consistency, compaction, conflict resolution, and failure detection under the hood.
LSM trees optimize for write throughput (sequential I/O) at the cost of read amplification; B+ trees optimize for reads (single tree traversal) at the cost of write amplification. The RUM conjecture says you cannot minimize all three amplification factors simultaneously.
The quorum formula W + R > N guarantees read-your-writes consistency, but sloppy quorums and hinted handoff can violate this during partitions. Always understand what your consistency level actually guarantees.
Conflict resolution is a design choice, not an afterthought. LWW is simple but lossy; vector clocks preserve conflicts for application-level resolution; CRDTs provide automatic convergence for supported data types.
Anti-entropy mechanisms (Merkle trees, read repair, hinted handoff) are essential for long-term data integrity. Without them, replicas silently diverge over time, and the system slowly rots from the inside out.

Lesson outline

The API surface and data model

A key-value store exposes three fundamental operations: put(key, value) to insert or update, get(key) to retrieve, and delete(key) to remove. Some systems add conditional variants such as compare-and-swap (CAS) for optimistic concurrency, or TTL-based expiration for cache workloads.

Keys are typically byte strings or UTF-8 strings with a maximum length (e.g., 256 bytes in DynamoDB, 2 KB in etcd). Values can be opaque blobs up to several megabytes. Keeping values small (under 1 MB) is critical for tail latency because large values consume network bandwidth and compaction I/O.

Extended API operations in production systems

put(key, value, options?) — Insert or update. Options may include TTL, expected version for CAS, and consistency level (ONE, QUORUM, ALL).
get(key, options?) — Retrieve the value for a key. Options may specify consistency level, timeout, and whether to return metadata (version, timestamp).
delete(key) — Mark the key as deleted. Internally, most systems write a tombstone rather than immediately removing data to propagate deletes across replicas.
scan(startKey, endKey, limit) — Range query across a sorted keyspace. Not available in all KV stores (hash-based partitioning makes range scans expensive).
batch_put / batch_get — Amortize network round-trips by batching multiple operations. DynamoDB supports up to 25 items per BatchWriteItem.

Interview tip: Start with the API

In a system design interview, always begin by clarifying the API contract and access patterns. Asking "Do we need range queries or only point lookups?" immediately shapes your partitioning strategy and storage engine choice.

Storage engines: LSM trees vs B+ trees

The storage engine is the heart of any key-value store. The two dominant approaches are Log-Structured Merge trees (LSM trees) and B+ trees. Each optimizes for a different axis of the read-write-space amplification triangle.

Property	LSM Tree	B+ Tree
Write amplification	Moderate (10-30x due to compaction)	Low (1-2x, in-place update)
Read amplification	Higher (check memtable + multiple SSTable levels)	Low (single tree traversal, O(log n))
Space amplification	Higher (stale versions exist until compacted)	Low (in-place updates reclaim space)
Write throughput	Excellent (sequential I/O only)	Good (random I/O for page splits)
Range scan performance	Good after compaction (sorted SSTables)	Excellent (leaf pages are linked)
Use cases	Cassandra, RocksDB, LevelDB, HBase	MySQL InnoDB, PostgreSQL, etcd (via boltdb)

LSM trees convert all writes into sequential appends, which is ideal for SSDs and spinning disks alike. The trade-off is that reads may need to consult multiple levels, and background compaction consumes I/O bandwidth. B+ trees provide predictable read latency but generate random I/O on writes due to page splits and in-place updates.

The amplification triangle

You cannot minimize read amplification, write amplification, and space amplification simultaneously. LSM trees favor write throughput at the cost of read and space amplification. B+ trees favor read performance at the cost of write amplification. This is a fundamental storage engine trade-off known as the RUM conjecture.

graph TD
  A[Client Write] --> B[Write-Ahead Log]
  B --> C[Memtable - in memory, sorted]
  C -->|Memtable full| D[Flush to SSTable Level 0]
  D --> E[Level 0 SSTables]
  E -->|Size threshold| F[Compaction to Level 1]
  F --> G[Level 1 SSTables - sorted, non-overlapping]
  G -->|Size threshold| H[Compaction to Level 2]
  H --> I[Level 2 SSTables]
  I --> J[...Level N]

LSM tree write path: data flows from WAL to memtable to increasingly larger SSTable levels via compaction.

The write path: WAL, memtable, and SSTable flush

Every write follows a strict sequence to guarantee durability without sacrificing throughput. Understanding this pipeline is essential for diagnosing latency spikes and data loss scenarios.

Write path step by step

→

Client sends put(key, value) to the coordinator node (the node responsible for this key's partition).

→

The coordinator appends the operation to the Write-Ahead Log (WAL) on disk. This is a sequential append and is fast. The WAL guarantees durability: if the process crashes after this step, the write can be recovered on restart.

→

The coordinator inserts the key-value pair into the in-memory memtable (typically a red-black tree or skip list). The memtable keeps entries sorted by key for efficient range scans and subsequent flush.

→

When the memtable exceeds a size threshold (e.g., 64 MB), it is frozen and a new empty memtable is created to accept incoming writes.

→

A background thread flushes the frozen memtable to disk as an SSTable (Sorted String Table). SSTables are immutable once written. The corresponding WAL segment can then be safely deleted.

The coordinator replicates the write to N-1 replica nodes (the replication factor minus one). Depending on consistency settings, it waits for W acknowledgments before responding to the client.

Client sends put(key, value) to the coordinator node (the node responsible for this key's partition).

When the memtable exceeds a size threshold (e.g., 64 MB), it is frozen and a new empty memtable is created to accept incoming writes.

A background thread flushes the frozen memtable to disk as an SSTable (Sorted String Table). SSTables are immutable once written. The corresponding WAL segment can then be safely deleted.

The coordinator replicates the write to N-1 replica nodes (the replication factor minus one). Depending on consistency settings, it waits for W acknowledgments before responding to the client.

WAL sync modes matter

The fsync policy on the WAL determines the durability guarantee. fsync on every write provides the strongest guarantee but adds ~2ms latency per write. Batching fsync (e.g., every 10ms or every 100 writes) improves throughput 10-50x but risks losing the last batch on power failure. Cassandra defaults to periodic sync; etcd fsyncs every write because metadata loss is catastrophic.

In a leader-follower setup, the leader executes the write path and ships the WAL entries (or a replication log) to followers. In a leaderless (Dynamo-style) system, the client or coordinator sends the write to all N replicas in parallel and waits for W acknowledgments.

The read path: memtable, bloom filters, and SSTable levels

Reads in an LSM-based store must search multiple locations because data for a given key may reside in the memtable, any SSTable level, or multiple levels simultaneously (if compaction has not yet merged them). Bloom filters are the critical optimization that makes this feasible.

Read path step by step

→

Check the active memtable. If the key is found, return immediately. This is the fastest path (sub-microsecond, pure memory lookup).

→

Check the frozen memtable (if one exists during a flush). Same logic.

→

For each SSTable level starting from Level 0 (most recent), consult the Bloom filter. A Bloom filter is a probabilistic data structure that can tell you "definitely not here" (true negative) or "maybe here" (possible false positive, typically <1% rate). This eliminates unnecessary disk reads for 99%+ of SSTables that don't contain the key.

→

If the Bloom filter says "maybe," perform a binary search on the SSTable's sparse index to locate the data block, then read the block from disk (or block cache) and search within it.

Return the most recent version of the key (highest timestamp wins). If a tombstone is the most recent entry, return "key not found."

Check the active memtable. If the key is found, return immediately. This is the fastest path (sub-microsecond, pure memory lookup).

Check the frozen memtable (if one exists during a flush). Same logic.

If the Bloom filter says "maybe," perform a binary search on the SSTable's sparse index to locate the data block, then read the block from disk (or block cache) and search within it.

Return the most recent version of the key (highest timestamp wins). If a tombstone is the most recent entry, return "key not found."

Bloom filter tuning

A Bloom filter with 10 bits per key gives a ~1% false positive rate. Increasing to 15 bits per key drops it to ~0.1%. Each additional bit halves the false positive rate but increases memory usage. For a 1 billion key dataset, a 10 bits/key Bloom filter needs ~1.2 GB of RAM. This is almost always worth it because a single avoided disk seek saves 100-500 microseconds on SSD.

Read optimization	How it helps	Cost
Bloom filters	Eliminates 99%+ of unnecessary SSTable reads	~1.2 GB RAM per billion keys at 10 bits/key
Block cache (LRU)	Caches recently read SSTable data blocks in RAM	Configurable memory budget (e.g., 4 GB)
Sparse index	Stores every Nth key offset; binary search narrows to a block	Tiny: ~0.1% of data size
Compression	Reduces disk I/O per block read (LZ4, Snappy, Zstd)	CPU overhead; LZ4 decompresses at 4 GB/s
Compaction (background)	Reduces number of SSTables to search	I/O and CPU budget; can spike latency during heavy compaction

Compaction strategies: size-tiered vs leveled

Compaction is the background process that merges SSTables, removes stale versions and tombstones, and reorganizes data to reduce read amplification. The two dominant strategies make different trade-offs between write amplification and read performance.

Size-tiered compaction (STCS)

How it works — SSTables of similar size are grouped into tiers. When a tier accumulates enough SSTables (e.g., 4), they are merged into a single larger SSTable in the next tier.
Write amplification — Low. Each SSTable is rewritten ~log(N) times across its lifetime. Typical write amplification is 5-10x.
Space amplification — High. During compaction, both the old and new SSTables coexist temporarily, requiring up to 2x the data size in disk space.
Read performance — Moderate. Multiple overlapping SSTables within a tier mean a point lookup may need to check several files.
Best for — Write-heavy workloads where you can tolerate temporary space spikes. Used by Cassandra as the default strategy.

Leveled compaction (LCS)

How it works — Data is organized into levels where each level is 10x larger than the previous. Within a level (except L0), SSTables have non-overlapping key ranges. When a level is full, one SSTable is picked and merged with the overlapping SSTables in the next level.
Write amplification — High. A key may be rewritten 10-30x across levels. This is the main cost of leveled compaction.
Space amplification — Low (~10% overhead). Non-overlapping SSTables within each level means no duplicate keys within a level.
Read performance — Excellent. At most one SSTable per level needs to be checked for a point lookup (plus L0 which may have overlapping files).
Best for — Read-heavy workloads or when disk space is limited. Used by LevelDB, RocksDB (default), and Cassandra (optional).

Common mistake: ignoring compaction debt

If writes arrive faster than compaction can process, the number of SSTables grows unboundedly ("compaction debt"). This causes read latency to spike as more files must be checked per lookup. Monitor the pending compaction bytes metric. If it keeps growing, either throttle writes, add I/O capacity, or switch to a less write-amplifying compaction strategy.

Partitioning: consistent hashing and virtual nodes

A single machine cannot store or serve the entire dataset at scale. Partitioning (sharding) distributes keys across multiple nodes. Consistent hashing is the standard approach because it minimizes data movement when nodes are added or removed.

In consistent hashing, both keys and nodes are mapped onto a logical ring using a hash function (e.g., MD5 or Murmur3). Each key is assigned to the first node encountered when walking clockwise around the ring. When a node joins or leaves, only the keys in the affected arc need to be redistributed, not the entire dataset.

graph LR
  subgraph Hash Ring
    N1((Node A)) ---|owns keys| R1[Key range 0-90]
    N2((Node B)) ---|owns keys| R2[Key range 91-180]
    N3((Node C)) ---|owns keys| R3[Key range 181-270]
    N4((Node D)) ---|owns keys| R4[Key range 271-360]
  end
  subgraph Virtual Nodes
    N1 -.-> V1[A-v1]
    N1 -.-> V2[A-v2]
    N1 -.-> V3[A-v3]
    N2 -.-> V4[B-v1]
    N2 -.-> V5[B-v2]
    N2 -.-> V6[B-v3]
  end

Consistent hashing ring with virtual nodes. Each physical node owns multiple positions on the ring for balanced distribution.

Key partitioning concepts

Virtual nodes (vnodes) — Each physical node owns 100-256 positions on the ring. This smooths out load imbalance caused by hash randomness and makes it easier to handle heterogeneous hardware (give faster machines more vnodes).
Replication placement — After finding the primary node for a key, walk clockwise to find the next N-1 distinct physical nodes. These become the replicas. Rack-aware placement ensures replicas land in different failure domains.
Hot key mitigation — If a single key receives disproportionate traffic (e.g., a viral post), options include: (1) add a random suffix to split the key across nodes, (2) cache the hot key in a dedicated in-memory layer, (3) replicate the hot key to more nodes for read fan-out.

DynamoDB's approach to partitioning

DynamoDB uses consistent hashing internally but also adaptively splits partitions when throughput exceeds the per-partition limit (1,000 WCU or 3,000 RCU). This automatic splitting is invisible to the user and is a key differentiator from manually-partitioned systems.

Replication, quorums, and consistency models

Replication is non-negotiable for durability and availability: if a node fails, its replicas serve the data. The fundamental question is how to maintain consistency across replicas while maximizing availability.

Replication topologies

Leader-follower (primary-secondary) — All writes go to a single leader which ships them to followers via a replication log. Reads can go to the leader (strong consistency) or followers (eventual consistency with potential stale reads). Used by etcd and Redis Sentinel.
Leaderless (Dynamo-style) — Reads and writes are sent to all N replicas. The coordinator waits for W write acknowledgments and R read responses. Consistency is tunable per-request. Used by Cassandra, DynamoDB, and Riak.
Multi-leader — Multiple nodes accept writes independently and synchronize asynchronously. Used for multi-datacenter deployments where each DC has a local leader to minimize write latency.

Quorum setting (N=3)	Behavior	Consistency	Availability
W=1, R=1	Write to 1 node, read from 1 node	Eventual (may read stale data)	Highest (tolerates 2 node failures for reads OR writes)
W=2, R=2	Write to 2 nodes, read from 2 nodes	Strong (W+R=4 > N=3, overlap guaranteed)	Moderate (tolerates 1 node failure)
W=3, R=1	Write to all 3 nodes, read from 1 node	Strong reads, but writes need all nodes up	Low for writes (0 node failures tolerated)
W=1, R=3	Write to 1 node, read from all 3 nodes	Strong reads, fast writes, slow reads	Low for reads (0 node failures tolerated)

The quorum formula W + R > N guarantees that at least one node in the read set has the latest write. However, quorums alone do not prevent all anomalies: sloppy quorums (where hinted handoff nodes substitute for unavailable replicas) can violate this guarantee. Cassandra and DynamoDB use sloppy quorums for higher availability, which means even "quorum" reads can return stale data during network partitions.

Choosing consistency levels

For most applications: use W=QUORUM, R=QUORUM for data that must be consistent (financial transactions, counters). Use W=ONE, R=ONE for data where staleness is acceptable (user profile caches, activity feeds). Never use W=ALL in production unless you can tolerate write unavailability during any single node failure.

Consistency models explained

Strong (linearizable) — Every read returns the most recent write. Requires either a single leader or consensus protocol (Raft/Paxos). etcd provides this. High latency cost.
Eventual — If no new writes occur, all replicas will eventually converge. No ordering guarantees in the interim. DynamoDB and Cassandra default to this.
Causal — If process A writes X, then reads X and writes Y, any process that reads Y will also see X. Preserves cause-and-effect ordering without the full cost of linearizability. Implemented via vector clocks or logical timestamps.
Read-your-writes — A client always sees its own writes. Commonly implemented by routing reads to the same node that handled the write, or by using a session token.

Conflict resolution, failure detection, and anti-entropy

In a distributed system where multiple nodes accept writes (leaderless or multi-leader), conflicting writes to the same key are inevitable. The system must detect and resolve these conflicts, detect failed nodes, and repair data inconsistencies.

Conflict resolution strategies

Last-writer-wins (LWW) — Each write carries a timestamp. The write with the highest timestamp wins. Simple but lossy: concurrent writes are silently discarded. Used by Cassandra and DynamoDB (by default). Requires synchronized clocks (NTP), and even then, clock skew can cause "earlier" writes to win.
Vector clocks — Each node maintains a vector of logical timestamps (one per node). When replicas diverge, the vector clock detects the conflict (neither version dominates). The system can then either present both versions to the application for resolution (Riak's approach) or merge them automatically.
CRDTs (Conflict-free Replicated Data Types) — Data structures that are mathematically guaranteed to converge regardless of the order operations are applied. Examples: G-Counter (grow-only counter), OR-Set (observed-remove set). Riak supports CRDTs natively. CRDTs eliminate the need for manual conflict resolution but are limited to specific data types.

Failure detection mechanisms

Gossip protocol — Each node periodically sends its membership list (with heartbeat counters) to a random peer. If a node's heartbeat counter has not incremented within a timeout, it is marked as suspected-down. Lightweight and decentralized. Used by Cassandra.
Phi-accrual failure detector — Instead of a binary alive/dead decision, it computes a suspicion level (phi) based on the distribution of inter-heartbeat intervals. A phi value of 8 means the probability of being wrong is about 1 in 10,000. Used by Cassandra and Akka.

Anti-entropy and repair

Hinted handoff — When a replica is temporarily down, another node stores the writes intended for it in a hint queue. When the downed node recovers, hints are replayed. This handles transient failures but not permanent node loss.
Merkle tree repair — Each node builds a Merkle tree (hash tree) over its key ranges. Nodes compare root hashes to quickly identify which ranges have diverged, then synchronize only the differing keys. This makes full repair efficient even for billions of keys.
Read repair — During a quorum read, if the coordinator detects that some replicas returned stale data, it sends the latest version to those replicas. This is opportunistic repair that gradually heals inconsistencies for frequently-read keys.

The CAP theorem in practice

During a network partition, a system must choose between Consistency (reject writes to avoid inconsistency) and Availability (accept writes on both sides of the partition, resolve conflicts later). DynamoDB and Cassandra choose AP: they remain available during partitions but may serve stale reads. etcd and ZooKeeper choose CP: they reject writes if a majority quorum cannot be reached. In practice, most systems are not purely CP or AP; they offer tunable consistency per-operation.

Real-world systems: putting it all together

Understanding individual components is necessary but not sufficient. Let us see how production systems combine these building blocks into cohesive architectures.

System	Storage Engine	Partitioning	Replication	Consistency	Conflict Resolution
DynamoDB	B-tree (internal)	Consistent hashing with adaptive splits	Leaderless, multi-AZ	Eventual (default) or strong (per-read)	LWW (timestamps)
Cassandra	LSM tree (based on BigTable)	Consistent hashing with vnodes	Leaderless, tunable quorum	Tunable (ONE to ALL)	LWW (timestamps), optional CRDTs
Riak	Bitcask / LevelDB	Consistent hashing with vnodes	Leaderless (Dynamo model)	Eventual, tunable quorum	Vector clocks, CRDTs, sibling resolution
etcd	B+ tree (boltdb/bbolt)	No partitioning (single cluster)	Leader-follower via Raft consensus	Strong (linearizable)	Not applicable (single leader)
Redis	In-memory hash table	Hash slots (16384 slots)	Leader-follower, async	Eventual (async replication)	LWW (last replica write wins)

Choosing the right system

Need strong consistency for coordination or config metadata? Use etcd or ZooKeeper. Need massive write throughput with tunable consistency? Use Cassandra. Need a fully managed solution with automatic scaling? Use DynamoDB. Need sub-millisecond latency for caching? Use Redis. The choice always starts with your access patterns, consistency requirements, and operational capacity.

graph TD
  Client[Client] -->|put/get/delete| Coord[Coordinator Node]
  Coord -->|1. Write to WAL| WAL[Write-Ahead Log]
  Coord -->|2. Insert into| MT[Memtable]
  MT -->|Flush| SST[SSTable Levels]
  SST -->|Compaction| SST
  Coord -->|3. Replicate| R1[Replica 1]
  Coord -->|3. Replicate| R2[Replica 2]
  R1 -->|Gossip| R2
  R2 -->|Gossip| Coord
  Coord -->|Read: Bloom Filter| SST
  Coord -->|Read: check| MT
  subgraph Anti-Entropy
    MK[Merkle Trees] -->|Compare hashes| Repair[Sync diverged keys]
  end

End-to-end architecture of a Dynamo-style distributed key-value store showing write path, read path, replication, and anti-entropy repair.

Building a distributed key-value store is a masterclass in trade-off engineering. Every decision, from the storage engine to the consistency model to the compaction strategy, involves sacrificing one desirable property to gain another. The best engineers do not memorize the "right" answer; they understand the trade-off space deeply enough to make the right choice for their specific requirements.

How this might come up in interviews

Designing a key-value store is one of the top 5 most frequently asked system design questions at FAANG companies. Interviewers use it to test depth across multiple dimensions: storage engine internals, distributed systems fundamentals, consistency trade-offs, and the ability to reason about failure modes. A strong candidate starts with the API, discusses storage engine trade-offs (LSM vs B+ tree), walks through the write and read paths, explains partitioning and replication, and then dives into consistency models and conflict resolution. The key differentiator between a senior and staff-level answer is demonstrating awareness of operational concerns: compaction tuning, failure detection, anti-entropy repair, and hot key mitigation.

Common questions:

Design a distributed key-value store that supports get, put, and delete operations with configurable consistency levels.
How would you partition data across 1,000 nodes? What happens when a node fails or a new node is added?
Explain the trade-offs between LSM trees and B+ trees. When would you choose one over the other?
How does a quorum-based replication scheme work? What are the trade-offs of different W and R values?
Two clients write different values to the same key at the same time. How does your system handle this conflict?
How would you detect and repair data inconsistencies between replicas without scanning every key?

Key takeaways

The API is deceptively simple (get/put/delete), but production key-value stores must solve partitioning, replication, consistency, compaction, conflict resolution, and failure detection under the hood.
LSM trees optimize for write throughput (sequential I/O) at the cost of read amplification; B+ trees optimize for reads (single tree traversal) at the cost of write amplification. The RUM conjecture says you cannot minimize all three amplification factors simultaneously.
The quorum formula W + R > N guarantees read-your-writes consistency, but sloppy quorums and hinted handoff can violate this during partitions. Always understand what your consistency level actually guarantees.
Conflict resolution is a design choice, not an afterthought. LWW is simple but lossy; vector clocks preserve conflicts for application-level resolution; CRDTs provide automatic convergence for supported data types.
Anti-entropy mechanisms (Merkle trees, read repair, hinted handoff) are essential for long-term data integrity. Without them, replicas silently diverge over time, and the system slowly rots from the inside out.

Before you move on: can you answer these?

In a Dynamo-style KV store with N=3 replicas, you set W=2 and R=2. One replica goes down permanently. Can you still serve reads and writes? What happens when the downed replica comes back?

Yes, you can still serve both reads and writes: W=2 requires 2 of 3 replicas (2 are alive), and R=2 also succeeds with 2 of 3. When the downed replica recovers, it will be out of date. Hinted handoff replays queued writes if the outage was short. For longer outages, an anti-entropy repair using Merkle trees synchronizes the diverged key ranges. Read repair also heals stale data opportunistically on subsequent reads.

Why does an LSM-tree-based KV store have higher read amplification than a B+ tree, and what are three techniques to mitigate it?

In an LSM tree, a key may exist in the memtable, L0 SSTables, and any deeper level. A point lookup must check all of these in the worst case (read amplification). Three mitigations: (1) Bloom filters eliminate 99%+ of unnecessary SSTable checks. (2) Leveled compaction ensures at most one SSTable per level contains a given key. (3) Block cache (LRU) keeps frequently accessed SSTable blocks in memory, avoiding repeated disk reads.

Explain what happens when two clients concurrently write different values to the same key in a leaderless KV store using last-writer-wins (LWW). Why is this problematic, and what is an alternative?

Both writes succeed on their respective replicas. When a read queries multiple replicas, it sees both versions and picks the one with the higher timestamp. The "losing" write is silently discarded, causing data loss. This is problematic because LWW conflates "most recent timestamp" with "correct value," and clock skew can cause the objectively earlier write to win. Alternatives: vector clocks detect the conflict and let the application merge values explicitly, or CRDTs (e.g., OR-Set) define automatic, mathematically convergent merge semantics.

🧠Mental Model

💡 Analogy

A distributed key-value store is like a chain of safe deposit vaults across multiple cities. You give a clerk (coordinator) your box number (key) and they use a directory (consistent hashing ring) to determine which city vault holds your box. For durability, copies of your box contents are kept in 2-3 vaults in different cities (replication). To read, the clerk checks the most recently updated copies. If the clerks in different cities disagree about what is in the box (conflict), they use timestamps or version tracking to figure out which is correct. The vault rooms are organized with recent items in a fast-access drawer (memtable) and older items filed in organized cabinets (SSTables), with a manifest (Bloom filter) at each cabinet door that instantly tells you "your item is definitely not in here" to save you from searching.

⚡ Core Idea

A key-value store is deceptively simple on the outside (get/put/delete) but internally requires solving the hardest problems in distributed systems: partitioning data across machines, replicating for durability, maintaining consistency, detecting failures, and resolving conflicts. The storage engine (LSM tree vs B+ tree) determines single-node performance, while the replication and partitioning layers determine system-wide scalability and fault tolerance.

🎯 Why It Matters

Key-value stores are the backbone of modern infrastructure: session stores, caches, configuration management, metadata catalogs, and even as building blocks for larger databases. Understanding how they work gives you a mental framework for reasoning about any distributed storage system. In interviews, "Design a key-value store" is one of the most common and highest-signal system design questions because it tests your knowledge of storage engines, networking, distributed consensus, and trade-off analysis simultaneously.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

Designing a Distributed Key-Value Store from Scratch

The API surface and data model

Storage engines: LSM trees vs B+ trees

The write path: WAL, memtable, and SSTable flush

The read path: memtable, bloom filters, and SSTable levels

Compaction strategies: size-tiered vs leveled

Partitioning: consistent hashing and virtual nodes

Replication, quorums, and consistency models

Conflict resolution, failure detection, and anti-entropy

Real-world systems: putting it all together

Discussion

In-app Q&A

Designing a Distributed Key-Value Store from Scratch

The API surface and data model

Storage engines: LSM trees vs B+ trees

The write path: WAL, memtable, and SSTable flush

The read path: memtable, bloom filters, and SSTable levels

Compaction strategies: size-tiered vs leveled

Partitioning: consistent hashing and virtual nodes

Replication, quorums, and consistency models

Conflict resolution, failure detection, and anti-entropy

Real-world systems: putting it all together

Discussion

In-app Q&A