The hardest problems in distributed systems: consensus algorithms, consistency models, partition tolerance, distributed transactions, and the failure modes that appear only at scale.
The hardest problems in distributed systems: consensus algorithms, consistency models, partition tolerance, distributed transactions, and the failure modes that appear only at scale.
Lesson outline
In a single-process application, operations happen in order, data is consistent, and function calls either succeed or fail. In a distributed system, none of these are guaranteed. Two operations from different servers may appear to happen "at the same time." The same data may look different from two different servers. A remote call may succeed, fail, or — most insidiously — partially succeed (executed but never acknowledged).
The first principle of distributed systems: partial failures are the norm. A network packet may be delivered zero, one, or multiple times. A server may crash between writing data and acknowledging the write. A clock drift of 200ms can reorder events that "happened simultaneously." Designing for these failures is not paranoia — it is engineering.
A consistency model defines what values a read can return relative to previous writes. From strongest to weakest:
Linearizability (strong consistency): Every read sees the most recent write, globally. The system appears as a single, sequential machine. Expensive — requires coordination. Example: etcd, ZooKeeper, PostgreSQL with read-your-writes.
Sequential consistency: Operations appear in the same order to all nodes, but not necessarily in real-time order. Rare in practice.
Causal consistency: Causally related operations are seen in the same order by all nodes. Unrelated operations may be seen in different orders. Example: Cosmos DB in session mode.
Eventual consistency: Given no new updates, all replicas will converge to the same value — eventually. No guarantees about how long. Example: DynamoDB by default, Cassandra, DNS.
Most apps can live with eventual consistency — with careful design
User profile updates: eventual consistency is fine (a few seconds of stale data is not a problem). Bank account balance: linearizability is required (even momentary inconsistency enables double-spend). Choose your consistency model based on the cost of an inconsistency.
1// Consistency model selection based on use case23// ✅ Eventually consistent: user preferences (stale data for 5s is fine)4const dynamoDb = new DynamoDBClient({});5await dynamoDb.send(new PutItemCommand({6TableName: 'UserPreferences',7Item: { userId: { S: userId }, theme: { S: 'dark' } },8// Default: eventual consistency — written to 1 replica, others catch up9}));1011// For reading: eventually consistent (cheaper, faster)12const readResult = await dynamoDb.send(new GetItemCommand({13TableName: 'UserPreferences',ConsistentRead: false = eventual (cheaper). Use for non-critical reads.14Key: { userId: { S: userId } },15ConsistentRead: false, // Eventual consistency (default)16}));1718// ✅ Strongly consistent: account balance (stale data = double-spend risk)19const strongReadResult = await dynamoDb.send(new GetItemCommand({20TableName: 'AccountBalances',ConsistentRead: true always reads from primary — use for financial data21Key: { accountId: { S: accountId } },22ConsistentRead: true, // Strong consistency (2x cost, always hits primary)23}));2425// ✅ Optimistic concurrency: prevent lost updates without distributed lock26// Using DynamoDB conditional expression (compare-and-swap)27await dynamoDb.send(new UpdateItemCommand({Conditional update = optimistic locking without a distributed lock28TableName: 'AccountBalances',29Key: { accountId: { S: accountId } },30UpdateExpression: 'SET balance = :newBalance, version = :newVersion',31ConditionExpression: 'version = :expectedVersion', // Fails if version changed32ExpressionAttributeValues: {33':newBalance': { N: String(newBalance) },34':newVersion': { N: String(currentVersion + 1) },35':expectedVersion': { N: String(currentVersion) },36},37// If condition fails → ConditionalCheckFailedException → retry with fresh read38}));
Two-Phase Commit (2PC): A coordinator asks all participants to "prepare" (vote). If all agree, coordinator sends "commit." Atomic across all participants. Problem: blocking — if the coordinator crashes after prepare but before commit, participants are locked waiting indefinitely. Use only for same-database transactions.
Saga pattern: A sequence of local transactions, each publishing an event or message to trigger the next step. If one step fails, compensating transactions undo previous steps. Non-blocking, resilient to failures. The standard approach for distributed transactions in microservices.
Saga for e-commerce checkout:
01
Reserve inventory (local tx in Inventory service)
02
Charge payment (local tx in Payment service) — if fails → release inventory
03
Create order (local tx in Orders service) — if fails → refund payment, release inventory
04
Send confirmation email (local tx in Notification service) — if fails → just retry (email is idempotent)
Reserve inventory (local tx in Inventory service)
Charge payment (local tx in Payment service) — if fails → release inventory
Create order (local tx in Orders service) — if fails → refund payment, release inventory
Send confirmation email (local tx in Notification service) — if fails → just retry (email is idempotent)
Choreography vs Orchestration: Choreography — each service reacts to events published by others. Simple but hard to trace. Orchestration — a central saga coordinator drives the steps. Easier to trace, single point of failure if not built with care.
1// Saga Orchestrator: manages distributed checkout transaction23type SagaStep<T> = {4name: string;5execute: (ctx: T) => Promise<T>;6compensate: (ctx: T) => Promise<void>; // Undo if a later step fails7};89async function runSaga<T>(initialCtx: T, steps: SagaStep<T>[]): Promise<T> {10const completed: SagaStep<T>[] = [];11let ctx = initialCtx;1213for (const step of steps) {14try {15ctx = await step.execute(ctx);16completed.unshift(step); // Track for compensation (reverse order)17} catch (error) {18console.error(`Saga failed at step: ${step.name}`, error);19Compensating transactions run in REVERSE order of execution20// Run compensating transactions in reverse order21for (const doneStep of completed) {22try {Compensation failure requires manual intervention — alert immediately23await doneStep.compensate(ctx);24} catch (compensateError) {25// Compensation failure → manual intervention required26await sendAlertToOncall(`Compensation failed: ${doneStep.name}`, compensateError);27}28}29throw error;30}31}32return ctx;33}3435// Checkout saga steps36interface CheckoutCtx {37userId: string; orderId?: string;38reservationId?: string; paymentId?: string;39}4041const checkoutSaga: SagaStep<CheckoutCtx>[] = [42{43name: 'reserve-inventory',44execute: async (ctx) => ({45...ctx,46reservationId: await inventoryService.reserve(ctx.userId),47}),48compensate: async (ctx) => {49if (ctx.reservationId) await inventoryService.release(ctx.reservationId);50},51},52{53name: 'charge-payment',54execute: async (ctx) => ({55...ctx,56paymentId: await paymentService.charge(ctx.userId, getOrderTotal(ctx)),57}),58compensate: async (ctx) => {59if (ctx.paymentId) await paymentService.refund(ctx.paymentId);60},61},62{63name: 'create-order',64execute: async (ctx) => ({65...ctx,66orderId: await orderService.create(ctx),67}),68compensate: async (ctx) => {69if (ctx.orderId) await orderService.cancel(ctx.orderId);70},71},72];
Many distributed systems need a single "leader" to coordinate writes: database primary, job scheduler, distributed lock holder. How do nodes agree on who is leader when any node can fail at any time?
Raft is the most understandable consensus algorithm (vs Paxos which is notoriously complex). Nodes are in one of three states: Follower (receive data from leader), Candidate (campaigning for leader), Leader (accepts writes, replicates to followers).
Leader election: if a Follower does not hear from the Leader for a randomized timeout, it becomes a Candidate, increments its term, and asks others to vote. A node wins if it gets votes from a majority (quorum). Safety: only one leader per term; a leader has all committed entries.
Why quorum matters: With 3 nodes, you need 2 (majority) to agree. If one node fails, the system still works. With 5 nodes, you can tolerate 2 failures. Never run with 2 nodes — you need 2 for quorum, so if one fails, the system halts.
Use existing consensus implementations
Never implement Raft or Paxos yourself. Use etcd, ZooKeeper, or Consul for distributed coordination. They have years of battle-testing that your custom implementation will not have.
Circuit breaker: Wraps calls to an external service. Tracks failure rate. When failure rate exceeds threshold, "opens" the circuit — subsequent calls fail immediately without calling the service. After a cooldown, allows a test request to see if the service has recovered.
Bulkhead: Isolate failures using separate resource pools per dependency. If the payment service thread pool is exhausted by slow payment calls, it does not affect the inventory service thread pool. Named after ship bulkheads that prevent flooding one compartment from sinking the ship.
Timeout everywhere: Every network call must have a timeout. Without timeouts, a slow dependency eventually exhausts your thread pool. Rule of thumb: set timeout to 2x the p99 latency of the dependency under normal load.
Retry with exponential backoff and jitter: Retry transient failures. Exponential backoff prevents retry storms. Jitter randomizes the backoff to prevent synchronized retries from all clients hitting the server at the same instant.
1// Circuit Breaker implementation2// States: CLOSED (normal) → OPEN (failing) → HALF_OPEN (testing)34class CircuitBreaker {5private failures = 0;6private lastFailureTime = 0;7private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';89constructor(10private threshold = 5, // Open after 5 failures11private recoveryTimeout = 30000, // Try again after 30s12private halfOpenRequests = 3, // Allow 3 test requests in HALF_OPEN13) {}1415async call<T>(fn: () => Promise<T>): Promise<T> {16if (this.state === 'OPEN') {HALF_OPEN state: test if service has recovered before fully reopening17if (Date.now() - this.lastFailureTime > this.recoveryTimeout) {18this.state = 'HALF_OPEN';19this.failures = 0;20} else {21throw new Error('Circuit breaker OPEN — service unavailable');22}23}2425try {26const result = await fn();27if (this.state === 'HALF_OPEN') {28this.state = 'CLOSED'; // Recovery confirmed29console.log('Circuit breaker CLOSED — service recovered');30}31this.failures = 0;32return result;33} catch (error) {34this.failures++;35this.lastFailureTime = Date.now();36if (this.failures >= this.threshold) {37this.state = 'OPEN';38console.error(`Circuit breaker OPEN after ${this.failures} failures`);39}40throw error;41}42}43}4445// Retry with exponential backoff + jitter46async function withRetry<T>(47fn: () => Promise<T>,48maxAttempts = 3,49baseDelayMs = 10050): Promise<T> {51for (let attempt = 1; attempt <= maxAttempts; attempt++) {Jitter (Math.random() * 100) prevents thundering herd on retry52try {53return await fn();54} catch (error) {55if (attempt === maxAttempts) throw error;56const delay = baseDelayMs * 2 ** attempt + Math.random() * 100; // jitter57console.log(`Attempt ${attempt} failed, retrying in ${delay.toFixed(0)}ms`);58await new Promise(r => setTimeout(r, delay));59}60}61throw new Error('Should never reach here');62}
Distributed systems questions test whether you understand failure modes that only appear in production at scale.
Common questions:
Strong answers include:
Red flags:
Quick check · Distributed Systems: Consensus, Consistency & Fault Tolerance
1 / 1
Key takeaways
From the books
Designing Data-Intensive Applications — Martin Kleppmann (2017)
Chapters 8-9: Trouble with Distributed Systems, Consistency and Consensus
The most important insight in distributed systems: a node cannot distinguish between a slow network and a crashed remote node. This uncertainty is fundamental and cannot be eliminated — only managed.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.