A deep dive into the twelve fundamental patterns that power every large-scale distributed system: CQRS, Event Sourcing, Sagas, Circuit Breakers, Bulkheads, Retry with Backoff, Idempotency, Write-Ahead Logs, Sharding, Leader Election, Fan-Out strategies, and the Sidecar pattern. Learn when to use each pattern, how they compose together, and how they appear in FAANG system design interviews.
A deep dive into the twelve fundamental patterns that power every large-scale distributed system: CQRS, Event Sourcing, Sagas, Circuit Breakers, Bulkheads, Retry with Backoff, Idempotency, Write-Ahead Logs, Sharding, Leader Election, Fan-Out strategies, and the Sidecar pattern. Learn when to use each pattern, how they compose together, and how they appear in FAANG system design interviews.
Lesson outline
CQRS separates the write model (commands) from the read model (queries) into two distinct paths. The write side validates business invariants and persists events or state changes. The read side projects data into shapes optimised for queries — denormalized views, materialized aggregations, or search indexes. This separation lets you scale reads and writes independently: your write database can be a strongly consistent relational store while reads are served from Redis, Elasticsearch, or a purpose-built read replica.
In practice, CQRS surfaces wherever you see a system that writes to one store and reads from another. An e-commerce platform writes orders to a PostgreSQL database but projects order history into DynamoDB for fast customer lookups. A social network writes posts to a primary datastore but fans them out into per-user timeline caches. The two sides are kept in sync by an event pipeline — usually Kafka, Kinesis, or an internal change data capture (CDC) mechanism.
When to reach for CQRS
Apply CQRS when read and write workloads have fundamentally different access patterns, scaling requirements, or consistency needs. If your reads and writes hit the same table with the same shape, CQRS adds complexity without benefit.
| Aspect | Traditional CRUD | CQRS |
|---|---|---|
| Data model | Single model for reads and writes | Separate read and write models |
| Scaling | Scale everything together | Scale reads and writes independently |
| Consistency | Immediate | Eventual (between write and read sides) |
| Complexity | Low — one model to maintain | Higher — two models, sync pipeline, eventual consistency |
| Best fit | Simple domains, low traffic | High read-to-write ratios, complex query needs |
Client → [Command Bus] → Write Model → Event Store → [Projection Engine] → Read Model → Client
CQRS data flow: commands mutate the write model, events propagate to projections that populate the read model
Event Sourcing stores every state change as an immutable event rather than overwriting the current state. Instead of a "users" table with the latest values, you have an append-only log: UserCreated, EmailChanged, AccountDeactivated. The current state is derived by replaying events from the beginning (or from a snapshot). This gives you a complete audit trail, the ability to rebuild state at any point in time, and natural compatibility with CQRS — events from the write side feed projections on the read side.
Kafka, EventStoreDB, and AWS DynamoDB Streams are common infrastructure choices. The pattern is used by financial systems (every transaction is an event), e-commerce order pipelines, and collaborative editing tools. The main costs are increased storage, the need for snapshotting (replaying millions of events on every read is impractical), and the complexity of handling schema evolution as event shapes change over time.
Event schema evolution is hard
Once events are persisted they are immutable. Plan for versioning from day one: use a schema registry, include a version field in every event, and write upcasters that transform old event shapes into the current format during replay.
Event Sourcing building blocks
A saga is a sequence of local transactions where each step publishes an event or sends a command that triggers the next step. If any step fails, compensating transactions undo the work of prior steps. Sagas replace the heavyweight two-phase commit (2PC) protocol, which requires locking resources across services and becomes a scalability bottleneck. There are two coordination styles: orchestration and choreography.
In orchestration, a central saga coordinator (orchestrator) tells each service what to do and what to compensate if things fail. In choreography, each service listens for events and decides independently what to do next — no central coordinator. Orchestration is easier to reason about and debug; choreography is more decoupled but can become a tangled web of implicit dependencies.
| Aspect | Orchestration | Choreography |
|---|---|---|
| Coordination | Central orchestrator sends commands | Services react to events independently |
| Coupling | Services coupled to orchestrator | Services coupled to event schema |
| Visibility | Single place to see the full flow | Flow is distributed across services |
| Debugging | Easier — trace through orchestrator | Harder — must correlate events across services |
| Scalability | Orchestrator can become a bottleneck | Better horizontal scalability |
| Best fit | 3-7 step workflows, business-critical flows | Simple 2-3 step reactive chains |
Compensating transactions are not rollbacks
A compensating transaction is a new forward action that semantically undoes the effect of a previous step. If step 2 reserved inventory, the compensation is a "release inventory" command — not a database rollback. Design every saga step with its compensation from the start.
Example: Order saga (orchestration)
01
Orchestrator sends CreateOrder command to Order Service → Order created in PENDING state
02
Orchestrator sends ReserveInventory to Inventory Service → Items reserved
03
Orchestrator sends ChargePayment to Payment Service → Payment processed
04
Orchestrator sends ConfirmOrder to Order Service → Order moves to CONFIRMED
05
If ChargePayment fails → Orchestrator sends ReleaseInventory (compensate step 2) → sends CancelOrder (compensate step 1)
Orchestrator sends CreateOrder command to Order Service → Order created in PENDING state
Orchestrator sends ReserveInventory to Inventory Service → Items reserved
Orchestrator sends ChargePayment to Payment Service → Payment processed
Orchestrator sends ConfirmOrder to Order Service → Order moves to CONFIRMED
If ChargePayment fails → Orchestrator sends ReleaseInventory (compensate step 2) → sends CancelOrder (compensate step 1)
The circuit breaker pattern prevents a failing downstream dependency from cascading failures across the entire system. It wraps remote calls in a state machine with three states: Closed (normal operation — requests pass through), Open (the dependency is considered down — requests fail immediately without attempting the call), and Half-Open (a trial period where a limited number of requests are allowed through to test if the dependency has recovered). Netflix popularised this pattern with Hystrix; modern implementations include Resilience4j for Java and Polly for .NET.
When in the Closed state, the circuit breaker tracks failure rates. Once failures exceed a threshold (e.g., 50% of the last 100 requests within a 10-second window), the breaker trips to Open. After a configurable timeout, it transitions to Half-Open and allows a probe request. If the probe succeeds, the breaker resets to Closed; if it fails, the breaker returns to Open.
Circuit breaker state transitions
Tune thresholds per dependency
A payment gateway that fails 5% of the time is very different from a recommendation service with the same failure rate. Set circuit breaker thresholds, timeouts, and fallback strategies based on the criticality and expected behaviour of each downstream dependency.
Circuit Breaker with Resilience4j
CircuitBreakerConfig.custom() .failureRateThreshold(50) .waitDurationInOpenState(Duration.ofSeconds(30)) .slidingWindowSize(100) .permittedNumberOfCallsInHalfOpenState(3) .build()
Use for: Protecting a microservice from a flaky downstream payment API with automatic recovery
Named after the watertight compartments in a ship hull, the bulkhead pattern isolates components so that a failure in one does not sink the entire system. In software, this means giving each downstream dependency (or tenant, or workload class) its own dedicated pool of resources — thread pools, connection pools, rate limiters, or even separate infrastructure. If the recommendation service exhausts its thread pool, the checkout service continues unaffected because it has its own isolated pool.
Bulkheads come in two flavours: thread-pool isolation and semaphore isolation. Thread-pool isolation gives each dependency its own fixed-size thread pool and a bounded queue. This provides strong isolation but adds overhead from context switching and thread management. Semaphore isolation limits concurrency with a counter (no separate threads), offering lower overhead but less protection against slow calls that tie up the calling thread.
| Isolation Type | Mechanism | Overhead | Protection Level | Use When |
|---|---|---|---|---|
| Thread-pool | Separate thread pool per dependency | Higher (thread management) | Strong — slow calls timeout without blocking caller | Downstream calls are unreliable or slow |
| Semaphore | Concurrency counter (permit-based) | Lower (no extra threads) | Moderate — slow calls still occupy the caller thread | Downstream calls are fast and generally reliable |
| Infrastructure | Separate clusters/processes per workload | Highest (duplicate infra) | Strongest — full fault domain isolation | Multi-tenant systems, critical workloads |
Ship hull analogy
A ship has watertight bulkheads so that a breach in one compartment does not flood the entire vessel. In your system, each thread pool or resource boundary is a bulkhead. Size them so that the most critical services (checkout, authentication) always have capacity even if less critical services (recommendations, analytics) are under stress.
Transient failures — network blips, brief overloads, leader elections — are inevitable in distributed systems. A naive immediate retry floods the recovering service with a thundering herd. Exponential backoff increases the delay between retries exponentially (e.g., 100ms, 200ms, 400ms, 800ms), giving the downstream time to recover. Adding jitter (randomness) to the delay decorrelates retries from multiple callers, preventing synchronized retry storms.
There are three common jitter strategies. Full jitter picks a random delay between 0 and the exponential ceiling. Equal jitter uses half the exponential value plus a random component up to the other half. Decorrelated jitter makes each delay independent of the previous delay. AWS recommends full jitter for most cases because it produces the broadest spread and the fewest collisions.
| Strategy | Formula (simplified) | Spread | Best For |
|---|---|---|---|
| No jitter | delay = base * 2^attempt | None — all callers retry at the same time | Never use in production |
| Full jitter | delay = random(0, base * 2^attempt) | Maximum spread | General-purpose default |
| Equal jitter | delay = base * 2^attempt / 2 + random(0, base * 2^attempt / 2) | Moderate spread with guaranteed minimum | When you need a minimum delay floor |
| Decorrelated jitter | delay = random(base, prev_delay * 3) | High spread, no memory of exponential curve | High-concurrency clients |
Do not retry non-idempotent operations blindly
Retrying a CreateOrder call without an idempotency key can create duplicate orders. Pair retry logic with idempotency (covered next) or restrict retries to read operations and operations that are inherently idempotent.
Implementing retry with backoff
01
Define maximum retry count (typically 3-5) and base delay (e.g., 100ms)
02
On each failure, compute delay: base * 2^attempt, capped at a maximum (e.g., 30s)
03
Apply full jitter: actual_delay = random(0, computed_delay)
04
Check if the error is retryable (5xx, timeout, connection reset — NOT 4xx client errors)
05
If max retries exceeded, propagate the error to the caller or route to a dead-letter queue
Define maximum retry count (typically 3-5) and base delay (e.g., 100ms)
On each failure, compute delay: base * 2^attempt, capped at a maximum (e.g., 30s)
Apply full jitter: actual_delay = random(0, computed_delay)
Check if the error is retryable (5xx, timeout, connection reset — NOT 4xx client errors)
If max retries exceeded, propagate the error to the caller or route to a dead-letter queue
An operation is idempotent if performing it multiple times produces the same result as performing it once. In distributed systems where retries, duplicate messages, and at-least-once delivery are the norm, idempotency is the mechanism that prevents double charges, duplicate orders, and corrupted state. The standard approach is an idempotency key: the client generates a unique identifier (UUID) for each logical operation and sends it with every request. The server stores the key alongside the result; if it sees the same key again, it returns the cached result instead of re-executing.
Stripe, PayPal, and most payment APIs require an Idempotency-Key header. Kafka achieves exactly-once semantics via producer IDs and sequence numbers. DynamoDB conditional writes (PutItem with attribute_not_exists) provide idempotent inserts. The pattern applies anywhere messages might be delivered more than once.
Idempotency implementation strategies
TTL your idempotency keys
Idempotency keys do not need to live forever. A TTL of 24-72 hours is typical. After that window, a duplicate request is almost certainly a new intentional request, not a retry of the original.
A write-ahead log (WAL) is an append-only file where every mutation is recorded before it is applied to the main data structure. If the system crashes mid-operation, the WAL provides the information needed to recover: either replay uncommitted entries to complete the operation or discard them to roll back. WAL is the foundation of durability in PostgreSQL, MySQL (InnoDB redo log), SQLite, etcd, and Kafka (the commit log itself is a WAL).
The mechanism is straightforward: (1) write the intended change to the WAL on disk, (2) fsync the WAL to guarantee persistence, (3) apply the change to the in-memory data structure and/or main data files, (4) acknowledge success to the client. Because sequential appends to a log are far faster than random writes to a B-tree or LSM-tree, WAL allows high write throughput while guaranteeing durability.
| System | WAL Implementation | Purpose |
|---|---|---|
| PostgreSQL | pg_wal directory with 16MB segment files | Crash recovery, streaming replication, point-in-time recovery |
| MySQL InnoDB | Redo log (ib_logfile0, ib_logfile1) | Crash recovery, double-write buffer coordination |
| Apache Kafka | Commit log segments on each broker | Message durability, replication, consumer offset tracking |
| etcd / Raft | Raft log entries persisted before applying to state machine | Consensus, crash recovery, leader replication |
| Cassandra | Commitlog (append-only, per-node) | Crash recovery before data reaches SSTables via memtable flush |
WAL and replication
WAL is not only for crash recovery. PostgreSQL streaming replication, Kafka replica fetching, and Raft log replication all ship WAL entries to followers. The same log that provides local durability also provides distributed consistency.
Leader election designates a single node as the coordinator (leader) for a task while others act as followers or standby replicas. The leader makes decisions, assigns work, or manages writes, avoiding the split-brain problems that arise when multiple nodes believe they are in charge. When the leader fails, the remaining nodes detect the failure and elect a new leader. Raft (used by etcd, Consul, CockroachDB) and ZAB (used by ZooKeeper) are the two most widely deployed consensus algorithms for leader election.
In Raft, nodes start as followers. If a follower does not receive a heartbeat from the leader within a randomised timeout, it transitions to candidate and requests votes from peers. If it receives a majority of votes, it becomes leader and begins sending heartbeats. The randomised timeout prevents split votes. ZooKeeper uses ephemeral sequential znodes: each node creates a znode and the node with the lowest sequence number becomes the leader. When the leader crashes, its ephemeral znode disappears, and the next node in sequence takes over.
| Mechanism | Algorithm / Tool | Failure Detection | Failover Time | Split-Brain Prevention |
|---|---|---|---|---|
| Raft | etcd, Consul, CockroachDB | Heartbeat timeout (randomised) | < 1 second typical | Majority quorum required for election |
| ZAB | ZooKeeper | Session timeout on ephemeral znodes | 1-5 seconds | Majority quorum, epoch numbers |
| Lease-based | DynamoDB, Chubby, Redis Redlock | Lease expiry (TTL) | Lease duration (seconds) | Fencing tokens, lease expiry |
| Bully algorithm | Academic / simple systems | Ping / response timeout | Variable | Highest ID wins — no quorum guarantee |
Fencing tokens prevent stale leaders
A leader holding an expired lease may still believe it is the leader (clock skew, GC pause, network partition). Fencing tokens — monotonically increasing numbers issued with each lease — let downstream systems reject operations from stale leaders. Always include fencing tokens when using lease-based leader election.
Follower --[timeout]--> Candidate --[majority vote]--> Leader --[heartbeat]--> Followers Leader --[crash]--> Followers detect timeout --> New election round
Raft leader election lifecycle: followers become candidates on timeout, candidates become leaders with majority vote
This pattern addresses how to deliver content (posts, updates, notifications) from a producer to many consumers. Fan-out-on-write (push model) pre-computes the result at write time: when a user posts, the system immediately writes that post into every follower timeline cache. Fan-out-on-read (pull model) defers the work to read time: when a follower opens their feed, the system gathers and merges posts from all followed users on the fly. Most real-world systems use a hybrid approach.
Fan-out-on-write delivers fast reads (the timeline is pre-built) but is expensive for users with millions of followers — writing a single celebrity post to 50 million timeline caches is prohibitively slow and wasteful. Fan-out-on-read saves write amplification for high-follower users but pushes latency to read time, which can be slow if a user follows many accounts. The hybrid approach (used by Twitter) fans out on write for regular users and fans out on read for celebrities, merging the two at read time.
| Approach | Write Cost | Read Cost | Latency | Storage | Best For |
|---|---|---|---|---|---|
| Fan-out-on-write | O(followers) per post | O(1) — read from cache | Low read latency | High — duplicated across caches | Users with < 10K followers |
| Fan-out-on-read | O(1) — write once | O(following) per read | Higher read latency | Low — single copy | Celebrity users with millions of followers |
| Hybrid | O(followers) for regular users, O(1) for celebrities | O(1) + merge celebrity posts | Low for most users | Moderate | Production social networks |
The celebrity threshold
Define a follower threshold (e.g., 10,000 or 100,000) above which users switch from fan-out-on-write to fan-out-on-read. This threshold should be tuned based on your write throughput capacity and acceptable read latency percentiles.
The sidecar pattern deploys a helper process alongside your main application in the same host or pod. The sidecar handles cross-cutting concerns — networking (service mesh proxies like Envoy), observability (log shippers, metric collectors), security (mTLS termination, authentication proxies), or configuration (secret injection). The application itself remains unaware of the sidecar; it communicates with it over localhost or a shared Unix socket. This lets you add capabilities to any service regardless of its language, framework, or age.
Service meshes like Istio and Linkerd are the most prominent example: an Envoy sidecar proxy intercepts all inbound and outbound traffic, handling load balancing, circuit breaking, retries, mTLS, and distributed tracing without any code changes to the application. Other sidecar examples include Fluentd or Filebeat for log collection, Vault Agent for secret injection, and OPA (Open Policy Agent) for policy enforcement.
Common sidecar use cases
Sidecar lifecycle management
In Kubernetes, define the sidecar as a container in the same pod. Use an init container if the sidecar must start before the main app (e.g., Vault Agent populating secrets). Handle graceful shutdown ordering — the sidecar proxy should drain connections before the main container exits.
No pattern exists in isolation. A real system design composes multiple patterns to address different requirements simultaneously. Understanding which patterns complement each other — and which create tension — is a key differentiator in senior-level interviews.
| Pattern Combination | How They Compose | Real-World Example |
|---|---|---|
| CQRS + Event Sourcing | Events from the write side feed projections that build the read model | Order management: write events to Kafka, project to Elasticsearch for search |
| Saga + Idempotency | Each saga step must be idempotent so retries during compensation are safe | Payment saga: charge with idempotency key, compensate with refund idempotency key |
| Circuit Breaker + Retry + Backoff | Retry with backoff for transient failures; circuit breaker to stop retrying when the dependency is truly down | API gateway: Resilience4j retry wrapped in a circuit breaker per downstream service |
| Bulkhead + Circuit Breaker | Bulkhead isolates resource pools; circuit breaker stops calls to failing services within each pool | Netflix: separate thread pools per dependency, each with its own Hystrix circuit breaker |
| Sharding + Leader Election | Each shard has its own leader for writes; leader election runs per shard | Kafka: each partition has a leader broker elected via ZooKeeper/KRaft |
| Fan-out-on-write + Sidecar | Sidecar proxy handles fan-out routing and retry logic transparently | Notification system: Envoy sidecar routes push notifications to regional endpoints |
| WAL + Event Sourcing | The event log serves as both a WAL and the source of truth | Kafka-backed microservices: the commit log is the event store and durability layer |
Start simple, compose as needed
Do not apply all twelve patterns to every service. Start with the simplest approach that meets requirements. Add circuit breakers when you observe cascading failures. Add CQRS when read/write scaling diverges. Add sharding when a single node cannot hold the data. Each pattern adds operational complexity — earn that complexity with a real need.
Data Patterns
Resilience Patterns
Coordination Patterns
Infrastructure Patterns
System design interviews at FAANG companies almost always involve multiple patterns from this list. When you are asked to design a newsfeed, the interviewer expects you to discuss fan-out strategies and CQRS. When designing an e-commerce checkout, they expect sagas, idempotency, and circuit breakers. When designing a chat system, they expect sharding and leader election. The interviewer is not testing whether you can recite definitions — they are testing whether you can identify the right pattern for a given constraint and reason about its trade-offs.
Common questions:
Key takeaways
A saga step (ReserveInventory) succeeds, but the next step (ChargePayment) fails. What happens, and how do you ensure the compensation (ReleaseInventory) does not accidentally release inventory from a different order?
The saga orchestrator triggers a compensating ReleaseInventory command using the same saga_id and step_id as the original reservation. The Inventory service uses this as an idempotency key — it only releases inventory tagged with that specific saga_id, not arbitrary inventory. This combination of saga compensation and idempotency keys ensures exactly-once semantics for the undo operation.
You are designing a social network where some users have 50 million followers. How do you handle posting for both regular users and celebrities?
Use a hybrid fan-out approach. For regular users (under a follower threshold), fan-out-on-write: push the post into every follower timeline cache at write time. For celebrities above the threshold, fan-out-on-read: do not push at write time. When a follower opens their feed, merge their pre-built timeline (from regular users) with on-the-fly lookups of celebrity posts. This caps write amplification while keeping read latency low for the majority of feed loads.
Your circuit breaker is in the Open state, but the downstream service has actually recovered. What mechanism allows traffic to resume, and what prevents a flood of requests from overwhelming the recovering service?
After a configured timeout, the circuit breaker transitions to Half-Open and permits a small number of probe requests (e.g., 3). If these probes succeed, the breaker resets to Closed and normal traffic resumes. The limited probe count prevents a sudden flood — traffic ramps up gradually. If probes fail, the breaker returns to Open for another timeout period. Combine this with gradual ramp-up (e.g., allow 10%, 25%, 50%, 100% of traffic) for extra safety.
💡 Analogy
Think of these patterns like the standard plays in a basketball playbook. A point guard does not invent a new play on every possession — they choose from a repertoire of proven plays (pick-and-roll, fast break, zone offense) based on what the defense gives them. Similarly, a system designer draws from a repertoire of proven patterns (CQRS, circuit breaker, saga) based on the constraints and failure modes they face. Mastering the playbook means knowing when each play works, when it fails, and how to combine plays into a fluid offense.
⚡ Core Idea
Distributed system design is not about inventing novel solutions but about recognising which proven pattern fits the constraint you are facing — then composing multiple patterns to address the full set of requirements.
🎯 Why It Matters
In interviews and in production, the engineers who ship reliable systems are not the ones who memorise definitions — they are the ones who can look at a requirement (high read throughput, cross-service transactions, fault isolation) and immediately reach for the right pattern. This pattern literacy separates senior engineers from junior ones.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.