The eight fallacies of distributed computing, CAP theorem in practice, replication, partitioning, and circuit breaker patterns.
The eight fallacies of distributed computing, CAP theorem in practice, replication, partitioning, and circuit breaker patterns.
Lesson outline
In 1994, Peter Deutsch at Sun Microsystems wrote down "The Eight Fallacies of Distributed Computing" — incorrect assumptions that developers make when moving from single-machine to distributed systems. Thirty years later, they still cause production outages every day.
The eight fallacies — and what actually happens
In any distributed system with a network partition (two parts of the system cannot communicate), you must choose between Consistency and Availability. You cannot have both.
| Choice | Behavior on partition | Real-world example | When to choose |
|---|---|---|---|
| CP (Consistent + Partition tolerant) | Return error/timeout rather than stale data | HBase, Zookeeper, etcd, bank account balances | Financial transactions, leader election, inventory counts — where stale = wrong |
| AP (Available + Partition tolerant) | Return possibly stale data rather than error | DynamoDB, Cassandra, CouchDB, DNS, social feeds | User experience matters more than perfect consistency — shopping carts, social feeds, product catalog |
| CA (Consistent + Available) | Only possible without partitions — single-node systems | PostgreSQL on one server (not distributed) | Impossible in distributed systems — CAP theorem guarantees partitions happen |
The practical implication: "eventual consistency"
Most distributed databases choose AP and offer eventual consistency — all replicas will converge to the same value eventually, but reads may return stale data briefly. This is acceptable for most use cases. The window of inconsistency is typically milliseconds to seconds, not hours.
Your e-commerce site shows product inventory counts. During a network partition, which is worse: showing a customer "5 left" when there are actually 0, or showing an error page?
The three replication strategies and when to use each
When a downstream service is slow or failing, naive retry logic makes it worse. If 1,000 clients are each retrying every second, the struggling service receives 1,000× normal load — a retry storm that prevents recovery.
Patterns for resilient distributed communication
The cascade failure pattern: how one slow service kills everything
Service A calls Service B (no timeout). Service B is slow. Service A threads block waiting. Service A runs out of threads. Service A starts failing. Service C calls Service A and starts failing. In 60 seconds, one slow service has cascaded to a full outage. Circuit breakers + timeouts break this chain.
Senior/staff engineering interviews and distributed systems design questions. The CAP theorem question is almost universal in backend system design interviews.
Common questions:
Key takeaways
What are the three states of a circuit breaker?
Closed (requests pass through, errors tracked), Open (all requests fast-fail to allow downstream recovery), Half-Open (one probe request allowed through; if successful, transitions back to Closed).
Why does retry with a fixed interval cause a "thundering herd"?
All clients fail at the same time and all retry at t+1s simultaneously, creating a synchronized spike that can overwhelm the recovering service. Exponential backoff with random jitter staggers retries to prevent this.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.