Interactive Explainer

🎯Key Takeaways

Follow the REACH framework in every system design: Requirements (5 min), Estimation (5 min), Architecture (10 min), Component deep-dive (15 min), Hardening/cross-cutting concerns (5 min). This structure prevents you from getting lost and ensures you demonstrate both breadth and depth within the 45-minute time constraint.

Capacity estimation is not a formality — it is the bridge between requirements and architecture. The numbers tell you whether you need a single server or a globally distributed system. An architecture drawn without estimates is guesswork, and interviewers can immediately tell when numbers do not support the design.

The L5/L6 differentiator is trade-off articulation, not component knowledge. Saying "I chose Cassandra" is L4. Saying "I chose Cassandra over Postgres because our 500K write QPS exceeds what a single Postgres instance can handle, and our access pattern is simple key-value lookups that do not need SQL joins — the trade-off is eventual consistency, which is acceptable for our timeline data" is L6.

Cross-cutting concerns (caching, monitoring, security, rate limiting, disaster recovery) are not afterthoughts — they are architectural decisions. Addressing them proactively in the last 5 minutes of the interview signals production experience and can move a borderline evaluation to a strong hire.

Practice the complete framework, not individual components. Engineers who can design a URL shortener in isolation but cannot combine patterns into a complete e-commerce platform have not mastered system design. The meta-skill is synthesis under time pressure, and it only improves with structured, timed practice.

End-to-End System Design Practice: From Blank Whiteboard to Production Architecture

The capstone lesson that ties every system design concept together into a repeatable methodology. Walk through the complete five-phase framework — requirements gathering, capacity estimation, high-level architecture, deep-dive component selection, and cross-cutting concerns — used by engineers at Google, Meta, Amazon, and Netflix in both interviews and production design reviews. Includes two full 45-minute walkthroughs (e-commerce platform and real-time analytics dashboard), a pre-interview checklist, common anti-patterns that fail candidates, and a self-evaluation rubric. This is the meta-skill that separates L5 engineers who recite patterns from L6+ engineers who synthesize complete systems under pressure.

~40 min read

Be the first to complete!

What you'll learn

Follow the REACH framework in every system design: Requirements (5 min), Estimation (5 min), Architecture (10 min), Component deep-dive (15 min), Hardening/cross-cutting concerns (5 min). This structure prevents you from getting lost and ensures you demonstrate both breadth and depth within the 45-minute time constraint.
Capacity estimation is not a formality — it is the bridge between requirements and architecture. The numbers tell you whether you need a single server or a globally distributed system. An architecture drawn without estimates is guesswork, and interviewers can immediately tell when numbers do not support the design.
The L5/L6 differentiator is trade-off articulation, not component knowledge. Saying "I chose Cassandra" is L4. Saying "I chose Cassandra over Postgres because our 500K write QPS exceeds what a single Postgres instance can handle, and our access pattern is simple key-value lookups that do not need SQL joins — the trade-off is eventual consistency, which is acceptable for our timeline data" is L6.
Cross-cutting concerns (caching, monitoring, security, rate limiting, disaster recovery) are not afterthoughts — they are architectural decisions. Addressing them proactively in the last 5 minutes of the interview signals production experience and can move a borderline evaluation to a strong hire.
Practice the complete framework, not individual components. Engineers who can design a URL shortener in isolation but cannot combine patterns into a complete e-commerce platform have not mastered system design. The meta-skill is synthesis under time pressure, and it only improves with structured, timed practice.

Lesson outline

The Art of End-to-End System Design

Every system design interview question you have ever studied — URL shorteners, web crawlers, notification systems, news feeds, chat systems, search autocomplete, video platforms, file storage — teaches you a specific pattern. But in a real interview, nobody asks you to implement a single pattern in isolation. They ask you to build a complete system from scratch, on a blank whiteboard, in 45 minutes, while explaining your reasoning out loud. The skill that matters is not knowing any single pattern; it is knowing how to combine them into a coherent architecture under time pressure.

This is the meta-skill of system design: the ability to take an ambiguous problem statement ("Design Uber Eats" or "Design a real-time analytics dashboard"), decompose it into structured phases, make deliberate trade-offs at each phase, and produce an architecture that is both correct and defensible. It is the difference between an engineer who can describe how consistent hashing works and an engineer who knows when to reach for consistent hashing versus range-based sharding based on the access patterns they derived from the requirements.

Why This Lesson Exists

You have spent the previous lessons mastering individual system design components: rate limiters, key-value stores, ID generators, URL shorteners, crawlers, notification systems, news feeds, chat systems, autocomplete, video platforms, and file storage. Each taught you a pattern. This lesson teaches you the framework that ties them all together. Think of it as learning to conduct the orchestra after learning to play each instrument.

The five-phase framework you will learn here is not an academic exercise. It is the same methodology used in production design reviews at companies like Google (where it is called a Design Doc review), Amazon (where it maps to the Working Backwards PR/FAQ process), and Netflix (where architecture reviews follow a structured template). The interview format is a compressed version of this production process, testing whether you can think through a system from end to end without getting lost.

The Five Phases of End-to-End System Design

Phase 1: Requirements and Scope — Clarify what you are building, for whom, at what scale, and with what constraints. This is the most important 5 minutes of the interview because every downstream decision depends on getting this right.
Phase 2: Capacity Estimation — Translate requirements into numbers: QPS, storage, bandwidth, memory. These numbers tell you which architecture you need before you draw a single box.
Phase 3: High-Level Architecture — Draw the canonical components (clients, API gateway, services, databases, caches, queues) and show how data flows through the system for the primary use cases.
Phase 4: Deep-Dive Component Selection — Choose 1-2 components to zoom into with L6-level depth. This is where you demonstrate mastery of specific patterns and trade-off analysis.
Phase 5: Cross-Cutting Concerns — Address caching strategy, monitoring, security, rate limiting, and disaster recovery. These concerns separate senior engineers from junior ones.

The 45-Minute Time Budget

In a standard FAANG system design interview, allocate your time as follows: Requirements and Scope (5 minutes), Capacity Estimation (5 minutes), High-Level Architecture (10 minutes), Deep-Dive on 1-2 components (15 minutes), Cross-Cutting Concerns (5 minutes), Q&A and wrap-up (5 minutes). The most common failure mode is spending 20 minutes on high-level design and running out of time before showing any depth.

The critical insight that most candidates miss is that the five phases are not just a checklist — they are a dependency chain. Requirements drive estimation. Estimation constrains architecture. Architecture reveals which components need depth. Depth exposes cross-cutting concerns. Skip any phase and the entire downstream chain becomes unreliable. An architecture drawn without estimation might be 10x over-provisioned or 10x under-provisioned, and you would never know.

The Number One Failure Pattern

The most common reason experienced engineers fail system design interviews is jumping straight to the solution. They hear "Design a ride-sharing system" and immediately start drawing databases and APIs because they have built ride-sharing systems before. But the interviewer might want a food delivery system, or a carpooling system, or a system optimized for autonomous vehicles. Without requirements clarification, you are solving the wrong problem — and no amount of technical depth can save you.

Throughout this lesson, you will see how each phase builds on the previous one, and you will practice the complete methodology twice with two different systems. By the end, you will have internalized a repeatable framework that works for any system design problem you encounter.

The Five-Phase Mnemonic: REACH

Requirements, Estimation, Architecture, Component deep-dive, Hardening (cross-cutting concerns). REACH for the complete design. If you remember nothing else from this lesson, remember REACH.

Phase 1: Requirements & Scope

Requirements gathering is the single most important phase of any system design exercise. It is also the phase most frequently skipped by candidates who are eager to show technical depth. The irony is that skipping requirements is itself a signal of shallow thinking — senior engineers know that the requirements determine the architecture, not the other way around.

In an interview setting, the problem statement is deliberately vague. "Design Twitter" could mean design the tweet creation and timeline system, or the trending topics algorithm, or the notification pipeline, or the search infrastructure. Each of these is a completely different architecture. The interviewer is testing whether you ask the right questions to narrow scope, or whether you assume scope and risk solving the wrong problem.

The Requirements Gathering Script (Use This in Every Interview)

→

Restate the problem in your own words: "So we are designing a system that allows users to [X]. Is that correct?" This confirms you understood the prompt and gives the interviewer a chance to redirect.

→

Ask about core use cases: "What are the 2-3 most important user actions? For example, for a food delivery system: browse restaurants, place orders, track delivery in real-time. Which should I focus on?"

→

Ask about scale: "How many daily active users should I design for? Are we talking about 1 million, 10 million, or 100 million?" This drives every capacity decision.

→

Ask about read/write ratio: "Is this system read-heavy (like a news feed) or write-heavy (like a logging system) or balanced (like a chat system)?" This determines your database and caching strategy.

→

Ask about latency requirements: "What is the acceptable latency for the core user action? Real-time (< 200ms), near-real-time (< 2s), or batch (minutes)?" This determines whether you need caching, WebSockets, or can use async processing.

→

Ask about consistency requirements: "Does this system need strong consistency (like a payment system) or is eventual consistency acceptable (like a social media feed)?" This determines your database replication strategy.

Confirm scope boundaries: "To keep this focused for 45 minutes, I will design [X, Y, Z] and defer [A, B, C] as out of scope. Does that align with what you want to see?"

Ask about scale: "How many daily active users should I design for? Are we talking about 1 million, 10 million, or 100 million?" This drives every capacity decision.

Ask about read/write ratio: "Is this system read-heavy (like a news feed) or write-heavy (like a logging system) or balanced (like a chat system)?" This determines your database and caching strategy.

Confirm scope boundaries: "To keep this focused for 45 minutes, I will design [X, Y, Z] and defer [A, B, C] as out of scope. Does that align with what you want to see?"

Functional vs Non-Functional Requirements

Functional requirements describe what the system does (create posts, send messages, process payments). Non-functional requirements describe how well the system does it (availability, latency, throughput, durability). In an interview, spend 60% of your requirements time on functional requirements (to set scope) and 40% on non-functional requirements (to set constraints). The non-functional requirements directly determine your architecture.

Requirement Type	Example	Architectural Impact
Functional	Users can post messages up to 280 characters	Simple text storage, small record size, high write QPS
Functional	Users see a personalized news feed	Fan-out strategy needed (push vs pull), ranking service
Non-functional	99.99% availability	Multi-region deployment, no single points of failure
Non-functional	< 200ms p99 latency	Aggressive caching, CDN, read replicas, denormalized data
Non-functional	50M DAU, 500M requests/day	Horizontal scaling, load balancing, database sharding
Non-functional	Strong consistency for payments	ACID transactions, synchronous replication, no caching on write path

Scope negotiation is a subtle but powerful skill. When the interviewer says "Design Uber," they do not expect you to design the entire Uber ecosystem in 45 minutes. They expect you to identify the core components, propose a focused scope, and go deep on that scope. The best candidates proactively say: "I will focus on the ride-matching and real-time tracking system, and defer the payment processing and surge pricing algorithm as out-of-scope for today. Does that work?"

The Scope Trap

If you do not set scope, the interviewer will keep expanding the problem. You will spend 40 minutes on breadth and zero minutes on depth. The L6 rubric explicitly requires depth on at least one component. Breadth without depth is an L4 signal. Protect your time by setting scope early.

Requirements That Change Everything

Global vs single-region — Global requires multi-region replication, conflict resolution (CRDTs or last-write-wins), CDN, and DNS-based routing. Single-region is dramatically simpler.
Real-time vs near-real-time — Real-time requires WebSockets or SSE, persistent connections, connection management, and pub/sub infrastructure. Near-real-time can use polling or short-lived connections.
Exact-once vs at-least-once delivery — Exact-once requires idempotency keys, deduplication, and transactional outbox pattern. At-least-once is much simpler but requires consumers to be idempotent.
Regulatory compliance (GDPR, HIPAA, PCI) — Compliance requirements add encryption at rest, audit logging, data retention policies, right-to-deletion, and geographic data residency constraints.

The output of Phase 1 should be a crisp written or verbal summary: "We are building a food delivery platform for 50M DAU. Core use cases: restaurant browsing, order placement, real-time delivery tracking. Non-functional: 99.9% availability, < 500ms order confirmation, eventual consistency acceptable for tracking updates. Out of scope: payment processing, restaurant onboarding, menu management." This summary becomes the contract for the rest of the interview.

Write It Down

On the whiteboard or shared document, write your requirements summary in the top corner. Reference it throughout the design to justify decisions. When the interviewer asks "Why did you choose a message queue here?" you can point to the requirements: "Because our latency requirement is < 500ms for order confirmation, but delivery tracking updates can be eventually consistent, so I am decoupling them with a queue."

Phase 2: Capacity Estimation Playbook

Capacity estimation translates the qualitative requirements from Phase 1 into quantitative constraints that drive every architectural decision. It is the bridge between "we need to handle a lot of traffic" and "we need 17 application servers behind a load balancer, a Redis cluster with 64 GB of memory, and a sharded database with 5 partitions." Without estimation, architecture is guesswork.

The Numbers You Must Know by Heart

Seconds in a day: 86,400 (round to 100,000 for easy math). Seconds in a month: 2.5 million. 1 million requests/day is roughly 12 QPS average. A single modern server handles 10K-50K simple HTTP requests/second. A single MySQL/Postgres instance handles 5K-10K queries/second. Redis handles 100K-200K operations/second. A single SSD can do 10K-100K IOPS. Network bandwidth per server: 1-10 Gbps.

The estimation chain always follows the same sequence: start with users, derive requests, derive storage, derive bandwidth, derive memory. Each step feeds the next. Here is the universal template that works for any system:

The Universal Estimation Template

→

Start with DAU (Daily Active Users). This is given in the requirements or assumed. Example: 50M DAU.

→

Derive daily actions per user. Example: each user browses 10 restaurants and places 0.5 orders per day. Total: 500M browses + 25M orders = 525M daily requests.

→

Derive QPS (Queries Per Second). Divide daily requests by 86,400. Example: 525M / 86,400 = ~6,100 QPS average. Multiply by peak factor (typically 3-5x): ~18,000-30,000 QPS peak.

→

Derive storage per record. Sum all fields: order_id (8B) + user_id (8B) + restaurant_id (8B) + items JSON (500B) + metadata (200B) = ~724B per order record. Round to 1KB.

→

Derive daily and yearly storage. Example: 25M orders/day x 1KB = 25 GB/day. Per year: ~9 TB. Over 5 years: ~45 TB.

→

Derive bandwidth. Example: if average API response is 5KB, then 30,000 QPS peak x 5KB = 150 MB/sec peak egress bandwidth.

Derive cache size. Apply the 80/20 rule: 20% of data serves 80% of requests. If you have 10M active restaurant listings at 2KB each = 20GB. Cache the hot 20%: 4GB fits in a single Redis node.

Start with DAU (Daily Active Users). This is given in the requirements or assumed. Example: 50M DAU.

Derive daily actions per user. Example: each user browses 10 restaurants and places 0.5 orders per day. Total: 500M browses + 25M orders = 525M daily requests.

Derive QPS (Queries Per Second). Divide daily requests by 86,400. Example: 525M / 86,400 = ~6,100 QPS average. Multiply by peak factor (typically 3-5x): ~18,000-30,000 QPS peak.

Derive storage per record. Sum all fields: order_id (8B) + user_id (8B) + restaurant_id (8B) + items JSON (500B) + metadata (200B) = ~724B per order record. Round to 1KB.

Derive daily and yearly storage. Example: 25M orders/day x 1KB = 25 GB/day. Per year: ~9 TB. Over 5 years: ~45 TB.

Derive bandwidth. Example: if average API response is 5KB, then 30,000 QPS peak x 5KB = 150 MB/sec peak egress bandwidth.

Derive cache size. Apply the 80/20 rule: 20% of data serves 80% of requests. If you have 10M active restaurant listings at 2KB each = 20GB. Cache the hot 20%: 4GB fits in a single Redis node.

graph LR
    A[DAU: 50M] --> B[Daily Actions: 525M]
    B --> C[Avg QPS: 6,100]
    C --> D[Peak QPS: 30,000]
    D --> E{Architecture Decision}
    E -->|QPS < 1K| F[Single Server]
    E -->|QPS 1K-10K| G[Load Balancer + 3-5 Servers]
    E -->|QPS 10K-100K| H[Sharded DB + Cache + LB]
    E -->|QPS > 100K| I[Multi-Region + CDN + Sharding]
    B --> J[Storage: 25 GB/day]
    J --> K[5-Year: 45 TB]
    K --> L{Storage Decision}
    L -->|< 1TB| M[Single DB Instance]
    L -->|1-10TB| N[Read Replicas + Partitioning]
    L -->|> 10TB| O[Sharded Database]

Estimation-to-architecture decision flow: numbers determine infrastructure

Metric	Formula	Example (Food Delivery)	Architecture Implication
Average QPS	Daily requests / 86,400	525M / 86,400 = 6,100	> 5K QPS: need load balancer and multiple app servers
Peak QPS	Avg QPS x peak factor (3-5x)	6,100 x 5 = 30,000	> 10K peak: need caching layer (Redis/Memcached)
Daily storage	Daily records x record size	25M x 1KB = 25 GB	> 10 GB/day: plan storage growth strategy
5-year storage	Daily storage x 365 x 5	25 GB x 1,825 = 45 TB	> 10 TB: will need sharding or archival strategy
Peak bandwidth	Peak QPS x avg response size	30K x 5KB = 150 MB/s	> 100 MB/s: consider CDN for static content
Cache size	20% of working set	20% of 20GB = 4 GB	< 64 GB: single Redis node. > 64 GB: Redis cluster

The Estimation Shortcut

In an interview, you do not need exact numbers. You need order-of-magnitude estimates that are correct within 2x. Use round numbers aggressively: 86,400 seconds/day becomes 100,000. 2.5 million seconds/month becomes 2 million. The point is not mathematical precision — it is architectural correctness. If your estimate says you need 5 servers, the real answer is probably 3-8 servers, and that is close enough to make the right architectural choice.

A common mistake is treating estimation as a one-time exercise at the start of the interview. In practice, you should re-estimate whenever you make a significant design choice. When you decide to add a cache, estimate the cache hit rate and recalculate the database QPS. When you decide to shard, estimate the per-shard QPS and verify each shard can handle it. Estimation is a continuous process, not a phase.

Do Not Forget Write Amplification

When estimating database QPS, remember that a single user action often generates multiple database operations. An order placement might write to the orders table, update the restaurant availability, create a payment record, enqueue a notification, and update analytics counters. A single user write can become 5-10 database writes. Factor this into your QPS estimate or your database will be undersized by 5-10x.

Phase 3: High-Level Architecture Blueprint

With requirements defined and capacity estimated, you now draw the high-level architecture. This is what most people think of when they hear "system design" — the boxes and arrows. But the boxes and arrows are not random. Every system you will ever design at internet scale uses the same canonical set of components, arranged in a predictable pattern. The skill is knowing which components your specific system needs (based on your estimates) and how data flows through them.

The Canonical Components (Every System Has Most of These)

Client Layer — Web browsers, mobile apps, API consumers. These generate requests and display responses. Consider: do you need real-time updates (WebSocket) or is request-response (HTTP) sufficient?
CDN / Edge Layer — Caches static content (images, CSS, JS) close to users. Reduces latency and server bandwidth. Required when your bandwidth estimate exceeds 100 MB/s or users span multiple geographies.
Load Balancer / API Gateway — Distributes requests across application servers. Provides TLS termination, rate limiting, authentication, and routing. Always needed when you have more than one application server.
Application Service Layer — Stateless servers running your business logic. Horizontally scalable. The number of servers is driven by your peak QPS estimate divided by per-server throughput (typically 5K-20K QPS per server).
Cache Layer — Redis or Memcached sitting between your application and database. Required when your database cannot handle the read QPS (typically when read QPS exceeds 10K). Cache hit rates of 90-99% are common for read-heavy workloads.
Primary Database — Your source of truth. SQL (Postgres, MySQL) for transactional workloads with complex queries. NoSQL (DynamoDB, Cassandra) for simple key-value access patterns at massive scale.
Message Queue / Event Bus — Kafka, RabbitMQ, or SQS for decoupling synchronous request processing from asynchronous downstream work. Required when a single user action triggers multiple downstream effects (notifications, analytics, search indexing).
Object Storage — S3, GCS, or Azure Blob for storing binary data (images, videos, files). Never store binary blobs in your primary database.
Search / Analytics — Elasticsearch for full-text search, ClickHouse or BigQuery for analytics. These are read-optimized stores populated asynchronously from the primary database.

graph TB
    subgraph Clients
        Web[Web Browser]
        Mobile[Mobile App]
        API[API Consumer]
    end
    subgraph Edge
        CDN[CDN / CloudFront]
        LB[Load Balancer / API Gateway]
    end
    subgraph Services
        S1[Service A]
        S2[Service B]
        S3[Service C]
    end
    subgraph Data
        Cache[(Redis Cache)]
        DB[(Primary DB)]
        ReadDB[(Read Replicas)]
        Queue[Message Queue / Kafka]
        S3Store[Object Storage / S3]
    end
    subgraph Async
        Worker[Async Workers]
        Search[(Search Index)]
        Analytics[(Analytics Store)]
    end
    Web --> CDN
    Mobile --> LB
    API --> LB
    CDN --> LB
    LB --> S1
    LB --> S2
    LB --> S3
    S1 --> Cache
    S2 --> Cache
    Cache -->|miss| DB
    S1 --> DB
    DB --> ReadDB
    S2 --> ReadDB
    S1 --> Queue
    S3 --> S3Store
    Queue --> Worker
    Worker --> Search
    Worker --> Analytics

Reference architecture: the canonical components that appear in virtually every internet-scale system

How to Draw the Architecture

Start from the left (or top) with clients and move right (or down) through the system. Draw the happy path first: the most common user action flowing through all components. Then add secondary paths. Label every arrow with the protocol (HTTP, gRPC, WebSocket, Kafka) and the data format (JSON, Protobuf, Avro). An unlabeled arrow is an undecided trade-off — the interviewer will ask about it.

The architecture should tell a story. Walk the interviewer through the primary use case: "A user opens the app, the request hits our CDN for static assets, then the API gateway for the restaurant listing API. The gateway routes to the restaurant service, which checks Redis for the cached listing. On cache miss, it queries the read replica. The response is cached for 5 minutes and returned. The whole flow takes approximately 50 milliseconds at p95."

The API Design Shortcut

For each core use case, write one API endpoint signature on the whiteboard: POST /api/v1/orders {restaurant_id, items[], payment_method} returns {order_id, estimated_delivery}. GET /api/v1/orders/{id}/track returns {status, driver_location, eta}. This shows the interviewer you are thinking about the interface contract, not just the backend plumbing.

Component	When to Include	When to Skip
CDN	Static content > 10% of traffic OR users in multiple regions	Pure API service with no static assets and single-region users
Cache (Redis)	Read QPS > 10K OR read:write ratio > 10:1	Write-heavy system where data changes faster than cache TTL
Message Queue	User action triggers > 2 downstream effects OR need to decouple services	Simple CRUD with no async processing needs
Read Replicas	Database read QPS > 5K after caching	Low traffic or already using NoSQL with built-in replication
Search Index	Full-text search, autocomplete, or faceted filtering required	Simple key-value lookups or ID-based queries only
Object Storage	Any binary data (images, videos, files, documents)	Text-only data that fits in relational columns

A powerful technique is to draw the architecture incrementally. Start with the simplest possible version that works (client, single server, single database), then add components as your estimates demand. This shows the interviewer that you understand scaling as an incremental process, not a one-shot design. It also naturally creates discussion points: "At our estimated 30K peak QPS, a single database cannot handle the reads, so I am adding a Redis cache here to reduce database load by approximately 90%."

Over-Engineering Warning

Do not add components your estimates do not justify. If your system has 1,000 QPS, you do not need Kafka, sharding, or a CDN. Adding unnecessary complexity signals that you are pattern-matching from memorized architectures rather than reasoning from first principles. The interviewer wants to see that you add each component because the numbers demand it.

Phase 4: Deep-Dive Component Selection

The high-level architecture demonstrates breadth. The deep dive demonstrates depth. In the FAANG leveling rubric, breadth without depth is an L4 signal (knows the components but cannot reason about trade-offs). Depth on one or two components is the L5-L6 differentiator. Depth on cross-cutting concerns across components is the L6-L7 signal.

Choosing what to deep-dive is itself a skill. You want to pick a component that (1) is critical to the system functioning correctly, (2) has interesting trade-offs worth discussing, and (3) plays to your strengths. The interviewer may also direct you: "Tell me more about how the database handles this at scale." Follow their lead if they redirect.

High-Value Deep-Dive Topics by System Type

Chat / Messaging system — WebSocket connection management: how do you maintain millions of persistent connections? Connection registry, heartbeat, reconnection, message ordering, and delivery guarantees (at-least-once vs exactly-once).
News Feed / Timeline — Fan-out strategy: push (write to all follower timelines at post time) vs pull (compute timeline at read time) vs hybrid (push for normal users, pull for celebrities). Each has radically different QPS, storage, and latency characteristics.
Search / Autocomplete — Index design: inverted index structure, tokenization, ranking algorithm (TF-IDF, BM25), trie-based prefix matching, and how to handle typo tolerance and fuzzy matching.
Video Platform — Upload and transcoding pipeline: chunked upload, DAG-based transcoding (multiple resolutions and codecs in parallel), content delivery via CDN with adaptive bitrate streaming (HLS/DASH).
E-commerce / Marketplace — Inventory management and ordering: handling concurrent purchases of the last item (optimistic locking vs pessimistic locking), distributed transactions across order and payment services (saga pattern).
Analytics / Monitoring — Data ingestion pipeline: how to handle bursty writes (buffering, batching), time-series database choice (InfluxDB vs ClickHouse vs TimescaleDB), aggregation strategies (pre-aggregation vs query-time).

The Deep-Dive Template

For every component you deep-dive, cover these four dimensions: (1) Data model — what is the schema, what are the access patterns, what indexes do you need? (2) Scaling strategy — how does this component handle 10x traffic? Sharding, replication, partitioning? (3) Failure modes — what happens when this component fails? How do you detect it and recover? (4) Trade-offs — what did you give up by choosing this approach? What are the alternatives and why did you reject them?

How to Demonstrate L6-Level Depth

→

Start with the data model: "The orders table has order_id (Snowflake ID), user_id, restaurant_id, items JSONB, status enum, created_at. I am indexing on (user_id, created_at) for order history and (restaurant_id, status) for restaurant dashboards."

→

Explain the access pattern: "The primary access pattern is point lookup by order_id for status checks (90% of reads), and range query by user_id for order history (10% of reads). This is a read-heavy workload with a 50:1 read:write ratio."

→

Show the scaling decision: "With 25M new orders per day and 45TB over 5 years, I am sharding by order_id using consistent hashing. This distributes writes evenly. For user order history queries, I maintain a secondary index in a separate table sharded by user_id."

→

Discuss failure handling: "If the primary shard is down, writes fail immediately and the client retries. Reads fall back to a read replica with 1-second replication lag. I am using the outbox pattern for events so no order confirmations are lost even during shard failover."

Articulate the trade-off: "I chose SQL over NoSQL because orders require ACID transactions for payment processing. The trade-off is that sharding a relational database is more operationally complex than using DynamoDB, but the transactional guarantees are worth it for financial data."

Trade-off articulation is the most important skill at the L6 level. Every design decision is a trade-off, and the interviewer wants to hear that you understand what you are giving up. "I chose eventual consistency for the restaurant listings cache because a user seeing a slightly stale menu is acceptable, but I chose strong consistency for the order placement flow because charging a user for an unavailable item is not acceptable." This sentence demonstrates more maturity than an hour of component descriptions.

Depth vs Trivia

Depth means understanding trade-offs, failure modes, and scaling strategies. Trivia means reciting implementation details like "Redis uses a single-threaded event loop with epoll." Interviewers are looking for depth, not trivia. Do not lecture about internal data structures unless the interviewer asks. Instead, focus on decisions: why did you choose this database, this sharding strategy, this consistency level? What are the alternatives and what did you trade away?

The Proactive Deep-Dive

Do not wait for the interviewer to ask you to go deep. After drawing the high-level architecture, say: "I would like to deep-dive on two areas. First, the order processing pipeline because it requires ACID transactions and saga orchestration across services. Second, the real-time delivery tracking because it requires high-frequency location updates with WebSocket fanout." This demonstrates initiative and focus.

Remember that the deep dive should always connect back to the requirements you established in Phase 1. If the requirement says "< 200ms p99 for order confirmation," your deep dive should explain exactly how the order processing pipeline achieves that latency target: "The synchronous path is client to API gateway (5ms) to order service (10ms) to database write (20ms) to payment confirmation (100ms) to response (5ms) = 140ms total. Async work (notifications, analytics) is offloaded to the message queue and does not affect user-facing latency."

Cross-Cutting Concerns Checklist

Cross-cutting concerns are the architectural qualities that span all components rather than living in any single one. Caching, monitoring, security, rate limiting, and disaster recovery do not belong to any specific microservice — they are systemic properties. Addressing them is what separates a senior answer (L5+) from a junior one (L4). Junior engineers design systems that work. Senior engineers design systems that work, and that you can operate, observe, secure, and recover when things go wrong.

The Cross-Cutting Concerns Checklist

Caching Strategy — What data is cached, at which layer (CDN, application, database), with what TTL, and what invalidation strategy (TTL expiry, write-through, write-behind, cache-aside)? What is your plan for cache stampede (lock + recompute, stale-while-revalidate)? What is the expected cache hit ratio and how does it affect your database QPS estimate?
Monitoring and Observability — The four golden signals: latency (p50, p95, p99), error rate, traffic (QPS), and saturation (CPU, memory, disk). For each service, define SLIs and SLOs. Distributed tracing (Jaeger, Zipkin) for cross-service request flows. Structured logging (JSON) with correlation IDs for debugging.
Security — Authentication (OAuth 2.0 / JWT for APIs, session tokens for web). Authorization (RBAC or ABAC). Input validation and SQL injection prevention. HTTPS everywhere. Secrets management (Vault, AWS Secrets Manager). Encryption at rest for sensitive data.
Rate Limiting — Token bucket or sliding window at the API gateway. Per-user, per-IP, and per-endpoint limits. Graceful degradation: return 429 Too Many Requests with Retry-After header. Distributed rate limiting across multiple gateway instances using Redis.
Disaster Recovery — RPO (Recovery Point Objective): how much data can you afford to lose? Determines backup frequency. RTO (Recovery Time Objective): how fast must you recover? Determines failover strategy. Multi-AZ deployment for AZ failure. Cross-region replication for regional failure. Regular DR drills to verify recovery procedures work.
Data Partitioning and Migration — How do you add a new shard without downtime? How do you rebalance data when a shard becomes hot? How do you handle schema migrations across a sharded database? These are the operational realities that show production experience.

The 2-Minute Cross-Cutting Sweep

In the last 5-7 minutes of the interview, explicitly sweep through cross-cutting concerns: "Let me touch on operational concerns. For caching, I am using Redis with a 5-minute TTL and cache-aside pattern. For monitoring, I have the four golden signals on each service with alerts on p99 latency exceeding 500ms. For security, all API endpoints require JWT authentication, and I am rate-limiting at 100 requests per second per user at the API gateway. For disaster recovery, the database has a hot standby in another AZ with automatic failover." This sweep takes 2 minutes and signals production maturity.

Concern	Junior Answer (L4)	Senior Answer (L6+)
Caching	"I will add a Redis cache"	"Cache-aside pattern with 5-min TTL on restaurant listings. Write-through on order status for consistency. Cache stampede prevention via probabilistic early expiration. Expected 95% hit rate reduces DB read QPS from 30K to 1.5K"
Monitoring	"I will add logging"	"Four golden signals per service. P99 latency SLO of 200ms. Distributed tracing with Jaeger. Alert on error rate > 0.1%. Custom business metrics: orders per minute, cart abandonment rate"
Security	"I will use HTTPS"	"JWT with 15-min expiry and refresh tokens. RBAC: customers vs restaurants vs drivers vs admins. Rate limit: 100 req/s per user, 1000 req/s per restaurant API key. Input validation via schema (Joi/Zod). SQL parameterized queries. PII encryption at rest"
Rate Limiting	"I will add rate limiting"	"Token bucket at API gateway. Per-user: 100 req/s. Per-IP: 1000 req/s (protects against DDoS). Per-endpoint: /search gets 50 req/s, /orders gets 10 req/s (higher cost per request). Distributed counter in Redis with lua script for atomicity"
DR	"I will have backups"	"Multi-AZ primary with synchronous replication. Cross-region async replica for regional DR. RPO: 1 second (async replication lag). RTO: 5 minutes (automatic failover). Monthly DR drill. Database backup to S3 every hour with point-in-time recovery"

The "I Will Handle That Later" Trap

Many candidates dismiss cross-cutting concerns with "I would handle monitoring in a later phase" or "Security is an implementation detail." In a system design interview, this signals that you have not built production systems. Production engineers know that monitoring and security are not afterthoughts — they are architectural decisions that affect component choice, data flow, and operational cost. Address them proactively.

The strongest signal of production experience is when cross-cutting concerns influence your component choices rather than being bolted on after the fact. For example: "I chose Kafka over RabbitMQ for the event bus because Kafka provides message replay, which is critical for our disaster recovery strategy — if a consumer fails, we can replay events from the last checkpoint rather than losing them." This shows that DR thinking informed the technology choice, not the other way around.

The Observability Hook

If you are running short on time but want to demonstrate production maturity, say: "If I had more time, I would walk through the observability stack. Each service emits structured logs with a request correlation ID, publishes latency and error metrics to Prometheus, and participates in distributed tracing via OpenTelemetry. We have SLO-based alerts: if the 30-day error budget burn rate exceeds 2x, we page the on-call engineer." This single paragraph demonstrates SRE-level thinking.

End-to-End Walkthrough: E-Commerce Platform

Let us now apply the entire five-phase framework to a complete system design problem. This walkthrough simulates a 45-minute FAANG interview for the prompt: "Design an e-commerce platform like Amazon." We will follow the REACH framework (Requirements, Estimation, Architecture, Component deep-dive, Hardening) and show how each phase builds on the previous one.

Phase 1: Requirements (5 minutes)

After clarifying with the interviewer, we agree on: Core use cases: product search, product detail view, add to cart, checkout/place order, order tracking. Scale: 100M DAU, 5M orders per day. Non-functional: 99.99% availability, < 300ms p99 for search and browsing, < 1s p99 for checkout. Eventual consistency acceptable for search index, strong consistency required for orders and payments. Out of scope: seller onboarding, recommendation engine ML, fraud detection.

Metric	Calculation	Result
Browse QPS (avg)	100M DAU x 20 browses / 86,400	~23,000 QPS
Browse QPS (peak)	23,000 x 5	~115,000 QPS
Order QPS (avg)	5M orders / 86,400	~58 QPS
Order QPS (peak)	58 x 10 (flash sales)	~580 QPS
Product catalog size	500M products x 2KB each	~1 TB
Order storage/year	5M/day x 1KB x 365	~1.8 TB/year
Image storage/year	50M new images/year x 500KB	~25 TB/year
Cache size (hot 20%)	20% of 1TB catalog	~200 GB (Redis cluster)

graph TB
    subgraph Clients
        Web[Web App]
        Mobile[Mobile App]
    end
    subgraph Edge
        CDN[CDN - images and static]
        GW[API Gateway + Auth + Rate Limit]
    end
    subgraph Services
        PS[Product Service]
        CS[Cart Service]
        OS[Order Service]
        PAY[Payment Service]
        INV[Inventory Service]
        TRACK[Order Tracking Service]
        SRCH[Search Service]
    end
    subgraph Data
        PDB[(Product DB - Postgres)]
        CDB[(Cart DB - Redis)]
        ODB[(Order DB - Postgres Sharded)]
        INVDB[(Inventory DB - Postgres)]
        ES[(Elasticsearch)]
        S3[S3 - Images]
        Kafka[Kafka Event Bus]
    end
    Web --> CDN
    Mobile --> GW
    CDN --> GW
    GW --> PS
    GW --> CS
    GW --> OS
    GW --> SRCH
    PS --> PDB
    CS --> CDB
    OS --> ODB
    OS --> PAY
    OS --> INV
    INV --> INVDB
    SRCH --> ES
    PS --> S3
    OS --> Kafka
    Kafka --> TRACK
    Kafka --> ES

E-commerce platform architecture: services decomposed by domain, connected via API gateway and Kafka event bus

Phase 4 Deep-Dive: Order Processing Pipeline

The order placement flow is the most critical path. It must be ACID-compliant and handle concurrent purchases of the same item. The flow: (1) Validate cart items are still in stock via Inventory Service (optimistic lock check), (2) Create order record with status PENDING in Orders DB, (3) Reserve inventory (decrement available count with WHERE available > 0 — this is the optimistic lock), (4) Initiate payment via Payment Service, (5) On payment success: update order status to CONFIRMED, publish OrderConfirmed event to Kafka, (6) On payment failure: release inventory reservation, update order status to FAILED. The entire synchronous path must complete within 1 second. Notifications, search index updates, and analytics are handled asynchronously via Kafka consumers.

For the inventory management deep-dive, the key challenge is the flash sale scenario. When 10,000 users try to buy the last 100 units of a product simultaneously, naive approaches fail: SELECT available FROM inventory WHERE product_id = X followed by UPDATE inventory SET available = available - 1 has a classic race condition. The solution is optimistic locking: UPDATE inventory SET available = available - 1, version = version + 1 WHERE product_id = X AND version = current_version AND available > 0. If the update affects zero rows, the item is either out of stock or another transaction won the race. The client retries or shows "out of stock."

Phase 5: Cross-Cutting Concerns Sweep

Caching: Redis cluster (200GB) for product catalog with 10-minute TTL, cache-aside pattern. Redis single-node for cart data (ephemeral, no persistence needed). Monitoring: p99 latency on order placement is the primary SLI with a 1-second SLO. Alert on inventory service error rate > 0.5%. Security: JWT auth on all API endpoints, PCI-DSS compliance for payment data (payment service is a separate, hardened service that never stores full card numbers). Rate limiting: 50 req/s per user on search, 5 req/s per user on checkout (prevents scripted purchasing bots). DR: Orders DB has synchronous replication to a hot standby. Cross-region async replica for regional DR. RPO: 0 seconds for orders (synchronous replication). RTO: 30 seconds (automatic failover).

This walkthrough demonstrates the complete REACH framework in action. Notice how each phase constrained the next: the 115K peak browse QPS drove the need for a 200GB Redis cache and Elasticsearch (Postgres alone cannot serve 115K QPS). The 5M orders/day drove the sharded Orders DB and the saga-based order processing pipeline. The flash sale requirement drove the optimistic locking strategy for inventory.

End-to-End Walkthrough: Real-Time Analytics Dashboard

For our second complete walkthrough, we tackle a fundamentally different system: "Design a real-time analytics dashboard that shows website traffic metrics (page views, unique visitors, top pages, geographic distribution) with updates every 5 seconds for customers with up to 1 billion events per day." This exercises different patterns: streaming data ingestion, time-series storage, pre-aggregation, and real-time visualization.

Phase 1: Requirements (5 minutes)

Core use cases: (1) Customers embed a JavaScript tracking snippet on their website, (2) The snippet sends page view events to our ingestion endpoint, (3) Customers view a dashboard showing real-time metrics with 5-second refresh, (4) Customers can query historical data (last 30 days) with drill-down by page, geography, device. Scale: 10,000 customers, largest customer has 100M events/day, total platform: 1B events/day. Non-functional: ingestion must never lose events (at-least-once delivery), dashboard latency < 2 seconds, historical queries < 5 seconds. Eventual consistency acceptable (5-second delay is fine). Out of scope: A/B testing, funnel analysis, custom event tracking.

Metric	Calculation	Result
Ingestion QPS (avg)	1B events / 86,400	~11,600 events/sec
Ingestion QPS (peak)	11,600 x 5	~58,000 events/sec
Event size	URL (200B) + timestamp (8B) + user_id (16B) + geo (50B) + metadata (100B)	~374B, round to 500B
Daily raw storage	1B x 500B	~500 GB/day
30-day raw storage	500 GB x 30	~15 TB
Dashboard read QPS	10K customers x 1 query / 5 sec	~2,000 QPS
Aggregated data size (daily)	10K customers x 1000 pages x 24 hours x 100B per aggregation	~24 GB/day

graph LR
    subgraph Ingestion
        SDK[JS Tracking SDK]
        EDGE[Edge Collectors - Global PoPs]
        KAFKA[Kafka - Event Stream]
    end
    subgraph Processing
        FLINK[Stream Processor - Flink]
        AGG[Pre-Aggregation Service]
    end
    subgraph Storage
        RAW[(Raw Events - S3 / Parquet)]
        TSDB[(Time-Series DB - ClickHouse)]
        CACHE[(Redis - Live Counters)]
    end
    subgraph Serving
        API[Query API Service]
        WS[WebSocket Server]
        DASH[Dashboard UI]
    end
    SDK -->|HTTP POST| EDGE
    EDGE -->|Batch + Forward| KAFKA
    KAFKA --> FLINK
    FLINK -->|5-sec windows| AGG
    AGG --> TSDB
    AGG --> CACHE
    FLINK -->|Raw archival| RAW
    TSDB --> API
    CACHE --> WS
    API --> DASH
    WS -->|Push updates| DASH

Real-time analytics pipeline: edge collection to Kafka to stream processing to time-series storage with WebSocket push to dashboards

The architecture for this system is fundamentally different from the e-commerce platform. Instead of request-response, we have a streaming pipeline. The key insight is the separation between the ingestion path (which must handle 58K events/sec at peak without dropping any) and the query path (which serves 2K dashboard queries/sec). These two paths have completely different performance profiles and can be scaled independently.

Deep-Dive: Stream Processing and Pre-Aggregation

The stream processor (Apache Flink) reads from Kafka, groups events into 5-second tumbling windows by customer_id, and computes aggregations: page view counts, unique visitor counts (using HyperLogLog for memory-efficient cardinality estimation), top-N pages (using a min-heap), and geographic distribution (by mapping IP addresses to countries). These pre-aggregated results are written to ClickHouse (for historical queries) and Redis (for live dashboard updates). Pre-aggregation reduces the query-time data by a factor of roughly 1000x: instead of scanning 1 billion raw events, the dashboard reads pre-computed 5-second aggregation buckets.

For the deep dive on ClickHouse as the time-series database choice: ClickHouse uses columnar storage, which compresses time-series data extremely well (10-20x compression ratios are common). A query like "show page views per hour for the last 7 days for customer X" only reads the page_views column and the timestamp column, skipping all other columns. With the MergeTree engine and proper partitioning by date and customer_id, this query scans roughly 2MB of compressed data and returns in under 100 milliseconds.

Cross-Cutting Concerns for Streaming Systems

Backpressure handling: if Flink falls behind Kafka, use Kafka consumer group lag monitoring to detect it and auto-scale Flink task managers. Exactly-once semantics: Flink provides exactly-once processing via checkpointing to S3 every 30 seconds. If a Flink task fails, it restores from the last checkpoint and replays from Kafka (which retains data for 7 days). Late-arriving events: use event-time processing with a 30-second watermark; events arriving more than 30 seconds late are directed to a side output for separate processing. Monitoring: track Kafka consumer lag, Flink checkpoint duration, ClickHouse insert latency, and end-to-end latency from event generation to dashboard display.

Notice how this system uses different patterns than the e-commerce platform: streaming ingestion instead of request-response, pre-aggregation instead of query-time computation, HyperLogLog instead of exact counting, WebSocket push instead of polling. The requirements drove these choices: 5-second freshness with 1 billion events per day makes query-time aggregation over raw data impossible (scanning 500GB of daily data every 5 seconds would require unrealistic disk throughput). Pre-aggregation is the only viable strategy, and this insight comes directly from the capacity estimation.

Common Mistakes & The Final Checklist

After studying hundreds of system design interview debriefs from FAANG candidates, a clear set of anti-patterns emerges. These mistakes are not about lacking technical knowledge — they are about process failures that prevent candidates from demonstrating the knowledge they have. Avoiding these mistakes is often the difference between a borderline and a strong hire signal.

The 10 Most Common System Design Interview Mistakes

Mistake 1: Skipping requirements — Jumping to drawing boxes before understanding what you are building. Fix: spend the first 5 minutes asking questions. If the interviewer has to redirect your design because you misunderstood the problem, you have lost significant time and credibility.
Mistake 2: No capacity estimation — Drawing an architecture without knowing whether it needs to handle 1K or 1M QPS. Fix: always do the math. Even a 2-minute estimation changes the architecture fundamentally.
Mistake 3: All breadth, no depth — Drawing 15 boxes but not being able to explain how any single one works at scale. Fix: draw 7-8 boxes and deep-dive into 2 of them with trade-off analysis, failure modes, and scaling strategies.
Mistake 4: Not articulating trade-offs — Stating decisions without explaining what you gave up. "I chose Cassandra" is weak. "I chose Cassandra over Postgres because our access pattern is simple key-value lookups at 500K QPS, and Cassandra provides linear horizontal scaling for writes. The trade-off is no ACID transactions, which is acceptable because our data is eventually consistent" is strong.
Mistake 5: Ignoring failure modes — Designing a system that only works when everything is healthy. Fix: for each critical component, ask "what happens when this fails?" and have an answer.
Mistake 6: Over-engineering for scale — Adding Kafka, sharding, and multi-region to a system that handles 100 QPS. Fix: let your estimates drive your architecture. Add complexity only when the numbers demand it.
Mistake 7: Monologuing without checking in — Talking for 10 minutes without pausing for interviewer feedback. Fix: after each phase, ask "Does this direction look right? Should I go deeper here or move on?"
Mistake 8: No API design — Drawing backend components without defining how clients interact with them. Fix: write 2-3 key API endpoint signatures on the whiteboard.
Mistake 9: Using buzzwords without understanding — Saying "I will use consistent hashing" without being able to explain how it works or when you would use it versus range-based partitioning. Fix: only mention technologies you can discuss in depth.
Mistake 10: Not managing time — Spending 25 minutes on Phase 1-2 and rushing through Phases 3-5. Fix: set time limits per phase and stick to them. Use the 5-5-10-15-5-5 budget.

The Pre-Interview Checklist (Review 30 Minutes Before)

Memorize the estimation numbers (86,400 seconds/day, QPS thresholds, storage units). Practice the REACH mnemonic (Requirements, Estimation, Architecture, Component deep-dive, Hardening). Have 3-4 deep-dive topics prepared (database sharding, caching strategies, message queue patterns, real-time communication). Review the time budget (5-5-10-15-5-5). Prepare your opening: "Let me start by clarifying the requirements to make sure I design the right system."

Phase	Self-Evaluation Question	Strong Signal	Weak Signal
Requirements	Did I ask clarifying questions before designing?	Asked 5+ questions, confirmed scope, identified key non-functional requirements	Started drawing immediately, assumed scope, no questions asked
Estimation	Did I derive numbers from requirements?	Calculated QPS, storage, bandwidth; used numbers to justify component choices	Skipped estimation or gave vague numbers without calculations
Architecture	Can I trace a request through every component?	Walked through the happy path with protocol labels, justified each component with estimates	Drew disconnected boxes, could not explain data flow, added components without justification
Deep Dive	Did I show depth on 1-2 components?	Discussed data model, scaling strategy, failure modes, and trade-offs for chosen components	Stayed at the surface level on all components, no trade-off discussion
Cross-Cutting	Did I address operational concerns?	Covered caching, monitoring, security, rate limiting, and DR in the last 5 minutes	No mention of monitoring, security, or how to operate the system in production

The Flash Sale of System Design: Black Friday Traffic

Every e-commerce system design must address the 10-100x traffic spike scenario. If your architecture handles average load but crumbles at peak, the design is incomplete. Always ask: "What happens on the busiest day of the year?" For social media: viral content. For e-commerce: Black Friday. For video: live events. For messaging: New Year. Your architecture must degrade gracefully under extreme load, not fall over.

The final skill is self-evaluation. After practicing a system design problem, grade yourself on each phase using the rubric above. Record yourself (audio or video) doing a 45-minute mock interview and watch it back. You will immediately notice where you spend too long, where you forget phases, and where your explanations lack clarity. The candidates who pass system design interviews are not the ones with the most knowledge — they are the ones who have practiced presenting their knowledge in a structured, time-boxed format.

The REACH Framework Summary

Requirements (5 min): Ask questions, confirm scope, identify constraints. Estimation (5 min): DAU to QPS to storage to bandwidth to cache. Architecture (10 min): Canonical components, data flow, API design. Component deep-dive (15 min): Data model, scaling, failures, trade-offs on 1-2 components. Hardening (5 min): Caching, monitoring, security, rate limiting, DR. Every system. Every interview. Every time.

Your 7-Day Preparation Plan

→

Day 1: Review this lesson end-to-end. Memorize the REACH framework and the estimation numbers cheat sheet.

→

Day 2: Practice the e-commerce walkthrough from memory. Time yourself. Identify where you get stuck.

→

Day 3: Practice the analytics dashboard walkthrough from memory. Compare your architecture to the reference.

→

Day 4: Pick a new problem (Design Twitter/Instagram/WhatsApp). Apply REACH from scratch. Do not look at solutions until after.

→

Day 5: Pick another new problem (Design YouTube/Uber/Dropbox). Focus on estimation accuracy and deep-dive depth.

→

Day 6: Do a mock interview with a partner or record yourself. Review the recording against the self-evaluation rubric.

Day 7: Review your weak areas. Re-read the relevant individual design lessons (URL shortener, chat system, etc.) for patterns you struggled with.

Day 1: Review this lesson end-to-end. Memorize the REACH framework and the estimation numbers cheat sheet.

Day 2: Practice the e-commerce walkthrough from memory. Time yourself. Identify where you get stuck.

Day 3: Practice the analytics dashboard walkthrough from memory. Compare your architecture to the reference.

Day 4: Pick a new problem (Design Twitter/Instagram/WhatsApp). Apply REACH from scratch. Do not look at solutions until after.

Day 5: Pick another new problem (Design YouTube/Uber/Dropbox). Focus on estimation accuracy and deep-dive depth.

Day 6: Do a mock interview with a partner or record yourself. Review the recording against the self-evaluation rubric.

Day 7: Review your weak areas. Re-read the relevant individual design lessons (URL shortener, chat system, etc.) for patterns you struggled with.

How this might come up in interviews

End-to-end system design is THE system design interview. Every FAANG system design round tests exactly this skill: can you take an ambiguous problem, decompose it into structured phases, make deliberate trade-offs, and produce a coherent architecture in 45 minutes? This lesson is the capstone that ties together every individual design pattern into the methodology interviewers use to calibrate candidates from L4 (junior — can describe components) through L7 (principal — can synthesize complex systems with cross-cutting concerns and organizational implications). The REACH framework works for any prompt: URL shortener, chat system, video platform, or novel problems you have never seen before.

Common questions:

L4-L5: Design a URL shortener. Walk me through your approach from requirements to deployment. (Tests: structured thinking, basic estimation, standard components, one deep-dive)
L5-L6: Design Instagram/Twitter/WhatsApp. Start with requirements and show me the architecture. Pick two components and go deep. (Tests: scale estimation, fan-out strategy, caching, trade-off articulation on database and messaging choices)
L5-L6: Design a notification system that handles 1 billion notifications per day across push, SMS, and email channels. (Tests: queue-based architecture, rate limiting, template management, delivery guarantees, cross-channel deduplication)
L6: Design a real-time collaborative document editor like Google Docs. (Tests: operational transformation or CRDTs, WebSocket management, conflict resolution, cursor synchronization, offline support)
L6-L7: Design a global content delivery network (CDN) that serves 10 million requests per second. (Tests: edge computing, cache hierarchy, origin shielding, DNS-based routing, cache invalidation at global scale, cost optimization)
L7: Design the architecture for a new product that combines ride-sharing, food delivery, and package delivery on a single platform. (Tests: platform thinking, shared infrastructure, multi-tenant architecture, organizational structure implications, incremental migration strategy)

Try this question: Before designing, always ask the interviewer: What are the 2-3 most important use cases I should focus on? What scale should I design for (DAU, QPS)? Are there specific non-functional requirements (latency, consistency, availability) that are most important? Should I optimize for read-heavy or write-heavy workloads? Is this a greenfield design or migration from an existing system? Which aspect would you like me to go deepest on?

Strong answer: Structured approach with clear phase transitions. Estimates that drive component choices. Trade-off articulation on every major decision. Proactive deep-dive offer on 1-2 components. Cross-cutting concerns addressed in the last 5 minutes. Checking in with the interviewer after each phase. Connecting design decisions back to requirements.

Red flags: Jumping to solution without requirements. No capacity estimation. Drawing 15 boxes with no depth on any. Using buzzwords without understanding (mentioning consistent hashing but not being able to explain when it is better than range-based sharding). Over-engineering for scale the numbers do not justify. Monologuing for 10+ minutes without checking in. No mention of monitoring, security, or failure handling.

Key takeaways

Follow the REACH framework in every system design: Requirements (5 min), Estimation (5 min), Architecture (10 min), Component deep-dive (15 min), Hardening/cross-cutting concerns (5 min). This structure prevents you from getting lost and ensures you demonstrate both breadth and depth within the 45-minute time constraint.
Capacity estimation is not a formality — it is the bridge between requirements and architecture. The numbers tell you whether you need a single server or a globally distributed system. An architecture drawn without estimates is guesswork, and interviewers can immediately tell when numbers do not support the design.
The L5/L6 differentiator is trade-off articulation, not component knowledge. Saying "I chose Cassandra" is L4. Saying "I chose Cassandra over Postgres because our 500K write QPS exceeds what a single Postgres instance can handle, and our access pattern is simple key-value lookups that do not need SQL joins — the trade-off is eventual consistency, which is acceptable for our timeline data" is L6.
Cross-cutting concerns (caching, monitoring, security, rate limiting, disaster recovery) are not afterthoughts — they are architectural decisions. Addressing them proactively in the last 5 minutes of the interview signals production experience and can move a borderline evaluation to a strong hire.
Practice the complete framework, not individual components. Engineers who can design a URL shortener in isolation but cannot combine patterns into a complete e-commerce platform have not mastered system design. The meta-skill is synthesis under time pressure, and it only improves with structured, timed practice.

Before you move on: can you answer these?

You are asked to "Design Instagram" in an interview. Walk through the first 10 minutes: what questions do you ask, what do you estimate, and what is the first thing you draw?

Ask: What are the core use cases (post photos, view feed, follow users, search)? How many DAU (500M)? What is the read:write ratio (100:1)? Latency target (< 200ms for feed)? Consistency (eventual is fine for feed). Then estimate: 500M DAU x 5 feed views/day = 2.5B reads/day = 29K read QPS avg, 145K peak. 5M new posts/day x 1MB avg photo = 5TB/day image storage. Cache: 20% of feed data. First thing to draw: Client, CDN for images, API Gateway, Feed Service, Post Service, User Service, Redis Cache, Postgres (sharded by user_id), S3 for images, Kafka for fan-out.

Explain the difference between a junior (L4) and senior (L6) answer when asked "How would you handle a database that cannot keep up with read traffic?"

L4 answer: "Add a cache." L6 answer: "First, quantify the gap: our database handles 10K QPS but we need 100K. I would add a Redis cache with cache-aside pattern and a TTL appropriate for the data freshness requirements. For our product catalog (updates every few minutes), a 5-minute TTL gives us a 95% hit rate, reducing DB reads from 100K to 5K QPS. For user sessions (changes on every request), I would use write-through caching. I would also add 2-3 read replicas for queries that cannot be cached (complex aggregations). The trade-off: cache-aside has a cold-start problem and cache stampede risk, which I mitigate with probabilistic early expiration and request coalescing."

Why is the capacity estimation phase critical, and what happens to your design if you skip it?

Capacity estimation is critical because it determines which architecture you need. Without it, you are guessing. A system handling 100 QPS needs a single server and database. A system handling 100K QPS needs load balancers, caching, database sharding, and possibly a CDN. If you skip estimation and draw a single database for a 100K QPS system, your design is fundamentally wrong. If you draw a sharded database with Kafka for a 100 QPS system, you are over-engineering by orders of magnitude. Both failures demonstrate that you cannot translate requirements into architecture, which is the core skill being evaluated.

🧠Mental Model

💡 Analogy

End-to-end system design is like building a house. The requirements phase is the consultation with the homeowner — you must understand how many bedrooms they need, their budget, and the lot constraints before you design anything. The estimation phase is the materials quote — you calculate how much lumber, concrete, and wiring you need based on the house size, because a 1,500 sq ft house and a 15,000 sq ft mansion require completely different structural approaches. The high-level architecture is the floor plan — the arrangement of rooms, hallways, and the general flow from the front door to the kitchen. The deep dive is the plumbing and electrical detail — the specific pipe gauges, circuit breaker ratings, and drainage slopes that make each system actually work. Cross-cutting concerns are the fire codes, safety inspections, and insurance requirements — the systemic qualities that must be addressed in every room, not just one. You would never start installing plumbing before agreeing on how many bathrooms the house needs. Yet that is exactly what engineers do when they skip requirements and jump straight to database schemas.

⚡ Core Idea

System design is a structured five-phase process (REACH: Requirements, Estimation, Architecture, Component deep-dive, Hardening) where each phase constrains the next. Requirements determine what to build. Estimation determines how big to build it. Architecture determines which components to use. Deep dives demonstrate mastery of specific components. Hardening makes the system production-ready. Skipping any phase makes all downstream phases unreliable.

🎯 Why It Matters

In a FAANG interview, the system design round is the highest-signal round for L5+ candidates. It tests not just what you know, but how you think under ambiguity and time pressure. The five-phase framework gives you a repeatable process that works for any problem, prevents you from getting lost, ensures you demonstrate both breadth and depth, and naturally produces the artifacts (requirements doc, capacity estimates, architecture diagram, trade-off analysis) that interviewers use to calibrate your level. Engineers who wing it get inconsistent results. Engineers who follow a framework get consistently strong results.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

End-to-End System Design Practice: From Blank Whiteboard to Production Architecture

The Art of End-to-End System Design

Phase 1: Requirements & Scope

Phase 2: Capacity Estimation Playbook

Phase 3: High-Level Architecture Blueprint

Phase 4: Deep-Dive Component Selection

Cross-Cutting Concerns Checklist

End-to-End Walkthrough: E-Commerce Platform

End-to-End Walkthrough: Real-Time Analytics Dashboard

Common Mistakes & The Final Checklist

Discussion

In-app Q&A

End-to-End System Design Practice: From Blank Whiteboard to Production Architecture

The Art of End-to-End System Design

Phase 1: Requirements & Scope

Phase 2: Capacity Estimation Playbook

Phase 3: High-Level Architecture Blueprint

Phase 4: Deep-Dive Component Selection

Cross-Cutting Concerns Checklist

End-to-End Walkthrough: E-Commerce Platform

End-to-End Walkthrough: Real-Time Analytics Dashboard

Common Mistakes & The Final Checklist

Discussion

In-app Q&A