The math every FAANG engineer does in their head: QPS, storage, bandwidth, memory. Real numbers from Twitter, YouTube, Instagram, WhatsApp. Learn to be accurate within 2x, fast.
The math every FAANG engineer does in their head: QPS, storage, bandwidth, memory. Real numbers from Twitter, YouTube, Instagram, WhatsApp. Learn to be accurate within 2x, fast.
Lesson outline
Back-of-the-envelope estimation is the most underrated skill in system design. Most engineers treat it as a formality before drawing boxes — something you rush through to get to the "real" design. This is backwards. The estimates *are* the design. They tell you which architecture you need before you draw a single box.
Consider two engineers asked to design a photo sharing service. Engineer A skips estimation and draws: a single Postgres database, a Python API server, and an S3 bucket. Engineer B does the math first: 50M DAU × 3 photo views/day = 150M photo reads/day = 1,736 photo reads/sec average, 5,208/sec peak. At 500KB per photo, that's 2.6 GB/sec of bandwidth at peak. A single server cannot serve 2.6 GB/sec. Engineer A's design is wrong before it's even drawn.
Estimation Is Architecture Triage
Every capacity estimate answers a binary question: does the naive solution (single server, single DB) work, or do we need distributed infrastructure? 10K QPS → single DB fine. 100K QPS → read replicas needed. 1M QPS → sharding or NoSQL required. You can't make this decision without the numbers.
In a FAANG interview, estimation signals three things to the interviewer: (1) You quantify before you decide — the most critical engineering judgment skill. (2) You know the rough performance limits of common components. (3) You can derive architectural implications from numbers, not just recite patterns.
The good news: estimation is a skill with a formula, and the formula always works. There are only four estimation types you'll ever need: QPS, storage, bandwidth, and memory. Master these four and you can estimate any system.
The 4 Estimation Types and When Each Drives Architecture
Before you can estimate, you need a reference table of numbers to work from. These are the numbers experienced engineers have internalized. In an interview, you're expected to know these approximately — not perfectly, but within 2-3x.
| Storage Unit | Bytes | Real-World Reference |
|---|---|---|
| 1 KB | 1,000 bytes | A short tweet with metadata (~280 chars + fields) |
| 100 KB | 100,000 bytes | A compressed profile photo thumbnail |
| 1 MB | 1,000,000 bytes | A high-quality profile photo or 1 minute of audio |
| 10 MB | 10,000,000 bytes | A 1080p video clip (10 seconds) |
| 100 MB | 100,000,000 bytes | A compressed feature-length movie (low quality) |
| 1 GB | 1,000,000,000 bytes | A full HD movie (standard quality) |
| 1 TB | 1,000,000,000,000 bytes | 1,000 movies, or 1 year of detailed server logs |
| 1 PB | 1,000,000,000,000,000 bytes | Netflix's content library; YouTube uploads per month |
| Time Unit | Seconds | Use For |
|---|---|---|
| 1 minute | 60 sec | Rate limiting windows |
| 1 hour | 3,600 sec | Cache TTL calculation |
| 1 day | 86,400 sec | The most important number: DAU → QPS conversion |
| 1 month | 2,592,000 sec (~2.6M) | Monthly storage growth |
| 1 year | 31,536,000 sec (~31.5M) | Annual storage capacity planning |
The Numbers That Drive Every Estimation
86,400 = seconds per day (memorize this). 1M DAU × 1 req/day ÷ 86,400 = ~12 QPS. 1M DAU × 10 req/day = ~116 QPS. 1M DAU × 100 req/day = ~1,160 QPS. The pattern: multiply DAU × requests/user/day, then divide by 86,400. That gives average QPS. Multiply by 2-3x for peak.
Latency numbers are equally important — they tell you when a solution is "fast enough" and when you need a cache:
| Operation | Latency | What This Means in Practice |
|---|---|---|
| L1 cache hit | 1 ns | Fastest possible — in CPU register |
| L2 cache hit | 4 ns | Still extremely fast — in CPU cache |
| RAM access | 100 ns | In-process memory (e.g., local variable) |
| SSD random read | 100 µs (0.1ms) | 1,000x slower than RAM |
| Network round trip (same datacenter) | 500 µs (0.5ms) | Redis, inter-service call |
| HDD random read | 10 ms | 100x slower than SSD |
| Cross-datacenter network round trip | 50-150 ms | US East to US West ~60ms; US to Europe ~150ms |
| Database query (no index) | 100-1,000 ms | Full table scan — never acceptable in production |
The latency table reveals why caching is so powerful: the difference between an in-memory Redis lookup (0.5ms) and a database query without index (100ms+) is 200x. If your system does 100K reads/sec and 99% hit Redis, only 1,000/sec hit the DB — a completely manageable load.
QPS (Queries Per Second) is the single most important number in system design estimation. It determines whether you need a monolith or microservices, a single database or sharding, one cache node or a cluster.
The formula is always: QPS = DAU × requests_per_user_per_day ÷ 86,400
Then for peak: Peak QPS = Average QPS × peak_multiplier (typically 2-5x depending on traffic pattern)
QPS Calculation Walkthrough: Instagram-like App
01
Get the DAU: Assume 500M DAU (Instagram's actual figure). If you don't know, ask in the interview or make an explicit assumption.
02
Estimate requests per user per day: Mix of reads and writes. Assume: 20 feed scroll events/day (each loading 20 photos) = 400 photo reads. Plus 2 photo uploads = 2 writes. Total: ~402 requests/user/day. Round to 400 for simplicity.
03
Calculate average QPS: 500M × 400 / 86,400 = 200,000,000,000 / 86,400 ≈ 2,314,815 QPS total. Round to ~2.3M QPS.
04
Split read vs. write: 500M × 400 reads / 86,400 ≈ 2.3M read QPS. 500M × 2 uploads / 86,400 ≈ 11,574 write QPS. Read:write ratio ≈ 200:1.
05
Calculate peak QPS: Instagram's peak is during lunch and evening hours — roughly 3x average. Read peak: 2.3M × 3 = ~7M QPS. Write peak: 11,574 × 3 = ~35K QPS.
06
Derive architecture: 7M read QPS cannot be served by a single database (max ~50K-100K QPS for a very well-tuned Postgres). You need: CDN for static assets, Redis cluster for feed cache, multiple DB shards for user data. 35K write QPS: possible on a single primary DB with write buffering, but you'd want replicas.
Get the DAU: Assume 500M DAU (Instagram's actual figure). If you don't know, ask in the interview or make an explicit assumption.
Estimate requests per user per day: Mix of reads and writes. Assume: 20 feed scroll events/day (each loading 20 photos) = 400 photo reads. Plus 2 photo uploads = 2 writes. Total: ~402 requests/user/day. Round to 400 for simplicity.
Calculate average QPS: 500M × 400 / 86,400 = 200,000,000,000 / 86,400 ≈ 2,314,815 QPS total. Round to ~2.3M QPS.
Split read vs. write: 500M × 400 reads / 86,400 ≈ 2.3M read QPS. 500M × 2 uploads / 86,400 ≈ 11,574 write QPS. Read:write ratio ≈ 200:1.
Calculate peak QPS: Instagram's peak is during lunch and evening hours — roughly 3x average. Read peak: 2.3M × 3 = ~7M QPS. Write peak: 11,574 × 3 = ~35K QPS.
Derive architecture: 7M read QPS cannot be served by a single database (max ~50K-100K QPS for a very well-tuned Postgres). You need: CDN for static assets, Redis cluster for feed cache, multiple DB shards for user data. 35K write QPS: possible on a single primary DB with write buffering, but you'd want replicas.
Common Mistake: Forgetting the Peak Multiplier
Average QPS is not what you design for. You design for peak QPS, because your system must handle the worst moment — Super Bowl halftime, a viral tweet, midnight on New Year's. Typical peak multipliers: Consumer apps (lunch/evening spikes): 3-5x. B2B apps (9-5 business hours): 4-8x during business hours vs. overnight. Event-driven apps (live sports, concerts): can be 10-50x during the event. Always ask: "What's the traffic pattern? Does it spike around events?"
graph LR
A[DAU
e.g. 100M] --> B[× Requests/User/Day
e.g. 10]
B --> C[= Daily Requests
1 Billion]
C --> D[÷ 86,400 sec]
D --> E[= Average QPS
~11,574]
E --> F[× Peak Multiplier
3x]
F --> G[= Peak QPS
~35,000]
G --> H{Architectural Decision}
H -->|< 10K QPS| I[Single DB
no cache needed]
H -->|10K-100K QPS| J[Read replicas
+ Redis cache]
H -->|100K-1M QPS| K[DB sharding
+ Redis cluster]
H -->|> 1M QPS| L[NoSQL + CDN
+ distributed everything]The QPS formula and how it drives architectural decisions
One important nuance: not all requests are equal. A "request" for a tweet feed requires 1 DB query. A "request" for a search might require 100 operations. When estimating, break mixed workloads into categories: read requests, write requests, search requests, media requests. Each has a different backend cost.
Storage estimation determines your database type, sharding strategy, and data retention policy. The key insight: you don't just estimate current storage — you estimate 5-year growth, because migrating a database at 10TB is much harder than planning for it at 10GB.
The formula: Daily storage = writes_per_day × avg_object_size. Then multiply by 365 × 5 for 5-year capacity. Add metadata overhead (typically 20-30% extra).
| Content Type | Avg Size | Compression | Effective Size |
|---|---|---|---|
| Tweet / Short text post | 280 chars + metadata | 2x compression | ~250 bytes effective |
| User profile record | Name, bio, settings, etc. | Low compressibility | ~2 KB |
| Instagram photo (display) | 1080p compressed JPG | Already compressed | ~300 KB |
| Instagram photo (original) | 12MP HEIC from iPhone | Minimal | ~8 MB |
| WhatsApp text message | ~200 bytes + metadata | 2x | ~100 bytes effective |
| YouTube video (1080p, 10 min) | H.264 encoded | Already compressed | ~500 MB |
| Spotify audio track (3 min) | 320kbps MP3 | Already compressed | ~7 MB |
| Database row (typical) | ID + 10 fields + indexes | N/A | ~1-10 KB with index overhead |
Twitter Storage Estimation: Full Walkthrough
01
Tweet metadata storage: 300M DAU × 5 tweets/day = 1.5B tweets/day. Wait — daily active users don't all tweet. Realistically ~10% of users tweet daily. So 300M × 10% × 5 = 150M tweets/day.
02
Bytes per tweet: tweet_id (8B) + user_id (8B) + text (280 bytes avg ≈ 200 compressed) + timestamps (8B) + metadata (100B) ≈ 324 bytes. Round to 350 bytes.
03
Daily text storage: 150M × 350B = 52.5 GB/day.
04
5-year text storage: 52.5 GB × 365 × 5 = 95.8 TB ≈ 100 TB. This means: you cannot fit this in a single Postgres instance (max practical size ~10-20TB before it gets painful). You need partitioning or a NoSQL database like Cassandra.
05
Media storage: Assume 30% of tweets have images at 150KB average (compressed for display). 45M images/day × 150KB = 6.75 TB/day. 5-year media: 6.75 TB × 365 × 5 = 12.3 PB. This requires object storage (S3) with lifecycle policies to archive cold media to Glacier.
06
Cache storage: You're not caching everything — just hot recent data. Latest 100 tweets per user in feed cache: 300M users × 100 tweets × 350 bytes = 10.5 TB. Even 10% of users active at once = 1.05 TB of cache needed. This requires a Redis cluster — definitely not a single node.
Tweet metadata storage: 300M DAU × 5 tweets/day = 1.5B tweets/day. Wait — daily active users don't all tweet. Realistically ~10% of users tweet daily. So 300M × 10% × 5 = 150M tweets/day.
Bytes per tweet: tweet_id (8B) + user_id (8B) + text (280 bytes avg ≈ 200 compressed) + timestamps (8B) + metadata (100B) ≈ 324 bytes. Round to 350 bytes.
Daily text storage: 150M × 350B = 52.5 GB/day.
5-year text storage: 52.5 GB × 365 × 5 = 95.8 TB ≈ 100 TB. This means: you cannot fit this in a single Postgres instance (max practical size ~10-20TB before it gets painful). You need partitioning or a NoSQL database like Cassandra.
Media storage: Assume 30% of tweets have images at 150KB average (compressed for display). 45M images/day × 150KB = 6.75 TB/day. 5-year media: 6.75 TB × 365 × 5 = 12.3 PB. This requires object storage (S3) with lifecycle policies to archive cold media to Glacier.
Cache storage: You're not caching everything — just hot recent data. Latest 100 tweets per user in feed cache: 300M users × 100 tweets × 350 bytes = 10.5 TB. Even 10% of users active at once = 1.05 TB of cache needed. This requires a Redis cluster — definitely not a single node.
The Storage → Architecture Decision Map
< 100 GB: Single Postgres instance. Use RDS Multi-AZ for availability. 100 GB - 10 TB: Postgres with partitioning by date/user_id. Or DynamoDB for write-heavy workloads. > 10 TB: Must shard. Either PostgreSQL with horizontal sharding (Citus), or NoSQL (Cassandra, DynamoDB). > 1 PB: Object storage (S3/GCS) is mandatory. Use tiered storage (hot → warm → cold → archive).
Bandwidth estimation is often skipped in interviews, but it's critical for two reasons: it determines whether you need a CDN (which changes your architecture significantly) and it drives cloud infrastructure costs (AWS charges $0.09/GB for data egress — this adds up fast).
The formula: Bandwidth = QPS × avg_response_size. For ingress (upload): Ingress = write_QPS × avg_object_size.
YouTube Bandwidth Estimation
01
Context: YouTube has 2.7B logged-in users. Assume 500M DAU. ~1B videos are watched per day.
02
Watch QPS: 1B views/day ÷ 86,400 = ~11,574 video streams/sec average. Peak (2-3x): ~30,000 concurrent streams.
03
Video bitrate: A 1080p YouTube stream ≈ 5 Mbps (megabits per second). A 720p stream ≈ 2.5 Mbps. Assume average 3 Mbps.
04
Egress bandwidth: 30,000 concurrent streams × 3 Mbps = 90,000 Mbps = 90 Gbps egress at peak. This is why YouTube has its own CDN (Google's Bandwidth Sharing network) — $0.09/GB × 90 Gbps × 3600 sec = $29,160/hour if they used AWS CDN. They save ~$200M/year by peering directly with ISPs.
05
Upload bandwidth: 500 hours of video uploaded per minute (YouTube's real stat). 500 hours × 60 min × 500 MB (avg HD upload) ÷ 60 sec = ~250 GB/sec ingress. This is why uploads go through specialized upload servers with aggressive compression pipelines.
06
Storage cost implication: 500 hours × 60 min × 500 MB = 15,000 TB/minute = 15 PB/minute. At $0.023/GB on S3, that's $345,000/minute of storage cost if unprocessed. In reality, YouTube transcodes to multiple resolutions (360p, 720p, 1080p, 4K) and deletes originals after a few days.
Context: YouTube has 2.7B logged-in users. Assume 500M DAU. ~1B videos are watched per day.
Watch QPS: 1B views/day ÷ 86,400 = ~11,574 video streams/sec average. Peak (2-3x): ~30,000 concurrent streams.
Video bitrate: A 1080p YouTube stream ≈ 5 Mbps (megabits per second). A 720p stream ≈ 2.5 Mbps. Assume average 3 Mbps.
Egress bandwidth: 30,000 concurrent streams × 3 Mbps = 90,000 Mbps = 90 Gbps egress at peak. This is why YouTube has its own CDN (Google's Bandwidth Sharing network) — $0.09/GB × 90 Gbps × 3600 sec = $29,160/hour if they used AWS CDN. They save ~$200M/year by peering directly with ISPs.
Upload bandwidth: 500 hours of video uploaded per minute (YouTube's real stat). 500 hours × 60 min × 500 MB (avg HD upload) ÷ 60 sec = ~250 GB/sec ingress. This is why uploads go through specialized upload servers with aggressive compression pipelines.
Storage cost implication: 500 hours × 60 min × 500 MB = 15,000 TB/minute = 15 PB/minute. At $0.023/GB on S3, that's $345,000/minute of storage cost if unprocessed. In reality, YouTube transcodes to multiple resolutions (360p, 720p, 1080p, 4K) and deletes originals after a few days.
Bandwidth Threshold → CDN Decision
Rule of thumb: if your static asset or media bandwidth exceeds 1 GB/sec, you need a CDN. Without CDN, your origin servers spend all their time serving bytes instead of processing requests. With a CDN, 95%+ of asset requests are served from edge nodes close to the user, cutting origin load by 20x and reducing latency from 100ms to < 10ms.
graph LR
U[User Devices
1B requests/day] -->|No CDN| OS[Origin Servers
11,574 req/sec]
OS --> DB[(Database)]
U2[User Devices
1B requests/day] -->|95% cache hit| CDN[CDN Edge
~11,000 req/sec cached]
CDN -->|5% miss| OS2[Origin Servers
~578 req/sec]
OS2 --> DB2[(Database)]
note1[Without CDN: 11,574 req/sec to origin]
note2[With CDN: 578 req/sec to origin
20x reduction]CDN reduces origin server load by 20x for read-heavy media workloads
| Service | Daily Bandwidth | Peak Bandwidth | CDN Strategy |
|---|---|---|---|
| Twitter (text only) | ~50 GB/day | ~10 MB/sec | CDN for media only; API is lightweight |
| ~500 TB/day | ~10 GB/sec | Multi-CDN (Akamai + FB PoPs), image CDN | |
| YouTube | ~15 EB/day uploads, ~500 PB/day views | ~90 Gbps streaming | Own CDN (Google Global Cache), ISP peering |
| Netflix | ~400 PB/day streaming | ~100 Gbps | Open Connect CDN, placed inside ISPs |
| ~100 PB/day media | ~30 Gbps | Multiple CDNs + direct ISP peering |
Cache sizing is the estimation that most engineers get wrong, because they either over-provision (wasting money) or under-provision (causing cache thrashing that kills database performance). The right approach is to estimate the working set — the "hot" data that gets accessed repeatedly.
The 80-20 rule applies to caching: 20% of data accounts for 80% of reads. This is the data that should be cached. If you size your cache to hold 20% of your total data, you'll absorb 80% of your read load.
Redis Cache Sizing for a Twitter-like Feed
01
Total data: 300M users × 100 recent tweets in feed × 350 bytes/tweet = 10.5 TB total feed data.
02
Working set (hot 20%): 10.5 TB × 20% = 2.1 TB. But not all hot users are active simultaneously.
03
Concurrently active users: At peak, Twitter sees ~50M concurrent sessions. 50M ÷ 300M = 16.7% of users active.
04
Cache target: Size cache to hold feed data for concurrent active users. 50M × 100 tweets × 350 bytes = 1.75 TB. Round to 2 TB for headroom.
05
Redis nodes needed: A single Redis node maxes out at ~25-50 GB of memory before management overhead degrades performance. 2 TB ÷ 25 GB = 80 Redis nodes. Use Redis Cluster with consistent hashing. Add 50% overhead for replication: 120 nodes total.
06
Cost implication: 120 × r6g.xlarge Redis nodes on AWS ≈ 120 × $0.226/hour = $27.12/hour = $237,811/year just for feed cache. This is why Twitter invested in their own Nighthawk cache system.
Total data: 300M users × 100 recent tweets in feed × 350 bytes/tweet = 10.5 TB total feed data.
Working set (hot 20%): 10.5 TB × 20% = 2.1 TB. But not all hot users are active simultaneously.
Concurrently active users: At peak, Twitter sees ~50M concurrent sessions. 50M ÷ 300M = 16.7% of users active.
Cache target: Size cache to hold feed data for concurrent active users. 50M × 100 tweets × 350 bytes = 1.75 TB. Round to 2 TB for headroom.
Redis nodes needed: A single Redis node maxes out at ~25-50 GB of memory before management overhead degrades performance. 2 TB ÷ 25 GB = 80 Redis nodes. Use Redis Cluster with consistent hashing. Add 50% overhead for replication: 120 nodes total.
Cost implication: 120 × r6g.xlarge Redis nodes on AWS ≈ 120 × $0.226/hour = $27.12/hour = $237,811/year just for feed cache. This is why Twitter invested in their own Nighthawk cache system.
Cache Sizing Rules of Thumb
For any cache sizing question: (1) Calculate total data size for the domain. (2) Take 20% for working set. (3) Account for serialization overhead (Redis stores objects with key + value + metadata — typically 2-3x raw data size). (4) Single Redis node: 25-50 GB usable. (5) Add 50% overhead for replicas. The formula: cache_nodes = (working_set_GB × 2.5 overhead × 1.5 replication) ÷ 30 GB per node.
| Cache Type | Typical Capacity | Latency | Use Case |
|---|---|---|---|
| L1 (in-process, Guava/Caffeine) | 1-10 GB per JVM | < 1 µs | Frequently accessed config, auth tokens, computed values |
| L2 (Redis single node) | 25-50 GB | 0.5-2 ms | Session data, user profiles, small working sets |
| L3 (Redis Cluster) | 100 GB - 100 TB | 1-5 ms | Feed data, product catalog, large working sets |
| CDN Edge Cache | Unlimited (edge-distributed) | 1-50 ms | Static assets, media files, API responses with long TTL |
Let's put it all together with a real estimation exercise. We'll estimate Twitter at its actual scale (pre-Elon acquisition, ~2022), then derive the architecture directly from the numbers.
Twitter's Real Numbers (2022)
Twitter had ~238M mDAU (monetizable daily active users). ~500M tweets per day. ~350K QPS at peak (verified by Twitter Engineering blog). ~1PB of new data per day including media. Their infrastructure ran on ~100,000 servers.
Full Twitter Estimation: Matching Reality
01
Write QPS (tweets): 500M tweets/day ÷ 86,400 = 5,787 tweets/sec average. Peak (3x): ~17,361 tweets/sec. Validation: Twitter Engineering has confirmed ~350K total QPS including reads — this tweet-write rate is plausible.
02
Read QPS (timeline reads): Twitter has ~100:1 read:write ratio for feeds. 5,787 × 100 = ~578,700 read QPS avg. Peak: ~1.7M read QPS. Additional: search queries, notifications, DMs — total peaks around 350K-500K QPS (lower than pure feed math because not everyone loads feed constantly).
03
Tweet storage: 500M tweets/day × 500 bytes = 250 GB/day of tweet text. 5-year tweet storage: 250 GB × 365 × 5 = 456 TB ≈ 0.5 PB. This drove Twitter to Cassandra (NoSQL) — too large for a single Postgres.
04
Media storage: ~30% of tweets have media. 150M media tweets × 500KB avg compressed = 75 TB/day of new media. 5-year: 75 TB × 365 × 5 = 136.875 PB ≈ 137 PB. Twitter uses their own Blobstore + CDN for this.
05
Fan-out cost: Avg follower count: ~200. Celebrity accounts: up to 130M followers. When Taylor Swift tweets: 130M × 1 Redis write = 130M operations. At 1M Redis ops/sec, that's 130 seconds to fan out — unacceptable. This drives the hybrid fan-out architecture Twitter calls "Flock."
06
Cache size: 238M DAU. Feed cache: top 100 tweets per user × 500 bytes × 238M users = 11.9 TB working set. At peak concurrent users (50M at once): 50M × 100 × 500B = 2.5 TB needed in Redis. Twitter runs ~2,000+ Redis nodes total for various cache layers.
Write QPS (tweets): 500M tweets/day ÷ 86,400 = 5,787 tweets/sec average. Peak (3x): ~17,361 tweets/sec. Validation: Twitter Engineering has confirmed ~350K total QPS including reads — this tweet-write rate is plausible.
Read QPS (timeline reads): Twitter has ~100:1 read:write ratio for feeds. 5,787 × 100 = ~578,700 read QPS avg. Peak: ~1.7M read QPS. Additional: search queries, notifications, DMs — total peaks around 350K-500K QPS (lower than pure feed math because not everyone loads feed constantly).
Tweet storage: 500M tweets/day × 500 bytes = 250 GB/day of tweet text. 5-year tweet storage: 250 GB × 365 × 5 = 456 TB ≈ 0.5 PB. This drove Twitter to Cassandra (NoSQL) — too large for a single Postgres.
Media storage: ~30% of tweets have media. 150M media tweets × 500KB avg compressed = 75 TB/day of new media. 5-year: 75 TB × 365 × 5 = 136.875 PB ≈ 137 PB. Twitter uses their own Blobstore + CDN for this.
Fan-out cost: Avg follower count: ~200. Celebrity accounts: up to 130M followers. When Taylor Swift tweets: 130M × 1 Redis write = 130M operations. At 1M Redis ops/sec, that's 130 seconds to fan out — unacceptable. This drives the hybrid fan-out architecture Twitter calls "Flock."
Cache size: 238M DAU. Feed cache: top 100 tweets per user × 500 bytes × 238M users = 11.9 TB working set. At peak concurrent users (50M at once): 50M × 100 × 500B = 2.5 TB needed in Redis. Twitter runs ~2,000+ Redis nodes total for various cache layers.
graph TB
Numbers["Twitter Numbers
238M DAU, 500M tweets/day"] --> QPS["Write QPS: 5,787/sec
Read QPS: 578K/sec peak"]
Numbers --> Storage["Tweet Storage: 250GB/day
Media: 75TB/day
5-year: 137PB media"]
Numbers --> Cache["Feed Cache: 2.5TB
Requires Redis Cluster
~2000+ nodes at Twitter"]
QPS --> Arch1["→ Cassandra for tweet storage
(too much for Postgres)"]
QPS --> Arch2["→ Redis mandatory
(100:1 read:write)"]
Storage --> Arch3["→ Object storage + CDN
(137PB needs S3-scale)"]
Cache --> Arch4["→ Hybrid fan-out
(celebrity accounts overflow Redis)"]Twitter estimation → architecture decisions: every box in the architecture is justified by a number
YouTube is the canonical example for bandwidth and storage estimation because the numbers are staggeringly large and well-documented. This case study shows how media-heavy services are fundamentally different from text-heavy services.
YouTube's actual stats (2023): 500 hours of video uploaded per minute. 2.7B logged-in users. 1B+ hours of video watched per day.
YouTube Storage and Bandwidth Deep Dive
01
Upload rate: 500 hours/min × 60 min × 24 hours = 720,000 hours of video per day. At 1 GB per hour of raw 1080p video: 720,000 GB = 720 TB of raw uploads per day.
02
Transcoded storage: YouTube stores videos at 7 quality levels: 144p, 240p, 360p, 480p, 720p, 1080p, 4K. Combined, transcoded versions ≈ 3x the size of the original. 720 TB × 3 = 2.16 PB added per day.
03
5-year total video storage: 2.16 PB/day × 365 × 5 = 3.942 EB ≈ ~4 Exabytes. Only Google-scale infrastructure can handle this. YouTube uses Google's Colossus distributed file system.
04
Viewing bandwidth: 1B hours/day of video watched. At 2 Mbps average bitrate (mix of 720p, 1080p, mobile): 1B hours × 3600 sec × 2 Mbps = 7,200,000,000,000 Mbps = 7.2 × 10^12 Mbps. Per second: 1,157,407 Mbps ≈ 1.16 Tbps average egress. Peak (3x): ~3.5 Tbps.
05
CDN implication: 3.5 Tbps peak egress at $0.09/GB = $0.09 × (3.5 × 10^12 bits / 8 bits per byte) per second = $39,375/second = $141M/hour at standard CDN pricing. This is why YouTube/Google runs its own CDN (Google Global Cache), peering directly with major ISPs at no data transfer cost.
06
Video processing QPS: When a video is uploaded, YouTube processes it across a transcoding pipeline — one task per resolution × bitrate variant. 720,000 hours uploaded/day ÷ 86,400 sec = 8.33 hours of video uploaded/sec. At 7 transcoding jobs per upload: ~58 transcode jobs/sec. This requires a dedicated job queue (YouTube uses custom Borgqueue workers on Google's cluster management system).
Upload rate: 500 hours/min × 60 min × 24 hours = 720,000 hours of video per day. At 1 GB per hour of raw 1080p video: 720,000 GB = 720 TB of raw uploads per day.
Transcoded storage: YouTube stores videos at 7 quality levels: 144p, 240p, 360p, 480p, 720p, 1080p, 4K. Combined, transcoded versions ≈ 3x the size of the original. 720 TB × 3 = 2.16 PB added per day.
5-year total video storage: 2.16 PB/day × 365 × 5 = 3.942 EB ≈ ~4 Exabytes. Only Google-scale infrastructure can handle this. YouTube uses Google's Colossus distributed file system.
Viewing bandwidth: 1B hours/day of video watched. At 2 Mbps average bitrate (mix of 720p, 1080p, mobile): 1B hours × 3600 sec × 2 Mbps = 7,200,000,000,000 Mbps = 7.2 × 10^12 Mbps. Per second: 1,157,407 Mbps ≈ 1.16 Tbps average egress. Peak (3x): ~3.5 Tbps.
CDN implication: 3.5 Tbps peak egress at $0.09/GB = $0.09 × (3.5 × 10^12 bits / 8 bits per byte) per second = $39,375/second = $141M/hour at standard CDN pricing. This is why YouTube/Google runs its own CDN (Google Global Cache), peering directly with major ISPs at no data transfer cost.
Video processing QPS: When a video is uploaded, YouTube processes it across a transcoding pipeline — one task per resolution × bitrate variant. 720,000 hours uploaded/day ÷ 86,400 sec = 8.33 hours of video uploaded/sec. At 7 transcoding jobs per upload: ~58 transcode jobs/sec. This requires a dedicated job queue (YouTube uses custom Borgqueue workers on Google's cluster management system).
The Estimation → Architecture Reveal for YouTube
3.5 Tbps peak egress → Google runs its own CDN (Google Global Cache boxes inside ISP networks). 4 EB of video storage → Google Colossus distributed file system (custom, not S3). 58 transcode jobs/sec → Dedicated async job queue, not in-band processing. 1B hours watched/day → Recommendation system must serve personalized suggestions at < 100ms. These aren't random architecture choices — every one is forced by the numbers.
After seeing hundreds of estimation exercises in system design interviews, there are consistent mistakes that engineers make. These mistakes either make the interviewer lose confidence in the estimates, or worse, lead to designing the wrong architecture.
The 8 Most Common Estimation Mistakes
The Estimation Sanity Check
After any estimation, do a quick sanity check against known real-world systems: Is your QPS estimate higher than Google Search (~8.5 billion queries/day = ~100K QPS)? Then something is probably wrong. Is your storage estimate larger than Amazon S3 (hundreds of exabytes)? Recheck. Known reference points: Netflix peak streaming: ~500 Gbps. Visa transaction peak: ~24,000 TPS. AWS S3: 100+ trillion objects. Google Search: ~99K QPS.
graph TD
Start[Start Estimation] --> DAU[Get DAU from requirements]
DAU --> ReqPerUser[Estimate requests per user per day
by use case]
ReqPerUser --> AvgQPS[Average QPS = DAU × req ÷ 86400]
AvgQPS --> PeakQPS[Peak QPS = Average × 2-5x]
PeakQPS --> Decision{What does Peak QPS tell us?}
Decision -->|< 1K| Mono[Monolith + single DB]
Decision -->|1K - 10K| SingleDB[Single DB + cache]
Decision -->|10K - 100K| Replicas[Read replicas + Redis]
Decision -->|100K - 1M| Sharding[DB sharding + Redis cluster]
Decision -->|> 1M| NoSQL[NoSQL + CDN + custom infra]
Mono --> Storage[Calculate storage]
SingleDB --> Storage
Replicas --> Storage
Sharding --> Storage
NoSQL --> Storage
Storage --> Bandwidth[Calculate bandwidth]
Bandwidth --> CDN{Bandwidth > 1 GB/s?}
CDN -->|Yes| AddCDN[Add CDN to architecture]
CDN -->|No| NoCDN[CDN optional]Complete estimation decision flowchart: from DAU to architecture decisions
Every FAANG system design interview includes an estimation section, usually after requirements. The interviewer is explicitly watching for: (1) Does the candidate write numbers on the board? (2) Do they show the formula? (3) Do they derive architecture from numbers rather than recite patterns? Some interviewers ask standalone estimation questions: "Without designing a system, just estimate: how many storage servers does Instagram need?"
Common questions:
Key takeaways
An interviewer says: "Design a Dropbox-like file storage system for 100M users, where each user stores an average of 200 files at 5MB each." Estimate the storage requirements for 5 years.
Current storage: 100M users × 200 files × 5MB = 100 TB total data. With 3x replication for durability: 300 TB. With 20% metadata overhead: 360 TB today. Growth rate: assume 20% year-over-year growth in files and 10% in average file size. Year 1: 360 TB × 1.2 = 432 TB. Year 2: 518 TB. Year 3: 622 TB. Year 4: 746 TB. Year 5: 895 TB ≈ ~1 PB. Architectural implication: 1 PB mandates object storage (S3 or equivalent) with tiered storage (frequently accessed files on hot storage, rarely accessed on Glacier/cold tier). Cannot use relational DB for file bytes — only for file metadata (owner, path, version history). This also requires chunking: Dropbox splits files into 4MB chunks and only syncs changed chunks, reducing upload bandwidth by 70%.
You're estimating a real-time chat system like WhatsApp. 2B users, 60B messages/day. How do you estimate storage and what database would you use?
Storage per message: user_id (8B) + recipient_id (8B) + message (avg 100 bytes compressed) + timestamps (16B) + metadata (30B) ≈ 162 bytes. Round to 200 bytes. Daily storage: 60B messages × 200B = 12 TB/day. 5-year storage: 12 TB × 365 × 5 = 21.9 PB ≈ 22 PB. With 3x replication: 66 PB. This immediately rules out SQL — no SQL database at reasonable cost handles 66 PB. WhatsApp actually uses Erlang + an in-house built distributed database. For an interview answer: Cassandra is ideal. Why? Messages are written once and read a few times. Access pattern is always "get messages for conversation X after timestamp Y" — perfect Cassandra partition key (conversation_id) + clustering column (timestamp) model. Note: WhatsApp chose to NOT store messages long-term on the server — messages are stored on devices and servers only as a relay. This reduces server storage by 99% but requires clients to be the source of truth.
Your interview question is "Design YouTube." Your estimation shows 3.5 Tbps of peak egress bandwidth. The interviewer asks: "What does that number tell you about the architecture?" How do you answer?
3.5 Tbps of egress has massive cost and architectural implications. First, cost: at standard CDN rates of $0.09/GB, 3.5 Tbps = 437.5 GB/sec × $0.09 = $39,375/second = $141M/hour. This is economically impossible to route through a commercial CDN. Second, architectural implication: YouTube must own its CDN infrastructure. This is exactly what they built — Google Global Cache (GGC), a network of edge servers placed physically inside ISP data centers, serving video directly to users without paying data transfer costs. YouTube negotiates peering agreements with ISPs rather than paying per-byte. Third, content placement: at 3.5 Tbps peak, you need to pre-place popular videos at edge nodes close to users. 20% of videos account for 80% of views (Zipf distribution). Caching only the popular 20% at edge handles 80% of bandwidth. This drives the "cache warming" pipeline that YouTube runs — predicting which videos will trend and pre-seeding them to edge nodes before the traffic spike hits.
💡 Analogy
Back-of-the-envelope estimation is exactly like weather forecasting. A meteorologist doesn't need to know the exact temperature at every point in the atmosphere to accurately predict rain. They use known patterns, simplified models, and error bounds to get a forecast that's "right enough" to be useful. Similarly, you don't need exact QPS — you need to be within 2x to make the right architectural decision. ±2x is fine. ±10x (confusing millions for thousands) means you design the wrong system entirely.
⚡ Core Idea
Every estimation follows the same formula: DAU × action_rate → QPS → architectural boundary. QPS < 10K: single server. QPS < 100K: read replicas + cache. QPS < 1M: sharding + Redis cluster. QPS > 1M: NoSQL + CDN + distributed everything. The numbers aren't just academic — they are the architecture. If you skip estimation, you're guessing at which architecture to use.
🎯 Why It Matters
The difference between a 10K QPS system and a 1M QPS system isn't just 100x more servers. It's an entirely different architecture: different database type, different caching strategy, different deployment model, 100x more operational complexity. Wrong-order-of-magnitude estimates cause engineers to over-engineer (wasting millions in infrastructure) or under-engineer (causing production outages). Estimation is the skill that prevents both.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.