Interactive Explainer

🎯Key Takeaways

Every system starts as a single server. Scale incrementally — add each layer when you hit the specific bottleneck that layer solves.

The scaling layers in order: database separation → load balancer → read replicas → caching → CDN → sharding → multi-region.

Instagram reached 1B users with 13 engineers by following this exact playbook, not by over-engineering from day 1.

Caching is the highest-ROI scaling technique: a Redis cache hit is 100-500x faster than a database query.

Sharding is painful — exhaust all other options (replicas, caching, vertical scaling) before sharding.

Scaling from One User to Millions

How Instagram went from a single AWS server to serving 2 billion users — and the exact architectural decisions made at each inflection point. Master the universal scaling playbook used by every major tech company.

~17 min read

Be the first to complete!

What you'll learn

Every system starts as a single server. Scale incrementally — add each layer when you hit the specific bottleneck that layer solves.
The scaling layers in order: database separation → load balancer → read replicas → caching → CDN → sharding → multi-region.
Instagram reached 1B users with 13 engineers by following this exact playbook, not by over-engineering from day 1.
Caching is the highest-ROI scaling technique: a Redis cache hit is 100-500x faster than a database query.
Sharding is painful — exhaust all other options (replicas, caching, vertical scaling) before sharding.

Lesson outline

Why Scaling Matters: The Instagram Story

On October 6, 2010, Instagram launched. Within 24 hours, 25,000 people had signed up. Within 3 months: 1 million users. Within 18 months: 30 million users. By 2018: 1 billion active users.

The engineering team that handled this was just 13 people when Facebook acquired Instagram for $1 billion in 2012. They scaled from a single server to a globally distributed system — and the decisions they made are the foundation of system design.

Real Numbers: Instagram Day 1 vs Today

Day 1 (Oct 2010): 25K users, 1 server, ~50 req/sec, $50/month Year 1 (Oct 2011): 10M users, 3 servers, ~500 req/sec, $5K/month Acquisition (Apr 2012): 30M users, 30+ servers, ~2K req/sec, $50K/month Today (2024): 2B users, 100K+ servers, ~5M req/sec, $100M+/month

The key insight: every architecture decision was driven by a specific bottleneck. Instagram did NOT design for 2 billion users on day 1. They solved each bottleneck as it appeared. This is the correct approach — over-engineering early is as dangerous as under-engineering.

graph TD
    A[Day 1: 25K Users] --> B[Month 3: 1M Users]
    B --> C[Month 18: 10M Users]
    C --> D[Acquisition: 30M Users]
    D --> E[Today: 2B Users]

    A -. Single EC2 Server .-> F[Single Server Architecture]
    B -. Database Overload .-> G[Database Separation]
    C -. Read Bottleneck .-> H[Read Replicas + Cache]
    D -. Write Bottleneck .-> I[Sharding + CDN]
    E -. Global Scale .-> J[Multi-Region + Microservices]

Instagram scaling journey: each growth milestone triggered a specific architectural change

Level 1: Single Server Architecture — Where Every App Starts

Every successful app starts the same way: one server, one database, one engineer at 3am wondering why it is slow. This is not a failure — it is correct engineering. Premature optimization is the enemy of shipping.

graph LR
    User[👤 User Request] --> DNS[DNS Lookup
yourapp.com → 54.23.1.1]
    DNS --> Server[Single EC2 Server
t3.medium: 2 vCPU, 4GB RAM
Nginx + Node.js + PostgreSQL]
    Server --> DB[(PostgreSQL
Same Machine
10GB SSD)]
    Server --> Files[User Photos
Local Disk
100GB]

    style Server fill:#f96,stroke:#333,stroke-width:3px
    style DB fill:#6c9,stroke:#333

Single server: everything on one machine. Simple to deploy, simple to debug, single point of failure.

What runs on that single server: Nginx (web server), Node.js/Python/Rails (application), PostgreSQL/MySQL (database), file system (user uploads). Total cost: $50-200/month on AWS.

Component	Purpose	Bottleneck at Scale	Max capacity (rough)
Nginx	Serve HTTP, reverse proxy	Rarely the bottleneck	50K req/sec
Node.js App	Business logic	CPU-bound at high concurrency	1-5K req/sec
PostgreSQL	Data storage + queries	Disk I/O, CPU for complex queries	1-10K req/sec
Local disk	File storage	IOPS and capacity limits	100-500 MB/s

When Single Server Is Fine

< 10,000 daily active users < 100 req/sec sustained < 100GB data Non-critical availability (5 nines not needed) Early-stage startup or internal tool Do NOT over-engineer early. AWS t3.medium at $30/month can handle most early-stage apps.

First bottleneck signals to watch: database CPU > 70% sustained, response times > 500ms under load, disk I/O wait > 20%, memory pressure causing swapping.

Level 2: Database Separation — Splitting the Monolith

The first scaling step is almost always: move the database to its own machine. This is Instagram's first architectural change at ~100K users.

Why database first? The database is stateful — it holds your data. The app server is stateless — it just processes requests. Separating them lets you scale each independently.

Database Separation: Step-by-Step

→

Provision a new, more powerful RDS instance (r6g.large: 2 vCPU, 16GB RAM, optimized for databases)

→

Run both old and new databases in parallel for 24 hours, writing to both, reading from old

→

Verify data consistency between old and new (checksums, row counts)

→

Switch reads to new database, monitor for errors

→

Switch writes to new database, decommission old DB from app server

Now the app server has 4GB RAM freed — use it for application caching

Provision a new, more powerful RDS instance (r6g.large: 2 vCPU, 16GB RAM, optimized for databases)

Run both old and new databases in parallel for 24 hours, writing to both, reading from old

Verify data consistency between old and new (checksums, row counts)

Switch reads to new database, monitor for errors

Switch writes to new database, decommission old DB from app server

Now the app server has 4GB RAM freed — use it for application caching

graph TB
    Users[👥 100K Users] --> LB_DNS[DNS Round-robin
or Elastic IP]
    LB_DNS --> App[App Server
ec2: t3.large
2 vCPU, 8GB RAM
Nginx + Node.js]
    App --> DB[(Dedicated DB Server
rds: r6g.large
2 vCPU, 16GB RAM
PostgreSQL)]
    App --> S3[AWS S3
User Photos
Unlimited scale
$0.023/GB/month]

    style App fill:#6af,stroke:#333,stroke-width:2px
    style DB fill:#6c9,stroke:#333,stroke-width:2px

After database separation: app and DB on separate machines. Now can scale independently.

Metric	Before (Single Server)	After (Separated)	Improvement
App server CPU	85% (DB stealing)	30%	55% freed
DB server RAM	2GB available	16GB dedicated	8x more
Max req/sec	~200	~1,000	5x
Monthly cost	$50	$200	4x
Failure domains	1 (everything)	2 (app + DB)	Better isolation

The Hidden Cost: Network Latency

Before separation: app → DB was memory/IPC (~0.1ms) After separation: app → DB is network (~1-2ms) For an endpoint making 10 DB queries: 0.1ms → 10-20ms added latency Mitigation: Connection pooling (PgBouncer), N+1 query elimination, caching hot queries

Level 3: Load Balancing — Handling Horizontal Scale

At ~500K users, a single app server starts struggling. The solution: add more app servers and put a load balancer in front. This is "horizontal scaling" — adding more machines instead of bigger ones.

Vertical vs Horizontal: The Numbers

Vertical (bigger server): - t3.medium → r6g.16xlarge: 128 vCPU, 1TB RAM = $3.84/hr = $2,765/month - Maximum: ~448 vCPU (AWS largest instance) - Single point of failure Horizontal (more servers): - 10x t3.large at $0.083/hr = $0.83/hr = $600/month - No upper limit — add as many as you need - Fault tolerant: lose 2 of 10 = still 80% capacity Winner: Horizontal at scale. But stateless design is REQUIRED.

graph TB
    Users[👥 1M Users] --> LB[AWS Application Load Balancer
Layer 7: HTTP/HTTPS routing
Health checks every 30s
$16/month + $0.008/LCU]

    LB --> App1[App Server 1
t3.large]
    LB --> App2[App Server 2
t3.large]
    LB --> App3[App Server 3
t3.large]
    LB -. Auto-scaling .-> AppN[App Server N
Add on demand]

    App1 & App2 & App3 --> DB[(RDS Primary
r6g.xlarge
4 vCPU, 32GB RAM)]
    App1 & App2 & App3 --> Redis[(Redis Cache
elasticache.r6g.large
13GB RAM)]
    App1 & App2 & App3 --> S3[(S3 + CloudFront
Static assets + uploads)]

    style LB fill:#f90,stroke:#333,stroke-width:3px

Load balanced architecture: horizontal app scaling with stateless servers.

Critical requirement: stateless app servers. If server 1 handles a user's login and stores session in memory, then server 2 won't know they're logged in. Solution: store session in Redis or database, NOT server memory.

Load Balancing Algorithms: When to Use Each

Round Robin — Default. Sends request 1→server1, 2→server2, 3→server1... Best when all servers have same capacity.
Least Connections — Sends to server with fewest active connections. Best when requests have variable processing time (some take 10ms, some 10 seconds).
IP Hash — Same client always goes to same server (sticky sessions). Use only if you MUST store state on server — prefer external session storage instead.
Weighted Round Robin — Send more requests to more powerful servers. Use when you have mixed instance types during migrations.

Health Checks Are Critical

Configure load balancer health checks: - Endpoint: GET /health → returns 200 if healthy - Interval: every 30 seconds - Unhealthy threshold: 2 consecutive failures → remove from rotation - Healthy threshold: 3 consecutive passes → add back Without health checks: dead servers keep receiving traffic → errors for users

Level 4: Database Replication — Solving the Read Bottleneck

At 5M users, your database primary server hits 70-80% CPU. Investigation shows: 95% of queries are READs (users viewing feeds, profiles), only 5% are WRITEs (users posting, updating). Classic read-heavy workload.

Solution: database replication. One primary handles all writes, multiple read replicas handle all reads.

graph TB
    App[App Servers
10x t3.large] -- writes: INSERT/UPDATE/DELETE --> Primary[(DB Primary
r6g.4xlarge
16 vCPU, 128GB RAM)]
    App -- reads: SELECT --> Replica1[(Read Replica 1
r6g.2xlarge
8 vCPU, 64GB RAM)]
    App -- reads: SELECT --> Replica2[(Read Replica 2
r6g.2xlarge
8 vCPU, 64GB RAM)]
    App -- reads: SELECT --> Replica3[(Read Replica 3
r6g.2xlarge)]

    Primary -. async replication
lag: 1-100ms .-> Replica1
    Primary -. async replication .-> Replica2
    Primary -. async replication .-> Replica3

    style Primary fill:#f66,stroke:#333,stroke-width:3px
    note1[Primary: handles ALL writes
Replicas: handle ALL reads
95%+ traffic goes to replicas]

Read replica architecture: primary for writes, replicas for reads. 95% of traffic served by replicas.

Metric	Before Replicas	After Replicas (3 replicas)	Notes
Primary CPU	80% (reads+writes)	20% (writes only)	Safe headroom for writes
Read capacity	1x	3x (or more)	Add replicas linearly
Read latency	50ms avg	30ms avg	Replicas less contended
Monthly cost	$800 (primary)	$800 + $600 = $1,400	75% increase, 3x capacity
Replication lag	N/A	1-100ms typically	Acceptable for most reads

Replication Lag: The Hidden Gotcha

Async replication means replicas are slightly behind the primary. Scenario: User posts photo → writes to primary → immediately reads from replica → photo not there yet! Solutions: 1. Read-your-writes: route user's own writes back to primary for 1-5 seconds 2. Sticky reads: for 500ms after a write, route that user to primary for reads 3. Synchronous replication: zero lag but 2x write latency Instagram used option 1 for profile updates, option 3 for financial data.

Capacity math: With 5M DAU, each visiting twice/day, viewing 20 items/visit = 200M reads/day = 2,314 reads/sec average, peak 10x = 23,140 reads/sec. Each replica handles ~5,000 reads/sec = need 5 replicas for peak. Instagram used 12 at this stage.

Level 5: Caching Layer — The 100x Multiplier

A cache hit is 0.1ms. A database query is 1-50ms. Caching the right data gives you 10-500x performance improvement and reduces database load by 80-95%. This is the highest ROI optimization in system design.

The Caching Hierarchy (Fastest to Slowest)

CPU L1 Cache — 0.4 nanoseconds. 32KB. Managed by CPU. You cannot control this.
CPU L2/L3 Cache — 1-10 nanoseconds. 256KB-32MB. Managed by CPU.
RAM / Application Memory — 100 nanoseconds. GBs. Node.js/Python in-process caching. Lost on restart.
Redis / Memcached — 0.1-1 milliseconds. GBs-TBs. Shared across servers. Survives restarts.
CDN Edge (Cloudflare, Akamai) — 1-50 milliseconds. Geographically distributed. Static content, images, videos.
Database Query Cache — 1-50 milliseconds. Limited, often disabled in production.
Database Storage (SSD) — 0.1-1 milliseconds. For data not in DB buffer pool.
Database Storage (HDD) — 5-10 milliseconds. Mostly legacy.

graph LR
    Client[Browser/App] --> CDN[CDN Edge
Cloudflare/Akamai
~10ms latency
Cache: images, JS, CSS]
    CDN --> LB[Load Balancer]
    LB --> App[App Server]
    App --> AppCache[In-Process Cache
Node.js Map/LRU
~0.01ms
Hot config data]
    App --> Redis[(Redis Cluster
elasticache
~0.5ms
User sessions, feeds
rate limiting)]
    App --> DB[(PostgreSQL
Primary + Replicas
~5ms
Source of truth)]

    Redis -. Cache Miss .-> DB
    CDN -. Cache Miss .-> LB

    style CDN fill:#fa0,stroke:#333
    style Redis fill:#c33,color:#fff,stroke:#333
    style DB fill:#363,color:#fff,stroke:#333

Multi-layer cache hierarchy. Most requests should be served by Redis or CDN — never reaching the database.

Cache Strategy	How It Works	Best For	Risk
Cache-Aside (Lazy)	App checks cache first. Miss → query DB → store in cache	General-purpose, read-heavy	Cold start, thundering herd on cache miss
Write-Through	Write to cache AND DB together on every write	Data that's immediately re-read	Slower writes, cache pollution (write-only data cached)
Write-Behind (Write-Back)	Write to cache only; async flush to DB	Very write-heavy workloads	Data loss if cache dies before flush
Read-Through	Cache handles DB reads automatically on miss	Simplifies app code	Cache vendor must support it (Redis modules)
Refresh-Ahead	Pre-fetch data before TTL expires	Predictable access patterns	Wasted work if prediction wrong

Cache Stampede: The Most Common Cache Failure

Scenario: Popular tweet cached with 60s TTL. 10,000 users reading it. At T+60s, cache expires. All 10,000 concurrent readers → cache miss → all 10,000 hit DB simultaneously → DB overload → outage. Fix 1: Probabilistic early expiry — refresh cache slightly before TTL with some probability Fix 2: Mutex lock — first miss acquires lock, others wait; lock holder fetches from DB Fix 3: Background refresh — async job refreshes cache before expiry, TTL never actually expires in production Reddit experienced this in 2012 — front page cache expired during peak traffic, 50,000 simultaneous DB queries, site down for 8 minutes.

Level 6: CDN and Static Assets — Eliminating Origin Load

At 10M users, you discover that 70% of your bandwidth is serving the same static files: JavaScript bundles, CSS, user profile photos, product images. These never change (or change rarely). They should NEVER hit your application servers.

What CDN Numbers Look Like

Instagram at 10M users (2011 estimation): - Average user views 40 photos/session - Each photo: 200KB average - 10M users × 2 sessions/day = 20M sessions/day - 20M × 40 photos × 200KB = 160TB/day transferred Without CDN: all 160TB served from origin → $15,000/month bandwidth + massive origin load With CDN (Cloudflare): ~95% cache hit rate → 8TB from origin, 152TB from edge Cost: $15,000 → $2,500 (5x reduction) Latency: 200ms (origin) → 20ms (edge near user)

graph TB
    US_User[👤 US User
New York] --> CF_US[Cloudflare Edge
New York PoP
10ms away]
    EU_User[👤 EU User
London] --> CF_EU[Cloudflare Edge
London PoP
8ms away]
    ASIA_User[👤 Asia User
Tokyo] --> CF_AS[Cloudflare Edge
Tokyo PoP
5ms away]

    CF_US -. Cache Miss
~5% of requests .-> Origin[Origin Servers
US-East-1
AWS]
    CF_EU -. Cache Miss .-> Origin
    CF_AS -. Cache Miss .-> Origin

    CF_US -. 95% Cache Hit
~20ms .-> US_User
    CF_EU -. 95% Cache Hit
~15ms .-> EU_User
    CF_AS -. 95% Cache Hit
~10ms .-> ASIA_User

    style CF_US fill:#fa0,stroke:#333
    style CF_EU fill:#fa0,stroke:#333
    style CF_AS fill:#fa0,stroke:#333

CDN global PoP (Points of Presence): users connect to nearest edge server. Origin only sees cache misses.

CDN Implementation Checklist

→

Move all static assets (JS, CSS, images) to S3 or equivalent object storage

→

Put CloudFront/Cloudflare in front of S3 with long TTLs (1 year for versioned assets)

→

Use content hashing for filenames: app.a3f29c.js — hash changes when content changes, instant cache busting

→

Configure cache-control headers: Cache-Control: public, max-age=31536000, immutable

→

For user-uploaded content: route through CDN with shorter TTLs (7 days)

→

Monitor CDN hit rate in dashboard — target >90%

Configure CDN WAF (Web Application Firewall) to block malicious traffic at edge

Move all static assets (JS, CSS, images) to S3 or equivalent object storage

Put CloudFront/Cloudflare in front of S3 with long TTLs (1 year for versioned assets)

Use content hashing for filenames: app.a3f29c.js — hash changes when content changes, instant cache busting

Configure cache-control headers: Cache-Control: public, max-age=31536000, immutable

For user-uploaded content: route through CDN with shorter TTLs (7 days)

Monitor CDN hit rate in dashboard — target >90%

Configure CDN WAF (Web Application Firewall) to block malicious traffic at edge

Cache-Busting Without Downtime

Never name files static.js and rely on CDN purge for updates. Purge propagation takes 30-60 seconds, during which some users see old version. Correct approach: Content hash in filename - app.js → app.[sha256hash].js - HTML references the hashed filename - When JS changes, hash changes, CDN sees new URL = fresh fetch - Old version still cached for users with old HTML - Zero-downtime deployment

Level 7: Database Sharding — Breaking the Write Bottleneck

At 50M users, your database primary is at 80% CPU even with read replicas handling all reads. Writes are the problem. You have 50M users × 5 writes/day = 250M writes/day = 2,893 writes/sec sustained. A single PostgreSQL primary maxes out at ~5,000 writes/sec on powerful hardware — you're dangerously close.

Sharding (also called horizontal partitioning) splits your data across multiple independent database instances. Each instance is a "shard" responsible for a subset of data.

graph TB
    App[App Servers] --> Router[Shard Router
user_id % 4 = shard]

    Router -- user_id 0,4,8,12... --> Shard0[(Shard 0
PostgreSQL Primary
Users 0,4,8...)]
    Router -- user_id 1,5,9,13... --> Shard1[(Shard 1
PostgreSQL Primary
Users 1,5,9...)]
    Router -- user_id 2,6,10,14... --> Shard2[(Shard 2
PostgreSQL Primary
Users 2,6,10...)]
    Router -- user_id 3,7,11,15... --> Shard3[(Shard 3
PostgreSQL Primary
Users 3,7,11...)]

    Shard0 --> S0R1[(Shard 0 Replica)]
    Shard1 --> S1R1[(Shard 1 Replica)]

    style Router fill:#f90,stroke:#333,stroke-width:3px

Hash-based sharding by user_id: user_id % 4 determines which shard. Each shard has its own primary and replicas.

Sharding Strategy	How It Works	Pros	Cons	Real-World Example
Hash-based	hash(shard_key) % num_shards	Even distribution, predictable	Resharding is expensive	Instagram user_id, Stripe customer_id
Range-based	shard_key 0-1M → shard1, 1M-2M → shard2	Easy range queries, resharding easier	Hot spots if range is popular	MongoDB default, HBase
Directory-based	Lookup table: key → shard mapping	Maximum flexibility, easy resharding	Lookup table becomes bottleneck	Foursquare
Geo-based	Route by user geography	GDPR compliance, low latency	Uneven distribution (US > other regions)	Airbnb, WhatsApp

Sharding Is Painful — Avoid Until Necessary

Instagram waited until 30M users to shard. Before sharding, they tried: 1. Vertical scaling primary (larger instance) 2. Read replicas (for reads) 3. Aggressive caching (Redis for hot data) 4. Query optimization (indexes, batching) Only when writes hit 2,000/sec sustained did they shard. Pain of sharding: - Cross-shard queries become application-level joins - Distributed transactions are very hard - Resharding (adding more shards) requires data migration - Operational complexity: 4 primaries to monitor vs 1 Do not shard until you absolutely must.

Level 8: Full Production Architecture — The Complete Picture

At 100M+ users, your architecture has evolved from a single server to a multi-layer globally distributed system. Here is what the complete architecture looks like — and the numbers behind it.

graph TB
    User[👤 2B Users Globally] --> CF[Cloudflare CDN
250 PoPs worldwide
DDoS protection
WAF + Rate Limiting]

    CF --> GLB[Global Load Balancer
Geodns / Anycast
Routes to nearest region]

    GLB --> US[US-East Region]
    GLB --> EU[EU-West Region]
    GLB --> AS[Asia-Pacific Region]

    subgraph US [US-East AWS Region]
        ALB_US[ALB
HTTPS/HTTP2] --> AppTier_US[App Tier
500x c6g.xlarge
Auto-scaling group]
        AppTier_US --> Cache_US[Redis Cluster
12x r6g.2xlarge
13TB cache]
        AppTier_US --> DB_US_Primary[(DB Primary
r6g.16xlarge
128 vCPU)]
        DB_US_Primary --> DB_US_R1[(Replica 1)]
        DB_US_Primary --> DB_US_R2[(Replica 2)]
        DB_US_Primary --> DB_US_R3[(Replica 3)]
    end

    DB_US_Primary -. Cross-region async
replication .-> DB_EU_Primary[(EU Primary)]
    DB_US_Primary -. Cross-region async .-> DB_AS_Primary[(APAC Primary)]

    AppTier_US --> MQ[Kafka
Message Queues
Async processing]
    MQ --> Workers[Worker Fleet
Image processing
Notifications
Analytics]
    Workers --> S3_CDN[S3 + CloudFront
User photos/videos
Global CDN]

Full production architecture at 2B users: multi-region, CDN, load balancers, app tier, Redis, sharded DB, message queues.

Layer	Technology	Scale	Monthly Cost (rough)
CDN	Cloudflare Enterprise	5PB/month bandwidth	$50,000
Load Balancers	AWS ALB (3 regions)	50M req/hour each	$30,000
App Servers	1,500x c6g.xlarge (auto-scale)	Avg 500 active	$300,000
Redis Cache	36x r6g.2xlarge cluster	40TB total cache	$150,000
Database (RDS)	12 primaries, 36 replicas	100TB total data	$500,000
S3 Storage	Multi-region replication	50PB user content	$1,000,000
Kafka/queues	200-node Kafka cluster	50M events/sec	$200,000
Total	-	-	~$2.2M/month

The Cost of 1B vs 1M Users

1M users: ~$5,000/month → $0.005 per user/month 1B users: ~$2,200,000/month → $0.0022 per user/month Economies of scale: at 1B users, cost per user is LOWER than at 1M users. This is why scaling is worth it — marginal cost decreases as you grow.

Scaling Decision Framework: When to Add Each Layer

The biggest mistake in system design is over-engineering. Engineers who have worked at Google sometimes design Google-scale systems for 1,000-user apps. This is expensive and slows development.

User Scale	Architecture	Key Components	Monthly Cost
0 - 10K users	Single server	EC2 t3.medium + RDS db.t3.medium	$100
10K - 100K users	Separated DB	EC2 t3.large + RDS r6g.large + S3	$500
100K - 1M users	+ Load balancer	ALB + 3x app servers + RDS r6g.xlarge	$2,000
1M - 10M users	+ Read replicas + Redis	3 app servers + 2 replicas + ElastiCache	$8,000
10M - 100M users	+ CDN + Sharding	CDN + 20 app servers + 4 DB shards	$50,000
100M - 1B users	Multi-region + Microservices	Full distributed system	$500,000+
1B+ users	Custom infrastructure	Custom hardware, networking, storage	$2M+/month

The Instagram Rule

"Design for 10x your current scale, not 100x." At 10K users: design for 100K At 1M users: design for 10M At 100M users: design for 1B Do NOT design for 1B users when you have 10K. The architecture for 1B is 100x more complex and expensive. Focus on shipping features, not premature scaling infrastructure.

Scaling Decision Signals: When to Add Each Layer

Add DB Separation — App server CPU > 60% AND DB is part of it. Or you want to scale app servers independently.
Add Load Balancer — Single app server at > 60% CPU sustained, or you need zero-downtime deployments, or any availability requirement.
Add Read Replicas — DB primary CPU > 60% AND 80%+ of queries are SELECTs. Replication lag must be acceptable for your use case.
Add Redis Cache — Repeated identical DB queries. Same data fetched by multiple users. Rate limiting needed. Session storage needed.
Add CDN — Static file bandwidth > 1TB/month. Or serving users in multiple geographies. Or images/video are significant traffic.
Add DB Sharding — DB write throughput > 70% of primary capacity. Single primary cannot keep up with write load. Data size > 5TB per instance.
Go Multi-Region — Users in multiple continents with latency complaints. Need 99.99%+ SLA. Regulatory data residency requirements.

How this might come up in interviews

This is the foundation for every system design interview. Interviewers expect you to narrate the scaling journey when designing any system — "at 100K users I would add X, at 10M users I would add Y." Not knowing this journey is a red flag at L5+ levels.

Common questions:

Design Instagram. Walk me through the architecture at 10K, 1M, and 1B users. (L5 - L7 question)
Your single-server app is slow. Database CPU is at 90%. What do you do? Walk me through your debugging and solution approach. (L4 - L5 question)
When would you choose vertical scaling over horizontal scaling for the database tier? What are the trade-offs? (L5 question)
Explain cache stampede and three ways to prevent it. (L5 - L6 question)
Your app serves users in US and Europe with 300ms latency for EU users. How do you reduce this to < 50ms? (L5 - L6 question)
Instagram has a "hot user" problem: 1 million people follow a celebrity, and when they post, it generates 1M writes. How do you handle this? (L6 - L7 question)

Key takeaways

Every system starts as a single server. Scale incrementally — add each layer when you hit the specific bottleneck that layer solves.
The scaling layers in order: database separation → load balancer → read replicas → caching → CDN → sharding → multi-region.
Instagram reached 1B users with 13 engineers by following this exact playbook, not by over-engineering from day 1.
Caching is the highest-ROI scaling technique: a Redis cache hit is 100-500x faster than a database query.
Sharding is painful — exhaust all other options (replicas, caching, vertical scaling) before sharding.

Before you move on: can you answer these?

Your app has 500K DAU, each making 20 requests/day. Calculate peak QPS assuming 5x peak multiplier. What architecture handles this?

500K × 20 / 86,400 = 115 req/sec avg. × 5 = 578 req/sec peak. Handled by: 2-3 app servers behind a load balancer, with DB read replicas. Single server maxes at ~300-500 req/sec.

At what point should you add read replicas vs add a Redis cache? What signal tells you which one to add first?

Add Redis FIRST if you have repeated identical queries (same data fetched many times). Redis is 100x cheaper than another DB replica. Add read replicas when DB CPU is high DESPITE good cache hit rates — meaning new data is needed, not repeated data.

Why is sharding considered a last resort, and what are 3 alternatives to try first?

Sharding makes cross-shard queries expensive (application-level joins) and distributed transactions nearly impossible. Try first: (1) vertical scale the primary, (2) aggressive caching to reduce write load, (3) read replicas to eliminate all read load from primary. Only shard when writes exceed single-machine capacity.

🧠Mental Model

💡 Analogy

Scaling a system is like scaling a restaurant. You start as a food truck (single server) — one chef handles everything: cooking, serving, payment. It works for 50 customers/day. At 200 customers, you rent a small restaurant: separate kitchen (database server) from dining area (app server). At 500 customers, you hire more servers and add a host to direct them (load balancer). At 2,000 customers, you open a second location and replicate the kitchen setup (read replicas). At 10,000 customers, you franchise the concept: multiple independent restaurants with shared brand and supply chain (sharding + CDN). Each step adds complexity but handles more demand.

⚡ Core Idea

Every scaling journey follows the same pattern: identify the current bottleneck → add the layer that removes that bottleneck → repeat. The layers are: database separation, load balancing, read replicas, caching, CDN, sharding. Each layer solves a specific constraint at a specific scale.

🎯 Why It Matters

Without understanding the scaling journey, you will either under-engineer (and crash at 100K users like early Instagram) or over-engineer (waste 6 months building Google-scale infrastructure for a 1,000-user app). Knowing WHEN to add each layer is the difference between a startup that scales and one that dies from success.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

Scaling from One User to Millions

Why Scaling Matters: The Instagram Story

Level 1: Single Server Architecture — Where Every App Starts

Level 2: Database Separation — Splitting the Monolith

Level 3: Load Balancing — Handling Horizontal Scale

Level 4: Database Replication — Solving the Read Bottleneck

Level 5: Caching Layer — The 100x Multiplier

Level 6: CDN and Static Assets — Eliminating Origin Load

Level 7: Database Sharding — Breaking the Write Bottleneck

Level 8: Full Production Architecture — The Complete Picture

Scaling Decision Framework: When to Add Each Layer

Discussion

In-app Q&A

Scaling from One User to Millions

Why Scaling Matters: The Instagram Story

Level 1: Single Server Architecture — Where Every App Starts

Level 2: Database Separation — Splitting the Monolith

Level 3: Load Balancing — Handling Horizontal Scale

Level 4: Database Replication — Solving the Read Bottleneck

Level 5: Caching Layer — The 100x Multiplier

Level 6: CDN and Static Assets — Eliminating Origin Load

Level 7: Database Sharding — Breaking the Write Bottleneck

Level 8: Full Production Architecture — The Complete Picture

Scaling Decision Framework: When to Add Each Layer

Discussion

In-app Q&A