Skip to main content
Career Paths
Concepts
System Scale Zero To Millions
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Scaling from One User to Millions

How Instagram went from a single AWS server to serving 2 billion users — and the exact architectural decisions made at each inflection point. Master the universal scaling playbook used by every major tech company.

🎯Key Takeaways
Every system starts as a single server. Scale incrementally — add each layer when you hit the specific bottleneck that layer solves.
The scaling layers in order: database separation → load balancer → read replicas → caching → CDN → sharding → multi-region.
Instagram reached 1B users with 13 engineers by following this exact playbook, not by over-engineering from day 1.
Caching is the highest-ROI scaling technique: a Redis cache hit is 100-500x faster than a database query.
Sharding is painful — exhaust all other options (replicas, caching, vertical scaling) before sharding.

Scaling from One User to Millions

How Instagram went from a single AWS server to serving 2 billion users — and the exact architectural decisions made at each inflection point. Master the universal scaling playbook used by every major tech company.

~17 min read
Be the first to complete!
What you'll learn
  • Every system starts as a single server. Scale incrementally — add each layer when you hit the specific bottleneck that layer solves.
  • The scaling layers in order: database separation → load balancer → read replicas → caching → CDN → sharding → multi-region.
  • Instagram reached 1B users with 13 engineers by following this exact playbook, not by over-engineering from day 1.
  • Caching is the highest-ROI scaling technique: a Redis cache hit is 100-500x faster than a database query.
  • Sharding is painful — exhaust all other options (replicas, caching, vertical scaling) before sharding.

Lesson outline

Why Scaling Matters: The Instagram Story

On October 6, 2010, Instagram launched. Within 24 hours, 25,000 people had signed up. Within 3 months: 1 million users. Within 18 months: 30 million users. By 2018: 1 billion active users.

The engineering team that handled this was just 13 people when Facebook acquired Instagram for $1 billion in 2012. They scaled from a single server to a globally distributed system — and the decisions they made are the foundation of system design.

Real Numbers: Instagram Day 1 vs Today

Day 1 (Oct 2010): 25K users, 1 server, ~50 req/sec, $50/month Year 1 (Oct 2011): 10M users, 3 servers, ~500 req/sec, $5K/month Acquisition (Apr 2012): 30M users, 30+ servers, ~2K req/sec, $50K/month Today (2024): 2B users, 100K+ servers, ~5M req/sec, $100M+/month

The key insight: every architecture decision was driven by a specific bottleneck. Instagram did NOT design for 2 billion users on day 1. They solved each bottleneck as it appeared. This is the correct approach — over-engineering early is as dangerous as under-engineering.

graph TD
    A[Day 1: 25K Users] --> B[Month 3: 1M Users]
    B --> C[Month 18: 10M Users]
    C --> D[Acquisition: 30M Users]
    D --> E[Today: 2B Users]

    A -. Single EC2 Server .-> F[Single Server Architecture]
    B -. Database Overload .-> G[Database Separation]
    C -. Read Bottleneck .-> H[Read Replicas + Cache]
    D -. Write Bottleneck .-> I[Sharding + CDN]
    E -. Global Scale .-> J[Multi-Region + Microservices]

Instagram scaling journey: each growth milestone triggered a specific architectural change

Level 1: Single Server Architecture — Where Every App Starts

Every successful app starts the same way: one server, one database, one engineer at 3am wondering why it is slow. This is not a failure — it is correct engineering. Premature optimization is the enemy of shipping.

graph LR
    User[👤 User Request] --> DNS[DNS Lookup
yourapp.com → 54.23.1.1]
    DNS --> Server[Single EC2 Server
t3.medium: 2 vCPU, 4GB RAM
Nginx + Node.js + PostgreSQL]
    Server --> DB[(PostgreSQL
Same Machine
10GB SSD)]
    Server --> Files[User Photos
Local Disk
100GB]

    style Server fill:#f96,stroke:#333,stroke-width:3px
    style DB fill:#6c9,stroke:#333

Single server: everything on one machine. Simple to deploy, simple to debug, single point of failure.

What runs on that single server: Nginx (web server), Node.js/Python/Rails (application), PostgreSQL/MySQL (database), file system (user uploads). Total cost: $50-200/month on AWS.

ComponentPurposeBottleneck at ScaleMax capacity (rough)
NginxServe HTTP, reverse proxyRarely the bottleneck50K req/sec
Node.js AppBusiness logicCPU-bound at high concurrency1-5K req/sec
PostgreSQLData storage + queriesDisk I/O, CPU for complex queries1-10K req/sec
Local diskFile storageIOPS and capacity limits100-500 MB/s

When Single Server Is Fine

< 10,000 daily active users < 100 req/sec sustained < 100GB data Non-critical availability (5 nines not needed) Early-stage startup or internal tool Do NOT over-engineer early. AWS t3.medium at $30/month can handle most early-stage apps.

First bottleneck signals to watch: database CPU > 70% sustained, response times > 500ms under load, disk I/O wait > 20%, memory pressure causing swapping.

Level 2: Database Separation — Splitting the Monolith

The first scaling step is almost always: move the database to its own machine. This is Instagram's first architectural change at ~100K users.

Why database first? The database is stateful — it holds your data. The app server is stateless — it just processes requests. Separating them lets you scale each independently.

Database Separation: Step-by-Step

→

01

Provision a new, more powerful RDS instance (r6g.large: 2 vCPU, 16GB RAM, optimized for databases)

→

02

Run both old and new databases in parallel for 24 hours, writing to both, reading from old

→

03

Verify data consistency between old and new (checksums, row counts)

→

04

Switch reads to new database, monitor for errors

→

05

Switch writes to new database, decommission old DB from app server

06

Now the app server has 4GB RAM freed — use it for application caching

1

Provision a new, more powerful RDS instance (r6g.large: 2 vCPU, 16GB RAM, optimized for databases)

2

Run both old and new databases in parallel for 24 hours, writing to both, reading from old

3

Verify data consistency between old and new (checksums, row counts)

4

Switch reads to new database, monitor for errors

5

Switch writes to new database, decommission old DB from app server

6

Now the app server has 4GB RAM freed — use it for application caching

graph TB
    Users[👥 100K Users] --> LB_DNS[DNS Round-robin
or Elastic IP]
    LB_DNS --> App[App Server
ec2: t3.large
2 vCPU, 8GB RAM
Nginx + Node.js]
    App --> DB[(Dedicated DB Server
rds: r6g.large
2 vCPU, 16GB RAM
PostgreSQL)]
    App --> S3[AWS S3
User Photos
Unlimited scale
$0.023/GB/month]

    style App fill:#6af,stroke:#333,stroke-width:2px
    style DB fill:#6c9,stroke:#333,stroke-width:2px

After database separation: app and DB on separate machines. Now can scale independently.

MetricBefore (Single Server)After (Separated)Improvement
App server CPU85% (DB stealing)30%55% freed
DB server RAM2GB available16GB dedicated8x more
Max req/sec~200~1,0005x
Monthly cost$50$2004x
Failure domains1 (everything)2 (app + DB)Better isolation

The Hidden Cost: Network Latency

Before separation: app → DB was memory/IPC (~0.1ms) After separation: app → DB is network (~1-2ms) For an endpoint making 10 DB queries: 0.1ms → 10-20ms added latency Mitigation: Connection pooling (PgBouncer), N+1 query elimination, caching hot queries

Level 3: Load Balancing — Handling Horizontal Scale

At ~500K users, a single app server starts struggling. The solution: add more app servers and put a load balancer in front. This is "horizontal scaling" — adding more machines instead of bigger ones.

Vertical vs Horizontal: The Numbers

Vertical (bigger server): - t3.medium → r6g.16xlarge: 128 vCPU, 1TB RAM = $3.84/hr = $2,765/month - Maximum: ~448 vCPU (AWS largest instance) - Single point of failure Horizontal (more servers): - 10x t3.large at $0.083/hr = $0.83/hr = $600/month - No upper limit — add as many as you need - Fault tolerant: lose 2 of 10 = still 80% capacity Winner: Horizontal at scale. But stateless design is REQUIRED.

graph TB
    Users[👥 1M Users] --> LB[AWS Application Load Balancer
Layer 7: HTTP/HTTPS routing
Health checks every 30s
$16/month + $0.008/LCU]

    LB --> App1[App Server 1
t3.large]
    LB --> App2[App Server 2
t3.large]
    LB --> App3[App Server 3
t3.large]
    LB -. Auto-scaling .-> AppN[App Server N
Add on demand]

    App1 & App2 & App3 --> DB[(RDS Primary
r6g.xlarge
4 vCPU, 32GB RAM)]
    App1 & App2 & App3 --> Redis[(Redis Cache
elasticache.r6g.large
13GB RAM)]
    App1 & App2 & App3 --> S3[(S3 + CloudFront
Static assets + uploads)]

    style LB fill:#f90,stroke:#333,stroke-width:3px

Load balanced architecture: horizontal app scaling with stateless servers.

Critical requirement: stateless app servers. If server 1 handles a user's login and stores session in memory, then server 2 won't know they're logged in. Solution: store session in Redis or database, NOT server memory.

Load Balancing Algorithms: When to Use Each

  • Round Robin — Default. Sends request 1→server1, 2→server2, 3→server1... Best when all servers have same capacity.
  • Least Connections — Sends to server with fewest active connections. Best when requests have variable processing time (some take 10ms, some 10 seconds).
  • IP Hash — Same client always goes to same server (sticky sessions). Use only if you MUST store state on server — prefer external session storage instead.
  • Weighted Round Robin — Send more requests to more powerful servers. Use when you have mixed instance types during migrations.

Health Checks Are Critical

Configure load balancer health checks: - Endpoint: GET /health → returns 200 if healthy - Interval: every 30 seconds - Unhealthy threshold: 2 consecutive failures → remove from rotation - Healthy threshold: 3 consecutive passes → add back Without health checks: dead servers keep receiving traffic → errors for users

Level 4: Database Replication — Solving the Read Bottleneck

At 5M users, your database primary server hits 70-80% CPU. Investigation shows: 95% of queries are READs (users viewing feeds, profiles), only 5% are WRITEs (users posting, updating). Classic read-heavy workload.

Solution: database replication. One primary handles all writes, multiple read replicas handle all reads.

graph TB
    App[App Servers
10x t3.large] -- writes: INSERT/UPDATE/DELETE --> Primary[(DB Primary
r6g.4xlarge
16 vCPU, 128GB RAM)]
    App -- reads: SELECT --> Replica1[(Read Replica 1
r6g.2xlarge
8 vCPU, 64GB RAM)]
    App -- reads: SELECT --> Replica2[(Read Replica 2
r6g.2xlarge
8 vCPU, 64GB RAM)]
    App -- reads: SELECT --> Replica3[(Read Replica 3
r6g.2xlarge)]

    Primary -. async replication
lag: 1-100ms .-> Replica1
    Primary -. async replication .-> Replica2
    Primary -. async replication .-> Replica3

    style Primary fill:#f66,stroke:#333,stroke-width:3px
    note1[Primary: handles ALL writes
Replicas: handle ALL reads
95%+ traffic goes to replicas]

Read replica architecture: primary for writes, replicas for reads. 95% of traffic served by replicas.

MetricBefore ReplicasAfter Replicas (3 replicas)Notes
Primary CPU80% (reads+writes)20% (writes only)Safe headroom for writes
Read capacity1x3x (or more)Add replicas linearly
Read latency50ms avg30ms avgReplicas less contended
Monthly cost$800 (primary)$800 + $600 = $1,40075% increase, 3x capacity
Replication lagN/A1-100ms typicallyAcceptable for most reads

Replication Lag: The Hidden Gotcha

Async replication means replicas are slightly behind the primary. Scenario: User posts photo → writes to primary → immediately reads from replica → photo not there yet! Solutions: 1. Read-your-writes: route user's own writes back to primary for 1-5 seconds 2. Sticky reads: for 500ms after a write, route that user to primary for reads 3. Synchronous replication: zero lag but 2x write latency Instagram used option 1 for profile updates, option 3 for financial data.

Capacity math: With 5M DAU, each visiting twice/day, viewing 20 items/visit = 200M reads/day = 2,314 reads/sec average, peak 10x = 23,140 reads/sec. Each replica handles ~5,000 reads/sec = need 5 replicas for peak. Instagram used 12 at this stage.

Level 5: Caching Layer — The 100x Multiplier

A cache hit is 0.1ms. A database query is 1-50ms. Caching the right data gives you 10-500x performance improvement and reduces database load by 80-95%. This is the highest ROI optimization in system design.

The Caching Hierarchy (Fastest to Slowest)

  • CPU L1 Cache — 0.4 nanoseconds. 32KB. Managed by CPU. You cannot control this.
  • CPU L2/L3 Cache — 1-10 nanoseconds. 256KB-32MB. Managed by CPU.
  • RAM / Application Memory — 100 nanoseconds. GBs. Node.js/Python in-process caching. Lost on restart.
  • Redis / Memcached — 0.1-1 milliseconds. GBs-TBs. Shared across servers. Survives restarts.
  • CDN Edge (Cloudflare, Akamai) — 1-50 milliseconds. Geographically distributed. Static content, images, videos.
  • Database Query Cache — 1-50 milliseconds. Limited, often disabled in production.
  • Database Storage (SSD) — 0.1-1 milliseconds. For data not in DB buffer pool.
  • Database Storage (HDD) — 5-10 milliseconds. Mostly legacy.
graph LR
    Client[Browser/App] --> CDN[CDN Edge
Cloudflare/Akamai
~10ms latency
Cache: images, JS, CSS]
    CDN --> LB[Load Balancer]
    LB --> App[App Server]
    App --> AppCache[In-Process Cache
Node.js Map/LRU
~0.01ms
Hot config data]
    App --> Redis[(Redis Cluster
elasticache
~0.5ms
User sessions, feeds
rate limiting)]
    App --> DB[(PostgreSQL
Primary + Replicas
~5ms
Source of truth)]

    Redis -. Cache Miss .-> DB
    CDN -. Cache Miss .-> LB

    style CDN fill:#fa0,stroke:#333
    style Redis fill:#c33,color:#fff,stroke:#333
    style DB fill:#363,color:#fff,stroke:#333

Multi-layer cache hierarchy. Most requests should be served by Redis or CDN — never reaching the database.

Cache StrategyHow It WorksBest ForRisk
Cache-Aside (Lazy)App checks cache first. Miss → query DB → store in cacheGeneral-purpose, read-heavyCold start, thundering herd on cache miss
Write-ThroughWrite to cache AND DB together on every writeData that's immediately re-readSlower writes, cache pollution (write-only data cached)
Write-Behind (Write-Back)Write to cache only; async flush to DBVery write-heavy workloadsData loss if cache dies before flush
Read-ThroughCache handles DB reads automatically on missSimplifies app codeCache vendor must support it (Redis modules)
Refresh-AheadPre-fetch data before TTL expiresPredictable access patternsWasted work if prediction wrong

Cache Stampede: The Most Common Cache Failure

Scenario: Popular tweet cached with 60s TTL. 10,000 users reading it. At T+60s, cache expires. All 10,000 concurrent readers → cache miss → all 10,000 hit DB simultaneously → DB overload → outage. Fix 1: Probabilistic early expiry — refresh cache slightly before TTL with some probability Fix 2: Mutex lock — first miss acquires lock, others wait; lock holder fetches from DB Fix 3: Background refresh — async job refreshes cache before expiry, TTL never actually expires in production Reddit experienced this in 2012 — front page cache expired during peak traffic, 50,000 simultaneous DB queries, site down for 8 minutes.

Level 6: CDN and Static Assets — Eliminating Origin Load

At 10M users, you discover that 70% of your bandwidth is serving the same static files: JavaScript bundles, CSS, user profile photos, product images. These never change (or change rarely). They should NEVER hit your application servers.

What CDN Numbers Look Like

Instagram at 10M users (2011 estimation): - Average user views 40 photos/session - Each photo: 200KB average - 10M users × 2 sessions/day = 20M sessions/day - 20M × 40 photos × 200KB = 160TB/day transferred Without CDN: all 160TB served from origin → $15,000/month bandwidth + massive origin load With CDN (Cloudflare): ~95% cache hit rate → 8TB from origin, 152TB from edge Cost: $15,000 → $2,500 (5x reduction) Latency: 200ms (origin) → 20ms (edge near user)

graph TB
    US_User[👤 US User
New York] --> CF_US[Cloudflare Edge
New York PoP
10ms away]
    EU_User[👤 EU User
London] --> CF_EU[Cloudflare Edge
London PoP
8ms away]
    ASIA_User[👤 Asia User
Tokyo] --> CF_AS[Cloudflare Edge
Tokyo PoP
5ms away]

    CF_US -. Cache Miss
~5% of requests .-> Origin[Origin Servers
US-East-1
AWS]
    CF_EU -. Cache Miss .-> Origin
    CF_AS -. Cache Miss .-> Origin

    CF_US -. 95% Cache Hit
~20ms .-> US_User
    CF_EU -. 95% Cache Hit
~15ms .-> EU_User
    CF_AS -. 95% Cache Hit
~10ms .-> ASIA_User

    style CF_US fill:#fa0,stroke:#333
    style CF_EU fill:#fa0,stroke:#333
    style CF_AS fill:#fa0,stroke:#333

CDN global PoP (Points of Presence): users connect to nearest edge server. Origin only sees cache misses.

CDN Implementation Checklist

→

01

Move all static assets (JS, CSS, images) to S3 or equivalent object storage

→

02

Put CloudFront/Cloudflare in front of S3 with long TTLs (1 year for versioned assets)

→

03

Use content hashing for filenames: app.a3f29c.js — hash changes when content changes, instant cache busting

→

04

Configure cache-control headers: Cache-Control: public, max-age=31536000, immutable

→

05

For user-uploaded content: route through CDN with shorter TTLs (7 days)

→

06

Monitor CDN hit rate in dashboard — target >90%

07

Configure CDN WAF (Web Application Firewall) to block malicious traffic at edge

1

Move all static assets (JS, CSS, images) to S3 or equivalent object storage

2

Put CloudFront/Cloudflare in front of S3 with long TTLs (1 year for versioned assets)

3

Use content hashing for filenames: app.a3f29c.js — hash changes when content changes, instant cache busting

4

Configure cache-control headers: Cache-Control: public, max-age=31536000, immutable

5

For user-uploaded content: route through CDN with shorter TTLs (7 days)

6

Monitor CDN hit rate in dashboard — target >90%

7

Configure CDN WAF (Web Application Firewall) to block malicious traffic at edge

Cache-Busting Without Downtime

Never name files static.js and rely on CDN purge for updates. Purge propagation takes 30-60 seconds, during which some users see old version. Correct approach: Content hash in filename - app.js → app.[sha256hash].js - HTML references the hashed filename - When JS changes, hash changes, CDN sees new URL = fresh fetch - Old version still cached for users with old HTML - Zero-downtime deployment

Level 7: Database Sharding — Breaking the Write Bottleneck

At 50M users, your database primary is at 80% CPU even with read replicas handling all reads. Writes are the problem. You have 50M users × 5 writes/day = 250M writes/day = 2,893 writes/sec sustained. A single PostgreSQL primary maxes out at ~5,000 writes/sec on powerful hardware — you're dangerously close.

Sharding (also called horizontal partitioning) splits your data across multiple independent database instances. Each instance is a "shard" responsible for a subset of data.

graph TB
    App[App Servers] --> Router[Shard Router
user_id % 4 = shard]

    Router -- user_id 0,4,8,12... --> Shard0[(Shard 0
PostgreSQL Primary
Users 0,4,8...)]
    Router -- user_id 1,5,9,13... --> Shard1[(Shard 1
PostgreSQL Primary
Users 1,5,9...)]
    Router -- user_id 2,6,10,14... --> Shard2[(Shard 2
PostgreSQL Primary
Users 2,6,10...)]
    Router -- user_id 3,7,11,15... --> Shard3[(Shard 3
PostgreSQL Primary
Users 3,7,11...)]

    Shard0 --> S0R1[(Shard 0 Replica)]
    Shard1 --> S1R1[(Shard 1 Replica)]

    style Router fill:#f90,stroke:#333,stroke-width:3px

Hash-based sharding by user_id: user_id % 4 determines which shard. Each shard has its own primary and replicas.

Sharding StrategyHow It WorksProsConsReal-World Example
Hash-basedhash(shard_key) % num_shardsEven distribution, predictableResharding is expensiveInstagram user_id, Stripe customer_id
Range-basedshard_key 0-1M → shard1, 1M-2M → shard2Easy range queries, resharding easierHot spots if range is popularMongoDB default, HBase
Directory-basedLookup table: key → shard mappingMaximum flexibility, easy reshardingLookup table becomes bottleneckFoursquare
Geo-basedRoute by user geographyGDPR compliance, low latencyUneven distribution (US > other regions)Airbnb, WhatsApp

Sharding Is Painful — Avoid Until Necessary

Instagram waited until 30M users to shard. Before sharding, they tried: 1. Vertical scaling primary (larger instance) 2. Read replicas (for reads) 3. Aggressive caching (Redis for hot data) 4. Query optimization (indexes, batching) Only when writes hit 2,000/sec sustained did they shard. Pain of sharding: - Cross-shard queries become application-level joins - Distributed transactions are very hard - Resharding (adding more shards) requires data migration - Operational complexity: 4 primaries to monitor vs 1 Do not shard until you absolutely must.

Level 8: Full Production Architecture — The Complete Picture

At 100M+ users, your architecture has evolved from a single server to a multi-layer globally distributed system. Here is what the complete architecture looks like — and the numbers behind it.

graph TB
    User[👤 2B Users Globally] --> CF[Cloudflare CDN
250 PoPs worldwide
DDoS protection
WAF + Rate Limiting]

    CF --> GLB[Global Load Balancer
Geodns / Anycast
Routes to nearest region]

    GLB --> US[US-East Region]
    GLB --> EU[EU-West Region]
    GLB --> AS[Asia-Pacific Region]

    subgraph US [US-East AWS Region]
        ALB_US[ALB
HTTPS/HTTP2] --> AppTier_US[App Tier
500x c6g.xlarge
Auto-scaling group]
        AppTier_US --> Cache_US[Redis Cluster
12x r6g.2xlarge
13TB cache]
        AppTier_US --> DB_US_Primary[(DB Primary
r6g.16xlarge
128 vCPU)]
        DB_US_Primary --> DB_US_R1[(Replica 1)]
        DB_US_Primary --> DB_US_R2[(Replica 2)]
        DB_US_Primary --> DB_US_R3[(Replica 3)]
    end

    DB_US_Primary -. Cross-region async
replication .-> DB_EU_Primary[(EU Primary)]
    DB_US_Primary -. Cross-region async .-> DB_AS_Primary[(APAC Primary)]

    AppTier_US --> MQ[Kafka
Message Queues
Async processing]
    MQ --> Workers[Worker Fleet
Image processing
Notifications
Analytics]
    Workers --> S3_CDN[S3 + CloudFront
User photos/videos
Global CDN]

Full production architecture at 2B users: multi-region, CDN, load balancers, app tier, Redis, sharded DB, message queues.

LayerTechnologyScaleMonthly Cost (rough)
CDNCloudflare Enterprise5PB/month bandwidth$50,000
Load BalancersAWS ALB (3 regions)50M req/hour each$30,000
App Servers1,500x c6g.xlarge (auto-scale)Avg 500 active$300,000
Redis Cache36x r6g.2xlarge cluster40TB total cache$150,000
Database (RDS)12 primaries, 36 replicas100TB total data$500,000
S3 StorageMulti-region replication50PB user content$1,000,000
Kafka/queues200-node Kafka cluster50M events/sec$200,000
Total--~$2.2M/month

The Cost of 1B vs 1M Users

1M users: ~$5,000/month → $0.005 per user/month 1B users: ~$2,200,000/month → $0.0022 per user/month Economies of scale: at 1B users, cost per user is LOWER than at 1M users. This is why scaling is worth it — marginal cost decreases as you grow.

Scaling Decision Framework: When to Add Each Layer

The biggest mistake in system design is over-engineering. Engineers who have worked at Google sometimes design Google-scale systems for 1,000-user apps. This is expensive and slows development.

User ScaleArchitectureKey ComponentsMonthly Cost
0 - 10K usersSingle serverEC2 t3.medium + RDS db.t3.medium$100
10K - 100K usersSeparated DBEC2 t3.large + RDS r6g.large + S3$500
100K - 1M users+ Load balancerALB + 3x app servers + RDS r6g.xlarge$2,000
1M - 10M users+ Read replicas + Redis3 app servers + 2 replicas + ElastiCache$8,000
10M - 100M users+ CDN + ShardingCDN + 20 app servers + 4 DB shards$50,000
100M - 1B usersMulti-region + MicroservicesFull distributed system$500,000+
1B+ usersCustom infrastructureCustom hardware, networking, storage$2M+/month

The Instagram Rule

"Design for 10x your current scale, not 100x." At 10K users: design for 100K At 1M users: design for 10M At 100M users: design for 1B Do NOT design for 1B users when you have 10K. The architecture for 1B is 100x more complex and expensive. Focus on shipping features, not premature scaling infrastructure.

Scaling Decision Signals: When to Add Each Layer

  • Add DB Separation — App server CPU > 60% AND DB is part of it. Or you want to scale app servers independently.
  • Add Load Balancer — Single app server at > 60% CPU sustained, or you need zero-downtime deployments, or any availability requirement.
  • Add Read Replicas — DB primary CPU > 60% AND 80%+ of queries are SELECTs. Replication lag must be acceptable for your use case.
  • Add Redis Cache — Repeated identical DB queries. Same data fetched by multiple users. Rate limiting needed. Session storage needed.
  • Add CDN — Static file bandwidth > 1TB/month. Or serving users in multiple geographies. Or images/video are significant traffic.
  • Add DB Sharding — DB write throughput > 70% of primary capacity. Single primary cannot keep up with write load. Data size > 5TB per instance.
  • Go Multi-Region — Users in multiple continents with latency complaints. Need 99.99%+ SLA. Regulatory data residency requirements.
How this might come up in interviews

This is the foundation for every system design interview. Interviewers expect you to narrate the scaling journey when designing any system — "at 100K users I would add X, at 10M users I would add Y." Not knowing this journey is a red flag at L5+ levels.

Common questions:

  • Design Instagram. Walk me through the architecture at 10K, 1M, and 1B users. (L5 - L7 question)
  • Your single-server app is slow. Database CPU is at 90%. What do you do? Walk me through your debugging and solution approach. (L4 - L5 question)
  • When would you choose vertical scaling over horizontal scaling for the database tier? What are the trade-offs? (L5 question)
  • Explain cache stampede and three ways to prevent it. (L5 - L6 question)
  • Your app serves users in US and Europe with 300ms latency for EU users. How do you reduce this to < 50ms? (L5 - L6 question)
  • Instagram has a "hot user" problem: 1 million people follow a celebrity, and when they post, it generates 1M writes. How do you handle this? (L6 - L7 question)

Key takeaways

  • Every system starts as a single server. Scale incrementally — add each layer when you hit the specific bottleneck that layer solves.
  • The scaling layers in order: database separation → load balancer → read replicas → caching → CDN → sharding → multi-region.
  • Instagram reached 1B users with 13 engineers by following this exact playbook, not by over-engineering from day 1.
  • Caching is the highest-ROI scaling technique: a Redis cache hit is 100-500x faster than a database query.
  • Sharding is painful — exhaust all other options (replicas, caching, vertical scaling) before sharding.
Before you move on: can you answer these?

Your app has 500K DAU, each making 20 requests/day. Calculate peak QPS assuming 5x peak multiplier. What architecture handles this?

500K × 20 / 86,400 = 115 req/sec avg. × 5 = 578 req/sec peak. Handled by: 2-3 app servers behind a load balancer, with DB read replicas. Single server maxes at ~300-500 req/sec.

At what point should you add read replicas vs add a Redis cache? What signal tells you which one to add first?

Add Redis FIRST if you have repeated identical queries (same data fetched many times). Redis is 100x cheaper than another DB replica. Add read replicas when DB CPU is high DESPITE good cache hit rates — meaning new data is needed, not repeated data.

Why is sharding considered a last resort, and what are 3 alternatives to try first?

Sharding makes cross-shard queries expensive (application-level joins) and distributed transactions nearly impossible. Try first: (1) vertical scale the primary, (2) aggressive caching to reduce write load, (3) read replicas to eliminate all read load from primary. Only shard when writes exceed single-machine capacity.

🧠Mental Model

💡 Analogy

Scaling a system is like scaling a restaurant. You start as a food truck (single server) — one chef handles everything: cooking, serving, payment. It works for 50 customers/day. At 200 customers, you rent a small restaurant: separate kitchen (database server) from dining area (app server). At 500 customers, you hire more servers and add a host to direct them (load balancer). At 2,000 customers, you open a second location and replicate the kitchen setup (read replicas). At 10,000 customers, you franchise the concept: multiple independent restaurants with shared brand and supply chain (sharding + CDN). Each step adds complexity but handles more demand.

⚡ Core Idea

Every scaling journey follows the same pattern: identify the current bottleneck → add the layer that removes that bottleneck → repeat. The layers are: database separation, load balancing, read replicas, caching, CDN, sharding. Each layer solves a specific constraint at a specific scale.

🎯 Why It Matters

Without understanding the scaling journey, you will either under-engineer (and crash at 100K users like early Instagram) or over-engineer (waste 6 months building Google-scale infrastructure for a 1,000-user app). Knowing WHEN to add each layer is the difference between a startup that scales and one that dies from success.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.