Interactive Explainer

🎯Key Takeaways

Vertical scaling (bigger machine) is always faster to implement and should be exhausted before horizontal scaling. Most systems have 12-18 months of vertical headroom they haven't used yet.

Horizontal scaling requires stateless application servers — sessions must be in Redis, files in S3, caches in Redis. This is the central technical prerequisite.

Auto-scaling with asymmetric cooldowns (scale-out: 90 seconds, scale-in: 600 seconds) handles traffic spikes while preventing the scale-in/spike race condition that caused Slack's 2019 outage.

Databases are the exception: vertical scaling + read replicas handles 95% of cases. Sharding is only necessary when write QPS exceeds a single DB's capacity (~50K QPS) or data exceeds 10-20TB.

Scale in this order: optimize → vertical → cache → read replicas → horizontal app → vertical DB partition → shard. Skipping steps introduces unnecessary complexity.

Horizontal vs. Vertical Scaling

The two fundamental scaling strategies. When each is right. Stateless vs stateful. Auto-scaling groups. Real cost math from AWS. Why databases are the exception that proves every rule.

~22 min read

Be the first to complete!

What you'll learn

Vertical scaling (bigger machine) is always faster to implement and should be exhausted before horizontal scaling. Most systems have 12-18 months of vertical headroom they haven't used yet.
Horizontal scaling requires stateless application servers — sessions must be in Redis, files in S3, caches in Redis. This is the central technical prerequisite.
Auto-scaling with asymmetric cooldowns (scale-out: 90 seconds, scale-in: 600 seconds) handles traffic spikes while preventing the scale-in/spike race condition that caused Slack's 2019 outage.
Databases are the exception: vertical scaling + read replicas handles 95% of cases. Sharding is only necessary when write QPS exceeds a single DB's capacity (~50K QPS) or data exceeds 10-20TB.
Scale in this order: optimize → vertical → cache → read replicas → horizontal app → vertical DB partition → shard. Skipping steps introduces unnecessary complexity.

Lesson outline

The Scaling Problem: When One Machine Isn't Enough

Every successful system eventually hits the same wall: one machine can't keep up. Traffic grows, response times climb, errors increase. At this point, you have two options: make the machine bigger (vertical scaling) or add more machines (horizontal scaling). The choice you make defines your entire architecture for years.

Most engineers starting out assume the answer is obvious: "add more servers." But the reality is nuanced. Adding more servers introduces an entirely new class of problems: distributed state, network partitions, consensus, load distribution. The solution creates new problems. The art of scaling is knowing which problems are preferable at each stage of your system's growth.

The Scaling Decision Timeline

Most successful systems follow this path: (1) Start: Single machine, no scaling needed. (2) Early growth: Add RAM/CPU (vertical) — easiest path, no code changes. (3) Scale: Hit hardware ceiling, add horizontal application servers. (4) Database bottleneck: Add read replicas (quasi-horizontal). (5) Write bottleneck: Add sharding (true horizontal DB scaling — hardest part). (6) Mature: Auto-scaling groups, multi-region, microservices. Understanding this timeline prevents premature optimization.

To make the scaling conversation concrete, let's use a real example: a startup that builds a booking platform. At launch, they have one server running everything — API, database, static files. The first 1,000 users work fine. At 10,000 concurrent users, the API starts timing out. At 50,000, the database falls over. What do they do?

graph LR
    A[Single Server
All components
1K users OK] -->|Traffic grows| B[Server CPU 80%
Response time ↑]
    B --> C{Decision Point}
    C -->|Option 1| D[Vertical Scale
Bigger machine
r5.4xlarge → r5.16xlarge]
    C -->|Option 2| E[Horizontal Scale
Multiple servers
+ Load Balancer]
    D --> F[Fast to execute
No code changes
But has a ceiling]
    E --> G[Complex setup
Code must be stateless
But unlimited scale]

The scaling decision tree: vertical is fast but limited, horizontal is complex but unlimited

This guide walks through both options in depth: when each is right, how to execute each, their cost curves, and the special case of database scaling — which is fundamentally different from application server scaling and where most teams get stuck.

Vertical Scaling Deep Dive: Instance Families and Limits

Vertical scaling means upgrading to a more powerful machine: more CPU cores, more RAM, faster storage, more network bandwidth. It's the first scaling strategy you should exhaust before reaching for horizontal scaling, because it requires zero code changes and is immediately effective.

On AWS, vertical scaling means changing instance type. AWS organizes instances into "instance families" optimized for different bottlenecks:

Instance Family	Optimized For	Max Config	Use Case
m-series (m7g, m6i)	General purpose (balanced CPU/RAM)	m7g.metal: 64 cores, 256 GB RAM	API servers, microservices, general workloads
c-series (c7g, c6i)	Compute-intensive	c7g.metal: 64 cores, 128 GB RAM	CPU-bound: video encoding, ML inference, cryptography
r-series (r7g, r6i)	Memory-intensive	r6i.metal: 128 cores, 1,024 GB RAM	Large in-memory databases, Redis, Elasticsearch
x-series (x2idn)	Extreme memory	x2idn.metal: 128 cores, 4,096 GB RAM (4TB!)	SAP HANA, in-memory analytics, massive databases
i-series (i4i)	Storage-intensive (NVMe SSD)	i4i.metal: 128 cores, 1,024 GB RAM, 30 TB NVMe	High-I/O databases, log analytics, Elasticsearch

The Vertical Scaling Cost Curve

Vertical scaling does not have a linear cost curve — it's superlinear. Doubling RAM costs more than double. AWS pricing example: m6i.xlarge (4 cores, 16 GB): $0.192/hr. m6i.2xlarge (8 cores, 32 GB — 2x): $0.384/hr (exactly 2x). m6i.8xlarge (32 cores, 128 GB — 8x): $1.536/hr (still linear). BUT: m6i.metal (128 cores, 512 GB): $7.776/hr (a premium over linear). At the extremes, you pay a 30-50% premium per unit of compute for the biggest instances. This is the economic inflection point where horizontal scaling becomes cheaper.

The hard ceiling on vertical scaling: even the biggest AWS instance (x2idn.metal) has 128 cores and 4TB RAM. At some scale, there is no bigger machine. Google, Facebook, and Amazon hit this ceiling in the 2000s and built horizontal-first architectures as a result. If you're not at Google scale, you have a lot of vertical headroom.

When Vertical Scaling Is the Right Answer

Single-threaded bottleneck — Some workloads don't parallelize well — a migration script, a monolithic CMS, or a stateful session-based app. Faster CPU clock speed helps more than adding more CPUs.
Database scaling (short term) — Databases are hard to scale horizontally. Upgrading RDS from db.r6g.large to db.r6g.4xlarge often buys 12-18 months before you need to think about read replicas.
Small team, early stage — Horizontal scaling requires substantial engineering work: load balancer setup, stateless app design, distributed tracing, deployment automation. Vertical scaling is free if you're on RDS or managed services.
Steady predictable load — Auto-scaling horizontal groups work best with variable load. If your traffic is steady, vertical is often cheaper because you're not paying for unused capacity in an auto-scaling group.

Horizontal Scaling Deep Dive: Stateless, Load Balancers, Sticky Sessions

Horizontal scaling means adding more machines and distributing work across them. It's the only approach that works at massive scale — but it requires your application to be designed for it. The central requirement is statelessness: each server must be able to handle any request independently, without depending on local state from a previous request.

The Stateless Requirement: Why It's Non-Negotiable

Imagine you have 3 servers and a user logs in on Server 1. Server 1 stores the session in memory: {user_id: 123, auth_token: "xyz"}. On the next request, the load balancer routes to Server 2. Server 2 has no session. The user sees a login page again. This is the session affinity problem — it breaks horizontal scaling. The fix: never store session state on application servers. Store it externally in Redis or JWT tokens that are self-contained.

Once your app is stateless, horizontal scaling works as follows: a load balancer receives all incoming requests and distributes them across N identical application servers. Any server can handle any request. You can add or remove servers at any time without affecting users. This is the foundation of every scalable web architecture.

graph LR
    C[Client] --> LB[Load Balancer]
    LB --> S1[App Server 1
Stateless]
    LB --> S2[App Server 2
Stateless]
    LB --> S3[App Server 3
Stateless]
    S1 --> Redis[(Redis
Session Store)]
    S2 --> Redis
    S3 --> Redis
    S1 --> DB[(Database)]
    S2 --> DB
    S3 --> DB
    note[Each server can handle
any request - no shared
in-memory state]

Horizontal scaling: stateless app servers + shared external state (Redis for sessions)

Making an Application Stateless: The Migration Path

→

Audit all session storage: find every place your app stores state in server memory. Common culprits: Flask session object, PHP $_SESSION, in-memory JWT stores, WebSocket connection state.

→

Move sessions to Redis: replace server-side session stores with Redis. npm install express-session connect-redis, configure Redis store. Sessions now survive server restarts and work across multiple servers.

→

JWT for stateless auth: alternative to Redis sessions. Encode user claims in a signed JWT token. Server validates signature without looking up anything. Tradeoff: tokens can't be revoked (until expiry) — store revocation list in Redis if needed.

→

Externalize all caches: any in-memory cache (Python dict, Java HashMap) must move to Redis or be acceptable to lose. If a cache miss causes a slightly slower response (not a crash), you can keep local caches as a first layer.

→

File uploads: any file uploads stored locally must move to S3. A file on Server 1's disk is not accessible from Server 2. All file storage must be external.

Verify with kill test: deploy your change, then kill one server while load testing. User requests should continue without errors. If they fail, you still have state on servers.

Audit all session storage: find every place your app stores state in server memory. Common culprits: Flask session object, PHP $_SESSION, in-memory JWT stores, WebSocket connection state.

File uploads: any file uploads stored locally must move to S3. A file on Server 1's disk is not accessible from Server 2. All file storage must be external.

Verify with kill test: deploy your change, then kill one server while load testing. User requests should continue without errors. If they fail, you still have state on servers.

State Type	Naive (Breaks Horizontal Scale)	Fixed (Stateless)
User sessions	In-memory dict on server	Redis with TTL, or JWT tokens
File uploads	Server local filesystem	S3 / GCS / object storage
In-process cache	Python dict / Java HashMap	Redis + local L1 cache (TTL < 30s)
WebSocket connections	Per-server connection map	Redis pub/sub for broadcast, sticky sessions for single-user connection
Background job state	Thread-local variables	Celery/RQ with Redis/SQS backend

Making Apps Stateless: Session Storage in Redis and JWT

Session management is the most common reason apps fail to scale horizontally, and it's entirely solvable. There are two patterns: external session stores (Redis) and stateless tokens (JWT). Understanding the trade-offs tells you when to use each.

The Redis session store approach: when a user logs in, generate a random session_id, store session data in Redis with a TTL, return the session_id as a cookie. On subsequent requests, look up session_id in Redis (0.5ms). Any server can serve any request because they all share the same Redis.

Redis Session Store: The Architecture

Login: POST /login → validate credentials → generate session_id (UUID) → Redis SET session:{id} {user_id, permissions, created_at} EX 86400 (24hr TTL) → return cookie session_id={id}. Request: GET /profile → read cookie → Redis GET session:{id} → if found: authenticated, load user_id → if not found: 401 Unauthorized. Logout: Redis DEL session:{id} → session immediately invalid on all servers.

The JWT (JSON Web Token) approach: encode user data directly in a signed token. The server validates the signature without any database or Redis lookup. Perfectly stateless — no external store needed.

Characteristic	Redis Sessions	JWT Tokens
Revocation	Immediate: DEL session:{id}	Cannot revoke until expiry (unless maintaining blocklist in Redis)
Latency	0.5ms for Redis lookup	0ms — pure CPU crypto verification
Storage	Redis cluster storage cost	No server-side storage — token lives in client
Scalability	Scales with Redis cluster	Infinite — no shared state anywhere
Token size	~36 bytes (UUID in cookie)	~500-2000 bytes (JSON + signature in header)
User data freshness	Always current (read from Redis)	Stale until token expiry — role/permission changes lag
Best for	Apps needing immediate session invalidation (banking, healthcare)	Stateless APIs, microservices, mobile clients

The Hybrid Pattern (Production Best Practice)

Most production systems use both: JWT for authentication (who are you?) + Redis for authorization/session state (what can you do right now?). JWT contains: user_id, issued_at, expiry (15 minutes). Redis contains: user permissions, feature flags, active session status. Short JWT expiry + Redis check = best of both worlds: stateless auth lookup, instant permission revocation, no stale permission bugs.

Auto-Scaling Strategies: Reactive, Predictive, and Cooldowns

Auto-scaling is horizontal scaling on autopilot: the system automatically adds servers when traffic increases and removes them when it decreases. This sounds simple but has significant operational complexity. Wrong configuration is responsible for a disproportionate number of production outages.

There are two fundamental auto-scaling strategies: reactive and predictive.

Reactive vs. Predictive Auto-Scaling

Reactive (CloudWatch Alarms) — Scale based on current metrics: CPU > 70% for 2 minutes → add 2 servers. CPU < 30% for 5 minutes → remove 2 servers. Simple to configure but inherently lags behind traffic. By the time CPU hits 70% and the alarm fires, 2-3 minutes have passed. If traffic spiked, some requests already timed out.
Predictive (AWS Predictive Scaling) — Analyzes historical traffic patterns to pre-scale before expected load. Knows that every Monday at 9 AM traffic doubles — scales up at 8:45 AM. Works well for regular patterns (business hours, weekly cycles). Fails for unexpected viral events.
Scheduled Scaling — Explicit time-based rules: scale to 20 servers every weekday at 8 AM, scale to 5 servers at 10 PM. Deterministic and reliable. Best for predictable cycles (B2B apps, batch processing windows). Complements both reactive and predictive.

The Cooldown Period Trap

Cooldown periods prevent thrashing: after scaling up, wait N minutes before scaling again. This prevents adding 1 server, waiting 60 seconds, adding another, waiting, etc. BUT: too long a cooldown can cause outages. If traffic spikes hard and your cooldown is 5 minutes, you can't scale fast enough. Slack's 2019 outage (in our war story) was caused partly by a 5-minute scale-in cooldown that removed too many servers when traffic dropped, then couldn't scale back fast enough when a traffic spike hit. Recommended: scale-out cooldown: 60-120 seconds. Scale-in cooldown: 5-10 minutes (asymmetric — be cautious about removing servers).

AWS Auto Scaling Group Configuration Best Practices

→

Set minimum and maximum capacity: min=2 (for availability during maintenance), max=50 (cost protection). Never set max to unlimited — a traffic spike can run up a massive bill.

→

Choose the right scaling metric: CPU utilization is the most common but not always right. For I/O-bound apps: scale on network throughput or request count. For databases: scale on connection count. For queues: scale on queue depth (number of unprocessed messages).

→

Use target tracking scaling: instead of "scale when CPU > 70%", use "keep CPU at 60%." AWS continuously adjusts capacity to maintain the target. Much better than threshold-based alarms.

→

Configure health checks properly: EC2 health checks alone are insufficient — they only check if the instance is running. Configure Application Load Balancer health checks to verify the app is actually serving traffic. Remove unhealthy instances from rotation immediately.

→

Use lifecycle hooks for warm-up time: new instances often need time to warm up (JVM JIT compilation, cache warming, establishing DB connections). Add a lifecycle hook to keep instances in "pending" state for 60-120 seconds before adding them to the load balancer.

Asymmetric cooldowns: scale-out cooldown = 120 seconds, scale-in cooldown = 600 seconds. Scaling in (removing servers) should be conservative — you're reducing capacity. Scaling out (adding servers) should be aggressive — you're adding capacity to handle load.

Set minimum and maximum capacity: min=2 (for availability during maintenance), max=50 (cost protection). Never set max to unlimited — a traffic spike can run up a massive bill.

Use target tracking scaling: instead of "scale when CPU > 70%", use "keep CPU at 60%." AWS continuously adjusts capacity to maintain the target. Much better than threshold-based alarms.

graph LR
    M[CloudWatch Metrics
CPU, Req/sec, Queue Depth] --> A[Auto Scaling Policy
Target: CPU 60%]
    A -->|CPU > 70%| ScaleOut[Launch new instance
Warm-up: 90 sec
Add to Load Balancer]
    A -->|CPU < 40%| ScaleIn[Drain connections
300 sec cooldown
Terminate instance]
    ScaleOut --> ALB[Application Load Balancer]
    ScaleIn --> ALB
    ALB --> S1[Server 1]
    ALB --> S2[Server 2]
    ALB --> SN[Server N]

AWS Auto Scaling Group: metrics trigger scale-out or scale-in with warm-up and cooldown periods

The Stateful Exception: Why Database Horizontal Scaling Is Hard

Everything we've discussed so far applies cleanly to application servers: make them stateless, put a load balancer in front, auto-scale. Databases are completely different. Databases are *inherently stateful* — they are the state. You can't just spin up 5 more Postgres instances and round-robin queries across them. At least, not without significant complexity.

There are three ways to scale databases horizontally, each with significant trade-offs:

Database Horizontal Scaling Options

Read Replicas — One primary DB for writes, N replicas for reads. Simple to set up (AWS RDS Multi-AZ is automatic). Works for read-heavy workloads (95%+ of consumer apps). Downside: replicas lag behind primary by 1-100ms (replication lag). Don't use replicas for "read your own writes" — a user who just posted a tweet might read from a replica that hasn't received it yet.
Vertical Partitioning (by feature) — Split one large database into multiple databases by feature domain: user_db, product_db, orders_db. Each has its own primary + replicas. Simple to reason about, eliminates cross-table JOIN requirements (forces good service boundaries). Downside: cross-domain queries (count of orders per user) require application-level joins, which is slow.
Horizontal Sharding (by key range or hash) — Split one table across multiple database servers by a shard key (user_id % N, or consistent hashing). True horizontal scaling for writes. Used by Instagram (user_id-based sharding), Discord (channel_id), Stripe (merchant_id). Downside: massive operational complexity, cross-shard queries are extremely hard, rebalancing shards when adding nodes is a major operation.

The Premature Sharding Mistake

Sharding should be a last resort, not a first instinct. Engineers hear "100M users" and immediately say "we should shard." But Postgres on a large r6i.16xlarge (64 cores, 512 GB RAM) handles 50K-100K QPS with proper indexing. Stripe was at billions of dollars in payment volume before they sharded their primary database. Instagram reached 300M users before sharding. Start with read replicas + vertical scaling. Shard only when a single machine cannot handle write QPS or when storage exceeds 10-20TB.

graph TB
    App[Application Servers] --> PrimaryDB[(Primary DB
All Writes)]
    App -->|Read queries| R1[(Read Replica 1)]
    App -->|Read queries| R2[(Read Replica 2)]
    App -->|Read queries| R3[(Read Replica 3)]
    PrimaryDB -->|Async replication
1-50ms lag| R1
    PrimaryDB -->|Async replication| R2
    PrimaryDB -->|Async replication| R3

    note1[Read Replicas: Simple, handles 95% of scaling needs
But: replication lag, no write scaling]

Read replicas: the simplest DB horizontal scaling strategy — one primary for writes, N replicas for reads

The hard question in database scaling: when do you switch from read replicas to sharding? The signals are: (1) Write QPS exceeds primary DB capacity (~50K QPS for Postgres). (2) Single table exceeds 500GB (queries slow down significantly). (3) Replication lag consistently exceeds 1 second (replicas too far behind). At this point, sharding is necessary, and it's a massive engineering investment.

Real Cost Comparison: Vertical vs. Horizontal at 10x Scale

The cost comparison between vertical and horizontal scaling is not as simple as "horizontal is always cheaper at scale." It depends heavily on your traffic pattern, the type of bottleneck, and the operational cost of managing distributed systems.

Let's run the numbers for a specific scenario: you have an API server currently handling 10K QPS on a single server. Traffic is expected to grow 10x to 100K QPS. What are your options?

Option	Current Setup	Configuration for 100K QPS	Monthly Cost	Operational Complexity
Vertical Only	c6i.2xlarge ($0.34/hr) handles 10K QPS	c6i.32xlarge ($5.44/hr) handles ~120K QPS	$3,916/month	Low — same deployment, same config
Horizontal (App only)	3× c6i.xlarge ($0.17/hr each) + ALB	30× c6i.xlarge + ALB	$3,671 + $40 ALB = $3,711/month	Medium — need load balancer, auto-scaling config, stateless app
Horizontal + Managed DB	RDS r6g.2xlarge ($0.63/hr)	RDS r6g.2xlarge + 4 read replicas ($0.63 × 5)	$2,268 DB + $3,711 app = $5,979/month	Medium-High — read replica routing, replication monitoring
Full Horizontal (Sharded)	Single RDS	4 DB shards × r6g.large ($0.32/hr each)	~$8,000+ total	Very High — shard routing logic, cross-shard queries, rebalancing

The Hidden Cost: Engineering Time

AWS cost is only part of the equation. Moving from vertical to horizontal requires: 1-2 weeks of engineering to make app stateless, 1 week for load balancer setup and deployment pipeline changes, 2-3 weeks for read replica setup and query routing, 3-6 months for sharding implementation. At $200K/year average engineer salary, 1 month of engineering time ≈ $16,667. A startup that spends 2 months on horizontal scaling to save $500/month on AWS costs needs 33 months to break even. Vertical scaling first is often the right financial decision.

The right answer depends on your situation. Here's the decision framework:

The Scaling Decision Framework

→

Is your bottleneck CPU, RAM, or I/O? CPU-bound → vertical CPU upgrade or horizontal scaling. RAM-bound → vertical RAM upgrade (cheapest). I/O-bound → SSD instance types or read replicas.

→

How much runway does vertical give you? If the next vertical tier buys 12+ months, take it. One engineer-week is worth avoiding 6 months of distributed system complexity.

→

Is your app stateless? If not, can you make it stateless within 1-2 weeks? If stateless migration takes > 1 month, vertical first while the team prepares.

→

Is the database your bottleneck? If yes: read replicas first (3-5 days to set up). Not horizontal app scaling — the DB is the constraint.

→

Do you have unpredictable traffic spikes? Horizontal with auto-scaling handles spikes. Vertical doesn't — if you peak at 3x your baseline, you need to provision for peak even during off-hours.

What's your ops team size? 1-2 engineers: prefer managed services (RDS, ElastiCache) + vertical scaling. 5+ engineers: horizontal app scaling becomes operationally sustainable.

Is your bottleneck CPU, RAM, or I/O? CPU-bound → vertical CPU upgrade or horizontal scaling. RAM-bound → vertical RAM upgrade (cheapest). I/O-bound → SSD instance types or read replicas.

How much runway does vertical give you? If the next vertical tier buys 12+ months, take it. One engineer-week is worth avoiding 6 months of distributed system complexity.

Is your app stateless? If not, can you make it stateless within 1-2 weeks? If stateless migration takes > 1 month, vertical first while the team prepares.

Is the database your bottleneck? If yes: read replicas first (3-5 days to set up). Not horizontal app scaling — the DB is the constraint.

Do you have unpredictable traffic spikes? Horizontal with auto-scaling handles spikes. Vertical doesn't — if you peak at 3x your baseline, you need to provision for peak even during off-hours.

What's your ops team size? 1-2 engineers: prefer managed services (RDS, ElastiCache) + vertical scaling. 5+ engineers: horizontal app scaling becomes operationally sustainable.

Microservices and Independent Scaling

Horizontal scaling of a monolith scales everything together: your API servers, your background job processors, your authentication service all scale as one unit. If your image resize service is the bottleneck, you scale the entire monolith — even the parts that aren't bottlenecked. This wastes money.

Microservices solve this by allowing each service to scale independently. The image resize service gets 50 workers. The user profile service runs on 3. The payment service stays at 2 (and runs on dedicated high-security instances). You pay only for what each service actually needs.

When Microservices Scaling Makes Economic Sense

Microservices scaling is worth the overhead when: different components have wildly different scaling needs (10:1 ratio between most and least loaded), components need different hardware types (GPU for ML inference, high memory for caching, high CPU for encoding), components need different security boundaries (payment processing isolated from analytics), or different teams own different components with independent deployment cycles.

graph TB
    LB[Load Balancer] --> A1[API Gateway
3 instances]
    A1 --> U[User Service
5 instances
Low CPU]
    A1 --> I[Image Resize
50 instances
High CPU, GPU]
    A1 --> P[Payment Service
2 instances
Isolated network]
    A1 --> S[Search Service
10 instances
High RAM]
    I --> Q[Job Queue
SQS]
    Q --> W[Worker Pool
20-200 instances
Auto-scales on queue depth]

Microservices enable independent scaling: each service scales to exactly what it needs

The trade-off: microservices horizontal scaling is dramatically more complex than monolith horizontal scaling. A monolith with 10 servers is operationally simple. 20 microservices each with 2-50 instances is 40-1000 running processes, each with their own load balancer, health checks, deployment pipeline, service discovery, and distributed tracing. The operational overhead is proportional to service count, not traffic.

The Microservices Premature Optimization Trap

Many startups adopt microservices before they have a scaling problem, introducing massive complexity with no benefit. Netflix, Uber, Amazon all built monoliths first and migrated to microservices when they had specific, proven scaling needs — and hundreds of engineers. For a team of < 20 engineers, a well-structured monolith with horizontal scaling at the app tier (and read replicas for the DB) is almost always the right architecture. Microservices are a solution to an org-scale problem as much as a technical scaling problem.

Scaling Decision Framework: Flowchart for Every Situation

After covering all the scaling strategies, let's build a practical decision framework. When traffic grows and your system starts struggling, here is the exact order of steps to consider — from simplest to most complex.

The Scaling Decision Order (Simplest First)

→

Step 1 — Profile first: never scale blindly. Use APM tools (New Relic, Datadog, AWS X-Ray) to find the actual bottleneck. Is it the DB query? The network? The CPU? Scaling the wrong thing wastes money and time.

→

Step 2 — Optimize before scaling: a slow DB query with a missing index might cause 10x unnecessary DB load. Adding a single index often "scales" a system more than adding 5 servers. Query optimization, connection pooling, and caching are almost always faster ROI than infrastructure changes.

→

Step 3 — Vertical scale the bottleneck: double the RAM/CPU of the bottlenecked component. Typically buys 3-12 months of runway. Zero code changes.

→

Step 4 — Add caching: Redis in front of the DB reduces read load by 80-95%. One engineer-week of work. Often makes vertical scaling sufficient for another 12-18 months.

→

Step 5 — Add read replicas to the DB: route read queries to replicas. Scales read capacity by N× where N is the number of replicas. AWS RDS supports up to 15 read replicas.

→

Step 6 — Horizontal scale the application tier: make app stateless, add load balancer, configure auto-scaling. 1-2 weeks of engineering. Scales application servers indefinitely.

→

Step 7 — Vertical partition the database: split by functional domain (user_db, product_db). Reduces load per database. No sharding complexity yet.

Step 8 — Horizontal shard the database: last resort. Months of engineering, permanent architectural complexity. Only when a single DB primary cannot handle write QPS or data volume.

Step 3 — Vertical scale the bottleneck: double the RAM/CPU of the bottlenecked component. Typically buys 3-12 months of runway. Zero code changes.

Step 4 — Add caching: Redis in front of the DB reduces read load by 80-95%. One engineer-week of work. Often makes vertical scaling sufficient for another 12-18 months.

Step 5 — Add read replicas to the DB: route read queries to replicas. Scales read capacity by N× where N is the number of replicas. AWS RDS supports up to 15 read replicas.

Step 6 — Horizontal scale the application tier: make app stateless, add load balancer, configure auto-scaling. 1-2 weeks of engineering. Scales application servers indefinitely.

Step 7 — Vertical partition the database: split by functional domain (user_db, product_db). Reduces load per database. No sharding complexity yet.

Step 8 — Horizontal shard the database: last resort. Months of engineering, permanent architectural complexity. Only when a single DB primary cannot handle write QPS or data volume.

graph TD
    Problem[System Under Load] --> Profile[Step 1: Profile
Find the bottleneck]
    Profile --> Optimize[Step 2: Optimize
Index queries, fix N+1]
    Optimize --> Vertical[Step 3: Vertical Scale
Bigger instance type]
    Vertical --> Cache[Step 4: Add Caching
Redis in front of DB]
    Cache --> Replicas[Step 5: Read Replicas
Scale read QPS]
    Replicas --> HorizontalApp[Step 6: Horizontal App
Stateless + Load Balancer]
    HorizontalApp --> VPartition[Step 7: Vertical Partition DB
Split by domain]
    VPartition --> Shard[Step 8: Shard DB
Last resort]

    style Shard fill:#ff6b6b,color:#fff
    style Profile fill:#51cf66,color:#fff
    style Optimize fill:#51cf66,color:#fff

Scaling decision order: always start with the simplest solution. Sharding is the nuclear option.

The Slack Approach: Scale What's Hot

Slack engineering published a detailed blog post on their scaling philosophy: "Scale the right component by the right method at the right time." Their API tier is horizontally scaled with auto-scaling groups. Their message fanout (delivering a message to all workspace members) is handled by a dedicated horizontally-scaled fanout service. Their database is vertically scaled with read replicas — no sharding. This selective approach means their complexity is proportional to their actual scaling needs, not aspirational ones.

The most important scaling principle: delay scaling complexity as long as possible. Every scaling mechanism adds operational complexity, increases deployment risk, and requires ongoing maintenance. A system that scales to 10M users on a single well-tuned Postgres with read replicas is better than a 20-microservice system that handles 1M users with constant operational incidents. Scale to the right level at the right time.

How this might come up in interviews

Scaling appears in almost every system design interview because every system at scale needs it. Interviewers test: (1) Do you know when vertical vs. horizontal is appropriate? (2) Do you understand the stateless requirement? (3) Can you explain database scaling (read replicas vs. sharding)? (4) Do you mention auto-scaling for variable traffic? The common trap is candidates who immediately jump to sharding for any scaling question — this signals lack of proportionality in engineering judgment.

Common questions:

L4: "You have a monolith that's hitting CPU limits. How do you scale it?" [Tests: stateless requirement, load balancer, horizontal vs. vertical decision]
L4-L5: "Your database is struggling at 50K read QPS. Walk me through your scaling options in order." [Tests: read replicas first, caching, sharding as last resort]
L5: "Design an auto-scaling system for a service that gets 5x traffic every weekday morning." [Tests: predictive vs. reactive, cooldown configuration, minimum capacity design]
L5-L6: "How would you migrate a stateful monolith to a horizontally scalable system with zero downtime?" [Tests: stateless migration strategy, session externalization, blue-green deployment]
L6: "Slack has a message fanout service — when one user sends a message, it must be delivered to all workspace members in < 1 second. The workspace can have 100K members. How do you scale this?" [Tests: fan-out architecture, independent service scaling, queue-based workers, horizontal vs. vertical for specific components]
L6-L7: "Design the database scaling strategy for a system growing from 100K to 100M users over 3 years. What steps do you take and when?" [Tests: vertical + read replicas → sharding progression, migration strategy, timing signals for each step]

Key takeaways

Vertical scaling (bigger machine) is always faster to implement and should be exhausted before horizontal scaling. Most systems have 12-18 months of vertical headroom they haven't used yet.
Horizontal scaling requires stateless application servers — sessions must be in Redis, files in S3, caches in Redis. This is the central technical prerequisite.
Auto-scaling with asymmetric cooldowns (scale-out: 90 seconds, scale-in: 600 seconds) handles traffic spikes while preventing the scale-in/spike race condition that caused Slack's 2019 outage.
Databases are the exception: vertical scaling + read replicas handles 95% of cases. Sharding is only necessary when write QPS exceeds a single DB's capacity (~50K QPS) or data exceeds 10-20TB.
Scale in this order: optimize → vertical → cache → read replicas → horizontal app → vertical DB partition → shard. Skipping steps introduces unnecessary complexity.

Before you move on: can you answer these?

Your e-commerce site currently runs on a single r6i.4xlarge (16 cores, 128 GB RAM). Traffic has doubled and CPU is averaging 65%. Your next Black Friday is in 6 weeks. What do you do?

First, profile to confirm CPU is the actual bottleneck (not network or DB). Then, immediately vertical scale to r6i.8xlarge or r6i.16xlarge — zero downtime on RDS, minimal on EC2. This buys time. Simultaneously, check the DB: if DB CPU is below 40%, the app server is the bottleneck and vertical + read replicas should handle Black Friday. In parallel over the next 3 weeks: make the app stateless (move sessions to Redis if not already), configure an auto-scaling group with 3x minimum capacity for Black Friday week, and run a load test at 5x normal traffic to verify. Do NOT attempt to implement sharding in 6 weeks — that is a multi-month project that would create more risk than it solves before a major traffic event.

Your team is debating whether to shard your PostgreSQL database or add more read replicas. You're at 80K read QPS and 8K write QPS, 3TB of data. What do you recommend?

At 80K read QPS and 8K write QPS, sharding is not justified yet. A well-configured Postgres primary (r6g.8xlarge: 32 cores, 256 GB RAM) handles 30-50K write QPS with proper indexing. You're well within write capacity. For reads: 80K QPS across 5 read replicas = 16K QPS per replica, which is very manageable. The correct path: add read replicas (4-6 total). Implement connection pooling (PgBouncer) if not already present. Add Redis caching for hot read patterns. Sharding makes sense if write QPS exceeds 50K (you're at 8K — 6x away) or data exceeds 10-20TB (you're at 3TB). Current architecture with 4 read replicas should handle 3-5x current load. Revisit sharding at 50K write QPS or 15TB data.

An interviewer asks: "Design a system that needs to handle 500K QPS at peak but only 50K QPS at average. How do you scale it?" What is your approach and why?

This is exactly the use case for auto-scaling horizontal infrastructure, not vertical. The 10:1 peak-to-average ratio means vertical scaling (or fixed horizontal) would require provisioning for 500K QPS 24/7 — paying for 10x the average capacity. Auto-scaling with horizontal app servers solves this: at 50K QPS, run 5 servers. As traffic rises toward 500K, auto-scale out to 50 servers. Scale back in after the spike. Specific implementation: Application Load Balancer + EC2 Auto Scaling Group with target-tracking policy (target: 60% CPU). Scale-out cooldown: 90 seconds (aggressive — spikes must be handled fast). Scale-in cooldown: 600 seconds (conservative — don't remove servers mid-spike). Minimum: 5 instances. Maximum: 60 instances (20% headroom above peak). Database: the DB is the harder problem. 500K app QPS likely generates 50K-200K DB QPS. Add a Redis cache (target: 90% hit rate) to reduce DB QPS to 5K-20K — manageable with 1 primary + 3 read replicas.

🧠Mental Model

💡 Analogy

Vertical scaling is like widening a road — adding more lanes to the same highway. It works until you hit the physical boundary of the road width (the machine's hardware limit). Horizontal scaling is like building parallel highways — you can always build another one. But parallel highways need traffic routing (load balancers), consistent road conditions (stateless app servers), and synchronized signals (distributed coordination). The choice isn't which is better in the abstract — it's which is right for your traffic level and team capacity right now.

⚡ Core Idea

The correct scaling strategy depends on where the bottleneck is and what your current scale requires. Most systems follow a progression: optimize → vertical → cache → read replicas → horizontal app → vertical DB partition → shard. Skipping steps wastes engineering time. The goal is to always be at the simplest architecture that handles your current + near-term scale.

🎯 Why It Matters

Choosing the wrong scaling strategy wastes money (over-engineering a system for 100x its actual traffic) or causes outages (under-engineering a system that's about to hit a wall). Scaling decisions also shape your entire team structure: horizontal microservices requires 10x the ops investment of a well-tuned monolith. Every scaling architecture is a bet on where your bottleneck will be at 10x current scale.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

Horizontal vs. Vertical Scaling

The Scaling Problem: When One Machine Isn't Enough

Vertical Scaling Deep Dive: Instance Families and Limits

Horizontal Scaling Deep Dive: Stateless, Load Balancers, Sticky Sessions

Making Apps Stateless: Session Storage in Redis and JWT

Auto-Scaling Strategies: Reactive, Predictive, and Cooldowns

The Stateful Exception: Why Database Horizontal Scaling Is Hard

Real Cost Comparison: Vertical vs. Horizontal at 10x Scale

Microservices and Independent Scaling

Scaling Decision Framework: Flowchart for Every Situation

Discussion

In-app Q&A

Horizontal vs. Vertical Scaling

The Scaling Problem: When One Machine Isn't Enough

Vertical Scaling Deep Dive: Instance Families and Limits

Horizontal Scaling Deep Dive: Stateless, Load Balancers, Sticky Sessions

Making Apps Stateless: Session Storage in Redis and JWT

Auto-Scaling Strategies: Reactive, Predictive, and Cooldowns

The Stateful Exception: Why Database Horizontal Scaling Is Hard

Real Cost Comparison: Vertical vs. Horizontal at 10x Scale

Microservices and Independent Scaling

Scaling Decision Framework: Flowchart for Every Situation

Discussion

In-app Q&A