The two fundamental scaling strategies. When each is right. Stateless vs stateful. Auto-scaling groups. Real cost math from AWS. Why databases are the exception that proves every rule.
The two fundamental scaling strategies. When each is right. Stateless vs stateful. Auto-scaling groups. Real cost math from AWS. Why databases are the exception that proves every rule.
Lesson outline
Every successful system eventually hits the same wall: one machine can't keep up. Traffic grows, response times climb, errors increase. At this point, you have two options: make the machine bigger (vertical scaling) or add more machines (horizontal scaling). The choice you make defines your entire architecture for years.
Most engineers starting out assume the answer is obvious: "add more servers." But the reality is nuanced. Adding more servers introduces an entirely new class of problems: distributed state, network partitions, consensus, load distribution. The solution creates new problems. The art of scaling is knowing which problems are preferable at each stage of your system's growth.
The Scaling Decision Timeline
Most successful systems follow this path: (1) Start: Single machine, no scaling needed. (2) Early growth: Add RAM/CPU (vertical) — easiest path, no code changes. (3) Scale: Hit hardware ceiling, add horizontal application servers. (4) Database bottleneck: Add read replicas (quasi-horizontal). (5) Write bottleneck: Add sharding (true horizontal DB scaling — hardest part). (6) Mature: Auto-scaling groups, multi-region, microservices. Understanding this timeline prevents premature optimization.
To make the scaling conversation concrete, let's use a real example: a startup that builds a booking platform. At launch, they have one server running everything — API, database, static files. The first 1,000 users work fine. At 10,000 concurrent users, the API starts timing out. At 50,000, the database falls over. What do they do?
graph LR
A[Single Server
All components
1K users OK] -->|Traffic grows| B[Server CPU 80%
Response time ↑]
B --> C{Decision Point}
C -->|Option 1| D[Vertical Scale
Bigger machine
r5.4xlarge → r5.16xlarge]
C -->|Option 2| E[Horizontal Scale
Multiple servers
+ Load Balancer]
D --> F[Fast to execute
No code changes
But has a ceiling]
E --> G[Complex setup
Code must be stateless
But unlimited scale]The scaling decision tree: vertical is fast but limited, horizontal is complex but unlimited
This guide walks through both options in depth: when each is right, how to execute each, their cost curves, and the special case of database scaling — which is fundamentally different from application server scaling and where most teams get stuck.
Vertical scaling means upgrading to a more powerful machine: more CPU cores, more RAM, faster storage, more network bandwidth. It's the first scaling strategy you should exhaust before reaching for horizontal scaling, because it requires zero code changes and is immediately effective.
On AWS, vertical scaling means changing instance type. AWS organizes instances into "instance families" optimized for different bottlenecks:
| Instance Family | Optimized For | Max Config | Use Case |
|---|---|---|---|
| m-series (m7g, m6i) | General purpose (balanced CPU/RAM) | m7g.metal: 64 cores, 256 GB RAM | API servers, microservices, general workloads |
| c-series (c7g, c6i) | Compute-intensive | c7g.metal: 64 cores, 128 GB RAM | CPU-bound: video encoding, ML inference, cryptography |
| r-series (r7g, r6i) | Memory-intensive | r6i.metal: 128 cores, 1,024 GB RAM | Large in-memory databases, Redis, Elasticsearch |
| x-series (x2idn) | Extreme memory | x2idn.metal: 128 cores, 4,096 GB RAM (4TB!) | SAP HANA, in-memory analytics, massive databases |
| i-series (i4i) | Storage-intensive (NVMe SSD) | i4i.metal: 128 cores, 1,024 GB RAM, 30 TB NVMe | High-I/O databases, log analytics, Elasticsearch |
The Vertical Scaling Cost Curve
Vertical scaling does not have a linear cost curve — it's superlinear. Doubling RAM costs more than double. AWS pricing example: m6i.xlarge (4 cores, 16 GB): $0.192/hr. m6i.2xlarge (8 cores, 32 GB — 2x): $0.384/hr (exactly 2x). m6i.8xlarge (32 cores, 128 GB — 8x): $1.536/hr (still linear). BUT: m6i.metal (128 cores, 512 GB): $7.776/hr (a premium over linear). At the extremes, you pay a 30-50% premium per unit of compute for the biggest instances. This is the economic inflection point where horizontal scaling becomes cheaper.
The hard ceiling on vertical scaling: even the biggest AWS instance (x2idn.metal) has 128 cores and 4TB RAM. At some scale, there is no bigger machine. Google, Facebook, and Amazon hit this ceiling in the 2000s and built horizontal-first architectures as a result. If you're not at Google scale, you have a lot of vertical headroom.
When Vertical Scaling Is the Right Answer
Horizontal scaling means adding more machines and distributing work across them. It's the only approach that works at massive scale — but it requires your application to be designed for it. The central requirement is statelessness: each server must be able to handle any request independently, without depending on local state from a previous request.
The Stateless Requirement: Why It's Non-Negotiable
Imagine you have 3 servers and a user logs in on Server 1. Server 1 stores the session in memory: {user_id: 123, auth_token: "xyz"}. On the next request, the load balancer routes to Server 2. Server 2 has no session. The user sees a login page again. This is the session affinity problem — it breaks horizontal scaling. The fix: never store session state on application servers. Store it externally in Redis or JWT tokens that are self-contained.
Once your app is stateless, horizontal scaling works as follows: a load balancer receives all incoming requests and distributes them across N identical application servers. Any server can handle any request. You can add or remove servers at any time without affecting users. This is the foundation of every scalable web architecture.
graph LR
C[Client] --> LB[Load Balancer]
LB --> S1[App Server 1
Stateless]
LB --> S2[App Server 2
Stateless]
LB --> S3[App Server 3
Stateless]
S1 --> Redis[(Redis
Session Store)]
S2 --> Redis
S3 --> Redis
S1 --> DB[(Database)]
S2 --> DB
S3 --> DB
note[Each server can handle
any request - no shared
in-memory state]Horizontal scaling: stateless app servers + shared external state (Redis for sessions)
Making an Application Stateless: The Migration Path
01
Audit all session storage: find every place your app stores state in server memory. Common culprits: Flask session object, PHP $_SESSION, in-memory JWT stores, WebSocket connection state.
02
Move sessions to Redis: replace server-side session stores with Redis. npm install express-session connect-redis, configure Redis store. Sessions now survive server restarts and work across multiple servers.
03
JWT for stateless auth: alternative to Redis sessions. Encode user claims in a signed JWT token. Server validates signature without looking up anything. Tradeoff: tokens can't be revoked (until expiry) — store revocation list in Redis if needed.
04
Externalize all caches: any in-memory cache (Python dict, Java HashMap) must move to Redis or be acceptable to lose. If a cache miss causes a slightly slower response (not a crash), you can keep local caches as a first layer.
05
File uploads: any file uploads stored locally must move to S3. A file on Server 1's disk is not accessible from Server 2. All file storage must be external.
06
Verify with kill test: deploy your change, then kill one server while load testing. User requests should continue without errors. If they fail, you still have state on servers.
Audit all session storage: find every place your app stores state in server memory. Common culprits: Flask session object, PHP $_SESSION, in-memory JWT stores, WebSocket connection state.
Move sessions to Redis: replace server-side session stores with Redis. npm install express-session connect-redis, configure Redis store. Sessions now survive server restarts and work across multiple servers.
JWT for stateless auth: alternative to Redis sessions. Encode user claims in a signed JWT token. Server validates signature without looking up anything. Tradeoff: tokens can't be revoked (until expiry) — store revocation list in Redis if needed.
Externalize all caches: any in-memory cache (Python dict, Java HashMap) must move to Redis or be acceptable to lose. If a cache miss causes a slightly slower response (not a crash), you can keep local caches as a first layer.
File uploads: any file uploads stored locally must move to S3. A file on Server 1's disk is not accessible from Server 2. All file storage must be external.
Verify with kill test: deploy your change, then kill one server while load testing. User requests should continue without errors. If they fail, you still have state on servers.
| State Type | Naive (Breaks Horizontal Scale) | Fixed (Stateless) |
|---|---|---|
| User sessions | In-memory dict on server | Redis with TTL, or JWT tokens |
| File uploads | Server local filesystem | S3 / GCS / object storage |
| In-process cache | Python dict / Java HashMap | Redis + local L1 cache (TTL < 30s) |
| WebSocket connections | Per-server connection map | Redis pub/sub for broadcast, sticky sessions for single-user connection |
| Background job state | Thread-local variables | Celery/RQ with Redis/SQS backend |
Session management is the most common reason apps fail to scale horizontally, and it's entirely solvable. There are two patterns: external session stores (Redis) and stateless tokens (JWT). Understanding the trade-offs tells you when to use each.
The Redis session store approach: when a user logs in, generate a random session_id, store session data in Redis with a TTL, return the session_id as a cookie. On subsequent requests, look up session_id in Redis (0.5ms). Any server can serve any request because they all share the same Redis.
Redis Session Store: The Architecture
Login: POST /login → validate credentials → generate session_id (UUID) → Redis SET session:{id} {user_id, permissions, created_at} EX 86400 (24hr TTL) → return cookie session_id={id}. Request: GET /profile → read cookie → Redis GET session:{id} → if found: authenticated, load user_id → if not found: 401 Unauthorized. Logout: Redis DEL session:{id} → session immediately invalid on all servers.
The JWT (JSON Web Token) approach: encode user data directly in a signed token. The server validates the signature without any database or Redis lookup. Perfectly stateless — no external store needed.
| Characteristic | Redis Sessions | JWT Tokens |
|---|---|---|
| Revocation | Immediate: DEL session:{id} | Cannot revoke until expiry (unless maintaining blocklist in Redis) |
| Latency | 0.5ms for Redis lookup | 0ms — pure CPU crypto verification |
| Storage | Redis cluster storage cost | No server-side storage — token lives in client |
| Scalability | Scales with Redis cluster | Infinite — no shared state anywhere |
| Token size | ~36 bytes (UUID in cookie) | ~500-2000 bytes (JSON + signature in header) |
| User data freshness | Always current (read from Redis) | Stale until token expiry — role/permission changes lag |
| Best for | Apps needing immediate session invalidation (banking, healthcare) | Stateless APIs, microservices, mobile clients |
The Hybrid Pattern (Production Best Practice)
Most production systems use both: JWT for authentication (who are you?) + Redis for authorization/session state (what can you do right now?). JWT contains: user_id, issued_at, expiry (15 minutes). Redis contains: user permissions, feature flags, active session status. Short JWT expiry + Redis check = best of both worlds: stateless auth lookup, instant permission revocation, no stale permission bugs.
Auto-scaling is horizontal scaling on autopilot: the system automatically adds servers when traffic increases and removes them when it decreases. This sounds simple but has significant operational complexity. Wrong configuration is responsible for a disproportionate number of production outages.
There are two fundamental auto-scaling strategies: reactive and predictive.
Reactive vs. Predictive Auto-Scaling
The Cooldown Period Trap
Cooldown periods prevent thrashing: after scaling up, wait N minutes before scaling again. This prevents adding 1 server, waiting 60 seconds, adding another, waiting, etc. BUT: too long a cooldown can cause outages. If traffic spikes hard and your cooldown is 5 minutes, you can't scale fast enough. Slack's 2019 outage (in our war story) was caused partly by a 5-minute scale-in cooldown that removed too many servers when traffic dropped, then couldn't scale back fast enough when a traffic spike hit. Recommended: scale-out cooldown: 60-120 seconds. Scale-in cooldown: 5-10 minutes (asymmetric — be cautious about removing servers).
AWS Auto Scaling Group Configuration Best Practices
01
Set minimum and maximum capacity: min=2 (for availability during maintenance), max=50 (cost protection). Never set max to unlimited — a traffic spike can run up a massive bill.
02
Choose the right scaling metric: CPU utilization is the most common but not always right. For I/O-bound apps: scale on network throughput or request count. For databases: scale on connection count. For queues: scale on queue depth (number of unprocessed messages).
03
Use target tracking scaling: instead of "scale when CPU > 70%", use "keep CPU at 60%." AWS continuously adjusts capacity to maintain the target. Much better than threshold-based alarms.
04
Configure health checks properly: EC2 health checks alone are insufficient — they only check if the instance is running. Configure Application Load Balancer health checks to verify the app is actually serving traffic. Remove unhealthy instances from rotation immediately.
05
Use lifecycle hooks for warm-up time: new instances often need time to warm up (JVM JIT compilation, cache warming, establishing DB connections). Add a lifecycle hook to keep instances in "pending" state for 60-120 seconds before adding them to the load balancer.
06
Asymmetric cooldowns: scale-out cooldown = 120 seconds, scale-in cooldown = 600 seconds. Scaling in (removing servers) should be conservative — you're reducing capacity. Scaling out (adding servers) should be aggressive — you're adding capacity to handle load.
Set minimum and maximum capacity: min=2 (for availability during maintenance), max=50 (cost protection). Never set max to unlimited — a traffic spike can run up a massive bill.
Choose the right scaling metric: CPU utilization is the most common but not always right. For I/O-bound apps: scale on network throughput or request count. For databases: scale on connection count. For queues: scale on queue depth (number of unprocessed messages).
Use target tracking scaling: instead of "scale when CPU > 70%", use "keep CPU at 60%." AWS continuously adjusts capacity to maintain the target. Much better than threshold-based alarms.
Configure health checks properly: EC2 health checks alone are insufficient — they only check if the instance is running. Configure Application Load Balancer health checks to verify the app is actually serving traffic. Remove unhealthy instances from rotation immediately.
Use lifecycle hooks for warm-up time: new instances often need time to warm up (JVM JIT compilation, cache warming, establishing DB connections). Add a lifecycle hook to keep instances in "pending" state for 60-120 seconds before adding them to the load balancer.
Asymmetric cooldowns: scale-out cooldown = 120 seconds, scale-in cooldown = 600 seconds. Scaling in (removing servers) should be conservative — you're reducing capacity. Scaling out (adding servers) should be aggressive — you're adding capacity to handle load.
graph LR
M[CloudWatch Metrics
CPU, Req/sec, Queue Depth] --> A[Auto Scaling Policy
Target: CPU 60%]
A -->|CPU > 70%| ScaleOut[Launch new instance
Warm-up: 90 sec
Add to Load Balancer]
A -->|CPU < 40%| ScaleIn[Drain connections
300 sec cooldown
Terminate instance]
ScaleOut --> ALB[Application Load Balancer]
ScaleIn --> ALB
ALB --> S1[Server 1]
ALB --> S2[Server 2]
ALB --> SN[Server N]AWS Auto Scaling Group: metrics trigger scale-out or scale-in with warm-up and cooldown periods
Everything we've discussed so far applies cleanly to application servers: make them stateless, put a load balancer in front, auto-scale. Databases are completely different. Databases are *inherently stateful* — they are the state. You can't just spin up 5 more Postgres instances and round-robin queries across them. At least, not without significant complexity.
There are three ways to scale databases horizontally, each with significant trade-offs:
Database Horizontal Scaling Options
The Premature Sharding Mistake
Sharding should be a last resort, not a first instinct. Engineers hear "100M users" and immediately say "we should shard." But Postgres on a large r6i.16xlarge (64 cores, 512 GB RAM) handles 50K-100K QPS with proper indexing. Stripe was at billions of dollars in payment volume before they sharded their primary database. Instagram reached 300M users before sharding. Start with read replicas + vertical scaling. Shard only when a single machine cannot handle write QPS or when storage exceeds 10-20TB.
graph TB
App[Application Servers] --> PrimaryDB[(Primary DB
All Writes)]
App -->|Read queries| R1[(Read Replica 1)]
App -->|Read queries| R2[(Read Replica 2)]
App -->|Read queries| R3[(Read Replica 3)]
PrimaryDB -->|Async replication
1-50ms lag| R1
PrimaryDB -->|Async replication| R2
PrimaryDB -->|Async replication| R3
note1[Read Replicas: Simple, handles 95% of scaling needs
But: replication lag, no write scaling]Read replicas: the simplest DB horizontal scaling strategy — one primary for writes, N replicas for reads
The hard question in database scaling: when do you switch from read replicas to sharding? The signals are: (1) Write QPS exceeds primary DB capacity (~50K QPS for Postgres). (2) Single table exceeds 500GB (queries slow down significantly). (3) Replication lag consistently exceeds 1 second (replicas too far behind). At this point, sharding is necessary, and it's a massive engineering investment.
The cost comparison between vertical and horizontal scaling is not as simple as "horizontal is always cheaper at scale." It depends heavily on your traffic pattern, the type of bottleneck, and the operational cost of managing distributed systems.
Let's run the numbers for a specific scenario: you have an API server currently handling 10K QPS on a single server. Traffic is expected to grow 10x to 100K QPS. What are your options?
| Option | Current Setup | Configuration for 100K QPS | Monthly Cost | Operational Complexity |
|---|---|---|---|---|
| Vertical Only | c6i.2xlarge ($0.34/hr) handles 10K QPS | c6i.32xlarge ($5.44/hr) handles ~120K QPS | $3,916/month | Low — same deployment, same config |
| Horizontal (App only) | 3× c6i.xlarge ($0.17/hr each) + ALB | 30× c6i.xlarge + ALB | $3,671 + $40 ALB = $3,711/month | Medium — need load balancer, auto-scaling config, stateless app |
| Horizontal + Managed DB | RDS r6g.2xlarge ($0.63/hr) | RDS r6g.2xlarge + 4 read replicas ($0.63 × 5) | $2,268 DB + $3,711 app = $5,979/month | Medium-High — read replica routing, replication monitoring |
| Full Horizontal (Sharded) | Single RDS | 4 DB shards × r6g.large ($0.32/hr each) | ~$8,000+ total | Very High — shard routing logic, cross-shard queries, rebalancing |
The Hidden Cost: Engineering Time
AWS cost is only part of the equation. Moving from vertical to horizontal requires: 1-2 weeks of engineering to make app stateless, 1 week for load balancer setup and deployment pipeline changes, 2-3 weeks for read replica setup and query routing, 3-6 months for sharding implementation. At $200K/year average engineer salary, 1 month of engineering time ≈ $16,667. A startup that spends 2 months on horizontal scaling to save $500/month on AWS costs needs 33 months to break even. Vertical scaling first is often the right financial decision.
The right answer depends on your situation. Here's the decision framework:
The Scaling Decision Framework
01
Is your bottleneck CPU, RAM, or I/O? CPU-bound → vertical CPU upgrade or horizontal scaling. RAM-bound → vertical RAM upgrade (cheapest). I/O-bound → SSD instance types or read replicas.
02
How much runway does vertical give you? If the next vertical tier buys 12+ months, take it. One engineer-week is worth avoiding 6 months of distributed system complexity.
03
Is your app stateless? If not, can you make it stateless within 1-2 weeks? If stateless migration takes > 1 month, vertical first while the team prepares.
04
Is the database your bottleneck? If yes: read replicas first (3-5 days to set up). Not horizontal app scaling — the DB is the constraint.
05
Do you have unpredictable traffic spikes? Horizontal with auto-scaling handles spikes. Vertical doesn't — if you peak at 3x your baseline, you need to provision for peak even during off-hours.
06
What's your ops team size? 1-2 engineers: prefer managed services (RDS, ElastiCache) + vertical scaling. 5+ engineers: horizontal app scaling becomes operationally sustainable.
Is your bottleneck CPU, RAM, or I/O? CPU-bound → vertical CPU upgrade or horizontal scaling. RAM-bound → vertical RAM upgrade (cheapest). I/O-bound → SSD instance types or read replicas.
How much runway does vertical give you? If the next vertical tier buys 12+ months, take it. One engineer-week is worth avoiding 6 months of distributed system complexity.
Is your app stateless? If not, can you make it stateless within 1-2 weeks? If stateless migration takes > 1 month, vertical first while the team prepares.
Is the database your bottleneck? If yes: read replicas first (3-5 days to set up). Not horizontal app scaling — the DB is the constraint.
Do you have unpredictable traffic spikes? Horizontal with auto-scaling handles spikes. Vertical doesn't — if you peak at 3x your baseline, you need to provision for peak even during off-hours.
What's your ops team size? 1-2 engineers: prefer managed services (RDS, ElastiCache) + vertical scaling. 5+ engineers: horizontal app scaling becomes operationally sustainable.
Horizontal scaling of a monolith scales everything together: your API servers, your background job processors, your authentication service all scale as one unit. If your image resize service is the bottleneck, you scale the entire monolith — even the parts that aren't bottlenecked. This wastes money.
Microservices solve this by allowing each service to scale independently. The image resize service gets 50 workers. The user profile service runs on 3. The payment service stays at 2 (and runs on dedicated high-security instances). You pay only for what each service actually needs.
When Microservices Scaling Makes Economic Sense
Microservices scaling is worth the overhead when: different components have wildly different scaling needs (10:1 ratio between most and least loaded), components need different hardware types (GPU for ML inference, high memory for caching, high CPU for encoding), components need different security boundaries (payment processing isolated from analytics), or different teams own different components with independent deployment cycles.
graph TB
LB[Load Balancer] --> A1[API Gateway
3 instances]
A1 --> U[User Service
5 instances
Low CPU]
A1 --> I[Image Resize
50 instances
High CPU, GPU]
A1 --> P[Payment Service
2 instances
Isolated network]
A1 --> S[Search Service
10 instances
High RAM]
I --> Q[Job Queue
SQS]
Q --> W[Worker Pool
20-200 instances
Auto-scales on queue depth]Microservices enable independent scaling: each service scales to exactly what it needs
The trade-off: microservices horizontal scaling is dramatically more complex than monolith horizontal scaling. A monolith with 10 servers is operationally simple. 20 microservices each with 2-50 instances is 40-1000 running processes, each with their own load balancer, health checks, deployment pipeline, service discovery, and distributed tracing. The operational overhead is proportional to service count, not traffic.
The Microservices Premature Optimization Trap
Many startups adopt microservices before they have a scaling problem, introducing massive complexity with no benefit. Netflix, Uber, Amazon all built monoliths first and migrated to microservices when they had specific, proven scaling needs — and hundreds of engineers. For a team of < 20 engineers, a well-structured monolith with horizontal scaling at the app tier (and read replicas for the DB) is almost always the right architecture. Microservices are a solution to an org-scale problem as much as a technical scaling problem.
After covering all the scaling strategies, let's build a practical decision framework. When traffic grows and your system starts struggling, here is the exact order of steps to consider — from simplest to most complex.
The Scaling Decision Order (Simplest First)
01
Step 1 — Profile first: never scale blindly. Use APM tools (New Relic, Datadog, AWS X-Ray) to find the actual bottleneck. Is it the DB query? The network? The CPU? Scaling the wrong thing wastes money and time.
02
Step 2 — Optimize before scaling: a slow DB query with a missing index might cause 10x unnecessary DB load. Adding a single index often "scales" a system more than adding 5 servers. Query optimization, connection pooling, and caching are almost always faster ROI than infrastructure changes.
03
Step 3 — Vertical scale the bottleneck: double the RAM/CPU of the bottlenecked component. Typically buys 3-12 months of runway. Zero code changes.
04
Step 4 — Add caching: Redis in front of the DB reduces read load by 80-95%. One engineer-week of work. Often makes vertical scaling sufficient for another 12-18 months.
05
Step 5 — Add read replicas to the DB: route read queries to replicas. Scales read capacity by N× where N is the number of replicas. AWS RDS supports up to 15 read replicas.
06
Step 6 — Horizontal scale the application tier: make app stateless, add load balancer, configure auto-scaling. 1-2 weeks of engineering. Scales application servers indefinitely.
07
Step 7 — Vertical partition the database: split by functional domain (user_db, product_db). Reduces load per database. No sharding complexity yet.
08
Step 8 — Horizontal shard the database: last resort. Months of engineering, permanent architectural complexity. Only when a single DB primary cannot handle write QPS or data volume.
Step 1 — Profile first: never scale blindly. Use APM tools (New Relic, Datadog, AWS X-Ray) to find the actual bottleneck. Is it the DB query? The network? The CPU? Scaling the wrong thing wastes money and time.
Step 2 — Optimize before scaling: a slow DB query with a missing index might cause 10x unnecessary DB load. Adding a single index often "scales" a system more than adding 5 servers. Query optimization, connection pooling, and caching are almost always faster ROI than infrastructure changes.
Step 3 — Vertical scale the bottleneck: double the RAM/CPU of the bottlenecked component. Typically buys 3-12 months of runway. Zero code changes.
Step 4 — Add caching: Redis in front of the DB reduces read load by 80-95%. One engineer-week of work. Often makes vertical scaling sufficient for another 12-18 months.
Step 5 — Add read replicas to the DB: route read queries to replicas. Scales read capacity by N× where N is the number of replicas. AWS RDS supports up to 15 read replicas.
Step 6 — Horizontal scale the application tier: make app stateless, add load balancer, configure auto-scaling. 1-2 weeks of engineering. Scales application servers indefinitely.
Step 7 — Vertical partition the database: split by functional domain (user_db, product_db). Reduces load per database. No sharding complexity yet.
Step 8 — Horizontal shard the database: last resort. Months of engineering, permanent architectural complexity. Only when a single DB primary cannot handle write QPS or data volume.
graph TD
Problem[System Under Load] --> Profile[Step 1: Profile
Find the bottleneck]
Profile --> Optimize[Step 2: Optimize
Index queries, fix N+1]
Optimize --> Vertical[Step 3: Vertical Scale
Bigger instance type]
Vertical --> Cache[Step 4: Add Caching
Redis in front of DB]
Cache --> Replicas[Step 5: Read Replicas
Scale read QPS]
Replicas --> HorizontalApp[Step 6: Horizontal App
Stateless + Load Balancer]
HorizontalApp --> VPartition[Step 7: Vertical Partition DB
Split by domain]
VPartition --> Shard[Step 8: Shard DB
Last resort]
style Shard fill:#ff6b6b,color:#fff
style Profile fill:#51cf66,color:#fff
style Optimize fill:#51cf66,color:#fffScaling decision order: always start with the simplest solution. Sharding is the nuclear option.
The Slack Approach: Scale What's Hot
Slack engineering published a detailed blog post on their scaling philosophy: "Scale the right component by the right method at the right time." Their API tier is horizontally scaled with auto-scaling groups. Their message fanout (delivering a message to all workspace members) is handled by a dedicated horizontally-scaled fanout service. Their database is vertically scaled with read replicas — no sharding. This selective approach means their complexity is proportional to their actual scaling needs, not aspirational ones.
The most important scaling principle: delay scaling complexity as long as possible. Every scaling mechanism adds operational complexity, increases deployment risk, and requires ongoing maintenance. A system that scales to 10M users on a single well-tuned Postgres with read replicas is better than a 20-microservice system that handles 1M users with constant operational incidents. Scale to the right level at the right time.
Scaling appears in almost every system design interview because every system at scale needs it. Interviewers test: (1) Do you know when vertical vs. horizontal is appropriate? (2) Do you understand the stateless requirement? (3) Can you explain database scaling (read replicas vs. sharding)? (4) Do you mention auto-scaling for variable traffic? The common trap is candidates who immediately jump to sharding for any scaling question — this signals lack of proportionality in engineering judgment.
Common questions:
Key takeaways
Your e-commerce site currently runs on a single r6i.4xlarge (16 cores, 128 GB RAM). Traffic has doubled and CPU is averaging 65%. Your next Black Friday is in 6 weeks. What do you do?
First, profile to confirm CPU is the actual bottleneck (not network or DB). Then, immediately vertical scale to r6i.8xlarge or r6i.16xlarge — zero downtime on RDS, minimal on EC2. This buys time. Simultaneously, check the DB: if DB CPU is below 40%, the app server is the bottleneck and vertical + read replicas should handle Black Friday. In parallel over the next 3 weeks: make the app stateless (move sessions to Redis if not already), configure an auto-scaling group with 3x minimum capacity for Black Friday week, and run a load test at 5x normal traffic to verify. Do NOT attempt to implement sharding in 6 weeks — that is a multi-month project that would create more risk than it solves before a major traffic event.
Your team is debating whether to shard your PostgreSQL database or add more read replicas. You're at 80K read QPS and 8K write QPS, 3TB of data. What do you recommend?
At 80K read QPS and 8K write QPS, sharding is not justified yet. A well-configured Postgres primary (r6g.8xlarge: 32 cores, 256 GB RAM) handles 30-50K write QPS with proper indexing. You're well within write capacity. For reads: 80K QPS across 5 read replicas = 16K QPS per replica, which is very manageable. The correct path: add read replicas (4-6 total). Implement connection pooling (PgBouncer) if not already present. Add Redis caching for hot read patterns. Sharding makes sense if write QPS exceeds 50K (you're at 8K — 6x away) or data exceeds 10-20TB (you're at 3TB). Current architecture with 4 read replicas should handle 3-5x current load. Revisit sharding at 50K write QPS or 15TB data.
An interviewer asks: "Design a system that needs to handle 500K QPS at peak but only 50K QPS at average. How do you scale it?" What is your approach and why?
This is exactly the use case for auto-scaling horizontal infrastructure, not vertical. The 10:1 peak-to-average ratio means vertical scaling (or fixed horizontal) would require provisioning for 500K QPS 24/7 — paying for 10x the average capacity. Auto-scaling with horizontal app servers solves this: at 50K QPS, run 5 servers. As traffic rises toward 500K, auto-scale out to 50 servers. Scale back in after the spike. Specific implementation: Application Load Balancer + EC2 Auto Scaling Group with target-tracking policy (target: 60% CPU). Scale-out cooldown: 90 seconds (aggressive — spikes must be handled fast). Scale-in cooldown: 600 seconds (conservative — don't remove servers mid-spike). Minimum: 5 instances. Maximum: 60 instances (20% headroom above peak). Database: the DB is the harder problem. 500K app QPS likely generates 50K-200K DB QPS. Add a Redis cache (target: 90% hit rate) to reduce DB QPS to 5K-20K — manageable with 1 primary + 3 read replicas.
💡 Analogy
Vertical scaling is like widening a road — adding more lanes to the same highway. It works until you hit the physical boundary of the road width (the machine's hardware limit). Horizontal scaling is like building parallel highways — you can always build another one. But parallel highways need traffic routing (load balancers), consistent road conditions (stateless app servers), and synchronized signals (distributed coordination). The choice isn't which is better in the abstract — it's which is right for your traffic level and team capacity right now.
⚡ Core Idea
The correct scaling strategy depends on where the bottleneck is and what your current scale requires. Most systems follow a progression: optimize → vertical → cache → read replicas → horizontal app → vertical DB partition → shard. Skipping steps wastes engineering time. The goal is to always be at the simplest architecture that handles your current + near-term scale.
🎯 Why It Matters
Choosing the wrong scaling strategy wastes money (over-engineering a system for 100x its actual traffic) or causes outages (under-engineering a system that's about to hit a wall). Scaling decisions also shape your entire team structure: horizontal microservices requires 10x the ops investment of a well-tuned monolith. Every scaling architecture is a bet on where your bottleneck will be at 10x current scale.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.