Interactive Explainer

🎯Key Takeaways

KV cache memory is the binding scaling constraint for LLM systems — monitor utilization, not just GPU compute

Scale on leading indicators (queue depth) not lagging ones (GPU utilization) — by the time GPU is at 80%, users are already waiting

Synchronous request handling is an anti-pattern above 50 concurrent users — implement streaming and async queues

The hybrid architecture (self-hosted baseline + cloud API burst) is the correct scaling pattern for 100K-1M users

Pre-provision capacity before planned traffic spikes — reactive autoscaling cannot respond fast enough for GPU provisioning

Always load-test the fallback path — the path that saves you in an incident is the one most likely to be misconfigured

Scaling AI Systems

Architecture Decisions for 10x, 100x, and 1000x Growth

~13 min read

Be the first to complete!

Why this matters

An AI system that works for 1,000 users often catastrophically fails at 100,000. The failure modes are different from scaling traditional web applications — LLMs are memory-hungry, stateful, and expensive in ways that break naive horizontal scaling.

Without this knowledge

Your LLM API costs grow super-linearly with users. The system falls over at peak traffic because GPU memory is exhausted. Response latency degrades from 800ms to 8 seconds as the request queue backs up. You cannot scale fast enough to meet demand.

With this knowledge

Your system scales from 1,000 to 1,000,000 users with predictable cost growth. GPU resources are utilized at 85%+ during peak hours and scale down during off-peak. P95 latency stays below 1 second at any traffic level through intelligent request routing and caching.

What you'll learn

KV cache memory is the binding scaling constraint for LLM systems — monitor utilization, not just GPU compute
Scale on leading indicators (queue depth) not lagging ones (GPU utilization) — by the time GPU is at 80%, users are already waiting
Synchronous request handling is an anti-pattern above 50 concurrent users — implement streaming and async queues
The hybrid architecture (self-hosted baseline + cloud API burst) is the correct scaling pattern for 100K-1M users
Pre-provision capacity before planned traffic spikes — reactive autoscaling cannot respond fast enough for GPU provisioning
Always load-test the fallback path — the path that saves you in an incident is the one most likely to be misconfigured

Lesson outline

The Scene: The Launch That Broke Everything

In November 2023, a consumer AI app was featured on Product Hunt. Traffic went from 2,000 daily active users to 40,000 overnight. The engineering team had stress-tested to 5,000 concurrent users. At 8,000 concurrent users, the system started queuing requests. At 12,000, GPU memory was exhausted — the Llama-3-8B model running on 4x A100s ran out of KV cache space. New requests were being rejected with 503 errors.

The fallback: switch all traffic to the OpenAI API while the team scaled up GPU capacity. Cost: $0.03 per request × 40,000 users × 3 requests per session = $3,600 that day. Acceptable as an emergency. Three months later, still on OpenAI, still paying $3,600/day: $108,000/month they had not budgeted for.

The two compounding failures: (1) no autoscaling for GPU capacity — the cluster was fixed at 4x A100s, could not expand dynamically. (2) no request queue with backpressure — at 12,000 concurrent users, requests were rejected immediately rather than queued and served when capacity freed up. Both are solvable infrastructure problems that require designing for scale from the beginning.

Scaling AI systems is not the same as scaling web servers. You cannot just add more instances — GPU availability has minutes-long provisioning time (vs. seconds for CPU instances). KV cache memory is a hard constraint per request, not just CPU cycles. The architecture decisions you make at 1,000 users determine whether you survive 100,000.

Why Scaling AI Is Different From Scaling Web Apps

Most engineers have intuitions about scaling built from web application experience. These intuitions fail in important ways when applied to LLM systems.

Wrong Intuition 1: Just add more instances. Web servers: add an instance in 30 seconds. GPU instances: 2-10 minutes to provision and warm up. At 12,000 concurrent users during a traffic spike, you need GPU capacity available *before* the spike, not reactive provisioning during it. AI systems need predictive autoscaling (scale before demand, based on leading indicators), not reactive autoscaling (scale after you're already bottlenecked).

Wrong Intuition 2: Memory usage is approximately constant. Web servers use roughly the same memory per request. LLMs use KV cache memory proportional to sequence length per *active* request. At 100 concurrent 2,000-token conversations, an 8B model needs 100 × 2,000 × cache_size = significant VRAM. When the KV cache is full, new requests cannot start. Memory management is a first-class scaling concern that does not exist in traditional web architecture.

Wrong Intuition 3: Queuing requests is free. Adding a request queue decouples your web tier from your processing tier. For AI, long queues create a user experience problem: a user who submits a request and waits 45 seconds for a response will give up and reload. AI queues need smart backpressure: reject requests early (fast 429) when queue depth exceeds a threshold, rather than silently queuing thousands of requests that will eventually time out.

Wrong Intuition 4: Horizontal scaling solves everything. More web servers = more capacity. More GPU servers = more capacity, but also more cost and complexity. At some point, the right answer is not "more GPUs" but "more efficient use of GPUs" — through quantization, continuous batching, and caching. The sweet spot for AI systems is usually a mix: optimized serving on self-hosted GPUs for baseline load, cloud GPU API for burst capacity, and semantic caching to reduce the load that actually reaches GPUs.

How Production AI Scaling Actually Works: Three Levels

Scaling AI systems requires infrastructure patterns that differ fundamentally from traditional web scaling.

request_queue.py

1import asyncio
2from collections import deque
3from dataclasses import dataclass, field
4from datetime import datetime
5from typing import Callable, Optional
6import time
7 
8# Level 1: Request queue with backpressure
9# Controls load without dropping requests silently
10 
Level 1: Request queue — prevents overload by controlling request flow into the LLM backend
11@dataclass
12class QueuedRequest:
13    request_id: str
14    query: str
15    submitted_at: float = field(default_factory=time.time)
16    future: asyncio.Future = field(default_factory=asyncio.Future)
17 
18class BackpressureQueue:
19    def __init__(self, max_queue_size: int = 500, max_wait_seconds: float = 30.0):
max_queue_size=500: When queue exceeds 500, reject new requests fast (429) instead of silently queuing
20        self.queue: deque[QueuedRequest] = deque()
21        self.max_size = max_queue_size
22        self.max_wait = max_wait_seconds
23        self.processing_count = 0
24 
25    async def submit(self, request_id: str, query: str) -> Optional[str]:
26        """Submit a request. Returns None if queue is full (fast rejection)."""
Fast rejection is better UX than silent timeout — users know to retry vs. waiting 45 seconds
27        if len(self.queue) >= self.max_size:
28            # Fast rejection: tell the user immediately rather than silent timeout
29            raise ValueError(f"Service at capacity. Queue depth: {len(self.queue)}/{self.max_size}")
30 
31        req = QueuedRequest(request_id=request_id, query=query)
32        self.queue.append(req)
33 
max_wait=30s: If a request waits longer than 30s, it will timeout — better to fail fast
34        try:
35            # Wait for result with timeout
36            result = await asyncio.wait_for(req.future, timeout=self.max_wait)
37            return result
38        except asyncio.TimeoutError:
39            self.queue.remove(req)
40            raise TimeoutError(f"Request {request_id} timed out after {self.max_wait}s in queue")
41 
42    async def process_queue(self, handler: Callable):
43        """Worker that processes queued requests."""
44        while True:
45            if self.queue and self.processing_count < self._max_concurrent():
46                req = self.queue.popleft()
Expire stale requests: if request waited max_wait in queue, discard rather than process
47                # Check if request has been waiting too long
48                wait_time = time.time() - req.submitted_at
49                if wait_time > self.max_wait:
50                    req.future.set_exception(TimeoutError("Expired in queue"))
51                    continue
52 
53                self.processing_count += 1
54                result = await handler(req.query)
55                self.processing_count -= 1
56                req.future.set_result(result)
57            else:
58                await asyncio.sleep(0.01)
59 
60    def _max_concurrent(self) -> int:
61        return 20  # Max concurrent LLM requests per serving instance

autoscaling.py

1import boto3
2import time
3from dataclasses import dataclass
4 
5ec2 = boto3.client("ec2", region_name="us-east-1")
6cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
7 
8# Level 2: Predictive autoscaling for GPU instances
9# React to leading indicators (queue depth, request rate trend)
10# NOT just CPU/GPU utilization — that is too late for LLM systems
Level 2: Predictive autoscaling — scale on leading indicators (queue depth), not lagging ones (GPU util)
11 
12@dataclass
13class ScalingPolicy:
scale_up at queue depth 100, NOT when GPU is maxed — by then users are already waiting
14    scale_up_queue_depth: int = 100      # Scale up when queue exceeds this
15    scale_up_request_rate_pct: float = 0.80  # Scale up at 80% of capacity
16    scale_down_utilization: float = 0.30  # Scale down when GPU util drops below 30%
17    min_instances: int = 2
cooldown_minutes=5: Prevent oscillation — give new instances time to come online before re-evaluating
18    max_instances: int = 20
19    cooldown_minutes: int = 5            # Wait 5 min between scaling actions
20 
21class GPUAutoscaler:
22    def __init__(self, asg_name: str, policy: ScalingPolicy):
23        self.asg = asg_name
24        self.policy = policy
25        self.last_scale_action = 0
26 
27    def get_current_metrics(self) -> dict:
28        """Get queue depth and GPU utilization from CloudWatch."""
29        response = cloudwatch.get_metric_statistics(
30            Namespace="AI/Inference",
31            MetricName="QueueDepth",
32            Period=60,
33            Statistics=["Average"],
34            StartTime=time.time() - 120,
35            EndTime=time.time(),
36        )
37        queue_depth = response["Datapoints"][0]["Average"] if response["Datapoints"] else 0
38 
39        gpu_util = self._get_gpu_utilization()
40        return {"queue_depth": queue_depth, "gpu_utilization": gpu_util}
41 
42    def evaluate(self):
43        """Evaluate scaling need and take action if within cooldown."""
44        if time.time() - self.last_scale_action < self.policy.cooldown_minutes * 60:
45            return  # Still in cooldown period
46 
47        metrics = self.get_current_metrics()
Scale UP by +2: Aggressive scale-up to clear queue rapidly. Scale DOWN by -1: conservative to avoid thrashing.
48        current_count = self._get_instance_count()
49 
50        if (metrics["queue_depth"] > self.policy.scale_up_queue_depth or
51                metrics["gpu_utilization"] > self.policy.scale_up_request_rate_pct):
52            if current_count < self.policy.max_instances:
53                new_count = min(current_count + 2, self.policy.max_instances)
54                self._set_desired_capacity(new_count)
55                self.last_scale_action = time.time()
56                print(f"Scaled UP: {current_count} → {new_count} instances")
57 
58        elif metrics["gpu_utilization"] < self.policy.scale_down_utilization:
59            if current_count > self.policy.min_instances:
60                new_count = max(current_count - 1, self.policy.min_instances)
61                self._set_desired_capacity(new_count)
62                self.last_scale_action = time.time()
63                print(f"Scaled DOWN: {current_count} → {new_count} instances")

hybrid_routing.py

1from openai import OpenAI
2import httpx
3import asyncio
4from enum import Enum
5 
6client_openai = OpenAI()
7 
8# Level 3: Hybrid routing — self-hosted baseline + cloud API burst
9# Route traffic between self-hosted GPU cluster and API based on capacity
Level 3: Hybrid routing — use cheap self-hosted for baseline, cloud API for burst overflow
10 
11class RoutingTarget(Enum):
12    SELF_HOSTED = "self_hosted"
13    OPENAI_API = "openai_api"
14 
15class HybridInferenceRouter:
16    """
Cost comparison: self-hosted ~$0.001/request, OpenAI API ~$0.015/request — 15x more expensive for overflow
17    Routing strategy:
18    - Self-hosted: baseline load (cost-efficient, ~$0.001/request)
19    - OpenAI API: burst overflow (flexible, ~$0.015/request)
20    Decision: route to API when self-hosted queue depth > threshold
21    """
22 
23    def __init__(
24        self,
Route to API before self-hosted is maxed (at 85% utilization) — prevent user-visible degradation
25        self_hosted_url: str = "http://vllm-cluster:8000",
26        burst_threshold_queue_depth: int = 200,
27        burst_threshold_gpu_util: float = 0.85,
28    ):
29        self.self_hosted_url = self_hosted_url
30        self.burst_threshold_queue = burst_threshold_queue_depth
31        self.burst_threshold_gpu = burst_threshold_gpu_util
32 
33    async def route(self, prompt: str) -> tuple[str, RoutingTarget]:
34        """Route request to optimal target based on current capacity."""
35        metrics = await self._get_cluster_metrics()
36 
37        if (metrics["queue_depth"] > self.burst_threshold_queue or
38                metrics["gpu_utilization"] > self.burst_threshold_gpu):
39            # Self-hosted cluster is saturated — overflow to API
40            response = await self._call_api(prompt)
41            return response, RoutingTarget.OPENAI_API
42        else:
43            response = await self._call_self_hosted(prompt)
44            return response, RoutingTarget.SELF_HOSTED
45 
46    async def _call_self_hosted(self, prompt: str) -> str:
47        async with httpx.AsyncClient() as client:
48            response = await client.post(
49                f"{self.self_hosted_url}/v1/completions",
50                json={"prompt": prompt, "max_tokens": 512},
Use gpt-4o-mini for burst overflow — reserve gpt-4o for cases where quality is critical
51                timeout=30.0
52            )
53            return response.json()["choices"][0]["text"]
54 
55    async def _call_api(self, prompt: str) -> str:
56        response = client_openai.chat.completions.create(
57            model="gpt-4o-mini",      # Use cheaper model for burst overflow
58            messages=[{"role": "user", "content": prompt}]
59        )
60        return response.choices[0].message.content

Three Scale Levels, Three Architecture Decisions

Scaling AI systems has distinct architectural phases, each with its own dominant challenge.

Scale Level 1: 1K → 10K users (Optimize Before Scaling) At this scale, the right answer is almost never "more hardware." Implement semantic caching (eliminate redundant queries), model routing (use cheaper models for simple tasks), and prompt compression (reduce token costs). A well-optimized system for 1K users can often handle 5-10K with no additional infrastructure. Profile first: where is the actual bottleneck? If it is GPU memory, optimize KV cache management. If it is API costs, add caching and routing. If it is latency, add streaming. Hardware is the last resort at this scale.

Scale Level 2: 10K → 100K users (Queue and Autoscale) At 10K+ users, traffic spikes become existential threats. The key infrastructure: a request queue with backpressure (rejects requests fast when at capacity rather than queueing indefinitely), GPU autoscaling based on queue depth (not just CPU utilization), and a CDN layer for streaming responses (offloads streaming connection management from your serving cluster). The architecture is: load balancer → request queue → autoscaling GPU cluster → response streaming. This architecture survives 10x traffic spikes because it rejects gracefully at saturation rather than collapsing.

Scale Level 3: 100K → 1M users (Hybrid and Specialize) At this scale, no single serving setup is optimal. The architecture becomes hybrid: self-hosted GPU cluster for the high-volume, well-understood request types (fine-tuned for your specific workload), cloud API for burst overflow and novel query types, and specialized fast-path serving for real-time features (smaller model, dedicated cluster, strict SLA). The cost optimization at this scale is through specialization: run the cheapest model that handles each specific query type, reserve the most expensive model for the cases where quality demonstrably matters.

The Scaling Traps That Break Systems at 10x

Three architectural decisions that work at 1,000 users and fail catastrophically at 10,000.

How Principals Think About Scale Architecture

Senior engineers ask: "How do we handle 10x the traffic?" Principal engineers ask: "What is the cost-per-user trajectory, and what architectural decisions preserve unit economics as we scale?"

The principal insight is that scaling AI is a unit economics problem. The question is not just "can our system handle 1M users" but "at what per-user cost, and does the business model work at that cost?" An AI system serving 1M users at $0.02/user/day costs $20K/day — $600K/month. Whether that is sustainable depends on revenue per user. The architect's job is to design the system so cost scales sub-linearly with users through caching, routing, and optimization.

The frontier in 2025: mixture-of-experts (MoE) models at scale — models like Mixtral 8x7B that activate only 2 of 8 expert layers per token, providing near-70B quality at 13B active parameter cost. MoE models fundamentally change the scaling math: you can serve near-state-of-the-art quality with much lower per-token compute. The infrastructure challenge is that MoE models require all expert weights in VRAM even though only 2 are active per token — requiring specialized memory management for efficient serving.

The five-year arc: AI serving infrastructure will bifurcate into two tiers — cloud API (fully managed, premium pricing, maximum flexibility) and specialized hardware (custom ASICs, Groq-style inference chips, 10-100x more efficient than GPUs for specific model architectures). Engineers who understand both the algorithmic efficiency (which model for which task) and the hardware efficiency (which serving hardware for which traffic pattern) will design the systems that sustain AI businesses at scale.

How this might come up in interviews

Common questions:

How do you scale an LLM serving system from 1,000 to 1,000,000 users?
What is the difference between scaling AI systems and scaling traditional web applications?
What causes an LLM system to fail under sudden traffic spikes?
When does self-hosting a model become cheaper than using an API?
Explain PagedAttention and why it matters for scaling.
How do you handle traffic spikes when GPU autoscaling takes 5+ minutes?

Strong answer: Analyzes the actual bottleneck (is it KV cache, compute, network, or cost?) before recommending solutionsKnows the scaling staircase: optimize → autoscale → hybrid → multi-regionUnderstands why queue depth is a better autoscaling trigger than GPU utilizationHas designed or operated a hybrid serving architecture and can discuss the cost-benefit analysis

Red flags: Recommends simply adding more GPUs as the answer to scaling challenges without analyzing the actual bottleneckNo awareness of KV cache memory as a scaling constraint (focuses only on GPU compute)Uses GPU utilization as the autoscaling trigger without knowing about the lag problemNo concept of backpressure — believes queuing all requests is always better than rejecting some

Scenario · A B2B AI startup, currently 5,000 users, targeting 500,000 in 18 months

Step 1 of 2

You Are the Principal Engineer. Design for 100x Growth.

Your AI analyst assistant runs on a fixed 4x A100 cluster. At current load, GPU utilization is 45%. You have been told to design the infrastructure for 100x growth without 100x infrastructure cost.

You analyze your current 4x A100 cluster at 45% average GPU utilization. Peak utilization hits 92% for 2 hours each day (9-11 AM EST when US users are most active). During those 2 hours, P95 latency spikes from 800ms to 4.2 seconds. The rest of the day, the cluster is underutilized.

What do you address first?

Quick check · Scaling AI Systems

1 / 4

Your AI system handles 5,000 concurrent users smoothly. At 8,000 concurrent users (a Product Hunt spike), the system fails. GPU utilization is at 95%. What is the most likely root cause?

Key takeaways

KV cache memory is the binding scaling constraint for LLM systems — monitor utilization, not just GPU compute
Scale on leading indicators (queue depth) not lagging ones (GPU utilization) — by the time GPU is at 80%, users are already waiting
Synchronous request handling is an anti-pattern above 50 concurrent users — implement streaming and async queues
The hybrid architecture (self-hosted baseline + cloud API burst) is the correct scaling pattern for 100K-1M users
Pre-provision capacity before planned traffic spikes — reactive autoscaling cannot respond fast enough for GPU provisioning
Always load-test the fallback path — the path that saves you in an incident is the one most likely to be misconfigured

Before you move on: can you answer these?

Why is KV cache memory exhaustion the most common scaling failure mode for LLM systems?

Each concurrent request holds KV cache proportional to its sequence length. As concurrent users grow, total KV cache demand grows. When VRAM is full, no new requests can start regardless of GPU compute availability. This is fundamentally different from traditional web scaling where memory per request is small and relatively constant.

What is the difference between reactive and predictive autoscaling for AI systems, and why does it matter?

Reactive: scale when a metric (GPU util) exceeds a threshold. Problem: GPU provisioning takes 2-10 minutes, so users experience degradation during the scale-up window. Predictive: scale on leading indicators (queue depth, traffic rate trends) before degradation occurs, and pre-provision before known spikes. AI needs predictive because GPU cold-start time is too long for reactive scaling to prevent user impact.

When should you use the hybrid serving architecture (self-hosted + cloud API), and when is full cloud API preferable?

Hybrid: when monthly API costs exceed $10K AND you have MLOps team capacity. Below $10K/month, API-only is simpler. Hybrid provides 5-10x cost reduction on baseline load while keeping cloud API for burst flexibility. Full API-only remains preferable for: early stage (no MLOps capacity), data privacy not required, unpredictable traffic patterns.

From the books

Designing Data-Intensive Applications — Martin Kleppmann (2017)

Kleppmann's framework on backpressure is directly applicable to AI serving: systems should communicate capacity limits explicitly (rejection with 429) rather than implicitly (timeout after 30s). Fast failure allows clients to implement appropriate backoff strategies. Silent queueing is an anti-pattern that feels polite but results in worse outcomes for all users.

AI Engineering — Chapter 14: Production Infrastructure — Chip Huyen (2024)

Huyen's insight on the hybrid serving model: "The optimal architecture for most AI companies at growth stage is not fully self-hosted or fully API, but a hybrid where self-hosted handles predictable baseline load at low cost and managed APIs handle burst traffic at premium cost. The split point should be determined by cost analysis, not by engineering convenience."

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A