Architecture Decisions for 10x, 100x, and 1000x Growth
Architecture Decisions for 10x, 100x, and 1000x Growth
An AI system that works for 1,000 users often catastrophically fails at 100,000. The failure modes are different from scaling traditional web applications — LLMs are memory-hungry, stateful, and expensive in ways that break naive horizontal scaling.
Your LLM API costs grow super-linearly with users. The system falls over at peak traffic because GPU memory is exhausted. Response latency degrades from 800ms to 8 seconds as the request queue backs up. You cannot scale fast enough to meet demand.
Your system scales from 1,000 to 1,000,000 users with predictable cost growth. GPU resources are utilized at 85%+ during peak hours and scale down during off-peak. P95 latency stays below 1 second at any traffic level through intelligent request routing and caching.
Lesson outline
In November 2023, a consumer AI app was featured on Product Hunt. Traffic went from 2,000 daily active users to 40,000 overnight. The engineering team had stress-tested to 5,000 concurrent users. At 8,000 concurrent users, the system started queuing requests. At 12,000, GPU memory was exhausted — the Llama-3-8B model running on 4x A100s ran out of KV cache space. New requests were being rejected with 503 errors.
The fallback: switch all traffic to the OpenAI API while the team scaled up GPU capacity. Cost: $0.03 per request × 40,000 users × 3 requests per session = $3,600 that day. Acceptable as an emergency. Three months later, still on OpenAI, still paying $3,600/day: $108,000/month they had not budgeted for.
The two compounding failures: (1) no autoscaling for GPU capacity — the cluster was fixed at 4x A100s, could not expand dynamically. (2) no request queue with backpressure — at 12,000 concurrent users, requests were rejected immediately rather than queued and served when capacity freed up. Both are solvable infrastructure problems that require designing for scale from the beginning.
Scaling AI systems is not the same as scaling web servers. You cannot just add more instances — GPU availability has minutes-long provisioning time (vs. seconds for CPU instances). KV cache memory is a hard constraint per request, not just CPU cycles. The architecture decisions you make at 1,000 users determine whether you survive 100,000.
Most engineers have intuitions about scaling built from web application experience. These intuitions fail in important ways when applied to LLM systems.
Wrong Intuition 1: Just add more instances. Web servers: add an instance in 30 seconds. GPU instances: 2-10 minutes to provision and warm up. At 12,000 concurrent users during a traffic spike, you need GPU capacity available *before* the spike, not reactive provisioning during it. AI systems need predictive autoscaling (scale before demand, based on leading indicators), not reactive autoscaling (scale after you're already bottlenecked).
Wrong Intuition 2: Memory usage is approximately constant. Web servers use roughly the same memory per request. LLMs use KV cache memory proportional to sequence length per *active* request. At 100 concurrent 2,000-token conversations, an 8B model needs 100 × 2,000 × cache_size = significant VRAM. When the KV cache is full, new requests cannot start. Memory management is a first-class scaling concern that does not exist in traditional web architecture.
Wrong Intuition 3: Queuing requests is free. Adding a request queue decouples your web tier from your processing tier. For AI, long queues create a user experience problem: a user who submits a request and waits 45 seconds for a response will give up and reload. AI queues need smart backpressure: reject requests early (fast 429) when queue depth exceeds a threshold, rather than silently queuing thousands of requests that will eventually time out.
Wrong Intuition 4: Horizontal scaling solves everything. More web servers = more capacity. More GPU servers = more capacity, but also more cost and complexity. At some point, the right answer is not "more GPUs" but "more efficient use of GPUs" — through quantization, continuous batching, and caching. The sweet spot for AI systems is usually a mix: optimized serving on self-hosted GPUs for baseline load, cloud GPU API for burst capacity, and semantic caching to reduce the load that actually reaches GPUs.
Scaling AI systems requires infrastructure patterns that differ fundamentally from traditional web scaling.
1import asyncio2from collections import deque3from dataclasses import dataclass, field4from datetime import datetime5from typing import Callable, Optional6import time78# Level 1: Request queue with backpressure9# Controls load without dropping requests silently10Level 1: Request queue — prevents overload by controlling request flow into the LLM backend11@dataclass12class QueuedRequest:13request_id: str14query: str15submitted_at: float = field(default_factory=time.time)16future: asyncio.Future = field(default_factory=asyncio.Future)1718class BackpressureQueue:19def __init__(self, max_queue_size: int = 500, max_wait_seconds: float = 30.0):max_queue_size=500: When queue exceeds 500, reject new requests fast (429) instead of silently queuing20self.queue: deque[QueuedRequest] = deque()21self.max_size = max_queue_size22self.max_wait = max_wait_seconds23self.processing_count = 02425async def submit(self, request_id: str, query: str) -> Optional[str]:26"""Submit a request. Returns None if queue is full (fast rejection)."""Fast rejection is better UX than silent timeout — users know to retry vs. waiting 45 seconds27if len(self.queue) >= self.max_size:28# Fast rejection: tell the user immediately rather than silent timeout29raise ValueError(f"Service at capacity. Queue depth: {len(self.queue)}/{self.max_size}")3031req = QueuedRequest(request_id=request_id, query=query)32self.queue.append(req)33max_wait=30s: If a request waits longer than 30s, it will timeout — better to fail fast34try:35# Wait for result with timeout36result = await asyncio.wait_for(req.future, timeout=self.max_wait)37return result38except asyncio.TimeoutError:39self.queue.remove(req)40raise TimeoutError(f"Request {request_id} timed out after {self.max_wait}s in queue")4142async def process_queue(self, handler: Callable):43"""Worker that processes queued requests."""44while True:45if self.queue and self.processing_count < self._max_concurrent():46req = self.queue.popleft()Expire stale requests: if request waited max_wait in queue, discard rather than process47# Check if request has been waiting too long48wait_time = time.time() - req.submitted_at49if wait_time > self.max_wait:50req.future.set_exception(TimeoutError("Expired in queue"))51continue5253self.processing_count += 154result = await handler(req.query)55self.processing_count -= 156req.future.set_result(result)57else:58await asyncio.sleep(0.01)5960def _max_concurrent(self) -> int:61return 20 # Max concurrent LLM requests per serving instance
1import boto32import time3from dataclasses import dataclass45ec2 = boto3.client("ec2", region_name="us-east-1")6cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")78# Level 2: Predictive autoscaling for GPU instances9# React to leading indicators (queue depth, request rate trend)10# NOT just CPU/GPU utilization — that is too late for LLM systemsLevel 2: Predictive autoscaling — scale on leading indicators (queue depth), not lagging ones (GPU util)1112@dataclass13class ScalingPolicy:scale_up at queue depth 100, NOT when GPU is maxed — by then users are already waiting14scale_up_queue_depth: int = 100 # Scale up when queue exceeds this15scale_up_request_rate_pct: float = 0.80 # Scale up at 80% of capacity16scale_down_utilization: float = 0.30 # Scale down when GPU util drops below 30%17min_instances: int = 2cooldown_minutes=5: Prevent oscillation — give new instances time to come online before re-evaluating18max_instances: int = 2019cooldown_minutes: int = 5 # Wait 5 min between scaling actions2021class GPUAutoscaler:22def __init__(self, asg_name: str, policy: ScalingPolicy):23self.asg = asg_name24self.policy = policy25self.last_scale_action = 02627def get_current_metrics(self) -> dict:28"""Get queue depth and GPU utilization from CloudWatch."""29response = cloudwatch.get_metric_statistics(30Namespace="AI/Inference",31MetricName="QueueDepth",32Period=60,33Statistics=["Average"],34StartTime=time.time() - 120,35EndTime=time.time(),36)37queue_depth = response["Datapoints"][0]["Average"] if response["Datapoints"] else 03839gpu_util = self._get_gpu_utilization()40return {"queue_depth": queue_depth, "gpu_utilization": gpu_util}4142def evaluate(self):43"""Evaluate scaling need and take action if within cooldown."""44if time.time() - self.last_scale_action < self.policy.cooldown_minutes * 60:45return # Still in cooldown period4647metrics = self.get_current_metrics()Scale UP by +2: Aggressive scale-up to clear queue rapidly. Scale DOWN by -1: conservative to avoid thrashing.48current_count = self._get_instance_count()4950if (metrics["queue_depth"] > self.policy.scale_up_queue_depth or51metrics["gpu_utilization"] > self.policy.scale_up_request_rate_pct):52if current_count < self.policy.max_instances:53new_count = min(current_count + 2, self.policy.max_instances)54self._set_desired_capacity(new_count)55self.last_scale_action = time.time()56print(f"Scaled UP: {current_count} → {new_count} instances")5758elif metrics["gpu_utilization"] < self.policy.scale_down_utilization:59if current_count > self.policy.min_instances:60new_count = max(current_count - 1, self.policy.min_instances)61self._set_desired_capacity(new_count)62self.last_scale_action = time.time()63print(f"Scaled DOWN: {current_count} → {new_count} instances")
1from openai import OpenAI2import httpx3import asyncio4from enum import Enum56client_openai = OpenAI()78# Level 3: Hybrid routing — self-hosted baseline + cloud API burst9# Route traffic between self-hosted GPU cluster and API based on capacityLevel 3: Hybrid routing — use cheap self-hosted for baseline, cloud API for burst overflow1011class RoutingTarget(Enum):12SELF_HOSTED = "self_hosted"13OPENAI_API = "openai_api"1415class HybridInferenceRouter:16"""Cost comparison: self-hosted ~$0.001/request, OpenAI API ~$0.015/request — 15x more expensive for overflow17Routing strategy:18- Self-hosted: baseline load (cost-efficient, ~$0.001/request)19- OpenAI API: burst overflow (flexible, ~$0.015/request)20Decision: route to API when self-hosted queue depth > threshold21"""2223def __init__(24self,Route to API before self-hosted is maxed (at 85% utilization) — prevent user-visible degradation25self_hosted_url: str = "http://vllm-cluster:8000",26burst_threshold_queue_depth: int = 200,27burst_threshold_gpu_util: float = 0.85,28):29self.self_hosted_url = self_hosted_url30self.burst_threshold_queue = burst_threshold_queue_depth31self.burst_threshold_gpu = burst_threshold_gpu_util3233async def route(self, prompt: str) -> tuple[str, RoutingTarget]:34"""Route request to optimal target based on current capacity."""35metrics = await self._get_cluster_metrics()3637if (metrics["queue_depth"] > self.burst_threshold_queue or38metrics["gpu_utilization"] > self.burst_threshold_gpu):39# Self-hosted cluster is saturated — overflow to API40response = await self._call_api(prompt)41return response, RoutingTarget.OPENAI_API42else:43response = await self._call_self_hosted(prompt)44return response, RoutingTarget.SELF_HOSTED4546async def _call_self_hosted(self, prompt: str) -> str:47async with httpx.AsyncClient() as client:48response = await client.post(49f"{self.self_hosted_url}/v1/completions",50json={"prompt": prompt, "max_tokens": 512},Use gpt-4o-mini for burst overflow — reserve gpt-4o for cases where quality is critical51timeout=30.052)53return response.json()["choices"][0]["text"]5455async def _call_api(self, prompt: str) -> str:56response = client_openai.chat.completions.create(57model="gpt-4o-mini", # Use cheaper model for burst overflow58messages=[{"role": "user", "content": prompt}]59)60return response.choices[0].message.content
Scaling AI systems has distinct architectural phases, each with its own dominant challenge.
Scale Level 1: 1K → 10K users (Optimize Before Scaling) At this scale, the right answer is almost never "more hardware." Implement semantic caching (eliminate redundant queries), model routing (use cheaper models for simple tasks), and prompt compression (reduce token costs). A well-optimized system for 1K users can often handle 5-10K with no additional infrastructure. Profile first: where is the actual bottleneck? If it is GPU memory, optimize KV cache management. If it is API costs, add caching and routing. If it is latency, add streaming. Hardware is the last resort at this scale.
Scale Level 2: 10K → 100K users (Queue and Autoscale) At 10K+ users, traffic spikes become existential threats. The key infrastructure: a request queue with backpressure (rejects requests fast when at capacity rather than queueing indefinitely), GPU autoscaling based on queue depth (not just CPU utilization), and a CDN layer for streaming responses (offloads streaming connection management from your serving cluster). The architecture is: load balancer → request queue → autoscaling GPU cluster → response streaming. This architecture survives 10x traffic spikes because it rejects gracefully at saturation rather than collapsing.
Scale Level 3: 100K → 1M users (Hybrid and Specialize) At this scale, no single serving setup is optimal. The architecture becomes hybrid: self-hosted GPU cluster for the high-volume, well-understood request types (fine-tuned for your specific workload), cloud API for burst overflow and novel query types, and specialized fast-path serving for real-time features (smaller model, dedicated cluster, strict SLA). The cost optimization at this scale is through specialization: run the cheapest model that handles each specific query type, reserve the most expensive model for the cases where quality demonstrably matters.
Three architectural decisions that work at 1,000 users and fail catastrophically at 10,000.
Senior engineers ask: "How do we handle 10x the traffic?" Principal engineers ask: "What is the cost-per-user trajectory, and what architectural decisions preserve unit economics as we scale?"
The principal insight is that scaling AI is a unit economics problem. The question is not just "can our system handle 1M users" but "at what per-user cost, and does the business model work at that cost?" An AI system serving 1M users at $0.02/user/day costs $20K/day — $600K/month. Whether that is sustainable depends on revenue per user. The architect's job is to design the system so cost scales sub-linearly with users through caching, routing, and optimization.
The frontier in 2025: mixture-of-experts (MoE) models at scale — models like Mixtral 8x7B that activate only 2 of 8 expert layers per token, providing near-70B quality at 13B active parameter cost. MoE models fundamentally change the scaling math: you can serve near-state-of-the-art quality with much lower per-token compute. The infrastructure challenge is that MoE models require all expert weights in VRAM even though only 2 are active per token — requiring specialized memory management for efficient serving.
The five-year arc: AI serving infrastructure will bifurcate into two tiers — cloud API (fully managed, premium pricing, maximum flexibility) and specialized hardware (custom ASICs, Groq-style inference chips, 10-100x more efficient than GPUs for specific model architectures). Engineers who understand both the algorithmic efficiency (which model for which task) and the hardware efficiency (which serving hardware for which traffic pattern) will design the systems that sustain AI businesses at scale.
Common questions:
Strong answer: Analyzes the actual bottleneck (is it KV cache, compute, network, or cost?) before recommending solutionsKnows the scaling staircase: optimize → autoscale → hybrid → multi-regionUnderstands why queue depth is a better autoscaling trigger than GPU utilizationHas designed or operated a hybrid serving architecture and can discuss the cost-benefit analysis
Red flags: Recommends simply adding more GPUs as the answer to scaling challenges without analyzing the actual bottleneckNo awareness of KV cache memory as a scaling constraint (focuses only on GPU compute)Uses GPU utilization as the autoscaling trigger without knowing about the lag problemNo concept of backpressure — believes queuing all requests is always better than rejecting some
Scenario · A B2B AI startup, currently 5,000 users, targeting 500,000 in 18 months
Step 1 of 2
Your AI analyst assistant runs on a fixed 4x A100 cluster. At current load, GPU utilization is 45%. You have been told to design the infrastructure for 100x growth without 100x infrastructure cost.
You analyze your current 4x A100 cluster at 45% average GPU utilization. Peak utilization hits 92% for 2 hours each day (9-11 AM EST when US users are most active). During those 2 hours, P95 latency spikes from 800ms to 4.2 seconds. The rest of the day, the cluster is underutilized.
What do you address first?
Quick check · Scaling AI Systems
1 / 4
Key takeaways
Why is KV cache memory exhaustion the most common scaling failure mode for LLM systems?
Each concurrent request holds KV cache proportional to its sequence length. As concurrent users grow, total KV cache demand grows. When VRAM is full, no new requests can start regardless of GPU compute availability. This is fundamentally different from traditional web scaling where memory per request is small and relatively constant.
What is the difference between reactive and predictive autoscaling for AI systems, and why does it matter?
Reactive: scale when a metric (GPU util) exceeds a threshold. Problem: GPU provisioning takes 2-10 minutes, so users experience degradation during the scale-up window. Predictive: scale on leading indicators (queue depth, traffic rate trends) before degradation occurs, and pre-provision before known spikes. AI needs predictive because GPU cold-start time is too long for reactive scaling to prevent user impact.
When should you use the hybrid serving architecture (self-hosted + cloud API), and when is full cloud API preferable?
Hybrid: when monthly API costs exceed $10K AND you have MLOps team capacity. Below $10K/month, API-only is simpler. Hybrid provides 5-10x cost reduction on baseline load while keeping cloud API for burst flexibility. Full API-only remains preferable for: early stage (no MLOps capacity), data privacy not required, unpredictable traffic patterns.
From the books
Designing Data-Intensive Applications — Martin Kleppmann (2017)
Kleppmann's framework on backpressure is directly applicable to AI serving: systems should communicate capacity limits explicitly (rejection with 429) rather than implicitly (timeout after 30s). Fast failure allows clients to implement appropriate backoff strategies. Silent queueing is an anti-pattern that feels polite but results in worse outcomes for all users.
AI Engineering — Chapter 14: Production Infrastructure — Chip Huyen (2024)
Huyen's insight on the hybrid serving model: "The optimal architecture for most AI companies at growth stage is not fully self-hosted or fully API, but a hybrid where self-hosted handles predictable baseline load at low cost and managed APIs handle burst traffic at premium cost. The split point should be determined by cost analysis, not by engineering convenience."
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.