Making LLMs Fast and Affordable at Production Scale
Making LLMs Fast and Affordable at Production Scale
A single A100 GPU serving a 70B parameter model can handle about 10 concurrent requests at acceptable latency. At production scale (10,000+ requests/minute), naive inference costs $500K+/month and still cannot meet SLA requirements.
Your LLM API bills grow linearly with traffic. Response latency is 3-8 seconds, making real-time features impossible. You hit GPU memory limits at moderate traffic and cannot scale to meet demand spikes.
You serve 10-50x more traffic on the same hardware through batching and KV caching. Quantization cuts GPU memory by 50-75% with less than 3% quality degradation. The same compute budget that served 1M requests/day now serves 10-50M.
Lesson outline
It is Q1 2024. A fast-growing AI startup has just crossed 1 million active users. Their AI writing assistant makes 4 API calls per user session. At GPT-4 pricing ($0.03/1K tokens, average 800 tokens per call), that is $96 per 1,000 users per day. At 1 million users, $96,000 per day. $2.9M per month, just in inference. Their Series B runway was 18 months. Inference alone would consume it in 6 months.
The obvious move: switch to a cheaper model. GPT-4o-mini costs 30x less. But their product manager ran an A/B test — GPT-4o-mini writing quality scored 15 points lower on user satisfaction. Users noticed.
The team spent two weeks doing what they should have done before launch: inference optimization. They implemented: response streaming (users see output immediately, reducing perceived latency by 70%). Semantic caching (repeat queries return cached responses — 34% of their queries were near-duplicates). Prompt compression using LLMLingua (average prompt reduced from 800 to 520 tokens, 35% cost reduction). Model routing (send simple synonym suggestions to GPT-4o-mini, complex paragraph rewrites to GPT-4o).
Net result: $800K/month inference bill dropped to $110K. Same GPT-4 quality for complex tasks. Actual P95 latency dropped from 4.2s to 1.8s (streaming + caching). These were not exotic ML techniques — they were engineering choices that should have been made at architecture design time.
Inference optimization is not premature optimization. It is the difference between a viable business model and one that burns through runway on GPU bills.
When engineers first think about making LLMs faster, they reach for the wrong levers. The most common intuitions: "use a smaller model" (sacrifices quality), "reduce max tokens" (limits output quality), "use faster hardware" (expensive, limited upside). These miss the highest-leverage optimization opportunities.
Wrong Intuition 1: Latency scales linearly with output length. Actual behavior: time to first token (TTFT) depends on prompt length and model size. Time per output token (TPOT) is nearly constant per token. A 1,000-token output takes about 10x longer than 100 tokens — but streaming makes this invisible to users. First-token latency is what users experience as "lag." Optimizing TTFT matters 3-5x more than optimizing total generation time for perceived performance.
Wrong Intuition 2: Batching requests hurts latency. Actual behavior: continuous batching (PagedAttention, used by vLLM) allows new requests to join a batch mid-generation, dramatically increasing GPU utilization without adding per-request latency. Naive batching does hurt latency. Continuous batching does not. Modern inference servers (vLLM, TGI, TensorRT-LLM) all implement this. Without continuous batching, GPU utilization is typically 20-40%. With it, 80-95%.
Wrong Intuition 3: Quantization degrades quality significantly. Actual behavior: AWQ (Activation-aware Weight Quantization) at 4-bit precision degrades benchmark performance by less than 2% on most tasks while cutting VRAM by 50-75%. A 70B model in FP16 requires 140GB VRAM (two A100s 80GB). In AWQ 4-bit, 35-40GB (fits on one A100). The same hardware serves 2x the model capacity. For most production applications, the quality difference between FP16 and AWQ 4-bit is below user detection threshold.
The highest-leverage optimization opportunities are: (1) KV cache reuse — cache attention states for repeated prompt prefixes. (2) Semantic response caching — return cached answers for semantically similar queries. (3) Prompt compression — reduce input tokens by 30-50% using model-guided compression. (4) Model routing — use a small cheap model when it is sufficient, reserve the expensive model for complex tasks. These four techniques alone can reduce inference cost by 60-85% on typical production workloads.
Inference optimization spans from simple API patterns to custom serving infrastructure. Here is the progression from prototype to production-grade.
1import hashlib2import numpy as np3from openai import OpenAI4from redis import Redis5import json67client = OpenAI()8redis = Redis(host="localhost", port=6379)910# Level 1: Semantic caching11# Cache LLM responses by semantic similarity of queries12# 20-40% of production queries are near-duplicates — serve them from cache13SIMILARITY_THRESHOLD=0.92: Tune higher to avoid wrong cache hits, lower to increase hit rate14SIMILARITY_THRESHOLD = 0.92 # Cosine similarity threshold for cache hit1516def get_embedding(text: str) -> list[float]:17return client.embeddings.create(18input=text, model="text-embedding-3-small"19).data[0].embedding2021def semantic_cache_lookup(query: str) -> str | None:22query_embedding = get_embedding(query)23Cache lookup scales O(n) with cache size — use vector DB (Qdrant, Pinecone) for >10K cached queries24# Search for similar cached queries25cached_keys = redis.keys("embed:*")26for key in cached_keys:27cached_data = json.loads(redis.get(key))28cached_embedding = cached_data["embedding"]29Cosine similarity: values >0.92 indicate semantically equivalent queries safe to cache-serve30# Cosine similarity31similarity = np.dot(query_embedding, cached_embedding) / (32np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)33)3435if similarity >= SIMILARITY_THRESHOLD:36return cached_data["response"] # Cache hit3738return None # Cache miss3940def cached_completion(query: str, system_prompt: str) -> str:41# Try cache first42cached = semantic_cache_lookup(query)43if cached:44return cached4546# Generate response47response = client.chat.completions.create(48model="gpt-4o",TTL=3600: Expire cache entries after 1 hour — balance freshness vs. hit rate for your use case49messages=[50{"role": "system", "content": system_prompt},51{"role": "user", "content": query}52],53stream=False,54).choices[0].message.content5556# Store in cache with embedding57embed_key = f"embed:{hashlib.sha256(query.encode()).hexdigest()[:16]}"58redis.setex(embed_key, 3600, json.dumps({ # 1-hour TTL59"embedding": get_embedding(query),60"response": response,61"query": query,62}))6364return response
1from vllm import LLM, SamplingParams2from vllm.engine.async_llm_engine import AsyncLLMEngine3from vllm.engine.arg_utils import AsyncEngineArgs4import asyncio56# Level 2: vLLM — production inference server with PagedAttention7# PagedAttention: manages KV cache like virtual memory, enables continuous batching8# Result: 2-4x more throughput on same hardware vs. naive inferencevLLM's PagedAttention: KV cache management like OS virtual memory — eliminates memory fragmentation910engine_args = AsyncEngineArgs(11model="meta-llama/Llama-3-8B-Instruct",tensor_parallel_size=2: Model sharded across 2 GPUs. Use for models too large for 1 GPU.12tensor_parallel_size=2, # Split model across 2 GPUs13dtype="bfloat16", # BF16: stable training, good inference quality14gpu_memory_utilization=0.90, # Use 90% of GPU VRAM for KV cacheenable_prefix_caching: Cache KV states for identical system prompt prefixes — big win when system prompt is long15max_model_len=8192, # Max context lengthquantization=awq: AWQ 4-bit cuts VRAM 65% with <2% quality loss on most tasks16enable_prefix_caching=True, # Cache KV states for repeated system prompts17quantization="awq", # AWQ 4-bit: 50-75% memory reduction18)1920engine = AsyncLLMEngine.from_engine_args(engine_args)2122async def generate(prompt: str, max_tokens: int = 512) -> str:23sampling_params = SamplingParams(24temperature=0.7,25top_p=0.9,26max_tokens=max_tokens,27)2829# vLLM handles batching automatically via PagedAttention30# New requests join existing batches without waiting for current batch to finishContinuous batching: unlike naive batching, new requests join mid-flight. No head-of-line blocking.31results_generator = engine.generate(prompt, sampling_params, request_id=str(id(prompt)))3233final_output = None34async for request_output in results_generator:35final_output = request_output3637return final_output.outputs[0].text3839async def benchmark_throughput():40# With vLLM: 1,200-1,800 tokens/second on 2x A100 80GB (Llama-3-8B)41# Without vLLM (naive HF): 300-400 tokens/second on same hardwarevLLM throughput: 4-5x vs. naive HuggingFace inference on same hardware42# 4-5x throughput improvement from PagedAttention alone43prompts = [f"Prompt {i}: Write a paragraph about AI." for i in range(100)]44tasks = [generate(p) for p in prompts]45results = await asyncio.gather(*tasks)46return results
1from openai import OpenAI2from anthropic import Anthropic3from pydantic import BaseModel45client_openai = OpenAI()6client_anthropic = Anthropic()78# Level 3: Intelligent model routing9# Route queries to the cheapest model that can handle them10# Save 60-80% inference cost vs. always using the most expensive modelModel routing: match query complexity to model capability/cost — biggest lever for cost reduction1112class RouterDecision(BaseModel):13model: str # Which model to use14reasoning: str # Why this model was chosen15complexity_score: int # 1-51617ROUTING_SYSTEM_PROMPT = """Classify this query's complexity to determine the optimal model.18Score 1-5:191-2 = Simple (factual lookup, formatting, short completion) -> use gpt-4o-mini203 = Medium (multi-step reasoning, moderate length) -> use gpt-4o214-5 = Complex (deep analysis, creative, long-form, nuanced) -> use claude-3-5-sonnetUse the cheapest model for routing itself — a 100-token routing call costs 10x less than the main call2223Respond with JSON: model, reasoning, complexity_score"""2425def route_query(user_query: str) -> tuple[str, str]:The router classifies query complexity in <200ms — negligible latency for 60-80% cost savings26# Use a cheap model to route — the routing call costs almost nothing27routing_response = client_openai.beta.chat.completions.parse(28model="gpt-4o-mini", # Cheap model for routing decisions29messages=[30{"role": "system", "content": ROUTING_SYSTEM_PROMPT},31{"role": "user", "content": user_query}32],33response_format=RouterDecision,34)3536decision = routing_response.choices[0].message.parsed37return decision.model, decision.reasoning3839def intelligent_completion(user_query: str, system_prompt: str) -> str:40model, reason = route_query(user_query)4142if model == "gpt-4o-mini":43# $0.15/1M input tokens44response = client_openai.chat.completions.create(45model="gpt-4o-mini",claude-3-5-sonnet at $3/1M tokens: 20x more expensive than gpt-4o-mini, justified only for complex tasks46messages=[{"role": "system", "content": system_prompt},47{"role": "user", "content": user_query}]48)49return response.choices[0].message.content5051elif model == "claude-3-5-sonnet":52# $3/1M input tokens — 20x more expensive, reserved for complex tasks53response = client_anthropic.messages.create(54model="claude-3-5-sonnet-20241022",55max_tokens=2048,56system=system_prompt,57messages=[{"role": "user", "content": user_query}]58)59return response.content[0].text
The right optimization depends on your traffic pattern, latency requirements, and cost constraints.
Scenario 1: High-Volume, Repetitive Queries (Caching First) A customer service AI handles 200K queries/day. Analysis shows: 40% of queries are near-duplicates (same questions about return policy, shipping times, product specs). Optimization: semantic caching in Redis. Cache hit rate: 38%. LLM API calls reduced from 200K to 124K/day. Cost reduction: 38% with zero quality impact (cached queries return identical responses). Implementation time: 3 days. This is the highest ROI optimization for any application with repetitive query patterns.
Scenario 2: Latency-Sensitive Real-Time App (Streaming + Smaller Model) A copilot feature needs to respond in under 1 second for inline suggestions. Problem: GPT-4o at P95 2.8 seconds is too slow. Optimization strategy: (1) Enable streaming — first token appears in 300-400ms, making the UI feel responsive even before the full response arrives. (2) Route all copilot suggestions to GPT-4o-mini (P95: 400ms) except complex multi-step rewrites. (3) Implement speculative decoding for the most common completion patterns. Result: P95 TTFT drops from 2.8s to 380ms. User engagement with copilot increases 40% after the latency improvement.
Scenario 3: Self-Hosted 70B Model (Quantization + vLLM) An enterprise customer requires on-premises deployment of Llama-3-70B for data privacy. Hardware: 4x A100 80GB GPUs. In FP16, Llama-3-70B requires 140GB VRAM — needs all 4 GPUs with near-zero KV cache headroom. With AWQ 4-bit quantization: 35-40GB model size. Fits on 1 GPU, leaving 3 for KV cache. vLLM with PagedAttention serves 8x more concurrent requests than naive transformers inference. Quality benchmarks: AWQ 4-bit Llama-3-70B vs. FP16 shows less than 2% degradation on domain benchmarks. The enterprise uses 1/4 the hardware, serves the same traffic.
Three optimization mistakes that engineers make when they first encounter LLM performance problems.
Senior engineers ask: "How do we make this faster?" Principal engineers ask: "What is the unit economics of inference for this application, and how does it scale with our growth plan?"
The principal perspective is that inference optimization is a product economics problem. The question is not "what is the fastest possible serving setup" but "what is the cost per quality-adjusted output at production scale, and how do we make that economics work for our business model?"
The frontier in 2025: speculative decoding (using a small draft model to generate candidate tokens that a larger model verifies in parallel, achieving near-large-model quality at small-model speed), flash attention 3 (attention computation that runs 1.5-2x faster on H100s through hardware-aware algorithms), and continuous KV cache compression (compressing the KV cache during generation to allow longer effective context on the same hardware).
The five-year arc: inference will get dramatically cheaper through hardware improvements (H100, H200, B200 progression), software improvements (attention algorithms, quantization), and architectural changes (mixture-of-experts models like Mixtral that activate only 1/8 of parameters per token). Engineers who understand the stack from attention mechanism to serving infrastructure will own the optimization decisions that determine product economics.
Common questions:
Strong answer: Analyzes query patterns first (what % are near-duplicates? what % are simple vs. complex?) before recommending optimizationKnows the semantic caching → model routing → quantization → serving optimization ladderUnderstands that vLLM is the minimum viable serving stack for any self-hosted production modelCan calculate the break-even point between API costs and self-hosting TCO (hardware + infra team time)
Red flags: Recommends switching to a smaller model as the first inference optimization without analyzing query complexity distributionBelieves batching always improves latency (confuses throughput with latency)No awareness of vLLM or continuous batching for self-hosted servingAssumes quantization is safe without task-specific quality evaluation
Scenario · An AI productivity startup, 500K users, $400K/month inference bill
Step 1 of 2
Your AI assistant makes 3 LLM calls per user session — a context-gathering call, a generation call, and a summary call. All three use GPT-4o at $5/1M tokens. Investors are asking about the path to profitability. Inference is 60% of your COGS.
You analyze your 3 LLM calls per session. The context-gathering call (average 200 tokens in, 150 tokens out) runs on every session, even for returning users who have visited before. The generation call (average 500 tokens in, 400 tokens out) handles the core AI feature. The summary call (average 800 tokens in, 100 tokens out) creates a session summary. All three use GPT-4o.
Which call should you optimize first?
Quick check · Inference Optimization
1 / 4
Key takeaways
What is the difference between improving throughput vs. improving latency, and why do they require different interventions?
Throughput = queries per second (total capacity). Latency = time per individual query. Larger batch sizes improve throughput but hurt individual query latency (more waiting). Smaller batches improve latency but reduce throughput. Continuous batching (vLLM) improves both simultaneously by allowing requests to join batches mid-flight.
Why does semantic caching have higher ROI than model switching for most inference cost problems?
Semantic caching returns cached responses in <10ms with zero LLM cost — it eliminates the API call entirely for matching queries. Model switching reduces per-call cost by 2-30x but still makes the call. When 30-40% of queries are near-duplicates, caching eliminates 30-40% of costs; switching only reduces the remaining 60-70% by a fraction.
What should you test before deploying a quantized model to production?
Run your full task-specific eval suite (not just MMLU) against the quantized model. Pay special attention to tasks involving arithmetic, exact formatting, and multi-step reasoning — these are most sensitive to precision loss. Require that quantized model quality is within 5% of FP16 on all eval strata before deploying.
From the books
AI Engineering — Chapter 13: Inference Optimization — Chip Huyen (2024)
Huyen frames inference optimization as a latency-throughput-cost trilemma: you can optimize any two at the expense of the third. The practical insight is to identify which dimension is the binding constraint for your application (cost for startups, latency for real-time features, throughput for batch processing) and optimize primarily for that dimension.
Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., UC Berkeley (2023)
The paper introducing PagedAttention and vLLM. Key insight: KV cache memory fragmentation (from variable sequence lengths) wastes 20-40% of GPU VRAM in naive implementations. PagedAttention manages KV cache in fixed-size pages (like OS virtual memory), eliminating fragmentation and enabling the continuous batching that makes vLLM 2-4x more throughput-efficient than naive serving.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al., MIT (2023)
AWQ's key insight: not all weights are equally important for quantization. Weights that correspond to high-activation channels (the "important" computations) should be preserved at higher precision, while low-activation weights can be aggressively quantized. This activation-aware approach achieves better quality than uniform 4-bit quantization (GPTQ) at equivalent compression ratios.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.