Interactive Explainer

🎯Key Takeaways

Semantic caching is the highest-ROI first optimization for any API-based workload with repetitive queries — 20-40% cost reduction with no quality impact

Model routing (cheap model for simple queries, expensive for complex) typically reduces costs 30-60% when query complexity varies

AWQ 4-bit quantization cuts self-hosted model memory 50-75% with <2% quality loss on most tasks

vLLM's continuous batching improves self-hosted throughput 3-5x over naive HuggingFace inference on the same hardware

Streaming (first-token latency) matters 3-5x more than total generation time for perceived user performance

Always measure the total cost of self-hosting (hardware + team time + reliability) vs. API costs before switching

Inference Optimization

Making LLMs Fast and Affordable at Production Scale

~12 min read

Be the first to complete!

Why this matters

A single A100 GPU serving a 70B parameter model can handle about 10 concurrent requests at acceptable latency. At production scale (10,000+ requests/minute), naive inference costs $500K+/month and still cannot meet SLA requirements.

Without this knowledge

Your LLM API bills grow linearly with traffic. Response latency is 3-8 seconds, making real-time features impossible. You hit GPU memory limits at moderate traffic and cannot scale to meet demand spikes.

With this knowledge

You serve 10-50x more traffic on the same hardware through batching and KV caching. Quantization cuts GPU memory by 50-75% with less than 3% quality degradation. The same compute budget that served 1M requests/day now serves 10-50M.

What you'll learn

Semantic caching is the highest-ROI first optimization for any API-based workload with repetitive queries — 20-40% cost reduction with no quality impact
Model routing (cheap model for simple queries, expensive for complex) typically reduces costs 30-60% when query complexity varies
AWQ 4-bit quantization cuts self-hosted model memory 50-75% with <2% quality loss on most tasks
vLLM's continuous batching improves self-hosted throughput 3-5x over naive HuggingFace inference on the same hardware
Streaming (first-token latency) matters 3-5x more than total generation time for perceived user performance
Always measure the total cost of self-hosting (hardware + team time + reliability) vs. API costs before switching

Lesson outline

The Scene: $800K/Month That Should Have Been $80K

It is Q1 2024. A fast-growing AI startup has just crossed 1 million active users. Their AI writing assistant makes 4 API calls per user session. At GPT-4 pricing ($0.03/1K tokens, average 800 tokens per call), that is $96 per 1,000 users per day. At 1 million users, $96,000 per day. $2.9M per month, just in inference. Their Series B runway was 18 months. Inference alone would consume it in 6 months.

The obvious move: switch to a cheaper model. GPT-4o-mini costs 30x less. But their product manager ran an A/B test — GPT-4o-mini writing quality scored 15 points lower on user satisfaction. Users noticed.

The team spent two weeks doing what they should have done before launch: inference optimization. They implemented: response streaming (users see output immediately, reducing perceived latency by 70%). Semantic caching (repeat queries return cached responses — 34% of their queries were near-duplicates). Prompt compression using LLMLingua (average prompt reduced from 800 to 520 tokens, 35% cost reduction). Model routing (send simple synonym suggestions to GPT-4o-mini, complex paragraph rewrites to GPT-4o).

Net result: $800K/month inference bill dropped to $110K. Same GPT-4 quality for complex tasks. Actual P95 latency dropped from 4.2s to 1.8s (streaming + caching). These were not exotic ML techniques — they were engineering choices that should have been made at architecture design time.

Inference optimization is not premature optimization. It is the difference between a viable business model and one that burns through runway on GPU bills.

Why Your Intuition About LLM Performance Is Wrong

When engineers first think about making LLMs faster, they reach for the wrong levers. The most common intuitions: "use a smaller model" (sacrifices quality), "reduce max tokens" (limits output quality), "use faster hardware" (expensive, limited upside). These miss the highest-leverage optimization opportunities.

Wrong Intuition 1: Latency scales linearly with output length. Actual behavior: time to first token (TTFT) depends on prompt length and model size. Time per output token (TPOT) is nearly constant per token. A 1,000-token output takes about 10x longer than 100 tokens — but streaming makes this invisible to users. First-token latency is what users experience as "lag." Optimizing TTFT matters 3-5x more than optimizing total generation time for perceived performance.

Wrong Intuition 2: Batching requests hurts latency. Actual behavior: continuous batching (PagedAttention, used by vLLM) allows new requests to join a batch mid-generation, dramatically increasing GPU utilization without adding per-request latency. Naive batching does hurt latency. Continuous batching does not. Modern inference servers (vLLM, TGI, TensorRT-LLM) all implement this. Without continuous batching, GPU utilization is typically 20-40%. With it, 80-95%.

Wrong Intuition 3: Quantization degrades quality significantly. Actual behavior: AWQ (Activation-aware Weight Quantization) at 4-bit precision degrades benchmark performance by less than 2% on most tasks while cutting VRAM by 50-75%. A 70B model in FP16 requires 140GB VRAM (two A100s 80GB). In AWQ 4-bit, 35-40GB (fits on one A100). The same hardware serves 2x the model capacity. For most production applications, the quality difference between FP16 and AWQ 4-bit is below user detection threshold.

The highest-leverage optimization opportunities are: (1) KV cache reuse — cache attention states for repeated prompt prefixes. (2) Semantic response caching — return cached answers for semantically similar queries. (3) Prompt compression — reduce input tokens by 30-50% using model-guided compression. (4) Model routing — use a small cheap model when it is sufficient, reserve the expensive model for complex tasks. These four techniques alone can reduce inference cost by 60-85% on typical production workloads.

How Production Inference Optimization Works: Three Levels

Inference optimization spans from simple API patterns to custom serving infrastructure. Here is the progression from prototype to production-grade.

semantic_cache.py

1import hashlib
2import numpy as np
3from openai import OpenAI
4from redis import Redis
5import json
6 
7client = OpenAI()
8redis = Redis(host="localhost", port=6379)
9 
10# Level 1: Semantic caching
11# Cache LLM responses by semantic similarity of queries
12# 20-40% of production queries are near-duplicates — serve them from cache
13 
SIMILARITY_THRESHOLD=0.92: Tune higher to avoid wrong cache hits, lower to increase hit rate
14SIMILARITY_THRESHOLD = 0.92  # Cosine similarity threshold for cache hit
15 
16def get_embedding(text: str) -> list[float]:
17    return client.embeddings.create(
18        input=text, model="text-embedding-3-small"
19    ).data[0].embedding
20 
21def semantic_cache_lookup(query: str) -> str | None:
22    query_embedding = get_embedding(query)
23 
Cache lookup scales O(n) with cache size — use vector DB (Qdrant, Pinecone) for >10K cached queries
24    # Search for similar cached queries
25    cached_keys = redis.keys("embed:*")
26    for key in cached_keys:
27        cached_data = json.loads(redis.get(key))
28        cached_embedding = cached_data["embedding"]
29 
Cosine similarity: values >0.92 indicate semantically equivalent queries safe to cache-serve
30        # Cosine similarity
31        similarity = np.dot(query_embedding, cached_embedding) / (
32            np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
33        )
34 
35        if similarity >= SIMILARITY_THRESHOLD:
36            return cached_data["response"]  # Cache hit
37 
38    return None  # Cache miss
39 
40def cached_completion(query: str, system_prompt: str) -> str:
41    # Try cache first
42    cached = semantic_cache_lookup(query)
43    if cached:
44        return cached
45 
46    # Generate response
47    response = client.chat.completions.create(
48        model="gpt-4o",
TTL=3600: Expire cache entries after 1 hour — balance freshness vs. hit rate for your use case
49        messages=[
50            {"role": "system", "content": system_prompt},
51            {"role": "user", "content": query}
52        ],
53        stream=False,
54    ).choices[0].message.content
55 
56    # Store in cache with embedding
57    embed_key = f"embed:{hashlib.sha256(query.encode()).hexdigest()[:16]}"
58    redis.setex(embed_key, 3600, json.dumps({  # 1-hour TTL
59        "embedding": get_embedding(query),
60        "response": response,
61        "query": query,
62    }))
63 
64    return response

vllm_serving.py

1from vllm import LLM, SamplingParams
2from vllm.engine.async_llm_engine import AsyncLLMEngine
3from vllm.engine.arg_utils import AsyncEngineArgs
4import asyncio
5 
6# Level 2: vLLM — production inference server with PagedAttention
7# PagedAttention: manages KV cache like virtual memory, enables continuous batching
8# Result: 2-4x more throughput on same hardware vs. naive inference
vLLM's PagedAttention: KV cache management like OS virtual memory — eliminates memory fragmentation
9 
10engine_args = AsyncEngineArgs(
11    model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2: Model sharded across 2 GPUs. Use for models too large for 1 GPU.
12    tensor_parallel_size=2,          # Split model across 2 GPUs
13    dtype="bfloat16",                # BF16: stable training, good inference quality
14    gpu_memory_utilization=0.90,     # Use 90% of GPU VRAM for KV cache
enable_prefix_caching: Cache KV states for identical system prompt prefixes — big win when system prompt is long
15    max_model_len=8192,              # Max context length
quantization=awq: AWQ 4-bit cuts VRAM 65% with <2% quality loss on most tasks
16    enable_prefix_caching=True,      # Cache KV states for repeated system prompts
17    quantization="awq",              # AWQ 4-bit: 50-75% memory reduction
18)
19 
20engine = AsyncLLMEngine.from_engine_args(engine_args)
21 
22async def generate(prompt: str, max_tokens: int = 512) -> str:
23    sampling_params = SamplingParams(
24        temperature=0.7,
25        top_p=0.9,
26        max_tokens=max_tokens,
27    )
28 
29    # vLLM handles batching automatically via PagedAttention
30    # New requests join existing batches without waiting for current batch to finish
Continuous batching: unlike naive batching, new requests join mid-flight. No head-of-line blocking.
31    results_generator = engine.generate(prompt, sampling_params, request_id=str(id(prompt)))
32 
33    final_output = None
34    async for request_output in results_generator:
35        final_output = request_output
36 
37    return final_output.outputs[0].text
38 
39async def benchmark_throughput():
40    # With vLLM: 1,200-1,800 tokens/second on 2x A100 80GB (Llama-3-8B)
41    # Without vLLM (naive HF): 300-400 tokens/second on same hardware
vLLM throughput: 4-5x vs. naive HuggingFace inference on same hardware
42    # 4-5x throughput improvement from PagedAttention alone
43    prompts = [f"Prompt {i}: Write a paragraph about AI." for i in range(100)]
44    tasks = [generate(p) for p in prompts]
45    results = await asyncio.gather(*tasks)
46    return results

model_routing.py

1from openai import OpenAI
2from anthropic import Anthropic
3from pydantic import BaseModel
4 
5client_openai = OpenAI()
6client_anthropic = Anthropic()
7 
8# Level 3: Intelligent model routing
9# Route queries to the cheapest model that can handle them
10# Save 60-80% inference cost vs. always using the most expensive model
Model routing: match query complexity to model capability/cost — biggest lever for cost reduction
11 
12class RouterDecision(BaseModel):
13    model: str                    # Which model to use
14    reasoning: str                # Why this model was chosen
15    complexity_score: int         # 1-5
16 
17ROUTING_SYSTEM_PROMPT = """Classify this query's complexity to determine the optimal model.
18Score 1-5: 
191-2 = Simple (factual lookup, formatting, short completion) -> use gpt-4o-mini
203 = Medium (multi-step reasoning, moderate length) -> use gpt-4o  
214-5 = Complex (deep analysis, creative, long-form, nuanced) -> use claude-3-5-sonnet
Use the cheapest model for routing itself — a 100-token routing call costs 10x less than the main call
22 
23Respond with JSON: model, reasoning, complexity_score"""
24 
25def route_query(user_query: str) -> tuple[str, str]:
The router classifies query complexity in <200ms — negligible latency for 60-80% cost savings
26    # Use a cheap model to route — the routing call costs almost nothing
27    routing_response = client_openai.beta.chat.completions.parse(
28        model="gpt-4o-mini",         # Cheap model for routing decisions
29        messages=[
30            {"role": "system", "content": ROUTING_SYSTEM_PROMPT},
31            {"role": "user", "content": user_query}
32        ],
33        response_format=RouterDecision,
34    )
35 
36    decision = routing_response.choices[0].message.parsed
37    return decision.model, decision.reasoning
38 
39def intelligent_completion(user_query: str, system_prompt: str) -> str:
40    model, reason = route_query(user_query)
41 
42    if model == "gpt-4o-mini":
43        # $0.15/1M input tokens
44        response = client_openai.chat.completions.create(
45            model="gpt-4o-mini",
claude-3-5-sonnet at $3/1M tokens: 20x more expensive than gpt-4o-mini, justified only for complex tasks
46            messages=[{"role": "system", "content": system_prompt},
47                      {"role": "user", "content": user_query}]
48        )
49        return response.choices[0].message.content
50 
51    elif model == "claude-3-5-sonnet":
52        # $3/1M input tokens — 20x more expensive, reserved for complex tasks
53        response = client_anthropic.messages.create(
54            model="claude-3-5-sonnet-20241022",
55            max_tokens=2048,
56            system=system_prompt,
57            messages=[{"role": "user", "content": user_query}]
58        )
59        return response.content[0].text

Three Scenarios, Three Optimization Strategies

The right optimization depends on your traffic pattern, latency requirements, and cost constraints.

Scenario 1: High-Volume, Repetitive Queries (Caching First) A customer service AI handles 200K queries/day. Analysis shows: 40% of queries are near-duplicates (same questions about return policy, shipping times, product specs). Optimization: semantic caching in Redis. Cache hit rate: 38%. LLM API calls reduced from 200K to 124K/day. Cost reduction: 38% with zero quality impact (cached queries return identical responses). Implementation time: 3 days. This is the highest ROI optimization for any application with repetitive query patterns.

Scenario 2: Latency-Sensitive Real-Time App (Streaming + Smaller Model) A copilot feature needs to respond in under 1 second for inline suggestions. Problem: GPT-4o at P95 2.8 seconds is too slow. Optimization strategy: (1) Enable streaming — first token appears in 300-400ms, making the UI feel responsive even before the full response arrives. (2) Route all copilot suggestions to GPT-4o-mini (P95: 400ms) except complex multi-step rewrites. (3) Implement speculative decoding for the most common completion patterns. Result: P95 TTFT drops from 2.8s to 380ms. User engagement with copilot increases 40% after the latency improvement.

Scenario 3: Self-Hosted 70B Model (Quantization + vLLM) An enterprise customer requires on-premises deployment of Llama-3-70B for data privacy. Hardware: 4x A100 80GB GPUs. In FP16, Llama-3-70B requires 140GB VRAM — needs all 4 GPUs with near-zero KV cache headroom. With AWQ 4-bit quantization: 35-40GB model size. Fits on 1 GPU, leaving 3 for KV cache. vLLM with PagedAttention serves 8x more concurrent requests than naive transformers inference. Quality benchmarks: AWQ 4-bit Llama-3-70B vs. FP16 shows less than 2% degradation on domain benchmarks. The enterprise uses 1/4 the hardware, serves the same traffic.

The Traps That Waste GPU Budget

Three optimization mistakes that engineers make when they first encounter LLM performance problems.

How Principals Think About Inference Architecture

Senior engineers ask: "How do we make this faster?" Principal engineers ask: "What is the unit economics of inference for this application, and how does it scale with our growth plan?"

The principal perspective is that inference optimization is a product economics problem. The question is not "what is the fastest possible serving setup" but "what is the cost per quality-adjusted output at production scale, and how do we make that economics work for our business model?"

The frontier in 2025: speculative decoding (using a small draft model to generate candidate tokens that a larger model verifies in parallel, achieving near-large-model quality at small-model speed), flash attention 3 (attention computation that runs 1.5-2x faster on H100s through hardware-aware algorithms), and continuous KV cache compression (compressing the KV cache during generation to allow longer effective context on the same hardware).

The five-year arc: inference will get dramatically cheaper through hardware improvements (H100, H200, B200 progression), software improvements (attention algorithms, quantization), and architectural changes (mixture-of-experts models like Mixtral that activate only 1/8 of parameters per token). Engineers who understand the stack from attention mechanism to serving infrastructure will own the optimization decisions that determine product economics.

How this might come up in interviews

Common questions:

How do you reduce LLM inference costs without sacrificing quality?
Explain PagedAttention and why it improves LLM serving efficiency.
What is the tradeoff between model quantization and quality?
When does self-hosting a model make more economic sense than using an API?
How do you diagnose whether a latency problem is TTFT vs. generation speed?
What is model routing and when is it appropriate?

Strong answer: Analyzes query patterns first (what % are near-duplicates? what % are simple vs. complex?) before recommending optimizationKnows the semantic caching → model routing → quantization → serving optimization ladderUnderstands that vLLM is the minimum viable serving stack for any self-hosted production modelCan calculate the break-even point between API costs and self-hosting TCO (hardware + infra team time)

Red flags: Recommends switching to a smaller model as the first inference optimization without analyzing query complexity distributionBelieves batching always improves latency (confuses throughput with latency)No awareness of vLLM or continuous batching for self-hosted servingAssumes quantization is safe without task-specific quality evaluation

Scenario · An AI productivity startup, 500K users, $400K/month inference bill

Step 1 of 2

You Are the Platform Engineer. Cut the Inference Bill.

Your AI assistant makes 3 LLM calls per user session — a context-gathering call, a generation call, and a summary call. All three use GPT-4o at $5/1M tokens. Investors are asking about the path to profitability. Inference is 60% of your COGS.

You analyze your 3 LLM calls per session. The context-gathering call (average 200 tokens in, 150 tokens out) runs on every session, even for returning users who have visited before. The generation call (average 500 tokens in, 400 tokens out) handles the core AI feature. The summary call (average 800 tokens in, 100 tokens out) creates a session summary. All three use GPT-4o.

Which call should you optimize first?

Quick check · Inference Optimization

1 / 4

Your LLM application P95 latency is 3.2 seconds. Users find it too slow. You implement aggressive batching (batch size 32). P95 latency increases to 4.1 seconds. What went wrong?

Key takeaways

Semantic caching is the highest-ROI first optimization for any API-based workload with repetitive queries — 20-40% cost reduction with no quality impact
Model routing (cheap model for simple queries, expensive for complex) typically reduces costs 30-60% when query complexity varies
AWQ 4-bit quantization cuts self-hosted model memory 50-75% with <2% quality loss on most tasks
vLLM's continuous batching improves self-hosted throughput 3-5x over naive HuggingFace inference on the same hardware
Streaming (first-token latency) matters 3-5x more than total generation time for perceived user performance
Always measure the total cost of self-hosting (hardware + team time + reliability) vs. API costs before switching

Before you move on: can you answer these?

What is the difference between improving throughput vs. improving latency, and why do they require different interventions?

Throughput = queries per second (total capacity). Latency = time per individual query. Larger batch sizes improve throughput but hurt individual query latency (more waiting). Smaller batches improve latency but reduce throughput. Continuous batching (vLLM) improves both simultaneously by allowing requests to join batches mid-flight.

Why does semantic caching have higher ROI than model switching for most inference cost problems?

Semantic caching returns cached responses in <10ms with zero LLM cost — it eliminates the API call entirely for matching queries. Model switching reduces per-call cost by 2-30x but still makes the call. When 30-40% of queries are near-duplicates, caching eliminates 30-40% of costs; switching only reduces the remaining 60-70% by a fraction.

What should you test before deploying a quantized model to production?

Run your full task-specific eval suite (not just MMLU) against the quantized model. Pay special attention to tasks involving arithmetic, exact formatting, and multi-step reasoning — these are most sensitive to precision loss. Require that quantized model quality is within 5% of FP16 on all eval strata before deploying.

From the books

AI Engineering — Chapter 13: Inference Optimization — Chip Huyen (2024)

Huyen frames inference optimization as a latency-throughput-cost trilemma: you can optimize any two at the expense of the third. The practical insight is to identify which dimension is the binding constraint for your application (cost for startups, latency for real-time features, throughput for batch processing) and optimize primarily for that dimension.

Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., UC Berkeley (2023)

The paper introducing PagedAttention and vLLM. Key insight: KV cache memory fragmentation (from variable sequence lengths) wastes 20-40% of GPU VRAM in naive implementations. PagedAttention manages KV cache in fixed-size pages (like OS virtual memory), eliminating fragmentation and enabling the continuous batching that makes vLLM 2-4x more throughput-efficient than naive serving.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al., MIT (2023)

AWQ's key insight: not all weights are equally important for quantization. Weights that correspond to high-activation channels (the "important" computations) should be preserved at higher precision, while low-activation weights can be aggressively quantized. This activation-aware approach achieves better quality than uniform 4-bit quantization (GPTQ) at equivalent compression ratios.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A