The Physics of AI: Why Bigger is Not Always Better
The Physics of AI: Why Bigger is Not Always Better
Engineers default to "use the biggest newest model" because it feels safe. At prototype scale this is harmless. At production scale with millions of daily requests, it is a cost catastrophe — and the model you chose may not even be best for your task.
You spend 3x more on inference than necessary, ship a model competitors can beat with a cheaper alternative, and have no principled framework for the quarterly model evaluation cycle.
You can size models with the precision of an engineer, not the hope of a researcher. You make the case for or against any model choice in numbers, and you build a repeatable evaluation framework that becomes a competitive moat.
Lesson outline
It is February 2023. The Llama team at Meta has just finished training. The headline: a 65B parameter model that beats GPT-3 (175B) on most benchmarks. Every ML engineer in the industry reads the same conclusion: "smaller models can beat larger models if trained longer." But the reason why took most people another six months to internalize.
The obvious explanation was "Meta used better data." The real explanation was Chinchilla scaling laws — a 2022 paper from DeepMind that demonstrated that most large language models were significantly undertrained. The optimal strategy was not to train the largest possible model for the compute budget, but to train a smaller model for much longer on much more data.
The teams that understood Chinchilla scaling laws before Llama dropped made a different strategic bet: they invested compute in more training tokens, not more parameters. Those that understood it after saw why a 65B model could beat 175B — the 65B model was trained on roughly 5x more tokens per parameter. This knowledge was worth billions in competitive advantage for any team that acted on it.
This is the fundamental pattern with scaling laws: the intuition that bigger is always better is systematically wrong in specific, predictable ways. Understanding those laws is the difference between informed model selection and expensive guessing.
This concept teaches you to select models with the precision of an engineer: using scaling laws to predict capability, cost analysis to set constraints, and capability filters to match the right model to each specific production task.
The most common scaling law intuition: larger models are always better. This is true on average and wrong in every specific deployment decision. Here is why the intuition fails.
The Chinchilla result: DeepMind's 2022 paper trained 400 models of varying sizes and compute budgets and found a consistent result: the optimal model is the one that trains a smaller model on more data. Specifically, for a given compute budget, model size and training data should scale equally. GPT-3 (175B parameters) was optimal to train for about 45B tokens. DeepMind found that 175B parameters optimally trained required 3.5 trillion tokens — roughly 70x more than GPT-3 actually used.
The inference cost problem: A model's benchmark score measures its capability ceiling. Its inference cost determines whether you can afford to reach that ceiling at production scale. GPT-4o might score 20% higher than Claude-3-Haiku on a reasoning benchmark. At $15/1M tokens vs. $0.25/1M tokens (60x price difference), you can serve 60x more users for the same budget with Haiku. For tasks where both models score identically — simple classification, JSON extraction, formatting — the 60x cost difference is pure waste.
The task-model mismatch: Models are evaluated on aggregate benchmarks that mix dozens of task types. A model that leads on MMLU might lose to a competitor on coding tasks. A model that wins on reasoning might lose on instruction following. The benchmark ranking is a portfolio average; your specific task might look completely different. The engineer who checks the specific task benchmark (HumanEval for code, IFEval for instruction following, GPQA for expert reasoning) makes a different model choice than the engineer who looks at the aggregate.
The correct mental model: model selection is a filtering problem, not a ranking problem. Start with your task requirements and cost constraints, then find the smallest model that passes both filters.
Model selection exists at three levels of sophistication: the capability filter, the cost-latency filter, and the full production shootout. Each level builds on the previous.
1import anthropic2import openai34# Level 1: Capability filter — can the model do the task?5# Test each candidate on a representative sample from your actual use case6Level 1: Capability filter — eliminates models that cannot do the task at minimum quality7CANDIDATE_MODELS = [8("gpt-4o-mini", "openai"),Test 4 models initially — there is no single best model across all tasks9("gpt-4o", "openai"),10("claude-3-haiku-20240307", "anthropic"),11("claude-3-5-sonnet-20241022", "anthropic"),12]1314def test_model_capability(model: str, provider: str, test_prompt: str) -> dict:temperature=0: deterministic for evaluation. Enables reproducible scoring.15"""Run the model on a representative task and score the output."""16if provider == "openai":17client = openai.OpenAI()18response = client.chat.completions.create(19model=model,20messages=[{"role": "user", "content": test_prompt}],21temperature=0,22)23output = response.choices[0].message.content24else:25client = anthropic.Anthropic()26response = client.messages.create(27model=model,28max_tokens=1024,29messages=[{"role": "user", "content": test_prompt}],30)31output = response.content[0].text3233return {"model": model, "output": output}Score against task-specific rubric, not generic quality. Use your actual production queries.3435# Run against 50 representative queries from your production use case36# Score each output against your task-specific rubric37# Any model below 85% task accuracy is eliminated from consideration
1from dataclasses import dataclass23# Level 2: Cost-latency filter — can we afford to run this at production scale?Level 2: Apply cost-latency filter only to models that passed the capability filter4# Apply after capability filter: only models that PASS capability testing enter here56@dataclass7class ModelPricing:8input_cost_per_1m: float # USD per 1M input tokens9output_cost_per_1m: float # USD per 1M output tokens10p95_latency_ms: int # Median p95 latency at moderate load1112# Pricing as of early 2025 (verify current rates)Verify current pricing — rates change frequently and these numbers will be outdated13MODEL_COSTS = {14"gpt-4o-mini": ModelPricing(0.15, 0.60, 800),15"gpt-4o": ModelPricing(2.50, 10.00, 1800),16"claude-3-haiku-20240307": ModelPricing(0.25, 1.25, 600),17"claude-3-5-sonnet-20241022": ModelPricing(3.00, 15.00, 2000),18}1920@dataclass21class ServingConstraints:22max_monthly_cost_usd: float23p95_latency_ms: int24daily_requests: int25avg_input_tokens: int26avg_output_tokens: int2728def filter_by_cost_latency(29passing_models: list[str],Monthly cost at production scale: the number that matters, not cost-per-call30constraints: ServingConstraints31) -> list[str]:32affordable = []33for model in passing_models:34pricing = MODEL_COSTS[model]35input_cost = (constraints.avg_input_tokens / 1_000_000) * pricing.input_cost_per_1m36output_cost = (constraints.avg_output_tokens / 1_000_000) * pricing.output_cost_per_1m37monthly_cost = (input_cost + output_cost) * constraints.daily_requests * 3038Cost AND latency must both pass — a cheap model that is too slow is eliminated39cost_ok = monthly_cost <= constraints.max_monthly_cost_usd40latency_ok = pricing.p95_latency_ms <= constraints.p95_latency_ms4142if cost_ok and latency_ok:43affordable.append(model)44print(f"{model}: PASS - monthly cost ${monthly_cost:,.0f}")45else:46print(f"{model}: ELIMINATED - cost ${monthly_cost:,.0f}/mo, latency {pricing.p95_latency_ms}ms")4748return affordable
1import asyncio2from openai import AsyncOpenAI3import time45# Level 3: The production shootout6# Run all passing models on a full 200-question evaluation setLevel 3: Full 200-question eval at production concurrency — this is the definitive test7# Score on task-specific dimensions that matter for YOUR use case89EVAL_DIMENSIONS = [10"task_accuracy", # Does it answer correctly?EVAL_DIMENSIONS: define what matters for YOUR task, not generic quality11"format_compliance", # Does it follow the output format?12"latency_p95", # What is actual p95 at production concurrency?13"cost_per_correct", # Cost divided by number of correct answers14]1516async def shootout(17models: list[str],18eval_set: list[dict],concurrent_requests=10: test at realistic production concurrency, not single-threaded19concurrent_requests: int = 1020) -> dict[str, dict]:21results = {}22for model in models:23client = AsyncOpenAI()24semaphore = asyncio.Semaphore(concurrent_requests)25latencies = []26correct = 027format_ok = 028total_cost = 029Semaphore controls concurrency — measures real latency distribution under load30async def run_one(test_case):31async with semaphore:32start = time.time()33response = await client.chat.completions.create(34model=model,35messages=[{"role": "user", "content": test_case["prompt"]}],36)37latency = (time.time() - start) * 100038latencies.append(latency)39return response.choices[0].message.content4041responses = await asyncio.gather(*[run_one(tc) for tc in eval_set])4243# Score each response44for response, test_case in zip(responses, eval_set):p95 latency: the latency 95% of users experience, not average latency45if score_correctness(response, test_case["expected"]): correct += 146if score_format(response, test_case["format"]): format_ok += 14748p95_latency = sorted(latencies)[int(len(latencies) * 0.95)]49results[model] = {50"accuracy": correct / len(eval_set),51"format_compliance": format_ok / len(eval_set),52"p95_latency_ms": p95_latency,53}5455return results
The right model depends entirely on the specific task, volume, and constraints. Here are three production decisions where the same mental model produces different answers.
Scenario 1: Healthcare Startup (Accuracy Over Cost) A healthcare startup is building a clinical documentation assistant. Daily volume: 50,000 requests. Average request: 800 tokens in, 400 tokens out. The task: transform doctor notes into structured SOAP format. Budget: $15K/month. Claude 3.5 Sonnet passes capability evaluation at 93% SOAP compliance. GPT-4o-mini achieves 82% SOAP compliance — below the clinical accuracy requirement. Decision: Claude 3.5 Sonnet despite higher cost ($12K/month vs. $3K/month for mini). The accuracy constraint eliminates the cheaper option. You cannot compromise clinical accuracy for cost savings.
Scenario 2: Social Platform (Latency Over Capability) A social platform needs to classify user-generated content into 12 categories for feed ranking. Volume: 5M classifications/day. Requirement: P95 latency < 200ms (user feed load time). GPT-4o classifies with 91% accuracy at 1,800ms p95 — too slow. GPT-4o-mini classifies with 88% accuracy at 600ms p95 — still too slow. A fine-tuned Llama-3-8B running on-premises classifies at 85% accuracy and 120ms p95 — the only option that meets the latency SLO. Decision: fine-tuned open-source model despite lower accuracy, because the latency constraint eliminates API-based models entirely.
Scenario 3: B2B SaaS (Routing Between Models) A contract management SaaS needs to handle two task types: simple clause classification ("is this an indemnification clause?") and complex risk analysis ("what are the liability implications of this clause?"). Volume: 20,000 requests/day, 60% simple/40% complex. Decision: route simple tasks to GPT-4o-mini (0.4s, $0.005/request), complex tasks to Claude 3.5 Sonnet (2s, $0.08/request). Monthly cost: $14,400 vs. $38,400 (Claude for all) or suboptimal quality vs. $3,600 (mini for all). The routing approach delivers near-premium quality at 37% of premium-only cost.
Three model selection mistakes that are seductive because they follow reasonable intuitions.
Senior engineers ask: "Which model should we use?" Principal engineers ask: "What is our model evaluation framework, and how do we make this decision systematically every quarter?"
The shift from senior to principal thinking is from one-time decision to repeatable process. A principal engineer does not pick a model — they design the capability-cost-latency evaluation framework that makes model selection a systematic engineering activity rather than an opinion.
The research frontier (2024-2025) shows that scaling laws are more nuanced than the original Chinchilla results suggested. For specific task types (coding, reasoning, instruction following), the optimal compute allocation differs from general language modeling. Mixture-of-experts (MoE) architectures like Mixtral and Gemini 1.5 achieve near-large-model quality with dramatically lower inference cost by activating only a fraction of parameters per token.
The five-year arc: model selection will increasingly involve choosing between dozens of specialized models (coding models, reasoning models, vision models) rather than one general-purpose model. The engineer who understands the capability profile of each model family — not just the aggregate benchmark ranking — will make better architectural decisions.
Common questions:
Strong answer: Immediately describes the three-filter framework (capability → cost → latency) as the model selection processKnows that production prompt lengths are often 3-5x longer than test prompts — and accounts for this in cost projectionsHas implemented model routing and can explain when it makes sense vs. single-model deploymentCan articulate what task-specific benchmarks to use for different task types (HumanEval for code, IFEval for instruction following)
Red flags: Selects models based on benchmark rankings without task-specific evaluationCalculates cost at test volume rather than production volumeTreats model selection as a one-time decision rather than a quarterly engineering activityCannot explain the Chinchilla result or why smaller models can beat larger ones
Scenario · A legal tech startup, 30K law firm clients, Series B
Step 1 of 2
You are building an AI assistant that reviews contracts for standard clause presence and flags unusual or missing clauses. The contracts average 8,000 tokens. You need P95 latency < 3 seconds, 90%+ accuracy on clause detection, and monthly cost under $25,000 at 50,000 contracts/day volume.
You have shortlisted four candidate models: GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, and Llama-3-70B self-hosted. Before calculating cost, you need to run a capability evaluation. You have 50 labeled contracts with known clause presence/absence. You run all four models on this eval set.
What score threshold should you use to eliminate models from the capability filter?
Quick check · Scaling Laws and Model Selection
1 / 4
Key takeaways
What did the Chinchilla scaling law reveal about how most LLMs were trained before 2022?
Most LLMs were significantly undertrained — they used too many parameters for the amount of training data. The optimal strategy is to scale model size and training data roughly equally. Smaller models trained longer on more data (like Llama) can outperform larger models trained on insufficient data.
Why does MMLU ranking not reliably predict performance on your specific production task?
MMLU measures aggregate performance across 57 academic subject areas. Your specific production task has a specific capability profile that may look nothing like the MMLU distribution. A model that ranks #3 on MMLU might rank #1 on your specific task and vice versa. Always evaluate on representative samples from your actual use case.
What is the correct order of the three model selection filters and why?
Capability first (does it meet quality threshold?), then cost (is it affordable at production volume?), then latency (does it meet P95 SLO?). This order matters: calculating cost for models that cannot meet quality requirements wastes time. Latency testing is expensive — do it last, only for models that pass capability and cost.
From the books
Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al., DeepMind (2022)
The paper that changed how frontier models are trained. The key result: most large models released before 2022 were significantly undertrained. Optimal training requires scaling parameters and data together. The insight that made Llama possible: a smaller model trained 5-10x longer on more data can outperform a larger model trained on insufficient data.
AI Engineering — Chapter 3: Model Selection — Chip Huyen (2024)
Huyen's framework: model selection is a filtering problem, not a ranking problem. Start with the capability constraint (can it do the task?), then the cost constraint (can we afford it at production volume?), then the latency constraint (is it fast enough for the use case?). Only models that pass all three filters are viable candidates.
Scaling Laws for Neural Language Models — Kaplan et al., OpenAI (2020)
The foundational scaling laws paper that showed model performance follows predictable power laws with respect to compute, model size, and training data. The original result: performance scales as a smooth function of each resource, with diminishing returns to any single axis. This paper is why engineers can make principled predictions about model capability before training.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.