Interactive Explainer

🎯Key Takeaways

Chinchilla scaling laws show that the optimal model trains smaller parameters for longer on more data — bigger is not always better

Model selection is a three-filter problem: capability (task quality) → cost (at production volume) → latency (at production concurrency)

MMLU and aggregate benchmarks have ~0.31 correlation with task-specific production quality — always evaluate on your actual use case

Calculate monthly cost at production volume with production-representative prompt lengths before launch — not at test volume

Model routing (cheap model for simple queries, premium for complex) delivers 50-70% cost reduction with near-premium quality on complex tasks

Model selection is quarterly, not one-time — the best model for your task changes every 3-4 months as the landscape evolves

Scaling Laws and Model Selection

The Physics of AI: Why Bigger is Not Always Better

~12 min read

Be the first to complete!

Why this matters

Engineers default to "use the biggest newest model" because it feels safe. At prototype scale this is harmless. At production scale with millions of daily requests, it is a cost catastrophe — and the model you chose may not even be best for your task.

Without this knowledge

You spend 3x more on inference than necessary, ship a model competitors can beat with a cheaper alternative, and have no principled framework for the quarterly model evaluation cycle.

With this knowledge

You can size models with the precision of an engineer, not the hope of a researcher. You make the case for or against any model choice in numbers, and you build a repeatable evaluation framework that becomes a competitive moat.

What you'll learn

Chinchilla scaling laws show that the optimal model trains smaller parameters for longer on more data — bigger is not always better
Model selection is a three-filter problem: capability (task quality) → cost (at production volume) → latency (at production concurrency)
MMLU and aggregate benchmarks have ~0.31 correlation with task-specific production quality — always evaluate on your actual use case
Calculate monthly cost at production volume with production-representative prompt lengths before launch — not at test volume
Model routing (cheap model for simple queries, premium for complex) delivers 50-70% cost reduction with near-premium quality on complex tasks
Model selection is quarterly, not one-time — the best model for your task changes every 3-4 months as the landscape evolves

Lesson outline

The Scene: $100M of Compute, One Decision

It is February 2023. The Llama team at Meta has just finished training. The headline: a 65B parameter model that beats GPT-3 (175B) on most benchmarks. Every ML engineer in the industry reads the same conclusion: "smaller models can beat larger models if trained longer." But the reason why took most people another six months to internalize.

The obvious explanation was "Meta used better data." The real explanation was Chinchilla scaling laws — a 2022 paper from DeepMind that demonstrated that most large language models were significantly undertrained. The optimal strategy was not to train the largest possible model for the compute budget, but to train a smaller model for much longer on much more data.

The teams that understood Chinchilla scaling laws before Llama dropped made a different strategic bet: they invested compute in more training tokens, not more parameters. Those that understood it after saw why a 65B model could beat 175B — the 65B model was trained on roughly 5x more tokens per parameter. This knowledge was worth billions in competitive advantage for any team that acted on it.

This is the fundamental pattern with scaling laws: the intuition that bigger is always better is systematically wrong in specific, predictable ways. Understanding those laws is the difference between informed model selection and expensive guessing.

This concept teaches you to select models with the precision of an engineer: using scaling laws to predict capability, cost analysis to set constraints, and capability filters to match the right model to each specific production task.

Why Your Intuition About Model Size Is Wrong

The most common scaling law intuition: larger models are always better. This is true on average and wrong in every specific deployment decision. Here is why the intuition fails.

The Chinchilla result: DeepMind's 2022 paper trained 400 models of varying sizes and compute budgets and found a consistent result: the optimal model is the one that trains a smaller model on more data. Specifically, for a given compute budget, model size and training data should scale equally. GPT-3 (175B parameters) was optimal to train for about 45B tokens. DeepMind found that 175B parameters optimally trained required 3.5 trillion tokens — roughly 70x more than GPT-3 actually used.

The inference cost problem: A model's benchmark score measures its capability ceiling. Its inference cost determines whether you can afford to reach that ceiling at production scale. GPT-4o might score 20% higher than Claude-3-Haiku on a reasoning benchmark. At $15/1M tokens vs. $0.25/1M tokens (60x price difference), you can serve 60x more users for the same budget with Haiku. For tasks where both models score identically — simple classification, JSON extraction, formatting — the 60x cost difference is pure waste.

The task-model mismatch: Models are evaluated on aggregate benchmarks that mix dozens of task types. A model that leads on MMLU might lose to a competitor on coding tasks. A model that wins on reasoning might lose on instruction following. The benchmark ranking is a portfolio average; your specific task might look completely different. The engineer who checks the specific task benchmark (HumanEval for code, IFEval for instruction following, GPQA for expert reasoning) makes a different model choice than the engineer who looks at the aggregate.

The correct mental model: model selection is a filtering problem, not a ranking problem. Start with your task requirements and cost constraints, then find the smallest model that passes both filters.

How Model Selection Actually Works: Three Levels

Model selection exists at three levels of sophistication: the capability filter, the cost-latency filter, and the full production shootout. Each level builds on the previous.

capability_filter.py

1import anthropic
2import openai
3 
4# Level 1: Capability filter — can the model do the task?
5# Test each candidate on a representative sample from your actual use case
6 
Level 1: Capability filter — eliminates models that cannot do the task at minimum quality
7CANDIDATE_MODELS = [
8    ("gpt-4o-mini", "openai"),
Test 4 models initially — there is no single best model across all tasks
9    ("gpt-4o", "openai"),
10    ("claude-3-haiku-20240307", "anthropic"),
11    ("claude-3-5-sonnet-20241022", "anthropic"),
12]
13 
14def test_model_capability(model: str, provider: str, test_prompt: str) -> dict:
temperature=0: deterministic for evaluation. Enables reproducible scoring.
15    """Run the model on a representative task and score the output."""
16    if provider == "openai":
17        client = openai.OpenAI()
18        response = client.chat.completions.create(
19            model=model,
20            messages=[{"role": "user", "content": test_prompt}],
21            temperature=0,
22        )
23        output = response.choices[0].message.content
24    else:
25        client = anthropic.Anthropic()
26        response = client.messages.create(
27            model=model,
28            max_tokens=1024,
29            messages=[{"role": "user", "content": test_prompt}],
30        )
31        output = response.content[0].text
32 
33    return {"model": model, "output": output}
Score against task-specific rubric, not generic quality. Use your actual production queries.
34 
35# Run against 50 representative queries from your production use case
36# Score each output against your task-specific rubric
37# Any model below 85% task accuracy is eliminated from consideration

cost_latency_filter.py

1from dataclasses import dataclass
2 
3# Level 2: Cost-latency filter — can we afford to run this at production scale?
Level 2: Apply cost-latency filter only to models that passed the capability filter
4# Apply after capability filter: only models that PASS capability testing enter here
5 
6@dataclass
7class ModelPricing:
8    input_cost_per_1m: float    # USD per 1M input tokens
9    output_cost_per_1m: float   # USD per 1M output tokens
10    p95_latency_ms: int         # Median p95 latency at moderate load
11 
12# Pricing as of early 2025 (verify current rates)
Verify current pricing — rates change frequently and these numbers will be outdated
13MODEL_COSTS = {
14    "gpt-4o-mini":           ModelPricing(0.15, 0.60, 800),
15    "gpt-4o":                ModelPricing(2.50, 10.00, 1800),
16    "claude-3-haiku-20240307":   ModelPricing(0.25, 1.25, 600),
17    "claude-3-5-sonnet-20241022": ModelPricing(3.00, 15.00, 2000),
18}
19 
20@dataclass
21class ServingConstraints:
22    max_monthly_cost_usd: float
23    p95_latency_ms: int
24    daily_requests: int
25    avg_input_tokens: int
26    avg_output_tokens: int
27 
28def filter_by_cost_latency(
29    passing_models: list[str],
Monthly cost at production scale: the number that matters, not cost-per-call
30    constraints: ServingConstraints
31) -> list[str]:
32    affordable = []
33    for model in passing_models:
34        pricing = MODEL_COSTS[model]
35        input_cost = (constraints.avg_input_tokens / 1_000_000) * pricing.input_cost_per_1m
36        output_cost = (constraints.avg_output_tokens / 1_000_000) * pricing.output_cost_per_1m
37        monthly_cost = (input_cost + output_cost) * constraints.daily_requests * 30
38 
Cost AND latency must both pass — a cheap model that is too slow is eliminated
39        cost_ok = monthly_cost <= constraints.max_monthly_cost_usd
40        latency_ok = pricing.p95_latency_ms <= constraints.p95_latency_ms
41 
42        if cost_ok and latency_ok:
43            affordable.append(model)
44            print(f"{model}: PASS - monthly cost ${monthly_cost:,.0f}")
45        else:
46            print(f"{model}: ELIMINATED - cost ${monthly_cost:,.0f}/mo, latency {pricing.p95_latency_ms}ms")
47 
48    return affordable

production_shootout.py

1import asyncio
2from openai import AsyncOpenAI
3import time
4 
5# Level 3: The production shootout
6# Run all passing models on a full 200-question evaluation set
Level 3: Full 200-question eval at production concurrency — this is the definitive test
7# Score on task-specific dimensions that matter for YOUR use case
8 
9EVAL_DIMENSIONS = [
10    "task_accuracy",      # Does it answer correctly?
EVAL_DIMENSIONS: define what matters for YOUR task, not generic quality
11    "format_compliance",  # Does it follow the output format?
12    "latency_p95",        # What is actual p95 at production concurrency?
13    "cost_per_correct",   # Cost divided by number of correct answers
14]
15 
16async def shootout(
17    models: list[str],
18    eval_set: list[dict],
concurrent_requests=10: test at realistic production concurrency, not single-threaded
19    concurrent_requests: int = 10
20) -> dict[str, dict]:
21    results = {}
22    for model in models:
23        client = AsyncOpenAI()
24        semaphore = asyncio.Semaphore(concurrent_requests)
25        latencies = []
26        correct = 0
27        format_ok = 0
28        total_cost = 0
29 
Semaphore controls concurrency — measures real latency distribution under load
30        async def run_one(test_case):
31            async with semaphore:
32                start = time.time()
33                response = await client.chat.completions.create(
34                    model=model,
35                    messages=[{"role": "user", "content": test_case["prompt"]}],
36                )
37                latency = (time.time() - start) * 1000
38                latencies.append(latency)
39                return response.choices[0].message.content
40 
41        responses = await asyncio.gather(*[run_one(tc) for tc in eval_set])
42 
43        # Score each response
44        for response, test_case in zip(responses, eval_set):
p95 latency: the latency 95% of users experience, not average latency
45            if score_correctness(response, test_case["expected"]): correct += 1
46            if score_format(response, test_case["format"]): format_ok += 1
47 
48        p95_latency = sorted(latencies)[int(len(latencies) * 0.95)]
49        results[model] = {
50            "accuracy": correct / len(eval_set),
51            "format_compliance": format_ok / len(eval_set),
52            "p95_latency_ms": p95_latency,
53        }
54 
55    return results

Three Scenarios, Three Very Different Model Decisions

The right model depends entirely on the specific task, volume, and constraints. Here are three production decisions where the same mental model produces different answers.

Scenario 1: Healthcare Startup (Accuracy Over Cost) A healthcare startup is building a clinical documentation assistant. Daily volume: 50,000 requests. Average request: 800 tokens in, 400 tokens out. The task: transform doctor notes into structured SOAP format. Budget: $15K/month. Claude 3.5 Sonnet passes capability evaluation at 93% SOAP compliance. GPT-4o-mini achieves 82% SOAP compliance — below the clinical accuracy requirement. Decision: Claude 3.5 Sonnet despite higher cost ($12K/month vs. $3K/month for mini). The accuracy constraint eliminates the cheaper option. You cannot compromise clinical accuracy for cost savings.

Scenario 2: Social Platform (Latency Over Capability) A social platform needs to classify user-generated content into 12 categories for feed ranking. Volume: 5M classifications/day. Requirement: P95 latency < 200ms (user feed load time). GPT-4o classifies with 91% accuracy at 1,800ms p95 — too slow. GPT-4o-mini classifies with 88% accuracy at 600ms p95 — still too slow. A fine-tuned Llama-3-8B running on-premises classifies at 85% accuracy and 120ms p95 — the only option that meets the latency SLO. Decision: fine-tuned open-source model despite lower accuracy, because the latency constraint eliminates API-based models entirely.

Scenario 3: B2B SaaS (Routing Between Models) A contract management SaaS needs to handle two task types: simple clause classification ("is this an indemnification clause?") and complex risk analysis ("what are the liability implications of this clause?"). Volume: 20,000 requests/day, 60% simple/40% complex. Decision: route simple tasks to GPT-4o-mini (0.4s, $0.005/request), complex tasks to Claude 3.5 Sonnet (2s, $0.08/request). Monthly cost: $14,400 vs. $38,400 (Claude for all) or suboptimal quality vs. $3,600 (mini for all). The routing approach delivers near-premium quality at 37% of premium-only cost.

The Traps That Cost Teams Millions

Three model selection mistakes that are seductive because they follow reasonable intuitions.

How Principals Think About Model Selection

Senior engineers ask: "Which model should we use?" Principal engineers ask: "What is our model evaluation framework, and how do we make this decision systematically every quarter?"

The shift from senior to principal thinking is from one-time decision to repeatable process. A principal engineer does not pick a model — they design the capability-cost-latency evaluation framework that makes model selection a systematic engineering activity rather than an opinion.

The research frontier (2024-2025) shows that scaling laws are more nuanced than the original Chinchilla results suggested. For specific task types (coding, reasoning, instruction following), the optimal compute allocation differs from general language modeling. Mixture-of-experts (MoE) architectures like Mixtral and Gemini 1.5 achieve near-large-model quality with dramatically lower inference cost by activating only a fraction of parameters per token.

The five-year arc: model selection will increasingly involve choosing between dozens of specialized models (coding models, reasoning models, vision models) rather than one general-purpose model. The engineer who understands the capability profile of each model family — not just the aggregate benchmark ranking — will make better architectural decisions.

How this might come up in interviews

Common questions:

How do you select a model for a production AI application?
What are the Chinchilla scaling laws and why do they matter?
How do you evaluate model quality for a specific production task?
When does it make sense to self-host a model vs. use an API?
What is model routing and when would you implement it?
How do you calculate whether a model is cost-effective at production scale?

Strong answer: Immediately describes the three-filter framework (capability → cost → latency) as the model selection processKnows that production prompt lengths are often 3-5x longer than test prompts — and accounts for this in cost projectionsHas implemented model routing and can explain when it makes sense vs. single-model deploymentCan articulate what task-specific benchmarks to use for different task types (HumanEval for code, IFEval for instruction following)

Red flags: Selects models based on benchmark rankings without task-specific evaluationCalculates cost at test volume rather than production volumeTreats model selection as a one-time decision rather than a quarterly engineering activityCannot explain the Chinchilla result or why smaller models can beat larger ones

Scenario · A legal tech startup, 30K law firm clients, Series B

Step 1 of 2

You Are the Staff ML Engineer. Pick the Model.

You are building an AI assistant that reviews contracts for standard clause presence and flags unusual or missing clauses. The contracts average 8,000 tokens. You need P95 latency < 3 seconds, 90%+ accuracy on clause detection, and monthly cost under $25,000 at 50,000 contracts/day volume.

You have shortlisted four candidate models: GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, and Llama-3-70B self-hosted. Before calculating cost, you need to run a capability evaluation. You have 50 labeled contracts with known clause presence/absence. You run all four models on this eval set.

What score threshold should you use to eliminate models from the capability filter?

Quick check · Scaling Laws and Model Selection

1 / 4

A model achieves 92% on MMLU but only 74% on your specific legal clause detection task. What does this tell you about how to select models?

Key takeaways

Chinchilla scaling laws show that the optimal model trains smaller parameters for longer on more data — bigger is not always better
Model selection is a three-filter problem: capability (task quality) → cost (at production volume) → latency (at production concurrency)
MMLU and aggregate benchmarks have ~0.31 correlation with task-specific production quality — always evaluate on your actual use case
Calculate monthly cost at production volume with production-representative prompt lengths before launch — not at test volume
Model routing (cheap model for simple queries, premium for complex) delivers 50-70% cost reduction with near-premium quality on complex tasks
Model selection is quarterly, not one-time — the best model for your task changes every 3-4 months as the landscape evolves

Before you move on: can you answer these?

What did the Chinchilla scaling law reveal about how most LLMs were trained before 2022?

Most LLMs were significantly undertrained — they used too many parameters for the amount of training data. The optimal strategy is to scale model size and training data roughly equally. Smaller models trained longer on more data (like Llama) can outperform larger models trained on insufficient data.

Why does MMLU ranking not reliably predict performance on your specific production task?

MMLU measures aggregate performance across 57 academic subject areas. Your specific production task has a specific capability profile that may look nothing like the MMLU distribution. A model that ranks #3 on MMLU might rank #1 on your specific task and vice versa. Always evaluate on representative samples from your actual use case.

What is the correct order of the three model selection filters and why?

Capability first (does it meet quality threshold?), then cost (is it affordable at production volume?), then latency (does it meet P95 SLO?). This order matters: calculating cost for models that cannot meet quality requirements wastes time. Latency testing is expensive — do it last, only for models that pass capability and cost.

From the books

Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al., DeepMind (2022)

The paper that changed how frontier models are trained. The key result: most large models released before 2022 were significantly undertrained. Optimal training requires scaling parameters and data together. The insight that made Llama possible: a smaller model trained 5-10x longer on more data can outperform a larger model trained on insufficient data.

AI Engineering — Chapter 3: Model Selection — Chip Huyen (2024)

Huyen's framework: model selection is a filtering problem, not a ranking problem. Start with the capability constraint (can it do the task?), then the cost constraint (can we afford it at production volume?), then the latency constraint (is it fast enough for the use case?). Only models that pass all three filters are viable candidates.

Scaling Laws for Neural Language Models — Kaplan et al., OpenAI (2020)

The foundational scaling laws paper that showed model performance follows predictable power laws with respect to compute, model size, and training data. The original result: performance scales as a smooth function of each resource, with diminishing returns to any single axis. This paper is why engineers can make principled predictions about model capability before training.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A