Interactive Explainer

🎯Key Takeaways

Academic benchmarks predict production quality with r≈0.31 — build task-specific evals for every production application

LLM-as-judge is powerful but requires cross-family judging to avoid self-preference bias (15-20% score inflation)

Stratify eval sets by query type and weight by revenue/usage — average scores hide critical regressions

The evaluation pyramid: golden set in CI (every commit), LLM-judge weekly, human expert review monthly

Failure mode analysis before switching models — categorize WHY the 22% fails before adding compute

Evaluation is what allows an AI product to compound quality improvements — without it, each change is a random walk

LLM Evaluation and Testing

Measuring What Actually Matters in AI Systems

~12 min read

Be the first to complete!

Why this matters

You cannot improve what you cannot measure. LLM outputs are open-ended — you cannot write a unit test that checks if an answer is "good." Without evaluation infrastructure, every model change is a guess.

Without this knowledge

You ship model updates based on vibes. Engineers argue in Slack about whether the new version is better. You discover the regression in production when users complain. A competitor ships a feature that "just feels better" because they measured quality and you did not.

With this knowledge

Every model change is gated on measurable quality metrics. You know the exact tradeoffs between accuracy, latency, and cost for each model version. Regressions are caught in CI before they reach users.

What you'll learn

Academic benchmarks predict production quality with r≈0.31 — build task-specific evals for every production application
LLM-as-judge is powerful but requires cross-family judging to avoid self-preference bias (15-20% score inflation)
Stratify eval sets by query type and weight by revenue/usage — average scores hide critical regressions
The evaluation pyramid: golden set in CI (every commit), LLM-judge weekly, human expert review monthly
Failure mode analysis before switching models — categorize WHY the 22% fails before adding compute
Evaluation is what allows an AI product to compound quality improvements — without it, each change is a random walk

Lesson outline

The Scene: Shipping a Regression Without Knowing It

It is March 2024. A legal AI startup switches from GPT-4 to Claude 3 Opus. Claude 3 Opus benchmarks higher on most academic evaluations. The engineering team updates the API call and ships. Two weeks later, their NPS drops 18 points.

The support tickets tell the story: Claude 3 Opus was better at following complex instructions and worse at matching the specific citation format the startup's lawyers needed. The academic benchmarks measured reasoning ability, not citation format compliance. No one had measured the things that actually mattered to paying customers.

The team spent four weeks building the evaluation infrastructure they should have had before the first model switch. They built a 200-question golden dataset with human-labeled quality scores across five dimensions: citation accuracy, legal reasoning correctness, conciseness, format compliance, and hedging appropriateness. They ran automated LLM-as-judge evaluation using GPT-4o as the evaluator on all five dimensions.

The result: when they re-evaluated Claude 3 Opus vs. GPT-4 on their actual use cases, GPT-4 won on citation format (the thing customers cared about) while Claude 3 Opus won on reasoning depth (less critical for their current customer tier). They implemented a routing approach: Claude 3 Opus for complex analysis, GPT-4 for citation-heavy summary tasks.

The eval infrastructure paid for itself within two weeks. Every subsequent model change was a measured experiment, not a guess. This is what evaluation infrastructure actually does: it turns model changes from belief into knowledge.

Why "Does It Look Good?" Is Not Evaluation

The most common evaluation anti-pattern is human spot-checking: an engineer runs 10 queries, the answers look reasonable, and the change ships. This is not evaluation — it is confirmation bias at scale. You unconsciously pick queries you expect to work well and miss the failure modes you haven't thought of.

Academic benchmarks (MMLU, HumanEval, HellaSwag) are even more dangerous as proxies for production quality. These benchmarks measure things that are easy to measure: multiple-choice accuracy, code compilation, next-sentence prediction. They do not measure the things your users care about: Is the output in the right format? Does it match your brand voice? Is it appropriately uncertain when it should be? Does it avoid the specific errors your domain is sensitive to?

The research backing this up is damning: a 2024 Stanford study found that MMLU rankings predicted production quality in commercial AI applications with correlation of r=0.31 — essentially uncorrelated. Models that ranked similarly on MMLU had wildly different quality profiles on real-world tasks.

The correct mental model: evaluation is a product problem, not a benchmark problem. You need to measure the specific dimensions of quality that drive your specific business outcomes. A customer service bot needs to measure resolution rate, escalation rate, and tone appropriateness. A code assistant needs to measure code correctness, idiomatic style, and security vulnerability rate. A document summarizer needs to measure factual accuracy, completeness, and compression ratio.

Generic benchmarks are a starting point, not a destination. The teams that build evaluation infrastructure specific to their use case compound quality advantages over time — because they can actually measure quality, iterate toward it, and catch regressions before they become user complaints.

How Production Evaluation Actually Works: Three Levels

Evaluation maturity follows a progression from manual spot-checks to automated pipelines to continuous quality monitoring.

golden_set_eval.py

1import json
2from openai import OpenAI
3from typing import List, Dict
4 
5client = OpenAI()
Level 1: Golden set eval — fast, deterministic, catches format regressions reliably
6 
7# Level 1: Golden set evaluation
8# Human-labeled expected outputs for a fixed set of test queries
9# Simple, deterministic, catches obvious regressions
10 
Each test case has: the query, expected keywords, and expected output format
11GOLDEN_SET = [
12    {
13        "query": "What is the cancellation policy?",
14        "expected_keywords": ["30 days", "written notice", "no penalty"],
15        "expected_format": "bullet_list",
16    },
17    {
18        "query": "Summarize the Q3 financial results in 3 bullet points.",
19        "expected_keywords": ["revenue", "growth", "Q3"],
20        "expected_format": "numbered_list",
21    },
22    # ... 50-200 more examples
23]
24 
25def evaluate_golden_set(model: str, system_prompt: str) -> Dict:
26    results = []
27    for test in GOLDEN_SET:
28        response = client.chat.completions.create(
29            model=model,
30            messages=[
31                {"role": "system", "content": system_prompt},
32                {"role": "user", "content": test["query"]}
33            ]
Keyword presence scoring: crude but fast. Catches when key facts are missing from outputs.
34        )
35        output = response.choices[0].message.content
36 
37        # Check keyword presence
Format scoring: check for bullet lists, numbered lists, JSON structure, etc.
38        keyword_score = sum(
39            1 for kw in test["expected_keywords"] if kw.lower() in output.lower()
40        ) / len(test["expected_keywords"])
41 
42        # Check format compliance
43        format_score = check_format(output, test["expected_format"])
44 
45        results.append({
46            "query": test["query"],
47            "keyword_score": keyword_score,
48            "format_score": format_score,
49            "output": output,
50        })
51 
52    return {
53        "model": model,
54        "avg_keyword_score": sum(r["keyword_score"] for r in results) / len(results),
55        "avg_format_score": sum(r["format_score"] for r in results) / len(results),
56        "results": results,
57    }

llm_as_judge.py

1from openai import OpenAI
2from pydantic import BaseModel
3from typing import Literal
4import asyncio
5 
6client = OpenAI()
7 
8# Level 2: LLM-as-Judge evaluation
9# Use a capable model (GPT-4o) to evaluate outputs from your production model
10# Captures quality dimensions that keyword matching cannot (nuance, tone, reasoning)
11 
LLM-as-Judge: Use a stronger model to evaluate your production model — captures nuance keyword matching misses
12class JudgeScore(BaseModel):
13    accuracy: int           # 1-5: Is the answer factually correct?
Structured score model: Pydantic ensures the judge returns parseable scores, not free-form text
14    completeness: int       # 1-5: Does it address all parts of the query?
15    format_compliance: int  # 1-5: Does it follow the required format?
16    conciseness: int        # 1-5: Is it appropriately concise?
17    reasoning: str          # Explanation of scores
18 
19JUDGE_PROMPT = """You are evaluating an AI assistant's response for a legal document analysis system.
Domain-specific dimensions: accuracy, completeness, format, conciseness — not generic "quality"
20 
21Query: {query}
22Reference Answer: {reference}
23Model Response: {response}
24 
25Score each dimension from 1-5:
26- accuracy: Is the response factually correct compared to the reference?
27- completeness: Does it address all key points from the reference?
28- format_compliance: Does it match the expected output format?
29- conciseness: Is it concise without omitting critical information?
30 
31Respond with JSON only."""
32 
33async def judge_response(query: str, reference: str, response: str) -> JudgeScore:
response_format=JudgeScore: Structured output forcing prevents the judge from adding unscored commentary
34    completion = client.beta.chat.completions.parse(
35        model="gpt-4o",
36        messages=[{
37            "role": "user",
38            "content": JUDGE_PROMPT.format(
39                query=query, reference=reference, response=response
Async evaluation: Run judge calls in parallel — 50 evals in 10s instead of 3 minutes
40            )
41        }],
42        response_format=JudgeScore,
43    )
44    return completion.choices[0].message.parsed
45 
46async def run_llm_judge_eval(test_cases: list, model: str) -> dict:
47    tasks = [
48        judge_response(tc["query"], tc["reference"], get_model_response(model, tc["query"]))
49        for tc in test_cases
50    ]
51    scores = await asyncio.gather(*tasks)
52 
53    return {
54        "model": model,
55        "avg_accuracy": sum(s.accuracy for s in scores) / len(scores),
56        "avg_completeness": sum(s.completeness for s in scores) / len(scores),
57        "avg_format": sum(s.format_compliance for s in scores) / len(scores),
58        "avg_conciseness": sum(s.conciseness for s in scores) / len(scores),
59    }

eval_pipeline.py

1import wandb
2from datetime import datetime
3from typing import Optional
4 
5# Level 3: Continuous evaluation pipeline
6# Runs on every model/prompt change in CI, tracks quality over time
7 
8class EvalPipeline:
Level 3: CI-integrated eval — every model change runs through this pipeline before deployment
9    def __init__(self, project_name: str):
10        self.project = project_name
11        wandb.init(project=project_name, job_type="eval")
12 
13    def run_eval(
14        self,
15        model: str,
16        system_prompt: str,
17        test_suite: str = "production_v2",
18        baseline_model: Optional[str] = None
19    ) -> dict:
20        test_cases = load_test_suite(test_suite)
21 
22        # Run both golden set and LLM-judge evaluation
23        golden_results = evaluate_golden_set(model, system_prompt)
24        judge_results = run_llm_judge_eval(test_cases, model)
Run both eval methods: golden set (fast, deterministic) + LLM-judge (slower, qualitative)
25 
26        # Compare to baseline if provided
27        regression_alert = False
28        if baseline_model:
29            baseline = evaluate_golden_set(baseline_model, system_prompt)
30            keyword_delta = golden_results["avg_keyword_score"] - baseline["avg_keyword_score"]
Regression threshold at 5%: fail the CI check if quality drops more than 5 percentage points
31            if keyword_delta < -0.05:  # More than 5% regression
32                regression_alert = True
33 
34        # Log all metrics to experiment tracking
35        wandb.log({
Log to experiment tracking: compare quality across models, prompts, and time in one dashboard
36            "model": model,
37            "test_suite": test_suite,
38            "keyword_score": golden_results["avg_keyword_score"],
39            "format_score": golden_results["avg_format_score"],
40            "judge_accuracy": judge_results["avg_accuracy"],
41            "judge_completeness": judge_results["avg_completeness"],
42            "regression_alert": regression_alert,
Raise exception on regression: the CI pipeline fails, blocking deployment of a regressing change
43            "timestamp": datetime.utcnow().isoformat(),
44        })
45 
46        if regression_alert:
47            raise ValueError(f"Quality regression detected: {keyword_delta:.2%} keyword score drop")
48 
49        return {**golden_results, **judge_results, "regression_alert": regression_alert}

Three Evaluation Architectures, Three Use Cases

The right evaluation architecture depends on your application domain and the cost of errors.

Scenario 1: Customer Service Bot (Speed + Format Focus) A retail company's support bot handles 50,000 queries per day. Evaluation priorities: response time, format compliance (does it follow the 5-step resolution template?), and escalation accuracy (does it escalate when it should?). Solution: a 200-question golden set with keyword-matching evaluation runs in CI on every commit. Three human-labeled dimensions are scored weekly on a random 1% sample of production queries. LLM-judge evaluation is too expensive at $0.02/query × 50,000 queries/day. The golden set catches format regressions in 90 seconds; the weekly human sample catches drift. Cost: ~$15/month.

Scenario 2: Medical Information Tool (Accuracy as Safety Requirement) A hospital's patient information tool must be factually accurate — wrong medical information is a safety event. Evaluation priorities: factual accuracy, appropriate uncertainty expression, and medical terminology correctness. Solution: three evaluation layers. First, a 500-question golden set with reference answers written by clinical pharmacists (format and keyword matching). Second, LLM-as-judge with a medical domain-specific judge prompt evaluating accuracy and uncertainty calibration. Third, a weekly 50-question human expert review. Every model or prompt change requires approval from all three layers. Conservative regression threshold: any metric dropping >2% blocks deployment.

Scenario 3: Code Generation Tool (Functional Correctness) A code assistant generates Python code for data transformations. Evaluation is the most objective: run the generated code. Solution: an execution-based eval. Each test case is a spec + a set of input-output examples. The model generates code, the evaluator executes it, and checks whether outputs match. Pass rate, runtime, and security scan results are the metrics. No LLM-as-judge needed — functional correctness is binary. Human evaluation covers code readability and style on a 5% sample. This is the gold standard of evaluation: when you can run the output and check correctness, use it.

The Traps That Make Evaluation Misleading

Evaluation done wrong is worse than no evaluation — it gives false confidence that hides real problems.

How Principals Think About Evaluation Strategy

Senior engineers ask: "How do we know if this model is good?" Principal engineers ask: "What are the measurable quality dimensions that drive our business outcomes, and how do we monitor them continuously?"

The principal perspective is that evaluation is a product investment, not a technical formality. Evaluation infrastructure is what allows an AI product to compound quality improvements over time. Without it, each change is a random walk. With it, you can run controlled experiments, measure the effect of every change, and make progress systematically.

The frontier in 2024-2025 is evaluation efficiency: how do you get high-quality quality signals without expensive human labeling at scale? The emerging techniques are: (1) weak supervision — using imperfect automated signals (keyword matching, execution tests) as proxies for expensive signals (human judgment), then calibrating the correlation; (2) embedding-based evaluation — measuring the semantic distance between model outputs and reference answers using embedding similarity; (3) adversarial evaluation — using red-teaming models to generate failure cases that human evaluators might miss.

The five-year arc: evaluation will become more automated and more integrated into the development loop. The model that wins in a domain will be the one built by the team with the best eval infrastructure — because evaluation is what allows you to measure progress, catch regressions, and ship quality improvements faster than competitors who are making changes on vibes.

How this might come up in interviews

Common questions:

How do you evaluate an LLM application's quality?
What is LLM-as-judge and what are its limitations?
How do you catch quality regressions before they reach production?
A model upgrade saves 40% on cost but your eval score drops 0.3. How do you decide whether to ship it?
How do you build an evaluation dataset?
What is wrong with using MMLU to compare models for a production application?

Strong answer: Immediately asks "what quality dimensions matter for this specific application?" before discussing metricsKnows the limitations of LLM-as-judge (self-preference bias) and how to mitigate themDescribes a stratified eval approach that covers all query types weighted by usageHas implemented CI-integrated evaluation that gates model changes on measurable quality thresholds

Red flags: Relies on manual spot-checking or "it looks good" as the primary quality signalUses academic benchmarks (MMLU, HumanEval) as a proxy for production qualityNo mention of holding separate eval dimensions or stratifying by query typeUses the same model as both production model and LLM judge without acknowledging self-preference bias

Scenario · A legal research startup, 15,000 active users, GPT-4 backend

Step 1 of 2

You Are the AI Lead. Build the Eval Infrastructure.

Your legal research assistant is used daily by lawyers for case research. Quality complaints are growing but you have no systematic way to measure what "quality" means for your users. The head of engineering wants to switch models to save cost.

Your head of engineering wants to switch from GPT-4 to Claude 3 Haiku to cut inference costs by 60%. You have no evaluation infrastructure. Lawyers are complaining about answer quality but you cannot quantify what is wrong. You need to make a recommendation.

What do you do before any model switch?

Quick check · LLM Evaluation and Testing

1 / 4

Your LLM application achieves 82% on MMLU. A competitor claims 78% on MMLU. Should you use this to conclude your product is better?

Key takeaways

Academic benchmarks predict production quality with r≈0.31 — build task-specific evals for every production application
LLM-as-judge is powerful but requires cross-family judging to avoid self-preference bias (15-20% score inflation)
Stratify eval sets by query type and weight by revenue/usage — average scores hide critical regressions
The evaluation pyramid: golden set in CI (every commit), LLM-judge weekly, human expert review monthly
Failure mode analysis before switching models — categorize WHY the 22% fails before adding compute
Evaluation is what allows an AI product to compound quality improvements — without it, each change is a random walk

Before you move on: can you answer these?

Why should you not use the same model as both your production model and your LLM judge?

Models exhibit self-preference bias — they score outputs from their own training distribution 15-20% higher on subjective dimensions. Use a different model family as the judge (Claude to judge GPT-4, GPT-4 to judge Claude) to avoid circular evaluation.

What is the difference between evaluating average quality vs. stratified quality by query type?

Average quality hides concentrated regressions. A 0.3 average drop might be 0.3 uniform across all types (usually acceptable) or 2.0 on one critical type affecting your highest-value users (not acceptable). Always analyze quality by segment.

When is execution-based evaluation (running the output) preferable to LLM-as-judge evaluation?

When the output is functional (code, SQL, structured data) and can be run to check correctness. Execution-based eval is fully objective, requires no labeled reference answers, and provides binary pass/fail signal. It is the gold standard when applicable.

From the books

AI Engineering — Chapter 11: Evaluation — Chip Huyen (2024)

Huyen's key distinction: functional evaluation (does it do the task?) vs. alignment evaluation (does it do it in the way users expect?). Most teams build functional eval first and discover they need alignment eval when users complain about tone, format, or style — even when the answer is technically correct.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al., UC Berkeley (2023)

The paper that systematized LLM-as-judge evaluation, demonstrating that GPT-4 as an evaluator achieves >80% agreement with human expert evaluators on many tasks. Also documents the self-preference bias quantitatively, showing models prefer their own outputs by 10-20% on subjective dimensions.

HELM: Holistic Evaluation of Language Models — Liang et al., Stanford CRFM (2022)

HELM's key insight: model quality is multi-dimensional and no single benchmark captures it. HELM evaluates 7 dimensions (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) simultaneously. The lesson for production: define your quality dimensions before choosing metrics.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A