Evaluating LLM Applications

On this page

You changed a prompt and have no idea if it got better
The principle: evals are unit tests for non-deterministic output
The shape of an eval pipeline
Choosing a scoring method
A tiny eval harness
Offline vs online: you need both
Common mistakes that cost hours
Takeaways
Where to go next

TL;DR

Treat evals as unit tests for non-deterministic output. Build golden datasets, deterministic checks, and LLM-as-judge scoring, then add regression gates so you ship on evidence instead of vibes. Run both offline and online.

You changed a prompt and have no idea if it got better

Here is the most common moment in LLM development. You tweak a system prompt, add a line, reword an instruction, swap a model. You run it on the one example you happen to have open, the output looks nicer, and you ship it. Then a week later a user complains the bot got *worse* at something you never thought to test, and you have no way to prove whether your change helped, hurt, or did nothing at all.

This is vibes-based shipping, and almost everyone does it at first. The reason is simple: with normal code you write a test, it passes or fails, done. With an LLM the output is different every run, there is no single "correct" answer, and "better" is a judgment call. So people skip evaluation entirely, which makes it the hardest and most-skipped part of building real AI products.

This article is the antidote. You will build a tiny eval harness that scores your prompt's outputs against a fixed set of examples, so the next time you change something you get a *number*, not a feeling.

Who this is for

Anyone who has wired up an LLM call and now wants to change it *safely*, app developers, prompt engineers, and AI engineers. You should be comfortable reading Python and have made at least one API call to a model. No ML background needed. If you have not written a prompt yet, start with Prompt Engineering Fundamentals.

The principle: evals are unit tests for non-deterministic output

If you cannot measure whether a change made your LLM app better, you are not engineering, you are gambling with extra steps.
The eval mindset

The mental model that unlocks everything: an eval is a unit test for a system whose output you cannot predict exactly. A normal unit test asserts add(2, 2) === 4. You can't do that with an LLM, the same input yields different wording every time. So instead of asserting *exact equality*, an eval asserts *a property holds*: the answer contains the right account ID, the JSON parses, the tone is professional, the summary doesn't invent facts.

Unit test asserting exact outputExact-match scorer (when there IS one right answer)

Test asserting a property (sorted, non-empty)Heuristic scorer (regex, JSON-parses, length, contains-keyword)

Code reviewer judging readabilityLLM-as-judge scoring tone, relevance, faithfulness

Test suite you run before every mergeEval suite you run before every prompt change

CI failing the build on a regressionRegression gate failing the deploy on a score drop

Same instinct as classic testing, just adapted for output you can't pin down exactly.

The shape of an eval pipeline

Before writing code, picture the flow. You start with a golden set, fixed inputs paired with what a good answer looks like. Each input runs through your prompt, the output goes to one or more scorers, and the scores roll up into an aggregate that a regression gate compares against your last known-good run.

An eval pipeline: golden set → run prompt → layered scorers → aggregate score → regression gate.

1
Collect a golden set
Gather 20–50 real inputs your app will actually see. Pull them from logs, support tickets, or your own head. For each, write down what a good output must contain, not the exact wording, just the must-haves.
2
Run the prompt over every case
Loop through the golden set, call the model with your current prompt for each input, and capture the raw output. Pin temperature low (or 0) so runs are comparable.
3
Score each output
Send every output through your scorers. Start with cheap deterministic checks (does it contain the order ID? does the JSON parse?), then add an LLM-judge for the fuzzy stuff (is the tone right? is it faithful to the source?).
4
Aggregate and gate
Average the scores into a pass rate per scorer. Compare against your last saved baseline. If the new run drops below it, the gate fails and you investigate before shipping.

Choosing a scoring method

There is no single "the" eval method, you layer several, cheapest first. Run deterministic checks on every case because they cost nothing, and reserve expensive methods (model judges, humans) for the questions cheap checks can't answer.

Method	What it measures	Cost	Signal quality
Exact match	Output equals expected string (classification labels, structured IDs)	Near zero	Perfect, but only where one right answer exists
Heuristic	Properties: regex, JSON-parses, contains keyword, length, no banned words	Near zero	Good for structure/format; blind to meaning
LLM-as-judge	Fuzzy qualities: relevance, tone, faithfulness, helpfulness	$ per case (extra model call)	Strong if the judge prompt is calibrated; biased if not
Human review	Anything subtle a model still gets wrong; ground truth	$$$ + slow	Highest, but doesn't scale; use to calibrate the others

Eval methods, cheapest to most expensive. Layer them, don't pick just one.

Always start deterministic

A surprising amount of "quality" is just structure: valid JSON, the right field present, no apology when none is needed. Heuristic scorers catch these for free and run in milliseconds. Only reach for an LLM-judge once the deterministic checks pass and you still need to grade meaning.

A tiny eval harness

Here is the whole idea in about 60 lines of Python: a golden set, two scorers (one heuristic, one LLM-judge), and an aggregate with a regression gate. The model call is stubbed behind run_prompt and judge so you can drop in whatever provider you use.

eval_harness.py

python

from dataclasses import dataclass
from statistics import mean

# 1. The golden set: real inputs + what a good answer must contain.
GOLDEN = [
    {"input": "What's the refund window?", "must_include": "30 days"},
    {"input": "Do you ship to Germany?", "must_include": "yes"},
    {"input": "Is my data sold to third parties?", "must_include": "no"},
]

# Your app's prompt under test. Swap in a real API call.
def run_prompt(user_input: str) -> str:
    system = "You are a concise support agent. Answer in one sentence."
    return call_model(system=system, user=user_input, temperature=0)

# --- Scorers: each returns a float in [0, 1] ---

def heuristic_scorer(output: str, case: dict) -> float:
    # Deterministic: does the answer contain the required fact?
    return 1.0 if case["must_include"].lower() in output.lower() else 0.0

JUDGE_PROMPT = (
    "Score the ANSWER from 0-1 for how clearly and politely it "
    "resolves the QUESTION. Reply with ONLY the number.\n"
    "QUESTION: {q}\nANSWER: {a}"
)

def llm_judge_scorer(output: str, case: dict) -> float:
    raw = judge(JUDGE_PROMPT.format(q=case["input"], a=output))
    try:
        return max(0.0, min(1.0, float(raw.strip())))
    except ValueError:
        return 0.0  # un-parseable judge output = fail loudly

SCORERS = {"heuristic": heuristic_scorer, "judge": llm_judge_scorer}

@dataclass
class Result:
    scores: dict          # scorer name -> mean score
    overall: float

def evaluate() -> Result:
    per_scorer = {name: [] for name in SCORERS}
    for case in GOLDEN:
        output = run_prompt(case["input"])
        for name, fn in SCORERS.items():
            per_scorer[name].append(fn(output, case))
    scores = {name: mean(vals) for name, vals in per_scorer.items()}
    return Result(scores=scores, overall=mean(scores.values()))

# --- Regression gate: fail if we dropped below the saved baseline ---
BASELINE = 0.85  # load this from your last green run, don't hardcode forever

if __name__ == "__main__":
    result = evaluate()
    print("per-scorer:", result.scores, "overall:", round(result.overall, 3))
    assert result.overall >= BASELINE, (
        f"REGRESSION: {result.overall:.3f} < baseline {BASELINE}"
    )
    print("PASS, safe to ship.")

Run this before and after a prompt change. The same three cases, the same scorers, two numbers you can compare. That single assert is your regression gate, wire it into CI and a prompt edit that quietly breaks an answer can no longer reach production.

Offline vs online: you need both

Everything above is offline eval, a fixed golden set you run on demand. It is fast, repeatable, and perfect for catching regressions before you ship. But a golden set only contains the inputs you *thought of*. Real users will phrase things you never imagined.

That's where online eval comes in: scoring a sample of *live production traffic*. You log real inputs and outputs, run the same scorers (plus lightweight signals like thumbs-up/down, did-the-user-retry, did-they-escalate-to-a-human), and watch the trend. Online eval tells you what offline can't, how the app behaves on the long tail of real questions. The two feed each other: failures you spot online become new cases in your golden set, which makes your offline gate smarter over time. Wiring this into production logging and dashboards is the subject of LLMOps: Productionizing LLM Apps.

Common mistakes that cost hours

1No dataset at all. Eyeballing one example is not evaluation. Without a fixed golden set every "improvement" is anecdote. Twenty real cases beat zero, start there, grow it as you find failures.
2Judge bias. An LLM-judge inherits the quirks of its model: it favors longer answers, rewards confident tone over correctness, and rates its *own* outputs higher than a rival model's. Calibrate the judge against human labels on a handful of cases before you trust it.
3Only offline. A green offline suite feels safe, but it only tests inputs you imagined. Skipping online eval means real-world failures stay invisible until a user reports them.
4Scoring with the model you're testing. Using the same model as both the app *and* the judge bakes in shared blind spots. Use a different (often stronger) model to judge.
5A vague golden answer. "A good response" is unscorable. Write the *must-haves*, the fact, the format, the forbidden phrases, so a scorer (or a human) can decide objectively.
6Letting the baseline rot. A regression gate is only as honest as its baseline. Update it from your last green run, and never lower it to make a failing change pass.

Takeaways

The whole article in seven lines

Evals are unit tests for non-deterministic output, assert *properties*, not exact equality.
Build a golden set first: 20–50 real inputs + the must-haves of a good answer.
Layer scorers cheapest-first: exact match → heuristic → LLM-as-judge → human review.
Deterministic checks are free and catch most structure bugs, always run them.
A regression gate (new score ≥ baseline) is what kills vibes-based shipping.
Offline eval catches regressions; online eval catches the long tail. You need both.
Watch for judge bias, calibrate against humans and never let the model grade itself.

Where to go next

Evaluation is the discipline that turns prompt-tweaking into engineering. The next step is putting the harness somewhere it runs automatically and watching real traffic.

Sharpen the thing you're evaluating: Prompt Engineering Fundamentals.
Run evals in CI and score live traffic: LLMOps: Productionizing LLM Apps.
See where eval fits in the broader role: the AI Engineer career path.

Check your understanding

1. What is the mental model the article uses for an eval?

2. What does the article call shipping a prompt change based on one example that looks nicer?

Frequently asked questions

How can I unit-test an LLM when the output changes every run?

An eval is a unit test for a system whose output you cannot predict exactly. Instead of asserting exact equality, you assert that a property holds, such as the answer contains the right account ID, the JSON parses, or the summary doesn't invent facts.

What is vibes-based shipping and why is it a problem?

Vibes-based shipping is tweaking a prompt, running it on the one example you have open, deciding it looks nicer, and shipping. The problem is you have no way to prove whether the change helped, hurt, or did nothing, and a regression can slip into something you never thought to test.

Do I need both offline and online evaluation?

Yes, you need both. Offline evals score your prompt's outputs against a fixed golden set before you ship, while online evaluation watches real production behavior. Each catches problems the other misses.

What is a golden set and how does it fit into an eval pipeline?

A golden set is a fixed collection of inputs paired with what a good answer looks like. Each input runs through your prompt, the output goes to one or more scorers, and the scores roll up into an aggregate that a regression gate compares against your last known-good run.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

SRE

Chaos Engineering in Practice

Read

AI Engineering

What Is an AI Engineer?

Read

Evaluating LLM Applications

01You changed a prompt and have no idea if it got better

02The principle: evals are unit tests for non-deterministic output

03The shape of an eval pipeline

04Choosing a scoring method

05A tiny eval harness

06Offline vs online: you need both

07Common mistakes that cost hours

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

RAG Architecture Explained for Backend Engineers

Chaos Engineering in Practice

What Is an AI Engineer?

You changed a prompt and have no idea if it got better

The principle: evals are unit tests for non-deterministic output

The shape of an eval pipeline

Choosing a scoring method

A tiny eval harness

Offline vs online: you need both

Common mistakes that cost hours

Takeaways

Where to go next