You changed a prompt, is it actually better, or does it just feel better? This is how to build real evals: golden datasets, deterministic checks, LLM-as-judge, and regression gates that kill vibes-based shipping.
You changed a prompt and have no idea if it got better
Here is the most common moment in LLM development. You tweak a system prompt, add a line, reword an instruction, swap a model. You run it on the one example you happen to have open, the output looks nicer, and you ship it. Then a week later a user complains the bot got *worse* at something you never thought to test, and you have no way to prove whether your change helped, hurt, or did nothing at all.
This is vibes-based shipping, and almost everyone does it at first. The reason is simple: with normal code you write a test, it passes or fails, done. With an LLM the output is different every run, there is no single "correct" answer, and "better" is a judgment call. So people skip evaluation entirely, which makes it the hardest and most-skipped part of building real AI products.
This article is the antidote. You will build a tiny eval harness that scores your prompt's outputs against a fixed set of examples, so the next time you change something you get a *number*, not a feeling.
Who this is for
Anyone who has wired up an LLM call and now wants to change it *safely*, app developers, prompt engineers, and AI engineers. You should be comfortable reading Python and have made at least one API call to a model. No ML background needed. If you have not written a prompt yet, start with [Prompt Engineering Fundamentals](/blog/prompt-engineering-fundamentals).
The principle: evals are unit tests for non-deterministic output
If you cannot measure whether a change made your LLM app better, you are not engineering, you are gambling with extra steps.
The mental model that unlocks everything: an eval is a unit test for a system whose output you cannot predict exactly. A normal unit test asserts add(2, 2) === 4. You can't do that with an LLM, the same input yields different wording every time. So instead of asserting *exact equality*, an eval asserts *a property holds*: the answer contains the right account ID, the JSON parses, the tone is professional, the summary doesn't invent facts.
Unit test asserting exact outputExact-match scorer (when there IS one right answer)
Test asserting a property (sorted, non-empty)Heuristic scorer (regex, JSON-parses, length, contains-keyword)
Test suite you run before every mergeEval suite you run before every prompt change
CI failing the build on a regressionRegression gate failing the deploy on a score drop
Same instinct as classic testing, just adapted for output you can't pin down exactly.
The shape of an eval pipeline
Before writing code, picture the flow. You start with a golden set, fixed inputs paired with what a good answer looks like. Each input runs through your prompt, the output goes to one or more scorers, and the scores roll up into an aggregate that a regression gate compares against your last known-good run.
An eval pipeline: golden set → run prompt → layered scorers → aggregate score → regression gate.
1
Collect a golden set
Gather 20–50 real inputs your app will actually see. Pull them from logs, support tickets, or your own head. For each, write down what a good output must contain, not the exact wording, just the must-haves.
2
Run the prompt over every case
Loop through the golden set, call the model with your current prompt for each input, and capture the raw output. Pin temperature low (or 0) so runs are comparable.
3
Score each output
Send every output through your scorers. Start with cheap deterministic checks (does it contain the order ID? does the JSON parse?), then add an LLM-judge for the fuzzy stuff (is the tone right? is it faithful to the source?).
4
Aggregate and gate
Average the scores into a pass rate per scorer. Compare against your last saved baseline. If the new run drops below it, the gate fails and you investigate before shipping.
Choosing a scoring method
There is no single "the" eval method, you layer several, cheapest first. Run deterministic checks on every case because they cost nothing, and reserve expensive methods (model judges, humans) for the questions cheap checks can't answer.
Strong if the judge prompt is calibrated; biased if not
Human review
Anything subtle a model still gets wrong; ground truth
$$$ + slow
Highest, but doesn't scale; use to calibrate the others
Eval methods, cheapest to most expensive. Layer them, don't pick just one.
Always start deterministic
A surprising amount of "quality" is just structure: valid JSON, the right field present, no apology when none is needed. Heuristic scorers catch these for free and run in milliseconds. Only reach for an LLM-judge once the deterministic checks pass and you still need to grade meaning.
A tiny eval harness
Here is the whole idea in about 60 lines of Python: a golden set, two scorers (one heuristic, one LLM-judge), and an aggregate with a regression gate. The model call is stubbed behind run_prompt and judge so you can drop in whatever provider you use.
eval_harness.py
python
from dataclasses import dataclass
from statistics import mean
# 1. The golden set: real inputs + what a good answer must contain.
GOLDEN = [
{"input": "What's the refund window?", "must_include": "30 days"},
{"input": "Do you ship to Germany?", "must_include": "yes"},
{"input": "Is my data sold to third parties?", "must_include": "no"},
]
# Your app's prompt under test. Swap in a real API call.defrun_prompt(user_input: str) -> str:
system = "You are a concise support agent. Answer in one sentence."returncall_model(system=system, user=user_input, temperature=0)
# --- Scorers: each returns a float in [0, 1] ---defheuristic_scorer(output: str, case: dict) -> float:
# Deterministic: does the answer contain the required fact?return1.0if case["must_include"].lower() in output.lower() else0.0
JUDGE_PROMPT = (
"Score the ANSWER from 0-1 for how clearly and politely it ""resolves the QUESTION. Reply with ONLY the number.\n""QUESTION: {q}\nANSWER: {a}"
)
defllm_judge_scorer(output: str, case: dict) -> float:
raw = judge(JUDGE_PROMPT.format(q=case["input"], a=output))
try:
returnmax(0.0, min(1.0, float(raw.strip())))
except ValueError:
return0.0# un-parseable judge output = fail loudly
SCORERS = {"heuristic": heuristic_scorer, "judge": llm_judge_scorer}
@dataclassclass Result:
scores: dict # scorer name -> mean score
overall: float
defevaluate() -> Result:
per_scorer = {name: [] for name in SCORERS}
for case in GOLDEN:
output = run_prompt(case["input"])
for name, fn in SCORERS.items():
per_scorer[name].append(fn(output, case))
scores = {name: mean(vals) for name, vals in per_scorer.items()}
returnResult(scores=scores, overall=mean(scores.values()))
# --- Regression gate: fail if we dropped below the saved baseline ---
BASELINE = 0.85# load this from your last green run, don't hardcode foreverif __name__ == "__main__":
result = evaluate()
print("per-scorer:", result.scores, "overall:", round(result.overall, 3))
assert result.overall >= BASELINE, (
f"REGRESSION: {result.overall:.3f} < baseline {BASELINE}"
)
print("PASS, safe to ship.")
Run this before and after a prompt change. The same three cases, the same scorers, two numbers you can compare. That single assert is your regression gate, wire it into CI and a prompt edit that quietly breaks an answer can no longer reach production.
Offline vs online: you need both
Everything above is offline eval, a fixed golden set you run on demand. It is fast, repeatable, and perfect for catching regressions before you ship. But a golden set only contains the inputs you *thought of*. Real users will phrase things you never imagined.
That's where online eval comes in: scoring a sample of *live production traffic*. You log real inputs and outputs, run the same scorers (plus lightweight signals like thumbs-up/down, did-the-user-retry, did-they-escalate-to-a-human), and watch the trend. Online eval tells you what offline can't, how the app behaves on the long tail of real questions. The two feed each other: failures you spot online become new cases in your golden set, which makes your offline gate smarter over time. Wiring this into production logging and dashboards is the subject of LLMOps: Productionizing LLM Apps.
Common mistakes that cost hours
No dataset at all. Eyeballing one example is not evaluation. Without a fixed golden set every "improvement" is anecdote. Twenty real cases beat zero, start there, grow it as you find failures.
Judge bias. An LLM-judge inherits the quirks of its model: it favors longer answers, rewards confident tone over correctness, and rates its *own* outputs higher than a rival model's. Calibrate the judge against human labels on a handful of cases before you trust it.
Only offline. A green offline suite feels safe, but it only tests inputs you imagined. Skipping online eval means real-world failures stay invisible until a user reports them.
Scoring with the model you're testing. Using the same model as both the app *and* the judge bakes in shared blind spots. Use a different (often stronger) model to judge.
A vague golden answer. "A good response" is unscorable. Write the *must-haves*, the fact, the format, the forbidden phrases, so a scorer (or a human) can decide objectively.
Letting the baseline rot. A regression gate is only as honest as its baseline. Update it from your last green run, and never lower it to make a failing change pass.
Takeaways
The whole article in seven lines
Evals are unit tests for non-deterministic output, assert *properties*, not exact equality.
Build a golden set first: 20–50 real inputs + the must-haves of a good answer.
Layer scorers cheapest-first: exact match → heuristic → LLM-as-judge → human review.
Deterministic checks are free and catch most structure bugs, always run them.
A regression gate (new score ≥ baseline) is what kills vibes-based shipping.
Offline eval catches regressions; online eval catches the long tail. You need both.
Watch for judge bias, calibrate against humans and never let the model grade itself.
Where to go next
Evaluation is the discipline that turns prompt-tweaking into engineering. The next step is putting the harness somewhere it runs automatically and watching real traffic.
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.