Back to Blog
AI Engineering12 min readJun 2026

Evaluating LLM Applications

You changed a prompt, is it actually better, or does it just feel better? This is how to build real evals: golden datasets, deterministic checks, LLM-as-judge, and regression gates that kill vibes-based shipping.

AILLMEvaluationTesting
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

You changed a prompt and have no idea if it got better

Here is the most common moment in LLM development. You tweak a system prompt, add a line, reword an instruction, swap a model. You run it on the one example you happen to have open, the output looks nicer, and you ship it. Then a week later a user complains the bot got *worse* at something you never thought to test, and you have no way to prove whether your change helped, hurt, or did nothing at all.

This is vibes-based shipping, and almost everyone does it at first. The reason is simple: with normal code you write a test, it passes or fails, done. With an LLM the output is different every run, there is no single "correct" answer, and "better" is a judgment call. So people skip evaluation entirely, which makes it the hardest and most-skipped part of building real AI products.

This article is the antidote. You will build a tiny eval harness that scores your prompt's outputs against a fixed set of examples, so the next time you change something you get a *number*, not a feeling.

Who this is for

Anyone who has wired up an LLM call and now wants to change it *safely*, app developers, prompt engineers, and AI engineers. You should be comfortable reading Python and have made at least one API call to a model. No ML background needed. If you have not written a prompt yet, start with [Prompt Engineering Fundamentals](/blog/prompt-engineering-fundamentals).

The principle: evals are unit tests for non-deterministic output

If you cannot measure whether a change made your LLM app better, you are not engineering, you are gambling with extra steps.
The eval mindset

The mental model that unlocks everything: an eval is a unit test for a system whose output you cannot predict exactly. A normal unit test asserts add(2, 2) === 4. You can't do that with an LLM, the same input yields different wording every time. So instead of asserting *exact equality*, an eval asserts *a property holds*: the answer contains the right account ID, the JSON parses, the tone is professional, the summary doesn't invent facts.

Unit test asserting exact outputExact-match scorer (when there IS one right answer)
Test asserting a property (sorted, non-empty)Heuristic scorer (regex, JSON-parses, length, contains-keyword)
Code reviewer judging readabilityLLM-as-judge scoring tone, relevance, faithfulness
Test suite you run before every mergeEval suite you run before every prompt change
CI failing the build on a regressionRegression gate failing the deploy on a score drop
Same instinct as classic testing, just adapted for output you can't pin down exactly.

The shape of an eval pipeline

Before writing code, picture the flow. You start with a golden set, fixed inputs paired with what a good answer looks like. Each input runs through your prompt, the output goes to one or more scorers, and the scores roll up into an aggregate that a regression gate compares against your last known-good run.

each casecompare
Golden set

inputs + expected

Run prompt

LLM under test

Exact / heuristic

deterministic scorers

LLM-as-judge

graded by a model

Aggregate score

pass rate per scorer

Regression gate

≥ baseline?

An eval pipeline: golden set → run prompt → layered scorers → aggregate score → regression gate.

  1. 1

    Collect a golden set

    Gather 20–50 real inputs your app will actually see. Pull them from logs, support tickets, or your own head. For each, write down what a good output must contain, not the exact wording, just the must-haves.

  2. 2

    Run the prompt over every case

    Loop through the golden set, call the model with your current prompt for each input, and capture the raw output. Pin temperature low (or 0) so runs are comparable.

  3. 3

    Score each output

    Send every output through your scorers. Start with cheap deterministic checks (does it contain the order ID? does the JSON parse?), then add an LLM-judge for the fuzzy stuff (is the tone right? is it faithful to the source?).

  4. 4

    Aggregate and gate

    Average the scores into a pass rate per scorer. Compare against your last saved baseline. If the new run drops below it, the gate fails and you investigate before shipping.

Choosing a scoring method

There is no single "the" eval method, you layer several, cheapest first. Run deterministic checks on every case because they cost nothing, and reserve expensive methods (model judges, humans) for the questions cheap checks can't answer.

MethodWhat it measuresCostSignal quality
Exact matchOutput equals expected string (classification labels, structured IDs)Near zeroPerfect, but only where one right answer exists
HeuristicProperties: regex, JSON-parses, contains keyword, length, no banned wordsNear zeroGood for structure/format; blind to meaning
LLM-as-judgeFuzzy qualities: relevance, tone, faithfulness, helpfulness$ per case (extra model call)Strong if the judge prompt is calibrated; biased if not
Human reviewAnything subtle a model still gets wrong; ground truth$$$ + slowHighest, but doesn't scale; use to calibrate the others
Eval methods, cheapest to most expensive. Layer them, don't pick just one.

Always start deterministic

A surprising amount of "quality" is just structure: valid JSON, the right field present, no apology when none is needed. Heuristic scorers catch these for free and run in milliseconds. Only reach for an LLM-judge once the deterministic checks pass and you still need to grade meaning.

A tiny eval harness

Here is the whole idea in about 60 lines of Python: a golden set, two scorers (one heuristic, one LLM-judge), and an aggregate with a regression gate. The model call is stubbed behind run_prompt and judge so you can drop in whatever provider you use.

eval_harness.py
python
from dataclasses import dataclass
from statistics import mean

# 1. The golden set: real inputs + what a good answer must contain.
GOLDEN = [
    {"input": "What's the refund window?", "must_include": "30 days"},
    {"input": "Do you ship to Germany?", "must_include": "yes"},
    {"input": "Is my data sold to third parties?", "must_include": "no"},
]

# Your app's prompt under test. Swap in a real API call.
def run_prompt(user_input: str) -> str:
    system = "You are a concise support agent. Answer in one sentence."
    return call_model(system=system, user=user_input, temperature=0)

# --- Scorers: each returns a float in [0, 1] ---

def heuristic_scorer(output: str, case: dict) -> float:
    # Deterministic: does the answer contain the required fact?
    return 1.0 if case["must_include"].lower() in output.lower() else 0.0

JUDGE_PROMPT = (
    "Score the ANSWER from 0-1 for how clearly and politely it "
    "resolves the QUESTION. Reply with ONLY the number.\n"
    "QUESTION: {q}\nANSWER: {a}"
)

def llm_judge_scorer(output: str, case: dict) -> float:
    raw = judge(JUDGE_PROMPT.format(q=case["input"], a=output))
    try:
        return max(0.0, min(1.0, float(raw.strip())))
    except ValueError:
        return 0.0  # un-parseable judge output = fail loudly

SCORERS = {"heuristic": heuristic_scorer, "judge": llm_judge_scorer}

@dataclass
class Result:
    scores: dict          # scorer name -> mean score
    overall: float

def evaluate() -> Result:
    per_scorer = {name: [] for name in SCORERS}
    for case in GOLDEN:
        output = run_prompt(case["input"])
        for name, fn in SCORERS.items():
            per_scorer[name].append(fn(output, case))
    scores = {name: mean(vals) for name, vals in per_scorer.items()}
    return Result(scores=scores, overall=mean(scores.values()))

# --- Regression gate: fail if we dropped below the saved baseline ---
BASELINE = 0.85  # load this from your last green run, don't hardcode forever

if __name__ == "__main__":
    result = evaluate()
    print("per-scorer:", result.scores, "overall:", round(result.overall, 3))
    assert result.overall >= BASELINE, (
        f"REGRESSION: {result.overall:.3f} < baseline {BASELINE}"
    )
    print("PASS, safe to ship.")

Run this before and after a prompt change. The same three cases, the same scorers, two numbers you can compare. That single assert is your regression gate, wire it into CI and a prompt edit that quietly breaks an answer can no longer reach production.

Offline vs online: you need both

Everything above is offline eval, a fixed golden set you run on demand. It is fast, repeatable, and perfect for catching regressions before you ship. But a golden set only contains the inputs you *thought of*. Real users will phrase things you never imagined.

That's where online eval comes in: scoring a sample of *live production traffic*. You log real inputs and outputs, run the same scorers (plus lightweight signals like thumbs-up/down, did-the-user-retry, did-they-escalate-to-a-human), and watch the trend. Online eval tells you what offline can't, how the app behaves on the long tail of real questions. The two feed each other: failures you spot online become new cases in your golden set, which makes your offline gate smarter over time. Wiring this into production logging and dashboards is the subject of LLMOps: Productionizing LLM Apps.

Common mistakes that cost hours

  1. No dataset at all. Eyeballing one example is not evaluation. Without a fixed golden set every "improvement" is anecdote. Twenty real cases beat zero, start there, grow it as you find failures.
  2. Judge bias. An LLM-judge inherits the quirks of its model: it favors longer answers, rewards confident tone over correctness, and rates its *own* outputs higher than a rival model's. Calibrate the judge against human labels on a handful of cases before you trust it.
  3. Only offline. A green offline suite feels safe, but it only tests inputs you imagined. Skipping online eval means real-world failures stay invisible until a user reports them.
  4. Scoring with the model you're testing. Using the same model as both the app *and* the judge bakes in shared blind spots. Use a different (often stronger) model to judge.
  5. A vague golden answer. "A good response" is unscorable. Write the *must-haves*, the fact, the format, the forbidden phrases, so a scorer (or a human) can decide objectively.
  6. Letting the baseline rot. A regression gate is only as honest as its baseline. Update it from your last green run, and never lower it to make a failing change pass.

Takeaways

The whole article in seven lines

  • Evals are unit tests for non-deterministic output, assert *properties*, not exact equality.
  • Build a golden set first: 20–50 real inputs + the must-haves of a good answer.
  • Layer scorers cheapest-first: exact match → heuristic → LLM-as-judge → human review.
  • Deterministic checks are free and catch most structure bugs, always run them.
  • A regression gate (new score ≥ baseline) is what kills vibes-based shipping.
  • Offline eval catches regressions; online eval catches the long tail. You need both.
  • Watch for judge bias, calibrate against humans and never let the model grade itself.

Where to go next

Evaluation is the discipline that turns prompt-tweaking into engineering. The next step is putting the harness somewhere it runs automatically and watching real traffic.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.