Fine-Tuning LLMs: When and How

On this page

The trap everyone falls into
One principle, one analogy
The fine-tuning workflow
Prompting vs RAG vs fine-tuning
What the data actually looks like
Evaluating the result
Common mistakes that cost weeks
Takeaways
Where to go next

TL;DR

Fine-tuning changes how a model behaves, not what it knows, so reach for it on style and format, not facts. Learn when prompting or RAG wins instead, and how to build a dataset that actually moves the needle.

The trap everyone falls into

Your model keeps answering in the wrong tone. It ignores your JSON schema. It writes like a chatbot when you need a terse internal tool. So someone says the magic words: "let's just fine-tune it." Three weeks later you have a labeled dataset, a training bill, a checkpoint that is already stale, and the model still hallucinates yesterday's pricing, because fine-tuning was never the tool for that problem.

Fine-tuning is powerful and badly misunderstood. The single most useful thing you can learn about it is what it cannot do. Get that right and the rest is mechanical.

Who this is for

Engineers who can already call an LLM API and have shipped a prompt or two, maybe even a RAG pipeline, and are now wondering whether fine-tuning is the next step. No ML PhD required. We assume you know what a prompt is and can read JSON and YAML.

One principle, one analogy

Fine-tuning teaches a model how to behave. It does not teach it what is true today.
The rule that prevents 90% of wasted fine-tuning projects

Think of a base model as a brilliant new graduate. They reason well, write well, and know an enormous amount about the world up to the day they stopped reading. Fine-tuning is sending them to a finishing school for tone and craft, teaching them your house style, your output format, your domain's voice. What it is *not* is handing them this morning's newspaper. For fresh facts you put the newspaper on their desk at question time. That desk is your prompt, and the newspaper is retrieval.

Writing a clear brief before each taskPrompt engineering, cheap, instant, changeable

Giving them the relevant files for this taskRAG, fresh facts injected at query time

Sending them to finishing school for your house styleFine-tuning, bakes behavior into the weights

Expecting school to teach today's stock priceThe mistake, weights are frozen at training time

Same employee, three different interventions

The fine-tuning workflow

When fine-tuning *is* the right call, the path is the same every time. Curate a clean dataset, format it the way the trainer expects, train a small adapter (not the whole model), evaluate honestly, deploy, and then watch it in production. The diagram below is the loop; the steps after it walk one pass through it.

The fine-tuning loop, most of the value is in the first two boxes and the eval gate

1
Curate the dataset
Collect a few hundred to a few thousand high-quality input/output pairs that demonstrate exactly the behavior you want. Quality beats quantity by a mile, 500 clean examples outperform 5,000 noisy ones.
2
Format to the chat schema
Convert each pair into the JSONL message format your trainer expects (system / user / assistant turns). Consistency here is non-negotiable; one malformed field can skew the whole run.
3
Train a LoRA adapter
Freeze the base model and train a small set of low-rank adapter weights. Minutes to a couple of hours on a single GPU, not a data-center job.
4
Evaluate against a held-out set
Score on examples the model never saw during training, plus an LLM-judge or human pass for quality. If it does not beat your prompt-only baseline, stop, do not deploy.
5
Deploy the adapter
Serve the small adapter on top of the base model. You can keep several adapters and swap them per use case without re-hosting the base.
6
Monitor and feed back
Watch for regressions and distribution drift in production, harvest fresh examples, and feed them into the next curation round.

Prompting vs RAG vs fine-tuning

These three are not competitors, they solve different problems and stack happily. The mistake is reaching for the most expensive one first. Try prompting. Then try RAG. Only then consider fine-tuning. Most teams never need the third box.

	Prompting	RAG	Fine-tuning
Best for	Instructions, tone, simple format	Fresh / private facts	Consistent style, strict format, narrow domain
Freshness	Whatever you paste in	Live, updates with your data	Frozen at training time
Cost	Near zero	Low, vector store + retrieval	High, data prep + GPU + upkeep
Effort to change	Edit a string	Update an index	Re-curate + re-train + re-eval
Time to first result	Minutes	Hours to days	Days to weeks

Reach for the cheapest tool that solves your actual problem

They compose

The strongest systems use all three: a fine-tuned model for reliable house-style output, RAG for the facts, and a tight prompt to orchestrate both. Fine-tuning is the last layer you add, not the first.

What the data actually looks like

The dataset is the project. Each line is one self-contained example in the chat schema, a system message setting the role, the user input, and the assistant output you wish the model had produced. Below is a tiny slice of a support-triage dataset teaching the model to always reply with a fixed JSON structure.

train.jsonl

json

{"messages": [{"role": "system", "content": "You are a support triage bot. Reply only with JSON."}, {"role": "user", "content": "My invoice is wrong and I want a refund."}, {"role": "assistant", "content": "{\"category\": \"billing\", \"priority\": \"high\", \"needs_human\": true}"}]}
{"messages": [{"role": "system", "content": "You are a support triage bot. Reply only with JSON."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "{\"category\": \"account\", \"priority\": \"low\", \"needs_human\": false}"}]}
{"messages": [{"role": "system", "content": "You are a support triage bot. Reply only with JSON."}, {"role": "user", "content": "The app crashes every time I open a report."}, {"role": "assistant", "content": "{\"category\": \"bug\", \"priority\": \"high\", \"needs_human\": true}"}]}

Notice what these examples teach: format and routing behavior, not facts. None of them depend on information that changes day to day. That is the signature of a good fine-tuning dataset.

Training itself is mostly configuration. A LoRA run reads like this, a base model, a handful of hyperparameters, and the rank that controls how big your adapter is.

lora.yaml

yaml

base_model: meta-llama/Llama-3.1-8B-Instruct
dataset: ./train.jsonl

lora:
  r: 16              # adapter rank, higher = more capacity, more params
  alpha: 32          # scaling factor, usually 2x r
  dropout: 0.05
  target_modules:    # which layers get adapters
    - q_proj
    - v_proj

training:
  epochs: 3          # too many = memorization / overfit
  learning_rate: 2.0e-4
  batch_size: 8
  eval_split: 0.1    # hold out 10%, never train on your eval set

Why LoRA / PEFT instead of full fine-tuning

Full fine-tuning updates every weight in the model, huge memory, huge cost, and a fresh multi-gigabyte checkpoint each time. Parameter-efficient fine-tuning (PEFT) methods like LoRA freeze the base model and train a tiny set of low-rank adapter matrices, often under 1% of the parameters. You get most of the benefit for a fraction of the cost, the adapters are megabytes not gigabytes, and you can host one base model with many swappable adapters.

Evaluating the result

A fine-tune that you did not evaluate is a liability, not an asset. The non-negotiable rule: always compare against your prompt-only baseline on data the model never trained on. If the fine-tune does not clearly beat the baseline, you just spent weeks to ship a regression risk.

Hold out a test set early. Split before you train and never let those examples leak into training.
Beat the baseline or stop. Run the same eval against plain prompting and against RAG. The fine-tune has to win to justify itself.
Measure what you actually fine-tuned for. Format compliance, tone match, schema validity, task-specific metrics, not a generic benchmark score.
Watch for regressions. A model tuned hard on one task often gets *worse* at everything else. Keep a small general-capability check in the loop.
Use an LLM judge for the fuzzy stuff, but spot-check the judge against human ratings so you trust it.

For the full discipline of measuring LLM systems, judges, regression suites, and offline-vs-online evals, see Evaluating LLM Applications.

Common mistakes that cost weeks

1Fine-tuning for fresh facts. Weights freeze at training time. If the answer changes daily, you want RAG, not a fine-tune. This is the number-one wasted project.
2Too little or dirty data. A few dozen inconsistent examples teach the model noise. Curate for quality; 500 clean pairs beat 5,000 messy ones.
3No evaluation. Shipping a checkpoint because the loss curve looked nice, with no held-out comparison to the baseline. You have no idea if it is better.
4Skipping prompting and RAG first. Reaching straight for the most expensive, slowest, hardest-to-change tool when a one-line prompt edit or a retrieval step would have solved it.
5Overfitting on epochs. Training too long memorizes your examples and destroys generalization. Watch eval loss, not just training loss.
6One model to rule them all. Cramming many unrelated tasks into one fine-tune. Prefer small, focused adapters you can swap per use case.

Takeaways

The whole article in seven lines

Fine-tuning changes behavior (style, tone, format), never knowledge of fresh facts.
Try prompting first, then RAG, and only then fine-tuning. Most teams never need the third.
Use RAG for anything that changes; use fine-tuning for consistent house style and strict formats.
The dataset is the project, a few hundred clean, on-task examples beat thousands of noisy ones.
Use LoRA / PEFT, train tiny adapters, not the whole model. Cheaper, faster, swappable.
Always evaluate on a held-out set against your prompt-only baseline. No win, no deploy.
Then monitor in production and feed new examples into the next round.

Where to go next

Fine-tuning sits near the top of the AI engineering stack, reach for it after you have squeezed prompting and retrieval. Build the layers underneath it first.

Solve the freshness problem the right way: RAG Architecture Explained.
Build the eval discipline before you train anything: Evaluating LLM Applications.
See where fine-tuning fits in the broader role: the AI Engineer career path.

Your model is not behaving the way you need. Which tool fixes it?

Check your understanding

1. What is the one principle the article gives about fine-tuning?

2. What order does the article recommend among prompting, RAG, and fine-tuning?

Frequently asked questions

What can fine-tuning actually do, and what can't it do?

Fine-tuning teaches a model how to behave, such as your tone, output format, and house style. It does not teach it what is true today, so it will not stop the model from hallucinating yesterday's pricing. For fresh facts, you use retrieval at question time instead.

Should I fine-tune, use RAG, or just improve my prompt?

These three are not competitors and they stack happily, so the rule is to try prompting first, then RAG, and only then consider fine-tuning. Most teams never need the third box, and the strongest systems combine all three.

What does fine-tuning training data look like?

The dataset is the project, and each line is one self-contained example in the chat schema: a system message setting the role, the user input, and the assistant output you wish the model had produced. Good examples teach format and routing behavior, not facts that change day to day.

Do I have to retrain the whole model?

No. When fine-tuning is the right call, the typical path trains a small adapter rather than the whole model, then you evaluate honestly, deploy, and watch it in production. Curating a clean dataset and formatting it the way the trainer expects comes first.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read

Fine-Tuning LLMs: When and How

01The trap everyone falls into

02One principle, one analogy

03The fine-tuning workflow

04Prompting vs RAG vs fine-tuning

05What the data actually looks like

06Evaluating the result

07Common mistakes that cost weeks

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

RAG Architecture Explained for Backend Engineers

What Is an AI Engineer?

How LLMs Actually Work (for Engineers)

The trap everyone falls into

One principle, one analogy

The fine-tuning workflow

Prompting vs RAG vs fine-tuning

What the data actually looks like

Evaluating the result

Common mistakes that cost weeks

Takeaways

Where to go next