Retrieval-Augmented Generation is how you make an LLM answer questions about *your* data, without retraining it. If you already understand APIs and databases, you're most of the way there.
You ship an internal chatbot on top of GPT-4 or Claude. A teammate asks it about the deploy runbook you updated last week. It confidently invents an answer, wrong commands, a service that was renamed two quarters ago, a flag that no longer exists. The model isn't broken. It simply has no idea your runbook exists, and it never will, because the knowledge it was trained on stops at a fixed cutoff date.
Who this is for
Backend and platform engineers who are comfortable with REST APIs and databases but haven't shipped an LLM feature yet. You don't need any ML background. If you can call an API and store rows in Postgres, you can build RAG.
Large Language Models like GPT-4 and Claude are trained once, on a fixed dataset with a knowledge cutoff. They don't know your company's internal docs, your product's latest changelog, or anything that happened after training. They also can't cite a source, they generate text that *sounds* right, which is exactly the failure mode you can't ship to users.
There are two ways to give a model new knowledge. You can fine-tune, keep training the model on your data so it bakes into the weights. It's expensive, slow to iterate, goes stale the moment your docs change, and the model still can't tell you *where* an answer came from. Or you can use RAG (Retrieval-Augmented Generation): leave the model alone, and at query time, look up the relevant documents and paste them into the prompt as context. The model then answers grounded in text you handed it, and you know exactly which documents it used.
RAG is open-book exam, not memorization. You don't make the model smarter, you let it look up the answer.
A library catalog that finds books by topic, not exact titleVector search over embeddings (similarity, not keyword match)
Handing a new hire the relevant runbook before they answerInjecting retrieved chunks into the prompt as context
An open-book exam, the facts are in front of youThe LLM answers using only the supplied context
A cache in front of a slow source of truthThe vector DB serving precomputed, queryable knowledge
RAG maps cleanly onto patterns you already use every day.
The picture: two flows, one shared store
RAG is really two pipelines that meet at a vector database. The first runs ahead of time and turns your documents into searchable vectors (indexing). The second runs on every user question and pulls the relevant vectors back out to feed the model (query). The vector DB is the shared node, written by indexing, read by query.
Top row: indexing (offline). Bottom row: query (per request). Both meet at the vector DB.
One rule makes the whole thing work: the same embedding model must be used on both rows. If you index with one model and query with another, the vectors live in different coordinate spaces and similarity search returns garbage. Pick an embedder once and pin its version.
The pipeline, step by step
1
Chunk
Split each document into ~500-token chunks. Whole documents are too big to embed meaningfully and too big to fit in a prompt; single sentences lose context. Chunking is the lever you'll tune most.
2
Embed (index time)
Run each chunk through an embedding model to get a vector, a list of floats that encodes meaning. Similar text lands close together in vector space.
3
Store
Upsert each vector into the vector DB alongside its source text and metadata (doc id, URL, section) so you can cite it later. This whole pass is offline, rerun it when docs change.
4
Embed (query time)
When a question arrives, embed it with the *same* model. Now the question is a vector in the same space as your chunks.
5
Retrieve
Ask the vector DB for the K nearest chunks by cosine similarity, typically K = 3 to 5. These are your candidate facts.
6
Generate
Paste the retrieved chunks into the prompt as context, instruct the model to answer using only that context, and return the response with citations from the chunk metadata.
The whole pipeline in code
Here is indexing and querying in one file. It's deliberately framework-free, no LangChain, no magic, so you can see every moving part. In production you'd swap the placeholders for a real embedder (e.g. an OpenAI or open-source embedding model) and a real vector store, but the shape never changes.
rag_pipeline.py
python
# ---- Indexing (run offline, re-run when docs change) ----defindex_documents(docs: list[str]) -> None:
for doc in docs:
# 1. Split into ~500-token chunksfor chunk inchunk_text(doc, size=500):
# 2. Embed the chunk (SAME model as queries use)
vector = embedder.embed(chunk)
# 3. Store vector + source text for later citation
vector_db.upsert(vector=vector, text=chunk)
# ---- Querying (run on every user question) ----defanswer_question(question: str) -> str:
# 1. Embed the question
query_vector = embedder.embed(question)
# 2. Retrieve the K most similar chunks (cosine similarity)
chunks = vector_db.similarity_search(query_vector, k=5)
# 3. Generate, grounded ONLY in the retrieved context
context = "\n\n".join(chunks)
prompt = f"""Answer the question using ONLY the context below.
If the answer isn't in the context, say you don't know.
Context:
{context}
Question: {question}"""return llm.complete(prompt)
Pro tip
Notice the prompt's escape hatch: "If the answer isn't in the context, say you don't know." Without it, the model will paper over weak retrieval by hallucinating. This one line is the cheapest reliability win in RAG.
Vector databases: what to pick
The vector DB stores your embeddings and answers nearest-neighbor queries fast. The market looks crowded, but for backend engineers the decision is usually simpler than the marketing suggests.
Database
Type
Best for
Watch out for
pgvector
Postgres extension
Teams already on Postgres, one less system to run
Tune indexes (HNSW) past a few million rows
Pinecone
Fully managed SaaS
Shipping fast with zero ops
Cost climbs steeply at scale; vendor lock-in
Qdrant
Open source (Rust)
High-throughput, self-hosted, rich filtering
You own the operational burden
Weaviate
Open source
Built-in hybrid (vector + keyword) search
Heavier to operate than pgvector
Chroma
Embedded / lightweight
Local prototyping and demos
Not a production store at scale
The realistic shortlist for production RAG.
Pro tip
Don't over-engineer this. **pgvector on the Postgres you already run handles the vast majority of real-world RAG.** You keep your existing backups, monitoring, and operational know-how. Reach for a dedicated vector DB only when you have millions of documents or need advanced filtering pgvector can't do efficiently.
Where RAG breaks down
A naive RAG demo works on day one and disappoints in week two. The failures are predictable, and so are the fixes.
Bad chunking. Chunks too large drag in irrelevant text and dilute the signal; too small and each chunk loses the context that made it meaningful. Start at 256–1024 tokens and try *semantic* chunking that splits on headings and paragraphs instead of fixed token counts.
Retrieval misses. Pure vector search finds *conceptually* similar text but can whiff on exact terms, an error code, a product SKU, a flag name. Add hybrid search (vector similarity + BM25 keyword) so exact matches still surface.
Context overflow. You retrieved 20 chunks and blew past the model's context window, or buried the one good chunk in noise. Run a reranker to score candidates and keep only the top 3–5 before generation.
Mismatched embedders. Indexing with one model and querying with another silently returns nonsense. Pin one embedding model and version everywhere.
Stale index. RAG is only as fresh as your last indexing run. Wire re-indexing into the pipeline that publishes the docs, or it drifts out of date the same way fine-tuning does.
Watch out
RAG doesn't fix a bad knowledge base. If your source docs are wrong, contradictory, or missing, retrieval will faithfully surface the wrong thing, now with the model's full confidence behind it. Garbage in, confident garbage out.
RAG vs fine-tuning: when to use each
These solve different problems, and the most common mistake is reaching for fine-tuning when you actually have a retrieval problem. Fine-tuning changes *how* the model talks; RAG changes *what facts* it has access to.
Dimension
RAG
Fine-tuning
Best for
Facts, citations, up-to-date knowledge
Style, tone, output format, domain phrasing
Knowledge freshness
Live, re-index and it's current
Frozen into weights until you retrain
Cost to update
Cheap (re-embed changed docs)
Expensive (a new training run)
Auditability
High, you can cite the source chunk
Low, knowledge is opaque in the weights
Iteration speed
Minutes
Hours to days
Pick based on whether you're changing behavior or supplying knowledge.
Note
For most enterprise use cases, documentation Q&A, customer support, internal knowledge bases, **RAG is the right default**: cheaper, faster to iterate, and auditable. Reach for fine-tuning when you need a consistent voice or a strict output format, and remember you can do both: fine-tune for style, RAG for facts.
Takeaways
RAG in seven lines
LLMs are frozen at a training cutoff, they don't know your data and can't cite sources.
RAG retrieves relevant documents at query time and injects them into the prompt as context.
It's two pipelines meeting at a vector DB: indexing (offline) and query (per request).
Use the **same embedding model** on both sides, or similarity search breaks.
pgvector on your existing Postgres covers most real-world RAG, don't over-engineer the store.
The usual failures are bad chunking, retrieval misses, and context overflow, fixed by tuning chunks, hybrid search, and a reranker.
RAG for facts and freshness; fine-tuning for style and format. RAG is the right default.
Where to go next
You now have the mental model and a working pipeline. The natural next step is to go deeper into embeddings, evaluation, and shipping RAG to production, which is exactly the arc the AI Engineer path follows.
AI Engineer career path, the full track: embeddings, vector DBs, RAG, agents, and LLM evaluation from zero to production.
Build it for real next: swap the placeholders in rag_pipeline.py for pgvector and a hosted embedding model, then add hybrid search and a reranker once the basics work.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.