RAG Architecture Explained for Backend Engineers

On this page

The model is frozen, your data isn't
The picture: two flows, one shared store
The pipeline, step by step
The whole pipeline in code
Vector databases: what to pick
Where RAG breaks down
RAG vs fine-tuning: when to use each
Takeaways
Where to go next

TL;DR

Make an LLM answer questions about your own data without retraining it. If you know APIs and databases, you can build the full RAG pipeline, pick the right vector store, and know when fine-tuning beats retrieval instead.

The model is frozen, your data isn't

You ship an internal chatbot on top of GPT-4 or Claude. A teammate asks it about the deploy runbook you updated last week. It confidently invents an answer, wrong commands, a service that was renamed two quarters ago, a flag that no longer exists. The model isn't broken. It simply has no idea your runbook exists, and it never will, because the knowledge it was trained on stops at a fixed cutoff date.

Who this is for

Backend and platform engineers who are comfortable with REST APIs and databases but haven't shipped an LLM feature yet. You don't need any ML background. If you can call an API and store rows in Postgres, you can build RAG.

Large Language Models like GPT-4 and Claude are trained once, on a fixed dataset with a knowledge cutoff. They don't know your company's internal docs, your product's latest changelog, or anything that happened after training. They also can't cite a source, they generate text that *sounds* right, which is exactly the failure mode you can't ship to users.

There are two ways to give a model new knowledge. You can fine-tune, keep training the model on your data so it bakes into the weights. It's expensive, slow to iterate, goes stale the moment your docs change, and the model still can't tell you *where* an answer came from. Or you can use RAG (Retrieval-Augmented Generation): leave the model alone, and at query time, look up the relevant documents and paste them into the prompt as context. The model then answers grounded in text you handed it, and you know exactly which documents it used.

RAG is open-book exam, not memorization. You don't make the model smarter, you let it look up the answer.

A library catalog that finds books by topic, not exact titleVector search over embeddings (similarity, not keyword match)

Handing a new hire the relevant runbook before they answerInjecting retrieved chunks into the prompt as context

An open-book exam, the facts are in front of youThe LLM answers using only the supplied context

A cache in front of a slow source of truthThe vector DB serving precomputed, queryable knowledge

RAG maps cleanly onto patterns you already use every day.

The picture: two flows, one shared store

RAG is really two pipelines that meet at a vector database. The first runs ahead of time and turns your documents into searchable vectors (indexing). The second runs on every user question and pulls the relevant vectors back out to feed the model (query). The vector DB is the shared node, written by indexing, read by query.

Top row: indexing (offline). Bottom row: query (per request). Both meet at the vector DB.

One rule makes the whole thing work: the same embedding model must be used on both rows. If you index with one model and query with another, the vectors live in different coordinate spaces and similarity search returns garbage. Pick an embedder once and pin its version.

The pipeline, step by step

1
Chunk
Split each document into ~500-token chunks. Whole documents are too big to embed meaningfully and too big to fit in a prompt; single sentences lose context. Chunking is the lever you'll tune most.
2
Embed (index time)
Run each chunk through an embedding model to get a vector, a list of floats that encodes meaning. Similar text lands close together in vector space.
3
Store
Upsert each vector into the vector DB alongside its source text and metadata (doc id, URL, section) so you can cite it later. This whole pass is offline, rerun it when docs change.
4
Embed (query time)
When a question arrives, embed it with the *same* model. Now the question is a vector in the same space as your chunks.
5
Retrieve
Ask the vector DB for the K nearest chunks by cosine similarity, typically K = 3 to 5. These are your candidate facts.
6
Generate
Paste the retrieved chunks into the prompt as context, instruct the model to answer using only that context, and return the response with citations from the chunk metadata.

The whole pipeline in code

Here is indexing and querying in one file. It's deliberately framework-free, no LangChain, no magic, so you can see every moving part. In production you'd swap the placeholders for a real embedder (e.g. an OpenAI or open-source embedding model) and a real vector store, but the shape never changes.

rag_pipeline.py

python

# ---- Indexing (run offline, re-run when docs change) ----
def index_documents(docs: list[str]) -> None:
    for doc in docs:
        # 1. Split into ~500-token chunks
        for chunk in chunk_text(doc, size=500):
            # 2. Embed the chunk (SAME model as queries use)
            vector = embedder.embed(chunk)
            # 3. Store vector + source text for later citation
            vector_db.upsert(vector=vector, text=chunk)


# ---- Querying (run on every user question) ----
def answer_question(question: str) -> str:
    # 1. Embed the question
    query_vector = embedder.embed(question)

    # 2. Retrieve the K most similar chunks (cosine similarity)
    chunks = vector_db.similarity_search(query_vector, k=5)

    # 3. Generate, grounded ONLY in the retrieved context
    context = "\n\n".join(chunks)
    prompt = f"""Answer the question using ONLY the context below.
If the answer isn't in the context, say you don't know.

Context:
{context}

Question: {question}"""

    return llm.complete(prompt)

Pro tip

Notice the prompt's escape hatch: "If the answer isn't in the context, say you don't know." Without it, the model will paper over weak retrieval by hallucinating. This one line is the cheapest reliability win in RAG.

Vector databases: what to pick

The vector DB stores your embeddings and answers nearest-neighbor queries fast. The market looks crowded, but for backend engineers the decision is usually simpler than the marketing suggests.

Database	Type	Best for	Watch out for
pgvector	Postgres extension	Teams already on Postgres, one less system to run	Tune indexes (HNSW) past a few million rows
Pinecone	Fully managed SaaS	Shipping fast with zero ops	Cost climbs steeply at scale; vendor lock-in
Qdrant	Open source (Rust)	High-throughput, self-hosted, rich filtering	You own the operational burden
Weaviate	Open source	Built-in hybrid (vector + keyword) search	Heavier to operate than pgvector
Chroma	Embedded / lightweight	Local prototyping and demos	Not a production store at scale

The realistic shortlist for production RAG.

Pro tip

Don't over-engineer this. pgvector on the Postgres you already run handles the vast majority of real-world RAG. You keep your existing backups, monitoring, and operational know-how. Reach for a dedicated vector DB only when you have millions of documents or need advanced filtering pgvector can't do efficiently.

Where RAG breaks down

A naive RAG demo works on day one and disappoints in week two. The failures are predictable, and so are the fixes.

1Bad chunking. Chunks too large drag in irrelevant text and dilute the signal; too small and each chunk loses the context that made it meaningful. Start at 256–1024 tokens and try *semantic* chunking that splits on headings and paragraphs instead of fixed token counts.
2Retrieval misses. Pure vector search finds *conceptually* similar text but can whiff on exact terms, an error code, a product SKU, a flag name. Add hybrid search (vector similarity + BM25 keyword) so exact matches still surface.
3Context overflow. You retrieved 20 chunks and blew past the model's context window, or buried the one good chunk in noise. Run a reranker to score candidates and keep only the top 3–5 before generation.
4Mismatched embedders. Indexing with one model and querying with another silently returns nonsense. Pin one embedding model and version everywhere.
5Stale index. RAG is only as fresh as your last indexing run. Wire re-indexing into the pipeline that publishes the docs, or it drifts out of date the same way fine-tuning does.

Watch out

RAG doesn't fix a bad knowledge base. If your source docs are wrong, contradictory, or missing, retrieval will faithfully surface the wrong thing, now with the model's full confidence behind it. Garbage in, confident garbage out.

RAG vs fine-tuning: when to use each

These solve different problems, and the most common mistake is reaching for fine-tuning when you actually have a retrieval problem. Fine-tuning changes *how* the model talks; RAG changes *what facts* it has access to.

Dimension	RAG	Fine-tuning
Best for	Facts, citations, up-to-date knowledge	Style, tone, output format, domain phrasing
Knowledge freshness	Live, re-index and it's current	Frozen into weights until you retrain
Cost to update	Cheap (re-embed changed docs)	Expensive (a new training run)
Auditability	High, you can cite the source chunk	Low, knowledge is opaque in the weights
Iteration speed	Minutes	Hours to days

Pick based on whether you're changing behavior or supplying knowledge.

Note

For most enterprise use cases, documentation Q&A, customer support, internal knowledge bases, RAG is the right default: cheaper, faster to iterate, and auditable. Reach for fine-tuning when you need a consistent voice or a strict output format, and remember you can do both: fine-tune for style, RAG for facts.

Takeaways

RAG in seven lines

LLMs are frozen at a training cutoff, they don't know your data and can't cite sources.
RAG retrieves relevant documents at query time and injects them into the prompt as context.
It's two pipelines meeting at a vector DB: indexing (offline) and query (per request).
Use the same embedding model on both sides, or similarity search breaks.
pgvector on your existing Postgres covers most real-world RAG, don't over-engineer the store.
The usual failures are bad chunking, retrieval misses, and context overflow, fixed by tuning chunks, hybrid search, and a reranker.
RAG for facts and freshness; fine-tuning for style and format. RAG is the right default.

Where to go next

You now have the mental model and a working pipeline. The natural next step is to go deeper into embeddings, evaluation, and shipping RAG to production, which is exactly the arc the AI Engineer path follows.

AI Engineer career path, the full track: embeddings, vector DBs, RAG, agents, and LLM evaluation from zero to production.
Build it for real next: swap the placeholders in rag_pipeline.py for pgvector and a hosted embedding model, then add hybrid search and a reranker once the basics work.

You want to give an LLM access to knowledge it wasn't trained on. Which approach fits your situation?

Check your understanding

1. Why can't a base LLM answer questions about your company's latest runbook?

2. What is the one rule the article says makes the whole RAG pipeline work?

Frequently asked questions

Why does an LLM make up answers about my internal docs?

Models like GPT-4 and Claude are trained once on a fixed dataset with a knowledge cutoff, so they have no idea your runbook or internal docs exist. They generate text that sounds right rather than admitting the gap, which is the failure mode you cannot ship to users.

When should I use RAG instead of fine-tuning?

Use RAG when your data changes or you need to cite sources, since it leaves the model alone and pastes relevant documents into the prompt at query time. Fine-tuning bakes data into the weights but is expensive, slow to iterate, goes stale the moment your docs change, and still cannot tell you where an answer came from.

What is the most important rule to get RAG working?

Use the same embedding model for both indexing and querying. If you index with one model and query with another, the vectors live in different coordinate spaces and similarity search returns garbage, so pick one embedder and pin its version.

How do I stop RAG from hallucinating when retrieval is weak?

Add an escape hatch to the prompt such as telling the model to say it does not know if the answer is not in the provided context. Without it, the model will paper over weak retrieval by inventing an answer.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read

AI Engineering

Prompt Engineering Fundamentals

Read

RAG Architecture Explained for Backend Engineers

01The model is frozen, your data isn't

02The picture: two flows, one shared store

03The pipeline, step by step

04The whole pipeline in code

05Vector databases: what to pick

06Where RAG breaks down

07RAG vs fine-tuning: when to use each

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

What Is an AI Engineer?

How LLMs Actually Work (for Engineers)

Prompt Engineering Fundamentals

The model is frozen, your data isn't

The picture: two flows, one shared store

The pipeline, step by step

The whole pipeline in code

Vector databases: what to pick

Where RAG breaks down

RAG vs fine-tuning: when to use each

Takeaways

Where to go next