Back to Blog
AI Engineering12 min readFeb 2026

RAG Architecture Explained for Backend Engineers

Retrieval-Augmented Generation is how you make an LLM answer questions about *your* data, without retraining it. If you already understand APIs and databases, you're most of the way there.

AIRAGLLMsVector DBs
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

The model is frozen, your data isn't

You ship an internal chatbot on top of GPT-4 or Claude. A teammate asks it about the deploy runbook you updated last week. It confidently invents an answer, wrong commands, a service that was renamed two quarters ago, a flag that no longer exists. The model isn't broken. It simply has no idea your runbook exists, and it never will, because the knowledge it was trained on stops at a fixed cutoff date.

Who this is for

Backend and platform engineers who are comfortable with REST APIs and databases but haven't shipped an LLM feature yet. You don't need any ML background. If you can call an API and store rows in Postgres, you can build RAG.

Large Language Models like GPT-4 and Claude are trained once, on a fixed dataset with a knowledge cutoff. They don't know your company's internal docs, your product's latest changelog, or anything that happened after training. They also can't cite a source, they generate text that *sounds* right, which is exactly the failure mode you can't ship to users.

There are two ways to give a model new knowledge. You can fine-tune, keep training the model on your data so it bakes into the weights. It's expensive, slow to iterate, goes stale the moment your docs change, and the model still can't tell you *where* an answer came from. Or you can use RAG (Retrieval-Augmented Generation): leave the model alone, and at query time, look up the relevant documents and paste them into the prompt as context. The model then answers grounded in text you handed it, and you know exactly which documents it used.

RAG is open-book exam, not memorization. You don't make the model smarter, you let it look up the answer.
A library catalog that finds books by topic, not exact titleVector search over embeddings (similarity, not keyword match)
Handing a new hire the relevant runbook before they answerInjecting retrieved chunks into the prompt as context
An open-book exam, the facts are in front of youThe LLM answers using only the supplied context
A cache in front of a slow source of truthThe vector DB serving precomputed, queryable knowledge
RAG maps cleanly onto patterns you already use every day.

The picture: two flows, one shared store

RAG is really two pipelines that meet at a vector database. The first runs ahead of time and turns your documents into searchable vectors (indexing). The second runs on every user question and pulls the relevant vectors back out to feed the model (query). The vector DB is the shared node, written by indexing, read by query.

loadsplitupsertembedsearchtop-K chunksgenerate
Documents

PDFs, Markdown, DB rows

Chunker

~500-token splits

Embedder

text → vector

Vector DB

stores embeddings

User question

natural language

Embedder

same model as index

LLM + context

prompt with chunks

Grounded answer

cites sources

Top row: indexing (offline). Bottom row: query (per request). Both meet at the vector DB.

One rule makes the whole thing work: the same embedding model must be used on both rows. If you index with one model and query with another, the vectors live in different coordinate spaces and similarity search returns garbage. Pick an embedder once and pin its version.

The pipeline, step by step

  1. 1

    Chunk

    Split each document into ~500-token chunks. Whole documents are too big to embed meaningfully and too big to fit in a prompt; single sentences lose context. Chunking is the lever you'll tune most.

  2. 2

    Embed (index time)

    Run each chunk through an embedding model to get a vector, a list of floats that encodes meaning. Similar text lands close together in vector space.

  3. 3

    Store

    Upsert each vector into the vector DB alongside its source text and metadata (doc id, URL, section) so you can cite it later. This whole pass is offline, rerun it when docs change.

  4. 4

    Embed (query time)

    When a question arrives, embed it with the *same* model. Now the question is a vector in the same space as your chunks.

  5. 5

    Retrieve

    Ask the vector DB for the K nearest chunks by cosine similarity, typically K = 3 to 5. These are your candidate facts.

  6. 6

    Generate

    Paste the retrieved chunks into the prompt as context, instruct the model to answer using only that context, and return the response with citations from the chunk metadata.

The whole pipeline in code

Here is indexing and querying in one file. It's deliberately framework-free, no LangChain, no magic, so you can see every moving part. In production you'd swap the placeholders for a real embedder (e.g. an OpenAI or open-source embedding model) and a real vector store, but the shape never changes.

rag_pipeline.py
python
# ---- Indexing (run offline, re-run when docs change) ----
def index_documents(docs: list[str]) -> None:
    for doc in docs:
        # 1. Split into ~500-token chunks
        for chunk in chunk_text(doc, size=500):
            # 2. Embed the chunk (SAME model as queries use)
            vector = embedder.embed(chunk)
            # 3. Store vector + source text for later citation
            vector_db.upsert(vector=vector, text=chunk)


# ---- Querying (run on every user question) ----
def answer_question(question: str) -> str:
    # 1. Embed the question
    query_vector = embedder.embed(question)

    # 2. Retrieve the K most similar chunks (cosine similarity)
    chunks = vector_db.similarity_search(query_vector, k=5)

    # 3. Generate, grounded ONLY in the retrieved context
    context = "\n\n".join(chunks)
    prompt = f"""Answer the question using ONLY the context below.
If the answer isn't in the context, say you don't know.

Context:
{context}

Question: {question}"""

    return llm.complete(prompt)

Pro tip

Notice the prompt's escape hatch: "If the answer isn't in the context, say you don't know." Without it, the model will paper over weak retrieval by hallucinating. This one line is the cheapest reliability win in RAG.

Vector databases: what to pick

The vector DB stores your embeddings and answers nearest-neighbor queries fast. The market looks crowded, but for backend engineers the decision is usually simpler than the marketing suggests.

DatabaseTypeBest forWatch out for
pgvectorPostgres extensionTeams already on Postgres, one less system to runTune indexes (HNSW) past a few million rows
PineconeFully managed SaaSShipping fast with zero opsCost climbs steeply at scale; vendor lock-in
QdrantOpen source (Rust)High-throughput, self-hosted, rich filteringYou own the operational burden
WeaviateOpen sourceBuilt-in hybrid (vector + keyword) searchHeavier to operate than pgvector
ChromaEmbedded / lightweightLocal prototyping and demosNot a production store at scale
The realistic shortlist for production RAG.

Pro tip

Don't over-engineer this. **pgvector on the Postgres you already run handles the vast majority of real-world RAG.** You keep your existing backups, monitoring, and operational know-how. Reach for a dedicated vector DB only when you have millions of documents or need advanced filtering pgvector can't do efficiently.

Where RAG breaks down

A naive RAG demo works on day one and disappoints in week two. The failures are predictable, and so are the fixes.

  1. Bad chunking. Chunks too large drag in irrelevant text and dilute the signal; too small and each chunk loses the context that made it meaningful. Start at 256–1024 tokens and try *semantic* chunking that splits on headings and paragraphs instead of fixed token counts.
  2. Retrieval misses. Pure vector search finds *conceptually* similar text but can whiff on exact terms, an error code, a product SKU, a flag name. Add hybrid search (vector similarity + BM25 keyword) so exact matches still surface.
  3. Context overflow. You retrieved 20 chunks and blew past the model's context window, or buried the one good chunk in noise. Run a reranker to score candidates and keep only the top 3–5 before generation.
  4. Mismatched embedders. Indexing with one model and querying with another silently returns nonsense. Pin one embedding model and version everywhere.
  5. Stale index. RAG is only as fresh as your last indexing run. Wire re-indexing into the pipeline that publishes the docs, or it drifts out of date the same way fine-tuning does.

Watch out

RAG doesn't fix a bad knowledge base. If your source docs are wrong, contradictory, or missing, retrieval will faithfully surface the wrong thing, now with the model's full confidence behind it. Garbage in, confident garbage out.

RAG vs fine-tuning: when to use each

These solve different problems, and the most common mistake is reaching for fine-tuning when you actually have a retrieval problem. Fine-tuning changes *how* the model talks; RAG changes *what facts* it has access to.

DimensionRAGFine-tuning
Best forFacts, citations, up-to-date knowledgeStyle, tone, output format, domain phrasing
Knowledge freshnessLive, re-index and it's currentFrozen into weights until you retrain
Cost to updateCheap (re-embed changed docs)Expensive (a new training run)
AuditabilityHigh, you can cite the source chunkLow, knowledge is opaque in the weights
Iteration speedMinutesHours to days
Pick based on whether you're changing behavior or supplying knowledge.

Note

For most enterprise use cases, documentation Q&A, customer support, internal knowledge bases, **RAG is the right default**: cheaper, faster to iterate, and auditable. Reach for fine-tuning when you need a consistent voice or a strict output format, and remember you can do both: fine-tune for style, RAG for facts.

Takeaways

RAG in seven lines

  • LLMs are frozen at a training cutoff, they don't know your data and can't cite sources.
  • RAG retrieves relevant documents at query time and injects them into the prompt as context.
  • It's two pipelines meeting at a vector DB: indexing (offline) and query (per request).
  • Use the **same embedding model** on both sides, or similarity search breaks.
  • pgvector on your existing Postgres covers most real-world RAG, don't over-engineer the store.
  • The usual failures are bad chunking, retrieval misses, and context overflow, fixed by tuning chunks, hybrid search, and a reranker.
  • RAG for facts and freshness; fine-tuning for style and format. RAG is the right default.

Where to go next

You now have the mental model and a working pipeline. The natural next step is to go deeper into embeddings, evaluation, and shipping RAG to production, which is exactly the arc the AI Engineer path follows.

  • AI Engineer career path, the full track: embeddings, vector DBs, RAG, agents, and LLM evaluation from zero to production.
  • Build it for real next: swap the placeholders in rag_pipeline.py for pgvector and a hosted embedding model, then add hybrid search and a reranker once the basics work.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.