Interactive Explainer

🎯Key Takeaways

Retrieval quality must be measured independently from generation quality — Precision@K and Recall@K before optimizing prompts

Hybrid retrieval (dense + BM25) outperforms pure semantic search by 15-25% and handles keyword/exact-match queries that dense search misses

Chunk size and boundaries must match document structure — arbitrary token-count chunking destroys semantic coherence

Cross-encoder rerankers provide 20-30% better relevance than cosine similarity at the cost of 150-200ms added latency

Every production RAG system needs a relevance threshold — return "I don't know" rather than hallucinate from irrelevant chunks

The knowledge base is a production dependency: it needs automated update pipelines, version tracking, and freshness monitoring

RAG in Production

From Prototype to Reliable Retrieval-Augmented Generation

~14 min read

Be the first to complete!

Why this matters

LLMs hallucinate facts and have stale knowledge cutoffs. Every production AI system needs a way to ground answers in real, current data.

Without this knowledge

Your chatbot confidently invents product specs, quotes policies that changed last quarter, and cites studies that do not exist. Users stop trusting it after the first bad answer.

With this knowledge

Your system retrieves the exact document chunks that answer each question, passes them as verified context to the model, and produces answers grounded in your actual data — with citations the user can verify.

What you'll learn

Retrieval quality must be measured independently from generation quality — Precision@K and Recall@K before optimizing prompts
Hybrid retrieval (dense + BM25) outperforms pure semantic search by 15-25% and handles keyword/exact-match queries that dense search misses
Chunk size and boundaries must match document structure — arbitrary token-count chunking destroys semantic coherence
Cross-encoder rerankers provide 20-30% better relevance than cosine similarity at the cost of 150-200ms added latency
Every production RAG system needs a relevance threshold — return "I don't know" rather than hallucinate from irrelevant chunks
The knowledge base is a production dependency: it needs automated update pipelines, version tracking, and freshness monitoring

Lesson outline

The Scene: A Support Bot That Made Things Worse

It is Q4 2023. A fintech startup with 200,000 customers has just launched an AI support chatbot. The demo was flawless — trained on their product docs, it answered questions about account limits, fee structures, and dispute processes with impressive accuracy. The VP of Customer Success called it "a 10x improvement over our old FAQ page." The engineering team shipped it to production in six weeks.

Three weeks later, the head of support has a spreadsheet open. It has 847 rows. Each row is a ticket where the chatbot gave a wrong answer that a human support agent then had to manually correct. The chatbot was confidently telling users their transfer limit was $5,000 — it had been raised to $10,000 eight months earlier, but that update lived in a Confluence page the chatbot had never seen. It was quoting a fee schedule from a PDF uploaded during onboarding — that fee schedule was superseded by a new one three weeks before launch.

The obvious answer, in retrospect: "we should have used RAG." But the team *thought* they were using RAG. They had a vector database. They had embeddings. They had a retrieval step. What they had built was a prototype-grade retrieval system that could not survive contact with real documents, real queries, and real users.

RAG — Retrieval-Augmented Generation — is not a feature you add. It is an information retrieval engineering problem that happens to use a language model at the end. The fintech team learned this after three weeks of customer complaints and a $200K contract under threat because a compliance officer received an incorrect answer about regulatory requirements.

This concept teaches you to build retrieval that works — not just retrieval that demos well.

Why "Just Use Embeddings" Is Wrong

Here is the mental model most engineers carry into their first RAG project: embed your documents, embed the query, find the nearest neighbors, stuff them in the context window. It takes an afternoon to prototype. It demos beautifully on cherry-picked examples. It breaks on everything else.

Why does it break? Three reasons that expose three incorrect intuitions.

Incorrect Intuition 1: Semantic similarity = relevance. You search for "what is the cancellation policy" and retrieve chunks about "account termination procedures." Both are *semantically similar*. Neither answers the user's actual question about cancellation fees and notice periods. Embedding similarity captures topic proximity, not answer quality. A chunk that contains the words "cancel," "policy," and "account" ranks high — even if it's a blog post about *why* your cancellation policy exists, not what it actually *is*.

Incorrect Intuition 2: More context = better answers. You increase chunk size from 512 to 2048 tokens to "give the model more to work with." The model now receives six huge chunks. Four of them are irrelevant. The model dilutes the relevant signal with noise and produces a less accurate answer. There is a reason "Lost in the Middle" is a named phenomenon in LLM research — models systematically underweight context that appears in the middle of a long prompt.

Incorrect Intuition 3: Retrieval quality is hard to measure. Your team debates whether the answers "seem good" in Slack. No one measures retrieval recall (did you get the right chunk?) separately from generation quality (did the model use it correctly?). You optimize the wrong thing — usually the generation — while retrieval silently fails on 30% of queries.

The correct mental model: RAG is a pipeline with three independent failure modes. Chunking determines what can be retrieved. Retrieval determines what gets passed to the model. Generation determines how those chunks become answers. Each needs to be optimized and evaluated independently.

The teams that build reliable RAG systems treat it exactly like traditional information retrieval — with precision@K metrics, recall benchmarks, and ablation studies — before they ever worry about prompt engineering.

How Production RAG Actually Works: Three Levels

Understanding RAG deeply means seeing through three lenses: what naive implementations do, what production systems do differently, and what expert architects optimize for.

naive_rag.py

1from openai import OpenAI
2from qdrant_client import QdrantClient
3import numpy as np
4 
5client = OpenAI()
6vector_db = QdrantClient(":memory:")
7 
8def naive_rag(query: str, top_k: int = 3) -> str:
Level 1: The prototype. Works in demos, breaks in production.
9    # Embed the query
10    query_embedding = client.embeddings.create(
11        input=query,
text-embedding-3-small: 1536 dims, cheap but lower quality than large
12        model="text-embedding-3-small"
13    ).data[0].embedding
14 
15    # Search vector database
16    results = vector_db.search(
17        collection_name="documents",
Pure semantic search — misses exact keyword matches entirely
18        query_vector=query_embedding,
19        limit=top_k
20    )
21 
22    # Build context from results
No deduplication, no reranking, no relevance threshold
23    context = "\n\n".join([r.payload["text"] for r in results])
24 
25    # Generate answer
No citation tracking — model cannot tell user WHERE the answer came from
26    response = client.chat.completions.create(
27        model="gpt-4o-mini",
28        messages=[
29            {"role": "system", "content": f"Answer using this context:\n{context}"},
30            {"role": "user", "content": query}
31        ]
32    )
33    return response.choices[0].message.content

production_rag.py

1from openai import OpenAI
2from qdrant_client import QdrantClient, models
3from sentence_transformers import CrossEncoder
4import hashlib
5 
6client = OpenAI()
7vector_db = QdrantClient("http://qdrant:6333")
8reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
Cross-encoder reranker: 50-100ms latency but 20-30% better relevance than cosine similarity
9 
10def production_rag(
11    query: str,
12    top_k: int = 10,
13    rerank_top_k: int = 3,
14    score_threshold: float = 0.7
15) -> dict:
16    # Hybrid retrieval: semantic + keyword (BM25)
Hybrid retrieval: Dense (semantic) + Sparse (BM25 keyword) finds what either alone misses
17    query_embedding = client.embeddings.create(
18        input=query, model="text-embedding-3-large"
text-embedding-3-large: 3072 dims, 5x cost but 15% better retrieval quality
19    ).data[0].embedding
20 
21    # Semantic search
22    dense_results = vector_db.search(
23        collection_name="documents",
24        query_vector=("dense", query_embedding),
score_threshold: Reject low-confidence results rather than hallucinate from irrelevant context
25        limit=top_k,
26        score_threshold=score_threshold
27    )
28 
29    # Sparse (keyword) search via BM25
30    sparse_results = vector_db.search(
31        collection_name="documents",
32        query_vector=("sparse", encode_query_sparse(query)),
33        limit=top_k
34    )
35 
Reciprocal Rank Fusion (RRF): Combines rankings without needing score normalization
36    # Reciprocal Rank Fusion to combine results
37    fused = reciprocal_rank_fusion(dense_results, sparse_results)
38 
Reranker operates on query-document pairs: much more accurate than embedding similarity
39    # Cross-encoder reranking: more accurate than cosine similarity
40    candidates = [(r.payload["text"], query) for r in fused[:top_k]]
41    rerank_scores = reranker.predict(candidates)
42 
43    # Sort by rerank score, take top K
44    ranked = sorted(
45        zip(fused[:top_k], rerank_scores),
46        key=lambda x: x[1], reverse=True
47    )[:rerank_top_k]
48 
49    # Build structured context with citations
50    chunks = []
51    for result, score in ranked:
Citation tracking: Every chunk carries its source URL for attribution in the response
52        chunks.append({
53            "text": result.payload["text"],
54            "source": result.payload["source_url"],
55            "doc_id": result.payload["doc_id"],
56            "relevance_score": float(score),
57        })
58 
59    return {"chunks": chunks, "query": query}

chunking_strategy.py

1from langchain.text_splitter import RecursiveCharacterTextSplitter
2from typing import List
3import re
4 
5class SmartChunker:
6    """
7    Chunking strategy that preserves semantic boundaries.
8    Naive fixed-size chunking splits sentences, tables, and code blocks
9    at arbitrary points — destroying the context the model needs.
10    """
11 
chunk_size=512 tokens is a starting point — tune based on your document structure
12    def __init__(self, chunk_size: int = 512, chunk_overlap: int = 64):
chunk_overlap=64: Adjacent chunks share 64 tokens so context at boundaries is preserved
13        self.splitter = RecursiveCharacterTextSplitter(
14            chunk_size=chunk_size,
RecursiveCharacterTextSplitter tries larger separators first (paragraphs > sentences > words)
15            chunk_overlap=chunk_overlap,
16            # Split on semantic boundaries, in priority order
17            separators=["\n\n", "\n", ". ", " ", ""],
18        )
19 
20    def chunk_document(self, doc: dict) -> List[dict]:
21        chunks = []
22        text = doc["content"]
23 
24        # Preserve tables as atomic units (never split a table)
25        tables = self.extract_tables(text)
26        for table in tables:
Tables MUST be preserved atomically — splitting a table destroys all the relationships it encodes
27            chunks.append({
28                "text": table,
29                "type": "table",
30                "doc_id": doc["id"],
31                "source_url": doc["url"],
32            })
33            text = text.replace(table, "[TABLE_REMOVED]")
34 
35        # Chunk remaining text semantically
36        for chunk_text in self.splitter.split_text(text):
Skip near-empty chunks — they waste vector DB space and pollute retrieval results
37            if len(chunk_text.strip()) < 50:
38                continue  # Skip near-empty chunks
39            chunks.append({
40                "text": chunk_text,
41                "type": "text",
42                "doc_id": doc["id"],
43                "source_url": doc["url"],
44            })
45 
46        return chunks
47 
48    def extract_tables(self, text: str) -> List[str]:
49        # Simple Markdown table detection
50        return re.findall(r'(\|.+\|\n(?:\|[-:]+\|\n)+(?:\|.+\|\n)*)', text)

Three Scenarios, Three Retrieval Failure Modes

RAG fails differently depending on what you're building. Here are three production scenarios with distinct failure modes and how to prevent them.

Scenario 1: The Customer Support Bot (Keyword Mismatch) A telecom company's RAG bot keeps failing on questions like "how do I cancel autopay?" The vector search returns chunks about "payment methods" and "billing cycles" — semantically related, but none of them actually describe the cancellation process. The fix: hybrid retrieval. Adding BM25 sparse search alongside dense embeddings immediately surfaces the chunk containing the exact phrase "disable automatic payment." Dense search finds *topics*; sparse search finds *exact terms*. Together, they handle the 15% of queries where users use domain-specific jargon that the embedding model doesn't cluster correctly.

Scenario 2: The Legal Document Assistant (Context Window Poisoning) A law firm's RAG system is retrieving excellent chunks — the right sections from the right contracts. But the summaries it generates are oddly vague and sometimes incorrect. Root cause: the context window has 12 chunks, but 8 of them are near-duplicates (different versions of the same clause from different contract drafts). The model averages across the noise. The fix: deduplication before generation. Add a maximum marginal relevance (MMR) filter that ensures retrieved chunks are both relevant *and* diverse. After filtering, 5 distinct, high-quality chunks produce significantly sharper summaries. The lesson: retrieval quantity is not retrieval quality.

Scenario 3: The Internal Knowledge Base (Staleness) An engineering team builds a RAG system over their internal Confluence wiki. It works great for six months. Then engineers start noticing answers about internal APIs that reference deprecated endpoints. The problem: documents from 18 months ago are still in the vector database with equal weight to last week's documents. The fix: metadata-filtered retrieval. Add `last_modified` timestamps to every chunk's payload. At query time, apply a recency filter that down-weights (or excludes) chunks older than 90 days for queries about current system behavior. For historical queries ("how did we handle X before the migration?"), relax the filter. Retrieval must be context-aware, not uniform.

The Traps That Sink Production RAG Systems

Three failure patterns that kill RAG in production — each seductive because it feels like solving the right problem.

How Principals Think About RAG Architecture

Senior engineers ask: "Does retrieval work?" Principal engineers ask: "What is the right retrieval architecture for this specific data and query distribution?"

The shift is from RAG as a technique to RAG as an information retrieval system that happens to use a vector database. Principal architects spend more time on the data pipeline than the model integration: How is the knowledge base kept fresh? What happens when documents are updated — are all affected chunks re-embedded? Is the embedding model versioned separately from the application? Can you roll back to a previous embedding without re-embedding 10 million documents?

The research frontier (2024-2025) is moving toward adaptive retrieval: systems that decide at query time whether to retrieve, how many chunks to retrieve, and which retrieval strategy to use — based on query complexity, user intent, and confidence signals from a smaller router model. GraphRAG (from Microsoft Research) extends beyond chunk retrieval to entity-relationship graphs, enabling multi-hop reasoning across connected facts that flat vector search cannot handle.

The five-year arc is RAG evolving from a retrieval workaround for short context windows toward a deliberate knowledge architecture decision: when to encode knowledge into model weights (fine-tuning), when to retrieve it at query time (RAG), and when to maintain it in external memory systems (agents with tools). Engineers who understand the tradeoffs across all three will own the design decisions that determine AI product reliability.

How this might come up in interviews

Common questions:

Walk me through how you would build a production RAG system from scratch.
How do you evaluate a RAG system's quality?
What is hybrid retrieval and when should you use it over dense-only?
A RAG system is returning wrong answers. How do you diagnose the problem?
How do you keep the knowledge base current as documents change?
What is the "lost in the middle" problem and how do you address it?

Strong answer: Immediately asks: "What's the current Precision@K?" when debugging a RAG quality problemKnows the difference between bi-encoder (embedding model) and cross-encoder (reranker) and when each appliesHas implemented a golden eval set for RAG quality measurementUnderstands that the confidence threshold is a product decision (accuracy vs. coverage tradeoff), not just a technical one

Red flags: Candidate treats RAG as a single system rather than a three-layer pipeline with independent failure modesNo mention of evaluating retrieval quality separately from generation qualityBelieves better prompting can fix retrieval failuresNo plan for knowledge base freshness and document version management

Scenario · A Series B HR-tech startup with 50,000 enterprise users

Step 1 of 3

You Are the AI Engineer. Fix the Production RAG System.

Your company's AI assistant answers questions from HR managers about company policies — benefits, leave, compliance. It uses RAG over a 10,000-document policy knowledge base. The system has been live for 60 days and NPS dropped from 72 to 44 last month.

User complaint data shows that 34% of questions receive wrong or outdated answers. Your head of product wants you to "improve the prompt." You pull a sample of 20 failed queries and manually look at what chunks were retrieved. In 16 of 20 cases, the retrieved chunks are not actually relevant to the query — they're topically adjacent but don't contain the answer. The prompt is fine. Retrieval is broken.

What do you do first?

Quick check · RAG in Production

1 / 4

Your RAG system has Precision@3 of 0.58. An engineer suggests improving the system prompt to handle irrelevant chunks better. What should you do?

Key takeaways

Retrieval quality must be measured independently from generation quality — Precision@K and Recall@K before optimizing prompts
Hybrid retrieval (dense + BM25) outperforms pure semantic search by 15-25% and handles keyword/exact-match queries that dense search misses
Chunk size and boundaries must match document structure — arbitrary token-count chunking destroys semantic coherence
Cross-encoder rerankers provide 20-30% better relevance than cosine similarity at the cost of 150-200ms added latency
Every production RAG system needs a relevance threshold — return "I don't know" rather than hallucinate from irrelevant chunks
The knowledge base is a production dependency: it needs automated update pipelines, version tracking, and freshness monitoring

Before you move on: can you answer these?

What is the difference between retrieval precision and generation quality, and why do you measure them separately?

Retrieval precision measures whether retrieved chunks are relevant (Precision@K). Generation quality measures whether the model produces a correct answer from those chunks. Separate measurement isolates which system is broken — you cannot fix a retrieval failure with a prompting change.

Why does hybrid retrieval outperform dense-only retrieval, and when does BM25 specifically help?

Dense retrieval excels at semantic similarity (finding concepts). BM25 excels at keyword matching (finding exact terms). Hybrid retrieval handles queries where users use domain-specific terminology, acronyms, or exact names that the embedding model generalizes away.

What happens to RAG quality when no relevant document exists in the knowledge base?

Vector search always returns K results regardless of relevance — there is no "no match" state. Without a confidence threshold, the model hallucinates from irrelevant chunks. Systems must detect the no-relevant-document case and respond with uncertainty rather than fabricating answers.

From the books

AI Engineering — Chapter 7: Retrieval-Augmented Generation — Chip Huyen (2024)

Huyen's framework distinguishes retrieval quality (can you find the right chunk?) from context quality (does the chunk contain enough information to answer the question?). Many RAG failures that appear to be retrieval failures are actually chunking failures — the right document exists but the relevant passage was split across two chunks. The fix is not better retrieval but better document preparation.

Lost in the Middle: How Language Models Use Long Contexts — Liu et al., Stanford (2023)

Empirical study showing that LLMs systematically underperform on information placed in the middle of long contexts, with models performing best on information at the beginning or end. Directly informs RAG prompt construction: the most important chunks should go first or last in the context window, not buried in the middle.

REALM: Retrieval-Augmented Language Model Pre-Training — Guu et al., Google Research (2020)

The foundational paper that introduced retrieval augmentation as a first-class component of language model architecture. Understanding REALM's formulation of retrieval as a latent variable explains why evaluation must cover retrieval and generation separately — they are independent systems with independent failure modes.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A