Back to Blog
AI Engineering12 min readJun 2026

Advanced RAG: Re-ranking, Hybrid Search & Query Transformation

Naive top-k retrieval is the #1 reason production RAG confidently answers wrong. Upgrade it with hybrid search, cross-encoder re-ranking, query rewriting, smarter chunking, and metadata filtering.

AIRAGRetrievalSearch
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

The demo that retrieved the wrong doc with total confidence

I once shipped a RAG demo for a customer-support bot. Someone asked, "How do I cancel my annual plan?" The bot answered crisply and completely, about cancelling the monthly plan. Different refund window, different policy, totally wrong answer, delivered with the calm authority of a doc that scored 0.83 cosine similarity.

Nothing was "broken." The embedding model worked. The vector DB worked. The LLM worked. The problem was upstream of all of them: naive top-k vector retrieval handed the model the wrong chunk, and a great model can't reason its way out of bad context. Garbage in, confident garbage out.

This is the single most common failure in production RAG. The fix isn't a bigger model or a longer prompt, it's a better retriever. This article walks through the four upgrades that take RAG from "impressive demo" to "trustworthy in prod": hybrid search, re-ranking, query transformation, and disciplined chunking plus metadata filtering.

Who this is for

You've built a basic RAG pipeline, embed docs, store vectors, retrieve top-k, stuff into a prompt, and it works in the demo but hallucinates or misses on real questions. You know what an embedding is but haven't yet tuned retrieval. No ML PhD required; if you can read Python, you can ship every technique here.

The principle: retrieval quality caps answer quality

Your LLM can only be as right as the context you retrieve. Retrieval quality is the ceiling on answer quality, the model can lower it, but it can never exceed it.

If you remember one thing, make it that. Teams spend weeks tuning prompts and swapping models while the actual bottleneck is the chunk that never made it into the context window. Fix retrieval first; it has the highest leverage of anything in the pipeline.

A librarian who only matches the vibe of your questionVector search (semantic, fuzzy, misses exact terms)
A librarian who only matches exact words on the spineKeyword / BM25 search (precise, misses synonyms)
Asking both librarians, then merging their picksHybrid search
A senior analyst re-reading the shortlist to rank true relevanceCross-encoder re-ranking
Rephrasing a vague request before searchingQuery rewriting / expansion
RAG is research, not magic. Each stage maps to how a good analyst finds an answer.

The picture: what an advanced RAG retriever actually does

User query

raw, vague

Query rewrite

expand / clarify

Vector search

semantic

BM25 / keyword

lexical

Merge candidates

~30 docs

Cross-encoder re-rank

score true relevance

Top-k context

best 3–5

LLM

generate answer

A user query is rewritten, retrieved two ways in parallel, merged, re-ranked, trimmed to a tight top-k, and only then sent to the LLM.

  1. 1

    Rewrite the query

    Turn a vague, pronoun-heavy, or under-specified question into one (or several) clean search queries. "cancel annual plan" becomes "how to cancel an annual subscription and refund policy."

  2. 2

    Retrieve two ways in parallel

    Run the rewritten query through vector search (semantic meaning) AND BM25 keyword search (exact terms like "annual", error codes, SKUs). Each catches what the other misses.

  3. 3

    Merge candidates

    Combine both result sets into one wide pool, say 30 candidates, using Reciprocal Rank Fusion or a weighted score. Recall over precision here: you want the right doc to be present.

  4. 4

    Re-rank with a cross-encoder

    Score each candidate against the query with a model that reads the query and document together. This is slow but accurate, and you only run it on ~30 docs, not the whole corpus.

  5. 5

    Trim to a tight top-k

    Keep only the best 3–5 chunks. Less, but better. A clean, short context beats a long, noisy one every time.

  6. 6

    Generate

    Hand the trimmed context to the LLM. Now the model is reasoning over the right material instead of guessing.

How to upgrade naive RAG, one layer at a time

Don't rebuild everything at once. Add layers in order of impact and measure after each. The two highest-leverage moves are usually adding a re-ranker and going hybrid, they routinely turn a 60%-correct pipeline into a 90% one without touching the model.

  1. Start with an eval set. Write 30–50 real questions with known-correct source chunks. You cannot improve what you cannot measure, and "it feels better" is not a metric.
  2. Add BM25 alongside vectors (hybrid). Most vector DBs (Weaviate, Qdrant, Elasticsearch, pgvector + a keyword index) support both. Fuse the results.
  3. Add a cross-encoder re-ranker. Retrieve wide (top-30), re-rank, keep top-5. Biggest single quality jump for the least code.
  4. Add query rewriting/expansion. Use a small LLM call to clean up and expand the query before retrieval.
  5. Fix chunking and add metadata filters. Smaller, well-bounded chunks plus structured filters (product, version, date) cut whole categories of wrong-doc errors.

The four techniques and the problem each one fixes

TechniqueProblem it fixes
Hybrid search (vector + BM25)Vector-only misses exact terms, names, codes, "annual" vs "monthly", acronyms. BM25 nails lexical matches semantics blurs.
Cross-encoder re-rankingEmbedding similarity is approximate; the top chunk is often "related" not "relevant". A re-ranker reads query + doc together to score true relevance.
Query rewriting / expansionReal questions are vague, conversational, or under-specified. Rewriting turns "and the annual one?" into a standalone, searchable query.
Metadata filteringThe right text exists but for the wrong product, version, or tenant. Structured filters scope retrieval before similarity ever runs.
Each technique targets a specific, recognizable failure mode of naive top-k.

Code: hybrid retrieve + cross-encoder re-rank

Here's the core of an advanced retriever: run vector and keyword search, fuse them with Reciprocal Rank Fusion, then re-rank the merged pool with a cross-encoder. This is framework-agnostic pseudo-real Python, swap in your own vector client and BM25 index.

retriever.py
python
from sentence_transformers import CrossEncoder

# A cross-encoder reads (query, doc) TOGETHER and scores relevance.
# Slower than embeddings, so we only run it on a small merged pool.
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


def reciprocal_rank_fusion(*ranked_lists, k=60):
    """Merge multiple ranked lists. Each doc's score is the sum of
    1 / (k + rank) across the lists it appears in. Rank-based, so it
    fuses results without needing comparable raw scores."""
    scores = {}
    for ranked in ranked_lists:
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)


def hybrid_retrieve(query, vector_index, bm25_index, docs, top_k=5):
    # 1. Retrieve WIDE from both sides (recall over precision here).
    vector_hits = vector_index.search(query, limit=30)   # semantic
    keyword_hits = bm25_index.search(query, limit=30)     # lexical

    # 2. Fuse the two ranked lists into one candidate pool.
    candidate_ids = reciprocal_rank_fusion(vector_hits, keyword_hits)[:30]

    # 3. Re-rank the pool: score (query, doc) pairs with the cross-encoder.
    pairs = [(query, docs[doc_id]) for doc_id in candidate_ids]
    rerank_scores = reranker.predict(pairs)

    ranked = sorted(
        zip(candidate_ids, rerank_scores),
        key=lambda x: x[1],
        reverse=True,
    )

    # 4. Return only the best few, tight context beats long context.
    return [docs[doc_id] for doc_id, _ in ranked[:top_k]]

Why re-rank only 30, not the whole corpus

Cross-encoders are accurate but expensive, they run a full model forward pass per (query, doc) pair. Running that over a million chunks is impossible. The trick is the funnel: cheap retrieval gets you from millions to ~30 candidates, then the expensive re-ranker sorts those 30. Best of both worlds.

Chunking: the silent killer nobody profiles

You can do everything above perfectly and still fail if your chunks are bad. Chunking decides what the smallest retrievable unit is, and if that unit is wrong, no amount of re-ranking saves you. It's the least glamorous and most overlooked lever in RAG.

The two classic mistakes: chunks that are too big (one chunk covers five topics, so similarity is diluted and the LLM drowns in irrelevant text) and chunks that are too small (a chunk loses the context that makes it meaningful, a number with no label, a step with no procedure).

  • Chunk on structure, not character count. Split on headings, sections, or paragraphs so each chunk is one coherent idea. Markdown/HTML structure is a gift, use it.
  • Add overlap (\~10–20%). A small overlap between adjacent chunks keeps a sentence from being orphaned at a boundary.
  • Attach context to each chunk. Prepend the document title and section heading to the chunk text before embedding, so "Step 3" still knows which procedure it belongs to.
  • Right-size for your content. Dense reference docs want smaller chunks (200–400 tokens); narrative content tolerates larger ones. Test both against your eval set.
  • Store rich metadata per chunk. Product, version, date, source URL, section, this is what powers metadata filtering and lets you cite sources.

New to embeddings and the vector layer underneath all this? Start with Embeddings & Vector Search, then come back here.

Common mistakes that quietly tank production RAG

  1. Vector-only retrieval. You'll silently miss every query that hinges on an exact term, error codes, product names, "annual" vs "monthly". Always go hybrid.
  2. No re-ranker. Trusting raw cosine similarity to be relevance. The top vector hit is often merely "related". Re-ranking is the cheapest big win you're not using.
  3. Bad chunking. Fixed 1,000-character splits that cut sentences in half and mix topics. Garbage chunks make every downstream stage irrelevant.
  4. No metadata filtering. Letting a query match the right text for the wrong version or tenant. Filter the haystack before you search it.
  5. No eval set. Tuning by vibes. Without a labeled question→source set you can't tell whether a change helped, hurt, or did nothing, and you'll ship regressions blind.

Takeaways

The whole article in seven lines

  • Bad production RAG is almost always a retrieval problem, not a model problem.
  • Retrieval quality caps answer quality, fix retrieval before prompts or models.
  • Go hybrid: vector for meaning + BM25 for exact terms, fused together.
  • Add a cross-encoder re-ranker, retrieve wide (30), re-rank, keep the best 5.
  • Rewrite vague queries into clean, standalone search queries before retrieving.
  • Chunk on structure with overlap and context; attach metadata and filter on it.
  • Build a 30–50 question eval set first, you can't improve what you don't measure.

Where to go next

You now have the retriever upgrades that separate demo RAG from production RAG. The next steps are understanding the full pipeline these slot into, and proving your improvements with hard numbers instead of gut feel.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.