Advanced RAG: Re-ranking, Hybrid Search & Query Transformation

On this page

The demo that retrieved the wrong doc with total confidence
The principle: retrieval quality caps answer quality
The picture: what an advanced RAG retriever actually does
How to upgrade naive RAG, one layer at a time
The four techniques and the problem each one fixes
Code: hybrid retrieve + cross-encoder re-rank
Chunking: the silent killer nobody profiles
Common mistakes that quietly tank production RAG
Takeaways
Where to go next

The demo that retrieved the wrong doc with total confidence

I once shipped a RAG demo for a customer-support bot. Someone asked, "How do I cancel my annual plan?" The bot answered crisply and completely, about cancelling the monthly plan. Different refund window, different policy, totally wrong answer, delivered with the calm authority of a doc that scored 0.83 cosine similarity.

Nothing was "broken." The embedding model worked. The vector DB worked. The LLM worked. The problem was upstream of all of them: naive top-k vector retrieval handed the model the wrong chunk, and a great model can't reason its way out of bad context. Garbage in, confident garbage out.

This is the single most common failure in production RAG. The fix isn't a bigger model or a longer prompt, it's a better retriever. This article walks through the four upgrades that take RAG from "impressive demo" to "trustworthy in prod": hybrid search, re-ranking, query transformation, and disciplined chunking plus metadata filtering.

Who this is for

You've built a basic RAG pipeline, embed docs, store vectors, retrieve top-k, stuff into a prompt, and it works in the demo but hallucinates or misses on real questions. You know what an embedding is but haven't yet tuned retrieval. No ML PhD required; if you can read Python, you can ship every technique here.

The principle: retrieval quality caps answer quality

Your LLM can only be as right as the context you retrieve. Retrieval quality is the ceiling on answer quality, the model can lower it, but it can never exceed it.

If you remember one thing, make it that. Teams spend weeks tuning prompts and swapping models while the actual bottleneck is the chunk that never made it into the context window. Fix retrieval first; it has the highest leverage of anything in the pipeline.

A librarian who only matches the vibe of your questionVector search (semantic, fuzzy, misses exact terms)

A librarian who only matches exact words on the spineKeyword / BM25 search (precise, misses synonyms)

Asking both librarians, then merging their picksHybrid search

A senior analyst re-reading the shortlist to rank true relevanceCross-encoder re-ranking

Rephrasing a vague request before searchingQuery rewriting / expansion

RAG is research, not magic. Each stage maps to how a good analyst finds an answer.

The picture: what an advanced RAG retriever actually does

A user query is rewritten, retrieved two ways in parallel, merged, re-ranked, trimmed to a tight top-k, and only then sent to the LLM.

1
Rewrite the query
Turn a vague, pronoun-heavy, or under-specified question into one (or several) clean search queries. "cancel annual plan" becomes "how to cancel an annual subscription and refund policy."
2
Retrieve two ways in parallel
Run the rewritten query through vector search (semantic meaning) AND BM25 keyword search (exact terms like "annual", error codes, SKUs). Each catches what the other misses.
3
Merge candidates
Combine both result sets into one wide pool, say 30 candidates, using Reciprocal Rank Fusion or a weighted score. Recall over precision here: you want the right doc to be present.
4
Re-rank with a cross-encoder
Score each candidate against the query with a model that reads the query and document together. This is slow but accurate, and you only run it on ~30 docs, not the whole corpus.
5
Trim to a tight top-k
Keep only the best 3–5 chunks. Less, but better. A clean, short context beats a long, noisy one every time.
6
Generate
Hand the trimmed context to the LLM. Now the model is reasoning over the right material instead of guessing.

How to upgrade naive RAG, one layer at a time

Don't rebuild everything at once. Add layers in order of impact and measure after each. The two highest-leverage moves are usually adding a re-ranker and going hybrid, they routinely turn a 60%-correct pipeline into a 90% one without touching the model.

1Start with an eval set. Write 30–50 real questions with known-correct source chunks. You cannot improve what you cannot measure, and "it feels better" is not a metric.
2Add BM25 alongside vectors (hybrid). Most vector DBs (Weaviate, Qdrant, Elasticsearch, pgvector + a keyword index) support both. Fuse the results.
3Add a cross-encoder re-ranker. Retrieve wide (top-30), re-rank, keep top-5. Biggest single quality jump for the least code.
4Add query rewriting/expansion. Use a small LLM call to clean up and expand the query before retrieval.
5Fix chunking and add metadata filters. Smaller, well-bounded chunks plus structured filters (product, version, date) cut whole categories of wrong-doc errors.

The four techniques and the problem each one fixes

Technique	Problem it fixes
Hybrid search (vector + BM25)	Vector-only misses exact terms, names, codes, "annual" vs "monthly", acronyms. BM25 nails lexical matches semantics blurs.
Cross-encoder re-ranking	Embedding similarity is approximate; the top chunk is often "related" not "relevant". A re-ranker reads query + doc together to score true relevance.
Query rewriting / expansion	Real questions are vague, conversational, or under-specified. Rewriting turns "and the annual one?" into a standalone, searchable query.
Metadata filtering	The right text exists but for the wrong product, version, or tenant. Structured filters scope retrieval before similarity ever runs.

Each technique targets a specific, recognizable failure mode of naive top-k.

Code: hybrid retrieve + cross-encoder re-rank

Here's the core of an advanced retriever: run vector and keyword search, fuse them with Reciprocal Rank Fusion, then re-rank the merged pool with a cross-encoder. This is framework-agnostic pseudo-real Python, swap in your own vector client and BM25 index.

retriever.py

python

from sentence_transformers import CrossEncoder

# A cross-encoder reads (query, doc) TOGETHER and scores relevance.
# Slower than embeddings, so we only run it on a small merged pool.
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


def reciprocal_rank_fusion(*ranked_lists, k=60):
    """Merge multiple ranked lists. Each doc's score is the sum of
    1 / (k + rank) across the lists it appears in. Rank-based, so it
    fuses results without needing comparable raw scores."""
    scores = {}
    for ranked in ranked_lists:
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)


def hybrid_retrieve(query, vector_index, bm25_index, docs, top_k=5):
    # 1. Retrieve WIDE from both sides (recall over precision here).
    vector_hits = vector_index.search(query, limit=30)   # semantic
    keyword_hits = bm25_index.search(query, limit=30)     # lexical

    # 2. Fuse the two ranked lists into one candidate pool.
    candidate_ids = reciprocal_rank_fusion(vector_hits, keyword_hits)[:30]

    # 3. Re-rank the pool: score (query, doc) pairs with the cross-encoder.
    pairs = [(query, docs[doc_id]) for doc_id in candidate_ids]
    rerank_scores = reranker.predict(pairs)

    ranked = sorted(
        zip(candidate_ids, rerank_scores),
        key=lambda x: x[1],
        reverse=True,
    )

    # 4. Return only the best few, tight context beats long context.
    return [docs[doc_id] for doc_id, _ in ranked[:top_k]]

Why re-rank only 30, not the whole corpus

Cross-encoders are accurate but expensive, they run a full model forward pass per (query, doc) pair. Running that over a million chunks is impossible. The trick is the funnel: cheap retrieval gets you from millions to ~30 candidates, then the expensive re-ranker sorts those 30. Best of both worlds.

Chunking: the silent killer nobody profiles

You can do everything above perfectly and still fail if your chunks are bad. Chunking decides what the smallest retrievable unit is, and if that unit is wrong, no amount of re-ranking saves you. It's the least glamorous and most overlooked lever in RAG.

The two classic mistakes: chunks that are too big (one chunk covers five topics, so similarity is diluted and the LLM drowns in irrelevant text) and chunks that are too small (a chunk loses the context that makes it meaningful, a number with no label, a step with no procedure).

Chunk on structure, not character count. Split on headings, sections, or paragraphs so each chunk is one coherent idea. Markdown/HTML structure is a gift, use it.
Add overlap (\~10–20%). A small overlap between adjacent chunks keeps a sentence from being orphaned at a boundary.
Attach context to each chunk. Prepend the document title and section heading to the chunk text before embedding, so "Step 3" still knows which procedure it belongs to.
Right-size for your content. Dense reference docs want smaller chunks (200–400 tokens); narrative content tolerates larger ones. Test both against your eval set.
Store rich metadata per chunk. Product, version, date, source URL, section, this is what powers metadata filtering and lets you cite sources.

New to embeddings and the vector layer underneath all this? Start with Embeddings & Vector Search, then come back here.

Common mistakes that quietly tank production RAG

1Vector-only retrieval. You'll silently miss every query that hinges on an exact term, error codes, product names, "annual" vs "monthly". Always go hybrid.
2No re-ranker. Trusting raw cosine similarity to be relevance. The top vector hit is often merely "related". Re-ranking is the cheapest big win you're not using.
3Bad chunking. Fixed 1,000-character splits that cut sentences in half and mix topics. Garbage chunks make every downstream stage irrelevant.
4No metadata filtering. Letting a query match the right text for the wrong version or tenant. Filter the haystack before you search it.
5No eval set. Tuning by vibes. Without a labeled question→source set you can't tell whether a change helped, hurt, or did nothing, and you'll ship regressions blind.

Takeaways

The whole article in seven lines

Bad production RAG is almost always a retrieval problem, not a model problem.
Retrieval quality caps answer quality, fix retrieval before prompts or models.
Go hybrid: vector for meaning + BM25 for exact terms, fused together.
Add a cross-encoder re-ranker, retrieve wide (30), re-rank, keep the best 5.
Rewrite vague queries into clean, standalone search queries before retrieving.
Chunk on structure with overlap and context; attach metadata and filter on it.
Build a 30–50 question eval set first, you can't improve what you don't measure.

Where to go next

You now have the retriever upgrades that separate demo RAG from production RAG. The next steps are understanding the full pipeline these slot into, and proving your improvements with hard numbers instead of gut feel.

See the full pipeline: RAG Architecture Explained, how ingestion, indexing, retrieval, and generation fit together end to end.
Go deeper on the vector layer: Embeddings & Vector Search, what embeddings are and how similarity search actually works.
Measure your gains: Evaluating LLM Applications, build the eval set and metrics that prove a retrieval change helped.
Follow the full track: the AI Engineer career path sequences RAG, evaluation, and agents into a structured curriculum.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read

Advanced RAG: Re-ranking, Hybrid Search & Query Transformation

01The demo that retrieved the wrong doc with total confidence

02The principle: retrieval quality caps answer quality

03The picture: what an advanced RAG retriever actually does

04How to upgrade naive RAG, one layer at a time

05The four techniques and the problem each one fixes

06Code: hybrid retrieve + cross-encoder re-rank

07Chunking: the silent killer nobody profiles

08Common mistakes that quietly tank production RAG

09Takeaways

10Where to go next

Want to go deeper?

RAG Architecture Explained for Backend Engineers

What Is an AI Engineer?

How LLMs Actually Work (for Engineers)

The demo that retrieved the wrong doc with total confidence

The principle: retrieval quality caps answer quality

The picture: what an advanced RAG retriever actually does

How to upgrade naive RAG, one layer at a time

The four techniques and the problem each one fixes

Code: hybrid retrieve + cross-encoder re-rank

Chunking: the silent killer nobody profiles

Common mistakes that quietly tank production RAG

Takeaways

Where to go next