Improving RAG

Retrieval Augmented Generation (RAG) augments the prompt with just-in-time data tailored to the query at hand.

A basic RAG pipeline looks like the following

Basic RAG pipeline showing document chunking, embedding, and retrieval before LLM generation

While this works for small datasets and straight forward queries, there are several limitations to this basic approach:

Single-shot retrieval: The context is stuffed into the prompt, and the LLM is invoked once without reflecting on whether the response adequately answers the query.
Lack of query planning: Complex queries (implicit data requests, summarization, comparisons etc.) require planning beyond simple semantic search in vector space.
No tool use: We cannot augment data from additional sources
Since Chunking and Embedding removes (some) context, the retrieved chunks also lack relevant context.
Embedding models only capture semantic relationships making exact keyword matches unreliable.

At a high level, improving RAG involves fixing its shortcomings on both the Generation and Retrieval phases - though these improvements often come at the cost of increased overhead and latency. A few interesting approaches

Add more context and Enrich retrieval: Adding more context/metadata to chunks/embeddings and using that metadata to re-rank results (e.g., Sentence Window Retriever, Auto-Merging Retriever )
Hybrid Retrieval: Combining traditional search models (BM25, TF-IDF etc.) with RAG, as seen in Contextual Retrieval and Blended RAG.
Agentic RAG adds a layer of intelligence to the query-answer phase. Approaches here range from simple agents that handle routing and tool use to sophisticated ReAct patterns and multi-agent systems that can decompose complex queries and gather information iteratively.

RAG and information retrieval remain challenging problems — it’ll be interesting to see how LLMs and new techniques keep reshaping this space in the coming months/years.

Abhilash Meesala

Improving RAG

Links

Socials