Improving RAG
Retrieval Augmented Generation (RAG) augments the prompt with just-in-time data tailored to the query at hand.
A basic RAG pipeline looks like the following
While this works for small datasets and straight forward queries, there are several limitations to this basic approach:
- Single-shot retrieval: The context is stuffed into the prompt, and the LLM is invoked once without reflecting on whether the response adequately answers the query.
- Lack of query planning: Complex queries (implicit data requests, summarization, comparisons etc.) require planning beyond simple semantic search in vector space.
- No tool use: We cannot augment data from additional sources
- Since Chunking and Embedding removes (some) context, the retrieved chunks also lack relevant context.
- Embedding models only capture semantic relationships making exact keyword matches unreliable.
At a high level, improving RAG involves fixing its shortcomings on both the Generation and Retrieval phases - though these improvements often come at the cost of increased overhead and latency. A few interesting approaches
- Add more context and Enrich retrieval: Adding more context/metadata to chunks/embeddings and using that metadata to re-rank results (e.g., Sentence Window Retriever, Auto-Merging Retriever )
- Hybrid Retrieval: Combining traditional search models (BM25, TF-IDF etc.) with RAG, as seen in Contextual Retrieval and Blended RAG.
- Agentic RAG adds a layer of intelligence to the query-answer phase. Approaches here range from simple agents that handle routing and tool use to sophisticated ReAct patterns and multi-agent systems that can decompose complex queries and gather information iteratively.
RAG and information retrieval remain challenging problems — it’ll be interesting to see how LLMs and new techniques keep reshaping this space in the coming months/years.