How can we make our RAGs Better?

Ever wondered how we can calculate the success metrics of RAGs model? How we enhance the model and on which parameters especially when we have so many models?

Most frequent metrics that can come to our mind could be Accuracy of the model, speed it offers in retrieval of data and cost of LLM architecture. As working with these models as a user, I would look for most accurate one with reference to my input context, high speed and less processing cost. However, if I had to award one of the metrics as most important I would accuracy as the priority, because amidst the models, I could wait for 100ms delay but won't accept the cost in terms of accuracy especially if the data is critical or I want precision in my output response.

How can we make our RAG system accurate?

One of the controllable components that we can enhance is our model. However, on the principle of GIGO (Garbage in Garbage out), if our prompt or our data resource is corrupt or not up to mark, user often gets garbage data.
We realize that these components - user input, input data resources or retrieval prompt are uncontrollable components, so we are only left to improve our pipeline for better retrieval.

Pipeline Improvement (RAG System design)

The most common problems of retrievals we face are:

User making spelling mistakes or typos.
Lack of technical knowledge or what to ask.
Missing important keywords in the prompt.

One of the simplest solutions offered to improve our pipeline is Query Rewriting.

Query Rewriting

Very often when user writes codes, they can make silly typos or simple spelling mistakes, this could affect the outcome, if the models isn't able to understand the context.

In that condition, introducing a mini-GPT for improving the query could prove beneficial. This would fix typos and add more context. Even though it can cause a latency or cost overhead, accuracy of retrieval will increase effectively. This process is called query-translation.

Corrective RAGs (Iterative methods)

Even if query is refined, there could be a case where the embedded retrieved chunks lack the context w.r.t to the query user asked. There can be difference in context of the chunks. In that case we can introduced mini-GPT to evaluate if the retrieved chunks are relevant and then can be sent for query translations for optimizing them. This could reduce the tendency of out of context output and ensures correct data retrieval.

Subquery Optimization

Instead of relying on a single query, the user query is first passed through a query translator which generates multiple subqueries or enhanced variants of the original intent.

Each of these subqueries is then used to retrieve chunks independently, leading to multiple retrieval paths. However, not all retrieved chunks are equally relevant—some may introduce noise or even hallucinated context due to semantic drift across subqueries.

To address this, a ranking mechanism is applied across all retrieved results. Chunks that consistently appear across multiple subqueries (e.g., common in top-k results of multiple branches) are ranked higher as they indicate stronger relevance. For example, if chunks 1, 2, and 3 appear across all retrievals, they are prioritized (rank 1), while chunks appearing in fewer branches (like 4, 5) are given lower rank, and outliers (like 6) are dropped. This ranking-based consolidation helps in selecting top coherent context (Top-2 or Top-k), reducing hallucination and ensuring that only the most contextually aligned information is passed forward for final generation.

Hyde (Hypothetical Document Embedding)

Even after improving query quality and applying techniques like subquery optimization, there are scenarios where retrieval still fails to return highly relevant chunks—especially when dealing with private data, domain-specific knowledge, or vague user queries. This happens because the query itself may lack the richness or structure required to match effectively with stored embeddings, leading to less accurate or partially relevant results.

To address this, HyDE introduces a different approach. Instead of directly embedding the user query, we first pass it through an LLM to generate a hypothetical answer or document that represents what the ideal response should look like. This generated document is then converted into a vector embedding and used for retrieval. Since this embedding is richer and more context-aware, it improves the chances of fetching high-quality, semantically aligned chunks, effectively shifting the system from query-based retrieval to answer-oriented retrieval and enhancing overall accuracy.

Conclusion

In the end, improving a RAG system is less about changing the model and more about refining the pipeline around it. Techniques like query rewriting, corrective RAG, subquery optimization, and HyDE help bridge the gap between user intent and retrieved context.

While latency and cost can be optimized, accuracy remains the most critical factor, especially when dealing with sensitive or precise information. A well-designed retrieval pipeline ensures that the model doesn’t just respond fast but responds right.

How can we make our RAGs Better?

How can we make our RAG system accurate?

Pipeline Improvement (RAG System design)

Query Rewriting

Corrective RAGs (Iterative methods)

Subquery Optimization

Hyde (Hypothetical Document Embedding)

Conclusion

Comments

More from this blog

Understanding Memoization in React Simplified Like Never Before!

How Retrieval-Augmented Generation Makes AI Smarter with Personalized Law Firm Data

Prompting in LLMs: Talking to AI Like a Pro

Unifying Data Fetching with GraphQL : The Upgrade Every Developers Need

Command Palette

How can we make our RAG system accurate?

Pipeline Improvement (RAG System design)

Query Rewriting

Corrective RAGs (Iterative methods)

Subquery Optimization

Hyde (Hypothetical Document Embedding)

Conclusion

Comments

More from this blog