Improving Real-World RAG Systems
Key Challenges & Practical Solutions
Dipanjan (DJ) Sarkar
Head of Community & Principal AI Scientist at Analytics Vidhya
Published Author, Google Developer Expert & Cloud Champion Innovator
Slides & Code
[Link]
Understanding RAG Systems
What is a RAG System?
APIs
Response
Raw Files
User
Vector Stores
Query
Databases
RAG System Architecture - Data Indexing
RAG System Architecture - Search and Generation
RAG System Challenges & Practical Solutions
Key Failure or Pain Points in a RAG System
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Problem: Missing Content
• Missing Content means the relevant context
to answer the question is not present in the
database
• Leads to the model giving a wrong answer
and hallucinating
• End users end up being frustrated with
irrelevant or wrong responses
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Solutions for Missing Content
Better data cleaning using tools like Better prompting to constrain the Agentic RAG with search tools to get
[Link] to ensure we extract model to NOT answer the question if live information for question with no
good quality data the context is irrelevant context data
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
• Better Data Cleaning
• Better Prompting
Hands-on Demo • Agentic RAG with Tools
• Get the notebook from HERE
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Problem: Missed Top Ranked
• Missed Top Ranked means context
documents don’t appear in the top retrieval
results
• Leads to the model not able to answer the
question
• Documents to answer the question are
present but failed to get retrieved due to
poor retrieval strategy
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Problem: Not in Context
• Not in Context means documents with the
answer are present during initial retrieval
but did not make it into the final context for
generating an answer
• Bad retrieval, reranking and consolidation
strategies lead to missing out on the right
documents in context
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Problem: Not Extracted
• Not extracted means the LLM struggles to
extract the correct answer from the
provided context even if it has the answer
• This occurs when there is too much
unnecessary information, noise or
contradicting information in the context
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Problem: Incorrect Specificity
• Output response is too vague and is not
detailed or specific enough
• Vague or generic queries might lead to not
getting the right context and response
• Wrong chunking or bad retrieval can lead to
this problem
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
• Use Better Chunking Strategies
• Hyperparameter Tuning - Chunking & Retrieval
Solutions for Missed • Use Better Embedder Models
Top Ranked, Not in
Context & Incorrect • Use Advanced Retrieval Strategies
Specificity
• Use Context Compression Strategies
• Use Better Reranker Models
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Experiment with Various Chunking Strategies
Splitter Type Description
RecursiveCharacterText Recursively splits text into larger chunks based on several defined
characters. Tries to keep related pieces of text next to each other.
Splitter LangChain’s recommended way to start splitting text
CharacterTextSplitter Splits text based on a user defined character. One of the simpler text splitters
tiktoken Splits text based on tokens using trained LLM tokenizers like GPT-4
spaCy Splits text using the tokenizer from the popular NLP library - spaCy
Splits text based on tokens using trained open LLM tokenizers available
SentenceTransformers
from the popular sentence-transformers library
The unstructured library allows various splitting and chunking strategies
[Link]
including splitting text based on key sections and titles
Hyperparameter Tuning - Chunking & Retrieval
Generate
LLM
Answer
Context:
-----------
Question: Chunk Top_K=K Eval
----------- Vector DB
Size = C Sim_thresh = S Metrics
Answer
-----------
C Value K Value S Value Space
Space Space
500 5 0.2
1,000 8 0.3
2,000 10 0.5
Better Embedder Models - MTEB Leaderboard
Better Embedder Models - Experiment Yourself
Embedding
hello world -0.027 -0.001 -0.020 ....... -0.023
models
Text Text as vector
• Newer Embedder Models will be trained on more data and often better
• Don’t just go by benchmarks, use and experiment on your data
• Do not use commercial models if data privacy is important
Advanced Retrieval Strategies
• Semantic Similarity Thresholding
• Multi-query Retrieval
• Hybrid Search (Keyword + Semantic)
• Reranking
• Chained Retrieval
Better Reranker Models
• Rerankers are fine-tuned cross-encoder
transformer models
• These models take in a pair of documents
(Query, Document) and return back a
relevance score
• Models fine-tuned on more pairs and
released recently will usually be better
Context Compression Strategies
• LLM prompt-based Context
Compression
• Extractor: Filters out content from context document
not related to query
• Filter: Filters out whole context documents not
related to query
• Microsoft LLMLingua Prompt
Compression
Solutions for Missed Top Ranked, Not in Context,
Not Extracted & Incorrect Specificity
• Effect of Embedder Models
• Advanced Retrieval Strategies
Hands-on Demo
• Chained Retrieval with Rerankers
• Context Compression Strategies
• Get the notebook from HERE
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Problem: Wrong Format
• The output response is in the wrong format
• It happens when you tell the LLM to return the
response in a specific format e.g JSON and it fails to
do so
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Solutions for Wrong Format
Powerful LLMs have native support for Better Prompting and Output Parsers Structured Output Frameworks
response formats e.g OpenAI supports
JSON outputs
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Solutions for Wrong Format - Native LLM Support
Solutions for Wrong Format - Output Parsers & Better Prompting
• LangChain allows to convert the raw LLM
response into a more consumable format by
using Output Parsers.
• There exists a variety of parsers including:
• String parser
• CSV parser
• Pydantic parser
• JSON parser
Solutions for Wrong Format - Structured Output Frameworks
Solutions for Wrong Format
• Native LLM Support
Hands-on Demo • Output Parsers
• Get the notebook from HERE
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Problem: Incomplete
• Incomplete means the generated response is
incomplete
• This could be because of poorly worded questions,
lack of right context retrieved, bad reasoning Reader
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Solutions for Incomplete
Use Better LLMs like GPT-4o, Claude 3.5 or Build Agentic Systems with Tool Use if
Gemini 1.5 necessary
Use Advanced Prompting Techniques like Rewrite User Query and Improve Retrieval -
Chain-of-Thought, Self-Consistency HyDE
Source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
HyDE - Hypothetical Document Embedding
Other Practical Solutions from recent Research Papers
which actually work!
RAG vs. Long Context LLMs
• Long Context LLMs often outperform RAG but are very
expensive in terms of computing and cost
• Hybrid approach where you can use an LLM to reflect
and see if the RAG answer is good enough or route to
Long Context LLM
RAG vs Long Context LLMs - Self-Router RAG
Agentic Corrective RAG
• Step 1:
• Retrieve context documents from vector database from the input query
• Step 2:
• Use an LLM to check if retrieved documents are relevant to input
question
• Step 3:
• If all documents are relevant (Correct), no specific action needed
• Step 4:
• If some or all documents are not relevant (Ambiguous OR Incorrect),
rephrase the query and search the web to get relevant context
information
• Step 5:
• Send rephrased query and context documents or information to the
LLM for response generation
Source: Corrective Retrieval Augmented Generation; [Link]
Agentic Corrective RAG
Source: [Link]
Agentic Self-Reflection RAG
Source: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection; [Link]
Retrieval Augmented Fine-tuning (RAFT)
Source: RAFT: Adapting Language Model to Domain Specific RAG; [Link]
Recent LLMs to Check Hallucinations
• GPT-4o from OpenAI
• Lynx from PatronusAI
Source: Lynx: An Open Source Hallucination Evaluation Model; [Link]
Build an evaluation dataset and
RAG is still very much a
always evaluate your RAG
retrieval problem
system
Explore various chunking and Even with Long Context LLMs,
retrieval strategies, don’t stick RAG isn’t going anywhere (for
Key Takeaways to default settings now)
Agentic RAG systems and
domain-specific fine-tuned
RAG systems are the future