Abstract
Recent developments in Large Language Models (LLMs)
have highlighted critical limitations in factual accuracy,
knowledge timeliness, and reasoning groundedness.
Retrieval-Augmented Generation (RAG) systems emerge
as a promising solution by incorporating external
knowledge repositories. This study presents a
comprehensive evaluation of four implemented RAG
paradigms—Naive RAG, Self-RAG, Adaptive RAG, and
Corrective RAG—employing four LLMs: Mixtral-8x7b,
Gemma2-9b, Llama-3-70b, and Qwen-2.5-32b,
supplemented by rigorous testing of diverse retrieval
techniques. Our investigation spans both Closed-Domain
(arXiv research paper) and Open-Domain (Wikipedia)
datasets. The experiments reveal that Self-RAG with
Llama-3-70b achieves superior performance in technical
contexts, while Naive RAG excels in general-domain tasks.
Advanced retrieval strategies, combining hierarchical
summative clustering and hybrid reranking, are shown to
further elevate retrieval accuracy and precision. Lastly, we
acknowledged key limitations of our experiments including
constrained dataset sizes, reliance on automated and
insufficient metrics with potential biases, uniform
embedding approaches, limited LLM and benchmark
diversity, and the exclusion of graph-based structured RAG
modalities. Future research would propose broader dataset
inclusion, refined evaluation frameworks with semantic
focus, diversified embeddings, expanded LLM testing,
highly efficient computing framework, and exploration of
graph-based RAG for multi-step reasoning. We believe
these findings could address LLM limitations—hallucination
and knowledge temporality—advancing RAG’s reliability
for knowledge specific inquiry. Codebase is available in
Github repository:
https://github.com/William-coder/rag_project.git.
Keywords: Natural Language Processing, Retrieval-
Augmented Generation, Large Language Models,
Contextual Retrieval, Question Answering