23mca1047
23mca1047
by
November, 2024
TALKDOC: QUERY BASED INFORMATION
RETRIEVAL FROM THE DOCUMENTS
Submitted in partial fulfillment for the award of the degree of
i
CONTENTS
ABSTRACT…………………………………………………………………………i
ACKNOWLEDGEMENT…………………………………………………………ii
CONTENTS……………………………………………………………………….. iii
CHAPTER 1………………………………………………………………………...1
INTRODUCTION…………………………………………………………………..1
1.1 KEY TERMINOLOGIES……………………………………………………….. 2
1.1.1 GENERATIVE AI…………………………………………………………….. 2
1.1.2 PROMPT……………………………………………………………………… 3
1.1.1 LARGE LANGUAGE MODELS (LLM)…………………………………….. 3
1.1.4 RETRIEVAL AUGMENTED GENERATION (RAG)………………………. 3
1.1.5 ADVANCED RAG……………………………………………………………. 3
1.1.6 MULTIPLE QUESTION GENERATION…………………………………… 4
1.1.7 SUMMARISATION…………………………………………………………... 4
CHAPTER 2……………………………………………………………...………….5
2.1 INTRODUCTION………………………………………………………………..5
CHAPTER 3……………………………………………………………………….20
METHODOLOGY………………………………………………………………...20
3.1 DATA………………………………………………………………………….. 20
3.1.1 DATA PREPROCESSING…………………………………………………... 20
3.1.2 VECTORISATION…………………………………………………………...20
3.1.3 VECTORSTORE…………………………………………………………….. 20
3.2 SIMILARITY SEARCH………………………………………………………..21
3.3 USER INPUT………………………………………………………………….. 21
3.4 USING ETHICS CHECKER…………………………………………………... 22
3.5 USING LARGE LANGUAGE MODELS…………………………………….. 22
3.5.1 GETTING BETTER CONTEXT……………………………………………. 22
3.6 ANSWER GENERATION…………………………………………………….. 23
3.7 CONVERTING THE FINAL RESPONSE TO SPEECH……………………... 23
3.8 METHODS USED……………………………………………………………... 24
iii
3.8.1 speak_output()………………………………………………………………...24
3.8.2 TEXT EXTRACTION METHODS…………………………………………. 25
3.8.3 get_file_create_retiever(file_path)…………………………………………… 25
3.8.4 get_audio()…………………………………………………………………… 26
3.8.5 get_output(big_chunks_retriever,question)………………………………….. 26
3.8.6 generate_questions(question)………………………………………………… 26
3.8.7 summarise(outputs)…………………………………………………………... 26
3.8.8 check_ethics(question)……………………………………………………….. 26
CHAPTER 4…………………………………………………………...…………..28
CHAPTER 5……………………………………………………………………….35
REFERENCES……………………………………………………………………37
APPENDICES…………………………………………………………………….39
iv
LIST OF FIGURES
3.1 MODEL ARCHITECTURE 27
4.1.2 COMPARISON OF METRICS FOR BERT SCORE FOR THE FIRST DOC 31
4.1.3 COMPARISON OF METRICS FOR BERT SCORE FOR THE SECOND DOC 32
4.1.4 COMPARISON OF METRICS FOR BERT SCORE FOR THE THIRD DOC 32
v
LIST OF TABLES
vi
LIST OF ACRONYMS
vii
Chapter 1
Introduction
The rapid advancements in artificial intelligence (AI) have transformed the way people
interact with and process information. Among the most revolutionary developments in AI
is Generative AI, which enables the creation of content such as text, images, videos, and
audio. These models, built on transformer-based architectures, are trained on vast datasets
to predict and generate the next element in a sequence, allowing them to produce coherent
outputs when given a prompt—a specific instruction, often in plain language, that guides
their behavior. At the forefront of generative AI are Large Language Models (LLMs),
which excel in natural language processing tasks like text generation. These models, such
as Meta's advanced Llama 3.1 70B, are equipped with billions of parameters, enabling them
to understand and generate text with unprecedented sophistication. Prior to the advent of
LLMs, generating meaningful and contextually relevant text responses was a challenging
task. However, with LLMs, this capability has become not only feasible but also highly
efficient.
To address these issues, our work presents an advanced Retrieval Augmented Generation
(RAG) application that transforms how users engage with textual content. RAG is an
advanced AI framework that combines retrieval-based and generative approaches to
provide accurate, contextually relevant responses. RAG systems enhance LLMs by
grounding their answers with information retrieved from external sources, such as
databases, document repositories, or knowledge bases. In a typical RAG system, when a
user poses a query, the system first searches an external knowledge base for information
1
that closely matches the query. This could involve pulling relevant passages from stored
documents, databases, or indexed sources. The retrieved information, called "retrieval
results" or "context," is passed along with the user’s query to the LLM. This context helps
the LLM generate answers that are more accurate and grounded in current information.
The LLM then generates a response based on both the retrieved information and the
original query, producing a well-informed, contextual answer.
The advanced RAG system allows users to upload documents in various formats, such as
.txt, .docx, .rtf and .pdf, creating a comprehensive and interactive knowledge base. By
harnessing the power of large language models (LLMs), the application provides users with
immediate, accurate responses to specific queries, bypassing the need to manually navigate
through extensive text. This application introduces a user-friendly chat interface that
facilitates direct interaction with uploaded documents, without typing any question, just by
talking to the system. Instead of dedicating hours to reading, users can ask precise questions
and receive tailored answers based on their needs. This functionality serves as a powerful
tool for students, enabling them to quickly locate key points from their study materials.
Similarly, professionals can efficiently retrieve vital information from detailed reports or
technical publications, significantly boosting productivity and decision-making.
Additionally, the inclusion of a text-to-speech feature enhances accessibility for users who
prefer auditory inputs or require alternative ways to process information. By integrating
both textual and auditory modes, the application caters to diverse learning preferences,
ensuring adaptability for a wide range of users. This approach highlights the potential of
artificial intelligence to revolutionize information retrieval and management. By offering
a solution that is both efficient and inclusive, it addresses the growing demand for tools
that simplify access to critical knowledge while meeting the diverse needs of modern users.
2
on huge corpus of data and learn to predict the next word or sound or pixel in a sequence.
And hence, they can generate content when given a prompt.
1.1.2 PROMPT
A prompt is an instruction to any generative AI model often in plain english. It guides and
instructs a model on what to output.
3
efficient similarity search. When a child chunk gets a high similarity score, the parent
chunk for the child is used for better context for the LLM so that it can give a better output.
1.1.7 SUMMARISATION
Giving a big context to the LLM would generate a detailed and long output, which can be
summarized and the summary can be presented to the user. If they want, the users can view
the entire output.
4
Chapter 2
Literature Review
2.1 INTRODUCTION
The recent advancements in RAG focus on improving the overall system accuracy by
implementing methods like Reranking and using automated testing systems.
5
works noticeably better than the TextRank method in terms of accuracy and relevance,
according to the authors' empirical study. According to the findings, the ELMo-based
method is superior at extracting the most important details from the text, producing
summaries that more accurately represent the original material. Future work includes
increasing the model’s accuracy and summarizing multiple documents at once to
generate generalized summaries.
Swami et al. [2] propose a thorough method for automating the resume evaluation
process, which is becoming more and more important. The difficulties experienced by
recruiters, who frequently go through 50 to 300 resumes for every job opening, are
highlighted by the writers Pratibha Swami and Vibha Pratap. This makes the process
labor-intensive and time-consuming. In order to solve this problem, the authors suggest
an end-to-end approach that streamlines the hiring process by summarizing resumes'
content in addition to classifying them into pertinent categories. The method starts with
a thorough text cleaning procedure that involves changing the text to a better encoding
scheme, getting rid of stop words and punctuation, and changing all of the text to lower
case to cut down on repetition. Because it gets the data ready for efficient analysis and
categorization, this preprocessing stage is essential. After cleaning, the authors use a
variety of machine learning models to categorize the resumes, such as Support Vector
Machines (SVM), multinomial Naïve Bayes, and Logistic Regression. Originally
selected for its ease of use and efficiency, the Naïve Bayes model's accuracy of about
40% leads the authors to investigate alternative models. The Support Vector Classifier
performs better than the others, attaining an accuracy of 70% to 75%, according to the
paper's comparative analysis of several classification models. The authors credit this
achievement to the SVM's capacity to identify dataset non-linearities, which is crucial
for efficient categorization. The significance of word frequency distribution in resumes
is also covered in the study. Each sentence's score is dependent on how frequently its
constituent terms occur. By using this scoring system, the model is able to identify the
most important lines and provide a succinct synopsis of the resume that omits
unnecessary filler language and highlights the most important information. The authors
stress the value of structured data analysis and visualization in addition to the
6
categorization and summarizing procedures, as these methods offer deeper insights into
the resumes. They mention well-known commercial resume extraction tools like Sovren
Resume/CV Parser and ALEX Resume Parser, but they point out that their method
provides a more customized solution by integrating classification and summarization
into a single program. The authors highlight the possible influence of their model on the
hiring process and argue that automating the review of resumes might greatly lessen
recruiters' workload and increase hiring efficiency. The authors hope to improve
recruiters' decision-making process by offering a tool that efficiently classifies and
condenses resumes, enabling them to concentrate on the most qualified applicants faster.
7
preserving their integrity. The authors also go over relevant research in the area, pointing
out that although numerous methods have been put out to increase the comprehensibility
of summaries, many fail to strike a balance between factuality and simplification. By
dynamically incorporating reliable external knowledge sources and using a reward-
based strategy to optimize the model for improved readability, the RAG-RLRC-LaySum
framework seeks to close this gap. In conclusion, the framework's potential to
democratize scientific knowledge by increasing public access to it and fostering greater
involvement with biological discoveries is highlighted. To further increase the accuracy
and applicability of the summaries, future research will concentrate on broadening the
framework's knowledge sources and improving the integration of domain-specific
knowledge. In order to ensure that the generated summaries successfully satisfy the
needs of non-expert audiences, the authors also acknowledge the necessity of human
evaluations to supplement automatic metrics. All things considered, the RAG-RLRC-
LaySum framework is a major breakthrough in the field of biomedical text summary,
fusing generative and retrieval methods to create excellent, easily understood summaries
that help the general public comprehend intricate scientific research.
8
integrity. The authors recommend students to consider ChatISA as an additional
resource rather than a primary source for academic assignments and offer explicit
instructions for ethical use in order to reduce these hazards. In order to promote
accountability and transparency, the chatbot enables students to download a PDF of their
exchanges, which they can then reference in assignments.The paper also mentions about
how students are using generative AI more and more, citing a research that shows a
sizable portion of students routinely use generative AI tools and think they improve their
learning. This emphasizes the necessity of continuing discussions on the effects of AI
in education between teachers and students. The study describes several strategies used
in ChatISA, such as interactive learning to successfully engage students and prompt
engineering to direct the AI's responses. User feedback, task performance gains,
relevance of produced questions, correctness of AI responses, and engagement levels
are some of the performance measures used to assess ChatISA's efficacy. These metrics
aid in determining areas for improvement and evaluating how effectively the chatbot
satisfies students' demands. The budget for ChatISA's upkeep, which is between $200–
300 USD each month, is also covered in the paper. This budget enables the adoption of
cutting-edge AI technologies while keeping expenses under control. The authors also
emphasize how important it is for educational institutions to modify their curricula in
order to integrate ChatISA and other AI tools. It is crucial to prepare students for jobs
that depend on AI as it becomes more and more common in the workplace. To ensure
that resources like ChatISA improve rather than diminish the educational process,
faculty members are urged to reconsider their teaching strategies and homework
assignments in order to promote critical thinking and creative work. In conclusion,
ChatISA is a major step forward in the integration of AI into education, offering students
useful tools to improve their education and get ready for the workforce. In order to make
sure AI technologies properly enhance student learning and engagement, the study also
emphasizes the significance of ethical issues, adaptability in teaching, and ongoing
evaluation.
9
systems. It starts by describing the shortcomings of the existing PDF parsing techniques,
such PyPDF, which frequently fall short in correctly identifying document structure,
resulting in fragmented and perplexing content representations. For complicated papers
that frequently appear in professional contexts and contain tables, multi-column layouts,
and other complex formatting, this is especially challenging. The authors stress that it is
difficult for systems to extract useful information from standard untagged documents,
such as PDFs, because they are not machine-readable like tagged formats, like HTML
or Word. The study compares two RAG systems: a baseline system that uses PyPDF for
parsing and ChatDOC, which makes use of a sophisticated PDF parser intended to
maintain document structure. The authors assess the effect of parsing quality on the
precision of responses produced by the RAG systems through methodical tests using a
dataset of 188 texts from a variety of topics. They point out that ChatDOC performs
better than the baseline in about 47% of the assessed questions, ties in 38%, and fails in
just 15% of them. This performance is explained by ChatDOC's capacity to obtain
longer and more accurate text passages, which is essential for responding to extractive
queries that call for exact data. The findings' implications for the future of RAG systems
are also covered in the research. It is suggested that better PDF structure identification
can greatly increase retrieval, which would improve the integration of domain
information into large language models (LLMs). The authors contend that as LLMs are
largely trained on publicly accessible material, their efficacy in vertical applications
depends on integrating specialized information from professional publications.
According to the study's findings, improvements in PDF parsing technology, like those
found in ChatDOC, have the potential to completely transform RAG systems and
increase their usefulness in work settings where precise information retrieval is crucial.
In conclusion, the study highlights the significance of excellent PDF parsing in the RAG
framework and shows that the quality of responses produced by QA systems can be
significantly enhanced by the capacity to correctly parse and partition complicated
documents. The results support a move toward increasingly complex parsing techniques
that can manage the complexities of business documents, improving the general efficacy
of knowledge-based applications across a range of industries.
10
The work called "Enhancing Biomedical Question Answering with Parameter-Efficient
Fine-Tuning and Hierarchical Retrieval Augmented Generation" details the CUHK-AIH
team's work on task 12b in the three phases of A, A+, and B of the 12th BioASQ
Challenge. To enhance the performance of biomedical question answering (QA), Gao
et al. [6] provide a system called Corpus PEFT Searching (CPS), which combines a
hierarchical retrieval-based methodology with Parameter-Efficient Fine-Tuning
(PEFT). In Phase A, the system generates a list of documents based on the input question
by using the BM25 retriever to look for pertinent documents from the PubMed Central
(PMC) corpus. For the phases that follow, this stage acts as a foundation. In Phase B ,
the authors refine a Large Language Model (LLM) Using the BioASQ training set,
utilizing the PEFT approach to improve the model's capacity to produce precise
responses. The process of fine-tuning is essential because it enables the model to adjust
to the unique features of biological queries and responses. In order to develop a more
complete biomedical QA system, Phase A+ presents a novel hierarchical Retrieval-
Augmented Generation (RAG) technique that integrates the BM25 retriever with the
optimized LLM. With just the biological question provided—no pre-identified
snippets—this phase aims to test the system's performance directly. The authors point
out that by using an ensemble retriever that blends sparse and dense retrieval
approaches, the RAG method improves the search results from Phase A and makes sure
that the most pertinent data is used to produce replies. Using thoughtfully constructed
prompts, the system generates high-quality responses by processing the query along
with the most pertinent document chunks. The experimental setup is also covered in the
publication, along with the training and evaluation datasets, which are question-answer
pairs from the BioASQ tasks. To examine the effects of various model types and
retrieval strategies on performance, the authors carried out thorough ablation research.
The findings show that the hierarchical RAG method and PEFT both considerably
improve the model's performance in biomedical QA tasks. Comparing Phase A+ to the
challenge's top competitors, the authors' performance measures demonstrate that
although Phase A+ produced competitive outcomes, there was still opportunity for
improvement.In conclusion, the study shows how biomedical question answering
systems can be improved by fusing sophisticated retrieval techniques with effective
11
fine-tuning techniques. The results highlight how crucial it is to use big language models
and cutting-edge retrieval strategies to handle the challenges of retrieving biomedical
data and responding to inquiries. According to the authors, their method may be used as
a basis for creating increasingly complex biomedical QA systems, which could help a
number of applications in bioinformatics, medical informatics, and related domains. All
things considered, the report opens the door for further developments in this crucial field
of study by offering insightful information on the continuous attempts to increase the
precision and applicability of responses derived from biomedical literature.
In order to solve the issues of fragmented domain knowledge that spans across multiple
experts and sources, Kuo et al. [7] propose a customized Question Answering (QA)
system designed for the Taiwanese film industry. The study emphasizes the value of
bringing disparate expertise together, especially in domains like film where knowledge
is frequently fragmented and incomplete. The study investigates the possibility of
incorporating large language models (LLMs) to produce a more efficient QA system by
utilizing developments in Natural Language Processing (NLP) and open-source
platforms like LangChain. The study addresses the drawbacks of general-purpose AI
chatbots, such ChatGPT and Gemini, who, despite their extensive knowledge,
frequently have trouble answering specific questions and may provide false
information—a condition known as AI hallucination. In contrast to current chatbots, the
QA system used in this study is intended to extract information from local documents,
offering industry professionals deeper insights. The authors stress the need of utilizing
local records from a single institution, which generates bias because of the uniformity
of the source material even though it is advantageous for targeted insights. This bias
may cause the QA system to ignore more general trends while distorting its reactions to
particular problems, such production costs. The study also discusses how the COVID-
19 pandemic has affected the film business, which has affected the questions asked and
the way replies are scored, resulting in an over focus on pandemic-related content.A set
of questions divided into three categories according to understanding needs and
complexity is part of the research approach. Two methods are used to evaluate the
performance of the QA system: a scoring evaluator from the LangChain framework and
12
human assessment. The findings show that the QA system performs better than other
chatbots in terms of accuracy and dependability, especially when paired with GPT-4.
The system's limits in interpreting graphical data and the difficulties associated with
input translation, particularly with regard to proprietary words and Taiwanese movie
names, are acknowledged by the developers. Future research aiming at improving the
efficacy of the QA system is also included in the report. In order to reduce bias, this
entails investigating a wider variety of document resources and improving
preprocessing methods for local documents to enhance the model's capacity to extract
significant information. In order to create a fair and sophisticated QA system, the
authors support a more thorough depiction of a variety of subjects. Overall, by offering
a framework that not only enhances access to specialized information but also tackles
the shortcomings of current AI solutions, the research represents a significant
improvement in the field of QA systems, especially for the Taiwanese film industry.
The results highlight the need for creative approaches to the management and
application of specialized knowledge across disciplines, which will ultimately lead to a
more knowledgeable and effective industry.
13
the study's main conclusions is that, particularly for more difficult questions, the model's
performance is greatly impacted by the length of the response it produces. While
excessively long explanations can impair performance on easier issues, detailed answers
are typically advantageous for more difficult ones. In order to guarantee that the model
thoroughly engages with the question before attempting to solve it, the study
additionally emphasizes the significance of the prompt's placement and the need to
indicate the minimal word count. The study provides a number of examples that show
how QAP prompts the model to offer more thorough justifications, which enhances
reasoning and response accuracy. The ideal word count varies according to the
complexity of the topic, though, since it also points out that shorter prompts (like
QAP25) can lead to partial responses, especially in mathematical activities. The authors
contend that QAP optimizes its comprehension and reduces the possibility of
overlooking important details by forcing the model to interpret the query directly. This
strategy marks a change from only concentrating on the solution to improving the
model's understanding of the current problem. The study comes to the conclusion that
QAP is a promising zero-shot prompting technique that can greatly enhance LLM
performance on reasoning tasks. This opens the door for more investigation into how
best to construct prompts and how question analysis and model reasoning skills interact.
All things considered, the results point to QAP as a potentially useful tool for scholars
and professionals wishing to use LLMs for more efficient applications of reasoning and
problem-solving.
14
retraining. The generator, usually a sequence-to-sequence model such as BART, uses the
retrieved texts as more context. The generator generates logical and contextually relevant
outputs by utilizing both the input query and the documents that were retrieved. The authors
show that RAG outperforms conventional models that just use extractive techniques,
achieving state-of-the-art performance on a variety of open-domain question answering
tasks. Even in cases where the precise response is absent from the collected documents,
RAG can create solutions by utilizing the generative capabilities of models such as BART.
This adaptability enables RAG to produce more accurate and instructive outputs by
generating responses that are based on the context supplied by the recovered documents.
The benefits of integrating parametric and non-parametric memory are also highlighted in
the research, which demonstrates how the non-parametric component can direct the
generation process by offering particular knowledge that enhances the model's learnt
parameters. The authors also investigate RAG's potential in fact verification tasks, such the
FEVER dataset, where it outperforms state-of-the-art algorithms that necessitate intricate
retrieval supervision. This suggests that knowledge-intensive activities can be successfully
completed by RAG without requiring complex engineering or domain-specific systems.
The significance of retrieval in improving the performance of generative models is
emphasized in the research, which also suggests that integrating retrieval techniques can
greatly increase the model's capacity to manage dynamic and changing knowledge. The
experimental findings in the research indicate how well RAG performs on a variety of
benchmarks and how it can combine the advantages of generative and retrieval-based
methods. The authors also offer insights into the training process, pointing out that RAG
may be improved on a variety of sequence-to-sequence tasks, enabling the generator and
retriever components to learn together. By optimizing both elements to function in unison,
this end-to-end training method improves the model's performance even more. The study
concludes by presenting RAG as a potent framework that successfully bridges the gap
between retrieval and generation for knowledge-intensive NLP jobs. RAG raises the bar
for performance in open-domain question answering and fact verification by utilizing the
advantages of pre-trained models such as BERT and BART and integrating a non-
parametric memory through a dense vector index. The results imply that this paradigm can
be expanded upon in future studies to investigate even more complex retrieval and
15
generation integrations, which could result in additional developments in the field of
natural language processing.
Hewitt et al. [10] examine how well language models perform, especially when it comes
to tasks like multi-document question answering (QA) and key-value retrieval that call for
the identification of pertinent information from lengthy input contexts. The authors draw
attention to a major problem with existing language models: depending on where pertinent
information is located in the input context, their performance might significantly
deteriorate. A U-shaped performance curve is revealed by the study through a series of
controlled trials, with models performing best when pertinent information is found at the
beginning (primacy bias) or end (recency bias) of the context. On the other hand, even for
models built to handle lengthy contexts, performance drastically deteriorates when
essential information is placed in the middle of the inputs. This result implies that
information that is not situated at the extremes of the input context is difficult for language
models to reliably access and use. The study uses a number of language models, such as
GPT-3.5-Turbo and Llama-2, and assesses how well they function in various scenarios,
such as changing the quantity of documents in the input context and their sequence. Even
base models without instruction fine-tuning exhibit this U-shaped performance pattern,
according to the data, indicating that existing architectures' inability to handle lengthy
contexts efficiently is a fundamental problem. The study also examines the effects of
human feedback-based reinforcement learning and instruction fine-tuning, discovering that
although these methods enhance overall performance, they do not completely eradicate the
performance decrease that is shown when pertinent information is inserted in the middle
of the context. The authors examine the implications of the empirical results for the creation
of upcoming long-context language models in addition to the findings themselves.
According to the authors, it is necessary to show that a model's performance is only slightly
impacted by the location of pertinent information in order to assert that it may use lengthy
input contexts. They provide fresh evaluation procedures that might be useful in
determining how well language models handle lengthy contexts. A useful case study on
open-domain question answering is also included in the article. It shows that model
performance saturates far before the recall of retrieved documents, suggesting that existing
16
models are ineffective at utilizing more retrieved data. All things considered, the study
highlights the need for more research to improve language models' capabilities and offers
insightful information about their limitations while processing lengthy input contexts. The
results highlight how crucial context placement is to model performance and imply that in
order to overcome these obstacles, model architecture and training techniques must be
improved. By making their code and assessment data publicly available, the authors hope
to encourage more research and comprehension of how language models can be enhanced
to make better use of lengthy contexts, which will ultimately lead to improvements in
applications for natural language processing. The study is a crucial step in creating more
resilient language models that can efficiently traverse and extract pertinent data from large
input contexts. This is becoming more and more crucial in real-world applications like
document summarization, search engines, and conversational interfaces.
Clarke et al. [11] provide a thorough assessment of Reciprocal Rank Fusion (RRF), a novel
technique for merging document rankings from various information retrieval (IR) systems.
In contrast to individual ranking systems and well-known techniques like Condorcet Fuse
and CombMNZ, the authors contend that RRF, a straightforward and unsupervised
methodology, consistently produces better results. The study's foundation is a collection of
tests carried out with a number of TREC (Text REtrieval Conference) datasets, which are
well-known industry standards for information retrieval. The authors detail the
methodology employed in their experiments, including the selection of datasets, the
ranking methods compared, and the performance metrics used to evaluate results. The
primary metric for assessing performance was Mean Average Precision (MAP), which
measures the accuracy of the ranked results. The authors conducted pilot experiments to
determine optimal parameters for RRF and to evaluate its performance against competing
methods. The results indicated that RRF outperformed Condorcet and CombMNZ in most
cases, with statistical tests confirming the significance of these findings. The study
emphasizes how RRF's efficacy stems from its capacity to leverage the variety of individual
ranks, which might raise the rating of documents that other approaches might miss. The
authors also investigated RRF's potential for developing a meta-learner that raises the lower
bound of what can be learned from the LETOR 3 dataset by ranking it higher than any
17
previously published approach. Because RRF combines rankings without taking into
account the arbitrary scores that certain ranking techniques yield, the results imply that it
is not only easier to use but also more efficient than Condorcet Fuse. The authors draw the
conclusion that RRF offers a strong substitute for enhancing document retrieval results and
constitutes a substantial breakthrough in the field of information retrieval. The significance
of RRF in utilizing the advantages of numerous ranking systems while avoiding the
complications of supervised learning techniques is emphasized in the study. All things
considered, the study offers insightful information about RRF's efficacy and its uses in a
range of IR tasks, supporting the idea that more straightforward approaches frequently
result in superior performance in real-world situations. The authors argue that RRF should
be investigated further and incorporated into current IR frameworks since it may improve
search engine and other retrieval system performance. The paper makes a compelling
argument for RRF's adoption in the information retrieval sector by showcasing its benefits
through thorough experimentation and statistical analysis, opening the door for more study
and advancement in this subject.
James et al. [12] propose a unique methodology for the reference-free evaluation of
Retrieval Augmented Generation (RAG) systems—which combine language model
generation and information retrieval—is presented: RAGAS (Retrieval Augmented
Generation Assessment). Conventional evaluation techniques frequently use datasets that
have been annotated by humans, which can take a lot of time and resources. In order to
overcome this constraint, RAGAS concentrates on three essential quality factors that are
essential for evaluating the effectiveness of RAG systems: fidelity, answer relevance, and
context relevance. The gpt-3.5-turbo-16k model from OpenAI, which enables the
extraction and analysis of statements from generated answers, is used by the authors to
build a fully automated evaluation method. The methodology breaks down larger
statements into more concise, targeted assertions to calculate faithfulness, making sure that
the responses' claims may be deduced from the context they give. While context relevance
evaluates the significance of the setting in connection to the question, response relevance
evaluates how well the generated answer answers the query. To help the language model
retrieve pertinent sentences from the context and produce assertions that accurately reflect
18
the faithfulness of the response, the authors employ particular prompts. The difficulties of
creating effective prompts are also covered in the paper because the way questions and
circumstances are phrased might have an impact on the evaluation's quality. The authors
point to earlier research in the area, pointing out that although some studies have looked
into using prompts for evaluation, they frequently did not show any discernible advantages.
By offering an organized strategy for assessing RAG systems without the use of human
references, RAGAS seeks to advance these techniques. The WikiEval dataset, which calls
for models to compare pairs of responses or context fragments, was used in the authors'
studies. They show that RAGAS can successfully capture the caliber of generated replies
by reporting that the correctness of their suggested metrics agrees with human annotators.
The findings show that the automated measures and human evaluations coincide quite well,
especially when it comes to faithfulness and context relevance. Concluding, RAGAS offers
a self-contained, reference-free methodology that can expedite the assessment process,
marking a substantial development in the evaluation of RAG systems. RAGAS offers a
strong framework for assessing the efficacy of retrieval-augmented generation techniques
by concentrating on important quality factors and utilizing the power of big language
models. In addition to advancing our knowledge of RAG systems, this work lays the
groundwork for further studies in automated evaluation techniques, which will eventually
improve the creation and use of more accurate and dependable language models. Given
that RAG systems are becoming more and more popular in a variety of applications, such
as conversational agents, summarization, and question answering, the authors stress the
significance of their findings for the larger area of natural language processing.
19
Chapter 3
Methodology
3.1 DATA
The data is to be provided by the user. The user needs to provide the file path with either
.pdf, .docx or .txt or .rtf format. There is no size limit in terms of pages or file size. The
data can contain images and tables, but the images won’t be processed and hence it is
better to provide a text-only file for the best output.
3.1.2 VECTORISATION
The child chunks are vectorised. Vectorisation is a process where each sentence is
converted into embeddings, i.e, embeddings are created for each sentence. There are a
number of methods to do it, like word2vec, but an advanced transformer based
embedding model called embedding-001 by Google is used. Vectors are created for the
child chunks.
3.1.3 VECTORSTORE
20
Once the embeddings are generated, they need to be stored in some vectorstore. So that
similarity search with the user’s query’s embeddings can be performed and the stored
embedding vectors of the child chunks from the uploaded document. Facebook AI
Similarity Search (FAISS) was preferred which creates indexes efficiently for our
vectors. It works really well for files of big sizes and the search retrieval is fast as well.
Unlike other popular options like chromadb, which uses a sqlite3 to store the vectors
and pinecone which is a dedicated cloud vector database, FAISS uses an in-memory
store which is fast and economical as well since it's an in-memory store.
21
speech_recognition library in Python is used to detect speech input from a microphone.
speech_recognition is used to detect the speech input from a microphone. By default,
the device's microphone is used to take the speech input. So if an external microphone
is attached, the voice input device needs to be changed in the system settings, if not
changed automatically.
22
question asked by the user. So now instead of just one query which is vague, three
queries will be searched which can be helpful to get similar vectors as the other 3
questions might have better information that the original query and thus similarity
search will yield better results. Initially Llama 3.1 8B model was used instead of Llama
3.1 70B model so as to make the question generation faster as Llama 3.1 70B model is
also a bit slower in processing than Llama 3.1 8B model. So using a smaller model for
sub tasks makes the system faster. But again, since it is a smaller model, it doesn’t match
the reasoning capabilities of Llama 3.1 70B and hence, Llama 3.1 70B was used.
Multiple questions are only generated for smaller questions. If the user asks a question
which has words less than a threshold value, only then the Llama 3.1 70B model
generates two more questions.
23
components can be controlled with a Python script. It provides a layer of connection
between Python and the Component Object Model (COM) architecture. The Component
Object Model is used by various Windows applications. Hence, using win32com, tasks
in various Windows applications can be automated and hence open the media player to
speak out the text generated by the Large Language Model.
From 3.5 to 3.11, the application runs in a loop until the user specifically commands it
to break out of the loop and the program terminates.
This entire process gives the user a feel of talking to a human rather than just typing the
questions and reading the generated text like it is done in many other applications. User
experience is really an important factor for the popularity of an application and it drives
the fact that how many users find it comfortable to use the application. Interactivity is
one the core parts of it. No explicit user interface was created but a simple method of
running the app through command prompt. The user can simply give the path of the file,
vital information like if the file is saved and how many characters are there in the file
are displayed to the user. The user is prompted to start speaking with a “recording
started” message. The user can end the recording by pressing control+c keys. “Done
recording” message would be shown once the user presses the control+c keys. The
questions asked and the answers spoken both are displayed on screen as well, in case
the user wants to read them or copy paste them anywhere. To enhance interactivity, the
program runs in a loop, so after the program finishes speaking, the user can start
speaking and the conversation goes on until the user mentions “bye” in his speech. The
script detects the user saying “bye” and terminates the program by thanking the user for
trying out the program. The user can mention the word “bye” in a sentence, the script
can efficiently detect it in a bunch of words.
24
check_ethics(), generate_questions(), get_output() and summarise(). speak_output() is
the primary method and acts as the entry point to the program.
25
3.8.4 get_audio()
The get_audio() method starts and ends the recording of the user’s query (PyAudio() is
used for recording) and converts it into text using OpenAI’s Whisper’s Base model and
returns the converted text.
3.8.5 get_output(big_chunks_retriever,question)
The method get_output(big_chunks_retriever,question), takes in the retriever and the
user’s query, and gets the relevant chunks which are then passed to the LLM, which
generates the final output which get_output() returns.
3.8.6 generate_questions(question)
This method takes in the original query asked by the user and generates two additional
questions similar to what the user asked and returns a list containing all three questions.
3.8.7 summarise(outputs)
This method is only called when there are two additional questions generated. The list
containing all the responses from the LLM is passed to the summarise method. The method
returns the summary string of all the responses in the list.
3.8.8 check_ethics(question)
The check_ethics(question) method takes in the query asked by the user and classifies it
into either ethical and harmless or unethical and harmful. It returns a string containing the
classification value (1 for ethical and harmless or 2 for unethical and harmful) and a reason
for the same separated by double pipe symbol (||).
26
Figure 3.1. Model Architecture
27
Chapter 4
Results and Discussion
4.1 RESULTS
We primarily relied on human evaluation and BERT Score to evaluate our application.
BERT Score needs the ground truth (the actual expected response) as the base to compare
with the model’s generated response. Hence, a human is required to generate the ground
truth answers. Three scores are taken from BERT Score to evaluate how well the RAG
system performs:
● Precision: The ratio of relevant results to the total results returned by the LLM.
● Recall: The ratio of relevant results returned by the LLM to the total relevant
results.
● F1-Score: F1-Score is the harmonic mean of Precision and Recall.
Three documents were selected and three questions were selected for all of them.
Document 1: The paper by Alkhalaf [13] which is a rich mix of different techniques and it
was helpful to test if all the techniques were properly extracted and presented to the user.
Document 2: An English translation of the Bhagavad Gita. It is a simple text only file and
it was useful to test if the RAG system can answer topics asked by the users that were not
directly present in the text.
Document 3: The paper by Cormack [11] was the third document used. The focus was to
see if tabular data can be retrieved efficiently.
Humans evaluated the system on:
● Quality of Response:
○ Factual Accuracy: Assess the correctness of the information presented in
the response. (scale: 1-5, one being not at all accurate, 5 being highly
accurate)
○ Relevance: Evaluate how well the response addresses the query and the
provided context. (scale: 1-5, 1 being not at all relevant, 5 being highly
relevant)
28
○ Comprehensiveness: Determine if the response covers all relevant aspects
of the query. (scale: 1-5, 1 being not at all comprehensive, 5 being highly
comprehensive)
○ Coherence: Judge the logical flow and overall coherence of the generated
text. (scale: 1-5 1 being not at all coherent, 5 being highly coherent)
○ Conciseness: Assess the brevity and clarity of the response. (scale: 1-5, 1
being not at all concise, 5 being highly concise)
● Quality of Retrieval:
○ Relevance: Evaluate the relevance of retrieved documents to the query.
(scale: 1-5, 1 being not at all relevant, 5 being highly relevant)
○ Diversity: Assess the diversity of information sources used in the response.
(scale: 1-5, 1 being not at all diverse, 5 being highly diverse)
○ Completeness: Determine if all relevant information sources were retrieved.
Also consider how much of the information was retrieved. (scale: 1-5, 1
being not at all complete, 5 being complete)
● Other Factors:
○ Bias: Assess the presence of bias in the generated response, such as gender,
race, or cultural bias. (scale: 0 or 1, 1 means bias being present or 0 means
no bias present)
○ Fairness: Evaluate the fairness of the system in terms of its treatment of
different user groups. (scale: 1-5, 1 being not at all fair, 5 being fair)
○ Toxicity: Determine the level of toxicity or offensive language in the
generated text. (scale: 1-5, 1 being not at all toxic, 5 being highly toxic)
Three different data were used as the knowledge base and three questions each. So a total
of nine questions and 3 documents were used for human evaluation.
Factual Accuracy 5 5 5
Relevance 4 4 5
29
Comprehensiveness 5 5 5
Coherence 3 4 4
Conciseness 5 5 4
Retrieval Relevance 4 4 3
Diversity 5 5 4
Completeness 5 5 5
Bias 0 0 0
Fairness 5 5 5
Toxicity 1 1 1
4.1.1 Human Evaluation
For BERT Score the same three documents were selected and three questions were selected
for all of them. The table below shows the average precision, recall, and F1 Score for each
document.
30
figure 4.1.1 comparison of metrics for human evaluation across all three documents
figure 4.1.2 comparison of metrics for BERT Score for the first document
31
figure 4.1.3 comparison of metrics for BERT Score for the second document
figure 4.1.4 comparison of metrics for BERT Score for the third document
32
33
34
Chapter 5
With all these in place, Talkdoc isn’t perfect. It might be slow at times due to
its dependency on free cloud platforms, like groq where the model used (Llama
3.1 70B) are hosted. It would be faster if a paid service is used or models are
run locally, which would increase interactivity by decreasing the delay in
response. The spoken voice generated by the system is very robotic. There are
solutions which generate more human sounding voices which can be used in
future iterations, but they will be costly as well. The multiple question
generation can also be improved if the Large Language Model knows the
context of the uploaded document. It can then generate similar questions from
the user question tailored to the file uploaded which would further enhance
question generation. Since the entire backbone of the system is a RAG system,
35
getting summary or themes of the document is not very accurate. In further
iterations, different methods can be created to summarize or discuss the themes
in the document when the user asks for it. Furthermore, introducing a reranker
might yield better outputs.
36
REFERENCES
[1] Gupta, H., & Patel, M. (2020, October). Study of extractive text summarizer using the
elmo embedding. In 2020 Fourth International Conference on I-SMAC (IoT in Social,
Mobile, Analytics and Cloud)(I-SMAC) (pp. 829-834). IEEE.
[2] Swami, P., & Pratap, V. (2022, May). Resume classifier and summarizer. In 2022
International Conference on Machine Learning, Big Data, Cloud and Parallel Computing
(COM-IT-CON) (Vol. 1, pp. 220-224). IEEE.
[3] Ji, Y., Li, Z., Meng, R., Sivarajkumar, S., Wang, Y., Yu, Z., ... & He, D. (2024). RAG-
RLRC-LaySum at BioLaySumm: Integrating Retrieval-Augmented Generation and
Readability Control for Layman Summarization of Biomedical Texts. arXiv preprint
arXiv:2405.13179.
[4] Megahed, F. M., Chen, Y. J., Ferris, J. A., Resatar, C., Ross, K., Lee, Y., & Jones-
Farmer, L. A. (2024). ChatISA: A Prompt-Engineered Chatbot for Coding, Project
Management, Interview and Exam Preparation Activities. arXiv preprint
arXiv:2407.15010.
[6] Gao, Y., Zong, L., & Li, Y. (2024). Enhancing biomedical question answering with
parameter-efficient fine-tuning and hierarchical retrieval augmented generation. CLEF
Working Notes.
[7] Kuo, E. C., & Su, Y. H. (2024, June). Assembling Fragmented Domain Knowledge: A
LLM-Powered QA System for Taiwan Cinema. In 2024 IEEE Congress on Evolutionary
Computation (CEC) (pp. 1-8). IEEE.
37
[8] Yugeswardeenoo, D., Zhu, K., & O'Brien, S. (2024). Question-Analysis Prompting
Improves LLM Performance in Reasoning Tasks. arXiv preprint arXiv:2407.03624.
[9] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D.
(2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in
Neural Information Processing Systems, 33, 9459-9474.
[10] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P.
(2024). Lost in the middle: How language models use long contexts. Transactions of the
Association for Computational Linguistics, 12, 157-173.
[11] Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009, July). Reciprocal rank fusion
outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd
international ACM SIGIR conference on Research and development in information
retrieval (pp. 758-759).
[12] Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). Ragas: Automated
evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217.
[13] Alkhalaf, M., Yu, P., Yin, M., & Deng, C. (2024). Applying generative AI with
retrieval augmented generation to summarize and extract key clinical information from
electronic health records. Journal of Biomedical Informatics, 104662.57
38
Appendices
39
Setting important parameters, defining the LLM and speak variables and creating the
functions to read the files
40
Defining the method to get the query as audio and convert it into text
41
Defining a method to get the output from the LLM
42
Defining a method to generate additional and similar questions to the user query
43
multiple questions
44
45
46