0% found this document useful (0 votes)
21 views57 pages

23mca1047

Uploaded by

Piyush Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views57 pages

23mca1047

Uploaded by

Piyush Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

A project report on

TALKDOC: QUERY BASED INFORMATION


RETRIEVAL FROM THE DOCUMENTS

Submitted in partial fulfillment for the award of the degree of

Master of Computer Applications

by

SATYAJEET NARAYAN (23MCA1047)

SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

November, 2024
TALKDOC: QUERY BASED INFORMATION
RETRIEVAL FROM THE DOCUMENTS
Submitted in partial fulfillment for the award of the degree of

Master of Computer Applications


by

SATYAJEET NARAYAN (23MCA1047)

SCHOOL OF COMPUTER SCIENCE AND ENGINEERING


November, 2024
ABSTRACT

The work focuses on creating an advanced Retrieval Augmented Generation (RAG)


application that enables users to input documents in multiple formats ( .txt,.docx, and.pdf),
in order to build an interactive knowledge base. A large language model (LLM) can then
be used, giving consumers immediate access to pertinent information without requiring
them to manually sort through lengthy documents, using the knowledge base effectively.
Academic papers, reports, and contracts are just a few of the many reading materials that
professionals and students deal with on a daily basis in today's hectic world. Because many
of these documents are long and dense, it might be difficult for consumers to read them
carefully or remember all the information. Soon after reading, this frequently leads to either
insufficient understanding or, worse, forgetting crucial information. By providing a chat
interface where users can interact directly with the paper, the artificial intelligence program
aims to mitigate these difficulties. They can ask targeted queries and get precise, needs-
based responses rather than wasting hours reading. This feature can be a very helpful study
tool for students, enabling them to swiftly access important material from notes or
textbooks. Professionals can use it to find pertinent information in lengthy reports or
technical publications, increasing productivity and efficiency. A text-to-speech feature is a
feature that increases the program's adaptability. As a result, users who prefer aural input
can choose to have their responses read aloud to them, which will aid in learning and
information absorption. This multimodal technique improves accessibility for a range of
learning requirements and styles by fusing visual involvement through text questions with
aural responses.

Keywords - Retrieval Augmented Generation, large language model, artificial


intelligence, multimodal, text-to-speech

i
CONTENTS

ABSTRACT…………………………………………………………………………i

ACKNOWLEDGEMENT…………………………………………………………ii

CONTENTS……………………………………………………………………….. iii

CHAPTER 1………………………………………………………………………...1

INTRODUCTION…………………………………………………………………..1
1.1 KEY TERMINOLOGIES……………………………………………………….. 2
1.1.1 GENERATIVE AI…………………………………………………………….. 2
1.1.2 PROMPT……………………………………………………………………… 3
1.1.1 LARGE LANGUAGE MODELS (LLM)…………………………………….. 3
1.1.4 RETRIEVAL AUGMENTED GENERATION (RAG)………………………. 3
1.1.5 ADVANCED RAG……………………………………………………………. 3
1.1.6 MULTIPLE QUESTION GENERATION…………………………………… 4
1.1.7 SUMMARISATION…………………………………………………………... 4

CHAPTER 2……………………………………………………………...………….5
2.1 INTRODUCTION………………………………………………………………..5

2.2 LITERATURE REVIEW……………………………………………………….. 5

CHAPTER 3……………………………………………………………………….20

METHODOLOGY………………………………………………………………...20
3.1 DATA………………………………………………………………………….. 20
3.1.1 DATA PREPROCESSING…………………………………………………... 20
3.1.2 VECTORISATION…………………………………………………………...20
3.1.3 VECTORSTORE…………………………………………………………….. 20
3.2 SIMILARITY SEARCH………………………………………………………..21
3.3 USER INPUT………………………………………………………………….. 21
3.4 USING ETHICS CHECKER…………………………………………………... 22
3.5 USING LARGE LANGUAGE MODELS…………………………………….. 22
3.5.1 GETTING BETTER CONTEXT……………………………………………. 22
3.6 ANSWER GENERATION…………………………………………………….. 23
3.7 CONVERTING THE FINAL RESPONSE TO SPEECH……………………... 23
3.8 METHODS USED……………………………………………………………... 24

iii
3.8.1 speak_output()………………………………………………………………...24
3.8.2 TEXT EXTRACTION METHODS…………………………………………. 25
3.8.3 get_file_create_retiever(file_path)…………………………………………… 25
3.8.4 get_audio()…………………………………………………………………… 26
3.8.5 get_output(big_chunks_retriever,question)………………………………….. 26
3.8.6 generate_questions(question)………………………………………………… 26
3.8.7 summarise(outputs)…………………………………………………………... 26
3.8.8 check_ethics(question)……………………………………………………….. 26

CHAPTER 4…………………………………………………………...…………..28

RESULTS AND DISCUSSION………...………………………………………... 28


4.1 RESULTS……………………………………………………………………… 28
4.2 RESULT SCREENSHOTS……………………………………………………. 32

CHAPTER 5……………………………………………………………………….35

CONCLUSION AND FUTURE WORK………………………………………...35

REFERENCES……………………………………………………………………37

APPENDICES…………………………………………………………………….39

APPENDIX 1: SOURCE CODE…………………………………………………..39

iv
LIST OF FIGURES
3.1 MODEL ARCHITECTURE 27

4.1.1 COMPARISON OF HUMAN EVALUATION FOR ALL THREE DOCS 31

4.1.2 COMPARISON OF METRICS FOR BERT SCORE FOR THE FIRST DOC 31

4.1.3 COMPARISON OF METRICS FOR BERT SCORE FOR THE SECOND DOC 32

4.1.4 COMPARISON OF METRICS FOR BERT SCORE FOR THE THIRD DOC 32

v
LIST OF TABLES

4.1.1 HUMAN EVALUATION 29

4.1.2 BERT-SCORE EVALUATION 30

vi
LIST OF ACRONYMS

RAG Retrieval Augmented Generation


LLM Large Language Model
BERT Bidirectional Encoder Representations from Transformer
TTS Text-To-Speech
PEFT Parameter Efficient Fine Tuning
LORA Low Rank Adaptation

vii
Chapter 1

Introduction
The rapid advancements in artificial intelligence (AI) have transformed the way people
interact with and process information. Among the most revolutionary developments in AI
is Generative AI, which enables the creation of content such as text, images, videos, and
audio. These models, built on transformer-based architectures, are trained on vast datasets
to predict and generate the next element in a sequence, allowing them to produce coherent
outputs when given a prompt—a specific instruction, often in plain language, that guides
their behavior. At the forefront of generative AI are Large Language Models (LLMs),
which excel in natural language processing tasks like text generation. These models, such
as Meta's advanced Llama 3.1 70B, are equipped with billions of parameters, enabling them
to understand and generate text with unprecedented sophistication. Prior to the advent of
LLMs, generating meaningful and contextually relevant text responses was a challenging
task. However, with LLMs, this capability has become not only feasible but also highly
efficient.

In today’s fast-paced environment, both students and professionals encounter an


overwhelming volume of reading materials, including academic papers, technical reports,
contracts, and textbooks. These documents often contain crucial information but tend to be
lengthy and dense, making it challenging to extract and retain essential details. Traditional
methods of reading and analyzing such materials can be time-intensive and inefficient,
often leading to incomplete understanding or the unintentional neglect of critical
information.

To address these issues, our work presents an advanced Retrieval Augmented Generation
(RAG) application that transforms how users engage with textual content. RAG is an
advanced AI framework that combines retrieval-based and generative approaches to
provide accurate, contextually relevant responses. RAG systems enhance LLMs by
grounding their answers with information retrieved from external sources, such as
databases, document repositories, or knowledge bases. In a typical RAG system, when a
user poses a query, the system first searches an external knowledge base for information

1
that closely matches the query. This could involve pulling relevant passages from stored
documents, databases, or indexed sources. The retrieved information, called "retrieval
results" or "context," is passed along with the user’s query to the LLM. This context helps
the LLM generate answers that are more accurate and grounded in current information.
The LLM then generates a response based on both the retrieved information and the
original query, producing a well-informed, contextual answer.

The advanced RAG system allows users to upload documents in various formats, such as
.txt, .docx, .rtf and .pdf, creating a comprehensive and interactive knowledge base. By
harnessing the power of large language models (LLMs), the application provides users with
immediate, accurate responses to specific queries, bypassing the need to manually navigate
through extensive text. This application introduces a user-friendly chat interface that
facilitates direct interaction with uploaded documents, without typing any question, just by
talking to the system. Instead of dedicating hours to reading, users can ask precise questions
and receive tailored answers based on their needs. This functionality serves as a powerful
tool for students, enabling them to quickly locate key points from their study materials.
Similarly, professionals can efficiently retrieve vital information from detailed reports or
technical publications, significantly boosting productivity and decision-making.

Additionally, the inclusion of a text-to-speech feature enhances accessibility for users who
prefer auditory inputs or require alternative ways to process information. By integrating
both textual and auditory modes, the application caters to diverse learning preferences,
ensuring adaptability for a wide range of users. This approach highlights the potential of
artificial intelligence to revolutionize information retrieval and management. By offering
a solution that is both efficient and inclusive, it addresses the growing demand for tools
that simplify access to critical knowledge while meeting the diverse needs of modern users.

1.1 KEY TERMINOLOGIES


1.1.1 GENERATIVE AI
Generative AI is the branch of AI that deals with generation of content like text, images,
video and audio. The generative AI models are transformer based models and are trained

2
on huge corpus of data and learn to predict the next word or sound or pixel in a sequence.
And hence, they can generate content when given a prompt.

1.1.2 PROMPT
A prompt is an instruction to any generative AI model often in plain english. It guides and
instructs a model on what to output.

1.1.3 LARGE LANGUAGE MODELS (LLM)


LLMs are models that use machine learning to perform natural language processing tasks
like text generation. Before the introduction of LLMs, it was really difficult to generate
text responses from a language model. The LLM of choice is Llama 3.1 70B which is an
advanced LLM from Meta with a whopping 70 billion parameters.

1.1.4 RETRIEVAL AUGMENTED GENERATION (RAG)


RAG is an AI framework that is used to retrieve information from an external knowledge
base (the uploaded documents in our case) to ground the LLM to produce the most up-to-
date response.

1.1.5 ADVANCED RAG


To provide the best and the most accurate response to the user, a popular method called
parent-document retrieval is followed. If the document has more characters than a set
threshold,there would be two levels of chunking. In level one, the document is chunked
with at least 2000 characters in each chunk. These chunks are called the parent chunks.
From each of the parent chunks, there will be multiple children chunks containing at least
400 characters. These child chunks would be converted to vectors and stored in a vector
store and used for similarity search. If during similarity search, a child chunk is found to
be matched with the user’s query, the parent chunk containing the child would be passed
to the LLM. The reason this is done is because in a large document, there can be multiple
topics. Even a parent chunk can have more than one topic in it. But when multiple children
chunks are generated from a parent chunk, this would introduce granularity and the topics
in each of these child chunks would not be very broad but concentrated which helps in an

3
efficient similarity search. When a child chunk gets a high similarity score, the parent
chunk for the child is used for better context for the LLM so that it can give a better output.

1.1.6 MULTIPLE QUESTION GENERATION


The user’s question can be very general or vague at times and hence, the similarity search
won’t work at such times. To tackle this, an approach that uses an LLM to generate two
more questions from the single question of the user was used. Since there are more and
better questions, it would really help in similarity search and produce really good results
even for general questions.

1.1.7 SUMMARISATION
Giving a big context to the LLM would generate a detailed and long output, which can be
summarized and the summary can be presented to the user. If they want, the users can view
the entire output.

4
Chapter 2

Literature Review

2.1 INTRODUCTION
The recent advancements in RAG focus on improving the overall system accuracy by
implementing methods like Reranking and using automated testing systems.

2.2 LITERATURE REVIEW


Gupta et al. [1] tackle the urgent problem of handling the massive amount of textual
data produced in a variety of fields, including social media, news, and education.
Effective summarizing approaches are becoming more and more important as the
amount of information available online continues to expand tremendously. Text
summarizing is the act of distilling long texts into concise, logical summaries that
preserve the most important details, making it simpler to understand and retrieve
pertinent information. In contrast to abstractive summarization, which creates new
phrases that express the same meaning, the study focuses on extractive text
summarization, which chooses and gathers the most important lines from a source text
to produce a succinct summary. As a fundamental stage in every natural language
processing (NLP) work, the authors stress the significance of text preprocessing, which
entails cleaning and getting the text ready for analysis. Four essential processes make
up their methodical approach to extractive summarization: text preparation, feature
extraction using ELMo embedding, cosine similarity scoring of phrases, and finally,
choosing the best-ranked sentences to create the summary. By using bidirectional Long
Short-Term Memory (LSTM) networks, the contextual embedding method known as
ELMo, or Embeddings from Language Models, determines a word's meaning depending
on its context inside a phrase. This makes it possible to comprehend word semantics and
syntax more deeply, which is essential for producing summaries of superior quality. The
suggested ELMo-based summarizing model is compared in the research to more
conventional techniques like the TextRank algorithm, which uses graph-based sentence
ranking. Higher F-measure and recall scores show that the ELMo embedding model

5
works noticeably better than the TextRank method in terms of accuracy and relevance,
according to the authors' empirical study. According to the findings, the ELMo-based
method is superior at extracting the most important details from the text, producing
summaries that more accurately represent the original material. Future work includes
increasing the model’s accuracy and summarizing multiple documents at once to
generate generalized summaries.

Swami et al. [2] propose a thorough method for automating the resume evaluation
process, which is becoming more and more important. The difficulties experienced by
recruiters, who frequently go through 50 to 300 resumes for every job opening, are
highlighted by the writers Pratibha Swami and Vibha Pratap. This makes the process
labor-intensive and time-consuming. In order to solve this problem, the authors suggest
an end-to-end approach that streamlines the hiring process by summarizing resumes'
content in addition to classifying them into pertinent categories. The method starts with
a thorough text cleaning procedure that involves changing the text to a better encoding
scheme, getting rid of stop words and punctuation, and changing all of the text to lower
case to cut down on repetition. Because it gets the data ready for efficient analysis and
categorization, this preprocessing stage is essential. After cleaning, the authors use a
variety of machine learning models to categorize the resumes, such as Support Vector
Machines (SVM), multinomial Naïve Bayes, and Logistic Regression. Originally
selected for its ease of use and efficiency, the Naïve Bayes model's accuracy of about
40% leads the authors to investigate alternative models. The Support Vector Classifier
performs better than the others, attaining an accuracy of 70% to 75%, according to the
paper's comparative analysis of several classification models. The authors credit this
achievement to the SVM's capacity to identify dataset non-linearities, which is crucial
for efficient categorization. The significance of word frequency distribution in resumes
is also covered in the study. Each sentence's score is dependent on how frequently its
constituent terms occur. By using this scoring system, the model is able to identify the
most important lines and provide a succinct synopsis of the resume that omits
unnecessary filler language and highlights the most important information. The authors
stress the value of structured data analysis and visualization in addition to the

6
categorization and summarizing procedures, as these methods offer deeper insights into
the resumes. They mention well-known commercial resume extraction tools like Sovren
Resume/CV Parser and ALEX Resume Parser, but they point out that their method
provides a more customized solution by integrating classification and summarization
into a single program. The authors highlight the possible influence of their model on the
hiring process and argue that automating the review of resumes might greatly lessen
recruiters' workload and increase hiring efficiency. The authors hope to improve
recruiters' decision-making process by offering a tool that efficiently classifies and
condenses resumes, enabling them to concentrate on the most qualified applicants faster.

Ji et al. [3] introduce the RAG-RLRC-LaySum framework, which uses sophisticated


Natural Language Processing (NLP) techniques to produce understandable summaries
of intricate scientific research for lay audiences. To improve the factual correctness and
readability of the generated summaries, the framework combines Reinforcement
Learning for Readability Control (RLRC) with Retrieval-Augmented Generation
(RAG). The incoming biomedical articles are first processed by the Longformer
Encoder-Decoder (LED) model, which produces an initial summary. A knowledge
retrieval system that draws on outside resources, like Wikipedia, to give more context
and factual information is then used to enhance this synopsis. Only the most relevant
material is included in the final summary after a neural re-ranker assesses the importance
of the recovered passages. In order to create writing that is readable by non-specialists,
the framework places a strong emphasis on readability by utilizing measures such as the
Flesch-Kincaid Grade Level and Dale-Chall Readability Score. By establishing a reward
mechanism that incentivizes the creation of summaries that satisfy predetermined
readability standards, the RLRC technique further maximizes readability. Evaluations
on publicly accessible datasets like PLOS and eLife show how successful the RAG-
RLRC-LaySum architecture is, outperforming more conventional language models like
Plain Gemini. The findings show improvements in ROUGE-2 relevance scores,
improvements in factual correctness, and a notable rise in reading scores. The study
addresses the difficulties associated with automatic summarization in the scientific field
and emphasizes the significance of simplifying intricate biomedical texts while

7
preserving their integrity. The authors also go over relevant research in the area, pointing
out that although numerous methods have been put out to increase the comprehensibility
of summaries, many fail to strike a balance between factuality and simplification. By
dynamically incorporating reliable external knowledge sources and using a reward-
based strategy to optimize the model for improved readability, the RAG-RLRC-LaySum
framework seeks to close this gap. In conclusion, the framework's potential to
democratize scientific knowledge by increasing public access to it and fostering greater
involvement with biological discoveries is highlighted. To further increase the accuracy
and applicability of the summaries, future research will concentrate on broadening the
framework's knowledge sources and improving the integration of domain-specific
knowledge. In order to ensure that the generated summaries successfully satisfy the
needs of non-expert audiences, the authors also acknowledge the necessity of human
evaluations to supplement automatic metrics. All things considered, the RAG-RLRC-
LaySum framework is a major breakthrough in the field of biomedical text summary,
fusing generative and retrieval methods to create excellent, easily understood summaries
that help the general public comprehend intricate scientific research.

Megahed et al. [4] introduce ChatISA, a cutting-edge chatbot created to improve


information systems and analytics students' learning experiences. Coding Companion,
Project Coach, Exam Ally, and Interview Mentor are the four primary modules that
make up ChatISA. Each module is designed to help students with a variety of academic
and professional activities. While the Project Coach assists with project management,
the Coding Companion helps with programming queries. The Interview Mentor mimics
interview situations using resumes and job descriptions, while the Exam Ally creates
practice questions to help students get ready for tests. The Task-Technology Fit Theory,
which holds that technology works best when its features match the tasks users must
complete, served as the foundation for ChatISA's design. The GPT-4o model is one of
the sophisticated large language models (LLMs) that the chatbot uses to generate more
human friendly texts. Authors emphasize the significance of ethical considerations
while integrating AI tools in education. It draws attention to the dangers of abuse, such
as when students depend on the chatbot to do tasks untruthfully, undermining academic

8
integrity. The authors recommend students to consider ChatISA as an additional
resource rather than a primary source for academic assignments and offer explicit
instructions for ethical use in order to reduce these hazards. In order to promote
accountability and transparency, the chatbot enables students to download a PDF of their
exchanges, which they can then reference in assignments.The paper also mentions about
how students are using generative AI more and more, citing a research that shows a
sizable portion of students routinely use generative AI tools and think they improve their
learning. This emphasizes the necessity of continuing discussions on the effects of AI
in education between teachers and students. The study describes several strategies used
in ChatISA, such as interactive learning to successfully engage students and prompt
engineering to direct the AI's responses. User feedback, task performance gains,
relevance of produced questions, correctness of AI responses, and engagement levels
are some of the performance measures used to assess ChatISA's efficacy. These metrics
aid in determining areas for improvement and evaluating how effectively the chatbot
satisfies students' demands. The budget for ChatISA's upkeep, which is between $200–
300 USD each month, is also covered in the paper. This budget enables the adoption of
cutting-edge AI technologies while keeping expenses under control. The authors also
emphasize how important it is for educational institutions to modify their curricula in
order to integrate ChatISA and other AI tools. It is crucial to prepare students for jobs
that depend on AI as it becomes more and more common in the workplace. To ensure
that resources like ChatISA improve rather than diminish the educational process,
faculty members are urged to reconsider their teaching strategies and homework
assignments in order to promote critical thinking and creative work. In conclusion,
ChatISA is a major step forward in the integration of AI into education, offering students
useful tools to improve their education and get ready for the workforce. In order to make
sure AI technologies properly enhance student learning and engagement, the study also
emphasizes the significance of ethical issues, adaptability in teaching, and ongoing
evaluation.

In the context of professional knowledge-based question answering (QA), Lin [5]


discusses how important efficient PDF parsing is to improving the functionality of RAG

9
systems. It starts by describing the shortcomings of the existing PDF parsing techniques,
such PyPDF, which frequently fall short in correctly identifying document structure,
resulting in fragmented and perplexing content representations. For complicated papers
that frequently appear in professional contexts and contain tables, multi-column layouts,
and other complex formatting, this is especially challenging. The authors stress that it is
difficult for systems to extract useful information from standard untagged documents,
such as PDFs, because they are not machine-readable like tagged formats, like HTML
or Word. The study compares two RAG systems: a baseline system that uses PyPDF for
parsing and ChatDOC, which makes use of a sophisticated PDF parser intended to
maintain document structure. The authors assess the effect of parsing quality on the
precision of responses produced by the RAG systems through methodical tests using a
dataset of 188 texts from a variety of topics. They point out that ChatDOC performs
better than the baseline in about 47% of the assessed questions, ties in 38%, and fails in
just 15% of them. This performance is explained by ChatDOC's capacity to obtain
longer and more accurate text passages, which is essential for responding to extractive
queries that call for exact data. The findings' implications for the future of RAG systems
are also covered in the research. It is suggested that better PDF structure identification
can greatly increase retrieval, which would improve the integration of domain
information into large language models (LLMs). The authors contend that as LLMs are
largely trained on publicly accessible material, their efficacy in vertical applications
depends on integrating specialized information from professional publications.
According to the study's findings, improvements in PDF parsing technology, like those
found in ChatDOC, have the potential to completely transform RAG systems and
increase their usefulness in work settings where precise information retrieval is crucial.
In conclusion, the study highlights the significance of excellent PDF parsing in the RAG
framework and shows that the quality of responses produced by QA systems can be
significantly enhanced by the capacity to correctly parse and partition complicated
documents. The results support a move toward increasingly complex parsing techniques
that can manage the complexities of business documents, improving the general efficacy
of knowledge-based applications across a range of industries.

10
The work called "Enhancing Biomedical Question Answering with Parameter-Efficient
Fine-Tuning and Hierarchical Retrieval Augmented Generation" details the CUHK-AIH
team's work on task 12b in the three phases of A, A+, and B of the 12th BioASQ
Challenge. To enhance the performance of biomedical question answering (QA), Gao
et al. [6] provide a system called Corpus PEFT Searching (CPS), which combines a
hierarchical retrieval-based methodology with Parameter-Efficient Fine-Tuning
(PEFT). In Phase A, the system generates a list of documents based on the input question
by using the BM25 retriever to look for pertinent documents from the PubMed Central
(PMC) corpus. For the phases that follow, this stage acts as a foundation. In Phase B ,
the authors refine a Large Language Model (LLM) Using the BioASQ training set,
utilizing the PEFT approach to improve the model's capacity to produce precise
responses. The process of fine-tuning is essential because it enables the model to adjust
to the unique features of biological queries and responses. In order to develop a more
complete biomedical QA system, Phase A+ presents a novel hierarchical Retrieval-
Augmented Generation (RAG) technique that integrates the BM25 retriever with the
optimized LLM. With just the biological question provided—no pre-identified
snippets—this phase aims to test the system's performance directly. The authors point
out that by using an ensemble retriever that blends sparse and dense retrieval
approaches, the RAG method improves the search results from Phase A and makes sure
that the most pertinent data is used to produce replies. Using thoughtfully constructed
prompts, the system generates high-quality responses by processing the query along
with the most pertinent document chunks. The experimental setup is also covered in the
publication, along with the training and evaluation datasets, which are question-answer
pairs from the BioASQ tasks. To examine the effects of various model types and
retrieval strategies on performance, the authors carried out thorough ablation research.
The findings show that the hierarchical RAG method and PEFT both considerably
improve the model's performance in biomedical QA tasks. Comparing Phase A+ to the
challenge's top competitors, the authors' performance measures demonstrate that
although Phase A+ produced competitive outcomes, there was still opportunity for
improvement.In conclusion, the study shows how biomedical question answering
systems can be improved by fusing sophisticated retrieval techniques with effective

11
fine-tuning techniques. The results highlight how crucial it is to use big language models
and cutting-edge retrieval strategies to handle the challenges of retrieving biomedical
data and responding to inquiries. According to the authors, their method may be used as
a basis for creating increasingly complex biomedical QA systems, which could help a
number of applications in bioinformatics, medical informatics, and related domains. All
things considered, the report opens the door for further developments in this crucial field
of study by offering insightful information on the continuous attempts to increase the
precision and applicability of responses derived from biomedical literature.

In order to solve the issues of fragmented domain knowledge that spans across multiple
experts and sources, Kuo et al. [7] propose a customized Question Answering (QA)
system designed for the Taiwanese film industry. The study emphasizes the value of
bringing disparate expertise together, especially in domains like film where knowledge
is frequently fragmented and incomplete. The study investigates the possibility of
incorporating large language models (LLMs) to produce a more efficient QA system by
utilizing developments in Natural Language Processing (NLP) and open-source
platforms like LangChain. The study addresses the drawbacks of general-purpose AI
chatbots, such ChatGPT and Gemini, who, despite their extensive knowledge,
frequently have trouble answering specific questions and may provide false
information—a condition known as AI hallucination. In contrast to current chatbots, the
QA system used in this study is intended to extract information from local documents,
offering industry professionals deeper insights. The authors stress the need of utilizing
local records from a single institution, which generates bias because of the uniformity
of the source material even though it is advantageous for targeted insights. This bias
may cause the QA system to ignore more general trends while distorting its reactions to
particular problems, such production costs. The study also discusses how the COVID-
19 pandemic has affected the film business, which has affected the questions asked and
the way replies are scored, resulting in an over focus on pandemic-related content.A set
of questions divided into three categories according to understanding needs and
complexity is part of the research approach. Two methods are used to evaluate the
performance of the QA system: a scoring evaluator from the LangChain framework and

12
human assessment. The findings show that the QA system performs better than other
chatbots in terms of accuracy and dependability, especially when paired with GPT-4.
The system's limits in interpreting graphical data and the difficulties associated with
input translation, particularly with regard to proprietary words and Taiwanese movie
names, are acknowledged by the developers. Future research aiming at improving the
efficacy of the QA system is also included in the report. In order to reduce bias, this
entails investigating a wider variety of document resources and improving
preprocessing methods for local documents to enhance the model's capacity to extract
significant information. In order to create a fair and sophisticated QA system, the
authors support a more thorough depiction of a variety of subjects. Overall, by offering
a framework that not only enhances access to specialized information but also tackles
the shortcomings of current AI solutions, the research represents a significant
improvement in the field of QA systems, especially for the Taiwanese film industry.
The results highlight the need for creative approaches to the management and
application of specialized knowledge across disciplines, which will ultimately lead to a
more knowledgeable and effective industry.

In order to improve the performance of large language models (LLMs) in reasoning


tasks, Zhu et al. [8] examine a unique prompting technique known as Question Analysis
Prompting (QAP). Even with these improvements, LLMs continue to fall short of
human thinking abilities, especially when it comes to arithmetic and commonsense
reasoning problems. This study explores if urging the model to evaluate the subject itself
can give better results than traditional methods, which frequently entail asking the model
to generate detailed calculations. Before attempting to solve the question, QAP asks the
model to describe it in a certain number of words (n). The length and level of depth of
the response are determined by the value of n. Using a variety of datasets, such as
GSM8K, AQuA, SAT, and StrategyQA, the study assesses QAP on two well-known
LLMs, GPT-3.5 Turbo and GPT-4 Turbo. Particularly on the AQuA and SAT datasets,
the results show that QAP works better than other cutting-edge prompting techniques
like chain-of-thought (CoT), Plan and Solve Prompting (PS+), and Take A Deep Breath
(TADB). In 75% of the tests, QAP is regularly ranked in the top two prompts. One of

13
the study's main conclusions is that, particularly for more difficult questions, the model's
performance is greatly impacted by the length of the response it produces. While
excessively long explanations can impair performance on easier issues, detailed answers
are typically advantageous for more difficult ones. In order to guarantee that the model
thoroughly engages with the question before attempting to solve it, the study
additionally emphasizes the significance of the prompt's placement and the need to
indicate the minimal word count. The study provides a number of examples that show
how QAP prompts the model to offer more thorough justifications, which enhances
reasoning and response accuracy. The ideal word count varies according to the
complexity of the topic, though, since it also points out that shorter prompts (like
QAP25) can lead to partial responses, especially in mathematical activities. The authors
contend that QAP optimizes its comprehension and reduces the possibility of
overlooking important details by forcing the model to interpret the query directly. This
strategy marks a change from only concentrating on the solution to improving the
model's understanding of the current problem. The study comes to the conclusion that
QAP is a promising zero-shot prompting technique that can greatly enhance LLM
performance on reasoning tasks. This opens the door for more investigation into how
best to construct prompts and how question analysis and model reasoning skills interact.
All things considered, the results point to QAP as a potentially useful tool for scholars
and professionals wishing to use LLMs for more efficient applications of reasoning and
problem-solving.

Lewis et al. [9] introduce Retrieval-Augmented Generation (RAG), a unique method


that combines a parametric memory model with a non-parametric memory component to
improve the performance of pre-trained language models. In order to handle knowledge-
intensive natural language processing (NLP) jobs, where access to and utilization of
external knowledge are essential, RAG was created. A retriever and a generator are the two
primary parts of the architecture. A bi-encoder architecture is used by the retriever, which
is based on the Dense Passage Retriever (DPR), to extract pertinent documents from a vast
corpus, like Wikipedia, in response to a query. The model can access a large amount of
data thanks to its effective retrieval method, which eliminates the need for intensive

14
retraining. The generator, usually a sequence-to-sequence model such as BART, uses the
retrieved texts as more context. The generator generates logical and contextually relevant
outputs by utilizing both the input query and the documents that were retrieved. The authors
show that RAG outperforms conventional models that just use extractive techniques,
achieving state-of-the-art performance on a variety of open-domain question answering
tasks. Even in cases where the precise response is absent from the collected documents,
RAG can create solutions by utilizing the generative capabilities of models such as BART.
This adaptability enables RAG to produce more accurate and instructive outputs by
generating responses that are based on the context supplied by the recovered documents.
The benefits of integrating parametric and non-parametric memory are also highlighted in
the research, which demonstrates how the non-parametric component can direct the
generation process by offering particular knowledge that enhances the model's learnt
parameters. The authors also investigate RAG's potential in fact verification tasks, such the
FEVER dataset, where it outperforms state-of-the-art algorithms that necessitate intricate
retrieval supervision. This suggests that knowledge-intensive activities can be successfully
completed by RAG without requiring complex engineering or domain-specific systems.
The significance of retrieval in improving the performance of generative models is
emphasized in the research, which also suggests that integrating retrieval techniques can
greatly increase the model's capacity to manage dynamic and changing knowledge. The
experimental findings in the research indicate how well RAG performs on a variety of
benchmarks and how it can combine the advantages of generative and retrieval-based
methods. The authors also offer insights into the training process, pointing out that RAG
may be improved on a variety of sequence-to-sequence tasks, enabling the generator and
retriever components to learn together. By optimizing both elements to function in unison,
this end-to-end training method improves the model's performance even more. The study
concludes by presenting RAG as a potent framework that successfully bridges the gap
between retrieval and generation for knowledge-intensive NLP jobs. RAG raises the bar
for performance in open-domain question answering and fact verification by utilizing the
advantages of pre-trained models such as BERT and BART and integrating a non-
parametric memory through a dense vector index. The results imply that this paradigm can
be expanded upon in future studies to investigate even more complex retrieval and

15
generation integrations, which could result in additional developments in the field of
natural language processing.

Hewitt et al. [10] examine how well language models perform, especially when it comes
to tasks like multi-document question answering (QA) and key-value retrieval that call for
the identification of pertinent information from lengthy input contexts. The authors draw
attention to a major problem with existing language models: depending on where pertinent
information is located in the input context, their performance might significantly
deteriorate. A U-shaped performance curve is revealed by the study through a series of
controlled trials, with models performing best when pertinent information is found at the
beginning (primacy bias) or end (recency bias) of the context. On the other hand, even for
models built to handle lengthy contexts, performance drastically deteriorates when
essential information is placed in the middle of the inputs. This result implies that
information that is not situated at the extremes of the input context is difficult for language
models to reliably access and use. The study uses a number of language models, such as
GPT-3.5-Turbo and Llama-2, and assesses how well they function in various scenarios,
such as changing the quantity of documents in the input context and their sequence. Even
base models without instruction fine-tuning exhibit this U-shaped performance pattern,
according to the data, indicating that existing architectures' inability to handle lengthy
contexts efficiently is a fundamental problem. The study also examines the effects of
human feedback-based reinforcement learning and instruction fine-tuning, discovering that
although these methods enhance overall performance, they do not completely eradicate the
performance decrease that is shown when pertinent information is inserted in the middle
of the context. The authors examine the implications of the empirical results for the creation
of upcoming long-context language models in addition to the findings themselves.
According to the authors, it is necessary to show that a model's performance is only slightly
impacted by the location of pertinent information in order to assert that it may use lengthy
input contexts. They provide fresh evaluation procedures that might be useful in
determining how well language models handle lengthy contexts. A useful case study on
open-domain question answering is also included in the article. It shows that model
performance saturates far before the recall of retrieved documents, suggesting that existing

16
models are ineffective at utilizing more retrieved data. All things considered, the study
highlights the need for more research to improve language models' capabilities and offers
insightful information about their limitations while processing lengthy input contexts. The
results highlight how crucial context placement is to model performance and imply that in
order to overcome these obstacles, model architecture and training techniques must be
improved. By making their code and assessment data publicly available, the authors hope
to encourage more research and comprehension of how language models can be enhanced
to make better use of lengthy contexts, which will ultimately lead to improvements in
applications for natural language processing. The study is a crucial step in creating more
resilient language models that can efficiently traverse and extract pertinent data from large
input contexts. This is becoming more and more crucial in real-world applications like
document summarization, search engines, and conversational interfaces.

Clarke et al. [11] provide a thorough assessment of Reciprocal Rank Fusion (RRF), a novel
technique for merging document rankings from various information retrieval (IR) systems.
In contrast to individual ranking systems and well-known techniques like Condorcet Fuse
and CombMNZ, the authors contend that RRF, a straightforward and unsupervised
methodology, consistently produces better results. The study's foundation is a collection of
tests carried out with a number of TREC (Text REtrieval Conference) datasets, which are
well-known industry standards for information retrieval. The authors detail the
methodology employed in their experiments, including the selection of datasets, the
ranking methods compared, and the performance metrics used to evaluate results. The
primary metric for assessing performance was Mean Average Precision (MAP), which
measures the accuracy of the ranked results. The authors conducted pilot experiments to
determine optimal parameters for RRF and to evaluate its performance against competing
methods. The results indicated that RRF outperformed Condorcet and CombMNZ in most
cases, with statistical tests confirming the significance of these findings. The study
emphasizes how RRF's efficacy stems from its capacity to leverage the variety of individual
ranks, which might raise the rating of documents that other approaches might miss. The
authors also investigated RRF's potential for developing a meta-learner that raises the lower
bound of what can be learned from the LETOR 3 dataset by ranking it higher than any

17
previously published approach. Because RRF combines rankings without taking into
account the arbitrary scores that certain ranking techniques yield, the results imply that it
is not only easier to use but also more efficient than Condorcet Fuse. The authors draw the
conclusion that RRF offers a strong substitute for enhancing document retrieval results and
constitutes a substantial breakthrough in the field of information retrieval. The significance
of RRF in utilizing the advantages of numerous ranking systems while avoiding the
complications of supervised learning techniques is emphasized in the study. All things
considered, the study offers insightful information about RRF's efficacy and its uses in a
range of IR tasks, supporting the idea that more straightforward approaches frequently
result in superior performance in real-world situations. The authors argue that RRF should
be investigated further and incorporated into current IR frameworks since it may improve
search engine and other retrieval system performance. The paper makes a compelling
argument for RRF's adoption in the information retrieval sector by showcasing its benefits
through thorough experimentation and statistical analysis, opening the door for more study
and advancement in this subject.

James et al. [12] propose a unique methodology for the reference-free evaluation of
Retrieval Augmented Generation (RAG) systems—which combine language model
generation and information retrieval—is presented: RAGAS (Retrieval Augmented
Generation Assessment). Conventional evaluation techniques frequently use datasets that
have been annotated by humans, which can take a lot of time and resources. In order to
overcome this constraint, RAGAS concentrates on three essential quality factors that are
essential for evaluating the effectiveness of RAG systems: fidelity, answer relevance, and
context relevance. The gpt-3.5-turbo-16k model from OpenAI, which enables the
extraction and analysis of statements from generated answers, is used by the authors to
build a fully automated evaluation method. The methodology breaks down larger
statements into more concise, targeted assertions to calculate faithfulness, making sure that
the responses' claims may be deduced from the context they give. While context relevance
evaluates the significance of the setting in connection to the question, response relevance
evaluates how well the generated answer answers the query. To help the language model
retrieve pertinent sentences from the context and produce assertions that accurately reflect

18
the faithfulness of the response, the authors employ particular prompts. The difficulties of
creating effective prompts are also covered in the paper because the way questions and
circumstances are phrased might have an impact on the evaluation's quality. The authors
point to earlier research in the area, pointing out that although some studies have looked
into using prompts for evaluation, they frequently did not show any discernible advantages.
By offering an organized strategy for assessing RAG systems without the use of human
references, RAGAS seeks to advance these techniques. The WikiEval dataset, which calls
for models to compare pairs of responses or context fragments, was used in the authors'
studies. They show that RAGAS can successfully capture the caliber of generated replies
by reporting that the correctness of their suggested metrics agrees with human annotators.
The findings show that the automated measures and human evaluations coincide quite well,
especially when it comes to faithfulness and context relevance. Concluding, RAGAS offers
a self-contained, reference-free methodology that can expedite the assessment process,
marking a substantial development in the evaluation of RAG systems. RAGAS offers a
strong framework for assessing the efficacy of retrieval-augmented generation techniques
by concentrating on important quality factors and utilizing the power of big language
models. In addition to advancing our knowledge of RAG systems, this work lays the
groundwork for further studies in automated evaluation techniques, which will eventually
improve the creation and use of more accurate and dependable language models. Given
that RAG systems are becoming more and more popular in a variety of applications, such
as conversational agents, summarization, and question answering, the authors stress the
significance of their findings for the larger area of natural language processing.

19
Chapter 3

Methodology

3.1 DATA
The data is to be provided by the user. The user needs to provide the file path with either
.pdf, .docx or .txt or .rtf format. There is no size limit in terms of pages or file size. The
data can contain images and tables, but the images won’t be processed and hence it is
better to provide a text-only file for the best output.

3.1.1 DATA PREPROCESSING


An advanced RAG technique called Parent document retrieval is used at the core of the
system. The entire text file is chunked on the basis of paragraphs, new lines and
punctuation marks. Langchain’s RecurssiveChracterTextSpliitter is used for chunking.
The parent chunks created are of approximately 2000 characters and each parent chunk
has multiple children with around 400 characters each. This is done because if the
document is big, it will be covering a variety of topics. Even if chunks of approximately
2000 characters are created, which is around 3-4 pages long, it will have multiple topics
covered. But, when the child chunks are of 400 characters, which is roughly 1-1.5 pages,
the page content will be mostly concentrated around a single topic. This will create more
refined chunks, which will be easy and faster for similarity search as smaller chunks can
be indexed more efficiently and searched faster.

3.1.2 VECTORISATION
The child chunks are vectorised. Vectorisation is a process where each sentence is
converted into embeddings, i.e, embeddings are created for each sentence. There are a
number of methods to do it, like word2vec, but an advanced transformer based
embedding model called embedding-001 by Google is used. Vectors are created for the
child chunks.

3.1.3 VECTORSTORE

20
Once the embeddings are generated, they need to be stored in some vectorstore. So that
similarity search with the user’s query’s embeddings can be performed and the stored
embedding vectors of the child chunks from the uploaded document. Facebook AI
Similarity Search (FAISS) was preferred which creates indexes efficiently for our
vectors. It works really well for files of big sizes and the search retrieval is fast as well.
Unlike other popular options like chromadb, which uses a sqlite3 to store the vectors
and pinecone which is a dedicated cloud vector database, FAISS uses an in-memory
store which is fast and economical as well since it's an in-memory store.

3.2 SIMILARITY SEARCH


FAISS creates clusters of similar vectors, which is based on a quantization process. Each
cluster is represented by a centroid and the index stores information about these
centroids and the vectors in the cluster. And when similarity search is performed, firstly,
the most relevant cluster is identified which is a coarse-grained search and then in the
identified cluster, a fine-grained search, cosine similarity is used to find the nearest
neighbors or the best matching vectors. The number of best matched values returned
can be chosen. Once the matching chunks are retrieved, the respective parent chunk for
each matched child chunk is retrieved and passed to the Large Language Model (LLM).

3.3 USER INPUT


The user provides a voice input which is then converted into text using OpenAI’s
Whisper Base model. Whisper is a powerful machine learning model for speech
recognition and transcription by OpenAI. It is a powerful transformer based encoder-
decoder model that can automatically detect the language of the input audio, translating
non-English languages into English and it is robust enough to handle accents,
background noise, and technical language effectively. It is available in different
segments like Tiny, Base, Medium and Large. Tiny is the fastest, but there is some
inconsistency in conversion from speech to text. Base works fine for most cases and is
fast enough to be used in a CPU only machine. Medium is fine when there is some
accent and is slower than the Base. Large is the best model but it is the slowest. The

21
speech_recognition library in Python is used to detect speech input from a microphone.
speech_recognition is used to detect the speech input from a microphone. By default,
the device's microphone is used to take the speech input. So if an external microphone
is attached, the voice input device needs to be changed in the system settings, if not
changed automatically.

3.4 USING ETHICS CHECKER


A custom function is created to check if the question or query asked by the user is ethical
and harmless. The llama 3.1 70B model is used as the reasoning engine to check if the
question or query asked by the user is ethical and harmless. The model only outputs
ethical and harmless (classified as 1) and unethical and harmful (classified as 2) along
with a reason for the classification. Different models were used like llama 3.1 8B, llama
2 7B but there was high misclassification. So, after thorough testing, the llama 3.1 70B
model was finalized.

3.5 USING LARGE LANGUAGE MODELS


The Large Language Model (LLM) is a crucial part of the entire system. It is responsible
for giving the final output from the user’s query, retrieved chunks and a predefined
prompt. The LLM of choice is the Llama 3.1 70B model. It is an LLM developed by
Meta that has humongous 70 Billion weights and biases. It is trained on huge data from
the internet. It has enhanced reasoning qualities compared to its predecessor Llama 3
and Llama 2. It is based on the transformer architecture and is a decoder only model.
Since it is a decoder only model, using an encoder to encode the text is an appropriate
approach. This is where Google’s embedding-001 model comes into picture which
encodes the text.

3.5.1 GETTING BETTER CONTEXT


The user might ask a vague or really simple question. In that case, the similarity search
won’t yield good results. It might yield generic or noisy results. To tackle this problem,
the Llama 3.1 70B model is used to generate two more questions related to the original

22
question asked by the user. So now instead of just one query which is vague, three
queries will be searched which can be helpful to get similar vectors as the other 3
questions might have better information that the original query and thus similarity
search will yield better results. Initially Llama 3.1 8B model was used instead of Llama
3.1 70B model so as to make the question generation faster as Llama 3.1 70B model is
also a bit slower in processing than Llama 3.1 8B model. So using a smaller model for
sub tasks makes the system faster. But again, since it is a smaller model, it doesn’t match
the reasoning capabilities of Llama 3.1 70B and hence, Llama 3.1 70B was used.
Multiple questions are only generated for smaller questions. If the user asks a question
which has words less than a threshold value, only then the Llama 3.1 70B model
generates two more questions.

3.6 ANSWER GENERATION


Once the user asks the question, it is converted to text using OpenAI’s Whisper’s Base
model. Two similar questions are generated using the Llama 3.1 70B model if the
question is smaller than a threshold value. A similarity search is done and the relevant
chunks are retrieved. The parent chunks to the respective retrieved child chunks, the
questions and the prompt is passed to the Llama 3.1 70B model for the final answer
generation. The model then gives a detailed answer according to the chunks, the
questions and the prompt and also summarizes it. This is the final output the user gets if
the query asked is smaller than a threshold value. If he query asked is not smaller than
the threshold value, a similarity search of the user’s query embedding and stored
embeddings is done and the relevant chunks are retrieved. The parent chunks to the
respective retrieved child chunks along with the questions and the prompt is passed to
the Llama 3.1 70B model for the final answer generation.

3.7 CONVERTING THE FINAL RESPONSE TO SPEECH


The response from the model would be converted to speech (Text-to-speech). The
win32com package’s Dispatch class is used to convert the generated text into speech.
win32com is a Python package using which various Windows applications and

23
components can be controlled with a Python script. It provides a layer of connection
between Python and the Component Object Model (COM) architecture. The Component
Object Model is used by various Windows applications. Hence, using win32com, tasks
in various Windows applications can be automated and hence open the media player to
speak out the text generated by the Large Language Model.

From 3.5 to 3.11, the application runs in a loop until the user specifically commands it
to break out of the loop and the program terminates.

This entire process gives the user a feel of talking to a human rather than just typing the
questions and reading the generated text like it is done in many other applications. User
experience is really an important factor for the popularity of an application and it drives
the fact that how many users find it comfortable to use the application. Interactivity is
one the core parts of it. No explicit user interface was created but a simple method of
running the app through command prompt. The user can simply give the path of the file,
vital information like if the file is saved and how many characters are there in the file
are displayed to the user. The user is prompted to start speaking with a “recording
started” message. The user can end the recording by pressing control+c keys. “Done
recording” message would be shown once the user presses the control+c keys. The
questions asked and the answers spoken both are displayed on screen as well, in case
the user wants to read them or copy paste them anywhere. To enhance interactivity, the
program runs in a loop, so after the program finishes speaking, the user can start
speaking and the conversation goes on until the user mentions “bye” in his speech. The
script detects the user saying “bye” and terminates the program by thanking the user for
trying out the program. The user can mention the word “bye” in a sentence, the script
can efficiently detect it in a bunch of words.

3.8 METHODS USED


3.8.1 speak_output()
The method speak_output() is used to take the file path and voice input from the user.
speak_output() calls multiple methods, like get_audio(), get_file_create_retiever(),

24
check_ethics(), generate_questions(), get_output() and summarise(). speak_output() is
the primary method and acts as the entry point to the program.

3.8.2 TEXT EXTRACTION METHODS


Four different methods are used to extract text from different file types:
● extract_text_from_pdf(file_path) :
○ The file path is passed as an input to the extract_text_from_pdf() method
when the file is a PDF file. This method extracts texts from the pdf page by
page, concatenates them and returns the text string.
● extract_text_from_docx(file_path) :
○ The file path is passed as an input to the extract_text_from_docx() method
when the file is of .docx type. This method extracts texts from each
paragraph of the document, concatenates them and returns the text string.
● extract_text_from_txt(file_path) :
○ The file path is passed as an input to the extract_text_from_txt() method
when the file is of .txt type. This method reads the file and returns the text
string.
● extract_text_from_rtf(file_path) :
○ The file path is passed as an input to the extract_text_from_rtf() method
when the file is of .rtf type. This method loads the .rtf file, used rtf_to_text()
function from the striprtf library to read the loaded document and returns
the text string. striprtf is a library that converts .rtf files to simple python
strings.
3.8.3 get_file_create_retiever(file_path)
The method get_file_create_retiever() takes in the file path and choses one of the above
methods to extract text from the file by the extension of the file provided. It then saves the
extracted string as a text file, replaces escape sequence characters like \t and \n with space,
creates the parent and child chunks and embeds the child chunks and creates a vector store
to store the embeddings and a document store to store the documents and finally creates
the ParentDocumentRetriever.

25
3.8.4 get_audio()
The get_audio() method starts and ends the recording of the user’s query (PyAudio() is
used for recording) and converts it into text using OpenAI’s Whisper’s Base model and
returns the converted text.

3.8.5 get_output(big_chunks_retriever,question)
The method get_output(big_chunks_retriever,question), takes in the retriever and the
user’s query, and gets the relevant chunks which are then passed to the LLM, which
generates the final output which get_output() returns.

3.8.6 generate_questions(question)
This method takes in the original query asked by the user and generates two additional
questions similar to what the user asked and returns a list containing all three questions.

3.8.7 summarise(outputs)
This method is only called when there are two additional questions generated. The list
containing all the responses from the LLM is passed to the summarise method. The method
returns the summary string of all the responses in the list.

3.8.8 check_ethics(question)
The check_ethics(question) method takes in the query asked by the user and classifies it
into either ethical and harmless or unethical and harmful. It returns a string containing the
classification value (1 for ethical and harmless or 2 for unethical and harmful) and a reason
for the same separated by double pipe symbol (||).

26
Figure 3.1. Model Architecture

27
Chapter 4
Results and Discussion
4.1 RESULTS
We primarily relied on human evaluation and BERT Score to evaluate our application.
BERT Score needs the ground truth (the actual expected response) as the base to compare
with the model’s generated response. Hence, a human is required to generate the ground
truth answers. Three scores are taken from BERT Score to evaluate how well the RAG
system performs:
● Precision: The ratio of relevant results to the total results returned by the LLM.
● Recall: The ratio of relevant results returned by the LLM to the total relevant
results.
● F1-Score: F1-Score is the harmonic mean of Precision and Recall.
Three documents were selected and three questions were selected for all of them.

Document 1: The paper by Alkhalaf [13] which is a rich mix of different techniques and it
was helpful to test if all the techniques were properly extracted and presented to the user.
Document 2: An English translation of the Bhagavad Gita. It is a simple text only file and
it was useful to test if the RAG system can answer topics asked by the users that were not
directly present in the text.
Document 3: The paper by Cormack [11] was the third document used. The focus was to
see if tabular data can be retrieved efficiently.
Humans evaluated the system on:
● Quality of Response:
○ Factual Accuracy: Assess the correctness of the information presented in
the response. (scale: 1-5, one being not at all accurate, 5 being highly
accurate)
○ Relevance: Evaluate how well the response addresses the query and the
provided context. (scale: 1-5, 1 being not at all relevant, 5 being highly
relevant)

28
○ Comprehensiveness: Determine if the response covers all relevant aspects
of the query. (scale: 1-5, 1 being not at all comprehensive, 5 being highly
comprehensive)
○ Coherence: Judge the logical flow and overall coherence of the generated
text. (scale: 1-5 1 being not at all coherent, 5 being highly coherent)

○ Conciseness: Assess the brevity and clarity of the response. (scale: 1-5, 1
being not at all concise, 5 being highly concise)
● Quality of Retrieval:
○ Relevance: Evaluate the relevance of retrieved documents to the query.
(scale: 1-5, 1 being not at all relevant, 5 being highly relevant)
○ Diversity: Assess the diversity of information sources used in the response.
(scale: 1-5, 1 being not at all diverse, 5 being highly diverse)
○ Completeness: Determine if all relevant information sources were retrieved.
Also consider how much of the information was retrieved. (scale: 1-5, 1
being not at all complete, 5 being complete)
● Other Factors:
○ Bias: Assess the presence of bias in the generated response, such as gender,
race, or cultural bias. (scale: 0 or 1, 1 means bias being present or 0 means
no bias present)
○ Fairness: Evaluate the fairness of the system in terms of its treatment of
different user groups. (scale: 1-5, 1 being not at all fair, 5 being fair)
○ Toxicity: Determine the level of toxicity or offensive language in the
generated text. (scale: 1-5, 1 being not at all toxic, 5 being highly toxic)

Three different data were used as the knowledge base and three questions each. So a total
of nine questions and 3 documents were used for human evaluation.

DOC 1 DOC 2 DOC 3

Factual Accuracy 5 5 5

Relevance 4 4 5

29
Comprehensiveness 5 5 5

Coherence 3 4 4

Conciseness 5 5 4

Retrieval Relevance 4 4 3

Diversity 5 5 4

Completeness 5 5 5

Bias 0 0 0

Fairness 5 5 5

Toxicity 1 1 1
4.1.1 Human Evaluation

For BERT Score the same three documents were selected and three questions were selected
for all of them. The table below shows the average precision, recall, and F1 Score for each
document.

DOC 1 DOC 2 DOC 3

Average precision 0.8964 0.8658 0.8658

Average recall 0.8830 0.8421 0.8421

Average F1 Score 0.8896 0.8538 0.8538


4.1.2 BERT-Score Evaluation

30
figure 4.1.1 comparison of metrics for human evaluation across all three documents

figure 4.1.2 comparison of metrics for BERT Score for the first document

31
figure 4.1.3 comparison of metrics for BERT Score for the second document

figure 4.1.4 comparison of metrics for BERT Score for the third document

4.2 RESULT SCREENSHOTS

32
33
34
Chapter 5

Conclusion and Future Work

Efficient and user-friendly solutions for extracting pertinent information are


required due to the growing amount and complexity of digital materials across
multiple areas. TalkDoc is an interactive, voice-driven document assistant that
combines cutting-edge methods in generative AI, Retrieval-Augmented
Generation (RAG), and voice processing to meet this demand. In addition to
saving time, TalkDoc offers a smooth experience for interacting with complex
information by enabling users to upload documents, ask questions vocally, and
receive succinct, spoken responses. By doing so, TalkDoc not only saves time
but also provides a seamless experience for engaging with dense information.

TalkDoc creates a dependable system for precise and contextual responses


with cutting-edge features including multi-level chunking for refined retrieval,
question generation to manage ambiguous questions, and ethical verification
chains to guarantee responses are morally and legally sound. With the
integration of parent-document retrieval and multi-question generation it is
ensured that even broad questions yield precise, well-informed responses,
enhancing user trust and utility.

With all these in place, Talkdoc isn’t perfect. It might be slow at times due to
its dependency on free cloud platforms, like groq where the model used (Llama
3.1 70B) are hosted. It would be faster if a paid service is used or models are
run locally, which would increase interactivity by decreasing the delay in
response. The spoken voice generated by the system is very robotic. There are
solutions which generate more human sounding voices which can be used in
future iterations, but they will be costly as well. The multiple question
generation can also be improved if the Large Language Model knows the
context of the uploaded document. It can then generate similar questions from
the user question tailored to the file uploaded which would further enhance
question generation. Since the entire backbone of the system is a RAG system,

35
getting summary or themes of the document is not very accurate. In further
iterations, different methods can be created to summarize or discuss the themes
in the document when the user asks for it. Furthermore, introducing a reranker
might yield better outputs.

36
REFERENCES
[1] Gupta, H., & Patel, M. (2020, October). Study of extractive text summarizer using the
elmo embedding. In 2020 Fourth International Conference on I-SMAC (IoT in Social,
Mobile, Analytics and Cloud)(I-SMAC) (pp. 829-834). IEEE.

[2] Swami, P., & Pratap, V. (2022, May). Resume classifier and summarizer. In 2022
International Conference on Machine Learning, Big Data, Cloud and Parallel Computing
(COM-IT-CON) (Vol. 1, pp. 220-224). IEEE.

[3] Ji, Y., Li, Z., Meng, R., Sivarajkumar, S., Wang, Y., Yu, Z., ... & He, D. (2024). RAG-
RLRC-LaySum at BioLaySumm: Integrating Retrieval-Augmented Generation and
Readability Control for Layman Summarization of Biomedical Texts. arXiv preprint
arXiv:2405.13179.

[4] Megahed, F. M., Chen, Y. J., Ferris, J. A., Resatar, C., Ross, K., Lee, Y., & Jones-
Farmer, L. A. (2024). ChatISA: A Prompt-Engineered Chatbot for Coding, Project
Management, Interview and Exam Preparation Activities. arXiv preprint
arXiv:2407.15010.

[5] Lin, D. (2024). Revolutionizing retrieval-augmented generation with enhanced PDF


structure recognition. arXiv preprint arXiv:2401.12599.

[6] Gao, Y., Zong, L., & Li, Y. (2024). Enhancing biomedical question answering with
parameter-efficient fine-tuning and hierarchical retrieval augmented generation. CLEF
Working Notes.

[7] Kuo, E. C., & Su, Y. H. (2024, June). Assembling Fragmented Domain Knowledge: A
LLM-Powered QA System for Taiwan Cinema. In 2024 IEEE Congress on Evolutionary
Computation (CEC) (pp. 1-8). IEEE.

37
[8] Yugeswardeenoo, D., Zhu, K., & O'Brien, S. (2024). Question-Analysis Prompting
Improves LLM Performance in Reasoning Tasks. arXiv preprint arXiv:2407.03624.

[9] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D.
(2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in
Neural Information Processing Systems, 33, 9459-9474.

[10] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P.
(2024). Lost in the middle: How language models use long contexts. Transactions of the
Association for Computational Linguistics, 12, 157-173.

[11] Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009, July). Reciprocal rank fusion
outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd
international ACM SIGIR conference on Research and development in information
retrieval (pp. 758-759).

[12] Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). Ragas: Automated
evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217.

[13] Alkhalaf, M., Yu, P., Yin, M., & Deng, C. (2024). Applying generative AI with
retrieval augmented generation to summarize and extract key clinical information from
electronic health records. Journal of Biomedical Informatics, 104662.57

38
Appendices

APPENDIX 1: SOURCE CODE

Importing the necessary libraries

39
Setting important parameters, defining the LLM and speak variables and creating the
functions to read the files

40
Defining the method to get the query as audio and convert it into text

Defining the method to create the retriever

41
Defining a method to get the output from the LLM

42
Defining a method to generate additional and similar questions to the user query

Defining a function to summarize the multiple responses generated by LLM for

43
multiple questions

Defining a function to check if the user asked an ethical and harmless

Defining the function that acts as the gateway to the program

44
45
46

You might also like