0% found this document useful (0 votes)
21 views68 pages

2001040208 - Đặng Quỳnh Trang - Quỳnh Trang Đặng

The dissertation titled 'HANU Chatbot Backend: Unleashing ChatGPT in Domain-Specific Knowledge' by Dang Quynh Trang explores the integration of ChatGPT with domain-specific knowledge from Hanoi University to enhance educational experiences. It addresses the limitations of existing AI models in providing accurate and context-relevant responses and proposes a backend system that utilizes vector databases and APIs for efficient information retrieval. The ultimate goal is to create a digital knowledge ecosystem that supports students and educators by providing reliable access to specialized knowledge through a conversational AI interface.

Uploaded by

ntv151706
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views68 pages

2001040208 - Đặng Quỳnh Trang - Quỳnh Trang Đặng

The dissertation titled 'HANU Chatbot Backend: Unleashing ChatGPT in Domain-Specific Knowledge' by Dang Quynh Trang explores the integration of ChatGPT with domain-specific knowledge from Hanoi University to enhance educational experiences. It addresses the limitations of existing AI models in providing accurate and context-relevant responses and proposes a backend system that utilizes vector databases and APIs for efficient information retrieval. The ultimate goal is to create a digital knowledge ecosystem that supports students and educators by providing reliable access to specialized knowledge through a conversational AI interface.

Uploaded by

ntv151706
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

BỘ GIÁO DỤC VÀ ĐÀO TẠO

TRƯỜNG ĐẠI HỌC HÀ NỘI

KHÓA LUẬN TỐT NGHIỆP

HANU CHATBOT BACKEND:


MỞ RA KHẢ NĂNG CỦA CHATGPT
TRONG KIẾN THỨC CHUYÊN MIỀN

HANU CHATBOT BACKEND:


UNLEASHING CHATGPT
IN DOMAIN-SPECIFIC KNOWLEDGE

Người hướng dẫn: TS. Nguyễn Xuân Thắng


Họ và tên sinh viên: Đặng Quỳnh Trang
Mã số sinh viên: 2001040208
Chuyên ngành: Kỹ sư Phần mềm
Khoa: Công nghệ Thông tin

Hà Nội, năm 2024


MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY

HANU CHATBOT BACKEND:


UNLEASHING CHATGPT
IN DOMAIN-SPECIFIC KNOWLEDGE

Supervisor: Dr. Nguyễn Xuân Thắng


Student: Đặng Quỳnh Trang
Student ID: 2001040208
Specialization: Software Engineering
Faculty: Information Technology

Hanoi, 2024
DECLARATION OF AUTHORSHIP

I, Dang Quynh Trang, student ID 2001040208 at Hanoi University, declare that this
dissertation titled “HANU Chatbot Backend: Unleashing ChatGPT in Domain-
specific Knowledge” is my own original work and that all sources have been
appropriately acknowledged and referenced. This dissertation has not been
submitted in whole or in part for consideration for any other degree or qualification
at this or any other university.

Signature: __________________________________________________________

Date: ______________________________________________________________

i
ACKNOWLEDGEMENT

First of all, I would like to express my sincere gratitude to my supervisor, Dr.


Nguyen Xuan Thang, dean of the Faculty of Information Technology, Hanoi
University, for his valuable guidance, encouragement and steadfast support
throughout the thesis writing process. His insightful feedback and expertise
contributed significantly to the design of this work.
I am also deeply grateful to the faculty and staff of Hanoi University for providing
me with the necessary resources, knowledge, and opportunities to explore the
fascinating fields of artificial intelligence and natural language processing.
My sincere gratitude goes to our family and friends who continue to inspire and
support us. Their belief in me was my motivation during difficult times.
Finally, I would like to thank the open-source community and researchers for their
contributions, whose work has laid the foundation for advances in language models,
vector databases, and conversational artificial intelligence.
Thank you sincerely!

ii
TABLE OF CONTENTS

DECLARATION OF AUTHORSHIP ........................................................................ i


ACKNOWLEDGEMENT ......................................................................................... ii
LIST OF FIGURES.....................................................................................................v
ABSTRACT ............................................................................................................. vii
1. INTRODUCTION...................................................................................................1
1.1. Background and rationale ................................................................................1
1.2. Problem statement ............................................................................................2
1.3. Scope and limitations .......................................................................................3
1.4. Expected outcomes...........................................................................................5
1.5. Research methodology .....................................................................................6
1.6. Dissertation organization .................................................................................8
2. CHAPTER 1: LITERATURE REVIEW ..............................................................10
2.1. Conversational AI in education ......................................................................10
2.2. Large Language AI Models ...........................................................................11
2.3. Integrating domain knowledge .......................................................................13
2.4. Gap analysis and solution...............................................................................14
3. CHAPTER 2: CHATGPT .....................................................................................16
3.1. Introduction to ChatGPT ................................................................................16
3.2. How ChatGPT works .....................................................................................17
3.3. ChatGPT applications ....................................................................................18
3.4. Strengths and weaknesses ..............................................................................19
4. CHAPTER 3: SYSTEM ARCHITECTURE AND DESIGN ...............................22
4.1. System requirements ......................................................................................22
4.1.1. Functional requirements ..........................................................................22
4.1.2. Non-functional requirements ..................................................................22
4.2. System architecture ........................................................................................23
4.3. Database schema ............................................................................................24

iii
4.3.1. Corpus storage schema ............................................................................24
4.3.2. Vector database schema ..........................................................................25
4.4. Pipeline workflows.........................................................................................26
4.4.1. System workflows ...................................................................................26
4.4.2. Backend workflows .................................................................................28
5. CHAPTER 4: IMPLEMENTATION AND DISCUSSION .................................30
5.1. Platforms and tools .........................................................................................30
5.1.1. Python programming language and libraries ..........................................30
5.1.2. PostgreSQL vector database ...................................................................31
5.1.3. GitHub version control system ...............................................................32
5.2. Document injection and preprocessing pipeline ............................................32
5.3. Embedding generating ...................................................................................35
5.4. Vector database ..............................................................................................37
5.5. Vector searching and retrieving .....................................................................42
5.6. API development ............................................................................................44
5.7. ChatGPT integration ......................................................................................45
5.8. Result and discussion .....................................................................................49
6. CONCLUSION .....................................................................................................55
REFERENCES ..........................................................................................................57

iv
LIST OF FIGURES

Figure 1: A CSV file storing documents about HANU educational program ..........25
Figure 2: A table about HANU educational program in Vector Database ...............26
Figure 3: System workflows of the HANU chatbot ..................................................27
Figure 4: Backend workflows of the HANU chatbot ...............................................28
Figure 5: Code snippet – Get HANU documents from a CSV file ...........................32
Figure 6: Code snippet – Get the number of tokens from a string ............................33
Figure 7: Code snippet – Chunk HANU documents into smaller units ....................34
Figure 8: Code snippet – Combine metadata from HANU documents ....................35
Figure 9: Code snippet – Initialize OpenAI client ....................................................36
Figure 10: Code snippet – Get embedding from HANU documents ........................36
Figure 11: Code snippet – Embed HANU documents (full steps) ............................36
Figure 12: Code snippet – Create a normal database ................................................37
Figure 13: Code snippet – Create vector extension within a database .....................38
Figure 14: Code snippet – Create ivfflat index within a table ..................................38
Figure 15: Code snippet – Create a vector table (full steps) .....................................39
Figure 16: Code snippet – Load training data into the vector table ..........................40
Figure 17: Code snippet – Prepare for chatbot training ............................................41
Figure 18: Code snippet – Load corpus into vector database (full steps) .................41
Figure 19: Code snippet – Prepare data for two chatbots .........................................42
Figure 20: Code snippet – Vector search ..................................................................43
Figure 21: Code snippet – Initialize Flask application .............................................44
Figure 22: Code snippet – Implement API endpoint for relevant documents ..........44
Figure 23: Code snippet – Run the chatbot ...............................................................45
Figure 24: Code snippet – Get response from GPT-3.5 model .................................46
Figure 25: Code snippet – Define body message for OpenAI API ...........................47
Figure 26: Code snippet – Implement API endpoint for complete response ............48
Figure 27: Chatbot training process (logging screen) ...............................................49

v
Figure 28: Example of a POST request to get the relevant documents ....................50
Figure 29: HANU chatbot preview ...........................................................................52
Figure 30: Comparison between Hanu Chatbot and ChatGPT performance ............53

vi
ABSTRACT

In the context of the digital age, the need for easy access to knowledge is becoming
increasingly important, with increasingly sophisticated requirements. Although
ChatGPT, today's intelligent artificial intelligence dialogue technology, has an
impressive ability to answer users' questions in a natural and friendly way, its
knowledge is still limited by the scope of initial training data. The solution
proposed in this thesis is to build a back-end chatbot system that can exploit the
capabilities of ChatGPT with a rich and reliable source of knowledge from the
documents of lecturers and staff at Hanoi University. From there, students will have
a “virtual assistant” to accompany them throughout the learning, consulting, and
research process, a natural dialogue partner for all issues in each specific field
from the school's reliable sources of knowledge. This thesis will focus on
researching and applying key technology components, such as tools to convert text
content into a digital representation (also known as a vector), a database optimized
for storing and searching similar vectors, and a method for integrating domain-
specific documents into ChatGPT. The ultimate goal is a digital knowledge
ecosystem that operates smoothly, meeting the needs of communication and
dialogue at an in-depth level while affirming the important role of artificial
intelligence applications in education to improve the quality of training for future
generations.

vii
1. INTRODUCTION

1.1. Background and rationale

The expansion of information technology toward automation has brought about a


significant evolution in information delivery to consumers. So as to meet the growing
market demand for new and specialized knowledge, educational institutions are trying
to develop effective methods to connect traditional knowledge banks with the ever-
changing learning needs of their students.
In the meantime, recent exceptional advances in the realm of Artificial Intelligence
(AI), particularly in Natural Language Processing (NLP), have proven their position in
reshaping the way people interact and work with digital information. Notable mentions
include Large Language Models (LLMs) like ChatGPT, Gemini, and Claude. These
models are based on deep learning mechanisms that enable them to analyze and
produce at-scale natural language text that seems written by human beings and create
engaging and user-friendly chat experiences. According to Gan et al. (2023), it is
possible to improve the quality of education and the learning experience by integrating
LLMs into the area of education, since these integrations might assist adaptive
assessment, intelligent tutoring, individualized learning, and other features.
Such models are highly praised by teachers and specialists, who find them to be quite
effective tools in education; however, despite their effectiveness, these models still
suffer from the disadvantage of lacking domain-specific expertise due to the non-
comprehensive range of input training data. In addition, as stated by Hadi et al. (2023),
when biased training data is used to construct Language Models (LMs), these LMs
may inadvertently exhibit prejudice. Specifically, LLMs may have a bias toward
generating more “interesting” or fluid outputs, which increases the likelihood of
hallucinations.
Therefore, this dissertation aims to provide a solution that integrates LLMs, such as
ChatGPT, with a repository of Hanoi University discipline-specific information. The

1
primary purpose is to develop a dependable backend system to connect ChatGPT to
Hanoi University’s credible resources. This integration will empower users to engage
in natural language conversations and receive accurate, context-relevant responses
drawn from the trusted and authoritative resources of Hanoi University.

1.2. Problem statement

While introducing AI models like ChatGPT into educational environments can yield
promising results, we still need to overcome a number of obstacles to exploit their full
potential. Hadi et al. (2023) claimed that LLMs may be biased and result in
detrimental or erroneous outcomes. The model might not be able to learn to generalize
to new scenarios if the dataset is incomplete. Furthermore, the model will be trained to
incorporate any bias or inaccuracy present in the dataset. Therefore, when answering
questions related to specific fields or actual professional opinions, LLMs may respond
vaguely with shallow or irrelevant answers (Hadi et al., 2023).
In Hanoi University and other educational institutes, providing in-depth, up-to-date,
and reliable knowledge to students, faculty, and staff is vital, as this not only ensures
academic excellence but also ensures the proper functioning of educational
institutions. Lectures, textbooks, and policy documents, although valuable in
disseminating knowledge to learners, are considered a static way of absorbing
information and do not pay enough attention to the individual needs of learners.
Besides, knowledge evolution occurs regularly in academic disciplines and
institutional policies. Therefore, it is important to have a responsive and adaptable
knowledge transfer channel to meet the rapid development of knowledge in many
different fields.
In fact, classical ways of bringing domain knowledge into LLMs, such as refining or
consolidating knowledge, can be resource-intensive and may not result in an efficient
process when applied to large and mixed data sets. Typically, these techniques are
performed during training. This complicates updating the model's knowledge base if
new information arises or existing information changes.
Because of this, there is an urgent need for a comprehensive approach to fill the gap
between the outstanding capabilities of LLMs and the specialized knowledge produced

2
by Hanoi University. This proposed solution can not only provide users with accurate
data relevant to the specific context but also promote easy absorption of that
information while engaging in conversations with natural language, like having a
personal consultant.
Talking about the challenges mentioned is important because if we overcome them,
then we can unleash AI-based conversational agents in educational contexts. Such
chatbots can help eliminate the hassle of seeking information from various sources by
making it possible for them to obtain more accurate and up-to-date information with
ease. Besides, it is expected that this approach will decrease the workload of teachers,
help to fix the problem of unequal access to education for underserved communities,
and at the same time create a possibility for further research on how subjective
cognitive beliefs are transferred to AI systems.

1.3. Scope and limitations

A major area of this dissertation is dedicated to the development of a powerful


backend infrastructure that harmoniously unifies big language model abilities such as
ChatGPT with all of Hanoi University’s domain knowledge data. The key purpose of
our system is to implement an integrated program through which users can engage in
free talk and obtain accurate, resource-based responses based on the contextualized
material Hanoi University has developed.
The proposed solution is to use a high-performance backend infrastructure to enable
domain-specific document retrieval. These include strategies for converting text
documents into digital representations (called “vectors” or “embeddings”) and the
implementation of an efficient vector storage and retrieval system. Following this, the
backend supports Application Programming Interfaces (APIs), allowing clients to
obtain the resources needed and produce credible answers to their questions.
Within the scope of this work, several key concerns are addressed:
• Document Ingestion and Processing Pipeline: The domain-specific
documents are gathered and stored in well-structured storage. The documents'
text will then be processed into small components (chunks) prepared for the
embedding process.

3
• Embedding Generator: This component undertakes the responsibility of
turning text content into numerical vector representations. In essence, these
embeddings preserve semantic relations and contextual sense for the given
texts, thus giving way to searching and retrieval based on the same meanings
instead of actual words.
• Vector Database: This part of the system is designed specifically for storing
and efficiently extracting high-dimensional embeddings. This element is
essential for enabling quick and correct similarity searches, which facilitate the
identification of relevant parts of documents according to users' queries.
• Vector Search and Retrieval: This algorithm is applied directly to link the
user questions with the corpus content most likely relevant to them. Thus, it is
ensured that the answer provided matches the question given.
• API Development: The feature of the RESTful API is introduced to make it
a mediator connecting the subsystems. It not only allows transactions between
different system components but also facilitates communication with the front
end. When needed, it activates processes that retrieve pertinent documents from
the backend database and transfer them to the client side. This feature allows
the front-end application to customize the instruction to get a response from the
GPT model based on the pulled documents.
The innovative project has a primary objective, which is to enhance the quality of
education for all students by allowing them access to accurate domain information
from a chatbot that plays the role of a “virtual tutor”. Additionally, this system holds
great potential for minimizing workloads on teachers, giving knowledge to
underprivileged communities, and making it possible for scientists to research how
perspectives can be integrated into an AI system, among other things. To conclude,
these efforts underscore AI's position as a driver of new concepts in higher education
and a driver of students' better learning experiences.
The dissertation’s scope is comprehensive, but it is important to recognize some
boundaries. When it comes to practical implementation, most of my work has focused
on building the backend system, which is made up of elements such as data scraping,

4
text pre-processing, vector storage and retrieval, and API creation. The ChatGPT
response completions will be handled by the front-end.
Furthermore, since our system is at its outset and its knowledge base has been trained
solely with resources provided by Hanoi University, this might make it less useful for
broader use cases where more general knowledge about different factors is required.
Nonetheless, the architecture of our system remains scalable and adaptable to include
other knowledge domains in the future.
The other problem is that the output generated is essentially driven by how accurate
and representative the consumed corpus is. Although attempts will be made to obtain
reliability and vastness of knowledge, there are times when the system may find gaps
or misalignments in its dataset.
The initial stages in establishing the backend system have been credited, although this
is still seen as a work in progress aimed at eliminating the gap between the enormous
powers of LLMs and the vast domain-specific knowledge base of academic
institutions. By providing solutions for issues related to applying subjective expertise
to dialogue AI systems, this study contributes to making a difference in the
dissemination of information as well as the advancement of pedagogical technology.

1.4. Expected outcomes

This topic is expected to provide a powerful and scalable backend system to


effectively integrate ChatGPT with the specialized knowledge of Hanoi University.
We therefore expect the system to enable users to participate in natural language
conversations and access accurate information that meets their needs. The system will
act primarily as a virtual tutor and source of contextual information for students,
teachers, and others involved and will provide personalized support to them in their
learning process.
With a view to upgrading AI in a continuously evolving setting, this study examines
transformative analysis on how domain-specific knowledge might be incorporated
within LLMs. Specifically, we can enrich educational systems by integrating platforms
like ChatGPT into the knowledge resources of schools, meaning that we will create an

5
atmosphere that promotes collective learning while utilizing digital technologies to
support students in obtaining relevant advice and information.
The innovative project has a primary objective, which is to enhance the quality of
education for all students by allowing them access to accurate domain information
from a chatbot that plays the role of a virtual tutor. Additionally, this system holds
great potential for minimizing workloads on teachers, giving knowledge to
underprivileged communities, and making it possible for scientists to research how
perspectives can be integrated into an AI system, among other things. Thus, these
efforts underscore AI's position as a driver of new concepts in higher education and a
driver of students' better learning experiences.
Finally, the successful realization of this dissertation will serve as a reminder that AI
has the ability to be at the forefront of making innovation in higher learning happen.
We can lift the quality of instruction, encourage collaborative learning experiences,
and provide students with tools and skills to succeed in an increasingly digitalized
world by fully tapping into conversational agents like ChatGPT and integrating them
with institutional domain assets.

1.5. Research methodology

A thorough research methodology will be used in order to achieve the goals of this
dissertation and provide a strong backend structure for incorporating ChatGPT with
domain-specific expertise from Hanoi University. This method will encompass a
mixture of both theoretical and empirical approaches, relying on well-known
principles and techniques from different fields, including natural language processing,
information retrieval, database systems, and software engineering.
Firstly, the study will begin with an extensive literature review that critically examines
past works in areas of conversational AI, LLMs, and how to integrate domain
knowledge into AI systems. The review aims to identify gaps within the existing state-
of-the-art methods and then inform our proposed solution design.
Henceforth, there will be a rigorous requirements engineering process aimed at
defining functional as well as non-functional requirements for the backend system. In

6
this phase, we are going to work closely with our mentor and potential end-users so as
to ensure that their expectations align well with the system.
The design of a detailed system architecture based on the proposed specifications will
comprise such components as corpus processing pipelines, vector storage and retrieval
mechanisms, and API implementation. The idea behind this is to make sure that the
system can grow while still remaining maintainable and flexible, following industrial
standards.
The next phrase will involve creating the backend in appropriate languages,
frameworks, and tools. By employing the Waterfall method, a comprehensive
requirements document will be established upfront, outlining all functionalities and
features of the software. This method adheres to a sequential development process
where each phase (requirement gathering, design, development, testing, and
deployment) must be completed before progressing to the next. This structured
approach ensures a clear roadmap and minimizes the need for significant changes
during later stages.
At the same time, data collection and curation activities will be set up to retrieve
academic documents, among others, from Hanoi University’s archives. These
documents would then undergo preprocessing, where they are transformed into vector
representations through modern natural language processing technologies like word
embeddings or transformer models.
Testing and evaluation strategies shall be used to ensure the reliability and accuracy of
the system. Response relevance, coherence, and accuracy are some of the key metrics
for measuring performance that will be evaluated so that areas for improvement or
refinement can be identified.
During the course of this research work, a collaborative approach would be created by
involving domain experts, including technical advisers and potential end-users. The
results from this research will also be presented through regular progress reports and
presentations while also being disseminated through peer-reviewed publications to
benefit other scientists working in similar areas.
As a result, the dissertation will be completed with the writing of a research paper that
details how the project was done in terms of its methodologies, techniques, and
7
findings. This paper will be reviewed thoroughly by peers and later published in
reputable academic journals or conferences, thus improving knowledge in dialogue AI
and domain knowledge fusion.

1.6. Dissertation organization

In this dissertation, I will explain why there is a need for the integration of ChatGPT
with domain-specific knowledge from Hanoi University while also giving a
comprehensive overview of all other aspects concerning the proposed architecture.
The first chapter serves as an entry point through which the reader can understand the
context within which this study is conducted. This aspect is brought out clearly
throughout the next chapter, as it explains why using LLMs such as ChatGPT helps in
teaching and learning in the higher education system by integrating it with subject
matter expertise. It shows up as a problem statement where the challenges and
limitations of existing approaches are well articulated. In addition, it also involves
stating what this study excludes so as to clearly identify its limits of inquiry as well as
marking what it includes, i.e., establishing the boundaries within which my research
was done. Also included are predictions on likely results that would emerge from the
implementation of the proposed solution to this problem, thereby indicating the
expected impact and contributions to be made.
Following this, a comprehensive review of the current literature and connected work
about conversational AI, LLMs, and techniques that enable integration of domain
knowledge in AI systems is provided. The chapter reviews the strengths and
weaknesses of existing approaches, identifies research gaps, and provides a theoretical
base for the proposed solution. It serves as a strong basis for subsequent chapters and,
hence, indicates the novelty and importance of the research undertaken.
Then, this dissertation focuses on ChatGPT, the giant language model upon which this
proposed solution is built. It shows how ChatGPT actually works, its underlying
architecture, and to what extent it can generate realistic text. The discussion explores
possible uses for ChatGPT along with its strong points as well as weak spots, thus
attesting to why it should incorporate domain-specific knowledge to enhance its
performance in area-specific situations.

8
At this point, we present the detailed design and architecture of our proposed backend
infrastructure. This implies outlining system requirements derived from the
stakeholder analysis phase through requirements engineering that are both functional
and non-functional. The overall system architecture is described, encompassing the
various components such as the storage schema, text-processing pipelines, vector
storage and retrieval mechanisms, and API implementation. Design principles,
patterns, and best practices employed in the development process are also discussed.
Later on, we provide an extensive account of the implementation phase, including all
backend systems’ platforms, tools, and technologies used. For example, how to ingest
and curate a corpus of domain-specific documents from Hanoi University, implement
text-processing pipelines, design vector storage and retrieval mechanisms, and
generate an API that allows for seamless integration with ChatGPT in our front-end
application. The results obtained from the evaluation and testing phases analyze the
system’s performance against key metrics such as response relevance and coherence
accuracy. That includes a critical discussion of the findings, limitations, weaknesses,
and scope for future research.
Finally, the dissertation concludes by summarizing the key findings and contributions
of this thesis work, highlighting the significance of the proposed solution to advancing
conversational AI with domain knowledge integration. This chapter goes back to the
research objectives and assesses whether they have been achieved or not. The
implications for further research are also provided, therefore suggesting possible ways
of addressing these limitations identified during the research process. A
comprehensive list of references is included to support the research findings and
comply with acceptable academic standards.

9
2. CHAPTER 1: LITERATURE REVIEW

2.1. Conversational AI in education

According to IBM (n.d.), chatbots and virtual agents are among several tools that fall
under the umbrella term Conversational AI, which refers to any system that can be
spoken to or written to by a user. They recognize speech and text inputs, determine
their meanings in multiple languages, and use vast amounts of data, natural language
processing, and machine learning to mimic human conversation. Indeed, NLP is
combined with machine learning in conversational AI, so these NLP methods keep
feeding into the machine learning processes for further improvement of AI algorithms
(IBM, n.d.).
One key advantage of using conversational AI in teaching is that it provides constant
support, which can answer students’ questions or objections anytime, even outside
regular school hours. For instance, AI chatbots can respond to inquiries from learners
by giving them the necessary feedback on their assignment submission. Additionally,
they may also guide people through reading materials while making recommendations
based on personal preferences indicated by one’s learning style at a given level
(Labadze et al., 2023). Such an approach would prove useful mainly among distance
education participants with different timetables by ensuring the availability of
educational resources.
The other significant advantage of conversational AI in education is its ability to
lessen the workload of teachers. This frees them from mundane tasks and allows them
to focus on more complex assignments and high-level cognitive discussions. In
accordance with Kamalov et al. (2023), AI can streamline various aspects of
education, such as creating content, grading tasks, or formulating questions based on
given prompts, all of which otherwise demand time and effort from teachers.
However, it should be duly noted that the efficiency of conversational AI for learning
greatly relies on the quality and breadth of its knowledge base. As highlighted by
Dempere et al. (2023), those engaging with AI-driven educational experiences should
recognize that these systems operate based on existing data and are often unable to
10
adapt well to new challenges without sufficient training examples. AI could find it
difficult to overcome the kind of hitherto unheard-of difficulties that humans would
run across when exploring space, for instance.
This implies the integration of deep understanding about specific academic or
professional domains into conversational AI systems so they may give accurate
context-related information as well as engage in meaningful conversations within
those fields. The current boundaries of conversational AI could be surpassed by
incorporating more advanced techniques in natural language processing together with
curated knowledge bases, thus transforming it into a potent instrument for knowledge
sharing and collaborative learning across educational institutions.

2.2. Large Language AI Models

As defined by IBM (n.d.), LLMs have been trained to generate texts similar to humans
in addition to other content types. The process is based on a large amount of data used
to train them. They are able to think, consider, or learn from context, generate coherent
and relevant sentences, translate text to other languages than English, summarize text,
give answers to questions (general conversation and questions that are asked the most),
and even help in creative writing or code generation tasks. There is no doubt that
LLMs are a major breakthrough in the field of NLP and AI (IBM, n.d.).
In 2024, LLM inventions are booming. GPT-4 of OpenAI is still the king of creative
text generation, writing incredibly realistic and imaginative content. Google has also
come to participate in the race with two strong competitors: PaLM 2 and Gemini.
PaLM 2 is a versatile champion that tackles a wide range of tasks, while Gemini
mainly gives informative answers. Concerning efficiency, LLaMA 2 by Meta
performs well without using too many resources. Meanwhile, Claude 2, developed by
Anthropic, puts more emphasis on safety and reliability, making it suitable for
business purposes. In addition, there is Cohere, whose specialty lies in specific areas
only, as well as Falcon, which handles multiple languages better than any other LLMs
so far while remaining an innovative option among these systems, always breaking
new ground in terms of what they can do. With such a wide range represented here,
where each one has its own specialty, we are bound to witness tremendous growth

11
within this industry over time as they keep changing themselves into something
different every now and then.
The performance boom among LLMs lately begs the question: What drives their
exceptional abilities? Naveed et al. (2024) claimed that LLMs process both data and
language via architectural advancements together with training methods and context
length enhancements. They employ architectures based on transformers, more
computational power, and extensive amounts of training data to comprehend and
generate human-like languages. Moreover, LLMs use pre-training coupled with fine-
tuning techniques for specific task adaptation across different areas, which makes
them capable of performing numerous NLP tasks. The working principles behind
LLMs include training them on huge datasets using high-performance computing
resources, which enables learning intricate structures or connections within languages;
then these models can be adjusted towards certain applications so that they can
understand and produce text outputs in various domains like education, science,
mathematics, law, finance, and healthcare robotics, among others (Naveed et al.,
2024).
However powerful they may be, there are still areas outside the training domain where
LLMs struggle with domain-specific knowledge. As pointed out by Mohamadi et al.
(2023), the training data for these models might not adequately cover certain subject
areas, leading to responses that lack depth, accuracy, or context-specific relevance
when dealing with inquiries from specialized domains. According to Bang et al.
(2023), ChatGPT is an unreliable reasoner with an average accuracy of 63.41% over
10 distinct reasoning categories, including logical reasoning, non-textual reasoning,
and commonsense reasoning. He also stated that ChatGPT experiences issues with
hallucinations.
As they advance and gain more capabilities, LLMs will find their way into different
fields such as education, healthcare, or customer service, which is likely to lead to an
increase in diversification across sectors. Nonetheless, it is crucial to overcome the
problem of incorporating domain knowledge so that these models can give accurate
context-aware responses for specialized queries.

12
2.3. Integrating domain knowledge

In order to unleash the full potential of LLMs so that they can offer exact and context-
aware information across different domains, the integration of knowledge specific to
those areas is highly required.
Typically, knowledge consolidation in conversational AI systems has been achieved
through rule-based systems or limited databases, which tend to give rigid and narrow
responses. Nevertheless, LLMs have brought about more sophisticated methods of
infusing these models with domain expertise.
One prominent method is fine-tuning, where a pre-trained language model is further
trained using data from a particular field. By training on far more instances than can fit
in the prompt, fine-tuning outperforms few-shot learning, the process of learning a
task through demonstrations, and enables users to perform better on a variety of
activities (OpenAI, n.d.). In other words, by fine-tuning domain-relevant datasets,
LLMs can acquire the specialized vocabulary, terminology, and contextual
understanding necessary for effective communication and information retrieval within
that domain. As a result, we will not have to include as many examples in the prompt
after a model is optimized.
Another technique is knowledge-grounding, where relevant information is fetched
from external knowledge bases and incorporated into the model’s responses. In
particular, as stated by Berger (2023), a trigger, like a user request or instruction, is the
first step in a simple Retrieval-Augmented Generation (RAG) model. This trigger is
sent to a retrieval function, which uses the query to retrieve pertinent content. After
that, the obtained content, the input prompt, and the question itself are all combined
back into the LLM's context window. Enough room is left for the response of the
model. Finally, the combined input and retrieved content are then used by the LLM to
create an output. Grounding in real-world applications is important, as this
straightforward but efficient method frequently produces outstanding outcomes.
Hybrid methods that combine fine-tuning and knowledge-grounding have also been
investigated with the aim of benefiting from both approaches. To start with, a retriever
provides fast access to large amounts of new external data. Furthermore, fine-tuning

13
makes the model behavior highly specialized for a given domain. Finally, the
generator produces outputs by utilizing both refined domain knowledge as well as
external context (MatrixFlows, n.d.).
While these techniques show promise for integrating domain knowledge into LLMs,
they also present several challenges. RAG thrives on real-time external data, ideal for
tasks like summarizing news articles; however, it may add latency and necessitate
database upkeep. Otherwise, fine-tuning excels at specializing LLMs for specific
domains but relies on static internal data and significant training resources. The hybrid
approach, although seeking to enable access to both internal and external data sources,
inherits the latency of RAG and the resource demands of fine-tuning, requiring
maintenance of both the model and the data repository.
Despite these challenges, the integration of domain knowledge into LLMs is a critical
area of research, as it holds the key to unlocking the full potential of conversational AI
systems in specialized domains.

2.4. Gap analysis and solution

While literature gives us insights on conversational AI fields, LLMs, and techniques


for domain knowledge integration, it still needs a lot to be desired, as there is no
complete solution that brings all these components together seamlessly. Approaches so
far have either concentrated on rule-based or limited-domain chatbots; others use
resource-intensive fine-tuning methods or knowledge-grounding techniques, which
might not be scalable enough over vast arrays of diverse knowledge.
One major missing link identified is the lack of an educational-specific,
comprehensive solution where domain-specific knowledge from academic institutions
needs to be incorporated in order to improve learning experiences and provide students
with reliable information. The majority of current approaches do not take into account
unique challenges and requirements posed within educational contexts, like seamless
institutional resource integration, personalized learning experiences, and the ability to
hold meaningful dialogues across various academic disciplines.
Also, we found another basic shortfall: the absence of scalable and efficient backend
infrastructure for integrating ChatGPT-like LLMs with ever-changing domain-specific

14
knowledge bases obtained from trusted sources such as academic institutions or
subject matter experts. Even though they look promising, fine-tuning and knowledge-
based methods usually take place during the training phase, which means that
adjusting the model’s knowledge base with new information or updated knowledge is
difficult.
This dissertation aims to fill these gaps by creating a strong backend system for
connecting ChatGPT to a database of specialized resources from Hanoi University;
investigating effective ways of consuming, processing, and retrieving domain-specific
information; and implementing mechanisms necessary for seamless integration with
OpenAI APIs. With this kind of infrastructure, we can achieve accurate responses in
natural language conversations that are relevant to context, education-oriented, and
institutionalized.

15
3. CHAPTER 2: CHATGPT

3.1. Introduction to ChatGPT

ChatGPT is an artificial intelligence program developed by OpenAI, it was first


launched on November 30, 2022. The "GPT" in ChatGPT refers to the "Generative
Pre-trained Transformer," which is the cutting-edge technology developed by OpenAI.
It is related to InstructGPT, a model designed to follow instructions and provide
detailed responses (OpenAI, n.d.). The release of ChatGPT marked an important
milestone in significant progress in the field of artificial intelligence, especially natural
language processing (NLP) and conversational AI, enabling machines to understand
and generate human-like text responses.
The history of ChatGPT traces back to the development of the original GPT model,
which laid the groundwork for subsequent iterations. Pre-trained large-scale models
era began with the release of the GPT-1 in 2018. In comparison to the prior language
natural models, which were based on supervised learning, GPT-1 introduced a novel
"semi-supervised" learning approach. In this approach, the model first operated on
unlabeled data with the help of a pre-trained model, and then adjusted itself using a
small portion of labeled data to achieve generalization. The following year 2019, GPT-
2 was released as a new version equipped with unsupervised, pre-trained, and bigger
data sets, which contributed significantly to text creation and generalization as well
(Hua et al., 2023).
However, it was the launch of GPT-3 that truly propelled ChatGPT into the spotlight.
Published in 2020, GPT-3 was designed to handle the majority of natural language
processing jobs. Two years after that, InstructGPT (GPT-3.5) came out, which
produces a lower-risk output. This is an improved version of GPT-3 that uses
Reinforcement Learning from Human Feedback (RLHF) (OpenAI, n.d.). In November
of the same year, a chatbot called ChatGPT surprised people from all over the planet.
This chatbot was enhanced with a new multi-round chat feature based on InstructGPT.
In March 2023, OpenAI launched GPT-4 accessible to customers on the waitlist and
ChatGPT Plus subscribers in a limited text-only mode. Nevertheless, it can interact
16
with both text and visuals. Despite its restricted availability, GPT-4 has received
notice for outperforming its predecessor.
The accuracy of the GPT models has been consistently improving from GPT-1 to
ChatGPT. Various tasks such as text summarizing, fictional creation, error diagnosing,
paper writing, knowledge querying, role-playing, and more, can be accomplished
using ChatGPT. Additionally, ChatGPT is also a multilingual system that can
comprehend both computer languages and languages that have limited resources.
Furthermore, ChatGPT is knowledgeable in practically every field, including finance,
biology, business, healthcare, and mathematics. Some individuals have great
expectations for ChatGPT because of its exceptional text production, processing, and
understanding capabilities. They believe it will encourage social development.

3.2. How ChatGPT works

As stated by Gewirtz (2023), ChatGPT operates using a sophisticated AI model called


a transformer architecture. This is a kind of deep learning neural network, which is a
complicated and multi-layered method that mimics the structure of the human brain.
With this method, ChatGPT was able to understand and generate human language. The
'transformer' refers to the model's ability to transform input text into relevant and
coherent responses. This process relies on the vast amounts of text data during its
training phase, allowing the model to learn patterns and nuances of language.
The process starts with pre-training the model using a large dataset containing texts of
different nature. It learns to do this by understanding the context and the connections
between words, which consequently produces a sentence where the next word makes
sense. After that, the model is fine-tuned through a supervised learning process during
which it gets presented with specific tasks and examples with the ultimate goal of
enhancing its accuracy and relevance.
According to the official website of OpenAi, ChatGPT uses a technique called the
Reinforcement Learning from Human Feedback (RLHF). In other words, this means
that the model has learned from examples provided by human trainers. In this stage,
human trainers interact with the model, playing both sides of a conversation. They
provide feedback on the model's responses, guiding it to produce more helpful and

17
accurate outputs. The dialogue corpus was collected and that dataset was blended with
InstructGPT data formatted to dialogue subsequently. This step is crucial as it helps
the model understand the subtleties of human conversation, including tone, intent, and
the flow of dialogue.
It's interesting to note that OpenAI's ChatGPT is built to remember previously asked
questions. This means it can gather context from prior questions users have asked and
use it to inform future conversations with them. Users can also request reworks and
revisions, and it will relate to what they were talking about previously. It makes
interaction with the AI feel more like a conversation (Guinness, 2023).
The result is a model that can engage in conversations across various topics, answer
questions, and even perform creative tasks like writing and code generation.
Furthermore, it can create essays, articles, and poetry. This capability to produce
human-like answers to a multitude of inquiries was among the primary reasons why
the community of users reached 100 million within the first two months after the app
was launched (Wilson, 2024).
However, despite its advantages, ChatGPT has limitations. It can occasionally give
illogical or inaccurate responses that seem plausible. Due to bias in the training data,
the model is sensitive to input variations and recurrent, sometimes verbose prompts, as
well as the overuse of certain terms (OpenAI, n.d.). Continuous updates and
improvements are part of the model's lifecycle to enhance its performance.

3.3. ChatGPT applications

ChatGPT has a wide range of applications. One of ChatGPT's key applications is in


education. According to Fraiwan and Khasawneh (2023), by analyzing student data,
creating tailored learning paths, recommending resources, and providing assignment
feedback, ChatGPT may provide a personalized learning experience. Teachers may
also construct tests, develop discussion topics, grade assignments, and assist students
with special needs and language learners more easily with ChatGPT. Or as this project
is developing, ChatGPT can be integrated into the education sector as a special chatbot
tailored to each school's need to answer questions about the school program or
services.

18
Another important field in which ChatGPT could be a great help is software
engineering. In this field, the ChatGPT model can offer several functionalities to to
assist engineers in enhancing their effectiveness at work. Some of the most common
abilities that can be listed are code generation, debugging, and software testing.
Moreover, ChatGPT can help with tasks like documentation creation, collaboration,
and natural language processing. Studies show that language models are useful for
predicting variable types, recognizing frequent code snippets, locating software
mistakes, and enhancing code quality (Fraiwan & Khasawneh, 2023).
According to Bahrini et al. (2023), in healthcare, ChatGPT promotes rapid symptom
checking and identification, provides personalized health advice, and suggests
improving patient outcomes. Furthermore, in biotechnology and medicine, it increases
efficiency, reduces costs, and improves the health of patients or animals. However, the
field faces legal, regulatory, ethical and privacy difficulties, such as data protection
and security, as well as potential errors and biases out of the system.
Finally, thanks to its conversational features, ChatGPT is a popular choice for chatbots
and personal assistants. By helping users remember appointments, meetings, and other
important occasions, not only that ChatGPT can helps users stay organized and
prepared for their daily activities, but it can also help users stay on top of their tasks
without having to constantly check their calendar with its features to sets reminders for
certain events (Lowery & Borghi, 2024).

3.4. Strengths and weaknesses

ChatGPT has a number of strengths that can be discussed. The first advantages of
ChaGPT lie in its ability to maintain a conversation that closely mimics human
interaction (Rampton, 2023). This feature allow the model to understand context,
remember previous queries, and provide coherent and relevant responses. Therefore,
making it an excellent tool for customer service, content creation and even educational
purposes.
Another notable advantage of ChatGPT is its versatility. The model can be used on
various platforms and integrated with different software, and for that reason, become a
flexible solution for many industries. Moreover, due to its enormous database,

19
ChatGPT not only could summarize information from multiple sources but could also
generate new content, adapt information quickly, and make broad connections (AVID
Open Access, n.d.).
Additionally, the responses generated by ChatGPT are crafted in a polished and easily
digestible manner, enhancing readability and user experience. This user-friendly
approach makes ChatGPT a preferred choice for individuals and businesses seeking
efficient and effective communication solutions.
However, despite its benefits, ChatGPT has a number of drawbacks. As stated by
OpenAI (n.d.), ChatGPT can occasionally give responses that seem reasonable but are
actually inaccurate or meaningless. Several difficulties have been involved in the
process of trying to address this issue. First of all, in reinforcement learning (RL)
training, there is currently no clear source of truth to guide the model's learning
process. Second, trying to train the model to be more conservative results in the
rejection of queries that could have been answered correctly. And last but not least,
supervised training can fool the model because the appropriate response is determined
by the model's knowledge rather than the trainer's expertise.
Another drawback of ChatGPT is that it is sensitive to small changes in input phrases
or repeated suggestions. For instance, when an user enter a question , there is a chance
that the chatbot may claim not to know the answer of that question. However, when
the question is slightly rephrased, the model can give the correct answer. Furthermore,
the response is very verbose and often mentions that this is a language model trained
by OpenAI. These problems occur due to bias in the training data, which favors longer
answers, as well as known concerns about over-optimization.
The ideal situation would be when faced with confusing user questions, the model
would ask for clarification. But in reality, current algorithms often try to predict user
intentions instead. Despite efforts to train the model to reject inappropriate requests,
there are cases where the model may still respond to harmful instructions or exhibit
biased behavior.
One more disadvantage of using ChatGPT is that the model can fabricate citations.
When asking for sources of the model’s responses, it gives us sources that seem
reliable, but they are actually fabricated. It's crucial to understand that AI can generate
20
responses confidently even without valid data, much like how a person influenced by
hallucinations may speak confidently without logical reasoning. Attempting to locate
these sources through Google or the library will result in no findings (Rozear & Park,
2023).

21
4. CHAPTER 3: SYSTEM ARCHITECTURE AND DESIGN

4.1. System requirements

For the successful development of a resilient back-end infrastructure that interoperates


with the Hanoi University domain-knowledge data source, the first step is to determine
an extensive list of key requirements that serve the design, implementation, and
evaluation processes of the system. The conditions stem from a carefully conducted
examination of stakeholder needs, technical limitations, and the project-at-large
mission as a whole.

4.1.1. Functional requirements

• The system has the capability of ingesting and processing a mix of


documents from Hanoi University in various domains.
• The system applies sophisticated text-to-numbers methods to generate
contextual numerical representations (embeddings) with the aim of capturing
semantic meaning.
• The system should be able to produce a specially built database that is
optimized for better efficiency in storing and retrieving high-dimensional
embeddings while ensuring that only the relevant document chunks are
retrieved by the user based on the query input.
• Implementation of algorithms and techniques is required for the tasks of
examining the similarity between the embeddings and thus returning the result,
which entails the documents relevant to the user question.
• A robust RESTful API that allows communication with frontend applications
seamlessly should be implemented by this system.

4.1.2. Non-functional requirements

• The system will provide an interface through which administrators can


manage the knowledge bases and end-users can interact with conversational
agents seamlessly.

22
• The system needs to be developed with downtime minimization, and it has to
give precise and consistent replies, which will not only ensure a positive
experience for users but also curb negative behaviors by ensuring reliability.
• The design should be modular and extensible to accommodate additional
components or the adaptation of the system to new domains in future use cases.
• Even under high-load conditions, the system has to offer quick and efficient
responses that do not keep users waiting so long.
• The codebase should be documented well enough that it meets the quality
standards and guidelines in the IT sphere for the developers in the future to
improve or upgrade it.
These requirements ensure that the backend structure complies with the project targets
and stakeholders’ expectations. Continual validation as well as testing against all the
requirements that we have mentioned above will be necessary for the smooth
development of the solution as well as the assurance of the best outcome.

4.2. System architecture

The backend architecture is modular and scalable; the main components are linked
together as a framework in which Hanoi University's knowledge is integrated with
ChatGPT to run seamlessly. The overall architecture can be divided into two main
subsystems: Chatbot FE and Chatbot BE.
The Chatbot FE, acting as the user interface, is able to handle both text-based and
conversational interactions between users and the system. It is the key that holds the
user inputs in its memory, displays responses, and keeps track of the flow of
conversation. The FE interacts with the BE via an API with the abilities of the
RESTful API, transmitting user queries and receiving relevant document chunks or
generated responses.
The Chatbot BE is the core of the infrastructure, comprising the following key
components:
• Document ingestion and processing pipeline: The section that takes care of
the intake and the preprocessing of the rich corpus of documents at Hanoi
University, both of academic and administrative nature, is responsible for this
23
task. It supports the CSV format and performs the process of text extraction and
the partitioning of large documents into small bits that are easy to manage.
• Embedding generator: This module converts textual content from document
parts and user queries into high-dimensional vector representations
(embeddings). These embeddings encode the semantic relationships in the text
and the subtle contextual details.
• Vector database: This database is optimized for high-dimensional
embeddings and hence uses Approximate Neighbor Search (ANN) algorithms
and indexing strategies for fast similarity computations.
• Vector search and retrieval: The algorithm accounts for running the
similarity search on previously saved embeddings. By means of query
embedding, semantic similarity, or other distance metrics, it computes the most
relevant semantic document chunks and assigns their scores.
• API development: The synchronization layer acts like the central interface,
and it handles communication among the backend components and the chatbot
interface.
The modular architecture of the system is designed in such a way that it can be scaled,
maintained, and extended so that it can handle the growing amount of data, varied
requirements, and integration of new components or functionalities in the future.

4.3. Database schema

4.3.1. Corpus storage schema

The CSV files serve as a medium through which we integrate the domain-specific
corpus of Hanoi University, which is a collection of texts already prepared and stored
within our system. These CSV files are thus a key tool that aids in the structured
gathering of records of experts from the various university departments and sources.
They act as the staging repository, making sure that the university infrastructure does
not lack cohesion or accessibility to the numerous documents that are present at the
university.

24
Figure 1: A CSV file storing documents about HANU educational program
As illustrated in Figure 1, the CSV schema mirrors the database table design,
consisting of the following fields:
• Title: The informative heading of the document.
• Summary: The succinct outline that reflects the core content of the
document. This should use keywords, tags, or abbreviations.
• Content: The contents of the document in terms of the textual substance.
• URL: The unique URL to find and retrieve the source document.
• Contributor: The people who have created the document or who have
contributed to its content.
Such a model is the best for the library of a university, and it is possible to provide the
relevant academics with the intellectual content they need.

4.3.2. Vector database schema

For the underlying structure of vector databases, we are going to introduce the
necessity of a strong schema for the storage of clusters of the resulting embeddings
and documents from the domain relevant to the analyst. This is done to ensure that
when the vector search algorithm is implemented, the expected result will be obtained,
i.e., similar vectors along with their corresponding document content.
25
Figure 2: A table about HANU educational program in Vector Database
Figure 2 shows the core of vector database, housing the following key attributes:
• id (INTEGER): The unifying identifier for every document segment.
• content (TEXT): The combined text of the documents, including the “Title”
of the original document, a concise “Summary”, a segment of textual
“Content”, the source “URL”, and information about the document's
“Contributor”.
• embedding (VECTOR): The high-dimensional embedding that represents the
semantic meaning of the content chunk.
Similarly, the coupling of the structured CSV file with a tailored vector database
architecture creates a strong framework for handling and querying the university's
domain-specific knowledge assets. It incorporates the possibility of utilizing LLMs
such as ChatGPT that could facilitate a conversational AI to connect to a vast, well-
arranged, and easily accessible pool of institutional knowledge.

4.4. Pipeline workflows

4.4.1. System workflows

The system's comprehensive workflow, encompassing both backend and frontend


components, is depicted in Figure 3, page 27.

26
Figure 3: System workflows of the HANU chatbot
As shown in Figure 3, there is a workflow that connects ChatGPT and Hanoi
University's knowledge corpus stored in CSV files.
The first primary task is preprocessing at the back-end side, where an in-built data
ingestion pipeline is used for consuming the sequence of CSV files with appropriate
preprocessing steps. The chunking pipeline (1) starts to separate the large texts into
smaller and easier-to-digest pieces, which in turn limits the size of the memory usage
and achieves balance in token usage. In this way, each text chunk is found with its
metadata and then is passed to the text embedding generator component (2), which
uses OpenAI's APIs to convert the textual data into a multi-dimensional vector
representation. The created embeddings and the concatenation of documents’ metadata
are systematically stored in a vector database (3) designed for the handling of high-
dimensional data. Vector-specific functionalities and indexing mechanisms built into
the database allow for fast and easy retrieval.
On the client side, when a user submits a query (4), a POST request (5) is sent to the
backend server via a RESTful API developed.
The backend workflow then mirrors the embedding generation process (6) applied to
the corpus documents, transforming the user's query into a query embedding.
Afterward, it examines the vector database by performing the similarity search (7) so
that the query embedding can be used to determine the relevant document chunks
based on their semantic proximity.
Subsequently, the front end is fed with the relevant pieces (8) that were returned. The
user's query, document chunks, and any other context (i.e., previously stored response
history) from Session Storage (9) are then processed by the API of OpenAI (10). The
27
client communicates with the ChatGPT integration layer, which creates sophisticated
answers that are tailored to the question asked (11). The present context of the
interaction is then saved in Session Storage (12) for further recall, and the response is
finally shown to the user (13).
The core of the system is made up of seamless workflow in both backend and frontend
perspectives, document ingestion, text processing, embedding creation, vector storage
and retrieval, API communication, and context management. It serves as a medium
between ChatGPT's text-generating ability and the domain-specific knowledge
resource of Hanoi University, as the users can embark on a lively dialogue that would
help them in their learning and consulting.

4.4.2. Backend workflows

The backend does not work independently but rather in a series of intertwined
workflows to facilitate integration between ChatGPT and Hanoi University's domain-
specific knowledge corpus. In other words, several processes take place for efficient
communication within the system: document ingestion, text processing, embedding
generation, vector storage and retrieval, and API communication.

Figure 4: Backend workflows of the HANU chatbot


As depicted in Figure 4, the backend system manages the following tasks:

28
• Step 1: Collecting and chunking documents. The HANU documents are
not used as they are but are pre-processed into smaller units (chunks) that can
easily be used to generate responses.
• Step 2: Generating embeddings. Each chunk is taken through an embedding
model to generate a vector representation that stores the meaning behind the
semantics it captures.
• Step 3: Storing embeddings. These vectors are stored in a dedicated vector
database for easy retrieval when needed.
• Step 4: Processing user questions. When a user asks a question, a POST
request is sent to the backend server. The user's question goes through the same
process that leads to the formation of query embedding.
• Step 5: Searching for relevant documents. The system operates through the
vector database to undertake an ANN probe into embeddings that are highly
similar to the query embedding. Eventually, the relevant documents whose
embeddings score the highest similarity are retrieved.
• Step 6: Returning relevant documents. Relevant documents are returned
through the system to the chatbot's front end based on the API channel. These
chunks will be used by the front-end application to return a comprehensive
response back to the user based on their query.
These workflows thus work together, complementing each other in such a way that
they serve as a bridge between ChatGPT's natural language capabilities and Hanoi
University's rich domain-specific knowledge assets. Thus, all potential users are able
to have rich conversations that are highly contextualized with regard to their academic
or professional needs.

29
5. CHAPTER 4: IMPLEMENTATION AND DISCUSSION

5.1. Platforms and tools

The backend infrastructure proposed for development takes advantage of an


innovative combination of platforms and tools, all meant to guarantee efficiency as
well as robust performance. The subsequent sections highlight technologies used in
practice throughout implementation processes.

5.1.1. Python programming language and libraries

Python is used as the main development platform for the backend system: an all-
around, widely adopted programming language that makes it easier to create NLP
systems together with data manipulation systems, API developments, and other
components due to its rich library and framework ecosystem plus simplicity and
readability.
To be more specific, the project leveraged several outstanding Python libraries, each
playing a crucial role in various aspects of the system:
• OpenAI: The openai library enabled smooth and direct communication
between the backend system and powerful language models of OpenAI. The
purpose of this library is to make it possible for the system to efficiently embed
textual data into numerical representations that encapsulate semantic
information.
• Psycopg2: Since the PostgreSQL database was used by the backend system
for strong data storage and management, the psycopg2 library provided an
interface for Python that was well suited for interacting with the PostgreSQL
database. This library allowed the creation of safe connections, the
implementation of SQL queries, and the performance of data's extraction and
manipulation in the PostgreSQL tree.
• Pgvector: The pgvector library served as a core component in connecting the
backend system to the PostgreSQL database through the vector extension. This

30
integration was the key to success in the implementation of such a system,
where the document embeddings are the basic components of its functionality.
• Flask: The Flask framework acted as the keystone and the mediator between
the user interface and the back-end system that transcends them, and therefore
experienced information flows all the more easily. API endpoints' creation,
HTTP request and response handling, and smooth integration with other
systems' components were simplified, thanks to Flask's simple and modular
design.
• Pandas: Among many other libraries of Python, pandas, a specialized one
for data manipulation and analysis, stood out as the one that handled the
complicated data structures within the project. By utilizing easy-to-follow data
structures and data maneuvering functions, data modification operations
become faster and easier.
These key libraries, along with others, can be downloaded using well-known package
managers like PIP and Conda. This requires several installation commands and library
imports throughout the development process.
With Python being used with a fair number of libraries and frameworks supported, the
architecture would be designed in a way that the different modules are linked together,
and complicated data operations are performed, and then would allow for a reliable
and scalable backend infrastructure, which would be utilized in sharing data between
the community and other language models.

5.1.2. PostgreSQL vector database

The achievement of robust data management is the consequence of the accurate


integration of PostgreSQL, which is one of the most powerful, dependable, and widely
used DBMSs in various industries for reliability purposes.
The project cleverly implements the pgvector extension within the PostgreSQL
operation in order to competently deal with the storage and retrieval of complex, high-
dimensional document embeddings. This enrichment confers PostgreSQL with
competencies designed specifically for the vector data administration task, where the
operation runs smoothly, and the performance is improved.

31
Also, to increase data retrieval tenancy, the project utilizes the ivfflat indexing method.
Ivfflat is the abbreviated term for Inverted-File-with-Flat-Compression Index, which
can increase query searchability and accuracy over the embeddings, improve the speed
of retrieval, and enable the quick and efficient discovery and identification of
documents that are thematically relevant to the corpus.
These actions serve as the pillars of data management, which assures the reliability
and efficiency of the system.

5.1.3. GitHub version control system

Cooperative development, as well as code management and version tracking, are


guaranteed through the use of GitHub, one of the most popular version control systems
(VCS). Not only does GitHub make it possible to immediately exchange, consume,
and adjust the source code, but it also serves as a centralized data bank for storing and
regulating the code base, making it possible to trace the progress of the development
by following the changes, fixes, and upgrades over time.

5.2. Document injection and preprocessing pipeline

To efficiently handle the mixture of different documents for Hanoi University, a


pipeline of chunking is introduced. This is achieved by chunking the text obtained
from the available document repository into smaller units, which would let the system
process large documents with ease before feeding them to the embedding model.
The data ingestion process commences with the execution of the “collect_data”
function, as shown in Figure 5.

Figure 5: Code snippet – Get HANU documents from a CSV file


This method (Figure 5) has the mission of loading the knowledge-specialized
documents from the dataset within a specific CSV file. These collections are gathered
through many sources: websites, documents, and announcements from the Hanoi
University campus, then sorted in line within the designed schema. Within the current

32
chatbot scope, we prepare two document domains: “educational_program” and
“public_administration”. Only the relevant columns are loaded (i.e., Title, Summary,
Content, URL, Contributor), ensuring only the necessary information is considered.
The next construction is the “get_num_tokens_from_string” function, which specifies
the number of tokens contained in a given text string, as depicted in Figure 6.

Figure 6: Code snippet – Get the number of tokens from a string


Since LLM determines the length of the input data in tokens, the size of the chunks
must be determined appropriately using this helper function (Figure 6). Thus, through
researching the existing Python libraries for token handling, the tiktoken library was
used to encode text using the specified encoding scheme, “cl100k_base” (which is a
closed vocabulary of 100,000 tokens specifically designed for word processing tasks).
Through this, the length of the encoded string will be obtained by implementing the
underlying built-in len() method.
Concluding the pipeline, the “get_chunked_data” function (Figure 7, page 34) acts as a
conductor in the process of chunking, receiving the corpus dataset (from the
“collect_data” function) and the maximum token limit as inputs to provide complete
chunked data with approximately equal lengths of tokens.

33
Figure 7: Code snippet – Chunk HANU documents into smaller units

According to Figure 7, this function begins by looping through all the documents to
check whether the length of their content exceeds the allocated token length (in this
scenario, set at 400 tokens, approximately equivalent to 300 words). Documents
within this limit can be directly appended to the chunk list. Conversely, documents
surpassing the token limit undergo segmentation into smaller chunks based on an ideal
chunk size (ideal_size), derived as a fraction of the maximum token limit. We
iteratively form chunks whose token count reaches the ideal size or whose content's
end is reached. Finally, the function returns a DataFrame containing the chunked data,
with the same columns as the data input.
This chunking pipe allows the system to effectively process and manage large and
diverse documents from Hanoi University's data warehouse. Dividing large papers into
smaller parts makes it easier to organize and retrieve relevant excerpts as well as to
generate suitable content input for the embedding model.
34
5.3. Embedding generating

The embedding generation pipeline makes use of the capabilities of the OpenAI
embedding models, specifically the “text-embedding-3-small” model, which is
designed for generating high-quality embeddings from text input. This model offers a
performance of 62.3% on the MTEB evaluation metric, with a cost of $62,500 per
8191 characters (OpenAI, n.d.).
In particular, the code snippets provided (Figures 8, 9, 10, and 11) illustrate the
functions joining in this process. It starts by combining all the necessary information
from each document gathered into a single string that can be used as the input to the
embedding model or the vector database, as depicted in Figure 8.

Figure 8: Code snippet – Combine metadata from HANU documents


Figure 8 introduces the “combine_all” function, which can combine the values from
all columns (“Title”, “Summary”, “Content”, “URL”, and “Contributor”) into a single
string. This combined string stores the metadata associated with each chunk, providing
additional context for future retrieval. On the other hand, “combine_text_only” just
combines only the values from the “Title”, “Summary”, and “Content” columns
(excluding the “URL” and “Contributor” columns). This combined text is the primary
input for the embedding model, as it contains the core textual content that needs to be
semantically represented.
Next, to generate embeddings, it is essential to create an OpenAI client object with
security consideration, in accordance with Figure 9, page 36.
35
Figure 9: Code snippet – Initialize OpenAI client
As captured by Figure 9, the dotenv library is employed to retrieve the API key (an
account on the OpenAI platform is required to obtain it) from environment variables.
This enables the API key stored securely in a .env file to interact with the OpenAI API
through the client object.
Following this, the “get_embedding” function (Figure 10) is coded to accept a text and
a model parameter as inputs, with the default being “text-embedding-3-small”.

Figure 10: Code snippet – Get embedding from HANU documents


This method (Figure 10) first replaces newline characters (\n) with spaces (‘ ’) to
ensure proper tokenization. After this, it invokes the OpenAI API client's
“embeddings.create” method, passing the text input and the specified embedding
model. The resulting embedding is then extracted and returned with a length of 256
dimensions (instead of the default 1536 dimensions, for efficient retrieval).
Lastly, leveraging the introduced methods, the “process_data” function (Figure 11)
takes the chunked corpus data and the output file path as inputs, orchestrating the
embedding generation pipeline.

Figure 11: Code snippet – Embed HANU documents (full steps)


As described in Figure 11, two new columns are created in the dataset. The
“Combined” column stores the output of the “combine_all” function, carrying the
metadata from all columns. Meanwhile, the “Text” column takes in the output of the
“combine_text_only” function, consisting of only the core textual content of the

36
documents. The embeddings generation process is then applied to the “Text” column,
with the resulting embeddings stored in the “Embedding” column. Finally, the updated
DataFrame is saved to a new CSV output file for later review and reference.
Through splitting the concerns of core text and metadata, the system guarantees that
the embeddings represent the true semantic relationships of the core content while
preserving the metadata for later retrieval.

5.4. Vector database

The chatbot system utilizes a specialized vector database to store and retrieve the
embeddings generated from the chunked corpus. After careful consideration, we
decided to leverage PostgreSQL, a widely adopted database management system,
along with the pgvector extension and ivfflat indexing scheme, for handling vector
data.
Consequently, we must implement the core functions responsible for setting up the
vector database and ingesting into it the embeddings obtained from the preprocessing
process. This begins with the creation of a normal database within the PostgreSQL
instance (Figure 12), which will later be equipped as a fully-functional vector
database.

Figure 12: Code snippet – Create a normal database


As is evident from Figure 12, the “create_database” function serves as the primary tool
for creating a new database within the PostgreSQL instance. Initially, it verifies

37
whether the database already exists, and if not, it proceeds to create it. Given that
PostgreSQL lacks native support for the “IF NOT EXISTS” option within the
“CREATE DATABASE” SQL command, a subquery to fulfill this requirement is
implemented, checking if a database with the same name already existed. Furthermore,
the function also ensures that all privileges on the freshly created database are granted
to the “postgres” user, which is the default administrative user configured during
PostgreSQL installation.
Based on this, the “init_database” function, as detailed, initializes the database by
creating a connection to the default “postgres” database and invoking the
“create_database” function with the desired database name later set to “hanu_chatbot”.
At the core of our database setup lies the “create_extension” function, which allows us
to transform our database into a fully functional vector database (Figure 13).

Figure 13: Code snippet – Create vector extension within a database


Figure 13 shows that this process comprises two steps: creating a database extension
by SQL statement as “vector” and registering the vector data type with the connection
by the “register_vector” function from the pgvector library. The database finds itself in
the right position to unlock the functions required for keeping and searching vector
data, thus enabling it to perform tasks on the stored embeddings.
Following the database configuration, the “create_index” function poses its
significance as the second crucial step (Figure 14).

Figure 14: Code snippet – Create ivfflat index within a table

38
Figure 14, page 38, apparently depicts how the “create_index” function establishes the
ivfflat index on the “embedding” column. To be more specific, the “lists” parameter in
the “CREATE INDEX” statement specifies the number of lists to be used in the index
structure. Each list in the ivfflat index represents a cluster of embeddings. During the
indexing process, embeddings are grouped into these lists based on their proximity to
certain centroids. In this case, the index will organize embeddings into 100 distinct
lists. Designed for expedited ANN searches on embeddings, this index ensures swift
retrieval of document chunks that closely match a given query embedding.
Sequentially, creating an index necessitates the prior existence of a table. This is where
the “create_table” and “init_table” functions come into play (Figure 15).

Figure 15: Code snippet – Create a vector table (full steps)


As informed by Figure 15, the “create_table” function is responsible for creating a
table within the database to store the document chunks along with their corresponding
embeddings. The table schema includes columns for the primary key (id), the
document content (text), and the embedding vector (vector(256)), with 256
representing the vector dimensions.
Based on this premise, a vector-equipped table can be accomplished by the
“init_table” method (Figure 15). At first, the function establishes a connection to the
designated database, subsequently configuring essential extensions to facilitate

39
efficient handling of vector data. Proceeding from this, it creates a table within the
database for the storage of document segments and their corresponding embeddings.
Finally, it launches the index on the embedding column to enhance retrieval
efficiency.
With the vector database and vector table now set up, the next momentous step is to
load the processed data into the database. This involves transferring the pre-processed
documents, along with their corresponding vectors, into the designated tables.
For that reason, we have the “load_data” function, which accounts for ingesting the
embeddings generated from the chunked corpus into the database table (Figure 16).

Figure 16: Code snippet – Load training data into the vector table
As presented in Figure 16, the function reads the embeddings saved in the CSV file
(behavior of the “process_data” function). Hence, it constructs a list of tuples
encompassing the document content and corresponding embedding and inserts them
into the specified table using a bulk insert operation (cur.executemany).
From there, the “store_data” function (Figure 16) acts as a wrapper for the “load_data”
function. It takes the file path containing the embeds, database name, and table name
as input and calls the “load_data” function with the appropriate parameters before
closing the connection to the database.
Having all the necessary methods in place, we can now start importing the training
dataset into the vector database. This chatbot training process begins with initializing

40
all necessary infrastructure, such as media folders and databases, as interpreted in
Figure 17.

Figure 17: Code snippet – Prepare for chatbot training


Initially, the function (Figure 17) sets up the necessary infrastructure to store and
manage the embeddings obtained from the corpus processing phrase. It creates a
dedicated folder (../documents/embedded_data) to store embeds in CSV format for
future reference. Additionally, it also calls the “init_database” function to create the
database and initialize the necessary tables and indexes.
Afterward, the domain-specific documents collected from Hanoi University’s assets
will be collected, pre-processed, and loaded into the vector database, satisfying the
clarified schema design. Most of the declared methods prove their position within this
pipeline, as Figure 18 illustrates.

Figure 18: Code snippet – Load corpus into vector database (full steps)
In accordance with Figure 18, the “load_corpus” function orchestrates the entire
process of ingesting the corpus into the vector database. Taking the database name and
the chatbot name (representing the corpus domain) as input, the function initializes a
41
dedicated vector table for storing the domain embeddings, then retrieves all the CSV
files containing the corpus data from the specified directory. For each detected file,
this method will call the “collect_data” and “get_chunked_data” functions to collect
and split the data stored into smaller units. The chunked data is then embedded by the
“process_data” function and stored in a new CSV file for future reference. At the end,
the “store_data” function is invoked to ingest the embeddings and documents from the
generated CSV file into the corresponding table in the vector database.
Consequently, the HANU chatbot’s bank of domain-oriented knowledge can
increasingly grow using the commands listed in Figure 19.

Figure 19: Code snippet – Prepare data for two chatbots


As can be observed in Figure 19, the “load_corpus” function is invoked twice, once
for the “educational_program” and once for the “public_administration” corpus,
ingesting the respective embeddings into the vector database. Other domains of
knowledge can be added with the same principle, as long as all documents are
gathered in a series of CSV files in a specific folder (with the folder name being the
same as the database name). This ensures the scalability of the system when, in the
future, requests to stretch and adjust the domain-specific knowledge in the database
are made. As the university's knowledge assets evolve, the system can accommodate
these changes by ingesting updated documents and deleting the old ones, ensuring that
the conversational AI remains adapted and accurately reflects the institution's domain-
specific expertise.
In short, by leveraging this implementation, the system can efficiently store and
retrieve embeddings from the chunked corpus, enabling fast similarity searches and
the identification of the most relevant document chunks in response to user queries.

5.5. Vector searching and retrieving

The backbone of the proposed system's ability to deliver accurate and relevant
responses lies in its efficient vector searching and retrieval capabilities. Once the
embeddings generated from the chunked corpus are stored in the specialized vector

42
database, the system can leverage these embeddings to perform similarity searches,
identifying the most relevant document chunks in response to user queries.
From there on, the “vector_search” function (Figure 20) orchestrates the process of
retrieving the top-ranked relevant document chunks based on their semantic similarity
to the given user query.

Figure 20: Code snippet – Vector search


According to Figure 20, the algorithm leverages the power of PostgreSQL's SQL
queries and the pgvector extension to perform the similarity search on the stored
embeddings. The query starts with a Common Table Expression (CTE) named “temp”,
which calculates the cosine similarity between the “query_embedding” (embedding of
the user’s question) and the embeddings of all document segments in the specified
table. The cosine similarity is computed, as recommended by OpenAI (n.d.), using the
“cosine_distance” function provided by the pgvector extension, and the result is
subtracted from 1 to obtain the similarity score (higher values indicate greater
similarity). It then filters out the low-similarity results (whose score is lower than 0.3,
meaning high distance), orders the remaining rows by descending similarity score, and
returns the 15 top-ranked relevant document chunks.
Using this vector searching approach, the system can efficiently identify the most
semantically relevant document chunks to a given user query, enabling the delivery of
accurate and context-specific responses.

43
5.6. API development

In order to promote smooth interaction between system components, a RESTful API


was developed using the Flask web framework written in Python. This API, in turn, is
a mediator that takes care of user interactions and orchestrates the retrieval of data
from the backend systems.
Within the scope of this project, we aim to launch the backend on the local host port
8080, allowing interaction with the chatbot UI on port 3000.

Figure 21: Code snippet – Initialize Flask application


Accordingly, Figure 21 demonstrates the steps to initialize the Flask application and
enable Cross-Origin Resource Sharing (CORS) to allow requests sent from the
frontend application's address (“http://localhost:3000”). Also, the connection to the
vector database is established for further use.
Next, the @app.route decorator defines a route for handling POST requests to the
“/hanu-chatbot/educational-program” endpoint, as described in Figure 22.

Figure 22: Code snippet – Implement API endpoint for relevant documents
This endpoint (Figure 22) is responsible for retrieving relevant document chunks from
the educational_program corpus based on the user's query. The request data is
extracted using request.get_json(), and the question parameter is obtained from the
JSON payload. If a question is provided in the request, the “get_embedding” function
is called to generate an embedding vector for the user's question. Then, the
44
“vector_search” function is invoked to retrieve the most relevant document chunks.
After all, a JSON response is returned, containing the list of these chunks under the
“relevant_docs” key.
Lastly, as detailed in Figure 23, the API is designed to run locally on “localhost” with
port 8080 when executed as the “main” script.

Figure 23: Code snippet – Run the chatbot


Eventually, we connect the backend chatbot with the frontend client on the local
network, where the HANU chatbot can show off all its super skills. With this RESTful
API implementation, the backend system would have the capability to communicate
with the frontend applications, thus, the interfaces would be able to submit user
queries and receive appropriate document fragments in response. The API operates as
a plank, hiding the complexity of the back-end technology, such as vector databases
and retrieval processes, and providing a standard interface for interaction.
On the other hand, the modular design of the API is helpful as it ensures a simple way
to expand and modify. Further endpoints can be added to deal with different corpora or
functionalities, like the already mentioned “public_administration” corpus. This
flexibility, in turn, ensures that the system can be scaled up and adapted to new
environments as well as to new fields of knowledge or new requirements that might
emerge in the future.
Generally, API development is a key element of the system through its capabilities of
smooth information exchange, modularity, and scalability. It is a core factor in the
effectiveness of the conversational AI experience within the HANU chatbot.

5.7. ChatGPT integration

The implementation of ChatGPT at the front end is what we choose to carry out to
avert the possible bottleneck triggered in the back-end server. However, it is also
possible to perform this integration in the backend, as there are advantages and
disadvantages to consider.
Including ChatGPT into the application could be achieved by defining a function for
getting the answer from ChatGPT, in which the backend must take care of the entire
45
process of integrating it with OpenAI's API, including preparing the context and
relevant documents, which will be used to generate an answer.
Firstly, it is essential to set up a method that helps us take advantage of the text
generation force of the LLM that we chose. Given our need for a model with a large
context window of 16,385 tokens and improved formatting capabilities, particularly
for non-English languages, we opted for the “gpt-3.5-turbo-0125” model, as shown in
Figure 24. This model can receive up to 15 input segments with a length of 400–500
tokens (400 document content tokens, along with the remaining length possible from
metadata specifying title, summary, url, and contributor). They are sent with up to five
recent responses (context that is saved in session storage in the client) with a length of
1500 tokens each and return up to 1500 response tokens. Therefore, by limiting the
length of each chunk and sending a reasonable number of chunks and context as
materials to the LLM, we can ensure the number of tokens stays within the tolerance
threshold of the selected LLM (GPT-3 with a specific allowed token quantity of 16385
tokens). This distribution can be more comfortable when using more advanced but
costly models, such as the GPT-4.

Figure 24: Code snippet – Get response from GPT-3.5 model


As reflected in Figure 24, the “get_completion_from_messages” function interacts
with the OpenAI API by passing the prepared list of messages to the
client.chat.completions.create method. It specifies the model used (“gpt-3.5-turbo-
0125”), the temperature for controlling the randomness of the response (0 means no
over-creative answers are preferred), and the maximum tokens allowed in the response
(set as 1500 tokens). This will return the response generated by the GPT-3.5 model.
Leveraging this ability, the “get_answer” function (Figure 25, page 47) is responsible
for preparing the necessary data to be used by the model.

46
Figure 25: Code snippet – Define body message for OpenAI API
According to Figure 25, page 43, the function commences by generating an
embedding for the user's question using the “get_embedding” function (Figure 10,
page 36). This embedding is then used to perform a vector search in the database to
retrieve the most relevant documents based on vector similarity. Next, it constructs the
appropriate system message and assistant message based on the question, context, and
relevant documents provided. The system message contains instructions for ChatGPT,
guiding it to refer to the context and relevant documents when formulating the
response, using the same language as the question, responding concisely and with a
technically credible tone, and handling currency exchanges if necessary. The user
message consists of the question itself, while the assistant message includes the
context and relevant documents to be referred to by the model when it generates the
response. These messages, finally, are passed to the “get_completion_from_message”
function to obtain the expected response.

47
Accordingly, the endpoint’s function can be slightly modified to deliver back the
response obtained from the model, as depicted in Figure 26.

Figure 26: Code snippet – Implement API endpoint for complete response
Thus, as shown in Figure 26, we can allow the answer obtained from calling the
“get_answer” method to be directly returned to the client instead of returning a list of
relevant documents as before.
Generally, when backend integration is used, it brings benefits like centralized logic,
which is easier to maintain, and better security (as API keys are protected in the
backend and are accessed securely), which in turn prevents loss of security on the front
end. Nevertheless, this method can hit performance bottlenecks while handling a
massive number of requests and complex queries; hence, the users may encounter
delays in the requests or timeout errors. Another issue is that scalability questions
could arise if the number of users and requests increases, leading to a more complex
and costly scaling of the backend alone to process the additional load as opposed to
scaling the front end.
In our project, where the HANU chatbot is a public AI for educational purposes like to
“wiki” and get consultancy, we have chosen to offload the computationally intensive
task of providing natural responses from ChatGPT to the client-side, the one that is
responsible for freeing the bottleneck in the backend. Security measures are taken in
the form of the safekeeping of the API key used for OpenAI and the entire reliability
of the application. This way of programming leads to a cleaner architecture where the

48
back end processes the data requests and the front end renders and manages user
interactions.
Notably, the selection of backend or frontend integration should be considered in light
of many factors that define the performance requirements, scalability demands,
security concerns, and application architecture in general. However, each approach can
be described as having “pros and cons”, and it is up to the project to make the decision
based on the specific requirements and restrictions.

5.8. Result and discussion

The execution of the suggested system led to fruitful outcomes, showing that it is an
effective tool by combining ChatGPT and the specific knowledge of Hanoi University.
There was a detailed assessment of the query processing and vector search
mechanisms. Also, the integration of ChatGPT's text generation was thoroughly
measured.
After training the chatbot, we can review the comprehensive pipeline, which covers
storage preparation, data ingestion, data preprocessing, data embedding, and data
storing, in the logging screen shown in Figure 27.

Figure 27: Chatbot training process (logging screen)


This training process (Figure 27) prepared the chatbot with in-depth knowledge that
cannot be found anywhere except Hanoi University. As a result, our vector database
currently includes the following tables:

49
• “educational_program” table: contains 333 document instances related to
various educational curriculum and lecturers at Hanoi University (briefly
depicted in Figure 2, page 26).
• “public_administration” table: contains 107 document instances ranging
from administrative procedures, admission and tuition fee information, to
announcements within the university.
On the other hand, the result after submitting the user’s question to the backend
endpoints is demonstrated in Figure 28.

Figure 28: Example of a POST request to get the relevant documents


To achieve this vivid result, the Postman application, renowned for its capability for
creating and testing requests, is used to craft a POST request containing the user's
question within its body and direct it towards the backend endpoints to retrieve a list of
relevant documents. In this case, the endpoint “http://localhost:8080/hanu-
chatbot/educational-program” was utilized so as to obtain information related to the
educational program (i.e., the general knowledge block) of Hanoi University.
To assess the performance of the vector search algorithms, various tests were
conducted on different document sets and query types. The findings are as follows:

50
• In all cases, when sending requests to backend endpoints, results are always
successfully returned in less than two seconds, considering the databases have
more than 300 instances.
• Testing on directly relevant questions (50 sample questions related directly to
the documents’ chunked segments): In 100% of the cases, the drive-to-the-point
chunks were found in the “relevant_docs” list.
• Testing on indirectly relevant questions (50 sample questions indirectly
related to the documents, mentioning keywords but not directly asking about
them): In 100% of the cases, some relevant documents were discovered,
although they did not present the needed information.
• Testing on completely irrelevant questions (50 sample questions not relevant
at all, without important keywords like “credits”, “faculty”, “tuition fee”, etc.):
In 20% of the cases, some documents were still jumped to the “relevant_docs”
list, typically those containing similar keywords in their content, such as
“university” or “information technology”. In the remaining cases, no documents
were returned.
• In the field where similar content occurs regularly (e.g., information related
to general knowledge blocks, credits, faculties, teachers), the “relevant_docs”
list sometimes ranked the expected documents lower than the other irrelevant
documents containing more similar keywords (higher similarity score).
However, the expected documents have not fallen entirely off the list.
To sum up, the vector search components performed remarkably well in retrieving
relevant documents based on the provided queries, even when the queries were
indirectly related to the document content.
After the list of relevant documents is delivered to the client, the front-end application
will integrate these documents and the recent context (the chatbot’s previous
responses) into the instructions sent to ChatGPT, triggering its text generation function
to ship the response to the user, as visualized in Figure 29, page 52.

51
Figure 29: HANU chatbot preview
The integration of the retrieved relevant documents with GPT-3.5 text generation
capabilities was evaluated through various query scenarios:
• Testing on directly relevant questions (50 sample questions related directly to
the documents’ chunked content): Over 90% of the cases presented the
expected response; other cases showed some mistakes like language
mismatches and wrong calculations.
• Testing on indirectly relevant questions (50 sample questions indirectly
related to the documents’ chunked content, meaning that information has to be
synthesized from multiple passages to draw conclusions, like asking about the
differences between two subjects): 60% of the cases showed the expected
responses. Others provided responses in different directions, which did not
really answer the question.
• Testing on completely irrelevant questions (50 sample questions not relevant
at all, without important keywords): Over 90% of the cases behaved as expected
(common responses from ChatGPT); others triggered language mismatches or
wrong interpretations.
• Testing on the context-based questions (50 sample pairs of related questions):
50% of the cases generated the expected answers, taking into account the
context from previous questions and responses. The remaining cases were
unable to provide satisfactory answers.

52
All things considered, the resulting response not only provided domain-specific
knowledge but also allowed for natural language conversations, enabling users to ask
follow-up questions and receive context-aware responses.
In order to assess the overall system's performance, a comparative evaluation was
conducted between the responses of ChatGPT and the HANU chatbot (Figure 30).

Figure 30: Comparison between Hanu Chatbot and ChatGPT performance


As compared in Figure 30, while ChatGPT's responses lack specificity and accuracy
due to its reliance on original training data, the HANU chatbot can tap into the curated
knowledge base, ensuring responses are grounded in the institution's expertise and
guidelines.
Taking everything into account, the system has brought promising results,
demonstrating the effectiveness of integrating ChatGPT with Hanoi University's
specialized knowledge.
On the one hand, the vector database and vector search algorithm work very well in
retrieving related documents based on the question provided. They have made it
possible to retrieve relevant documents accurately and with high efficiency. Even
when the question given is indirectly related to the document content, the system can
still identify and rank relevant documents appropriately.
On the other hand, integrating the relevant documents with ChatGPT's text-generation
capabilities, the system has yielded wonderful results in a variety of query scenarios.
By making effective use of context and domain knowledge, it provides relevant and
insightful answers even when faced with vague or marginally relevant questions. This
53
ability is especially valuable in academic settings, where discussions often involve
nuanced concepts and cross-domain connections. Chatbots have enabled scholars to
experience natural language conversations and receive tailored responses based on
trusted sources of domain-specific knowledge. This feature enhances the learning
experience, fostering an environment where students can explore complex topics
through intuitive dialogue, like chatting with a versatile mentor.
However, it should be noted that in certain cases, such as fields where similar content
frequently appears, the system sometimes ranks documents deemed related lower than
the unrelated ones having more similar keywords. It is the normal behavior of vector
search, which can be modified by reducing the dataset entropy or adding tags
explicitly to each instance of the dataset.
Through system testing, we also observed some performance bugs, similar to those
commonly seen in the sporadic operation of GPT-3.5 (language mismatches,
miscalculations, and misunderstandings). However, it is suggested that the relevance,
coherence, and accuracy of the responses can be significantly increased if a more
advanced and computationally extensive model such as GPT-4 is used.
In general, the resulting system not only provides domain-specific knowledge but also
enables natural language conversations, allowing users to ask follow-up questions and
receive contextual answers. A comparative evaluation of ChatGPT and HANU chatbot
responses highlighted the chatbot's ability to exploit a curated knowledge base,
ensuring responses are based on organizational expertise and guidance as opposed to
relying on ChatGPT depending on the original training data.

54
6. CONCLUSION

The dissertation presents an idea for a new approach: integrating LLMs such as
ChatGPT with Hanoi University's domain-aligned information resources. This
innovative solution will establish a bridge between ChatGPT as a cutting-edge chatbot
and domain-specific information, supported by sophisticated backend infrastructure.
At the core of this system is OpenAI's embedding model, which converts knowledge
content into high-dimensional vector embeddings stored in an optimized vector
database. This setup allows quick identification of semantically equivalent document
fragments based on user queries. The system, in turn, uses a RESTful API to retrieve
information snippets that are supplemented with user context, enabling it to craft
tailored responses for academic and professional recipients.
The chatbot, thus, plays a significant role in improving the learning experience, which
in turn enhances knowledge acquisition for students. It also acts as an all-around
assistant for scholars, helping to ease workload while improving information
accessibility.
Moreover, the system does not only take us forward in domain-oriented AI technology
but also gives insight into research investigating how specialized knowledge can be
brought into LLMs through structured processes and infrastructure. This helps
underscore our understanding of the interplay between AI and human capabilities,
something critical considering future shifts within this research-intensive, rapidly-
evolving field toward skill development, led by high technology.
There are a few improvements that might help ensure a successful future for the
HANU chatbot system as well as the expansion of its operational scope. Considering
data security, authentication and authorization should be implemented later on. Plus, it
is anticipated that historical conversations will be stored in the vector database to
increase context-based responses. Furthermore, the system would be able to acquire
various inputs such as images, PDF files, and other forms of multimedia, which would
widen the knowledge scope, thereby elevating the user experience.

55
This dissertation is not only pioneering but also futuristic in the wider domains of
education-oriented AI technology. It delves into a novel approach to infusing
subjective domain knowledge into large language models through specialized
processes and architectures. The study not only advances our understanding of the
interplay between AI and human cognizance but also lays a strong foundation for the
future growth of this swiftly burgeoning realm, which, by all indications, places us
light years ahead of contemporary developments within these critical spheres.

56
REFERENCES

AVID Open Access. (n.d.). A SWOT Analysis of ChatGPT. Retrieved from


https://avidopenaccess.org/resource/a-swot-analysis-of-chatgpt/#1674243400330-
b39704a3-7ffe
Bahrini, A., Khamoshifar, M., Abbasimehr, H., Riggs, R. J., Esmaeili, M.,
Majdabadkohne, R. M., & Pasehvar, M. (2023). ChatGPT: Applications,
Opportunities, and Threats. Retrieved from
https://doi.org/10.48550/arXiv.2304.09103
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu,
T., Chung, W., Do, Q. V., Xu, Y., & Fung, P. (2023). A Multitask, Multilingual,
Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and
Interactivity. ACL Anthology. Retrieved from
https://doi.org/10.18653/v1/2023.ijcnlp-main.45
Berger, E. (2023). Grounding LLMs. Retrieved from
https://techcommunity.microsoft.com/t5/fasttrack-for-azure/grounding-llms/ba-
p/3843857
Dempere, J., Modugu, K., Hesham, A., & Ramasamy, L. K. (2023). The impact of
ChatGPT on higher education. Frontiers in Education, 8. Retrieved from
https://doi.org/10.3389/feduc.2023.1206936
Fraiwan, M., & Khasawneh, N. (2023). A Review of ChatGPT Applications in
Education, Marketing, Software Engineering, and Healthcare: Benefits,
Drawbacks, and Research Directions. Retrieved from
https://doi.org/10.48550/arXiv.2305.00237
Gan, W., Qi, Z., Wu, J., & Lin, J. C. W. (2023). Large Language Models in
Education: Vision and Opportunities. IEEE International Conference on Big Data
(BigData), 4776-4785. doi: 10.1109/BigData59044.2023.10386291.
Gewirtz, D. (2023). How does ChatGPT actually work?. ZDNET. Retrieved from
https://www.zdnet.com/article/how-does-chatgpt-work/

57
Guinness, H. (2023). How does ChatGPT work?. Zapier. Retrieved from
https://zapier.com/blog/how-does-chatgpt-work/
Hadi, M. U., Al-Tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A.,
Shaikh, M., Akhtar, N., Wu, J., & Mirjalili, S. (2023). A Survey on Large
Language Models: Applications, Challenges, Limitations, and Practical Usage.
Retrieved from https://doi.org/10.36227/techrxiv.23589741
Heuer, M., Lewandowski, T., Kučević, E., Hellmich, J., Raykhlin, M., Blum, S., &
Böhmann, T. (2023). Towards Effective Conversational Agents: A Prototype-
based Approach for Facilitating their Evaluation and Improvement. Retrieved
from
https://researchgate.net/publication/368791250_Towards_Effective_Conversation
al_Agents_A_Prototype_based_Approach_for_Facilitating_their_Evaluation_and
_Improvement
Hua, S., Jin, S., & Jiang, S. (2023). The Limitations and Ethical Considerations of
ChatGPT. Data Intelligence, 6(1), 1-38. Retrieved from
https://doi.org/10.1162/dint_a_00243
IBM. (n.d.). What are large language models (LLMs)?. Retrieved from
https://www.ibm.com/topics/large-language-models
IBM. (n.d.). What is conservational AI?. Retrieved from
https://www.ibm.com/topics/conversational-ai
Kamalov, F., Calonge, D. S., & Gurrib, I. (2023). New era of artificial intelligence in
education: Towards a sustainable multifaceted revolution. Sustainability, 15(16),
12451. Retrieved from https://doi.org/10.3390/su151612451
Labadze, L., Grigolia, M., & Machaidze, L. (2023). Role of AI chatbots in education:
A systematic literature review. International Journal of Educational Technology in
Higher Education, 20(1). Retrieved from https://doi.org/10.1186/s41239-023-
00426-1
Lowery, M., & Borghi, S. (2024). 15 Benefits of ChatGPT (+8 Disadvantages).
Semrush Blog. Retrieved from https://www.semrush.com/blog/benefits-chatgpt/
MatrixFlows. (n.d.). RAG, Fine-tuning or Both? A Complete Framework for Choosing
the Right Strategy. Retrieved from https://www.matrixflows.com/blog/retrieval-
58
augmented-generation-rag-finetuning-hybrid-framework-for-choosing-right-
strategy
Mohamadi, S., Mujtaba, G., Doretto, G., & Adjeroh, D. (2023). ChatGPT in the Age of
Generative AI and Large Language Models: A Concise Survey. Retrieved from
https://www.researchgate.net/publication/372248458_ChatGPT_in_the_Age_of_
Generative_AI_and_Large_Language_Models_A_Concise_Survey
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N.,
Barnes, N., & Mian, A. (2024). A comprehensive overview of large language
models. Retrieved from https://doi.org/10.48550/arXiv.2307.06435
OpenAI. (n.d.). Embeddings. Retrieved from
https://platform.openai.com/docs/guides/embeddings/embedding-
models?ref=timescale.com
OpenAI. (n.d.). Introducing ChatGPT - OpenAI. Retrieved from
https://openai.com/blog/chatgpt/
OpenAI. (n.d.) Fine-tuning. Retrieved from
https://platform.openai.com/docs/guides/fine-tuning
OpenAI. (n.d.). What is ChatGPT?. Retrieved from
https://help.openai.com/en/articles/6783457-what-is-chatgpt
Rozear, H., & Park, S. (2023). ChatGPT and Fake Citations. Duke University
Libraries Blogs. Retrieved from
https://blogs.library.duke.edu/blog/2023/03/09/chatgpt-and-fake-citations/
Wilson, M. (2024). ChatGPT explained – everything you need to know about the AI
chatbot. TechRadar. Retrieved from https://www.techradar.com/news/chatgpt-
explained

59

You might also like