2001040208 - Đặng Quỳnh Trang - Quỳnh Trang Đặng
2001040208 - Đặng Quỳnh Trang - Quỳnh Trang Đặng
Hanoi, 2024
DECLARATION OF AUTHORSHIP
I, Dang Quynh Trang, student ID 2001040208 at Hanoi University, declare that this
dissertation titled “HANU Chatbot Backend: Unleashing ChatGPT in Domain-
specific Knowledge” is my own original work and that all sources have been
appropriately acknowledged and referenced. This dissertation has not been
submitted in whole or in part for consideration for any other degree or qualification
at this or any other university.
Signature: __________________________________________________________
Date: ______________________________________________________________
i
ACKNOWLEDGEMENT
ii
TABLE OF CONTENTS
iii
4.3.1. Corpus storage schema ............................................................................24
4.3.2. Vector database schema ..........................................................................25
4.4. Pipeline workflows.........................................................................................26
4.4.1. System workflows ...................................................................................26
4.4.2. Backend workflows .................................................................................28
5. CHAPTER 4: IMPLEMENTATION AND DISCUSSION .................................30
5.1. Platforms and tools .........................................................................................30
5.1.1. Python programming language and libraries ..........................................30
5.1.2. PostgreSQL vector database ...................................................................31
5.1.3. GitHub version control system ...............................................................32
5.2. Document injection and preprocessing pipeline ............................................32
5.3. Embedding generating ...................................................................................35
5.4. Vector database ..............................................................................................37
5.5. Vector searching and retrieving .....................................................................42
5.6. API development ............................................................................................44
5.7. ChatGPT integration ......................................................................................45
5.8. Result and discussion .....................................................................................49
6. CONCLUSION .....................................................................................................55
REFERENCES ..........................................................................................................57
iv
LIST OF FIGURES
Figure 1: A CSV file storing documents about HANU educational program ..........25
Figure 2: A table about HANU educational program in Vector Database ...............26
Figure 3: System workflows of the HANU chatbot ..................................................27
Figure 4: Backend workflows of the HANU chatbot ...............................................28
Figure 5: Code snippet – Get HANU documents from a CSV file ...........................32
Figure 6: Code snippet – Get the number of tokens from a string ............................33
Figure 7: Code snippet – Chunk HANU documents into smaller units ....................34
Figure 8: Code snippet – Combine metadata from HANU documents ....................35
Figure 9: Code snippet – Initialize OpenAI client ....................................................36
Figure 10: Code snippet – Get embedding from HANU documents ........................36
Figure 11: Code snippet – Embed HANU documents (full steps) ............................36
Figure 12: Code snippet – Create a normal database ................................................37
Figure 13: Code snippet – Create vector extension within a database .....................38
Figure 14: Code snippet – Create ivfflat index within a table ..................................38
Figure 15: Code snippet – Create a vector table (full steps) .....................................39
Figure 16: Code snippet – Load training data into the vector table ..........................40
Figure 17: Code snippet – Prepare for chatbot training ............................................41
Figure 18: Code snippet – Load corpus into vector database (full steps) .................41
Figure 19: Code snippet – Prepare data for two chatbots .........................................42
Figure 20: Code snippet – Vector search ..................................................................43
Figure 21: Code snippet – Initialize Flask application .............................................44
Figure 22: Code snippet – Implement API endpoint for relevant documents ..........44
Figure 23: Code snippet – Run the chatbot ...............................................................45
Figure 24: Code snippet – Get response from GPT-3.5 model .................................46
Figure 25: Code snippet – Define body message for OpenAI API ...........................47
Figure 26: Code snippet – Implement API endpoint for complete response ............48
Figure 27: Chatbot training process (logging screen) ...............................................49
v
Figure 28: Example of a POST request to get the relevant documents ....................50
Figure 29: HANU chatbot preview ...........................................................................52
Figure 30: Comparison between Hanu Chatbot and ChatGPT performance ............53
vi
ABSTRACT
In the context of the digital age, the need for easy access to knowledge is becoming
increasingly important, with increasingly sophisticated requirements. Although
ChatGPT, today's intelligent artificial intelligence dialogue technology, has an
impressive ability to answer users' questions in a natural and friendly way, its
knowledge is still limited by the scope of initial training data. The solution
proposed in this thesis is to build a back-end chatbot system that can exploit the
capabilities of ChatGPT with a rich and reliable source of knowledge from the
documents of lecturers and staff at Hanoi University. From there, students will have
a “virtual assistant” to accompany them throughout the learning, consulting, and
research process, a natural dialogue partner for all issues in each specific field
from the school's reliable sources of knowledge. This thesis will focus on
researching and applying key technology components, such as tools to convert text
content into a digital representation (also known as a vector), a database optimized
for storing and searching similar vectors, and a method for integrating domain-
specific documents into ChatGPT. The ultimate goal is a digital knowledge
ecosystem that operates smoothly, meeting the needs of communication and
dialogue at an in-depth level while affirming the important role of artificial
intelligence applications in education to improve the quality of training for future
generations.
vii
1. INTRODUCTION
1
primary purpose is to develop a dependable backend system to connect ChatGPT to
Hanoi University’s credible resources. This integration will empower users to engage
in natural language conversations and receive accurate, context-relevant responses
drawn from the trusted and authoritative resources of Hanoi University.
While introducing AI models like ChatGPT into educational environments can yield
promising results, we still need to overcome a number of obstacles to exploit their full
potential. Hadi et al. (2023) claimed that LLMs may be biased and result in
detrimental or erroneous outcomes. The model might not be able to learn to generalize
to new scenarios if the dataset is incomplete. Furthermore, the model will be trained to
incorporate any bias or inaccuracy present in the dataset. Therefore, when answering
questions related to specific fields or actual professional opinions, LLMs may respond
vaguely with shallow or irrelevant answers (Hadi et al., 2023).
In Hanoi University and other educational institutes, providing in-depth, up-to-date,
and reliable knowledge to students, faculty, and staff is vital, as this not only ensures
academic excellence but also ensures the proper functioning of educational
institutions. Lectures, textbooks, and policy documents, although valuable in
disseminating knowledge to learners, are considered a static way of absorbing
information and do not pay enough attention to the individual needs of learners.
Besides, knowledge evolution occurs regularly in academic disciplines and
institutional policies. Therefore, it is important to have a responsive and adaptable
knowledge transfer channel to meet the rapid development of knowledge in many
different fields.
In fact, classical ways of bringing domain knowledge into LLMs, such as refining or
consolidating knowledge, can be resource-intensive and may not result in an efficient
process when applied to large and mixed data sets. Typically, these techniques are
performed during training. This complicates updating the model's knowledge base if
new information arises or existing information changes.
Because of this, there is an urgent need for a comprehensive approach to fill the gap
between the outstanding capabilities of LLMs and the specialized knowledge produced
2
by Hanoi University. This proposed solution can not only provide users with accurate
data relevant to the specific context but also promote easy absorption of that
information while engaging in conversations with natural language, like having a
personal consultant.
Talking about the challenges mentioned is important because if we overcome them,
then we can unleash AI-based conversational agents in educational contexts. Such
chatbots can help eliminate the hassle of seeking information from various sources by
making it possible for them to obtain more accurate and up-to-date information with
ease. Besides, it is expected that this approach will decrease the workload of teachers,
help to fix the problem of unequal access to education for underserved communities,
and at the same time create a possibility for further research on how subjective
cognitive beliefs are transferred to AI systems.
3
• Embedding Generator: This component undertakes the responsibility of
turning text content into numerical vector representations. In essence, these
embeddings preserve semantic relations and contextual sense for the given
texts, thus giving way to searching and retrieval based on the same meanings
instead of actual words.
• Vector Database: This part of the system is designed specifically for storing
and efficiently extracting high-dimensional embeddings. This element is
essential for enabling quick and correct similarity searches, which facilitate the
identification of relevant parts of documents according to users' queries.
• Vector Search and Retrieval: This algorithm is applied directly to link the
user questions with the corpus content most likely relevant to them. Thus, it is
ensured that the answer provided matches the question given.
• API Development: The feature of the RESTful API is introduced to make it
a mediator connecting the subsystems. It not only allows transactions between
different system components but also facilitates communication with the front
end. When needed, it activates processes that retrieve pertinent documents from
the backend database and transfer them to the client side. This feature allows
the front-end application to customize the instruction to get a response from the
GPT model based on the pulled documents.
The innovative project has a primary objective, which is to enhance the quality of
education for all students by allowing them access to accurate domain information
from a chatbot that plays the role of a “virtual tutor”. Additionally, this system holds
great potential for minimizing workloads on teachers, giving knowledge to
underprivileged communities, and making it possible for scientists to research how
perspectives can be integrated into an AI system, among other things. To conclude,
these efforts underscore AI's position as a driver of new concepts in higher education
and a driver of students' better learning experiences.
The dissertation’s scope is comprehensive, but it is important to recognize some
boundaries. When it comes to practical implementation, most of my work has focused
on building the backend system, which is made up of elements such as data scraping,
4
text pre-processing, vector storage and retrieval, and API creation. The ChatGPT
response completions will be handled by the front-end.
Furthermore, since our system is at its outset and its knowledge base has been trained
solely with resources provided by Hanoi University, this might make it less useful for
broader use cases where more general knowledge about different factors is required.
Nonetheless, the architecture of our system remains scalable and adaptable to include
other knowledge domains in the future.
The other problem is that the output generated is essentially driven by how accurate
and representative the consumed corpus is. Although attempts will be made to obtain
reliability and vastness of knowledge, there are times when the system may find gaps
or misalignments in its dataset.
The initial stages in establishing the backend system have been credited, although this
is still seen as a work in progress aimed at eliminating the gap between the enormous
powers of LLMs and the vast domain-specific knowledge base of academic
institutions. By providing solutions for issues related to applying subjective expertise
to dialogue AI systems, this study contributes to making a difference in the
dissemination of information as well as the advancement of pedagogical technology.
5
atmosphere that promotes collective learning while utilizing digital technologies to
support students in obtaining relevant advice and information.
The innovative project has a primary objective, which is to enhance the quality of
education for all students by allowing them access to accurate domain information
from a chatbot that plays the role of a virtual tutor. Additionally, this system holds
great potential for minimizing workloads on teachers, giving knowledge to
underprivileged communities, and making it possible for scientists to research how
perspectives can be integrated into an AI system, among other things. Thus, these
efforts underscore AI's position as a driver of new concepts in higher education and a
driver of students' better learning experiences.
Finally, the successful realization of this dissertation will serve as a reminder that AI
has the ability to be at the forefront of making innovation in higher learning happen.
We can lift the quality of instruction, encourage collaborative learning experiences,
and provide students with tools and skills to succeed in an increasingly digitalized
world by fully tapping into conversational agents like ChatGPT and integrating them
with institutional domain assets.
A thorough research methodology will be used in order to achieve the goals of this
dissertation and provide a strong backend structure for incorporating ChatGPT with
domain-specific expertise from Hanoi University. This method will encompass a
mixture of both theoretical and empirical approaches, relying on well-known
principles and techniques from different fields, including natural language processing,
information retrieval, database systems, and software engineering.
Firstly, the study will begin with an extensive literature review that critically examines
past works in areas of conversational AI, LLMs, and how to integrate domain
knowledge into AI systems. The review aims to identify gaps within the existing state-
of-the-art methods and then inform our proposed solution design.
Henceforth, there will be a rigorous requirements engineering process aimed at
defining functional as well as non-functional requirements for the backend system. In
6
this phase, we are going to work closely with our mentor and potential end-users so as
to ensure that their expectations align well with the system.
The design of a detailed system architecture based on the proposed specifications will
comprise such components as corpus processing pipelines, vector storage and retrieval
mechanisms, and API implementation. The idea behind this is to make sure that the
system can grow while still remaining maintainable and flexible, following industrial
standards.
The next phrase will involve creating the backend in appropriate languages,
frameworks, and tools. By employing the Waterfall method, a comprehensive
requirements document will be established upfront, outlining all functionalities and
features of the software. This method adheres to a sequential development process
where each phase (requirement gathering, design, development, testing, and
deployment) must be completed before progressing to the next. This structured
approach ensures a clear roadmap and minimizes the need for significant changes
during later stages.
At the same time, data collection and curation activities will be set up to retrieve
academic documents, among others, from Hanoi University’s archives. These
documents would then undergo preprocessing, where they are transformed into vector
representations through modern natural language processing technologies like word
embeddings or transformer models.
Testing and evaluation strategies shall be used to ensure the reliability and accuracy of
the system. Response relevance, coherence, and accuracy are some of the key metrics
for measuring performance that will be evaluated so that areas for improvement or
refinement can be identified.
During the course of this research work, a collaborative approach would be created by
involving domain experts, including technical advisers and potential end-users. The
results from this research will also be presented through regular progress reports and
presentations while also being disseminated through peer-reviewed publications to
benefit other scientists working in similar areas.
As a result, the dissertation will be completed with the writing of a research paper that
details how the project was done in terms of its methodologies, techniques, and
7
findings. This paper will be reviewed thoroughly by peers and later published in
reputable academic journals or conferences, thus improving knowledge in dialogue AI
and domain knowledge fusion.
In this dissertation, I will explain why there is a need for the integration of ChatGPT
with domain-specific knowledge from Hanoi University while also giving a
comprehensive overview of all other aspects concerning the proposed architecture.
The first chapter serves as an entry point through which the reader can understand the
context within which this study is conducted. This aspect is brought out clearly
throughout the next chapter, as it explains why using LLMs such as ChatGPT helps in
teaching and learning in the higher education system by integrating it with subject
matter expertise. It shows up as a problem statement where the challenges and
limitations of existing approaches are well articulated. In addition, it also involves
stating what this study excludes so as to clearly identify its limits of inquiry as well as
marking what it includes, i.e., establishing the boundaries within which my research
was done. Also included are predictions on likely results that would emerge from the
implementation of the proposed solution to this problem, thereby indicating the
expected impact and contributions to be made.
Following this, a comprehensive review of the current literature and connected work
about conversational AI, LLMs, and techniques that enable integration of domain
knowledge in AI systems is provided. The chapter reviews the strengths and
weaknesses of existing approaches, identifies research gaps, and provides a theoretical
base for the proposed solution. It serves as a strong basis for subsequent chapters and,
hence, indicates the novelty and importance of the research undertaken.
Then, this dissertation focuses on ChatGPT, the giant language model upon which this
proposed solution is built. It shows how ChatGPT actually works, its underlying
architecture, and to what extent it can generate realistic text. The discussion explores
possible uses for ChatGPT along with its strong points as well as weak spots, thus
attesting to why it should incorporate domain-specific knowledge to enhance its
performance in area-specific situations.
8
At this point, we present the detailed design and architecture of our proposed backend
infrastructure. This implies outlining system requirements derived from the
stakeholder analysis phase through requirements engineering that are both functional
and non-functional. The overall system architecture is described, encompassing the
various components such as the storage schema, text-processing pipelines, vector
storage and retrieval mechanisms, and API implementation. Design principles,
patterns, and best practices employed in the development process are also discussed.
Later on, we provide an extensive account of the implementation phase, including all
backend systems’ platforms, tools, and technologies used. For example, how to ingest
and curate a corpus of domain-specific documents from Hanoi University, implement
text-processing pipelines, design vector storage and retrieval mechanisms, and
generate an API that allows for seamless integration with ChatGPT in our front-end
application. The results obtained from the evaluation and testing phases analyze the
system’s performance against key metrics such as response relevance and coherence
accuracy. That includes a critical discussion of the findings, limitations, weaknesses,
and scope for future research.
Finally, the dissertation concludes by summarizing the key findings and contributions
of this thesis work, highlighting the significance of the proposed solution to advancing
conversational AI with domain knowledge integration. This chapter goes back to the
research objectives and assesses whether they have been achieved or not. The
implications for further research are also provided, therefore suggesting possible ways
of addressing these limitations identified during the research process. A
comprehensive list of references is included to support the research findings and
comply with acceptable academic standards.
9
2. CHAPTER 1: LITERATURE REVIEW
According to IBM (n.d.), chatbots and virtual agents are among several tools that fall
under the umbrella term Conversational AI, which refers to any system that can be
spoken to or written to by a user. They recognize speech and text inputs, determine
their meanings in multiple languages, and use vast amounts of data, natural language
processing, and machine learning to mimic human conversation. Indeed, NLP is
combined with machine learning in conversational AI, so these NLP methods keep
feeding into the machine learning processes for further improvement of AI algorithms
(IBM, n.d.).
One key advantage of using conversational AI in teaching is that it provides constant
support, which can answer students’ questions or objections anytime, even outside
regular school hours. For instance, AI chatbots can respond to inquiries from learners
by giving them the necessary feedback on their assignment submission. Additionally,
they may also guide people through reading materials while making recommendations
based on personal preferences indicated by one’s learning style at a given level
(Labadze et al., 2023). Such an approach would prove useful mainly among distance
education participants with different timetables by ensuring the availability of
educational resources.
The other significant advantage of conversational AI in education is its ability to
lessen the workload of teachers. This frees them from mundane tasks and allows them
to focus on more complex assignments and high-level cognitive discussions. In
accordance with Kamalov et al. (2023), AI can streamline various aspects of
education, such as creating content, grading tasks, or formulating questions based on
given prompts, all of which otherwise demand time and effort from teachers.
However, it should be duly noted that the efficiency of conversational AI for learning
greatly relies on the quality and breadth of its knowledge base. As highlighted by
Dempere et al. (2023), those engaging with AI-driven educational experiences should
recognize that these systems operate based on existing data and are often unable to
10
adapt well to new challenges without sufficient training examples. AI could find it
difficult to overcome the kind of hitherto unheard-of difficulties that humans would
run across when exploring space, for instance.
This implies the integration of deep understanding about specific academic or
professional domains into conversational AI systems so they may give accurate
context-related information as well as engage in meaningful conversations within
those fields. The current boundaries of conversational AI could be surpassed by
incorporating more advanced techniques in natural language processing together with
curated knowledge bases, thus transforming it into a potent instrument for knowledge
sharing and collaborative learning across educational institutions.
As defined by IBM (n.d.), LLMs have been trained to generate texts similar to humans
in addition to other content types. The process is based on a large amount of data used
to train them. They are able to think, consider, or learn from context, generate coherent
and relevant sentences, translate text to other languages than English, summarize text,
give answers to questions (general conversation and questions that are asked the most),
and even help in creative writing or code generation tasks. There is no doubt that
LLMs are a major breakthrough in the field of NLP and AI (IBM, n.d.).
In 2024, LLM inventions are booming. GPT-4 of OpenAI is still the king of creative
text generation, writing incredibly realistic and imaginative content. Google has also
come to participate in the race with two strong competitors: PaLM 2 and Gemini.
PaLM 2 is a versatile champion that tackles a wide range of tasks, while Gemini
mainly gives informative answers. Concerning efficiency, LLaMA 2 by Meta
performs well without using too many resources. Meanwhile, Claude 2, developed by
Anthropic, puts more emphasis on safety and reliability, making it suitable for
business purposes. In addition, there is Cohere, whose specialty lies in specific areas
only, as well as Falcon, which handles multiple languages better than any other LLMs
so far while remaining an innovative option among these systems, always breaking
new ground in terms of what they can do. With such a wide range represented here,
where each one has its own specialty, we are bound to witness tremendous growth
11
within this industry over time as they keep changing themselves into something
different every now and then.
The performance boom among LLMs lately begs the question: What drives their
exceptional abilities? Naveed et al. (2024) claimed that LLMs process both data and
language via architectural advancements together with training methods and context
length enhancements. They employ architectures based on transformers, more
computational power, and extensive amounts of training data to comprehend and
generate human-like languages. Moreover, LLMs use pre-training coupled with fine-
tuning techniques for specific task adaptation across different areas, which makes
them capable of performing numerous NLP tasks. The working principles behind
LLMs include training them on huge datasets using high-performance computing
resources, which enables learning intricate structures or connections within languages;
then these models can be adjusted towards certain applications so that they can
understand and produce text outputs in various domains like education, science,
mathematics, law, finance, and healthcare robotics, among others (Naveed et al.,
2024).
However powerful they may be, there are still areas outside the training domain where
LLMs struggle with domain-specific knowledge. As pointed out by Mohamadi et al.
(2023), the training data for these models might not adequately cover certain subject
areas, leading to responses that lack depth, accuracy, or context-specific relevance
when dealing with inquiries from specialized domains. According to Bang et al.
(2023), ChatGPT is an unreliable reasoner with an average accuracy of 63.41% over
10 distinct reasoning categories, including logical reasoning, non-textual reasoning,
and commonsense reasoning. He also stated that ChatGPT experiences issues with
hallucinations.
As they advance and gain more capabilities, LLMs will find their way into different
fields such as education, healthcare, or customer service, which is likely to lead to an
increase in diversification across sectors. Nonetheless, it is crucial to overcome the
problem of incorporating domain knowledge so that these models can give accurate
context-aware responses for specialized queries.
12
2.3. Integrating domain knowledge
In order to unleash the full potential of LLMs so that they can offer exact and context-
aware information across different domains, the integration of knowledge specific to
those areas is highly required.
Typically, knowledge consolidation in conversational AI systems has been achieved
through rule-based systems or limited databases, which tend to give rigid and narrow
responses. Nevertheless, LLMs have brought about more sophisticated methods of
infusing these models with domain expertise.
One prominent method is fine-tuning, where a pre-trained language model is further
trained using data from a particular field. By training on far more instances than can fit
in the prompt, fine-tuning outperforms few-shot learning, the process of learning a
task through demonstrations, and enables users to perform better on a variety of
activities (OpenAI, n.d.). In other words, by fine-tuning domain-relevant datasets,
LLMs can acquire the specialized vocabulary, terminology, and contextual
understanding necessary for effective communication and information retrieval within
that domain. As a result, we will not have to include as many examples in the prompt
after a model is optimized.
Another technique is knowledge-grounding, where relevant information is fetched
from external knowledge bases and incorporated into the model’s responses. In
particular, as stated by Berger (2023), a trigger, like a user request or instruction, is the
first step in a simple Retrieval-Augmented Generation (RAG) model. This trigger is
sent to a retrieval function, which uses the query to retrieve pertinent content. After
that, the obtained content, the input prompt, and the question itself are all combined
back into the LLM's context window. Enough room is left for the response of the
model. Finally, the combined input and retrieved content are then used by the LLM to
create an output. Grounding in real-world applications is important, as this
straightforward but efficient method frequently produces outstanding outcomes.
Hybrid methods that combine fine-tuning and knowledge-grounding have also been
investigated with the aim of benefiting from both approaches. To start with, a retriever
provides fast access to large amounts of new external data. Furthermore, fine-tuning
13
makes the model behavior highly specialized for a given domain. Finally, the
generator produces outputs by utilizing both refined domain knowledge as well as
external context (MatrixFlows, n.d.).
While these techniques show promise for integrating domain knowledge into LLMs,
they also present several challenges. RAG thrives on real-time external data, ideal for
tasks like summarizing news articles; however, it may add latency and necessitate
database upkeep. Otherwise, fine-tuning excels at specializing LLMs for specific
domains but relies on static internal data and significant training resources. The hybrid
approach, although seeking to enable access to both internal and external data sources,
inherits the latency of RAG and the resource demands of fine-tuning, requiring
maintenance of both the model and the data repository.
Despite these challenges, the integration of domain knowledge into LLMs is a critical
area of research, as it holds the key to unlocking the full potential of conversational AI
systems in specialized domains.
14
knowledge bases obtained from trusted sources such as academic institutions or
subject matter experts. Even though they look promising, fine-tuning and knowledge-
based methods usually take place during the training phase, which means that
adjusting the model’s knowledge base with new information or updated knowledge is
difficult.
This dissertation aims to fill these gaps by creating a strong backend system for
connecting ChatGPT to a database of specialized resources from Hanoi University;
investigating effective ways of consuming, processing, and retrieving domain-specific
information; and implementing mechanisms necessary for seamless integration with
OpenAI APIs. With this kind of infrastructure, we can achieve accurate responses in
natural language conversations that are relevant to context, education-oriented, and
institutionalized.
15
3. CHAPTER 2: CHATGPT
17
accurate outputs. The dialogue corpus was collected and that dataset was blended with
InstructGPT data formatted to dialogue subsequently. This step is crucial as it helps
the model understand the subtleties of human conversation, including tone, intent, and
the flow of dialogue.
It's interesting to note that OpenAI's ChatGPT is built to remember previously asked
questions. This means it can gather context from prior questions users have asked and
use it to inform future conversations with them. Users can also request reworks and
revisions, and it will relate to what they were talking about previously. It makes
interaction with the AI feel more like a conversation (Guinness, 2023).
The result is a model that can engage in conversations across various topics, answer
questions, and even perform creative tasks like writing and code generation.
Furthermore, it can create essays, articles, and poetry. This capability to produce
human-like answers to a multitude of inquiries was among the primary reasons why
the community of users reached 100 million within the first two months after the app
was launched (Wilson, 2024).
However, despite its advantages, ChatGPT has limitations. It can occasionally give
illogical or inaccurate responses that seem plausible. Due to bias in the training data,
the model is sensitive to input variations and recurrent, sometimes verbose prompts, as
well as the overuse of certain terms (OpenAI, n.d.). Continuous updates and
improvements are part of the model's lifecycle to enhance its performance.
18
Another important field in which ChatGPT could be a great help is software
engineering. In this field, the ChatGPT model can offer several functionalities to to
assist engineers in enhancing their effectiveness at work. Some of the most common
abilities that can be listed are code generation, debugging, and software testing.
Moreover, ChatGPT can help with tasks like documentation creation, collaboration,
and natural language processing. Studies show that language models are useful for
predicting variable types, recognizing frequent code snippets, locating software
mistakes, and enhancing code quality (Fraiwan & Khasawneh, 2023).
According to Bahrini et al. (2023), in healthcare, ChatGPT promotes rapid symptom
checking and identification, provides personalized health advice, and suggests
improving patient outcomes. Furthermore, in biotechnology and medicine, it increases
efficiency, reduces costs, and improves the health of patients or animals. However, the
field faces legal, regulatory, ethical and privacy difficulties, such as data protection
and security, as well as potential errors and biases out of the system.
Finally, thanks to its conversational features, ChatGPT is a popular choice for chatbots
and personal assistants. By helping users remember appointments, meetings, and other
important occasions, not only that ChatGPT can helps users stay organized and
prepared for their daily activities, but it can also help users stay on top of their tasks
without having to constantly check their calendar with its features to sets reminders for
certain events (Lowery & Borghi, 2024).
ChatGPT has a number of strengths that can be discussed. The first advantages of
ChaGPT lie in its ability to maintain a conversation that closely mimics human
interaction (Rampton, 2023). This feature allow the model to understand context,
remember previous queries, and provide coherent and relevant responses. Therefore,
making it an excellent tool for customer service, content creation and even educational
purposes.
Another notable advantage of ChatGPT is its versatility. The model can be used on
various platforms and integrated with different software, and for that reason, become a
flexible solution for many industries. Moreover, due to its enormous database,
19
ChatGPT not only could summarize information from multiple sources but could also
generate new content, adapt information quickly, and make broad connections (AVID
Open Access, n.d.).
Additionally, the responses generated by ChatGPT are crafted in a polished and easily
digestible manner, enhancing readability and user experience. This user-friendly
approach makes ChatGPT a preferred choice for individuals and businesses seeking
efficient and effective communication solutions.
However, despite its benefits, ChatGPT has a number of drawbacks. As stated by
OpenAI (n.d.), ChatGPT can occasionally give responses that seem reasonable but are
actually inaccurate or meaningless. Several difficulties have been involved in the
process of trying to address this issue. First of all, in reinforcement learning (RL)
training, there is currently no clear source of truth to guide the model's learning
process. Second, trying to train the model to be more conservative results in the
rejection of queries that could have been answered correctly. And last but not least,
supervised training can fool the model because the appropriate response is determined
by the model's knowledge rather than the trainer's expertise.
Another drawback of ChatGPT is that it is sensitive to small changes in input phrases
or repeated suggestions. For instance, when an user enter a question , there is a chance
that the chatbot may claim not to know the answer of that question. However, when
the question is slightly rephrased, the model can give the correct answer. Furthermore,
the response is very verbose and often mentions that this is a language model trained
by OpenAI. These problems occur due to bias in the training data, which favors longer
answers, as well as known concerns about over-optimization.
The ideal situation would be when faced with confusing user questions, the model
would ask for clarification. But in reality, current algorithms often try to predict user
intentions instead. Despite efforts to train the model to reject inappropriate requests,
there are cases where the model may still respond to harmful instructions or exhibit
biased behavior.
One more disadvantage of using ChatGPT is that the model can fabricate citations.
When asking for sources of the model’s responses, it gives us sources that seem
reliable, but they are actually fabricated. It's crucial to understand that AI can generate
20
responses confidently even without valid data, much like how a person influenced by
hallucinations may speak confidently without logical reasoning. Attempting to locate
these sources through Google or the library will result in no findings (Rozear & Park,
2023).
21
4. CHAPTER 3: SYSTEM ARCHITECTURE AND DESIGN
22
• The system needs to be developed with downtime minimization, and it has to
give precise and consistent replies, which will not only ensure a positive
experience for users but also curb negative behaviors by ensuring reliability.
• The design should be modular and extensible to accommodate additional
components or the adaptation of the system to new domains in future use cases.
• Even under high-load conditions, the system has to offer quick and efficient
responses that do not keep users waiting so long.
• The codebase should be documented well enough that it meets the quality
standards and guidelines in the IT sphere for the developers in the future to
improve or upgrade it.
These requirements ensure that the backend structure complies with the project targets
and stakeholders’ expectations. Continual validation as well as testing against all the
requirements that we have mentioned above will be necessary for the smooth
development of the solution as well as the assurance of the best outcome.
The backend architecture is modular and scalable; the main components are linked
together as a framework in which Hanoi University's knowledge is integrated with
ChatGPT to run seamlessly. The overall architecture can be divided into two main
subsystems: Chatbot FE and Chatbot BE.
The Chatbot FE, acting as the user interface, is able to handle both text-based and
conversational interactions between users and the system. It is the key that holds the
user inputs in its memory, displays responses, and keeps track of the flow of
conversation. The FE interacts with the BE via an API with the abilities of the
RESTful API, transmitting user queries and receiving relevant document chunks or
generated responses.
The Chatbot BE is the core of the infrastructure, comprising the following key
components:
• Document ingestion and processing pipeline: The section that takes care of
the intake and the preprocessing of the rich corpus of documents at Hanoi
University, both of academic and administrative nature, is responsible for this
23
task. It supports the CSV format and performs the process of text extraction and
the partitioning of large documents into small bits that are easy to manage.
• Embedding generator: This module converts textual content from document
parts and user queries into high-dimensional vector representations
(embeddings). These embeddings encode the semantic relationships in the text
and the subtle contextual details.
• Vector database: This database is optimized for high-dimensional
embeddings and hence uses Approximate Neighbor Search (ANN) algorithms
and indexing strategies for fast similarity computations.
• Vector search and retrieval: The algorithm accounts for running the
similarity search on previously saved embeddings. By means of query
embedding, semantic similarity, or other distance metrics, it computes the most
relevant semantic document chunks and assigns their scores.
• API development: The synchronization layer acts like the central interface,
and it handles communication among the backend components and the chatbot
interface.
The modular architecture of the system is designed in such a way that it can be scaled,
maintained, and extended so that it can handle the growing amount of data, varied
requirements, and integration of new components or functionalities in the future.
The CSV files serve as a medium through which we integrate the domain-specific
corpus of Hanoi University, which is a collection of texts already prepared and stored
within our system. These CSV files are thus a key tool that aids in the structured
gathering of records of experts from the various university departments and sources.
They act as the staging repository, making sure that the university infrastructure does
not lack cohesion or accessibility to the numerous documents that are present at the
university.
24
Figure 1: A CSV file storing documents about HANU educational program
As illustrated in Figure 1, the CSV schema mirrors the database table design,
consisting of the following fields:
• Title: The informative heading of the document.
• Summary: The succinct outline that reflects the core content of the
document. This should use keywords, tags, or abbreviations.
• Content: The contents of the document in terms of the textual substance.
• URL: The unique URL to find and retrieve the source document.
• Contributor: The people who have created the document or who have
contributed to its content.
Such a model is the best for the library of a university, and it is possible to provide the
relevant academics with the intellectual content they need.
For the underlying structure of vector databases, we are going to introduce the
necessity of a strong schema for the storage of clusters of the resulting embeddings
and documents from the domain relevant to the analyst. This is done to ensure that
when the vector search algorithm is implemented, the expected result will be obtained,
i.e., similar vectors along with their corresponding document content.
25
Figure 2: A table about HANU educational program in Vector Database
Figure 2 shows the core of vector database, housing the following key attributes:
• id (INTEGER): The unifying identifier for every document segment.
• content (TEXT): The combined text of the documents, including the “Title”
of the original document, a concise “Summary”, a segment of textual
“Content”, the source “URL”, and information about the document's
“Contributor”.
• embedding (VECTOR): The high-dimensional embedding that represents the
semantic meaning of the content chunk.
Similarly, the coupling of the structured CSV file with a tailored vector database
architecture creates a strong framework for handling and querying the university's
domain-specific knowledge assets. It incorporates the possibility of utilizing LLMs
such as ChatGPT that could facilitate a conversational AI to connect to a vast, well-
arranged, and easily accessible pool of institutional knowledge.
26
Figure 3: System workflows of the HANU chatbot
As shown in Figure 3, there is a workflow that connects ChatGPT and Hanoi
University's knowledge corpus stored in CSV files.
The first primary task is preprocessing at the back-end side, where an in-built data
ingestion pipeline is used for consuming the sequence of CSV files with appropriate
preprocessing steps. The chunking pipeline (1) starts to separate the large texts into
smaller and easier-to-digest pieces, which in turn limits the size of the memory usage
and achieves balance in token usage. In this way, each text chunk is found with its
metadata and then is passed to the text embedding generator component (2), which
uses OpenAI's APIs to convert the textual data into a multi-dimensional vector
representation. The created embeddings and the concatenation of documents’ metadata
are systematically stored in a vector database (3) designed for the handling of high-
dimensional data. Vector-specific functionalities and indexing mechanisms built into
the database allow for fast and easy retrieval.
On the client side, when a user submits a query (4), a POST request (5) is sent to the
backend server via a RESTful API developed.
The backend workflow then mirrors the embedding generation process (6) applied to
the corpus documents, transforming the user's query into a query embedding.
Afterward, it examines the vector database by performing the similarity search (7) so
that the query embedding can be used to determine the relevant document chunks
based on their semantic proximity.
Subsequently, the front end is fed with the relevant pieces (8) that were returned. The
user's query, document chunks, and any other context (i.e., previously stored response
history) from Session Storage (9) are then processed by the API of OpenAI (10). The
27
client communicates with the ChatGPT integration layer, which creates sophisticated
answers that are tailored to the question asked (11). The present context of the
interaction is then saved in Session Storage (12) for further recall, and the response is
finally shown to the user (13).
The core of the system is made up of seamless workflow in both backend and frontend
perspectives, document ingestion, text processing, embedding creation, vector storage
and retrieval, API communication, and context management. It serves as a medium
between ChatGPT's text-generating ability and the domain-specific knowledge
resource of Hanoi University, as the users can embark on a lively dialogue that would
help them in their learning and consulting.
The backend does not work independently but rather in a series of intertwined
workflows to facilitate integration between ChatGPT and Hanoi University's domain-
specific knowledge corpus. In other words, several processes take place for efficient
communication within the system: document ingestion, text processing, embedding
generation, vector storage and retrieval, and API communication.
28
• Step 1: Collecting and chunking documents. The HANU documents are
not used as they are but are pre-processed into smaller units (chunks) that can
easily be used to generate responses.
• Step 2: Generating embeddings. Each chunk is taken through an embedding
model to generate a vector representation that stores the meaning behind the
semantics it captures.
• Step 3: Storing embeddings. These vectors are stored in a dedicated vector
database for easy retrieval when needed.
• Step 4: Processing user questions. When a user asks a question, a POST
request is sent to the backend server. The user's question goes through the same
process that leads to the formation of query embedding.
• Step 5: Searching for relevant documents. The system operates through the
vector database to undertake an ANN probe into embeddings that are highly
similar to the query embedding. Eventually, the relevant documents whose
embeddings score the highest similarity are retrieved.
• Step 6: Returning relevant documents. Relevant documents are returned
through the system to the chatbot's front end based on the API channel. These
chunks will be used by the front-end application to return a comprehensive
response back to the user based on their query.
These workflows thus work together, complementing each other in such a way that
they serve as a bridge between ChatGPT's natural language capabilities and Hanoi
University's rich domain-specific knowledge assets. Thus, all potential users are able
to have rich conversations that are highly contextualized with regard to their academic
or professional needs.
29
5. CHAPTER 4: IMPLEMENTATION AND DISCUSSION
Python is used as the main development platform for the backend system: an all-
around, widely adopted programming language that makes it easier to create NLP
systems together with data manipulation systems, API developments, and other
components due to its rich library and framework ecosystem plus simplicity and
readability.
To be more specific, the project leveraged several outstanding Python libraries, each
playing a crucial role in various aspects of the system:
• OpenAI: The openai library enabled smooth and direct communication
between the backend system and powerful language models of OpenAI. The
purpose of this library is to make it possible for the system to efficiently embed
textual data into numerical representations that encapsulate semantic
information.
• Psycopg2: Since the PostgreSQL database was used by the backend system
for strong data storage and management, the psycopg2 library provided an
interface for Python that was well suited for interacting with the PostgreSQL
database. This library allowed the creation of safe connections, the
implementation of SQL queries, and the performance of data's extraction and
manipulation in the PostgreSQL tree.
• Pgvector: The pgvector library served as a core component in connecting the
backend system to the PostgreSQL database through the vector extension. This
30
integration was the key to success in the implementation of such a system,
where the document embeddings are the basic components of its functionality.
• Flask: The Flask framework acted as the keystone and the mediator between
the user interface and the back-end system that transcends them, and therefore
experienced information flows all the more easily. API endpoints' creation,
HTTP request and response handling, and smooth integration with other
systems' components were simplified, thanks to Flask's simple and modular
design.
• Pandas: Among many other libraries of Python, pandas, a specialized one
for data manipulation and analysis, stood out as the one that handled the
complicated data structures within the project. By utilizing easy-to-follow data
structures and data maneuvering functions, data modification operations
become faster and easier.
These key libraries, along with others, can be downloaded using well-known package
managers like PIP and Conda. This requires several installation commands and library
imports throughout the development process.
With Python being used with a fair number of libraries and frameworks supported, the
architecture would be designed in a way that the different modules are linked together,
and complicated data operations are performed, and then would allow for a reliable
and scalable backend infrastructure, which would be utilized in sharing data between
the community and other language models.
31
Also, to increase data retrieval tenancy, the project utilizes the ivfflat indexing method.
Ivfflat is the abbreviated term for Inverted-File-with-Flat-Compression Index, which
can increase query searchability and accuracy over the embeddings, improve the speed
of retrieval, and enable the quick and efficient discovery and identification of
documents that are thematically relevant to the corpus.
These actions serve as the pillars of data management, which assures the reliability
and efficiency of the system.
32
chatbot scope, we prepare two document domains: “educational_program” and
“public_administration”. Only the relevant columns are loaded (i.e., Title, Summary,
Content, URL, Contributor), ensuring only the necessary information is considered.
The next construction is the “get_num_tokens_from_string” function, which specifies
the number of tokens contained in a given text string, as depicted in Figure 6.
33
Figure 7: Code snippet – Chunk HANU documents into smaller units
According to Figure 7, this function begins by looping through all the documents to
check whether the length of their content exceeds the allocated token length (in this
scenario, set at 400 tokens, approximately equivalent to 300 words). Documents
within this limit can be directly appended to the chunk list. Conversely, documents
surpassing the token limit undergo segmentation into smaller chunks based on an ideal
chunk size (ideal_size), derived as a fraction of the maximum token limit. We
iteratively form chunks whose token count reaches the ideal size or whose content's
end is reached. Finally, the function returns a DataFrame containing the chunked data,
with the same columns as the data input.
This chunking pipe allows the system to effectively process and manage large and
diverse documents from Hanoi University's data warehouse. Dividing large papers into
smaller parts makes it easier to organize and retrieve relevant excerpts as well as to
generate suitable content input for the embedding model.
34
5.3. Embedding generating
The embedding generation pipeline makes use of the capabilities of the OpenAI
embedding models, specifically the “text-embedding-3-small” model, which is
designed for generating high-quality embeddings from text input. This model offers a
performance of 62.3% on the MTEB evaluation metric, with a cost of $62,500 per
8191 characters (OpenAI, n.d.).
In particular, the code snippets provided (Figures 8, 9, 10, and 11) illustrate the
functions joining in this process. It starts by combining all the necessary information
from each document gathered into a single string that can be used as the input to the
embedding model or the vector database, as depicted in Figure 8.
36
documents. The embeddings generation process is then applied to the “Text” column,
with the resulting embeddings stored in the “Embedding” column. Finally, the updated
DataFrame is saved to a new CSV output file for later review and reference.
Through splitting the concerns of core text and metadata, the system guarantees that
the embeddings represent the true semantic relationships of the core content while
preserving the metadata for later retrieval.
The chatbot system utilizes a specialized vector database to store and retrieve the
embeddings generated from the chunked corpus. After careful consideration, we
decided to leverage PostgreSQL, a widely adopted database management system,
along with the pgvector extension and ivfflat indexing scheme, for handling vector
data.
Consequently, we must implement the core functions responsible for setting up the
vector database and ingesting into it the embeddings obtained from the preprocessing
process. This begins with the creation of a normal database within the PostgreSQL
instance (Figure 12), which will later be equipped as a fully-functional vector
database.
37
whether the database already exists, and if not, it proceeds to create it. Given that
PostgreSQL lacks native support for the “IF NOT EXISTS” option within the
“CREATE DATABASE” SQL command, a subquery to fulfill this requirement is
implemented, checking if a database with the same name already existed. Furthermore,
the function also ensures that all privileges on the freshly created database are granted
to the “postgres” user, which is the default administrative user configured during
PostgreSQL installation.
Based on this, the “init_database” function, as detailed, initializes the database by
creating a connection to the default “postgres” database and invoking the
“create_database” function with the desired database name later set to “hanu_chatbot”.
At the core of our database setup lies the “create_extension” function, which allows us
to transform our database into a fully functional vector database (Figure 13).
38
Figure 14, page 38, apparently depicts how the “create_index” function establishes the
ivfflat index on the “embedding” column. To be more specific, the “lists” parameter in
the “CREATE INDEX” statement specifies the number of lists to be used in the index
structure. Each list in the ivfflat index represents a cluster of embeddings. During the
indexing process, embeddings are grouped into these lists based on their proximity to
certain centroids. In this case, the index will organize embeddings into 100 distinct
lists. Designed for expedited ANN searches on embeddings, this index ensures swift
retrieval of document chunks that closely match a given query embedding.
Sequentially, creating an index necessitates the prior existence of a table. This is where
the “create_table” and “init_table” functions come into play (Figure 15).
39
efficient handling of vector data. Proceeding from this, it creates a table within the
database for the storage of document segments and their corresponding embeddings.
Finally, it launches the index on the embedding column to enhance retrieval
efficiency.
With the vector database and vector table now set up, the next momentous step is to
load the processed data into the database. This involves transferring the pre-processed
documents, along with their corresponding vectors, into the designated tables.
For that reason, we have the “load_data” function, which accounts for ingesting the
embeddings generated from the chunked corpus into the database table (Figure 16).
Figure 16: Code snippet – Load training data into the vector table
As presented in Figure 16, the function reads the embeddings saved in the CSV file
(behavior of the “process_data” function). Hence, it constructs a list of tuples
encompassing the document content and corresponding embedding and inserts them
into the specified table using a bulk insert operation (cur.executemany).
From there, the “store_data” function (Figure 16) acts as a wrapper for the “load_data”
function. It takes the file path containing the embeds, database name, and table name
as input and calls the “load_data” function with the appropriate parameters before
closing the connection to the database.
Having all the necessary methods in place, we can now start importing the training
dataset into the vector database. This chatbot training process begins with initializing
40
all necessary infrastructure, such as media folders and databases, as interpreted in
Figure 17.
Figure 18: Code snippet – Load corpus into vector database (full steps)
In accordance with Figure 18, the “load_corpus” function orchestrates the entire
process of ingesting the corpus into the vector database. Taking the database name and
the chatbot name (representing the corpus domain) as input, the function initializes a
41
dedicated vector table for storing the domain embeddings, then retrieves all the CSV
files containing the corpus data from the specified directory. For each detected file,
this method will call the “collect_data” and “get_chunked_data” functions to collect
and split the data stored into smaller units. The chunked data is then embedded by the
“process_data” function and stored in a new CSV file for future reference. At the end,
the “store_data” function is invoked to ingest the embeddings and documents from the
generated CSV file into the corresponding table in the vector database.
Consequently, the HANU chatbot’s bank of domain-oriented knowledge can
increasingly grow using the commands listed in Figure 19.
The backbone of the proposed system's ability to deliver accurate and relevant
responses lies in its efficient vector searching and retrieval capabilities. Once the
embeddings generated from the chunked corpus are stored in the specialized vector
42
database, the system can leverage these embeddings to perform similarity searches,
identifying the most relevant document chunks in response to user queries.
From there on, the “vector_search” function (Figure 20) orchestrates the process of
retrieving the top-ranked relevant document chunks based on their semantic similarity
to the given user query.
43
5.6. API development
Figure 22: Code snippet – Implement API endpoint for relevant documents
This endpoint (Figure 22) is responsible for retrieving relevant document chunks from
the educational_program corpus based on the user's query. The request data is
extracted using request.get_json(), and the question parameter is obtained from the
JSON payload. If a question is provided in the request, the “get_embedding” function
is called to generate an embedding vector for the user's question. Then, the
44
“vector_search” function is invoked to retrieve the most relevant document chunks.
After all, a JSON response is returned, containing the list of these chunks under the
“relevant_docs” key.
Lastly, as detailed in Figure 23, the API is designed to run locally on “localhost” with
port 8080 when executed as the “main” script.
The implementation of ChatGPT at the front end is what we choose to carry out to
avert the possible bottleneck triggered in the back-end server. However, it is also
possible to perform this integration in the backend, as there are advantages and
disadvantages to consider.
Including ChatGPT into the application could be achieved by defining a function for
getting the answer from ChatGPT, in which the backend must take care of the entire
45
process of integrating it with OpenAI's API, including preparing the context and
relevant documents, which will be used to generate an answer.
Firstly, it is essential to set up a method that helps us take advantage of the text
generation force of the LLM that we chose. Given our need for a model with a large
context window of 16,385 tokens and improved formatting capabilities, particularly
for non-English languages, we opted for the “gpt-3.5-turbo-0125” model, as shown in
Figure 24. This model can receive up to 15 input segments with a length of 400–500
tokens (400 document content tokens, along with the remaining length possible from
metadata specifying title, summary, url, and contributor). They are sent with up to five
recent responses (context that is saved in session storage in the client) with a length of
1500 tokens each and return up to 1500 response tokens. Therefore, by limiting the
length of each chunk and sending a reasonable number of chunks and context as
materials to the LLM, we can ensure the number of tokens stays within the tolerance
threshold of the selected LLM (GPT-3 with a specific allowed token quantity of 16385
tokens). This distribution can be more comfortable when using more advanced but
costly models, such as the GPT-4.
46
Figure 25: Code snippet – Define body message for OpenAI API
According to Figure 25, page 43, the function commences by generating an
embedding for the user's question using the “get_embedding” function (Figure 10,
page 36). This embedding is then used to perform a vector search in the database to
retrieve the most relevant documents based on vector similarity. Next, it constructs the
appropriate system message and assistant message based on the question, context, and
relevant documents provided. The system message contains instructions for ChatGPT,
guiding it to refer to the context and relevant documents when formulating the
response, using the same language as the question, responding concisely and with a
technically credible tone, and handling currency exchanges if necessary. The user
message consists of the question itself, while the assistant message includes the
context and relevant documents to be referred to by the model when it generates the
response. These messages, finally, are passed to the “get_completion_from_message”
function to obtain the expected response.
47
Accordingly, the endpoint’s function can be slightly modified to deliver back the
response obtained from the model, as depicted in Figure 26.
Figure 26: Code snippet – Implement API endpoint for complete response
Thus, as shown in Figure 26, we can allow the answer obtained from calling the
“get_answer” method to be directly returned to the client instead of returning a list of
relevant documents as before.
Generally, when backend integration is used, it brings benefits like centralized logic,
which is easier to maintain, and better security (as API keys are protected in the
backend and are accessed securely), which in turn prevents loss of security on the front
end. Nevertheless, this method can hit performance bottlenecks while handling a
massive number of requests and complex queries; hence, the users may encounter
delays in the requests or timeout errors. Another issue is that scalability questions
could arise if the number of users and requests increases, leading to a more complex
and costly scaling of the backend alone to process the additional load as opposed to
scaling the front end.
In our project, where the HANU chatbot is a public AI for educational purposes like to
“wiki” and get consultancy, we have chosen to offload the computationally intensive
task of providing natural responses from ChatGPT to the client-side, the one that is
responsible for freeing the bottleneck in the backend. Security measures are taken in
the form of the safekeeping of the API key used for OpenAI and the entire reliability
of the application. This way of programming leads to a cleaner architecture where the
48
back end processes the data requests and the front end renders and manages user
interactions.
Notably, the selection of backend or frontend integration should be considered in light
of many factors that define the performance requirements, scalability demands,
security concerns, and application architecture in general. However, each approach can
be described as having “pros and cons”, and it is up to the project to make the decision
based on the specific requirements and restrictions.
The execution of the suggested system led to fruitful outcomes, showing that it is an
effective tool by combining ChatGPT and the specific knowledge of Hanoi University.
There was a detailed assessment of the query processing and vector search
mechanisms. Also, the integration of ChatGPT's text generation was thoroughly
measured.
After training the chatbot, we can review the comprehensive pipeline, which covers
storage preparation, data ingestion, data preprocessing, data embedding, and data
storing, in the logging screen shown in Figure 27.
49
• “educational_program” table: contains 333 document instances related to
various educational curriculum and lecturers at Hanoi University (briefly
depicted in Figure 2, page 26).
• “public_administration” table: contains 107 document instances ranging
from administrative procedures, admission and tuition fee information, to
announcements within the university.
On the other hand, the result after submitting the user’s question to the backend
endpoints is demonstrated in Figure 28.
50
• In all cases, when sending requests to backend endpoints, results are always
successfully returned in less than two seconds, considering the databases have
more than 300 instances.
• Testing on directly relevant questions (50 sample questions related directly to
the documents’ chunked segments): In 100% of the cases, the drive-to-the-point
chunks were found in the “relevant_docs” list.
• Testing on indirectly relevant questions (50 sample questions indirectly
related to the documents, mentioning keywords but not directly asking about
them): In 100% of the cases, some relevant documents were discovered,
although they did not present the needed information.
• Testing on completely irrelevant questions (50 sample questions not relevant
at all, without important keywords like “credits”, “faculty”, “tuition fee”, etc.):
In 20% of the cases, some documents were still jumped to the “relevant_docs”
list, typically those containing similar keywords in their content, such as
“university” or “information technology”. In the remaining cases, no documents
were returned.
• In the field where similar content occurs regularly (e.g., information related
to general knowledge blocks, credits, faculties, teachers), the “relevant_docs”
list sometimes ranked the expected documents lower than the other irrelevant
documents containing more similar keywords (higher similarity score).
However, the expected documents have not fallen entirely off the list.
To sum up, the vector search components performed remarkably well in retrieving
relevant documents based on the provided queries, even when the queries were
indirectly related to the document content.
After the list of relevant documents is delivered to the client, the front-end application
will integrate these documents and the recent context (the chatbot’s previous
responses) into the instructions sent to ChatGPT, triggering its text generation function
to ship the response to the user, as visualized in Figure 29, page 52.
51
Figure 29: HANU chatbot preview
The integration of the retrieved relevant documents with GPT-3.5 text generation
capabilities was evaluated through various query scenarios:
• Testing on directly relevant questions (50 sample questions related directly to
the documents’ chunked content): Over 90% of the cases presented the
expected response; other cases showed some mistakes like language
mismatches and wrong calculations.
• Testing on indirectly relevant questions (50 sample questions indirectly
related to the documents’ chunked content, meaning that information has to be
synthesized from multiple passages to draw conclusions, like asking about the
differences between two subjects): 60% of the cases showed the expected
responses. Others provided responses in different directions, which did not
really answer the question.
• Testing on completely irrelevant questions (50 sample questions not relevant
at all, without important keywords): Over 90% of the cases behaved as expected
(common responses from ChatGPT); others triggered language mismatches or
wrong interpretations.
• Testing on the context-based questions (50 sample pairs of related questions):
50% of the cases generated the expected answers, taking into account the
context from previous questions and responses. The remaining cases were
unable to provide satisfactory answers.
52
All things considered, the resulting response not only provided domain-specific
knowledge but also allowed for natural language conversations, enabling users to ask
follow-up questions and receive context-aware responses.
In order to assess the overall system's performance, a comparative evaluation was
conducted between the responses of ChatGPT and the HANU chatbot (Figure 30).
54
6. CONCLUSION
The dissertation presents an idea for a new approach: integrating LLMs such as
ChatGPT with Hanoi University's domain-aligned information resources. This
innovative solution will establish a bridge between ChatGPT as a cutting-edge chatbot
and domain-specific information, supported by sophisticated backend infrastructure.
At the core of this system is OpenAI's embedding model, which converts knowledge
content into high-dimensional vector embeddings stored in an optimized vector
database. This setup allows quick identification of semantically equivalent document
fragments based on user queries. The system, in turn, uses a RESTful API to retrieve
information snippets that are supplemented with user context, enabling it to craft
tailored responses for academic and professional recipients.
The chatbot, thus, plays a significant role in improving the learning experience, which
in turn enhances knowledge acquisition for students. It also acts as an all-around
assistant for scholars, helping to ease workload while improving information
accessibility.
Moreover, the system does not only take us forward in domain-oriented AI technology
but also gives insight into research investigating how specialized knowledge can be
brought into LLMs through structured processes and infrastructure. This helps
underscore our understanding of the interplay between AI and human capabilities,
something critical considering future shifts within this research-intensive, rapidly-
evolving field toward skill development, led by high technology.
There are a few improvements that might help ensure a successful future for the
HANU chatbot system as well as the expansion of its operational scope. Considering
data security, authentication and authorization should be implemented later on. Plus, it
is anticipated that historical conversations will be stored in the vector database to
increase context-based responses. Furthermore, the system would be able to acquire
various inputs such as images, PDF files, and other forms of multimedia, which would
widen the knowledge scope, thereby elevating the user experience.
55
This dissertation is not only pioneering but also futuristic in the wider domains of
education-oriented AI technology. It delves into a novel approach to infusing
subjective domain knowledge into large language models through specialized
processes and architectures. The study not only advances our understanding of the
interplay between AI and human cognizance but also lays a strong foundation for the
future growth of this swiftly burgeoning realm, which, by all indications, places us
light years ahead of contemporary developments within these critical spheres.
56
REFERENCES
57
Guinness, H. (2023). How does ChatGPT work?. Zapier. Retrieved from
https://zapier.com/blog/how-does-chatgpt-work/
Hadi, M. U., Al-Tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A.,
Shaikh, M., Akhtar, N., Wu, J., & Mirjalili, S. (2023). A Survey on Large
Language Models: Applications, Challenges, Limitations, and Practical Usage.
Retrieved from https://doi.org/10.36227/techrxiv.23589741
Heuer, M., Lewandowski, T., Kučević, E., Hellmich, J., Raykhlin, M., Blum, S., &
Böhmann, T. (2023). Towards Effective Conversational Agents: A Prototype-
based Approach for Facilitating their Evaluation and Improvement. Retrieved
from
https://researchgate.net/publication/368791250_Towards_Effective_Conversation
al_Agents_A_Prototype_based_Approach_for_Facilitating_their_Evaluation_and
_Improvement
Hua, S., Jin, S., & Jiang, S. (2023). The Limitations and Ethical Considerations of
ChatGPT. Data Intelligence, 6(1), 1-38. Retrieved from
https://doi.org/10.1162/dint_a_00243
IBM. (n.d.). What are large language models (LLMs)?. Retrieved from
https://www.ibm.com/topics/large-language-models
IBM. (n.d.). What is conservational AI?. Retrieved from
https://www.ibm.com/topics/conversational-ai
Kamalov, F., Calonge, D. S., & Gurrib, I. (2023). New era of artificial intelligence in
education: Towards a sustainable multifaceted revolution. Sustainability, 15(16),
12451. Retrieved from https://doi.org/10.3390/su151612451
Labadze, L., Grigolia, M., & Machaidze, L. (2023). Role of AI chatbots in education:
A systematic literature review. International Journal of Educational Technology in
Higher Education, 20(1). Retrieved from https://doi.org/10.1186/s41239-023-
00426-1
Lowery, M., & Borghi, S. (2024). 15 Benefits of ChatGPT (+8 Disadvantages).
Semrush Blog. Retrieved from https://www.semrush.com/blog/benefits-chatgpt/
MatrixFlows. (n.d.). RAG, Fine-tuning or Both? A Complete Framework for Choosing
the Right Strategy. Retrieved from https://www.matrixflows.com/blog/retrieval-
58
augmented-generation-rag-finetuning-hybrid-framework-for-choosing-right-
strategy
Mohamadi, S., Mujtaba, G., Doretto, G., & Adjeroh, D. (2023). ChatGPT in the Age of
Generative AI and Large Language Models: A Concise Survey. Retrieved from
https://www.researchgate.net/publication/372248458_ChatGPT_in_the_Age_of_
Generative_AI_and_Large_Language_Models_A_Concise_Survey
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N.,
Barnes, N., & Mian, A. (2024). A comprehensive overview of large language
models. Retrieved from https://doi.org/10.48550/arXiv.2307.06435
OpenAI. (n.d.). Embeddings. Retrieved from
https://platform.openai.com/docs/guides/embeddings/embedding-
models?ref=timescale.com
OpenAI. (n.d.). Introducing ChatGPT - OpenAI. Retrieved from
https://openai.com/blog/chatgpt/
OpenAI. (n.d.) Fine-tuning. Retrieved from
https://platform.openai.com/docs/guides/fine-tuning
OpenAI. (n.d.). What is ChatGPT?. Retrieved from
https://help.openai.com/en/articles/6783457-what-is-chatgpt
Rozear, H., & Park, S. (2023). ChatGPT and Fake Citations. Duke University
Libraries Blogs. Retrieved from
https://blogs.library.duke.edu/blog/2023/03/09/chatgpt-and-fake-citations/
Wilson, M. (2024). ChatGPT explained – everything you need to know about the AI
chatbot. TechRadar. Retrieved from https://www.techradar.com/news/chatgpt-
explained
59