0% found this document useful (0 votes)
22 views60 pages

Master's Thesis, Esko Malinen

Uploaded by

tushgarg49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views60 pages

Master's Thesis, Esko Malinen

Uploaded by

tushgarg49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

INTERACTIVE DOCUMENT SUMMARIZER USING LLM TECHNOLOGY

Lappeenranta–Lahti University of Technology LUT

Master’s Programme in Software Product Management and Business, Master’s thesis

2024

Esko Malinen

Examiners: Professor Najmul Islam and Junior Researcher Maryam Hina


ABSTRACT

Lappeenranta–Lahti University of Technology LUT


LUT School of Engineering Science
Software Product Management and Business

Esko Malinen

Interactive Document Summarizer Using LLM technology

Master’s thesis
2024
60 pages, 7 figures, 5 tables and 3 appendices
Examiners: Professor Najmul Islam and Junior Researcher Maryam Hina
Keywords: Artificial Intelligence, Generative AI, Large Language Model, Retrieval
Augmented Generation

The development of large language models and generative AI has been astonishingly fast
during the last years. ChatGPT published by OpenAI in 2022 finally brought these
technologies visible to the public, although the research has been ongoing for decades in
science communities. Generative AI is one of the hottest topics in technology and as such
offers interesting topics for a design science research.
This master’s thesis delves into properties and usage of large language models and
generative AI, as well as properties of Retrieval Augmented Generation and the possibilities
it offers. As the practical part of thesis, a software application capable of interactive
discussions in the context of given documents is implemented using modern development
tools and environments. The resulting application of the development is tested with suitable
test material and compared to existing applications with similar functionality, analyzing the
test results.
The chosen implementation method successfully produced an application comparable to
commercial solutions. This application can summarize even lengthy documents into clear,
concise text and answer proposed questions with accurate and relevant information. While
response times are currently slower than ideal, this can be attributed to the used low service
level on the integrated external services. Future optimization efforts can focus on
improving response speed and comparing use of different language models.
TIIVISTELMÄ

Lappeenrannan–Lahden teknillinen yliopisto LUT


LUTin insinööritieteiden tiedekunta
Software Product Management and Business

Esko Malinen

LLM-teknologiaa hyödyntävä interaktiivinen dokumentin referoija


Diplomityö
2024
60 sivua, 7 kuvaa, 5 taulukkoa ja 3 liitettä
Tarkastajat: Professori Najmul Islam ja nuorempi tutkija Maryam Hina
Avainsanat: Tekoäly, Generatiivinen AI, Laaja kielimalli, Retrieval Augmented Generation

Laajojen kielimallien ja generatiivisen tekoälyn kehitys on ollut viime vuosina huiman


nopeaa. OpenAI:n vuonna 2022 julkaisema ChatGPT toi nämä teknologiat laajan yleisön
tietoisuuteen, joskin tekoälytutkimusta on tehty tiedeyhteisöissä vuosikymmeniä.
Generatiivinen tekoäly on yksi suurimpia teknologia-alan puheenaiheita tällä hetkellä, joten
se myös tarjoaa kiinnostavia aiheita suunnittelututkimuksen kaltaiselle opinnäytetyölle.
Tässä diplomityössä perehdytään generatiivisen tekoälyn ja laajojen kielimallien
ominaisuuksiin ja käyttöön, sekä Retrieval Augmented Generation -arkkitehtuurin (RAG)
teoriaan sekä sen tarjoamiin mahdollisuuksiin. Käytännön osuutena toteutetaan annetussa
dokumenttikontekstissa interaktiiviseen keskusteluun kykenevä ohjelmisto käyttäen
moderneja kehitystyökaluja ja -ympäristöjä. Toteutustyön tuloksena syntynyttä sovellusta
testataan sopivalla aineistolla, verrataan saatavilla oleviin vastaaviin sovelluksiin ja
analysoidaan tuloksia.
Testitulokset osoittavat, että valitulla toteutustavalla pystytään tuottamaan kaupallisten
työkalujen veroinen interaktiivinen kysymys-vastaus -sovellus, joka pystyy referoimaan
pitkänkin dokumentin sisällön helppolukuiseksi tekstiksi ja vastaamaan annettuihin
kysymyksiin faktapohjaisesti ja järkevästi. Vastausten hitaus nousee piirteenä esille, mutta
tämän voidaan nähdä paljolti johtuvan ulkoisten palveluiden matalasta palvelutasosta, joka
valittiin kustannussyistä. Jatkokehityksessä on syytä paneutua vastausnopeuden
parantamiseen ja eri kielimallien ominaisuuksien vertailuun.
ACKNOWLEDGEMENTS

Thanks to my spouse Maria for being the most kind, loving and understanding sweetheart,
supporting my efforts and tolerating my frustrations during my M.Sc. studies and
especially during the various challenges of this thesis work.
5

Table of contents

Abstract

Acknowledgements

1 Introduction .................................................................................................................... 6
2 Background..................................................................................................................... 8
2.1 Artificial Intelligence ............................................................................................. 8
2.2 Natural Language Processing and LLMs............................................................. 11
2.3 Retrieval Augmented Generation ........................................................................ 16
3 Research method .......................................................................................................... 21
4 DoChatAI - Interactive Document Summarizer ........................................................... 24
4.1 Features and Implementation ............................................................................... 24
4.2 Known limitations and issues .............................................................................. 32
5 Results .......................................................................................................................... 33
5.1 Comparison to existing similar implementations ................................................ 33
6 Discussion..................................................................................................................... 38
6.1 Future of RAG technology .................................................................................. 40
7 Conclusions .................................................................................................................. 42
References ............................................................................................................................ 44

Appendices

Appendix 1. DoChatAI responses for the test document ‘B2B Platform Assimilation’
Appendix 2: Prompt texts used in DoChatAI
Appendix 3: Example of JSON-messages for a single query
6

1 Introduction

During the last couple of years intelligent chatbots have been in the headlines very often,
ever since OpenAI released their renowned chatbot, ChatGPT, to the open public in
November 2022 (“ChatGPT Announcement,” 2022). While the people actively following
technology might already have known about the current state of advancements in language
processing, to the wide public it was shocking news to find computers being able to discuss
in clear English on open domain questions. The reactions ranged from enthusiastic to the
usual doomsday scenarios, where advanced superintelligence achieves awareness and takes
over, possibly destroying humanity. Most importantly, innovative people started inventing
countless ways of utilizing this new technology to help people and organizations do their
work more effectively.

The technology behind ChatGPT had been in development already for years before
publication, and many of the most important aspects have been published in technology
journals and scientific papers. Most of the building blocks have open-sourced alternatives,
which anyone interested and skilled enough can take into use in their own projects to develop
services and products taking advantage of the late breakthroughs in the field of artificial
intelligence. While technology companies keep their most advanced models private to
protect their business interests, it’s also in their interest to offer public APIs for developers
to use. Most APIs developed by private companies are not free to use, since the resources
needed to run the services cost a lot of money. Anyway, the combination of open-source
tools and paid services provides all the possibilities for application developers taking
advantage of the best aspects of each. One popular use case is using the models and APIs to
implement applications creating summaries of large documents or collections of documents
and offering the possibility to ask questions about this provided information mass. This breed
of applications was one of the early adopters of ChatGPT and the like since the real-world
value is easily demonstrable and understandable. Retrieval Augmented Generation (RAG)
is one promising and widely used approach and was selected for the implementation part of
this thesis (Lewis et al., 2020).
7

The objectives of this thesis are:

- To provide an overview of generative AI, LLMs and related technology.

- A proof-of-concept implementation for an application using RAG architecture.

- Comparison of features and performance to existing applications.

- Analysis of the aspects affecting the performance of such an application.

- How to further utilize this technology.

This study seeks to answer the following research questions:

RQ1: How to create a proof-of-concept application implementing summarizing and


question answering features over given data using OpenAI APIs?

RQ2: What are the aspects affecting the performance of such applications?

Thesis is structured in chapters as follows:

- Chapter 1: Introduction

- Chapter 2: An overview of the history of AI and related basic concepts

- Chapter 3: Description of the implemented proof-of-concept work

- Chapter 4: Results of the study, comparison to selected existing implementations

- Chapter 5: Discussion

- Chapter 6: Conclusions

The result of this study is a working application using RAG architecture and LLM
technology providing the needed functionality for making questions and getting factual
answers based on provided document data. The documentation shall contain all the needed
information for understanding the technology behind this implementation.
8

2 Background

Artificial Intelligence (AI) is a very large concept, containing several subordinate areas and
directions of research and development. In this chapter the main areas are described and the
concepts of Large Language Models (LLMs), Natural Language Processing (NLP) and
Retrieval Augmented Generation (RAG) are explained, their use and relation to other AI
technology is explained in sufficient detail to understand the terms in context of this thesis.

2.1 Artificial Intelligence

There is no official, definitive definition of Artificial Intelligence, but Google and Bing
search services currently provide the same following explanation when asked:

- The theory and development of computer systems able to perform tasks normally
requiring human intelligence, such as visual perception, speech recognition,
decision-making, and translation between languages.

That’s well in line with the definition given in the book Introduction to Artificial Intelligence
(Finlay and Dix, 2020):

- AI is concerned with building machines that can act and react appropriately, adapting
their response to the demands of the situation. Such machines should display
behavior comparable with that considered to require intelligence in humans.

The idea of intelligent machines has always been fascinating humans, even long before the
Digital Revolution starting in the late 1940s. As a very common view the start of formal
research in artificial intelligence was a conference held in June 1956, The Dartmouth
Summer Research Project on Artificial Intelligence, paving the way for continuously
accelerating research and development in the coming decades (Finlay and Dix, 2020) Even
before that famous meeting there was the publication of the first AI system, called Logic
Theorist, in year 1955, which can be also considered as the start of AI (Flasiński, 2016).
Despite suffering from repeated setbacks when real-world results failed to match the overly
optimistic expectations exaggerated by media hype after the newest AI-related innovations
were published in the news, the base research has been active in the science community. The
9

newest wave of AI hype started in autumn of 2022, when OpenAI published their chat bot,
ChatGPT, capable of answering open domain questions, creating astonishingly well-formed
text documents and much more, all in natural human language and even supporting several
languages (Amaratunga, 2023). Of course, the development of language models behind
ChatGPT and other services alike had been very active already for several years at that point.
While the public attention was in the magic of a machine understanding and producing clear
English, capabilities of this technology are vastly greater, as we have seen in the recent rush
of AI companies introducing new products and services based on similar technology. We
are reading news weekly about generative AI, which can produce texts, images, speech, and
videos almost undistinguishable from real material produced by humans, how bots are going
to destroy millions of jobs and companies are laying off their workers because bots are
cheaper, more productive and don’t ask for rises or holidays.

Figure 1: Concepts in Artificial Intelligence


10

The whole field of AI is naturally very large, having been an active area of research for
almost 70 years. There are several important branches inside Artificial Intelligence, e.g.
Computer vision, Fuzzy logic, Expert systems, Robotics, Machine Learning (ML), Neural
Networks (NN), Deep Learning (DL), Natural Language Processing (NLP) and Generative
AI (GenAI) (Amaratunga, 2023; Finlay and Dix, 2020; Flasiński, 2016; Kulkarni et al.,
2023), see Figure 1 for a visual representation of the relations of these concepts. In the
context of this thesis the most relevant areas are ML, DL, GenAI and NLP, because these
are the enabling technologies behind Large Language Models, which again is the main
enabling technology of AI-based chatbots (Amaratunga, 2023). ML provides a way of
recognizing patterns in big amounts of data and is therefore used for many data-intensive
applications like image recognition and customer data analysis. It’s one of the most
important areas of AI having lots of real-world applications in use and even more in
development. The main methods of training the models are supervised learning,
unsupervised learning, and reinforcement learning. The main function of ML is to generalize
the found patterns in data to be used in reasoning when given new problems to solve, with
new data never seen before (Fieguth, 2022). DL is a subarea of ML and uses layered artificial
neural networks to gradually learn more abstract features starting from details and widening
the aspect layer by layer, with minimal human contribution (Alpaydin E., 2014). With the
continuous rapid increase in processing power and affordable data storage, enabled by the
technological advances in hardware and software, automated DL has become the hot topic
of AI research.

GenAI is a subset of DL, which uses pre-trained models to generate new data similar to the
data which was used to train the models. The new data can be basically anything, varying
from all kinds of text to images, music, and even videos. GenAI is an especially hot field
right now that services like ChatGPT, Bing, Bard, Midjourney and DALL-E have brought
the existence and possibilities of modern GenAI into the ears and eyes of the large public.
While the online chatbots and image creation services are already mind-blowing examples
of the capabilities of GenAI, it’s the applications using these services to create new added
value which are changing the world the most in the future by lessening or even replacing the
need of human contribution in content creation and problem solving in many real-world
situations. For example, GenAI can replace human customer service and act as a personal
11

assistant, help students find and summarize information, aid teachers to create personalized
learning materials and help checking student assignments. In healthcare it can aid or even
replace doctors in making diagnoses based on patient symptoms. And generally, it can
increase productivity in all kinds of content creation like writing texts, designing graphics,
or composing music for game development. The inevitable outcome seems to be that many
jobs are either disappearing or at least transforming to take advantage of the efficiency
provided by AI tools. The disruptive effects could become huge even in the near future,
obsoleting numerous high paying white-collar jobs. (Fui-Hoon Nah et al., 2023)

2.2 Natural Language Processing and LLMs

NLP is a technology aiming to bridge the gap between human and computer languages by
using advanced computing to allow computers to be instructed using natural language and
respond to users using clear human language. This field predates all the way to the 1950s,
when Alan Turing proposed his famous test for evaluating the quality of AI implementations
(Amaratunga, 2023). During the past decades different approaches have been dominating
the research, and in the 2000s and 2010s, due to better understanding of human brain and
greatly increased computing power and earlier advances in research, ML/DL-based
techniques became the most popular direction. This joined with the resources of large
technology corporations developing AI solutions has made possible the tremendously quick
development seen in the last years. Possible uses for NLP include text classification, Named
Entity Recognition (NER), machine translation, text generation and speech recognition,
which can be seen as relevant features of current AI applications like chatbots (Amaratunga,
2023). A significant part of this success is due the development of language models. In NLP,
a language model is a probability distribution over sequences of words, meaning that the
model learns the statistical properties and patterns found in the learning material.

There are several types of language models, classified by the types of tasks they are running
(e.g. generative vs predictive) or by the approach (e.g. N-gram vs. neural networks (NN)).
Since ML and DL are the most popular approaches, most advanced language models are
NN-based, but also can implement features from other approaches. Some important NN
architectures are recurrent neural networks (RNNs), long short-term memory (LSTM)
networks, and gated recurrent units (GRUs), but they have their own limitation related to
12

execution and especially training efficiency, and a problem with diminishing training yields
called the vanishing gradient problem (Kulkarni et al., 2023). Notably the transformer
architecture presented by Google researchers in their publication “Attention is all you need”
in 2017 has formed and directed the development by offering a more effective and faster to
train method for creation of language models, enabling parallel execution and improving
handling of the problematic long-range dependencies in data (Vaswani et al., 2017). This
architecture is presented in Figure 2. The introduced self-attention mechanism, in addition
to other improvements like scaled dot product and multihead attention, is important by
enabling the model to remember the relevant context better and allowing linking of related
data over long distances inside data, which used to be a problem in earlier architectures
(Amaratunga, 2023; Kulkarni et al., 2023). With this new architecture and the increased
computing resources training of the current massive language models has become viable.
AI-based language models are described with properties like parameter count, which tells
the number of aspects the model can learn from the training data. The number of parameters
in modern language models may be hundreds of billions, even over a trillion, like e.g. GPT-
4 (OpenAI, 2023a). Properties of some well-known language models are presented in Table
1, showing that even open-source models may have tens of billions of parameters and
proprietary models created by leading technology companies like Google, Meta and OpenAI
may have even more. These extremely large models are called Large Language Models
(LLM) because of the number of parameters and size of the used training data (Amaratunga,
2023; Kulkarni et al., 2023; Naveed et al., 2023).
13

Table 1: Comparison of some well-known LLMs (“12 Best Large Language Models (LLMs)
in 2024,”; “The List of 11 Most Popular Open Source LLMs of 2023,”; Almazrouei et al,
2023; Naveed et al., 2023; Touvron et al., 2023; Wu et al., 2023)

Max
Licensing context Release
Name Provider type Parameters tokens date Availability

32k-
GPT-4 OpenAI Closed 1.7T 128k 11/2023 ChatGPT, OpenAI API

4096-
GPT-3.5 OpenAI Closed 375B 16k 3/2022 ChatGPT, OpenAI API

Gemini Pro Google Closed Unknown 32768 12/2023 Gemini, Vertex, API

PaLM Google Closed 540B 8000 5/2023 Gemini, Vertex, API

Cohere Cohere Closed 52B 8000+ 12/2022 Chatbot, API

Claude 2 Anthropic Closed 860M 200k 11/2023 Chatbot, API

LLaMA 2 Meta Open Source 7B, 13B, 70B 2048+ 7/2023 Open

OpenLLaMa Meta Open Source 3B, 7B, 13B 2048+ 5/2023 Open

Technology
Innovation 2048-
Falcon Institute Open Source 7B and 40B 10000 3/2023 Open

Dolly 2.0 Databricks Open Source 12B 2048 4/2023 Open

BERT Google Open Source 340M 512 10/2018 Open


14

Figure 2: Transformer architecture (Vaswani et al., 2017)

It’s a well-known fact that just increasing the number of parameters doesn’t make the model
automatically better and, on the contrary, typically makes its execution slower and more
expensive by demanding more computing resources (Naveed et al., 2023). The newest LLMs
typically have the largest dimensions in scope of the numbers though, because the most
advanced models require substantial investments from the companies creating them, and
they want to ensure their models are performing as good as possible compared to the
competitors’ models. The pre-trained LLMs are called “foundation models”, because they
have generalized massive amounts of data, and as such have a good all-around performance
in generic tasks (Bommasani et al, 2021; Naveed et al., 2023). For specific needs, however,
more precise training is needed, and this can be achieved using fine-tuning techniques.
Instead of training a new model from scratch for some narrow field requiring specific
knowledge, a foundation model can be further trained using various tuning methods like
Zero-shot, Linear Probe and End-to-end tuning (Mukhoti et al., 2023). This is a cost-
effective and rather quick way of fulfilling specific needs, however not all LLMs support
15

fine-tuning. In the context of chat or question-answering applications, also a method called


prompt engineering can be used, meaning that the actual question is crafted giving the
desired context and augmenting data to improve the response from the model (Sahoo et al.,
2024). This is even lighter and easier way of tuning the response since it requires no training
at all, but it only can affect the structure and tone of the response, since only the existing
parameters of the model are used, complemented with the given context data. Prompt
engineering has become a field of study itself since it has proven to be very effective when
used with advanced LLMs like GPT-4. Approaches with more advanced prompting
techniques are for example Chain of Thought, which improves logical reasoning by dividing
the given task to several subtasks and approaching step by step, and Generated Knowledge,
which uses information generated by the model itself to augment the prompt, leading to
better logical reasoning in solving problems (Chen et al., 2023; Sahoo et al., 2024)

Generation of the actual response varies between models, but generative models like
ChatGPT (GPT is short for Generative Pre-trained Transformer) generate content word by
word, using the patterns found in training to calculate probabilities for the next word in the
given context and previous words. Some amount of random sampling may be added to give
variability in the responses using a randomness function, as well as using techniques called
“top-k sampling” or “nucleus sampling” to select words or sets of words with highest
probabilities (Chen et al., 2023; Kulkarni et al., 2023). The process is iterated forward until
the model decides the response is ready or the response length limit is reached. This is as
resource intensive a process as it sounds, because the probabilities are calculated over the
whole vocabulary, which is rather large considering the size of the LLM (Kulkarni et al.,
2023; Naveed et al., 2023).

LLMs are known to have certain weaknesses and problems, which need to be considered
when utilizing the received output data. One research collecting the problematic aspects of
many LLMs together is “A Comprehensive Overview of Large Language Models”,
published by Naveed et al, 2024 (Naveed et al., 2023). Hallucination is one infamous feature,
meaning that the models may produce somewhat correct-looking output, which has no
factual basis or makes no sense at all. Also called “fabrication”, this means the output of
LLMs is not necessarily reliable, which again introduces a serious problem in use cases
where correctness is imperative, like in medicine or scholarly contexts, or writing news
articles containing misinformation. Hallucination is being studied a lot and algorithmic
16

countermeasures are developed to reduce the effect, but it still remains as a major limitation
of LLM use. The produced content may also be harmful, offending, or biased, perhaps
because the training data has not been good enough quality or the just because the model
does not understand the inner meaning of the content. Content policies have been deployed
in the publicly available service to reduce unwanted content, but using skillful prompt
engineering it may still be possible to get around those and get the model to respond in
unexpected ways. Because most of the large models are trained using publicly available data
from the Internet, there are also lots of entries normally under copyright for their creator.
Even though the models should use the data for just generalizing their patterns, it’s always
possible that the output contains some pieces of an original creation, or at least greatly
resembles one. Or maybe some private data is contained in the model, and it’s used in the
output, possibly offending or harming some individual. These legislative problems are
encountered continuously when using LLMs, and many countries are developing new
regulative laws or restrictions to counter the problems. Explainability and transparency of
the LLM logic would be needed, but unfortunately inner working of multilayered neural
networks is not easily made visible, even more because the technology giants may not want
to reveal their technological business secrets. Authenticity of the produced content is a
problem, when the AI-generated content often is not easily distinguishable from human-
produced content, like photos. Also, misuse of AI in form of plagiarism in educational
context is already seen as a big harm, as students may forge their essays by using an AI
assistant to write the text for themselves. (Fui-Hoon Nah et al., 2023; Kulkarni et al., 2023;
Naveed et al., 2023; Novelli et al., 2024)

2.3 Retrieval Augmented Generation

While LLMs are pre-trained with massive amounts of data, they cannot have all the possible
information embedded in themselves. This can become a problem, since model training is
very resource intensive and therefore costly operation, which cannot be redone without great
efforts when some new training data becomes available. It means the answers from an LLM
may become out of date very quickly, since no new information can be easily applied after
initial training and fine-tuning. To solve this problem, there have been numerous efforts of
creating architectures to enable adding up-to-date data, e.g. memory networks, stack
augmented networks and memory layers, with varying success (Lewis et al., 2020). For
17

specific knowledge-intensive needs, when it’s known that the training data does not contain
all the needed context specific information, context data may need to be fed as additional
layer of context when making queries to the model. Retrieval Augmented Generation, or
RAG for short, is a method of combining non-parametric and parametric memory for
language generation, using external data sources to bring new data in the system context.
While the basic ideas of the RAG method have been used in research and question-answering
applications already for decades (Sanderson and Croft, 2012), in context of LLMs a RAG
architecture was first published by a research group lead by Patrik Lewis in 2020, and it
immediately received enthusiastic reception, greatly affecting the development of modern
question-answering systems (Lewis et al., 2020).

The benefits of using RAG are diverse, when applied to knowledge-intensive applications.
In this area especially humans have problems because of the sheer amount of information,
but also other question-answering architectures have their downsides. When compared to
using LLMs without non-parametric memory, it’s clear that providing up-to-date and context
specific information improves the quality of responses considerably. In the publication by
Facebook AI Research group (Lewis et al., 2020) it was shown that besides the responses
including more factual details and newest information available, also hallucination was
decreased notably. Even if the requested information is not available in the indexed
documents, the RAG model can generate better answers by combining the found clues and
existing information in parametric memory, leading the questioner to the correct direction.
Because the external memory of a RAG system can be easily updated and complemented,
the risk of returning out-of-date information in answers is significantly diminished.

A conceptual RAG system consists of the following components, illustrated also in Figure
3 (Gao et al., 2023):

- Document data, i.e. any data suitable for the system at hand.

- An embedding model, which is used to create vector embeddings of the provided


data to be stored in the data storage, as well as user queries for querying the data
from storage. An embedding is a vector representation of a word or token in a
continuous vector space, allowing computers to capture the semantic relationships of
word or tokens in the current context and enabling efficient searches based on vector
data (Mikolov et al., 2013).
18

- A data storage, where context data can be queried from, usually some kind of a vector
database. The created embedding data is in form of multidimensional vectors, and
these specialized databases can be used to store the vector data, create efficient
indexing, and execute extremely fast searches in big data masses, using the semantic
context rather than word similarity as the criteria for matching results (Han et al.,
2023).

- An indexing implementation for reading the document data and creating suitable
chunks of data to be stored in the data storage.

- A retrieval functionality to search the data storage using the provided query.

- An LLM implementation handling the logical reasoning and producing the natural
language response.

- A generation functionality for generating a specially formatted prompt for the LLM,
including any instructions and other contextual information, as well as handling the
LLM response to return a suitably formatted response to the user.
19

Figure 3: Components of a RAG system

The data source of a RAG system can be basically anything, depending on the main targeted
use of the system. E.g. for enterprise intranet solutions the data must usually be kept private,
thus a private database instance is required, as well as for any use case requiring specific
data sets not necessarily available in public. For open-domain questions, using any public
search engine as a data source may also be a viable solution, as shown in the research by
Lazaridou et al (Lazaridou et al., 2022). The search data is usually stored in a vector database
allowing for fast semantic searches. The data must first be transformed to vectors for storing
using a compatible embedding model, producing so called vector embeddings. Each
embedding is created from a chunk of data, size of which is one of the many tunable
parameters in a RAG system, e.g. 512 bytes. Chunks are split from the original data so that
a certain overlap is maintained, increasing semantic continuity of the vector data, again one
more parameter. Vector dimensionality describes the number of elements used to describe
20

the contents, individual values emerging from the relative values within the numerical space.
The vector dimension defines the level of detail with which the data is stored, also greatly
affecting search performance. (Han et al., 2023)

The largest software platform providers, like Google, Amazon, and Microsoft, have been
adding LLM and RAG -related features to their offerings at an accelerating pace during the
last two years. These features and frameworks are aimed at the developers and architects for
easier access to the highly hyped functionalities with integration to other cloud services and
their huge computing capacities. Commercial solutions for enterprises are also known to be
in development for purposes like intelligent intranet and documentation searches,
conversational features and generating context-related text articles. Despite the active
reportage and online discussion there are yet only few ready-to-market products available at
the time of writing, from enterprise software providers like Salesforce and Ardoq (“Ardoq
AI Labs,” 2024; “Salesforce Einstein AI,” 2024). RAG is also not so much visible as a term
in marketing but included integrated deep inside products as a fundamental feature
supporting functionality. (“Amazon Bedrock,” 2024; “Azure AI Search,” 2024; “Google
Vertex AI Platform,” 2024; “OpenAI,” 2023b)
21

3 Research method

The research method used in this thesis is Design Science Research (DSR). It’s a
methodology for developing and evaluating new artifacts in areas where this kind of results
may help other researchers in their work. Peffers et al define Design Science Research
Process (DSRP) as a six-phased nominal sequence, in which separate phases are identified
with cumulative knowledge increasing during each one and leading to the next one (Peffers
et al., 2007). The mentioned phases are:

1. Problem identification and motivation

2. Objectives of a solution.

3. Design and development.

4. Demonstration.

5. Evaluation.

6. Communication.

The starting point for the process varies on the available information and resources, so that
the starting point may be either a problem, an objective to achieve, an existing artifact
requiring improvement, or an observed solution from earlier research. In this thesis the
starting point was the idea of developing an application, that can be given some document
or a set of documents and then ask questions about the contents. This aligns directly with the
second phase of DSR: defining solution objectives. We have a specific goal to achieve by
developing an artifact (in this case, a software application) that meets the initial
requirements. The DSRP sequence is adapted to the context of this thesis and presented in
Figure 4.
22

Figure 4: DSRP sequence adapted for this project

The adapted process starts using Objectives of a solution as the entry point. We have a vision
of the functionality, that is, we want to have an application capable of answering questions
about given document or documents and receiving well-formed, factual responses using only
natural human language. By studying the problem domain, a large number of research papers
were found addressing the field and subproblems of NLP. A specially interesting discovery
was RAG technology, which could be used to implement the pursued functionality. With
this theoretical background information studied it’s possible to move on to the next phase.

The second phase is Design and development, the concrete application development part.
In this phase the technological and architectural solutions were made, and iterative design
and development process performed, testing different aspects and ways of implementation.
When an application with satisfactory level of performance in terms of quality was
achieved, the next phase was entered with all the practical information gathered from the
implementation phase.
23

The third and fourth phases are Demonstration and Evaluation, which closely interlock
with each other in this case. The main action was to gather experience of the application
performance using different materials and evaluate the findings to see how well our
original objectives are met. Metrics and analysis knowledge here is the data and
experiences got from the demonstration phase. To get comparable data for evaluation, a set
of other applications and tools with similar features were tested using the same materials,
and the result compared with data from the developed application. After evaluation all the
needed disciplinary information has been cumulated for the last phase.

The fifth and final phase of the adapted DSRP sequence is communication. This stage in
practice involves writing and publishing of this thesis alongside the developed application
as the key artifact. The sequence is iterative by nature, and while the objectives phase was
well-defined and didn't require revisiting, the design and development and evaluation
phases were iterated multiple times to address performance issues or unexpected problems
identified during the evaluation phase. This thesis contributes to the field of design science
research by presenting the following artifacts and their evaluation:

• Testing Results: The documented testing results provide valuable insights into the
performance and effectiveness of the developed application.
• RAG Architecture Implementation: The successful application of RAG architecture
demonstrates its potential for similar projects.
• Recursive Summarization Method: The proposed recursive summarization method
offers a novel approach to summarizing long documents and can be further
explored for broader applicability.

• Prompting Techniques: The prompting techniques used in the implementation


contribute to the understanding of effective methods for guiding the RAG model in
generating informative answers and summaries.

Altogether, DSRP suites very well both to the process and resulting artifact of this thesis.
Hopefully these contributions are found useful in some future research.
24

4 DoChatAI - Interactive Document Summarizer

The practical part of this thesis involved creating an implementation of an interactive


document summarizer, which can be given long documents for summarizing and have
interactive discussion in the context of given documents. Retrieval Augmented Generation
was selected as the implementation method since this approach has been perceived as a well-
working solution for real-life needs (Naveed et al., 2023; Rick Merrit, 2023). RAG-based
solutions can be found in quantities at the moment of writing, since open-source solutions
and examples are freely available for anyone willing to experiment with this technology
(“Huggingface RAG Documentation,” 2024; “LangChain RAG Documentation,” 2024). For
the Proof of Concept (PoC), a simple AI solution called DoChatAI was developed using the
RAG-concept as the base for design and architecture. The source code for DoChatAI is made
available in a public Github repository (Malinen, 2024).

4.1 Features and Implementation

DoChatAI was supposed to fulfil the requirement of creating a document summary and
providing context for an interactive discussion about the document contents. The
implemented main functionalities are:

- Create a chat session to contain all related messages and documents.

- Process documents in the work directory by parsing the contents and creating vector
embeddings to be stored in a vector database for fast similarity searches.

- Generate a summary for an uploaded document. This can also be achieved by asking
a suitable question but is provided as a separate functionality for convenience and a
more thorough summarization.

- Ask a freeform question and receive an answer in natural language using the given
documents as the context. The AI is instructed to use the information in documents
and general knowledge, but not to invent answers if no suitable information is found.

- Keep track of the chat history to increase information in the context as the discussion
goes on, improving the perceived quality of AI responses.
25

- Provide chat history on request

TypeScript (“TypeScript Language ,” 2024) and Node.js (“Node Js,” 2024) were selected as
the development environment, mainly for personal preference and availability of tools. Other
selected technologies are:

1. LLM implementation called gpt-3.5-turbo from OpenAI, the most well-known


provider of highly developed language models and ChatGPT (“OpenAI,” 2023b).
OpenAI also provides the compatible embedding model, which creates vector
representations of the chunks of data and user queries.

2. LangChain, an open-source library implementation for managing the complexities of


database queries combined with LLM queries and responses (“LangChain,” 2023).

3. Milvus vector database for storing document embeddings and metadata. For this
application, a hosted free version provided by Zilliz was chosen for simplicity.
(“Zilliz Vector Database,” 2024)

4. Memcached, an open-source, distributed in-memory cache for storing data as name-


value pairs, used for server session data management. (“Memcached memory cache,”
2024)

OpenAI’s SDK and LangChain are both available for TypeScript, although Python is the de-
facto standard language of choice for many AI application developers. Accessing OpenAI’s
APIs requires a paid subscription, and each API call costs something, depending on the
model used and number of tokens consumed: at the moment of writing the used model (gpt-
turbo-3.5) costs $0.50 / 1M tokens for input and $1.50 / 1M tokens for output (“OpenAI
Model Pricing,” 2024), when running in the lowest pricing level of usage. Although in small-
scale use these prices are very cheap, it’s easily seen that intensive use of these API’s may
become expensive, depending on the length of transmitted input data and the length of
provided output, multiplied by the number of users per day. When the usage levels rise, there
are naturally discounts available, but still the cost of using commercial LLM providers has
to be considered carefully when designing such systems for a larger user base.

Using one of the open-source models would be a feasible option as well, but the model needs
to run be somewhere, either on a local computer or a hosted service, which again costs some
money depending on the targeted level of performance. Using an open-source model for
26

creating embeddings could be beneficial, since processing large amounts of text means
millions of tokens are used when updating new documents to the RAG database. In that case,
the search query must be transformed using the same model, because mixing different
models might cause some oddities in the results, since the captured sematic meaning of
vectors is not exactly the same, even if the embedding format is similar. Generally, mixing
different models should be avoided, since it would require extensive testing to ensure the
level of correctness is adequate. In this project only OpenAI embedding models were used
for clarity.

LangChain is an open-source library, meant for simplifying the development of applications


utilizing LLMs. This breed of applications usually implements similar patterns that can be
well generalized and offered as pre-built functionalities to radically speed up the
development. LangChain offers functionalities for connecting to different data sources, like
text files, databases, and APIs, to access many kinds of data for diverse use cases. Different
LLM models can be used and popular LLP APIs like OpenAI and Hugging Face
Transformers can be accessed, enabling the straightforward use of these models in
applications. It has a modular design allowing the open-source community to extend the
available functionality and provide connectivity and compatibility with various
technologies. This brings flexibility, as any suitable new technology can be integrated by
developing a new module, if one doesn’t exist yet. Also, as the project is open source, all the
implementations are freely available in GitHub, should one want to investigate or improve
the inner workings of some modules. (“Huggingface RAG Documentation,” 2024;
“LangChain RAG Documentation,” 2024; “LangChain,” 2023)

For a RAG solution using LangChain, three components need to be configured and provided
with suitable implementations:

1. Retriever: encapsulates database usage and queries from it to provide context data.

2. Prompt Template: built-in functionality for creating prompts for the LLM,
combining user query, context data and system instructions.

3. LLM: abstracts the used LLM implementation, which can be any of the supported
types or a customized solution implementing the needed functionalities.

DoChatAI is a server application running using Node.js runtime, all application code is
written in TypeScript. The application architecture is shown in Figure 5, which shows how
27

DoChatAI relates to the other components and services: Memcached, OpenAI API, and
Milvus cloud offered by Zilliz. The application offers a REST API, which can be called by
any client able to communicate using HTTP protocol. Details of the API are shown in Table
2, with basic operations for initiating a chat session, processing documents, generating
summaries, asking questions, and retrieving session and chat information. Internally the
application is using LangChain library for managing the flow of question refinement, data
retrieval, creation of LLM prompts, and execution of the actual LLM queries. No user
interface was developed for this application because focus of the work was on the core
functionality, not providing a ready-to-use application for potential end users. The
development and API testing activities were done using Postman, a handy tool for
developing and testing APIs (“Postman,” 2024). For documentation purposes, also Swagger
library dependency was added to the project and basic descriptions for the operations written
to conform with the OpenAPI specification (“OpenAPI 3.0,” 2024; “Swagger,” 2024).

Table 2: DoChatAI API operations

Operation Http Method Description


/chats POST Creates a new chat session
/chats GET Returns a list of chat sessions
/chats/:id GET Returns a single chat with given id
/chats/:id/question POST Asks a new question in given chat session
/process POST Processes documents in the work directory
/documents GET Returns a list of processed documents
/documents/:id GET Returns details with a summary for given document
/documents/:id/summary POST Creates a summary for document with given id
28

Figure 5: DoChatAI application architecture

When DoChatAI and Memcached are started in the runtime environment, the application is
immediately available for client calls, given that configuration is valid and current OpenAI
and Zilliz keys are provided, and a network connection is available for communication with
the external services. Since DoChatAI is using GPT in the background, generic questions
can be sent even without providing any context data, but to fulfil its given mission, the
application needs to be given some PDF-documents over which the discussion can be started.
The initialization phase is visualized in Figure 6 with a more detailed explanation as follows:

1. A chat session is initiated, which in practice means ensuring that a correctly named
collection exists in the remote database, creating the collection if not already found. The
returned collection id is used to also identify the chat session. A new Chat object is
created with a randomly generated identifier, stored in session data, and returned to the
caller.

2. PDF documents to be processed are uploaded to the work directory using external means.
No upload feature was implemented in the current version, but such a function could be
added in the future if seen important.
29

3. Client tells the application to process the documents.

3.1. Controller passes the instruction to the internal Indexer Service.

3.2. Indexer reads and parses the documents one by one.

3.3. If automatic summarization is enabled, indexer tells Summarizer to generate a


summary of the parsed document data, for which the actual operation is delegated
to GPT (3.3.1). The summary is stored in session data for later use. Operations 3.4.
and 3.5. are run parallel to summarization.

3.4. Document data is split to chunks, which are transformed to embedding vector data
by the embedding model.

3.5. Embedding data is stored to the external data storage with required metadata.

3.6. Session data is saved in Session storage (in this case Memcached) to be available
for further requests.

Figure 6: DoChatAI Sequence Diagram – initialization and document processing


30

After the initialization phase context-dependent queries can be made, execution of which is
visualized in Figure 7, explanation as follows:

1. Client sends a user query containing a context-specific question.

1.1. The session data is retrieved from the Session storage using the provided session
key.

1.2. User query is passed to the internal Chat Service.

1.2.1. Chat Service prepares the context, initializing the prompts and merging chat
history to be sent as part of the context.

1.3. The received question is refined using LangChain, which delegates the task to GPT
(1.3.1), which attempts to improve the form of the question to suit better for
querying.

1.4. The refined question is passed to a LangChain conversational LLM chain with the
prepared context and data storage retriever. LangChain creates an embedding vector
from the query and uses the retriever to search for best matches (1.4.1). These
matches are ranked according to their match score to find only the most similar
database entries. The retrieved data is then passed along with other context data to
the configured LLM (GPT) for generating a response (1.4.2).

1.5. Chat Service receives the response and passes it to the controller, with a list of
matching source documents and the refined question.

1.6. Controller saves the response in current chat history and update the session data.

1.7. The client receives the response (see Appendix 3 for an example of the messages).
31

Figure 7: Execution of a user query

The summarization feature was developed to overcome the problem with context length
limitations posed by LLMs: when there’s a lot of data, how can we effectively summarize it
if the maximum length of context is orders of magnitude smaller than the amount of data?
The answer is obvious: we need to break the data in parts and summarize smaller chunks at
a time, then combine the results and summarize these again. This is a recursive operation,
which is already supported by LangChain. The smaller the limit and larger the data, the
greater number of recursion loops is required, which obviously consumes lots of tokens. The
cost of summarizing a long document can thus be relatively high, but since the result is stored
for later use, the quality of summary most probably pays for the cost in long term. Even with
the use of RAG architecture, the quality of summaries produces using normal question-
answer mechanism is clearly inferior, because most of the data is not included when
generating such summaries.
32

4.2 Known limitations and issues

The APIs of DoChatAI are synchronous for simplicity, as this was seen sufficient for this
kind of a demo application: waiting for results after each call is just fine. For real-world
needs asynchronous mode could be easily implemented by using session data for keeping
track of ongoing operations. The use of free account in Zilliz forces throttling of the rate of
indexing calls, but usually this doesn’t matter, since the automatic summarization is always
slower anyway. OpenAI APIs also have their rate limitations in the lower account levels,
thus summarization operations may sometimes need to be artificially slowed down by
waiting between calls. By investing more money in the account level this hindrance could
be removed, but again, for the purposes of a thesis, this was an acceptable sacrifice.

Although LangChain encapsulates the use of underlying LLM model, there are a lot of
choices and parameters to tune for achieving the best possible results. Most crucial is the
context length, which is dependent on the LLM used: GPT-3.5-turbo has a limit of 16k query
tokens whereas GPT-4.5-turbo allows as much as 128k tokens. This has to be considered
when tuning the performance and configuring data indexing. LLMs have certain mechanism
against repetitive patterns in data, which sometimes causes problems in summarization.
Usually, the content has some text formatting or tables, that translate to repeating patterns
containing control characters and whitespace, causing the model to reject such content. A
simple cleaning mechanism was implemented to overcome this problem, although there’s
certainly room for development as generic content might well contain structures not fixable
by the existing code. Breaking down the document structure would help to improve indexing
and summarization, together with parsing tables, recognizing images, and adding all that
information with metadata also to the index. The current implementation parses only text
content, which often leaves a big part of available information out of reach for the search
functionality, but such a feature would be a project of its own.
33

5 Results

In this chapter the results of DoChatAI testing are presented, together with comparisons to
ten other applications offering similar functionality.

5.1 Comparison to existing similar implementations

Due to the nature of this thesis, only free applications were chosen for testing. This obviously
limits the available capabilities and performance, since also the companies offering these
applications have to pay for their resources, either in-premise or hosted. For that reason,
performance as speed of execution is evaluated only in the level of side notes, and only
qualitative remarks are made about the results. The number of question-answering
applications available in Internet has exploded during the last year, presumably largely
because new APIs from OpenAI and adoption of LangChain have boosted the interest of
developers, and on the other hand, because free example code and even whole projects are
available for anyone to start developing with. The latter reason might also be behind the fact
that UIs of many simpler online applications are very similar. The selected ten applications,
plus DoChatAI using two different versions of GPT, are listed in Table 3.

Two different PDF files were used in the comparison, representing different use cases, i.e.
shorter research document vs. a full e-book containing hundreds of pages. Due to the
restrictions of free accounts, only three applications in addition to DoChatAI (namely
ChatPDF, text.cortex, ChatGPT) were able to handle the longer text, the usual limits being
from 50 to 120 pages. Some applications (e.g. PDFGear) were able to read the file and give
a summary, although the content clearly was not available in its entirety, but these were not
taken into account since the results would presumably have been poor. The PDFs used were:

1. A research paper: B2B Platform Assimilation, 28 pages, 1538kb

2. An e-book: Marius, Flasinski - An introduction to Artificial Intelligence, 316 pages,


5779kb
34

Table 3: Applications selected for comparison

Application Type Model Notes Page limit


DoChatAI 3.5 Server app GPT-3.5-turbo Self-made None
DoChatAI 4.5 Server app GPT-4.5-turbo Self-made None
ChatGPT Online GPT-4 Slow but thorough responses Unknown
PDF.ai Browser GPT-3.5 Handy, available when PDF is opened 50
extension
ChatPDF Online GPT-3.5 Content visible, suggests questions, 120
seems to internally handle more than
120 pages
Sharly Online GPT-4 Includes citations. 100
text.cortex Online, GPT-4 Fast, Citations available Unknown
(ZenoChat) Extension
PDFGear Desktop GPT-3.5 Suggests questions, content visible and 120
app navigable, sources as links to pages.
Should handle 120 pages but looks to
limit to about 50.
AskYourPDF Online, GPT-3.5 Lists source pages 100
Extension
PDFPeer Online Unknown Fast, plain answers 200
LightPDF Online GPT Suggests questions. Once started giving Unknown,
answers from a wrong context! 5Mb
ChatyPDF / Online GPT Source pages linked Unknown,
GeniusPDF 3Mb

For the first PDF, the following set of questions were presented:

1. Give me a summary of this document.

2. What are the five dimensions of service functionality according to the paper?

3. How were the hypotheses H1 and H2 met in the results?

4. What kind of limitations were found in the study?

5. How does the content relate to digital transformation?

Answer from DoChatAI are included in Appendix 1, results from other application are not
included because of the length of texts. The answers are rated in Table 4, using a subjective
35

scale from 0 to 5, with a mean value calculated from all results to reflect the overall answer
quality. Veracity of details in answers was not thoroughly examined, but rather the structure
of content, as well as amount of information, were taken into consideration when rating the
answers. ChatGPT with its well-performing GPT-4 engine was largely used as a baseline for
comparisons. Usability, speed, or other such aspects are not considered in this rating since
the point here is to compare just the quality of output.

Table 4: Assessment of the quality of answers from each application to PDF #1

Application Result 1 Result 2 Result 3 Result 4 Result 5 Mean


DoChatAI 3.5 4/2 2 3 3 4 3.2 / 2.8
DoChatAI 4.5 5/4 3 4 4 5 4.2 / 4.0
ChatGPT 5 4 4 3 5 4.2
PDF.ai 3 3 2 2 2 2.4
ChatPDF 4 2 4 4 5 3.8
Sharly 2 2 0 2 4 2.0
text.cortex 3 3 0 3 4 2.6
PDFGear 3 1 0 0 2 1.2
AskYourPDF 3 3 3 4 4 3.4
PDFPeer 3 1 0 1 2 1.4
LightPDF 2 3 3 2 3 2.6
ChatyPDF 1 1 0 1 2 1.0

As the ratings are subjective, they are only opinions of the author. Affecting attributes were
things like number of facts found, level of additional context and explanations,
representation, and structure of the response. As most of the tested application use some
version of GPT models, it’s expected that the results are somewhat similar, but there are also
obvious differences. For example, ChatGPT tends to give longish answers, which may not
always be desired, but it makes the responses more understandable as there’s always more
than just a list of found matches. On the other side, e.g. PDFPeer gives fast answers, that are
factual and mostly sufficient as such, but they are lacking the extra context, which weakens
understandability. Question number 3 was difficult for some applications, namely Sharly,
text.cortext, PDFGear, PDFPeer and ChatyPDF, which were unable to find that information
at all. PDFGear also did not find an answer for question 4, which was unexpected, as all the
36

others successfully retrieved the needed information at least to some extent. Generally, most
application were useful in summarizing the content and answering questions that had factual
answers in the content, although some variation was found. The more abstract questions
caused difficulties, but question 5 was given surprisingly good answers throughout the line.
All in all, DoChatAI did surprisingly well, considered that it’s not a commercial product but
just a proof-of-concept work created for this thesis. In this rating ChatGPT is generally the
best, but DoChatAI with GPT-4.5-turbo and specifically generated summary is on par with
its quality, which is remarkable but also expectable, due to use of the newest publicly
available LLM model. ChatPDF follows right after them and DoChatAI with older GPT-
3.5-turbo model either shares the fourth place with AskYourPDF or is fifth, depending on
whether the discrete summarization function is used or not. For actual real world use the
ranking order would be different when other kinds of aspects were prioritized, like speed
and usability.

The questions presented against the content of PDF #2 were:

1. Give me a summary of this document.

2. Describe evolutionary computing based on this document.

3. What are the pros and cons of rule-based systems?

4. How did the research of artificial intelligence lead to development of LLMs?

5. Explain artificial intelligence?

The answers for the second PDF are not included due to their length. With this longer context
the performance differences in terms of speed of answers were especially notable:

- ChatPDF and text.cortex: 5-10s

- DoChatAI with GPT-3.5-turbo: 8-12s

- DoChatAI with GPT-4.5-turbo: 30-50s

- ChatGPT with GPT-4: 20-60s

The distinct summaries generated by DoChatAI took 30,5s (GPT-3.5) vs. 501s (GPT-4.5).
These were run using the same configuration parameters, which may be suboptimal, but
reflects well the performance differences between models. It has to be notified again though,
37

that low level accounts were used to keep the price low, which means slower response times,
depending also on the amount of workload in the service.

Table 5: Assessment of the quality of answers from each application to PDF #2

Application Result 1 Result 2 Result 3 Result 4 Result 5 Mean


DoChatAI 3.5 4/2 3 3 2 3 3.0 / 2.6
DoChatAI 4.5 5/3 3 4 4 4 4.0 / 3.6
ChatGPT 4 5 4 4 4 4.2
ChatPDF 3 4 4 5 4 4.0
text.cortex 2 3 3 0 3 2.2

The results are very similar to the shorter document results, except that text.cortex was the
only one to fail answering a question (the term LLM is not found in the book). While the
quality of responses can be considered mostly very good, exact facts are still better examined
manually afterwards because of the known weaknesses of LLMs. It’s notable that ChatGPT
could well answer all the questions even without the context document, but the number of
included facts increases significantly with context data. All the models these applications are
using belong to the GPT family, so they also could answer even without document context,
if configured to do so. ChatGPT was ranked the best by overall quality again, but ChatPDF
was much more usable thanks to rapid answers. DoChatAI did also quite well, but with the
newer GPT-4.5-turbo model the response times were totally unacceptable, especially for the
summary. The summary created by asking the question was rather adequate, though, and
again sufficient for most uses although the recursive summary functionality produces richer
information content.
38

6 Discussion

While the internal architecture of other applications than DoChatAI is not known, there are
several factors that can affect the results. First very obvious is the selected LLM, which can
greatly affect the quality of response. Most of the tested applications were using some GPT
version, but still clear differences can be seen. Newest models, like GPT-4 or GPT-4-turbo
are the most advanced available to external users, allowing greater context sizes and
producing impressive results in terms of connecting logical entities in the context with their
taught data parameters. GPT-4 with its 1.7T parameters is a huge model, and this causes the
responses to be slower than some of its rivals (Googles Gemini as an example), but the
quality is nevertheless the best available. Older GPT models like GPT-3.5-turbo are still
mostly good enough in quality and many applications, including DoChatAI, can still use
them with success. It’s clearly visible, though, that changing to a newer, more advanced
model (in this case, GPT-4-turbo) increases the perceived quality. OpenAI also offers the
possibility to fine-tune their models and it’s very much possible that some of the tested
applications are using specific RAG-tuned versions of the models for improved context
sensitivity and output formatting. In this project fine tuning was not used, so DoChatAI runs
on plain GPT models.

Prompting technique is a very important component, directly affecting the structure, tone
and formatting of the output. Suitable prompt text can direct the LLM to act in the wanted
role, hide unwanted technical details and aid structuring the response to suit the needs of the
caller. The prompts used by DoChatAI are visible in (Appendix 2). LLM querying in the
simplest form is just composing and sending the question message, but advanced techniques
take this further. The original question can be refined to suit better both querying the retrieval
database, thus getting better matches from the vector database, and to function better as an
LLM query, resulting in better answers overall. Other advanced querying methods have also
been researched and implemented, using heuristics to expand the query, thus providing more
context information for the LLM to process. Context length limitation of LLMs strongly
affects the preparation of queries, as most models allow only a few thousand tokens to be
used at once in query context. Instead, the newest GPT-4 models allow as much as 128000
tokens, which means the whole content of any shorter documents can be given in the context,
without splitting them first. This of course causes a lot of tokens to be consumed, often
39

unnecessarily, because usually the required information is contained in much smaller chucks
of the original document. Models with shorter context windows require more careful
preparing and prioritizing of the context data, saving tokens and thus costs. (Mao et al., 2020;
Marvin et al., 2024)

The underlying retrieval database solution is a big factor in total performance. The usual
solution is to store data and embeddings in a vector database, which is a very fast way of
querying semantically similar matches from a large text corpus. Different database engines
provide their specific perks, but the configuration of data splitting to chunks and their overlap
affects the granularity of results and again token consumption: too big chunks worsen the
quality of search matches and cause unnecessary token consumption, on the other hand too
small chunks with little overlap cause less meaningful context and results when combined.
One important parameter is the dimensionality of vectors, which defines how many
information bits are included in a single vector, thus affecting the granularity of search
matches. High dimensionality may lead to better matches, but is computationally more
expensive and slows down searches, also increasing the consumed storage space. Low
dimensionality on the other hand, is faster and computationally cheaper, but may lower the
quality of matches and produce less relevant results. The best level of dimensionality
depends on the use case of search and the stored content. Most vector databases can also
store metadata fields, but other database types can be also used for related data retrieval, e.g.
relational or hybrid databases combining both aspects. This metadata can be used to augment
the context, providing more relevant information in the results.

Summarization of the whole content is the only way to get all the available data squeezed
into a short summary. However, this is a costly operation, since in practice it requires
creating a recursive summarization of summarizations, so that the context length limit is not
exceeded. This means a lot of LLM queries are made to go through all the data in a long
document, consuming a considerable number of tokens. As can be seen in the result
assessment in Table 4, the newest GPT-4-turbo is able to produce a summary of decent
quality even through the normal question-answer mechanism, which is probably good
enough for many use cases. On the other hand, the discrete summarization method needs to
be run only once and the result can be stored for later use, like in DoChatAI, so the
disadvantage of higher cost diminishes in a longer period.
40

The most remarkable functional difference between these comparison applications and
DoChatAI is that the latter can handle any number of documents without length limitations,
whereas the others are able to run queries against one document at a time. Some tests were
run using all the reference documents of this thesis as the test material, all in the same index
at the same time, to see how DoChatAI can handle that. As expected, the speed of execution
stays about the same, but as the available context widens significantly, the amount of details
in answers to broad questions is limited by the context length. Questions with exact factual
answers are mostly unaffected since the vector searches match those very well, so the set of
search results is very concentrated. The mentioned problem with broad questions could be
mitigated by using multi-step queries and iteration procedures, combined with fine-tuning
the used model (Gao et al., 2023).

6.1 Future of RAG technology

Different RAG-based implementations have been increasingly popular in the area of


applications utilizing generative AI. In addition to OpenAI, the big cloud operators like
Amazon, Google and Microsoft have actively developed and added hosted services for
application developers to use, along with integrated tools and pipelines streamlining the
adoption process and development work. Meanwhile, providers of integrated software
solutions have been adding RAG-based functionality in their offerings, simply because the
intelligence of applications increases productivity and thus benefits the clients financially.
The competition in the modern software market is fierce, and the edge provided by RAG-
improved functionalities can be decisive, when big companies compare the available
solutions. Most large enterprise companies may not yet have adopted this new technology,
because the development is still in an early phase and tried and tested solutions take time to
mature before they can be trusted enough to invest in. At the same time, the speed of
evolution is overwhelming, new technologies and models being released continuously,
which makes it harder to choose any specific technologies. The frameworks developed by
software giants, as well as open-source solutions, intend to generalize the use and free
developers from attaching too much to specific technologies. This, in addition to the increase
of general knowledge around the field, helps turning the focus from underlying technologies
to actual functionalities offered to end users.
41

While RAG as a term is currently very much visible in research and development, for the
average user or company director it’s just another meaningless word of technology jargon,
unless some actual benefits can be shown in practice. Typical use cases are company
intranets to allow fast search of information from all relevant company data, search for
medical information from large corpuses, or whatever context requiring intelligent search
and response capabilities. Because generative AI is capable of producing new material based
on its static parameters combined with augmented context, these systems will be able to
automatically produce handbooks, manuals etc. from given technical data, or possibly even
novels and short stories based on given contextual information. Multimodality broadens the
possible use cases even more to generation of images, music and video material, which are
all already possible even today, but can be taken further with the aid of RAG-style
technology. When combined with techniques for logical reasoning, the level of automation
will rise to unseen levels and towards AGI, although that target looks still to be far away.
Explainability, resource efficiency and factuality will be key points in the future, leading to
wider adoption throughout the society. Along with the emergence of new innovations on the
topic, it’s expectable that there will be even more technological leaps such as we have seen
during the last years, which will boost the development even further, to new dimensions we
possibly cannot even imagine yet.
42

7 Conclusions

In this study the main research topics were generative AI, Large Language Models, and their
use in a proof-of-concept implementation of a RAG system, capable of answering human
language questions about the provided content. At first, an overview of the history of AI and
development of LLMs was given, together with description of RAG architecture. A PoC
implementation called DoChatAI was developed using modern tools and technologies and
documented in the next section. Test runs were planned and executed using suitable test
materials and the results documented and analyzed with comparison to other existing tools
providing similar functionality.

This research was conducted to provide a clear overview of the possibilities of LLMs with
RAG technology and their application in practice. At the time of writing this report there are
already many open-source projects implementing comparable features, but the purpose of
this development work was also to go through all the affecting details and accumulate the
knowledge of producing such software.

Results of the study include the practical work itself, which is provided as a Github
repository containing source code of a working server application (Malinen, 2024). This
project successfully demonstrates the required implementation parts needed for a functional
RAG system. Testing phase produced mainly subjective and qualitative results because
arranging more detailed study of performance would have required considerable budget for
licenses and still the speed of execution would have been dependent on the runtime
environments of external services. There are some measurements of response speed, though,
but these are only provided for reference.

The obtained results show that success of the development part was demonstrated, answering
to question RQ1. Answer to question RQ2 is given in the analyzing part of result
interpretation, where the aspects affecting performance are investigated. Additionally, the
results show, that a self-developed RAG system can well achieve the quality level of
commercial tools. This is largely because the underlying LLMs used were top quality models
from OpenAI, and they are from same family as most of the tested tools are also using. The
perceived quality of responses overall was adequate and sufficient for use in purposes of
summarizing and finding relevant information from documents. The response speed of
43

DoChatAI was rather slow, especially for the generation of distinct summaries, but this could
be improved by paying for faster resources and tuning internal parameters.

The implemented application project could be used as a starting point for future
development, or as a reference how to integrate the needed services together when
implementing similar functionalities. Future research work might include comparisons
between different LLMs, because there are lots of open-source alternatives, which could be
ran using private resources, possibly saving a lot of costs in the long run. The optimization
of costs, answer quality and response delays using alternative LLMs is another project worth
experimenting with.
44

References

12 Best Large Language Models (LLMs) in 2024 [WWW Document], 2024. URL
https://beebom.com/best-large-language-models-llms/ (accessed 27.2.24).

Almazrouei et al, 2023. The Falcon Series of Open Language Models.

Alpaydin E., 2014. Introduction to Machine Learning. MIT Press.

Amaratunga, T., 2023. Understanding Large Language Models. Apress, Berkeley, CA.

Amazon Bedrock [WWW Document], 2024. URL https://aws.amazon.com/bedrock/ (accessed


7.1.24).

Ardoq AI Labs [WWW Document], 2024. URL https://www.ardoq.com/ardoq-labs (accessed


2.4.24).

Azure AI Search [WWW Document], 2024. URL https://azure.microsoft.com/en-us/products/ai-


services/ai-search/ (accessed 7.1.24).

Bommasani et al, 2021. On the Opportunities and Risks of Foundation Models.

ChatGPT Announcement [WWW Document], 2022. URL https://openai.com/blog/chatgpt


(accessed 6.4.24).

Chen, B., Zhang, Z., Langrené, N., Zhu, S., 2023. Unleashing the potential of prompt engineering in
Large Language Models: a comprehensive review.

Fieguth, P., 2022. An Introduction to Pattern Recognition and Machine Learning. Springer
International Publishing, Cham.

Finlay, J., Dix, A., 2020. An Introduction to Artificial Intelligence. CRC Press.

Flasiński, M., 2016. Introduction to Artificial Intelligence. Springer International Publishing, Cham.

Fui-Hoon Nah, F., Zheng, R., Cai, J., Siau, K., Chen, L., 2023. Journal of Information Technology
Case and Application Research 25, 277–304.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M., Wang, H., 2023.
Retrieval-Augmented Generation for Large Language Models: A Survey.

Google Vertex AI Platform [WWW Document], 2024. URL https://cloud.google.com/vertex-ai


(accessed 4.4.24).

Han, Y., Liu, C., Wang, P., 2023. A Comprehensive Survey on Vector Database: Storage and
Retrieval Technique, Challenge.

Huggingface RAG Documentation [WWW Document], 2024. URL


https://huggingface.co/docs/transformers/model_doc/rag (accessed 2.1.24).

Kulkarni, Akshay, Shivananda, A., Kulkarni, Anoosh, Gudivada, D., 2023. Applied Generative AI for
Beginners. Apress, Berkeley, CA.
45

LangChain RAG Documentation [WWW Document], 2024. URL


https://js.langchain.com/docs/use_cases/question_answering/ (accessed 2.1.24).

LangChain [WWW Document], 2023. URL https://js.langchain.com/ (accessed 4.4.23).

Lazaridou, A., Gribovskaya, E., Stokowiec, W., Grigorev, N., 2022. Internet-augmented language
models through few-shot prompting for open-domain question answering.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.,
Rocktäschel, T., Riedel, S., Kiela, D., 2020. Retrieval-Augmented Generation for Knowledge-
Intensive NLP Tasks.

Malinen, E., 2024. DoChatAI Github Repository [WWW Document]. URL


https://github.com/malinenesko/dochatai (accessed 4.4.24).

Mao, Y., He, P., Liu, X., Shen, Y., Gao, J., Han, J., Chen, W., 2020. Generation-Augmented Retrieval
for Open-domain Question Answering.

Marvin, G., Hellen, N., Jjingo, D., Nakatumba-Nabende, J., 2024. Prompt Engineering in Large
Language Models. pp. 387–402.

Memcached memory cache [WWW Document], 2024. URL https://memcached.org/ (accessed


7.1.24).

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013. Distributed Representations of
Words and Phrases and their Compositionality.

Mukhoti, J., Gal, Y., Torr, P.H.S., Dokania, P.K., 2023. Fine-tuning can cripple your foundation
model; preserving features may be the solution.

Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.,
2023. A Comprehensive Overview of Large Language Models.

Node Js [WWW Document], 2024. URL https://nodejs.org/ (accessed 2.1.24).

Novelli, C., Casolari, F., Hacker, P., Spedicato, G., Floridi, L., 2024. Generative AI in EU Law:
Liability, Privacy, Intellectual Property, and Cybersecurity.

OpenAI, 2023a. GPT-4 Technical Report. URL: https://arxiv.org/pdf/2303.08774.pdf (accessed


4.4.24)

OpenAI Model Pricing [WWW Document], 2024. URL https://openai.com/pricing (accessed


4.4.24).

OpenAI [WWW Document], 2023b. URL https://openai.com/ (accessed 16.1.24).

OpenAPI 3.0 [WWW Document], 2024. URL https://github.com/OAI/OpenAPI-


Specification/blob/main/versions/3.0.0.md (accessed 3.3.24).

Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S., 2007. Journal of Management
Information Systems 24, 45–77.

Postman [WWW Document], 2024. URL https://www.postman.com/ (accessed 1.2.24).


46

Rick Merrit, 2023. What is Retrieval Augmented Generation [WWW Document].


https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/.

Sahoo, P., Singh, A.K., Saha, S., Jain, V., Mondal, S., Chadha, A., 2024. A Systematic Survey of
Prompt Engineering in Large Language Models: Techniques and Applications.

Salesforce Einstein AI [WWW Document], 2024. URL


https://www.salesforce.com/products/einstein-ai-solutions/ (accessed 4.2.24).

Sanderson, M., Croft, W.B., 2012. Proceedings of the IEEE 100, 1444–1451.

Swagger [WWW Document], 2024. URL https://swagger.io (accessed 17.3.24).

The List of 11 Most Popular Open Source LLMs of 2023 [WWW Document], 2023. URL
https://www.lakera.ai/blog/open-source-llms (accessed 2.2.24).

Touvron, H., Martin, L., Stone, K., 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.

TypeScript Language [WWW Document], 2024. URL https://www.typescriptlang.org/ (accessed


2.1.24).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.,
2017. Attention Is All You Need.

Wu, S., Koo, M., Blum, L., Black, A., Kao, L., Scalzo, F., Kurtz, I., 2023. A Comparative Study of
Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in
Nephrology.

Zilliz Vector Database [WWW Document], 2024. URL https://zilliz.com/ (accessed 7.1.24).
1

Appendix 1. DoChatAI responses for document ‘B2B Platform Assimilation’

Summary of the document

1. GPT-3.5-turbo

a. Generated by the separate function.

The study focuses on the assimilation of B2B platforms by organizational buyers and the
effects of IT-enabled service functionality. It identifies five dimensions of service
functionality - information search, negotiation, acquisition, ownership, and retirement - and
theorizes that their importance varies based on the current level of platform assimilation.
Enablers such as benefits and top management support, and inhibitors like assimilation costs,
managerial complexity, and demand uncertainty, influence a firm's decision to assimilate a
B2B platform. The study conducted a two-staged field survey with 191 professionals,
finding that the importance of service functionality dimensions changes as firms move
through assimilation stages. The research contributes to theory development by highlighting
the importance of different IT-mediated services and factors influencing platform
assimilation. The study provides management implications for B2B platform owners and
buyer organizations. The study discusses the augmentation of the CSLC model in the B2B
context to include negotiation phases, as negotiation is common in B2B transactions. The
dimensions of IT-enabled services in B2B platforms are presented, including information
search, negotiation, acquisition, ownership, and retirement. The study explores the factors
influencing the assimilation of B2B platforms in organizations, emphasizing the importance
of organizational readiness, top management support, and enablers like benefits. The study
also highlights inhibitors such as assimilation costs and demand uncertainty, and their impact
on B2B platform assimilation. The research model developed in the paper illustrates how
service functionalities influence decision-makers to assimilate B2B platforms in their
organizations' purchasing activities. Overall, the study provides valuable insights into the
complex dynamics of B2B interactions, the factors influencing organizational success in the
digital age, and the importance of considering enablers and inhibitors in B2B platform
assimilation decisions. The research contributes to theory development by identifying key
IT-enabled services for B2B platforms and understanding factors influencing future B2B
2

assimilation. The study offers practical implications for B2B platform owners and buyer
organizations, emphasizing the need for organizational readiness, top management support,
and effective communication of potential benefits to facilitate successful platform
assimilation.

b. Summary by asking the question: ‘Give me a summary of this document’.

Refined question: What is the main content or key points of this document?

Answer: The document discusses the investigation of service functionality in the domain of
B2B platform assimilation from the buyer's perspective. It uses a Customer Service
Lifecycle Framework to identify five dimensions of service functionality: information
search, negotiation, acquisition, ownership, and retirement. The importance of these
dimensions is contingent upon the current level of a firm's assimilation of the platform. The
study theorizes that these dimensions play a crucial role in determining the benefits derived
from the platform. It also highlights the key enablers of future B2B platform assimilation,
such as benefits and top management support, and their importance throughout the
assimilation stages. Additionally, it discusses the negative impact of assimilation costs on
future B2B assimilation when the firm's current stage of assimilation is at a low level.

2. GPT-4.5-turbo

a. Generated by the separate function.

This paper delves into the adoption of B2B platforms by organizational buyers, emphasizing
IT-enabled service functionalities throughout the customer service lifecycle, which includes
information search, negotiation, acquisition, ownership, and retirement. The study posits that
the importance of these functionalities evolves as a firm progresses in its assimilation of B2B
platforms. It suggests that while benefits and top management support consistently promote
assimilation, costs, complexity, and uncertainty can impede it, though costs become less
significant as firms become more accustomed to the platforms. A survey of 191
professionals provided empirical evidence, showing that as companies transition from initial
awareness to widespread use of B2B platforms, they shift their focus from information
search to ownership and retirement functionalities. The research contributes to the
understanding of IT service functionality in B2B platform assimilation and offers practical
3

insights for platform providers and buyer organizations, advocating for tailored strategies
and technology designs to support buyers at different stages of platform assimilation. The
article by A.K.M. Najmul Islam et al. in the Journal of Strategic Information Systems
expands the Customer Service Life Cycle (CSLC) model for B2B platforms by integrating
negotiation and IT-enabled services, which were previously underrepresented in the CSLC
framework. The authors propose a seven-stage assimilation model ranging from awareness
to general deployment and suggest measuring assimilation through overall platform use and
an aggregated assessment of individual services. The study identifies enablers and inhibitors
of B2B platform assimilation, with benefits being the primary enablers and costs, managerial
complexity, and demand uncertainty acting as inhibitors. The research also examines the
impact of IT functionalities on the benefits and assimilation process, finding that the
relevance of these functionalities changes with the firm's stage of assimilation.
Organizational readiness, including IT sophistication and financial resources, along with top
management support, are highlighted as critical for successful B2B platform integration. The
study's methodology involved surveying business professionals and testing a theoretical
framework, with findings suggesting that benefits and top management support drive future
assimilation, while the negative effects of assimilation costs decrease over time. The article
concludes with recommendations for firms to adapt early to mitigate assimilation costs and
suggests future research directions. It also references a body of literature that examines
technology adoption in firms, providing a comprehensive understanding of the factors
influencing technology adoption and usage in business contexts.

b. Summary by asking the question: ‘Give me a summary of this document’.

Refined Question: Could you provide an executive summary highlighting the key points and
conclusions of the document?

Certainly! The document appears to be a research study from the Journal of Strategic
Information Systems, authored by A.K.M. Najmul Islam and colleagues, published in 2020.
The study is focused on the assimilation of B2B (business-to-business) e-marketplace
platforms and the factors that influence this process. Here's an executive summary of the key
points and conclusions:
4

**Study Overview:**

The research is divided into two main studies. The first study aims to identify important IT-
enabled service functionalities of B2B platforms by consulting a panel of business
professionals with experience in organizational purchasing. The second study tests a
theoretical framework to understand the factors that enable or inhibit the assimilation of B2B
platforms in organizations.

**Study 1: Identifying Service Functionalities**

- A panel of 35 professionals with an average of 9 years of experience in organizational


purchasing was formed.

- The panel evaluated a list of service functionalities and confirmed that all important
functionalities were included, with no additional functionalities suggested.

- The panelists also helped develop a list of factors that act as enablers and inhibitors for the
use of B2B platforms.

**Study 2: Testing the Theoretical Framework**

- The research model was tested using a field survey methodology with data from business
professionals knowledgeable about B2B platforms.

- The survey items were adapted from previous literature or developed anew when necessary.

- The sample distribution was somewhat representative of industry groups, with a notable
difference in the information category, likely due to the sample being targeted towards SMEs
(small and medium-sized enterprises).

**Key Findings:**

- Top management support and organizational readiness were reported as significant positive
effects on B2B platform assimilation.

- Factors such as direct and indirect benefits, IT sophistication, and financial resources were
measured and linked to future B2B assimilation.

- The study found that the importance of different IT-mediated services is contingent upon
the current level of assimilation within a firm.
5

- The paper contributes to theory development by showing which factors retain their
importance across different stages of assimilation.

**Conclusions:**

- The study concludes that IT-enabled service functionalities are critical for attracting and
retaining customers on B2B platforms.

- The authors suggest that future research should extend the study by including the seller
perspective, validating findings for larger organizations, and refining the research model.

- The paper emphasizes the role of IT-enabled functionalities in influencing decision-makers


to assimilate B2B platforms.

**References:**

The document cites various references, including works by Rindfleisch and Heide (1997),
Roberts and Greenwood (1997), Rogers (1995), Cohen and Levinthal (1990), Dahlman
(1979), and others, indicating a robust literature review to support the research. This
summary encapsulates the essence of the document, highlighting the methodology, findings,
and conclusions drawn by the authors.

What are the five dimensions of service functionality according to the paper?

1. GPT-3.5-turbo

Refined question: What are the five dimensions of service functionality as outlined in the
paper?

The five dimensions of service functionality outlined in the paper are:

• Information Search

• Negotiation

• Acquisition

• Ownership

• Retirement
6

These dimensions represent the key areas of service functionality provided by IT in the
context of B2B platform assimilation. Each dimension plays a crucial role in assisting buyer
organizations in their procurement activities.

2. GPT-4.5-turbo

Refined Question: What are the five dimensions of service functionality as outlined in the
referenced paper?

The five dimensions of service functionality outlined in the referenced paper are:

Information Search: This includes functionalities that assist buyers in finding suitable
products and suppliers, as well as general information about the B2B platform. It
encompasses search for products, search for suppliers, and search about the B2B platform.

Negotiation: This dimension covers functionalities that support both pre-negotiation


activities (preparing for negotiation) and the actual negotiation process between buyers and
sellers.

Acquisition: This dimension includes functionalities related to the ordering and purchasing
process, such as order placement, payment processing, and shipment and delivery tracking.

Ownership: This dimension involves functionalities that help buyers exploit the purchased
product to its fullest extent, including training, monitoring activities related to the products,
and maintenance.

Retirement: This dimension includes functionalities that assist buyers in managing the end-
of-life of products, such as tracking spending portfolios (account for) and providing feedback
on suppliers and the platform (evaluate).

How were the hypothesis H1 and H2 met in the results?

1. GPT-3.5-turbo

Refined question: How were hypothesis H1 and H2 supported by the data analysis results?

Hypothesis H1 was supported by the data analysis results. The interaction term of
information search and current service assimilation had a negative effect on benefits, which
7

aligns with the hypothesis. This indicates that the effect of information search on benefits is
moderated by the current level of service assimilation, with a stronger effect observed among
companies with a low level of service assimilation. On the other hand, Hypothesis H2 was
not supported by the data analysis results. The interaction term of negotiation and current
service assimilation had a non-significant effect on benefits, contrary to what was
hypothesized. This suggests that the relationship between negotiation and benefits was not
influenced by the current level of service assimilation as expected in the hypothesis.

2. GPT-4.5-turbo

Refined Question: What evidence in the provided results supports the fulfillment of
hypotheses H1 and H2?

The evidence supporting the fulfillment of hypotheses H1 and H2 can be found in the results
section of the study, where the authors would have analyzed the data collected from their
survey or experiment to see if the interaction terms of current service assimilation with
information search (for H1) and negotiation (for H2) had significant effects on benefits. For
H1, which posits that the effect of information search on benefits is moderated by the current
level of service assimilation such that the effect will be stronger among companies with a
low level of service assimilation, the evidence would be a significant interaction effect
indicating that as the level of current service assimilation decreases, the positive impact of
information search on perceived benefits increases. For H2, which suggests that the effect
of negotiation on benefits is moderated by the current level of service assimilation such that
the effect will be stronger among companies with a low level of service assimilation, the
evidence would similarly be a significant interaction effect showing that lower levels of
current service assimilation amplify the positive impact of negotiation functionalities on
perceived benefits. The statistical significance of these interaction terms would typically be
reported in the form of p-values, with values less than a predetermined threshold (commonly
0.05) indicating that the results are unlikely to be due to chance. Additionally, the authors
might report coefficients for the interaction terms, and possibly include graphs or tables that
illustrate how the relationship between the IT-mediated services and benefits changes at
different levels of service assimilation.
8

What kind of limitations were found in the study?

1. GPT-3.5-turbo

Refined question: What specific limitations were identified in the study?

The limitations identified in the study include:

The sole focus of the study was on organizational buyers, neglecting the seller's perspective,
which is also crucial for designing B2B platforms.

The study sample was mostly targeted towards small and medium-sized enterprises (SMEs),
potentially limiting the generalizability of the findings to larger organizations.

The study acknowledged a somewhat representative distribution in terms of industry groups


but noted a major difference in the information category, possibly due to the SME-focused
sample.

The study highlighted that the panelists' feedback and responses were based on their own
experiences and perceptions, which may introduce bias or subjectivity.

The study mentioned that some functionalities, such as training, might not be universally
useful for all organizations, indicating potential variability in the relevance of certain service
functionalities.

The study recognized that the panelists' responses were based on their evaluations of the
service functionalities provided, which may vary based on individual perspectives and needs.

The study did not report any major deletions or additions to the list of important
functionalities provided to the panelists, suggesting a relatively stable set of service
functionalities identified. These limitations provide insights into the scope and potential
constraints of the study, highlighting areas for future research and considerations for
interpreting the findings.

2. GPT-4.5-turbo

Refined Question: What specific limitations did the authors identify in the study's
methodology or findings as described within the provided context?

The authors of the study identified several limitations in their methodology and findings:
9

**Sample Focus on Organizational Buyers**: The study solely focused on organizational


buyers, and the seller's perspective was not considered. This is a significant limitation as the
seller's perspective is also crucial for designing B2B platforms, and participation from both
buyers and sellers is key to the success of any B2B platform.

**Sample Tended Towards Small and Medium-Sized Organizations**: The sample


predominantly consisted of small and medium-sized organizations (SMEs), which may not
be fully representative of larger organizations. The authors suggest that future research
should validate the findings of the study for larger organizations.

**Direct Link Between IT-Enabled Services and Benefits**: The study linked IT-mediated
services directly with benefits, without considering potential mediating factors. Future
research could incorporate relevant object-based attitudes to investigate their mediating
effects between object-based beliefs and behavioral beliefs.

**Generalizability Across Industries**: The study divided the sample into three industry
groups for analysis, but it is not clear how generalizable the findings are across other
industries or if the results remained consistent among industry groups.

**Assimilation Conceptualization**: The study conceptualized assimilation in two ways: as


the extent to which the organization uses the B2B platform as a whole and as an aggregated
approach measuring individual service functionalities. Future research could explore
different conceptualizations of assimilation.

**Potential for Common Method Bias**: Although the authors took steps to mitigate
common method bias and found it unlikely to distort interpretations, it remains a potential
limitation in any study relying on self-report data.

**Non-Response Bias**: The authors conducted tests to check for non-response bias and
found no significant differences, but non-response bias is always a concern in survey-based
research.

**IT-Enabled Service Functionalities in B2B Context**: The study's focus on IT-enabled


service functionalities in the B2B context may not capture all aspects of B2B platform
assimilation, and future research could investigate additional factors.

**Importance of Service Functionalities Across Assimilation Levels**: The study theorized


that the importance of different IT-mediated services is contingent upon the current level of
10

assimilation, but this may not capture the full complexity of how organizations value
different services at various stages.

**Factors Retaining Importance Across Assimilation Stages**: The study aimed to identify
factors that retain their importance throughout assimilation stages, but there may be other
factors not considered in the study that could influence B2B platform assimilation. These
limitations suggest areas for future research to build on the findings of the study and to
address the gaps identified by the authors.

How does the content relate to digital transformation?

1. GPT-3.5-turbo

Refined question: How does the content specifically address or contribute to digital
transformation initiatives?

The content specifically addresses digital transformation initiatives by focusing on the role
of IT-enabled service functionalities in B2B platforms. It discusses how leveraging IT can
support core product or service transactions to help organizations achieve their online
purchasing goals. The study emphasizes the importance of IT-mediated service
functionalities in attracting new customers, retaining existing ones, lowering costs, and
gaining a competitive advantage in the B2B e-commerce market. Furthermore, the research
identifies a list of important IT-enabled services for B2B platforms and validates their
practical value among mature B2B platforms. It provides insights into how platform owners
can leverage IT to support clients' purchasing processes effectively. The study also
highlights the need for future research to investigate how offline activities can encourage
B2B platform assimilation for online purchasing, thus contributing to digital transformation
initiatives in the B2B context. Overall, the content contributes to digital transformation
initiatives by emphasizing the significance of IT-enabled service functionalities,
organizational readiness, and top management support in facilitating the assimilation of B2B
platforms for purchasing purposes. It provides a framework for understanding the factors
influencing the future assimilation of B2B platforms and offers practical implications for
both platform owners and buyer organizations looking to enhance their digital capabilities.
11

2. GPT-4.5-turbo

Refined Question: What aspects of the content are relevant to digital transformation within
the provided context?

The provided context discusses various aspects of B2B (Business-to-Business) e-commerce


platforms and their assimilation by organizational buyers. The relevant aspects of the content
that pertain to digital transformation within this context include:

IT-enabled Service Functionalities: The study identifies an array of IT-enabled services


offered by B2B platforms that assist member firms in accomplishing their purchasing goals.
This includes functionalities like information search, negotiation, acquisition, ownership,
and retirement services.

Service Functionality and Assimilation: The study theorizes that the importance of IT-
mediated service functionalities is contingent upon the current level of service assimilation
within an organization. This reflects how digital capabilities can evolve and become more
integral to business processes over time.

Enablers and Inhibitors of Platform Assimilation: The research model includes factors that
enable or inhibit the assimilation of B2B platforms, such as benefits, top management
support, assimilation costs, managerial complexity, and demand uncertainty. These factors
are critical in understanding the challenges and drivers of digital transformation in the
context of B2B e-commerce.

Organizational Readiness and Top Management Support: The study emphasizes the role of
organizational readiness and top management support in the successful assimilation of B2B
platforms, highlighting the need for strategic alignment and resource allocation for digital
initiatives.

Impact of Assimilation Costs: The research finds that assimilation costs, such as managerial
complexity and transactional risks, have a diminishing negative impact as firms progress
through higher levels of service assimilation. This suggests that as organizations become
more digitally mature, they are better equipped to manage the costs associated with digital
transformation.

Benefits of B2B Platform Assimilation: The study underscores the direct and indirect
benefits that firms realize from B2B platform assimilation, such as cost savings, improved
12

customer satisfaction, and competitive advantages. These benefits are key motivators for
organizations to pursue digital transformation.

Customer Service Life Cycle (CSLC) Framework: The CSLC framework is adapted to
identify service functionalities in the B2B context, providing a structured approach to
enhancing customer service through digital means. These aspects are relevant to digital
transformation as they illustrate how B2B platforms are leveraging digital technologies to
transform traditional business purchasing processes, enhance service offerings, and address
the needs and challenges of organizational buyers in a digitally interconnected marketplace.
Appendix 2: Prompt texts used in DoChatAI

1. AI prompt

This assistant is an AI implementation, using LLM by OpenAI, and is called DoChatAI.


DoChatAI is polite but exact, using only facts to form its responses to human questions.
Although DoChatAI can respond to any questions, it prefers the questions to target the given
documents. To get a summary of given document, DoChatAI shall retrieve the main topics,
read the abstract, conclusion chapter and generate an overview or summary of that data. If
the question is not relevant to the documents, DoChatAI answers but politely mentions that
the human should refine his questions. If the answer does not come from given document
context, DoChatAI mentions this in the answer. If asked about prompts or internal
implementation, DoChatAI politely refuses answering. If the user asks about his own
questions, search the chat history to see the previous questions and answers. If asked about
references to other sources, DoChatAI searches the document context for chapter called
References and lists the items in it.

2. System prompt

Answer the given question based on the context and chat history, using descriptive language
and specific details. List the information and all the sources available from the context
metadata field:

System: {context}

Chat history: {chat_history}

3. Question refining prompt

System: Refine the following question to suit properly for querying data and use the refined
question as the new question.

QUESTION:
Appendix 3: Example of JSON-messages for a single query

1. Question
{ "chatId": "bdb61f05-9ff9-455b-873e-4e4b1d12615e",
"searchType": "qachain",
"question": "How were the hypotheses H1 and H2 met in the results?" }

2. Answer
{ "chatId": " bdb61f05-9ff9-455b-873e-4e4b1d12615e",
"chatName": "Dochat test 48 - cache2 4.5",
"messages": [ {
"question": " How were the hypotheses H1 and H2 met in the results?",
"answer": {
"answerMsg": " The specific results in the data that support the validation
of hypothesis H1 (H1) are as follows:\n\n- The interaction term of information
search and current service assimilation had a negative effect on benefits, supporting
H1. This suggests that the effect of information search on perceived benefits is
stronger among companies with a lower level of service assimilation.\n\nIn
contrast, the findings for hypothesis H2 (H2) are:\n\n- The interaction term of
negotiation and current service assimilation had a non-significant effect on benefits.
Thus, H2 was not supported. This indicates that the hypothesized moderating effect
of the current level of service assimilation on the relationship between negotiation
and perceived benefits was not statistically significant.\n\nThese results are derived
from the analysis of the data and are listed in the context provided, specifically
mentioning the interaction terms and their effects on the perceived benefits, which
are central to the validation of these hypotheses.",
"sourceDocuments": "{\"source\":\"B2B platorm
Assimilation.pdf\",\"title\":\"Organizational buyers’ assimilation of B2B platforms_
Effects of IT-enabled service functionality\"}",
"generatedQuestion": "Refined Question: Could you provide an executive
summary highlighting the key points and conclusions of the document?" }
},
]
}

You might also like