Master's Thesis, Esko Malinen
Master's Thesis, Esko Malinen
2024
Esko Malinen
Esko Malinen
Master’s thesis
2024
60 pages, 7 figures, 5 tables and 3 appendices
Examiners: Professor Najmul Islam and Junior Researcher Maryam Hina
Keywords: Artificial Intelligence, Generative AI, Large Language Model, Retrieval
Augmented Generation
The development of large language models and generative AI has been astonishingly fast
during the last years. ChatGPT published by OpenAI in 2022 finally brought these
technologies visible to the public, although the research has been ongoing for decades in
science communities. Generative AI is one of the hottest topics in technology and as such
offers interesting topics for a design science research.
This master’s thesis delves into properties and usage of large language models and
generative AI, as well as properties of Retrieval Augmented Generation and the possibilities
it offers. As the practical part of thesis, a software application capable of interactive
discussions in the context of given documents is implemented using modern development
tools and environments. The resulting application of the development is tested with suitable
test material and compared to existing applications with similar functionality, analyzing the
test results.
The chosen implementation method successfully produced an application comparable to
commercial solutions. This application can summarize even lengthy documents into clear,
concise text and answer proposed questions with accurate and relevant information. While
response times are currently slower than ideal, this can be attributed to the used low service
level on the integrated external services. Future optimization efforts can focus on
improving response speed and comparing use of different language models.
TIIVISTELMÄ
Esko Malinen
Thanks to my spouse Maria for being the most kind, loving and understanding sweetheart,
supporting my efforts and tolerating my frustrations during my M.Sc. studies and
especially during the various challenges of this thesis work.
5
Table of contents
Abstract
Acknowledgements
1 Introduction .................................................................................................................... 6
2 Background..................................................................................................................... 8
2.1 Artificial Intelligence ............................................................................................. 8
2.2 Natural Language Processing and LLMs............................................................. 11
2.3 Retrieval Augmented Generation ........................................................................ 16
3 Research method .......................................................................................................... 21
4 DoChatAI - Interactive Document Summarizer ........................................................... 24
4.1 Features and Implementation ............................................................................... 24
4.2 Known limitations and issues .............................................................................. 32
5 Results .......................................................................................................................... 33
5.1 Comparison to existing similar implementations ................................................ 33
6 Discussion..................................................................................................................... 38
6.1 Future of RAG technology .................................................................................. 40
7 Conclusions .................................................................................................................. 42
References ............................................................................................................................ 44
Appendices
Appendix 1. DoChatAI responses for the test document ‘B2B Platform Assimilation’
Appendix 2: Prompt texts used in DoChatAI
Appendix 3: Example of JSON-messages for a single query
6
1 Introduction
During the last couple of years intelligent chatbots have been in the headlines very often,
ever since OpenAI released their renowned chatbot, ChatGPT, to the open public in
November 2022 (“ChatGPT Announcement,” 2022). While the people actively following
technology might already have known about the current state of advancements in language
processing, to the wide public it was shocking news to find computers being able to discuss
in clear English on open domain questions. The reactions ranged from enthusiastic to the
usual doomsday scenarios, where advanced superintelligence achieves awareness and takes
over, possibly destroying humanity. Most importantly, innovative people started inventing
countless ways of utilizing this new technology to help people and organizations do their
work more effectively.
The technology behind ChatGPT had been in development already for years before
publication, and many of the most important aspects have been published in technology
journals and scientific papers. Most of the building blocks have open-sourced alternatives,
which anyone interested and skilled enough can take into use in their own projects to develop
services and products taking advantage of the late breakthroughs in the field of artificial
intelligence. While technology companies keep their most advanced models private to
protect their business interests, it’s also in their interest to offer public APIs for developers
to use. Most APIs developed by private companies are not free to use, since the resources
needed to run the services cost a lot of money. Anyway, the combination of open-source
tools and paid services provides all the possibilities for application developers taking
advantage of the best aspects of each. One popular use case is using the models and APIs to
implement applications creating summaries of large documents or collections of documents
and offering the possibility to ask questions about this provided information mass. This breed
of applications was one of the early adopters of ChatGPT and the like since the real-world
value is easily demonstrable and understandable. Retrieval Augmented Generation (RAG)
is one promising and widely used approach and was selected for the implementation part of
this thesis (Lewis et al., 2020).
7
RQ2: What are the aspects affecting the performance of such applications?
- Chapter 1: Introduction
- Chapter 5: Discussion
- Chapter 6: Conclusions
The result of this study is a working application using RAG architecture and LLM
technology providing the needed functionality for making questions and getting factual
answers based on provided document data. The documentation shall contain all the needed
information for understanding the technology behind this implementation.
8
2 Background
Artificial Intelligence (AI) is a very large concept, containing several subordinate areas and
directions of research and development. In this chapter the main areas are described and the
concepts of Large Language Models (LLMs), Natural Language Processing (NLP) and
Retrieval Augmented Generation (RAG) are explained, their use and relation to other AI
technology is explained in sufficient detail to understand the terms in context of this thesis.
There is no official, definitive definition of Artificial Intelligence, but Google and Bing
search services currently provide the same following explanation when asked:
- The theory and development of computer systems able to perform tasks normally
requiring human intelligence, such as visual perception, speech recognition,
decision-making, and translation between languages.
That’s well in line with the definition given in the book Introduction to Artificial Intelligence
(Finlay and Dix, 2020):
- AI is concerned with building machines that can act and react appropriately, adapting
their response to the demands of the situation. Such machines should display
behavior comparable with that considered to require intelligence in humans.
The idea of intelligent machines has always been fascinating humans, even long before the
Digital Revolution starting in the late 1940s. As a very common view the start of formal
research in artificial intelligence was a conference held in June 1956, The Dartmouth
Summer Research Project on Artificial Intelligence, paving the way for continuously
accelerating research and development in the coming decades (Finlay and Dix, 2020) Even
before that famous meeting there was the publication of the first AI system, called Logic
Theorist, in year 1955, which can be also considered as the start of AI (Flasiński, 2016).
Despite suffering from repeated setbacks when real-world results failed to match the overly
optimistic expectations exaggerated by media hype after the newest AI-related innovations
were published in the news, the base research has been active in the science community. The
9
newest wave of AI hype started in autumn of 2022, when OpenAI published their chat bot,
ChatGPT, capable of answering open domain questions, creating astonishingly well-formed
text documents and much more, all in natural human language and even supporting several
languages (Amaratunga, 2023). Of course, the development of language models behind
ChatGPT and other services alike had been very active already for several years at that point.
While the public attention was in the magic of a machine understanding and producing clear
English, capabilities of this technology are vastly greater, as we have seen in the recent rush
of AI companies introducing new products and services based on similar technology. We
are reading news weekly about generative AI, which can produce texts, images, speech, and
videos almost undistinguishable from real material produced by humans, how bots are going
to destroy millions of jobs and companies are laying off their workers because bots are
cheaper, more productive and don’t ask for rises or holidays.
The whole field of AI is naturally very large, having been an active area of research for
almost 70 years. There are several important branches inside Artificial Intelligence, e.g.
Computer vision, Fuzzy logic, Expert systems, Robotics, Machine Learning (ML), Neural
Networks (NN), Deep Learning (DL), Natural Language Processing (NLP) and Generative
AI (GenAI) (Amaratunga, 2023; Finlay and Dix, 2020; Flasiński, 2016; Kulkarni et al.,
2023), see Figure 1 for a visual representation of the relations of these concepts. In the
context of this thesis the most relevant areas are ML, DL, GenAI and NLP, because these
are the enabling technologies behind Large Language Models, which again is the main
enabling technology of AI-based chatbots (Amaratunga, 2023). ML provides a way of
recognizing patterns in big amounts of data and is therefore used for many data-intensive
applications like image recognition and customer data analysis. It’s one of the most
important areas of AI having lots of real-world applications in use and even more in
development. The main methods of training the models are supervised learning,
unsupervised learning, and reinforcement learning. The main function of ML is to generalize
the found patterns in data to be used in reasoning when given new problems to solve, with
new data never seen before (Fieguth, 2022). DL is a subarea of ML and uses layered artificial
neural networks to gradually learn more abstract features starting from details and widening
the aspect layer by layer, with minimal human contribution (Alpaydin E., 2014). With the
continuous rapid increase in processing power and affordable data storage, enabled by the
technological advances in hardware and software, automated DL has become the hot topic
of AI research.
GenAI is a subset of DL, which uses pre-trained models to generate new data similar to the
data which was used to train the models. The new data can be basically anything, varying
from all kinds of text to images, music, and even videos. GenAI is an especially hot field
right now that services like ChatGPT, Bing, Bard, Midjourney and DALL-E have brought
the existence and possibilities of modern GenAI into the ears and eyes of the large public.
While the online chatbots and image creation services are already mind-blowing examples
of the capabilities of GenAI, it’s the applications using these services to create new added
value which are changing the world the most in the future by lessening or even replacing the
need of human contribution in content creation and problem solving in many real-world
situations. For example, GenAI can replace human customer service and act as a personal
11
assistant, help students find and summarize information, aid teachers to create personalized
learning materials and help checking student assignments. In healthcare it can aid or even
replace doctors in making diagnoses based on patient symptoms. And generally, it can
increase productivity in all kinds of content creation like writing texts, designing graphics,
or composing music for game development. The inevitable outcome seems to be that many
jobs are either disappearing or at least transforming to take advantage of the efficiency
provided by AI tools. The disruptive effects could become huge even in the near future,
obsoleting numerous high paying white-collar jobs. (Fui-Hoon Nah et al., 2023)
NLP is a technology aiming to bridge the gap between human and computer languages by
using advanced computing to allow computers to be instructed using natural language and
respond to users using clear human language. This field predates all the way to the 1950s,
when Alan Turing proposed his famous test for evaluating the quality of AI implementations
(Amaratunga, 2023). During the past decades different approaches have been dominating
the research, and in the 2000s and 2010s, due to better understanding of human brain and
greatly increased computing power and earlier advances in research, ML/DL-based
techniques became the most popular direction. This joined with the resources of large
technology corporations developing AI solutions has made possible the tremendously quick
development seen in the last years. Possible uses for NLP include text classification, Named
Entity Recognition (NER), machine translation, text generation and speech recognition,
which can be seen as relevant features of current AI applications like chatbots (Amaratunga,
2023). A significant part of this success is due the development of language models. In NLP,
a language model is a probability distribution over sequences of words, meaning that the
model learns the statistical properties and patterns found in the learning material.
There are several types of language models, classified by the types of tasks they are running
(e.g. generative vs predictive) or by the approach (e.g. N-gram vs. neural networks (NN)).
Since ML and DL are the most popular approaches, most advanced language models are
NN-based, but also can implement features from other approaches. Some important NN
architectures are recurrent neural networks (RNNs), long short-term memory (LSTM)
networks, and gated recurrent units (GRUs), but they have their own limitation related to
12
execution and especially training efficiency, and a problem with diminishing training yields
called the vanishing gradient problem (Kulkarni et al., 2023). Notably the transformer
architecture presented by Google researchers in their publication “Attention is all you need”
in 2017 has formed and directed the development by offering a more effective and faster to
train method for creation of language models, enabling parallel execution and improving
handling of the problematic long-range dependencies in data (Vaswani et al., 2017). This
architecture is presented in Figure 2. The introduced self-attention mechanism, in addition
to other improvements like scaled dot product and multihead attention, is important by
enabling the model to remember the relevant context better and allowing linking of related
data over long distances inside data, which used to be a problem in earlier architectures
(Amaratunga, 2023; Kulkarni et al., 2023). With this new architecture and the increased
computing resources training of the current massive language models has become viable.
AI-based language models are described with properties like parameter count, which tells
the number of aspects the model can learn from the training data. The number of parameters
in modern language models may be hundreds of billions, even over a trillion, like e.g. GPT-
4 (OpenAI, 2023a). Properties of some well-known language models are presented in Table
1, showing that even open-source models may have tens of billions of parameters and
proprietary models created by leading technology companies like Google, Meta and OpenAI
may have even more. These extremely large models are called Large Language Models
(LLM) because of the number of parameters and size of the used training data (Amaratunga,
2023; Kulkarni et al., 2023; Naveed et al., 2023).
13
Table 1: Comparison of some well-known LLMs (“12 Best Large Language Models (LLMs)
in 2024,”; “The List of 11 Most Popular Open Source LLMs of 2023,”; Almazrouei et al,
2023; Naveed et al., 2023; Touvron et al., 2023; Wu et al., 2023)
Max
Licensing context Release
Name Provider type Parameters tokens date Availability
32k-
GPT-4 OpenAI Closed 1.7T 128k 11/2023 ChatGPT, OpenAI API
4096-
GPT-3.5 OpenAI Closed 375B 16k 3/2022 ChatGPT, OpenAI API
Gemini Pro Google Closed Unknown 32768 12/2023 Gemini, Vertex, API
LLaMA 2 Meta Open Source 7B, 13B, 70B 2048+ 7/2023 Open
OpenLLaMa Meta Open Source 3B, 7B, 13B 2048+ 5/2023 Open
Technology
Innovation 2048-
Falcon Institute Open Source 7B and 40B 10000 3/2023 Open
It’s a well-known fact that just increasing the number of parameters doesn’t make the model
automatically better and, on the contrary, typically makes its execution slower and more
expensive by demanding more computing resources (Naveed et al., 2023). The newest LLMs
typically have the largest dimensions in scope of the numbers though, because the most
advanced models require substantial investments from the companies creating them, and
they want to ensure their models are performing as good as possible compared to the
competitors’ models. The pre-trained LLMs are called “foundation models”, because they
have generalized massive amounts of data, and as such have a good all-around performance
in generic tasks (Bommasani et al, 2021; Naveed et al., 2023). For specific needs, however,
more precise training is needed, and this can be achieved using fine-tuning techniques.
Instead of training a new model from scratch for some narrow field requiring specific
knowledge, a foundation model can be further trained using various tuning methods like
Zero-shot, Linear Probe and End-to-end tuning (Mukhoti et al., 2023). This is a cost-
effective and rather quick way of fulfilling specific needs, however not all LLMs support
15
Generation of the actual response varies between models, but generative models like
ChatGPT (GPT is short for Generative Pre-trained Transformer) generate content word by
word, using the patterns found in training to calculate probabilities for the next word in the
given context and previous words. Some amount of random sampling may be added to give
variability in the responses using a randomness function, as well as using techniques called
“top-k sampling” or “nucleus sampling” to select words or sets of words with highest
probabilities (Chen et al., 2023; Kulkarni et al., 2023). The process is iterated forward until
the model decides the response is ready or the response length limit is reached. This is as
resource intensive a process as it sounds, because the probabilities are calculated over the
whole vocabulary, which is rather large considering the size of the LLM (Kulkarni et al.,
2023; Naveed et al., 2023).
LLMs are known to have certain weaknesses and problems, which need to be considered
when utilizing the received output data. One research collecting the problematic aspects of
many LLMs together is “A Comprehensive Overview of Large Language Models”,
published by Naveed et al, 2024 (Naveed et al., 2023). Hallucination is one infamous feature,
meaning that the models may produce somewhat correct-looking output, which has no
factual basis or makes no sense at all. Also called “fabrication”, this means the output of
LLMs is not necessarily reliable, which again introduces a serious problem in use cases
where correctness is imperative, like in medicine or scholarly contexts, or writing news
articles containing misinformation. Hallucination is being studied a lot and algorithmic
16
countermeasures are developed to reduce the effect, but it still remains as a major limitation
of LLM use. The produced content may also be harmful, offending, or biased, perhaps
because the training data has not been good enough quality or the just because the model
does not understand the inner meaning of the content. Content policies have been deployed
in the publicly available service to reduce unwanted content, but using skillful prompt
engineering it may still be possible to get around those and get the model to respond in
unexpected ways. Because most of the large models are trained using publicly available data
from the Internet, there are also lots of entries normally under copyright for their creator.
Even though the models should use the data for just generalizing their patterns, it’s always
possible that the output contains some pieces of an original creation, or at least greatly
resembles one. Or maybe some private data is contained in the model, and it’s used in the
output, possibly offending or harming some individual. These legislative problems are
encountered continuously when using LLMs, and many countries are developing new
regulative laws or restrictions to counter the problems. Explainability and transparency of
the LLM logic would be needed, but unfortunately inner working of multilayered neural
networks is not easily made visible, even more because the technology giants may not want
to reveal their technological business secrets. Authenticity of the produced content is a
problem, when the AI-generated content often is not easily distinguishable from human-
produced content, like photos. Also, misuse of AI in form of plagiarism in educational
context is already seen as a big harm, as students may forge their essays by using an AI
assistant to write the text for themselves. (Fui-Hoon Nah et al., 2023; Kulkarni et al., 2023;
Naveed et al., 2023; Novelli et al., 2024)
While LLMs are pre-trained with massive amounts of data, they cannot have all the possible
information embedded in themselves. This can become a problem, since model training is
very resource intensive and therefore costly operation, which cannot be redone without great
efforts when some new training data becomes available. It means the answers from an LLM
may become out of date very quickly, since no new information can be easily applied after
initial training and fine-tuning. To solve this problem, there have been numerous efforts of
creating architectures to enable adding up-to-date data, e.g. memory networks, stack
augmented networks and memory layers, with varying success (Lewis et al., 2020). For
17
specific knowledge-intensive needs, when it’s known that the training data does not contain
all the needed context specific information, context data may need to be fed as additional
layer of context when making queries to the model. Retrieval Augmented Generation, or
RAG for short, is a method of combining non-parametric and parametric memory for
language generation, using external data sources to bring new data in the system context.
While the basic ideas of the RAG method have been used in research and question-answering
applications already for decades (Sanderson and Croft, 2012), in context of LLMs a RAG
architecture was first published by a research group lead by Patrik Lewis in 2020, and it
immediately received enthusiastic reception, greatly affecting the development of modern
question-answering systems (Lewis et al., 2020).
The benefits of using RAG are diverse, when applied to knowledge-intensive applications.
In this area especially humans have problems because of the sheer amount of information,
but also other question-answering architectures have their downsides. When compared to
using LLMs without non-parametric memory, it’s clear that providing up-to-date and context
specific information improves the quality of responses considerably. In the publication by
Facebook AI Research group (Lewis et al., 2020) it was shown that besides the responses
including more factual details and newest information available, also hallucination was
decreased notably. Even if the requested information is not available in the indexed
documents, the RAG model can generate better answers by combining the found clues and
existing information in parametric memory, leading the questioner to the correct direction.
Because the external memory of a RAG system can be easily updated and complemented,
the risk of returning out-of-date information in answers is significantly diminished.
A conceptual RAG system consists of the following components, illustrated also in Figure
3 (Gao et al., 2023):
- Document data, i.e. any data suitable for the system at hand.
- A data storage, where context data can be queried from, usually some kind of a vector
database. The created embedding data is in form of multidimensional vectors, and
these specialized databases can be used to store the vector data, create efficient
indexing, and execute extremely fast searches in big data masses, using the semantic
context rather than word similarity as the criteria for matching results (Han et al.,
2023).
- An indexing implementation for reading the document data and creating suitable
chunks of data to be stored in the data storage.
- A retrieval functionality to search the data storage using the provided query.
- An LLM implementation handling the logical reasoning and producing the natural
language response.
- A generation functionality for generating a specially formatted prompt for the LLM,
including any instructions and other contextual information, as well as handling the
LLM response to return a suitably formatted response to the user.
19
The data source of a RAG system can be basically anything, depending on the main targeted
use of the system. E.g. for enterprise intranet solutions the data must usually be kept private,
thus a private database instance is required, as well as for any use case requiring specific
data sets not necessarily available in public. For open-domain questions, using any public
search engine as a data source may also be a viable solution, as shown in the research by
Lazaridou et al (Lazaridou et al., 2022). The search data is usually stored in a vector database
allowing for fast semantic searches. The data must first be transformed to vectors for storing
using a compatible embedding model, producing so called vector embeddings. Each
embedding is created from a chunk of data, size of which is one of the many tunable
parameters in a RAG system, e.g. 512 bytes. Chunks are split from the original data so that
a certain overlap is maintained, increasing semantic continuity of the vector data, again one
more parameter. Vector dimensionality describes the number of elements used to describe
20
the contents, individual values emerging from the relative values within the numerical space.
The vector dimension defines the level of detail with which the data is stored, also greatly
affecting search performance. (Han et al., 2023)
The largest software platform providers, like Google, Amazon, and Microsoft, have been
adding LLM and RAG -related features to their offerings at an accelerating pace during the
last two years. These features and frameworks are aimed at the developers and architects for
easier access to the highly hyped functionalities with integration to other cloud services and
their huge computing capacities. Commercial solutions for enterprises are also known to be
in development for purposes like intelligent intranet and documentation searches,
conversational features and generating context-related text articles. Despite the active
reportage and online discussion there are yet only few ready-to-market products available at
the time of writing, from enterprise software providers like Salesforce and Ardoq (“Ardoq
AI Labs,” 2024; “Salesforce Einstein AI,” 2024). RAG is also not so much visible as a term
in marketing but included integrated deep inside products as a fundamental feature
supporting functionality. (“Amazon Bedrock,” 2024; “Azure AI Search,” 2024; “Google
Vertex AI Platform,” 2024; “OpenAI,” 2023b)
21
3 Research method
The research method used in this thesis is Design Science Research (DSR). It’s a
methodology for developing and evaluating new artifacts in areas where this kind of results
may help other researchers in their work. Peffers et al define Design Science Research
Process (DSRP) as a six-phased nominal sequence, in which separate phases are identified
with cumulative knowledge increasing during each one and leading to the next one (Peffers
et al., 2007). The mentioned phases are:
2. Objectives of a solution.
4. Demonstration.
5. Evaluation.
6. Communication.
The starting point for the process varies on the available information and resources, so that
the starting point may be either a problem, an objective to achieve, an existing artifact
requiring improvement, or an observed solution from earlier research. In this thesis the
starting point was the idea of developing an application, that can be given some document
or a set of documents and then ask questions about the contents. This aligns directly with the
second phase of DSR: defining solution objectives. We have a specific goal to achieve by
developing an artifact (in this case, a software application) that meets the initial
requirements. The DSRP sequence is adapted to the context of this thesis and presented in
Figure 4.
22
The adapted process starts using Objectives of a solution as the entry point. We have a vision
of the functionality, that is, we want to have an application capable of answering questions
about given document or documents and receiving well-formed, factual responses using only
natural human language. By studying the problem domain, a large number of research papers
were found addressing the field and subproblems of NLP. A specially interesting discovery
was RAG technology, which could be used to implement the pursued functionality. With
this theoretical background information studied it’s possible to move on to the next phase.
The second phase is Design and development, the concrete application development part.
In this phase the technological and architectural solutions were made, and iterative design
and development process performed, testing different aspects and ways of implementation.
When an application with satisfactory level of performance in terms of quality was
achieved, the next phase was entered with all the practical information gathered from the
implementation phase.
23
The third and fourth phases are Demonstration and Evaluation, which closely interlock
with each other in this case. The main action was to gather experience of the application
performance using different materials and evaluate the findings to see how well our
original objectives are met. Metrics and analysis knowledge here is the data and
experiences got from the demonstration phase. To get comparable data for evaluation, a set
of other applications and tools with similar features were tested using the same materials,
and the result compared with data from the developed application. After evaluation all the
needed disciplinary information has been cumulated for the last phase.
The fifth and final phase of the adapted DSRP sequence is communication. This stage in
practice involves writing and publishing of this thesis alongside the developed application
as the key artifact. The sequence is iterative by nature, and while the objectives phase was
well-defined and didn't require revisiting, the design and development and evaluation
phases were iterated multiple times to address performance issues or unexpected problems
identified during the evaluation phase. This thesis contributes to the field of design science
research by presenting the following artifacts and their evaluation:
• Testing Results: The documented testing results provide valuable insights into the
performance and effectiveness of the developed application.
• RAG Architecture Implementation: The successful application of RAG architecture
demonstrates its potential for similar projects.
• Recursive Summarization Method: The proposed recursive summarization method
offers a novel approach to summarizing long documents and can be further
explored for broader applicability.
Altogether, DSRP suites very well both to the process and resulting artifact of this thesis.
Hopefully these contributions are found useful in some future research.
24
DoChatAI was supposed to fulfil the requirement of creating a document summary and
providing context for an interactive discussion about the document contents. The
implemented main functionalities are:
- Process documents in the work directory by parsing the contents and creating vector
embeddings to be stored in a vector database for fast similarity searches.
- Generate a summary for an uploaded document. This can also be achieved by asking
a suitable question but is provided as a separate functionality for convenience and a
more thorough summarization.
- Ask a freeform question and receive an answer in natural language using the given
documents as the context. The AI is instructed to use the information in documents
and general knowledge, but not to invent answers if no suitable information is found.
- Keep track of the chat history to increase information in the context as the discussion
goes on, improving the perceived quality of AI responses.
25
TypeScript (“TypeScript Language ,” 2024) and Node.js (“Node Js,” 2024) were selected as
the development environment, mainly for personal preference and availability of tools. Other
selected technologies are:
3. Milvus vector database for storing document embeddings and metadata. For this
application, a hosted free version provided by Zilliz was chosen for simplicity.
(“Zilliz Vector Database,” 2024)
OpenAI’s SDK and LangChain are both available for TypeScript, although Python is the de-
facto standard language of choice for many AI application developers. Accessing OpenAI’s
APIs requires a paid subscription, and each API call costs something, depending on the
model used and number of tokens consumed: at the moment of writing the used model (gpt-
turbo-3.5) costs $0.50 / 1M tokens for input and $1.50 / 1M tokens for output (“OpenAI
Model Pricing,” 2024), when running in the lowest pricing level of usage. Although in small-
scale use these prices are very cheap, it’s easily seen that intensive use of these API’s may
become expensive, depending on the length of transmitted input data and the length of
provided output, multiplied by the number of users per day. When the usage levels rise, there
are naturally discounts available, but still the cost of using commercial LLM providers has
to be considered carefully when designing such systems for a larger user base.
Using one of the open-source models would be a feasible option as well, but the model needs
to run be somewhere, either on a local computer or a hosted service, which again costs some
money depending on the targeted level of performance. Using an open-source model for
26
creating embeddings could be beneficial, since processing large amounts of text means
millions of tokens are used when updating new documents to the RAG database. In that case,
the search query must be transformed using the same model, because mixing different
models might cause some oddities in the results, since the captured sematic meaning of
vectors is not exactly the same, even if the embedding format is similar. Generally, mixing
different models should be avoided, since it would require extensive testing to ensure the
level of correctness is adequate. In this project only OpenAI embedding models were used
for clarity.
For a RAG solution using LangChain, three components need to be configured and provided
with suitable implementations:
1. Retriever: encapsulates database usage and queries from it to provide context data.
2. Prompt Template: built-in functionality for creating prompts for the LLM,
combining user query, context data and system instructions.
3. LLM: abstracts the used LLM implementation, which can be any of the supported
types or a customized solution implementing the needed functionalities.
DoChatAI is a server application running using Node.js runtime, all application code is
written in TypeScript. The application architecture is shown in Figure 5, which shows how
27
DoChatAI relates to the other components and services: Memcached, OpenAI API, and
Milvus cloud offered by Zilliz. The application offers a REST API, which can be called by
any client able to communicate using HTTP protocol. Details of the API are shown in Table
2, with basic operations for initiating a chat session, processing documents, generating
summaries, asking questions, and retrieving session and chat information. Internally the
application is using LangChain library for managing the flow of question refinement, data
retrieval, creation of LLM prompts, and execution of the actual LLM queries. No user
interface was developed for this application because focus of the work was on the core
functionality, not providing a ready-to-use application for potential end users. The
development and API testing activities were done using Postman, a handy tool for
developing and testing APIs (“Postman,” 2024). For documentation purposes, also Swagger
library dependency was added to the project and basic descriptions for the operations written
to conform with the OpenAPI specification (“OpenAPI 3.0,” 2024; “Swagger,” 2024).
When DoChatAI and Memcached are started in the runtime environment, the application is
immediately available for client calls, given that configuration is valid and current OpenAI
and Zilliz keys are provided, and a network connection is available for communication with
the external services. Since DoChatAI is using GPT in the background, generic questions
can be sent even without providing any context data, but to fulfil its given mission, the
application needs to be given some PDF-documents over which the discussion can be started.
The initialization phase is visualized in Figure 6 with a more detailed explanation as follows:
1. A chat session is initiated, which in practice means ensuring that a correctly named
collection exists in the remote database, creating the collection if not already found. The
returned collection id is used to also identify the chat session. A new Chat object is
created with a randomly generated identifier, stored in session data, and returned to the
caller.
2. PDF documents to be processed are uploaded to the work directory using external means.
No upload feature was implemented in the current version, but such a function could be
added in the future if seen important.
29
3.4. Document data is split to chunks, which are transformed to embedding vector data
by the embedding model.
3.5. Embedding data is stored to the external data storage with required metadata.
3.6. Session data is saved in Session storage (in this case Memcached) to be available
for further requests.
After the initialization phase context-dependent queries can be made, execution of which is
visualized in Figure 7, explanation as follows:
1.1. The session data is retrieved from the Session storage using the provided session
key.
1.2.1. Chat Service prepares the context, initializing the prompts and merging chat
history to be sent as part of the context.
1.3. The received question is refined using LangChain, which delegates the task to GPT
(1.3.1), which attempts to improve the form of the question to suit better for
querying.
1.4. The refined question is passed to a LangChain conversational LLM chain with the
prepared context and data storage retriever. LangChain creates an embedding vector
from the query and uses the retriever to search for best matches (1.4.1). These
matches are ranked according to their match score to find only the most similar
database entries. The retrieved data is then passed along with other context data to
the configured LLM (GPT) for generating a response (1.4.2).
1.5. Chat Service receives the response and passes it to the controller, with a list of
matching source documents and the refined question.
1.6. Controller saves the response in current chat history and update the session data.
1.7. The client receives the response (see Appendix 3 for an example of the messages).
31
The summarization feature was developed to overcome the problem with context length
limitations posed by LLMs: when there’s a lot of data, how can we effectively summarize it
if the maximum length of context is orders of magnitude smaller than the amount of data?
The answer is obvious: we need to break the data in parts and summarize smaller chunks at
a time, then combine the results and summarize these again. This is a recursive operation,
which is already supported by LangChain. The smaller the limit and larger the data, the
greater number of recursion loops is required, which obviously consumes lots of tokens. The
cost of summarizing a long document can thus be relatively high, but since the result is stored
for later use, the quality of summary most probably pays for the cost in long term. Even with
the use of RAG architecture, the quality of summaries produces using normal question-
answer mechanism is clearly inferior, because most of the data is not included when
generating such summaries.
32
The APIs of DoChatAI are synchronous for simplicity, as this was seen sufficient for this
kind of a demo application: waiting for results after each call is just fine. For real-world
needs asynchronous mode could be easily implemented by using session data for keeping
track of ongoing operations. The use of free account in Zilliz forces throttling of the rate of
indexing calls, but usually this doesn’t matter, since the automatic summarization is always
slower anyway. OpenAI APIs also have their rate limitations in the lower account levels,
thus summarization operations may sometimes need to be artificially slowed down by
waiting between calls. By investing more money in the account level this hindrance could
be removed, but again, for the purposes of a thesis, this was an acceptable sacrifice.
Although LangChain encapsulates the use of underlying LLM model, there are a lot of
choices and parameters to tune for achieving the best possible results. Most crucial is the
context length, which is dependent on the LLM used: GPT-3.5-turbo has a limit of 16k query
tokens whereas GPT-4.5-turbo allows as much as 128k tokens. This has to be considered
when tuning the performance and configuring data indexing. LLMs have certain mechanism
against repetitive patterns in data, which sometimes causes problems in summarization.
Usually, the content has some text formatting or tables, that translate to repeating patterns
containing control characters and whitespace, causing the model to reject such content. A
simple cleaning mechanism was implemented to overcome this problem, although there’s
certainly room for development as generic content might well contain structures not fixable
by the existing code. Breaking down the document structure would help to improve indexing
and summarization, together with parsing tables, recognizing images, and adding all that
information with metadata also to the index. The current implementation parses only text
content, which often leaves a big part of available information out of reach for the search
functionality, but such a feature would be a project of its own.
33
5 Results
In this chapter the results of DoChatAI testing are presented, together with comparisons to
ten other applications offering similar functionality.
Due to the nature of this thesis, only free applications were chosen for testing. This obviously
limits the available capabilities and performance, since also the companies offering these
applications have to pay for their resources, either in-premise or hosted. For that reason,
performance as speed of execution is evaluated only in the level of side notes, and only
qualitative remarks are made about the results. The number of question-answering
applications available in Internet has exploded during the last year, presumably largely
because new APIs from OpenAI and adoption of LangChain have boosted the interest of
developers, and on the other hand, because free example code and even whole projects are
available for anyone to start developing with. The latter reason might also be behind the fact
that UIs of many simpler online applications are very similar. The selected ten applications,
plus DoChatAI using two different versions of GPT, are listed in Table 3.
Two different PDF files were used in the comparison, representing different use cases, i.e.
shorter research document vs. a full e-book containing hundreds of pages. Due to the
restrictions of free accounts, only three applications in addition to DoChatAI (namely
ChatPDF, text.cortex, ChatGPT) were able to handle the longer text, the usual limits being
from 50 to 120 pages. Some applications (e.g. PDFGear) were able to read the file and give
a summary, although the content clearly was not available in its entirety, but these were not
taken into account since the results would presumably have been poor. The PDFs used were:
For the first PDF, the following set of questions were presented:
2. What are the five dimensions of service functionality according to the paper?
Answer from DoChatAI are included in Appendix 1, results from other application are not
included because of the length of texts. The answers are rated in Table 4, using a subjective
35
scale from 0 to 5, with a mean value calculated from all results to reflect the overall answer
quality. Veracity of details in answers was not thoroughly examined, but rather the structure
of content, as well as amount of information, were taken into consideration when rating the
answers. ChatGPT with its well-performing GPT-4 engine was largely used as a baseline for
comparisons. Usability, speed, or other such aspects are not considered in this rating since
the point here is to compare just the quality of output.
As the ratings are subjective, they are only opinions of the author. Affecting attributes were
things like number of facts found, level of additional context and explanations,
representation, and structure of the response. As most of the tested application use some
version of GPT models, it’s expected that the results are somewhat similar, but there are also
obvious differences. For example, ChatGPT tends to give longish answers, which may not
always be desired, but it makes the responses more understandable as there’s always more
than just a list of found matches. On the other side, e.g. PDFPeer gives fast answers, that are
factual and mostly sufficient as such, but they are lacking the extra context, which weakens
understandability. Question number 3 was difficult for some applications, namely Sharly,
text.cortext, PDFGear, PDFPeer and ChatyPDF, which were unable to find that information
at all. PDFGear also did not find an answer for question 4, which was unexpected, as all the
36
others successfully retrieved the needed information at least to some extent. Generally, most
application were useful in summarizing the content and answering questions that had factual
answers in the content, although some variation was found. The more abstract questions
caused difficulties, but question 5 was given surprisingly good answers throughout the line.
All in all, DoChatAI did surprisingly well, considered that it’s not a commercial product but
just a proof-of-concept work created for this thesis. In this rating ChatGPT is generally the
best, but DoChatAI with GPT-4.5-turbo and specifically generated summary is on par with
its quality, which is remarkable but also expectable, due to use of the newest publicly
available LLM model. ChatPDF follows right after them and DoChatAI with older GPT-
3.5-turbo model either shares the fourth place with AskYourPDF or is fifth, depending on
whether the discrete summarization function is used or not. For actual real world use the
ranking order would be different when other kinds of aspects were prioritized, like speed
and usability.
The answers for the second PDF are not included due to their length. With this longer context
the performance differences in terms of speed of answers were especially notable:
The distinct summaries generated by DoChatAI took 30,5s (GPT-3.5) vs. 501s (GPT-4.5).
These were run using the same configuration parameters, which may be suboptimal, but
reflects well the performance differences between models. It has to be notified again though,
37
that low level accounts were used to keep the price low, which means slower response times,
depending also on the amount of workload in the service.
The results are very similar to the shorter document results, except that text.cortex was the
only one to fail answering a question (the term LLM is not found in the book). While the
quality of responses can be considered mostly very good, exact facts are still better examined
manually afterwards because of the known weaknesses of LLMs. It’s notable that ChatGPT
could well answer all the questions even without the context document, but the number of
included facts increases significantly with context data. All the models these applications are
using belong to the GPT family, so they also could answer even without document context,
if configured to do so. ChatGPT was ranked the best by overall quality again, but ChatPDF
was much more usable thanks to rapid answers. DoChatAI did also quite well, but with the
newer GPT-4.5-turbo model the response times were totally unacceptable, especially for the
summary. The summary created by asking the question was rather adequate, though, and
again sufficient for most uses although the recursive summary functionality produces richer
information content.
38
6 Discussion
While the internal architecture of other applications than DoChatAI is not known, there are
several factors that can affect the results. First very obvious is the selected LLM, which can
greatly affect the quality of response. Most of the tested applications were using some GPT
version, but still clear differences can be seen. Newest models, like GPT-4 or GPT-4-turbo
are the most advanced available to external users, allowing greater context sizes and
producing impressive results in terms of connecting logical entities in the context with their
taught data parameters. GPT-4 with its 1.7T parameters is a huge model, and this causes the
responses to be slower than some of its rivals (Googles Gemini as an example), but the
quality is nevertheless the best available. Older GPT models like GPT-3.5-turbo are still
mostly good enough in quality and many applications, including DoChatAI, can still use
them with success. It’s clearly visible, though, that changing to a newer, more advanced
model (in this case, GPT-4-turbo) increases the perceived quality. OpenAI also offers the
possibility to fine-tune their models and it’s very much possible that some of the tested
applications are using specific RAG-tuned versions of the models for improved context
sensitivity and output formatting. In this project fine tuning was not used, so DoChatAI runs
on plain GPT models.
Prompting technique is a very important component, directly affecting the structure, tone
and formatting of the output. Suitable prompt text can direct the LLM to act in the wanted
role, hide unwanted technical details and aid structuring the response to suit the needs of the
caller. The prompts used by DoChatAI are visible in (Appendix 2). LLM querying in the
simplest form is just composing and sending the question message, but advanced techniques
take this further. The original question can be refined to suit better both querying the retrieval
database, thus getting better matches from the vector database, and to function better as an
LLM query, resulting in better answers overall. Other advanced querying methods have also
been researched and implemented, using heuristics to expand the query, thus providing more
context information for the LLM to process. Context length limitation of LLMs strongly
affects the preparation of queries, as most models allow only a few thousand tokens to be
used at once in query context. Instead, the newest GPT-4 models allow as much as 128000
tokens, which means the whole content of any shorter documents can be given in the context,
without splitting them first. This of course causes a lot of tokens to be consumed, often
39
unnecessarily, because usually the required information is contained in much smaller chucks
of the original document. Models with shorter context windows require more careful
preparing and prioritizing of the context data, saving tokens and thus costs. (Mao et al., 2020;
Marvin et al., 2024)
The underlying retrieval database solution is a big factor in total performance. The usual
solution is to store data and embeddings in a vector database, which is a very fast way of
querying semantically similar matches from a large text corpus. Different database engines
provide their specific perks, but the configuration of data splitting to chunks and their overlap
affects the granularity of results and again token consumption: too big chunks worsen the
quality of search matches and cause unnecessary token consumption, on the other hand too
small chunks with little overlap cause less meaningful context and results when combined.
One important parameter is the dimensionality of vectors, which defines how many
information bits are included in a single vector, thus affecting the granularity of search
matches. High dimensionality may lead to better matches, but is computationally more
expensive and slows down searches, also increasing the consumed storage space. Low
dimensionality on the other hand, is faster and computationally cheaper, but may lower the
quality of matches and produce less relevant results. The best level of dimensionality
depends on the use case of search and the stored content. Most vector databases can also
store metadata fields, but other database types can be also used for related data retrieval, e.g.
relational or hybrid databases combining both aspects. This metadata can be used to augment
the context, providing more relevant information in the results.
Summarization of the whole content is the only way to get all the available data squeezed
into a short summary. However, this is a costly operation, since in practice it requires
creating a recursive summarization of summarizations, so that the context length limit is not
exceeded. This means a lot of LLM queries are made to go through all the data in a long
document, consuming a considerable number of tokens. As can be seen in the result
assessment in Table 4, the newest GPT-4-turbo is able to produce a summary of decent
quality even through the normal question-answer mechanism, which is probably good
enough for many use cases. On the other hand, the discrete summarization method needs to
be run only once and the result can be stored for later use, like in DoChatAI, so the
disadvantage of higher cost diminishes in a longer period.
40
The most remarkable functional difference between these comparison applications and
DoChatAI is that the latter can handle any number of documents without length limitations,
whereas the others are able to run queries against one document at a time. Some tests were
run using all the reference documents of this thesis as the test material, all in the same index
at the same time, to see how DoChatAI can handle that. As expected, the speed of execution
stays about the same, but as the available context widens significantly, the amount of details
in answers to broad questions is limited by the context length. Questions with exact factual
answers are mostly unaffected since the vector searches match those very well, so the set of
search results is very concentrated. The mentioned problem with broad questions could be
mitigated by using multi-step queries and iteration procedures, combined with fine-tuning
the used model (Gao et al., 2023).
While RAG as a term is currently very much visible in research and development, for the
average user or company director it’s just another meaningless word of technology jargon,
unless some actual benefits can be shown in practice. Typical use cases are company
intranets to allow fast search of information from all relevant company data, search for
medical information from large corpuses, or whatever context requiring intelligent search
and response capabilities. Because generative AI is capable of producing new material based
on its static parameters combined with augmented context, these systems will be able to
automatically produce handbooks, manuals etc. from given technical data, or possibly even
novels and short stories based on given contextual information. Multimodality broadens the
possible use cases even more to generation of images, music and video material, which are
all already possible even today, but can be taken further with the aid of RAG-style
technology. When combined with techniques for logical reasoning, the level of automation
will rise to unseen levels and towards AGI, although that target looks still to be far away.
Explainability, resource efficiency and factuality will be key points in the future, leading to
wider adoption throughout the society. Along with the emergence of new innovations on the
topic, it’s expectable that there will be even more technological leaps such as we have seen
during the last years, which will boost the development even further, to new dimensions we
possibly cannot even imagine yet.
42
7 Conclusions
In this study the main research topics were generative AI, Large Language Models, and their
use in a proof-of-concept implementation of a RAG system, capable of answering human
language questions about the provided content. At first, an overview of the history of AI and
development of LLMs was given, together with description of RAG architecture. A PoC
implementation called DoChatAI was developed using modern tools and technologies and
documented in the next section. Test runs were planned and executed using suitable test
materials and the results documented and analyzed with comparison to other existing tools
providing similar functionality.
This research was conducted to provide a clear overview of the possibilities of LLMs with
RAG technology and their application in practice. At the time of writing this report there are
already many open-source projects implementing comparable features, but the purpose of
this development work was also to go through all the affecting details and accumulate the
knowledge of producing such software.
Results of the study include the practical work itself, which is provided as a Github
repository containing source code of a working server application (Malinen, 2024). This
project successfully demonstrates the required implementation parts needed for a functional
RAG system. Testing phase produced mainly subjective and qualitative results because
arranging more detailed study of performance would have required considerable budget for
licenses and still the speed of execution would have been dependent on the runtime
environments of external services. There are some measurements of response speed, though,
but these are only provided for reference.
The obtained results show that success of the development part was demonstrated, answering
to question RQ1. Answer to question RQ2 is given in the analyzing part of result
interpretation, where the aspects affecting performance are investigated. Additionally, the
results show, that a self-developed RAG system can well achieve the quality level of
commercial tools. This is largely because the underlying LLMs used were top quality models
from OpenAI, and they are from same family as most of the tested tools are also using. The
perceived quality of responses overall was adequate and sufficient for use in purposes of
summarizing and finding relevant information from documents. The response speed of
43
DoChatAI was rather slow, especially for the generation of distinct summaries, but this could
be improved by paying for faster resources and tuning internal parameters.
The implemented application project could be used as a starting point for future
development, or as a reference how to integrate the needed services together when
implementing similar functionalities. Future research work might include comparisons
between different LLMs, because there are lots of open-source alternatives, which could be
ran using private resources, possibly saving a lot of costs in the long run. The optimization
of costs, answer quality and response delays using alternative LLMs is another project worth
experimenting with.
44
References
12 Best Large Language Models (LLMs) in 2024 [WWW Document], 2024. URL
https://beebom.com/best-large-language-models-llms/ (accessed 27.2.24).
Amaratunga, T., 2023. Understanding Large Language Models. Apress, Berkeley, CA.
Chen, B., Zhang, Z., Langrené, N., Zhu, S., 2023. Unleashing the potential of prompt engineering in
Large Language Models: a comprehensive review.
Fieguth, P., 2022. An Introduction to Pattern Recognition and Machine Learning. Springer
International Publishing, Cham.
Finlay, J., Dix, A., 2020. An Introduction to Artificial Intelligence. CRC Press.
Flasiński, M., 2016. Introduction to Artificial Intelligence. Springer International Publishing, Cham.
Fui-Hoon Nah, F., Zheng, R., Cai, J., Siau, K., Chen, L., 2023. Journal of Information Technology
Case and Application Research 25, 277–304.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M., Wang, H., 2023.
Retrieval-Augmented Generation for Large Language Models: A Survey.
Han, Y., Liu, C., Wang, P., 2023. A Comprehensive Survey on Vector Database: Storage and
Retrieval Technique, Challenge.
Kulkarni, Akshay, Shivananda, A., Kulkarni, Anoosh, Gudivada, D., 2023. Applied Generative AI for
Beginners. Apress, Berkeley, CA.
45
Lazaridou, A., Gribovskaya, E., Stokowiec, W., Grigorev, N., 2022. Internet-augmented language
models through few-shot prompting for open-domain question answering.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.,
Rocktäschel, T., Riedel, S., Kiela, D., 2020. Retrieval-Augmented Generation for Knowledge-
Intensive NLP Tasks.
Mao, Y., He, P., Liu, X., Shen, Y., Gao, J., Han, J., Chen, W., 2020. Generation-Augmented Retrieval
for Open-domain Question Answering.
Marvin, G., Hellen, N., Jjingo, D., Nakatumba-Nabende, J., 2024. Prompt Engineering in Large
Language Models. pp. 387–402.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013. Distributed Representations of
Words and Phrases and their Compositionality.
Mukhoti, J., Gal, Y., Torr, P.H.S., Dokania, P.K., 2023. Fine-tuning can cripple your foundation
model; preserving features may be the solution.
Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.,
2023. A Comprehensive Overview of Large Language Models.
Novelli, C., Casolari, F., Hacker, P., Spedicato, G., Floridi, L., 2024. Generative AI in EU Law:
Liability, Privacy, Intellectual Property, and Cybersecurity.
Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S., 2007. Journal of Management
Information Systems 24, 45–77.
Sahoo, P., Singh, A.K., Saha, S., Jain, V., Mondal, S., Chadha, A., 2024. A Systematic Survey of
Prompt Engineering in Large Language Models: Techniques and Applications.
Sanderson, M., Croft, W.B., 2012. Proceedings of the IEEE 100, 1444–1451.
The List of 11 Most Popular Open Source LLMs of 2023 [WWW Document], 2023. URL
https://www.lakera.ai/blog/open-source-llms (accessed 2.2.24).
Touvron, H., Martin, L., Stone, K., 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.,
2017. Attention Is All You Need.
Wu, S., Koo, M., Blum, L., Black, A., Kao, L., Scalzo, F., Kurtz, I., 2023. A Comparative Study of
Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in
Nephrology.
Zilliz Vector Database [WWW Document], 2024. URL https://zilliz.com/ (accessed 7.1.24).
1
1. GPT-3.5-turbo
The study focuses on the assimilation of B2B platforms by organizational buyers and the
effects of IT-enabled service functionality. It identifies five dimensions of service
functionality - information search, negotiation, acquisition, ownership, and retirement - and
theorizes that their importance varies based on the current level of platform assimilation.
Enablers such as benefits and top management support, and inhibitors like assimilation costs,
managerial complexity, and demand uncertainty, influence a firm's decision to assimilate a
B2B platform. The study conducted a two-staged field survey with 191 professionals,
finding that the importance of service functionality dimensions changes as firms move
through assimilation stages. The research contributes to theory development by highlighting
the importance of different IT-mediated services and factors influencing platform
assimilation. The study provides management implications for B2B platform owners and
buyer organizations. The study discusses the augmentation of the CSLC model in the B2B
context to include negotiation phases, as negotiation is common in B2B transactions. The
dimensions of IT-enabled services in B2B platforms are presented, including information
search, negotiation, acquisition, ownership, and retirement. The study explores the factors
influencing the assimilation of B2B platforms in organizations, emphasizing the importance
of organizational readiness, top management support, and enablers like benefits. The study
also highlights inhibitors such as assimilation costs and demand uncertainty, and their impact
on B2B platform assimilation. The research model developed in the paper illustrates how
service functionalities influence decision-makers to assimilate B2B platforms in their
organizations' purchasing activities. Overall, the study provides valuable insights into the
complex dynamics of B2B interactions, the factors influencing organizational success in the
digital age, and the importance of considering enablers and inhibitors in B2B platform
assimilation decisions. The research contributes to theory development by identifying key
IT-enabled services for B2B platforms and understanding factors influencing future B2B
2
assimilation. The study offers practical implications for B2B platform owners and buyer
organizations, emphasizing the need for organizational readiness, top management support,
and effective communication of potential benefits to facilitate successful platform
assimilation.
Refined question: What is the main content or key points of this document?
Answer: The document discusses the investigation of service functionality in the domain of
B2B platform assimilation from the buyer's perspective. It uses a Customer Service
Lifecycle Framework to identify five dimensions of service functionality: information
search, negotiation, acquisition, ownership, and retirement. The importance of these
dimensions is contingent upon the current level of a firm's assimilation of the platform. The
study theorizes that these dimensions play a crucial role in determining the benefits derived
from the platform. It also highlights the key enablers of future B2B platform assimilation,
such as benefits and top management support, and their importance throughout the
assimilation stages. Additionally, it discusses the negative impact of assimilation costs on
future B2B assimilation when the firm's current stage of assimilation is at a low level.
2. GPT-4.5-turbo
This paper delves into the adoption of B2B platforms by organizational buyers, emphasizing
IT-enabled service functionalities throughout the customer service lifecycle, which includes
information search, negotiation, acquisition, ownership, and retirement. The study posits that
the importance of these functionalities evolves as a firm progresses in its assimilation of B2B
platforms. It suggests that while benefits and top management support consistently promote
assimilation, costs, complexity, and uncertainty can impede it, though costs become less
significant as firms become more accustomed to the platforms. A survey of 191
professionals provided empirical evidence, showing that as companies transition from initial
awareness to widespread use of B2B platforms, they shift their focus from information
search to ownership and retirement functionalities. The research contributes to the
understanding of IT service functionality in B2B platform assimilation and offers practical
3
insights for platform providers and buyer organizations, advocating for tailored strategies
and technology designs to support buyers at different stages of platform assimilation. The
article by A.K.M. Najmul Islam et al. in the Journal of Strategic Information Systems
expands the Customer Service Life Cycle (CSLC) model for B2B platforms by integrating
negotiation and IT-enabled services, which were previously underrepresented in the CSLC
framework. The authors propose a seven-stage assimilation model ranging from awareness
to general deployment and suggest measuring assimilation through overall platform use and
an aggregated assessment of individual services. The study identifies enablers and inhibitors
of B2B platform assimilation, with benefits being the primary enablers and costs, managerial
complexity, and demand uncertainty acting as inhibitors. The research also examines the
impact of IT functionalities on the benefits and assimilation process, finding that the
relevance of these functionalities changes with the firm's stage of assimilation.
Organizational readiness, including IT sophistication and financial resources, along with top
management support, are highlighted as critical for successful B2B platform integration. The
study's methodology involved surveying business professionals and testing a theoretical
framework, with findings suggesting that benefits and top management support drive future
assimilation, while the negative effects of assimilation costs decrease over time. The article
concludes with recommendations for firms to adapt early to mitigate assimilation costs and
suggests future research directions. It also references a body of literature that examines
technology adoption in firms, providing a comprehensive understanding of the factors
influencing technology adoption and usage in business contexts.
Refined Question: Could you provide an executive summary highlighting the key points and
conclusions of the document?
Certainly! The document appears to be a research study from the Journal of Strategic
Information Systems, authored by A.K.M. Najmul Islam and colleagues, published in 2020.
The study is focused on the assimilation of B2B (business-to-business) e-marketplace
platforms and the factors that influence this process. Here's an executive summary of the key
points and conclusions:
4
**Study Overview:**
The research is divided into two main studies. The first study aims to identify important IT-
enabled service functionalities of B2B platforms by consulting a panel of business
professionals with experience in organizational purchasing. The second study tests a
theoretical framework to understand the factors that enable or inhibit the assimilation of B2B
platforms in organizations.
- The panel evaluated a list of service functionalities and confirmed that all important
functionalities were included, with no additional functionalities suggested.
- The panelists also helped develop a list of factors that act as enablers and inhibitors for the
use of B2B platforms.
- The research model was tested using a field survey methodology with data from business
professionals knowledgeable about B2B platforms.
- The survey items were adapted from previous literature or developed anew when necessary.
- The sample distribution was somewhat representative of industry groups, with a notable
difference in the information category, likely due to the sample being targeted towards SMEs
(small and medium-sized enterprises).
**Key Findings:**
- Top management support and organizational readiness were reported as significant positive
effects on B2B platform assimilation.
- Factors such as direct and indirect benefits, IT sophistication, and financial resources were
measured and linked to future B2B assimilation.
- The study found that the importance of different IT-mediated services is contingent upon
the current level of assimilation within a firm.
5
- The paper contributes to theory development by showing which factors retain their
importance across different stages of assimilation.
**Conclusions:**
- The study concludes that IT-enabled service functionalities are critical for attracting and
retaining customers on B2B platforms.
- The authors suggest that future research should extend the study by including the seller
perspective, validating findings for larger organizations, and refining the research model.
**References:**
The document cites various references, including works by Rindfleisch and Heide (1997),
Roberts and Greenwood (1997), Rogers (1995), Cohen and Levinthal (1990), Dahlman
(1979), and others, indicating a robust literature review to support the research. This
summary encapsulates the essence of the document, highlighting the methodology, findings,
and conclusions drawn by the authors.
What are the five dimensions of service functionality according to the paper?
1. GPT-3.5-turbo
Refined question: What are the five dimensions of service functionality as outlined in the
paper?
• Information Search
• Negotiation
• Acquisition
• Ownership
• Retirement
6
These dimensions represent the key areas of service functionality provided by IT in the
context of B2B platform assimilation. Each dimension plays a crucial role in assisting buyer
organizations in their procurement activities.
2. GPT-4.5-turbo
Refined Question: What are the five dimensions of service functionality as outlined in the
referenced paper?
The five dimensions of service functionality outlined in the referenced paper are:
Information Search: This includes functionalities that assist buyers in finding suitable
products and suppliers, as well as general information about the B2B platform. It
encompasses search for products, search for suppliers, and search about the B2B platform.
Acquisition: This dimension includes functionalities related to the ordering and purchasing
process, such as order placement, payment processing, and shipment and delivery tracking.
Ownership: This dimension involves functionalities that help buyers exploit the purchased
product to its fullest extent, including training, monitoring activities related to the products,
and maintenance.
Retirement: This dimension includes functionalities that assist buyers in managing the end-
of-life of products, such as tracking spending portfolios (account for) and providing feedback
on suppliers and the platform (evaluate).
1. GPT-3.5-turbo
Refined question: How were hypothesis H1 and H2 supported by the data analysis results?
Hypothesis H1 was supported by the data analysis results. The interaction term of
information search and current service assimilation had a negative effect on benefits, which
7
aligns with the hypothesis. This indicates that the effect of information search on benefits is
moderated by the current level of service assimilation, with a stronger effect observed among
companies with a low level of service assimilation. On the other hand, Hypothesis H2 was
not supported by the data analysis results. The interaction term of negotiation and current
service assimilation had a non-significant effect on benefits, contrary to what was
hypothesized. This suggests that the relationship between negotiation and benefits was not
influenced by the current level of service assimilation as expected in the hypothesis.
2. GPT-4.5-turbo
Refined Question: What evidence in the provided results supports the fulfillment of
hypotheses H1 and H2?
The evidence supporting the fulfillment of hypotheses H1 and H2 can be found in the results
section of the study, where the authors would have analyzed the data collected from their
survey or experiment to see if the interaction terms of current service assimilation with
information search (for H1) and negotiation (for H2) had significant effects on benefits. For
H1, which posits that the effect of information search on benefits is moderated by the current
level of service assimilation such that the effect will be stronger among companies with a
low level of service assimilation, the evidence would be a significant interaction effect
indicating that as the level of current service assimilation decreases, the positive impact of
information search on perceived benefits increases. For H2, which suggests that the effect
of negotiation on benefits is moderated by the current level of service assimilation such that
the effect will be stronger among companies with a low level of service assimilation, the
evidence would similarly be a significant interaction effect showing that lower levels of
current service assimilation amplify the positive impact of negotiation functionalities on
perceived benefits. The statistical significance of these interaction terms would typically be
reported in the form of p-values, with values less than a predetermined threshold (commonly
0.05) indicating that the results are unlikely to be due to chance. Additionally, the authors
might report coefficients for the interaction terms, and possibly include graphs or tables that
illustrate how the relationship between the IT-mediated services and benefits changes at
different levels of service assimilation.
8
1. GPT-3.5-turbo
The sole focus of the study was on organizational buyers, neglecting the seller's perspective,
which is also crucial for designing B2B platforms.
The study sample was mostly targeted towards small and medium-sized enterprises (SMEs),
potentially limiting the generalizability of the findings to larger organizations.
The study highlighted that the panelists' feedback and responses were based on their own
experiences and perceptions, which may introduce bias or subjectivity.
The study mentioned that some functionalities, such as training, might not be universally
useful for all organizations, indicating potential variability in the relevance of certain service
functionalities.
The study recognized that the panelists' responses were based on their evaluations of the
service functionalities provided, which may vary based on individual perspectives and needs.
The study did not report any major deletions or additions to the list of important
functionalities provided to the panelists, suggesting a relatively stable set of service
functionalities identified. These limitations provide insights into the scope and potential
constraints of the study, highlighting areas for future research and considerations for
interpreting the findings.
2. GPT-4.5-turbo
Refined Question: What specific limitations did the authors identify in the study's
methodology or findings as described within the provided context?
The authors of the study identified several limitations in their methodology and findings:
9
**Direct Link Between IT-Enabled Services and Benefits**: The study linked IT-mediated
services directly with benefits, without considering potential mediating factors. Future
research could incorporate relevant object-based attitudes to investigate their mediating
effects between object-based beliefs and behavioral beliefs.
**Generalizability Across Industries**: The study divided the sample into three industry
groups for analysis, but it is not clear how generalizable the findings are across other
industries or if the results remained consistent among industry groups.
**Potential for Common Method Bias**: Although the authors took steps to mitigate
common method bias and found it unlikely to distort interpretations, it remains a potential
limitation in any study relying on self-report data.
**Non-Response Bias**: The authors conducted tests to check for non-response bias and
found no significant differences, but non-response bias is always a concern in survey-based
research.
assimilation, but this may not capture the full complexity of how organizations value
different services at various stages.
**Factors Retaining Importance Across Assimilation Stages**: The study aimed to identify
factors that retain their importance throughout assimilation stages, but there may be other
factors not considered in the study that could influence B2B platform assimilation. These
limitations suggest areas for future research to build on the findings of the study and to
address the gaps identified by the authors.
1. GPT-3.5-turbo
Refined question: How does the content specifically address or contribute to digital
transformation initiatives?
The content specifically addresses digital transformation initiatives by focusing on the role
of IT-enabled service functionalities in B2B platforms. It discusses how leveraging IT can
support core product or service transactions to help organizations achieve their online
purchasing goals. The study emphasizes the importance of IT-mediated service
functionalities in attracting new customers, retaining existing ones, lowering costs, and
gaining a competitive advantage in the B2B e-commerce market. Furthermore, the research
identifies a list of important IT-enabled services for B2B platforms and validates their
practical value among mature B2B platforms. It provides insights into how platform owners
can leverage IT to support clients' purchasing processes effectively. The study also
highlights the need for future research to investigate how offline activities can encourage
B2B platform assimilation for online purchasing, thus contributing to digital transformation
initiatives in the B2B context. Overall, the content contributes to digital transformation
initiatives by emphasizing the significance of IT-enabled service functionalities,
organizational readiness, and top management support in facilitating the assimilation of B2B
platforms for purchasing purposes. It provides a framework for understanding the factors
influencing the future assimilation of B2B platforms and offers practical implications for
both platform owners and buyer organizations looking to enhance their digital capabilities.
11
2. GPT-4.5-turbo
Refined Question: What aspects of the content are relevant to digital transformation within
the provided context?
Service Functionality and Assimilation: The study theorizes that the importance of IT-
mediated service functionalities is contingent upon the current level of service assimilation
within an organization. This reflects how digital capabilities can evolve and become more
integral to business processes over time.
Enablers and Inhibitors of Platform Assimilation: The research model includes factors that
enable or inhibit the assimilation of B2B platforms, such as benefits, top management
support, assimilation costs, managerial complexity, and demand uncertainty. These factors
are critical in understanding the challenges and drivers of digital transformation in the
context of B2B e-commerce.
Organizational Readiness and Top Management Support: The study emphasizes the role of
organizational readiness and top management support in the successful assimilation of B2B
platforms, highlighting the need for strategic alignment and resource allocation for digital
initiatives.
Impact of Assimilation Costs: The research finds that assimilation costs, such as managerial
complexity and transactional risks, have a diminishing negative impact as firms progress
through higher levels of service assimilation. This suggests that as organizations become
more digitally mature, they are better equipped to manage the costs associated with digital
transformation.
Benefits of B2B Platform Assimilation: The study underscores the direct and indirect
benefits that firms realize from B2B platform assimilation, such as cost savings, improved
12
customer satisfaction, and competitive advantages. These benefits are key motivators for
organizations to pursue digital transformation.
Customer Service Life Cycle (CSLC) Framework: The CSLC framework is adapted to
identify service functionalities in the B2B context, providing a structured approach to
enhancing customer service through digital means. These aspects are relevant to digital
transformation as they illustrate how B2B platforms are leveraging digital technologies to
transform traditional business purchasing processes, enhance service offerings, and address
the needs and challenges of organizational buyers in a digitally interconnected marketplace.
Appendix 2: Prompt texts used in DoChatAI
1. AI prompt
2. System prompt
Answer the given question based on the context and chat history, using descriptive language
and specific details. List the information and all the sources available from the context
metadata field:
System: {context}
System: Refine the following question to suit properly for querying data and use the refined
question as the new question.
QUESTION:
Appendix 3: Example of JSON-messages for a single query
1. Question
{ "chatId": "bdb61f05-9ff9-455b-873e-4e4b1d12615e",
"searchType": "qachain",
"question": "How were the hypotheses H1 and H2 met in the results?" }
2. Answer
{ "chatId": " bdb61f05-9ff9-455b-873e-4e4b1d12615e",
"chatName": "Dochat test 48 - cache2 4.5",
"messages": [ {
"question": " How were the hypotheses H1 and H2 met in the results?",
"answer": {
"answerMsg": " The specific results in the data that support the validation
of hypothesis H1 (H1) are as follows:\n\n- The interaction term of information
search and current service assimilation had a negative effect on benefits, supporting
H1. This suggests that the effect of information search on perceived benefits is
stronger among companies with a lower level of service assimilation.\n\nIn
contrast, the findings for hypothesis H2 (H2) are:\n\n- The interaction term of
negotiation and current service assimilation had a non-significant effect on benefits.
Thus, H2 was not supported. This indicates that the hypothesized moderating effect
of the current level of service assimilation on the relationship between negotiation
and perceived benefits was not statistically significant.\n\nThese results are derived
from the analysis of the data and are listed in the context provided, specifically
mentioning the interaction terms and their effects on the perceived benefits, which
are central to the validation of these hypotheses.",
"sourceDocuments": "{\"source\":\"B2B platorm
Assimilation.pdf\",\"title\":\"Organizational buyers’ assimilation of B2B platforms_
Effects of IT-enabled service functionality\"}",
"generatedQuestion": "Refined Question: Could you provide an executive
summary highlighting the key points and conclusions of the document?" }
},
]
}