Applsci 14 09318 v2
Applsci 14 09318 v2
sciences
Article
Evaluating Retrieval-Augmented Generation Models for
Financial Report Question and Answering
Ivan Iaroshev 1 , Ramalingam Pillai 1 , Leandro Vaglietti 1 and Thomas Hanne 2, *
Abstract: This study explores the application of retrieval-augmented generation (RAG) to improve
the accuracy and reliability of large language models (LLMs) in the context of financial report analysis.
The focus is on enabling private investors to make informed decisions by enhancing the question-
and-answering capabilities regarding the half-yearly or quarterly financial reports of banks. The
study adopts a Design Science Research (DSR) methodology to develop and evaluate an RAG system
tailored for this use case. The study conducts a series of experiments to explore models in which
different RAG components are used. The aim is to enhance context relevance, answer faithfulness,
and answer relevance. The results indicate that model one (OpenAI ADA and OpenAI GPT-4)
achieved the highest performance, showing robust accuracy and relevance in response. Model three
(MiniLM Embedder and OpenAI GPT-4) scored significantly lower, indicating the importance of
high-quality components. The evaluation also revealed that well-structured reports result in better
RAG performance than less coherent reports. Qualitative questions received higher scores than the
quantitative ones, demonstrating the RAG’s proficiency in handling descriptive data. In conclusion,
a tailored RAG can aid investors in providing accurate and contextually relevant information from
financial reports, thereby enhancing decision making.
Keywords: retrieval-augmented generation; large language models (LLMs); financial reports; question
and answering with LLM
Citation: Iaroshev, I.; Pillai, R.;
Vaglietti, L.; Hanne, T. Evaluating
Retrieval-Augmented Generation
Models for Financial Report Question
and Answering. Appl. Sci. 2024, 14, 1. Introduction
9318. https://doi.org/10.3390/ Since the inception of Artificial Intelligence (AI), researchers have pursued the am-
app14209318 bitious goal of developing machines capable of reading, writing, and conversing like
Academic Editor: Pedro Couto
humans. This occurred in the domain of natural language processing (NLP). According
to Zhao et al. [1], NLP has a rich history: from the development of statistical language
Received: 16 September 2024 models (SLMs) to the rise of neural language models (NLMs), linguistic capabilities have
Revised: 7 October 2024 significantly increased, enabling AI systems to understand and generate complex language
Accepted: 9 October 2024 patterns [1].
Published: 12 October 2024
In recent years, NLP has witnessed remarkable progress with the introduction of large
language models (LLMs). These models, which were extensively trained on vast amounts
of textual data, demonstrated unprecedented proficiency in generating human-like text
Copyright: © 2024 by the authors.
and performing language-based tasks accurately [2]. However, despite their advancements,
Licensee MDPI, Basel, Switzerland. LLMs still face several limitations. Burtsev et al. [3] outline three main limitations of
This article is an open access article LLMs. First, they struggle with complex reasoning, hindering their ability to draw accurate
distributed under the terms and conclusions. Second, their knowledge or expertise is limited to training data, which may
conditions of the Creative Commons lead to failure in providing relevant information. Third, they may produce inaccurate
Attribution (CC BY) license (https:// outputs due to a lack of understanding of prompts, which makes them marginally helpful.
creativecommons.org/licenses/by/ The rise of LLMs has also sparked interest in their application to specific tasks, prompt-
4.0/). ing the emergence of an approach called retrieval-augmented generation (RAG). This
approach was first introduced by Guu et al. [4] and Lewis et al. [5]. RAG was devised
to extend the capabilities of LLMs beyond conventional training data. By integrating a
specialized body of knowledge, RAG enables LLMs to provide more accurate responses
to user queries. In essence, RAG comprises two distinct phases: retrieval and generation.
During the retrieval phase, defined sources (e.g., indexed documents) are analyzed to
extract relevant information that is aligned with the user’s prompt or question. The re-
trieved information is then seamlessly integrated with the user’s prompt and forwarded to
the language model. In the generative phase, the LLM leverages the augmented prompt
and internal understanding to craft a tailored response that addresses the user’s query
effectively [6,7].
RAG is gaining momentum for its potential to improve LLM-generated responses by
grounding models on the said external sources, thereby reducing issues such as inconsis-
tency and hallucination [8,9].
achieve this, a thesis statement and main research question were formulated, supplemented
by additional sub-questions.
Thesis Statement: The effectiveness of RAG in enhancing context relevance, answer
faithfulness, and answer relevance for analyzing half-yearly or quarterly reports can be
evaluated through empirical experiments with different model configurations.
Main Research Question: How can RAG be adapted to increase the context relevance,
answer faithfulness, and answer relevance of LLM conclusions about several types of
financial reports?
1. How can the retrieval component of RAG be set up to extract relevant information
from half-yearly or quarterly reports?
2. What strategies can be employed to ensure that the generation component of RAG
produces answers to the questions posed about half-yearly or quarterly reports?
3. How can the effectiveness of the RAG system for analyzing half-yearly or quarterly
reports in the banking sector be reliably evaluated and validated?
4. How accurately do RAG-generated responses represent the information extracted
from the half-yearly or quarterly reports of banking institutions?
2. Literature Review
In this review, RAG systems, including evaluation metrics, methods, and frameworks, are
briefly discussed. In addition, a research gap is identified that requires further investigation.
Methodologically, the literature review was initiated with extensive searches on Google
Scholar using broad keywords such as “Retrieval Augmented Generation” and “RAG” to
capture general papers on the topic. As the investigation progressed, the search terms were
refined to focus on specific aspects of RAG, including “Evaluation methods of RAG”. This
approach includes both forward and backward searches to ensure the comprehensive cov-
erage of the relevant literature. While priority was given to peer-reviewed academic papers,
a selection of non-peer-reviewed papers with substantial citations and reputable authors
was also integrated, acknowledging the rapid advancements in LLM and RAG research.
stance, the Retrieval-Augmented Generation Benchmark (RGB) [8] evaluates four essential
abilities for RAG: noise robustness, negative rejection, information integration, and coun-
terfactual robustness. However, this approach focuses on question-and-answering tasks to
measure RAG performance. Other approaches, such as those discussed by Lyu et al. [16],
support a wider range of RAG applications but are more challenging to implement.
3. Research Design
This section presents the research methodology used to address the research questions
outlined in Section 1.2 and delineates the data sources and collection methods employed.
The research tackles a real-world challenge by developing a novel and innovative
artifact. Hence, Design Science Research (DSR), as proposed by Hevner and Chatterjee [29],
is recommended. DSR involves the creation and evaluation of an artifact. In this context, it
entails the establishment of an RAG system, along with the adaptation of its components,
Appl. Sci. 2024, 14, 9318 6 of 18
such as the embedding model or LLM. These adaptations will be tailored to varying model
configurations, which will be evaluated. This application aims to tackle an important
problem by addressing the lack of a straightforward RAG approach tailored for the use
case of private investors analyzing half-yearly or quarterly bank reports.
3.1.2. Suggestion
The second step of DSR involves determining the type of artifact that could solve the
current problem [29]. In this case, the proposed artifact is an RAG system and its corre-
sponding implementation. This RAG system should be designed to facilitate the analysis
of the half-yearly and quarterly reports of banks. It is essential to note that the RAG system,
respectively, the retrieval and generation components, should be able to encompass various
scenarios, thereby allowing for comprehensive evaluation and adaptation as needed.
3.1.3. Development
After defining the problem and identifying the artifact type, as per Hevner and
Chatterjee [29], the Development phase should concentrate on designing and creating an
artifact that offers a solution to the defined problem. The sub-questions one and two (see
Table 1), outlined in Section 1.2., are integral to this phase.
In
Inthis
thisphase,
phase,the
thefocus
focusisisonon
developing
developingand implementing
and implementing an RAG
an RAGsystem. Establish-
system. Estab-
ing an RAG
lishing system
an RAG requires
system the the
requires definition andand
definition selection of certain
selection components.
of certain components.Figure 1
Figure
shows the simplified architecture.
1 shows the simplified architecture.
Figure1.1.RAG
Figure RAGsystem
systemaccording
accordingto
to[6].
[6].
Theflow
The flowisisoutlined
outlinedbased
basedon
onthe
thefollowing
following99steps
steps(identified
(identifiedby
by the
the circled
circlednumber
number
in
inFigure
Figure1)1)based
basedon
on[6]:
[6]:
1.1. Begin byby gathering
gatheringessential
essentialreports
reportsforforthetheRAG RAG system’s
system’s operation,
operation, specifically
specifically tar-
targeting half-yearly
geting half-yearly and
and quarterlyreports
quarterly reportsfromfrombanks.
banks. Further
Further insights
insights into the
the data
data
collection procedures
procedures areare provided
provided in in Section
Section 3.2. 3.2.
2.
2. Break down the the collected
collecteddata
datainto
intomanageable
manageablechunks chunkstoto streamline
streamline information
information re-
retrieval
trieval andand improve
improve efficiency
efficiency by by avoiding
avoiding the the processing
processing of entire
of entire documents.
documents. This
This segmentation
segmentation ensures
ensures that data
that each each chunk
data chunk
remains remains focused
focused on a specific
on a specific topic,
topic, which
which increases
increases the likelihood
the likelihood of retrieving
of retrieving relevant relevant information
information for userfor user queries.
queries. Subse-
Subsequently, these segmented
quently, these segmented data aredata are transformed
transformed into
into vector vector representations
representations known as
known
embeddings, which capture the semantic essence of the text. the text.
as embeddings, which capture the semantic essence of
3.
3. The
The resulting
resulting embeddings
embeddings are are stored
stored in
in aa dedicated
dedicated vector
vector database
database totofacilitate
facilitate the
the
efficient retrieval of pertinent information, transcending traditional
efficient retrieval of pertinent information, transcending traditional word-to-word word-to-word
comparison
comparison methods.
methods.
4. The user query formulation is initiated.
5. Upon entry into the system, the user query is converted to an embedding or vector
representation. To ensure consistency, the same model is used for both document and
query embeddings.
6. Use the transformed query to search the vector database and perform comparisons
with document embeddings.
7. The most relevant text chunks are retrieved, and a contextual framework is established
to address the user query.
8. The retrieved text chunks are integrated with the original user query to provide a
unified prompt for LLM.
9. The unified prompt, which comprises the retrieved text chunks and the original user
query, is sent to the LLM. The LLM utilizes its advanced natural language processing
capabilities and extensive knowledge to generate coherent responses tailored to
address the user’s queries effectively, leveraging the additional context provided by
the text chunks to enhance the accuracy and depth of the responses.
Therefore, for the development of a newly built RAG system, the essential components
include an embedding model, a vector database, an LLM, and a comprehensive orchestra-
Appl. Sci. 2024, 14, 9318 8 of 18
tion tool to manage the entire system seamlessly. The following tools were identified as
potential setups:
• Embedding Model: MiniLMEmbedder [30] and BedRockEmbeddings [31];
• Vector Database: FAISS [32] and Chroma [33];
• LLM: GPT-4o [34], Llama 3 [35], and Gemini 1.5 Pro [36];
• Orchestrator: LangChain [37] and LlamaIndex [38].
Furthermore, there exists an alternative approach involving the use of a pre-built RAG
system that can be readily deployed. One such system is the Verba RAG by Weaviate. In
this case, it is also possible to create different model configurations by using different em-
bedding models (e.g., OpenAI ADA, MiniLMEmbedder) and LLMs (e.g., GPT-4o, Gemini
1.5 Pro) [39,40].
In any case, it is the goal to create three distinct technical model configurations, each
using different components. For each model configuration, a selection of quarterly and
half-yearly reports is gathered from various banks, as detailed in Section 3.2. Using these
documents, querying and answering via the RAG system will be conducted. Ten questions
per bank are formulated for this purpose and tested with each model configuration.
3.1.5. Conclusions
In the final step of the DSR process, the focus is on discussing the findings and evalu-
ating their generalizability, as well as considering future aspects, such as open questions
and plans for further development [29].
Thus, the aim is to illuminate the potential of the specific RAG system for broader im-
plementation, identify avenues for continued development and enhancement, and address
any open questions regarding its functionality, model configuration, output, or evaluation.
Appl. Sci. 2024, 14, 9318 9 of 18
4. Solution Development
The following sections provide the research findings according to the main Design
Science Research phases of Hevner and Chatterjee [29].
Name Task
The component (https://weaviate.io/blog/verba-open-source-rag-
app#chunker-manager-breaking-data-up, accessed on 10 June 2024)
receives a list of strings representing uploaded documents. In our
case, it handles the reports collected from various banks. The
Read-Chunk Manager takes a list of documents and breaks each
document’s text into smaller segments. For this use case, the
Read-Chunk Manager
Read-Chunk Manager divides each document into chunks of
100-word tokens with a 50-token overlap. This method will remain
consistent across all the tests. This approach is based on preliminary
testing, which clearly indicated that larger chunk sizes (e.g., 250- and
400-word tokens) negatively affected the quality of the final output of
the proposed RAG system when applied to the financial reports.
Appl. Sci. 2024, 14, 9318 10 of 18
Table 3. Cont.
Name Task
The Embedding Manager (https://weaviate.io/blog/verba-open-
source-rag-app#embedding-manager-vectorizing-data, accessed on
10 June 2024) receives a list of documents and embeds them as
Embedding Manager vectors into Weaviate as the relevant database. It is also used to
retrieve chunks and documents from Weaviate. The specific
embedding model used will vary according to the model
configuration described in Section 4.2.1.
The Retrieve Manager (https://weaviate.io/blog/verba-open-
source-rag-app#retrieve-manager-finding-the-context, accessed on
10 June 2024) communicates with the Embedding Manager to retrieve
Retrieve Manager chunks and apply custom logic. They return a list of chunks. For this
use case, the “WindowRetriever” is employed for all the tests, which
retrieves relevant chunks and their surrounding context using a
combination of semantic and keyword search (hybrid approach).
The Generation Manager (https://weaviate.io/blog/verba-open-
source-rag-app#generation-manager-writing-the-answer, accessed
on 10 June 2024) uses a list of chunks and a query to generate an
Generation Manager
Appl. Sci. 2024, 14, 9318 answer. Then, it returns a string as the answer. Like the Embedding 11
Manager, the specific generating model used will vary according to
the model configuration described in Section 4.2.1.
Figure of
Figure 2. Illustration 2. Illustration
the Verba RAGof the Verba RAG
architecture architecture
based on [39]. based on [39].
questions one and two of Section 1.2, the RAG system will be developed, and three model
configurations will be created, each using different components. The setup and design of
the experiments are outlined in the following sections.
4.2.2. Data
For each model configuration, two quarterly and two half-yearly reports from the five
selected banks are used, as described in Section 3.2. To introduce complexity and simulate
real-world challenges, reports from three additional banks are included.
4.2.3. Questions
Each bank will be subjected to ten specific questions and three model configurations.
Seven of these questions are general, applicable across all the banks, and designed to
evaluate the system’s general capabilities. These general questions are a mix of quantitative
(e.g., financial metrics like revenue or profit) and qualitative types (e.g., risks, challenges,
and trends), but all are questions that a private investor could ask. They were developed it-
eratively, starting with simple queries and refining them through experiments to determine
the optimal length and style. We ended up with relatively long questions that accommodate
different wordings, such as revenue, total income, and gross earnings, as banks often use
these terms synonymously. These questions can be found in Appendix A.
The remaining three questions are tailored to each individual bank to assess the
system’s ability to handle specific and detailed inquiries.
4.2.4. Experiments
The experiments were conducted by providing the RAG system with ten questions
for each bank and model configuration. To ensure precise analysis and response, these
questions were submitted individually in a single session rather than in batches. Each
question, answer, and the retrieved context were compiled into an Excel. The complete
Appl. Sci. 2024, 14, 9318 12 of 18
file includes data for five banks for three model configurations with ten questions per
model configuration.
The RAG system was set up according to the identified requirements, developing three
distinct model configurations with different components. This setup addresses research
questions 1 and 2 from Section 1.2, particularly defining the retrieval and generation
components. On the other hand, the defined and developed setup of the RAG system and
the experiments form the basis for answering research questions three and four.
Scale Definition
The retrieved context is irrelevant to the question, the generated response is
1
inaccurate and inconsistent, and it fails to address the question.
The retrieved context has limited relevance and contains several inaccuracies, the
2 generated response exhibits notable inconsistencies, and it only partially addresses
the given question.
The retrieved context is somewhat relevant but has some inaccuracies, the generated
3 response is generally consistent with minor errors, and it adequately addresses
the question.
The retrieved context is relevant and mostly accurate, the generated response is
4
consistent with minor or no errors, and it effectively addresses the question.
The retrieved context is highly relevant and accurate, the generated response is
5 entirely faithful to the context with no errors, and it thoroughly addresses
the question.
A total of 750 points are possible for each model configuration, as there are five banks
with 10 questions for each model, with five points possible per question in three different
categories. The maximum number of points per model configuration is calculated as
follows: 5 × 10 × 5 × 3 = 750 points.
Table 6. Evaluation results: three model configurations (best results are indicated in bold).
Model one (OpenAI ADA and OpenAI GPT-4o) achieved the highest overall score
(552 of 750 points). With an average score of 3.7 out of 5, this model configuration falls
between the 3- and 4-point definitions on the scale presented in Section 5.1. Thus, the
retrieved context was generally relevant and mostly accurate, and the generated responses
were consistent with minor errors to address the posed questions. The lowest scores were
assigned to “Context Relevance”, often due to retrieving over 10–15 chunks, many of which
were not pertinent. However, even with irrelevant chunks, the LLM still provided mostly
accurate answers. This highlights GPT-4o’s robustness in handling excess information.
Model two (OpenAI ADA and Gemini 1.5 Pro) received a slightly lower overall score.
The “Context Relevance” score remained nearly stable due to using the same embedding
model. However, the scores for “Answer Faithfulness” and “Answer Relevance” were
slightly lower, reflecting the impact of using a different LLM. This observation aligns with
benchmarks from the LLM Leaderboard of the Large Model Systems Organization [44],
where GPT-4o was rated with 1287 points compared to Gemini 1.5 Pro with 1266 points. In
addition, the Massive Multitask Language Understanding (MMLU) benchmark developed
by Hendrycks et al. [45] obtained a score of 88.7% for GPT-4o and 81.9% for Gemini
1.5 Pro [46]. These benchmarks indicate a slight performance edge for GPT-4o, which was
reflected in the evaluation results.
Model three (MiniLM Embedder and OpenAI GPT-4) demonstrated a significantly
lower overall score of 438 points, which was more than 20% lower than model one and
lower than model two. Changing the embedding model from OpenAI’s ADA to the
MiniLM Embedder resulted in lower scores for all the metrics, which is not unexpected as
the new model is much smaller than ADA. The “Context Relevance” score was considerably
lower because much of the retrieved context was not relevant to the posed questions. We
observed that this model configuration often retrieved fewer chunks than the previous
models, which also affected “Answer Faithfulness” and “Answer Relevance”. Without
sufficient (quantity) and relevant (quality) context, the LLM struggled to generate accurate
answers. This underscores the importance of using a high-quality embedding model to
retrieve relevant information.
In summary, to answer research questions three and four from Section 1.2, models one
and two demonstrated a more accurate representation of the necessary information, while
model three lagged behind. The evaluation approach and setup were effective for reliably
evaluating and validating these experiments.
(e.g., line charts and bar charts) with minimal written text. On the contrary, HSBC’s reports
are text-heavy with well-explained tables and minimal charts. This layout makes it easier
for retrievers to gather the relevant context. Consequently, the HSBC reports are more
suitable for the RAG system.
Table 7. Results of evaluation—total points per bank (best results are indicated in bold).
Table 8 presents the results for each of the seven standards and three individual
questions. The highest scores were obtained on individual questions. This indicates that
when specific questions and wording align with the report, the RAG system performs well
across all three metrics, thereby providing accurate answers.
Table 8. Results of evaluation—total points per question (best results are indicated in bold).
The qualitative questions also received higher average scores than the quantitative
questions. It was found that when multiple similar numbers appeared in a report, the RAG
sometimes had difficulties handling them accurately. For example, question 4 asked for the
total assets held by the bank at the end of H1 2023. Banks typically report various types
of assets, such as risk-weighted assets, customer assets, or high-quality liquid assets. If
the RAG system selected the wrong number, it received a low score. However, minimal
hallucination was observed; instead, the primary issue was selecting an incorrect but
actually appearing number. On the other hand, for qualitative questions such as market
trends, the RAG system often produced at least a partially correct answer, which resulted
in higher scores.
Appl. Sci. 2024, 14, 9318 15 of 18
6. Conclusions
This section consolidates our research findings, assesses the success of addressing the
main research question, and presents the key insights gained from the study. Practical
implications, limitations, and the potential future developments of the suggested artifact
are discussed.
The study focused on the main research question: “How can the RAG be adapted
to increase the context relevance, answer faithfulness, and answer relevance of LLM’s
conclusions about several types of financial reports?”. By developing an RAG system and
exploring three technical RAG model configurations employing different components (em-
bedding models and LLMs), variations in these metrics were identified. The development
and evaluation process led us to conclude that the first model configuration was particu-
larly effective. Ultimately, the findings suggest that we can empower private investors to
make informed decisions when provided with accurate answers derived by such an RAG
system from relevant bank reports.
Although the main research question was addressed successfully, the study revealed
several areas for improvement. A significant limitation was the RAG system’s difficulty in
processing complex PDF layouts. Future research should aim to optimize the system by
integrating more robust components, such as computer vision models, to handle visual
information more effectively. Moreover, adding a domain-specific repository for finan-
cial terminology would enhance context understanding and mitigate issues arising from
different terminology used by different banks for similar concepts (e.g., assets).
Future research holds various promising avenues to provide investors with even
more reliable insights. Expanding the system to include a broader spectrum of banks
is a critical aspect. In addition, exploring more advanced evaluation methods could
enhance the robustness and accuracy of the assessment process, thereby improving RAG’s
overall efficacy.
By addressing these limitations and exploring the outlined research directions, the
RAG systems’ capability to analyze financial reports and deliver valuable insights to private
investors could be enhanced.
Appendix A
References
1. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language
Models. arXiv 2023, arXiv:2303.18223.
2. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language
Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901.
3. Burtsev, M.; Reeves, M.; Job, A. The Working Limitations of Large Language Models. MIT Sloan Manag. Rev. 2024, 65, 8–10.
4. Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. REALM: Retrieval-Augmented Language Model Pre-Training. In Proceedings
of the International Conference on Machine Learning, PMLR, Vienna, Austria, 13–18 July 2020; pp. 3929–3938.
5. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T. Retrieval-
Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474.
6. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for
Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997.
7. IBM Research. What Is Retrieval-Augmented Generation? Available online: https://research.ibm.com/blog/retrieval-
augmented-generation-RAG (accessed on 9 February 2021).
8. Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the
AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 17754–17762.
9. Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval Augmentation Reduces Hallucination in Conversation. arXiv 2021,
arXiv:2104.07567.
10. Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A
family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805.
11. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al.
Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288.
12. He, H.; Zhang, H.; Roth, D. Rethinking with Retrieval: Faithful Large Language Model Inference. arXiv 2022, arXiv:2301.00303.
13. Shen, X.; Chen, Z.; Backes, M.; Zhang, Y. In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT. arXiv
2023, arXiv:2304.08979.
14. Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; et al. Siren’s Song in the AI Ocean: A
Survey on Hallucination in Large Language Models. arXiv 2023, arXiv:2309.01219.
15. Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G.B.; Lespiau, J.-B.; Damoc, B.;
Clark, A. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference on
Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 2206–2240.
16. Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; Chen, E.; et al. CRUD-RAG: A Comprehensive
Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. arXiv 2024, arXiv:2401.17043.
17. Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv
2023, arXiv:2309.15217.
18. Saad-Falcon, J.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented
Generation Systems. arXiv 2023, arXiv:2311.09476.
19. Ragas. Core Concepts. 2023. Available online: https://docs.ragas.io/en/latest/concepts/index.html (accessed on 10 July 2024).
20. TruLens. RAG Triad. 2024. Available online: https://truera.com/ai-quality-education/generative-ai-rags/what-is-the-rag-triad/
(accessed on 10 July 2024).
21. Besbes, A. A 3-Step Approach to Evaluate a Retrieval Augmented Generation (RAG). Towards Data Science. Available online:
https://towardsdatascience.com/a-3-step-approach-to-evaluate-a-retrieval-augmented-generation-rag-5acf2aba86de (accessed
on 23 November 2023).
22. Besbes, A. Quickly Evaluate Your RAG without Manually Labeling Test Data. Towards Data Science. Available online:
https://towardsdatascience.com/quickly-evaluate-your-rag-without-manually-labeling-test-data-43ade0ae187a (accessed on 21
December 2023).
23. Frenchi, C. Evaluating RAG: Using LLMs to Automate Benchmarking of Retrieval Augmented Generation Systems. Willow Tree
Apps. Available online: https://www.willowtreeapps.com/craft/evaluating-rag-using-llms-to-automate-benchmarking-of-
retrieval-augmented-generation-systems (accessed on 1 December 2023).
24. Leal, M.; Frenchi, C. Evaluating Truthfulness: Benchmarking LLM Accuracy. Willow Tree Apps. Available online: https:
//www.willowtreeapps.com/craft/evaluating-truthfulness-a-deeper-dive-into-benchmarking-llm-accuracy (accessed on 21
September 2023).
25. Nguyen, R. LlamaIndex: How to Evaluate Your RAG (Retrieval Augmented Generation) Applications. Better Programming.
Available online: https://betterprogramming.pub/llamaindex-how-to-evaluate-your-rag-retrieval-augmented-generation-
applications-2c83490f489 (accessed on 1 October 2023).
26. Sarmah, B.; Zhu, T.; Mehta, D.; Pasquali, S. Towards reducing hallucination in extracting information from financial reports using
Large Language Models. arXiv 2023, arXiv:2310.10760.
27. Yepes, A.J.; You, Y.; Milczek, J.; Laverde, S.; Li, R. Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv
2024, arXiv:2402.05131.
Appl. Sci. 2024, 14, 9318 18 of 18
28. Zhang, B.; Yang, H.; Zhou, T.; Ali Babar, M.; Liu, X.-Y. Enhancing Financial Sentiment Analysis via Retrieval Augmented Large
Language Models. In Proceedings of the Fourth ACM International Conference on AI in Finance, Brooklyn, NY, USA, 27–29
November 2023; pp. 349–356.
29. Hevner, A.; Chatterjee, S. (Eds.) Design Science Research in Information Systems. In Design Research in Information Systems: Theory
and Practice; Springer: Berlin/Heidelberg, Germany, 2010; pp. 9–22. [CrossRef]
30. HuggingFace. MiniLMEmbedder. Available online: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed
on 18 January 2024).
31. LangChain. Bedrock Embeddings. 2023. Available online: https://python.langchain.com/v0.1/docs/integrations/text_
embedding/bedrock/ (accessed on 10 July 2024).
32. Meta. Faiss: A Library for Efficient Similarity Search. Available online: https://engineering.fb.com/2017/03/29/data-
infrastructure/faiss-a-library-for-efficient-similarity-search/ (accessed on 29 March 2017).
33. Chroma. Chroma Docs. 2024. Available online: https://docs.trychroma.com/getting-started (accessed on 10 July 2024).
34. OpenAI. OpenAI Platform. 2024. Available online: https://platform.openai.com/docs/overview (accessed on 10 July 2024).
35. Meta. Meta Llama 3. 2024. Available online: https://llama.meta.com/llama3/ (accessed on 10 July 2024).
36. Google DeepMind. Gemini Pro 1.5. Available online: https://deepmind.google/technologies/gemini/pro/ (accessed on 20 May
2024).
37. LangChain. Introduction to LangChain. 2023. Available online: https://python.langchain.com/v0.1/docs/get_started/
introduction/ (accessed on 10 July 2024).
38. LlamaIndex. LlamaIndex Docs. 2024. Available online: https://docs.llamaindex.ai/en/stable/ (accessed on 10 July 2024).
39. Weaviate. Verba Docs. GitHub. 2024. Available online: https://github.com/weaviate/Verba (accessed on 10 July 2024).
40. Weaviate. Verba—Demo Tool. 2024. Available online: https://verba.weaviate.io/ (accessed on 10 July 2024).
41. OpenAI. Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 13 May 2024).
42. Google. Gemini 1.5 Pro Now Available in 180+ Countries. Available online: https://developers.googleblog.com/en/gemini-15
-pro-now-available-in-180-countries-with-native-audio-understanding-system-instructions-json-mode-and-more/ (accessed on
9 April 2024).
43. Google. Gemini 1.5 Pro Updates. Available online: https://blog.google/technology/developers/gemini-gemma-developer-
updates-may-2024/ (accessed on 14 May 2024).
44. Large Model Systems Organization. LLM Leaderboard. Available online: https://chat.lmsys.org/?leaderboard (accessed on 6
June 2024).
45. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language
Understanding. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event,
Austria, 3–7 May 2021; pp. 11260–11285.
46. Papers with Code. Multi-task Language Understanding on MMLU. 2024. Available online: https://paperswithcode.com/sota/
multi-task-language-understanding-on-mmlu (accessed on 10 July 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.