0% found this document useful (0 votes)
58 views11 pages

Im-Rag Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Uploaded by

zexyzm1201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views11 pages

Im-Rag Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Uploaded by

zexyzm1201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IM-RAG: Multi-Round Retrieval-Augmented Generation Through

Learning Inner Monologues


Diji Yang Jinmeng Rao Kezhen Chen∗
[email protected] [email protected] [email protected]
University of California Santa Cruz Mineral.ai Together AI
Santa Cruz, USA Mountain View, USA Mountain View, USA

Xiaoyuan Guo∗ Yawen Zhang Jie Yang∗


[email protected] [email protected] [email protected]
Google Mineral.ai Cybever
Mountain View, USA Mountain View, USA Mountain View, USA

Yi Zhang
[email protected]
University of California Santa Cruz
Santa Cruz, USA
ABSTRACT CCS CONCEPTS
Although the Retrieval-Augmented Generation (RAG) paradigms • Information systems → Question answering; Language
can use external knowledge to enhance and ground the outputs models.
of Large Language Models (LLMs) to mitigate generative halluci-
nations and static knowledge base problems, they still suffer from KEYWORDS
limited flexibility in adopting Information Retrieval (IR) systems retrieval augmented generation, inner monologue, large language
with varying capabilities, constrained interpretability during the models, question answering, multi-round retrieval
multi-round retrieval process, and a lack of end-to-end optimization.
ACM Reference Format:
To address these challenges, we propose a novel LLM-centric ap- Diji Yang, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Jie
proach, IM-RAG, that integrates IR systems with LLMs to support Yang, and Yi Zhang. 2024. IM-RAG: Multi-Round Retrieval-Augmented
multi-round RAG through learning Inner Monologues (IM, i.e., the Generation Through Learning Inner Monologues. In Proceedings of the
human inner voice that narrates one’s thoughts). During the IM pro- 47th International ACM SIGIR Conference on Research and Development in
cess, the LLM serves as the core reasoning model (i.e., Reasoner) to Information Retrieval (SIGIR ’24), July 14–18, 2024, Washington, DC, USA.
either propose queries to collect more information via the Retriever ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3626772.3657760
or to provide a final answer based on the conversational context.
We also introduce a Refiner that improves the outputs from the 1 INTRODUCTION
Retriever, effectively bridging the gap between the Reasoner and IR Large Language Models (LLMs) have demonstrated impressive ca-
modules with varying capabilities and fostering multi-round com- pabilities in language understanding and generation [5, 30, 44];
munications. The entire IM process is optimized via Reinforcement however, there are two major challenges: generative hallucina-
Learning (RL) where a Progress Tracker is incorporated to provide tion [50] and static knowledge [18]. While LLMs possess a deep
mid-step rewards, and the answer prediction is further separately understanding of human language and can generate creative re-
optimized via Supervised Fine-Tuning (SFT). We conduct extensive sponses, they lack the ability to verify facts or access up-to-date
experiments with the HotPotQA dataset, a popular benchmark for information [1, 28]. To mitigate such issues, integrating Informa-
retrieval-based, multi-step question-answering. The results show tion Retrieval (IR) systems with LLMs has become an increasingly
that our approach achieves state-of-the-art (SOTA) performance promising direction. IR systems complement LLM by retrieving
while providing high flexibility in integrating IR modules as well as timely and relevant information, enhancing the factuality of re-
strong interpretability exhibited in the learned inner monologues. sponses. The synergy between LLMs and the IR systems – Retrieval
Augmented Generation (RAG) [28, 40] improves the ability of LLMs
∗ Work done at Mineral.ai.
and powers generative AI products like ChatGPT, Bard, and Bing,
showcasing the power and future potential of the combining IR
systems and LLMs for more accurate and reliable responses.
This work is licensed under a Creative Commons Attribution There are two typical paradigms to improve RAG systems: the
International 4.0 License. joint training approach v.s. training different components sepa-
rately. The first paradigm involves joint training of LLMs and re-
SIGIR ’24, July 14–18, 2024, Washington, DC, USA trievers on knowledge-intensive tasks, enhancing retrieval capa-
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0431-4/24/07. bilities of language models [13]. For example, Guu et al. [10] did
https://doi.org/10.1145/3626772.3657760 joint training of LLM and a retriever’s semantic embedding, and

730
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Diji Yang et al.

Substorm was described in qualitative terms by a Progress


scientist nominated for what seven times? Tracker

? Substorm:
Superstorm (film):
A substorm, sometimes referred
Which scientist is known for describing Superstorm is a three-part British
substorms?
Q to as a magnetospheric substorm
docudrama miniseries …
Reasoner or …

The Coming Global


Retriever Superstorm: …
The Coming Global Superstorm
(ISBN\xa0 ) is a 1999 …

Substorms were first described in qualitative terms by Kristian


Birkeland which he called polar elementary storms.
? Refiner

Kristian Birkeland was nominated seven


times for what?
Q Kristian Birkeland: Turid Birkeland:
Reasoner Kristian Olaf Bernhard Birkeland Turid Birkeland (5 November 1962
(13 December 1867 … – 24 December 2015) …

Retriever Kristian Kvakland:


Kristian Kvakland (born February …
5, 1927) is a Norwegian …

Kristian Birkeland was nominated for the Nobel Prize seven


!
Refiner times.
Nobel Prize. A
Reasoner

Figure 1: The Inner Monologue (IM) process in IM-RAG. For users posed questions, the Reasoner first determines if it has
enough information to provide an answer. If not, it acts as a Questioner, proposing a query to request more information. The
query is then directed to the Retriever, which searches for relevant documents in the knowledge source. Subsequently, the
Refiner refines the retrieved documents to highlight the most pertinent information, which is then returned to the Reasoner.
This iterative process may continue over multiple rounds until the Reasoner believes it has gathered enough information, at
which point it becomes an Answerer and generates a final answer. This IM process provides valuable insights into the reasoning
process, enabling humans to understand how the system arrived at its conclusions.

their approach has shown promising results. However, it lacks tasks [23], it requires a significant amount of labeled training data
interpretability because the communication between LLMs and as well as substantial training costs. For complex problems that re-
retrievers relies on complex deep-learning gradient propagation quire multi-step reasoning and multi-round retrieval, training data
and cross-attention between IR embedding models and LLMs. Fur- with human labeled multi-round search records can be expensive
thermore, this training approach is very computationally expensive, to collect, and the effectiveness of their method is unclear. In this
and it’s very hard or expensive to retrain the retriever’s seman- work, we mainly focus on improving the LLM-centric paradigm,
tic embedding as LLMs change or learn. The second paradigm considering its performance, flexibility, and interpretability.
improves LLM and/or IR engines separately. Most prior work in Recently, IMMO [52] trained an LLM and a vision-language
this paradigm focuses on improving LLM (LLM-centric), either mode to have Inner Monologues (i.e. Question-Answering (QA)
through prompting or fine-tuning LLM parameters [19, 29, 33]. The dialogues), and their results show the learned IM does explicit
prompting-based approach provides simplicity and flexibility with- multi-step reasoning, performs well on complex visual Question
out incurring extra training costs and allows the integration of Answering problems, meanwhile explainable.
black-box LLMs and search engines through API calls. However, Motivated by IMMO, we adapt the concept of IM to RAG to en-
it suffers from the lack of end-to-end optimization of the whole able LLMs to do multi-round retrieval, as we believe learning IM
system. For example, efforts spent on improving LLM search query could also be beneficial for the communication and collaboration be-
rewriting/generation module may not lead to better retrieval per- tween LLM and IR modules. Prior cognitive science studies suggest
formance, as the improvement is not well tailored for the specific that human Inner Monologue encompasses a broader spectrum of
search engine used. Besides, a static LLM generation module may mental processes beyond QA dialogues, including abstract concepts
not perform well when fed with both relevant and irrelevant docu- and symbolic representations [8, 47]. Thus, in this paper, we extend
ments. In contrast, a training-based approach collects and utilizes IM communication beyond the format of QA dialogues in natural
human-annotated interaction records between LLMs and IR mod- language, and further generalized IM to involve more formats that
ules, and then uses them to supervise LLMs in learning how to are more appropriate for RAG systems (e.g., ranking results and
better utilize and interact with IR modules. Although this approach returning scalar scores). This leads to a novel LLM-centric frame-
has shown better performance than the prompting-based approach work IM-RAG that integrates LLMs with IR to support context-
on simple image-to-text retrieval-based visual question-answering aware multi-round interactive retrieval through learning IM. In our

731
IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues SIGIR ’24, July 14–18, 2024, Washington, DC, USA

framework, LLM (i.e., Reasoner) acts as the mastermind of IM-RAG, systems emerges as another popular paradigm of RAG, where an
switching between two crucial roles during the multi-round com- LLM acts as a core reasoning model, and other models and tools
munication smoothly. When additional information is needed, it (including retrievers such as search engines and neural retrievers)
becomes a Questioner, crafting new queries based on the conversa- are integrated with the LLM through prompting or training. For
tional contexts to acquire more relevant documents from Retriever example, HuggingGPT [39] and Chameleon [24] prompt LLMs with
(i.e., a search engine); when enough information is gathered, it tool descriptions and use examples to accomplish various complex
automatically transitions to an Answerer, summarizes search results reasoning tasks by composing various tools. Though these prompt-
for the original user query, and sends the final responses to the user. based methods offer flexible plug-and-play solutions, they are hard
To better adapt a search engine to an LLM, we add a Refiner compo- to optimize end to end. Other works, such as ToolFormer [35], train
nent after the Retriever. This component learns to refine retrieved LLMs on filtered and sampled API calls to teach LLMs how to use a
documents (e.g., reranking or reformatting) to meet the needs of variety of tools. These training-based methods can be supervised
LLM. This helps the LLM’s reasoning process and facilitates the while requiring a large number of training data and providing lim-
interaction with Retriever as it bridges the gap between LLMs and ited interpretability for multi-round retrieval. Our work focuses
retrievers. With a Refiner as a learnable adapter, one can switch or on enhancing the multi-round retrieval capabilities of LLM-centric
add more IR modules without worrying much about the change systems through IM learning, which can be optimized end-to-end
of IR module capabilities and output formats. Progress Tracker for without heavy training data curation costs while providing high
LLM is introduced to track the multi-round retrieval progress, so flexibility and interpretability.
that LLM can switch its roles from questioner to answerer. We use
Question Answering. The evolution of Question-Answering (QA)
RL to optimize the IM interaction between LLM and Retriever with
research, particularly within the realm of information retrieval, has
multi-round retrieval progress as reward signals. Figure 1 shows
been significantly influenced by initiatives like the Text Retrieval
one example of how our IM-RAG system solves complex question-
Conference (TREC) QA track in early 2000. Traditional approaches
answering problems through multi-round retrieval. We summarize
of open domain QA usually include a retriever that finds relevant
our contributions as follows:
documents and a reader that processes retrieved documents to gen-
• Inspired by IMMO, we introduce a novel approach, IM-RAG, erate answer candidates. Extensive research has been done to study
that connects LLMs and IR modules for context-aware multi- how to improve retriever-based, such as iterative approaches that
round RAG through learning IM. The IM learning process sequentially update search queries at each iteration. Most of those
can be optimized via RL without intermediate human anno- approaches do not change the retriever or the reader. Recently, Zhu
tations. The learning process enables the key components et al. [60] models the iterative retrieval and answer process as a
of a RAG system (query generation, results ranking, answer partially observed Markov decision process, carefully designed ac-
generation, etc.) to be trained to match the capability of other tions and states of the agents, and trained each component of the
components. Thus, the whole RAG system is optimized. system. Ma et al. [25] proposes to chain together carefully design
• Our work offers a solution that provides flexibility in adopt- skills or modules, each specialized in a specific type of information
ing IR modules and LLMs with varying capabilities, as well processing task, for question answering, and one skill is retrieval
as interpretability for multi-round retrieval. based on a query expanded with the previous-hop evidence for
• We demonstrate the efficacy of our approach on the Hot- multi-round retrieval. Our proposed research is motivated by the
PotQA dataset [54], a popular knowledge-intensive multi- success of prior research on iterative retrieval, while we are more
hop question-answering dataset, and our approach achieves focused on enhancing the ability of large-scale language models,
SOTA performance. and we proposed a novel iterative retrieval solution that’s more
general and explainable based on the strength of LLMs.
2 RELATED WORKS Inner Monologue. Recent studies have demonstrated the signifi-
Retrieval-Augmented Generation for LLMs. Language models of- cant potential of LLM-centric systems in reasoning, planning, fact-
ten face challenges such as generating hallucinations or being con- checking, and knowledge management through carefully crafted
strained by static knowledge bases. RAG has been identified as chain-of-thought prompts, facilitating multi-agent collaboration [12,
a potential solution to tackle these challenges, offering reliable 49, 53]. As a cognitive process, Inner Monologue (i.e., self-talk con-
grounding and the flexibility to access various external knowledge ducted through the internal stream of thoughts) has recently been
bases. One paradigm of RAG is to jointly train language models recognized as an efficient prompting strategy for LLM-centric sys-
and retrievers on knowledge-intensive tasks [10, 13, 20, 22]. For ex- tems [3, 12, 48, 52]. For example, by leveraging environmental
ample, REALM [10] models retrieved documents as latent variables feedback, Huang et al. [12] apply IM into an LLM-centric system
and jointly pretrains a BERT-style language model and a neural to enable grounded closed-loop feedback for robot planning and
retriever through Masked Language Modeling (MLM). Atlas [13] reasoning. Zhou et al. [59] design and add IM to enable LLMs to
demonstrates that joint training can also bring strong few-shot better understand and use communication skills. IMMO [52] pro-
learning capabilities on a wide range of knowledge-intensive tasks. poses that natural language QA dialogues between an LLM and a
RA-DIT [22] proposes a dual instruction tuning method to retrofit Vision-Language Model (VLM) can serve as a form of IM, which can
language models with retrieval capabilities and achieves SOTA per- be further optimized end-to-end via RL. However, this QA-based
formance on many knowledge-intensive zero-shot and few-shot IM is restrictive, as it only facilitates interactions among models
learning benchmarks. With the rise of LLMs, building LLM-centric capable of processing and responding in QA formats. In the field of

732
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Diji Yang et al.

IR Reasoning
Query
Retriever Questioner Answerer
Reasoner

Reinforcement Learning
(LLM)

Supervised Learning
Multi-round
Inner Monologue

Top K Score
Refiner Progress Tracker

Task Final Prediction

Figure 2: Overview of IM-RAG framework. It involves four main components: a Reasoner, a Retriever, a Refiner, and a Progress
Tracker. The Reasoner is responsible for core reasoning, switching its role between Questioner (learning to propose queries to
request relevant documents via the Retriever) and Answerer (learning to predict a final answer based on the conversational
context). The Refiner improves the retrieved documents via rephrasing or reranking and passes the top-k highlighted documents
to both the Progress Tracker for predicting progress scores and the Reasoner for further reasoning. The training of Questioner
happens during the RL stage, where the progress scores are used as rewards. The training of Answerer happens during the SFT
stage, where the original questions, learned IM with refined top-k documents at each turn, and ground truth answers are used
as finetuning examples.

IR, many traditional IR modules’ inputs and outputs may not form (KL) divergence between the updated and the initial policy [14].
QA pairs or even natural language. In this work, we further extend This approach does not require human-annotated multi-round con-
the IM within LLM-centric systems to any form of communication versations for RL and only uses the correctness of the final answer
between the "Reasoner" and "Retriever" (e.g., lists of text chunks, as reward signals. Despite that IMMO achieves impressive perfor-
ranking results, or scalar scores), either structured or unstructured, mance, the lack of mid-step rewards makes it difficult to optimize
to provide high flexibility for communication and room for opti- the behavior at each step during the overall multi-step reasoning
mization. A "Refiner" is added after the "Retriever" to refine any process. Additionally, the QA-based IM used in IMMO can be re-
form of output into a desired format and length for LLMs. Our strictive. It is important to recognize that in an LLM-centric system,
approach is anticipated to be a versatile framework that facilitates various interactions, such as communications with retrievers, don’t
collaboration between components in LLM-centric systems. always rely on natural language dialogues. In our work, we broaden
the form and use of IM to include information retrieval. Our ap-
3 METHODOLOGY proach introduces mid-step rewards to provide more detailed and
precise feedback at each step during the RL process, improving the
In this section, we first briefly review the IMMO process [52], which system’s capability in the multi-round interactive retrieval.
shares a similar learning framework with our approach. Then, we
present IM-RAG as well as the rationales behind the design.
3.2 The IM-RAG Approach
3.1 Review of IMMO IM-RAG, as depicted in Figure 2, is an LLM-centric system, which
IMMO tackles the commonsense visual question-answering tasks consists of four components: a Reasoner, a Retriever, a Refiner, and
by leveraging the LLM’s rich common-sense knowledge in con- a Progress Tracker. The components are connected through multi-
junction with VLM’s image-understanding capabilities. During the round Inner Monologues. Below we first illustrate the design of each
learning stage, the LLM engages in a dialogue with VLM in natural component, then describe the training process of our approach.
language format, which is the IM process in the system. After multi-
ple turns of conversation, the LLM gathers enough information and 3.2.1 Reasoner. As shown in Figure 2, the Reasoner serves as the
provides a final answer. The whole IM process is optimized through core reasoning component in the IM-RAG framework with two
Proximal Policy Optimization (PPO) [36] that is based on the cor- key responsibilities: (1) Questioning: crafting search queries to
rectness of the final answer and penalized by the Kullback–Leibler acquire relevant documents iteratively through IR; (2) Answering:

733
IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues SIGIR ’24, July 14–18, 2024, Washington, DC, USA

providing the final answer to the initial question based on the multi- give a final answer. In practice, the scoring design of the Progress
round interaction between the Reasoner and the Retriever (i.e., Inner Tracker can be flexible, varying across different tasks, retrievers, and
Monologues within IM-RAG). For these two responsibilities, we datasets. This flexibility may include a neural reward model [30]
introduce two distinct parameter-efficient adapters to specialize or a discrete reward function [52]. In IM-RAG, we introduce a soft
each capability during the learning process. Specifically, we added distance score design based on cosine similarity, which provides
two LoRA [11] adapters to the same base LLM, namely Questioner robust reward signals while maintaining simplicity.
and Answerer. We first train the Questioner through its multi-round Denote the top 1 passage from Refiner at 𝑖-th turn is 𝑝𝑟𝑖 , and
IM with the Retriever via reinforcement learning. During this RL {𝑝 1, 𝑝 2, ..., 𝑝𝑛 } be list of golden support passages (𝑆𝑃), where 𝑛 is
stage, the Questioner learns how to decompose a complex task (e.g., the length of 𝑆𝑃. The closest passages to 𝑝𝑟𝑖 can be found by cosine
a question that requires multi-step retrieval and reasoning) into similarity. For brevity, the 𝑐𝑜𝑠 function shown in Equation 1 and 2
a series of simpler sub-queries. The sub-queries depend on the includes the operation of encoding passage into embedding space.
previous communication context, which can include the sub-query
and the retrieved documents in the previous step, as well as the 𝑝𝑐𝑙𝑜𝑠𝑒𝑠𝑡 = argmax cos(𝑝𝑟𝑖 , 𝑝) (1)
original question. We then train the Answerer through Supervised 𝑝 ∈𝑆𝑃
Fine-Tuning (SFT) to directly answer the original question. During 𝑑𝑖 = 1 − 𝑐𝑜𝑠 (𝑝𝑟𝑖 , 𝑝𝑐𝑙𝑜𝑠𝑒𝑠𝑡 ) (2)
the SFT stage, the Answerer leverages the IM learned from the RL
The distance score 𝑑𝑖 indicates the quality of 𝑝𝑟𝑖 , which is bounded
stage and provides a correct answer. The detailed training strategies
with the query 𝑞𝑖 . Since 𝑝𝑐𝑙𝑜𝑠𝑒𝑠𝑡 is considered to have been (at-
of two adapters are illustrated in section 3.2.5 and 3.2.6, respectively.
tempted to be) retrieved, it will be removed from 𝑆𝑃. By updating
3.2.2 Retriever. As shown in Figure 2, the purpose of the Retriever the list of passages that haven’t been retrieved yet, dependencies
component in the IM-RAG is to accurately retrieve relevant docu- are set between IM turns. The distance score of subsequent turns
ments given search queries from the Reasoner during the IM pro- will partially depend on all preceding actions.
cess. The specific architecture of the Retriever and its knowledge
resources can be flexible depending on various tasks or datasets. Algorithm 1 Reinforcement Learning for Questioner training
Conceptually, most existing search engines, dense retrievers, or Dataset: (Question 𝑄, Support passages 𝑆𝑃, Ground Truth 𝐺)
matching algorithms can be directly adopted into the IM-RAG tuples
framework as the Reasoner. There are two reasons behind this de- Inner Monologue: an empty list 𝐼𝑀 to store inner monologues
sign: (1) all the components in IM-RAG are fully decoupled, which Questioner: LoRA weights of a pre-trained large language model
makes IM-RAG an efficient plug-and-play solution; (2) the Refiner Retriever: a pre-defined searching system
component (introduced below) is able to refine a variety of outputs Z: pre-defined training epoch
from different IR modules into the content of a desired format and
1: for epoch = 1 to Z do
length, which gives more freedom in the selection of the Retriever.
2: Define the 𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑒𝑟 as the active model M
3.2.3 Refiner. As illustrated in Figure 2, we introduced a Refiner 3: Sample (𝑄, 𝑆𝑃, 𝐺) from the dataset
component in the IM-RAG to enhance the inner monologue process, 4: while Questioner ← {Eq. 6} do
particularly the multi-round conversations between the Reasoner 5: 𝑞 ← M (𝑄, 𝐼𝑀)
and the Retriever. The Refiner serves as a post-processor for the 6: 𝑝𝑠 ← 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑟 (𝑞, 𝐷)
Retriever’s outputs. Its introduction is driven by two primary moti- 7: 𝑝𝑟 ← 𝑅𝑒 𝑓 𝑖𝑛𝑒𝑟 (𝑞, 𝑝𝑠 )
vations: First, the outputs from various IR modules differ in format 8: 𝐼𝑀 = 𝐼𝑀 + 𝑞 + 𝑝𝑟
and length, which might not be ideally suited as contextual prompts 9: 𝑝𝑐𝑙𝑜𝑠𝑒𝑠𝑡 ← {Eq. 1}
for LLMs. The Refiner addresses this by rephrasing and standardiz- 10: 𝑑 ← {Eq. 2}
ing these outputs into concise, well-formatted passages. Second, the 11: Remove 𝑝𝑐𝑙𝑜𝑠𝑒𝑠𝑡 from 𝑆𝑃
varying capabilities of different IR modules can lead to unfiltered or 12: end while
unranked results, which can limit their utility. The Refiner improves 13: 𝐴 𝑓 = M (𝑄, 𝐼𝑀)
these results by reranking and filtering, making sure only the im- 14: R ← {Eq. 4}
portant information stands out. In essence, the Refiner provides 15: PPO updates M using Reward R
flexibility to the choice of IR modules and ensures their compati- 16: end for
bility with the Reasoner, effectively bridging the gap between the
Retriever and the Reasoner and streamlining the IM process.
3.2.5 Questioner Training. The overall training procedure is shown
3.2.4 Progress Tracker. RL algorithms such as PPO are inherently in Algorithm 1. For a given question 𝑄, we use the Questioner
plagued by optimization inefficiencies when the search space is to generate the queries. The training starts with initializing the
huge [36]. One way to mitigate these inefficiencies is by providing Questioner LoRA as the activate model M′ , an empty list to store
well-designed mid-step rewards during the multi-round process [21, the inner monologues 𝐼𝑀, and the data sample of (question, golden
45]. Thus, we introduce a Progress Tracker component in IM-RAG support passages list, ground truth answer) tuple as (𝑄, 𝑆𝑃, 𝐺) from
to provide a reward score based on retrieval progress at each turn. the dataset. The multi-round IM process starts from Progress Tracker
When the accumulated score exceeds a certain threshold, it indicates receives the question, as described in the Line 4. The Questioner
that the Reasoner has acquired sufficient information and should first generates a searching query 𝑞, and then the Retriever returns

734
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Diji Yang et al.

a long list of passages 𝑝𝑠 based on the similarity search within the For the experiment, we test IM-RAG in HotPotQA, which is a widely
given Document corpus 𝐷. Based on the retrieved information and used open-domain multi-hop QA dataset.
the initial question, Refiner selects the most relevant topk passages HotPotQA involves providing a system with a set of related
as 𝑝𝑟 . IM storage is now updated with the searching query and 𝑝𝑟 . documents and a question that requires reasoning across these doc-
Following the above-described working flow of Progress Tracker, uments to arrive at an answer. The input consists of the question
Line 9 to 11 conclude One Round of IM by calculating the distance and the list of supporting documents, while the output is the answer
score 𝑑 and update the 𝑆𝑃 list. This multi-round process continues to the question, which can be in the form spanning from text from
until the Progress Tracker determines that the 𝑆𝑃 is empty. After the documents, a yes/no response, to a sentence. Additionally, Hot-
all necessary information has been gathered, to complete the IM PotQA provides a document corpus that includes all introductory
process, the Questioner will also provide the final prediction 𝐴 𝑓 . In paragraphs from English Wikipedia 2017. The task is to identify
the open-format QA task, we consider both 𝐴 𝑓 and ground-truth the supporting facts within the document corpus that led to the
answer 𝐺 as a sequence of tokens. Thus, as shown in Equation 3, answer. We follow the original data split to conduct the experiment
the precision and recall of the predicted answer can be used to and report the result on the dev set following the community con-
calculate the F1 score. vention on this dataset. The evaluation is done by the official script
from HotPotQA, which includes EM (Exact Matching) and F1 score
𝑟 = 𝐹 1(𝐴 𝑓 , 𝐺) (3) between the predicted answer and the ground-truth answer label.
Besides, since the related supporting documents are provided as a
From 𝑖th-round of Inner Monologue, Progress Tracker collects 𝑖
list, the retrieval result can also be evaluated by EM and F1. This
number of distance scores. As part of the final reward, 1 −𝑑𝑖 is used
setup encourages the development of models that are not only adept
to reflect the quality of 𝑖-th round of retrieval in a continuous space.
at extracting answers but also capable of understanding the context
We introduce a discount factor, 𝛾 < 1, to emphasize the importance
and performing multi-hop reasoning. As our system is designed for
of the preceding search. Inheriting from IMMO, the reward also
final task completion, we focus more on the evaluation of the final
includes the KL divergence with a predefined weight, 𝛼, between the
answer.
updated Questioner M and its starting point M0 [14, 61]. The final
reward is a non-discrete number, which depends on both the IM
quality (distance score) and the answer quality (correctness score). 4.2 Implementation Details
The Questioner LoRA is updated by the PPO algorithm driven by Below, we provide the implementation details of IM-RAG, which
the reward function as shown in Equation 4. follows the approach design illustrated in Section 3.2.

𝑛
∑︁ 4.2.1 Reasoner. Following the design from Section 3.2.1, we uti-
R=( 𝛾 𝑖 (1 − 𝑑𝑖 )) + 𝑟 − 𝛼𝐾𝐿(M, M0 ) (4) lize a large pretrained language model as the Reasoner in IM-RAG.
𝑖=1 Specifically, we use the 7B version of Vicuna-1.5 [4] as the base
3.2.6 Answerer Training. After the Questioner has been trained, it LLM, which is an open-source LLM fine-tuned from LLaMA-2 [44]
learned the ability to perform a reasonable IM, thus obtaining valid with supervised instruction fine-tuning on 125K high-quality user-
supporting evidence from the IR module. As discussed, the goal of shared conversations collected from ShareGPT [38]. Building upon
asking meaningful questions differs from final question answering. the base LLM, we add and finetune two LoRA adapters as the Ques-
Thus, we define an Answerer, which specializes in the QA capability tioner and the Answerer, respectively. As discussed in Section 3.2.1,
to be exclusively responsible for providing the final answer. this design allows the capabilities of Questioner and the Answerer
In most datasets or tasks, the final answers are provided, and the to be separately learned while fully reusing the same base LLM.
multi-round retrieval (IM) information can be acquired by the well-
trained Questioner. Therefore, we have sufficient data to support 4.2.2 Retriever. Following the Dense Passage Retrieval (DPR) ap-
supervised learning. Following the instruction fine-tuning tech- proach [16], we index 5.2 million supporting documents using
nique [6, 43], the training data can be prepared as a combination of Sentence-transformer [34] embedding, which is fine-tuned for se-
the Initial Question, Inner Monologue, and Final Answer. The train- mantic search on a question-to-document matching task. We use
ing object for Answerer Lora is to perform the next token prediction FAISS library [15] to facilitate rapid similarity searches, averag-
over the corpus. ing 0.061 seconds per query under the GPU environment. Due to
the flexibility of our approach, the Retriever can be replaced with
4 EXPERIMENT stronger search engines or fine-tuned to further boost the IR perfor-
mance, while based on the experiments on the HotPotQA dataset,
In this section, we introduce the task and data in the experiment,
our current Retriever setting has already met the accuracy, speed,
the implementation and training details of our IM-RAG approach,
and scalability requirements by our approach.
the baseline approaches we compared with, and the experiment
results verified with statistical significance. 4.2.3 Refiner. Given the experimental design where the output
from Retriever is a list of Wikipedia introductory paragraphs re-
4.1 Task and Data trieved by FAISS from HotPotQA, the primary goal of Refiner is to
IM-RAG targets the multi-hop retrieval question-answering task. rerank this list, prioritizing the supporting facts. Given the effective-
In this kind of task, the knowledge needed to solve the problem ness and rapid deployability of LLM-reranker, as demonstrated in
usually exists in multiple passages from a given document corpus. previous works [32, 42], we employ the checkpoint of RankVicuna

735
IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues SIGIR ’24, July 14–18, 2024, Washington, DC, USA

Method Multi-rounds RAG 1 Training Passage EM EM F1


GPT-3.5 No LLM-centric Prompt N/A 31.0 37.1
REACT [55] Yes LLM-centric Prompt - 35.1 -
TPRR [57] Yes Jointly Train SFT 86.2 67.3 80.1
AISO [60] Yes Jointly Train RL 88.2 68.1 80.9
COS [25] Yes Jointly Train SFT 88.9 68.2 81.0
RAG (no IM) No LLM-centric SFT 36.2 31.2 41.2
IM-RAG Yes LLM-centric RL+SFT 83.4 68.4 82.5
Table 1: Results on HotPotQA. The results were categorized into three groups based on training data and the type of RAG
paradigm.

[31], an LLM pretrained for listwise document reranking. The rea-


sons for selecting RankVicuna are as follows: (1) As a pre-trained (
LLM, RankVicuna allows us to effortlessly harness its language 𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑒𝑟, if D ≤ 𝜙𝑟 and 𝑖 < 𝑁𝑚𝑎𝑥
𝑅𝑒𝑎𝑠𝑜𝑛𝑒𝑟 = (6)
comprehension and zero-shot capabilities for ranking tasks across 𝐴𝑛𝑠𝑤𝑒𝑟𝑒𝑟, if D > 𝜙𝑟 or 𝑖 = 𝑁𝑚𝑎𝑥
various documents, eliminating the need for additional fine-tuning.
(2) Ke et al. [17] highlighted a significant gap between retrievers and 4.3 Training Details
LLMs, which often impedes their communication, and proposed
Following the previous works [9, 43, 52], the RL of Questioner is sup-
to add a seq2seq model to enhance the output of retrievers. We
ported by Transformers Reinforcement-Learning (TRL) library [46],
found that RankVicuna, as a variant of the fine-tuned Vicuna LLMs,
and the SFT of Answerer is supported by the HuggingFace instruc-
matches the size and base capabilities of the Reasoner (also a Vicuna
tion finetuning pipeline [51]. All the hyperparameters follow the
LLM), effectively bridging the gap and facilitating the overall IM
default settings from StackLLaMA [2] and Alpaca [43]. With the
process.
Parameter-Efficient Fine-Tuning (PEFT) [26] support, under a 4
NVIDIA A100 GPU environment, the Questioner (RL) and Answerer
4.2.4 Progress Tracker. As discussed in section 3.2, the design of
(SFT) are trained for 6 and 10 epochs, respectively. The instruc-
the Progress Tracker can be flexible across different tasks. In Hot-
tion prompt is modified from the template provided by previous
PotQA, as the ground-truth supporting documents are provided,
works [43, 52].
we implemented the Progress Tracker in a heuristic way. Specifi-
cally, given the list of ground-truth documents 𝑆𝑃 and retrieved
document 𝑝𝑖 , we compute the cosine similarity between 𝑝𝑖 with 4.4 Baselines
each element in 𝑆𝑃 in the Sentence-transformer embedding space. We compared IM-RAG with three groups of baseline approaches.
The distance to the closest one will be recorded as the distance The first group relies on the power of LLM and can be plug-and-
score 𝑑𝑖 for the training as described in section 3.2. Moreover, this play by other available similar models or APIs. GPT-3.5 delivers
document will be considered as retrieved, so it will be removed QA results without connecting to an external knowledge base. We
from 𝑆𝑃 and will not be involved in the next-turn comparison. This provide 4-shot in-context examples as instruction for the LLM.
design provides dependencies across IM turns and encourages the REACT [55], as one of the early RAG works, chains LLMs with
Reasoner to search for new documents. In addition to the 𝑆𝑃 status search engines via prompting and in-context examples. It is a simple
mentioned in Questioner training (Section 3.2.5), the switch between yet effective approach with good zero-shot performance.
the Questioner and the Answerer is also controlled by an empirically We also include several good-performing, representative works
selected threshold 𝜙𝑟 for the accumulated distance reward scores D in the HotPotQA dataset. It is important to note that our focus is
over multiple turns as well as a preset maximum number of turns on the enhancement of the LLM-centric system rather than devel-
𝑁𝑚𝑎𝑥 (see Equation 5 and 6). If D is below the threshold 𝜙𝑟 , the oping a comprehensive QA system. The inclusion of these works
Reasoner will continue the responsibility of the Questioner to craft primarily serves as a reference for performance. AISO [60] models
a new query for retrieval. Conversely, as enough information has the QA task as a Reinforcement Learning trained Markov decision
been collected or 𝑁𝑚𝑎𝑥 has been reached, the Reasoner will switch process (MDP), whose action space includes the selection of differ-
to the Answerer to provide a final answer to the question. In the ent retrieval algorithms and the answer generation. This sophisti-
experiment, we set 𝜙𝑟 to 0.3 and 𝑁𝑚𝑎𝑥 to 3. cated system achieves promising results; however, it is expensive to
adapt this training-from-scratch system to a new domain. Instead
𝑛 of a complex MDP, IM-RAG uses LLM as the policy network, so it
can be easily optimized for a new domain by policy-based learn-
∑︁
D= 𝛾 𝑖 (1 − 𝑑𝑖 ) (5)
𝑖=1 ing method [30, 41]. Another noteworthy work is Chain-of-Skill

736
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Diji Yang et al.

(CoS) [25], which employs manually designed domain-specific re- Questioner (RL) Answerer (SFT) Refiner EM F1
trieval skills (such as entity linking and expanded query retrieval, ✗ ✓ ✓ Error Error
etc.) for Q&A tasks. These carefully designed skills significantly ✓ ✗ ✓ 63.9 77.9
improve the performance of language models; however, domain ✓ ✓ ✗ 35.5 48.3
knowledge may required to design the new skills when adapting ✓ ✓ ✓ 68.4 82.5
to a new domain. Specifically, CoS learns how to use skills through
Table 3: Ablation Study on each component in IM-RAG. Error
a multi-task pre-training phase, which needs to be retrained for
indicates the system fails to work under the given setting.
a new domain or skills change. AISO also has a similar challenge.
In addition, both AISO and CoS are inherently tied to predefined
IR systems. This means that plug-and-play other custom search
Significance Test. In this study, we employed McNemar’s test [27]
modules or knowledge bases are not straightforward. In general,
using Statsmodels [37] to statistically evaluate the performance
both approaches heavily rely on domain expertise for system design
improvements of our IM-RAG model compared to two baselines
and require retraining when design changes.
approaches mentioned in Section 4.4 (no-IM and GPT-3) and two
The last baseline, RAG (no IM), shares a similar structure with
results from ablation study (no-SFT and no-Refiner) on HotPotQA 2 .
IM-RAG as well as the modeling selection; the only difference is
The test is conducted on the prediction following the EM (0, 1) mea-
that it does not support multi-round retrieval due to the absence
surement. This non-parametric test is particularly suited for binary
of the IM process. This baseline uses the initial question as the
labels on paired nominal data. As reported in Table 2, the test results
retrieval query to obtain the documents that will be needed for
indicated that the IM-RAG model demonstrated a statistically sig-
supervised training for the Answerer.
nificant improvement in performance over all the above-mentioned
approaches.
4.5 Results
The results are reported in Table 1. Compared to the prompting- 5 ABLATION STUDY AND ANALYSIS
based approach, IM-RAG gains significant improvements while
In this section, we conduct an ablation study to investigate and
retaining flexibility. Previous work pointed out that ChatGPT falls
analyze how different training strategies and components impact
short in ensuring factuality in complex QA problems [58]. In our
the performance of IM-RAG, as well as outline the limitations of
comparison, GPT3.5 lagged behind RAG (no IM) by 0.2% and 4.1% on
IM-RAG.
EM and F1 scores, respectively. REACT, powered by PaLM-540B [5],
shows strong zero-shot capability; however, due to the limited task- 5.1 The Impact of Training Strategy
specific optimization, it does not have the advantage in terms of
performance compared to the approaches with training. The complete training process of IM-RAG includes reinforcement
Compared to the second group of works that are usually tied to learning as well as supervised learning. Thus, we report two abla-
predefined IR systems, IM-RAG has better flexibility in IR module tion experiments in this section to reveal the respective impacts.
selection. In our comparison, IM-RAG outperformed the previous As shown in table 3, first, we remove the RL training for Questioner.
best-performed model by 1.9% relative gain on F1 score. On the The plan is to enable the LLM to engage the multi-round retrieval
other hand, IM-RAG lagged behind others in the second group in by prompting and in-context examples. This approach can be re-
retrieval metrics like Passage EM because our focus wasn’t on fine- garded as “prompting the Inner Monologue". After collecting the
tuning the IR module. However, LLM’s rich pre-training knowledge query and the retrieved documents, we train the Answerer Lora
tolerates imperfect retrieval information and overturns the final in the same way as mentioned in Section 3.2.6. However, in our
QA result. experiments, we were unable to control the LLM (vicuna-7b) to
For the last baseline, with the same model selection and sys- output in the desired format. Under the zero-shot scenario, for a
tem design, IM-RAG outperforms the RAG (no IM) baseline by a large number of data points, the LLM generates irrelevant content
huge margin (82.5% vs. 41.2%) in terms of F1 score. We claim that or does not provide the query. Potential solutions would be to use
the multi-round retrieval is the key to the success of the IM-RAG a more powerful language model (e.g., GPT-4 or LlaMA2-70b) or a
framework. more sophisticated prompt design. However, the former requires
huge computational resources, whereas the latter requires more
effort from humans.
Model Comparison p-Value Significance
Another set of experiments focused on the effects of supervised
IM-RAG vs. no-IM < 0.001 Yes
fine-tuning. As shown in Algorithm 1, since the Questioner train-
IM-RAG vs. GPT-3.5 < 0.001 Yes
ing originally includes providing final prediction, we can simply
IM-RAG vs. no-SFT 0.008 Yes
remove the Answerer LoRA and record the Questioner’s response
IM-RAG vs. no-Refiner < 0.001 Yes
after completing the retrieval as the prediction. Under the same
Table 2: McNemar test results for comparing IM-RAG with experimental configuration, the Questioner LoRA obtained 77.9%
other LLM-based methods. All test shows the IM-RAG result F1 score. There is a 4.6% decrease from 82.5% (full version IM-RAG).
is statistically significant. As explained in section 3.2, asking for supporting facts and answer-
ing based on retrieved information require two different abilities.
2 Limited by available resources, we were unable to obtain prediction files of other
1 The RAG categorization follows our definition in Section 2. baselines. Therefore, we performed significance tests only for the above methods.

737
IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues SIGIR ’24, July 14–18, 2024, Washington, DC, USA

Assigning the tasks to two models (or two LoRAs in our design) IM-RAG will be stuck in the massive language (action) space and
simplifies the challenge, resulting in improved performance. thus unable to optimize because it can hardly reach the positive
reward.
5.2 Necessity of the Refiner
Inference Efficiency. Similar to other LLM-based RAG work [13,
As discussed in Section 3.2, the purpose of the Refiner is to improve 42], in general, IM-RAG has the higher inference latency than tra-
the output of the Retriever, which effectively bridges the gap be- ditional IR systems [7, 56]. As a result, it is difficult for IM-RAG to
tween the Reasoner and the Retriever, and fosters the IM process. meet the speed requirement in contexts where it is necessary to
To better understand the necessity of the Refiner, we conduct an obtain a fast response, and conversely, LLM brings decent reasoning
ablation study to explore how the Refiner impacts the performance ability as well as generative results.
of IM-RAG. In the experiment design on HotPotQA, the Refiner
plays the role of a re-ranker to highlight the most relevant passages. 7 LIMITATION AND FUTURE WORKS
As a comparison, we run another experiment where we simply use
This work demonstrates promising results in utilizing Inner Mono-
the top-5 passages provided by the Retriever at each turn without
logue to solve traditional information retrieval tasks; however, the
involving the Refiner for further refinement.
potential of the IM-RAG framework has not been fully explored. As
As shown in Table 3, with all other settings consistent, removing
discussed above, an important advantage of this framework is the
the Refiner leads to a 14.2% performance drop (68.3% vs. 82.5%) in
reinforcement of the model’s reasoning ability through outcome
terms of the F1 score. This result can be attributed to the gap be-
supervision. Compared to employing supervised learning to impart
tween the IR module and the LLM [17]. As introduced in Section 3.2,
models to do Chain-of-Thought reasoning, this approach facilitates
in the process of learning IM, the Reasoner actively proposes queries
models to find superior solutions, i.e., the reasoning path that is
at each turn to acquire more relevant documents from the Retriever.
better suited to their own system capabilities. However, due to the
However, there exists a gap between the Reasoner and the Retriever,
RL’s optimization difficulties on language models, this work uses
specifically in the format, length, and importance of the retrieved
final result supervision along with another strong reward signal,
documents compared to the expected context for the Reasoner. Such
i.e., the human-labeled golden document is considered as the target
a gap may not only give the Reasoner a "hard time" in figuring out
answer for each round of retrieval. This signaling serves as a fine
the most relevant information from the retrieved documents, but
guide during training yet sets an upper limit to IM retrieval. We ex-
also hinder the Progress Tracker from giving a positive reward that
pect that this problem can be solved in the future by better Progress
guides the IM learning via RL. In the cases where a large training
Tracker design, such as pretraining a complex neural network to
corpus exists, the Reasoner might be able to learn how to fill the
provide retrieval signals directly without the supervision of the
gap through intensive training, while this is more costly and less
golden documents from humans. Following the idea of RLHF [30],
efficient. Therefore, we can conclude that the Refiner is a necessary
using a large number of human annotations to train a reward model
component to bridge the gap and facilitate IM learning.
to act as a Progress Tracker is a promising approach. However, this
6 DISCUSSION design may only be available to institutions with the resources to
do so.
This section discusses situations in which IM-RAG applies as well
as those in which it does not. 8 CONCLUSION
Task. IM-RAG benefits from the rich language ability of the pre- We present IM-RAG, a novel approach inspired by inner mono-
trained LLM and excels in capturing dynamic information and then logues, which connects LLM and IR to accomplish complex reason-
performing context-aware multi-round retrieval. Thus, it specializes ing tasks through context-aware multi-round interactive retrieval.
in multi-hop retrieval and generation tasks. However, the perfor- During multi-round conversations, the LLM serves as the core rea-
mance of IM-RAG in single-step accurate retrieval and real-world soning model, either crafting new queries for the retriever based
complex environments is unclear. on the conversational context or generating a final response when
enough information has been collected. The retrieved documents
IR Dependency. The mobilized design makes IM-RAG very easy
are modified (reformatted, re-ranked, filtered, etc.) by the refiner
to be applied to customization tasks. Depending on the retrieval
to better match the needs of LLM. The whole process can be op-
scenario or domain, the IR module in Figure 2 can be replaced by
timized end-to-end via RL using the feedback from the Progress
other wildly-designed search engines or dense retrievers.
Tracker and final answer correctness as reward signals. The results
Data Requirement. For migration on a new task, the most chal- on HotPotQA show that IM-RAG achieves SOTA performance in
lenging aspect is the preparation and acquisition of the data re- multi-step reasoning. This enables the RAG system to do human-
quired by the Progress Tracker. During training, the retrieval quality like multi-round reasoning and retrieval with high flexibility and
signals provided by Progress Tracker directly guide the optimiza- interpretability.
tion of the strategy. In our experiments, Progress Tracker used the While this is the first step towards learning how to do inner
ground-truth retrieval results provided by the training set. However, monologue between LLM and retrievers, as with all preliminary
in cases where more resources are available (e.g., search logs from research, it comes with certain limitations. The dataset we used may
real users), Progress Tracker can provide better guidance for the not reflect the subtle and sometimes non-linear nature of human
training of the Reasoner. In contrast, when the available resources inner monologue, potentially limiting the model’s ability to learn
are unable to support Progress Tracker to provide retrieval score, and handle highly complex, abstract, or creative reasoning tasks.

738
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Diji Yang et al.

REFERENCES [24] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu,
[1] Qingyao Ai, Ting Bai, Zhao Cao, Yi Chang, Jiawei Chen, Zhumin Chen, Zhiyong Song-Chun Zhu, and Jianfeng Gao. 2023. Chameleon: Plug-and-play composi-
Cheng, Shoubin Dong, Zhicheng Dou, Fuli Feng, et al. 2023. Information Retrieval tional reasoning with large language models. arXiv preprint arXiv:2304.09842
Meets Large Language Models: A Strategic Report from Chinese IR Community. (2023).
AI Open 4 (2023), 80–90. [25] Kaixin Ma, Hao Cheng, Yu Zhang, Xiaodong Liu, Eric Nyberg, and Jianfeng
[2] Edward Beeching, Younes Belkada, Kashif Rasul, Lewis Tunstall, Leandro von Gao. 2023. Chain-of-Skills: A Configurable Model for Open-Domain Question
Werra, Nazneen Rajani, and Nathan Lambert. 2023. StackLLaMA: An RL Fine- Answering. In Proceedings of the 61st Annual Meeting of the Association for Com-
tuned LLaMA Model for Stack Exchange Question and Answering. https: putational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber,
//doi.org/10.57967/hf/0513 and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto,
[3] K Cherney. 2023. Everything to Know About Your Internal Monologue. Canada, 1599–1618. https://doi.org/10.18653/v1/2023.acl-long.89
[4] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, [26] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak
Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, Paul, and Benjamin Bossan. 2022. PEFT: State-of-the-art Parameter-Efficient
and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with Fine-Tuning methods. https://github.com/huggingface/peft.
90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/ [27] Quinn McNemar. 1947. Note on the sampling error of the difference between
[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gau- correlated proportions or percentages. Psychometrika 12, 2 (June 1947), 153–157.
rav Mishra, Adam Roberts, Paul Barham, et al. 2022. PaLM: Scaling Language https://doi.org/10.1007/bf02295996
Modeling with Pathways. arXiv preprint arXiv:2204.02311 (2022). [28] Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram
[6] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli
Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling Celikyilmaz, et al. 2023. Augmented language models: a survey. arXiv preprint
instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022). arXiv:2302.07842 (2023).
[7] Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua. 2005. Question [29] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina
answering passage retrieval using dependency relations. In Proceedings of the Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al.
28th Annual International ACM SIGIR Conference on Research and Development in 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv
Information Retrieval (Salvador, Brazil) (SIGIR ’05). Association for Computing Ma- preprint arXiv:2112.09332 (2021).
chinery, New York, NY, USA, 400–407. https://doi.org/10.1145/1076034.1076103 [30] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
[8] Charles Fernyhough and Anna Borghi. 2023. Inner speech as language process Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
and cognitive tool. Trends in Cognitive Sciences 27 (09 2023). https://doi.org/10. Training language models to follow instructions with human feedback. Advances
1016/j.tics.2023.08.014 in Neural Information Processing Systems 35 (2022), 27730–27744.
[9] Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, [31] Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. RankVicuna:
Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. 2022. Accelerate: Training Zero-Shot Listwise Document Reranking with Open-Source Large Language
and inference at scale made simple, efficient and adaptable. https://github.com/ Models. arXiv preprint arXiv:2309.15088 (2023).
huggingface/accelerate. [32] Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. RankZephyr:
[10] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Effective and Robust Zero-Shot Listwise Reranking is a Breeze! arXiv:2312.02724
Retrieval augmented language model pre-training. In International conference on (2023).
machine learning. PMLR, 3929–3938. [33] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin
[11] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language
Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large models. arXiv preprint arXiv:2302.00083 (2023).
language models. arXiv preprint arXiv:2106.09685 (2021). [34] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
[12] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Em-
Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. 2022. pirical Methods in Natural Language Processing. Association for Computational
Inner monologue: Embodied reasoning through planning with language models. Linguistics. https://arxiv.org/abs/1908.10084
arXiv preprint arXiv:2207.05608 (2022). [35] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli,
[13] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Lan-
Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard guage models can teach themselves to use tools. arXiv preprint arXiv:2302.04761
Grave. 2022. Few-shot learning with retrieval augmented language models. arXiv (2023).
preprint arXiv:2208.03299 (2022). [36] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
[14] Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Richard E Turner, and Douglas Eck. 2017. Sequence tutor: Conservative fine- (2017).
tuning of sequence generation models with kl-control. In International Conference [37] Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statistical
on Machine Learning. PMLR, 1645–1654. modeling with python. In 9th Python in Science Conference.
[15] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity [38] ShareGPT. 2023. https://sharegpt.com/.
search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547. [39] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting
[16] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in
Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- huggingface. arXiv preprint arXiv:2303.17580 (2023).
domain question answering. arXiv preprint arXiv:2004.04906 (2020). [40] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021.
[17] Zixuan Ke, Weize Kong, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Retrieval augmentation reduces hallucination in conversation. arXiv preprint
Bendersky. 2024. Bridging the Preference Gap between Retrievers and LLMs. arXiv:2104.07567 (2021).
arXiv preprint arXiv:2401.06954 (2024). [41] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea
[18] Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2021. Internet-augmented Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to
dialogue generation. arXiv preprint arXiv:2107.07566 (2021). summarize with human feedback. Advances in Neural Information Processing
[19] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grig- Systems 33 (2020), 3008–3021.
orev. 2022. Internet-augmented language models through few-shot prompting [42] Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun
for open-domain question answering. arXiv preprint arXiv:2203.05115 (2022). Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as
[20] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Re-Ranking Agent. arXiv preprint arXiv:2304.09542 (2023).
Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, [43] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos
et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An
Advances in Neural Information Processing Systems 33 (2020), 9459–9474. Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_
[21] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, alpaca.
Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s [44] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Verify Step by Step. arXiv preprint arXiv:2305.20050 (2023). Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
[22] Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv
Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. 2023. Ra-dit: preprint arXiv:2302.13971 (2023).
Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352 [45] Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel,
(2023). Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving
[23] Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan math word problems with process-and outcome-based feedback. arXiv preprint
Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023. Llava-plus: Learning to use arXiv:2211.14275 (2022).
tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023). [46] Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan
Thrush, and Nathan Lambert. 2020. TRL: Transformer Reinforcement Learning.

739
IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues SIGIR ’24, July 14–18, 2024, Washington, DC, USA

https://github.com/lvwerra/trl. [54] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan
[47] Lev S Vygotsky. 1987. Thinking and speech. The collected works of LS Vygotsky 1 Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for
(1987), 39–285. Diverse, Explainable Multi-hop Question Answering. In Conference on Empirical
[48] Kuan Wang, Yadong Lu, Michael Santacroce, Yeyun Gong, Chao Zhang, and Methods in Natural Language Processing (EMNLP).
Yelong Shen. 2023. Adapting LLM Agents Through Communication. arXiv [55] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,
preprint arXiv:2310.01444 (2023). and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models.
[49] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian arXiv preprint arXiv:2210.03629 (2022).
Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. [56] Lanbo Zhang and Yi Zhang. 2010. Interactive retrieval based on faceted feed-
2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 back. In Proceedings of the 33rd International ACM SIGIR Conference on Re-
(2022). search and Development in Information Retrieval (Geneva, Switzerland) (SI-
[50] Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and GIR ’10). Association for Computing Machinery, New York, NY, USA, 363–370.
Jason Weston. 2019. Neural text generation with unlikelihood training. arXiv https://doi.org/10.1145/1835449.1835511
preprint arXiv:1908.04319 (2019). [57] Xinyu Zhang, Ke Zhan, Enrui Hu, Chengzhen Fu, Lan Luo, Hao Jiang, Yantao
[51] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Jia, Fan Yu, Zhicheng Dou, Zhao Cao, and Lei Chen. 2021. Answer Complex
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Questions: Path Ranker Is All You Need. In Proceedings of the 44th International
Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, ACM SIGIR Conference on Research and Development in Information Retrieval
Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, (Virtual Event, Canada) (SIGIR ’21). Association for Computing Machinery, New
and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language York, NY, USA, 449–458. https://doi.org/10.1145/3404835.3462942
Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural [58] Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. 2023. Why Does ChatGPT
Language Processing: System Demonstrations. Association for Computational Fall Short in Answering Questions Faithfully? arXiv preprint arXiv:2304.10513
Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp- (2023).
demos.6 [59] Junkai Zhou, Liang Pang, Huawei Shen, and Xueqi Cheng. 2023. Think Before
[52] Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, You Speak: Cultivating Communication Skills of Large Language Models via
and Yi Zhang. 2024. Tackling vision language tasks through learning inner Inner Monologue. arXiv preprint arXiv:2311.07445 (2023).
monologues. In Proceedings of the AAAI Conference on Artificial Intelligence, [60] Yunchang Zhu, Liang Pang, Yanyan Lan, Huawei Shen, and Xueqi Cheng. 2021.
Vol. 38. 19350–19358. Adaptive information seeking for open-domain question answering. arXiv
[53] Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. 2023. preprint arXiv:2109.06747 (2021).
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction. [61] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario
arXiv preprint arXiv:2305.18752 (2023). Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models
from human preferences. arXiv preprint arXiv:1909.08593 (2019).

740

You might also like