Here’s a summary of three parts of the document:
Abstract :
This section introduces ReAct as an approach that allows LLMs to generate reasoning traces and task-specific actions in an interleaved manner. This synergy enables reasoning to help the model induce, track, and update action plans and handle exceptions, while actions allow the model to interface with external sources like knowledge bases or environments. ReAct has shown effectiveness on various language and decision-making tasks like HotpotQA, Fever, ALFWorld, and WebShop, outperforming state-of-the-art baselines, improving human interpretability, and trustworthiness by overcoming issues like hallucination and error propagation.
1 Introduction :
-
Human Intelligence Analogy: The paper begins by drawing a parallel between ReAct and human intelligence, which seamlessly combines task-oriented actions with verbal reasoning (inner speech) for self-regulation, strategization, and maintaining working memory. Examples like cooking illustrate how reasoning helps track progress, handle exceptions, and identify when external information is needed, while actions support reasoning.
-
Limitations of Prior LLM Approaches: Prior work on LLMs studied reasoning (e.g., Chain-of-Thought/CoT prompting) and acting (e.g., action plan generation) as separate topics. CoT reasoning is described as a static black box that is not grounded in the external world, leading to issues like fact hallucination and error propagation. Conversely, acting-focused approaches don’t extensively use LLMs to reason abstractly or maintain a working memory to support actions.
-
Introducing ReAct: ReAct is presented as a general paradigm that prompts LLMs to generate both verbal reasoning traces and actions in an interleaved manner. This allows for dynamic reasoning to create, maintain and adjust high-level plans(reason to act), and to interact with external environments(e.g. Wikipedia) to incorporate addtional information into reasoning(act to reason).
-
Empirical Evaluation & Benefits: ReAct has been empirically evaluated on HotpotQA, Fever, ALFWorld, and WebShop. It outperforms vanilla action generation models and is competitive with CoT on QA/fact vertification, especially when combined with CoT. On interactive decision-making tasks (ALFWorld, WebShop), one or two-shot ReAct prompting outperforms imitation and reinforcement learning methods by 34% and 10% absolute success rates, respectively. Beyond performance, ReAct enhances model interpretability, trustworthiness, and diagnosability, as humans can inspect reasoning traces and distinguish internal knowledge from external information.
HotpotQA:
- 任务类型:多跳推理(Multi-hop Reasoning)
- 描述:HotpotQA是一个问答数据集,要求模型回答需要推理多个步骤的问题。与传统的基于单一文本的问答系统不同,HotpotQA的答案通常需要跨越多个文档或信息源,因此,它需要模型能够进行多步推理来获取正确答案。
- 挑战:如何综合来自不同来源的信息,并将其关联以回答复杂问题。
FEVER:
- 任务类型:事实验证(Fact Verification)
- 描述:FEVER是一个事实验证任务,模型需要判断一个给定的声明(statement)是否为真(True)、假(False),或者不能从提供的证据中得出结论(Not Enough Information)。它的任务背景是基于一个大规模的文本库(例如维基百科)来验证事实。
- 挑战:模型需要处理大量的证据数据并准确评估真假,需要高效的信息检索和推理能力。
ALFWorld:
- 任务类型:交互式任务推理(Interactive Task Reasoning)
- 描述:ALFWorld是一个面向智能体的任务框架,模拟了智能体与环境的互动。这个数据集包含了模拟环境中的任务,智能体通过执行一系列操作来完成任务。它设计的场景通常涉及推理和行动(如导航、物品使用等)。
- 挑战:涉及到推理、决策和规划的复杂性,同时,智能体必须能够执行一系列动作以完成任务。
WebShop:
- 任务类型:目标导向的任务(Goal-Oriented Task)
- 描述:WebShop是一个模拟的购物任务,模型需要在一个模拟的电子商务网站中执行任务,通常涉及到搜索、比较和选择商品等操作。任务的目标是帮助用户找到合适的商品。
- 挑战:要求模型理解用户的需求,能够在模拟环境中做出决策,并通过合理的推理和行动来达成目标。
- Key Contributions: The paper summarizes its contributions as introducing ReAct, showcasing its advantages in few-shot learning, analyzing the importance of acting in reasoning and reasoning in interactive tasks, and exploring finetuning potential.
2 ReAct: Synergizing Reasoning + Acting :
-
Augmented Action Space: The core idea of ReAct is to augment the agent’s action space to include a “language space” (
L
), where an actionâ
∈L
is a thought or reasoning trace. These thoughts do not directly affect the external environment but compose useful information by reasoning over the current context and updating it to support future reasoning or acting. -
Types of Thoughts: Useful thoughts can include: decomposing task goals and creating action plans, injecting commonsense knowledge, extracting important parts from observations, tracking progress, transitioning action plans, and handling exceptions.
-
Prompting Setup: ReAct uses a frozen large language model (e.g., PaLM-540B) prompted with few-shot in-context examples that are human trajectories of interleaved actions, thoughts, and environment observations. For reasoning tasks, thoughts and actions are alternated densely, while for decision-making tasks, thoughts appear sparsely and asynchronously as decided by the LLM.
提示设置: ReAct使用一个冻结的大型语言模型(例如,PaLM-540B),通过少量示例的上下文提示来进行推理,这些示例是人类的行为轨迹,包含交替进行的行动、思维和环境观察。在推理任务中,思维和行动密集交替进行,而在决策任务中,思维则稀疏且异步地出现,具体取决于大型语言模型的决策。 -
Unique Features: ReAct is highlighted for being intuitive and easy to design (human annotators simply type thoughts), general and flexible for diverse tasks and reasoning needs, performant and robust with strong generalization from few examples, and human-aligned and controllable by providing interpretable processes and allowing human editing of thoughts.
独特特点: ReAct的特点在于它直观且易于设计(人类标注者只需输入思维内容),通用且灵活,能够适应各种任务和推理需求,性能优越且稳健,能够从少量示例中实现强大的泛化能力,并且与人类对齐且可控,通过提供可解释的过程并允许人类编辑思维内容。
3 Knowledge-Intensive Reasoning Tasks :
-
Setup: ReAct was evaluated on HotpotQA (multi-hop question answering) and FEVER (fact verification). In a “question-only” setup, models had to rely on internal knowledge or an external Wikipedia API. The API provided
search[entity]
,lookup[string]
, andfinish[answer]
actions, designed to simulate human interaction with Wikipedia. -
Methods: ReAct prompting used manually composed few-shot exemplars with dense thought-action-observation steps for various reasoning purposes (e.g., decomposing questions, extracting info, commonsense reasoning, guiding search, synthesizing answers). Baselines included Standard, Chain-of-Thought (CoT), CoT with Self-Consistency (CoT-SC), and Acting-only (Act). Strategies were also proposed to combine ReAct and CoT-SC, allowing the model to switch between methods based on confidence or failure to leverage both internal and external knowledge. Additionally, finetuning experiments were conducted using ReAct-generated trajectories on smaller LLMs.
方法:ReAct提示方法使用手动编写的少量示例,结合密集的思维-行动-观察步骤,旨在进行各种推理任务(例如,分解问题、提取信息、常识推理、引导搜索、合成答案)。基准方法包括标准方法、思维链(CoT)、带自一致性的思维链(CoT-SC)以及仅行动(Act)。还提出了将 ReAct 和 CoT-SC 结合的策略,允许模型根据置信度或失败情况在这些方法之间切换,以利用内部和外部知识。此外,还进行了使用ReAct生成的轨迹对较小的语言模型进行微调的实验。
Results & Observations:
-
ReAct outperforms Act on both HotpotQA and Fever, demonstrating the value of reasoning in guiding actions and synthesizing answers.
-
ReAct vs. CoT: ReAct showed less hallucination (6% vs. 14% false positives) and was more grounded, fact-driven, and trustworthy due to external knowledge base access. However, CoT sometimes had lower reasoning error rates, while ReAct could get stuck in repetitive loops or fail due to non-informative search results.
-
Combined Methods: The combination of ReAct and CoT-SC performed best for prompting LLMs, consistently outperforming CoT-SC alone.
-
Finetuning: When finetuned with additional data, ReAct became the best method, outperforming all prompting methods, indicating it teaches a more generalizable skill of accessing information via reasoning and acting. ReAct also excels at obtaining up-to-date knowledge from the web, correcting outdated dataset labels (e.g., hotel room count example).
4 Decision Making Tasks :
-
Domains: ReAct was tested on ALFWorld (a text-based game requiring navigation and interaction in a simulated household, with tasks like “examine paper under desklamp”) and WebShop (an online shopping environment). These environments demand planning over long horizons with sparse rewards, making reasoning crucial.
-
Thought Patterns: For ALFWorld, ReAct prompts included sparse thoughts to decompose goals, track subgoal completion, determine next subgoals, and apply commonsense for object location. For WebShop, ReAct prompts added reasoning to determine what to explore, when to buy, and relevant product options.
-
Performance: ReAct significantly outperformed Act and strong baselines like BUTLER (imitation learning) on ALFWorld. On WebShop, ReAct achieved a 10% absolute improvement in success rate over previous best methods, including IL and IL+RL.
-
Value of Internal Reasoning: ReAct’s flexible and sparse reasoning traces were shown to be superior to Inner Monologue (IM)-style prompting (ReAct-IM), which relies on dense external feedback but often fails due to a lack of high-level goal decomposition and commonsense reasoning.
-
Human-in-the-Loop: ReAct also enables human-in-the-loop behavior correction, where simple edits to the model’s thoughts can drastically change its behavior to align with human intent and lead to task success, offering a new form of human-machine collaboration.
是否有人将ReAct和RLHF结合?在ReAct边思考边行动时,为它提供人类的反馈,告知它的思考是否正确,以引导下一步行动。
5 Related Work :
-
Language Models for Reasoning: This section contextualizes ReAct within previous works on LLM reasoning, such as Chain-of-Thought (CoT) and its variants (least-to-most, zero-shot, self-consistency), as well as more sophisticated architectures like Selection-Inference and STaR. ReAct distinguishes itself by integrating model actions and corresponding observations into a coherent input stream for more accurate reasoning and to tackle interactive tasks.
-
Language Models for Decision Making: The paper compares ReAct to LLMs used as policy models in interactive environments, including WebGPT, chatbots (BlenderBot, Sparrow), and embodied agents (SayCan, Inner Monologue). ReAct is highlighted for explicitly modeling the thinking/reasoning procedure and learning policies more cheaply compared to methods relying on expensive human feedback or datasets. ReAct also builds on Inner Monologue but offers more flexible reasoning traces.
6 Conclusion :
The conclusion reiterates ReAct as a simple yet effective method for synergizing reasoning and acting in LLMs, leading to superior performance and interpretable decision traces. It acknowledges limitations like input length limits for in-context learning but notes promising finetuning results. Future work includes scaling ReAct with multi-task training and combining it with reinforcement learning.
-
Acknowledgments : Standard acknowledgment section.
-
Reproducibility Statement : Notes that main experiments were on PaLM-540B (not openly accessible) but includes prompts, additional GPT-3 experiments (which consistently outperformed PaLM-540B), and code for GPT-3 ReAct prompting to enhance reproducibility.
-
Ethics Statement : Discusses the potential dangers of LLMs interacting with external environments (e.g., looking up inappropriate information, harmful actions) and outlines how the experiments minimized these risks by limiting interactions to specific, safe websites (Wikipedia, WebShop) and action spaces without dangerous actions. It advises researchers to be aware of such risks in future work.
-
Appendices : These sections provide additional details and examples, including GPT-3 experiment results, a demonstration of ReAct obtaining up-to-date knowledge, an example of human-in-the-loop behavior correction in ALFWorld, detailed experiment parameters, and full prompt examples and trajectories for HotpotQA, FEVER, ALFWorld, and WebShop. They also include further analysis of success and failure modes for ReAct and CoT.