Offline Reinforcement Learning For LLM Multi-Step Reasoning
Offline Reinforcement Learning For LLM Multi-Step Reasoning
reinforcement learning (RL) is essential for has gained significant interest, as it offers the po-
quickly adapting them to complex tasks. While tential for self-improvement and learning without
Direct Preference Optimization (DPO) has relying on human-labeled trajectories. However,
shown promise in aligning LLMs with human many popular RL algorithms require costly online
preferences, it is less suitable for multi-step rea- data collection, either by generating language on-
soning tasks because (1) DPO relies on paired the-fly or interacting with an environment. For
preference data, which is not readily available
instance, tuning LLMs with Proximal Policy Op-
for multi-step reasoning tasks, and (2) it treats
all tokens uniformly, making it ineffective for timization (PPO, Schulman et al., 2017) is often
credit assignment in multi-step reasoning tasks, prohibitively expensive for most users, which limits
which often come with sparse reward. In this practical applications (Hu et al., 2023).
work, we propose OREO (Offline REasoning In contrast, offline RL methods, such as Di-
Optimization), an offline RL method for en- rect Preference Optimization (DPO, Rafailov et al.,
hancing LLM multi-step reasoning. Building 2024b), provide a more practical approach for
on insights from previous works of maximum
aligning LLMs with human preferences. These
entropy reinforcement learning, it jointly learns
a policy model and value function by optimiz-
methods enable practitioners to tune models using
ing the soft Bellman Equation. We show in pre-existing datasets, eliminating the need for live
principle that it reduces the need to collect interaction or data generation. However, attempts
pairwise data and enables better credit assign- to enhance LLMs’ multi-step reasoning abilities
ment. Empirically, OREO surpasses existing with DPO may deliver results close to or even
offline learning methods on multi-step reason- worse than simpler methods like SFT (Yuan et al.,
ing benchmarks, including mathematical rea- 2024; Chen et al., 2024b). Additionally, DPO re-
soning tasks (GSM8K, MATH), and embod-
quires pairwise preference data. In multi-step rea-
ied agent control (ALFWorld). The approach
can be extended to a multi-iteration framework soning tasks, however, data normally consists of
when additional resources are available. Fur- independent trajectories with sparse rewards indi-
thermore, the learned value function can be cating success or failure. A common alternative is
leveraged to guide the tree search for free, to extract correct trajectories from offline datasets
which can further boost the performance during and use them for supervised fine-tuning (Zelikman
test time.1 et al., 2022; Aksitov et al., 2023; Dong et al., 2023;
1 Introduction Paulus et al., 2024). While this approach is sim-
ple and often effective, it fails to fully exploit the
Large Language Models (LLMs) are increasingly offline dataset’s potential—particularly the oppor-
applied to complex tasks requiring multi-step rea- tunity to learn from failure experience and enhance
soning, such as mathematical problem solving (Ue- model robustness (Kumar et al., 2022).
sato et al., 2022; Shao et al., 2024; Hendrycks et al., In this paper, we introduce OREO (Offline
2021), embodied agent control (Wang et al., 2023; REasoning Optimization), an offline RL algorithm
Huang et al., 2022; Shridhar et al., 2020; Xiang designed to enhance LLMs’ multi-step reasoning
* Equal contribution. † corresponding author capabilities. Building on insights from the exten-
1
Code available at: https://github.com/jwhj/OREO sive literature on maximum entropy RL (Ziebart,
1
2010; Nachum et al., 2017; Haarnoja et al., 2017), recently gained traction in the LLM literature.
especially Path Consistency Learning (Nachum Maximum-entropy RL (Ziebart, 2010; Haarnoja
et al., 2017), OREO jointly learns a policy model et al., 2017) aims to maximize the weighted sum of
and a value function by optimizing the soft Bell- the accumulated reward and the policy entropy. No-
man Equation. OREO can leverage unpaired data table algorithms such as path-consistency learning
with only sparse rewards and enables finer-grained (PCL) (Nachum et al., 2017) and soft actor-critic
credit assignment, which is especially critical as the (SAC) (Haarnoja et al., 2018) effectively utilize
correctness of reasoning trajectories often depends this framework. Recent works (Guo et al., 2021;
on a few key tokens. Additionally, OREO can Richemond et al., 2024; Liu et al., 2024a) revealed
be extended into an iterative framework for online a strong connection between maximum-entropy RL
exploration. The trained value function is also di- and the RLHF objective, indicating a promising
rectly available to guide the step-level beam search direction to fine-tune LLMs with soft Q-learning-
at inference time, further boosting performance. based algorithms. Concurrent with our work, Liu
We demonstrate the effectiveness of our ap- et al. (2024a) leverages the SAC framework and
proach on both math reasoning (GSM8K, MATH) derives a similar algorithm for LLM multi-step
and embodied agent control (ALFWorld) tasks. It reasoning. Our method differs in the derivation ap-
consistently outperforms baseline methods, includ- proach, the inclusion of a KL regularization term,
ing rejection sampling, DPO, and KTO, across dif- and the exploration of several loss variants. No-
ferent model sizes. Notably, we train a 1.5B model tably, we provide deeper empirical insights by in-
to achieve a 52.5% accuracy on the MATH dataset corporating diverse domains, the iterative training
using only the original training set. Moreover, iter- setting, and value-guided tree search. More discus-
ative OREO steadily improves model performance sions and empirical comparisons can be found in
with additional training rounds, whereas baseline Appendix C.
methods like rejection sampling exhibit signs of sat-
uration. The value function learned by our method 2.2 LLM Reasoning
proves highly effective in guiding beam search for As an emergent ability of model scale, it is shown
math reasoning tasks or selecting the best-of-K that LLMs are able to generate intermediate rea-
actions in embodied agent control. This results in soning steps to solve complex problems, known as
up to a 17.9% relative improvement over greedy “scratchpad” (Nye et al., 2021) or chain-of-thoughts
decoding on the MATH dataset. (Wei et al., 2022; Kojima et al., 2022). Recent ef-
forts have enhanced LLM reasoning through super-
2 Related Work vised fine-tuning (Yue et al., 2023, 2024; Yu et al.,
2023; Luo et al., 2023; Hao et al., 2024b). When
2.1 Reinforcement Learning for LLM human-annotated reasoning trajectories are un-
Reinforcement Learning (RL) has become a stan- available, rejection-sampling-based methods have
dard approach in the post-training stage of LLMs. proven effective. Among them, Self-Taught Rea-
A widely adopted method, known as reinforcement soner (STaR) (Zelikman et al., 2022) generates ra-
learning from human feedback (RLHF) (Ziegler tionales and fine-tunes on those leading to correct
et al., 2019; Ouyang et al., 2022), is designed to answers. Singh et al. (2023) further proposes an
align LLM responses more closely with human iterative approach based on expectation maximiza-
preferences. Traditional RL methods, such as tion. Recently, the application of RL algorithms
Proximal Policy Optimization (PPO) (Schulman to improve LLM reasoning has gained increasing
et al., 2017), have been extensively used in LLM interest (Aksitov et al., 2023; Gou et al., 2023;
post-training (Achiam et al., 2023; Team et al., Dong et al., 2024; Havrilla et al., 2024; Shao et al.,
2023; Dubey et al., 2024). Alternative approaches, 2024; Zhao et al., 2024), but the direct usages of
such as rejection-sampling-based methods (Dong DPO are not all successful (Yuan et al., 2024; Chen
et al., 2023; Gulcehre et al., 2023; Zelikman et al., et al., 2024b) and efficient, as people have to specif-
2022; Hoffman et al., 2024), preference-based re- ically collect pairwise preference data (Chen et al.,
inforcement learning (RL) (Rafailov et al., 2024b; 2024a; Song et al., 2024). Our method addresses
Xiong et al., 2024; Ethayarajh et al., 2024), and these limitations of DPO in reasoning with a prin-
REINFORCE-like RL (Williams, 1992; Shao et al., cipled solution. Another line of work aims to train
2024; Li et al., 2023; Ahmadian et al., 2024), have a Process Reward Model (PRM), to provide finer-
2
grained feedback on RL. It is typically trained with where πref is the reference policy, ρπ is the state-
Monte-Carlo rollout (Wang et al., 2024a,b; Luo action trajectory distribution generated by follow-
et al., 2024; Zhang et al., 2024), which is a spe- ing policy π, and β controls the strength of the
cial case of the value function learned through our regularization. Typically, πref is a pre-trained LLM
method. We show that our value function enables followed by supervised fine-tuning. The discount
test-time scaling (Hao et al., 2023; Snell et al., factor γ is normally omitted in the RLHF setting.
2024; Wu et al., 2024; Brown et al., 2024; Yao
et al., 2024; Hao et al., 2024a; Liu et al., 2024b) to 3.2 Soft Bellman Equation
further boost the reasoning performance through Entropy-regularized reinforcement learning
tree search. (RL) (Ziebart, 2010; Nachum et al., 2017;
Haarnoja et al., 2017) augments the standard
3 Preliminaries reward maximization objective with an entropy
term to encourage exploration and improve
3.1 MDP for LLM Reasoning the robustness of the learned policy. This
We define the Markov Decision Process (MDP) for formulation has a strong connection to entropy-
LLM reasoning. At each time step, a new token regularized RL, as the Kullback-Leibler (KL)
is generated as the action at . The state is repre- divergence between two distributions can be
sented as a token sequence. For reasoning tasks decomposed into a cross-entropy term and an
that don’t involve interactions with the environ- entropy term, i.e., DKL (π(·|s)∥πref (·|s)) =
ment, st records the context for LLMs, i.e., st = Eπ [− log πref (a|s)] − Eπ [− log π(a|s)].
(x0 , . . . , xL , y0 , . . . , yt−1 ), where (x0 , . . . , xL ) is Adapting from the well-established theory in
the input prompt and (y0 , . . . , yt−1 ) is the sequence entropy-regularized RL to our setting, we first de-
of generated tokens up to step t − 1. The transition fine the value function V π of a policy, which quanti-
function f for these tasks deterministically updates fies the expected KL-regularized reward of a policy
the state as st+1 = f (st , at ) = st | at , where | is π from any given state:
T −t
concatenation. X
V π (st ) = E(at, st+1 ,...)∼ρπ r (st+l , at+l )
For those tasks requiring interacting with an ex- l=0 (2)
ternal environment, like embodied agent control,
π(at+l | st+l )
− β log
the state and transition function is slightly different: πref (at+l | st+l )
if at is the final token of the agent’s response (e.g.,
Compared to the value function in standard RL,
“go to desk 1”), then st+1 = f (st , at ) = st | at |
which only includes expected rewards, the above
next observation.
definition incorporates an additional KL regulariza-
The reward function r(st , at ) is generally de- π(a |st+l )
tion term: β log πref (at+l
t+l |st+l )
.
fined for every state-action pair to provide feedback
throughout the generation process. However, in this Theorem 1 The optimal policy and its value func-
work, we focus on the challenging case where the tion satisfy the soft Bellman Equation:
reward is non-zero only at the terminal step T , re- π ∗ (at | st )
flecting the correctness of the reasoning chain, or V ∗ (st ) − V ∗ (st+1 ) = r(st , at ) − β log (3)
πref (at | st )
whether the task is successfully accomplished.
Following the standard setup in Reinforce- where st+1 = f (st , at ).
ment Learning with Human Feedback (RLHF) Building on Nachum et al. (2017); Haarnoja et al.
(Ouyang et al., 2022; Rafailov et al., 2024b), a (2017), we extend their theorem with a lightweight
KL-regularization term is introduced to encourage derivation tailored to our setting, with the proof
the learned policy to remain close to a reference provided in Appendix B.
policy while optimizing for rewards. Therefore, the This equation characterizes the relationship be-
optimal policy πθ can be described as follows: tween the optimal policy and its value function, pro-
viding a theoretical basis for our proposed method.
T
π ∗ = arg maxE(s0 ,...,sT )∼ρπ
X
r (st , at ) When β = 0, the equation degenerates to the Bell-
π
t=0 (1) man equation in standard RL. Importantly, when
π (at | st ) the soft Bellman Equation is always satisfied, the
− β log ,
πref (at | st ) policy and the value function are guaranteed to be
the optimal ones:
3
Theorem 2 If a policy π(a | s) and state value reward exists. To apply DPO on these tasks, previ-
function V (s) satisfy the consistency property (3) ous work has to collect pairwise data on reasoning
for all states s and actions a (where s′ = f (s, a)), tasks (Song et al., 2024; Yuan et al., 2024), which
then π = π ∗ and V = V ∗ . is an inefficient usage of offline data. (2) No credit
Similarly, the proof is a simple extension to assignment: Relaxing the soft Bellman Equation
Nachum et al. (2017). Based on Theorem 2, our from every time step to the entire trajectory loses
proposed method OREO aims to learn both a pol- the granularity of credit assignment, which is espe-
icy model πθ and a value model Vϕ towards the cially critical in multi-step reasoning tasks where
optimal policy and value function. This is achieved correctness often depends on a few key tokens.
by minimizing the inconsistency of Soft Bellman
Consistency. A more formal description of our 4 OREO: Offline Reasoning
method is presented in Section 4. Optimization
4
4.2 Loss Variants We implement step-level beam search for math
In addition to our OREO learning objective, reasoning tasks. At each step, we maintain a set of
we present two variants: step-level OREO and B candidate partial trajectories. For each candidate,
response-level OREO. we generate B potential next reasoning steps. From
In step-level OREO, an action is considered to the resulting B 2 candidates, we retain the B with
be an entire reasoning step instead of a single gen- the highest values.
erated token. For example, In May, Natalia sold 48 For embodied agent tasks, where environment
/ 2 = 24 clips. In the context of language models, dynamics are unknown, beam search is not appli-
cable. Instead, we sample K actions at each step
Qkof taking an action a = (t1 t2 · · · tk )
the probability
and select the action with the highest value.
is π(a|s) = i=1 p(ti |st1 t2 · · · ti−1 ), where ti de-
notes the ith token of the action and p denotes the
5 Experiments
language model. The step-level OREO objective
can thus be modified accordingly. This objective In this section, we evaluate our method OREO
can also be grounded in the token-level MDP. in math reasoning and embodied agent tasks. We
Response-level OREO aims to mimic the behav- also demonstrate that the value function trained
ior of DPO. Instead of enforcing the consistency alongside the policy can further improve model
property at each time step, the action objective con- performance at test time through step-level beam
siders only the initial state, i.e., search or choosing the best-of-K action.
2
Lresp Vϕ (s0 ) − R0 + β
X πθ (ai |si ) Datasets and Evaluation Metric. We
π (θ) = log + αLreg .
πref (ai |si ) adopt the GSM8K (Cobbe et al., 2021) and
i≥0
(10) MATH (Hendrycks et al., 2021) dataset for the
task of math reasoning. GSM8K is a dataset of
4.3 Iterative OREO grade school math problems. It contains 7473
Previous works have shown that offline LLM fine- training problems and 1319 test problems. MATH
tuning methods can be applied iteratively to im- consists of competition-level math problems, with
prove model performance (Pang et al., 2024; Song a training set of size 7500 and a test set of size
et al., 2024; Xiong et al., 2024). After each itera- 5000. All problems in these datasets are labeled
tion, a new dataset is collected using the updated with step-by-step ground-truth solutions. We
policy model to generate responses or explore the use the script from DeepSeekMath2 to extract the
environment, which is used for further training. final answer from the solution and evaluate its
correctness.
4.4 Test-Time Search with Value Function
We adopt ALFWorld (Shridhar et al., 2020) for
Recently, inference-time scaling (Hao et al., 2024a; the task of embodied agent control. ALFWorld
Snell et al., 2024; Wu et al., 2024) has received sig- provides interactive TextWorld environments for
nificant research attention. One notable approach household tasks. Each task is labeled with an ex-
is the use of Process Reward Models (PRM), which pert trajectory. However, these data do not contain
evaluate whether a reasoning step is correct. Dur- any reasoning process. Song et al. (2024) annotates
ing the inference time, rather than decoding the 3119 ALFWorld training trajectories with ratio-
reasoning chain autoregressively from the policy nales for each step, allowing model training with
model, one can conduct a tree search (e.g., beam ReAct-style (Yao et al., 2022) prompting. The
search) guided by the PRM. Our method provides a evaluation set of ALFWorld contains 140 tasks in
value model for free, which estimates the expected seen environments and 134 tasks in unseen envi-
future reward and can be directly used to guide ronments. We evaluate the success rates of agents
beam search. in completing the tasks within 40 steps.
In fact, previous PRM methods (Wang et al.,
2024a,b; Luo et al., 2024) train their models using Base Models. For the math reasoning task, we
Monte Carlo rollouts, which are essentially similar select Qwen2.5-Math-1.5B (Yang et al., 2024) and
to the objective used for training the value function DeepSeekMath-7B-Instruct (Shao et al., 2024)
in our approach (Eq. 8). Our principled formulation as our base model. For the embodied agent task,
removes the need for the extensive heuristic designs 2
https://github.com/deepseek-ai/DeepSeek-Math/
commonly required in prior works. tree/main/evaluation/eval
5
we use MiniCPM-2B-dpo-bf16 (Hu et al., 2024) as OREO outperforms all baselines in both settings.
the base model. Interestingly, rejection sampling performs well in
seen environments within ALFWorld. However,
Baseline Methods. In addition to supervised fine-
its improvement is marginal in unseen settings,
tuning, we compare our method against three other
whereas OREO achieves a significant 17.7% rela-
baselines:
tive improvement over the baseline. Compared to
• Rejection Sampling: The method uses the suc- SFT which only learns from successful experience,
cessful trajectories in the offline dataset to su- OREO effectively leverages the failed trajectories,
pervise the policy model. Despite its simplic- which results in more generalizable capabilities.
ity, rejection sampling proves to be effective We evaluate different variants of the OREO ob-
in many reasoning tasks. It’s also known as jective on the math reasoning task. As shown in Ta-
STaR (Zelikman et al., 2022), RAFT (Dong ble 5, the response-level objective variant performs
et al., 2023), REST (Gulcehre et al., 2023), worse than the token-level objective. This vari-
RESTEM (Singh et al., 2023). ant treats all actions in the trajectories uniformly,
making it challenging to properly assign the sparse
• DPO (Rafailov et al., 2024b) uses offline pref- reward to individual tokens. This limitation also
erence data to solve reasoning tasks (Pang sheds light on the suboptimal performance of DPO,
et al., 2024; Song et al., 2024). as it struggles with credit assignment. In contrast,
our method explicitly trains a value function, en-
• KTO (Ethayarajh et al., 2024) is a popu-
abling better credit assignment and improved per-
lar variant of DPO utilizing the Kahneman-
formance. The step-level objective, on the other
Tversky model of human utility and is able to
hand, performs comparably to the token-level ob-
work on unpaired data.
jective and even slightly outperforms it on GSM8K.
We leave implementation details in Appendix A. This may be due to the value function’s limited
accuracy at each step, which introduces noise into
5.1 Main Results policy learning. Despite the slight performance
We present the experimental results on mathemat- gap, we adopt the token-level objective as the stan-
ical reasoning in Table 1. Consistent with prior dard objective in our main experiments due to its
research (Yuan et al., 2024; Pang et al., 2024), simpler implementation (eliminating the need to
we observe that while DPO provides marginal segment reasoning steps). Nonetheless, step-level
improvements over the SFT checkpoint used for policy optimization remains an intriguing avenue
its initialization, simpler methods such as rejec- for future exploration.
tion sampling often outperform DPO. In contrast,
5.2 Iterative OREO
OREO demonstrates consistent superiority over
all baselines across both datasets (GSM8K and Figure 2 illustrates the performance of various al-
MATH). This improvement is also observed uni- gorithms on the math reasoning task across mul-
versally across models in the Qwen and DeepSeek- tiple iterations. OREO demonstrates steady and
Math families. Specifically, for Qwen-2.5-Math consistent improvements in accuracy over three it-
1.5B, OREO achieves a 5.2% relative improve- erations, showcasing its robustness in leveraging
ment over SFT on GSM8K and a 10.5% improve- iterative training. While baseline methods also ben-
ment on MATH. For DeepSeekMath 7B, despite efit from collecting additional data during each it-
the SFT checkpoint being heavily tuned with 776K eration, their performance consistently lags behind
samples (Shao et al., 2024), OREO still delivers that of OREO. Notably, rejection sampling shows
meaningful improvements, with relative gains of signs of saturation by the third iteration, with di-
3.6% on GSM8K and 5.1% on MATH. These re- minishing performance gains. In contrast, OREO
sults highlight the robustness and effectiveness of continues to improve, likely due to its ability to ef-
our approach across different models and datasets. fectively learn from failed trajectories. The updated
The experimental results on ALFWorld, an em- policy model in each new iteration may be able
bodied control task, are presented in Table 2. to explore novel failure patterns, and incorporate
3
these insights into the learning process. This po-
The DeepSeekMath-7B-Instruct model is already su-
pervised fine-tuned (Shao et al., 2024). So the SFT results are tentially explains why OREO benefits more from
directly adopted from their paper. multiple iterations compared to rejection sampling.
6
Qwen 1.5B DeepSeekMath 7B
Methods
GSM8K MATH GSM8K MATH Methods Unseen Seen
SFT3 73.5 47.5 82.9 46.8 SFT 67.2 62.9
Rej. Sampling 74.9 50.3 83.6 47.2 Rej. Sampling 68.7 79.3
DPO 74.4 49.2 82.4 47.2 DPO 69.4 64.3
KTO 73.4 48.3 82.5 46.9
OREO (Ours) 79.1 80.7
OREO (Ours) 77.3 52.5 85.9 49.2
Table 2: Success rates in ALFWorld.
Table 1: Results on GSM8K and MATH. OREO yields higher OREO consistently outperforms all base-
accuracies than the baselines across both datasets and model sizes. lines.
Figure 1: Case studies on the implicit and explicit value functions. Correct reasoning steps are shown in green,
while incorrect ones are shown in red. Higher advantages predicted by the value functions are highlighted in yellow.
Ideally, a good value function should predict a higher advantage for the correct reasoning step.
5.3 Implicit vs Explicit Value Functions uate different possible continuations of the next
reasoning steps, with different choices of value
In DPO, the policy model is viewed as an implicit functions.
value function (Rafailov et al., 2024a). However, Assuming the token indices for the next reason-
our results in Section 5.1 have demonstrated that ing step range from i to j, the advantage function
OREO benefits from explicitly parameterizing a derived from the value model Vϕ is:
separate value function. In this section, we present
case studies to compare the explicit value function Aϕ = Vϕ (sj ) − Vϕ (si ), (11)
Vϕ and the implicit value function derived from which quantifies the contribution of the new rea-
πθ , aiming to provide an intuitive understanding of soning step to the expected reward. If the new
their differences. step introduces an error, the estimated value of the
Our setting is shown in Figure 1: given a prob- resulting state sj will be lower than that of the
lem and an existing reasoning chain prefix, we eval- previous state, resulting in a negative advantage.
7
model (Feng et al., 2023; Silver et al., 2017; Snell
et al., 2024). In the context of LLM reasoning, the
gap between Aθ and Aϕ may be attributed to the
softmax bottleneck (Yang et al., 2017). Specifically,
predictions for πθ (at | st ) across different actions
at are generated from the same final hidden state,
differing only with a linear language model head
Figure 4: Success rates in ALFWorld when choosing followed with softmax. In contrast, Vϕ (st+1 ) input
best-of-K actions. Sampling 5 actions and choosing the the entire text sequence to the transformer network,
one with the largest value gains significant improvement
enabling a richer representation. Exploring the gap
in success rates.
between policy and value networks presents an in-
triguing direction for future research.
8
to improve the model performance. We evaluate Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
our method in GSM8K, MATH, and ALFWorld, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
demonstrating a consistent improvement compared
Nakano, et al. 2021. Training verifiers to solve math
to previous offline RLHF methods like DPO. word problems. arXiv preprint arXiv:2110.14168.
Arash Ahmadian, Chris Cremer, Matthias Gallé, Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff,
Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, and Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model
Sara Hooker. 2024. Back to basics: Revisiting re- alignment as prospect theoretic optimization. arXiv
inforce style optimization for learning from human preprint arXiv:2402.01306.
feedback in llms. arXiv preprint arXiv:2402.14740.
Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus
Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang McAleer, Ying Wen, Weinan Zhang, and Jun Wang.
Li, Sheila Babayan, Kavya Kopparapu, Zachary 2023. Alphazero-like tree-search can guide large lan-
Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srini- guage model decoding and training. arXiv preprint
vasan, et al. 2023. Rest meets react: Self- arXiv:2309.17179.
improvement for multi-step reasoning llm agent.
arXiv preprint arXiv:2312.10003. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong
Shen, Yujiu Yang, Nan Duan, and Weizhu Chen.
Ralph Allan Bradley and Milton E Terry. 1952. Rank 2023. Critic: Large language models can self-correct
analysis of incomplete block designs: I. the method with tool-interactive critiquing. arXiv preprint
of paired comparisons. Biometrika, 39(3/4):324– arXiv:2305.11738.
345.
Caglar Gulcehre, Tom Le Paine, Srivatsan Srini-
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald vasan, Ksenia Konyushkova, Lotte Weerts, Abhishek
Clark, Quoc V Le, Christopher Ré, and Azalia Mirho- Sharma, Aditya Siddhant, Alex Ahern, Miaosen
seini. 2024. Large language monkeys: Scaling infer- Wang, Chenjie Gu, et al. 2023. Reinforced self-
ence compute with repeated sampling. arXiv preprint training (rest) for language modeling. arXiv preprint
arXiv:2407.21787. arXiv:2308.08998.
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Han Guo, Bowen Tan, Zhengzhong Liu, Eric P Xing,
Fan. 2024a. Step-level value preference optimiza- and Zhiting Hu. 2021. Efficient (soft) q-learning
tion for mathematical reasoning. arXiv preprint for text generation with limited good data. arXiv
arXiv:2406.10858. preprint arXiv:2106.07704.
Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and
Su, and Jun Zhu. 2024b. Noise contrastive alignment Sergey Levine. 2017. Reinforcement learning with
of language models with explicit rewards. arXiv deep energy-based policies. In International confer-
preprint arXiv:2402.05369. ence on machine learning, pages 1352–1361. PMLR.
9
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Sergey Levine. 2018. Soft actor-critic: Off-policy taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
maximum entropy deep reinforcement learning with guage models are zero-shot reasoners. Advances in
a stochastic actor. Preprint, arXiv:1801.01290. neural information processing systems, 35:22199–
22213.
Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan
Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Aviral Kumar, Joey Hong, Anikait Singh, and Sergey
Adithya Samavedhi, Qiyue Gao, et al. 2024a. Llm Levine. 2022. When should we prefer offline rein-
reasoners: New evaluation, library, and analysis of forcement learning over behavioral cloning? arXiv
step-by-step reasoning with large language models. preprint arXiv:2204.05618.
arXiv preprint arXiv:2404.05221.
Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun,
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, and Zhi-Quan Luo. 2023. Remax: A simple, effec-
Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. tive, and efficient method for aligning large language
Reasoning with language model is planning with models. arXiv preprint arXiv:2310.10505.
world model. arXiv preprint arXiv:2305.14992.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Edwards, Bowen Baker, Teddy Lee, Jan Leike,
Li, Zhiting Hu, Jason Weston, and Yuandong Tian. John Schulman, Ilya Sutskever, and Karl Cobbe.
2024b. Training large language models to rea- 2023. Let’s verify step by step. arXiv preprint
son in a continuous latent space. arXiv preprint arXiv:2305.20050.
arXiv:2412.06769.
Guanlin Liu, Kaixuan Ji, Renjie Zheng, Zheng Wu,
Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Chen Dun, Quanquan Gu, and Lin Yan. 2024a. En-
Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym hancing multi-step reasoning abilities of language
Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, models through direct q-function optimization. arXiv
and Roberta Raileanu. 2024. Teaching large lan- preprint arXiv:2410.09302.
guage models to reason with reinforcement learning.
arXiv preprint arXiv:2403.04642. Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru,
Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyil-
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
maz. 2024b. Don’t throw away your value model!
Arora, Steven Basart, Eric Tang, Dawn Song, and Ja-
generating more preferable text with value-guided
cob Steinhardt. 2021. Measuring mathematical prob-
monte-carlo tree search decoding. In First Confer-
lem solving with the math dataset. arXiv preprint
ence on Language Modeling.
arXiv:2103.03874.
Matthew Douglas Hoffman, Du Phan, David Dohan, Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-
Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel guang Lou, Chongyang Tao, Xiubo Geng, Qingwei
Sountsov, Charles Sutton, Sharad Vikram, and Rif Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wiz-
A Saurous. 2024. Training chain-of-thought via ardmath: Empowering mathematical reasoning for
latent-variable inference. Advances in Neural In- large language models via reinforced evol-instruct.
formation Processing Systems, 36. arXiv preprint arXiv:2308.09583.
Jian Hu, Li Tao, June Yang, and Chandler Zhou. 2023. Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat
Aligning language models with offline reinforce- Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu,
ment learning from human feedback. arXiv preprint Lei Meng, Jiao Sun, et al. 2024. Improve mathemati-
arXiv:2308.12050. cal reasoning in language models by automated pro-
cess supervision. arXiv preprint arXiv:2406.06592.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu
Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxi- Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and
ang Huang, Weilin Zhao, et al. 2024. Minicpm: Dale Schuurmans. 2017. Bridging the gap between
Unveiling the potential of small language models value and policy based reinforcement learning. Ad-
with scalable training strategies. arXiv preprint vances in neural information processing systems, 30.
arXiv:2404.06395.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari,
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Henryk Michalewski, Jacob Austin, David Bieber,
Igor Mordatch. 2022. Language models as zero-shot David Dohan, Aitor Lewkowycz, Maarten Bosma,
planners: Extracting actionable knowledge for em- David Luan, et al. 2021. Show your work: Scratch-
bodied agents. In International conference on ma- pads for intermediate computation with language
chine learning, pages 9118–9147. PMLR. models. arXiv preprint arXiv:2112.00114.
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Daniel Fried. 2024. Visualwebarena: Evaluating mul- 2022. Training language models to follow instruc-
timodal agents on realistic visual web tasks. arXiv tions with human feedback. Advances in neural in-
preprint arXiv:2401.13649. formation processing systems, 35:27730–27744.
10
Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian
He He, Sainbayar Sukhbaatar, and Jason Weston. Li, and Bill Yuchen Lin. 2024. Trial and error:
2024. Iterative reasoning preference optimization. Exploration-based trajectory optimization for llm
arXiv preprint arXiv:2404.19733. agents. arXiv preprint arXiv:2403.02502.
Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang
Brandon Amos, and Yuandong Tian. 2024. Ad- Liu, Yiming Yang, Sean Welleck, and Chuang Gan.
vprompter: Fast adaptive adversarial prompting for 2024. Easy-to-hard generalization: Scalable align-
llms. arXiv preprint arXiv:2404.16873. ment beyond human supervision. arXiv preprint
arXiv:2403.09472.
Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea
Finn. 2024a. From r to q ∗ : Your language Gemini Team, Rohan Anil, Sebastian Borgeaud,
model is secretly a q-function. arXiv preprint Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
arXiv:2404.12358. Radu Soricut, Johan Schalkwyk, Andrew M Dai,
Anja Hauth, et al. 2023. Gemini: a family of
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- highly capable multimodal models. arXiv preprint
pher D Manning, Stefano Ermon, and Chelsea Finn. arXiv:2312.11805.
2024b. Direct preference optimization: Your lan-
guage model is secretly a reward model. Advances Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran-
in Neural Information Processing Systems, 36. cis Song, Noah Siegel, Lisa Wang, Antonia Creswell,
Geoffrey Irving, and Irina Higgins. 2022. Solv-
Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, ing math word problems with process-and outcome-
Daniele Calandriello, Mohammad Gheshlaghi Azar, based feedback. arXiv preprint arXiv:2211.14275.
Rafael Rafailov, Bernardo Avila Pires, Eugene
Tarassov, Lucas Spangher, Will Ellsworth, et al. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man-
2024. Offline regularised reinforcement learning for dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and
large language models alignment. arXiv preprint Anima Anandkumar. 2023. Voyager: An open-ended
arXiv:2405.19107. embodied agent with large language models. arXiv
preprint arXiv:2305.16291.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai
Alec Radford, and Oleg Klimov. 2017. Proxi-
Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui.
mal policy optimization algorithms. arXiv preprint
2024a. Math-shepherd: Verify and reinforce llms
arXiv:1707.06347.
step-by-step without human annotations. In Proceed-
ings of the 62nd Annual Meeting of the Association
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu,
for Computational Linguistics (Volume 1: Long Pa-
Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu,
pers), pages 9426–9439.
and Daya Guo. 2024. Deepseekmath: Pushing the
limits of mathematical reasoning in open language Zihan Wang, Yunxuan Li, Yuexin Wu, Liangchen Luo,
models. arXiv preprint arXiv:2402.03300. Le Hou, Hongkun Yu, and Jingbo Shang. 2024b.
Multi-step problem solving through a verifier: An
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, empirical analysis on model-induced process super-
Yonatan Bisk, Adam Trischler, and Matthew vision. arXiv preprint arXiv:2402.02658.
Hausknecht. 2020. Alfworld: Aligning text and em-
bodied environments for interactive learning. arXiv Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
preprint arXiv:2010.03768. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea-
David Silver, Thomas Hubert, Julian Schrittwieser, Ioan- soning in large language models. Advances in neural
nis Antonoglou, Matthew Lai, Arthur Guez, Marc information processing systems, 35:24824–24837.
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore
Graepel, et al. 2017. Mastering chess and shogi by Ronald J Williams. 1992. Simple statistical gradient-
self-play with a general reinforcement learning algo- following algorithms for connectionist reinforcement
rithm. arXiv preprint arXiv:1712.01815. learning. Machine learning, 8:229–256.
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck,
Anand, Piyush Patil, Xavier Garcia, Peter J Liu, and Yiming Yang. 2024. Inference scaling laws:
James Harrison, Jaehoon Lee, Kelvin Xu, et al. An empirical analysis of compute-optimal inference
2023. Beyond human data: Scaling self-training for problem-solving with language models. arXiv
for problem-solving with language models. arXiv preprint arXiv:2408.00724.
preprint arXiv:2312.06585.
Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- Wang, Zichao Yang, and Zhiting Hu. 2024. Language
mar. 2024. Scaling llm test-time compute optimally models meet world models: Embodied experiences
can be more effective than scaling model parameters. enhance language models. Advances in neural infor-
arXiv preprint arXiv:2408.03314. mation processing systems, 36.
11
Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou,
Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue
2024. Iterative preference learning from human feed- Ou, Yonatan Bisk, Daniel Fried, et al. 2023. We-
back: Bridging theory and practice for rlhf under barena: A realistic web environment for building au-
kl-constraint. In Forty-first International Conference tonomous agents. arXiv preprint arXiv:2307.13854.
on Machine Learning.
Brian D Ziebart. 2010. Modeling purposeful adaptive
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, behavior with the principle of maximum causal en-
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan tropy. Carnegie Mellon University.
Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2
technical report. arXiv preprint arXiv:2407.10671. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B
Brown, Alec Radford, Dario Amodei, Paul Chris-
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
William W Cohen. 2017. Breaking the softmax bot- guage models from human preferences. arXiv
tleneck: A high-rank rnn language model. arXiv preprint arXiv:1909.08593.
preprint arXiv:1711.03953.
12
because the GSM8K dataset already split reasoning B Proof Sketch of Theorem 1
steps with line breaks.
The RLHF objective can be rewritten as
−1
TX
A.3 Hyperparameters JRLHF (π) = Eτ ∼π rtask (st , at )
t=0
The batch size is set to 128 for all experiments.
+ β log πref (at |st ) + βH(π(·|st )) .
(13)
SFT. The Qwen2.5-Math-1.5B model is trained
on the GSM8K and MATH dataset for 3 epochs So it can be viewed as maximum-entropy
with a learning rate of 2 × 10−5 . The RL (Haarnoja et al., 2017) with reward
DeepSeekMath-7B-Instruct model is already in- rtask (st , at ) + β log πref (at |st ). Previous works
struction fine-tuned, so we do not perform any ad- in maximum-entropy RL (Haarnoja et al., 2017)
ditional SFT. The MiniCPM-2B-dpo-bf16 model show that the optimal policy satisfies
is trained on the annotated dataset from Song et al.
(2024) for 2 epochs with a learning rate of 2×10−5 . β log π ∗ (at |st ) = Q∗ (st , at ) − V ∗ (at ), (14)
13
of the method to better understand its properties.
Along with our open-sourced code, we believe our
work can help the community explore best prac-
tices for LLM training. (3) Our work presents com-
prehensive experiments and analyses, including
experiments on embodied agent control, the itera-
tive training setting, value-guided tree search, and
a comparison between implicit and explicit value
functions.
We present the empirical comparison of
two methods here. Liu et al. (2024a) used
Qwen2-7B-Instruct and Gemma-1.1-7B-it in
their experiments. As their work is not yet
open-sourced, we reproduce DQO and experi-
ment on two math reasoning datasets. We use
Qwen2.5-Math-1.5B as the base model, and use
the same training data. We report the results in
Table 3.
D Safeguard Statement
In this paper, we primarily focus on the math rea-
soning tasks and embodied agent control tasks in a
household simulator, posing no significant ethical
or harmful concerns. We recognize that future re-
search on border applications of multi-step reason-
ing may pose a risk of misuse, and we recommend
careful consideration of all aspects of safety before
it’s applied in the real world.
14