0% found this document useful (0 votes)

89 views14 pages

Offline Reinforcement Learning For LLM Multi-Step Reasoning

The document introduces OREO (Offline REasoning Optimization), an offline reinforcement learning method designed to enhance the multi-step reasoning capabilities of large language models (LLMs). OREO improves upon existing methods by effectively utilizing unpaired data and enabling better credit assignment, thus outperforming traditional approaches like Direct Preference Optimization (DPO) in various reasoning benchmarks. The method shows significant performance improvements in tasks such as mathematical reasoning and embodied agent control, with the potential for iterative enhancements and guiding tree search during inference.

Uploaded by

Stack Overflow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views14 pages

Offline Reinforcement Learning For LLM Multi-Step Reasoning

Uploaded by

Stack Overflow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Huaijie Wang♢∗ , Shibo Hao♣ * , Hanze Dong♡ , Shenao Zhang♠

Yilin Bao♣ , Ziran Yang♣ , Yi Wu♢†
♣
UC San Diego, ♢ Tsinghua University,
♡
Salesforce Research, ♠ Northwestern University

Abstract et al., 2024), and web navigation (Deng et al., 2024;

Zhou et al., 2023; Koh et al., 2024). Enhancing
Improving the multi-step reasoning ability of
large language models (LLMs) with offline
LLM reasoning with reinforcement learning (RL)
arXiv:2412.16145v2 [cs.LG] 25 Dec 2024

reinforcement learning (RL) is essential for has gained significant interest, as it offers the po-
quickly adapting them to complex tasks. While tential for self-improvement and learning without
Direct Preference Optimization (DPO) has relying on human-labeled trajectories. However,
shown promise in aligning LLMs with human many popular RL algorithms require costly online
preferences, it is less suitable for multi-step rea- data collection, either by generating language on-
soning tasks because (1) DPO relies on paired the-fly or interacting with an environment. For
preference data, which is not readily available
instance, tuning LLMs with Proximal Policy Op-
for multi-step reasoning tasks, and (2) it treats
all tokens uniformly, making it ineffective for timization (PPO, Schulman et al., 2017) is often
credit assignment in multi-step reasoning tasks, prohibitively expensive for most users, which limits
which often come with sparse reward. In this practical applications (Hu et al., 2023).
work, we propose OREO (Offline REasoning In contrast, offline RL methods, such as Di-
Optimization), an offline RL method for en- rect Preference Optimization (DPO, Rafailov et al.,
hancing LLM multi-step reasoning. Building 2024b), provide a more practical approach for
on insights from previous works of maximum
aligning LLMs with human preferences. These
entropy reinforcement learning, it jointly learns
a policy model and value function by optimiz-
methods enable practitioners to tune models using
ing the soft Bellman Equation. We show in pre-existing datasets, eliminating the need for live
principle that it reduces the need to collect interaction or data generation. However, attempts
pairwise data and enables better credit assign- to enhance LLMs’ multi-step reasoning abilities
ment. Empirically, OREO surpasses existing with DPO may deliver results close to or even
offline learning methods on multi-step reason- worse than simpler methods like SFT (Yuan et al.,
ing benchmarks, including mathematical rea- 2024; Chen et al., 2024b). Additionally, DPO re-
soning tasks (GSM8K, MATH), and embod-
quires pairwise preference data. In multi-step rea-
ied agent control (ALFWorld). The approach
can be extended to a multi-iteration framework soning tasks, however, data normally consists of
when additional resources are available. Fur- independent trajectories with sparse rewards indi-
thermore, the learned value function can be cating success or failure. A common alternative is
leveraged to guide the tree search for free, to extract correct trajectories from offline datasets
which can further boost the performance during and use them for supervised fine-tuning (Zelikman
test time.1 et al., 2022; Aksitov et al., 2023; Dong et al., 2023;
1 Introduction Paulus et al., 2024). While this approach is sim-
ple and often effective, it fails to fully exploit the
Large Language Models (LLMs) are increasingly offline dataset’s potential—particularly the oppor-
applied to complex tasks requiring multi-step rea- tunity to learn from failure experience and enhance
soning, such as mathematical problem solving (Ue- model robustness (Kumar et al., 2022).
sato et al., 2022; Shao et al., 2024; Hendrycks et al., In this paper, we introduce OREO (Offline
2021), embodied agent control (Wang et al., 2023; REasoning Optimization), an offline RL algorithm
Huang et al., 2022; Shridhar et al., 2020; Xiang designed to enhance LLMs’ multi-step reasoning
* Equal contribution. † corresponding author capabilities. Building on insights from the exten-
1
Code available at: https://github.com/jwhj/OREO sive literature on maximum entropy RL (Ziebart,

1
2010; Nachum et al., 2017; Haarnoja et al., 2017), recently gained traction in the LLM literature.
especially Path Consistency Learning (Nachum Maximum-entropy RL (Ziebart, 2010; Haarnoja
et al., 2017), OREO jointly learns a policy model et al., 2017) aims to maximize the weighted sum of
and a value function by optimizing the soft Bell- the accumulated reward and the policy entropy. No-
man Equation. OREO can leverage unpaired data table algorithms such as path-consistency learning
with only sparse rewards and enables finer-grained (PCL) (Nachum et al., 2017) and soft actor-critic
credit assignment, which is especially critical as the (SAC) (Haarnoja et al., 2018) effectively utilize
correctness of reasoning trajectories often depends this framework. Recent works (Guo et al., 2021;
on a few key tokens. Additionally, OREO can Richemond et al., 2024; Liu et al., 2024a) revealed
be extended into an iterative framework for online a strong connection between maximum-entropy RL
exploration. The trained value function is also di- and the RLHF objective, indicating a promising
rectly available to guide the step-level beam search direction to fine-tune LLMs with soft Q-learning-
at inference time, further boosting performance. based algorithms. Concurrent with our work, Liu
We demonstrate the effectiveness of our ap- et al. (2024a) leverages the SAC framework and
proach on both math reasoning (GSM8K, MATH) derives a similar algorithm for LLM multi-step
and embodied agent control (ALFWorld) tasks. It reasoning. Our method differs in the derivation ap-
consistently outperforms baseline methods, includ- proach, the inclusion of a KL regularization term,
ing rejection sampling, DPO, and KTO, across dif- and the exploration of several loss variants. No-
ferent model sizes. Notably, we train a 1.5B model tably, we provide deeper empirical insights by in-
to achieve a 52.5% accuracy on the MATH dataset corporating diverse domains, the iterative training
using only the original training set. Moreover, iter- setting, and value-guided tree search. More discus-
ative OREO steadily improves model performance sions and empirical comparisons can be found in
with additional training rounds, whereas baseline Appendix C.
methods like rejection sampling exhibit signs of sat-
uration. The value function learned by our method 2.2 LLM Reasoning
proves highly effective in guiding beam search for As an emergent ability of model scale, it is shown
math reasoning tasks or selecting the best-of-K that LLMs are able to generate intermediate rea-
actions in embodied agent control. This results in soning steps to solve complex problems, known as
up to a 17.9% relative improvement over greedy “scratchpad” (Nye et al., 2021) or chain-of-thoughts
decoding on the MATH dataset. (Wei et al., 2022; Kojima et al., 2022). Recent ef-
forts have enhanced LLM reasoning through super-
2 Related Work vised fine-tuning (Yue et al., 2023, 2024; Yu et al.,
2023; Luo et al., 2023; Hao et al., 2024b). When
2.1 Reinforcement Learning for LLM human-annotated reasoning trajectories are un-
Reinforcement Learning (RL) has become a stan- available, rejection-sampling-based methods have
dard approach in the post-training stage of LLMs. proven effective. Among them, Self-Taught Rea-
A widely adopted method, known as reinforcement soner (STaR) (Zelikman et al., 2022) generates ra-
learning from human feedback (RLHF) (Ziegler tionales and fine-tunes on those leading to correct
et al., 2019; Ouyang et al., 2022), is designed to answers. Singh et al. (2023) further proposes an
align LLM responses more closely with human iterative approach based on expectation maximiza-
preferences. Traditional RL methods, such as tion. Recently, the application of RL algorithms
Proximal Policy Optimization (PPO) (Schulman to improve LLM reasoning has gained increasing
et al., 2017), have been extensively used in LLM interest (Aksitov et al., 2023; Gou et al., 2023;
post-training (Achiam et al., 2023; Team et al., Dong et al., 2024; Havrilla et al., 2024; Shao et al.,
2023; Dubey et al., 2024). Alternative approaches, 2024; Zhao et al., 2024), but the direct usages of
such as rejection-sampling-based methods (Dong DPO are not all successful (Yuan et al., 2024; Chen
et al., 2023; Gulcehre et al., 2023; Zelikman et al., et al., 2024b) and efficient, as people have to specif-
2022; Hoffman et al., 2024), preference-based re- ically collect pairwise preference data (Chen et al.,
inforcement learning (RL) (Rafailov et al., 2024b; 2024a; Song et al., 2024). Our method addresses
Xiong et al., 2024; Ethayarajh et al., 2024), and these limitations of DPO in reasoning with a prin-
REINFORCE-like RL (Williams, 1992; Shao et al., cipled solution. Another line of work aims to train
2024; Li et al., 2023; Ahmadian et al., 2024), have a Process Reward Model (PRM), to provide finer-

2
grained feedback on RL. It is typically trained with where πref is the reference policy, ρπ is the state-
Monte-Carlo rollout (Wang et al., 2024a,b; Luo action trajectory distribution generated by follow-
et al., 2024; Zhang et al., 2024), which is a spe- ing policy π, and β controls the strength of the
cial case of the value function learned through our regularization. Typically, πref is a pre-trained LLM
method. We show that our value function enables followed by supervised fine-tuning. The discount
test-time scaling (Hao et al., 2023; Snell et al., factor γ is normally omitted in the RLHF setting.
2024; Wu et al., 2024; Brown et al., 2024; Yao
et al., 2024; Hao et al., 2024a; Liu et al., 2024b) to 3.2 Soft Bellman Equation
further boost the reasoning performance through Entropy-regularized reinforcement learning
tree search. (RL) (Ziebart, 2010; Nachum et al., 2017;
Haarnoja et al., 2017) augments the standard
3 Preliminaries reward maximization objective with an entropy
term to encourage exploration and improve
3.1 MDP for LLM Reasoning the robustness of the learned policy. This
We define the Markov Decision Process (MDP) for formulation has a strong connection to entropy-
LLM reasoning. At each time step, a new token regularized RL, as the Kullback-Leibler (KL)
is generated as the action at . The state is repre- divergence between two distributions can be
sented as a token sequence. For reasoning tasks decomposed into a cross-entropy term and an
that don’t involve interactions with the environ- entropy term, i.e., DKL (π(·|s)∥πref (·|s)) =
ment, st records the context for LLMs, i.e., st = Eπ [− log πref (a|s)] − Eπ [− log π(a|s)].
(x0 , . . . , xL , y0 , . . . , yt−1 ), where (x0 , . . . , xL ) is Adapting from the well-established theory in
the input prompt and (y0 , . . . , yt−1 ) is the sequence entropy-regularized RL to our setting, we first de-
of generated tokens up to step t − 1. The transition fine the value function V π of a policy, which quanti-
function f for these tasks deterministically updates fies the expected KL-regularized reward of a policy
the state as st+1 = f (st , at ) = st | at , where | is π from any given state:
T −t
concatenation. X
V π (st ) = E(at, st+1 ,...)∼ρπ r (st+l , at+l )
For those tasks requiring interacting with an ex- l=0 (2)
ternal environment, like embodied agent control,

π(at+l | st+l )
− β log
the state and transition function is slightly different: πref (at+l | st+l )
if at is the final token of the agent’s response (e.g.,
Compared to the value function in standard RL,
“go to desk 1”), then st+1 = f (st , at ) = st | at |
which only includes expected rewards, the above
next observation.
definition incorporates an additional KL regulariza-
The reward function r(st , at ) is generally de- π(a |st+l )
tion term: β log πref (at+l
t+l |st+l )
.
fined for every state-action pair to provide feedback
throughout the generation process. However, in this Theorem 1 The optimal policy and its value func-
work, we focus on the challenging case where the tion satisfy the soft Bellman Equation:
reward is non-zero only at the terminal step T , re- π ∗ (at | st )
flecting the correctness of the reasoning chain, or V ∗ (st ) − V ∗ (st+1 ) = r(st , at ) − β log (3)
πref (at | st )
whether the task is successfully accomplished.
Following the standard setup in Reinforce- where st+1 = f (st , at ).
ment Learning with Human Feedback (RLHF) Building on Nachum et al. (2017); Haarnoja et al.
(Ouyang et al., 2022; Rafailov et al., 2024b), a (2017), we extend their theorem with a lightweight
KL-regularization term is introduced to encourage derivation tailored to our setting, with the proof
the learned policy to remain close to a reference provided in Appendix B.
policy while optimizing for rewards. Therefore, the This equation characterizes the relationship be-
optimal policy πθ can be described as follows: tween the optimal policy and its value function, pro-
viding a theoretical basis for our proposed method.
T
π ∗ = arg maxE(s0 ,...,sT )∼ρπ
X
r (st , at ) When β = 0, the equation degenerates to the Bell-
π
t=0 (1) man equation in standard RL. Importantly, when

π (at | st ) the soft Bellman Equation is always satisfied, the
− β log ,
πref (at | st ) policy and the value function are guaranteed to be
the optimal ones:

3
Theorem 2 If a policy π(a | s) and state value reward exists. To apply DPO on these tasks, previ-
function V (s) satisfy the consistency property (3) ous work has to collect pairwise data on reasoning
for all states s and actions a (where s′ = f (s, a)), tasks (Song et al., 2024; Yuan et al., 2024), which
then π = π ∗ and V = V ∗ . is an inefficient usage of offline data. (2) No credit
Similarly, the proof is a simple extension to assignment: Relaxing the soft Bellman Equation
Nachum et al. (2017). Based on Theorem 2, our from every time step to the entire trajectory loses
proposed method OREO aims to learn both a pol- the granularity of credit assignment, which is espe-
icy model πθ and a value model Vϕ towards the cially critical in multi-step reasoning tasks where
optimal policy and value function. This is achieved correctness often depends on a few key tokens.
by minimizing the inconsistency of Soft Bellman
Consistency. A more formal description of our 4 OREO: Offline Reasoning
method is presented in Section 4. Optimization

3.3 Connection to DPO Based on the theorems presented in Section 3, we

present the detailed formulation of our method
In this section, we introduce how DPO can be de-
OREO. We further introduce two objective func-
rived from the formulation above with two addi-
tion variants, an iterative extension of our approach,
tional assumptions. This enables us to understand
and a test-time search strategy leveraging the value
the limitation of DPO on LLM reasoning from
function.
the principle, and motivates us to propose the new
method. Rafailov et al. (2024a) present a related
4.1 Learning Objetive
derivation to analyze the properties of DPO.
First, DPO relaxes the requirements of soft Bell- We adopt a similar method as PCL (Nachum et al.,
man Equation by telescoping time steps: 2017) to fine-tune the LLM. Inspired by Theorem 2,
T −1 T −1 we optimize the policy by enforcing the soft Bell-
X X π ∗ (at | st )
r(st , at ) = V ∗ (s0 ) + β log (4) man Equation property given in Eq. 3. In our set-
t=0 t=0
πref (at | st )
ting where the reward signal is sparse, we aim to
It then introduces the Bradley-Terry preference enforce the telescoped version of Eq. 3, namely
model (Bradley and Terry, 1952), which assumes X π ∗ (ai |si )
that the probability of one response being preferred V ∗ (st ) = Rt − β log , (7)
πref (ai |si )
over another is determined by the normalized rela- i≥t

tive exponential rewards of the responses: P

where Rt = i≥t r(si , ai ). Note that DPO lever-
exp (r (sw w
T , aT )) ages Eq. 4, which is a special case of Eq. 7 with
p∗ τ w ⪰ τ l = .
exp (r (sw w
T , aT )) + exp r slT , alT t = 0. We train a separate value network Vϕ to-
(5)
gether with the policy πθ . We adopt the MSE loss
By maximizing the log-likelihood that a win-
for the value network:
ning response is preferred over a losing response
with a preference dataset D = {(τ w , τ l )}, the loss T −1
 2
1 X X πθ (ai |si ) 
function of DPO can be derived: LV (ϕ) = Vϕ (st ) − Rt + β log .
T t=0 πref (ai |si )
i≥t
−1
TX !
π ∗ (aw w

t | st ) (8)
L = − E(τw ,τl )∼D log σ β log
t=0
πref (aw w
t | st ) The policy objective is given by
−1
T
!
π ∗ alt | slt

X
− β log T −1
1 X πθ (at |st )

t=0
πref alt | slt Lπ (θ) = Vϕ (st ) − Rt + β log
(6) T t=0 πref (at |st )
" # (9)
2
The additional assumptions of DPO introduce X πθ (ai |si )
+ sg β log + αLreg .
two challenges for multi-step reasoning problems: i>t
πref (ai |si )
(1) Unnecessary pairwise data collection: While
the BT model is reasonable for a general dialogue Here sg[·] denotes the stop gradient operator, which
system where the reward can only be implicitly makes each stepPhave the same scale in the gradi-
−1
inferred from human preference, it’s unnecessary ent. Lreg = T1 Tt=0 KL[πθ (·|st )∥πref (·|st )] is a
for multi-step reasoning tasks where a ground-truth regularization term that helps stabilize training.

4
4.2 Loss Variants We implement step-level beam search for math
In addition to our OREO learning objective, reasoning tasks. At each step, we maintain a set of
we present two variants: step-level OREO and B candidate partial trajectories. For each candidate,
response-level OREO. we generate B potential next reasoning steps. From
In step-level OREO, an action is considered to the resulting B 2 candidates, we retain the B with
be an entire reasoning step instead of a single gen- the highest values.
erated token. For example, In May, Natalia sold 48 For embodied agent tasks, where environment
/ 2 = 24 clips. In the context of language models, dynamics are unknown, beam search is not appli-
cable. Instead, we sample K actions at each step
Qkof taking an action a = (t1 t2 · · · tk )
the probability
and select the action with the highest value.
is π(a|s) = i=1 p(ti |st1 t2 · · · ti−1 ), where ti de-
notes the ith token of the action and p denotes the
5 Experiments
language model. The step-level OREO objective
can thus be modified accordingly. This objective In this section, we evaluate our method OREO
can also be grounded in the token-level MDP. in math reasoning and embodied agent tasks. We
Response-level OREO aims to mimic the behav- also demonstrate that the value function trained
ior of DPO. Instead of enforcing the consistency alongside the policy can further improve model
property at each time step, the action objective con- performance at test time through step-level beam
siders only the initial state, i.e., search or choosing the best-of-K action.
2
Lresp Vϕ (s0 ) − R0 + β
X πθ (ai |si ) Datasets and Evaluation Metric. We
π (θ) = log + αLreg .
πref (ai |si ) adopt the GSM8K (Cobbe et al., 2021) and
i≥0
(10) MATH (Hendrycks et al., 2021) dataset for the
task of math reasoning. GSM8K is a dataset of
4.3 Iterative OREO grade school math problems. It contains 7473
Previous works have shown that offline LLM fine- training problems and 1319 test problems. MATH
tuning methods can be applied iteratively to im- consists of competition-level math problems, with
prove model performance (Pang et al., 2024; Song a training set of size 7500 and a test set of size
et al., 2024; Xiong et al., 2024). After each itera- 5000. All problems in these datasets are labeled
tion, a new dataset is collected using the updated with step-by-step ground-truth solutions. We
policy model to generate responses or explore the use the script from DeepSeekMath2 to extract the
environment, which is used for further training. final answer from the solution and evaluate its
correctness.
4.4 Test-Time Search with Value Function
We adopt ALFWorld (Shridhar et al., 2020) for
Recently, inference-time scaling (Hao et al., 2024a; the task of embodied agent control. ALFWorld
Snell et al., 2024; Wu et al., 2024) has received sig- provides interactive TextWorld environments for
nificant research attention. One notable approach household tasks. Each task is labeled with an ex-
is the use of Process Reward Models (PRM), which pert trajectory. However, these data do not contain
evaluate whether a reasoning step is correct. Dur- any reasoning process. Song et al. (2024) annotates
ing the inference time, rather than decoding the 3119 ALFWorld training trajectories with ratio-
reasoning chain autoregressively from the policy nales for each step, allowing model training with
model, one can conduct a tree search (e.g., beam ReAct-style (Yao et al., 2022) prompting. The
search) guided by the PRM. Our method provides a evaluation set of ALFWorld contains 140 tasks in
value model for free, which estimates the expected seen environments and 134 tasks in unseen envi-
future reward and can be directly used to guide ronments. We evaluate the success rates of agents
beam search. in completing the tasks within 40 steps.
In fact, previous PRM methods (Wang et al.,
2024a,b; Luo et al., 2024) train their models using Base Models. For the math reasoning task, we
Monte Carlo rollouts, which are essentially similar select Qwen2.5-Math-1.5B (Yang et al., 2024) and
to the objective used for training the value function DeepSeekMath-7B-Instruct (Shao et al., 2024)
in our approach (Eq. 8). Our principled formulation as our base model. For the embodied agent task,
removes the need for the extensive heuristic designs 2
https://github.com/deepseek-ai/DeepSeek-Math/
commonly required in prior works. tree/main/evaluation/eval

5
we use MiniCPM-2B-dpo-bf16 (Hu et al., 2024) as OREO outperforms all baselines in both settings.
the base model. Interestingly, rejection sampling performs well in
seen environments within ALFWorld. However,
Baseline Methods. In addition to supervised fine-
its improvement is marginal in unseen settings,
tuning, we compare our method against three other
whereas OREO achieves a significant 17.7% rela-
baselines:
tive improvement over the baseline. Compared to
• Rejection Sampling: The method uses the suc- SFT which only learns from successful experience,
cessful trajectories in the offline dataset to su- OREO effectively leverages the failed trajectories,
pervise the policy model. Despite its simplic- which results in more generalizable capabilities.
ity, rejection sampling proves to be effective We evaluate different variants of the OREO ob-
in many reasoning tasks. It’s also known as jective on the math reasoning task. As shown in Ta-
STaR (Zelikman et al., 2022), RAFT (Dong ble 5, the response-level objective variant performs
et al., 2023), REST (Gulcehre et al., 2023), worse than the token-level objective. This vari-
RESTEM (Singh et al., 2023). ant treats all actions in the trajectories uniformly,
making it challenging to properly assign the sparse
• DPO (Rafailov et al., 2024b) uses offline pref- reward to individual tokens. This limitation also
erence data to solve reasoning tasks (Pang sheds light on the suboptimal performance of DPO,
et al., 2024; Song et al., 2024). as it struggles with credit assignment. In contrast,
our method explicitly trains a value function, en-
• KTO (Ethayarajh et al., 2024) is a popu-
abling better credit assignment and improved per-
lar variant of DPO utilizing the Kahneman-
formance. The step-level objective, on the other
Tversky model of human utility and is able to
hand, performs comparably to the token-level ob-
work on unpaired data.
jective and even slightly outperforms it on GSM8K.
We leave implementation details in Appendix A. This may be due to the value function’s limited
accuracy at each step, which introduces noise into
5.1 Main Results policy learning. Despite the slight performance
We present the experimental results on mathemat- gap, we adopt the token-level objective as the stan-
ical reasoning in Table 1. Consistent with prior dard objective in our main experiments due to its
research (Yuan et al., 2024; Pang et al., 2024), simpler implementation (eliminating the need to
we observe that while DPO provides marginal segment reasoning steps). Nonetheless, step-level
improvements over the SFT checkpoint used for policy optimization remains an intriguing avenue
its initialization, simpler methods such as rejec- for future exploration.
tion sampling often outperform DPO. In contrast,
5.2 Iterative OREO
OREO demonstrates consistent superiority over
all baselines across both datasets (GSM8K and Figure 2 illustrates the performance of various al-
MATH). This improvement is also observed uni- gorithms on the math reasoning task across mul-
versally across models in the Qwen and DeepSeek- tiple iterations. OREO demonstrates steady and
Math families. Specifically, for Qwen-2.5-Math consistent improvements in accuracy over three it-
1.5B, OREO achieves a 5.2% relative improve- erations, showcasing its robustness in leveraging
ment over SFT on GSM8K and a 10.5% improve- iterative training. While baseline methods also ben-
ment on MATH. For DeepSeekMath 7B, despite efit from collecting additional data during each it-
the SFT checkpoint being heavily tuned with 776K eration, their performance consistently lags behind
samples (Shao et al., 2024), OREO still delivers that of OREO. Notably, rejection sampling shows
meaningful improvements, with relative gains of signs of saturation by the third iteration, with di-
3.6% on GSM8K and 5.1% on MATH. These re- minishing performance gains. In contrast, OREO
sults highlight the robustness and effectiveness of continues to improve, likely due to its ability to ef-
our approach across different models and datasets. fectively learn from failed trajectories. The updated
The experimental results on ALFWorld, an em- policy model in each new iteration may be able
bodied control task, are presented in Table 2. to explore novel failure patterns, and incorporate
3
these insights into the learning process. This po-
The DeepSeekMath-7B-Instruct model is already su-
pervised fine-tuned (Shao et al., 2024). So the SFT results are tentially explains why OREO benefits more from
directly adopted from their paper. multiple iterations compared to rejection sampling.

6
Qwen 1.5B DeepSeekMath 7B
Methods
GSM8K MATH GSM8K MATH Methods Unseen Seen
SFT3 73.5 47.5 82.9 46.8 SFT 67.2 62.9
Rej. Sampling 74.9 50.3 83.6 47.2 Rej. Sampling 68.7 79.3
DPO 74.4 49.2 82.4 47.2 DPO 69.4 64.3
KTO 73.4 48.3 82.5 46.9
OREO (Ours) 79.1 80.7
OREO (Ours) 77.3 52.5 85.9 49.2
Table 2: Success rates in ALFWorld.
Table 1: Results on GSM8K and MATH. OREO yields higher OREO consistently outperforms all base-
accuracies than the baselines across both datasets and model sizes. lines.

Figure 1: Case studies on the implicit and explicit value functions. Correct reasoning steps are shown in green,
while incorrect ones are shown in red. Higher advantages predicted by the value functions are highlighted in yellow.
Ideally, a good value function should predict a higher advantage for the correct reasoning step.

Figure 3: The accuracies of OREO on GSM8K and

Figure 2: The accuracies on GSM8K and MATH after MATH improve with the compute budget. “Rej.” stands
several iterations. Rej. stands for “rejection sampling”. for rejection sampling. MATH500 is a subset of MATH
OREO improves as new data are collected. containing 500 queries.

5.3 Implicit vs Explicit Value Functions uate different possible continuations of the next
reasoning steps, with different choices of value
In DPO, the policy model is viewed as an implicit functions.
value function (Rafailov et al., 2024a). However, Assuming the token indices for the next reason-
our results in Section 5.1 have demonstrated that ing step range from i to j, the advantage function
OREO benefits from explicitly parameterizing a derived from the value model Vϕ is:
separate value function. In this section, we present
case studies to compare the explicit value function Aϕ = Vϕ (sj ) − Vϕ (si ), (11)
Vϕ and the implicit value function derived from which quantifies the contribution of the new rea-
πθ , aiming to provide an intuitive understanding of soning step to the expected reward. If the new
their differences. step introduces an error, the estimated value of the
Our setting is shown in Figure 1: given a prob- resulting state sj will be lower than that of the
lem and an existing reasoning chain prefix, we eval- previous state, resulting in a negative advantage.

7
model (Feng et al., 2023; Silver et al., 2017; Snell
et al., 2024). In the context of LLM reasoning, the
gap between Aθ and Aϕ may be attributed to the
softmax bottleneck (Yang et al., 2017). Specifically,
predictions for πθ (at | st ) across different actions
at are generated from the same final hidden state,
differing only with a linear language model head
Figure 4: Success rates in ALFWorld when choosing followed with softmax. In contrast, Vϕ (st+1 ) input
best-of-K actions. Sampling 5 actions and choosing the the entire text sequence to the transformer network,
one with the largest value gains significant improvement
enabling a richer representation. Exploring the gap
in success rates.
between policy and value networks presents an in-
triguing direction for future research.

5.4 Test-Time Search with Value Functions

The superiority of the explicit value function moti-
vates its use to enhance inference through search-
based methods. For our experiments, we evaluate a
Figure 5: Different variants of OREO objective. “To- subset of the MATH dataset containing 500 queries,
ken” stands for the standard OREO objective. “Step” a commonly used benchmark in prior work (Light-
denotes the step-level one and “Resp.” denotes the man et al., 2023; Sun et al., 2024).
response-level one. Figure 3 shows the performance of step-level
beam search in math reasoning. OREO leverages
In contrast, Rafailov et al. (2024a) represent an the value function to achieve progressively higher
implicit value function using the policy model, de- accuracies as the computational budget increases.
πθ (at′ |st′ ) Compared to greedy decoding, beam search with
fined as:Vθ (st ) = V (s0 ) + tt′ =0 β log πref
P
(at′ |st′ ) .
Therefore, the advantage function derived from the B = 7 provides a 11.4% relative improvement
policy model πθ is: in GSM8K and a 17.9% relative improvement in
MATH. This indicates that the explicit value func-
j−1
X πθ (at | st ) tion is more effective than the policy in distinguish-
Aθ = Vϕ (sj ) − Vϕ (si ) = β log . (12)
t=i
πref (at | st ) ing between correct and incorrect reasoning steps.
Similarly, Figure 4 presents the success rates in
When the soft Bellman equation holds at every ALFWorld when selecting the best-of-K actions.
timestep, the two advantage functions should be The success rates improve rapidly as the number of
equivalent, which aligns with the assumption of sampled actions increases, while stabilizing with
Rafailov et al. (2024a). However, as illustrated in five samples. Importantly, the value function is
Figure 1, there are cases where the two advantage a natural byproduct of the OREO training frame-
functions diverge. work, unlike prior work on PRM, which often in-
We observed that the magnitude of Aθ is gener- volves substantial data engineering and heuristic
ally smaller than that of Aϕ . For example, in the design efforts.
case “Betty now has $50 + $30 = $80”, which is
incorrect, the value function decreases significantly 6 Conclusion
from Vϕ (si ) = 0.816 to Vϕ (sj ) = 0.385, resulting
in an advantage of −0.431. In contrast, Aθ is much In this paper, we present Offline REasoning
more conservative, indicating its weaker ability to Optimization (OREO), an offline RL algorithm for
distinguish correct and incorrect reasoning steps. LLM reasoning and embodied LLM agent tasks.
While both advantage functions correctly favor the OREO leverages the idea of soft Q-learning. It
correct response in the GSM8K example above, trains an explicit value function together with the
in a more challenging MATH example below, Aϕ LLM policy by optimizing the soft Bellman Equa-
successfully identifies the correct reasoning step, tion. This alleviates the need for paired preference
whereas Aθ fails. data in DPO and enables fine-grained credit as-
These observations are consistent with the ef- signment among the reasoning steps. In addition,
fectiveness of searching with an explicit value our value function can be used in test-time search

8
to improve the model performance. We evaluate Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
our method in GSM8K, MATH, and ALFWorld, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
demonstrating a consistent improvement compared
Nakano, et al. 2021. Training verifiers to solve math
to previous offline RLHF methods like DPO. word problems. arXiv preprint arXiv:2110.14168.

7 Limitations Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam

Stevens, Boshi Wang, Huan Sun, and Yu Su. 2024.
Due to limited computation resources, some of our Mind2web: Towards a generalist agent for the web.
experiments, including ablation studies, iterative Advances in Neural Information Processing Systems,
36.
OREO, and test-times search, use 1.5B models.
We plan to run experiments on larger scales in the Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan
future. Our method has primarily been evaluated Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng
on mathematical reasoning and embodied agent Zhang, Kashun Shum, and Tong Zhang. 2023. Raft:
tasks. As future work, we aim to extend OREO to Reward ranked finetuning for generative foundation
model alignment. arXiv preprint arXiv:2304.06767.
a wider variety of tasks, such as coding and web
browsing, to explore its effectiveness in domains Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang,
with different structures and requirements. Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo,
Caiming Xiong, and Tong Zhang. 2024. Rlhf work-
flow: From reward modeling to online rlhf. arXiv
preprint arXiv:2405.07863.
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Akhil Mathur, Alan Schelten, Amy Yang, Angela
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. Fan, et al. 2024. The llama 3 herd of models. arXiv
arXiv preprint arXiv:2303.08774. preprint arXiv:2407.21783.

Arash Ahmadian, Chris Cremer, Matthias Gallé, Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff,
Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, and Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model
Sara Hooker. 2024. Back to basics: Revisiting re- alignment as prospect theoretic optimization. arXiv
inforce style optimization for learning from human preprint arXiv:2402.01306.
feedback in llms. arXiv preprint arXiv:2402.14740.
Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus
Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang McAleer, Ying Wen, Weinan Zhang, and Jun Wang.
Li, Sheila Babayan, Kavya Kopparapu, Zachary 2023. Alphazero-like tree-search can guide large lan-
Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srini- guage model decoding and training. arXiv preprint
vasan, et al. 2023. Rest meets react: Self- arXiv:2309.17179.
improvement for multi-step reasoning llm agent.
arXiv preprint arXiv:2312.10003. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong
Shen, Yujiu Yang, Nan Duan, and Weizhu Chen.
Ralph Allan Bradley and Milton E Terry. 1952. Rank 2023. Critic: Large language models can self-correct
analysis of incomplete block designs: I. the method with tool-interactive critiquing. arXiv preprint
of paired comparisons. Biometrika, 39(3/4):324– arXiv:2305.11738.
345.
Caglar Gulcehre, Tom Le Paine, Srivatsan Srini-
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald vasan, Ksenia Konyushkova, Lotte Weerts, Abhishek
Clark, Quoc V Le, Christopher Ré, and Azalia Mirho- Sharma, Aditya Siddhant, Alex Ahern, Miaosen
seini. 2024. Large language monkeys: Scaling infer- Wang, Chenjie Gu, et al. 2023. Reinforced self-
ence compute with repeated sampling. arXiv preprint training (rest) for language modeling. arXiv preprint
arXiv:2407.21787. arXiv:2308.08998.

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Han Guo, Bowen Tan, Zhengzhong Liu, Eric P Xing,
Fan. 2024a. Step-level value preference optimiza- and Zhiting Hu. 2021. Efficient (soft) q-learning
tion for mathematical reasoning. arXiv preprint for text generation with limited good data. arXiv
arXiv:2406.10858. preprint arXiv:2106.07704.

Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and
Su, and Jun Zhu. 2024b. Noise contrastive alignment Sergey Levine. 2017. Reinforcement learning with
of language models with explicit rewards. arXiv deep energy-based policies. In International confer-
preprint arXiv:2402.05369. ence on machine learning, pages 1352–1361. PMLR.

9
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Sergey Levine. 2018. Soft actor-critic: Off-policy taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
maximum entropy deep reinforcement learning with guage models are zero-shot reasoners. Advances in
a stochastic actor. Preprint, arXiv:1801.01290. neural information processing systems, 35:22199–
22213.
Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan
Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Aviral Kumar, Joey Hong, Anikait Singh, and Sergey
Adithya Samavedhi, Qiyue Gao, et al. 2024a. Llm Levine. 2022. When should we prefer offline rein-
reasoners: New evaluation, library, and analysis of forcement learning over behavioral cloning? arXiv
step-by-step reasoning with large language models. preprint arXiv:2204.05618.
arXiv preprint arXiv:2404.05221.
Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun,
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, and Zhi-Quan Luo. 2023. Remax: A simple, effec-
Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. tive, and efficient method for aligning large language
Reasoning with language model is planning with models. arXiv preprint arXiv:2310.10505.
world model. arXiv preprint arXiv:2305.14992.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Edwards, Bowen Baker, Teddy Lee, Jan Leike,
Li, Zhiting Hu, Jason Weston, and Yuandong Tian. John Schulman, Ilya Sutskever, and Karl Cobbe.
2024b. Training large language models to rea- 2023. Let’s verify step by step. arXiv preprint
son in a continuous latent space. arXiv preprint arXiv:2305.20050.
arXiv:2412.06769.
Guanlin Liu, Kaixuan Ji, Renjie Zheng, Zheng Wu,
Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Chen Dun, Quanquan Gu, and Lin Yan. 2024a. En-
Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym hancing multi-step reasoning abilities of language
Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, models through direct q-function optimization. arXiv
and Roberta Raileanu. 2024. Teaching large lan- preprint arXiv:2410.09302.
guage models to reason with reinforcement learning.
arXiv preprint arXiv:2403.04642. Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru,
Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyil-
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
maz. 2024b. Don’t throw away your value model!
Arora, Steven Basart, Eric Tang, Dawn Song, and Ja-
generating more preferable text with value-guided
cob Steinhardt. 2021. Measuring mathematical prob-
monte-carlo tree search decoding. In First Confer-
lem solving with the math dataset. arXiv preprint
ence on Language Modeling.
arXiv:2103.03874.
Matthew Douglas Hoffman, Du Phan, David Dohan, Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-
Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel guang Lou, Chongyang Tao, Xiubo Geng, Qingwei
Sountsov, Charles Sutton, Sharad Vikram, and Rif Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wiz-
A Saurous. 2024. Training chain-of-thought via ardmath: Empowering mathematical reasoning for
latent-variable inference. Advances in Neural In- large language models via reinforced evol-instruct.
formation Processing Systems, 36. arXiv preprint arXiv:2308.09583.

Jian Hu, Li Tao, June Yang, and Chandler Zhou. 2023. Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat
Aligning language models with offline reinforce- Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu,
ment learning from human feedback. arXiv preprint Lei Meng, Jiao Sun, et al. 2024. Improve mathemati-
arXiv:2308.12050. cal reasoning in language models by automated pro-
cess supervision. arXiv preprint arXiv:2406.06592.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu
Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxi- Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and
ang Huang, Weilin Zhao, et al. 2024. Minicpm: Dale Schuurmans. 2017. Bridging the gap between
Unveiling the potential of small language models value and policy based reinforcement learning. Ad-
with scalable training strategies. arXiv preprint vances in neural information processing systems, 30.
arXiv:2404.06395.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari,
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Henryk Michalewski, Jacob Austin, David Bieber,
Igor Mordatch. 2022. Language models as zero-shot David Dohan, Aitor Lewkowycz, Maarten Bosma,
planners: Extracting actionable knowledge for em- David Luan, et al. 2021. Show your work: Scratch-
bodied agents. In International conference on ma- pads for intermediate computation with language
chine learning, pages 9118–9147. PMLR. models. arXiv preprint arXiv:2112.00114.
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Daniel Fried. 2024. Visualwebarena: Evaluating mul- 2022. Training language models to follow instruc-
timodal agents on realistic visual web tasks. arXiv tions with human feedback. Advances in neural in-
preprint arXiv:2401.13649. formation processing systems, 35:27730–27744.

10
Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian
He He, Sainbayar Sukhbaatar, and Jason Weston. Li, and Bill Yuchen Lin. 2024. Trial and error:
2024. Iterative reasoning preference optimization. Exploration-based trajectory optimization for llm
arXiv preprint arXiv:2404.19733. agents. arXiv preprint arXiv:2403.02502.

Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang
Brandon Amos, and Yuandong Tian. 2024. Ad- Liu, Yiming Yang, Sean Welleck, and Chuang Gan.
vprompter: Fast adaptive adversarial prompting for 2024. Easy-to-hard generalization: Scalable align-
llms. arXiv preprint arXiv:2404.16873. ment beyond human supervision. arXiv preprint
arXiv:2403.09472.
Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea
Finn. 2024a. From r to q ∗ : Your language Gemini Team, Rohan Anil, Sebastian Borgeaud,
model is secretly a q-function. arXiv preprint Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
arXiv:2404.12358. Radu Soricut, Johan Schalkwyk, Andrew M Dai,
Anja Hauth, et al. 2023. Gemini: a family of
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- highly capable multimodal models. arXiv preprint
pher D Manning, Stefano Ermon, and Chelsea Finn. arXiv:2312.11805.
2024b. Direct preference optimization: Your lan-
guage model is secretly a reward model. Advances Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran-
in Neural Information Processing Systems, 36. cis Song, Noah Siegel, Lisa Wang, Antonia Creswell,
Geoffrey Irving, and Irina Higgins. 2022. Solv-
Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, ing math word problems with process-and outcome-
Daniele Calandriello, Mohammad Gheshlaghi Azar, based feedback. arXiv preprint arXiv:2211.14275.
Rafael Rafailov, Bernardo Avila Pires, Eugene
Tarassov, Lucas Spangher, Will Ellsworth, et al. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man-
2024. Offline regularised reinforcement learning for dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and
large language models alignment. arXiv preprint Anima Anandkumar. 2023. Voyager: An open-ended
arXiv:2405.19107. embodied agent with large language models. arXiv
preprint arXiv:2305.16291.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai
Alec Radford, and Oleg Klimov. 2017. Proxi-
Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui.
mal policy optimization algorithms. arXiv preprint
2024a. Math-shepherd: Verify and reinforce llms
arXiv:1707.06347.
step-by-step without human annotations. In Proceed-
ings of the 62nd Annual Meeting of the Association
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu,
for Computational Linguistics (Volume 1: Long Pa-
Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu,
pers), pages 9426–9439.
and Daya Guo. 2024. Deepseekmath: Pushing the
limits of mathematical reasoning in open language Zihan Wang, Yunxuan Li, Yuexin Wu, Liangchen Luo,
models. arXiv preprint arXiv:2402.03300. Le Hou, Hongkun Yu, and Jingbo Shang. 2024b.
Multi-step problem solving through a verifier: An
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, empirical analysis on model-induced process super-
Yonatan Bisk, Adam Trischler, and Matthew vision. arXiv preprint arXiv:2402.02658.
Hausknecht. 2020. Alfworld: Aligning text and em-
bodied environments for interactive learning. arXiv Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
preprint arXiv:2010.03768. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea-
David Silver, Thomas Hubert, Julian Schrittwieser, Ioan- soning in large language models. Advances in neural
nis Antonoglou, Matthew Lai, Arthur Guez, Marc information processing systems, 35:24824–24837.
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore
Graepel, et al. 2017. Mastering chess and shogi by Ronald J Williams. 1992. Simple statistical gradient-
self-play with a general reinforcement learning algo- following algorithms for connectionist reinforcement
rithm. arXiv preprint arXiv:1712.01815. learning. Machine learning, 8:229–256.
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck,
Anand, Piyush Patil, Xavier Garcia, Peter J Liu, and Yiming Yang. 2024. Inference scaling laws:
James Harrison, Jaehoon Lee, Kelvin Xu, et al. An empirical analysis of compute-optimal inference
2023. Beyond human data: Scaling self-training for problem-solving with language models. arXiv
for problem-solving with language models. arXiv preprint arXiv:2408.00724.
preprint arXiv:2312.06585.
Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- Wang, Zichao Yang, and Zhiting Hu. 2024. Language
mar. 2024. Scaling llm test-time compute optimally models meet world models: Embodied experiences
can be more effective than scaling model parameters. enhance language models. Advances in neural infor-
arXiv preprint arXiv:2408.03314. mation processing systems, 36.

11
Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou,
Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue
2024. Iterative preference learning from human feed- Ou, Yonatan Bisk, Daniel Fried, et al. 2023. We-
back: Bridging theory and practice for rlhf under barena: A realistic web environment for building au-
kl-constraint. In Forty-first International Conference tonomous agents. arXiv preprint arXiv:2307.13854.
on Machine Learning.
Brian D Ziebart. 2010. Modeling purposeful adaptive
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, behavior with the principle of maximum causal en-
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan tropy. Carnegie Mellon University.
Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2
technical report. arXiv preprint arXiv:2407.10671. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B
Brown, Alec Radford, Dario Amodei, Paul Chris-
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
William W Cohen. 2017. Breaking the softmax bot- guage models from human preferences. arXiv
tleneck: A high-rank rnn language model. arXiv preprint arXiv:1909.08593.
preprint arXiv:1711.03953.

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,

Tom Griffiths, Yuan Cao, and Karthik Narasimhan. A Implementation Details
2024. Tree of thoughts: Deliberate problem solving
with large language models. Advances in Neural A.1 Dataset Construction
Information Processing Systems, 36. We use the datasets GSM8K and MATH to train
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak the SFT model for math reasoning. For the task of
Shafran, Karthik Narasimhan, and Yuan Cao. 2022. math reasoning using 1.5B models, we sample 10
React: Synergizing reasoning and acting in language responses for each query in the GSM8K dataset and
models. arXiv preprint arXiv:2210.03629.
the MATH dataset. Only trajectories with correct
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, answers receive a reward 1 at the terminal state. For
Zhengying Liu, Yu Zhang, James T Kwok, Zhen- DPO, we pair up the positive and negative instances
guo Li, Adrian Weller, and Weiyang Liu. 2023. for each query and sample at most 6 pairs without
Metamath: Bootstrap your own mathematical ques-
tions for large language models. arXiv preprint replacement.
arXiv:2309.12284. When training 7B models, we apply a different
data-collecting strategy to balance the number of
Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding,
Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen,
positive and negative instances. We sample 16 re-
Ruobing Xie, Yankai Lin, et al. 2024. Advancing llm sponses for each query and randomly select at most
reasoning generalists with preference trees. arXiv 4 positive instances and 4 negative instances. We
preprint arXiv:2404.02078. then enforce that the number of positive instances
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wen- does not exceed that of negative instances. For ex-
hao Huang, Huan Sun, Yu Su, and Wenhu Chen. ample, if there are only 3 negative instances in the
2023. Mammoth: Building math generalist models 16 sampled responses, then we only select 3 posi-
through hybrid instruction tuning. arXiv preprint tive instances instead of 4. However, we make sure
arXiv:2309.05653.
that at least 1 positive instance is selected for the
Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen. query (if there is any). For DPO, we then sample
2024. Mammoth2: Scaling instructions from the web. at most 10 pairs of data without replacement.
arXiv preprint arXiv:2405.03548.
For the task ALFWorld, we use the annotated
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good- data from Song et al. (2024) to train our SFT model.
man. 2022. Star: Bootstrapping reasoning with rea- We perform 5 rollouts for each task in the training
soning. Advances in Neural Information Processing set. Successful trajectories receive a reward 1 at
Systems, 35:15476–15488.
the end of the trajectory. We sample at most 5
Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong preference pairs for DPO.
Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo
Molchanov, and Tong Zhang. 2024. Entropy- A.2 Segmentation of Reasoning Steps
regularized process reward model. arXiv preprint
arXiv:2412.11006. In step-level beam search and step-level OREO,
we use line breaks and periods to indicate the end
Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong,
and Doyen Sahoo. 2024. Automatic curriculum
of a reasoning step. More specifically, we use line
expert iteration for reliable llm reasoning. arXiv breaks in the GSM8K dataset and use both line
preprint arXiv:2410.07627. breaks and periods in the MATH dataset. This is

12
because the GSM8K dataset already split reasoning B Proof Sketch of Theorem 1
steps with line breaks.
The RLHF objective can be rewritten as
−1
TX
A.3 Hyperparameters JRLHF (π) = Eτ ∼π rtask (st , at )
t=0
The batch size is set to 128 for all experiments.

+ β log πref (at |st ) + βH(π(·|st )) .
(13)
SFT. The Qwen2.5-Math-1.5B model is trained
on the GSM8K and MATH dataset for 3 epochs So it can be viewed as maximum-entropy
with a learning rate of 2 × 10−5 . The RL (Haarnoja et al., 2017) with reward
DeepSeekMath-7B-Instruct model is already in- rtask (st , at ) + β log πref (at |st ). Previous works
struction fine-tuned, so we do not perform any ad- in maximum-entropy RL (Haarnoja et al., 2017)
ditional SFT. The MiniCPM-2B-dpo-bf16 model show that the optimal policy satisfies
is trained on the annotated dataset from Song et al.
(2024) for 2 epochs with a learning rate of 2×10−5 . β log π ∗ (at |st ) = Q∗ (st , at ) − V ∗ (at ), (14)

and that the optimal Q function and optimal value

Rejection Sampling. All experiments are trained function satisfies
for 1 epoch. For the math reasoning task, the learn-
Q∗ (st , at ) = r(st , at ) + Est+1 ∼T (·|st ,at ) [V ∗ (st+1 )]. (15)
ing rate is set to 5 × 10−6 . For the embodied agent
task, the learning rate is 2 × 10−5 . In RLHF settings, the discount factor γ is normally
omitted. We combine Eq. 14 and Eq. 15 and apply
the reward rtask (st , at ) + β log πref (at |st ). This
DPO. All experiments are trained for 1 epoch
gives us
with a learning rate of 5 × 10−7 and β set to 0.1.
π ∗ (at |st )
V ∗ (st ) − V ∗ (st+1 ) = rtask (st , at ) − β log .
πref (at |st )
KTO. The Qwen2.5-Math-1.5B model is (16)
trained for 1 epoch with a learning rate of 5 × 10−8 .
The DeepSeekMath-7B-Instruct model is
C Comparison to Liu et al. (2024a)
trained for 2 epochs such that the total number of Related to this work, Liu et al. (2024a) proposed
samples matches. The learning rate for the 7B DQO (Direct Q-function Optimization) to improve
model is 10−7 . LLM multi-step reasoning. Both work builds onto
previous maximum entropy RL algorithms, and
the resulting algorithms are similar. To be more
OREO. The learning rate is 5 × 10−6 . β precise, a special case of their training objective
is set to 0.03 and α is set to 0.01. The (without using importance sampling or λ-Return),
Qwen2.5-Math-1.5B is trained for 1 epoch. The is equivalent to a special case of ours (token-level
DeepSeekMath-7B-Instruct model is trained for OREO, without using the regularization term or
2 epochs such that the total number of samples stop gradient).
matches. The MiniCPM-2B-dpo-bf16 model is The main differences can be summarized as fol-
trained for 3 epochs. To save computation, we use lows: (1) Our work formulates the joint training
LoRA on the critic for 7B models. The LoRA rank of policy and value models by enforcing the soft
and LoRA alpha are both set to 64. The learning Bellman Equation in PCL (Nachum et al., 2017),
rate for the critic is set to 10−4 . while Liu et al. (2024a) derive their method start-
For step-level OREO and response-level OREO, ing from SAC (Haarnoja et al., 2018) and eliminate
the loss scales are different, which demands dif- Q-functions with policy and value models. No-
ferent α in the regularization term. We experi- tably, our derivation highlights the connection to
ment with α ∈ {0.01, 0.1, 0.3} and choose the DPO, shedding light on why it struggles with multi-
best performance parameter. For step-level OREO, step reasoning. (2) We include a KL regularization
α = 0.1. For response-level OREO, α = 0.3. term to stabilize training, and explore two variants

13
of the method to better understand its properties.
Along with our open-sourced code, we believe our
work can help the community explore best prac-
tices for LLM training. (3) Our work presents com-
prehensive experiments and analyses, including
experiments on embodied agent control, the itera-
tive training setting, value-guided tree search, and
a comparison between implicit and explicit value
functions.
We present the empirical comparison of
two methods here. Liu et al. (2024a) used
Qwen2-7B-Instruct and Gemma-1.1-7B-it in
their experiments. As their work is not yet
open-sourced, we reproduce DQO and experi-
ment on two math reasoning datasets. We use
Qwen2.5-Math-1.5B as the base model, and use
the same training data. We report the results in
Table 3.

Methods GSM8K MATH

OREO (Ours) 77.3 52.5
DQO 75.1 49.4

Table 3: Comparison between OREO and DQO (Liu

et al., 2024a) in math reasoning tasks.

D Safeguard Statement
In this paper, we primarily focus on the math rea-
soning tasks and embodied agent control tasks in a
household simulator, posing no significant ethical
or harmful concerns. We recognize that future re-
search on border applications of multi-step reason-
ing may pose a risk of misuse, and we recommend
careful consideration of all aspects of safety before
it’s applied in the real world.

Offline Reinforcement Learning For LLM Multi-Step Reasoning
No ratings yet
Offline Reinforcement Learning For LLM Multi-Step Reasoning
13 pages
Bridging Offline and Online Reinforcement Learning For Llms
No ratings yet
Bridging Offline and Online Reinforcement Learning For Llms
17 pages
Entropy-Regularized Process Reward Model
No ratings yet
Entropy-Regularized Process Reward Model
18 pages
Approach: Others
No ratings yet
Approach: Others
1 page
RL Enhances LLM Reasoning
No ratings yet
RL Enhances LLM Reasoning
24 pages
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
No ratings yet
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
28 pages
Efficient Online RFT With Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
No ratings yet
Efficient Online RFT With Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
25 pages
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
No ratings yet
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
24 pages
DAPO: Open-Source RL for LLMs
No ratings yet
DAPO: Open-Source RL for LLMs
16 pages
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
No ratings yet
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
16 pages
A Survey On Offline Reinforcement Learning Taxonomy Review and Open Problems
No ratings yet
A Survey On Offline Reinforcement Learning Taxonomy Review and Open Problems
21 pages
3003 o Ine Reinforcement Learning W
No ratings yet
3003 o Ine Reinforcement Learning W
15 pages
Part I: Tricks or Traps? A Deep Dive Into RL For LLM Reasoning
No ratings yet
Part I: Tricks or Traps? A Deep Dive Into RL For LLM Reasoning
26 pages
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
No ratings yet
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
39 pages
Reasoning
No ratings yet
Reasoning
21 pages
SA031PL
No ratings yet
SA031PL
7 pages
DeepSeek-R1: Enhancing LLM Reasoning via RL
No ratings yet
DeepSeek-R1: Enhancing LLM Reasoning via RL
22 pages
DeepSeek-R1: Advancing LLM Reasoning
No ratings yet
DeepSeek-R1: Advancing LLM Reasoning
22 pages
STS Special Issue Offline RL
No ratings yet
STS Special Issue Offline RL
27 pages
R-Zero - Self-Evolving Reasoning LLM From Zero Data - Tencent AI Lab
No ratings yet
R-Zero - Self-Evolving Reasoning LLM From Zero Data - Tencent AI Lab
17 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
Offline RL with Pre-trained LMs
No ratings yet
Offline RL with Pre-trained LMs
19 pages
ICML 2025 O MAPL Offline Multi Agent Preference Learning
No ratings yet
ICML 2025 O MAPL Offline Multi Agent Preference Learning
29 pages
10 Cs224r-Rl For Reasoning Lecture
No ratings yet
10 Cs224r-Rl For Reasoning Lecture
41 pages
DeepSeek R1 Dual
No ratings yet
DeepSeek R1 Dual
44 pages
Pairwise Ppo
No ratings yet
Pairwise Ppo
19 pages
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
No ratings yet
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
34 pages
DPO Summary
No ratings yet
DPO Summary
8 pages
1022 Deep Inverse Reinforcement Lea
No ratings yet
1022 Deep Inverse Reinforcement Lea
15 pages
Vapo
No ratings yet
Vapo
13 pages
Reflect-then-Plan: Offline Model-Based Planning Through A Lens
No ratings yet
Reflect-then-Plan: Offline Model-Based Planning Through A Lens
21 pages
2024 Acl-Long 510
No ratings yet
2024 Acl-Long 510
14 pages
Research Re Search:: Learning To Ason With For Llms Via Reinforcement Learning
No ratings yet
Research Re Search:: Learning To Ason With For Llms Via Reinforcement Learning
12 pages
IPO: Your Language Model Is Secretly A Preference Classifier
No ratings yet
IPO: Your Language Model Is Secretly A Preference Classifier
16 pages
Comprehensive Survey of Reinforcement Learning From Algorithms To Practical Challenges
No ratings yet
Comprehensive Survey of Reinforcement Learning From Algorithms To Practical Challenges
79 pages
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
No ratings yet
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
14 pages
15) EXPLAIN Fitted Q and Deep Q-Learning
No ratings yet
15) EXPLAIN Fitted Q and Deep Q-Learning
17 pages
Ladder: S - I LLM T R P D: ELF Mproving S Hrough Ecursive Roblem Ecomposition
No ratings yet
Ladder: S - I LLM T R P D: ELF Mproving S Hrough Ecursive Roblem Ecomposition
13 pages
Towards Self-Improvement of Llms Via MCTS: Leveraging Stepwise Knowledge With Curriculum Preference Learning
No ratings yet
Towards Self-Improvement of Llms Via MCTS: Leveraging Stepwise Knowledge With Curriculum Preference Learning
12 pages
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
No ratings yet
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
35 pages
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
No ratings yet
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
15 pages
ICML 2018 RL Highlights
No ratings yet
ICML 2018 RL Highlights
55 pages
Deep Reinforcement Learning Handout v2.0
0% (1)
Deep Reinforcement Learning Handout v2.0
6 pages
Sensors 23 02036
No ratings yet
Sensors 23 02036
24 pages
Algorithms 17 00269
No ratings yet
Algorithms 17 00269
2 pages
Personalizing Robot Rewards Framework
No ratings yet
Personalizing Robot Rewards Framework
21 pages
SEE-DPO: Self Entropy Enhanced Direct Preference Optimization
No ratings yet
SEE-DPO: Self Entropy Enhanced Direct Preference Optimization
11 pages
Sample-Efficient LLM Optimization With Reset Replay
No ratings yet
Sample-Efficient LLM Optimization With Reset Replay
16 pages
Research Re Search:: Learning To Ason With For Llms Via Reinforcement Learning
No ratings yet
Research Re Search:: Learning To Ason With For Llms Via Reinforcement Learning
2 pages
Deepseek-R1 Incentivizes Reasoning in Llms Through Reinforcement Learning
No ratings yet
Deepseek-R1 Incentivizes Reasoning in Llms Through Reinforcement Learning
11 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
LLM Reasoning Strategies Survey
No ratings yet
LLM Reasoning Strategies Survey
15 pages
Agentic Reinforced Policy Optimization
No ratings yet
Agentic Reinforced Policy Optimization
38 pages
DL Questions
No ratings yet
DL Questions
30 pages
53 Direct Preference Optimization
No ratings yet
53 Direct Preference Optimization
28 pages
RL and MPC Techniques for Biped Walking
No ratings yet
RL and MPC Techniques for Biped Walking
2 pages
Revisiting LLM Reasoning Via Information Bottleneck: Shiye Lei Zhihao Cheng Kai Jia Dacheng Tao
No ratings yet
Revisiting LLM Reasoning Via Information Bottleneck: Shiye Lei Zhihao Cheng Kai Jia Dacheng Tao
13 pages
Robust Preference Optimization Through Reward Model Distillation
No ratings yet
Robust Preference Optimization Through Reward Model Distillation
27 pages
ZeRO: Optimize Memory for Trillion-Parameter Models
No ratings yet
ZeRO: Optimize Memory for Trillion-Parameter Models
24 pages
Teacher Report For 2210 (Cs3244 Machine Learning (Lecture) ) Cs3244 (Lecture) Kan Min Yen - A882e196 2c38 4b26 Bbd0 3674ea7fc7cden Us
No ratings yet
Teacher Report For 2210 (Cs3244 Machine Learning (Lecture) ) Cs3244 (Lecture) Kan Min Yen - A882e196 2c38 4b26 Bbd0 3674ea7fc7cden Us
11 pages
Jason Wei CV
No ratings yet
Jason Wei CV
2 pages
N SR TSRPT
No ratings yet
N SR TSRPT
1 page
Personalizing Dialogue Agents: I Have A Dog, Do You Have Pets Too?
No ratings yet
Personalizing Dialogue Agents: I Have A Dog, Do You Have Pets Too?
10 pages
Letter of Intent To TRANSFER
100% (17)
Letter of Intent To TRANSFER
5 pages
JT 04243004
No ratings yet
JT 04243004
13 pages
Formal Theme 2
No ratings yet
Formal Theme 2
5 pages
E. PsyAs - Test Administration, Scoring, Interpretation and Usage
No ratings yet
E. PsyAs - Test Administration, Scoring, Interpretation and Usage
3 pages
English ESO 1 Listening and Reading Activities
No ratings yet
English ESO 1 Listening and Reading Activities
5 pages
Solomon Mission
No ratings yet
Solomon Mission
23 pages
Arbitration and CISG Jurisdiction Analysis
No ratings yet
Arbitration and CISG Jurisdiction Analysis
2 pages
Unesco - Eolss Sample Chapters: Pid Control
No ratings yet
Unesco - Eolss Sample Chapters: Pid Control
7 pages
30 Popular Nursery Rhymes For Kids in English
No ratings yet
30 Popular Nursery Rhymes For Kids in English
19 pages
Monte Carlo BT
No ratings yet
Monte Carlo BT
12 pages
Pilgrims of Hope Inggris Indonesia - Lagu Yubelium Tahun 2024
No ratings yet
Pilgrims of Hope Inggris Indonesia - Lagu Yubelium Tahun 2024
6 pages
English 10: Conditional Exercises
No ratings yet
English 10: Conditional Exercises
4 pages
M.Tech Civil Engg FAQs
No ratings yet
M.Tech Civil Engg FAQs
2 pages
Writer's Workshop - and Ballad
No ratings yet
Writer's Workshop - and Ballad
5 pages
Arxiv Sunil Mukhi String Theory Review 1110.2569
No ratings yet
Arxiv Sunil Mukhi String Theory Review 1110.2569
45 pages
Comprehension
No ratings yet
Comprehension
4 pages
ICP-AES Method for Water Analysis
No ratings yet
ICP-AES Method for Water Analysis
10 pages
Album The Eminem Show
No ratings yet
Album The Eminem Show
42 pages
Peer Pressure & STEM Students' Grades
No ratings yet
Peer Pressure & STEM Students' Grades
4 pages
Indian Filter Coffee Championship
No ratings yet
Indian Filter Coffee Championship
8 pages
Muhammad Rifki Resume + Portfolios 2020 PDF
No ratings yet
Muhammad Rifki Resume + Portfolios 2020 PDF
2 pages
The Mechanical Bride-Folklore of Industrial Man
92% (13)
The Mechanical Bride-Folklore of Industrial Man
171 pages
Janger Bali: History and Popularity Insights
No ratings yet
Janger Bali: History and Popularity Insights
13 pages
Affidavit of Undertaking: - Affiant
100% (4)
Affidavit of Undertaking: - Affiant
2 pages
De Thi Tuyen Sinh Lop 10 - Anh - TP Da Nang
No ratings yet
De Thi Tuyen Sinh Lop 10 - Anh - TP Da Nang
2 pages
A Toolkit of Motivational Skills How To Help Others Reach For Change - 3rd Edition ISBN 1118510291, 9781118510292 Entire Ebook Download
No ratings yet
A Toolkit of Motivational Skills How To Help Others Reach For Change - 3rd Edition ISBN 1118510291, 9781118510292 Entire Ebook Download
17 pages
Recent Essays by Dilshodbek
No ratings yet
Recent Essays by Dilshodbek
10 pages
A Christmas Tale: Joseph'S Dilema
No ratings yet
A Christmas Tale: Joseph'S Dilema
7 pages
Gerard Way: Musician and Comic Creator
No ratings yet
Gerard Way: Musician and Comic Creator
19 pages
The Theosophical Movement: by A.P. Sinnett
No ratings yet
The Theosophical Movement: by A.P. Sinnett
6 pages

Offline Reinforcement Learning For LLM Multi-Step Reasoning

Uploaded by

Offline Reinforcement Learning For LLM Multi-Step Reasoning

Uploaded by

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Huaijie Wang♢∗ , Shibo Hao♣ * , Hanze Dong♡ , Shenao Zhang♠

Abstract et al., 2024), and web navigation (Deng et al., 2024;

3.3 Connection to DPO Based on the theorems presented in Section 3, we

tive exponential rewards of the responses: P

Figure 3: The accuracies of OREO on GSM8K and

5.4 Test-Time Search with Value Functions

7 Limitations Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,

and that the optimal Q function and optimal value

Methods GSM8K MATH

Table 3: Comparison between OREO and DQO (Liu

You might also like