0% found this document useful (0 votes)
56 views

Teaching LLM

The document discusses using reinforcement learning techniques to improve the reasoning abilities of large language models. It compares the performance of algorithms like Expert Iteration, PPO, and return-conditioned RL on reasoning tasks using different reward schemes and model sizes/initializations. It finds that Expert Iteration performs best across most metrics and settings. The document also analyzes why these algorithms are effective and discusses implications for using RL to further develop model reasoning.

Uploaded by

rnd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Teaching LLM

The document discusses using reinforcement learning techniques to improve the reasoning abilities of large language models. It compares the performance of algorithms like Expert Iteration, PPO, and return-conditioned RL on reasoning tasks using different reward schemes and model sizes/initializations. It finds that Expert Iteration performs best across most metrics and settings. The document also analyzes why these algorithms are effective and discusses implications for using RL to further develop model reasoning.

Uploaded by

rnd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Teaching Large Language Models to Reason

with Reinforcement Learning


Alex Havrilla1,2,∗ , Yuqing Du4 , Sharath Chandra Raparthy1 , Christoforos Nalmpantis1 , Jane
Dwivedi-Yu1 , Maksym Zhuravinskyi3 , Eric Hambro1,∗∗ , Sainbayar Sukhbaatar1 , Roberta Raileanu1
1
Meta, 2 Georgia Institute of Technology, 3 StabilityAI, 4 UC Berkeley

Work done during Meta internship, ∗∗ Work done while at Meta

Reinforcement Learning from Human Feedback (RLHF) has emerged as a dominant approach for align-
ing LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance
arXiv:2403.04642v1 [cs.LG] 7 Mar 2024

of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization
(PPO), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse
and dense rewards provided to the LLM both heuristically and via a learned reward model. We
additionally start from multiple model sizes and initializations both with and without supervised
fine-tuning (SFT) data. Overall, we find all algorithms perform comparably, with Expert Iteration
performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is
similar to that of PPO, requiring at most on the order of 106 samples to converge from a pretrained
checkpoint. We investigate why this is the case, concluding that during RL training models fail to
explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a
trade off between maj@1 and pass@96 metric performance during SFT training and how conversely
RL training improves both simultaneously. We then conclude by discussing the implications of our
findings for RLHF and the future role of RL in LLM fine-tuning.

Date: March 8, 2024


Correspondence: Alex Havrilla at [email protected]

1 Introduction
The reasoning abilities of large language models (LLMs) are rapidly improving as measured by their performance
on numerous math, science and code benchmarks (Cobbe et al., 2021; Hendrycks et al., 2021b; Sawada et al.,
2023; Liang et al., 2022; Srivastava et al., 2022; Rein et al., 2023; Mialon et al., 2023; Chollet, 2019; Mishra
et al., 2022; Hendrycks et al., 2021a; Austin et al., 2021; Patel et al., 2021; Gao et al., 2021). Simultaneously,
Reinforcement Learning from Human Feedback (RLHF) (Bai et al., 2022; Ziegler et al., 2019; Ouyang et al.,
2022) and instruction fine-tuning (Wei et al., 2021; Mishra et al., 2021) have made significant progress in
aligning LLMs with human preferences. Improvements in model instructability have further increased apparent
model capability by making complex behaviors more accessible via instruction prompting. This has led to a
number of increasingly sophisticated prompting strategies augmenting LLM reasoning capabilities such as
Chain-of-Thought (Wei et al., 2022) or Tree-of-Thoughts (Yao et al., 2023).
Previous work in reinforcement learning (RL) such as AlphaGo (Silver et al., 2017), AlphaStar (Vinyals et al.,
2019), and OpenAI Dota 2 (Berner et al., 2019) demonstrate that RL techniques can be used to train neural
networks capable of sophisticated planning and reasoning in game environments. Cicero (Bakhtin et al., 2022)
in particular succeeds in combining an RL trained planning agent with a dialogue fine-tuned LLM to achieve
nearly super-human performance in the board game Diplomacy. Given these previous successes and the
inherent interactive nature of problem solving, applying RL to LLM reasoning seems a natural next step. In
this paper, we study how ideas from RL can be used to improve the reasoning capabilities of LLMs across a
variety of reward schemes and model initializations.
We begin by comparing the performance of different RL algorithms on reasoning tasks τ defined as a
distribution of question answer tuples (Q, A). The task τ can be extended to define a Markov Decision Process
(MDP) 4-tuple (S, A, Pa , Ra ) where tokens serve as both actions and accumulated state with deterministic

1
dynamics. By default we use a sparse reward of +1 if the final answer is correct but also experiment with
dense rewards matching intermediate steps in a reference solution and rewards synthetically generated using a
reward model. We evaluate models with 7B and 13B parameters both starting from supervised fine-tuned
(SFT) checkpoints and pre-trained checkpoints. We report four metrics assessing model performance on a
task specific test set: 1) maj@1 score computed by greedily sampling once per question, 2) maj@96 score
computed by sampling K = 96 times per question and uniformly voting on the final answer, 3) rerank@96
score computed by sampling K = 96 times and choosing the final answer using an Outcome-Based Reward
Model (ORM), and 4) pass@96 score computed by sampling the model K = 96 times and taking the best
result according to the ground truth answer.
We find that overall the simplest method, Expert Iteration (EI) (Anthony et al., 2017), performs best across
all metrics for most reward setups and model initializations. Surprisingly, EI is nearly as sample efficient as
more sophisticated algorithms like Proximal Policy Optimization (PPO), both requiring only a few thousand
samples to converge even when initialized from a pretrained checkpoint. We also observe the gap between
pretrained model performance and SFT model performance significantly shrinks (< 10% gap on GSM8K)
after RL fine-tuning, with larger models having a smaller gap. Additionally, previous work identified a tradeoff
between test time maj@1 performance and pass@96 performance during supervised fine-tuning (Cobbe et al.,
2021), with continued training increasing maj@1 score at the expense of pass@96 score. We identify the
limited diversity of the dataset as a core reason for this. We show that RL fine-tuning can improve both
metrics simultaneously due to the fact that RL generates its own data during training, resulting in a more
diverse set of examples to learn from.
We then discuss why EI and return conditioned RL are competitive with PPO, suggesting two principal
factors. Firstly, the reasoning tasks we consider have entirely deterministic dynamics: a setting in which
direct behavior cloning and return conditioned RL is known to do well (Brandfonbrener et al., 2022). In
contrast, PPO often succeeds in environments with a high degree of stochasticity (Bhargava et al., 2023).
Second, we identify a lack of sophisticated exploration carried out by models during RL fine-tuning. This
limitation significantly impacts any performance or sample complexity advantages PPO may have when
fine-tuning the pretrained model. We come to this conclusion from a number of observations, noting in
particular quickly saturating pass@96 scores early in RL training. We conclude with a discussion of the
impacts of our observations on RLHF and the future of LLM fine-tuning via RL.
In summary we make the following contributions:
• A comprehensive study of PPO fine-tuning of LLMs on reasoning tasks using different types of rewards,
model sizes and initializations.
• A comparison to expert iteration and return-conditioned RL from which we find expert iteration reliably
attains the best performance and competitive sample complexity across the board.
• A discussion of the implications of our findings for RLHF and the future of RL fine-tuning for LLMs,
identifying exploration as a major limiting factor.

2 Related Work
LLM Reasoning: State-of-the-art large language models (OpenAI, 2023; Touvron et al., 2023; Bai et al., 2022;
Chowdhery et al., 2022) demonstrate increasingly impressive abilties on hard reasoning tasks as studied by
a wide range of math, science, and code benchmarks (Cobbe et al., 2021; Hendrycks et al., 2021b; Sawada
et al., 2023; Liang et al., 2022; Srivastava et al., 2022; Rein et al., 2023; Mialon et al., 2023; Chollet, 2019;
Mishra et al., 2022; Hendrycks et al., 2021a; Austin et al., 2021; Patel et al., 2021; Gao et al., 2021). Chain
of thought (CoT) (Wei et al., 2022) and related techniques (Chen et al., 2022; Yao et al., 2023; Besta et al.,
2023) have emerged as dominant methods siginficantly boosting LLM performance on these types of tasks.
CoT methods allow LLMs to defer giving their final answer by first generating a ”chain of thought” involving
intermediate computations needed to correctly solve the problem.
Another line of work combines base LLM reasoning capabilities with planning and search algorithms to further
boost performance on a wide range of tasks (Yao et al., 2023; Besta et al., 2023; Ye et al., 2022; Yao et al.,

2
2022; Dohan et al., 2022). Tree of thought (Yao et al., 2023) for example combines LLMs with a breadth first
search algorithm, relying on the LLM to both propose actions and evaluate state. Other works combine LLMs
with tools (Schick et al., 2023; Qin et al., 2023; Zhou et al., 2023a) further boosting reasoning capability.
Combining GPT-4 with a python code interpreter for generation and self-verification achieves an impressive
84% on the hard MATH benchmark (Hendrycks et al., 2021a; Zhou et al., 2023a).
Other works focus on LLMs for mathematical reasoning in natural language (Cobbe et al., 2021; Lewkowycz
et al., 2022; Azerbayev et al., 2023; Lightman et al., 2023; Patel et al., 2021; Zhu et al., 2023; Rafailov
et al., 2023). Particularly relevant to our study is Cobbe et al. (2021) which fine-tunes GPT-3 on supervised
math word problem (MWP) reasoning traces. In addition they train solution verifiers called Outcome Based
Reward Models (ORMs) which predict the probability of correctly solving a question Q giving a prefix of
intermediate steps Pi = (S1 , ..., Si ) i.e. p(is correct(A)|Q, Pi ) where A is a solution with prefix Pi . Process
based reward models (PRMs) (Uesato et al., 2022; Lightman et al., 2023) can also be trained to instead look
at the step-level accuracy of solutions. More recent work (Luo et al., 2023) utlizies a PRM distilled from
GPT-4 feedback as a reward signal during PPO.
RL for LLM fine-tuning: Reinforcement Learning from Human Feedback (RLHF) is perhaps the most
well-known application of RL techniques for fine-tuning LLMs. RLHF (Christiano et al., 2017; Ziegler et al.,
2019; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022; Glaese et al., 2022; Peng et al., 2021;
Ramamurthy et al., 2022) most often works by training a reward model to capture human preferences over
a task τ . The reward model is then used to score LLM responses to prompts from the task after which
policy improvement is performed. PPO is most often used (Ouyang et al., 2022; Bai et al., 2022) but several
recent works including ReST (Gulcehre et al., 2023), Reward-Ranked Fine-tuning (Dong et al., 2023), and
AlpacaFarm (Dubois et al., 2023) all demonstrate simply fine-tuning on high return responses with the
standard cross-entropy loss can attain comparable performance. We broadly refer to this class of algorithms
as Expert Iteration.
A large body of work studying RL for LLM fine-tuning also exists outside of the RLHF sphere. Work on text
games (Yao et al., 2020; Ammanabrolu and Riedl, 2019) and other interactive textual environments (Zhou
et al., 2023b; Carta et al., 2023) seek to ground LLMs via interaction and RL. RL has also been applied to
improving model performance on controllable generation and question answering tasks (Lu et al., 2022; Liu
et al., 2022). Various forms of expert iteration have also been applied to improve LLM reasoning capabilities
(Huang et al., 2022; Yuan et al., 2023; Zelikman et al., 2022; Uesato et al., 2022). For example “Scaling
Relationship on Learning Mathematical Reasoning with Large Language Models” (Yuan et al., 2023) applies
a single round of expert iteration across multiple model sizes on GSM8K. They observe sizeable gains in all
metrics for smaller models, with gains diminishing for larger models. A related body of work studies RL
for code generation (Le et al., 2022; Shen et al., 2023; Rozière et al., 2023). Shen et al. (2023) in particular
reports a huge increase in StarCoder’s (Li et al., 2023) maj@1 performance after a single round of expert
iteration, jumping from ∼30% to ∼60%.
Despite all the above work, it remains unclear exactly what factors account for the biggest impact during RL
fine-tuning due to wide variance in tasks, pretraining data, supervised fine-tuning data, RL algorithm used,
and the reward source. Our work conducts a thorough analysis of all these factors to understand exactly how
different algorithms compare when applied to improving LLM reasoning capability. As a result we are able to
identify key bottlenecks to further LLM improvement via RL and provide a discussion on promising future
directions.

3 Methods
Reasoning as an RL problem
We study the performance and sample complexity requirements for various RL algorithms when fine-tuning
LLMs on reasoning tasks. We consider Expert Iteration (EI) (Anthony et al., 2017), Proximal Policy
Optimization (PPO) (Schulman et al., 2017), and Return-Conditioned RL (RCRL) (Brandfonbrener et al.,
2022) as representative algorithms from the RL literature. In general, the goal of all RL algorithms is to
maximize the expected future return EA∼π(Q),(Q,·)∈τ R(A) of a student policy π on task τ . We call the highest
return policy the optimal policy π ∗ . Each of our chosen algorithms goes about finding π ∗ in a different way.

3
PPO is an example of an online RL algorithm. Online algorithms engage in both an exploration phase and a
policy improvement phase which updates πθ using data generated during the exploration phase. PPO is also
an on-policy algorithm which samples model rollouts during exploration from the student policy πθ being
trained. During policy improvement, the student πθ updates its parameters via gradient descent by directly
maximizing for reward with the objective

 
π(at |st ) π(at |st )
J(θ) = Et min( Ât , clip(1 − ϵ, 1 + ϵ, )Ât )
πold (at |st ) πold (at |st )

where Ât estimates the advantage i.e. difference between Q(s, a) (the expected return after taking action a at
state s) and value V (s) (the expected return at state s).
In practice, for PPO we sample 1024 rollouts at a time with a temperature of 0.7 and N = 4 rollouts per
question. Training is then run on these samples for K = 4 PPO epochs with a batch size of 256. Additionally,
we train using LoRA (Hu et al., 2021) with r = 128. Training is run for 4000 gradient steps. The best
checkpoint is then selected via performance on a validation set.
Expert iteration is also online but more off-policy than PPO. An initial expert policy approximation π̂0∗
is sampled on the entire train set K times per question before any policy improvement. The π̂0∗ is often
constructed using repeated sampling from an initial policy π0 . For example, AlphaZero (Silver et al., 2017)
and subsequent work (Schick et al., 2023) combine π0 with Monte Carlo Tree Search. Sampling π̂0∗ constructs
an initial
P set Pof rollouts D1 which are then distilled back into a policy π1 via a standard cross-entropy
H
loss: τ ∈D t=1 −log(πθ (at |st )). This process can be repeated to construct policy πi fine-tuned on dataset
Di = Ri ∪ Di−1 where Ri corresponds to exploration done by πi−1 .
In our setting we construct an approximation to the optimal policy π̂ ∗ by rejection sampling our student
policy πθ . After generating K samples S1 , ..., SK on a question Q we construct D1 by filtering all (Q, Si )
pairs with return below a threshold T . De-duplication is then performed on the remaining samples.
In practice, during the expert iteration exploration phase we sample each question in the train set K = 96
times with temperature T = 1.0. To construct the training set we filter out incorrect solutions and duplicates.
Importantly, fine-tuning is then done from the pretrained base model with the same hyperparameters as SFT.
This is repeated until performance on a validation set saturates.
Return Conditioned RL Return conditioned RL algorithms seek to train policies conditioned on both the
current state s and desired return R when sampling an action. This is motivated by a desire to learn return
conditionable policies which can change depending on the desired return. Best performance can then be
sampled by conditioning on the highest possible return.
We consider an offline version of this class of algorithms similar to a decision transformer (Chen et al.,
2021). A training dataset D is constructed by generating state, action, return τ = ((st , at , gt ))H t=1 trajectories.
P PH
Training is done by predicting the action given state and return: τ ∈D t=1 −log(π (a |s
θ t t t )). In practice
, g
we construct D by sampling solutions S = (S1 , ..., SL ), where each Si is an intermediate step, from our best
EI trained policy πEI given a question Q. We generate return labels for each step Si by sampling πEI K many
times from Pi = (S1 , ..., Si ). This results in binary labels l1 , .., lK evaluating the correctness of the generated
1
PK
final answers. Si is then labeled as “[GOOD]” if the average return K k=1 lk ≥ T and otherwise is labeled
as “[BAD]”. Typically we set T = 0.5. We then filter the dataset to ensure a balanced number of correct and
incorrect solutions. See Section F in the appendix for more details about the step-label generating process.
Outcome Based Reward Modeling Multiple works (Cobbe et al., 2021; Uesato et al., 2022) train Outcome
Based Reward models ORMs as verifiers of candidate solutions to word problems. The ORM can then be
used to rerank multiple candidate solutions generated by a student model, significantly boosting performance.
Training data for the ORM is generated using a student policy π by sampling K solutions per question Q in
the task dataset. The ORM is trained as a classifier by predicting the probability of reaching the correct final
answer p(is correct(A)|Q, Pi ) from an intermediate sequence of steps Pi = (S1 , ..., Si ), Pi ⊆ A = (S1 , ..., SL ).

4
maj@1 maj@96 rerank@96† pass@96
7B 13B 7B 13B 7B 13B 7B 13B
SFT 0.41 0.48 0.47 0.53 0.54 0.68 0.72 0.84
EIn 0.48 0.53 0.55 0.59 0.64 0.71 0.8 0.88
ORM EIn 0.48 0.53 0.54 0.58 0.65 0.71 0.81 0.87
ORM RCRL 0.45 0.51 0.5 0.56 0.54 0.69 0.73 0.83
Sparse PPO 0.44 0.51 0.49 0.55 0.58 0.67 0.77 0.85
Dense PPO 0.43 0.50 0.47 0.54 0.53 0.65 0.71 0.81
Sparse ORM PPO 0.46 0.51 0.51 0.55 0.59 0.67 0.79 0.83
Dense ORM PPO 0.46 0.51 0.52 0.55 0.59 0.67 0.76 0.83
Llema∗ 0.40 0.62 0.54 0.69 N/A N/A
RFT 0.47 0.54 0.58 0.65 N/A N/A
WizardMath 0.55 0.64 N/A N/A N/A
GPT-3∗∗ 0.2 0.31 N/A 0.39 0.55 0.71 NA
GPT-4∗∗∗ 0.91 N/A N/A N/A

Table 1 Results when initializing from SFT. EIn denotes n rounds of expert iteration until convergence with n = 2 for
7B and n = 2 for 13B. † Note all reranking is done using an ORM trained with samples from EIn . Results from other
works are included on the bottom for reference. N/A stands for not available. ∗ Llema results reported for 7B/34B
sizes without fine-tuning. ∗∗ GPT-3 results reported for 7B/175B sizes. ∗∗∗ GPT-4 size unknown.

4 Experiments
We conduct our evaluations on GSM8K and SVAMP (Patel et al., 2021): two math word problem benchmarks.
In addition on GSM8K we consider two data regimes: first with SFT data and then without SFT data. We
evaluate all models using greedy sampling (maj@1) accuracy as well majority vote at 96 samples (maj@96),
ORM based reranking at 96 samples (rerank@96), and best of 96 sample (pass@96) accuracy. Unless otherwise
specified, test-time sampling is done greedily for maj@1 and with a temperature of 0.7 otherwise. We sample
the RCRL models one step/line at a time, conditioning on the “[GOOD]” token. We note while the notion of
a “step” is not clearly defined in general, in our case we can simply regard each step as ending with a sentence
or newline. All experiments are done using instruction-tuned Llama-2 7B and Llama-2 13B models.

4.1 Results with SFT Initialization


When given access to SFT data, we first supervise fine-tune Llama-2 models for 4 epochs with a global batch
size of 128 and an initial lr of 2e-5 decayed to 2e-7 with a cosine warmup schedule. We call the resulting
models SFT. When fine-tuning with PPO we initialize using this checkpoint. In contrast, for both EI and
RCRL we generate data with the SFT checkpoint but reset training to start from the pretrained base model.
Similarly to Zelikman et al. (2022), we find this model resetting is crucial for achieving best performance.
Results for both 7B and 13B models are reported in Table 1.
Expert iteration achieves the best performance with competitive sample complexity
Surprisingly, we find EI achieves the best performance with a maj@1 accuracy of 0.485 and 0.53 on 7B and 13B
models respectively. For both model sizes the best greedy accuracy is achieved after n = 2 expert iterations
(see Fig. 2), after which performance plateaus. In total, EI gives a sizable improvement of around 7% over
the SFT baseline. Similar gains can be seen in maj@96, rerank@96, and pass@96 scores with.
PPO models underperform EI, with ORM guided PPO giving the biggest improvement of around 5% over the
SFT baseline. Again, maj@96, rerank@96, and pass@96 accuracies show similar improvements. Interestingly,
despite further training on top of the SFT initialization, PPO models retain competitive rerank@96 and
pass@96 scores when compared to regression we see after further supervised fine-tuning. We believe this is
due to the relatively more diverse nature of the exploration dataset used to update the model.
Finally, RCRL models under-perform EI models despite training on EI generated data with an even balance

5
Figure 1 Sample complexities of SFT initialized models Figure 2 Accuracy of EI models on GSM8K test vs.
on GSM8K. EI achieves better performance than PPO number of iterations. Performance seems plateaus for SFT
with the same order of magnitude of samples. initialized models after two iterations. The pretrained
checkpoints converge after four iterations.

between ‘[GOOD]’ and ‘[BAD]’ step labels. This matches similar results from Du et al. (2023) which use only
sparse labels for the entire rollout. Further, when sampling the RCRL model unconditionally the model often
generates the perfectly valid steps following a ‘[BAD]’ label resulting in a correct final answer. These results
suggest RCRL models are not correctly learning what constitutes a ‘[GOOD]’ versus ‘[BAD]’. This suggests
RCRL models are unable to usefully incorporate information from partially correct solutions at train time.
An ablation (See sec. A of the appendix) on the ratio of positive to negative labels finds a balanced ratio
yields the worst performance, with increasing the amount of positive data leading to better results.
In Figure 1 we plot the number of model rollouts against model performance in log-scale. PPO models achieve
their best accuracies after around 60,000 rollouts while EI models train with an order of magnitude more.
However, the resulting train time in both cases is about a day. This is largely due to memory requirements
from PPO, resulting in lower rollout throughput and smaller mini-batch sizes at train time. Additionally, in
the SFT case we did not experiment with reducing the number of samples from K = 96 per question for EI.
However, we expect this number can be significantly reduced without impacting performance. For a more
thorough investigation of sample complexity requirements, see Figure 5.
Extra guidance from ORMs or dense rewards provides little benefit Overall, the ORM slightly improves
PPO performance and negligibly impacts EI performance. For both algorithms it provides an improvement in
terms of sample complexity. However, this does not change final performance. See Figures 3 and 4 which plot
the performance against number of model rollouts for differnt reward regimes.
Giving dense rewards at best provides no extra benefit to performance when given either heuristically or
via the ORM. Giving a heuristic dense reward even slightly harms model performance relative to the sparse
setting. Recall we give intermediate reward by comparing intermediate model generated steps to the reference
solution. This likely encourages more overfit to exact solutions in the train set, limiting solution diversity.
RL improves maj@1 accuracy without impacting pass@96 performance Looking at the pass@96 accuracies
more closely, we see most similarly sized models are within 3% of the best result. This demonstrates with
enough sampling, most models are able to solve a very similar range of problems. Further, while the pass@96
accuracy of our best EI model initially seems much higher than the SFT checkpoint, this is only because the
SFT checkpoint has undergone much more training on a less diverse dataset. Simply supervised fine-tuning
for half as many steps results in a checkpoint with maj@1 = 0.36 but pass@96 = 0.76. This further suggests
RL training mostly impacts maj@1 accuracy without significantly improving on a pass@n accuracy which can
be achieved with a light amount of supervised fine-tuning.
The proximity of pass@96 accuracies among most models is in sharp contrast to the rerank@96 performance.
Here we find EI models enjoy around a 5% lead over other models. At first glance this seems contradictory

6
Figure 3 maj@1 scores of EI and ORM aided EI models Figure 4 maj@1 scores of PPO and ORM guided PPO
over the course of training. The ORM improves sample models over the course of training. As with EI models, the
efficiency but not performance. ORM improves sample efficiency but not performance.

maj@1 maj@n rerank@n† pass@n


7B 13B 7B 13B 7B 13B 7B 13B
Prompted 0.05 0.03 0.14 0.18 0.17 0.24 0.22 0.27
EIn 0.31 0.4 0.35 0.47 0.39 0.63 0.45 0.83
ORM EI 0.28 0.37 0.33 0.43 0.37 0.59 0.42 0.76
Sparse PPO 0.32 0.41 0.37 0.48 0.41 0.65 0.5 0.83
Sparse ORM PPO 0.29 0.38 0.34 0.44 0.4 0.62 0.49 0.81
Dense ORM PPO 0.29 0.39 0.35 0.45 0.41 0.64 0.5 0.82

Table 2 Results for 7B/13B models when not using SFT initialization on GSM8K. Sparse PPO performs slightly
better than EIin this setting. ∗ Note all reranking is done using an ORM trained with samples from EIn model.

with relatively similar pass@96 performance. However, we believe a non-trivial percentage of this gap is due
to overfit of the ORM to the EI model which was used to generate its training data.

4.2 Results with no SFT Initialization


We now consider the case when no SFT data is available for training. For questions in both SVAMP and
GSM8K we give pretrained models access to a two-shot prompt with samples drawn from the GSM8K
validation set. For EI models, we remove these prompts after the first round of exploration, instead relying on
the generated SFT data. As in the case with SFT data, we run both algorithms until performance saturates.
For PPO this happens after 250 steps on SVAMP and 1000 steps on GSM8K. For EI, this happens after n = 5
rounds of exploration and distillation. Results on both datasets are reported in Tables 2 and 3.
EI achieves the best performance overall Even without SFT data, EI achieves the best performance on
SVAMP, improving 7B/13B pretrained greedy model accuracies over 50% from 0.06/0.05 to 0.58/0.69%,
respectively. PPO performs slightly better than EI on GSM8K, improving from 0.05/0.03 to 0.31/0.4. Both
algorithms achieve comparable pass@96 scores across modes sizes, further supporting our observations from
the SFT regime that EI mostly improves maj@1 scores relative to PPO. The prompted 13B model on GSM8K
even attains 0.83 pass@96 accuracy which is close to the 0.84 pass@96 score achieved by the SFT model,
despite having no access to SFT data itself.
EI has the same sample complexity as PPO As before we plot the reward versus number of model rollouts
for PPO and EI in Figures 5 and 6. On GSM8K PPO models attain their best maj@1 accuracies after only
30,000 rollouts and on SVAMP even less. Surprisingly, EI models have the same sample complexity as PPO on

7
maj@1 maj@n rerank@n† pass@n
7B 13B 7B 13B 7B 13B 7B 13B
Prompted 0.06 0.05 0.2 0.25 0.24 0.29 0.3 0.36
EIn 0.58 0.69 0.6 0.75 0.62 0.78 0.70 0.93
Sparse PPO 0.44 0.51 0.55 0.66 0.58 0.73 0.72 0.89
Sparse ORM PPO 0.43 0.51 0.52 0.64 0.54 0.71 0.65 0.85
Dense ORM PPO 0.44 0.52 0.51 0.63 0.55 0.73 0.67 0.85

Table 3 Results for 7B/13B models when not using SFT initialization on SVAMP. EIn denotes the best EI model
after n iterations. EI outperforms PPO.

Figure 5 Sample complexities on GSM8K from pretrained Figure 6 Sample complexities on SVAMP. Surprisingly,
initialization. EI appears nearly as sample efficient as PPO.

SVAMP, requiring more samples to converge but also converging to a much higher accuracy. EI still appears
to have higher sample complexity on GSM8K, however as noted before this may be due to oversampling each
prompt during the exploration phase. To test this, we reduce the number of samples per prompt each round
of EI from K = 96 to K = 4. The resulting EI models require more iterations to converge but require far
less total samples, also converging in accuracy only a few percentage points lower than K = 96 samples per
prompt. With K = 4 rollouts per prompt EI has the same sample complexity as PPO on GSM8K.
This is a particularly surprising finding when compared to the performance of EI and PPO on more classical
RL problems training a neural network from scratch. Often PPO enjoys far better sample complexity in
these settings. One major difference here is the initialization of our student from a pretrained model which
imparts a very strong bias on the kind of behaviors and exploration encountered during RL training. Both
the extremely small sample complexity and the comparability of EI and PPO in this setting provide more
evidence that models are not truly engaging in complex exploration, but instead primarily drawing on what
they already know from the pre-training phase.

4.3 Implementation Details


It is well known RL training can be quite sensitive to architectural and hyperparameter choices. This is
even more so the case for LLM fine-tuning. In this section we ablate and discuss the factors we found most
important in our tasks.
PPO model architecture and training parameters To save memory we use a joint architecture for the PPO
policy and value heads. We found it important to use a relatively large value branch (L=4 transformer
layers) and detach the gradients coming from the value branch to the policy trunk. Without detachment we
found value gradients interfere with policy gradients, as similarly observed in Stiennon et al. (2020), causing

8
instability with a big update to either branch. See Figure 7 which compares maj@1 score of a student with a
large value branch and detached value gradients versus the default.

Figure 7 maj@1 performance of PPO fine-tuned models Figure 8 Best K of N sampling parameters versus maj@1
against architectural changes. Note, we initialize training score during training. K=4, N=4 yields a fast runtime
from a 7B SFT model with maj@1 = 0.29. and best performance.

Low rank adaptation (LoRA) (Hu et al., 2021) with rank r = 128 helped significantly to further stabilize a full
layer fine-tuning while still maintaining performance (Sun et al., 2023). A large enough batch size (BS = 256)
and a small lr = 1e-6 also helped with stabilization. We additionally experimented with a partial fine-tune of
only the top M layers. This saved memory but at the cost of a few percentage points of performance.
We also found a non-trivial KL penalty of 0.05 to be critical for preventing model collapse after more than a
hundred gradient updates. This is in contrast to Bai et al. (2022) who do not see a significant need for the
KL constraint. We attribute its importance here to the somewhat unnatural distribution of text found in the
the reasoning tasks which consist of broken natural language and computations enclosed in <<x+y=z>> tags.
For tasks with distributions closer to pure natural language dialogue, such as those considered in Bai et al.
(2022), the KL constraint seems less necessary.
Sampling parameters affect exploration We found the best temperature to use for good exploration during
PPO training heavily depends on the initialization. When starting from an SFT checkpoint we choose T
= 0.7. However, sampling on a high temperature when starting from the pretrained prompted model often
results in collapse. In these cases we choose a low temperature (T = 0.2). Potentially better results for PPO
could likely be achieved by annealing the exploration temperature over the course of training. We similarly
experimented with the sampling temperature used during exploration in EI, ultimately deciding on T = 1.0
to maximize solution diversity without sampling too many degenerate solutions.
We also experimented with best K of N (KoN) sampling during PPO training to promote more solution
diversity. In this setup the K highest reward samples of N rollouts from a single prompt are kept for training
and the rest are discarded. Choosing parameters K ≪ N prioritize high reward samples and discard low
reward ones, resulting in a training distribution more similar to the curated EI dataset.
However, one important consideration is the impact of the K/N ratio on training time and sample complexity,
with smaller ratios taking proportionally longer. For example, K=1,N=8 takes 8 times as long as the default
K=1,N=1. Further, we ultimately found little benefit to small K/N ratios with most configurations yielding
decreased performance over K=1,N=1. In practice we found setting K=4, N=4 worked best. See Figure 8
which compares the performance of various choices of K and N.
Model size and initialization affect exploration We found both the quality of the student initialization and
the size of the student significantly affected the type of exploration engaged in during training. In particular
larger models engaged in more diverse exploration while models with worse generalization engaged in less
diverse exploration (See Appendix Section B). This in turn directly impacts model performance when trained

9
maj@1 maj@96 Rerank@96 pass@96
SFT2 0.36 0.45 0.53 0.76
SFT4 0.41 0.47 0.54 0.72
PPO2 0.43 0.48 0.59 0.8
PPO4 0.44 0.49 0.58 0.77

Table 4 Results for full supervised fine-tune (SFT4 ), half supervised fine-tune (SFT2 ) and their PPO fine-tunes.
Fine-tuning for only two epochs gets pass@96 = 0.76. This decreases to 0.72 with two additional epochs of fine-tuning.

on exploration data, with models engaging in more diverse exploration improving more from RL training.
To further examine the observations about overfitting, we supervise fine-tune a Llama-2-7B model for half as
many steps than the SFT model reported in Table 1. We call the model trained for four epochs SFT4 and the
model trained for two epochs SFT2 . Despite half the training, SFT2 has similar Rerank@96 and superior
pass@96 scores to SFT4 with the main difference being the maj@1 accuracies. When sampled K = 96 times on
each train prompt, SFT2 produces on average 3.7 unique correct solutions compared to SFT4 which produces
2.9 unique correct solutions. We also find SFT2 benefits significantly more from RL fine-tuning than SFT4 ,
jumping from maj@1=0.36 to maj@1=0.43. It’s important to note some of this improvement also happens
with continued SFT training, however at the cost to model output diversity and pass@96 performance.
We believe RL fine-tuning is less prone to overfitting when compared to static SFT fine-tuning precisely
because of the exploration process which generates its own training data. This results in in more diverse
solution paths than the SFT training set, ameliorating overfit. This is also in line with recent work that found
RLHF to result in better (out-of-distribution) generalization than SFT on summarization and instruction
following tasks (Kirk et al., 2023). This benefit can be found both PPO and EI which have almost 10%
pass@96 improvement over continued SFT (yet a much smaller pass@96 improvement over a light SFT). To
support this hypothesis we plot the solution accuracies and diversities of EI models over each iteration in
Figures 10 and 12, respectively. Figure 12 also shows larger models generate more diverse solutions.

5 Discussion and Conclusions


Our study resulted in the following findings:
1. All the tested RL algorithms perform similarly on reasoning tasks, with Expert Iteration performing
best in most cases.
2. Both EI and PPO converge relatively quickly even without supervised fine-tuning, requiring only ∼60,000
model rollouts.
3. Neither algorithm benefits significantly from ORM guidance or a denser reward.
4. EI and PPO fine-tuning simultaneously improves maj@1 score and pass@n score in contrast with SFT.
The improvement of both maj@1 and pass@n performance noted above is due to the ability of online RL
algorithms to dynamically grow diverse sets of training examples via synthetic data generation. This allows
for longer training/more gradient updates on the same model without adversely impacting output diversity
and pass@n scores. In contrast, SFT training occurs on a static dataset. This limits how much training can
occur before maj@1 overfit occurs and output diversity suffers. However, RL training does not significantly
improve pass@n score beyond what can be achieved with light supervised fine-tuning. This suggests even
with RL training our best models are not discovering solutions beyond what can be discovered with (light)
supervised fine-tuning given the same rollout budget.
This observation, taken together with the fast convergence of both online algorithms and the low-impact of
ORM guidance and dense rewards, suggests models are not engaging in a significant amount of exploration
beyond pretraining/SFT data. Regardless of the type of algorithm used or the quality of the reward, all
student models engage in similar exploration, resulting in similar performance.

10
Crucial in our setting is the usage of a pretrained model imparting a strong exploration prior. Without such a
prior, exploration in a high-dimensional textual action space would be impossible. However, this prior also
appears to constrain the exploration engaged in at the beginning of training, with additional SFT training
only making things worse. We view the discovery of new techniques encouraging complex, rich exploration of
reasoning problems as fundamental to progress in LLM reasoning capability. More sophisticted prompting
strategies such as Tree of Thought (Yao et al., 2023) and combining LLM generative abilities with evolutionary
algorithms (Lehman et al., 2022) have already begun to make progress in this direction.
In addition to the limited exploration noted above, we also note reasoning environments are entirely deter-
ministic. This is a setting in which EI and RCRL algorithms are already known to work well theoretically
(Brandfonbrener et al., 2022). PPO enjoys more advantage in environemnts with a high degree of stochasticity.
We also note prior work in RLHF finds PPO outperforms EI type approaches in human preference satisfaction
and instruction following (Gulcehre et al., 2023; Dubois et al., 2023; Kirk et al., 2023). Importantly, in our
setting we always have a reliable ground truth reward to optimize. However, in RLHF, models must optimize
against an unreliable reward model, often resulting in over-optimization (Gao et al., 2022). The relatively
superior performance of PPO over EI on RLHF tasks versus reasoning tasks suggests PPO better mitigates
such over-optimization. This is not too surprising since PPO training penalizes student models diverging
from the initial policy via both its clipped objective and additional KL-constraint. In contrast, EI training
has no such protection built in.

References
Prithviraj Ammanabrolu and Mark Riedl. Playing text-adventure games with graph-based deep reinforcement learning.
In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 3557–3565, Minneapolis, Minnesota, June 2019. Association for Computational
Linguistics. doi: 10.18653/v1/N19-1358. URL https://aclanthology.org/N19-1358.
Thomas W. Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In
Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:19449905.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang,
Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models.
ArXiv, abs/2108.07732, 2021. URL https://api.semanticscholar.org/CorpusID:237142385.
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng,
Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. ArXiv, abs/2310.10631,
2023. URL https://api.semanticscholar.org/CorpusID:264172303.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, John Kernion, Andy Jones, Anna Chen, Anna
Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,
Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, E Perez, Jamie Kerr, Jared Mueller, Jeff Ladish, J Landau,
Kamal Ndousse, Kamile Lukovsiūtė, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noem’i
Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El
Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, T. J. Henighan, Tristan Hume,
Sam Bowman, Zac Hatfield-Dodds, Benjamin Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom B.
Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073, 2022. URL
https://api.semanticscholar.org/CorpusID:254823489.
Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan
Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis,
Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak,
Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra. Human-level play in the game of ¡i¿diplomacy¡/i¿ by
combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. doi: 10.1126/science.
ade9097. URL https://www.science.org/doi/abs/10.1126/science.ade9097.
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David
Farhi, Quirin Fischer, Shariq Hashme, Christopher Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub W.
Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter,
Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale

11
deep reinforcement learning. ArXiv, abs/1912.06680, 2019. URL https://api.semanticscholar.org/CorpusID:
209376771.
Maciej Besta, Nils Blach, Alevs Kubivcek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann,
Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate
problems with large language models. ArXiv, abs/2308.09687, 2023. URL https://api.semanticscholar.org/
CorpusID:261030303.
Prajjwal Bhargava, Rohan Chitnis, Alborz Geramifard, Shagun Sodhani, and Amy Zhang. Sequence modeling is a robust
contender for offline reinforcement learning. ArXiv, abs/2305.14550, 2023. URL https://api.semanticscholar.
org/CorpusID:258866105.
David Brandfonbrener, Alberto Bietti, Jacob Buckman, Romain Laroche, and Joan Bruna. When does return-
conditioned supervised learning work for offline reinforcement learning? ArXiv, abs/2206.01079, 2022. URL
https://api.semanticscholar.org/CorpusID:249282285.
Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding
large language models in interactive environments with online reinforcement learning. ArXiv, abs/2302.02662, 2023.
URL https://api.semanticscholar.org/CorpusID:256615643.
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, P. Abbeel, A. Srinivas, and Igor
Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing
Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235294299.
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling
computation from reasoning for numerical reasoning tasks. ArXiv, abs/2211.12588, 2022.
François Chollet. On the measure of intelligence. ArXiv, abs/1911.01547, 2019. URL https://api.semanticscholar.
org/CorpusID:207870692.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko,
Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily
Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari,
Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcı́a,
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret
Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai,
Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr
Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Dı́az, Orhan Firat, Michele Catasta,
Jason Wei, Kathleen S. Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language
modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2022. URL https://api.semanticscholar.org/
CorpusID:247951931.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning
from human preferences. Advances in neural information processing systems, 30, 2017.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry
Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math
word problems. ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar.org/CorpusID:239998651.
David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk
Michalewski, Rif A. Saurous, Jascha Narain Sohl-Dickstein, Kevin Murphy, and Charles Sutton. Language model
cascades. ArXiv, abs/2207.10342, 2022.
Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and T. Zhang.
Raft: Reward ranked finetuning for generative foundation model alignment. ArXiv, abs/2304.06767, 2023. URL
https://api.semanticscholar.org/CorpusID:258170300.
Yuqing Du, Alexander Havrilla, Sainbayar Sukhbaatar, Pieter Abbeel, and Roberta Raileanu. A study on improving
reasoning in language models. In I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation
Models, 2023. URL https://openreview.net/forum?id=tCZFmDyPFm.
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang,
and Tatsunori Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.
ArXiv, abs/2305.14387, 2023. URL https://api.semanticscholar.org/CorpusID:258865545.

12
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey
Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang,
Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URL
https://doi.org/10.5281/zenodo.5371628.
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International
Conference on Machine Learning, 2022. URL https://api.semanticscholar.org/CorpusID:252992904.
Amelia Glaese, Nathan McAleese, Maja Trkebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh,
Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang,
Ramona Comanescu, Fan Yang, A. See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez
Elias, Richard Green, Sovna Mokr’a, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel,
William S. Isaac, John F. J. Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey
Irving. Improving alignment of dialogue agents via targeted human judgements. ArXiv, abs/2209.14375, 2022. URL
https://api.semanticscholar.org/CorpusID:252596089.
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya
Siddhant, Alexa Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, A. Doucet, Orhan Firat, and Nando
de Freitas. Reinforced self-training (rest) for language modeling. ArXiv, abs/2308.08998, 2023. URL https:
//api.semanticscholar.org/CorpusID:261031028.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob
Steinhardt. Measuring mathematical problem solving with the math dataset, 2021a.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and
Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. ArXiv, abs/2103.03874, 2021b.
URL https://api.semanticscholar.org/CorpusID:232134851.
J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora:
Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.
org/CorpusID:235458009.
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language
models can self-improve. ArXiv, abs/2210.11610, 2022.
Minqi Jiang, Edward Grefenstette, and Tim Rocktäschel. Prioritized level replay. In International Conference on
Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:222208809.
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette,
and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint
arXiv:2310.06452, 2023.
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. Coderl: Mastering code
generation through pretrained models and deep reinforcement learning. ArXiv, abs/2207.01780, 2022. URL
https://api.semanticscholar.org/CorpusID:250280117.
Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through
large models. 2022. URL https://api.semanticscholar.org/CorpusID:249848020.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh,
Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and
Vedant Misra. Solving quantitative reasoning problems with language models. ArXiv, abs/2206.14858, 2022. URL
https://api.semanticscholar.org/CorpusID:250144408.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone,
Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier
Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade,
Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo
Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan
Zhang, Nourhan Fahmy, Urvashi Bhattacharyya, W. Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim
Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey
Schoelkopf, Jana Ebert, Tri Dao, Mayank Mishra, Alexander Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan
Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz
Ferrandis, Sean M. Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the
source be with you! ArXiv, abs/2305.06161, 2023. URL https://api.semanticscholar.org/CorpusID:258588247.

13
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak
Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian
Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, E. Zelikman, Esin
Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel J. Orr, Lucia
Zheng, Mert Yüksekgönül, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri S. Chatterji, O. Khattab, Peter Henderson,
Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard,
Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic
evaluation of language models. ArXiv, abs/2211.09110, 2022. URL https://api.semanticscholar.org/CorpusID:
263423935.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John
Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. ArXiv, abs/2305.20050, 2023. URL
https://api.semanticscholar.org/CorpusID:258987659.
Jiacheng Liu, Skyler Hallinan, Ximing Lu, Pengfei He, Sean Welleck, Hannaneh Hajishirzi, and Yejin Choi. Rainier:
Reinforced knowledge introspector for commonsense question answering. ArXiv, abs/2210.03078, 2022. URL
https://api.semanticscholar.org/CorpusID:252735191.
Ximing Lu, Sean Welleck, Liwei Jiang, Jack Hessel, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin
Choi. Quark: Controllable text generation with reinforced unlearning. ArXiv, abs/2205.13636, 2022. URL
https://api.semanticscholar.org/CorpusID:249152301.
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin,
Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models
via reinforced evol-instruct. ArXiv, abs/2308.09583, 2023. URL https://api.semanticscholar.org/CorpusID:
261030818.
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann André LeCun, and Thomas Scialom. Gaia: a
benchmark for general ai assistants. 2023. URL https://api.semanticscholar.org/CorpusID:265351664.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural
language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics, 2021.
URL https://api.semanticscholar.org/CorpusID:237421373.
Swaroop Mishra, Pan Lu, and A. Kalyan. Lila: A unified benchmark for mathematical reasoning. 2022. URL
https://api.semanticscholar.org/CorpusID:257405677.
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:
257532815.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models
to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022. URL https://api.semanticscholar.
org/CorpusID:246426909.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems?,
2021.
Xiangyu Peng, Siyan Li, Sarah Wiegreffe, and Mark Riedl. Inferring the reader: Guiding automated story generation
with commonsense reasoning, 2021.
Yujia Qin, Shi Liang, Yining Ye, Kunlun Zhu, Lan Yan, Ya-Ting Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian,
Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Marc H. Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun.
Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789, 2023. URL
https://api.semanticscholar.org/CorpusID:260334759.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct
preference optimization: Your language model is secretly a reward model. ArXiv, abs/2305.18290, 2023. URL
https://api.semanticscholar.org/CorpusID:258959321.
Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage,
Hannaneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing?: Benchmarks,
baselines, and building blocks for natural language policy optimization. ArXiv, abs/2210.01241, 2022. URL
https://api.semanticscholar.org/CorpusID:252693405.

14
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael,
and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. ArXiv, abs/2311.12022, 2023. URL
https://api.semanticscholar.org/CorpusID:265295009.
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu,
Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton
Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis
Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code,
2023.
Tim Salimans and Richard J. Chen. Learning montezuma’s revenge from a single demonstration. ArXiv, abs/1812.03381,
2018. URL https://api.semanticscholar.org/CorpusID:54463584.
Tomohiro Sawada, Daniel Paleka, Alex Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay,
Kshitij Gupta, and Aran Komatsuzaki. Arb: Advanced reasoning benchmark for large language models. ArXiv,
abs/2307.13692, 2023. URL https://api.semanticscholar.org/CorpusID:260155126.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessı̀, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda,
and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761,
2023. URL https://api.semanticscholar.org/CorpusID:256697342.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization
algorithms. ArXiv, abs/1707.06347, 2017. URL https://api.semanticscholar.org/CorpusID:28695052.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual
data. ArXiv, abs/1511.06709, 2015. URL https://api.semanticscholar.org/CorpusID:15600925.
Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang
Zhao, Yuenan Guo, and Qianxiang Wang. Pangu-coder2: Boosting large language models for code with ranking
feedback. ArXiv, abs/2307.14936, 2023. URL https://api.semanticscholar.org/CorpusID:260202985.
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot,
L. Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering
chess and shogi by self-play with a general reinforcement learning algorithm. ArXiv, abs/1712.01815, 2017. URL
https://api.semanticscholar.org/CorpusID:33081038.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R.
Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal,
Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia
Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Annasaheb Rahane,
Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmuller, Andrew M.
Dai, Andrew La, Andrew Kyle Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta,
Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun
Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakacs,
B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartlomiej Bojanowski, Batuhan Ozyurt, Behnam Hedayatnia,
Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Stephen Howald, Bryan
Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, C’esar Ferri Ram’irez, Chandan
Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian
Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel,
Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Daniel H Garrette, Dan Hendrycks, Dan Kilman, Dan Roth,
Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Mosegu’i Gonz’alez, Danielle R. Perszyk, Danny Hernandez,
Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep
Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra,
Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus
Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth P. Donoway, Ellie Pavlick, Emanuele Rodolà,
Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan J. Jerzak, Ethan
Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Mart’inez-Plumed,
Francesca Happ’e, François Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán
Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-L’opez, Gregor Betz,
Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar,
Henry Shevlin, Hinrich Schutze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap
Jumelet, Jack Geissinger, John Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James
Koppel, James Zheng, James Zou, Jan Koco’n, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom,
Jascha Narain Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer

15
Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Oluwadara Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang,
Jane W Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jorg Frohberg,
Jos Rozen, José Hernández-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua Tenenbaum, Joshua S.
Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja
Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Wallace Mathewson, Kristen Chiafullo, Ksenia
Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan,
Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Luca Lam, Lucy Noble, Ludwig
Schmidt, Luheng He, Luis Oliveros Col’on, Luke Metz, Lutfi Kerem cSenel, Maarten Bosma, Maarten Sap, Maartje
ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria
Jose Ram’irez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt,
Matthias Hagen, M’aty’as Schubert, Medina Baitemirova, Melody Arnaud, Melvin Andrew McElrath, Michael A.
Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michal Swkedrowski, Michele
Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Monica Tiwari,
Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, T MukundVarma, Nanyun Peng, Nathan A. Chi,
Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita
Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha Iyer, Noah Constant, Noah Fiedel,
Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares,
Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter
Chang, Peter Eckersley, Phu Mon Htut, Pi-Bei Hwang, P. Milkowski, Piyush S. Patil, Pouya Pezeshkpour, Priti Oli,
Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker,
Ramon Risco, Raphael Milliere, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers,
Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan Lebras, Rosanne Liu, Rowan Jacobs, Rui
Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M.
Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman,
Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean
Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi S. Hamdan, Sharon
Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar,
Shubham Toshniwal, Shyam Upadhyay, Shyamolima Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi,
Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan
Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T Piantadosi, Stuart M. Shieber,
Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali,
Tatsunori Hashimoto, Te-Lin Wu, Theo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius
Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar
Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Venkatesh
Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William
Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh,
Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov,
Yu Hou, Yu Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zi Fu Wang, Zijie J. Wang, Zirui Wang, and Ziyi
Wu. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. 2022. URL
https://api.semanticscholar.org/CorpusID:263625818.
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan J. Lowe, Chelsea Voss, Alec Radford, Dario
Amodei, and Paul Christiano. Learning to summarize from human feedback. ArXiv, abs/2009.01325, 2020. URL
https://api.semanticscholar.org/CorpusID:221665105.
Simeng Sun, Dhawal Gupta, and Mohit Iyyer. Exploring the impact of low-rank adaptation on the performance, efficiency,
and regularization of rlhf. ArXiv, abs/2309.09055, 2023. URL https://api.semanticscholar.org/CorpusID:
261884455.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen,
Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj
Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez,
Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril,
Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor
Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan
Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams,
Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan
Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and
fine-tuned chat models, 2023.

16
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, L. Wang, Antonia Creswell, Geoffrey
Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback. ArXiv,
abs/2211.14275, 2022. URL https://api.semanticscholar.org/CorpusID:254017497.
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H.
Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka,
Aja Huang, L. Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander Sasha Vezhnevets, Rémi Leblond,
Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom Le Paine, Caglar Gulcehre,
Ziyun Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver
Smith, Tom Schaul, Timothy P. Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver.
Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575:350 – 354, 2019. URL
https://api.semanticscholar.org/CorpusID:204972004.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai,
and Quoc V. Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021. URL https:
//api.semanticscholar.org/CorpusID:237416585.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou.
Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
Shunyu Yao, Rohan Rao, Matthew J. Hausknecht, and Karthik Narasimhan. Keep calm and explore: Language models
for action generation in text-based games. ArXiv, abs/2010.02903, 2020. URL https://api.semanticscholar.org/
CorpusID:222142129.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing
reasoning and acting in language models. ArXiv, abs/2210.03629, 2022. URL https://api.semanticscholar.org/
CorpusID:252762395.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan.
Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601, 2023. URL
https://api.semanticscholar.org/CorpusID:258762525.
Anbang Ye, Christopher Cui, Taiwei Shi, and Mark O. Riedl. Neural story planning. ArXiv, abs/2212.08718, 2022.
URL https://api.semanticscholar.org/CorpusID:254854533.
Zheng Yuan, Hongyi Yuan, Cheng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on
learning mathematical reasoning with large language models. ArXiv, abs/2308.01825, 2023. URL https://api.
semanticscholar.org/CorpusID:260438790.
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning, 2022.
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie
Zhan, and Hongsheng Li. Solving challenging math word problems using gpt-4 code interpreter with code-based
self-verification. ArXiv, abs/2308.07921, 2023a. URL https://api.semanticscholar.org/CorpusID:260900008.
Wei Zhou, Xiangyu Peng, and Mark O. Riedl. Dialogue shaping: Empowering agents through npc interaction. ArXiv,
abs/2307.15833, 2023b. URL https://api.semanticscholar.org/CorpusID:260333931.
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang.
Solving math word problems via cooperative reasoning induced language models. Association for Computational
Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.245. URL https://doi.org/10.18653%2Fv1%2F2023.acl-long.
245.
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and
Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

17
Figure 9 Accuracy of EI models on GSM8K test vs. Figure 10 Accuracy of EI models on GSM8K test vs.
number of iterations. EI scratch models have use no SFT number of iterations. K = 4 samples per prompt are used
initialization. to construct a fine-tuning dataset for the next round.

A RCRL Label Balance


We also experiment with different proportions of ‘[GOOD]’ and ‘[BAD]’ labels in RCRL training data. This is
motivated by a desire to make better use of abundant negative data, which is much easier to generate than its
positive counterpart. Better teaching the student what not to do with this data would ideally increase the
number of valid solutions. Recall by default we balance the number of positive and negative samples.
We conduct experiments on LLama-2 7 GSM8K without any SFT data. We apply only one round of Expert
Iteration (K = 1 per question), producing a student model we refer to as EI-minimal. Note, in this setting we
only provide ‘[GOOD]’ and ‘[BAD]’ labels for entire solutions, rather than providing labels at the step level.
Results are reported in 5.

positive:negative ratio GSM8K (maj@1)


EI-minimal - 0.17
100:1 0.18
RCRL 10:1 0.18
1:1 0.15

Table 5 RCRL without SFT, using different proportions of positive and negative samples. As we increase the proportion
of negative samples, performance generally decreases. At best, we only see very marginal gains using RCRL. Note:
EI-minimal refers to running EI for one iteration, with K = 1 per question.

We find we achieve best performance when the amount of positive training data greatly outweighs the amount
of negative data. In these cases, our RCRL models’ maj@1 score slightly exceeds the maj@1 score of the
data generating EI-minimal model. Yet, when we balance the amount of positive and negative training data,
we find performance is degraded. This suggests our 7B student doesn’t effectively learn from the provided
negative demonstrations. We suspect either a larger model or an easier task would give better results.

B EI Improvement across Iterations


Figures 9 and 10 plot the maj@1 score of models on versus rounds of expert iteration. On both datasets the
score is monotonically increasing until convergence after at most four rounds. Models initialized from an
SFT checkpoint converge faster than their pretrained counterparts. Each round of expert iteration samples
K ∗ num train rollouts, with the longest running training loop generate at most 5 ∗ 4 ∗ 7000 ≈ 106 samples.

18
Figure 11 Diversity of GSM8K model output over rounds of EI. (No SFT)

Figure 12 Diversity of SVAMP model output over rounds of EI. K = 96 samples are used per prompt. positive
diversity measures diversity in the subset of solutions with a correct final answer.

Figures 11 and 12 report the diversity of solutions across rounds of expert iteration as measured by two
separate metrics for solution uniqueness. exact diversity checks for equality between two solutions using
exact string match. trace diversity checks for equality between two solutions by first extracting the trace of a
solution as the sequence of intermediate calculations used to get to the final answer. An exact match is then
performed on this trace representation.
Solution diversity increases then decreases over training Overall both measures of solution diversity increase
for both model sizes over the first two rounds of expert iteration. After the first two rounds both trace
diversity appears to plateau and in some cases slightly decrease. Exact diversity continues to increase for
13B, but not at the same rate as during the first two rounds. The largest increases in solution diversity over
the first two rounds also match when the largest gains in maj@1 performance occur. This lends evidence to
the intuition that a high-performing student will be able to generate many correct but unique solutionst to
the same problem. Further, we see during later rounds of expert iteration that while maj@1 score improves
slightly, diversity suffers. This provides further evidence that training is begining to overfit to maj@1 score, in
the process reducing both pass@n and solution diversity. We see the same behavior
Larger models generate more diverse solutions The above figures also demonstrate the 13B model produces
signifcantly more diverse outputs than the 7B model. This is true during every round of fine-tuning, with the
gap getting larger as more training is done. Interestingly, the 13B model appears to produce an exactly unique
solution with every sampling after 4 rounds of expert iteration. However, its trace diversity peaks after two

19
Figure 13 gsm8k sft diversity

rounds, indicating 13B tends to introduce semantic diversity without changing the underlying computational
structure of a solution.

C Sample Complexities
In this section we plot all sample complexities on benchmarks accompanying the results in Section 4. Figures
14 and 15 report results on GSM8K without supervised fine-tuning. Figures 16 and 17 report results on
SVAMP.

Figure 14 Sample complexity of default versus ORM Figure 15 Sample complexity of default versus ORM
guided EI students on GSM8K (no SFT). The ORM guided PPO students on GSM8K (no SFT). Similarly to
improves sample complexity initially but ultimately un- as in EI, the ORM improves maj@1 score over using only
derperforms using only the ground truth. ground truth rewards but eventually underperforms.

As in the SFT case, using an ORM to guide EI and PPO on prompted GSM8K models does somewhat
reduce sample complexity but does not improve best performance (if anything the ORM reward slightly hurts
converged maj@1 score). We see the same story when providing a dense ORM reward, further decreasing
sample comlexity but at the cost of final converged performance. Our best results still come from using only
the ground truth score. We suspect the performance degredation introduced by the ORM reward could be
alleviated with a larger reward model. However, we do not believe using a larger model would improve over
just the ground truth reward. Similar results are seen for SVAMP.

20
Figure 16 Sample complexity of default versus ORM Figure 17 Sample complexity of default versus ORM
guided EI students on SVAMP. guided PPO students on SVAMP.

Figure 18 maj@1 scores for Prioritized Level Replay (PLR) and Backtracking techniques compared to default PPO
and SFT.

D Curriculum Learning for RL


In addition to vanilla PPO we experiment with backtracking (Salimans and Chen, 2018) and Prioritized Level
Replay (PLR) (Jiang et al., 2020) as algorithms from the curriculum learning literature. Such algorithms
aim to construct a “curriculum” of subproblems, with the model ideally learning to generalize from easier
subproblems to harder subproblems.
Backtracking in particular is a natural choice as it relies on using high-quality supervised trajectories to
improve exploration of the solution space. This is done by sampling the student policy π on the partially
complete solution (Q, Pi ) where Pi is a sequence of intermediate ground truth steps (S1 , ..., Si ). The algorithm
proceeds by setting an initial threshold τ0 ∈ (0, 1) which represents how far back from the final answer to
initialize partial solutions. By default we use τ0 = 0.9. Then, for each problem Q which can be solved from
Pi , we remove the last step Si and condition on Pi−1 the next time Q is sampled.
PLR does not rely on access to SFT data, instead heuristically prioritizing problems with high “learning
potential” estimated by the average absolute advantage. Prioritizing problems with this potential allows the
model to focus on problems that are neither too easy nor too hard, making efficient use of its exploration
budget. We initialize the student using a supervised fine-tuned LLama-2 7B on GSM8K. Results are reported
in Figure 19.

21
Figure 19 maj@1 scores on GSM8K for Prioritized Level Replay (PLR) and Backtracking techniques compared to
default PPO and SFT.

Overall we find neither method exceeds the performance of default PPO. We hypothesize this is due to the
limited exploration the model engages in from the start, due to both pretraining and supervised fine-tuning.
We speculate better results might be achieved on a harder dataset with more intermediate steps, particularly
when using backtracking.

E Data augmentation
We additionally experimented with generating synthetic (Q, A) training pairs via an approach inspired by
backtranslation (Sennrich et al., 2015). We assume access to a supervised fine-tuning dataset D of (Q, A)
pairs and train a Q → A model MQ→A as our usual student model. We call this model the verifier. We can
also utilize D to train models of the form MA→Q and MA→A which map answers to questions and answers
to answers respectively. We train MA→Q simply by fine-tuning the pretrained model M to predict p(A|Q)
where (Q, A) ∼ D. We call the combination of MA→A and MA→Q the generator. We construct a train set
for MA→A as follows: For each A in (Q, A) ∈ D we randomly sample three other answers A1 , A2 , A3 from D
which act as a conditional prompt. We then train MA→A by minimizing p(A|A1 , A2 , A3 ).
We sample MA→A on each ground truth answer A ∈ D K = 8 times, constructing a synthetic dataset of
answers A. We then use our backwards model MA→Q to produce questions for each of the synthetic answers
A ∈ A. This forms a synthetic (Q, A) dataset Dsynth . Finally, for each synthetic (Q, A) pair, we sample our
student model MQ→A K = 20 times for each question and check whether the student model’s final answer
agrees with the “intended” final answer. We refer to the percentage of student generated solutions recovering
the intended final answer as the score for a synthetic (Q, A) pair. We plot the distribution of scores in Figure
20.
We see that the majority of synthetic pairs, over 50,000, never have their solutions recovered by the student
MQ→A . This is either because a) the student is too weak to solve the question or b) the question is impossible
to solve. Either way, we likely do not want to include these new training data for the student. Similarly, we
likely do not want to include questions which are always solved by the student, i.e. those with score = 1, as
they are too easy. Additionally, we should be wary of questions which have a small score in the range (0, ϵ).
We expect many questions will have been solved incorrectly but still arriving at the correct final answer. We
should exclude such problems from our training dataset.
We expect the highest quality data (Q, A) to have a score in the neighborhood ( 12 − τ, 21 + τ ). These questions
should be not too hard but not too easy for our student. Figure 6 shows the performance of student models
fine-tuned on a combination of ground truth data and synthetically generated data with scores in the range
( 12 − τ, 21 + τ ). All models are trained for five epochs with an initial lr = 2e-5 cosine decayed to 2e-7. Llama-2
7B is used as the pretrained base model.

22
Figure 20 Scores of synthetically backwards generated (Q, A) pairs. Note: the score refers to the percentage of times
the forward student model MQ→A recovers the intended final answer.

maj@1
τ = 0.1 0.38
τ = 0.2 0.36
τ = 0.3 0.34
SFT 0.41

Table 6 Performance of models training with various amounts of synthetic data vs. the SFT baseline. Note: τ
represents the size of the neighborhood of scores around 12 that are not filtered out.

Unfortunately, it seems introducing any amount of synthetically generated data degrades performance. When
manually inspecting the synthetically generated (Q, A) pairs it becomes clear why. There is an extremely
high number of false positives. Consider the following example of a synthetic pair shown in Table ??:
This is an example of a low-quality sample we do not want in our training data. Ideally, such a sample would
have a score of 0 since the technically correct answer is 100, not 120. However, the SFT MQ→A student we
use to construct a score for each (Q, A) sample computes the final answer as 120 a staggering 47% of the time.
The verifier makes the exactly the same mistakes the MA→A model made when constructing the question,
likely because they were trained on similar distributions.
We suspect using a larger model more capable of detecting these sort of trivial non-solutions would do
substantially better at generating backwards synthetic data. Similarly, employing separate models as the
generator and verifier may reduce the probability of both making the same mistakes, improving the reliability
of the score for each pair. We leave this as future work.

F RCRL Step-label Generating Process


Another natural candidate which could be used to identify mistakes at each step is a Process Based Reward
Model (PRM) (Lightman et al., 2023). A PRM estimates the probability of correctness of a step Si ,
p(Si correct|Q, S1 , S2 , ..., Si ) independently of its impact on the final answer. However, this would be
expensive, requiring collecting human annotated samples. Instead, we propose to approximate the optimal
value function V ∗ of the reasoning task. V ∗ corresponds to the value function of the optimal policy which is
able to successfully solve the reasoning task from any logically valid intermediate state Sj . Such an optimal
value function would have V ∗ (Q, S1 , ..., Si ) = 1 for a solution prefix with no mistakes, and V ∗ (Q, S1 , ..., Si ) = 0
if the prefix already contains a mistake which will result in an incorrect final answer. Note however, V ∗ does

23
Question ”A school of 100 musicians goes on a skiing trip. 40%
are beginners, 30% are intermediate, and 50% are
advanced. How many people went on the skiing trip?”
Answer ”There are 100 * 0.4 = 40 beginner skiiers. There are
100 * 0.3 = 30 intermediate skiiers. There are 100 *
0.5 = 50 advanced skiiers. Therefore there are 40 +
30 + 50 = 120 skiiers total.”

not exactly correspond to a PRM. This is because a partial solution S1 , ..., Si with a mistake at step j ̸= i and
valid terminal step Si will have V ∗ (Q, S1 , ..., Si ) = 0 and P RM (Q, S1 , ..., Si ) = 1. To make this distinction
clear, we call models we train to directly approximate V ∗ stepwise ORMs or SORMs.

24

You might also like