DeepSeek R1 Dual
DeepSeek R1 Dual
Reinforcement Learning
DeepSeek-AI
Abstract
80 79.8 79.2
75.7
72.6 71.5
Accuracy / Percentile (%)
63.6 62.1
60 58.7 60.0 59.1
49.2 48.9
41.6 42.0
40 39.2
36.8
20
0
AIME 2024 Codeforces GPQA Diamond MATH-500 MMLU SWE-bench Verified
(Pass@1) (Percentile) (Pass@1) (Pass@1) (Pass@1) (Resolved)
DeepSeek-AI
抽象的
80 79.8 79.2
75.7
72.6 71.5
)
%
( 63.6
e
62.1
l
i 60 58.7 60.0 59.1
t
ne
cr
e 49.2 48.9
P
/
y 41.6 42.0
c
a
r 40 39.2
uc 36.8
c
A
20
0
AIME 2024 Codeforces GPQA Diamond MATH-500 MMLU SWE-bench Verified
(Pass@1) (Percentile) (Pass@1) (Pass@1) (Pass@1) (Resolved)
图 1 | DeepSeek-R1 的基准性能。
Contents
1 Introduction 3
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Summary of Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Approach 5
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model . . . . . . . . . . 5
2.2.1 Reinforcement Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Training Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero 6
2.3 DeepSeek-R1: Reinforcement Learning with Cold Start . . . . . . . . . . . . . . . 9
2.3.1 Cold Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Reasoning-oriented Reinforcement Learning . . . . . . . . . . . . . . . . . 10
2.3.3 Rejection Sampling and Supervised Fine-Tuning . . . . . . . . . . . . . . . 10
2.3.4 Reinforcement Learning for all Scenarios . . . . . . . . . . . . . . . . . . . 11
2.4 Distillation: Empower Small Models with Reasoning Capability . . . . . . . . . . 11
3 Experiment 11
3.1 DeepSeek-R1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Distilled Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Discussion 14
4.1 Distillation v.s. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Unsuccessful Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
内容
1 简介 3
1.1 贡献..................................................................................................................................................
................................................................................ 4 1.2 评估结果摘要..............................................
................................................................................................................................................................
.... 4
2 方法 5
2.1 概述 . ...
2.3 DeepSeek-R1:具有冷启动的强化学习 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.4 全场景强化学习 .........................................................
.........................................................................................11 2.4 蒸馏:赋予小模型推理能力 ...........
.......................................................................................................11
3 实验 11
3.1 DeepSeek-R1 评估 . ...
4 讨论 14
4.1 提炼与强化学习 . ...
5 结论、局限性和未来工作 16
A 贡献与致谢 20
2
1. Introduction
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and
evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap
towards Artificial General Intelligence (AGI).
Recently, post-training has emerged as an important component of the full training pipeline.
It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt
to user preferences, all while requiring relatively minimal computational resources against
pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models
were the first to introduce inference-time scaling by increasing the length of the Chain-of-
Thought reasoning process. This approach has achieved significant improvements in various
reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge
of effective test-time scaling remains an open question for the research community. Several prior
works have explored various approaches, including process-based reward models (Lightman
et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024),
and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh
et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning
performance comparable to OpenAI’s o1 series models.
In this paper, we take the first step toward improving language model reasoning capabilities
using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop
reasoning capabilities without any supervised data, focusing on their self-evolution through
a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ
GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning.
During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting
reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance
on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to
71.0%, and with majority voting, the score further improves to 86.7%, matching the performance
of OpenAI-o1-0912.
However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training
pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the
DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-
Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection
sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains
such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model.
After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking
into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to
as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.
We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-
32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying
RL on it. This demonstrates that the reasoning patterns discovered by larger base models are cru-
cial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey
et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source
QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a
new record on the reasoning benchmarks among dense models.
3
1. 简介
近年来,大型语言模型 (LLM) 经历了快速迭代和发展(Anthropic,2024 年;Google,2024 年
;OpenAI,2024a),与通用人工智能 (AGI) 的差距逐渐缩小。
最近,后训练已成为整个训练流程的重要组成部分。事实证明,它可以提高推理任务的准确
性,与社会价值观保持一致并适应用户偏好,同时与预训练相比,它所需的计算资源相对较少
。在推理能力方面,OpenAI 的 o1(OpenAI,2024b)系列模型首次通过增加思想链推理过程的
长度来引入推理时间扩展。这种方法在数学、编码和科学推理等各种推理任务中取得了显着的
改进。然而,有效的测试时间扩展的挑战仍然是研究界的一个悬而未决的问题。之前已有多项
研究探索了各种方法,包括基于过程的奖励模型(Lightman 等人,2023 年;Uesato 等人,2022
年;Wang 等人,2023 年)、强化学习(Kumar 等人,2024 年)以及蒙特卡洛树搜索和波束搜
索等搜索算法(Feng 等人,2024 年;Trinh 等人,2024 年;Xin 等人,2024 年)。然而,这些
方法都没有达到与 OpenAI 的 o1 系列模型相当的通用推理性能。
然而,DeepSeek-R1-Zero 面临着可读性差、语言混合等挑战。为了解决这些问题并进一步提
升推理性能,我们推出了 DeepSeek-R1,它结合了少量冷启动数据和多阶段训练流程。具体来说
,我们首先收集数千个冷启动数据来微调 DeepSeek-V3-Base 模型,然后像 DeepSeek-R1-Zero 一
样进行面向推理的强化学习。当强化学习过程接近收敛时,我们通过对强化学习检查点进行拒
绝采样创建新的 SFT 数据,结合 DeepSeek-V3 在写作、事实问答和自我认知等领域的监督数据
,然后重新训练 DeepSeek-V3-Base 模型。在使用新数据进行微调之后,检查点会经历额外的强
化学习过程,其中会考虑来自所有场景的提示。经过这些步骤,我们得到了一个称为DeepSeek-
R1的检查点,其性能与OpenAI-o1-1217相当。
3
1.1. Contributions
• Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly
surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%,
performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2)
On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks,
as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in
the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than
DeepSeek-V3, which could help developers in real world tasks.
• Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-
R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores
of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its
performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1
surpasses other closed-source models, demonstrating its competitive edge in educational
tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,
demonstrating its capability in handling fact-based queries. A similar trend is observed
where OpenAI-o1 surpasses 4o on this benchmark.
4
1.1. 贡献
训练后:基础模型上的大规模强化学习
• 我们直接将 RL 应用于基础模型,而无需依赖监督微调 (SFT) 作为初步步骤。这种方法允
许模型探索解决复杂问题的思路 (CoT),从而开发出 DeepSeek-R1-Zero。DeepSeek-R1-Zer
o 展示了自我验证、反思和生成长 CoT 等功能,标志着研究界的一个重要里程碑。值得注
意的是,这是第一个公开研究,验证了 LLM 的推理能力可以纯粹通过 RL 来激励,而无需
SFT。这一突破为该领域的未来发展铺平了道路。
提炼:小模型同样强大
• 我们证明了较大模型的推理模式可以提炼为较小的模型,与通过强化学习在小型模型上发
现的推理模式相比,其性能更佳。开源的 DeepSeek-R1 及其 API 将有利于研究界在未来提
炼出更好的小型模型。
• 利用 DeepSeek-R1 生成的推理数据,我们对研究社区中广泛使用的几个密集模型进行了微
调。评估结果表明,提炼出的较小密集模型在基准测试中表现非常出色。DeepSeek-R1-Dis
till-Qwen-7B 在 AIME 2024 上取得了 55.5% 的成绩,超过了 QwQ-32B-Preview。此外,De
epSeek-R1-Distill-Qwen-32B 在 AIME 2024 上的得分为 72.6%,在 MATH-500 上的得分为
94.3%,在 LiveCodeBench 上的得分为 57.2%。这些结果明显优于之前的开源模型,可与 o
1-mini 相媲美。我们向社区开源了基于 Qwen2.5 和 Llama3 系列提炼出的 1.5B、7B、8B、
14B、32B 和 70B 检查点。
1.2. 评估结果总结
4
• Others: DeepSeek-R1 also excels in a wide range of tasks, including creative writing,
general question answering, editing, summarization, and more. It achieves an impressive
length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on Are-
naHard, showcasing its strong ability to intelligently handle non-exam-oriented queries.
Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring
long-context understanding, substantially outperforming DeepSeek-V3 on long-context
benchmarks.
2. Approach
2.1. Overview
Previous work has heavily relied on large amounts of supervised data to enhance model
performance. In this study, we demonstrate that reasoning capabilities can be significantly
improved through large-scale reinforcement learning (RL), even without using supervised
fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with
the inclusion of a small amount of cold-start data. In the following sections, we present: (1)
DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and
(2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of
long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to
small dense models.
Group Relative Policy Optimization In order to save the training costs of RL, we adopt Group
Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is
typically the same size as the policy model, and estimates the baseline from group scores instead.
Specifically, for each question 𝑞, GRPO samples a group of outputs { 𝑜1 , 𝑜2 , · · · , 𝑜𝐺 } from the old
policy 𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model 𝜋𝜃 by maximizing the following objective:
J𝐺𝑅𝑃𝑂 ( 𝜃) = E[𝑞 ∼ 𝑃 ( 𝑄 ), { 𝑜𝑖 }𝐺𝑖=1 ∼ 𝜋𝜃𝑜𝑙𝑑 (𝑂 | 𝑞)]
𝐺
(1)
1 ∑︁ 𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
, 1 − 𝜀, 1 + 𝜀 𝐴𝑖 − 𝛽 D 𝐾𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 ,
min 𝐴𝑖 , clip
𝐺 𝜋𝜃𝑜𝑙𝑑 ( 𝑜𝑖 | 𝑞) 𝜋𝜃𝑜𝑙𝑑 ( 𝑜𝑖 | 𝑞)
𝑖=1
𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞) 𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞)
D 𝐾𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 = − log − 1, (2)
𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
where 𝜀 and 𝛽 are hyper-parameters, and 𝐴𝑖 is the advantage, computed using a group of
rewards {𝑟1 , 𝑟2 , . . . , 𝑟𝐺 } corresponding to the outputs within each group:
𝑟𝑖 − m𝑒𝑎𝑛 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
𝐴𝑖 = . (3)
s𝑡𝑑 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
5
• 其他:DeepSeek-R1 在创意写作、一般问答、编辑、总结等一系列任务上也表现出色。它
在 AlpacaEval 2.0 上取得了令人印象深刻的 87.6% 的长度控制胜率,在 ArenaHard 上取得
了 92.3% 的胜率,展示了其强大的智能处理非考试导向查询的能力。此外,DeepSeek-R1
在需要长上下文理解的任务上表现出色,在长上下文基准测试中大大优于 DeepSeek-V3。
2. 方法
2.1. 概述
先前的研究严重依赖大量监督数据来提高模型性能。在本研究中,我们证明,即使不使用监督
微调 (SFT) 作为冷启动,大规模强化学习 (RL) 也可以显著提高推理能力。此外,通过加入少量
冷启动数据可以进一步提高性能。在以下部分中,我们将介绍:(1) DeepSeek-R1-Zero,它将 RL
直接应用于基础模型,而无需任何 SFT 数据;(2) DeepSeek-R1,它从使用数千个长思路链 (CoT
) 示例微调的检查点开始应用 RL。3) 将 DeepSeek-R1 的推理能力提炼到小型密集模型中。
2.2. DeepSeek-R1-Zero:基础模型上的强化学习
强化学习在推理任务中表现出显著的有效性,这一点可以从我们之前的研究(Shao 等人,2024
;Wang 等人,2023)中看出。然而,这些研究严重依赖于监督数据,而收集这些数据需要耗费
大量时间。在本节中,我们将探索 LLM 在没有任何监督数据的情况下开发推理能力的潜力,重
点关注它们通过纯强化学习过程的自我进化。我们首先简要概述我们的 RL 算法,然后介绍一些
令人兴奋的结果,希望这能为社区提供有价值的见解。
𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞) 𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞)
D 𝐾 𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 = − log − 1, (2)
𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
其中 𝜀 和 𝛽 是超参数,𝐴𝑖 是优势,使用对应于每组内输出的一组奖励 {𝑟1, 𝑟2, . . . , 𝑟𝐺 } 计算:
𝑟𝑖 − m𝑒𝑎𝑛 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
𝐴𝑖 = . (3)
s𝑡𝑑 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
5
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:
Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning
question during training.
The reward is the source of the training signal, which decides the optimization direction of RL.
To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
types of rewards:
• Accuracy rewards: The accuracy reward model evaluates whether the response is correct.
For example, in the case of math problems with deterministic results, the model is required
to provide the final answer in a specified format (e.g., within a box), enabling reliable
rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be
used to generate feedback based on predefined test cases.
• Format rewards: In addition to the accuracy reward model, we employ a format reward
model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’
tags.
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero,
because we find that the neural reward model may suffer from reward hacking in the large-scale
reinforcement learning process, and retraining the reward model needs additional training
resources and it complicates the whole training pipeline.
6
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:
表 1 | DeepSeek-R1-Zero 的模板。训练期间,提示将被替换为具体的推理问题。
奖励是训练信号的来源,决定了强化学习的优化方向。为了训练 DeepSeek-R1-Zero,我们采用
了基于规则的奖励系统,该系统主要包含两种类型的奖励:
• 准确度奖励:准确度奖励模型评估答案是否正确。例如,对于结果确定的数学问题,模型
需要以指定的格式(例如,在方框内)提供最终答案,从而实现可靠的基于规则的正确性
验证。同样,对于 LeetCode 问题,可以使用编译器根据预定义的测试用例生成反馈。
• 格式奖励:除了准确性奖励模型之外,我们还采用了格式奖励模型,强制模型将其思考过
程置于‘<think>’和‘</think>’标签之间。
为了训练 DeepSeek-R1-Zero,我们首先设计一个简单的模板,指导基础模型遵循我们指定的指
令。如表 1 所示,此模板要求 DeepSeek-R1-Zero 首先生成一个推理过程,然后给出最终答案。
我们有意将约束限制在这种结构格式上,避免任何特定于内容的偏见(例如强制进行反思性推
理或推广特定的问题解决策略),以确保我们能够准确观察模型在 RL 过程中的自然进展。
6
GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
OpenAI-o1-0912 74.4 83.3 94.8 77.3 63.4 1843
DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444
Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample
16 responses and calculate the overall average accuracy to ensure a stable evaluation.
DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised
fine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to
learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek-
R1-Zero can be further augmented through the application of majority voting. For example,
when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance
escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The
ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without
majority voting, highlights its strong foundational capabilities and its potential for further
advancements in reasoning tasks.
7
GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
OpenAI-o1-0912 74.4 83.3 94.8 77.3 63.4 1843
DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444
DeepSeek-R1-Zero 无需任何监督微调数据即可实现强大的推理能力。这是一项值得注意的成就
,因为它强调了该模型仅通过 RL 就能有效学习和推广的能力。此外,DeepSeek-R1-Zero 的性能
可以通过应用多数投票进一步增强。例如,当在 AIME 基准上使用多数投票时,DeepSeek-R1-Z
ero 的性能从 71.0% 提升到 86.7%,从而超过了 OpenAI-o1-0912 的性能。DeepSeek-R1-Zero 能够
在有或没有多数投票的情况下实现如此有竞争力的性能,凸显了其强大的基础能力以及在推理
任务中进一步进步的潜力。
如图 3 所示,DeepSeek-R1-Zero 的思考时间呈现出持续的提升。
7
Figure 3 | The average response length of DeepSeek-R1-Zero on the training set during the RL
process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.
ment throughout the training process. This improvement is not the result of external adjustments
but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the
ability to solve increasingly complex reasoning tasks by leveraging extended test-time compu-
tation. This computation ranges from generating hundreds to thousands of reasoning tokens,
allowing the model to explore and refine its thought processes in greater depth.
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated
behaviors as the test-time computation increases. Behaviors such as reflection—where the model
revisits and reevaluates its previous steps—and the exploration of alternative approaches to
problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead
emerge as a result of the model’s interaction with the reinforcement learning environment. This
spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities,
enabling it to tackle more challenging tasks with greater efficiency and accuracy.
8
图 3 | DeepSeek-R1-Zero 在 RL 过程中对训练集的平均响应长度。DeepSeek-R1-Zero 自然学会了
用更多的思考时间来解决推理任务。
在整个训练过程中,DeepSeek-R1-Zero 不断改进。这种改进不是外部调整的结果,而是模型内
部的内在发展。DeepSeek-R1-Zero 通过利用延长的测试时间计算,自然获得了解决日益复杂的
推理任务的能力。这种计算范围从生成数百到数千个推理标记,使模型能够更深入地探索和改
进其思维过程。
这种自我进化最引人注目的方面之一是,随着测试时间计算的增加,复杂的行为也随之出现
。诸如反思(模型重新审视并重新评估其先前的步骤)和探索解决问题的替代方法等行为都是
自发产生的。这些行为不是明确编程的,而是模型与强化学习环境交互的结果。这种自发发展
显著增强了 DeepSeek-R1-Zero 的推理能力,使其能够以更高的效率和准确性应对更具挑战性的
任务。
这一刻不仅是模型的“顿悟时刻”,也是观察其行为的研究人员的“顿悟时刻”。它强调了
强化学习的力量和魅力:我们无需明确地教模型如何解决问题,只需为其提供适当的激励,它
就会自主地开发高级的解决问题策略。这一“顿悟时刻”有力地提醒了我们,强化学习有潜力
在人工智能系统中开启新的智能水平,为未来更自主、更具适应性的模型铺平道路。
8
√︁ √
Question: If 𝑎 > 1, then the sum of the real solutions of 𝑎 − 𝑎 + 𝑥 = 𝑥 is equal to
Response: <think> √︁
√
To solve the equation 𝑎 − 𝑎 + 𝑥 = 𝑥 , let’s start by squaring both · · ·
√︁ √ 2 √
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 − 𝑎 + 𝑥 = 𝑥 2 .
Rearrange to isolate the inner square root term:
( 𝑎 − 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑎2 − 2𝑎𝑥 2 + ( 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑥 4 − 2𝑎𝑥 2 − 𝑥 + ( 𝑎2 − 𝑎) = 0
...
Wait, wait. Wait. That’s an aha moment I can flag here.
Let’s reevaluate this step-by-step to identify if the correct sum can be · · ·
We
√︁ started with the equation:
√
𝑎− 𝑎+𝑥 = 𝑥
First,
√ let’s square both √ sides:
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 + 𝑥 = 𝑎 − 𝑥 2
Next, I could square both sides again, treating the equation: · · ·
...
Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can
reasoning performance be further improved or convergence accelerated by incorporating a small
amount of high-quality data as a cold start? 2) How can we train a user-friendly model that
not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong
general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The
pipeline consists of four stages, outlined as follows.
Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from
the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data
to fine-tune the model as the initial RL actor. To collect such data, we have explored several
approaches: using few-shot prompting with a long CoT as an example, directly prompting
models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-
Zero outputs in a readable format, and refining the results through post-processing by human
annotators.
In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data
9
√︁ √
Question: If 𝑎 > 1, then the sum of the real solutions of 𝑎 − 𝑎 + 𝑥 = 𝑥 is equal to
Response: <think> √︁
√
To solve the equation 𝑎 − 𝑎 + 𝑥 = 𝑥 , let’s start by squaring both · · ·
√︁ √ 2 √
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 − 𝑎 + 𝑥 = 𝑥 2 .
Rearrange to isolate the inner square root term:
( 𝑎 − 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑎2 − 2𝑎𝑥 2 + ( 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑥 4 − 2𝑎𝑥 2 − 𝑥 + ( 𝑎2 − 𝑎) = 0
...
Wait, wait. Wait. That’s an aha moment I can flag here.
Let’s reevaluate this step-by-step to identify if the correct sum can be · · ·
We
√︁ started with the equation:
√
𝑎− 𝑎+𝑥 = 𝑥
First,
√ let’s square both √ sides:
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 + 𝑥 = 𝑎 − 𝑥 2
Next, I could square both sides again, treating the equation: · · ·
...
表 3 | DeepSeek-R1-Zero 中级版本一个有趣的顿悟时刻,模型学会了用拟人化的语气重新思考
,这也是我们的顿悟时刻,让我们见证了强化学习的力量和美丽。
2.3. DeepSeek-R1:冷启动强化学习
受到 DeepSeek-R1-Zero 良好结果的启发,我们自然而然地产生了两个问题:1)通过引入少量
高质量数据作为冷启动,是否可以进一步提高推理性能或加速收敛?2)如何训练一个用户友好
的模型,该模型不仅可以产生清晰连贯的思维链 (CoT),而且还具有强大的通用能力?为了解决
这些问题,我们设计了一个流程来训练 DeepSeek-R1。该流程由四个阶段组成,概述如下。
9
include:
• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable
for reading. Responses may mix multiple languages or lack markdown formatting to
highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1,
we design a readable pattern that includes a summary at the end of each response and
filters out responses that are not reader-friendly. Here, we define the output format as
|special_token|<reasoning_process>|special_token|<summary>, where the reasoning
process is the CoT for the query, and the summary is used to summarize the reasoning
results.
• Potential: By carefully designing the pattern for cold-start data with human priors, we
observe better performance against DeepSeek-R1-Zero. We believe the iterative training is
a better way for reasoning models.
After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale
reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses
on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such
as coding, mathematics, science, and logic reasoning, which involve well-defined problems with
clear solutions. During the training process, we observe that CoT often exhibits language mixing,
particularly when RL prompts involve multiple languages. To mitigate the issue of language
mixing, we introduce a language consistency reward during RL training, which is calculated
as the proportion of target language words in the CoT. Although ablation experiments show
that such alignment results in a slight degradation in the model’s performance, this reward
aligns with human preferences, making it more readable. Finally, we combine the accuracy of
reasoning tasks and the reward for language consistency by directly summing them to form the
final reward. We then apply RL training on the fine-tuned model until it achieves convergence
on reasoning tasks.
Reasoning data We curate reasoning prompts and generate reasoning trajectories by perform-
ing rejection sampling from the checkpoint from the above RL training. In the previous stage,
we only included data that could be evaluated using rule-based rewards. However, in this stage,
we expand the dataset by incorporating additional data, some of which use a generative reward
model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.
Additionally, because the model output is sometimes chaotic and difficult to read, we have
filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For
each prompt, we sample multiple responses and retain only the correct ones. In total, we collect
about 600k reasoning related training samples.
10
包括:
• 可读性:DeepSeek-R1-Zero 的一个关键限制是其内容通常不适合阅读。响应可能混合多种
语言或缺乏 markdown 格式来为用户突出显示答案。相比之下,在为 DeepSeek-R1 创建冷
启动数据时,我们设计了一种可读模式,在每个响应的末尾包含一个摘要,并过滤掉不易
于阅读的响应。在这里,我们将输出格式定义为 |special_token|<reasoning_process>|speci
al_token|<summary>,其中推理过程是查询的 CoT,摘要用于总结推理结果。
• 潜力:通过精心设计冷启动数据模式,结合人类先验知识,我们观察到其性能优于 DeepS
eek-R1-Zero。我们相信迭代训练是推理模型的更好方法。
当以推理为导向的强化学习收敛时,我们会利用生成的检查点来收集 SFT(监督微调)数据,
以供下一轮使用。与主要侧重于推理的初始冷启动数据不同,此阶段会整合来自其他领域的数
据,以增强模型在写作、角色扮演和其他通用任务中的能力。具体来说,我们会生成数据并微
调模型,如下所述。
10
Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition,
and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of
DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential
chain-of-thought before answering the question by prompting. However, for simpler queries,
such as “hello” we do not provide a CoT in response. In the end, we collected a total of
approximately 200k training samples that are unrelated to reasoning.
We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about
800k samples.
To further align the model with human preferences, we implement a secondary reinforcement
learning stage aimed at improving the model’s helpfulness and harmlessness while simultane-
ously refining its reasoning capabilities. Specifically, we train the model using a combination
of reward signals and diverse prompt distributions. For reasoning data, we adhere to the
methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the
learning process in math, code, and logical reasoning domains. For general data, we resort to
reward models to capture human preferences in complex and nuanced scenarios. We build
upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and train-
ing prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the
assessment emphasizes the utility and relevance of the response to the user while minimizing
interference with the underlying reasoning process. For harmlessness, we evaluate the entire
response of the model, including both the reasoning process and the summary, to identify and
mitigate any potential risks, biases, or harmful content that may arise during the generation
process. Ultimately, the integration of reward signals and diverse data distributions enables us
to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly
fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using
the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that
this straightforward distillation method significantly enhances the reasoning abilities of smaller
models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-
14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its
reasoning capability is slightly better than that of Llama-3.1.
For distilled models, we apply only SFT and do not include an RL stage, even though
incorporating RL could substantially boost model performance. Our primary goal here is to
demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL
stage to the broader research community.
3. Experiment
Benchmarks We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema
et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al.,
2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al.,
2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,
11
非推理数据对于非推理数据,例如写作、事实问答、自我认知和翻译,我们采用 DeepSeek-V3
流程并重用 DeepSeek-V3 的部分 SFT 数据集。对于某些非推理任务,我们会在提示回答问题之
前调用 DeepSeek-V3 生成潜在的思路链。但是,对于更简单的查询,例如“你好”,我们不提
供 CoT 作为响应。最终,我们总共收集了大约 200k 个与推理无关的训练样本。
为了使模型与人类偏好进一步保持一致,我们实施了第二阶段强化学习,旨在提高模型的有用
性和无害性,同时完善其推理能力。具体来说,我们使用奖励信号和各种提示分布的组合来训
练模型。对于推理数据,我们遵循 DeepSeek-R1-Zero 中概述的方法,该方法利用基于规则的奖
励来指导数学、代码和逻辑推理领域的学习过程。对于一般数据,我们采用奖励模型来捕捉复
杂而微妙场景中的人类偏好。我们以 DeepSeek-V3 流程为基础,采用类似的偏好对和训练提示
分布。对于有用性,我们只关注最终摘要,确保评估强调响应对用户的实用性和相关性,同时
最大限度地减少对底层推理过程的干扰。对于无害性,我们评估模型的整个响应,包括推理过
程和摘要,以识别和减轻生成过程中可能出现的任何潜在风险、偏见或有害内容。最终,奖励
信号和多样化数据分布的整合使我们能够训练出一个在推理方面表现出色的模型,同时优先考
虑有用性和无害性。
2.4. 提炼:赋予小模型推理能力
3.实验
基准测试我们在 MMLU (Hendrycks et al., 2020)、MMLU-Redux (Gema et al., 2024)、MMLU-Pro (
Wang et al., 2024)、C-Eval (Huang et al., 2023)、CMMLU (Li et al., 2023)、IFEval (Zhou et al., 202
3)、FRAMES (Krishna et al., 2024)、GPQA Diamond (Rein et al., 2023)、SimpleQA (OpenAI, 2024
c)、C-SimpleQA (He et al., 2024)、SWE-Bench Verified (OpenAI,
11
2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2 , Chinese
National High School Mathematics Olympiad (CNMO 2024)3 , and American Invitational Math-
ematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we
also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we
adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li
et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we
only feed the final summary to evaluation to avoid the length bias. For distilled models, we
report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and
LiveCodeBench.
Evaluation Setup We set the maximum generation length to 32,768 tokens for the models.
We found that using greedy decoding to evaluate long-output reasoning models results in
higher repetition rates and significant variability across different checkpoints. Therefore, we
default to pass@𝑘 evaluation (Chen et al., 2021) and report pass@1 using a non-zero temperature.
Specifically, we use a sampling temperature of 0.6 and a top- 𝑝 value of 0.95 to generate 𝑘
responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1
is then calculated as
𝑘
1 ∑︁
pass@1 = 𝑝𝑖 ,
𝑘
𝑖=1
where 𝑝𝑖 denotes the correctness of the 𝑖-th response. This method provides more reliable
performance estimates. For AIME 2024, we also report consensus (majority vote) results (Wang
et al., 2022) using 64 samples, denoted as cons@64.
1 https://aider.chat
2 https://codeforces.com
3 https://www.cms.org.cn/Home/comp/comp/cid/12.html
12
2024d)、Aider 1、LiveCodeBench (Jain 等人,2024) (2024-08 – 2025-01)、Codeforces 2、中国全国
高中数学奥林匹克 (CNMO 2024) 3 和美国邀请赛数学考试 2024 (AIME 2024) (MAA, 2024)。除了
标准基准之外,我们还使用 LLM 作为评判者在开放式生成任务上评估我们的模型。具体来说,
我们遵循 AlpacaEval 2.0 (Dubois 等人,2024) 和 Arena-Hard (Li 等人,2024) 的原始配置,它们
利用 GPT-4-Turbo-1106 作为成对比较的评判者。在这里,我们只将最终摘要提供给评估,以避
免长度偏差。对于提炼模型,我们报告了 AIME 2024、MATH-500、GPQA Diamond、Codeforce
s 和 LiveCodeBench 的代表性结果。
基线 我们针对几个强大的基线进行了全面的评估,包括 DeepSeek-V3、Claude-Sonnet-3.5-1022
、GPT-4o-0513、OpenAI-o1-mini 和 OpenAI-o1-1217。由于在中国大陆访问 OpenAI-o1-1217 API
具有挑战性,因此我们根据官方报告报告其性能。对于提炼模型,我们还比较了开源模型 QwQ-
32B-Preview(Qwen,2024a)。
𝑘
1 ∑︁
pass@1 = 𝑝𝑖 ,
𝑘
𝑖=1
1 https://aider.chat
2 https://codeforces.com
3 https://www.cms.org.cn/Home/comp/comp/cid/12.html
12
3.1. DeepSeek-R1 Evaluation
13
3.1. DeepSeek-R1 评估
表4 | DeepSeek-R1与其他代表模型的比较。
13
DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying
its robustness across multiple tasks.
On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217,
surpassing other models by a large margin. A similar trend is observed on coding algorithm
tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these
benchmarks. On engineering-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1
on Aider but achieves comparable performance on SWE Verified. We believe the engineering
performance of DeepSeek-R1 will improve in the next version, as the amount of related RL
training data currently remains very limited.
GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
GPT-4o-0513 9.3 13.4 74.6 49.9 32.9 759
Claude-3.5-Sonnet-1022 16.0 26.7 78.3 65.0 38.9 717
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 1316
DeepSeek-R1-Distill-Qwen-1.5B 28.9 52.7 83.9 33.8 16.9 954
DeepSeek-R1-Distill-Qwen-7B 55.5 83.3 92.8 49.1 37.6 1189
DeepSeek-R1-Distill-Qwen-14B 69.7 80.0 93.9 59.1 53.1 1481
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 1691
DeepSeek-R1-Distill-Llama-8B 50.4 80.0 89.1 49.0 39.6 1205
DeepSeek-R1-Distill-Llama-70B 70.0 86.7 94.5 65.2 57.5 1633
As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeek-
R1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform non-
reasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32B-
Preview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly
exceed o1-mini on most benchmarks. These results demonstrate the strong potential of distilla-
tion. Additionally, we found that applying RL to these distilled models yields significant further
gains. We believe this warrants further exploration and therefore present only the results of the
simple SFT-distilled models here.
4. Discussion
In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve impressive
results. However, there is still one question left: can the model achieve comparable performance
through the large-scale RL training discussed in the paper without distillation?
To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math,
code, and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The
experimental results, shown in Table 6, demonstrate that the 32B base model, after large-scale
14
DeepSeek-R1 避免在基于 GPT 的评估中引入长度偏差,进一步巩固了其在多项任务中的稳健性
。
3.2. 蒸馏模型评估
GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
GPT-4o-0513 9.3 13.4 74.6 49.9 32.9 759
Claude-3.5-Sonnet-1022 16.0 26.7 78.3 65.0 38.9 717
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 1316
DeepSeek-R1-Distill-Qwen-1.5B 28.9 52.7 83.9 33.8 16.9 954
DeepSeek-R1-Distill-Qwen-7B 55.5 83.3 92.8 49.1 37.6 1189
DeepSeek-R1-Distill-Qwen-14B 69.7 80.0 93.9 59.1 53.1 1481
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 1691
DeepSeek-R1-Distill-Llama-8B 50.4 80.0 89.1 49.0 39.6 1205
DeepSeek-R1-Distill-Llama-70B 70.0 86.7 94.5 65.2 57.5 1633
表 5 | DeepSeek-R1 提炼模型与其他同类模型在推理相关基准上的比较。
4.讨论
4.1. 蒸馏与强化学习
14
AIME 2024 MATH-500 GPQA Diamond LiveCodeBench
Model pass@1 cons@64 pass@1 pass@1 pass@1
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9
DeepSeek-R1-Zero-Qwen-32B 47.0 60.0 91.6 55.0 40.2
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2
In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along
the way. We share our failure experiences here to provide insights, but this does not imply that
these approaches are incapable of developing effective reasoning models.
Process Reward Model (PRM) PRM is a reasonable method to guide the model toward better
approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al.,
2023). However, in practice, PRM has three main limitations that may hinder its ultimate suc-
cess. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second,
determining whether the current intermediate step is correct is a challenging task. Automated
annotation using models may not yield satisfactory results, while manual annotation is not con-
ducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward
hacking (Gao et al., 2022), and retraining the reward model needs additional training resources
and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good
ability to rerank the top-N responses generated by the model or assist in guided search (Snell
et al., 2024), its advantages are limited compared to the additional computational overhead it
introduces during the large-scale reinforcement learning process in our experiments.
Monte Carlo Tree Search (MCTS) Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Sil-
ver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time
compute scalability. This approach involves breaking answers into smaller parts to allow the
model to explore the solution space systematically. To facilitate this, we prompt the model to
generate multiple tags that correspond to specific reasoning steps necessary for the search. For
training, we first use collected prompts to find answers via MCTS guided by a pre-trained value
model. Subsequently, we use the resulting question-answer pairs to train both the actor model
and the value model, iteratively refining the process.
However, this approach encounters several challenges when scaling up the training. First,
unlike chess, where the search space is relatively well-defined, token generation presents an
15
AIME 2024 MATH-500 GPQA Diamond LiveCodeBench
Model pass@1 cons@64 pass@1 pass@1 pass@1
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9
DeepSeek-R1-Zero-Qwen-32B 47.0 60.0 91.6 55.0 40.2
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2
表 6 | 推理相关基准上的提炼模型和 RL 模型的比较。
因此,我们可以得出两个结论:第一,将更强大的模型提炼成更小的模型可以产生很好的效
果,而本文提到的依赖于大规模强化学习的小模型需要巨大的计算能力,甚至可能无法达到提
炼的性能。第二,虽然提炼策略既经济又有效,但要超越智能的界限,可能仍需要更强大的基
础模型和更大规模的强化学习。
4.2. 失败的尝试
蒙特卡洛树搜索 (MCTS) 受 AlphaGo (Silver et al., 2017b) 和 AlphaZero (Silver et al., 2017a) 的启发
,我们探索使用蒙特卡洛树搜索 (MCTS) 来增强测试时计算可扩展性。此方法涉及将答案分解为
更小的部分,以允许模型系统地探索解决方案空间。为了实现这一点,我们提示模型生成与搜
索所需的特定推理步骤相对应的多个标签。对于训练,我们首先使用收集到的提示通过由预先
训练的价值模型指导的 MCTS 来寻找答案。随后,我们使用生成的问答对来训练参与者模型和
价值模型,并迭代地完善该过程。
然而,这种方法在扩大训练规模时遇到了一些挑战。首先,与国际象棋不同,国际象棋的搜
索空间相对明确,而 token 生成则呈现出
15
exponentially larger search space. To address this, we set a maximum extension limit for each
node, but this can lead to the model getting stuck in local optima. Second, the value model
directly influences the quality of generation since it guides each step of the search process.
Training a fine-grained value model is inherently difficult, which makes it challenging for the
model to iteratively improve. While AlphaGo’s core success relied on training a value model to
progressively enhance its performance, this principle proves difficult to replicate in our setup
due to the complexities of token generation.
In conclusion, while MCTS can improve performance during inference when paired with a
pre-trained value model, iteratively boosting model performance through self-search remains a
significant challenge.
16
搜索空间呈指数级增长。为了解决这个问题,我们为每个节点设置了最大扩展限制,但这可能
会导致模型陷入局部最优。其次,价值模型直接影响生成的质量,因为它指导搜索过程的每个
步骤。训练细粒度的价值模型本质上很困难,这使得模型很难迭代改进。虽然 AlphaGo 的核心
成功依赖于训练价值模型来逐步提高其性能,但由于 token 生成的复杂性,这一原则在我们的设
置中很难复制。
5. 结论、局限性和未来工作
在这项工作中,我们分享了通过强化学习增强模型推理能力的历程。DeepSeek-R1-Zero 代表了
一种不依赖冷启动数据的纯 RL 方法,在各种任务中都取得了出色的性能。DeepSeek-R1 更强大
,它利用冷启动数据和迭代 RL 微调。最终,DeepSeek-R1 在一系列任务上实现了与 OpenAI-o1-
1217 相当的性能。
未来我们计划针对DeepSeek-R1在以下方向投入研究。
• 通用能力:目前 DeepSeek-R1 在函数调用、多回合、复杂角色扮演、JSON 输出等任务上
的能力还不及 DeepSeek-V3。未来我们计划探索如何利用 CoT 来增强这些领域的任务。
• 语言混合:DeepSeek-R1 目前针对中文和英文进行了优化,这可能会导致在处理其他语言
的查询时出现语言混合问题。例如,即使查询使用的语言不是英文或中文,DeepSeek-R1
也可能使用英文进行推理和响应。我们计划在未来的更新中解决这一限制。
16
References
AI@Meta. Llama 3.1 model card, 2024. URL https://github.com/meta-llama/llama-m
odels/blob/main/models/llama3_1/MODEL_CARD.md.
Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3
-5-sonnet.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin,
B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet,
F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss,
A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage,
M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and
W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
URL https://arxiv.org/abs/2107.03374.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A
multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint
arXiv:2305.08322, 2023.
N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica.
Livecodebench: Holistic and contamination free evaluation of large language models for code.
CoRR, abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv.2403.07974.
17
参考
AI@Meta。Llama 3.1 模型卡,2024 年。URL https://github.com/meta-llama/llama-m
odels/blob/main/models/llama3_1/MODEL_CARD.md。
人与人之间的。克劳德 3.5 十四行诗,2024 年。URL
https://www.anthropic.com/news/claude-3 -5-sonnet。
M. Chen、J. Tworek、H. Jun、Q. Yuan、H. P. de Oliveira Pinto、J. Kaplan、H. Edwards、Y. Burda
、N. Joseph、G. Brockman、A. Ray、R. Puri、G. Krueger、M. Petrov、H. Khlaaf、G. Sastry、P.
Mishkin、B. Chan、S. Gray、N. Ryder、M. Pavlov、A. Power、L. Kaiser、M. Bavarian、C. Winte
r、P. Tillet、F. P. Such、D. Cummings、M. Plappert、F. Chantzis、E. Barnes、A. Herbert-Voss、
W. H. Guss、A. Nichol、A. Paino、N. Tezak、J. Tang、I. Babuschkin、S. Balaji、S. Jain、W. Saun
ders、C. Hesse、A. N. Carr、J. Leike、 J. Achiam、V. Misra、E. Morikawa、A. Radford、M. Knig
ht、M. Brundage、M. Murati、K. Mayer、P. Welinder、B. McGrew、D. Amodei、S. McCandlish
、I. Sutskever 和 W. Zaremba。评估在代码上训练的大型语言模型。CoRR,abs/2107.03374,202
1 年。URL https://arxiv.org/abs/2107.03374。
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei 等。C-Eval:用于
基础模型的多层次多学科中文评估套件。arXiv 预印本 arXiv:2305.08322,2023 年。
N. Jain、K. Han、A. Gu、W. Li、F. Yan、T. Zhang、S. Wang、A. Solar-Lezama、K. Sen 和 I. Stoi
ca。Livecodebench:对大型代码语言模型进行整体且无污染的评估。CoRR,abs/2403.07974,2
024 年。URL https://doi.org/10.48550/arXiv.2403.07974。
17
S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui.
Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. CoRR,
abs/2409.12941, 2024. doi: 10.48550/ARXIV.2409.12941. URL https://doi.org/10.485
50/arXiv.2409.12941.
A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop,
R. Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv
preprint arXiv:2409.12917, 2024.
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measur-
ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212,
2023.
T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From
crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv
preprint arXiv:2406.11939, 2024.
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman,
I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL
https://github.com/WildEval/ZeroEval.
MAA. American invitational mathematics examination - aime. In American Invitational
Mathematics Examination - AIME 2024, February 2024. URL https://maa.org/math
-competitions/american-invitational-mathematics-examination-aime.
OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.
OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/learnin
g-to-reason-with-llms/.
OpenAI. Introducing SimpleQA, 2024c. URL https://openai.com/index/introducing
-simpleqa/.
OpenAI. Introducing SWE-bench verified we’re releasing a human-validated subset of swe-
bench that more, 2024d. URL https://openai.com/index/introducing-swe-bench
-verified/.
Qwen. Qwq: Reflect deeply on the boundaries of the unknown, 2024a. URL https://qwenlm
.github.io/blog/qwq-32b-preview/.
Qwen. Qwen2.5: A party of foundation models, 2024b. URL https://qwenlm.github.io/b
log/qwen2.5.
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman.
GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath:
Pushing the limits of mathematical reasoning in open language models. arXiv preprint
arXiv:2402.03300, 2024.
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre,
D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and
shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815,
2017a. URL http://arxiv.org/abs/1712.01815.
18
S. Krishna、K. Krishna、A. Mohananey、S. Schwarcz、A. Stambler、S. Upadhyay 和 M. Faruqui。
事实、获取和推理:检索增强生成的统一评估。CoRR,abs/2409.12941,2024 年。doi:10.4855
0/ARXIV.2409.12941。URL https://doi.org/10.485 50/arXiv.2409.12941。
B. Y. Lin。ZeroEval:评估语言模型的统一框架,2024 年 7 月。URL
https://github.com/WildEval/ZeroEval。
MAA。美国邀请数学考试 - aime。在美国邀请数学考试 - AIME 2024,2024 年 2 月。URL
https://maa.org/math
-competitions/american-invitational-mathematics-examination-aime。
开放人工智能。你好,GPT-4o,2024a。网址https://openai.com/index/hello-gpt-4o/。
OpenAI。学习使用 llms 进行推理,2024b。URL https://openai.com/index/learnin
g-to-reason-with-llms/。
OpenAI。推出 SimpleQA,2024c。URL https://openai.com/index/introducing
-simpleqa/。
OpenAI。推出经过验证的 SWE-bench,我们正在发布经过人工验证的 swe-bench 子集,更多,2
024d。URL https://openai.com/index/introducing-swe-bench -verified/。
Qwen。Qwq:深刻思考未知的边界,2024a。URL https://qwenlm
.github.io/blog/qwq-32b-preview/。
Qwen。Qwen2.5:基础模型的聚会,2024b。URL https://qwenlm.github.io/b
log/qwen2.5。
D. Rein、B. L. Hou、A. C. Stickland、J. Petty、R. Y. Pang、J. Dirani、J. Michael 和 S. R. Bowman
。 GPQA:研究生级别的谷歌验证问答基准。 arXiv 预印本 arXiv:2311.12022, 2023。
18
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,
M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and
D. Hassabis. Mastering the game of go without human knowledge. Nat., 550(7676):354–359,
2017b. doi: 10.1038/NATURE24270. URL https://doi.org/10.1038/nature24270.
C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more
effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.033
14.
T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong. Solving olympiad geometry without human
demonstrations. Nature, 2024. doi: 10.1038/s41586-023-06747-5.
P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A label-
free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935,
2023.
Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li,
M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and
challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024.
URL https://doi.org/10.48550/arXiv.2406.01574.
H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao,
Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing
proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL
https://arxiv.org/abs/2408.08152.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following
evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
19
D. Silver、J. Schrittwieser、K. Simonyan、I. Antonoglou、A. Huang、A. Guez、T. Hubert、L. Bak
er、M. Lai、A. Bolton、Y. Chen、T. P. Lillicrap、F. Hui、L. Sifre、G. van den Driessche、T. Grae
pel 和 D. Hassabis。无需人类知识即可掌握围棋游戏。Nat.,550(7676):354–359,2017b。doi:1
0.1038/NATURE24270。URL https://doi.org/10.1038/nature24270。
Y. 王,X. 马,G. 张,Y. Ni,A. Chandra,S. 郭,W. Ren,A. Arulraj,X. He,Z. Jiang,T. Li,
M. Ku,K.王,A.庄,R.范,X.岳,和W.陈。 Mmlu-pro:更强大、更具挑战性的多任务语言理
解基准。 CoRR,abs/2406.01574,2024。URL
https://doi.org/10.48550/arXiv.2406.01574。 C.S.Xia、Y.Deng、S.Dunn 和 L.Zhang
。无代理:揭秘基于 llm 的软件工程代理。 arXiv 预印本,2024. H. Xin, Z. Z. Ren, J. Song, Z. Sha
o, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F.
Luo, 和 C. Ruan。Deepseek-prover-v1.5:利用证明辅助反馈进行强化学习和蒙特卡罗树搜索,20
24 年。URL https://arxiv.org/abs/2408.08152 。J. Zhou, T. Lu, S. Mishra, S. Brahma, S.
Basu, Y. Luan, D. Zhou, 和 L. Hou。大型语言模型的指令跟随评估。arXiv 预印本 arXiv:2311.0791
1, 2023 。
19
Appendix
A. Contributions and Acknowledgments
20
附录
A. 贡献与致谢
核心贡献者 李慧 建中 郭嘉实
大亚郭德建杨浩 李景昌 陈景阳 袁
伟张俊晓宋若愚 金浩 屠俊杰 邱俊
张润新徐启豪朱 龙 李家良 蔡佳琪
世荣马培宜王小 倪建良 金陈凯 董
毕小康张兴凯余 凯 胡*凯超 游凯歌
宇吴Z.F.吴志斌 高康 关可欣 黄蒯
苟志宏 邵卓书 玉 Lean 王乐从 张
李子毅 高 亮 赵立同 王丽月
张雷徐乐怡 夏明川
张明华 张明辉 唐
明旭 周猛 李妙君
王明明 李宁 田盼
盼 黄鹏 张千程 王
钦宇 陈求是 杜瑞
琪 葛瑞松 张瑞哲
潘润吉 王 R.J.陈瑞
林
撰稿人 爱新 刘冰
薛冰轩 王伯超 吴
蓓 冯成大 卢成刚
赵成启 邓冲 阮大
麦 戴德利 陈东杰
吉二航 李方云 林
福聪 戴福利罗*
郝广博 郝冠廷 陈
国伟 李 H. 张汉伟
徐红辉 丁华作 高
辉曲
20
Ruyi Chen Y.X. Wei
Shanghao Lu Yang Zhang
Shangyan Zhou Yanhong Xu
Shanhuang Chen Yao Li
Shengfeng Ye Yao Zhao
Shiyu Wang Yaofeng Sun
Shuiping Yu Yaohui Wang
Shunfeng Zhou Yi Yu
Shuting Pan Yichao Zhang
S.S. Li Yifan Shi
Shuang Zhou Yiliang Xiong
Shaoqing Wu Ying He
Shengfeng Ye Yishi Piao
Tao Yun Yisong Wang
Tian Pei Yixuan Tan
Tianyu Sun Yiyang Ma*
T. Wang Yiyuan Liu
Wangding Zeng Yongqiang Guo
Wen Liu Yuan Ou
Wenfeng Liang Yuduan Wang
Wenjun Gao Yue Gong
Wenqin Yu* Yuheng Zou
Wentao Zhang Yujia He
W.L. Xiao Yunfan Xiong
Wei An Yuxiang Luo
Xiaodong Liu Yuxiang You
Xiaohan Wang Yuxuan Liu
Xiaokang Chen Yuyang Zhou
Xiaotao Nie Y.X. Zhu
Xin Cheng Yanping Huang
Xin Liu Yaohui Li
Xin Xie Yi Zheng
Xingchao Liu Yuchen Zhu
Xinyu Yang Yunxian Ma
Xinyuan Li Ying Tang
Xuecheng Su Yukun Zha
Xuheng Lin Yuting Yan
X.Q. Li Z.Z. Ren
Xiangyue Jin Zehui Ren
Xiaojin Shen Zhangli Sha
Xiaosha Chen Zhe Fu
Xiaowen Sun Zhean Xu
Xiaoxiang Wang Zhenda Xie
Xinnan Song Zhengyan Zhang
Xinyi Zhou Zhewen Hao
Xianzu Wang Zhicheng Ma
Xinxia Shan Zhigang Yan
Y.K. Li Zhiyu Wu
Y.Q. Wang Zihui Gu
21
如意 陈尚浩 卢商 Y.X.魏阳 张艳红
言 周山黄 陈胜峰 徐耀 李耀 赵耀峰
叶世宇 王水平 余 孙耀辉 王一宇 余
顺峰 周淑婷 潘 S. 一超 张一凡 史一
S. 李双 周少卿 吴 良 熊英 何一师 朴
胜峰 叶涛 云天 裴 一松 王一旋 谭一
天宇 孙涛 王旺鼎 阳 马一元 刘永强
曾文 刘文峰 梁文 郭元 欧玉端 王悦
君 高文钦 余* 文 龚宇恒 邹宇佳 何
涛 张文涛小伟 安 云帆 熊宇翔 罗游
晓东 刘小涵 王小 宇翔 刘宇轩 周宇
康 陈小涛 聂新成 扬朱彦平 黄耀辉
新 刘新 谢兴超 刘 李毅 郑雨辰 朱允
新宇 杨新元 李学 贤 马英 唐玉坤 查
成 苏旭恒 林小强 玉婷 严志忠任泽
李香月 金小金 沉 辉 任张立 沙哲 付
小莎 陈小文 孙小 哲安 徐振达 谢正
香 王新楠 宋心怡 彦 张哲文 郝志成
周显祖 王心霞 单 马志刚 严志宇 吴
玉坤李永庆王 子慧顾
21
Zijia Zhu Zhen Huang
Zijun Liu* Zhipeng Xu
Zilin Li Zhongyu Zhang
Ziwei Xie Zhen Zhang
Ziyang Song
Zizheng Pan
Within each role, authors are listed alphabetically by the first name. Names marked with *
denote individuals who have departed from our team.
22
子佳 朱子君 黄振鹏 徐仲宇
刘* 李子琳 张振 张
谢子伟 宋子
正 潘子正
在每个职位中,作者按名字的字母顺序排列。标有 * 的名字表示已离开我们团队的个人。
22