0% found this document useful (0 votes)

33 views44 pages

DeepSeek R1 Dual

The document introduces DeepSeek-R1 and DeepSeek-R1-Zero, first-generation reasoning models developed through large-scale reinforcement learning (RL). While DeepSeek-R1-Zero exhibits impressive reasoning capabilities, it faces challenges like poor readability, prompting the development of DeepSeek-R1, which enhances performance through multi-stage training and cold-start data. Both models, along with several distilled versions, are open-sourced to support the research community and show competitive performance against existing models like OpenAI-o1-1217.

Uploaded by

赵墨

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views44 pages

DeepSeek R1 Dual

Uploaded by

赵墨

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via

Reinforcement Learning

DeepSeek-AI

[email protected]

Abstract

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the
research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models
(1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-V3

100 97.3 96.4
96.3 96.6
93.4 94.3
90.6 90.0 90.2 90.8 91.8
87.4 88.5
85.2

80 79.8 79.2
75.7
72.6 71.5
Accuracy / Percentile (%)

63.6 62.1
60 58.7 60.0 59.1

49.2 48.9

41.6 42.0
40 39.2
36.8

0
AIME 2024 Codeforces GPQA Diamond MATH-500 MMLU SWE-bench Verified
(Pass@1) (Percentile) (Pass@1) (Pass@1) (Pass@1) (Resolved)

Figure 1 | Benchmark performance of DeepSeek-R1.

DeepSeek-R1：通过强化学习激励法学硕士的推理能力

DeepSeek-AI

[email protected]

抽象的

我们介绍了第一代推理模型 DeepSeek-R1-Zero 和 DeepSeek-R1。DeepSeek-R1-Zero 是一种通过

大规模强化学习 (RL) 训练的模型，无需监督微调 (SFT) 作为初步步骤，表现出卓越的推理能力
。通过 RL，DeepSeek-R1-Zero 自然而然地呈现出许多强大而有趣的推理行为。然而，它面临着
可读性差和语言混合等挑战。为了解决这些问题并进一步提高推理性能，我们推出了 DeepSeek-
R1，它在 RL 之前结合了多阶段训练和冷启动数据。DeepSeek-R1 在推理任务上实现了与 Open
AI-o1-1217 相当的性能。为了支持研究社区，我们开源了 DeepSeek-R1-Zero、DeepSeek-R1 以
及基于 Qwen 和 Llama 从 DeepSeek-R1 提炼出的六个密集模型（1.5B、7B、8B、14B、32B、70
B）。

DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-V3

100 97.3 96.4
96.3 96.6
93.4 94.3
90.6 90.0 90.2 90.8 91.8
87.4 88.5
85.2

80 79.8 79.2
75.7
72.6 71.5
)
%
( 63.6
e
62.1
l
i 60 58.7 60.0 59.1
t
ne
cr
e 49.2 48.9
P
/
y 41.6 42.0
c
a
r 40 39.2
uc 36.8
c
A

0
AIME 2024 Codeforces GPQA Diamond MATH-500 MMLU SWE-bench Verified
(Pass@1) (Percentile) (Pass@1) (Pass@1) (Pass@1) (Resolved)

图 1 | DeepSeek-R1 的基准性能。
Contents

1 Introduction 3
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Summary of Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Approach 5
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model . . . . . . . . . . 5
2.2.1 Reinforcement Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Training Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero 6
2.3 DeepSeek-R1: Reinforcement Learning with Cold Start . . . . . . . . . . . . . . . 9
2.3.1 Cold Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Reasoning-oriented Reinforcement Learning . . . . . . . . . . . . . . . . . 10
2.3.3 Rejection Sampling and Supervised Fine-Tuning . . . . . . . . . . . . . . . 10
2.3.4 Reinforcement Learning for all Scenarios . . . . . . . . . . . . . . . . . . . 11
2.4 Distillation: Empower Small Models with Reasoning Capability . . . . . . . . . . 11

3 Experiment 11
3.1 DeepSeek-R1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Distilled Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Discussion 14
4.1 Distillation v.s. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Unsuccessful Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Conclusion, Limitations, and Future Work 16

A Contributions and Acknowledgments 20

2
内容

1 简介 3
1.1 贡献..................................................................................................................................................
................................................................................ 4 1.2 评估结果摘要..............................................
................................................................................................................................................................
.... 4
2 方法 5
2.1 概述 . ...

2.2.1 强化学习算法 . ...

2.3 DeepSeek-R1：具有冷启动的强化学习 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.4 全场景强化学习 .........................................................
.........................................................................................11 2.4 蒸馏：赋予小模型推理能力 ...........
.......................................................................................................11

3 实验 11
3.1 DeepSeek-R1 评估 . ...

4 讨论 14
4.1 提炼与强化学习 . ...

5 结论、局限性和未来工作 16

A 贡献与致谢 20

2
1. Introduction
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and
evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap
towards Artificial General Intelligence (AGI).
Recently, post-training has emerged as an important component of the full training pipeline.
It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt
to user preferences, all while requiring relatively minimal computational resources against
pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models
were the first to introduce inference-time scaling by increasing the length of the Chain-of-
Thought reasoning process. This approach has achieved significant improvements in various
reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge
of effective test-time scaling remains an open question for the research community. Several prior
works have explored various approaches, including process-based reward models (Lightman
et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024),
and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh
et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning
performance comparable to OpenAI’s o1 series models.
In this paper, we take the first step toward improving language model reasoning capabilities
using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop
reasoning capabilities without any supervised data, focusing on their self-evolution through
a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ
GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning.
During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting
reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance
on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to
71.0%, and with majority voting, the score further improves to 86.7%, matching the performance
of OpenAI-o1-0912.
However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training
pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the
DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-
Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection
sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains
such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model.
After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking
into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to
as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.
We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-
32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying
RL on it. This demonstrates that the reasoning patterns discovered by larger base models are cru-
cial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey
et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source
QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a
new record on the reasoning benchmarks among dense models.

3
1. 简介
近年来，大型语言模型 (LLM) 经历了快速迭代和发展（Anthropic，2024 年；Google，2024 年
；OpenAI，2024a），与通用人工智能 (AGI) 的差距逐渐缩小。

最近，后训练已成为整个训练流程的重要组成部分。事实证明，它可以提高推理任务的准确
性，与社会价值观保持一致并适应用户偏好，同时与预训练相比，它所需的计算资源相对较少
。在推理能力方面，OpenAI 的 o1（OpenAI，2024b）系列模型首次通过增加思想链推理过程的
长度来引入推理时间扩展。这种方法在数学、编码和科学推理等各种推理任务中取得了显着的
改进。然而，有效的测试时间扩展的挑战仍然是研究界的一个悬而未决的问题。之前已有多项
研究探索了各种方法，包括基于过程的奖励模型（Lightman 等人，2023 年；Uesato 等人，2022
年；Wang 等人，2023 年）、强化学习（Kumar 等人，2024 年）以及蒙特卡洛树搜索和波束搜
索等搜索算法（Feng 等人，2024 年；Trinh 等人，2024 年；Xin 等人，2024 年）。然而，这些
方法都没有达到与 OpenAI 的 o1 系列模型相当的通用推理性能。

在本文中，我们迈出了使用纯强化学习 (RL) 来提升语言模型推理能力的第一步。我们的目

标是探索 LLM 在没有任何监督数据的情况下开发推理能力的潜力，重点关注它们通过纯 RL 过
程进行自我进化。具体来说，我们使用 DeepSeek-V3-Base 作为基础模型，并使用 GRPO (Shao e
t al., 2024) 作为 RL 框架来提高模型的推理性能。在训练过程中，DeepSeek-R1-Zero 自然而然地
出现了许多强大而有趣的推理行为。经过数千个 RL 步骤后，DeepSeek-R1-Zero 在推理基准上表
现出超强的性能。例如，AIME 2024 上的 pass@1 分数从 15.6% 提高到 71.0%，通过多数投票，
分数进一步提高到 86.7%，与 OpenAI-o1-0912 的性能相当。

然而，DeepSeek-R1-Zero 面临着可读性差、语言混合等挑战。为了解决这些问题并进一步提
升推理性能，我们推出了 DeepSeek-R1，它结合了少量冷启动数据和多阶段训练流程。具体来说
，我们首先收集数千个冷启动数据来微调 DeepSeek-V3-Base 模型，然后像 DeepSeek-R1-Zero 一
样进行面向推理的强化学习。当强化学习过程接近收敛时，我们通过对强化学习检查点进行拒
绝采样创建新的 SFT 数据，结合 DeepSeek-V3 在写作、事实问答和自我认知等领域的监督数据
，然后重新训练 DeepSeek-V3-Base 模型。在使用新数据进行微调之后，检查点会经历额外的强
化学习过程，其中会考虑来自所有场景的提示。经过这些步骤，我们得到了一个称为DeepSeek-
R1的检查点，其性能与OpenAI-o1-1217相当。

我们进一步探索从 DeepSeek-R1 到较小密集模型的蒸馏。使用 Qwen2.5-32B (Qwen, 2024b)

作为基础模型，从 DeepSeek-R1 直接蒸馏的效果优于在其上应用强化学习。这表明，由较大的
基础模型发现的推理模式对于提高推理能力至关重要。我们开源了蒸馏后的 Qwen 和 Llama (Du
bey et al., 2024) 系列。值得注意的是，我们蒸馏后的 14B 模型的表现远胜于最先进的开源 QwQ-
32B-Preview (Qwen, 2024a)，而蒸馏后的 32B 和 70B 模型在密集模型的推理基准上创下了新纪
录。

3
1.1. Contributions

Post-Training: Large-Scale Reinforcement Learning on the Base Model

• We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as
a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for
solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-
R1-Zero demonstrates capabilities such as self-verification, reflection, and generating
long CoTs, marking a significant milestone for the research community. Notably, it is the
first open research to validate that reasoning capabilities of LLMs can be incentivized
purely through RL, without the need for SFT. This breakthrough paves the way for future
advancements in this area.
• We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL
stages aimed at discovering improved reasoning patterns and aligning with human pref-
erences, as well as two SFT stages that serve as the seed for the model’s reasoning and
non-reasoning capabilities. We believe the pipeline will benefit the industry by creating
better models.

Distillation: Smaller Models Can Be Powerful Too

• We demonstrate that the reasoning patterns of larger models can be distilled into smaller
models, resulting in better performance compared to the reasoning patterns discovered
through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit
the research community to distill better smaller models in the future.
• Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models
that are widely used in the research community. The evaluation results demonstrate that
the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeek-
R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Addi-
tionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500,
and 57.2% on LiveCodeBench. These results significantly outperform previous open-
source models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B,
32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

1.2. Summary of Evaluation Results

• Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly
surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%,
performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2)
On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks,
as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in
the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than
DeepSeek-V3, which could help developers in real world tasks.
• Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-
R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores
of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its
performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1
surpasses other closed-source models, demonstrating its competitive edge in educational
tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,
demonstrating its capability in handling fact-based queries. A similar trend is observed
where OpenAI-o1 surpasses 4o on this benchmark.

4
1.1. 贡献

训练后：基础模型上的大规模强化学习
• 我们直接将 RL 应用于基础模型，而无需依赖监督微调 (SFT) 作为初步步骤。这种方法允
许模型探索解决复杂问题的思路 (CoT)，从而开发出 DeepSeek-R1-Zero。DeepSeek-R1-Zer
o 展示了自我验证、反思和生成长 CoT 等功能，标志着研究界的一个重要里程碑。值得注
意的是，这是第一个公开研究，验证了 LLM 的推理能力可以纯粹通过 RL 来激励，而无需
SFT。这一突破为该领域的未来发展铺平了道路。

• 我们引入了用于开发 DeepSeek-R1 的流水线。该流水线包含两个 RL 阶段，旨在发现改进

的推理模式并与人类偏好保持一致，以及两个 SFT 阶段，作为模型推理和非推理能力的种
子。我们相信该流水线将通过创建更好的模型使行业受益。

提炼：小模型同样强大
• 我们证明了较大模型的推理模式可以提炼为较小的模型，与通过强化学习在小型模型上发
现的推理模式相比，其性能更佳。开源的 DeepSeek-R1 及其 API 将有利于研究界在未来提
炼出更好的小型模型。

• 利用 DeepSeek-R1 生成的推理数据，我们对研究社区中广泛使用的几个密集模型进行了微
调。评估结果表明，提炼出的较小密集模型在基准测试中表现非常出色。DeepSeek-R1-Dis
till-Qwen-7B 在 AIME 2024 上取得了 55.5% 的成绩，超过了 QwQ-32B-Preview。此外，De
epSeek-R1-Distill-Qwen-32B 在 AIME 2024 上的得分为 72.6%，在 MATH-500 上的得分为
94.3%，在 LiveCodeBench 上的得分为 57.2%。这些结果明显优于之前的开源模型，可与 o
1-mini 相媲美。我们向社区开源了基于 Qwen2.5 和 Llama3 系列提炼出的 1.5B、7B、8B、
14B、32B 和 70B 检查点。

1.2. 评估结果总结

• 推理任务：（1）DeepSeek-R1 在 AIME 2024 上取得了 79.8% 的 Pass@1 分数，略高于 Op

enAI-o1-1217。在 MATH-500 上，它取得了令人印象深刻的 97.3% 的分数，与 OpenAI-o1-
1217 相当，并明显优于其他模型。（2）在与编码相关的任务中，DeepSeek-R1 在代码竞
赛任务中表现出专家级水平，因为它在 Codeforces 上获得了 2,029 Elo 评级，比比赛中 96.
3% 的人类参与者表现更好。对于与工程相关的任务，DeepSeek-R1 的表现略优于 DeepSee
k-V3，这可以帮助开发人员完成现实世界的任务。
• 知识：在 MMLU、MMLU-Pro 和 GPQA Diamond 等基准测试中，DeepSeek-R1 取得了出
色的成绩，显著优于 DeepSeek-V3，MMLU 得分为 90.8%，MMLU-Pro 得分为 84.0%，GP
QA Diamond 得分为 71.5%。虽然 DeepSeek-R1 在这些基准测试中的表现略低于 OpenAI-o1
-1217，但它超越了其他闭源模型，展示了其在教育任务中的竞争优势。在事实基准 Simpl
eQA 上，DeepSeek-R1 的表现优于 DeepSeek-V3，展示了其处理基于事实的查询的能力。
OpenAI-o1 在这个基准测试中超越了 4o，也呈现出了类似的趋势。

4
• Others: DeepSeek-R1 also excels in a wide range of tasks, including creative writing,
general question answering, editing, summarization, and more. It achieves an impressive
length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on Are-
naHard, showcasing its strong ability to intelligently handle non-exam-oriented queries.
Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring
long-context understanding, substantially outperforming DeepSeek-V3 on long-context
benchmarks.

2. Approach

2.1. Overview

Previous work has heavily relied on large amounts of supervised data to enhance model
performance. In this study, we demonstrate that reasoning capabilities can be significantly
improved through large-scale reinforcement learning (RL), even without using supervised
fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with
the inclusion of a small amount of cold-start data. In the following sections, we present: (1)
DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and
(2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of
long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to
small dense models.

2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

Reinforcement learning has demonstrated significant effectiveness in reasoning tasks, as ev-

idenced by our previous works (Shao et al., 2024; Wang et al., 2023). However, these works
heavily depended on supervised data, which are time-intensive to gather. In this section, we
explore the potential of LLMs to develop reasoning capabilities without any supervised data,
focusing on their self-evolution through a pure reinforcement learning process. We start with a
brief overview of our RL algorithm, followed by the presentation of some exciting results, and
hope this provides the community with valuable insights.

2.2.1. Reinforcement Learning Algorithm

Group Relative Policy Optimization In order to save the training costs of RL, we adopt Group
Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is
typically the same size as the policy model, and estimates the baseline from group scores instead.
Specifically, for each question 𝑞, GRPO samples a group of outputs { 𝑜1 , 𝑜2 , · · · , 𝑜𝐺 } from the old
policy 𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model 𝜋𝜃 by maximizing the following objective:
J𝐺𝑅𝑃𝑂 ( 𝜃) = E[𝑞 ∼ 𝑃 ( 𝑄 ), { 𝑜𝑖 }𝐺𝑖=1 ∼ 𝜋𝜃𝑜𝑙𝑑 (𝑂 | 𝑞)]
𝐺
(1)

1 ∑︁ 𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
, 1 − 𝜀, 1 + 𝜀 𝐴𝑖 − 𝛽 D 𝐾𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 ,

min 𝐴𝑖 , clip
𝐺 𝜋𝜃𝑜𝑙𝑑 ( 𝑜𝑖 | 𝑞) 𝜋𝜃𝑜𝑙𝑑 ( 𝑜𝑖 | 𝑞)
𝑖=1

𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞) 𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞)
D 𝐾𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 = − log − 1, (2)
𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
where 𝜀 and 𝛽 are hyper-parameters, and 𝐴𝑖 is the advantage, computed using a group of
rewards {𝑟1 , 𝑟2 , . . . , 𝑟𝐺 } corresponding to the outputs within each group:
𝑟𝑖 − m𝑒𝑎𝑛 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
𝐴𝑖 = . (3)
s𝑡𝑑 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })

5
• 其他：DeepSeek-R1 在创意写作、一般问答、编辑、总结等一系列任务上也表现出色。它
在 AlpacaEval 2.0 上取得了令人印象深刻的 87.6% 的长度控制胜率，在 ArenaHard 上取得
了 92.3% 的胜率，展示了其强大的智能处理非考试导向查询的能力。此外，DeepSeek-R1
在需要长上下文理解的任务上表现出色，在长上下文基准测试中大大优于 DeepSeek-V3。

2. 方法

2.1. 概述

先前的研究严重依赖大量监督数据来提高模型性能。在本研究中，我们证明，即使不使用监督
微调 (SFT) 作为冷启动，大规模强化学习 (RL) 也可以显著提高推理能力。此外，通过加入少量
冷启动数据可以进一步提高性能。在以下部分中，我们将介绍：(1) DeepSeek-R1-Zero，它将 RL
直接应用于基础模型，而无需任何 SFT 数据；(2) DeepSeek-R1，它从使用数千个长思路链 (CoT
) 示例微调的检查点开始应用 RL。3) 将 DeepSeek-R1 的推理能力提炼到小型密集模型中。

2.2. DeepSeek-R1-Zero：基础模型上的强化学习

强化学习在推理任务中表现出显著的有效性，这一点可以从我们之前的研究（Shao 等人，2024
；Wang 等人，2023）中看出。然而，这些研究严重依赖于监督数据，而收集这些数据需要耗费
大量时间。在本节中，我们将探索 LLM 在没有任何监督数据的情况下开发推理能力的潜力，重
点关注它们通过纯强化学习过程的自我进化。我们首先简要概述我们的 RL 算法，然后介绍一些
令人兴奋的结果，希望这能为社区提供有价值的见解。

2.2.1. Reinforcement Learning Algorithm

群体相对策略优化为了节省强化学习的训练成本，我们采用群体相对策略优化 (GRPO) (Shao et a

l., 2024)，它放弃了通常与策略模型大小相同的评论家模型，而是从群体得分估计基线。具体来
说，对于每个问题 𝑞，GRPO 从旧策略 𝜋𝜃𝑜𝑙𝑑 中采样一组输出 { 𝑜1、𝑜2、· · ·、𝑜𝐺 }，然后通过最大化
以下目标来优化策略模型 𝜋𝜃：

J𝐺𝑅𝑃𝑂 ( 𝜃) = E[𝑞 ∼ 𝑃 ( 𝑄 ), { 𝑜𝑖 }𝐺𝑖=1 ∼ 𝜋𝜃𝑜𝑙𝑑 (𝑂 | 𝑞)]

𝐺
(1)

1 ∑︁ 𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
, 1 − 𝜀, 1 + 𝜀 𝐴𝑖 − 𝛽 D 𝐾 𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 ,

min 𝐴𝑖 , clip
𝐺 𝜋𝜃𝑜𝑙𝑑 ( 𝑜𝑖 | 𝑞) 𝜋𝜃𝑜𝑙𝑑 ( 𝑜𝑖 | 𝑞)
𝑖=1

𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞) 𝜋𝑟𝑒 𝑓 ( 𝑜𝑖 | 𝑞)
D 𝐾 𝐿 𝜋𝜃 || 𝜋𝑟𝑒 𝑓 = − log − 1, (2)
𝜋𝜃 ( 𝑜𝑖 | 𝑞) 𝜋𝜃 ( 𝑜𝑖 | 𝑞)
其中 𝜀 和 𝛽 是超参数，𝐴𝑖 是优势，使用对应于每组内输出的一组奖励 {𝑟1, 𝑟2, . . . , 𝑟𝐺 } 计算：

𝑟𝑖 − m𝑒𝑎𝑛 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })
𝐴𝑖 = . (3)
s𝑡𝑑 ({𝑟1 , 𝑟2 , · · · , 𝑟𝐺 })

5
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:

Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning
question during training.

2.2.2. Reward Modeling

The reward is the source of the training signal, which decides the optimization direction of RL.
To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
types of rewards:
• Accuracy rewards: The accuracy reward model evaluates whether the response is correct.
For example, in the case of math problems with deterministic results, the model is required
to provide the final answer in a specified format (e.g., within a box), enabling reliable
rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be
used to generate feedback based on predefined test cases.
• Format rewards: In addition to the accuracy reward model, we employ a format reward
model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’
tags.
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero,
because we find that the neural reward model may suffer from reward hacking in the large-scale
reinforcement learning process, and retraining the reward model needs additional training
resources and it complicates the whole training pipeline.

2.2.3. Training Template

To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides

the base model to adhere to our specified instructions. As depicted in Table 1, this template
requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer.
We intentionally limit our constraints to this structural format, avoiding any content-specific
biases—such as mandating reflective reasoning or promoting particular problem-solving strate-
gies—to ensure that we can accurately observe the model’s natural progression during the RL
process.

2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero

Performance of DeepSeek-R1-Zero Figure 2 depicts the performance trajectory of DeepSeek-

R1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated,
DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the
RL training advances. Notably, the average pass@1 score on AIME 2024 shows a significant
increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels
comparable to OpenAI-o1-0912. This significant improvement highlights the efficacy of our RL
algorithm in optimizing the model’s performance over time.
Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI’s o1-0912
models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers

6
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:

表 1 | DeepSeek-R1-Zero 的模板。训练期间，提示将被替换为具体的推理问题。

2.2.2. Reward Modeling

奖励是训练信号的来源，决定了强化学习的优化方向。为了训练 DeepSeek-R1-Zero，我们采用
了基于规则的奖励系统，该系统主要包含两种类型的奖励：

• 准确度奖励：准确度奖励模型评估答案是否正确。例如，对于结果确定的数学问题，模型
需要以指定的格式（例如，在方框内）提供最终答案，从而实现可靠的基于规则的正确性
验证。同样，对于 LeetCode 问题，可以使用编译器根据预定义的测试用例生成反馈。

• 格式奖励：除了准确性奖励模型之外，我们还采用了格式奖励模型，强制模型将其思考过
程置于‘<think>’和‘</think>’标签之间。

在开发 DeepSeek-R1-Zero 时，我们没有应用结果或过程神经奖励模型，因为我们发现神经奖励

模型在大规模强化学习过程中可能会受到奖励黑客攻击，并且重新训练奖励模型需要额外的训
练资源，并且使整个训练流程变得复杂。

2.2.3. Training Template

为了训练 DeepSeek-R1-Zero，我们首先设计一个简单的模板，指导基础模型遵循我们指定的指
令。如表 1 所示，此模板要求 DeepSeek-R1-Zero 首先生成一个推理过程，然后给出最终答案。
我们有意将约束限制在这种结构格式上，避免任何特定于内容的偏见（例如强制进行反思性推
理或推广特定的问题解决策略），以确保我们能够准确观察模型在 RL 过程中的自然进展。

2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero

DeepSeek-R1-Zero 的性能图 2 描绘了 DeepSeek-R1-Zero 在整个 RL 训练过程中在 AIME 2024 基

准上的性能轨迹。如图所示，随着 RL 训练的进展，DeepSeek-R1-Zero 的性能稳步提升。值得注
意的是，AIME 2024 上的平均 pass@1 分数显着提高，从最初的 15.6% 跃升至令人印象深刻的 7
1.0%，达到了与 OpenAI-o1-0912 相当的性能水平。这一显着的改进凸显了我们的 RL 算法在随
时间优化模型性能方面的有效性。

表 2 提供了 DeepSeek-R1-Zero 与 OpenAI 的 o1-0912 模型在各种推理相关基准上的比较分析

。研究结果表明，强化学习可以增强

6
GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
OpenAI-o1-0912 74.4 83.3 94.8 77.3 63.4 1843
DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444

Table 2 | Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related

benchmarks.

Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample
16 responses and calculate the overall average accuracy to ensure a stable evaluation.

DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised
fine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to
learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek-
R1-Zero can be further augmented through the application of majority voting. For example,
when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance
escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The
ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without
majority voting, highlights its strong foundational capabilities and its potential for further
advancements in reasoning tasks.

Self-evolution Process of DeepSeek-R1-Zero The self-evolution process of DeepSeek-R1-Zero

is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities
autonomously. By initiating RL directly from the base model, we can closely monitor the model’s
progression without the influence of the supervised fine-tuning stage. This approach provides
a clear view of how the model evolves over time, particularly in terms of its ability to handle
complex reasoning tasks.
As depicted in Figure 3, the thinking time of DeepSeek-R1-Zero shows consistent improve-

7
GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
OpenAI-o1-0912 74.4 83.3 94.8 77.3 63.4 1843
DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444

表 2 | DeepSeek-R1-Zero 与 OpenAI o1 模型在推理相关基准上的比较。

图 2 | DeepSeek-R1-Zero 在训练期间的 AIME 准确率。对于每个问题，我们抽取 16 个答案并计

算总体平均准确率，以确保评估的稳定性。

DeepSeek-R1-Zero 无需任何监督微调数据即可实现强大的推理能力。这是一项值得注意的成就
，因为它强调了该模型仅通过 RL 就能有效学习和推广的能力。此外，DeepSeek-R1-Zero 的性能
可以通过应用多数投票进一步增强。例如，当在 AIME 基准上使用多数投票时，DeepSeek-R1-Z
ero 的性能从 71.0% 提升到 86.7%，从而超过了 OpenAI-o1-0912 的性能。DeepSeek-R1-Zero 能够
在有或没有多数投票的情况下实现如此有竞争力的性能，凸显了其强大的基础能力以及在推理
任务中进一步进步的潜力。

DeepSeek-R1-Zero 的自我进化过程 DeepSeek-R1-Zero 的自我进化过程令人着迷地展示了 RL 如

何驱动模型自主提高其推理能力。通过直接从基础模型启动 RL，我们可以密切监控模型的进展
，而不受监督微调阶段的影响。这种方法可以清楚地看到模型如何随时间演变，特别是在其处
理复杂推理任务的能力方面。

如图 3 所示，DeepSeek-R1-Zero 的思考时间呈现出持续的提升。

7
Figure 3 | The average response length of DeepSeek-R1-Zero on the training set during the RL
process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.

ment throughout the training process. This improvement is not the result of external adjustments
but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the
ability to solve increasingly complex reasoning tasks by leveraging extended test-time compu-
tation. This computation ranges from generating hundreds to thousands of reasoning tokens,
allowing the model to explore and refine its thought processes in greater depth.
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated
behaviors as the test-time computation increases. Behaviors such as reflection—where the model
revisits and reevaluates its previous steps—and the exploration of alternative approaches to
problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead
emerge as a result of the model’s interaction with the reinforcement learning environment. This
spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities,
enabling it to tackle more challenging tasks with greater efficiency and accuracy.

Aha Moment of DeepSeek-R1-Zero A particularly intriguing phenomenon observed during

the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as
illustrated in Table 3, occurs in an intermediate version of the model. During this phase,
DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial
approach. This behavior is not only a testament to the model’s growing reasoning abilities
but also a captivating example of how reinforcement learning can lead to unexpected and
sophisticated outcomes.
This moment is not only an “aha moment” for the model but also for the researchers
observing its behavior. It underscores the power and beauty of reinforcement learning: rather
than explicitly teaching the model on how to solve a problem, we simply provide it with the
right incentives, and it autonomously develops advanced problem-solving strategies. The
“aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of
intelligence in artificial systems, paving the way for more autonomous and adaptive models in
the future.

8
图 3 | DeepSeek-R1-Zero 在 RL 过程中对训练集的平均响应长度。DeepSeek-R1-Zero 自然学会了
用更多的思考时间来解决推理任务。

在整个训练过程中，DeepSeek-R1-Zero 不断改进。这种改进不是外部调整的结果，而是模型内
部的内在发展。DeepSeek-R1-Zero 通过利用延长的测试时间计算，自然获得了解决日益复杂的
推理任务的能力。这种计算范围从生成数百到数千个推理标记，使模型能够更深入地探索和改
进其思维过程。

这种自我进化最引人注目的方面之一是，随着测试时间计算的增加，复杂的行为也随之出现
。诸如反思（模型重新审视并重新评估其先前的步骤）和探索解决问题的替代方法等行为都是
自发产生的。这些行为不是明确编程的，而是模型与强化学习环境交互的结果。这种自发发展
显著增强了 DeepSeek-R1-Zero 的推理能力，使其能够以更高的效率和准确性应对更具挑战性的
任务。

DeepSeek-R1-Zero 的顿悟时刻在 DeepSeek-R1-Zero 训练期间观察到的一个特别有趣的现象是“

顿悟时刻”的出现。如表 3 所示，此时刻发生在模型的中间版本中。在此阶段，DeepSeek-R1-Z
ero 通过重新评估其初始方法学会为问题分配更多思考时间。这种行为不仅证明了模型不断增强
的推理能力，而且还是一个引人入胜的例子，展示了强化学习如何产生意想不到的复杂结果。

这一刻不仅是模型的“顿悟时刻”，也是观察其行为的研究人员的“顿悟时刻”。它强调了
强化学习的力量和魅力：我们无需明确地教模型如何解决问题，只需为其提供适当的激励，它
就会自主地开发高级的解决问题策略。这一“顿悟时刻”有力地提醒了我们，强化学习有潜力
在人工智能系统中开启新的智能水平，为未来更自主、更具适应性的模型铺平道路。

8
√︁ √
Question: If 𝑎 > 1, then the sum of the real solutions of 𝑎 − 𝑎 + 𝑥 = 𝑥 is equal to
Response: <think> √︁
√
To solve the equation 𝑎 − 𝑎 + 𝑥 = 𝑥 , let’s start by squaring both · · ·
√︁ √ 2 √
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 − 𝑎 + 𝑥 = 𝑥 2 .
Rearrange to isolate the inner square root term:
( 𝑎 − 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑎2 − 2𝑎𝑥 2 + ( 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑥 4 − 2𝑎𝑥 2 − 𝑥 + ( 𝑎2 − 𝑎) = 0
...
Wait, wait. Wait. That’s an aha moment I can flag here.
Let’s reevaluate this step-by-step to identify if the correct sum can be · · ·
We
√︁ started with the equation:
√
𝑎− 𝑎+𝑥 = 𝑥
First,
√ let’s square both √ sides:
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 + 𝑥 = 𝑎 − 𝑥 2
Next, I could square both sides again, treating the equation: · · ·
...

Table 3 | An interesting “aha moment” of an intermediate version of DeepSeek-R1-Zero. The

model learns to rethink using an anthropomorphic tone. This is also an aha moment for us,
allowing us to witness the power and beauty of reinforcement learning.

Drawback of DeepSeek-R1-Zero Although DeepSeek-R1-Zero exhibits strong reasoning

capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces
several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability,
and language mixing. To make reasoning processes more readable and share them with the
open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly
cold-start data.

2.3. DeepSeek-R1: Reinforcement Learning with Cold Start

Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can
reasoning performance be further improved or convergence accelerated by incorporating a small
amount of high-quality data as a cold start? 2) How can we train a user-friendly model that
not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong
general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The
pipeline consists of four stages, outlined as follows.

2.3.1. Cold Start

Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from
the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data
to fine-tune the model as the initial RL actor. To collect such data, we have explored several
approaches: using few-shot prompting with a long CoT as an example, directly prompting
models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-
Zero outputs in a readable format, and refining the results through post-processing by human
annotators.
In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data

9
√︁ √
Question: If 𝑎 > 1, then the sum of the real solutions of 𝑎 − 𝑎 + 𝑥 = 𝑥 is equal to
Response: <think> √︁
√
To solve the equation 𝑎 − 𝑎 + 𝑥 = 𝑥 , let’s start by squaring both · · ·
√︁ √ 2 √
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 − 𝑎 + 𝑥 = 𝑥 2 .
Rearrange to isolate the inner square root term:
( 𝑎 − 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑎2 − 2𝑎𝑥 2 + ( 𝑥 2 ) 2 = 𝑎 + 𝑥 =⇒ 𝑥 4 − 2𝑎𝑥 2 − 𝑥 + ( 𝑎2 − 𝑎) = 0
...
Wait, wait. Wait. That’s an aha moment I can flag here.
Let’s reevaluate this step-by-step to identify if the correct sum can be · · ·
We
√︁ started with the equation:
√
𝑎− 𝑎+𝑥 = 𝑥
First,
√ let’s square both √ sides:
𝑎 − 𝑎 + 𝑥 = 𝑥 2 =⇒ 𝑎 + 𝑥 = 𝑎 − 𝑥 2
Next, I could square both sides again, treating the equation: · · ·
...

表 3 | DeepSeek-R1-Zero 中级版本一个有趣的顿悟时刻，模型学会了用拟人化的语气重新思考
，这也是我们的顿悟时刻，让我们见证了强化学习的力量和美丽。

DeepSeek-R1-Zero 的缺点尽管 DeepSeek-R1-Zero 表现出强大的推理能力，并自主开发出意想不

到的强大推理行为，但它也面临一些问题。例如，DeepSeek-R1-Zero 面临着可读性差和语言混
合等挑战。为了使推理过程更具可读性并与开放社区分享，我们探索了 DeepSeek-R1，这是一
种利用强化学习和人性化冷启动数据的方法。

2.3. DeepSeek-R1：冷启动强化学习

受到 DeepSeek-R1-Zero 良好结果的启发，我们自然而然地产生了两个问题：1）通过引入少量
高质量数据作为冷启动，是否可以进一步提高推理性能或加速收敛？2）如何训练一个用户友好
的模型，该模型不仅可以产生清晰连贯的思维链 (CoT)，而且还具有强大的通用能力？为了解决
这些问题，我们设计了一个流程来训练 DeepSeek-R1。该流程由四个阶段组成，概述如下。

2.3.1. Cold Start

与 DeepSeek-R1-Zero 不同，为了防止基础模型在 RL 训练早期出现不稳定的冷启动阶段，对于

DeepSeek-R1，我们构建并收集少量的长 CoT 数据，以作为初始 RL 参与者对模型进行微调。为
了收集此类数据，我们探索了几种方法：使用长 CoT 的少样本提示作为示例，直接提示模型通
过反思和验证生成详细答案，以可读格式收集 DeepSeek-R1-Zero 输出，并通过人工注释者的后
期处理来完善结果。

在本研究中，我们收集了数千个冷启动数据，以微调 DeepSeek-V3-Base 作为 RL 的起点。

与 DeepSeek-R1-Zero 相比，冷启动数据的优势

9
include:
• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable
for reading. Responses may mix multiple languages or lack markdown formatting to
highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1,
we design a readable pattern that includes a summary at the end of each response and
filters out responses that are not reader-friendly. Here, we define the output format as
|special_token|<reasoning_process>|special_token|<summary>, where the reasoning
process is the CoT for the query, and the summary is used to summarize the reasoning
results.
• Potential: By carefully designing the pattern for cold-start data with human priors, we
observe better performance against DeepSeek-R1-Zero. We believe the iterative training is
a better way for reasoning models.

2.3.2. Reasoning-oriented Reinforcement Learning

After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale
reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses
on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such
as coding, mathematics, science, and logic reasoning, which involve well-defined problems with
clear solutions. During the training process, we observe that CoT often exhibits language mixing,
particularly when RL prompts involve multiple languages. To mitigate the issue of language
mixing, we introduce a language consistency reward during RL training, which is calculated
as the proportion of target language words in the CoT. Although ablation experiments show
that such alignment results in a slight degradation in the model’s performance, this reward
aligns with human preferences, making it more readable. Finally, we combine the accuracy of
reasoning tasks and the reward for language consistency by directly summing them to form the
final reward. We then apply RL training on the fine-tuned model until it achieves convergence
on reasoning tasks.

2.3.3. Rejection Sampling and Supervised Fine-Tuning

When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT

(Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which
primarily focuses on reasoning, this stage incorporates data from other domains to enhance the
model’s capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we
generate the data and fine-tune the model as described below.

Reasoning data We curate reasoning prompts and generate reasoning trajectories by perform-
ing rejection sampling from the checkpoint from the above RL training. In the previous stage,
we only included data that could be evaluated using rule-based rewards. However, in this stage,
we expand the dataset by incorporating additional data, some of which use a generative reward
model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.
Additionally, because the model output is sometimes chaotic and difficult to read, we have
filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For
each prompt, we sample multiple responses and retain only the correct ones. In total, we collect
about 600k reasoning related training samples.

10
包括：
• 可读性：DeepSeek-R1-Zero 的一个关键限制是其内容通常不适合阅读。响应可能混合多种
语言或缺乏 markdown 格式来为用户突出显示答案。相比之下，在为 DeepSeek-R1 创建冷
启动数据时，我们设计了一种可读模式，在每个响应的末尾包含一个摘要，并过滤掉不易
于阅读的响应。在这里，我们将输出格式定义为 |special_token|<reasoning_process>|speci
al_token|<summary>，其中推理过程是查询的 CoT，摘要用于总结推理结果。

• 潜力：通过精心设计冷启动数据模式，结合人类先验知识，我们观察到其性能优于 DeepS
eek-R1-Zero。我们相信迭代训练是推理模型的更好方法。

2.3.2. Reasoning-oriented Reinforcement Learning

在冷启动数据上对 DeepSeek-V3-Base 进行微调后，我们采用与 DeepSeek-R1-Zero 相同的大规模

强化学习训练流程，此阶段侧重于增强模型的推理能力，特别是在编码、数学、科学和逻辑推
理等推理密集型任务中，这些任务涉及定义明确且解决方案明确的问题。在训练过程中，我们
观察到 CoT 经常出现语言混合的情况，尤其是当 RL 提示涉及多种语言时。为了缓解语言混合
的问题，我们在 RL 训练中引入了语言一致性奖励，其计算方式为 CoT 中目标语言单词的比例
。虽然消融实验表明这种对齐会导致模型性能略有下降，但这种奖励符合人类的偏好，使其更
具可读性。最后，我们将推理任务的准确率和语言一致性的奖励直接相加，形成最终奖励。然
后，我们对微调后的模型进行 RL 训练，直到它在推理任务上实现收敛。

2.3.3. Rejection Sampling and Supervised Fine-Tuning

当以推理为导向的强化学习收敛时，我们会利用生成的检查点来收集 SFT（监督微调）数据，
以供下一轮使用。与主要侧重于推理的初始冷启动数据不同，此阶段会整合来自其他领域的数
据，以增强模型在写作、角色扮演和其他通用任务中的能力。具体来说，我们会生成数据并微
调模型，如下所述。

推理数据我们通过从上述 RL 训练的检查点进行拒绝抽样来整理推理提示并生成推理轨迹。在

上一阶段，我们仅包含可以使用基于规则的奖励进行评估的数据。但是，在此阶段，我们通过
合并其他数据来扩展数据集，其中一些数据使用生成奖励模型，将基本事实和模型预测输入 Dee
pSeek-V3 进行判断。此外，由于模型输出有时混乱且难以阅读，我们过滤掉了混合语言、长段
落和代码块的思路链。对于每个提示，我们会抽样多个响应并仅保留正确的响应。总的来说，
我们收集了大约 600k 个与推理相关的训练样本。

10
Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition,
and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of
DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential
chain-of-thought before answering the question by prompting. However, for simpler queries,
such as “hello” we do not provide a CoT in response. In the end, we collected a total of
approximately 200k training samples that are unrelated to reasoning.
We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about
800k samples.

2.3.4. Reinforcement Learning for all Scenarios

To further align the model with human preferences, we implement a secondary reinforcement
learning stage aimed at improving the model’s helpfulness and harmlessness while simultane-
ously refining its reasoning capabilities. Specifically, we train the model using a combination
of reward signals and diverse prompt distributions. For reasoning data, we adhere to the
methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the
learning process in math, code, and logical reasoning domains. For general data, we resort to
reward models to capture human preferences in complex and nuanced scenarios. We build
upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and train-
ing prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the
assessment emphasizes the utility and relevance of the response to the user while minimizing
interference with the underlying reasoning process. For harmlessness, we evaluate the entire
response of the model, including both the reasoning process and the summary, to identify and
mitigate any potential risks, biases, or harmful content that may arise during the generation
process. Ultimately, the integration of reward signals and diverse data distributions enables us
to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

2.4. Distillation: Empower Small Models with Reasoning Capability

To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly
fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using
the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that
this straightforward distillation method significantly enhances the reasoning abilities of smaller
models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-
14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its
reasoning capability is slightly better than that of Llama-3.1.
For distilled models, we apply only SFT and do not include an RL stage, even though
incorporating RL could substantially boost model performance. Our primary goal here is to
demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL
stage to the broader research community.

3. Experiment
Benchmarks We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema
et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al.,
2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al.,
2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,

11
非推理数据对于非推理数据，例如写作、事实问答、自我认知和翻译，我们采用 DeepSeek-V3
流程并重用 DeepSeek-V3 的部分 SFT 数据集。对于某些非推理任务，我们会在提示回答问题之
前调用 DeepSeek-V3 生成潜在的思路链。但是，对于更简单的查询，例如“你好”，我们不提
供 CoT 作为响应。最终，我们总共收集了大约 200k 个与推理无关的训练样本。

我们使用上面整理的约 800k 个样本的数据集对 DeepSeek-V3-Base 进行了两个时期的微调。

2.3.4. Reinforcement Learning for all Scenarios

为了使模型与人类偏好进一步保持一致，我们实施了第二阶段强化学习，旨在提高模型的有用
性和无害性，同时完善其推理能力。具体来说，我们使用奖励信号和各种提示分布的组合来训
练模型。对于推理数据，我们遵循 DeepSeek-R1-Zero 中概述的方法，该方法利用基于规则的奖
励来指导数学、代码和逻辑推理领域的学习过程。对于一般数据，我们采用奖励模型来捕捉复
杂而微妙场景中的人类偏好。我们以 DeepSeek-V3 流程为基础，采用类似的偏好对和训练提示
分布。对于有用性，我们只关注最终摘要，确保评估强调响应对用户的实用性和相关性，同时
最大限度地减少对底层推理过程的干扰。对于无害性，我们评估模型的整个响应，包括推理过
程和摘要，以识别和减轻生成过程中可能出现的任何潜在风险、偏见或有害内容。最终，奖励
信号和多样化数据分布的整合使我们能够训练出一个在推理方面表现出色的模型，同时优先考
虑有用性和无害性。

2.4. 提炼：赋予小模型推理能力

为了使更高效的小型模型具备像 DeepSeek-R1 这样的推理能力，我们使用 DeepSeek-R1 整理的 8

00k 个样本直接对 Qwen (Qwen, 2024b) 和 Llama (AI@Meta, 2024) 等开源模型进行了微调，详情
见 §2.3.3。我们的研究结果表明，这种直接的提炼方法显著增强了小型模型的推理能力。我们在
这里使用的基础模型是 Qwen2.5-Math-1.5B、Qwen2.5-Math-7B、Qwen2.5-14B、Qwen2.5-32B、
Llama-3.1-8B 和 Llama-3.3-70B-Instruct。我们选择 Llama-3.3 是因为它的推理能力略优于 Llama-
3.1。

对于蒸馏模型，我们仅应用 SFT，而不包含 RL 阶段，尽管加入 RL 可以显著提高模型性能

。我们的主要目标是证明蒸馏技术的有效性，而将 RL 阶段的探索留给更广泛的研究社区。

3.实验
基准测试我们在 MMLU (Hendrycks et al., 2020)、MMLU-Redux (Gema et al., 2024)、MMLU-Pro (
Wang et al., 2024)、C-Eval (Huang et al., 2023)、CMMLU (Li et al., 2023)、IFEval (Zhou et al., 202
3)、FRAMES (Krishna et al., 2024)、GPQA Diamond (Rein et al., 2023)、SimpleQA (OpenAI, 2024
c)、C-SimpleQA (He et al., 2024)、SWE-Bench Verified (OpenAI,

11
2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2 , Chinese
National High School Mathematics Olympiad (CNMO 2024)3 , and American Invitational Math-
ematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we
also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we
adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li
et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we
only feed the final summary to evaluation to avoid the length bias. For distilled models, we
report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and
LiveCodeBench.

Evaluation Prompts Following the setup in DeepSeek-V3, standard benchmarks such as

MMLU, DROP, GPQA Diamond, and SimpleQA are evaluated using prompts from the simple-
evals framework. For MMLU-Redux, we adopt the Zero-Eval prompt format (Lin, 2024) in a
zero-shot setting. In terms of MMLU-Pro, C-Eval and CLUE-WSC, since the original prompts
are few-shot, we slightly modify the prompt to the zero-shot setting. The CoT in few-shot
may hurt the performance of DeepSeek-R1. Other datasets follow their original evaluation
protocols with default prompts provided by their creators. For code and math benchmarks, the
HumanEval-Mul dataset covers eight mainstream programming languages (Python, Java, C++,
C#, JavaScript, TypeScript, PHP, and Bash). Model performance on LiveCodeBench is evaluated
using CoT format, with data collected between August 2024 and January 2025. The Codeforces
dataset is evaluated using problems from 10 Div.2 contests along with expert-crafted test cases,
after which the expected ratings and percentages of competitors are calculated. SWE-Bench
verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related
benchmarks are measured using a "diff" format. DeepSeek-R1 outputs are capped at a maximum
of 32,768 tokens for each benchmark.

Baselines We conduct comprehensive evaluations against several strong baselines, including

DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, and OpenAI-o1-1217.
Since accessing the OpenAI-o1-1217 API is challenging in mainland China, we report its perfor-
mance based on official reports. For distilled models, we also compare the open-source model
QwQ-32B-Preview (Qwen, 2024a).

Evaluation Setup We set the maximum generation length to 32,768 tokens for the models.
We found that using greedy decoding to evaluate long-output reasoning models results in
higher repetition rates and significant variability across different checkpoints. Therefore, we
default to pass@𝑘 evaluation (Chen et al., 2021) and report pass@1 using a non-zero temperature.
Specifically, we use a sampling temperature of 0.6 and a top- 𝑝 value of 0.95 to generate 𝑘
responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1
is then calculated as
𝑘
1 ∑︁
pass@1 = 𝑝𝑖 ,
𝑘
𝑖=1

where 𝑝𝑖 denotes the correctness of the 𝑖-th response. This method provides more reliable
performance estimates. For AIME 2024, we also report consensus (majority vote) results (Wang
et al., 2022) using 64 samples, denoted as cons@64.
1 https://aider.chat
2 https://codeforces.com
3 https://www.cms.org.cn/Home/comp/comp/cid/12.html

12
2024d)、Aider 1、LiveCodeBench (Jain 等人，2024) (2024-08 – 2025-01)、Codeforces 2、中国全国
高中数学奥林匹克 (CNMO 2024) 3 和美国邀请赛数学考试 2024 (AIME 2024) (MAA, 2024)。除了
标准基准之外，我们还使用 LLM 作为评判者在开放式生成任务上评估我们的模型。具体来说，
我们遵循 AlpacaEval 2.0 (Dubois 等人，2024) 和 Arena-Hard (Li 等人，2024) 的原始配置，它们
利用 GPT-4-Turbo-1106 作为成对比较的评判者。在这里，我们只将最终摘要提供给评估，以避
免长度偏差。对于提炼模型，我们报告了 AIME 2024、MATH-500、GPQA Diamond、Codeforce
s 和 LiveCodeBench 的代表性结果。

评估提示按照 DeepSeek-V3 中的设置，使用 simple-evals 框架中的提示对 MMLU、DROP、GP

QA Diamond 和 SimpleQA 等标准基准进行评估。对于 MMLU-Redux，我们在零样本设置中采用
Zero-Eval 提示格式（Lin，2024）。对于 MMLU-Pro、C-Eval 和 CLUE-WSC，由于原始提示是
少样本的，我们将提示稍微修改为零样本设置。少样本中的 CoT 可能会损害 DeepSeek-R1 的性
能。其他数据集遵循其原始评估协议，使用其创建者提供的默认提示。对于代码和数学基准，H
umanEval-Mul 数据集涵盖了八种主流编程语言（Python、Java、C++、C#、JavaScript、TypeScri
pt、PHP 和 Bash）。使用 CoT 格式评估 LiveCodeBench 上的模型性能，数据收集时间为 2024
年 8 月至 2025 年 1 月。使用来自 10 个 Div.2 竞赛的问题以及专家制作的测试用例评估 Codeforc
es 数据集，然后计算竞争对手的预期评分和百分比。通过无代理框架获得 SWE-Bench 验证结果
（Xia 等人，2024 年）。使用“diff”格式测量与 AIDER 相关的基准。每个基准的 DeepSeek-R1
输出上限为 32,768 个令牌。

基线我们针对几个强大的基线进行了全面的评估，包括 DeepSeek-V3、Claude-Sonnet-3.5-1022
、GPT-4o-0513、OpenAI-o1-mini 和 OpenAI-o1-1217。由于在中国大陆访问 OpenAI-o1-1217 API
具有挑战性，因此我们根据官方报告报告其性能。对于提炼模型，我们还比较了开源模型 QwQ-
32B-Preview（Qwen，2024a）。

评估设置我们将模型的最大生成长度设置为 32,768 个 token。我们发现使用贪婪解码来评估长

输出推理模型会导致更高的重复率和不同检查点之间的显著差异。因此，我们默认使用 pass@𝑘
评估（Chen 等人，2021 年）并使用非零温度报告 pass@1。具体来说，我们使用采样温度 0.6 和
top-𝑝 值 0.95 来为每个问题生成 𝑘 响应（通常在 4 到 64 之间，具体取决于测试集大小）。然后
计算 Pass@1 为

𝑘
1 ∑︁
pass@1 = 𝑝𝑖 ,
𝑘
𝑖=1

其中 𝑝𝑖 表示第 𝑖 个响应的正确性。此方法提供了更可靠的性能估计。对于 AIME 2024，我们还

使用 64 个样本报告了共识（多数投票）结果（Wang 等人，2022 年），表示为 cons@64。

1 https://aider.chat
2 https://codeforces.com
3 https://www.cms.org.cn/Home/comp/comp/cid/12.html

12
3.1. DeepSeek-R1 Evaluation

Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek

Benchmark (Metric)
Sonnet-1022 0513 V3 o1-mini o1-1217 R1
Architecture - - MoE - - MoE
# Activated Params - - 37B - - 37B
# Total Params - - 671B - - 671B
MMLU (Pass@1) 88.3 87.2 88.5 85.2 91.8 90.8
MMLU-Redux (EM) 88.9 88.0 89.1 86.7 - 92.9
MMLU-Pro (EM) 78.0 72.6 75.9 80.3 - 84.0
DROP (3-shot F1) 88.3 83.7 91.6 83.9 90.2 92.2
IF-Eval (Prompt Strict) 86.5 84.3 86.1 84.8 - 83.3
English
GPQA Diamond (Pass@1) 65.0 49.9 59.1 60.0 75.7 71.5
SimpleQA (Correct) 28.4 38.2 24.9 7.0 47.0 30.1
FRAMES (Acc.) 72.5 80.5 73.3 76.9 - 82.5
AlpacaEval2.0 (LC-winrate) 52.0 51.1 70.0 57.8 - 87.6
ArenaHard (GPT-4-1106) 85.2 80.4 85.5 92.0 - 92.3
LiveCodeBench (Pass@1-COT) 38.9 32.9 36.2 53.8 63.4 65.9
Codeforces (Percentile) 20.3 23.6 58.7 93.4 96.6 96.3
Code
Codeforces (Rating) 717 759 1134 1820 2061 2029
SWE Verified (Resolved) 50.8 38.8 42.0 41.6 48.9 49.2
Aider-Polyglot (Acc.) 45.3 16.0 49.6 32.9 61.7 53.3
AIME 2024 (Pass@1) 16.0 9.3 39.2 63.6 79.2 79.8
Math MATH-500 (Pass@1) 78.3 74.6 90.2 90.0 96.4 97.3
CNMO 2024 (Pass@1) 13.1 10.8 43.2 67.6 - 78.8
CLUEWSC (EM) 85.4 87.9 90.9 89.9 - 92.8
Chinese C-Eval (EM) 76.7 76.0 86.5 68.9 - 91.8
C-SimpleQA (Correct) 55.4 58.7 68.0 40.3 - 63.7

Table 4 | Comparison between DeepSeek-R1 and other representative models.

For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA

Diamond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This im-
provement is primarily attributed to enhanced accuracy in STEM-related questions, where signif-
icant gains are achieved through large-scale reinforcement learning. Additionally, DeepSeek-R1
excels on FRAMES, a long-context-dependent QA task, showcasing its strong document analysis
capabilities. This highlights the potential of reasoning models in AI-driven search and data
analysis tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,
demonstrating its capability in handling fact-based queries. A similar trend is observed where
OpenAI-o1 surpasses GPT-4o on this benchmark. However, DeepSeek-R1 performs worse than
DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse
answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an
accuracy of over 70%.
DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a
model’s ability to follow format instructions. These improvements can be linked to the inclusion
of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL
training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard,
indicating DeepSeek-R1’s strengths in writing tasks and open-domain question answering. Its
significant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scale
RL, which not only boosts reasoning capabilities but also improves performance across diverse
domains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an
average of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that

13
3.1. DeepSeek-R1 评估

Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek

表4 | DeepSeek-R1与其他代表模型的比较。

对于 MMLU、MMLU-Pro 和 GPQA Diamond 等面向教育的知识基准，DeepSeek-R1 表现出

色，优于 DeepSeek-V3。这一提升主要归功于 STEM 相关问题的准确率提高，通过大规模强化
学习可以实现显著提升。此外，DeepSeek-R1 在长期依赖上下文的 QA 任务 FRAMES 上表现出
色，展示了其强大的文档分析能力。这凸显了推理模型在人工智能驱动的搜索和数据分析任务
中的潜力。在事实基准 SimpleQA 上，DeepSeek-R1 优于 DeepSeek-V3，展示了其处理基于事实
的查询的能力。在该基准上，OpenAI-o1 超越 GPT-4o 也呈现出类似的趋势。然而，DeepSeek-R
1 在中国 SimpleQA 基准上的表现不如 DeepSeek-V3，主要是因为它在安全强化学习后倾向于拒
绝回答某些查询。在没有安全 RL 的情况下，DeepSeek-R1 可以达到 70% 以上的准确率。

DeepSeek-R1 在 IF-Eval 上也取得了令人印象深刻的结果，IF-Eval 是一个旨在评估模型遵循

格式指令的能力的基准。这些改进可以归因于在监督微调 (SFT) 和 RL 训练的最后阶段纳入了遵
循指令的数据。此外，在 AlpacaEval2.0 和 ArenaHard 上也观察到了出色的表现，表明 DeepSeek
-R1 在写作任务和开放域问答方面具有优势。它显著优于 DeepSeek-V3 的表现凸显了大规模 RL
的泛化优势，这不仅提高了推理能力，还提高了跨不同领域的性能。此外，DeepSeek-R1 生成
的摘要长度简洁，在 ArenaHard 上平均为 689 个标记，在 AlpacaEval 2.0 上平均为 2,218 个字符
。这表明

13
DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying
its robustness across multiple tasks.
On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217,
surpassing other models by a large margin. A similar trend is observed on coding algorithm
tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these
benchmarks. On engineering-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1
on Aider but achieves comparable performance on SWE Verified. We believe the engineering
performance of DeepSeek-R1 will improve in the next version, as the amount of related RL
training data currently remains very limited.

3.2. Distilled Model Evaluation

GPQA LiveCode
AIME 2024 MATH-500 CodeForces
Model Diamond Bench
pass@1 cons@64 pass@1 pass@1 pass@1 rating
GPT-4o-0513 9.3 13.4 74.6 49.9 32.9 759
Claude-3.5-Sonnet-1022 16.0 26.7 78.3 65.0 38.9 717
OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 1316
DeepSeek-R1-Distill-Qwen-1.5B 28.9 52.7 83.9 33.8 16.9 954
DeepSeek-R1-Distill-Qwen-7B 55.5 83.3 92.8 49.1 37.6 1189
DeepSeek-R1-Distill-Qwen-14B 69.7 80.0 93.9 59.1 53.1 1481
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 1691
DeepSeek-R1-Distill-Llama-8B 50.4 80.0 89.1 49.0 39.6 1205
DeepSeek-R1-Distill-Llama-70B 70.0 86.7 94.5 65.2 57.5 1633

Table 5 | Comparison of DeepSeek-R1 distilled models and other comparable models on

reasoning-related benchmarks.

As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeek-
R1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform non-
reasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32B-
Preview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly
exceed o1-mini on most benchmarks. These results demonstrate the strong potential of distilla-
tion. Additionally, we found that applying RL to these distilled models yields significant further
gains. We believe this warrants further exploration and therefore present only the results of the
simple SFT-distilled models here.

4. Discussion

4.1. Distillation v.s. Reinforcement Learning

In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve impressive
results. However, there is still one question left: can the model achieve comparable performance
through the large-scale RL training discussed in the paper without distillation?
To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math,
code, and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The
experimental results, shown in Table 6, demonstrate that the 32B base model, after large-scale

14
DeepSeek-R1 避免在基于 GPT 的评估中引入长度偏差，进一步巩固了其在多项任务中的稳健性
。

在数学任务上，DeepSeek-R1 的表现与 OpenAI-o1-1217 相当，远超其他模型。在编码算法任

务上也观察到了类似的趋势，例如 LiveCodeBench 和 Codeforces，在这些基准测试中，以推理为
重点的模型占据主导地位。在面向工程的编码任务上，OpenAI-o1-1217 在 Aider 上的表现优于
DeepSeek-R1，但在 SWE Verified 上的表现相当。我们相信 DeepSeek-R1 的工程性能将在下一版
本中得到改善，因为相关的 RL 训练数据量目前仍然非常有限。

3.2. 蒸馏模型评估

表 5 | DeepSeek-R1 提炼模型与其他同类模型在推理相关基准上的比较。

如表 5 所示，只需对 DeepSeek-R1 的输出进行简单提炼，高效的 DeepSeek-R1-7B（即 DeepS

eek-R1-Distill-Qwen-7B，下文简称类似）就能全面超越 GPT-4o-0513 等非推理模型。DeepSeek-
R1-14B 在所有评估指标上都超越了 QwQ-32B-Preview，而 DeepSeek-R1-32B 和 DeepSeek-R1-70
B 在大多数基准测试中都显著超过了 o1-mini。这些结果证明了提炼的强大潜力。此外，我们发
现将 RL 应用于这些提炼模型可以获得显著的进一步收益。我们认为这值得进一步探索，因此这
里仅展示简单的 SFT 提炼模型的结果。

4.讨论

4.1. 蒸馏与强化学习

在 3.2 节中，我们可以看到通过蒸馏 DeepSeek-R1，小模型可以取得令人印象深刻的效果。然而

，还有一个问题：如果不进行蒸馏，模型是否可以通过论文中讨论的大规模 RL 训练获得相当
的性能？

为了回答这个问题，我们使用数学、代码和 STEM 数据对 Qwen-32B-Base 进行了大规模 RL

训练，训练了超过 10K 步，最终得到了 DeepSeek-R1-Zero-Qwen-32B。实验结果如表 6 所示，
表明 32B 基础模型经过大规模

14
AIME 2024 MATH-500 GPQA Diamond LiveCodeBench
Model pass@1 cons@64 pass@1 pass@1 pass@1
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9
DeepSeek-R1-Zero-Qwen-32B 47.0 60.0 91.6 55.0 40.2
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2

Table 6 | Comparison of distilled and RL Models on Reasoning-Related Benchmarks.

RL training, achieves performance on par with QwQ-32B-Preview. However, DeepSeek-R1-

Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than
DeepSeek-R1-Zero-Qwen-32B across all benchmarks.
Therefore, we can draw two conclusions: First, distilling more powerful models into smaller
ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in
this paper require enormous computational power and may not even achieve the performance
of distillation. Second, while distillation strategies are both economical and effective, advancing
beyond the boundaries of intelligence may still require more powerful base models and larger-
scale reinforcement learning.

4.2. Unsuccessful Attempts

In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along
the way. We share our failure experiences here to provide insights, but this does not imply that
these approaches are incapable of developing effective reasoning models.

Process Reward Model (PRM) PRM is a reasonable method to guide the model toward better
approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al.,
2023). However, in practice, PRM has three main limitations that may hinder its ultimate suc-
cess. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second,
determining whether the current intermediate step is correct is a challenging task. Automated
annotation using models may not yield satisfactory results, while manual annotation is not con-
ducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward
hacking (Gao et al., 2022), and retraining the reward model needs additional training resources
and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good
ability to rerank the top-N responses generated by the model or assist in guided search (Snell
et al., 2024), its advantages are limited compared to the additional computational overhead it
introduces during the large-scale reinforcement learning process in our experiments.

Monte Carlo Tree Search (MCTS) Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Sil-
ver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time
compute scalability. This approach involves breaking answers into smaller parts to allow the
model to explore the solution space systematically. To facilitate this, we prompt the model to
generate multiple tags that correspond to specific reasoning steps necessary for the search. For
training, we first use collected prompts to find answers via MCTS guided by a pre-trained value
model. Subsequently, we use the resulting question-answer pairs to train both the actor model
and the value model, iteratively refining the process.
However, this approach encounters several challenges when scaling up the training. First,
unlike chess, where the search space is relatively well-defined, token generation presents an

15
AIME 2024 MATH-500 GPQA Diamond LiveCodeBench
Model pass@1 cons@64 pass@1 pass@1 pass@1
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9
DeepSeek-R1-Zero-Qwen-32B 47.0 60.0 91.6 55.0 40.2
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2

表 6 | 推理相关基准上的提炼模型和 RL 模型的比较。

RL 训练的性能与 QwQ-32B-Preview 相当。然而，从 DeepSeek-R1 提炼出来的 DeepSeek-R1-Dist

ill-Qwen-32B 在所有基准测试中的表现都明显优于 DeepSeek-R1-Zero-Qwen-32B。

因此，我们可以得出两个结论：第一，将更强大的模型提炼成更小的模型可以产生很好的效
果，而本文提到的依赖于大规模强化学习的小模型需要巨大的计算能力，甚至可能无法达到提
炼的性能。第二，虽然提炼策略既经济又有效，但要超越智能的界限，可能仍需要更强大的基
础模型和更大规模的强化学习。

4.2. 失败的尝试

在开发 DeepSeek-R1 的早期阶段，我们也曾遇到过失败和挫折。我们在这里分享失败的经验是

为了提供见解，但这并不意味着这些方法无法开发有效的推理模型。

过程奖励模型 (PRM) PRM 是一种合理的方法，可以引导模型朝着更好的方法解决推理任务 (Lig

htman 等，2023；Uesato 等，2022；Wang 等，2023)。然而，在实践中，PRM 有三个主要限制
，可能会阻碍其最终成功。首先，在一般推理中明确定义一个细粒度的步骤具有挑战性。其次
，确定当前中间步骤是否正确是一项具有挑战性的任务。使用模型的自动注释可能无法产生令
人满意的结果，而手动注释不利于扩大规模。第三，一旦引入基于模型的 PRM，它将不可避免
地导致奖励黑客攻击 (Gao 等，2022)，并且重新训练奖励模型需要额外的训练资源，并且使整个
训练流程复杂化。综上所述，虽然 PRM 表现出了对模型生成的前 N 个响应进行重新排序或协助
引导搜索的良好能力（Snell 等人，2024），但与我们实验中在大规模强化学习过程中引入的额
外计算开销相比，它的优势是有限的。

蒙特卡洛树搜索 (MCTS) 受 AlphaGo (Silver et al., 2017b) 和 AlphaZero (Silver et al., 2017a) 的启发
，我们探索使用蒙特卡洛树搜索 (MCTS) 来增强测试时计算可扩展性。此方法涉及将答案分解为
更小的部分，以允许模型系统地探索解决方案空间。为了实现这一点，我们提示模型生成与搜
索所需的特定推理步骤相对应的多个标签。对于训练，我们首先使用收集到的提示通过由预先
训练的价值模型指导的 MCTS 来寻找答案。随后，我们使用生成的问答对来训练参与者模型和
价值模型，并迭代地完善该过程。

然而，这种方法在扩大训练规模时遇到了一些挑战。首先，与国际象棋不同，国际象棋的搜
索空间相对明确，而 token 生成则呈现出

15
exponentially larger search space. To address this, we set a maximum extension limit for each
node, but this can lead to the model getting stuck in local optima. Second, the value model
directly influences the quality of generation since it guides each step of the search process.
Training a fine-grained value model is inherently difficult, which makes it challenging for the
model to iteratively improve. While AlphaGo’s core success relied on training a value model to
progressively enhance its performance, this principle proves difficult to replicate in our setup
due to the complexities of token generation.
In conclusion, while MCTS can improve performance during inference when paired with a
pre-trained value model, iteratively boosting model performance through self-search remains a
significant challenge.

5. Conclusion, Limitations, and Future Work

In this work, we share our journey in enhancing model reasoning abilities through reinforcement
learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start
data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful,
leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves
performance comparable to OpenAI-o1-1217 on a range of tasks.
We further explore distillation the reasoning capability to small dense models. We use
DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small
dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o
and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other
dense models also achieve impressive results, significantly outperforming other instruction-
tuned models based on the same underlying checkpoints.
In the future, we plan to invest in research across the following directions for DeepSeek-R1.
• General Capability: Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3
in tasks such as function calling, multi-turn, complex role-playing, and JSON output.
Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in
these fields.
• Language Mixing: DeepSeek-R1 is currently optimized for Chinese and English, which
may result in language mixing issues when handling queries in other languages. For
instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is
in a language other than English or Chinese. We aim to address this limitation in future
updates.
• Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive
to prompts. Few-shot prompting consistently degrades its performance. Therefore, we
recommend users directly describe the problem and specify the output format using a
zero-shot setting for optimal results.
• Software Engineering Tasks: Due to the long evaluation times, which impact the effi-
ciency of the RL process, large-scale RL has not been applied extensively in software
engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement
over DeepSeek-V3 on software engineering benchmarks. Future versions will address
this by implementing rejection sampling on software engineering data or incorporating
asynchronous evaluations during the RL process to improve efficiency.

16
搜索空间呈指数级增长。为了解决这个问题，我们为每个节点设置了最大扩展限制，但这可能
会导致模型陷入局部最优。其次，价值模型直接影响生成的质量，因为它指导搜索过程的每个
步骤。训练细粒度的价值模型本质上很困难，这使得模型很难迭代改进。虽然 AlphaGo 的核心
成功依赖于训练价值模型来逐步提高其性能，但由于 token 生成的复杂性，这一原则在我们的设
置中很难复制。

总之，虽然 MCTS 与预先训练的价值模型结合使用时可以提高推理过程中的性能，但通过

自我搜索迭代地提升模型性能仍然是一个重大挑战。

5. 结论、局限性和未来工作
在这项工作中，我们分享了通过强化学习增强模型推理能力的历程。DeepSeek-R1-Zero 代表了
一种不依赖冷启动数据的纯 RL 方法，在各种任务中都取得了出色的性能。DeepSeek-R1 更强大
，它利用冷启动数据和迭代 RL 微调。最终，DeepSeek-R1 在一系列任务上实现了与 OpenAI-o1-
1217 相当的性能。

我们进一步探索将推理能力蒸馏到小型密集模型。我们使用 DeepSeek-R1 作为教师模型来生

成 800K 个训练样本，并对几个小型密集模型进行微调。结果令人鼓舞：DeepSeek-R1-Distill-Q
wen-1.5B 在数学基准测试中的表现优于 GPT-4o 和 Claude-3.5-Sonnet，在 AIME 上为 28.9%，在
MATH 上为 83.9%。其他密集模型也取得了令人印象深刻的结果，显著优于基于相同底层检查
点的其他指令调整模型。

未来我们计划针对DeepSeek-R1在以下方向投入研究。
• 通用能力：目前 DeepSeek-R1 在函数调用、多回合、复杂角色扮演、JSON 输出等任务上
的能力还不及 DeepSeek-V3。未来我们计划探索如何利用 CoT 来增强这些领域的任务。

• 语言混合：DeepSeek-R1 目前针对中文和英文进行了优化，这可能会导致在处理其他语言
的查询时出现语言混合问题。例如，即使查询使用的语言不是英文或中文，DeepSeek-R1
也可能使用英文进行推理和响应。我们计划在未来的更新中解决这一限制。

• 提示工程：在评估 DeepSeek-R1 时，我们观察到它对提示很敏感。少量提示会持续降低其

性能。因此，我们建议用户直接描述问题并使用零样本设置指定输出格式以获得最佳结果
。

• 软件工程任务：由于评估时间较长，影响了 RL 过程的效率，大规模 RL 尚未广泛应用于

软件工程任务。因此，DeepSeek-R1 在软件工程基准测试中并未表现出比 DeepSeek-V3 更
大的改进。未来版本将通过对软件工程数据实施拒绝采样或在 RL 过程中加入异步评估来
解决这个问题，以提高效率。

16
References
AI@Meta. Llama 3.1 model card, 2024. URL https://github.com/meta-llama/llama-m
odels/blob/main/models/llama3_1/MODEL_CARD.md.
Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3
-5-sonnet.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin,
B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet,
F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss,
A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage,
M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and
W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
URL https://arxiv.org/abs/2107.03374.

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten,

A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.

Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple

way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.

X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like

tree-search can guide large language model decoding and training, 2024. URL https:
//arxiv.org/abs/2309.17179.
L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization, 2022. URL
https://arxiv.org/abs/2210.10760.
A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao,
X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and
P. Minervini. Are we done with mmlu? CoRR, abs/2406.04127, 2024. URL https://doi.or
g/10.48550/arXiv.2406.04127.
Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/techno
logy/ai/google-gemini-next-generation-model-february-2024.
Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, et al. Chi-
nese simpleqa: A chinese factuality evaluation for large language models. arXiv preprint
arXiv:2411.07140, 2024.

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring

massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A
multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint
arXiv:2305.08322, 2023.

N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica.
Livecodebench: Holistic and contamination free evaluation of large language models for code.
CoRR, abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv.2403.07974.

17
参考
AI@Meta。Llama 3.1 模型卡，2024 年。URL https://github.com/meta-llama/llama-m
odels/blob/main/models/llama3_1/MODEL_CARD.md。
人与人之间的。克劳德 3.5 十四行诗，2024 年。URL
https://www.anthropic.com/news/claude-3 -5-sonnet。
M. Chen、J. Tworek、H. Jun、Q. Yuan、H. P. de Oliveira Pinto、J. Kaplan、H. Edwards、Y. Burda
、N. Joseph、G. Brockman、A. Ray、R. Puri、G. Krueger、M. Petrov、H. Khlaaf、G. Sastry、P.
Mishkin、B. Chan、S. Gray、N. Ryder、M. Pavlov、A. Power、L. Kaiser、M. Bavarian、C. Winte
r、P. Tillet、F. P. Such、D. Cummings、M. Plappert、F. Chantzis、E. Barnes、A. Herbert-Voss、
W. H. Guss、A. Nichol、A. Paino、N. Tezak、J. Tang、I. Babuschkin、S. Balaji、S. Jain、W. Saun
ders、C. Hesse、A. N. Carr、J. Leike、 J. Achiam、V. Misra、E. Morikawa、A. Radford、M. Knig
ht、M. Brundage、M. Murati、K. Mayer、P. Welinder、B. McGrew、D. Amodei、S. McCandlish
、I. Sutskever 和 W. Zaremba。评估在代码上训练的大型语言模型。CoRR，abs/2107.03374，202
1 年。URL https://arxiv.org/abs/2107.03374。

A. Dubey、A. Jauhri、A. Pandey、A. Kadian、A. Al-Dahle、A. Letman、A. Mathur、A. Schelten

、A. Yang、A. Fan 等。骆驼 3 群模型。 arXiv 预印本 arXiv:2407.21783, 2024。

Y.Dubois、B.Galambosi、P.Liang 和 T.B.Hashimoto。长度控制的 alpacaeval：一种消除自动评估

器偏差的简单方法。 arXiv 预印本 arXiv:2404.04475, 2024。

X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang 和 J. Wang。类似 Alphazero 的树搜索

可以指导大型语言模型解码和训练，2024 年。URL https: //arxiv.org/abs/2309.17179
。

L. Gao、J. Schulman 和 J. Hilton。奖励模型过度优化的缩放定律，2022 年。URL

https://arxiv.org/abs/2210.10760。

A. P. Gema、J. O. J. Leang、G. Hong、A. Devoto、A. C. M. Mancino、R. Saxena、X. He、Y. 赵、

X. Du、M. R. G. Madani、C. Barale、R. McHardy、J. Harris、J. Kaddour、E. van Krieken 和 P. Mi
nervini。 mmlu 结束了吗？ CoRR，abs/2406.04127，2024。URL https://doi.or
g/10.48550/arXiv.2406.04127。

Google。我们的下一代车型：Gemini 1.5，2024 年。URL https://blog.google/techno

logy/ai/google-gemini-next-generation-model-february-2024。
Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng 等。中文简单问答
：大型语言模型的中文事实性评估。arXiv 预印本 arXiv:2411.07140，2024 年。

D. Hendrycks、C. Burns、S. Basart、A. Zou、M. Mazeika、D. Song 和 J. Steinhardt。测量大规模

多任务语言理解。arXiv 预印本 arXiv:2009.03300，2020 年。

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei 等。C-Eval：用于
基础模型的多层次多学科中文评估套件。arXiv 预印本 arXiv:2305.08322，2023 年。

N. Jain、K. Han、A. Gu、W. Li、F. Yan、T. Zhang、S. Wang、A. Solar-Lezama、K. Sen 和 I. Stoi
ca。Livecodebench：对大型代码语言模型进行整体且无污染的评估。CoRR，abs/2403.07974，2
024 年。URL https://doi.org/10.48550/arXiv.2403.07974。

17
S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui.
Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. CoRR,
abs/2409.12941, 2024. doi: 10.48550/ARXIV.2409.12941. URL https://doi.org/10.485
50/arXiv.2409.12941.
A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop,
R. Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv
preprint arXiv:2409.12917, 2024.
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measur-
ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212,
2023.
T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From
crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv
preprint arXiv:2406.11939, 2024.
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman,
I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL
https://github.com/WildEval/ZeroEval.
MAA. American invitational mathematics examination - aime. In American Invitational
Mathematics Examination - AIME 2024, February 2024. URL https://maa.org/math
-competitions/american-invitational-mathematics-examination-aime.
OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.
OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/learnin
g-to-reason-with-llms/.
OpenAI. Introducing SimpleQA, 2024c. URL https://openai.com/index/introducing
-simpleqa/.
OpenAI. Introducing SWE-bench verified we’re releasing a human-validated subset of swe-
bench that more, 2024d. URL https://openai.com/index/introducing-swe-bench
-verified/.
Qwen. Qwq: Reflect deeply on the boundaries of the unknown, 2024a. URL https://qwenlm
.github.io/blog/qwq-32b-preview/.
Qwen. Qwen2.5: A party of foundation models, 2024b. URL https://qwenlm.github.io/b
log/qwen2.5.
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman.
GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath:
Pushing the limits of mathematical reasoning in open language models. arXiv preprint
arXiv:2402.03300, 2024.
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre,
D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and
shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815,
2017a. URL http://arxiv.org/abs/1712.01815.

18
S. Krishna、K. Krishna、A. Mohananey、S. Schwarcz、A. Stambler、S. Upadhyay 和 M. Faruqui。
事实、获取和推理：检索增强生成的统一评估。CoRR，abs/2409.12941，2024 年。doi：10.4855
0/ARXIV.2409.12941。URL https://doi.org/10.485 50/arXiv.2409.12941。

A. Kumar、V. Zhuang、R. Agarwal、Y. Su、J. D. Co-Reyes、A. Singh、K. Baumli、S. Iqbal、C.

Bishop、R. Roelofs 等。通过强化学习训练语言模型进行自我纠正。arXiv 预印本 arXiv:2409.129
17，2024 年。

H.Li、Y.Zhang、F.Koto、Y.Yang、H.Zhao、Y.Gong、N.Duan 和 T.Baldwin。 CMMLU：测量中

文的大规模多任务语言理解。 arXiv 预印本 arXiv:2306.09212, 2023。

T. Li、W.-L. Chiang、E. Frick、L. Dunlap、T. Wu、B. Zhu、J. E. Gonzalez 和 I. Stoica。从众包

数据到高质量基准：Arena-hard 和 benchbuilder 管道。arXiv 预印本 arXiv:2406.11939，2024 年
。

H. Lightman、V. Kosaraju、Y. Burda、H. Edwards、B. Baker、T. Lee、J. Leike、J. Schulman、I.

Sutskever 和 K. Cobbe。让我们一步步验证。arXiv 预印本 arXiv:2305.20050，2023 年。

B. Y. Lin。ZeroEval：评估语言模型的统一框架，2024 年 7 月。URL
https://github.com/WildEval/ZeroEval。
MAA。美国邀请数学考试 - aime。在美国邀请数学考试 - AIME 2024，2024 年 2 月。URL
https://maa.org/math
-competitions/american-invitational-mathematics-examination-aime。
开放人工智能。你好，GPT-4o，2024a。网址https://openai.com/index/hello-gpt-4o/。
OpenAI。学习使用 llms 进行推理，2024b。URL https://openai.com/index/learnin
g-to-reason-with-llms/。
OpenAI。推出 SimpleQA，2024c。URL https://openai.com/index/introducing
-simpleqa/。
OpenAI。推出经过验证的 SWE-bench，我们正在发布经过人工验证的 swe-bench 子集，更多，2
024d。URL https://openai.com/index/introducing-swe-bench -verified/。

Qwen。Qwq：深刻思考未知的边界，2024a。URL https://qwenlm
.github.io/blog/qwq-32b-preview/。
Qwen。Qwen2.5：基础模型的聚会，2024b。URL https://qwenlm.github.io/b
log/qwen2.5。
D. Rein、B. L. Hou、A. C. Stickland、J. Petty、R. Y. Pang、J. Dirani、J. Michael 和 S. R. Bowman
。 GPQA：研究生级别的谷歌验证问答基准。 arXiv 预印本 arXiv:2311.12022, 2023。

Z. Shao、P. Wang、Q. Zhu、R. Xu、J. Song、M. Zhang、Y. Li、Y. Wu 和 D. Guo。Deepseekmath

：突破开放语言模型中数学推理的极限。arXiv 预印本 arXiv:2402.03300，2024 年。

D. Silver、T. Hubert、J. Schrittwieser、I. Antonoglou、M. Lai、A. Guez、M. Lanctot、L. Sifre、D

. Kumaran、T. Graepel、T. P. Lillicrap、K. Simonyan 和 D. Hassabis。使用通用强化学习算法通过
自学掌握国际象棋和将棋。CoRR，abs/1712.01815，2017a。URL
http://arxiv.org/abs/1712.01815。

18
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,
M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and
D. Hassabis. Mastering the game of go without human knowledge. Nat., 550(7676):354–359,
2017b. doi: 10.1038/NATURE24270. URL https://doi.org/10.1038/nature24270.

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more
effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.033
14.
T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong. Solving olympiad geometry without human
demonstrations. Nature, 2024. doi: 10.1038/s41586-023-06747-5.

J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and

I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv
preprint arXiv:2211.14275, 2022.

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A label-
free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935,
2023.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou.

Self-consistency improves chain of thought reasoning in language models. arXiv preprint
arXiv:2203.11171, 2022.

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li,
M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and
challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024.
URL https://doi.org/10.48550/arXiv.2406.01574.

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software

engineering agents. arXiv preprint, 2024.

H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao,
Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing
proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL
https://arxiv.org/abs/2408.08152.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following
evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.

19
D. Silver、J. Schrittwieser、K. Simonyan、I. Antonoglou、A. Huang、A. Guez、T. Hubert、L. Bak
er、M. Lai、A. Bolton、Y. Chen、T. P. Lillicrap、F. Hui、L. Sifre、G. van den Driessche、T. Grae
pel 和 D. Hassabis。无需人类知识即可掌握围棋游戏。Nat.，550(7676):354–359，2017b。doi：1
0.1038/NATURE24270。URL https://doi.org/10.1038/nature24270。

C. Snell、J. Lee、K. Xu 和 A. Kumar。最佳缩放 llm 测试时间计算比缩放模型参数更有效，2024

年。URL https://arxiv.org/abs/2408.033 14。T. Trinh、Y. Wu、Q. Le、H. He 和 T. Lu
ong。无需人工演示即可解决奥林匹克几何问题。《自然》，2024 年。doi：10.1038/s41586-023-
06747-5。

J. Uesato、N. Kushman、R. Kumar、F. Song、N. Siegel、L. Wang、A. Creswell、G. Irving 和 I. H

iggins。使用基于过程和结果的反馈解决数学应用题。arXiv 预印本 arXiv:2211.14275，2022 年
。

P. Wang、L. Li、Z. Shao、R. Xu、D. Dai、Y. Li、D. Chen、Y. Wu 和 Z. Sui。Math-shepherd：用

于数学推理的 llms 的无标签分步验证器。arXiv 预印本 arXiv:2312.08935，2023 年。

X. Wang、J. Wei、D. Schuurmans、Q. Le、E. Chi、S. Narang、A. Chowdhery 和 D. Zhou。自洽

性改善了语言模型中的思路链推理。arXiv 预印本 arXiv:2203.11171，2022 年。

Y. 王，X. 马，G. 张，Y. Ni，A. Chandra，S. 郭，W. Ren，A. Arulraj，X. He，Z. Jiang，T. Li，
M. Ku，K.王，A.庄，R.范，X.岳，和W.陈。 Mmlu-pro：更强大、更具挑战性的多任务语言理
解基准。 CoRR，abs/2406.01574，2024。URL
https://doi.org/10.48550/arXiv.2406.01574。 C.S.Xia、Y.Deng、S.Dunn 和 L.Zhang
。无代理：揭秘基于 llm 的软件工程代理。 arXiv 预印本，2024. H. Xin, Z. Z. Ren, J. Song, Z. Sha
o, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F.
Luo, 和 C. Ruan。Deepseek-prover-v1.5：利用证明辅助反馈进行强化学习和蒙特卡罗树搜索，20
24 年。URL https://arxiv.org/abs/2408.08152 。J. Zhou, T. Lu, S. Mishra, S. Brahma, S.
Basu, Y. Luan, D. Zhou, 和 L. Hou。大型语言模型的指令跟随评估。arXiv 预印本 arXiv:2311.0791
1, 2023 。

19
Appendix
A. Contributions and Acknowledgments

Core Contributors Hui Li

Daya Guo Jianzhong Guo
Dejian Yang Jiashi Li
Haowei Zhang Jingchang Chen
Junxiao Song Jingyang Yuan
Ruoyu Zhang Jinhao Tu
Runxin Xu Junjie Qiu
Qihao Zhu Junlong Li
Shirong Ma J.L. Cai
Peiyi Wang Jiaqi Ni
Xiao Bi Jian Liang
Xiaokang Zhang Jin Chen
Xingkai Yu Kai Dong
Yu Wu Kai Hu*
Z.F. Wu Kaichao You
Zhibin Gou Kaige Gao
Zhihong Shao Kang Guan
Zhuoshu Li Kexin Huang
Ziyi Gao Kuai Yu
Lean Wang
Lecong Zhang
Contributors
Liang Zhao
Aixin Liu
Litong Wang
Bing Xue
Liyue Zhang
Bingxuan Wang
Lei Xu
Bochao Wu
Leyi Xia
Bei Feng
Mingchuan Zhang
Chengda Lu
Minghua Zhang
Chenggang Zhao
Minghui Tang
Chengqi Deng
Mingxu Zhou
Chong Ruan
Meng Li
Damai Dai
Miaojun Wang
Deli Chen
Mingming Li
Dongjie Ji
Ning Tian
Erhang Li
Panpan Huang
Fangyun Lin
Peng Zhang
Fucong Dai
Qiancheng Wang
Fuli Luo*
Qinyu Chen
Guangbo Hao
Qiushi Du
Guanting Chen
Ruiqi Ge*
Guowei Li
Ruisong Zhang
H. Zhang
Ruizhe Pan
Hanwei Xu
Runji Wang
Honghui Ding
R.J. Chen
Huazuo Gao
R.L. Jin
Hui Qu

20
附录
A. 贡献与致谢

核心贡献者李慧建中郭嘉实
大亚郭德建杨浩李景昌陈景阳袁
伟张俊晓宋若愚金浩屠俊杰邱俊
张润新徐启豪朱龙李家良蔡佳琪
世荣马培宜王小倪建良金陈凯董
毕小康张兴凯余凯胡*凯超游凯歌
宇吴Z.F.吴志斌高康关可欣黄蒯
苟志宏邵卓书玉 Lean 王乐从张
李子毅高亮赵立同王丽月
张雷徐乐怡夏明川
张明华张明辉唐
明旭周猛李妙君
王明明李宁田盼
盼黄鹏张千程王
钦宇陈求是杜瑞
琪葛瑞松张瑞哲
潘润吉王 R.J.陈瑞
林
撰稿人爱新刘冰
薛冰轩王伯超吴
蓓冯成大卢成刚
赵成启邓冲阮大
麦戴德利陈东杰
吉二航李方云林
福聪戴福利罗*
郝广博郝冠廷陈
国伟李 H. 张汉伟
徐红辉丁华作高
辉曲

20
Ruyi Chen Y.X. Wei
Shanghao Lu Yang Zhang
Shangyan Zhou Yanhong Xu
Shanhuang Chen Yao Li
Shengfeng Ye Yao Zhao
Shiyu Wang Yaofeng Sun
Shuiping Yu Yaohui Wang
Shunfeng Zhou Yi Yu
Shuting Pan Yichao Zhang
S.S. Li Yifan Shi
Shuang Zhou Yiliang Xiong
Shaoqing Wu Ying He
Shengfeng Ye Yishi Piao
Tao Yun Yisong Wang
Tian Pei Yixuan Tan
Tianyu Sun Yiyang Ma*
T. Wang Yiyuan Liu
Wangding Zeng Yongqiang Guo
Wen Liu Yuan Ou
Wenfeng Liang Yuduan Wang
Wenjun Gao Yue Gong
Wenqin Yu* Yuheng Zou
Wentao Zhang Yujia He
W.L. Xiao Yunfan Xiong
Wei An Yuxiang Luo
Xiaodong Liu Yuxiang You
Xiaohan Wang Yuxuan Liu
Xiaokang Chen Yuyang Zhou
Xiaotao Nie Y.X. Zhu
Xin Cheng Yanping Huang
Xin Liu Yaohui Li
Xin Xie Yi Zheng
Xingchao Liu Yuchen Zhu
Xinyu Yang Yunxian Ma
Xinyuan Li Ying Tang
Xuecheng Su Yukun Zha
Xuheng Lin Yuting Yan
X.Q. Li Z.Z. Ren
Xiangyue Jin Zehui Ren
Xiaojin Shen Zhangli Sha
Xiaosha Chen Zhe Fu
Xiaowen Sun Zhean Xu
Xiaoxiang Wang Zhenda Xie
Xinnan Song Zhengyan Zhang
Xinyi Zhou Zhewen Hao
Xianzu Wang Zhicheng Ma
Xinxia Shan Zhigang Yan
Y.K. Li Zhiyu Wu
Y.Q. Wang Zihui Gu

21
如意陈尚浩卢商 Y.X.魏阳张艳红
言周山黄陈胜峰徐耀李耀赵耀峰
叶世宇王水平余孙耀辉王一宇余
顺峰周淑婷潘 S. 一超张一凡史一
S. 李双周少卿吴良熊英何一师朴
胜峰叶涛云天裴一松王一旋谭一
天宇孙涛王旺鼎阳马一元刘永强
曾文刘文峰梁文郭元欧玉端王悦
君高文钦余* 文龚宇恒邹宇佳何
涛张文涛小伟安云帆熊宇翔罗游
晓东刘小涵王小宇翔刘宇轩周宇
康陈小涛聂新成扬朱彦平黄耀辉
新刘新谢兴超刘李毅郑雨辰朱允
新宇杨新元李学贤马英唐玉坤查
成苏旭恒林小强玉婷严志忠任泽
李香月金小金沉辉任张立沙哲付
小莎陈小文孙小哲安徐振达谢正
香王新楠宋心怡彦张哲文郝志成
周显祖王心霞单马志刚严志宇吴
玉坤李永庆王子慧顾

21
Zijia Zhu Zhen Huang
Zijun Liu* Zhipeng Xu
Zilin Li Zhongyu Zhang
Ziwei Xie Zhen Zhang
Ziyang Song
Zizheng Pan

Within each role, authors are listed alphabetically by the first name. Names marked with *
denote individuals who have departed from our team.

22
子佳朱子君黄振鹏徐仲宇
刘* 李子琳张振张
谢子伟宋子
正潘子正

在每个职位中，作者按名字的字母顺序排列。标有 * 的名字表示已离开我们团队的个人。

Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
No ratings yet
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
24 pages
DeepSeek-R1: Enhancing LLM Reasoning via RL
No ratings yet
DeepSeek-R1: Enhancing LLM Reasoning via RL
22 pages
DeepSeek-R1: Advancing LLM Reasoning
No ratings yet
DeepSeek-R1: Advancing LLM Reasoning
22 pages
AIME 2024 Pass@1 Score Analysis
No ratings yet
AIME 2024 Pass@1 Score Analysis
23 pages
AI3
No ratings yet
AI3
1 page
DeepSeek Research
No ratings yet
DeepSeek Research
6 pages
AI2
No ratings yet
AI2
1 page
DeepSeek-R1: Enhanced Reasoning Via Reinforcement Learning
No ratings yet
DeepSeek-R1: Enhanced Reasoning Via Reinforcement Learning
9 pages
1.1. Contributions Post-Training: Large-Scale Reinforcement Learning On The Base Model
No ratings yet
1.1. Contributions Post-Training: Large-Scale Reinforcement Learning On The Base Model
1 page
Deepseek-R1 Incentivizes Reasoning in Llms Through Reinforcement Learning
No ratings yet
Deepseek-R1 Incentivizes Reasoning in Llms Through Reinforcement Learning
11 pages
AI Reasoning with RL: DeepSeek-R1
No ratings yet
AI Reasoning with RL: DeepSeek-R1
35 pages
Train DeepSeek R1 From Scratch
No ratings yet
Train DeepSeek R1 From Scratch
42 pages
A Perfect Guide To DeepSeek R1
No ratings yet
A Perfect Guide To DeepSeek R1
9 pages
AI11
No ratings yet
AI11
1 page
How DeepSeek-R1 Was Built - Architecture and Training Explained
No ratings yet
How DeepSeek-R1 Was Built - Architecture and Training Explained
12 pages
Approach: Others
No ratings yet
Approach: Others
1 page
DeepSeek Proves AI Comes For All Jobs - Even AI Jobs
No ratings yet
DeepSeek Proves AI Comes For All Jobs - Even AI Jobs
6 pages
Reinforcement Learning For Reasoning in Small LLMS: What Works and What Doesn'T
No ratings yet
Reinforcement Learning For Reasoning in Small LLMS: What Works and What Doesn'T
17 pages
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
No ratings yet
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
36 pages
Reasoning
No ratings yet
Reasoning
21 pages
Leveraging Reasoning Model Answers To Enhance Non-Reasoning Model Capability
No ratings yet
Leveraging Reasoning Model Answers To Enhance Non-Reasoning Model Capability
10 pages
Research Re Search:: Learning To Ason With For Llms Via Reinforcement Learning
No ratings yet
Research Re Search:: Learning To Ason With For Llms Via Reinforcement Learning
12 pages
Search-R1: Training Llms To Reason and Leverage Search Engines With Reinforcement Learning
No ratings yet
Search-R1: Training Llms To Reason and Leverage Search Engines With Reinforcement Learning
31 pages
DeepSeek R1: Advancements in AI Reasoning
No ratings yet
DeepSeek R1: Advancements in AI Reasoning
9 pages
Search-R1: Training Llms To Reason and Leverage Search Engines With Reinforcement Learning
No ratings yet
Search-R1: Training Llms To Reason and Leverage Search Engines With Reinforcement Learning
16 pages
9733 31211 3 PB
No ratings yet
9733 31211 3 PB
3 pages
R1-Searcher: Incentivizing The Search Capability in Llms Via Reinforcement Learning
No ratings yet
R1-Searcher: Incentivizing The Search Capability in Llms Via Reinforcement Learning
17 pages
AM-Thinking-v1: Advancing The Frontier of Reasoning at 32B Scale
No ratings yet
AM-Thinking-v1: Advancing The Frontier of Reasoning at 32B Scale
16 pages
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
No ratings yet
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
35 pages
RL Enhances LLM Reasoning
No ratings yet
RL Enhances LLM Reasoning
24 pages
Efficient Reasoning in Language Models
No ratings yet
Efficient Reasoning in Language Models
16 pages
Research Re Search:: Learning To Ason With For Llms Via Reinforcement Learning
No ratings yet
Research Re Search:: Learning To Ason With For Llms Via Reinforcement Learning
2 pages
1.1. Background On Reasoning in Large Language Models (LLMS)
No ratings yet
1.1. Background On Reasoning in Large Language Models (LLMS)
64 pages
Offline Reinforcement Learning For LLM Multi-Step Reasoning
No ratings yet
Offline Reinforcement Learning For LLM Multi-Step Reasoning
14 pages
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
No ratings yet
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
36 pages
DeepRetrieval Powerful Query Generation For Information Retrieval With Reinforcement Learning
No ratings yet
DeepRetrieval Powerful Query Generation For Information Retrieval With Reinforcement Learning
9 pages
Offline Reinforcement Learning For LLM Multi-Step Reasoning
No ratings yet
Offline Reinforcement Learning For LLM Multi-Step Reasoning
13 pages
Vision R1
No ratings yet
Vision R1
14 pages
ICML 2018 RL Highlights
No ratings yet
ICML 2018 RL Highlights
55 pages
AceReason Nemotron
No ratings yet
AceReason Nemotron
23 pages
MiMo: Unlocking The Reasoning Potential of Language Model - From Pretraining To Posttraining
No ratings yet
MiMo: Unlocking The Reasoning Potential of Language Model - From Pretraining To Posttraining
26 pages
Re-Enforcement Learning
No ratings yet
Re-Enforcement Learning
5 pages
The DeepSeek Series A Technical Overview
No ratings yet
The DeepSeek Series A Technical Overview
11 pages
Demystifying Long Chain-of-Thought Reasoning in LLMs
No ratings yet
Demystifying Long Chain-of-Thought Reasoning in LLMs
40 pages
Sora Deepseek
No ratings yet
Sora Deepseek
11 pages
Reinforcement Learning Optimization For Large-Scale Learning: An Efficient and User-Friendly Scaling Library
No ratings yet
Reinforcement Learning Optimization For Large-Scale Learning: An Efficient and User-Friendly Scaling Library
16 pages
Offline RL with Pre-trained LMs
No ratings yet
Offline RL with Pre-trained LMs
19 pages
cs224r L01 Intro
No ratings yet
cs224r L01 Intro
51 pages
DeepSeek Models Explained
No ratings yet
DeepSeek Models Explained
11 pages
Deep Reinforcement Learning For Computational Fluid Dynamics On HPC
No ratings yet
Deep Reinforcement Learning For Computational Fluid Dynamics On HPC
9 pages
Towards Large Reasoning Models
No ratings yet
Towards Large Reasoning Models
36 pages
AS P - L L M: Urvey On OST Training of Arge Anguage Odels
No ratings yet
AS P - L L M: Urvey On OST Training of Arge Anguage Odels
87 pages
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
No ratings yet
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
31 pages
RL vs DRL: Robotics Arm Motion Planning
No ratings yet
RL vs DRL: Robotics Arm Motion Planning
9 pages
AI - Assignment 2 Zaryab Khan
No ratings yet
AI - Assignment 2 Zaryab Khan
6 pages
DeepSeek Models in STEM Education: Capabilities, Applications, and Challenges
No ratings yet
DeepSeek Models in STEM Education: Capabilities, Applications, and Challenges
6 pages
DAPO: Open-Source RL for LLMs
No ratings yet
DAPO: Open-Source RL for LLMs
16 pages
DeepSeek Modelss
No ratings yet
DeepSeek Modelss
52 pages
Understanding Marine Biology Essentials
No ratings yet
Understanding Marine Biology Essentials
83 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
3 pages
Jai Narain Vyas University, Jodhpur: Provisional Marksheet
No ratings yet
Jai Narain Vyas University, Jodhpur: Provisional Marksheet
1 page
Age Diversity Term Paper
100% (1)
Age Diversity Term Paper
7 pages
1123 Summary Tasks Guidance For Examination From 2024
100% (4)
1123 Summary Tasks Guidance For Examination From 2024
9 pages
Aerospace Engineering Master's Program
No ratings yet
Aerospace Engineering Master's Program
5 pages
My CV 1
No ratings yet
My CV 1
2 pages
Psychotherapeutic Interventions For Adults With Brain Injury - Karen G Langer Linda Laatsch Lisa Lewis - 1st Edition, March 1999 - Intl - 9781887841238 - Anna's Archive
No ratings yet
Psychotherapeutic Interventions For Adults With Brain Injury - Karen G Langer Linda Laatsch Lisa Lewis - 1st Edition, March 1999 - Intl - 9781887841238 - Anna's Archive
288 pages
Module 3 CSS G12 1st Sem Week 15-16
No ratings yet
Module 3 CSS G12 1st Sem Week 15-16
39 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
3 pages
Workplace Behaviour MCQ PDF
No ratings yet
Workplace Behaviour MCQ PDF
17 pages
The Wisdom of The Enneagram
80% (5)
The Wisdom of The Enneagram
4 pages
Rosenberg Self-Esteem Scale
No ratings yet
Rosenberg Self-Esteem Scale
2 pages
Sustainable Learning for Grades 5-6
100% (2)
Sustainable Learning for Grades 5-6
5 pages
Math Exam 2nd Final
No ratings yet
Math Exam 2nd Final
2 pages
I. Clinical Summary A. General Data Profile
No ratings yet
I. Clinical Summary A. General Data Profile
8 pages
Brochure - DOHSS IITM Conference 2025-1
No ratings yet
Brochure - DOHSS IITM Conference 2025-1
7 pages
888
No ratings yet
888
40 pages
Front Page For Project
No ratings yet
Front Page For Project
3 pages
Chapter 9 True False Items Van Blerkom
No ratings yet
Chapter 9 True False Items Van Blerkom
9 pages
Elements of Total Quality Management
No ratings yet
Elements of Total Quality Management
5 pages
Worksheet HHW Civics L1 Class7 IPS 3
No ratings yet
Worksheet HHW Civics L1 Class7 IPS 3
3 pages
Formal vs Informal Letters Explained
No ratings yet
Formal vs Informal Letters Explained
3 pages
Zachary Johnson Lesson Plan 1
No ratings yet
Zachary Johnson Lesson Plan 1
6 pages
GenChem Lecture 6
No ratings yet
GenChem Lecture 6
14 pages
Application of Segment Anything Model For Civil Infrastructure Defect Assessment
No ratings yet
Application of Segment Anything Model For Civil Infrastructure Defect Assessment
31 pages
Perceptions of Early Teen Relationships
No ratings yet
Perceptions of Early Teen Relationships
15 pages
Think - l2 - Unit 2 - Vocabulary - Basic
No ratings yet
Think - l2 - Unit 2 - Vocabulary - Basic
2 pages
Academic Year - 2081 Syllabus Booklet Grade: - IX
No ratings yet
Academic Year - 2081 Syllabus Booklet Grade: - IX
79 pages
1986 - Wallbott - JPSP Cues and Channels in Emotion Recognition
No ratings yet
1986 - Wallbott - JPSP Cues and Channels in Emotion Recognition
10 pages