Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
Reinforcement Learning
DeepSeek-AI [email protected]
Abstract
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Experiment 11
4 Discussion 14
1. Introduction
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution
(Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial
General Intelligence (AGI).
Recently, post-training has emerged as an important component of the full training pipeline. It has
been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user
preferences, all while requiring relatively minimal computational resources against pre-training. In the
context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models were the first to introduce
inference-time scaling by increasing the length of the Chain-ofThought reasoning process. This
approach has achieved significant improvements in various reasoning tasks, such as mathematics,
coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open
question for the research community. Several prior works have explored various approaches, including
process-based reward models (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023),
2
reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search
and Beam Search (Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024). However, none of these methods
has achieved general reasoning performance comparable to OpenAI’s o1 series models.
In this paper, we take the first step toward improving language model reasoning capabilities using
pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning
capabilities without any supervised data, focusing on their self-evolution through a pure RL process.
Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the
RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally
emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps,
DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1
score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further
improves to 86.7%, matching the performance of OpenAI-o1-0912.
However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing.
To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which
incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin
by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we
perform reasoning-oriented RL like DeepSeek-R1Zero. Upon nearing convergence in the RL process, we
create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data
from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the
DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional
RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint
referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.
We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.532B
(Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it.
This demonstrates that the reasoning patterns discovered by larger base models are crucial for
improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024)
series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview
(Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the
reasoning benchmarks among dense models.
1.1. Contributions
• We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a
preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving
complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeekR1-Zero
demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking
a significant milestone for the research community. Notably, it is the first open research to
validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the
need for SFT. This breakthrough paves the way for future advancements in this area.
• We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages
aimed at discovering improved reasoning patterns and aligning with human preferences, as well
as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities.
We believe the pipeline will benefit the industry by creating better models.
3
Distillation: Smaller Models Can Be Powerful Too
• We demonstrate that the reasoning patterns of larger models can be distilled into smaller
models, resulting in better performance compared to the reasoning patterns discovered
through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the
research community to distill better smaller models in the future.
• Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that
are widely used in the research community. The evaluation results demonstrate that the
distilled smaller dense models perform exceptionally well on benchmarks. DeepSeekR1-Distill-
Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-
R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on
LiveCodeBench. These results significantly outperform previous opensource models and are
comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints
based on Qwen2.5 and Llama3 series to the community.
• Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly
surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing
on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related
tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo
rating on Codeforces outperforming 96.3% human participants in the competition. For
engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could
help developers in real world tasks.
• Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeekR1
achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on
MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly
below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-
source models, demonstrating its competitive edge in educational tasks. On the factual
benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in
handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 4o on this
benchmark.
• Others: DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general
question answering, editing, summarization, and more. It achieves an impressive length-
controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on ArenaHard, showcasing
its strong ability to intelligently handle non-exam-oriented queries. Additionally, DeepSeek-R1
demonstrates outstanding performance on tasks requiring long-context understanding,
substantially outperforming DeepSeek-V3 on long-context benchmarks.
2. Approach
2.1. Overview
Previous work has heavily relied on large amounts of supervised data to enhance model performance.
In this study, we demonstrate that reasoning capabilities can be significantly improved through large-
scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start.
Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start
4
data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base
model without any SFT data, and
(2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-
of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models.
Group Relative Policy Optimization In order to save the training costs of RL, we adopt Group Relative
Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same
size as the policy model, and estimates the baseline from group scores instead. Specifically, for each
question 𝑞, GRPO samples a group of outputs {𝑜 , 𝑜 , · · · , 𝑜 } from the old policy 𝜋 and then optimizes
1 2 𝐺 𝜃𝑜𝑙𝑑
𝜋
the policy model 𝜃 by maximizing the following objective:
J𝐺𝑅𝑃𝑂 𝜋𝜃𝑜𝑙𝑑 (𝑂|𝑞)]
𝐴 𝜀 𝜀𝐴 𝛽 𝜋 𝜋
1 ∑︁𝐺 min 𝜋𝜃(𝑜𝑖|𝑞) 𝑖, clip 𝜋𝜃(𝑜𝑖|𝑞) , 1 − , 1 + 𝑖 − D𝐾𝐿 𝜃|| 𝑟𝑒𝑓 , (1)
𝐺 𝜋𝜃𝑜𝑙𝑑 (𝑜𝑖|𝑞) 𝜋𝜃𝑜𝑙𝑑 (𝑜𝑖|𝑞)
𝑖=1
Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning question
during training.
5
2.2.2. Reward Modeling
The reward is the source of the training signal, which decides the optimization direction of RL. To train
DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards:
• Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For
example, in the case of math problems with deterministic results, the model is required to
provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based
verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate
feedback based on predefined test cases.
• Format rewards: In addition to the accuracy reward model, we employ a format reward model
that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero,
because we find that the neural reward model may suffer from reward hacking in the large-scale
reinforcement learning process, and retraining the reward model needs additional training resources
and it complicates the whole training pipeline.
To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base
model to adhere to our specified instructions. As depicted in Table 1, this template requires DeepSeek-
R1-Zero to first produce a reasoning process, followed by the final answer. We intentionally limit our
constraints to this structural format, avoiding any content-specific biases—such as mandating reflective
reasoning or promoting particular problem-solving strategies—to ensure that we can accurately
observe the model’s natural progression during the RL process.
6
Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16
responses and calculate the overall average accuracy to ensure a stable evaluation.
DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised fine-
tuning data. This is a noteworthy achievement, as it underscores the model’s ability to learn and
generalize effectively through RL alone. Additionally, the performance of DeepSeekR1-Zero can be
further augmented through the application of majority voting. For example, when majority voting is
employed on the AIME benchmark, DeepSeek-R1-Zero’s performance escalates from 71.0% to 86.7%,
thereby exceeding the performance of OpenAI-o1-0912. The ability of DeepSeek-R1-Zero to achieve
such competitive performance, both with and without majority voting, highlights its strong
foundational capabilities and its potential for further advancements in reasoning tasks.
7
Figure 3 | The average response length of DeepSeek-R1-Zero on the training set during the RL process.
DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.
ment throughout the training process. This improvement is not the result of external adjustments but
rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the ability to
solve increasingly complex reasoning tasks by leveraging extended test-time computation. This
computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model
to explore and refine its thought processes in greater depth.
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated
behaviors as the test-time computation increases. Behaviors such as reflection—where the model
revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-
solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a
result of the model’s interaction with the reinforcement learning environment. This spontaneous
development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle
more challenging tasks with greater efficiency and accuracy.
Aha Moment of DeepSeek-R1-Zero A particularly intriguing phenomenon observed during the training
of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3,
occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate
more thinking time to a problem by reevaluating its initial approach. This behavior is not only a
testament to the model’s growing reasoning abilities but also a captivating example of how
reinforcement learning can lead to unexpected and sophisticated outcomes.
This moment is not only an “aha moment” for the model but also for the researchers observing its
behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching
the model on how to solve a problem, we simply provide it with the right incentives, and it
autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful
reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way
for more autonomous and adaptive models in the future.
√︁
√︁
8
Question: If 𝑎 > 1, then the sum of the real solutions of 𝑎
− 𝑎 + 𝑥 = 𝑥 is equal to
Response: <think>
√︁
√︁
To solve the equation 𝑎 − 𝑎 𝑥 𝑥
+ = , let’s start by squaring both · · ·
⇒ 𝑎 − √︁𝑎 + 𝑥 = 𝑥2.
Rearrange to isolate the inner square root term:
(𝑎 − 𝑥1)2 = 𝑎 + 𝑥 =⇒ 𝑎2 − 2𝑎𝑥2 + (𝑥2)2 = 𝑎 + 𝑥 =⇒ 𝑥4 − 2𝑎𝑥2 − 𝑥 + (𝑎2 − 𝑎) = 0
...
Wait, wait. Wait. That’s an aha moment I can flag here.
Let’s reevaluate this step-by-step to identify if the correct sum can be · · · We
started with the equation:
√︁ √︁
𝑎− 𝑎+𝑥=𝑥
Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can reasoning
performance be further improved or convergence accelerated by incorporating a small amount of high-
quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and
coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? To address these
questions, we design a pipeline to train DeepSeek-R1. The pipeline consists of four stages, outlined as
follows.
Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base
model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the
model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot
prompting with a long CoT as an example, directly prompting models to generate detailed answers with
reflection and verification, gathering DeepSeek-R1Zero outputs in a readable format, and refining the
results through post-processing by human annotators.
In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the
starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data
9
Drawback of DeepSeek-R1-Zero Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and
autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For
instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To
make reasoning processes more readable and share them with the open community, we explore
DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.
include:
• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for
reading. Responses may mix multiple languages or lack markdown formatting to highlight
answers for users. In contrast, when creating cold-start data for DeepSeek-R1,
we design a readable pattern that includes a summary at the end of each response and filters out
responses that are not reader-friendly. Here, we define the output format as
|special_token|<reasoning_process>|special_token|<summary>, where the reasoning process
is the CoT for the query, and the summary is used to summarize the reasoning results.
• Potential: By carefully designing the pattern for cold-start data with human priors, we observe
better performance against DeepSeek-R1-Zero. We believe the iterative training is a better way
for reasoning models.
After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement
learning training process as employed in DeepSeek-R1-Zero. This phase focuses on enhancing the
model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics,
science, and logic reasoning, which involve well-defined problems with clear solutions. During the
training process, we observe that CoT often exhibits language mixing, particularly when RL prompts
involve multiple languages. To mitigate the issue of language mixing, we introduce a language
consistency reward during RL training, which is calculated as the proportion of target language words
in the CoT. Although ablation experiments show that such alignment results in a slight degradation in
the model’s performance, this reward aligns with human preferences, making it more readable. Finally,
we combine the accuracy of reasoning tasks and the reward for language consistency by directly
summing them to form the final reward. We then apply RL training on the fine-tuned model until it
achieves convergence on reasoning tasks.
Reasoning data We curate reasoning prompts and generate reasoning trajectories by performing
rejection sampling from the checkpoint from the above RL training. In the previous stage, we only
included data that could be evaluated using rule-based rewards. However, in this stage, we expand the
dataset by incorporating additional data, some of which use a generative reward model by feeding the
ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model
output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed
10
languages, long parapraphs, and code blocks. For each prompt, we sample multiple responses and
retain only the correct ones. In total, we collect about 600k reasoning related training samples.
Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition, and translation,
we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain
non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering
the question by prompting. However, for simpler queries, such as “hello” we do not provide a CoT in
response. In the end, we collected a total of approximately 200k training samples that are unrelated to
reasoning.
We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k
samples.
To further align the model with human preferences, we implement a secondary reinforcement learning
stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its
reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse
prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero,
which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning
domains. For general data, we resort to reward models to capture human preferences in complex and
nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of
preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary,
ensuring that the assessment emphasizes the utility and relevance of the response to the user while
minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the
entire response of the model, including both the reasoning process and the summary, to identify and
mitigate any potential risks, biases, or harmful content that may arise during the generation process.
Ultimately, the integration of reward signals and diverse data distributions enables us to train a model
that excels in reasoning while prioritizing helpfulness and harmlessness.
To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-
tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k
samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that this straightforward
distillation method significantly enhances the reasoning abilities of smaller models. The base models
we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.514B, Qwen2.5-32B, Llama-3.1-8B,
and Llama-3.3-70B-Instruct. We select Llama-3.3 because its reasoning capability is slightly better than
that of Llama-3.1.
For distilled models, we apply only SFT and do not include an RL stage, even though incorporating
RL could substantially boost model performance. Our primary goal here is to demonstrate the
effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader
research community.
3. Experiment
Benchmarks We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024),
MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023), IFEval (Zhou et
11
al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al., 2023), SimpleQA (OpenAI, 2024c),
C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI, 2024d), Aider 1, LiveCodeBench (Jain et al.,
2024) (2024-08 – 2025-01), Codeforces 2, Chinese National High School Mathematics Olympiad (CNMO
2024)3, and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024). In addition
to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as
judges. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and
Arena-Hard (Li et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here,
we only feed the final summary to evaluation to avoid the length bias. For distilled models, we report
representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench.
Evaluation Prompts Following the setup in DeepSeek-V3, standard benchmarks such as MMLU, DROP,
GPQA Diamond, and SimpleQA are evaluated using prompts from the simpleevals framework. For
MMLU-Redux, we adopt the Zero-Eval prompt format (Lin, 2024) in a zero-shot setting. In terms of
MMLU-Pro, C-Eval and CLUE-WSC, since the original prompts are few-shot, we slightly modify the
prompt to the zero-shot setting. The CoT in few-shot may hurt the performance of DeepSeek-R1. Other
datasets follow their original evaluation protocols with default prompts provided by their creators. For
code and math benchmarks, the HumanEval-Mul dataset covers eight mainstream programming
languages (Python, Java, C++, C#, JavaScript, TypeScript, PHP, and Bash). Model performance on
LiveCodeBench is evaluated using CoT format, with data collected between August 2024 and January
2025. The Codeforces dataset is evaluated using problems from 10 Div.2 contests along with expert-
crafted test cases, after which the expected ratings and percentages of competitors are calculated. SWE-
Bench verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related
benchmarks are measured using a "diff" format. DeepSeek-R1 outputs are capped at a maximum of
32,768 tokens for each benchmark.
Baselines We conduct comprehensive evaluations against several strong baselines, including DeepSeek-
V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, and OpenAI-o1-1217. Since accessing the
OpenAI-o1-1217 API is challenging in mainland China, we report its performance based on official
reports. For distilled models, we also compare the open-source model QwQ-32B-Preview (Qwen,
2024a).
Evaluation Setup We set the maximum generation length to 32,768 tokens for the models. We found
that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates
and significant variability across different checkpoints. Therefore, we default to pass@𝑘 evaluation
(Chen et al., 2021) and report pass@1 using a non-zero temperature. Specifically, we use a sampling
temperature of 0.6 and a top-𝑝 value of 0.95 to generate 𝑘 responses (typically between 4 and 64,
depending on the test set size) for each question. Pass@1 is then calculated as 1 ∑︁𝑘
𝑝
pass@1 = , 𝑖
𝑘 𝑖=1
1 https://aider.chat
2 https://codeforces.com
3 https://www.cms.org.cn/Home/comp/comp/cid/12.html
12
where 𝑝𝑖 denotes the correctness of the 𝑖-th response. This method provides more reliable performance
estimates. For AIME 2024, we also report consensus (majority vote) results (Wang et al., 2022) using 64
samples, denoted as cons@64.
3.1. DeepSeek-R1 Evaluation
For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond,
DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This improvement is
primarily attributed to enhanced accuracy in STEM-related questions, where significant gains are
achieved through large-scale reinforcement learning. Additionally, DeepSeek-R1 excels on FRAMES, a
long-context-dependent QA task, showcasing its strong document analysis capabilities. This highlights
the potential of reasoning models in AI-driven search and data analysis tasks. On the factual benchmark
SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based
queries. A similar trend is observed where OpenAI-o1 surpasses GPT-4o on this benchmark. However,
DeepSeek-R1 performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due
to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1
could achieve an accuracy of over 70%.
DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model’s
ability to follow format instructions. These improvements can be linked to the inclusion of instruction-
following data during the final stages of supervised fine-tuning (SFT) and RL training. Furthermore,
remarkable performance is observed on AlpacaEval2.0 and ArenaHard, indicating DeepSeek-R1’s
strengths in writing tasks and open-domain question answering. Its significant outperformance of
DeepSeek-V3 underscores the generalization benefits of large-scale RL, which not only boosts reasoning
capabilities but also improves performance across diverse domains. Moreover, the summary lengths
13
generated by DeepSeek-R1 are concise, with an average of 689 tokens on ArenaHard and 2,218
characters on AlpacaEval 2.0. This indicates that DeepSeek-R1 avoids introducing length bias during GPT-
based evaluations, further solidifying its robustness across multiple tasks.
On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217, surpassing
other models by a large margin. A similar trend is observed on coding algorithm tasks, such as
LiveCodeBench and Codeforces, where reasoning-focused models dominate these benchmarks. On
engineering-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1 on Aider but achieves
comparable performance on SWE Verified. We believe the engineering performance of DeepSeek-R1
will improve in the next version, as the amount of related RL training data currently remains very
limited.
As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeekR1-7B
(i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models
like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation
metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most
benchmarks. These results demonstrate the strong potential of distillation. Additionally, we found that
applying RL to these distilled models yields significant further gains. We believe this warrants further
exploration and therefore present only the results of the simple SFT-distilled models here.
4. Discussion
4.1. Distillation v.s. Reinforcement Learning
In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve impressive
results. However, there is still one question left: can the model achieve comparable performance
through the large-scale RL training discussed in the paper without distillation?
To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math, code,
and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The experimental
results, shown in Table 6, demonstrate that the 32B base model, after large-scale
14
pass@1 pass@1 pass@1
QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9
DeepSeek-R1-Zero-Qwen-32B 47.0 60.0 91.6 55.0 40.2
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2
In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the
way. We share our failure experiences here to provide insights, but this does not imply that these
approaches are incapable of developing effective reasoning models.
Process Reward Model (PRM) PRM is a reasonable method to guide the model toward better
approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023).
However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is
challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether
the current intermediate step is correct is a challenging task. Automated annotation using models may
not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a
model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining
the reward model needs additional training resources and it complicates the whole training pipeline.
In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the
model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the
additional computational overhead it introduces during the large-scale reinforcement learning process
in our experiments.
Monte Carlo Tree Search (MCTS) Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al.,
2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability.
This approach involves breaking answers into smaller parts to allow the model to explore the solution
space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond
to specific reasoning steps necessary for the search. For training, we first use collected prompts to find
answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-
answer pairs to train both the actor model and the value model, iteratively refining the process.
15
However, this approach encounters several challenges when scaling up the training. First, unlike
chess, where the search space is relatively well-defined, token generation presents an exponentially
larger search space. To address this, we set a maximum extension limit for each node, but this can lead
to the model getting stuck in local optima. Second, the value model directly influences the quality of
generation since it guides each step of the search process. Training a fine-grained value model is
inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo’s
core success relied on training a value model to progressively enhance its performance, this principle
proves difficult to replicate in our setup due to the complexities of token generation.
In conclusion, while MCTS can improve performance during inference when paired with a pre-
trained value model, iteratively boosting model performance through self-search remains a significant
challenge.
16
References
URL https://arxiv.org/abs/2107.03374.
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten,
A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can
guide large language model decoding and training, 2024. URL https:
//arxiv.org/abs/2309.17179.
L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization, 2022. URL
https://arxiv.org/abs/2210.10760.
A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao,
X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, andP. Minervini. Are
we done with mmlu? CoRR, abs/2406.04127, 2024. URL https://doi.or
g/10.48550/arXiv.2406.04127.
Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/techno
logy/ai/google-gemini-next-generation-model-february-2024.
Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, et al. Chi-nese simpleqa:
A chinese factuality evaluation for large language models. arXiv preprint
arXiv:2411.07140, 2024.
17
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive
multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level
multi-discipline chinese evaluation suite for foundation models. arXiv preprint
arXiv:2305.08322, 2023.
N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica.
Livecodebench: Holistic and contamination free evaluation of large language models for code. CoRR,
abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv.2403.07974.
S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui. Fact, fetch,
and reason: A unified evaluation of retrieval-augmented generation. CoRR,
H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive
multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212,
2023.
T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced
data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv
B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL
https://github.com/WildEval/ZeroEval.
MAA. American invitational mathematics examination - aime. In American Invitational
18
-competitions/american-invitational-mathematics-examination-aime.
OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.
OpenAI. Introducing SWE-bench verified we’re releasing a human-validated subset of swebench that
more, 2024d. URL https://openai.com/index/introducing-swe-bench -verified/.
Qwen. Qwq: Reflect deeply on the boundaries of the unknown, 2024a. URL https://qwenlm
.github.io/blog/qwq-32b-preview/.
Qwen. Qwen2.5: A party of foundation models, 2024b. URL https://qwenlm.github.io/b
log/qwen2.5.
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A
graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the
limits of mathematical reasoning in open language models. arXiv preprint
arXiv:2402.03300, 2024.
T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations.
Nature, 2024. doi: 10.1038/s41586-023-06747-5.
J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving
math word problems with process-and outcome-based feedback. arXiv
19
preprint arXiv:2211.14275, 2022.
P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A labelfree step-
by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935,
2023.
arXiv:2203.11171, 2022.
Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li,M. Ku, K.
Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task
language understanding benchmark. CoRR, abs/2406.01574, 2024.
URL https://doi.org/10.48550/arXiv.2406.01574.
C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents.
arXiv preprint, 2024.
H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, Q. Zhu, D.
Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing proof assistant
feedback for reinforcement learning and monte-carlo tree search, 2024. URL
https://arxiv.org/abs/2408.08152.
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation
for large language models. arXiv preprint arXiv:2311.07911, 2023.
Appendix
20
A. Contributions and Acknowledgments
Core Contributors Hui Li
Daya Guo Jianzhong Guo
Dejian Yang Jiashi Li
Haowei Zhang Jingchang Chen
Junxiao Song Ruoyu Jingyang Yuan
Zhang Jinhao Tu
Runxin Xu Junjie Qiu
Qihao Zhu Junlong Li
Shirong Ma Peiyi J.L. Cai
Wang Jiaqi Ni
Xiao Bi Jian Liang
Xiaokang Zhang Jin Chen Kai
Xingkai Yu Dong
Yu Wu Kai Hu*
Z.F. Wu Kaichao You
Zhibin Gou Kaige Gao
Zhihong Shao Kang Guan
Zhuoshu Li Ziyi Kexin Huang
Gao Kuai Yu
Lean Wang
Contributors Lecong Zhang
Aixin Liu Bing Liang Zhao
Xue Litong Wang
Bingxuan Wang Liyue Zhang
Bochao Wu Lei Xu
Bei Feng Leyi Xia
Chengda Lu Mingchuan Zhang
Chenggang Zhao Minghua Zhang
Chengqi Deng Minghui Tang
Chong Ruan Mingxu Zhou
Damai Dai Meng Li
Deli Chen Miaojun Wang
Dongjie Ji Mingming Li
Erhang Li Ning Tian
Fangyun Lin Panpan Huang
Fucong Dai Peng Zhang
Fuli Luo* Qiancheng Wang
Guangbo Hao Qinyu Chen
Guanting Chen Qiushi Du
Guowei Li Ruiqi Ge*
H. Zhang Ruisong Zhang
Hanwei Xu Ruizhe Pan
Honghui Ding Runji Wang
Huazuo Gao R.J. Chen
Hui Qu R.L. Jin
21
Ruyi Chen Y.X. Wei
Shanghao Lu Yang Zhang
Shangyan Zhou Yanhong Xu
Shanhuang Chen Yao Li
Shengfeng Ye Yao Zhao
Shiyu Wang Yaofeng Sun
Shuiping Yu Yaohui Wang
Shunfeng Zhou Yi Yu
Shuting Pan Yichao Zhang
S.S. Li Yifan Shi
Shuang Zhou Yiliang Xiong
Shaoqing Wu Ying He
Shengfeng Ye Yishi Piao
Tao Yun Yisong Wang
Tian Pei Yixuan Tan
Tianyu Sun Yiyang Ma*
T. Wang Yiyuan Liu
Wangding Zeng Yongqiang Guo
Wen Liu Yuan Ou
Wenfeng Liang Yuduan Wang
Wenjun Gao Yue Gong
Wenqin Yu* Yuheng Zou
Wentao Zhang Yujia He
W.L. Xiao Yunfan Xiong
Wei An Xiaodong Yuxiang Luo
Liu Yuxiang You
Xiaohan Wang Yuxuan Liu
Xiaokang Chen Yuyang Zhou
Xiaotao Nie Y.X. Zhu
Xin Cheng Yanping Huang
Xin Liu Yaohui Li
Xin Xie Yi Zheng
Xingchao Liu Yuchen Zhu
Xinyu Yang Yunxian Ma
Xinyuan Li Ying Tang
Xuecheng Su Yukun Zha
Xuheng Lin Yuting Yan
X.Q. Li Z.Z. Ren
Xiangyue Jin Zehui Ren
Xiaojin Shen Zhangli Sha
Xiaosha Chen Zhe Fu
Xiaowen Sun Zhean Xu
Xiaoxiang Wang Zhenda Xie
Xinnan Song Zhengyan Zhang
Xinyi Zhou Zhewen Hao
22
Xianzu Wang Zhicheng Ma
Xinxia Shan Zhigang Yan
Y.K. Li Zhiyu Wu
Y.Q. Wang Zihui Gu
23
Zijia Zhu Zhen Huang
Zijun Liu* Zhipeng Xu
Zilin Li Zhongyu Zhang
Ziwei Xie Zhen Zhang
Ziyang Song
Zizheng Pan
Within each role, authors are listed alphabetically by the first name. Names marked with * denote
individuals who have departed from our team.
24