DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1:通过强化学习激励LLMs的推理能力
[email protected]
目录
Post-Training: Large-Scale Reinforcement Learning on the Base Model训练后:基于基础模型的大规模强化学习
Distillation: Smaller Models Can Be Powerful Too蒸馏:小型模型也可以很强大
1.2Summary of Evaluation Results1.2 评估结果摘要
2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model2.2 DeepSeek-R1-Zero:基于基础模型的强化学习
2.2.1Reinforcement Learning Algorithm2.2.1 强化学习算法
Group Relative Policy Optimization组相对策略优化
2.2.2Reward Modeling 2.2.2 奖励建模
2.2.3 Training Template 2.2.3 训练模板
Performance of DeepSeek-R1-ZeroDeepSeek-R1-Zero 的性能
Self-evolution Process of DeepSeek-R1-Zero深度 Seek-R1-Zero 的自我进化过程
Aha Moment of DeepSeek-R1-Zero啊哈时刻 深寻-R1-零
Drawback of DeepSeek-R1-ZeroDeepSeek-R1-Zero 的缺点
2.3 DeepSeek-R1: Reinforcement Learning with Cold Start2.3 DeepSeek-R1:冷启动下的强化学习
2.3.2Reasoning-oriented Reinforcement Learning2.3.2 以推理为导向的强化学习
2.3.3Rejection Sampling and Supervised Fine-Tuning2.3.3 拒绝采样和监督微调
2.3.4Reinforcement Learning for all Scenarios2.3.4 适用于所有场景的强化学习
2.4Distillation: Empower Small Models with Reasoning Capability2.4 蒸馏:赋予小型模型推理能力
3.1DeepSeek-R1 Evaluation3.1 DeepSeek-R1 评估
3.2Distilled Model Evaluation3.2 蒸馏模型评估
4.1 Distillation v.s. Reinforcement Learning4.1 蒸馏与强化学习
4.2 Unsuccessful Attempts 4.2 未遂尝试
Process Reward Model (PRM)进程奖励模型(PRM)
Monte Carlo Tree Search (MCTS)蒙特卡洛树搜索(MCTS)