Reasoning

The document discusses advancements in large language models (LLMs) with a focus on Chain of Thought (CoT) reasoning, which improves problem-solving by generating step-by-step solutions. It highlights OpenAI's o1 series, which enhances reasoning capabilities and performance on various benchmarks, and introduces DeepSeek-R1, a new model utilizing a multihead latent attention mechanism and reinforcement learning. The document notes the lack of detailed information on the reinforcement learning methods used by OpenAI and the exploration of a pure RL approach in DeepSeek-R1-Zero.

Uploaded by

Yicheng Jiang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views21 pages

Reasoning

Uploaded by

Yicheng Jiang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

LLMs with Reasoning

Recall Chain of Thought Reasoning:

• In the prompt, tell the LLM to “generate a step-by-step solution,” or to
”think through the answer”.
• Resulting output is called a Chain of Thought (CoT).
• Provides a limited form of reasoning.
• ”Let’s Verify” evaluates multiple CoTs using a process-reward model
Open AI o1 series: September 2024
• Much longer chains of thought: “Inference time thinking”.
• Self-evaluation, backtracking
• Gave significant improvement on math, science, and coding
benchmarks
• OpenAI said it used RL but gave no details
• Provided only summaries of CoT to users
• Many research groups were making conjectures, or trying to reverse
engineer.
DeepSeek-R1: December 2024
• Employs DeepSeek-V3 as base model:
• Multihead Latent Attention (KV caching trick)
• DeepSeek MOE architecture
• 600B parameters
• Employs RL
• Most likely similar to how OpenAI o1 employed RL
• Paper first explores Deep-Seel-R1-Zero
• Pure RL approach, no SFT
• Term “Zero” inspired by Alpha-Go-Zero
DeepSeek-R1-Zero: Prompt
DeepSeek-R1-Zero: RL setup
As with RLHF, formulate an episodic RL problem:
• initial state of episode = prompt
• action space = vocabulary
• action = selected token in vocabulary
• state = prompt + all response tokens generated so far
• policy = LLM which takes as input state and outputs probability distribution over
vocabulary. Policy is parameterized by LLM parameters , which RL updates.
• episode ends when <|endoftext|> token is sampled from policy distribution
• rewards are zero except final reward in episode: +1 or -1 depending on whether
is correct or not
• goal: Maxθ Ex~D’,y~πθ [ R ] where D’ is some fixed set of math/coding questions
• Quiz: what does the environment do in this MDP?
Solve with a variant of PPO.
GRPO (Group Relative Policy
Optimization)
• Select question q from question set.
• Use current policy to generate G episodes (CoTs with a final answer)
• Determine reward ri (= +1 or -1) for each episode i.
• Calculate advantage Ai for each episode as:
• Take several gradient steps, updating at each step, using:

• After the several steps, , and repeat with new question.

• Only differs from PPO in how the advantage Ai is defined.
•
GRPO in Practice (ie, in open source
code)
• Only take one gradient step when updating !
• After generating episodes, do exactly one update
• No multiple mini-episodes between episode generation
• In this case, there is no need for the clipping!
GRPO Simply Becomes Normalized
Policy Gradient
Initialize θ in policy network and in critic network
Repeat:
Use policy πθ to obtain N episodes:
, , , , , ,…, , , ), i=1,…,G
Calculate the returns ri, i=1,…,N
Calculate GRPO advantage estimates:
Update the policy:

Equivalently
majority voting
Learns on its own to generate long
CoTs
Non-linear thinking
• DeepSeek-R1-Zero acquires the ability to solve increasingly complex
reasoning tasks by leveraging extended test-time computation.
• Behaviors such as reflection—where the model revisits
and reevaluates its previous steps—and the exploration
of alternative approaches to problem-solving arise
spontaneously.
Aha moment
DeepSeek-R1: Four-Stage Pipeline
1) Cold Start
• Small set of long CoT examples is used to SFT base model (V3) before applying RL.
• The data is collected through : few-shot prompting, readable outputs from
DeepSeek-R1-Zero, and human post-processing.
• Provides improved readability and early stability during RL training.
2) Reasoning-oriented Reinforcement Learning
• Same large-scale RL as R1-Zero, but starting from the fine-tuned checkpoint.
• Focus on coding, math, and logic tasks.
• A language consistency reward is added to reduce language mixing in CoTs,
favoring outputs in a single language..
DeepSeek-R1 (without the Zero)
3) Rejection Sampling and Supervised Fine-Tuning
• After RL convergence, the model’s outputs are sampled and filtered to create new
supervised fine-tuning (SFT) data:
• Reasoning data: 600K examples with correct answers and readable CoTs.
• Non-reasoning data: 200K examples for writing, QA, and translation
• The model is fine-tuned on the 800K dataset for two epochs.
4) Reinforcement Learning for All Scenarios
• A second round of RL
• Rewards = correctness + helpfulness + harmlessness
• Helpfulness is judged on the final summary; harmlessness is assessed over the full
response.
• This stage ensures the model is strong in both reasoning and general capabilities.
Distillation: Empowering Small
Models
• Using 800 samples curated from R1, SFT smaller base models:
• No RL employed
• Empower small models with reasoning
Distillation versus RL for small
models

• Distillation beats RL on smaller models

Comparison with Process Reward
Model
Recall PRM is used in “Let’s Verify”
DeepSeek authors argue:
• Challenging to to explicitly define a fine-grain step in general
reasoning.
• Determining whether the current intermediate step is correct
requires manual annotation which does not scale.

Authors also tried Monte Carlo Tree Search (MCTS), as employed in

alphaGO, with no success.
The paper examines more closely:
• Pre-trained base model
• GRPO
DeepSeek V3 already has “aha
moment”

V3 and Qwen possibly include question/answers with long CoTs in pretraining corpus
Dr GRPO

DAPO: Open-Source RL for LLMs
No ratings yet
DAPO: Open-Source RL for LLMs
16 pages
2025 Lecture 16 - RLVR
No ratings yet
2025 Lecture 16 - RLVR
67 pages
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
No ratings yet
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
24 pages
DeepSeek-R1: Enhancing LLM Reasoning via RL
No ratings yet
DeepSeek-R1: Enhancing LLM Reasoning via RL
22 pages
Magistral
No ratings yet
Magistral
23 pages
Lec10 - Interaction
No ratings yet
Lec10 - Interaction
40 pages
DeepSeek-R1: Advancing LLM Reasoning
No ratings yet
DeepSeek-R1: Advancing LLM Reasoning
22 pages
Part I: Tricks or Traps? A Deep Dive Into RL For LLM Reasoning
No ratings yet
Part I: Tricks or Traps? A Deep Dive Into RL For LLM Reasoning
26 pages
Approach: Others
No ratings yet
Approach: Others
1 page
Reinforcement Learning For Reasoning in Small LLMS: What Works and What Doesn'T
No ratings yet
Reinforcement Learning For Reasoning in Small LLMS: What Works and What Doesn'T
17 pages
AIME 2024 Pass@1 Score Analysis
No ratings yet
AIME 2024 Pass@1 Score Analysis
23 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
Train DeepSeek R1 From Scratch
No ratings yet
Train DeepSeek R1 From Scratch
42 pages
DeepSeek R1 Dual
No ratings yet
DeepSeek R1 Dual
44 pages
Deepseek-R1 Incentivizes Reasoning in Llms Through Reinforcement Learning
No ratings yet
Deepseek-R1 Incentivizes Reasoning in Llms Through Reinforcement Learning
11 pages
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
No ratings yet
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
39 pages
R-Zero - Self-Evolving Reasoning LLM From Zero Data - Tencent AI Lab
No ratings yet
R-Zero - Self-Evolving Reasoning LLM From Zero Data - Tencent AI Lab
17 pages
ICML 2018 RL Highlights
No ratings yet
ICML 2018 RL Highlights
55 pages
Offline Reinforcement Learning For LLM Multi-Step Reasoning
No ratings yet
Offline Reinforcement Learning For LLM Multi-Step Reasoning
14 pages
Offline Reinforcement Learning For LLM Multi-Step Reasoning
No ratings yet
Offline Reinforcement Learning For LLM Multi-Step Reasoning
13 pages
Reinforcement Learning - Personal Study Notes
No ratings yet
Reinforcement Learning - Personal Study Notes
12 pages
RL Enhances LLM Reasoning
No ratings yet
RL Enhances LLM Reasoning
24 pages
ChatGPT and RLHF Insights
No ratings yet
ChatGPT and RLHF Insights
45 pages
Roisinluo Reasoning in LLMs
No ratings yet
Roisinluo Reasoning in LLMs
72 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Lecture 2 Summary
No ratings yet
Lecture 2 Summary
1 page
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
Pairwise Ppo
No ratings yet
Pairwise Ppo
19 pages
Secrets of RLHF in Large Language Models Part I: Ppo: Fudan NLP Group Bytedance Inc
100% (1)
Secrets of RLHF in Large Language Models Part I: Ppo: Fudan NLP Group Bytedance Inc
32 pages
Optimizing PPO for RLHF in LLMs
No ratings yet
Optimizing PPO for RLHF in LLMs
32 pages
Neurips - Verl Workshop
No ratings yet
Neurips - Verl Workshop
48 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
Vapo
No ratings yet
Vapo
13 pages
Agentic Reinforced Policy Optimization
No ratings yet
Agentic Reinforced Policy Optimization
38 pages
AI3
No ratings yet
AI3
1 page
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
No ratings yet
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
28 pages
DeepSeek R1: Advancements in AI Reasoning
No ratings yet
DeepSeek R1: Advancements in AI Reasoning
9 pages
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
No ratings yet
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
36 pages
Unlocking Recursive Thinking of LLMs
No ratings yet
Unlocking Recursive Thinking of LLMs
14 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Preference Optimization in LLMs
No ratings yet
Preference Optimization in LLMs
89 pages
Unit 4
No ratings yet
Unit 4
23 pages
Re Max
No ratings yet
Re Max
36 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
DeepSeek Research
No ratings yet
DeepSeek Research
6 pages
03 04 Lessonarticle
No ratings yet
03 04 Lessonarticle
5 pages
Lecture 1-2
No ratings yet
Lecture 1-2
57 pages
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
No ratings yet
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
34 pages
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
9 pages
Step-wise Policy Optimization for MLLMs
No ratings yet
Step-wise Policy Optimization for MLLMs
10 pages
Building Open-Ended Embodied Agent Via Language-Policy Bidirectional Adaptation
No ratings yet
Building Open-Ended Embodied Agent Via Language-Policy Bidirectional Adaptation
27 pages
Title of Your RL Project
No ratings yet
Title of Your RL Project
1 page
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
RADL LACuong
No ratings yet
RADL LACuong
81 pages
w7 - Reinforcement Learning
No ratings yet
w7 - Reinforcement Learning
5 pages
15) EXPLAIN Fitted Q and Deep Q-Learning
No ratings yet
15) EXPLAIN Fitted Q and Deep Q-Learning
17 pages
Lec 23
No ratings yet
Lec 23
51 pages
Case Study Cuatrecasas
No ratings yet
Case Study Cuatrecasas
4 pages
Iot-Enabled Water Monitoring in Smart Cities: With Retrofit and Solar-Based Energy Harvesting
No ratings yet
Iot-Enabled Water Monitoring in Smart Cities: With Retrofit and Solar-Based Energy Harvesting
21 pages
L2 Schedule
No ratings yet
L2 Schedule
4 pages
Smart Peristaltic Pump SG600FC Overview
No ratings yet
Smart Peristaltic Pump SG600FC Overview
4 pages
A Project Report On Designing
No ratings yet
A Project Report On Designing
22 pages
Q4 Arts
No ratings yet
Q4 Arts
54 pages
Database Management Systems Skills
No ratings yet
Database Management Systems Skills
8 pages
Shadowrun Playbooks PDF
100% (1)
Shadowrun Playbooks PDF
28 pages
HMI Design Using Wonderware Indusoft Web Studio
No ratings yet
HMI Design Using Wonderware Indusoft Web Studio
4 pages
Verified PDF Download Cengage Advantage Understanding Arguments Concise Edition 1st Edition FULL Version
No ratings yet
Verified PDF Download Cengage Advantage Understanding Arguments Concise Edition 1st Edition FULL Version
404 pages
Revision Notes - 02 Reliability in Computer Systems
No ratings yet
Revision Notes - 02 Reliability in Computer Systems
12 pages
Advanced Excel VBA for Developers
No ratings yet
Advanced Excel VBA for Developers
3 pages
Analyzing Thread Dumps with IBM Tool
No ratings yet
Analyzing Thread Dumps with IBM Tool
6 pages
Session 15
No ratings yet
Session 15
9 pages
Wireless Network Hacking Guide
No ratings yet
Wireless Network Hacking Guide
39 pages
Quotation-Fetal Dopplers
No ratings yet
Quotation-Fetal Dopplers
4 pages
Exporting QuickBooks Data to Excel Guide
No ratings yet
Exporting QuickBooks Data to Excel Guide
115 pages
Romeo & Juliet Essay Guide
100% (2)
Romeo & Juliet Essay Guide
8 pages
Ued102 Indiviual Assignments
No ratings yet
Ued102 Indiviual Assignments
24 pages
PPC Lab2
No ratings yet
PPC Lab2
6 pages
Honeywell SIL 6.1 Software
No ratings yet
Honeywell SIL 6.1 Software
11 pages
FastAttack Lua
No ratings yet
FastAttack Lua
5 pages
AWS Cloud Security MindMap
No ratings yet
AWS Cloud Security MindMap
1 page
Huawei TN54TSC
No ratings yet
Huawei TN54TSC
1 page
SQL SELECT Functions and Operations Guide
No ratings yet
SQL SELECT Functions and Operations Guide
14 pages
KPIs and Formula Definitions
No ratings yet
KPIs and Formula Definitions
22 pages
Veritas Netbackup 8.1.2: Administration: Course Description
No ratings yet
Veritas Netbackup 8.1.2: Administration: Course Description
2 pages
Informatics Practices Xi
No ratings yet
Informatics Practices Xi
184 pages
Digital Forencis
No ratings yet
Digital Forencis
186 pages
RT5370 WiFi Driver Installation Guide
No ratings yet
RT5370 WiFi Driver Installation Guide
2 pages