0% found this document useful (0 votes)
44 views21 pages

Reasoning

The document discusses advancements in large language models (LLMs) with a focus on Chain of Thought (CoT) reasoning, which improves problem-solving by generating step-by-step solutions. It highlights OpenAI's o1 series, which enhances reasoning capabilities and performance on various benchmarks, and introduces DeepSeek-R1, a new model utilizing a multihead latent attention mechanism and reinforcement learning. The document notes the lack of detailed information on the reinforcement learning methods used by OpenAI and the exploration of a pure RL approach in DeepSeek-R1-Zero.

Uploaded by

Yicheng Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views21 pages

Reasoning

The document discusses advancements in large language models (LLMs) with a focus on Chain of Thought (CoT) reasoning, which improves problem-solving by generating step-by-step solutions. It highlights OpenAI's o1 series, which enhances reasoning capabilities and performance on various benchmarks, and introduces DeepSeek-R1, a new model utilizing a multihead latent attention mechanism and reinforcement learning. The document notes the lack of detailed information on the reinforcement learning methods used by OpenAI and the exploration of a pure RL approach in DeepSeek-R1-Zero.

Uploaded by

Yicheng Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

LLMs with Reasoning

Recall Chain of Thought Reasoning:


• In the prompt, tell the LLM to “generate a step-by-step solution,” or to
”think through the answer”.
• Resulting output is called a Chain of Thought (CoT).
• Provides a limited form of reasoning.
• ”Let’s Verify” evaluates multiple CoTs using a process-reward model
Open AI o1 series: September 2024
• Much longer chains of thought: “Inference time thinking”.
• Self-evaluation, backtracking
• Gave significant improvement on math, science, and coding
benchmarks
• OpenAI said it used RL but gave no details
• Provided only summaries of CoT to users
• Many research groups were making conjectures, or trying to reverse
engineer.
DeepSeek-R1: December 2024
• Employs DeepSeek-V3 as base model:
• Multihead Latent Attention (KV caching trick)
• DeepSeek MOE architecture
• 600B parameters
• Employs RL
• Most likely similar to how OpenAI o1 employed RL
• Paper first explores Deep-Seel-R1-Zero
• Pure RL approach, no SFT
• Term “Zero” inspired by Alpha-Go-Zero
DeepSeek-R1-Zero: Prompt
DeepSeek-R1-Zero: RL setup
As with RLHF, formulate an episodic RL problem:
• initial state of episode = prompt
• action space = vocabulary
• action = selected token in vocabulary
• state = prompt + all response tokens generated so far
• policy = LLM which takes as input state and outputs probability distribution over
vocabulary. Policy is parameterized by LLM parameters , which RL updates.
• episode ends when <|endoftext|> token is sampled from policy distribution
• rewards are zero except final reward in episode: +1 or -1 depending on whether
is correct or not
• goal: Maxθ Ex~D’,y~πθ [ R ] where D’ is some fixed set of math/coding questions
• Quiz: what does the environment do in this MDP?
Solve with a variant of PPO.
GRPO (Group Relative Policy
Optimization)
• Select question q from question set.
• Use current policy to generate G episodes (CoTs with a final answer)
• Determine reward ri (= +1 or -1) for each episode i.
• Calculate advantage Ai for each episode as:
• Take several gradient steps, updating at each step, using:

• After the several steps, , and repeat with new question.


• Only differs from PPO in how the advantage Ai is defined.

GRPO in Practice (ie, in open source
code)
• Only take one gradient step when updating !
• After generating episodes, do exactly one update
• No multiple mini-episodes between episode generation
• In this case, there is no need for the clipping!
GRPO Simply Becomes Normalized
Policy Gradient
Initialize θ in policy network and in critic network
Repeat:
Use policy πθ to obtain N episodes:
, , , , , ,…, , , ), i=1,…,G
Calculate the returns ri, i=1,…,N
Calculate GRPO advantage estimates:
Update the policy:

Equivalently
majority voting
Learns on its own to generate long
CoTs
Non-linear thinking
• DeepSeek-R1-Zero acquires the ability to solve increasingly complex
reasoning tasks by leveraging extended test-time computation.
• Behaviors such as reflection—where the model revisits
and reevaluates its previous steps—and the exploration
of alternative approaches to problem-solving arise
spontaneously.
Aha moment
DeepSeek-R1: Four-Stage Pipeline
1) Cold Start
• Small set of long CoT examples is used to SFT base model (V3) before applying RL.
• The data is collected through : few-shot prompting, readable outputs from
DeepSeek-R1-Zero, and human post-processing.
• Provides improved readability and early stability during RL training.
2) Reasoning-oriented Reinforcement Learning
• Same large-scale RL as R1-Zero, but starting from the fine-tuned checkpoint.
• Focus on coding, math, and logic tasks.
• A language consistency reward is added to reduce language mixing in CoTs,
favoring outputs in a single language..
DeepSeek-R1 (without the Zero)
3) Rejection Sampling and Supervised Fine-Tuning
• After RL convergence, the model’s outputs are sampled and filtered to create new
supervised fine-tuning (SFT) data:
• Reasoning data: 600K examples with correct answers and readable CoTs.
• Non-reasoning data: 200K examples for writing, QA, and translation
• The model is fine-tuned on the 800K dataset for two epochs.
4) Reinforcement Learning for All Scenarios
• A second round of RL
• Rewards = correctness + helpfulness + harmlessness
• Helpfulness is judged on the final summary; harmlessness is assessed over the full
response.
• This stage ensures the model is strong in both reasoning and general capabilities.
Distillation: Empowering Small
Models
• Using 800 samples curated from R1, SFT smaller base models:
• No RL employed
• Empower small models with reasoning
Distillation versus RL for small
models

• Distillation beats RL on smaller models


Comparison with Process Reward
Model
Recall PRM is used in “Let’s Verify”
DeepSeek authors argue:
• Challenging to to explicitly define a fine-grain step in general
reasoning.
• Determining whether the current intermediate step is correct
requires manual annotation which does not scale.

Authors also tried Monte Carlo Tree Search (MCTS), as employed in


alphaGO, with no success.
The paper examines more closely:
• Pre-trained base model
• GRPO
DeepSeek V3 already has “aha
moment”

V3 and Qwen possibly include question/answers with long CoTs in pretraining corpus
Dr GRPO

You might also like