[ICLR 2026] DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

This is the official repository for the ICLR 2026 paper "DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage".

DIVA-GRPO is a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) specifically designed for Multimodal Large Language Models (MLLMs). It dynamically assesses problem difficulty, generates tailored variants, and computes local/global advantages with reward-range-based rescaling to mitigate reward sparsity and advantage vanishing.

✨ Main Contributions

Our paper introduces several key innovations to solve reward sparsity and advantage vanishing in MLLM reinforcement learning:

Difficulty-Adaptive Variant Generation: We propose a dynamic difficulty assessment mechanism that adaptively samples semantically consistent variants (text, image, and reasoning hints) to ensure stable reward variance regardless of problem difficulty.
Joint Local-Global Advantage Estimation: We introduce a two-step balancing strategy utilizing batch z-score normalization and difficulty-weighted scaling to prevent global advantages from dominating optimization.
RRB-Rescaling (Reward-Range-Based Rescaling): A novel technique that scales advantages based on actual reward variability, effectively preventing unreasonable advantage inflation and accelerating convergence.
Exceptional Efficiency & Performance: Evaluated on Qwen2.5-VL-7B, our method achieves State-of-the-Art (SOTA) performance across 6 mainstream multimodal reasoning benchmarks. It reduces required training steps by over 2.55x and delivers a 1.76x end-to-end speedup in wall-clock time.

🏆 Supported Benchmarks

The model has been extensively evaluated and achieves strong performance on the following multimodal mathematical and scientific benchmarks:

MathVista
MathVerse
MathVision
OlympiadBench
WeMath
MMK12-test

🛠️ Installation

Create a conda environment and activate it:

conda create -n diva python=3.10
conda activate diva

Install dependencies (Requires CUDA and PyTorch):

pip install -r requirements.txt
# Install the framework in editable mode
pip install -e .

📊 Dataset Preparation & Augmentation

Our training relies on dynamically augmented datasets with difficulty variants and reasoning "think steps". We use the R1-ShareVL-52K dataset as our base.

1. Download Base Dataset

Download the base dataset from Hugging Face and save it locally as a Parquet file (e.g., data/r1_sharevl_52k.parquet).

2. Generate Variants and Think Steps

We provide a high-performance, multiprocessing script to call Azure OpenAI (or other LLMs like GPT-o3 or Qwen-Plus) to generate the variants and reasoning steps.

First, export your API credentials:

export AZURE_OPENAI_KEY="your_api_key_here"
export AZURE_OPENAI_ENDPOINT="your_endpoint_here"

Next, run the augmentation script. You can adjust the --workers argument based on your API rate limits:

python verl/difficulty_variation/augment_dataset.py \
    --input data/r1_sharevl_52k.parquet \
    --output data/r1_sharevl_52k_augmented.parquet \
    --workers 8

Note: This script (augment_dataset.py) utilizes a listener-worker architecture with IPC queues to ensure thread-safe, incremental saving of generated data to a single Parquet file.

🏃‍♂️ Training

We provide example scripts to launch the training process using Ray and vLLM. To train Qwen2.5-VL-7B-Instruct with the DIVA-GRPO algorithm, please update the paths in the script and run:

bash examples/main_exp/ZSCORENORM_WAN_RRBLOCAL_RRBGLOBAL_5000_k=0.1.sh

Key Hyperparameters Configuration

trainer.weighted_advantage_k=0.1: The sensitivity parameter for difficulty-weighted scaling (Found to be optimal in ablations).
Z-Score Norm: Applies batch-level z-score normalization separately to local and global advantages (ZSCORENORM_WAN).
RRB-Rescaling: Reward-Range-Based Rescaling prevents inflated advantages from minor reward differences (RRBLOCAL_RRBGLOBAL).

🤝 Acknowledgments

This project is built upon the excellent open-source verl framework. We express our gratitude to the authors for their foundational work.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
picture		picture
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR 2026] DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

✨ Main Contributions

🏆 Supported Benchmarks

🛠️ Installation

📊 Dataset Preparation & Augmentation

1. Download Base Dataset

2. Generate Variants and Think Steps

🏃‍♂️ Training

Key Hyperparameters Configuration

🤝 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICLR 2026] DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

✨ Main Contributions

🏆 Supported Benchmarks

🛠️ Installation

📊 Dataset Preparation & Augmentation

1. Download Base Dataset

2. Generate Variants and Think Steps

🏃‍♂️ Training

Key Hyperparameters Configuration

🤝 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages