How To Teach Large Multimodal Models New Skills?

This repository accompanies our study on how to teach Large Multimodal Models (LMMs) new skills without degrading existing capabilities. We analyze where to tune (vision encoder, projector, LLM) and, within the LLM, contrast self‑attention projection (SA Proj.) vs MLP updates under a controlled skill‑acquisition protocol. We provide a reproducible evaluation suite (target vs held‑out benchmarks), modular training scripts for component‑wise tuning, and a few optional mitigation baselines. The code builds upon LLaVA‑OneVision/NeXT.

Features

Component-wise Tuning Recipes: SA Proj. and MLP Gate&Up options that achieve strong learning with minimal forgetting across tasks and model families.
Controlled Evaluation: Five target skills and eight held‑out benchmarks measuring both learning (target gains) and forgetting (held‑out drops).
Modular Experiment Scripts: Reproducible, configurable pipelines to tune specific components (vision, projector, LLM, SA Proj., MLP) and compare recipes.
Optional Baselines/Mitigations: LoRA (PEFT), Learning without Forgetting (distillation), WiSE‑FT, and MoE provided as reference implementations.
Flexible Data Handling: Utilities to fetch/convert datasets into the required format.
Analysis Tools: Scripts to probe output‑distribution shift and aggregate evaluation results.

Quick Start with Minimal Codes to Select What You Train

We developed that only tuning the SA Proj. or MLP Gate&Up are effective for adapting large multimodal models with minimal forgetting. You can do this in just a few lines by filtering parameter names. Example:

# 1) Freeze everything first
for p in model.parameters():
    p.requires_grad = False

# 2a) Unfreeze ONLY self-attention projections (SA Proj.) in the LLM
sa_keys = (
    "self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj"
)
for name, p in model.named_parameters():
    if any(k in name for k in sa_keys):
        p.requires_grad = True

# 2b) (Alternative) Unfreeze ONLY MLP Gate & Up; keep Down frozen
# Comment-out 2a and use this block instead
gate_up_keys = ("mlp.gate_proj", "mlp.up_proj")
down_keys    = ("mlp.down_proj")
for name, p in model.named_parameters():
    if any(k in name for k in gate_up_keys):
        p.requires_grad = True
    elif any(k in name for k in down_keys):
        p.requires_grad = False  # explicit for clarity

# ... proceed to wrap with your Trainer/optimizer, etc.

Tip: The built‑in flags shown above implement the same logic internally and are robust across model families. Use the snippet when you want to prototype custom selections inline.

Installation

Clone the repository:

git clone https://github.com/jessemelpolio/LMM_CL.git
cd LMM_CL

Create and activate an environment:

conda create -n lmm_cl python=3.10 -y
conda activate lmm_cl

Install dependencies:

Editable install with training extras, plus the bundled evaluation toolkit:

pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"
pip install -e ./lmms-eval

Alternatively, you can start from the provided environment spec and then install the packages above:

conda env create -f environment.yaml
conda activate lmm_cl   # environment name from the YAML
pip install -e ".[train]" && pip install -e ./lmms-eval

Configure access (recommended):
- For Hugging Face datasets: huggingface-cli login
- For PixMo-Count (Flickr): obtain API credentials (see Dataset Preparation Guide). Prefer environment variables over hardcoding keys in code.

Dataset Preparation

Detailed instructions for downloading and preparing the datasets used in our experiments are provided in a separate document.

➡️ Dataset Preparation Guide

This guide covers the setup for datasets such as CUB-200, PixMo-Count, PathVQA, TextVQA, and more.

Running Experiments

We provide a suite of shell scripts to run various continual learning experiments. For detailed instructions on how to configure and run these experiments, please refer to our experiments guide.

➡️ Running Experiments Guide

This guide explains how to run different CL strategies, from simple sequential fine-tuning to more advanced methods like LoRA, LwF, and MoE.

Evaluation

We use the lmms-eval toolkit to evaluate target‑task learning and held‑out‑task forgetting in a standardized, reproducible way:

Learning (target): Improvement on newly learned skills
Forgetting (held‑out): Average drop on general benchmarks
Overall: Trade‑offs across recipes and components

Key Evaluation Features

Automatic Integration: All experiment scripts automatically evaluate models after each training stage
Comprehensive Benchmarks: Commonly evaluates on CUB-200, PixMo-Count, TextVQA, PathVQA, AI2D, ChartQA, DocVQA, InfoVQA, RealWorldQA, SEEDBench, ScienceQA, MM-Star, TimeClock, and optionally CocoClock/OpenImgClock/OCRBench/CountBench depending on the script’s ALL_BENCHMARKS
Custom Tasks: Includes custom lmms-eval tasks for the continual learning datasets
Detailed Results: Saves both aggregated metrics and per-sample results for analysis
Standalone Evaluation: Can evaluate any checkpoint independently of training

Example evaluation command:

accelerate launch --num_processes=2 -m lmms_eval \
  --model llava_onevision \
  --model_args pretrained=/path/to/checkpoint,conv_template=qwen_1_5,model_name=llava_qwen,vocab_size=152064 \
  --tasks cub200,pixmocount,textvqa,pathvqa \
  --batch_size 1 --log_samples --output_path ./results

See the Running Experiments Guide for the exact benchmarks string used by each script and how to customize it.

Project Structure

Here is an overview of the key directories in this project:

llava/: Core code for the LLaVA model, training, and inference logic.
scripts/all_experiments/: Scripts and configurations for running continual learning experiments (tune parts, LoRA, LwF, MoE, WiSE‑FT, etc.).
- *.yaml: Dataset configuration files.
utils/: Utility scripts for dataset preparation, results processing, and model manipulation.
docs/: Contains detailed documentation, including guides for data preparation and running experiments.
lmms-eval/: The evaluation framework used to benchmark model performance.

Acknowledgement

This project is built upon the LLaVA-NeXT repository. We thank the original authors for their great work.
Vicuna: the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!
We thank the contributors to the lmms-eval framework for their support on the evaluation side.

Citation

If you find our work useful for your research, please cite:

BibTeX:

@article{zhu2025teachlmm,
  title  = {How To Teach Large Multimodal Models New Skills?},
  author = {Zhu, Zhen and Gong, Yiming and Xiao, Yao and Liu, Yaoyao and Hoiem, Derek},
  journal = {https://arxiv.org/abs/2510.08564},
  year   = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
llava		llava
lmms-eval		lmms-eval
scripts		scripts
trl		trl
utils		utils
.cursorignore		.cursorignore
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How To Teach Large Multimodal Models New Skills?

Features

Quick Start with Minimal Codes to Select What You Train

Table of Contents

Installation

Dataset Preparation

Running Experiments

Evaluation

Key Evaluation Features

Project Structure

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

jessemelpolio/LMM_CL

Folders and files

Latest commit

History

Repository files navigation

How To Teach Large Multimodal Models New Skills?

Features

Quick Start with Minimal Codes to Select What You Train

Table of Contents

Installation

Dataset Preparation

Running Experiments

Evaluation

Key Evaluation Features

Project Structure

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages