SAM-I2V++:
Efficiently Upgrading SAM for Promptable Video Segmentation

IEEE TPAMI 2026

Haiyang Mei Pengyu Zhang Mike Zheng Shou^✉️
Show Lab, National University of Singapore

Table of Contents

1. Overview

SAM-I2V++ is an enhanced version of SAM-I2V (CVPR 2025), a training-efficient method that upgrades the image-based SAM for promptable video segmentation. It achieves over 93% of SAM 2.1’s performance while requiring only 0.2% of its training cost.

SAM-I2V++ takes an input video and extracts frame features via an image encoder enhanced by a temporal feature integrator to capture dynamic context. These features are processed by a memory selective associator and memory prompt generator to manage historical information and generate target prompts. A prompt encoder incorporates optional user inputs (e.g., masks, points, boxes). Finally, the mask decoder produces segmentation masks for each frame, enabling user-guided and memory-conditioned promptable video segmentation.

2. Installation

Our implementation uses python==3.11, torch==2.5.0 and torchvision==0.20.0. Please follow the instructions here to install both PyTorch and TorchVision dependencies. You can install SAM-I2VPP on a GPU machine using:

git clone https://github.com/showlab/SAM-I2VPP.git && cd SAM-I2VPP
conda create -n sam-i2vpp python=3.11
conda activate sam-i2vpp
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
pip install -e .

3. Getting Started

3.1 Download Checkpoint

First, we need to download the SAM-I2VPP checkpoint. It can be downloaded from:

[sam-i2vpp_8gpu.pt] [ Google Drive ] [ OneDrive ] [ BaiduDisk ]
[sam-i2vpp_32gpu.pt] [ Google Drive ] [ OneDrive ] [ BaiduDisk ]

Both models were trained in 26 hours using 24GB GPUs. The first model (sam-i2vpp_8gpu.pt) was trained with 8 GPUs, while the second model (sam-i2vpp_32gpu.pt) was trained with 32 GPUs and offers better performance.

3.2 Demo Use

SAM-I2V++ can be used in a few lines as follows for promptable video segmentation. Below provides a video predictor with APIs for example to add prompts and propagate masklets throughout a video. Same as SAM2, SAM-I2V++ supports video inference on multiple objects and uses an inference state to keep track of the interactions in each video.

import torch
from i2vpp.build_i2vpp import build_i2vpp_video_predictor

checkpoint = "./checkpoints/sam-i2vpp_32gpu.pt"
model_cfg = "./i2vpp/configs/i2vpp-infer.yaml"
predictor = build_i2vpp_video_predictor(model_cfg, checkpoint)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
  state = predictor.init_state( < your_video >)

  # add new prompts and instantly get the output on the same frame
  frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, < your_prompts >):

  # propagate the prompts to get masklets throughout the video
  for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
    ...

3.3 Testing

We provide instructions for testing on the SAV-Test dataset.

(a) Please refer to the sav_dataset/README.md for detailed instructions on how to download and prepare the SAV-Test dataset before testing.

(b) Prepare the 'mask_info' for the ease of testing via:

python tools/save_gt_mask_multiprocess.py

Or you can directly download the preprocessed 'mask_info' here.

(c) Run the inference script

cd test_pvs
sh semi_infer.sh

3.4 Evaluation

Run the evaluation script

sh semi_eval.sh

3.5 Training

(a) Please refer to the sav_dataset/README.md for detailed instructions on how to download and prepare the SAV-Train dataset. Totally 50,583 training videos (train/txt/sav_train_list.txt).

(b) We follow SAM 2 to train the model on mixed video and image data. Download the SA-1B dataset and sample a subset of images, as the full dataset is too large to use in its entirety. We randomly sample 10k images (train/txt/sa1b_10k_train_list.txt) to train SAM-I2V.

(c) Download the SAM 1 model (i.e., TinySAM) to be upgraded and put it to checkpoints/tinysam.pth.

(d) Train the model:

Single node with 8 GPUs:

sh train.sh

Multi-node with each node has 8 GPUs (e.g., 4x8=32 GPUs):

sh multi_node_train_4_nodes.sh

3.6 Web Annotation Tool

ssh -L 5000:127.0.0.1:5000 username@serverip
python tools/web_annotation_tool.py

4. Acknowledgements

Our implementation builds upon SAM 2 and reuses essential modules from its official codebase.

5. Citation

If you use SAM-I2V++ in your research, please use the following BibTeX entry.

@article{Mei_2026_TPAMI,
    author    = {Mei, Haiyang and Zhang, Pengyu and Shou, Mike Zheng},
    title     = {SAM-I2V++: Efficiently Upgrading SAM for Promptable Video Segmentation},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
    year      = {2026},
}

6. License

Please see LICENSE.

7. Contact

E-Mail: Haiyang Mei ([email protected])

⬆ back to top

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
sav_dataset		sav_dataset
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAM-I2V++:
Efficiently Upgrading SAM for Promptable Video Segmentation

1. Overview

2. Installation

3. Getting Started

3.1 Download Checkpoint

3.2 Demo Use

3.3 Testing

3.4 Evaluation

3.5 Training

3.6 Web Annotation Tool

4. Acknowledgements

5. Citation

6. License

7. Contact

About

Uh oh!

Releases

Packages

Languages

License

showlab/SAM-I2VPP

Folders and files

Latest commit

History

Repository files navigation

SAM-I2V++: Efficiently Upgrading SAM for Promptable Video Segmentation

1. Overview

2. Installation

3. Getting Started

3.1 Download Checkpoint

3.2 Demo Use

3.3 Testing

3.4 Evaluation

3.5 Training

3.6 Web Annotation Tool

4. Acknowledgements

5. Citation

6. License

7. Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

SAM-I2V++:
Efficiently Upgrading SAM for Promptable Video Segmentation

Packages