Skip to content

showlab/SAM-I2VPP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SAM-I2V++:
Efficiently Upgrading SAM for Promptable Video Segmentation

IEEE TPAMI 2026

Haiyang Mei    Pengyu Zhang    Mike Zheng Shou✉️
Show Lab, National University of Singapore

  

  1. 1 overview
  2. 2 installation
  3. 3.1 download checkpoint
  4. 3.2 demo use
  5. 3.3 testing
  6. 3.4 evaluation
  7. 3.5 training
  8. 3.6 web annotation tool
  9. 4 acknowledges
  10. 5 citation
  11. 6 license
  12. 7 contact

1. Overview

SAM-I2V++ is an enhanced version of SAM-I2V (CVPR 2025), a training-efficient method that upgrades the image-based SAM for promptable video segmentation. It achieves over 93% of SAM 2.1’s performance while requiring only 0.2% of its training cost.

SAM-I2V++ takes an input video and extracts frame features via an image encoder enhanced by a temporal feature integrator to capture dynamic context. These features are processed by a memory selective associator and memory prompt generator to manage historical information and generate target prompts. A prompt encoder incorporates optional user inputs (e.g., masks, points, boxes). Finally, the mask decoder produces segmentation masks for each frame, enabling user-guided and memory-conditioned promptable video segmentation.

2. Installation

Our implementation uses python==3.11, torch==2.5.0 and torchvision==0.20.0. Please follow the instructions here to install both PyTorch and TorchVision dependencies. You can install SAM-I2VPP on a GPU machine using:

git clone https://github.com/showlab/SAM-I2VPP.git && cd SAM-I2VPP
conda create -n sam-i2vpp python=3.11
conda activate sam-i2vpp
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
pip install -e .

3. Getting Started

3.1 Download Checkpoint

First, we need to download the SAM-I2VPP checkpoint. It can be downloaded from:

Both models were trained in 26 hours using 24GB GPUs. The first model (sam-i2vpp_8gpu.pt) was trained with 8 GPUs, while the second model (sam-i2vpp_32gpu.pt) was trained with 32 GPUs and offers better performance.

3.2 Demo Use

SAM-I2V++ can be used in a few lines as follows for promptable video segmentation. Below provides a video predictor with APIs for example to add prompts and propagate masklets throughout a video. Same as SAM2, SAM-I2V++ supports video inference on multiple objects and uses an inference state to keep track of the interactions in each video.

import torch
from i2vpp.build_i2vpp import build_i2vpp_video_predictor

checkpoint = "./checkpoints/sam-i2vpp_32gpu.pt"
model_cfg = "./i2vpp/configs/i2vpp-infer.yaml"
predictor = build_i2vpp_video_predictor(model_cfg, checkpoint)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
  state = predictor.init_state( < your_video >)

  # add new prompts and instantly get the output on the same frame
  frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, < your_prompts >):

  # propagate the prompts to get masklets throughout the video
  for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
    ...

3.3 Testing

We provide instructions for testing on the SAV-Test dataset.

(a) Please refer to the sav_dataset/README.md for detailed instructions on how to download and prepare the SAV-Test dataset before testing.

(b) Prepare the 'mask_info' for the ease of testing via:

python tools/save_gt_mask_multiprocess.py

Or you can directly download the preprocessed 'mask_info' here.

(c) Run the inference script

cd test_pvs
sh semi_infer.sh

3.4 Evaluation

Run the evaluation script

sh semi_eval.sh

3.5 Training

(a) Please refer to the sav_dataset/README.md for detailed instructions on how to download and prepare the SAV-Train dataset. Totally 50,583 training videos (train/txt/sav_train_list.txt).

(b) We follow SAM 2 to train the model on mixed video and image data. Download the SA-1B dataset and sample a subset of images, as the full dataset is too large to use in its entirety. We randomly sample 10k images (train/txt/sa1b_10k_train_list.txt) to train SAM-I2V.

(c) Download the SAM 1 model (i.e., TinySAM) to be upgraded and put it to checkpoints/tinysam.pth.

(d) Train the model:

  • Single node with 8 GPUs:
sh train.sh
  • Multi-node with each node has 8 GPUs (e.g., 4x8=32 GPUs):
sh multi_node_train_4_nodes.sh

3.6 Web Annotation Tool

ssh -L 5000:127.0.0.1:5000 username@serverip
python tools/web_annotation_tool.py

4. Acknowledgements

Our implementation builds upon SAM 2 and reuses essential modules from its official codebase.

5. Citation

If you use SAM-I2V++ in your research, please use the following BibTeX entry.

@article{Mei_2026_TPAMI,
    author    = {Mei, Haiyang and Zhang, Pengyu and Shou, Mike Zheng},
    title     = {SAM-I2V++: Efficiently Upgrading SAM for Promptable Video Segmentation},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
    year      = {2026},
}

6. License

Please see LICENSE.

7. Contact

E-Mail: Haiyang Mei ([email protected])

⬆ back to top

About

[TPAMI 2026] SAM-I2V++

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published