Retrieval-Augmented Perception

This repo contains the official code for the paper "Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG"

💡 Highlights

🔥 We propose RAP, a training-free framework designed to enhance Multimodal Large Language Models' (MLLMs) ability to process high-resolution images effectively.

📜 News

[2025.06.28] We add a demo script play.py for inference on one image.

[2025.06.07] Our paper was accepted to ICML 2025 as an Oral paper (Top 1%)! 🎉

[2025.05.05] RAP code is available! [2025.05.05] RAP code is available!

[2025.03.04] We released the ArXiv paper. 🚀

👀 Introduction

High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench.

⚙️ Installation

Clone this repository and navigate to into the codebase

git clone https://github.com/DreamMr/RAP.git
cd RAP

Install Packages

conda create -n RAP python=3.10 -y
conda activate RAP
pip install -e .

📚 Preparation

1. MLLM & RAG Model

In this repo, we implement RAP with LLaVA-OneVision (ov) series and VisRAG-Ret. You can either download these checkpoints manually beforehand or let them be fetched automatically when calling the from_pretrained method in transformers.

2. Evaluation data

Download the $V^*$ Bench and HR-Bench (Single) from the link. Then copy the downloaded data to LMUData:

export LMUData=YOUR_DATASET_PATH
cp vstar.tsv $LMUData
cp hr_bench_4k_single.tsv $LMUData
cp hr_bench_8k_single.tsv $LMUData

🫵 Evaluation

1. Results of HR-Bench

cd scripts
bash run_llava_ov_hrbench.sh

Note: Since the official HR-Bench uses Cyclic Permutation, in order to improve evaluation efficiency, we adopt a two-stage approach: 1) First, for each image and query, we use RAP to obtain key image crops; 2) Then, we use the images obtained in 1) to replace the original images as input.

2. Results of $V^*$ Bench

cd scripts
bash run_llava_ov_vstar.sh

3. Results of Vanilla

To enable better comparison, we also provide evaluation code without RAP.

cd scripts
bash run_llava_ov_vanilla.sh

Note: If an OOM (Out of Memory) error occurs during evaluation, please try reducing the number of workers (in rap/inference.py line 107) and the max_batch_size (in rap/vlm/base.py line 23).

Run the demo

We offer a demo file for RAP that can process any given Image-Question pair.

w/o RAP

python play.py --model llava_onevision_qwen2_0.5b_ov --image_path ./demo.jpg --input "What's the color of the umbrella?"

w/ RAP

python play.py --model llava_onevision_qwen2_0.5b_ov --image_path ./demo.jpg --use_rap --input "What's the color of the umbrella?"

📧 Contact

Wenbin Wang: [email protected]

✒️ Citation

If you use RAP in your research, please cite our work:

@inproceedings{wangretrieval,
  title={Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG},
  author={Wang, Wenbin and Jing, Yongcheng and Ding, Liang and Wang, Yingjie and Shen, Li and Luo, Yong and Du, Bo and Tao, Dacheng},
  booktitle={Forty-second International Conference on Machine Learning},
  url={https://arxiv.org/abs/2503.01222}
}

Acknowledgement

VLMEvalKit: We start from codebase from the VLMEvalKit.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
rap.egg-info		rap.egg-info
rap		rap
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md
demo.jpg		demo.jpg
play.py		play.py
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retrieval-Augmented Perception

💡 Highlights

📜 News

👀 Introduction

⚙️ Installation

📚 Preparation

1. MLLM & RAG Model

2. Evaluation data

🫵 Evaluation

1. Results of HR-Bench

2. Results of $V^*$ Bench

3. Results of Vanilla

Run the demo

w/o RAP

w/ RAP

📧 Contact

✒️ Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

DreamMr/RAP

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Perception

💡 Highlights

📜 News

👀 Introduction

⚙️ Installation

📚 Preparation

1. MLLM & RAG Model

2. Evaluation data

🫵 Evaluation

1. Results of HR-Bench

2. Results of $V^*$ Bench

3. Results of Vanilla

Run the demo

w/o RAP

w/ RAP

📧 Contact

✒️ Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages