VisionRAG is an innovative implementation of MULTI-MODALITY-RAG, leveraging the novel approach introduced in ColPali: Efficient Document Retrieval with Vision Language Models.
ColPali offers a groundbreaking method for document retrieval using vision language models. This project aims to demonstrate how visual-based embedding can simplify and enhance RAG systems, making them more versatile and easier to implement for a wide range of document types.
- Direct embedding of document screenshots
- No need for OCR or complex preprocessing
- Handles multi-modal content (text, images, charts, tables)
- Streamlined retrieval and ranking process
- Built on ColPali 2's efficient embedding technique
We tested the speed of the indexing on affordable GPUs , we pass the embeddings into GPUs
| GPU | Batch Size | Speed (s/iteration) |
|---|---|---|
| NVIDIA A10g | 4 | 2.67 |
| NVIDIA l4 | 4 | 3.6s |
| NVIDIA t4 | 4 | 4.55 |
| Query | Image | Image |
|---|---|---|
| Scaled and Dot | ||
| What is the model architecture and what is adaptive visual encoding? |
What does this heatmap tell us ?
- The heatmap shows areas of high attention (bright spots) and low attention (darker areas) for a specific token.
- The model seems to understand and focus on the relevant parts of the image that discuss or illustrate the adaptive visual encoding concept.
- The spread of attention indicates how precisely the model can identify the relevant areas. In this case, the attention seems to be spread across relevant diagrams and text, suggesting a good understanding.
For more information about this innovative approach:
- Implement a multi-modal RAG system using ColPali's approach
- Demonstrate the efficiency and versatility of this approach
-
The notebook provids a step by step on how to use colpali to index and how to then pass the image to a Vision-Language model to generate answers
-
The notebook also shows how to generate the heatmaps to check what the model sees?




