Name	Name	Last commit message	Last commit date
Latest commit History 30 Commits
images	images
pdf	pdf
vision_rag	vision_rag
README.md	README.md

Name

Last commit message

Last commit date

VisionRAG

VisionRAG is an innovative implementation of MULTI-MODALITY-RAG, leveraging the novel approach introduced in ColPali: Efficient Document Retrieval with Vision Language Models.

🔍 Overview!

ColPali offers a groundbreaking method for document retrieval using vision language models. This project aims to demonstrate how visual-based embedding can simplify and enhance RAG systems, making them more versatile and easier to implement for a wide range of document types.

Key Features of ColPali:

Direct embedding of document screenshots
No need for OCR or complex preprocessing
Handles multi-modal content (text, images, charts, tables)
Streamlined retrieval and ranking process
Built on ColPali 2's efficient embedding technique

How fast is the indexing?

We tested the speed of the indexing on affordable GPUs , we pass the embeddings into GPUs

GPU	Batch Size	Speed (s/iteration)
NVIDIA A10g	4	2.67
NVIDIA l4	4	3.6s
NVIDIA t4	4	4.55

Interpretability

Query	Image	Image
Scaled and Dot
What is the model architecture and what is adaptive visual encoding?

What does this heatmap tell us ?

The heatmap shows areas of high attention (bright spots) and low attention (darker areas) for a specific token.
The model seems to understand and focus on the relevant parts of the image that discuss or illustrate the adaptive visual encoding concept.
The spread of attention indicates how precisely the model can identify the relevant areas. In this case, the attention seems to be spread across relevant diagrams and text, suggesting a good understanding.

📚 Resources

For more information about this innovative approach:

🎯 Project Goals

Implement a multi-modal RAG system using ColPali's approach
Demonstrate the efficiency and versatility of this approach

🚀 Getting Started

The notebook provids a step by step on how to use colpali to index and how to then pass the image to a Vision-Language model to generate answers
The notebook also shows how to generate the heatmaps to check what the model sees?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VisionRAG

🔍 Overview!

Key Features of ColPali:

How fast is the indexing?

Interpretability

📚 Resources

🎯 Project Goals

🚀 Getting Started

About

Uh oh!

Releases

Packages

Languages

License

midroid/VisionRAG

Folders and files

Latest commit

History

Repository files navigation

VisionRAG

🔍 Overview!

Key Features of ColPali:

How fast is the indexing?

Interpretability

📚 Resources

🎯 Project Goals

🚀 Getting Started

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages