Skip to content

midroid/VisionRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VisionRAG

VisionRAG is an innovative implementation of MULTI-MODALITY-RAG, leveraging the novel approach introduced in ColPali: Efficient Document Retrieval with Vision Language Models.

ColPali Architecture

πŸ” Overview!

ColPali offers a groundbreaking method for document retrieval using vision language models. This project aims to demonstrate how visual-based embedding can simplify and enhance RAG systems, making them more versatile and easier to implement for a wide range of document types.

Key Features of ColPali:

  • Direct embedding of document screenshots
  • No need for OCR or complex preprocessing
  • Handles multi-modal content (text, images, charts, tables)
  • Streamlined retrieval and ranking process
  • Built on ColPali 2's efficient embedding technique

How fast is the indexing?

We tested the speed of the indexing on affordable GPUs , we pass the embeddings into GPUs

GPU Batch Size Speed (s/iteration)
NVIDIA A10g 4 2.67
NVIDIA l4 4 3.6s
NVIDIA t4 4 4.55

Interpretability

Query Image Image
Scaled and Dot

What is the model architecture and what is adaptive visual encoding?

What does this heatmap tell us ?

  • The heatmap shows areas of high attention (bright spots) and low attention (darker areas) for a specific token.
  • The model seems to understand and focus on the relevant parts of the image that discuss or illustrate the adaptive visual encoding concept.
  • The spread of attention indicates how precisely the model can identify the relevant areas. In this case, the attention seems to be spread across relevant diagrams and text, suggesting a good understanding.

πŸ“š Resources

For more information about this innovative approach:

🎯 Project Goals

  1. Implement a multi-modal RAG system using ColPali's approach
  2. Demonstrate the efficiency and versatility of this approach

πŸš€ Getting Started

  1. The notebook provids a step by step on how to use colpali to index and how to then pass the image to a Vision-Language model to generate answers

  2. The notebook also shows how to generate the heatmaps to check what the model sees?

About

A new novel multi-modality (Vision) RAG architecture

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%