Unveiling the PDF Content Query System: Intelligent Document Search
A Streamlit-Powered Solution for Efficient PDF Content Retrieval
Authors:
Muhammad Awais
Muhammad Asaad Areeb
Muhammad Osama Tahir
Muhammad Ahzam Ejaz
”Transforming how we interact with PDF documents through intelligent search.”
Submitted to: Dr. Mohseen Ali
Course: Deep Learning
Date: June 24, 2025
A Smarter Way to Query PDFs
What We Built: Our app allows users to query PDF content using natural language
instead of keywords.
Solution Highlights:
▶ Uses AI to interpret user queries and match them with content.
▶ Handles scanned documents using visual embeddings.
▶ Summarizes information in a user-friendly response.
Key Features:
▶ Upload single or multiple PDFs and manage them easily.
▶ AI understands the context, not just keywords.
▶ Pages are processed visually—matching even with poor layout.
User Benefit: Time-saving, intuitive, and effective for complex document exploration.
How It Works: The Multi-Agent Framework
Architecture Overview: Inspired by human-like workflows, the app is broken into
specialized agents. Each one is responsible for a specific task.
Agents and Their Roles:
▶ Document Processor: Transforms PDFs into searchable formats using
embeddings.
▶ Query Processor: Converts user input into embedding and finds best-matching
pages.
▶ Answer Generator: Uses Gemini AI to extract and summarize content visually.
▶ Manager Agent: Handles storage, duplication, and system cleaning.
Tech Stack: Python, Streamlit, PyTorch, PyMuPDF, ColPali (vision model), Gemini
AI.
System Diagram
Turning PDFs into Searchable Data
Step 1: Document Processing – Converts the PDF into image pages and extracts
embeddings for each page.
Technical Flow:
▶ Each page is rendered as an image using PyMuPDF.
▶ ColPali, a vision-based model, processes the image to generate embeddings.
▶ Embeddings are cached to prevent redundant computation.
Why It Matters:
▶ Enables matching based on layout, structure, and visual content.
▶ Makes scanned documents accessible.
▶ Optimized for speed using GPU support.
Precision Search with Vision Embeddings
Step 2: Query Processing – This component converts your natural-language query
into a visual representation.
Search Flow:
▶ Query is converted to an embedding using the same model type.
▶ Compared against all stored page embeddings using cosine similarity.
▶ Returns top-k relevant pages based on similarity.
Why It’s Effective:
▶ Handles fuzzy or approximate matches.
▶ Great for long documents with varied language.
▶ k-value can be adjusted for deeper results.
Transforming Matches into Meaningful Answers
Step 3: Answer Generation – Converts retrieved visual matches into human-friendly
responses.
Workflow:
▶ Selected pages are passed to Gemini AI along with the query.
▶ Input is a base64 image and a natural-language prompt.
▶ Output is a summarized answer with context.
Example Interaction:
▶ Q: “When was the contract signed?”
▶ A: “The contract was signed in June 2023, as shown on page 5.”
Advantage: Eliminates the need to read through long documents for a simple answer.
Streamlined Document Management
Manager Agent: Keeps the system organized and efficient.
Responsibilities:
▶ Stores document metadata like name, size, and upload date.
▶ Uses SHA256 hashing to prevent duplicate uploads.
▶ Allows users to delete or replace documents as needed.
Importance:
▶ Avoids redundancy and confusion.
▶ Ensures smooth experience even with many documents.
▶ Forms the backbone for future cloud integration.
Intuitive Interface for Document Search
Frontend Built with Streamlit: Clean, fast, and reactive UI.
User Flow:
▶ Tab 1: Upload PDFs and manage the file list.
▶ Tab 2: Ask a question and view matched pages with answers.
▶ Sidebar: Customize settings like top-k results or Gemini API key.
Features:
▶ Alerts for missing keys, unsupported formats, and no results.
▶ Visual indicators for loading, success, and error states.
▶ Designed to be usable even for non-technical users.
Built for Performance and Scalability
Performance Features:
▶ Embedding cache avoids re-computation.
▶ PyTorch’s DataLoader enables fast batch processing.
▶ Asynchronous tasks reduce UI wait time.
Scalability Considerations:
▶ Agent modularity allows for parallel processing.
▶ Code supports future cloud-based deployment.
▶ Optimized for thousands of pages.
Robust Design: Fails gracefully and logs detailed errors for debugging.
Empowering Industries with Intelligent Search
Use Cases:
▶ Academia: Quickly locate references and definitions.
▶ Legal Sector: Extract clauses, dates, and key terms from contracts.
▶ Corporate: Audit reports, HR docs, or compliance files.
Why It’s Needed:
▶ Massive time savings for high-volume workflows.
▶ Enhanced accuracy compared to manual review.
▶ Democratizes document search for non-engineers.
The Road Ahead for PDF Content Query
Planned Enhancements:
▶ Extend support to DOCX, scanned images, and PPTX.
▶ Integrate multiple AI models (Claude, GPT-4, etc.).
▶ Improve semantic parsing of long multi-part questions.
Scalability Plans:
▶ Offload embedding and query processes to the cloud.
▶ Integrate with Google Drive and cloud buckets.
▶ Introduce multilingual OCR and summarization.
Redefining PDF Interaction with AI
Key Takeaways:
▶ Makes unstructured PDFs interactive and searchable.
▶ Modular agents offer high performance and easy maintenance.
▶ Ready for integration into workflows across domains.
Final Thought: Our system is a step toward intelligent document interfaces—fast,
accurate, and human-centric.
Thank You!
Questions? We’re happy to answer.