Skip to content

usnistgov/aptafind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AptaFind: Aptamer Intelligence Extraction Pipeline

Header: aptafind

🧬 Automated extraction of aptamer data from scientific literature

AptaFind is a complete pipeline for extracting aptamer sequences, binding affinities, and experimental conditions from PDF research papers. It uses a coarse-to-fine approach combining semantic search, regex pattern matching, and local LLMs for verification.

Features

  • Semantic PDF Processing: Converts PDFs to structured, searchable chunks
  • Smart Retrieval: Uses ColBERT embeddings for relevant passage discovery
  • Pattern Extraction: Regex-based extraction of sequences, Kd values, and conditions
  • LLM Verification: Local transformer models verify and link extracted data
  • Knowledge Base: SQLite database with full-text search capabilities
  • Multiple Export Formats: CSV, JSON, FASTA, and HTML reports
  • No External AI APIs: Runs completely locally using transformers (requires network connection for online search)

💎 Quick Start

Installation

# Clone the repository
git clone http://github.com/usnistgov/aptafind.git
cd aptafind

# Install dependencies
pip install -r requirements.txt

Basic Usage

Build local PDF database

# Process PDFs through complete pipeline
python aptafind.py process <pdf folder>

Run GUI (recommended)

chmod +x ./start_app.sh && ./start_app.sh

Run via CLI

# Search the resulting database
python aptafind.py search --target thrombin --max-kd 100

# View database statistics
python aptafind.py stats

Pipeline Overview

AptaFind uses a 4-stage pipeline adhering to the Minimum Agentic Flow principle: "Vector search/LLMs find, regex captures, LLMs verify"

LM processing is kept to a minimum and only for contextual tasks, to limit hallucination potential.

Stage 0: Document Preparation

  • Converts PDFs to markdown using marker-pdf or pymupdf4llm
  • Creates semantic chunks (1000 tokens) preserving context
  • Maintains section metadata (Methods, Results, etc.)

Stage 1: Coarse Retrieval (Intelligence Layer)

  • Builds ColBERT search index using RAGatouille
  • Multi-query retrieval with aptamer-specific queries
  • Local LLM relevance scoring (permissive threshold)

Stage 2: Pattern Detection (Regex Layer)

  • Extracts DNA/RNA sequences (20-100 nucleotides)
  • Captures binding affinities (Kd, IC50, EC50) with unit conversion
  • Identifies experimental conditions (pH, temperature, buffer)

Stage 3: Verification & Linkage (Intelligence Layer)

  • LLM verification of extracted sequences
  • Links sequences with binding measurements and conditions
  • Deduplication and confidence scoring

Stage 4: Knowledge Base Construction

  • SQLite database with full-text search
  • Multiple export formats
  • Performance metrics and validation

⌨️ CLI Commands

Full Pipeline

python aptafind.py process <pdf_directory>              # Complete pipeline
python aptafind.py process ut_pdfs/ --target VEGF      # Target-specific extraction

Individual Stages

python aptafind.py ingest ut_pdfs/                     # Stage 0: PDF processing
python aptafind.py retrieve --target thrombin          # Stage 1: Semantic search
python aptafind.py extract data/candidates/candidates.json  # Stage 2: Pattern extraction
python aptafind.py verify data/extractions/extractions_candidates.json  # Stage 3: LLM verification
python aptafind.py build data/verified/verified_*.json  # Stage 4: Database construction

Database Operations

python aptafind.py search --query "DNA aptamer"        # Full-text search
python aptafind.py search --target thrombin --max-kd 50  # Filtered search
python aptafind.py search --min-confidence high --type DNA  # Advanced filtering
python aptafind.py stats                               # Database statistics

⚙️ Configuration

Edit config.yaml to customize:

# LLM Configuration - Direct Transformers
llm:
  model: "microsoft/DialoGPT-medium"  # or "google/flan-t5-base"
  device: "auto"  # auto-detect GPU/CPU
  load_in_8bit: true
  temperature: 0.1

# Document Processing
processing:
  chunk_size: 1000  # tokens per chunk
  overlap_size: 200
  pdf_tool: "marker-pdf"  # or "pymupdf4llm"

# Confidence Thresholds
thresholds:
  relevance_score: 6  # permissive (7 = strict)
  sequence_similarity: 0.95  # for deduplication

Example Output

{
  "sequence": "ATCGATCGATCGATCG",
  "sequence_type": "DNA",
  "target": "thrombin",
  "kd_nM": 25.3,
  "conditions": {
    "pH": 7.4,
    "temperature_c": 25,
    "buffer": "PBS"
  },
  "confidence": "high",
  "source": {
    "paper_title": "Novel thrombin aptamers...",
    "section": "Results"
  }
}

Research Applications

  • Aptamer Discovery: Find existing sequences for specific targets
  • Binding Analysis: Compare affinities across different conditions
  • Literature Mining: Extract data from large paper collections
  • Database Construction: Build searchable aptamer repositories
  • Experimental Planning: Identify optimal conditions and controls

Performance Tips

  1. GPU Acceleration: Install PyTorch with CUDA for faster LLM inference
  2. PDF Quality: Higher quality PDFs yield better text extraction
  3. Batch Processing: Process multiple papers together for efficiency
  4. Target-Specific: Use --target flag for focused extraction

Contributing

This is a rapid-iteration research tool. Contributions welcome for:

  • Additional LLM model support
  • Improved sequence pattern recognition
  • Enhanced metadata extraction
  • Performance optimizations

License

See License file.


AptaFind: From literature to knowledge base in minutes, not days!

About

A local, lightweight, LM-driven bioscience data miner

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published