🧬 Automated extraction of aptamer data from scientific literature
AptaFind is a complete pipeline for extracting aptamer sequences, binding affinities, and experimental conditions from PDF research papers. It uses a coarse-to-fine approach combining semantic search, regex pattern matching, and local LLMs for verification.
- Semantic PDF Processing: Converts PDFs to structured, searchable chunks
- Smart Retrieval: Uses ColBERT embeddings for relevant passage discovery
- Pattern Extraction: Regex-based extraction of sequences, Kd values, and conditions
- LLM Verification: Local transformer models verify and link extracted data
- Knowledge Base: SQLite database with full-text search capabilities
- Multiple Export Formats: CSV, JSON, FASTA, and HTML reports
- No External AI APIs: Runs completely locally using transformers (requires network connection for online search)
# Clone the repository
git clone http://github.com/usnistgov/aptafind.git
cd aptafind
# Install dependencies
pip install -r requirements.txtBuild local PDF database
# Process PDFs through complete pipeline
python aptafind.py process <pdf folder>Run GUI (recommended)
chmod +x ./start_app.sh && ./start_app.shRun via CLI
# Search the resulting database
python aptafind.py search --target thrombin --max-kd 100
# View database statistics
python aptafind.py statsAptaFind uses a 4-stage pipeline adhering to the Minimum Agentic Flow principle: "Vector search/LLMs find, regex captures, LLMs verify"
LM processing is kept to a minimum and only for contextual tasks, to limit hallucination potential.
- Converts PDFs to markdown using marker-pdf or pymupdf4llm
- Creates semantic chunks (1000 tokens) preserving context
- Maintains section metadata (Methods, Results, etc.)
- Builds ColBERT search index using RAGatouille
- Multi-query retrieval with aptamer-specific queries
- Local LLM relevance scoring (permissive threshold)
- Extracts DNA/RNA sequences (20-100 nucleotides)
- Captures binding affinities (Kd, IC50, EC50) with unit conversion
- Identifies experimental conditions (pH, temperature, buffer)
- LLM verification of extracted sequences
- Links sequences with binding measurements and conditions
- Deduplication and confidence scoring
- SQLite database with full-text search
- Multiple export formats
- Performance metrics and validation
python aptafind.py process <pdf_directory> # Complete pipeline
python aptafind.py process ut_pdfs/ --target VEGF # Target-specific extractionpython aptafind.py ingest ut_pdfs/ # Stage 0: PDF processing
python aptafind.py retrieve --target thrombin # Stage 1: Semantic search
python aptafind.py extract data/candidates/candidates.json # Stage 2: Pattern extraction
python aptafind.py verify data/extractions/extractions_candidates.json # Stage 3: LLM verification
python aptafind.py build data/verified/verified_*.json # Stage 4: Database constructionpython aptafind.py search --query "DNA aptamer" # Full-text search
python aptafind.py search --target thrombin --max-kd 50 # Filtered search
python aptafind.py search --min-confidence high --type DNA # Advanced filtering
python aptafind.py stats # Database statisticsEdit config.yaml to customize:
# LLM Configuration - Direct Transformers
llm:
model: "microsoft/DialoGPT-medium" # or "google/flan-t5-base"
device: "auto" # auto-detect GPU/CPU
load_in_8bit: true
temperature: 0.1
# Document Processing
processing:
chunk_size: 1000 # tokens per chunk
overlap_size: 200
pdf_tool: "marker-pdf" # or "pymupdf4llm"
# Confidence Thresholds
thresholds:
relevance_score: 6 # permissive (7 = strict)
sequence_similarity: 0.95 # for deduplication{
"sequence": "ATCGATCGATCGATCG",
"sequence_type": "DNA",
"target": "thrombin",
"kd_nM": 25.3,
"conditions": {
"pH": 7.4,
"temperature_c": 25,
"buffer": "PBS"
},
"confidence": "high",
"source": {
"paper_title": "Novel thrombin aptamers...",
"section": "Results"
}
}- Aptamer Discovery: Find existing sequences for specific targets
- Binding Analysis: Compare affinities across different conditions
- Literature Mining: Extract data from large paper collections
- Database Construction: Build searchable aptamer repositories
- Experimental Planning: Identify optimal conditions and controls
- GPU Acceleration: Install PyTorch with CUDA for faster LLM inference
- PDF Quality: Higher quality PDFs yield better text extraction
- Batch Processing: Process multiple papers together for efficiency
- Target-Specific: Use
--targetflag for focused extraction
This is a rapid-iteration research tool. Contributions welcome for:
- Additional LLM model support
- Improved sequence pattern recognition
- Enhanced metadata extraction
- Performance optimizations
See License file.
AptaFind: From literature to knowledge base in minutes, not days!
