Biotic Interaction Classifier

A comprehensive machine learning pipeline for detecting and classifying biotic interactions in scientific text using multiple state-of-the-art NLP models.

Overview

This project implements various classification approaches for identifying biotic interactions (predation, parasitism, pollination, etc.) from scientific literature passages. The system supports multiple model architectures and includes tools for training, evaluation, and inference.

Features

Multiple Model Architectures:
- Transformer models (DistilBERT, BioBERT, RoBERTa, PubMedBERT)
- Support Vector Machines (SVM)
- Random Forest
- Logistic Regression
- BERT variants
- LLaMA models
- LUKE (entity-aware transformers)
Advanced Training Features:
- 5-fold stratified cross-validation
- Active learning for efficient annotation
- Precision-recall threshold optimization
- High precision mode for reduced false positives
- External validation on curated test sets
Production-Ready APIs:
- RESTful API endpoints for each model
- Batch prediction support
- Real-time inference

Project Structure

classifier/
├── src/                          # Source code
│   ├── models/                   # Model training scripts
│   │   ├── transformer_classifier.py    # Main transformer training pipeline
│   │   ├── svm_classifier.py            # SVM implementation
│   │   ├── random_forest_classifier.py  # Random Forest classifier
│   │   ├── bert_classifier.py           # BERT fine-tuning
│   │   ├── llama_classifier.py          # LLaMA model training
│   │   └── luke_classifier.py           # LUKE entity-aware model
│   ├── api/                      # API endpoints
│   │   ├── biomedbert_api.py
│   │   ├── svm_api.py
│   │   ├── lr_api.py
│   │   └── test_api.py
│   ├── data/                     # Data processing
│   │   ├── preprocessing.py
│   │   ├── data_building.py
│   │   ├── retrieve_positives.py
│   │   ├── retrieve_negatives.py
│   │   └── retrieve_random.py
│   └── utils/                    # Utilities
│       ├── prediction.py
│       ├── evaluation.py
│       └── classification_details.py
├── data/
│   ├── training/                 # Training datasets
│   ├── evaluation/               # Test/validation sets
│   ├── processed/                # Processed data
│   └── taxonomies/               # Species taxonomy data
├── models/                       # Trained models (git-ignored)
├── results/                      # Outputs
│   ├── cv_results/              # Cross-validation metrics
│   ├── predictions/             # Model predictions
│   └── figures/                 # Visualizations
├── notebooks/                    # Jupyter notebooks
└── scripts/                      # Utility scripts

Installation

1. Clone the repository

git clone <your-repo-url>
cd classifier

2. Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Download required models (if needed)

# Download spaCy language models
python -m spacy download en_core_web_sm

Usage

Training Transformer Models

Train all transformer models with cross-validation:

python src/models/transformer_classifier.py --model all --epochs 3 --cv_folds 5

Train a specific model:

python src/models/transformer_classifier.py --model biobert --epochs 5

Enable active learning:

python src/models/transformer_classifier.py --model BiomedBERT --active_learning --al_iterations 10

Optimize for high precision:

python src/models/transformer_classifier.py --model distilbert --target_precision 0.8

Training SVM

python src/models/svm_classifier.py

Making Predictions

python src/utils/prediction.py --model biomedbert --input data/evaluation/eval_100.tsv

Running API Server

python src/api/biomedbert_api.py

Model Performance

Results on external test set (eval_100.tsv):

Model	Precision	Recall	F1 Score
BiomedBERT	TBD	TBD	TBD
BioBERT	TBD	TBD	TBD
RoBERTa	TBD	TBD	TBD
DistilBERT	TBD	TBD	TBD
SVM	TBD	TBD	TBD

See results/cv_results/ for detailed metrics.

Data

Training Data

training_data_cleaned.csv: Cleaned training set with labeled passages
Format: passage,label where label is 0 (no interaction) or 1 (interaction)

Evaluation Sets

eval_100.tsv: Hand-curated test set (100 samples)
BiotXBench annotations: Additional quality-checked annotations
GloBI datasets: Global Biotic Interactions passages

Active Learning

The transformer classifier supports uncertainty-based active learning:

Start with small labeled dataset (default: 500 samples)
Train initial model
Select most uncertain samples from unlabeled pool
Iteratively expand training set
Track performance improvement over iterations

Example:

python src/models/transformer_classifier.py \
  --model biobert \
  --active_learning \
  --al_iterations 10

Threshold Optimization

For high-precision applications (e.g., automated curation), the system can optimize decision thresholds:

python src/models/transformer_classifier.py \
  --model BiomedBERT \
  --target_precision 0.9

This finds the optimal probability threshold to achieve target precision while maximizing recall.

API Documentation

Prediction Endpoint

POST /predict
{
  "text": "The fox hunts mice in the meadow."
}

Response:
{
  "prediction": 1,
  "confidence": 0.94,
  "label": "interaction"
}

Batch Prediction

POST /predict_batch
{
  "texts": ["sentence 1", "sentence 2", ...]
}

Contact

esteban.gaillac@hesge.ch

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
archive_old_files		archive_old_files
data		data
ensemble_model		ensemble_model
figures		figures
manuscript		manuscript
results		results
scripts		scripts
src		src
.gitignore		.gitignore
Biotic_Interaction_Classifier_Presentation.pptx		Biotic_Interaction_Classifier_Presentation.pptx
Biotic_Interaction_Classifier_WITH_FIGURES.pptx		Biotic_Interaction_Classifier_WITH_FIGURES.pptx
ENSEMBLE_RESULTS_SUMMARY.md		ENSEMBLE_RESULTS_SUMMARY.md
FIGURES_README.md		FIGURES_README.md
FILES_GUIDE.md		FILES_GUIDE.md
GITHUB_SETUP.md		GITHUB_SETUP.md
PRESENTATION_READY.md		PRESENTATION_READY.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biotic Interaction Classifier

Overview

Features

Project Structure

Installation

1. Clone the repository

2. Create virtual environment

3. Install dependencies

4. Download required models (if needed)

Usage

Training Transformer Models

Training SVM

Making Predictions

Running API Server

Model Performance

Data

Training Data

Evaluation Sets

Active Learning

Threshold Optimization

API Documentation

Prediction Endpoint

Batch Prediction

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ecsltae/biotic-interaction-classifier

Folders and files

Latest commit

History

Repository files navigation

Biotic Interaction Classifier

Overview

Features

Project Structure

Installation

1. Clone the repository

2. Create virtual environment

3. Install dependencies

4. Download required models (if needed)

Usage

Training Transformer Models

Training SVM

Making Predictions

Running API Server

Model Performance

Data

Training Data

Evaluation Sets

Active Learning

Threshold Optimization

API Documentation

Prediction Endpoint

Batch Prediction

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages