AffilGood provides tools and annotated datasets to improve the accuracy of attributing scientific works to research organizations, especially in multilingual and complex contexts. The framework addresses key challenges in institution name disambiguation through a modular pipeline approach.
This is the official repository for the paper "AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis", published in the Scholarly Document Processing (SDP) 2024 Workshop at ACL 2024 Conference. Slides used in the presentation are available here.
- Modular Pipeline Architecture: Separate components for span identification, named entity recognition, entity linking, and metadata normalization
- Multilingual Support: Models trained on data in multiple languages
- Advanced Entity Linking: Multiple linking strategies with combination of retrievers and reranking mechanisms
- Multiple Data Sources: Support for ROR, WikiData, and custom data sources
- Location Normalization: Integration with OpenStreetMap for standardizing geographic data
- Language Processing: Automatic language detection and translation capabilities
- Performance Optimization: Caching mechanisms and batch processing for efficient handling of large datasets
For more detailed information about using and extending AffilGood, check out our documentation:
- Getting Started - Installation and first steps
- Modules Reference - Detailed reference for classes and methods
- Entity Linking - Guide to entity linking capabilities
- Data Sources - Available data sources and customization
- Language Processing - Multilingual support and translation
- Customization - Extending the pipeline with custom components
- Performance - Optimization and scaling strategies
- Usage Examples - Code examples for different scenarios
- Technical Overview - In-depth explanation of architecture
- Contribution Guide - Guidelines for contributing
We recommend installing AffilGood in editable mode to allow development and live code changes:
git clone https://github.com/sirisacademic/affilgood.git
cd affilgood
pip install -e .
β οΈ Note: Installing without-e(editable mode) may result in import errors due to how nested modules are organized.
from affilgood import AffilGood
# Initialize with default settings
affil_good = AffilGood()
# Or customize components
affil_good = AffilGood(
span_separator='', # Use model-based span identification
span_model_path='SIRIS-Lab/affilgood-span-multilingual', # Custom span model
ner_model_path='SIRIS-Lab/affilgood-NER-multilingual', # Custom NER model
entity_linkers=['Whoosh', 'DenseLinker'], # Use multiple linkers
return_scores=True, # Return confidence scores with predictions
metadata_normalization=True, # Enable location normalization
verbose=False, # Detailed logging
device=None # Auto-detect device (CPU or CUDA)
)
# Process affiliation strings
affiliations = [
"Granges Terragrisa SL, Paratge de La Gleva, CamΓ de Burrissola s/n, E-08508 Les Masies de VoltregΓ (Barcelona), Catalonia, Spain",
"Treuman Katz Center for Pediatric Bioethics, Seattle Children's Research Institute, Seattle, WA, USA"
]
# Full pipeline processing (span identification, NER, normalization, entity linking)
results = affil_good.process(affiliations)
# Or use individual components
spans = affil_good.get_span(affiliations)
entities = affil_good.get_ner(spans)
normalized = affil_good.get_normalization(entities)
linked = affil_good.get_entity_linking(normalized)
print(linked)The repository is structured as follows:
affilgood/
βββ __init__.py # Package initialization
βββ affilgood.py # Main AffilGood class implementation
βββ span_identification/ # Span identification module
β βββ span_identifier.py # Model-based span identification
β βββ simple_span_identifier.py # Character-based span splitter
β βββ noop_span_identifier.py # Pass-through identifier for pre-segmented data
βββ ner/ # Named Entity Recognition module
β βββ ner.py # NER implementation
βββ entity_linking/ # Entity linking module
β βββ entity_linker.py # Main entity linking orchestrator
β βββ base_linker.py # Base class for entity linkers
β βββ whoosh_linker.py # Whoosh-based entity linker
β βββ s2aff_linker.py # S2AFF-based entity linker
β βββ dense_linker.py # Dense retrieval-based entity linker
β βββ base_reranker.py # Base class for rerankers
β βββ direct_pair_reranker.py # Direct pair matching reranker
β βββ llm_reranker.py # LLM-based reranker for candidate selection
β βββ constants.py # Constants for entity linking
β βββ wikidata_dump_generator.py # WikiData integration
β βββ llm_translator.py # Translation capabilities
β βββ __init__.py # Data source registry and handlers
βββ metadata_normalization/ # Metadata normalization module
β βββ normalizer.py # Location and country normalization
βββ utils/ # Utility functions
βββ data_manager.py # Data loading and caching
βββ text_utils.py # Text processing utilities
βββ translation_mappings.py # Institution name translation mappings
AffilGood uses several pre-trained models available on Hugging Face:
- π€ SIRIS-Lab/affilgood-NER-multilingual - Multilingual NER model
- π€ SIRIS-Lab/affilgood-span-multilingual - Multilingual span model
- π€ SIRIS-Lab/affilgood-NER - English NER model
- π€ SIRIS-Lab/affilgood-SPAN - English span model
- π€ SIRIS-Lab/affilgood-affilRoBERTa - RoBERTa adapted for affiliation data
- π€ SIRIS-Lab/affilgood-affilXLM - XLM-RoBERTa adapted for affiliation data
Note: These results can be outdated as the pipeline is in development and new features are being included.
AffilGood achieves state-of-the-art performance on institution name disambiguation tasks compared to existing systems:
| Model | MA | FA | NRMO | S2AFF* | CORDIS | ETERe | ETERm |
|---|---|---|---|---|---|---|---|
| ElasticSearch | .545 | .407 | .470 | .515 | .751 | .855 | .847 |
| OpenAlex | .394 | .118 | .769 | .871π₯ | .648 | .859 | .852 |
| S2AFF | .546 | .367 | .617 | .785 | .649 | .668 | .720 |
| AffRo | .452 | .408 | .558 | .726 | .641 | .709 | .617 |
| AffilGoodNERm + S2AFFLinker | .596 | .685 | .762 | .841 | .827 | .887 | .863 |
| AffilGoodNER + S2AFFLinker | .579 | .685 | .758 | .850 | .839 | .895 | .855 |
| AffilGoodNERm + Elastic | .690 | .587 | .747 | .640 | .849 | .887 | .894 |
| AffilGoodNER + Elastic | .649 | .610 | .755 | .648 | .855 | .893 | .881 |
| AffilGoodNERm + Elastic+qLLM | .710π₯ | .721 | .774π₯ | .790 | .881 | .936π₯ | .916π₯ |
| AffilGoodNER + Elastic+qLLM | .653 | .747π₯ | .767 | .799 | .891π₯ | .936π₯ | .909 |
If you use AffilGood in your research, please cite our paper:
@inproceedings{duran-silva-etal-2024-affilgood,
title = "{A}ffil{G}ood: Building reliable institution name disambiguation tools to improve scientific literature analysis",
author = "Duran-Silva, Nicolau and
Accuosto, Pablo and
Przyby{\l}a, Piotr and
Saggion, Horacio",
editor = "Ghosal, Tirthankar and
Singh, Amanpreet and
Waard, Anita and
Mayr, Philipp and
Naik, Aakanksha and
Weller, Orion and
Lee, Yoonjoo and
Shen, Shannon and
Qin, Yanxia",
booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.sdp-1.13",
pages = "135--144",
}We welcome contributions to the AffilGood project! Instead of a single main branch, we use two branches:
develop: Development and default branch for new features and bug fixes.main: Production branch used to deploy the server components to the production environment.
Please follow our Contribution Guidelines to participate in this project.
For further information, please contact [email protected].
This work is distributed under the Apache License, Version 2.0.
If you see an error like:
ImportError: ...libstdc++.so.6: version `GLIBCXX_3.4.32' not found
This means your system is using an outdated version of the C++ standard library (libstdc++.so.6), or a conflicting version from Anaconda.
Ensure you're using the correct library version. You can override Anaconda's version by setting this before running Python:
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6Or temporarily launch Python with a clean environment:
LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu pythonYou can update the system library via:
sudo apt update
sudo apt install libstdc++6Check that the required version is available:
strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX_3.4.32pip uninstall hnswlib
pip install --no-binary :all: hnswlibThis ensures the library is built using your systemβs standard C++ runtime.
