Skip to content

AffilGood provides annotated datasets and tools to improve the accuracy of attributing scientific works to research organizations, especially in multilingual and complex contexts.

License

Notifications You must be signed in to change notification settings

nagelea/affilgood

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

65 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AffilGood Library πŸ”

AffilGood provides tools and annotated datasets to improve the accuracy of attributing scientific works to research organizations, especially in multilingual and complex contexts. The framework addresses key challenges in institution name disambiguation through a modular pipeline approach.

AffilGood Pipeline

Publication

This is the official repository for the paper "AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis", published in the Scholarly Document Processing (SDP) 2024 Workshop at ACL 2024 Conference. Slides used in the presentation are available here.

🌟 Key Features

  • Modular Pipeline Architecture: Separate components for span identification, named entity recognition, entity linking, and metadata normalization
  • Multilingual Support: Models trained on data in multiple languages
  • Advanced Entity Linking: Multiple linking strategies with combination of retrievers and reranking mechanisms
  • Multiple Data Sources: Support for ROR, WikiData, and custom data sources
  • Location Normalization: Integration with OpenStreetMap for standardizing geographic data
  • Language Processing: Automatic language detection and translation capabilities
  • Performance Optimization: Caching mechanisms and batch processing for efficient handling of large datasets

πŸ“š Documentation

For more detailed information about using and extending AffilGood, check out our documentation:

πŸ› οΈ Installation

We recommend installing AffilGood in editable mode to allow development and live code changes:

git clone https://github.com/sirisacademic/affilgood.git
cd affilgood
pip install -e .

⚠️ Note: Installing without -e (editable mode) may result in import errors due to how nested modules are organized.

πŸš€ Quick Start

from affilgood import AffilGood

# Initialize with default settings
affil_good = AffilGood()

# Or customize components
affil_good = AffilGood(
    span_separator='',  # Use model-based span identification
    span_model_path='SIRIS-Lab/affilgood-span-multilingual',  # Custom span model
    ner_model_path='SIRIS-Lab/affilgood-NER-multilingual',  # Custom NER model
    entity_linkers=['Whoosh', 'DenseLinker'],  # Use multiple linkers
    return_scores=True,  # Return confidence scores with predictions
    metadata_normalization=True,  # Enable location normalization
    verbose=False,  # Detailed logging
    device=None  # Auto-detect device (CPU or CUDA)
)

# Process affiliation strings
affiliations = [
    "Granges Terragrisa SL, Paratge de La Gleva, CamΓ­ de Burrissola s/n, E-08508 Les Masies de VoltregΓ  (Barcelona), Catalonia, Spain",
    "Treuman Katz Center for Pediatric Bioethics, Seattle Children's Research Institute, Seattle, WA, USA"
]

# Full pipeline processing (span identification, NER, normalization, entity linking)
results = affil_good.process(affiliations)

# Or use individual components
spans = affil_good.get_span(affiliations)
entities = affil_good.get_ner(spans)
normalized = affil_good.get_normalization(entities)
linked = affil_good.get_entity_linking(normalized)

print(linked)

πŸ“¦ Project Structure

The repository is structured as follows:

affilgood/
β”œβ”€β”€ __init__.py                   # Package initialization
β”œβ”€β”€ affilgood.py                  # Main AffilGood class implementation
β”œβ”€β”€ span_identification/          # Span identification module
β”‚   β”œβ”€β”€ span_identifier.py        # Model-based span identification
β”‚   β”œβ”€β”€ simple_span_identifier.py # Character-based span splitter
β”‚   └── noop_span_identifier.py   # Pass-through identifier for pre-segmented data
β”œβ”€β”€ ner/                          # Named Entity Recognition module
β”‚   └── ner.py                    # NER implementation
β”œβ”€β”€ entity_linking/               # Entity linking module
β”‚   β”œβ”€β”€ entity_linker.py          # Main entity linking orchestrator
β”‚   β”œβ”€β”€ base_linker.py            # Base class for entity linkers
β”‚   β”œβ”€β”€ whoosh_linker.py          # Whoosh-based entity linker
β”‚   β”œβ”€β”€ s2aff_linker.py           # S2AFF-based entity linker
β”‚   β”œβ”€β”€ dense_linker.py           # Dense retrieval-based entity linker
β”‚   β”œβ”€β”€ base_reranker.py          # Base class for rerankers
β”‚   β”œβ”€β”€ direct_pair_reranker.py   # Direct pair matching reranker
β”‚   β”œβ”€β”€ llm_reranker.py           # LLM-based reranker for candidate selection
β”‚   β”œβ”€β”€ constants.py              # Constants for entity linking
β”‚   β”œβ”€β”€ wikidata_dump_generator.py # WikiData integration
β”‚   β”œβ”€β”€ llm_translator.py         # Translation capabilities
β”‚   └── __init__.py               # Data source registry and handlers
β”œβ”€β”€ metadata_normalization/       # Metadata normalization module
β”‚   └── normalizer.py             # Location and country normalization
└── utils/                        # Utility functions
    β”œβ”€β”€ data_manager.py           # Data loading and caching
    β”œβ”€β”€ text_utils.py             # Text processing utilities
    └── translation_mappings.py   # Institution name translation mappings

πŸ€— Pre-trained Models

AffilGood uses several pre-trained models available on Hugging Face:

πŸ“Š Performance

Note: These results can be outdated as the pipeline is in development and new features are being included.

AffilGood achieves state-of-the-art performance on institution name disambiguation tasks compared to existing systems:

Model MA FA NRMO S2AFF* CORDIS ETERe ETERm
ElasticSearch .545 .407 .470 .515 .751 .855 .847
OpenAlex .394 .118 .769 .871πŸ”₯ .648 .859 .852
S2AFF .546 .367 .617 .785 .649 .668 .720
AffRo .452 .408 .558 .726 .641 .709 .617
AffilGoodNERm + S2AFFLinker .596 .685 .762 .841 .827 .887 .863
AffilGoodNER + S2AFFLinker .579 .685 .758 .850 .839 .895 .855
AffilGoodNERm + Elastic .690 .587 .747 .640 .849 .887 .894
AffilGoodNER + Elastic .649 .610 .755 .648 .855 .893 .881
AffilGoodNERm + Elastic+qLLM .710πŸ”₯ .721 .774πŸ”₯ .790 .881 .936πŸ”₯ .916πŸ”₯
AffilGoodNER + Elastic+qLLM .653 .747πŸ”₯ .767 .799 .891πŸ”₯ .936πŸ”₯ .909

πŸ“ Citation

If you use AffilGood in your research, please cite our paper:

@inproceedings{duran-silva-etal-2024-affilgood,
    title = "{A}ffil{G}ood: Building reliable institution name disambiguation tools to improve scientific literature analysis",
    author = "Duran-Silva, Nicolau  and
      Accuosto, Pablo  and
      Przyby{\l}a, Piotr  and
      Saggion, Horacio",
    editor = "Ghosal, Tirthankar  and
      Singh, Amanpreet  and
      Waard, Anita  and
      Mayr, Philipp  and
      Naik, Aakanksha  and
      Weller, Orion  and
      Lee, Yoonjoo  and
      Shen, Shannon  and
      Qin, Yanxia",
    booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.sdp-1.13",
    pages = "135--144",
}

πŸ™‹β€β™€οΈ Contributing

We welcome contributions to the AffilGood project! Instead of a single main branch, we use two branches:

  • develop: Development and default branch for new features and bug fixes.
  • main: Production branch used to deploy the server components to the production environment.

Please follow our Contribution Guidelines to participate in this project.

πŸ“« Contact

For further information, please contact [email protected].

βš–οΈ License

This work is distributed under the Apache License, Version 2.0.

πŸ§ͺ Troubleshooting

❗ Issue: ImportError when using hnswlib

If you see an error like:

ImportError: ...libstdc++.so.6: version `GLIBCXX_3.4.32' not found

This means your system is using an outdated version of the C++ standard library (libstdc++.so.6), or a conflicting version from Anaconda.

βœ… Solution 1: Use the system version of libstdc++

Ensure you're using the correct library version. You can override Anaconda's version by setting this before running Python:

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6

Or temporarily launch Python with a clean environment:

LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu python

βœ… Solution 2: Update libstdc++

You can update the system library via:

sudo apt update
sudo apt install libstdc++6

Check that the required version is available:

strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX_3.4.32

βœ… Solution 3: Rebuild hnswlib using your current compiler

pip uninstall hnswlib
pip install --no-binary :all: hnswlib

This ensures the library is built using your system’s standard C++ runtime.

About

AffilGood provides annotated datasets and tools to improve the accuracy of attributing scientific works to research organizations, especially in multilingual and complex contexts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.7%
  • Shell 1.3%