0% found this document useful (0 votes)
47 views9 pages

UniProt 2025

The document discusses updates to the UniProt Knowledgebase (UniProtKB) aimed at providing high-quality, non-redundant protein sequences and functional information. It highlights ongoing efforts in manual curation, community involvement, and the use of machine learning techniques to enhance data accuracy and retrieval. Additionally, it notes the database's recognition as a Global Core Biodata Resource and outlines plans to limit unreviewed sequences while improving user access to protein data.

Uploaded by

horacio24pacule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views9 pages

UniProt 2025

The document discusses updates to the UniProt Knowledgebase (UniProtKB) aimed at providing high-quality, non-redundant protein sequences and functional information. It highlights ongoing efforts in manual curation, community involvement, and the use of machine learning techniques to enhance data accuracy and retrieval. Additionally, it notes the database's recognition as a Global Core Biodata Resource and outlines plans to limit unreviewed sequences while improving user access to protein data.

Uploaded by

horacio24pacule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Nucleic Acids Research, 2025, 53, D609–D617

https://doi.org/10.1093/nar/gkae1010
Advance access publication date: 18 November 2024
Database issue

UniProt: the Universal Protein Knowledgebase in 2025


The UniProt Consortium1 ,2 ,3 ,4 ,*
1
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
2
Protein Information Resource, Georgetown University Medical Center, 2115 Wisconsin Ave NW, G1 level, Suite 040A, Washington, DC 20007,
USA
3
Protein Information Resource, University of Delaware, Ammon-Pinizzotto Biopharmaceutical Innovation Building, Suite 147B, 590 Avenue
1743, Newark, DE 19713, USA
4
SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
*
To whom correspondence should be addressed. Sandra Orchard. Tel: +44 1223 494675; Email: [email protected]

Abstract
The aim of the UniProt Knowledgebase (UniProtKB; https://www.uniprot.org/) is to provide users with a comprehensive, high-quality and freely
accessible set of protein sequences annotated with functional information. In this publication, we describe ongoing changes to our production
pipeline to limit the sequences available in UniProtKB to high-quality, non-redundant reference proteomes. We continue to manually curate the
scientific literature to add the latest functional data and use machine learning techniques. We also encourage community curation to ensure
key publications are not missed. We provide an update on the automatic annotation methods used by UniProtKB to predict information for
unreviewed entries describing unstudied proteins. Finally, updates to the UniProt website are described, including a new tab linking protein
to genomic information. In recognition of its value to the scientific community, the UniProt database has been awarded Global Core Biodata
Resource status.

Graphical abstract

Introduction critical role of the data resource in underpinning molecu-


The UniProt suite of databases (https://www.uniprot.org/) lar biology was recognized when UniProt became one of the
serves as a leading global data resource for protein se- first databases awarded Global Core Biodata Resource sta-
quence and functional information (1). The long-standing tus (https://globalbiodata.org) in December 2022. The cen-

Received: September 12, 2024. Revised: October 14, 2024. Editorial Decision: October 15, 2024. Accepted: October 16, 2024
© The Author(s) 2024. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/),
which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
D610 Nucleic Acids Research, 2025, Vol. 53, Database issue

tral UniProt Knowledgebase (UniProtKB) is a curated re- again, multiple copies of the same genome are sequenced, such
source. It is comprised of both a reviewed set of pro- as happened during the recent pandemic. Imports from En-
tein entries (UniProtKB/Swiss-Prot), where each record con- sembl, RefSeq and other genome annotation platforms such
tains a summary of the experimentally verified, or com- as WormBase ParaSite (7) are still largely manually screened
putationally predicted, functional information, added and as import candidates. We continue to assess the quality of
evaluated by an expert biocurator, and the unreviewed set the remaining proteomes, using BUSCO (8) and the complete
(UniProtKB/TrEMBL). In the latter, entries are computation- proteome detector (9). We then implement additional quality
ally annotated by automated systems. The UniRef databases checks based on criteria developed by RefSeq, supplemented
cluster sequence sets at various levels of sequence identity by further internally created checks. Only the highest qual-
and the UniProt Archive (UniParc) delivers a complete set ity, nonredundant proteomes remain to be submitted to our
of known unique sequences, including historical obsolete se- process for automatically selecting reference proteomes, i.e. a
quences. If a UniParc entry sequence is not included in UniPro- representative of each group of a cluster of proteomes grouped
tKB, the reason for the exclusion of that sequence is provided by overall sequence similarity. We additionally manually select
(e.g. pseudo-gene). In addition, each UniParc record contains a and maintain a number of key reference proteomes, thus en-
reference to its source database(s) with accession and version suring that those derived from model organisms or selected
numbers, tagged with its status in that database, indicating by the community as important for a particular area of study
if the sequence still exists or has been deleted in the source are tagged and remain the reference for their cluster group. To
database. ensure stability in proteome selection, the pipeline is weighted
Historically, UniProtKB contained all protein sequences de- to ensure that once selected a proteome remains representa-
posited by the research community in any International Nu- tive for its group unless a related proteome which is signif-
cleotide Sequence Database Collaboration (INSDC) mem- icantly improved in quality (in terms of genome standard,
ber resource (2) in addition to proteomes derived from se- coverage and annotation) becomes part of the cluster. This is
lected genomes annotated by either Ensembl (3) or RefSeq then evaluated as a replacement for the existing representative
(4). In the future, as the rate of whole genome sequenc- proteome.
ing increases and global projects aim to sequence across the It is our intention in 2025 to limit the unreviewed sequences
whole biodiversity of species and, in some cases, multiple in UniProtKB/TrEMBL to only those derived from reference
strains, sexes or variants of a single species, this will be- proteomes, unless there is significant additional functional in-
come increasingly difficult to maintain and for our users to formation associated with an entry, for example a 3D struc-
navigate and comprehend. We describe here our response to ture, or there is a request from the community that a set of
managing the increased number of genomes now available entries be included. This will initially result in a drop in the
in the public domain, and their effect on our production number of unreviewed entries in UniProtKB, which will then
pipelines. In addition, we outline our work to improve the ef- increase as biodiverse sequencing projects lead to the creation
ficiency of information retrieval and summarization from the of novel proteome clusters. Redundant proteomes will con-
scientific literature and also from collaborating repositories, tinue to be available for search and download in UniProt
and the enhanced representation of these data in UniProtKB through UniParc and we will work to improve user access to
records. this resource. It should be noted that the selection of viral ref-
erence proteomes is a separate process. The small size of their
proteome means that our redundancy pipeline is ineffective
so viral reference proteomes are selected manually, in collab-
Progress and new developments oration with the International Committee on Taxonomy of
Managing the sequence space Viruses (10).
In May 2024, the INSDC contained 2.4 billion nucleotide
sequences, many with associated translations to protein se-
quences made by the submitter, which form the basis of gene Expert curation
calls by groups such as Ensembl or RefSeq. Most of these se- The extraction and evaluation of information relating to
quences are highly redundant, i.e. identical or near identical protein sequence, structure and function from the scientific
in sequence content derived from a single organism which has literature and the addition of these data to the relevant
been sequenced many times or from the sequencing of very UniProtKB/Swiss-Prot reviewed records is of central impor-
closely related species. UniProtKB would not supply an opti- tance to the operation of the UniProt data resource. Expert
mal service to our users if we included all of the sequences curators target key publications and create or update records
available from INSDC or downstream annotation sources. pertaining to human and model organism biology, including
Redundant sequences do not add significantly to information enzymology, host–pathogen interactions and many other ar-
content, and their presence both slows down and compro- eas critical to the studies of biodiversity and human health
mises the output of services such as BLAST (5) and are costly and disease. Over the last few years, a particular effort has
to process and perform repetitive computation upon. UniProt been made to retrofit enzyme and transporter annotation us-
release 2024_04 contains approximately 246 million sequence ing the Rhea knowledgebase of biochemical reactions, which
records in UniProtKB. This represents a relatively conservative uses the ChEBI ontology to represent reactants (11,12). At the
increase in sequence number over the last 2 years (6), reflecting time of writing, UniProtKB includes annotations to 12,501
measures already put in place to limit the volume of data en- Rhea reactions, which are linked to 28,259 857 UniProtKB
tering the Knowledgebase. As data are automatically imported protein sequence records, including 231,709 reviewed pro-
from the INSDC, we filter to exclude redundant copies of tein sequence records in UniProtKB/Swiss-Prot. We also ac-
proteomes from the same organism (identified by taxonomic tively track publications describing new enzymatic reactions
identifier) and increasingly remove clinical isolates where, or orphan reactions. One example of this is the identifi-
Nucleic Acids Research, 2025, Vol. 53, Database issue D611

cation of the flavin reductase BLVRB (https://www.uniprot. ing to achieve an LLM-generated ‘first-pass’ summarization
org/uniprotkb/P30043/entry) as a moonlighting enzyme. Pre- of relevant publications to add value to unreviewed entries
viously known to be involved in fetal heme catabolism in and supplement existing information in UniProtKB/Swiss-
the liver, a recent publication showed that BLVRB also cat- Prot records. A number of pilot projects are currently under-
alyzes the S-nitrosylation of cysteine residues of specific tar- way evaluating the precision and recall of existing models and
get proteins, including INSR and IRS1 thus inhibiting in- also the potential of LLMs to act as a curator-assist tool, in-
sulin signaling (13). Whilst the protein S-nitrosylation mod- creasing the throughput of publications which can be evalu-
ification had been known for many years to act as a key ated in the expert curation process.
mediator of signal transduction, and is described in more
than 60 human entries in UniProt, (https://www.uniprot.org/
Community curation
uniprotkb?query=(keyword:KW-0702) the enzymes that me-
diate this modification were previously largely unknown. We As previously described (6), we continue to reach out to the
curated the publication in UniProtKB/Swiss-Prot and pro- broader scientific community, asking researchers to alert us to
vided functional information in the form of human read- entries requiring an update and novel publications carrying
able text summaries and structured vocabularies, such as the key pieces of information. Submitters are directed to a simple
Gene Ontology (GO) (14), Rhea or ChEBI (15). We cre- form which enables them to enter information from a publica-
ated new reactions to describe the multistep transnitrosyla- tion and select appropriate annotation categories to which in-
tion reactions in the BLVRB entry and described protein S- formation can be added. A ‘Batch Submission’ option enables
nitrosylation sites and their effect on target entries [HMOX2 upload of a collection of publications and annotations for one
(P30519), INSR (P06213) and IRS1 (P35568)]. The use of the or more entries. As of release 2024_04, there have been 3967
ChEBI ontology to describe chemical structures for both en- submissions enabling update of 3504 proteins with 1744 pub-
zyme reactions and post-translational modifications (PTMs) lications (https://community.uniprot.org/bbsub/STATS.html).
provides a direct link between enzymes and their targets, Each of these submissions has been reviewed by a UniProt cu-
thereby increasing the interoperability of UniProt. UniProt rator and submitted information is now visible on the UniPro-
curators are also highly involved in the curation of biolog- tKB webpages for each updated entry (Figure 2). Contribu-
ical systems using Gene Ontology–Causal Activity Model(s) tors are identified by their ORCID and a link to their ORCID
[GO-CAM(s)] (15,16) in order to describe the flow between landing page validates their expertise. We are now looking to
molecular functions. We curated a GO-CAM to describe actively solicit community submissions by identifying relevant
the BKVRB catalyzed reactions (https://www.alliancegenome. papers through text-mining and reaching out to the authors to
org/gene/HGNC:1063#function- - - go- annotations) and are help evaluate the text-mining results and suggest further anno-
currently working on the import and display of GO-CAM tations which can be extracted from the publications. We are
models on the UniProt website. also exploring the use of LLMs to generate draft annotations
In a separate project, we have been proactively collaborat- that can be sent to paper authors for review.
ing with research teams working in the area of antimicrobial
drug resistance (AMR), focusing on mechanisms of drug resis-
tance developed by the World Health Organization Class I tar- Automatic annotation and ML
get ‘ESKAPE’ organisms (Escherichia coli, Staphylococcus au- Adding functional information to the body of unreviewed en-
reus, Klebsiella pneumoniae, Acinetobacter baumanii, Pseu- tries in UniProtKB/TrEMBL remains an important task. These
domonas aeruginosa and Enterobacteriaceae). We are identi- records are enriched with functional information by annota-
fying and updating key proteins playing direct roles in AMR, tion transfer systems, which use the protein classification tool
such as beta-lactamases (Figure 1), carbapenemases, efflux InterPro to group sequences at superfamily, family and sub-
pumps and ABC transporters are currently being targeted for family levels, and to predict the occurrence of functional do-
annotation with efforts expanding to include members of two- mains and important sites. The rule-based computational an-
component signalling systems, quorum sensing proteins and notation UniRule system (20) uses these groupings to transfer
those involved in biofilm formation in the near future. experimentally verified annotations onto unstudied proteins,
As the volume and diversity of the scientific literature in- adding properties, such as protein name, functional annota-
creases, we are increasingly turning to machine learning (ML) tion, catalytic activity, pathway, GO terms and subcellular lo-
techniques to assist the curation process with selection of key cation. We continue to expand this rule set, most recently with
papers and data extraction. Existing ML frameworks such as rules created by them based on NCBIfam signatures (21) eval-
LitSuggest (17) are used by UniProt curators to identify pa- uated and added to the set.
pers on subjects such as amino acid variants linked to human The Association-Rule-Based Annotator system for auto-
disease, enzymatic reactions and protein complexes. UniProt matic classification and annotation of UniProtKB proteins (9)
members have collaborated with the PubMed team at NCBI has been further developed, most recently through the incor-
on the development of a new training and benchmarking poration of novel PANTHER signatures (22). This immedi-
dataset, EnzChemRed, to support the development of Nat- ately generated 9141 new rules and 119,579,654 new predic-
ural Language Processing models to assist enzyme curation tions for over 20 million new protein sequences. We are addi-
(18,19). EnzChemRED can boost the ability of BERT (Bidi- tionally collaborating with Google Research who have devel-
rectional Encoder Representations from Transformers)-based oped ProtNLM, an LLM that accurately predicts descriptions
and GPT (Generative pre-trained transformer) models to ex- of protein function directly from a protein’s amino acid se-
tract novel enzyme–substrate relationships from the literature. quence (23). Initially, a single model was pre-trained, using the
The potential use of large-language models (LLMs) to gen- T5 framework, across the UniRef50 2018_03 dataset of ∼30
erate text directly from the scientific literature is currently million diverse unlabelled protein sequences, then fine-tuned
being actively evaluated by UniProt curators. We are aim- using labelled protein domain sequences from Pfam (24). This
D612 Nucleic Acids Research, 2025, Vol. 53, Database issue

Figure 1. The annotation of an extended-spectrum beta-lactamase (P28585) which confers resistance to penicillins.
Nucleic Acids Research, 2025, Vol. 53, Database issue D613

Figure 2. Community submitted information is now added to relevant UniProtKB entries.

has since been extended to encompass the output of multiple evaluation stage. The number of entrants was significantly
models. To date, ProtNLM has been used to annotate more higher, and a good success rate was achieved, with the pos-
than 28 million proteins (2024_04) previously labelled as ‘un- sibility of embedding some of the successful code into future
characterized’ with a functionally relevant name and we are UniProt pipeline development.
actively exploring options with Google to extend this annota-
tion to additional fields.
We continue to make our data more accessible to ML Data integration
algorithms, to act as both training and test sets. We now UniProtKB works with many other groups to access and inte-
make raw embeddings (per-protein and per-residue) avail- grate large-scale datasets, mapping these data onto the appro-
able for UniProtKB/Swiss-Prot and some reference pro- priate protein sequence records and displaying the mappings
teomes of model organisms. These have been generated us- via the ProtVista Feature Viewer (25). All data shown are
ing the bio_embeddings tool (https://github.com/sacdallago/ downloadable via File Transfer Protocol and Application Pro-
bio_embeddings), prottrans_t5_xl_u5 model0, and can be re- gramming Interfaces (APIs). We have worked to improve our
trieved from our Downloads page. Per-protein embeddings for support for the mass spectrometry (MS) community (26). The
UniProtKB can also be retrieved from the UniProtKB search existing UniProt MS peptide identification pipeline by which
results download function. We have also reached out to re- MS-based proteomic data deposited into proteomeXchange
searchers to help us with our efforts to develop novel al- databases (27,28) is reprocessed, scored and used to verify
gorithms for use in the UniProtKB production pipeline and the existence of each protein is being upgraded to meet the
have led/participated in a number of community challenges. stringency of the HUPO Human Proteome Project (HPP) 3.0
The UniProt metal-binding challenge invited the ML commu- guidelines (29). UniProtKB now actively supports the HUPO
nity to create computational methods to predict metal-binding HPP which aims to provide MS evidence for the existence of
sites across the whole of UniProtKB. This problem was cho- each protein in the human proteome (Omenn et al. 2024, in
sen as currently ∼17% of curated proteins have annotated preparation). All proteins identified by the qualifying number
metal-binding site residues whereas only 3% of unreviewed of unique peptides of a set minimum length are annotated with
entries have these predicted, giving a large search space against the keyword ‘proteomics identification’, and the protein exis-
which successful predictions could potentially be made. We tence level is set to ‘experimental evidence at protein level’.
provided both a training set and a test set of 1 million pro- We have additionally collaborated with PRIDE (29), Pep-
teins. Unfortunately, although several groups participated in tideAtlas (27,28) and the University of Liverpool to enable
this challenge, all predictions contained a high number of false the integration of PTMs from these resources into UniPro-
positives. More successful was our participation in a CAFA tKB records (30). Filtering and reanalysis ensure that only
(Critical Assessment of Functional Annotation) challenge on high-quality datasets are imported and reanalysed, with these
the Kaggle platform, in which participants were asked to datasets stored in PRIDE attributed a unique ID (PXD),
predict the function of a set of proteins, as indicated by thereby facilitating user access, traceability and reusability.
the use of GO terms (www.kaggle.com/competitions/cafa-5- Modified sites are assigned a confidence score based on
protein- function- prediction/). UniProt curators selected and their false localization rate across multiple datasets to reflect
contributed to the annotation of a training set of annotated the strength of evidence available. Data are integrated into
proteins and also provided a curated test set to assist in the UniProtKB and visualized in the ProtVista Feature Viewer
D614 Nucleic Acids Research, 2025, Vol. 53, Database issue

Figure 3. The Genomics Tab view for the human Tyrosine-protein kinase Lck protein (P06239).

both in the PTM/processing section of the protein entry page nates of the exons creating each isoform and the correspond-
and in the Feature Viewer in site-centric format. Modified pep- ing ranges of encoded amino acids.
tide data are visualized in the proteomics section of the Fea-
ture Viewer, and both site-centric and peptide-centric data are
accessible via the proteins API. At the time of writing, rice Visualizing proteins in the context of protein
(Oryza sativa subsp. Japonica) and Plasmodium falciparum complexes
proteomic PTM data are available with Saccharomyces cere- Many proteins are functionally operative only as a member
visiae, mouse (Mus musculus) and human (Homo sapiens) of a protein complex, a set of polypeptide chains, that inter-
data currently being processed. act with each other and/or with a nucleic acid. Mutations in
genes encoding different proteins within the same complex
cause similar diseases (locus heterogeneity). In order to present
Website the user with a representation of a protein in its functional
We continue to develop the UniProt website with an emphasis environment, we have collaborated with the Complex Portal
on providing the user with an effective search, easy naviga- (31) to provide structured, machine-readable descriptions of
tion and excellent responsiveness whilst further improving the complexes using UniProtKB identifiers and a range of shared
tools dashboard and the programmatic access interface (API). reference ontologies, such as the GO and ChEBI. To date, a
The website has been built using a modular approach, sepa- total of almost 5000 manually curated complexes have been
rating the front end (web user interface) from the back end released with UniProtKB curators contributing significantly
(API). We have added additional views to the website, first to that number. Complex Portal entries are cross-referenced
moving the variant view from its previous position embedded in UniProtKB and we have added the ComplexViewer (32)
in the disease and variant section of the website to a separate to the UniProt website. This provides a dynamic, interac-
tab to increase the overall speed of website loading. Second, tive, responsive 2D visualization of each complex, enabling
we have created a genomics tab (Figure 3) which enables the users to see its topology and stoichiometry (when known)
user to link protein data to the corresponding genomic coor- and zoom in on binding regions down to the residue level
dinates. Genomic coordinates of UniProtKB proteins are im- (Figure 4). Interactive links enable users to easily move be-
ported from the corresponding nucleotide sequence data in the tween the UniProtKB entries describing complex components
INSDC. For those imported from Ensembl, Ensembl Genome, or to learn more about the full assembly in the Complex
RefSeq and WormBase ParaSite, the genomic coordinates are Portal.
also accessed from those resources. For each UniProtKB entry,
representing the product(s) of one gene, details of the genomic
assembly, chromosomal location of the gene and strand are Conclusion
given. Each protein isoform is listed with its corresponding The UniProt database will undergo some fundamental
genomic location, the total number and the genomic coordi- changes over the next 1–2 years, as we adapt our data process-
Nucleic Acids Research, 2025, Vol. 53, Database issue D615

Figure 4. Visualization of the RNase H2 complex as seen in the UniProt entry for the Ribonuclease H2 subunit A protein (O75792).

ing pipelines to address the increase in the amount of whole the website, and file sets at the FTP site (www.uniprot.org/
genome sequencing data which is now being produced on an downloads), and supply users with a number of different op-
almost industrial scale. At the same time, we wish to ensure tions for computational access to the data (www.uniprot.org/
our users enjoy the same quality of service we have offered help/programmatic_access). These include the website REST-
them over the previous decades. We aim to present a reference ful Application Programming Interface (API), stable URLs
proteome for each taxonomic grouping to the research com- that can be bookmarked, linked and reused, the SPARQL
munity. Whilst we will continue to evaluate the scientific lit- API that allows users to perform complex queries across all
erature and summarize key information on proteins and their UniProt data and also other resources that provide a SPARQL
functions, new ML tools and language models mean we are endpoint and a Java API.
approaching how we do so in a very different way. However,
we will ensure that we maintain the high-quality, expert cura-
tion for which UniProtKB is known. We additionally encour-
Funding
age community contributions and now make those visible di-
rectly on the webpage of the relevant entries. We will continue National Human Genome Research Institute (NHGRI)
to develop our methods for the computational annotation of [HG002273]; Office of the Director, NIH OD
unreviewed proteins and will increasingly work with the ML [U24HG007822]; National Institute of Allergy and In-
community through collaborations and challenges to enable fectious Diseases (NIAID); National Institute on Aging
this. (NIA); National Institute of General Medical Sciences
UniProtKB acts as a central hub of information integrat- (NIGMS) [R35GM141873]; National Institute of Diabetes
ing data from many external resources. We supply a num- and Digestive and Kidney Diseases (NIDDK); National Eye
ber of APIs enabling users to access UniProt data or incor- Institute (NEI); National Cancer Institute (NCI); National
porate this into their own resources (www.uniprot.org/help/ Heart, Lung, and Blood Institute (NHLBI); Biotechnol-
programmatic_access). We proactively collaborate with other ogy and Biological Sciences Research Council (BBSRC)
data producers and data (re)users and are always looking for [BB/T015608/1]; National Science Foundation’s Directorate
opportunities to extend such interactions. We greatly value for Biological Sciences [BB/X002179/1]; Open Targets; Hori-
the feedback and annotation updates from our user commu- zon 2020 - Research and Innovation Framework Programme
nity. Please send your comments and suggestions via the con- [825575]; State Secretariat for Education, Research and
tact link on the UniProt website (https://www.uniprot.org/ Innovation (SERI); European Molecular Biology Laboratory
contact). (EMBL) Australia. Funding for open access charge: National
Institutes of Health.

Data availability
UniProt releases are published every 8 weeks. We provide cus- Conflict of interest statement
tomizable views and downloads in a range of formats via None declared.
D616 Nucleic Acids Research, 2025, Vol. 53, Database issue

References EnzChemRED, a rich enzyme chemistry relation extraction dataset


[Data set]. Zenodo, https://zenodo.org/records/11067998.
1. Lussi,Y.C., Magrane,M., Martin,M.J., Orchard,S. and
20. MacDougall,A., Volynkin,V., Saidi,R., Poggioli,D., Zellner,H.,
UniProt Consortium (2023) Searching and navigating UniProt
Hatton-Ellis,E., Joshi,V., O’Donovan,C., Orchard,S.,
databases. Curr. Protoc., 3, e700.
Auchincloss,A.H., et al. (2020) UniRule: a unified rule resource
2. Arita,M., Karsch-Mizrachi,I. and Cochrane,G. (2021) The
for automatic annotation in the UniProt Knowledgebase.
International Nucleotide Sequence Database Collaboration.
Bioinformatics, 36, 4643–4648.
Nucleic Acids Res., 49, D121–D124.
21. Haft,D.H., DiCuccio,M., Badretdin,A., Brover,V., Chetvernin,V.,
3. Harrison,P.W., Amode,M.R., Austine-Orimoloye,O., Azov,A.G.,
O’Neill,K., Li,W., Chitsaz,F., Derbyshire,M.K., Gonzales,N.R.,
Barba,M., Barnes,I., Becker,A., Bennett,R., Berry,A., Bhai,J., et al.
et al. (2018) RefSeq: an update on prokaryotic genome
(2024) Ensembl 2024. Nucleic Acids Res., 52, D891–D899.
annotation and curation. Nucleic Acids Res., 46, D851–D860.
4. Sayers,E.W., Beck,J., Bolton,E.E., Brister,J.R., Chan,J.,
22. Thomas,P.D., Ebert,D., Muruganujan,A., Mushayahama,T.,
Comeau,D.C., Connor,R., DiCuccio,M., Farrell,C.M.,
Albou,L.-P. and Mi,H. (2021) PANTHER: making genome-scale
Feldgarden,M., et al. (2024) Database resources of the National
phylogenetics accessible to all. Protein Sci., 31, 8–22.
Center for Biotechnology Information. Nucleic Acids Res., 52,
23. Paysan-Lafosse,T., Blum,M., Chuguransky,S. and Grego,T. (2023)
D33–D43.
InterPro in 2022. Nucleic Acids Res., 51, D418–D427.
5. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.
24. Brevdo,E. (2023) ProtNLM: Model-based Natural Language
(1990) Basic local alignment search tool. J. Mol. Biol., 215,
Protein Annotation. https://ebrevdo.github.io.
403–410.
25. Salazar,G.A., Luciani,A., Watkins,X., Kandasaamy,S., Rice,D.L.,
6. UnitProt Consortium (2023) UniProt: the Universal Protein
Blum,M., Bateman,A. and Martin,M. (2023) Nightingale: web
Knowledgebase in 2023. Nucleic Acids Res., 51, D523–D531.
components for protein feature visualization. Bioinform. Adv., 3,
7. Howe,K.L., Bolt,B.J., Shafie,M., Kersey,P. and Berriman,M. (2017)
vbad064.
WormBase ParaSite—a comprehensive resource for helminth
26. Bowler-Barnett,E.H., Fan,J., Luo,J., Magrane,M., Martin,M.J.,
genomics. Mol. Biochem. Parasitol., 215, 2–10.
Orchard,S. and UniProt Consortium (2023) UniProt and mass
8. Manni,M., Berkeley,M.R., Seppey,M., Simão,F.A. and
spectrometry-based proteomics—a 2-way working relationship.
Zdobnov,E.M. (2021) BUSCO update: novel and streamlined
Mol. Cell. Proteomics, 22, 100591.
workflows along with broader and deeper phylogenetic coverage
27. Deutsch,E.W., Lane,L., Overall,C.M., Bandeira,N., Baker,M.S.,
for scoring of eukaryotic, prokaryotic, and viral genomes. Mol.
Pineau,C., Moritz,R.L., Corrales,F., Orchard,S., Van Eyk,J.E., et al.
Biol. Evol., 38, 4647–4654.
(2019) Human Proteome Project Mass Spectrometry Data
9. UniProt Consortium (2021) UniProt: the Universal Protein
Interpretation Guidelines 3.0. J. Proteome Res., 18, 4108–4116.
Knowledgebase in 2021. Nucleic Acids Res., 49, D480–D489.
28. Deutsch,E.W., Bandeira,N., Perez-Riverol,Y., Sharma,V., Carver,J.J.,
10. Lefkowitz,E.J., Dempsey,D.M., Hendrickson,R.C., Orton,R.J.,
Mendoza,L., Kundu,D.J., Wang,S., Bandla,C., Kamatchinathan,S.,
Siddell,S.G. and Smith,D.B. (2018) Virus taxonomy: the database
et al. (2023) The ProteomeXchange consortium at 10 years: 2023
of the International Committee on Taxonomy of Viruses (ICTV).
update. Nucleic Acids Res., 51, D1539–D1548.
Nucleic Acids Res., 46, D708–D717.
29. Perez-Riverol,Y., Bai,J., Bandla,C., García-Seisdedos,D.,
11. Morgat,A., Lombardot,T., Coudert,E., Axelsen,K., Neto,T.B.,
Hewapathirana,S., Kamatchinathan,S., Kundu,D.J., Prakash,A.,
Gehant,S., Bansal,P., Bolleman,J., Gasteiger,E., de Castro,E., et al.
Frericks-Zipper,A., Eisenacher,M., et al. (2022) The PRIDE
(2020) Enzyme annotation in UniProtKB using Rhea.
database resources in 2022: a hub for mass spectrometry-based
Bioinformatics, 36, 1896–1901.
proteomics evidences. Nucleic Acids Res., 50, D543–D552.
12. Bansal,P., Morgat,A., Axelsen,K.B., Muthukrishnan,V., Coudert,E.,
30. Kalyuzhnyy,A., Eyers,P.A., Eyers,C.E., Bowler-Barnett,E.,
Aimo,L., Hyka-Nouspikel,N., Gasteiger,E., Kerhornou,A.,
Martin,M.J., Sun,Z., Deutsch,E.W. and Jones,A.R. (2022) Profiling
Neto,T.B., et al. (2022) Rhea, the reaction knowledgebase in
the human phosphoproteome to estimate the true extent of protein
2022. Nucleic Acids Res., 50, D693–D700.
phosphorylation. J. Proteome Res., 21, 1510–1524.
13. Zhou,H.-L., Grimmett,Z.W., Venetos,N.M., Stomberski,C.T.,
31. Meldal,B.H.M., Perfetto,L., Combe,C., Lubiana,T., Ferreira
Qian,Z., McLaughlin,P.J., Bansal,P.K., Zhang,R., Reynolds,J.D.,
Cavalcante,J.V., Bye-A-Jee,H., Waagmeester,A., Del-Toro,N.,
Premont,R.T., et al. (2023) An enzyme that selectively
Shrivastava,A., Barrera,E., et al. (2023) Complex Portal 2022:
S-nitrosylates proteins to regulate insulin signaling. Cell, 186,
new curation frontiers. Nucleic Acids Res., 50, D578–D586.
5812–5825.
32. Combe,C.W., Sivade,M.D., Hermjakob,H., Heimbach,J.,
14. Gene Ontology Consortium (2021) The Gene Ontology resource:
Meldal,B.H.M., Micklem,G., Orchard,S. and Rappsilber,J. (2017)
enriching a GOld mine. Nucleic Acids Res., 49, D325–D334.
ComplexViewer: visualization of curated macromolecular
15. Hastings,J., Owen,G., Dekker,A., Ennis,M., Kale,N.,
complexes. Bioinformatics, 33, 3673–3675.
Muthukrishnan,V., Turner,S., Swainston,N., Mendes,P. and
Steinbeck,C. (2016) ChEBI in 2016: improved services and an
expanding collection of metabolites. Nucleic Acids Res., 44,
D1214–D1219. Appendix
16. Thomas,P.D., Hill,D.P., Mi,H., Osumi-Sutherland,D., Van
Auken,K., Carbon,S., Balhoff,J.P., Albou,L.-P., Good,B., Gaudet,P., The UniProt Consortium
et al. (2019) Gene Ontology Causal Activity Modeling Alex Bateman, Maria-Jesus Martin, Sandra Orchard, Michele
(GO-CAM) moves beyond GO annotations to structured Magrane, Aduragbemi Adesina, Shadab Ahmad, Emily H.
descriptions of biological functions and systems. Nat. Genet., 51, Bowler-Barnett, Hema Bye-A-Jee, David Carpentier, Paul
1429–1433. Denny, Jun Fan, Penelope Garmiri, Leonardo Jose da Costa
17. Allot,A., Lee,K., Chen,Q., Luo,L. and Lu,Z. (2021) LitSuggest: a Gonzales, Abdulrahman Hussein, Alexandr Ignatchenko,
web-based system for literature recommendation and curation
Giuseppe Insana, Rizwan Ishtiaq, Vishal Joshi, Dushyanth
using machine learning. Nucleic Acids Res., 49, W352–W358.
Jyothi, Swaathi Kandasaamy, Antonia Lock, Aurelien Luciani,
18. Lai,P.-T., Coudert,E., Aimo,L., Axelsen,K., Breuza,L., de Castro,E.,
Feuermann,M., Morgat,A., Pourcel,L., Pedruzzi,I., et al. (2024) Jie Luo, Yvonne Lussi, Juan Sebastian Martinez Marin, Pedro
EnzChemRED, a rich enzyme chemistry relation extraction Raposo, Daniel L. Rice, Rafael Santos, Elena Speretta, James
dataset. Scientific Data, 11, 982. Stephenson, Prabhat Totoo, Nidhi Tyagi, Nadya Urakova,
19. Lai,P.-T., Coudert,E., Aimo,L., Axelsen,K., Breuza,L., de Castro,E., Preethi Vasudev, Kate Warner, Supun Wijerathne, Conny
Feuermann,M., Morgat,A., Pourcel,L., Pedruzzi,I., et al. (2024) Wing-Heng Yu and Rossana Zaru at the EMBL-European
Nucleic Acids Research, 2025, Vol. 53, Database issue D617

Bioinformatics Institute; Alan J. Bridge, Lucila Aimo, Ghis- Lieberherr, Patrick Masson, Anne Morgat, Salvo Paesano,
laine Argoud-Puy, Andrea H. Auchincloss, Kristian B. Ax- Ivo Pedruzzi, Sandrine Pilbout, Lucille Pourcel, Sylvain Poux,
elsen, Parit Bansal, Delphine Baratin, Teresa M. Batista Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine
Neto, Marie-Claude Blatter, Jerven T. Bolleman, Emmanuel Rivoire, Christian J. A. Sigrist, Karin Sonesson, Shyamala Sun-
Boutet, Lionel Breuza, Blanca Cabrera Gil, Cristina Casals- daram and Anastasia Sveshnikova at the SIB Swiss Institute
Casas, Kamal Chikh Echioukh, Elisabeth Coudert, Beat- of Bioinformatics; Cathy H. Wu, Cecilia N. Arighi, Chuming
rice Cuche, Edouard de Castro, Anne Estreicher, Maria L. Chen, Yongxing Chen, Hongzhan Huang, Kati Laiho, Minna
Famiglietti, Marc Feuermann, Elisabeth Gasteiger, Pascale Lehvaslaiho, Peter McGarvey, Darren A. Natale, Karen Ross,
Gaudet, Sebastien Gehant, Vivienne Gerritsen, Arnaud Gos, C. R. Vinayaka, Yuqi Wang and Jian Zhang at the Protein In-
Nadine Gruaz, Chantal Hulo, Nevila Hyka-Nouspikel, Flo- formation Resource.
rence Jungo, Arnaud Kerhornou, Philippe Le Mercier, Damien

Received: September 12, 2024. Revised: October 14, 2024. Editorial Decision: October 15, 2024. Accepted: October 16, 2024
© The Author(s) 2024. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse,
distribution, and reproduction in any medium, provided the original work is properly cited.

You might also like