Ensembl Genome Database Project
Ensembl Genome Database Project
History
Content
The human genome consists of three billion base pairs,
which code for approximately 20,000–25,000 genes. Description Ensembl
However the genome alone is of little use, unless the Contact
locations and relationships of individual genes can be Research center
European Bioinformatics
identified. One option is manual annotation, whereby a
Institute
team of scientists tries to locate genes using
experimental data from scientific journals and public Primary citation Yates, et al. (2020)[1]
databases. However this is a slow, painstaking task. Access
The alternative, known as automated annotation, is to
use the power of computers to do the complex pattern- Website www.ensembl.org (http://ww
matching of protein to DNA.[5][6] The Ensembl project w.ensembl.org/)
was launched in 1999 in response to the imminent
completion of the Human Genome Project, with the initial goals of automatically annotate the human
genome, integrate this annotation with available biological data and make all this knowledge publicly
available.[2]
In the Ensembl project, sequence data are fed into the gene annotation system (a collection of software
"pipelines" written in Perl) which creates a set of predicted gene locations and saves them in a MySQL
database for subsequent analysis and display. Ensembl makes these data freely accessible to the world
research community. All the data and code produced by the Ensembl project is available to download,[7]
and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl
website provides computer-generated visual displays of much of the data.
Over time the project has expanded to include additional species (including key model organisms such as
mouse, fruitfly and zebrafish) as well as a wider range of genomic data, including genetic variations and
regulatory features. Since April 2009, a sister project, Ensembl Genomes, has extended the scope of
Ensembl into invertebrate metazoa, plants, fungi, bacteria, and protists, focusing on providing taxonomic
and evolutionary context to genes, whilst the original project continues to focus on vertebrates.[8][9]
As of 2020, Ensembl supported over 50 000 genomes across both Ensembl and Ensembl Genomes
databases, adding some new innovative features such as Rapid Release (https://rapid.ensembl.org/index.ht
ml), a new website designed to make genome annotation data available more quickly to users, and
COVID-19 (https://covid-19.ensembl.org/index.html), a new website to access to SARS-CoV-2 reference
genome.
Externally produced data can also be added to the display by uploading a suitable file in one of the
supported formats, such as BAM, BED, or PSL.
Graphics are generated using a suite of custom Perl modules based on GD, the standard Perl graphics
display library.
This software can be used to access the public MySQL database, avoiding the need to download enormous
datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this
requires an extensive knowledge of the current database schema.
Large datasets can be retrieved using the BioMart data-mining tool. It provides a web interface for
downloading datasets using complex queries.
Last, there is an FTP (http://ftp.ensembl.org/) server which can be used to download entire MySQL
databases as well some selected data sets in other formats.
Current species
The annotated genomes include most fully sequenced vertebrates and selected model organisms. All of
them are eukaryotes, there are no prokaryotes. As of 2022, there are 271 species registered, this
includes:[11]
Species
Chordata Angola colobus, black-capped squirrel
monkey, black snub-nosed monkey,
bonobo, bushbaby, capuchin,
chimpanzee, common marmoset,
Coquerel's sifaka, crab-eating macaque,
drill, human, macaque, mouse lemur,
Primates
gelada, gibbon, golden snub-nosed
monkey, gorilla, greater bamboo lemur,
green monkey, Ma's night monkey, olive
baboon, orangutan, pig-tailed macaque,
sooty mangabey, tarsier, Ugandan red
colobus
Monotremes Platypus
Open source/mirrors
All data part of the Ensembl project is open access and all software is open source, being freely available to
the scientific community, under a CC BY 4.0 license. Currently, Ensembl database website is mirrored at
four different locations worldwide to improve the service.
US East (Amazon AWS) (https://useast.ensembl.org/index.html) ---- Cloud-based mirror on East Coast of United
States
See also
List of sequenced eukaryotic genomes
List of biological databases
Sequence analysis
Sequence profiling tool
Sequence motif
UCSC Genome Browser
ENCODE
References
1. Yates A. D.; et al. (January 2020). "Ensembl 2020" (https://www.ncbi.nlm.nih.gov/pmc/article
s/PMC7145704). Nucleic Acids Res. 48 (D1): D682–D688. doi:10.1093/nar/gkz966 (https://d
oi.org/10.1093%2Fnar%2Fgkz966). PMC 7145704 (https://www.ncbi.nlm.nih.gov/pmc/article
s/PMC7145704). PMID 31691826 (https://pubmed.ncbi.nlm.nih.gov/31691826).
2. Hubbard, T. (1 January 2002). "The Ensembl genome database project" (https://www.ncbi.nl
m.nih.gov/pmc/articles/PMC99161). Nucleic Acids Research. 30 (1): 38–41.
doi:10.1093/nar/30.1.38 (https://doi.org/10.1093%2Fnar%2F30.1.38). PMC 99161 (https://w
ww.ncbi.nlm.nih.gov/pmc/articles/PMC99161). PMID 11752248 (https://pubmed.ncbi.nlm.ni
h.gov/11752248).
3. Flicek P, Amode MR, Barrell D, et al. (November 2010). "Ensembl 2011" (https://www.ncbi.nl
m.nih.gov/pmc/articles/PMC3013672). Nucleic Acids Res. 39 (Database issue): D800–
D806. doi:10.1093/nar/gkq1064 (https://doi.org/10.1093%2Fnar%2Fgkq1064).
PMC 3013672 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013672). PMID 21045057
(https://pubmed.ncbi.nlm.nih.gov/21045057).
4. Flicek P, Aken BL, Ballester B, et al. (January 2010). "Ensembl's 10th year" (https://www.ncb
i.nlm.nih.gov/pmc/articles/PMC2808936). Nucleic Acids Res. 38 (Database issue): D557–
62. doi:10.1093/nar/gkp972 (https://doi.org/10.1093%2Fnar%2Fgkp972). PMC 2808936 (htt
ps://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808936). PMID 19906699 (https://pubmed.ncb
i.nlm.nih.gov/19906699).
5. Davis, Charles Patrick (29 March 2021). "Medical definition of Genome Annotation" (https://
web.archive.org/web/20210614173351/https://www.medicinenet.com/genome_annotation/d
efinition.htm). Archived from the original (https://www.medicinenet.com/genome_annotation/
definition.htm) on 14 June 2021. Retrieved 7 August 2022.
6. Curwen, Val; Eyras, Eduardo; Andrews, T. Daniel; Clarke, Laura; Mongin, Emmanuel;
Searle, Steven M. J.; Clamp, Michele (May 2004). "The Ensembl automatic gene annotation
system" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC479124). Genome Research. 14 (5):
942–950. doi:10.1101/gr.1858004 (https://doi.org/10.1101%2Fgr.1858004). ISSN 1088-9051
(https://www.worldcat.org/issn/1088-9051). PMC 479124 (https://www.ncbi.nlm.nih.gov/pmc/
articles/PMC479124). PMID 15123590 (https://pubmed.ncbi.nlm.nih.gov/15123590).
7. Ruffier, Magali; Kähäri, Andreas; Komorowska, Monika; Keenan, Stephen; Laird, Matthew;
Longden, Ian; Proctor, Glenn; Searle, Steve; Staines, Daniel; Taylor, Kieron; Vullo,
Alessandro; Yates, Andrew; Zerbino, Daniel; Flicek, Paul (January 2017). "Ensembl core
software resources: storage and programmatic access for DNA sequence and genome
annotation" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467575). Database. 2017 (1):
bax020. doi:10.1093/database/bax020 (https://doi.org/10.1093%2Fdatabase%2Fbax020).
PMC 5467575 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467575). PMID 28365736
(https://pubmed.ncbi.nlm.nih.gov/28365736).
8. Hubbard, T. J. P.; Aken, B. L.; Ayling, S.; Ballester, B.; Beal, K.; Bragin, E.; Brent, S.; Chen, Y.;
Clapham, P.; Clarke, L.; Coates, G. (January 2009). "Ensembl 2009" (https://www.ncbi.nlm.ni
h.gov/pmc/articles/PMC2686571). Nucleic Acids Research. 37 (Database issue): D690–
697. doi:10.1093/nar/gkn828 (https://doi.org/10.1093%2Fnar%2Fgkn828). ISSN 1362-4962
(https://www.worldcat.org/issn/1362-4962). PMC 2686571 (https://www.ncbi.nlm.nih.gov/pm
c/articles/PMC2686571). PMID 19033362 (https://pubmed.ncbi.nlm.nih.gov/19033362).
9. Howe, Kevin L.; Contreras-Moreira, Bruno; De Silva, Nishadi; Maslen, Gareth; Akanni,
Wasiu; Allen, James; Alvarez-Jarreta, Jorge; Barba, Matthieu; Bolser, Dan M.; Cambell,
Lahcen; Carbajo, Manuel (8 January 2020). "Ensembl Genomes 2020-enabling non-
vertebrate genomic research" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6943047).
Nucleic Acids Research. 48 (D1): D689–D695. doi:10.1093/nar/gkz890 (https://doi.org/10.10
93%2Fnar%2Fgkz890). ISSN 1362-4962 (https://www.worldcat.org/issn/1362-4962).
PMC 6943047 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6943047). PMID 31598706
(https://pubmed.ncbi.nlm.nih.gov/31598706).
10. Stabenau A, McVicker G, Melsopp C, Proctor G, Clamp M, Birney E (February 2004). "The
Ensembl Core Software Libraries" (http://genome.cshlp.org/content/14/5/929.full). Genome
Research. 14 (5): 929–933. doi:10.1101/gr.1857204 (https://doi.org/10.1101%2Fgr.185720
4). PMC 479122 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC479122). PMID 15123588
(https://pubmed.ncbi.nlm.nih.gov/15123588).
11. "Species List" (https://uswest.ensembl.org/info/about/species.html). uswest.ensembl.org.
Retrieved 5 August 2022.
External links
Official website (http://www.ensembl.org/)
Vega (http://vega.sanger.ac.uk)
Pre-Ensembl (http://pre.ensembl.org)
Ensembl genomes (http://www.ensemblgenomes.org)
UCSC Genome Browser (http://genome.ucsc.edu)
NCBI (https://www.ncbi.nlm.nih.gov/)
Ensembl: Browsing chordate genomes on EBI Train OnLine (http://www.ebi.ac.uk/training/on
line/course/ensembl-browsing-chordate-genomes)