BD0 Introduction 2per
BD0 Introduction 2per
Introduction to
Biological Databases
Simon Tomlinson
Introduction
Biological databases are organized collections of biological data,
typically accessible by computational means. They are reservoirs
of biological knowledge, generally stable across time and often focused
on a particular biological domain.
1
19/09/2023
Course In Detail
Content of the course would be in the areas of :-
2
19/09/2023
Reading List
Textbook
• No single textbook covers the whole course content. But SQL will be an important technology and
for this a useful book is
• Learning SQL: Generate, Manipulate, and Retrieve Data, Alan Beaulieu
General Reading
• Thessen, Anne E., and David J. Patterson. "Data issues in the life sciences." ZooKeys 150 (2011): 15.
• Sharma, Parva Kumar, and Inderjit Singh Yadav. "Biological databases and their application."
Bioinformatics. Academic Press, 2022. 17-31.
• Hassani-Pak, Keywan, and Christopher Rawlings. "Knowledge discovery in biological databases for
revealing candidate genes linked to complex phenotypes." Journal of integrative bioinformatics 14.1
(2017).
Specific, weekly, reading lists will be provided for every topic covered.
3
19/09/2023
Assessment
• In-course assessment (50%) and exam (50%)
• This is the second year of this course so there is one past exams.
Example exam questions will be provided later in the course and
there will also be a revision session
4
19/09/2023
• Introduction to web site and database design for drug discovery (BICH11007, SBS)
Design a database and query it for drug data
• Molecular Modelling and Database Mining (PGBI11023, SBS)
Query a molecular modelling database
General database design
• Using R for Data Science, Functional Genomic Technologies
Querying and accessing data within databases using R
• Bioinformatics Programming and System Management
Building simple databases in SQL, query using Python
10
5
19/09/2023
Setup
• In the Introductory week(s) we will use web examples
• After that we will switch to using Unix and R. Server accounts will be
provided for this, but you can also use your own laptop.
• We will go through setup procedures next week.
• This week we will only use a web browser...
11
12
6
19/09/2023
My Contact details
• I am the course organizer and the lecturer
• If you’d like to contact me please use email as I’m often not in my office. Please
put “BD” at the start of your email header for emails. This allows emails relating
to this course can easily be identified.
Dr Simon Tomlinson
Senior Lecturer/Group Leader
Centre for Regenerative Medicine
Institute for Regeneration and Repair
School of Biological Sciences
University of Edinburgh
email: [email protected]
13
• Available at www.ensembl.org
14
7
19/09/2023
15
16
8
19/09/2023
Pick a Match...
Match (click)
17
18
9
19/09/2023
Lower part
Assembly
19
20
10
19/09/2023
21
Seems Simple?
• Ensembl is actually an extremely complicated data resource
• It is probably the most complex data resource we will use on the course
• But the complexity is a bit hidden- you can perform simple queries
relatively easily and help is available
• The problem for us in bioinformatics or data science
• We need to work at genome scale- not look at single genes
• The complexities are very important!!
• Our purpose this week is not to fully understand Ensembl but to use
Ensembl to map out the challenges to understanding any database
22
11
19/09/2023
23
24
12
19/09/2023
• Not every system offers such a rich set of interfaces, but Ensembl can be accessed in all of these ways
• In this design all the interfaces “see” exactly the same versions of the data stored in the database
25
26
13
19/09/2023
27
28
14
19/09/2023
29
30
15
19/09/2023
• All this seems ‘trivial’ because we could search through the list of matches
and pick out the “correct” one
• But suppose we have 10,000 queries to make automatically, if we want to
get the required results back, we need to use the correct query name
otherwise we risk pulling back the wrong gene information
31
Stable Identifiers
• You may notice that Ensembl calls the mouse Trp53 gene as
ENSMUSG00000059552. This identifier uniquely identifies this gene in Ensembl.
• This identifier is unique in Ensembl (although it may have different version
numbers)
• In a way we are thinking about this annotation in reverse to Ensembl’s design. In
Ensembl, it takes the assembly and maps annotation to this, creating gene
identifiers. Then these genes are mapped to known genes.
• Identifiers are fixed to “gene” sequences in the genome, but the gene names they
map to might change if knowledge grows or the nomenclature changes.
• So these stable gene IDs (related to accession numbers) are constant between
versions of ensembl and unique within the database. Ensembl has other IDs to
represent proteins or transcripts for example.
32
16
19/09/2023
• The gene name is a foreign key in the record with the value of
“Trp53”
• We can use this key to go to the MGI nomenclature to get other
useful information such as aliases
• Of course, we can make these queries ourselves or they can be
MySQL
made through the web interface- but they all work in the same way
Ensembl
33
Ensembl-Gene Locations
• ENSMUSG00000059552 mapping to the Trp53 gene also has a
location in the genome 11:69471185-69482699:1
• This is on chromosome 11, starting 69471185 and ending 69482699
on the positive chromosomal strand
• Almost all genes can be mapped to the assembly to a unique location
• Other Ensembl feature, promoter, enhancer, gene, transcript etc can
be similarly mapped
chr11
69471185 69482699
*simplified model
34
17
19/09/2023
35
ENSMUSG00000047751
ENSMUSG00000074637
ENSMUSG00000024406
ENSMUSG00000055148
Search with these IDs
ENSMUSG00000003032
ENSMUSG00000022346
ENSMUSG00000037169
Trp53
ENSMUSG00000105265
36
18
19/09/2023
1st Attempt
Ensembl Gene ID chr Gene Start (bp) Gene End (bp) Gene Biotype Gene Name
ENSMUSG00000003032 4 55527143 55532466 protein_coding Klf4
ENSMUSG00000022346 15 61985391 61990374 protein_coding Myc
ENSMUSG00000024406 17 35506018 35510776 protein_coding Pou5f1
ENSMUSG00000037169 12 12936096 12941914 protein_coding Mycn
ENSMUSG00000047751 7 139943789 139945112 protein_coding Utf1
ENSMUSG00000055148 8 72319033 72321656 protein_coding Klf2
ENSMUSG00000059552 11 69580359 69591873 protein_coding Trp53
ENSMUSG00000074637 3 34650005 34652461 protein_coding Sox2
37
1st Attempt
Ensembl Gene ID chr Gene Start (bp) Gene End (bp) Gene Biotype Gene Name
ENSMUSG00000003032 4 55527143 55532466 protein_coding Klf4
ENSMUSG00000022346 15 61985391 61990374 protein_coding Myc
ENSMUSG00000024406 17 35506018 35510776 protein_coding Pou5f1
ENSMUSG00000037169 12 12936096 12941914 protein_coding Mycn
ENSMUSG00000047751 7 139943789 139945112 protein_coding Utf1
ENSMUSG00000055148 8 72319033 72321656 protein_coding Klf2
ENSMUSG00000059552 11 69580359 69591873 protein_coding Trp53
ENSMUSG00000074637 3 34650005 34652461 protein_coding Sox2
2nd Attempt
Gene stable ID Gene start (bp) Gene end (bp) Chromosome Gene name Gene type
ENSMUSG00000003032 55527143 55532466 4 Klf4 protein_coding
ENSMUSG00000022346 61857240 61862223 15 Myc protein_coding
ENSMUSG00000024406 35816915 35821669 17 Pou5f1 protein_coding
ENSMUSG00000037169 12986094 12991915 12 Mycn protein_coding
ENSMUSG00000047751 139523702 139525025 7 Utf1 protein_coding
ENSMUSG00000055148 73072877 73075500 8 Klf2 protein_coding
ENSMUSG00000059552 69471185 69482699 11 Trp53 protein_coding
ENSMUSG00000074637 34704554 34706610 3 Sox2 protein_coding
ENSMUSG00000105265 34158419 34736768 3 Sox2ot lncRNA
38
19