0% found this document useful (0 votes)
8 views19 pages

BD0 Introduction 2per

The document outlines a course on biological databases, focusing on their structure, querying methods, and practical applications in biological analysis. It covers various database types, programming languages like SQL, R, and Python, and includes assessments and reading materials. The course aims to equip students with the skills to effectively utilize and understand biological databases across different domains.

Uploaded by

jacktnichols02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views19 pages

BD0 Introduction 2per

The document outlines a course on biological databases, focusing on their structure, querying methods, and practical applications in biological analysis. It covers various database types, programming languages like SQL, R, and Python, and includes assessments and reading materials. The course aims to equip students with the skills to effectively utilize and understand biological databases across different domains.

Uploaded by

jacktnichols02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

19/09/2023

Introduction to
Biological Databases
Simon Tomlinson

Introduction
Biological databases are organized collections of biological data,
typically accessible by computational means. They are reservoirs
of biological knowledge, generally stable across time and often focused
on a particular biological domain.

• This course is about those databases, how data is stored, accessed


retrieved and converted to form that can be used in biological
analysis.

1
19/09/2023

Course In Detail
Content of the course would be in the areas of :-

• Different types of database from flat files to relational formats


• Searching and querying databases, SQL and XML interchange formats
• Advanced interfaces- web, software, mart, REST etc
• Correctness, database normalisation, performance, versioning etc

A survey of biological databases


• Example databases from 10 different biological domains
• Design principles for each database
• Query and data retrieval using SQL, R and Python
• Advanced topic- eg meta databases, big data

Databases Used on this Course


Databases from 10 different biological domains
• Genomic databases [eg Ensembl]
• Nucleic acid [eg Genbank]
• Pathway and metabolic databases [eg Reactome, KEGG]
• Taxonomic databases [eg Gene Ontology]
• Protein interaction databases [eg BioGrid]
• High-throughput sequencing databases [eg GEO, ArrayExpresss]
• Imaging databases [OMERO]
• Protein and proteomic databases [Swissprot]
• Model organism databases [eg Flybase]
• Genomic feature databases [eg JASPAR]
• Meta databases, large-scale databases and Big-data [eg Intermine]

2
19/09/2023

Programming Languages Used


• Databases can generally be explored through web interfaces, but this
works best for small queries eg single gene queries
• Bioinformaticians, Data Scientists or others performing data analysis
often need to work at a much larger scale
• So we also use programming languages to access databases which
offers the ability to perform more complex and large scale queries
and also directly integrate the results in the environment being used
• Examples with be using SQL, R and Python
• We will also discuss web technologies such as REST which act as interfaces
between the database
• No prior programming experience is required!!!

Reading List
Textbook
• No single textbook covers the whole course content. But SQL will be an important technology and
for this a useful book is
• Learning SQL: Generate, Manipulate, and Retrieve Data, Alan Beaulieu

General Reading
• Thessen, Anne E., and David J. Patterson. "Data issues in the life sciences." ZooKeys 150 (2011): 15.
• Sharma, Parva Kumar, and Inderjit Singh Yadav. "Biological databases and their application."
Bioinformatics. Academic Press, 2022. 17-31.
• Hassani-Pak, Keywan, and Christopher Rawlings. "Knowledge discovery in biological databases for
revealing candidate genes linked to complex phenotypes." Journal of integrative bioinformatics 14.1
(2017).

Specific, weekly, reading lists will be provided for every topic covered.

3
19/09/2023

Assessment
• In-course assessment (50%) and exam (50%)
• This is the second year of this course so there is one past exams.
Example exam questions will be provided later in the course and
there will also be a revision session

• You will have at least four weeks to complete the in-course


assessment. Detailed guidance will be provided in later weeks.

What Will You Learn On This Course?


• You will learn about the different databases
• You will learn about the different designs of these databases and now
to use this knowledge to exploit databases
• You will learn to query biological databases

On this course we try to explore the general concepts of biological


databases using the ten chosen examples. But we do this in such a way
as the skills learned on the course to be adapted to explore the many
thousands of other databases that are also available online

4
19/09/2023

Overlap to Existing ‘Database’ Courses


The focus of this course is very much on biological databases- their common design principles and
how we can extract knowledge given the design. Given this focus, the expected overlap to existing
‘database’ courses will be small, a maximum of 5-10%. A modest amount of course overlap is
desirable as it allows integration of knowledge between courses.

• Introduction to web site and database design for drug discovery (BICH11007, SBS)
Design a database and query it for drug data
• Molecular Modelling and Database Mining (PGBI11023, SBS)
Query a molecular modelling database
General database design
• Using R for Data Science, Functional Genomic Technologies
Querying and accessing data within databases using R
• Bioinformatics Programming and System Management
Building simple databases in SQL, query using Python

General Course Design


• Each week there will be a lecture and a practical session
• In the lecture we will review a topic and then this will be followed by
a practical session on this topic
• There will be a summary of each topic at the end of the topic
• Teaching material will be available on Learn

10

5
19/09/2023

Setup
• In the Introductory week(s) we will use web examples
• After that we will switch to using Unix and R. Server accounts will be
provided for this, but you can also use your own laptop.
• We will go through setup procedures next week.
• This week we will only use a web browser...

11

Attendance & Advice


• A register will be taken for attendance each week
• It is important to attend each class, but if some reason you cannot
attend make sure you catch up as soon as possible afterwards
• Learn the course contents as you go along week by week
• Remember that we have 30 hours of taught time but 70 hours set
aside for personal study on this course. Make use of this study time!
• Coursework deadlines can be difficult to manage. Make sure you give
an appropriate amount of time to each piece of coursework. If you
are struggling with coursework, let someone know.

12

6
19/09/2023

My Contact details
• I am the course organizer and the lecturer
• If you’d like to contact me please use email as I’m often not in my office. Please
put “BD” at the start of your email header for emails. This allows emails relating
to this course can easily be identified.

Dr Simon Tomlinson
Senior Lecturer/Group Leader
Centre for Regenerative Medicine
Institute for Regeneration and Repair
School of Biological Sciences
University of Edinburgh
email: [email protected]

13

An Example Database -Ensembl

• Available at www.ensembl.org

14

7
19/09/2023

The Ensembl Genomic Database

• This database is a collection of genomic information


• Basically genomic sequence is used to generate
an assembly
• Annotation is then mapped onto this assembly
• This information can then be queried...

15

Searching for a Gene

Mouse [An example


organism]

Trp53 [an example gene]

16

8
19/09/2023

Pick a Match...

Match (click)

17

Ensembl Record for Trp53 (top)

18

9
19/09/2023

Lower part

Gene model (the exon intron etc)

Assembly

19

Click on Any Gene

More detailed information

20

10
19/09/2023

Ensembl Has Extensive Help Available

Help from here

21

Seems Simple?
• Ensembl is actually an extremely complicated data resource
• It is probably the most complex data resource we will use on the course
• But the complexity is a bit hidden- you can perform simple queries
relatively easily and help is available
• The problem for us in bioinformatics or data science
• We need to work at genome scale- not look at single genes
• The complexities are very important!!

• Our purpose this week is not to fully understand Ensembl but to use
Ensembl to map out the challenges to understanding any database

22

11
19/09/2023

First detail- what is Ensembl?


• It is what you obtain when you access www.ensembl.org
• But is the page we access “the Ensembl database”? No!
• The web page is an interface to the underlying database
• So we can say we “accessed Ensembl from www.ensembl.org”? Yes!
• Ensembl is also a project that builds interfaces such as the web page
as well as maintaining the underlying database.

23

Simplified Ensembl Overall Design


Query Page Result Page Web pages served by the interface

Client side web, HTML, Javascript etc

Ensembl Web Interface Overall web interface/web site

SQL to the database and results back

MySQL This contains all the Ensembl data organized in


Ensembl several MySQL databases. We will return to MySQL
in a later class!

24

12
19/09/2023

Design is Modular- Adding Other Interfaces


Programming BioMart
Web Interface Direct SQL Interface
Languages

MySQL Ensembl Database

• Not every system offers such a rich set of interfaces, but Ensembl can be accessed in all of these ways
• In this design all the interfaces “see” exactly the same versions of the data stored in the database

25

Copies, Mirrors, Versions and Archives


• New versions of Ensembl are released periodically- today we are using Ensembl
Ensembl Release 110 (July 2023)
• Releases update the annotation, but also may bring in a new genome assembly
(fragments are assembled into genomic sequence and co-ordinates) and also new
software
• The underlying MySQL database can be copied to different locations and as long as
the versions match, query copies and results will match the main database queries
• Mirrors of the whole Ensembl site are available which duplicate all Emsembl at
another location https://www.ensembl.org/info/about/mirrors.html
• Ensembl has a range of archives-
https://www.ensembl.org/info/website/archives/index.html
These are working copies of earlier releases- so old annotation and software can
still be accessed if required

26

13
19/09/2023

But What About the Data?


• We cannot explore every possible source of data in Ensembl- this
would take a whole course in itself
• However, the complexity and richness of this resource is what makes
it worth the effort to be able to query in the first place
• So on the course we extract general principles from examples. If we
can understand how the system works for one query, we can make
similar queries. This approach is more powerful if we have some idea
as well as to how the overall system works.

27

Searching for a Gene -Revisited


• We put in “Mouse” as a species
• But what is “Mouse”?
• Obviously, the species but it
is not an exact scientific species
name.

28

14
19/09/2023

Ensembl Mouse- Strains & Similar Names


• “Mouse” is a short name for the
precise species name
• There is actually reference strain
CL57B6 that is used for the assembly
• Then other strain annotations are imported
on to the references
• Note that in our search, we would not have
searched “Mouse Limur” as this is not from
the species Mus musculus.

So precise names and their meaning is


very important if you want to get the correct
results!

29

We searched for the gene “Trp53”


• Trp53 is a standard gene name defined
by the mouse nomenclature committee
(http://www.informatics.jax.org/mgihome/nomen/
• This committee standardized the naming- so one gene
has one single name and only 1 gene has that name
• But in practice old names were still used and so genes
names also have ‘aliases’
• Trp53 is actually the orthologue of the human
cancer gene TP53 which still gets called by the alias p53.
• P53 refers to the protein for both mouse and human
• Note nomenclature is set by different committees in
mouse and human
• Note if you search for the protein/alias name as
gene name you find lots of related genes

30

15
19/09/2023

So in this simple query...


• I used a gene name that I knew matched the formal gene nomenclature for
mouse (note most labs still call this gene p53)
• I used a simple species name, knowing that it mapped to the
Mus musculus reference strain used by Ensembl (CL57BL6)

• All this seems ‘trivial’ because we could search through the list of matches
and pick out the “correct” one
• But suppose we have 10,000 queries to make automatically, if we want to
get the required results back, we need to use the correct query name
otherwise we risk pulling back the wrong gene information

31

Stable Identifiers
• You may notice that Ensembl calls the mouse Trp53 gene as
ENSMUSG00000059552. This identifier uniquely identifies this gene in Ensembl.
• This identifier is unique in Ensembl (although it may have different version
numbers)
• In a way we are thinking about this annotation in reverse to Ensembl’s design. In
Ensembl, it takes the assembly and maps annotation to this, creating gene
identifiers. Then these genes are mapped to known genes.
• Identifiers are fixed to “gene” sequences in the genome, but the gene names they
map to might change if knowledge grows or the nomenclature changes.
• So these stable gene IDs (related to accession numbers) are constant between
versions of ensembl and unique within the database. Ensembl has other IDs to
represent proteins or transcripts for example.

32

16
19/09/2023

Unique Identifiers and SQL queries


Query Page Result Page • Remember that Ensembl is built from a MySQL database
• Accessions in this case can act as SQL primary keys

• So we can uniquely identify records using the Gene ID primary key

• Query something like


Ensembl Web Interface
select * from genetable where GeneID= “ENSMUSG00000059552”

• The gene name is a foreign key in the record with the value of
“Trp53”
• We can use this key to go to the MGI nomenclature to get other
useful information such as aliases
• Of course, we can make these queries ourselves or they can be
MySQL
made through the web interface- but they all work in the same way
Ensembl

33

Ensembl-Gene Locations
• ENSMUSG00000059552 mapping to the Trp53 gene also has a
location in the genome 11:69471185-69482699:1
• This is on chromosome 11, starting 69471185 and ending 69482699
on the positive chromosomal strand
• Almost all genes can be mapped to the assembly to a unique location
• Other Ensembl feature, promoter, enhancer, gene, transcript etc can
be similarly mapped

chr11
69471185 69482699
*simplified model

34

17
19/09/2023

Biomart Ensembl Interface


• Biomart offers an interface to the Ensembl database that allows
detailed queries using multiple search terms- so we can search with a
list of genes for example see
http://www.ensembl.org/biomart/martview/

• So I have selected mouse genes &


the latest version of Ensembl
• Filters are used to restrict matches
to a list of IDs
• Attributes are what you’d like back
eg gene names or whatever
• Set the filters & Attributes and then
click the Results button to get a results
file to download and load into a
spreadsheet

35

Using Biomart to Annotate Genes


• Obtain their Gene name, Ensembl ID, the start, stop, chromosome and if they are protein or RNA
coding
• Compare to my table on the next page-are there any differences and why do you think this is?

ENSMUSG00000047751
ENSMUSG00000074637
ENSMUSG00000024406
ENSMUSG00000055148
Search with these IDs
ENSMUSG00000003032
ENSMUSG00000022346
ENSMUSG00000037169
Trp53
ENSMUSG00000105265

36

18
19/09/2023

1st Attempt
Ensembl Gene ID chr Gene Start (bp) Gene End (bp) Gene Biotype Gene Name
ENSMUSG00000003032 4 55527143 55532466 protein_coding Klf4
ENSMUSG00000022346 15 61985391 61990374 protein_coding Myc
ENSMUSG00000024406 17 35506018 35510776 protein_coding Pou5f1
ENSMUSG00000037169 12 12936096 12941914 protein_coding Mycn
ENSMUSG00000047751 7 139943789 139945112 protein_coding Utf1
ENSMUSG00000055148 8 72319033 72321656 protein_coding Klf2
ENSMUSG00000059552 11 69580359 69591873 protein_coding Trp53
ENSMUSG00000074637 3 34650005 34652461 protein_coding Sox2

37

1st Attempt
Ensembl Gene ID chr Gene Start (bp) Gene End (bp) Gene Biotype Gene Name
ENSMUSG00000003032 4 55527143 55532466 protein_coding Klf4
ENSMUSG00000022346 15 61985391 61990374 protein_coding Myc
ENSMUSG00000024406 17 35506018 35510776 protein_coding Pou5f1
ENSMUSG00000037169 12 12936096 12941914 protein_coding Mycn
ENSMUSG00000047751 7 139943789 139945112 protein_coding Utf1
ENSMUSG00000055148 8 72319033 72321656 protein_coding Klf2
ENSMUSG00000059552 11 69580359 69591873 protein_coding Trp53
ENSMUSG00000074637 3 34650005 34652461 protein_coding Sox2

2nd Attempt
Gene stable ID Gene start (bp) Gene end (bp) Chromosome Gene name Gene type
ENSMUSG00000003032 55527143 55532466 4 Klf4 protein_coding
ENSMUSG00000022346 61857240 61862223 15 Myc protein_coding
ENSMUSG00000024406 35816915 35821669 17 Pou5f1 protein_coding
ENSMUSG00000037169 12986094 12991915 12 Mycn protein_coding
ENSMUSG00000047751 139523702 139525025 7 Utf1 protein_coding
ENSMUSG00000055148 73072877 73075500 8 Klf2 protein_coding
ENSMUSG00000059552 69471185 69482699 11 Trp53 protein_coding
ENSMUSG00000074637 34704554 34706610 3 Sox2 protein_coding
ENSMUSG00000105265 34158419 34736768 3 Sox2ot lncRNA

38

19

You might also like