0% found this document useful (0 votes)
5 views15 pages

L01 Solved

Uploaded by

maria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views15 pages

L01 Solved

Uploaded by

maria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Computational Biology

Introduction to Computational Biology


2024/2025

LAB#1 – Biological Databases

The goal of this Practical Lesson is to introduce the different types of biological information
present in several resources and databases, namely (A) NCBI, (B) GenBank, (C) UniProt, (D)
BRENDA, and (E) KEGG.
These exercises should help you to browse the available information and guide you through the
main procedure. You are invited to further explore them and check additional features.

Discussion and Reflection:


• Students should discuss how these databases complement each other.
• Students should reflect on how these resources might be used in future computational
biology research.

NCBI
NCBI (National Center for Biotechnology Information) at https://www.ncbi.nlm.nih.gov/
offers a wide range of resources that are crucial for research in bioinformatics, genomics, and
other areas of computational biology. These resources collectively support a wide array of
bioinformatics tasks, from literature searches to complex genomic analyses.

Here is a brief overview of some key databases:


1. GenBank:
o Purpose: A comprehensive genetic sequence database and an annotated
collection of all publicly available DNA sequences. GenBank is part of the
International Nucleotide Sequence Database Collaboration, which comprises
the DNA DataBank of Japan (DDBJ), the European Molecular Biology
Laboratory (EMBL), and GenBank at NCBI.
o Use: Widely used by researchers for genetic analysis and comparison.
2. Gene:
o Purpose: A database focusing on genomes that have been completely
sequenced and that have an active research community to contribute gene-
specific data.
o Use: Explore gene-specific data, such as sequences, phenotypes, pathways, and
related genetic disorders.
3. Nucleotide:
o Purpose: A repository of nucleotide sequences from a variety of sources,
including GenBank, RefSeq, and others.
o Use: Retrieve DNA and RNA sequences for genes, genomes, and transcripts
across different species.
4. Protein:
o Purpose: A database that includes protein sequence records from a variety of
sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

1/15
o Use: Access protein sequences, domain structures, functional annotations, and
links to related literature.
5. SNP (Single Nucleotide Polymorphism):
o Purpose: A database of genetic variation, including single nucleotide
polymorphisms and other variations.
o Use: Study genetic diversity, associations with diseases, and population
genetics.
6. dbGaP (Database of Genotypes and Phenotypes):
o Purpose: Contains data from studies that investigate the interaction between
genotypes and phenotypes in humans.
o Use: Access data for genetic association studies, including clinical information
and genotypic data.
7. Taxonomy:
o Purpose: A database that provides information on the classification and
nomenclature of organisms.
Use: Explore taxonomic data for various species, including evolutionary
o
relationships and hierarchy.
8. OMIM (Online Mendelian Inheritance in Man):
o Purpose: A catalog of human genes and genetic disorders.
o Use: Research information on the relationship between genes and diseases,
particularly for Mendelian disorders.
9. PubMed:
o Purpose: A comprehensive database of biomedical literature, including
articles from life sciences journals and online books.
o Use: Search for scientific papers, review articles, and clinical studies relevant
to a particular topic or gene.
10. Bookshelf:
o Purpose: A collection of freely accessible, full-text books and documents in
life science and healthcare.
o Use: Reference textbooks, reports, and guidelines for in-depth understanding
of biological concepts.

A Key tool:
1. BLAST (Basic Local Alignment Search Tool):
o Purpose: A tool for comparing nucleotide or protein sequences against
databases to find regions of similarity.
o Use: Identify homologous sequences, annotate genes, and find evolutionary
relationships.

2/15
Genomes
GenBank database is available at https://www.ncbi.nlm.nih.gov/genbank/ hosted by the
National Center for Biotechnology Information (NCBI), which also contains other relevant
databases.
I) GenBank data
Search for one specific ID, e.g., AB001981:
a. How many genes are contained in this entry?
b. For which organism?
c. What information is available in sections HEADER and FEATURES?

HEADER contains general information of the entry such as the organisms and publications;
FEATURES provides specific description of the DNA sequence, for example, CDS (Coding
DNA Sequence); and ORIGIN contains the actual nucleotide sequence.

ORGANISM Columba livia Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;


Euteleostomi; Archosauria; Dinosauria; Saurischia; Theropoda; Coelurosauria; Aves;
Neognathae; Columbiformes; Columbidae; Columba.
There are two CDS, so, two genes: alpha-A and alpha-D globin genes.
Click on the first CDS, which always starts by one START codon and end with a STOP condon
(see genetic code). What happens? Which interval is highlighted? Check the intron-exon (what
are these?) structure.

3/15
Source: Wikipedia.org

Click in “FASTA” at the top of the page. What is the structure of this file?

4/15
What happen to the genes? Which part of the original entry was converted?
The block ORIGIN was converted to FASTA format; we can no longer identify the gene
positions without further external information.
In the previous page, click in Send to:

Save the FASTA file to your desktop (or all the file if you chose format GenBank). Hint: change
the filename to AB001981.fasta.
Open with any text editor (be aware of possible problems with line and paragraph breaks!)
Finally, click on Graphics for an interactive visualization of the CDS. Explore the available
features such as zoom, and links for external information and tools.

5/15
II) GenBank search
Search entries that contain the terms “human” and “insulin”
How many results do you obtain? (Solution: 18111) Check the retrieved entries: do they all
correspond to human sequences? Or insulin? Give examples.
By default, all the terms are searched appearing in any entry. Solution for efficienty filtering
the results: use Advanced Search.

How many entries do you obtain now? (Solution: 5548)

Try to change the Field corresponding to “insulin”, e.g., “Keyword”, “Protein Name”, etc.
Which entry corresponds to the human insulin?

6/15
In FEATURES we can check additional information:
• SOURCE: /map="11p15.5" indicates that the sequence belongs to chromosome 11 and,
more precisely, in the short arm (p), within region/band 15 and sub-band 5.

https://ghr.nlm.nih.gov/primer/howgeneswork/genelocation

We can use directly the field codes in our “Search Builder”, available at:
https://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Sear
ch_Fields_and_Qualifiers

7/15
We can combine logical operators NOT, AND, OR (Venn diagrams), but be careful not to
exclude potentially relevant hits.

III) Additional exercises


Find the following genes
1. Insulin of the Rat and of the Mouse
Hint: (rat [organism] OR mouse [organism]) AND insulin [keyword]
Select the following 4 genes and save the files:
1. Rat insulin-I (ins-1) gene
5,425 bp linear DNA
J00747.1 GI:204956
2. Rat insulin II gene (ins-2) with two introns
2,852 bp linear DNA
J00748.1 GI:204958
3. Mouse preproinsulin gene I
1,384 bp linear DNA
X04725.1 GI:52712
4. Mouse preproinsulin gene II
2,408 bp linear DNA
X04724.1 GI:52714

2. alpha-globin of organism Capra hircus


Hint: "capra hircus" [organism] AND "alpha globin" [title]
HBAI e HBAII

3. alpha-globin of all ruminants


ORGANISM Capra hircus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae;
Caprinae; Capra.
Hint: Ruminantia [organism] AND “alpha globin” [title]
16 entries

4. Normal p53
Hint: p53 [protein name] AND human [organism]
Still too many? Add … NOT isolate [title]
Find only complete sequences: … AND complete [title]
Human p53 (TP53) gene, complete cds
20,303 bp linear DNA
Accession: U94788.1 GI: 3041866
Why is this gene important?
Note – You can also explore the Taxonomy Browser by clicking on the species.

V) External tools
Use http://www.bioinformatics.org/sms2/ to manipulate the previous data entry saved in the
File. Explore entries such as: Genbank to FASTA; Genbank Feature Extractor; Genbank Trans
Extractor.

8/15
Proteins and Metabolism
This part will guide you through the databases UniProt (http://www.uniprot.org/), BRENDA
(http://www.brenda-enzymes.info/), and KEGG (http://www.genome.jp/kegg/) that allow
searching for proteins (including enzymes) and metabolic pathways.

I) UniProt database
We will work with the UniProt Knowledge-base (UnitProtKB), in particular with Swiss-Prot,
whose entries are annotated and curated manually.
Search for “human insulin”. How many entries do you find? How many correspond to Swiss-
Prot (“show only reviewed”)? Which one is the “correct” one?
Like before, we should limit our search:
Restrict term “human” to organism; restrict term “insulin” to protein name (Reduces to
~203 entries)
In the “Query” box, exclude terms such as insulin-like and receptors using the Boolean search
as before: (protein_name:insulin) AND (organism_id:9606) NOT
(protein_name:receptor) NOT (protein_name:insulin-like) AND (reviewed:true)
(further reduces to ~16 entries).
Click on P01308 and explore that entry.
How many references do you find? Why is this protein extensively studied?
In section Function, see Gene Ontology (GO) annotations, in particular the table with entries
Molecular Function and Biological process. See also Subcellular location.
Where do you find this protein in the cell? Is there any relationship with its function?

Click on the positions 1-24: that region will appear highlighted:

Insulin is constituted by two chains (A and B), the other peptides are cut/eliminated.
There is also information about the protein structures: Helix (alpha-helix), Strand (beta-sheet)
and Turn (check by clicking in the structure to highlight and by moving the mouse).

9/15
Also provides links to other databases (Cross-references) such as GenBank (e.g. click on
J00265) and PDBe (e.g. click 1B9E)
Advanced Search
Find proteins secreted by cells, with experimental confidence level and further limiting your
searches:

(or)
(cc_scl_term_exp:SL-0243) AND (existence:1) AND (length:[1 TO 80]) AND
(fragment:false) AND (organism_id:9606)
Download results as FASTA formatted files.
>sp|A0A0C5B5G6|MOTSC_HUMAN Mitochondrial-derived peptide MOTS-c OS=Homo sapiens OX=9606 GN=MT-RNR1 PE=1 SV=1
MRWQEMGYIFYPRKLR
>sp|O15263|DFB4A_HUMAN Defensin beta 4A OS=Homo sapiens OX=9606 GN=DEFB4A PE=1 SV=1
MRVLYLLFSFLFIFLMPLPGVFGGIGDPVTCLKSGAICHPVFCPRRYKQIGTCGLPGTKC
CKKP
>sp|P00995|ISK1_HUMAN Serine protease inhibitor Kazal-type 1 OS=Homo sapiens OX=9606 GN=SPINK1 PE=1 SV=2
MKVTGIFLLSALALLSLSGNTGADSLGREAKCYNELNGCTKIYDPVCGTDGNTYPNECVL
CFENRKRQTSILIQKSGPC
>sp|P02808|STAT_HUMAN Statherin OS=Homo sapiens OX=9606 GN=STATH PE=1 SV=2
MKFLVFAFILALMVSMIGADSSEEKFLRRIGRFGYGYGPYQPVPEQPLYPQPYQPQYQQY
TF
>sp|P02814|SMR3B_HUMAN Submaxillary gland androgen-regulated protein 3B OS=Homo sapiens OX=9606 GN=SMR3B PE=1 SV=2
MKSLTWILGLWALAACFTPGESQRGPRGPYPPGPLAPPQPFGPGFVPPPPPPPYGPGRIP
PPPPAPYGPGIFPPPPPQP
>sp|P0DMC3|ELA_HUMAN Apelin receptor early endogenous ligand OS=Homo sapiens OX=9606 GN=APELA PE=1 SV=1
MRFQQFLFAFFIFIMSLLLISGQRPVNLTMRRKLRKHNCLQRRCMPLHSRVPFP
(…)

10/15
II) BRENDA
Search pyruvate decarboxylase.
Click on EC number (format EC A.B.C.D with hierarchical classification) and the symbols
https://www.brenda-enzymes.org/enzyme.php?ecno=4.1.1.1

Search by EC-1.2.3.4
Explore full hierarchy through Homeà Explorer à Enzyme Classification

11/15
Search for 2.7.1.1 – what type of enzymes have you found and what is their role? (Hint:
hexokinase)

III) KEGG database


Search “glycolysis” at http://www.genome.jp/kegg/
In KEGG PATHWAY the full Description and metabolic network (Pathway Map) appears –
click and chose organism (change pathway type button), e.g. Homo sapiens:

12/15
Each reaction has information about the corresponding enzyme(s), number in the edges, for
example 2.7.1.40:

13/15
…and also about the corresponding metabolites, for example Phosphoenolpyruvate:

In http://www.genome.jp/kegg/pathway.html click on 6. Human Diseases.


In colorectal cancer, the full interaction network is described:

14/15
Pathway entry has links to Disease:
Also available information about drugs in 7. Drug Development, for example, Penicillins
(http://www.genome.jp/kegg/pathway/map/map07011.html):

15/15

You might also like