Chapter 1.
Introduction and
biological databases
19/2/2025
1
Definition of bioinformatics
• Bioinformatics is an interdisciplinary research area at the interface between
computer science and biological science
• Bioinformatics involves the technology that uses computers for storage,
retrieval, manipulation and distribution of information related to biological
macromolecules including DNA, RNA and protein
2
Central dogma in molecular biology
3 Bioinformatics and functional genomics
History of DNA sequencing
• Structure of DNA was discovered in
1953 by Watson and Crick
• The first DNA sequence was read in
1965
4
History of DNA sequencing
• Rapid DNA sequencing developed by Fred Sanger 1977
5
History of genome sequencing
• Bacteriophage PhiX174
• First sequenced genome, by Sanger sequencing
• DNA genome consists of 5386 nucleotides and 11 genes
• Published in 1977
• Haemophilus influenzae
• Fist sequenced free-living organism
• DNA genome consists of 1.8 million nucleotides and 1800
genes
• Published in 1995
6 [Link] PDB-101: Molecule of the Month: Bacteriophage phiX174
History of genome sequencing
• Saccharomyces cerevisiae
• First sequenced eukaryote
• Genome consists of 12 million nucleotides and 6000 genes
• Published in 1977, took 7 years to finish
• Homo sapiens
• The Human Genome Project
• Genome consists of approximately 3.25 billion nucleotides
and 21000 genes
• Initiated in 1990, finished 13 years later
• Jointed effort by 200 research groups, cost estimated to be
$3 billion
• First gap-less human genome published in March 2022
7
Definition of bioinformatics
• Bioinformatics deals with massive
amount of sequencing data of
nucleotide and amino acid
sequences.
8 Bioinformatics and functional genomics
Subfields of bioinformatics
9 Essential bioinformatics
Bioinformatics software: two cultures
10 Bioinformatics and functional genomics
Command line
11
Bioinformatics vs. computational biology
• Bioinformatics (computational molecular biology) is limited to sequence,
structural, and functional analysis of genes, genomes (DNA) and their
corresponding products (RNA, proteins)
• Computational biology encompasses all biological areas that involve
computation, e.g., mathematical modeling of ecosystems, population
dynamics, but not necessarily involve biological macromolecules
12
Reproducible research in bioinformatics
• A workflow should be well-documented in lab notebook, electronic lab
notebook
• Information stored on a computer should be well-organized
• Data should be made available to other, with some exception regarding
sensitive data
• Metadata is important (can be location from which the bacterium is isolated)
• Databases used in bioinformatics analysis should be documented, version
number and date of access to the databases should also be recorded.
• Software should be documented
13
Biological databases
• Database: computerized archive used to
store and organize data in such a way that
information can be retrieved easily via a
variety of search criteria.
• Database: computer hardware and
software for data management
• Entry: a record in the database, contain a
number of fields that hold the actual data
items (value)
14
Types of databases
• Flat file format: a long text file that contains many entries separated by a
delimiter (|). Within each entry are a number of fields separated by tabs or
commas (,), aka, a single table for the entire database.
• To search a flat file for information, a computer has to read through the entire
file → improve searching efficiency by establishing a data structure (data
management system)
• Two types of data management system
• Relational databases
• Object-oriented databases
15
Relational databases
• Relational databases use a set of tables to organize data
Relation
Entity Field Attribute
Value
16
Relational databases
• Relational databases use a set of tables to organize data
• Relational databases can be created by structured query language (SQL)
17
Object-oriented database
• Object-oriented databases store data as objects that are linked by a set of
pointers defining predetermined relationship between objects
• Object-oriented databases can be created by programming language C++
18
Centralized databases
19 Bioinformatics and functional genomics
Genbank
20 Bioinformatics and functional genomics
Genbank
21 Bioinformatics and functional genomics
Biological databases
• Microorganisms and cell lines:
• Bacdive (DSMZ): [Link]
• 16S rRNA genes:
• Ribosomal Database Project: [Link]
• Silva ribosomal RNA Database Project: [Link]
• Greengene Database: [Link]
• Earth Microbiome Project: [Link]
• Protein:
• [Link]
• Protein Data Bank: [Link]
Types of data stored in databases
23 Bioinformatics and functional genomics
Information retrieval from biological databases
• NCBI developed and maintains Entrez – biological database retrieval system
that allows text-based searches for data
[Link]
• Sequences
• Structures
• Taxonomy
• Abstracts
• Full papers
24
Information retrieval from biological databases
25 Bioinformatics and functional genomics
Refseq database
• Freely available, non-redundant, curated database of nucleotides, genomes,
proteins, provide only one single entry for each biological molecules for major
organisms
Cơ sở dữ liệu Refseq
• Cơ sở dữ liệu có sẵn miễn phí, không dư thừa, được tuyển chọn về nucleotide, bộ gen, protein, chỉ cung cấp một mục duy
nhất cho mỗi phân tử sinh học cho các sinh vật chính
26
Genbank sequence format
• Search output is a
flat file which
contain 3 sections: Header
header, features and
sequence entry
• Each field has a
unique identifier for
easy indexing by
computer software Feature
• Đầu ra tìm kiếm là một tệp phẳng chứa 3
phần: tiêu đề, tính năng và mục nhập trình
tự
• Mỗi trường có một mã định danh duy nhất
để dễ dàng lập chỉ mục bằng phần mềm
máy tính
Sequence
27
Genbank sequencing format - Header
Genbank sequence format - Features
29
Genbank sequence format - Sequence
30
FASTA format
Uniprot database
32
Uniprot database
33
Protein data bank
34