0% found this document useful (0 votes)
72 views48 pages

Overview of Bioinformatics Databases

The document provides an overview of bioinformatics databases, focusing on nucleic acid and protein sequence databases. It explains the structure and function of database management systems (DBMS), the classification of biological databases, and highlights key databases such as NCBI, GenBank, EMBL, and PDB. Additionally, it discusses the importance of these databases in storing and making biological data accessible for scientific research.

Uploaded by

ckittu009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views48 pages

Overview of Bioinformatics Databases

The document provides an overview of bioinformatics databases, focusing on nucleic acid and protein sequence databases. It explains the structure and function of database management systems (DBMS), the classification of biological databases, and highlights key databases such as NCBI, GenBank, EMBL, and PDB. Additionally, it discusses the importance of these databases in storing and making biological data accessible for scientific research.

Uploaded by

ckittu009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Introduction to Bioinformatics

databases: Nucleic Acid


Databases
What is Database
• General:
• A database is any collection of related data.
• A Computerized archive used to store and
organize data in such a way that information can
be retrieved easily.
• A database is a collection of interrelated data
store together without harmful and unnecessary
redundancy (duplicate data) to serve multiple
applications
• Retrieving is called firing a query.
DATABASE SYSTEM

Database System is an integrated collection of related


files along with the detail about their definition,
interpretation, manipulation and maintenance

A database system is based on the data. Also a database


system can be run or executed by using software called
DBMS (Database Management System).
A database system controls the data from unauthorized
access.
A database management system (DBMS) is a
collection of programs that enables users to create
and maintain a database.
What Does a DBMS Do?
Database management systems provide several
functions in addition to simple file management:
• allow concurrency
• control security
• maintain data integrity
• provide for backup and recovery
• control redundancy
• allow data independence
• provide non-procedural query language
• perform automatic query optimization

What is a relational database?


• a database that treats all of its data as a
collection of relations
Biological databases: why?
• Need for storing and communicating
large datasets has grown
• Make biological data available to
scientists.
• To make biological data available in
computer-readable form.
Different classifications of
databases
• Type of data
– nucleotide sequences
– protein sequences
– proteins sequence patterns or motifs
– macromolecular 3D structure
– gene expression data
– metabolic pathways
Different classifications of databases….

• Primary or derived databases


– Primary databases: experimental results
directly into database
– Secondary databases: results of analysis of
primary databases
– Aggregate of many databases
• Links to other data items
• Combination of data
• Consolidation of data
Different classifications of databases….

• Availability
– Publicly available, no restrictions
– Available, but with copyright
– Accessible, but not downloadable
– Academic, but not freely available
– Proprietary, commercial; possibly free for
academics
NCBI and Entrez

• One of the largest and most comprehensive


databases belonging to the NIH – national
institute of health (USA)
• Entrez is the search engine of NCBI
• Search for :
genes, proteins, genomes, structures, diseases,
publications and more.
• [Link]
9
Primary Databases
• This databases contains the raw nucleic
acid sequence data which are produced
and submitted by researchers worldwide.

• Nucleic acid • Protein

EMBL
GenBank PIR
DDBJ (DNA Data Bank of Japan) MIPS
SWISS-PROT
TrEMBL
NRL-3D
Nucleotide sequence databases
• EMBL, GenBank, and DDBJ are the three
primary nucleotide sequence
databases
• EMBL [Link]/embl/
• GenBank
[Link]/Genbank/
• DDBJ [Link]
• They together constitute the International Nucleotide
Sequence database callaboration.
Genbank
• An annotated collection of all publicly
available nucleotide and proteins

• Set up in 1979 at the LANL (Los Alamos).

• Maintained since 1992 NCBI (Bethesda).

• [Link]
GenBank file format
GenBank file format
EMBL Nucleotide Sequence
Database
• An annotated collection of all publicly available
nucleotide and protein sequences

• Created in 1980 at the European Molecular


Biology Laboratory in Heidelberg.

• Maintained since 1994 by EBI- Cambridge.

• [Link]
DDBJ–DNA Data Bank of Japan
• An annotated collection of all publicly available
nucleotide and protein sequences

• Started, 1984 at the National Institute of


Genetics (NIG) in Mishima.

• Still maintained in this institute a team led by


Takashi Gojobori.

• [Link]
Databases related to
Genomics
Contain information on genes, gene location
(mapping), gene nomenclature and links to
sequence databases;
Exist for most organisms important for life
science research;
Examples: OMIM, GDB (human), MGD
(mouse), FlyBase (Drosophila), SGD (yeast),
MaizeDB (maize), SubtiList ([Link]), etc.
Other NCBI nucleic acids DBs
• EST database: A collection of expressed sequence tags, or short, single-pass sequence
reads from mRNA (cDNA).
• HomoloGene: A gene homology tool that compares nucleotide sequences between pairs
of organisms in order to identify putative orthologs.
• HTG database: A collection of high-throughput genome sequences from large-scale
genome sequencing centers, including unfinished and finished sequences.
• SNPs database: A central repository for both single-base nucleotide substitutions and
short deletion and insertion polymorphisms.
• RefSeq: A database of non-redundant reference sequences standards, including genomic
DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within
NCBI and with external groups, supports data-gathering efforts.
Nucleic acid structure
databases
• NDB Nucleic acid-containing structures
[Link]

• NTDB Thermodynamic data for nucleic acids


[Link]

• RNABase RNA-containing structures from PDB and NDB


[Link]

• SCOR Structural classification of RNA: RNA motifs by


structure, function and tertiary interactions
• [Link]
Protein Sequence Databases
Protein Information Resource
One of the first biological sequence
databases was probably the book
"Atlas of Protein Sequences and
Structures"
by Margaret Dayhoff and colleagues,
first published in 1965. It contained
the protein sequences determined
at the time, and new editions of the
book were published till 1978.
It became the foundation
of the PIR database.

[Link]
Protein Databases
•SWISS-PROT: Annotated Sequence Database
•TrEMBL: Database of EMBL nucleotide translated sequences
•InterPro:Integrated resource for protein families, domains
and functional sites.
•CluSTr:Offers an automatic classification of SWISS-PROT
and TrEMBL.
•IPI: A non-redundant human proteome set constructed from
SWISS-PROT, TrEMBL, Ensembl and RefSeq.
•GOA: Provides assignments of gene products to the Gene
Ontology (GO) resource.
•Proteome Analysis: Statistical and comparative analysis of
the predicted proteomes of fully sequenced organisms
•Protein Profiles: Tables of SWISS-PROT and TrEMBL entries
and alignments for the protein families of the Protein Profile.
•IntEnz: The Integrated relational Enzyme database (IntEnz) will
contain enzyme data approved by the Nomenclature Committee.

Reference site : [Link]/Databases/[Link]


Swiss-Prot
• A protein sequence database which strives
to provide a high level of annotation:
* the function of a protein
* domains structure
* post-translational modifications
* variants
• One entry for each protein
• Complete, Curated, Non-redundant and
cross-referenced with 34 other databases
UniProt: [Link]
• The Universal Protein Resource (UniProt) is the
world's most comprehensive catalog of
information on proteins. It is a central repository
of protein sequence and function created by
joining the information contained in Swiss-Prot,
TrEMBL, and PIR.
• It features BLAST, align sequence, retrieve
sequences based on identifiers, and ID mapping
from other databases such as GenBank, EMBL,
DDBJ etc.
TrEMBL (Translation of
EMBL)
• Created in 1996 as a computer annotated supplement to
SWISS-PROT.
• Contains translations of all coding sequences (CDS) in
EMBL.
Has 2 main sections:
[Link]-TrEMBL: contains entries that will eventually be
incorporated into SWISS-PROT, but that have not yet
been manually annotated.
2. REM-TrEMBL: contains sequences that are not destined
to be included in SWISS-PROT, these include
immunoglobulins and T-cell receptors, synthetic and
patented sequences and codon translations that do not
encode real proteins.
Computer-annotated supplement to SWISS-PROT, as it is
impossible to cope with the flow of data…
TrEMBL contains all what is not yet in SWISS-PROT
Structure Databases

•MSD:The Macromolecular Structure Database –


A relational database representation of clean Protein Data Bank
(PDB)
•3DSeq: 3D sequence alignment server- Annotation of the
alignments between sequence database and the PDB
•FSSP: Based on exhaustive all-against-all 3D structure
comparison of protein structures currently in the Protein Data Bank
(PDB)
•DALI: Fold Classification based on Structure-Structure
Assignments
•3Dee: Database of protein domain definitions wherein
the domains have been clustered on sequence and
structural similarity
•NDB: Nucleic Acid Structure Database
Protein DataBank (PDB)
• Important in solving real problems in
molecular biology
• Protein Databank
– PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
– Sole international repository of macromolecular
structure data
– Moved to Research Collaboratory
for Structural Bioinformatics

[Link]
PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
COMPND 2 (E.C.[Link]) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR [Link],[Link] 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH [Link],[Link],[Link],[Link] 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF [Link]. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14
REMARK 2 12CA 15
REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16
REMARK 3 12CA 17
REMARK 3 REFINEMENT. 12CA 18
REMARK 3 PROGRAM PROLSQ 12CA 19
REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………
PDB (cont.)
SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68
SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69
SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70
SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71
SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72
SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73
SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74
SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75
TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76
TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77
TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78
TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79
TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80
TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81
CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82
ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83
ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84
ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85
SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86
SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87
SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88
ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89
ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90
ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91
ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92
ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93
ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94
ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95
ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96
ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97
ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98
ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99
ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100
ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101
ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102
…….
Databases related to Proteomics
• Contain information obtained by 2D-PAGE: master
images of the gels and description of identified
proteins
• Examples: SWISS-2DPAGE (Two-dimensional
polyacrylamide gel electrophoresis database)
• , ECO2DBASE, Maize-2DPAGE, Sub2D,
Cyano2DBase, etc.
• Format: composed of image and text files
• Most 2D-PAGE databases are “federated” and
use SWISS-PROT as a master index
• Mass Spectrometry (MS) database
Munich Information Center for
Protein Sequences (MIPS)
• A research centre hosted at the Institute for
Bioinformatics (IBI) at Neuherberg, Germany.
• Contains information for Systematic analysis of
genome information including the development and
application of bioinformatics methods in genome
annotation, gene expression analysis and proteomics.
• MIPS supports and maintains a set of
generic databases as well as the systematic
comparative analysis of microbial, fungal, and plant
genomes.

07/03/25 17:09
The Institue of Genomic
Research (TIGR)
• Maintained by The Center for the
Advancement of Genomics (TCAG)
• Its Database is TDB
• TDB: A database of The Institute of Genomic
Research:provides a substantial suite of
databases containing DNA and protein
sequence, gene expression, cellular role,
protein family information, and taxonomic
data for microbes, plants and humans.
07/03/25 17:09
HOVERGEN : Homologous Vertebrate Genes Database

• HOVERGEN is a database of homologous vertebrate genes.


• It allows one to select sets of homologous genes among vertebrate species,
and to visualize multiple alignments and phylogenetic trees
• Thus HOVERGEN is particularly useful for comparative sequence analysis,
phylogeny and molecular evolution studies.

•Divided into 2 parts

[Link] contains the protein sequences

2. HOVERGENDNA contains the associated nucleotide sequences.

The database contains all vertebrate protein sequences from the UniProt
Knowledgebase (Swiss-Prot and TrEMBL)
The Arabidopsis Information
Resource TAIR
• TAIR maintains a database of genetic and molecular
biology data for the model higher plant Arabidopsis
thaliana.
• Data available from TAIR includes the complete genome
sequence along with gene structure, gene product
information, metabolism, gene expression, DNA and
seed stocks, genome maps, genetic and physical
markers, publications, and information about the
Arabidopsis research community.
• Its an up to date database which updates in every 2
weeks

07/03/25 17:09
PlasmoDB: a functional genomic
database for malaria parasites
• PlasmoDB ([Link] is a functional genomic
database for Plasmodium spp. that provides a resource
for data analysis and visualization in a gene-by-gene or
genome-wide scale.
• The latest release, PlasmoDB 5.5, contains numerous
new data types from several broad categories—
annotated genomes, evidence of transcription,
proteomics evidence, protein function evidence,
population biology and evolution.

07/03/25 17:09
ECDC (European Centre for
Disease Prevention and Control)
• The European Centre for Disease Prevention
and Control (ECDC) was established in 2005. It
is an EU agency aimed at strengthening
Europe's defences against infectious diseases.
• ECDC publishes scientific and technical reports
on various issues related to communicable
diseases prevention and control, including
comprehensive reports from key technical and
scientific meetings.

07/03/25 17:09
Other Databases
• KEGG (Kyoto Encyclopedia of Gene and Genomics) – for
Pathways
• GeneCards – A databases of human genes, their products
and their involvement in diseases. It’s a secondary
database which contains link for many other databases.
• All in one database of human genes (a project by
Weizmann institute)
• Attempts to integrate as many as possible databases,
publications and all available knowledge
• There are many databases available for microarray,
SAGE, ESTs and SNPs.
FASTA Format
• Popular Format and commonly used

> Seq1
ALVLRARLATGPATGCTRTARARLATGALVLRARLATGPARARLATGPATGCTRTARA
RLATGALVLRARRLATGPATGCTRRLATGPATGCTRRARLATGPATGCTRTARARLAT
GALVLRAR
>Seq2
TGCTRTARARLATGALVLRARLATGPARARALVLRARLATGPATGCTRTARATGALVL
RARLATGPARARALVLRARLATG
>Seq 3
……..
Identifiers and Accession numbers

• Identifier: string of letters and digits that generally is


“understandable”
– Example: TPIS_CHICK (Triose Phosphate Isomerase from
chicken (gallus gallus) ) in SwissProt
– The identifier can change (based on the curator)
• Accession code: a string of letters and digits that
uniquely identifies an entry in its database.
– The accession number for TPIS_CHICK in Swissprot is
P00940
– Accession number should not changed!!
07/03/25 17:09
Google scholar
[Link]

43
07/03/25 17:09
Exercise
• Retrieve all publications in which the first
author is: Mayrose I and the last author is:
Pupko T

07/03/25 17:09
45
The MOST important of all

[Link] (or any search engine)

46
And always remember:

[Link] –
Read the manual!!

47
Help!
• Read the Help section
• Read the FAQ section
• Google the question!

48

Common questions

Powered by AI

Primary databases serve as repositories for raw experimental data directly submitted by researchers, like nucleic acid sequences in databases such as EMBL, GenBank, and DDBJ. In contrast, secondary databases are collections that result from the analysis of data stored in primary databases. They offer interpreted or derived information such as protein structures or functional annotations. Secondary databases often aggregate data from multiple primary sources and may involve additional data processing or computational analysis to provide a structured and comprehensive view for further research .

Swiss-Prot provides detailed annotations regarding protein sequences, such as their functional roles, domain structures, post-translational modifications, and variants. It is a highly curated, non-redundant repository offering comprehensive cross-references to 34 other databases, which benefits biological research by ensuring high-quality, reliable data. This level of detailed annotation allows researchers to make more informed analyses concerning protein function, aiding in hypotheses generation and experimental planning in molecular and cellular biology .

The ECDC plays a vital role in managing communicable diseases in Europe by providing scientific and technical support to governmental bodies. Established in 2005, it strengthens Europe's defenses against infectious diseases through surveillance, preparedness, risk assessment, and response planning. The ECDC publishes comprehensive reports and guidance documents, thus facilitating coordinated actions and informed decision-making across EU member states, which is essential for effectively managing disease outbreaks and preventing their spread .

The collaboration among EMBL, GenBank, and DDBJ enhances the reliability and comprehensiveness of nucleotide sequence data by creating an unified International Nucleotide Sequence Database. Each institution contributes to maintaining a non-redundant, global repository where data is regularly updated and shared across these platforms. This tripartite agreement ensures consistency, accuracy, and comprehensive global coverage of nucleotide sequences, making datasets more robust and readily available for researchers worldwide, thereby fostering cross-validation and reducing the duplication of efforts .

UniProt provides a comprehensive catalog of protein information by combining data from Swiss-Prot, TrEMBL, and PIR, offering a central repository for protein sequences and functional annotations. Researchers benefit from UniProt's extensive tools and functionalities such as BLAST for sequence alignment, ID mapping from other databases, and accessing curated data. This integration aids in efficiently correlating molecular functions, identifying evolutionary relationships, and conducting broad-scale proteomic studies, significantly advancing protein-related research .

Identifiers in biological databases are strings of letters and digits that describe entries, often in a readable and sometimes changeable format, depending on the curator. In contrast, accession numbers are stable, unique codes that do not change and are used to distinctly identify entries within a database. This stability ensures reliable referencing and data consistency, making accession numbers critical for precise data retrieval and integration across studies .

The Protein Data Bank (PDB) is critical for solving molecular biology problems as it serves as the sole international repository for macromolecular structure data. By providing open access to 3D structures of proteins and nucleic acids, researchers can study molecular interactions and conformations, crucial for drug design, understanding enzyme mechanisms, and forming hypotheses about molecular function. The PDB's structured format supports computational modeling and simulation, which further aids in predicting biological activities and interactions, thus directly contributing to advancements in molecular biology .

A Database Management System (DBMS) enhances data retrieval and security through several functionalities that go beyond basic file management. It allows for concurrency, meaning multiple transactions can occur simultaneously without conflicting. DBMS controls security by restricting unauthorized access and maintains data integrity by ensuring data accuracy and consistency. It also provides mechanisms for backup and recovery to restore data in case of failures. DBMS controls redundancy by minimizing duplicate data storage, allows for data independence so changes to data structure do not affect access, provides non-procedural query languages making data retrieval more efficient, and performs automatic query optimization for faster query responses .

'Entrez' serves as the search engine for NCBI databases, playing a crucial role in genomic research. It enables users to search across a vast array of entities such as genes, proteins, genomes, structures, diseases, and publications, providing an integrated access point to the diverse information stored in various NCBI databases. This centralized search mechanism greatly simplifies the retrieval of related data items across different datasets, supporting comprehensive analysis and research in genomics .

Protein sequence databases like the "Atlas of Protein Sequences and Structures" by Margaret Dayhoff laid the groundwork for modern bioinformatics by providing one of the first systematic collections of protein sequences. These historical efforts not only pioneered data codification and structuring methods but also introduced the concept of sequence alignments and evolutionary studies. The transition to digital forms like PIR and Swiss-Prot utilized these foundational works, significantly impacting data accessibility and advancing bioinformatic methodologies, leading to the comprehensive, globally-accessible databases we use today .

You might also like