Primary database
Nucleic acid sequence databases
EMBL
Genbank/NCBI
DDBJ
Protein sequence databases
Uniprot- Swissprot
TrEMBL
Iproclass
EMBL – European Molecular
Biology Laboratory
At EMBL laboratories 1980, Heidelberg,
Germany First DNA sequence database
Nucleotide sequence database from the
European Bioinformatics Institute (EBI)
It includes sequence from direct author
submissions and genome sequencing groups
and from the scientific literature and patent
applications
This database is produced in an international
collaboration with DDBJ and GenBank
Each of the three groups collect sequence
data world wide and all new database entries
are exchanged between the groups on a daily
basis
Taxonomic Division – each entry belongs
to exactly one taxonomic division
Code Division
PHG Bacteriophage
ENV Environmental sample
FUN Fungal
HUM Human
INV Invertebrate
MAM Mammals
VRT Vertebrate
MUS Mus musculus
PLN Plant
PRO Prokaryotes
ROD Rodent
SYN Synthetic
TGN Transgenic
UNC Un-classified
Structure of an entry
Each entry in the database is composed of
lines
each line begins with a two-character line
code
which indicates the type of the information
contained in the line
EMBL Structure
ID – identification
AC – Accession number
DT – date
DE – Description
KW – keyword
OS – organism species
OC – organism classification
OG – Organelle
RN – reference number
RP – reference position
RA – reference author
RT – reference title
RL – Reference location
DR – database cross reference
CC – comments
FH – feature header
FT – feature table
XX – spacer line
SQ – sequence header
//- termination line
Line structure
Each line begins with a two character line
type code
This code is always followed by three blanks
So the actual information in each line begins
in character position 6
ID – identification
First line of the entry
Format of the ID line is
<1>;<2>; <3>; <4>; <5>; <6>; <7>;
Primary accession number
Sequence version number
Topology ‘circular or linear’
Molecule type
Data class
Taxonomical division
Sequence length
E.g. ID M85050; SV 1; linear; mRNA; STD; INV; 1353
BP.
AC – accession number
Accession number lines lists the accession
numbers associated with the entry
E.g. AC M85050; s46826;
Secondary accession number is to allow
tracking of data.
DT – Date
Date line shows when an entry first appeared
in the database and when it was last updated
Each entry contains two DT lines
DT DD-MON-YYYY Created
DT DD-MON-YYYY updated
E.g. DT 20-DEC-1990 (Rel. 26, Created)
E.g. DT 25-MAR-2001 (Rel. 67, Last updated,
Version 33)
DE – Description
Lines contains general descriptive
information about the sequence stored
It includes
Designation of the genes for which the
sequence codes
The region of the genome from which it is
derived
E.g. DE Human hemoglobin DNA with a
deletion causing Indian delta-beta thalassemia.
KW – Keywords
Used to generate cross reference indexes of the
sequence based on the function, structural and
other categories
E.g. KW hemoglobin.
OS – Organism species
Line specifies the preferred scientific name of
the organism
OS Genus Species (name)
E.g. OS Pseudoterranova decipiens (cod worm)
E.g. OS Homo Homosapiens Human
OC – organism classification
Line contains the taxonomical classification of the source
organism
The classification is listed top-down as nodes in a taxonomic
tree
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia;
OC Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae;
OC Homo.
OG – Organelle
Line indicates the sub-cellular location of non-nuclear
sequences
E.g. OG Lung
The reference
RN, RP, RA, RT, RL
DR - Database cross reference
Line cross references to other database which
contains information related to the entry
FH – Feature Header
Key and Location
FT – Feature Table
Source - organism name
CDS – Coding sequence
mRNA – messenger RNA
SQ – sequence header
Line marks the beginning of the sequence and
gives summary of the sequence
E.g. SQ Sequence 2337 BP; 942 A; 462 C; 401
G; 529 T; 3 other;
// - Terminator
Terminator end of the entry
NCBI – National Center for
Biotechnology Information
Claude Pepper established the NCBI on
November 4, 1988 as a division of the
National Library of Medicine(NLM) at the
National Institute of Health
Nucleotide database – GenBank
The DNA database from NCBI, incorporates
sequences from publicly available sources
From direct author submissions and large
scale sequencing projects
Sequences data are submitted to GenBank
from individual scientists from around the
world and large centers involved in the
Human Genome project
Genbank is an international collaborative
project with partners located at
European Bioinformatics Institute in the
United Kingdom and
National Institute of Genetics in Japan
The increasing size of the database, have
made it convenient to split Genbank into
smaller, discrete divisions
Division Sequence Subset
PRI Primate
ROD Rodent
MAM Mammalian
VRT Vertebrate
INV Invertebrate
PLN Plant, Fungal, Algae
BCT Bacteria
RNA Structural RNA
VRL Viral
PHG Bacteriophage
SYN Synthetic
UNA Unannotated
EST Expressed Sequence
Tag
PAT Patent
STS Sequence Tagged sites
GSS Genome Survey
Sequence
HTG High Throughput
Genomic sequence
ENV Environmental
sampling sequence
The structure of GenBank
Entries
Each entry consists of
a number of keywords
Relevant associated sub-keywords and
a optional feature table
Its end is indicated by a “//” terminator
The positioning of these elements on any given
line is important
Keywords begin in column 1,
Sub-keywords begin in column 3 and
A code defining part of the feature table begin in
column 6
Any line beginning with a blank character is
considered a continuation from the keyword
or sub-keyword
Keywords includes LOCUS, DEFEINITION,
ACCESSION, NID, KEYWORDS, SOURCES
REFERENCE, FEATURES, BASE COUNT
AND ORIGIN
Locus
Includes a short label for the entry that may suggest
the function of the sequence
E.g. HUMCYCLOX
suggest a human cyclooxygenase
Cyclooxygenase (COX) is an enzyme that is responsible
for formation of important biological mediators called
prostanoids
Other relevant facts
Number of bases
Source of sequence data(mRNA)
Section of database (PRI) and
Date of submission
Definition
Contains a concise description of the sequence
(in this example Homosapiens cyclooxygenase)
Accession
Gives a accession number, a unique constant
code assigned to each entry
NID
Supplies a nucleotide identifier (g181253)
Keywords
Introduces a list of short phrases, assigned by
the author, describing gene products and other
relevant information about the entry
In this example cyclooxygenase, prostagladin.
Source
Provides information on the tissue from which
the data have been derived (here umblicalvein)
Sub-keyword ORGANISM illustrate the
biological classification of source organism
Here homosapiens, Eukaryotes etc
Reference
Indicate the portion of sequence data to
which the cited literature refers
Sub-keywords : Authors, title & Journal
provide a structure for the citation
MEDLINE is a pointer to an online medical
literature information resources, which
allows the abstract of the given article to be
viewed.
Features
It describes properties of the sequence
indetail
‘db-xref’ links to other database
Taxon:9606 - a taxonomic database
PID:g181254 - a protein sequence database
5’ – untranslated region (UTR)
CDS – coding sequence
3’ – untranslated region (UTR)
polyA signal – poly adenylation sequence
Base count
Provides the frequency of occurrence of the
different base types in sequence
E.g. 1010A, 712C, 633G and 1032T
Origin
Location of the first base of the sequence with
in the genome
Entry is terminated by the // marker
DDBJ – DNA Data Bank of Japan
DNA data bank of Japan began in 1986 at the National
Institute of Genetics(NIG) with the endorsement of the
ministry of Education, Science, sports and Culture
DDBJ has been functioning as one of the international
DNA databases including EBI in Europe and NCBI in
USA
DDBJ collaborating with two databank through
exchanging data and information on internet
By regularly holding two meetings
The International DNA databanks Advisory meeting
The International DNA databanks collaborative meeting
Structure of the DDBJ file is exactly same as the
Genbank file format
Contains Keywords, subkeywords, feature table
and terminator
SAKURA is a nucleotide sequence data submission
system through the WWW server at DDBJ
Using this system you can interactively enter and
submit nucleotide sequences, functions and
features of the sequences
MGS – Mass Genome Submission for Genome
sequences
Entrez
NIH
NCBI
•Submissions GenBank •Submissions
•Updates •Updates
EMBL
DDBJ
CIB EBI
NIG •Submissions
•Updates SRS
getentry EMBL
37