0% found this document useful (0 votes)
64 views37 pages

Overview of Major Nucleic Acid Databases

Uploaded by

utkarsh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views37 pages

Overview of Major Nucleic Acid Databases

Uploaded by

utkarsh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Primary database

Nucleic acid sequence databases


EMBL
Genbank/NCBI
DDBJ

Protein sequence databases


Uniprot- Swissprot
TrEMBL
Iproclass
EMBL – European Molecular
Biology Laboratory
At EMBL laboratories 1980, Heidelberg,
Germany First DNA sequence database
Nucleotide sequence database from the
European Bioinformatics Institute (EBI)
It includes sequence from direct author
submissions and genome sequencing groups
and from the scientific literature and patent
applications
This database is produced in an international
collaboration with DDBJ and GenBank
Each of the three groups collect sequence
data world wide and all new database entries
are exchanged between the groups on a daily
basis
Taxonomic Division – each entry belongs
to exactly one taxonomic division
Code Division
PHG Bacteriophage
ENV Environmental sample
FUN Fungal
HUM Human
INV Invertebrate
MAM Mammals
VRT Vertebrate
MUS Mus musculus
PLN Plant
PRO Prokaryotes
ROD Rodent
SYN Synthetic
TGN Transgenic
UNC Un-classified
Structure of an entry
Each entry in the database is composed of
lines
each line begins with a two-character line
code
which indicates the type of the information
contained in the line
EMBL Structure
ID – identification
AC – Accession number
DT – date
DE – Description
KW – keyword
OS – organism species
OC – organism classification
OG – Organelle
RN – reference number
RP – reference position
RA – reference author
RT – reference title
RL – Reference location
DR – database cross reference
CC – comments
FH – feature header
FT – feature table
XX – spacer line
SQ – sequence header
//- termination line
Line structure
Each line begins with a two character line
type code
This code is always followed by three blanks
So the actual information in each line begins
in character position 6
ID – identification
First line of the entry
Format of the ID line is
<1>;<2>; <3>; <4>; <5>; <6>; <7>;
Primary accession number
Sequence version number
Topology ‘circular or linear’
Molecule type
Data class
Taxonomical division
Sequence length
E.g. ID M85050; SV 1; linear; mRNA; STD; INV; 1353
BP.
AC – accession number
Accession number lines lists the accession
numbers associated with the entry
E.g. AC M85050; s46826;
Secondary accession number is to allow
tracking of data.
DT – Date
Date line shows when an entry first appeared
in the database and when it was last updated
Each entry contains two DT lines
 DT DD-MON-YYYY Created
 DT DD-MON-YYYY updated

 E.g. DT 20-DEC-1990 (Rel. 26, Created)

 E.g. DT 25-MAR-2001 (Rel. 67, Last updated,

Version 33)
DE – Description
Lines contains general descriptive
information about the sequence stored
It includes
Designation of the genes for which the
sequence codes
The region of the genome from which it is
derived
E.g. DE Human hemoglobin DNA with a
deletion causing Indian delta-beta thalassemia.
KW – Keywords
Used to generate cross reference indexes of the
sequence based on the function, structural and
other categories
E.g. KW hemoglobin.
OS – Organism species
Line specifies the preferred scientific name of
the organism
OS Genus Species (name)
E.g. OS Pseudoterranova decipiens (cod worm)
E.g. OS Homo Homosapiens Human
OC – organism classification
Line contains the taxonomical classification of the source
organism
The classification is listed top-down as nodes in a taxonomic
tree
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia;
OC Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae;
OC Homo.
OG – Organelle
Line indicates the sub-cellular location of non-nuclear
sequences
E.g. OG Lung
The reference
RN, RP, RA, RT, RL
DR - Database cross reference
Line cross references to other database which
contains information related to the entry
FH – Feature Header
Key and Location
FT – Feature Table
Source - organism name
CDS – Coding sequence
mRNA – messenger RNA
SQ – sequence header
Line marks the beginning of the sequence and
gives summary of the sequence
E.g. SQ Sequence 2337 BP; 942 A; 462 C; 401
G; 529 T; 3 other;
// - Terminator
Terminator end of the entry
NCBI – National Center for
Biotechnology Information
Claude Pepper established the NCBI on
November 4, 1988 as a division of the
National Library of Medicine(NLM) at the
National Institute of Health
Nucleotide database – GenBank
The DNA database from NCBI, incorporates
sequences from publicly available sources
From direct author submissions and large
scale sequencing projects
Sequences data are submitted to GenBank
from individual scientists from around the
world and large centers involved in the
Human Genome project
Genbank is an international collaborative
project with partners located at
European Bioinformatics Institute in the
United Kingdom and
National Institute of Genetics in Japan
The increasing size of the database, have
made it convenient to split Genbank into
smaller, discrete divisions
Division Sequence Subset

PRI Primate
ROD Rodent
MAM Mammalian
VRT Vertebrate
INV Invertebrate
PLN Plant, Fungal, Algae
BCT Bacteria
RNA Structural RNA
VRL Viral
PHG Bacteriophage
SYN Synthetic
UNA Unannotated
EST Expressed Sequence
Tag
PAT Patent
STS Sequence Tagged sites
GSS Genome Survey
Sequence
HTG High Throughput
Genomic sequence
ENV Environmental
sampling sequence
The structure of GenBank
Entries
Each entry consists of
a number of keywords
Relevant associated sub-keywords and
a optional feature table
Its end is indicated by a “//” terminator
The positioning of these elements on any given
line is important
Keywords begin in column 1,
Sub-keywords begin in column 3 and
A code defining part of the feature table begin in
column 6
Any line beginning with a blank character is
considered a continuation from the keyword
or sub-keyword
Keywords includes LOCUS, DEFEINITION,
ACCESSION, NID, KEYWORDS, SOURCES
REFERENCE, FEATURES, BASE COUNT
AND ORIGIN
Locus
Includes a short label for the entry that may suggest
the function of the sequence
E.g. HUMCYCLOX
suggest a human cyclooxygenase
Cyclooxygenase (COX) is an enzyme that is responsible
for formation of important biological mediators called
prostanoids
Other relevant facts
Number of bases
Source of sequence data(mRNA)
Section of database (PRI) and
Date of submission
Definition
Contains a concise description of the sequence
(in this example Homosapiens cyclooxygenase)
Accession
Gives a accession number, a unique constant
code assigned to each entry
NID
Supplies a nucleotide identifier (g181253)
Keywords
Introduces a list of short phrases, assigned by
the author, describing gene products and other
relevant information about the entry
In this example cyclooxygenase, prostagladin.
Source
Provides information on the tissue from which
the data have been derived (here umblicalvein)
Sub-keyword ORGANISM illustrate the
biological classification of source organism
Here homosapiens, Eukaryotes etc
Reference
Indicate the portion of sequence data to
which the cited literature refers
Sub-keywords : Authors, title & Journal
provide a structure for the citation
MEDLINE is a pointer to an online medical
literature information resources, which
allows the abstract of the given article to be
viewed.
Features
It describes properties of the sequence
indetail
‘db-xref’ links to other database
Taxon:9606 - a taxonomic database
PID:g181254 - a protein sequence database
5’ – untranslated region (UTR)
CDS – coding sequence
3’ – untranslated region (UTR)
polyA signal – poly adenylation sequence
Base count
Provides the frequency of occurrence of the
different base types in sequence
E.g. 1010A, 712C, 633G and 1032T
Origin
Location of the first base of the sequence with
in the genome
Entry is terminated by the // marker
DDBJ – DNA Data Bank of Japan
DNA data bank of Japan began in 1986 at the National
Institute of Genetics(NIG) with the endorsement of the
ministry of Education, Science, sports and Culture
DDBJ has been functioning as one of the international
DNA databases including EBI in Europe and NCBI in
USA
DDBJ collaborating with two databank through
exchanging data and information on internet
By regularly holding two meetings
The International DNA databanks Advisory meeting
The International DNA databanks collaborative meeting
Structure of the DDBJ file is exactly same as the
Genbank file format
Contains Keywords, subkeywords, feature table
and terminator
SAKURA is a nucleotide sequence data submission
system through the WWW server at DDBJ
Using this system you can interactively enter and
submit nucleotide sequences, functions and
features of the sequences
MGS – Mass Genome Submission for Genome
sequences
Entrez
NIH
NCBI

•Submissions GenBank •Submissions


•Updates •Updates

EMBL
DDBJ
CIB EBI

NIG •Submissions
•Updates SRS

getentry EMBL
37

You might also like