0% found this document useful (0 votes)
17 views43 pages

02-B-Sequence Presentation and File Formats

The document provides an overview of various sequence presentation and file formats, focusing on GenBank and SwissProt records. It details the structure of sequence records, including fields such as locus name, accession numbers, definitions, and features like genes and coding sequences. Additionally, it outlines the format and components of PDB records, emphasizing the importance of unique identifiers and the organization of sequence data.

Uploaded by

wasilicharles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views43 pages

02-B-Sequence Presentation and File Formats

The document provides an overview of various sequence presentation and file formats, focusing on GenBank and SwissProt records. It details the structure of sequence records, including fields such as locus name, accession numbers, definitions, and features like genes and coding sequences. Additionally, it outlines the format and components of PDB records, emphasizing the importance of unique identifiers and the organization of sequence data.

Uploaded by

wasilicharles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Sequence presentation and file formats

Wilson Nandolo
[email protected]
+265993375505
GenBank
sequence format
GenBank
sequence
format
• 1 The LOCUS
field consists of
five different
subfields
described below:
• 1a - 1e

4
1a Locus Name
• The locus name is a tag for
grouping similar sequences.
• The first two or three letters
usually designate the organism.
• In this case HS stands for
Homo sapiens
• The last several characters are
associated with another group
designation, such as gene
product.
• In this example, the last
three digits represent the
gene symbol, HFE.
• Currently, the only requirement
for assigning a locus name to a
record is that it is unique.

5
1b number of base-
pairs

6
1c Molecule Type -
Type of molecule
that was
sequenced

• All sequence data in an entry must be of the same


type.

7
1d GenBank
Division

• There are different GenBank divisions


• In this example, PRI stands for primate sequences.
• Some other divisions include ROD (rodent sequences) MAM
(other mammal sequences) PLN (plant, fungal, and algal
sequences), and BCT (bacterial sequences)

8
1e Modification Date
- Date of most recent
modification made
to the record

• The date of first public release is not available in the sequence


record.
• This information can be obtained only by contacting NCBI at
[email protected].
9
2 DEFINITION - Brief
description of the
sequence
• The description may include source organism name, gene or
protein name, or designation as untranscribed or untranslated
sequences (e.g., a promoter region).
• For sequences containing a coding region (CDS), the definition
field may also contain a “completeness” qualifier such as
"complete CDS" or "exon 1."
10
3 ACCESSION -
Unique identifier
assigned to a
complete sequence
record

• This number never changes, even if the record is modified.


• An accession number is a combination of letters and numbers
that are usually in the format of one letter followed by five digits
(e.g., M12345) or two letters followed by six digits (e.g.,
AC123456).

11
4 VERSION - Identification
number assigned to a
single, specific sequence
in the database

• This number is in the format “accession.version.”


• If any changes are made to the sequence data, the version part of the
number will increase by one.
• For example U12345.1 becomes U12345.2.
• A version number of Z92910.1 for this HFE sequence indicates that the
sequence data has not been altered since its original submission.

12
5 GI - Also a
sequence
identification
number

• Whenever a sequence is changed, the version number is


increased and a new GI is assigned.
• If a nucleotide sequence record contains a protein translation of
the sequence, the translation will have its own GI number
13
The RefSeq Accession number format
and molecule types

Accession Molecule type


NC_xxxxxx Complete genomic molecule
NG_xxxxxx Genomic region
NM_xxxxxx mRNA
NP_xxxxxx Protein
NR_xxxxxx RNA
NT_xxxxxx computed Genomic contig
XM_xxxxxx computed mRNA
XP_xxxxxx computed Protein
6 KEYWORDS - A keyword
can be any word or
phrase used to describe
the sequence

• Keywords are not taken from a controlled vocabulary.


• Notice that in this record the keyword, "haemochromatosis,"
employs British spelling, rather than the American
"hemochromatosis."
• Many records have no keywords.
• A period is placed in this field for records without keywords.
15
7 SOURCE

• Usually contains an abbreviated or common


name of the source organism

16
8 ORGANISM-The
scientific name (usually
genus and species) and
phylogenetic lineage
• See the NCBI Taxonomy Homepage for more information about
the classification scheme used to construct taxonomic lineages.

17
9 REFERENCE - Citations of
publications by sequence
authors that support
information presented in the
sequence record
• Several references may be included in one record
• References are automatically sorted from the oldest to the newest.
• Cited publications are searchable by author, article or publication title,
journal title, or MEDLINE unique identifier (UID).
• The UID links the sequence record to the MEDLINE record.
18
9 REFERENCE

• If the REFERENCE
TITLE contains the
words "Direct
Submission," contact
information for the
submitter(s) is
provided.

19
The
FEATURES
table
• A feature is simply an annotation that
describes a portion of the sequence.
• Each feature includes a location
(sequence location or interval) and one
or several qualifiers.
• Clicking on the feature name will open
a record for the sequence interval
identified in the feature location.
• A list of features can be found in
• http://www.ncbi.nlm.nih.gov/collab/FT/

20
SOURCE - An obligatory feature

• The source gives the length of the entire


sequence, the scientific name of the
source organism, and the Taxon ID
number.

• Other types of information that the


submitter may include in this field are
chromosome number, map location,
clone, and strain identification.

21
GENE

• Sequence portion that


delineates the
beginning and end of
a gene

22
EXON

• Sequence segment that


contains an exon.
• Exons may contain portions of
5' and 3’ UTRs (untranslated
regions).
• The name of the gene to which
the exon belongs and exon
number are provided.

23
CDS - Sequence of nucleotides
that code for amino acids of
the protein product (coding
sequence)

• The CDS begins with the first nucleotide


of the start codon and ends with the third
nucleotide of the stop codon.
• This feature includes the translation into
amino acids and may also contain gene
name, gene product function, link to
protein sequence record, and cross-
references to other database entries.

24
INTRON

• Transcribed but spliced-out


parts.
• Intron number is shown

25
polyA_signal - Identifies
the sequence portion
required for endonuclease
cleavage of an mRNA
transcript

• Consensus sequence for the


polyA signal is AATAAA.

26
BASE COUNT &
ORIGIN

• Base Count gives the total


number of adenine (A), cytosine
(C), guanine (G), and thymine
(T) bases in the sequence.
• Origin contains the sequence
data, which begins on the line
immediately below the field title.

27
Sequence formats:
FASTA format
SwissProt records
⚫ ID identification line

− ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE;


SEQUENCE_LENGTH.
ID CRAM_CRAAB STANDARD; PRT; 46 AA.
− Format for the ENTRY_NAME:
− NAME_SPECIES ( 10 characters)
here: Crambin (Crambe abyssinica)
− For number of organisms (16) SPECIES has a recognizable
name:
HUMAN, MOUSE, CHICK, BOVIN, YEAST, ECOLI….

− N.B. The ID can change, e.g. serotonine receptors have got a


new nomenclature
SwissProt records
⚫ AC accession number
AC P01542;
AC is unique:
Name, sequence, everything can change but AC stays the same

⚫ DT deposition date
DT 21-JUL-1986 (Rel. 01, Created)
DT 30-MAY-2000 (Rel. 39, Last sequence update)
DT 30-MAY-2000 (Rel. 39, Last annotation update)
1) You can not see what the last annotation update was
2) No depositor record (Implicit: author of first reference)
SwissProt records
⚫ DE description
DE CRAMBIN.
DE 6-phosphofructo-2-kinase 1 (EC 2.7.1.105)
(Phosphofructokinase 2 I)
1) General descriptive information
2) Free-format

⚫ GN gene name
GN THI2.
⚫ OS & OC & OG
⚫ OS Crambe abyssinica (Abyssinian crambe).
OC Eukaryota; Viridiplantae;
Embryophyta;Tracheophyta;Spermatophyta;
OC Magnoliophyta; eudicotyledons; Rosidae; eurosids II;
Brassicales;
OC Brassicaceae; Crambe.
⚫ Organism Species; Organism Classification; OrGanelle
SwissProt records
⚫ RN References
RN [1]
RP SEQUENCE.
RX MEDLINE; 82046542.
RA Teeter M.M., Mazer J.A., L'Italien J.J.;
RT "Primary structure of the hydrophobic plant protein crambin.";
RL Biochemistry 20:5437-5443(1981).

⚫ CC Comments or notes
CC -!- FUNCTION: THE FUNCTION OF THIS HYDROPHOBIC PLANT SEED
PROTEIN
CC IS NOT KNOWN.
CC -!- MISCELLANEOUS: TWO ISOFORMS EXISTS, A MAJOR FORM PL
(SHOWN HERE)
CC AND A MINOR FORM SI.
CC -!- SIMILARITY: BELONGS TO THE PLANT THIONIN FAMILY.
SwissProt records
⚫ DR Database Cross Reference
DR PIR; A01805; KECX.
DR PDB; 1CRN; 16-APR-87.
DR PDB; 1CBN; 31-JAN-94.
DR PDB; 1CCM; 31-OCT-93.
DR PDB; 1CCN; 31-JAN-94.
DR PDB; 1CNR; 31-AUG-94.
DR PDB; 1AB1; 12-AUG-97.
DR INTERPRO; IPR001010; -.
DR PFAM; PF00321; plant_thionins; 1.
DR PRINTS; PR00287; THIONIN.
DR PROSITE; PS00271; THIONIN; 1.

⚫ KW Keyword
Not standardized (under control of depositor)
KW Thionin; 3D-structure.
SwissProt records
⚫ FT Feature table data
FT DISULFID 3 40
FT DISULFID 4 32
FT DISULFID 16 26
FT VARIANT 22 22 P -> S (IN ISOFORM SI).
FT VARIANT 25 25 L -> I (IN ISOFORM SI).
FT STRAND 2 3
FT HELIX 7 16
FT TURN 17 19
FT HELIX 23 30
FT TURN 31 31
FT STRAND 33 34
FT TURN 42 43
Feature table
⚫ Other features: post-translational modifications, binding sites, enzyme active
sites, local secondary structure or other characteristics reported in the cited
references. Sequence conflicts between references are also included.
FT CONFLICT 33 33 MISSING (IN REF. 2).
FT MUTAGEN 123 123 G->R,L,M: DNA BINDING LOST.
FT MOD_RES 11 11 PHOSPHORYLATION (BY PKC).
FT LIPID 1 1 MYRISTATE.
FT CARBOHYD 103 103 GLUCOSYLGALACTOSE.
FT METAL 87 87 COPPER (POTENTIAL).
FT BINDING 14 14 HEME (COVALENT).
FT PROPEP 27 28 ACTIVATION PEPTIDE.
FT DOMAIN 22 788 EXTRACELLULAR (POTENTIAL).
FT ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS.
SwissProt records
⚫ SQ sequence header
SQ SEQUENCE 46 AA; 4736 MW; 919E68AF159EF722
CRC64;

⚫ Sequence data
TTCCPSIVAR SNFNVCRLPG TPEALCATYT GCIIIPGATC
PGDYAN

⚫ //
Termination line
PDB records
⚫ Filename= accession number= PDB Code
1) Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN)
2) Be aware: 0HYK means entry HYK does not contain
coordinates

⚫ HEADER
describes molecule & gives deposition date
HEADER PLANT SEED PROTEIN 30-APR-81 1CRN 1CRND 1

⚫ CMPND
name of molecule
COMPND CRAMBIN 1CRN 4

⚫ SOURCE
organism
SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED 1CRN 5
PDB records
⚫ AUTHOR
AUTHOR W.A.HENDRICKSON,M.M.TEETER 1CRN 6

⚫ The depositor

⚫ JRNL
JRNL AUTH M.BLABER,X.-J.ZHANG,B.W.MATTHEWS 111L 10

JRNL TITL STRUCTURAL BASIS OF ALPHA-HELIX PROPENSITY AT TWO 111L 11

JRNL TITL 2 SITES IN T4 LYSOZYME 111L 12

JRNL REF SCIENCE V. 260 1637 1993 111L 13

JRNL REFN ASTM SCIEAS US ISSN 0036-8075 038 111L 14

⚫ REMARK
Not standardized: many different REMARK records & subrecords!
REMARK 1 REFERENCE 3 1CRNC 10
REMARK 1 AUTH M.M.TEETER,W.A.HENDRICKSON 1CRN 16
REMARK 1 TITL HIGHLY ORDERED CRYSTALS OF THE PLANT SEED PROTEIN 1CRN 17
REMARK 1 TITL 2 CRAMBIN 1CRN 18
REMARK 1 REF J.MOL.BIOL. V. 127 219 1979 1CRN 19
REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 070 1CRN 20
REMARK 2 1CRN 21
REMARK 2 RESOLUTION. 1.5 ANGSTROMS. 1CRN 22
PDB records
⚫ SEQRES
Sequence of protein;
Be aware: Not always all 3D-coordinates are present for all the amino acids in SEQRES!!
SEQRES 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51
SEQRES 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52
SEQRES 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53
SEQRES 4 46 CYS PRO GLY ASP TYR ALA ASN 1CRN 54

⚫ HET & FORMUL

⚫ metals, cofactors, ions, etc.


HET NAD A 1 44 NAD CO-ENZYME 4MDH 219
HET SUL A 2 5 SULFATE 4MDH 220
HET NAD B 1 44 NAD CO-ENZYME 4MDH 221
HET SUL B 2 5 SULFATE 4MDH 222
FORMUL 3 NAD 2(C21 H28 N7 O14 P2) 4MDH 223
FORMUL 4 SUL 2(O4 S1) 4MDH 224
FORMUL 5 HOH *471(H2 O1)
4MDH 225
PDB records
⚫ HELIX/SHEET/TURN
Secondary structure elements as provided by the crystallographer (subjective)
HELIX 1 H1 ILE 7 PRO 19 1 3/10 CONFORMATION RES 17,19 1CRN 55
SHEET 2 S1 2 CYS 32 ILE 35 -1 1CRN 58
TURN 1 T1 PRO 41 TYR 44 1CRN 59

⚫ SSBOND
disulfide bridges
SSBOND 1 CYS 3 CYS 40 1CRN 60
SSBOND 2 CYS 4 CYS 32 1CRN 61

⚫ CRYST1, ORIGX1, ORIGX2, ORIGX3, SCALE1, SCALE2, SCALE3


crystallographic parameters
CRYST1 40.960 18.650 22.520 90.00 90.77 90.00 P 21 2 1CRN 63
ORIGX1 1.000000 0.000000 0.000000 0.00000 1CRN 64
ORIGX2 0.000000 1.000000 0.000000 0.00000 1CRN 65
ORIGX3 0.000000 0.000000 1.000000 0.00000 1CRN 66
SCALE1 .024414 0.000000 -.000328 0.00000 1CRN 67
SCALE2 0.000000 .053619 0.000000 0.00000 1CRN 68
SCALE3 0.000000 0.000000 .044409 0.00000 1CRN 69
PDB records
⚫ ATOM
one line for each atom with its unique name and its x,y,z coordinates
ATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70
ATOM 2 CA THR 1 16.967 12.784 4.338 1.00 10.80 1CRN 71
ATOM 3 C THR 1 15.685 12.755 5.133 1.00 9.19 1CRN 72
ATOM 4 O THR 1 15.268 13.825 5.594 1.00 9.85 1CRN 73
ATOM 5 CB THR 1 18.170 12.703 5.337 1.00 13.02 1CRN 74
ATOM 6 OG1 THR 1 19.334 12.829 4.463 1.00 15.06 1CRN 75
ATOM 7 CG2 THR 1 18.150 11.546 6.304 1.00 14.23 1CRN 76
ATOM 8 N THR 2 15.115 11.555 5.265 1.00 7.81 1CRN 77
ATOM 9 CA THR 2 13.856 11.469 6.066 1.00 8.31 1CRN 78
ATOM 10 C THR 2 14.164 10.785 7.379 1.00 5.80 1CRN 79
ATOM 11 O THR 2 14.993 9.862 7.443 1.00 6.94 1CRN 80

⚫ TER record terminates the amino acid chain


ATOM 325 OD1 ASN 46 11.982 4.849 15.886 1.00 11.00 1CRN 394
ATOM 326 ND2 ASN 46 13.407 3.298 15.015 1.00 10.32 1CRN 395
ATOM 327 OXT ASN 46 12.703 4.973 10.746 1.00 7.86 1CRN 396
TER 328 ASN 46 1CRN 397
PDB records
⚫ HETATM
atomic coordinate records for atoms within “HET & FORMUL”-lines (metals,
cofactors, ions, …) and for water molecules
HETATM 5158 AP NAD B 1 42.641 30.361 41.284 1.00
26.73 4MDH5495
HETATM 5159 AO1 NAD B 1 43.440 31.570 40.868 1.00
20.69 4MDH5496
HETATM 5160 AO2 NAD B 1 41.161 30.484 41.376 1.00
33.73 4MDH5497

HETATM 5207 O HOH 0 15.379 1.907 3.295 1.00


58.12 4MDH5544
HETATM 5208 O HOH 1 58.861 0.984 17.024 1.00
37.58 4MDH5545
HETATM 5209 O HOH 2 24.384 1.184 74.398 1.00
35.92 4MDH5546
End of presentation

You might also like