0% found this document useful (0 votes)
110 views36 pages

Applied Bioinformatics in Genetics

This document provides an introduction to searching and exploring sequence data from the GenBank database. It describes key fields and features contained in GenBank entries, including the LOCUS, DEFINITION, ACCESSION, and FEATURES fields. It also explains how to retrieve a sequence from GenBank using an accession number or locus name, and highlights some display options for viewing sequence records. The goal is to familiarize students with the structure and content of GenBank entries.

Uploaded by

minghouu215
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views36 pages

Applied Bioinformatics in Genetics

This document provides an introduction to searching and exploring sequence data from the GenBank database. It describes key fields and features contained in GenBank entries, including the LOCUS, DEFINITION, ACCESSION, and FEATURES fields. It also explains how to retrieve a sequence from GenBank using an accession number or locus name, and highlights some display options for viewing sequence records. The goal is to familiarize students with the structure and content of GenBank entries.

Uploaded by

minghouu215
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

VIETNAM NATIONAL UNIVERSITY HCMC

INTERNATIONAL UNIVERSITY
SCHOOL OF BIOTECHNOLOGY

APPLIED BIOINFORMATICS for


MOLECULAR GENETICS

Prepare by: Nguyen Minh Thanh, PhD.

HCMC - 2013
1
LAB 1 – TEXT SEARCH FROM ONLINE DATABASES
Introduction
The GenBank sequence database is an annotated collection of all publicly available nucleotide
sequences and their protein translation. This database is produced at National Center for
Biotechnology Information (NCBI) as part of an international collaboration with the European
Molecular Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI)
and the DNA Data Bank of Japan (DDBJ). These three databases exchange data constantly and are
almost identical as far as sequence data goes, but use different formats for their header information.
A large section of these databases is made up of expressed sequence tags (EST), which are mass
sequences cDNAs, often uncharacterized and containing sequencing errors. Due to the volume and
poor quality of the data, ESTs are kept in a separate section of the database. GenBank and its
collaborators receive sequences produced in laboratories throughout the world from more than
100,000 distinct organisms. GenBank continues to grow at an exponential rate, doubling every 10
months. GenBank is built by direct submissions from individual laboratories, as well as from bulk
submissions from large-scale sequencing centers.

Among protein databases, SWISS-PROT contains the highest quality information: its sequences are
all curetted and annotated manually. However, because this process takes time, SWISS-PROT is not
as up to date as automatically generated databases such as GenPep, which contains the translated
coding regions from GenBank.

Nucleotide and protein sequence databases are growing extremely rapidly. These sequence
databases store not only the sequence data, but also a lot of reference information such as the
name of the gene or the protein, the organism of origin, who reported it and where in the literature,
important sequence features etc. Each sequence is stored in the database as a separate ENTRY,
together with the reference information organized in a series of fields. Field names vary from
database to database. For example, the general description of a sequence is stored in the
DEFINITION field in GenBank and in the DE field in EMBL. Each database entry can be identified by
an entry code (also called locus or ID code) or a unique accession number. The ID code or LOCUS in
GenBank is sometimes descriptive of the function of the sequence. The accession number is a non-
descriptive number. The accession number is given to a sequence when it is first entered into the
database and refers to the sequence itself, whereas the sequence that an ID code/LOCUS points to
may change as more data is added to a particular entry.

Retrieval from a database using LOCUS/Accession number – Text search

Exercise 1
• Go to the GenBank page.

• Select the database Nucleotide

• Enter humsomi as the code of search for, then click on Search or Enter.

2
• Click on the link labelled FASTA in the upper left of the page, you will get a FASTA-formatted
sequence

• Save the sequence as humsoma in your folder.

Field Descriptions

1. The LOCUS field consists of five different subfields:


3
Locus Name (HUMSOMI) - The locus name is a tag for grouping similar sequences. The first two or
three letters usually designate the organism. In this case HUM stands for Homo sapiens The last
several characters are associated with another group designation, such as gene product. In this
example, the last four digits represent the gene symbol, SOMI. Currently, the only requirement for
assigning a Locus name to a record is that it be unique.

Sequence Length (2667 bp) - The total number of nucleotide base pairs (or amino acid residues) in
the sequence record. Nucleotide sequence length can range from 50 bp to 350 kb.

Molecule Type (DNA) - Type of molecule that was sequenced. All sequence data must come from a
single molecule type. Some examples of molecule type include genomic DNA, mRNA (cDNA),
genomic RNA, and ribosomal RNA.

GenBank Division (PRI) - There are 16 different GenBank divisions. In this example, PRI stands for
primate sequences. Some other divisions include ROD (rodent sequences), MAM (other mammal
sequences), PLN (plant, fungal, and algal sequences), and BCT (bacterial sequences).

Modification Date (13-JAN-1995) - Date of most recent modification made to the record. The date of
first public release is not available in the sequence record. This information can be obtained only by
contacting NCBI at info@[Link].

2. DEFINITION

Brief description of the sequence. The description may include source organism name, gene or
protein name, or function of a noncoding sequence (e.g., a promoter region). For sequences
containing a coding region (CDS), the definition field may also contain a completeness qualifier such
as "complete CDS" or "exon 1," indicating sequence information pertaining only to a gene's first
exon.

3. ACCESSION (J00306)

Unique identifier assigned to a complete sequence record. This number never changes, even if the
record is modified. An accession number is a combination of letters and numbers that are usually in
the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits
(e.g., AC123456).

4. VERSION (J00306.1)

Identification number assigned to a single, specific sequence in the database. This number is in the
format [Link]. If any changes are made to the sequence data, the version part of the
number will increase by one. For example U12345.1 becomes U12345.2. A version number of
J00306.1 for this HUM sequence indicates that the sequence data has not been altered since its
original submission.

5. GI (338287)

Also a sequence identification number. Whenever a sequence is changed, the version number is
increased and a new GI is assigned. If a nucleotide sequence record contains a protein translation of
the sequence, the translation will have its own GI number.

4
6. KEYWORDS

A keyword can be any word or phrase used to describe the sequence. Keywords are not based on
any controlled vocabulary. For many records, no keywords are included. A period is placed in this
field for records without keywords.

7. SOURCE (human)

Usually contains an abbreviated or common form of the source organism's name.

8. ORGANISM (Homo sapiens)

Source organism's formal scientific name (usually genus and species) and phylogenetic lineage. See
the NCBI Taxonomy Homepage for more information about the classification scheme used to
construct the organism's lineage.

9. REFERENCE

Citations of publications by sequence authors that support information presented in the sequence
record. Several references may be included in one record. References are automatically sorted so
that the oldest are always listed first. Cited publications listed as references are searchable by
author, article or publication title, journal title, or MEDLINE unique identifier (UID). The UID links to
the reference's MEDLINE record.

10. FEATURES

In a sequence record, a list of sequence features follows the references. A feature is simply an
annotation that describes a portion of the sequence. An alphabetical list of features can be found in
Appendix III: Feature Keys Reference of the DDBJ/EMBL/GenBank Feature Table. Each feature
includes a location (sequence interval to which the feature refers) and one or several qualifiers.
Clicking on the feature name will open a record for the sequence interval identified in the feature
location.

The following features are included in the sample HUMSOMI sequence record J00306:

source - The source feature must be included in each sequence record. The source gives the length
of the entire sequence, the scientific name of the source organism, and the Taxon ID number. Other
types of information that the submitter may include in this field are chromosome number, map
location, and clone or strain identification.

exon - Sequence segment that codes for a portion of spliced mRNA, rRNA, or tRNA. An exon may
contain a portion of mRNA's 5' UTR (untranslated region) or 3' UTR, in addition to part of the coding
sequence. The name of the gene to which the exon belongs and exon number are provided.

gene - Sequence portion that encodes a specific functional product.

CDS - Sequence of nucleotides that code for amino acids of the protein product (coding sequence).
The CDS begins with the start codon's first nucleotide and ends with the stop codon's third
nucleotide. This feature includes the coding sequence's amino acid translation and may also contain

5
gene name, gene product function, link to protein sequence record, and cross-references to other
database entries.

intron - Segment of noncoding sequence that is transcribed but removed from the transcript by
splicing together the exons (sequence portions) on either side of it.

polyA_signal - Identifies the sequence portion required for endonuclease cleavage of an mRNA
transcript. Consensus sequence for the polyA signal is AATAAA.

11. ORIGIN

Origin contains the sequence data, which begins on the line immediately below the field title.

Exploring display options for sequence records

In the left corner near the top of each record in the Entrez Nucleotide database is the Display drop-
down box. The Display setting described in the previous section of this tutorial is the default or
GenBank display. NCBI also provides several other formats for viewing a sequence record.

- The Summary display will bring up the sequence's Accession number and an abbreviated
description. The GI List is another brief format that lists the GI (GenInfo identification) number for
the sequence.

- The ASN.1 display will bring up a computer-readable data format known as the Abstract Syntax
Notation 1 form. XML (Extensible Markup Language), GBSeqXML, and TinySeqXML are other

6
computer-readable formats.

- The FASTA format consists of a single line of descriptive text called the definition line followed by
the sequence characters. The FASTA format of a sequence can be used as input for sequence
analysis tools such as NCBI's BLAST.

Exercise 2
• Go to the GenBank page.

• Select the database Nucleotide

• Enter X59263 as the accession number of search for and click on Search or Enter. This entry
contains an E. coli alcohol dehydrogenase gene sequence.

Exercise 3
Text search can retrieve more than one code at a time.

• Go to the GenBank page.

• Select the database protein,

• Type the following entry codes in the entry box, separated by OR and click on Search or
Enter.

Ins_human ins_bovin ins_crilo ins_cavpo ins1-xenla

• Save the sequences as follows:

Ins_human into humins

Ins_bovin into bovins

Ins-crilo into hamins

Ins_cavpo into guiins

Ins1-xenla into fogins

7
Retrieval from a database using Text Search Terms

Exercise 4: Standard query


• Go to the GenBank page.

• Select the database protein

• Type steroid as the query in the text box, and then click on Search or Enter.

Exercise 5: Searching for more than one word


• Go to the GenBank page.

• Select the database protein

• Type steroid receptor as the query in the text box, and then click on Search or Enter.

• Note the number of results you get.

Exercise 6: AND, OR and NOT


• Go to the GenBank page.

• Select the database protein

• Type steroid receptor as the query in the text box, and then type AND/OR/NOT in between
steroid receptor.

• Click Search or Enter.

• Note the number of results you get.

Exercise 7: Advance search


The different search methods can be combined to generate complex queries. For example, to
retrieve the human steroid receptors from the database which are not androgen or mineral corticoid
receptors.

• Go to the GenBank page.

• Select the database protein, and select Advance search

• Type steroid receptor in the first text box

• Enter human in the second text box and change the menu immediately before this text box
to AND.

• Enter androgen in the third text box and change the menu immediately before this text box
to NOT.
8
• Enter mineral corticoid in the fourth text box and change the menu immediately before this
text box to NOT.

• Click Search or Enter.

• Note the number of results you get.

Searching all databases for a term

Exercise 8:
• Go to the NCBI page.

• Select All database

• Type haemochromatosis as the query in the text box.

• Click Search or Enter.

There are multiple entries identified including DNA and protein sequences, PubMed references, a
PDB structure and reference to the Online Mendelian Inheritance in Man.

9
Exercise 9: Refining sequence searches
• Go to the NCBI page.

• Select Nucleotide, search for MHC

• Now search for one of the species that are listed below using advance search to combine
two searches.

• Save the sequences as the set file of MHC

Kangaroo Shark Gorilla Echidna Sloth

Panther Tiger Elk Dolphin Cow

Elephant Tortoise Domestic cat Guinea pig

Questions:
1. Explain the difference of gene structure from the searches of humsomi (Exercise 1) and
X59263 (Exercise 2).

2. Have a look through the results of Exercise 9, list species with MHC and without MHC
sequence identified. If there is species with MHC sequence, explain the reason. If there is
species without MHC sequence, try to explain the reason why it was not picked up.

3. Complete a search of NM003227 and answer the following points:

a. Gene name
b. Gene length
c. Organism of origin
d. Number of introns and exons. If introns or exons were not reported, give the reason.

10
LAB 2 – SEQUENCE DATABASE COMPARISONS
This section deals with some of the utilities on NCBI for searching the various databases for
sequences similar to a query sequence. Database similarity searching is based on sequence
alignment. An exhaustive database search program works by taking every sequence in the database,
aligning it with the query sequence, calculating the score for this alignment, and finally selecting
those database sequences with the highest alignment score – Those best matching database ‘hits’
are listed, together with their alignment with the query sequence. In practice, most programs do not
perform the whole alignment with the database, but take ‘shortcut’ to allow the search to finish in a
reasonable amount of time.

BLAST
BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs (blastn, blastp,
blastx, tblastn, tblastx) designed to explore all of the available sequence databases regardless of
whether the query is protein or DNA.

Exercise 1: Basic Blast search

• Go to the NCBI page, and then select BLAST

• Click on protein Blast (Blastp)

11
• Select a histone acetyltransferase protein (this protein involves in chromatin remodelling
and thereby in the regulation of gene expression) from a human gene for Blast search (A
guide to select this protein is right after all steps of Exercise 1) and paste your FASTA-
formatted protein sequence including the first line with the name on it, into the search box
labelled ‘Enter accession number(s), gi(s), or FASTA sequence(s)’

• Use all the default parameters, and click on BLAST

• Then, set up an additional blastp with the same input sequence, but choose the database
SWISS-SPROT and use the other default parameters in this instance

• Click on BLAST

Searching a histone acetyltransferase

• Open a new window, then go to NCBI or GenBank

• Enter NP008998 (accession number for histone acetyltransferase of Homo sapiens) in the
text box

12
• A database entry from the GenBank is displayed, click FASTA (right before the blue line that
separates the heading and the entry information) to copy the protein sequence in a FASTA
format.

The results of the Exercise 1 search will appear on your screen in the following format:

The first part you see is the header information which describes your query sequence as well as the
database that was searched.

The second part is a graphic summary. The first red bar is the query sequence. The other red bars
give you an indication as to where in comparison to the query sequence, the matches were located

The next part of the results is the data that explains the degree of similarity between the database
sequence and the query. This part shows the actual homologous sequences aligned one-by-one with
your query sequence. Have a look at a single entry of the matching. By clicking on the Accession
number you go straight to the database entry of that sequence. Click on the value under the Score
column, this brings up the alignment of the database sequence and the query sequence.

The fundamental unit of the BLAST algorithm result is the High-scoring segment pair (HSP or Score).
An HSP consists of two sequence fragments of arbitrary but equal length whose alignment is locally
maximal and for which the alignment score meets or exceeds a threshold or cut-off score; this set
may be empty if the cut-off score is sufficiently high. Each HSP consists of a segment from the query
sequence and one from a database sequence.

E-values (Expect values) can help to evaluate the results of the search. The statistics to consider
when analysing the result is the e-value, which is the probability that the similarity between the two
sequences is a chance occurrence (i.e. the lower the probability value, the more likely it is that the
similarity actually reflects a biological relationship). A match with a probability value lower than
0.001 (for a DNA search) and 0.02 (for a protein search) is considered statistically significant.

The last part is the actual homologous sequences aligned with the query sequence. The format of
the alignment is: the top line is the query sequence (Query), the bottom line is the database
sequence (Sbjct), and the middle line is a consensus that shows wherever the two sequences are
identical.

Exercise 2: Filtered (limited) blast search

• Set up Blastp with the query sequence of histone acetyltransferase

• In the Organism box, type Drosophila melanogaster and use all the default parameters in
this instance

• Click on BLAST

• Once the search has finished, have a look at the format of these alignment:

Amino acid match: ‘letter’ = identical, ‘ ‘ = no match, ‘+’ = similar amino acid, ‘-‘ = gap

13
Exercise 3:

• Set up Blastp with the query sequence of mpi (mannose-6-phosphate isomerise, NP002426)

• In the Organism box, enter the scientific name of the following species and use all the
default parameters in this instance

o Homo sapiens neanderthalensis


o Homo sp. Altai
o Pan troglodytes
o Pan paniscus
o Gorilla gorilla
o Pongo abelii

• Click on BLAST

• Note the number of hits you get and their values

Exercise 4: Non-standard Blast search

Non-standard searches are those that require translation of the database and/or the query
sequence prior to searching for similarities. These searches can be used to identify potential
translated regions of a genomic sequence (compare against a protein database), identify the
potential cDNA sequence from a peptide sequence.

Please note: These programs take some time to run.

• Set up Blastx with the query sequence of bsmp (HE962377), then click on BLAST

14
• Set up tBlastn with the query sequence of mpi (NP002426), then click on BLAST

Questions:
1. Using the results of the Exercise 2 explain whether a chromatin remodelling system is present
in the insect.

2. List the species do not have a homologous sequence with the query of Exercise 3. What may
you explain in term of evolution?

3. Set up Blast search for an unknown sequence in Blackboard. Identify potential gene for this
unknown sequence, its function and which organism carries this gene.

15
LAB 3 – SEQUENCE ALIGNMENT
Two sequence comparison
The programs described in this section compare two sequences in order to estimate their similarity
and align them to determine possible homologies.

Exercise 1: (running the program – Water-Protein)

• Go to [Link]

• Use water to align the two protein sequences humins and bovins. Use the default
parameters.

• Once done, have a look at the alignment

Note that the Percent Similarity is higher than the Percent Identity. This is due to the similarity
value taking into account the number of aligned amino acids that have similar properties. When
aligning two nucleotide sequences the Percent Similarity is identical to the Percent Identity.

Perfect matches are indicated by bars(|), conservative pairing by colons (:) and a semi-conservative
pairing by dots (.).

16
Exercise 2: (running the program – Water-Nucleotide)

• Use water to align humsomi (J00306) with NM_012659. Use the default parameters.

• Once done, have a look at the alignment

• Note that the Percent Similarity and Percent Identity are the same.

17
Exercise 3: Stretcher

In some cases it is possible to use stretcher to align sequences of very different lengths. The
following exercise demonstrates the use of gap to align a genomic DNA sequence with the coding
regions of the sequence.

In such a case, what is expected in the final output is an alignment containing few, but very long
gaps (these gaps correspond to the introns in the genomic sequence). In order for the program to
correctly identify these gaps (so that few gaps are created) but the gap extension must be set to
lower, so that the program can insert the very long gaps.
18
• Use stretcher to align humsomi with BC032625 using default and then the following
parameter: Gap open = 10 and Gap extend = 1

• Look at the alignment noting the number of mismatches in the first and second parts of the
alignment. The second alignment has been geared towards aligning a genomic sequence to a
cDNA sequence.

Exercise 4: Single Nucleotide Polymorphism Locating program

There are many alleles of the HLA-B gene. These differ from each other by a small number of point
mutations or SNPs. The program diffseq allows you to identify where the point mutations are
located. Large sequence can be used as an input provided there is a region that is overlapped
between them. The localization of SNPs is important in identifying polymorphic markers for mapping
studies.

• Using the HLA-B alleles HLA-B07021 (GenBank: AH010802) and HLA-B4101 (GenBank:
AH010727) run the program diffseq: [Link]

• Use default parameters

19
• In this instance both sequence are just exon 2 and so overlap covers the entire sequence.

• All regions of interest (differences within the overlap) are then listed. In the first case there is a 4bp
region that shows a difference between the two sequences and the second instance there is a single
nucleotide change. The end of the result indicates the number of single nucleotide differences and
how many of them resulted in a transition and a transversion change.

Exercise 5:

• Using diffseq, compare the two MICA alleles 0701 and 001 (GenBank: AY750850 and
AY204547)

• Use default parameters

• Note that in addition to the four SNPs there is a 3bp insertion in one of the sequences. Notice how
this affects the overlap end co-ordinates.

Multiple sequence analysis


This section covers the programs for creating and formatting multiple sequence alignments, as well
as programs which can analyses such alignments to display the similarity between sequences. A
multiple sequence alignment is an alignment of more than two sequences which can be used for
displaying the relationships between a group of related sequences, or used as input for other
techniques such as phylogeny inference and profile analysis.

Exercise 6: ClustalW – Protein

• Start ClustalW/Clustal Omega ([Link] and select the set file ins
as input file

• Use default parameters

20
• Click on Submit to start the alignment

Interpretation of the output (below)

Colours on protein alignments

This protein-only option colours the residues according to their physicochemical properties:
Residue Colour Property
AVFPMILW RED Small (small+ hydrophobic ([Link] -Y))
DE BLUE Acidic
RK MAGENTA Basic - H
STYHCNGQ GREEN Hydroxyl + sulfhydryl + amine + G
Others Grey Unusual amino/imino acids etc

The consensus symbols in the alignment

* (asterisk) indicates positions which have a single, fully conserved residue.


: (colon) indicates conservation between groups of strongly similar properties - scoring > 0.5 in the
Gonnet PAM 250 matrix.
. (period) indicates conservation between groups of weakly similar properties - scoring =< 0.5 in the
Gonnet PAM 250 matrix.

21
Exercise 7: ClustalW – Nucleotide

• Use ClustalW, change to nucleotide and select the set file MHC as input file

• Use default parameters

• Click on Submit to start the alignment

• Have a look at the output

Exercise 8: BioEdit (reference only)

The program BioEdit allows a sequence to be added to an existing alignment. This is useful because
one may wish to build up a multiple alignment gradually, choosing different parameters manually. In
addition, this program also allows you to edit a multiple sequence alignment and then use the edited
version as input to additional analysis programs.

Questions:
1. Compare two protein sequences humins & ratins (GenBank EHB15713.1). What is their
identity, similarity, gap and score..

2. Align protein sequence humins with the common chimpanzee (Pan troglodytes) sequence
(GenBank NP_001008996). Report: a number of identical amino acids and a number of
similar physical/chemical amino acids

3. Align actin gene of the white-leg shrimp (Litopenaeus vannamei)(GenBank AF300705) and
the giant freshwater prawn (Macrobrachium rosenbergii) (GenBank AY626840). Note a
number of transversion SNPs & a number of transition SNPs & their positions.

22
LAB 4 – LABORATORY METHODS: RESTRICTION MAPPING &
PRIMER DESIGN
Restriction mapping
A restriction map describes the location of cleavage sites of restriction endonuclease within a DNA
fragment. Restriction enzymes are an enzyme that recognize a specific sequence in a DNA molecule
and cuts the DNA strand. Restriction mapping programs such as Restrict, NEBcutter,
RestrictionMapper can be used to predict the fragment sizes resulting from a restriction digest and
to draw restriction maps. These programs can be very helpful in the planning of experiments
involving procedures such as plasmid construction and site directed mutagenesis.

There are hundreds of known enzymes called restriction endonucleases that cleave DNA at very
specific sites. For example the enzyme BamHI recognizes the sequence GGATCC and cuts the DNA
between the two G's. If just one base is changed in the sequence (say GGTTCC) then the enzyme will
not cut the DNA.

Restriction enzymes can also be classified by the numbers of bases in the recognition sequence. The
numbers of bases will determine the frequency of that specific sequence in an average DNA
sequence. For example,

DpnI recognizes a 4 bp sequence with would occur once every 44 or 256 bp.

PvuI recognizes a 6 bp sequence with would occur once every 46 or 4,096 bp.

NotI recognizes an 8 bp sequence with would occur once every 48 or 65,536 bp.

A DNA sequence can be run through a program that will identify these sites in the DNA. The results
from this program will show all of the known sites in a given DNA sequence that is cut by restriction

23
enzymes. This "restriction map" is very useful in designing cloning strategies, and in developing
diagnostic assays.

Exercise 1:

• Use the file humsoma as input for NEBcutter

[Link]

2. Paste a nucleotide sequence for analysis. You can cut and paste either plain DNA sequence or a
FASTA formatted sequence from your local computer into the box or alternatively you can use the
“Browse” button to select the files in your computer in ‘Local sequence file’. If you know the
Genbank accession number for the sequence, you can directly enter under ‘GenBank number’ If not,
you can use the “Browse GenBank” button to search GenBank (Fig. 1.46).

3. Indicate whether the sequence is linear or circular. In case of plasmids, they are circular. If a
GenBank file is used containing a circular sequence, NEBcutter will automatically recognize that.

24
4. Choose the type of enzyme to be used like ‘which are available in NEB’ or ‘user defined’.

5. Click on ‘Submit’ button to analyze your sequence by NEBcutter.

6. Other options (Optional)

Exploration

• The result is displayed with information like the AT and GC content, a graphical output
representing the location of the enzymes’ sites on the sequence etc.

• Under ‘Main options’ click ‘Custom digest’ to view the enzymes that have restriction sites on the
sequence provided in a table form

25
• Chose the enzymes of interest (AccI and BccI) preferably a single cutter one and click ‘Digest’
button. This displays the sites of the chosen enzyme(s) alone

• To view the restriction pattern on gel, click ‘View gel’ found on the ‘Main options’ menu.
• To view enzymes & cut positions, click ‘Enzymes & sites’ found on the ‘List’ menu
• To view enzymes & fragment lengths, click ‘Fragments’ found on the ‘List’ menu

Primer design
This section demonstrates the Primer3 that is used to design PCR primers for amplifying a region of a
known sequence.

The polymerase chain reaction (PCR) has become one of the most widely used techniques in
molecular biology. In its basic form, this technique allows the rapid amplification of small amounts of
DNA through multiple cycles of denaturation to single stranded DNA, annealing with a pair of
oligonucleotide primers and synthesis of the second DNA strand using a thermostable DNA
polymerase. One of the major drawbacks of this method is its sensitivity to experimental artefacts,
some of which can be minimized through careful experimental planning. In particular the
oligonucleotide primers should be carefully designed to minimize the likelihood of side reactions
such as primer-dimer formation and spurious annealing, which can result in decreased efficiency and
unwanted side products.

Exercise 2:

• Select the program Primer3Plus ([Link]


bin/primer3plus/[Link])

• Use the file humsoma as the input

• Click on Pick Primers

26
Exercise 3:

Primer3 can also be used to test your own set of primers. For example, HSHFE is the GenBank locus
for the genomic sequence of HFE (accession number: Z92910), a non classical class I MHC protein
and the candidate gene for hereditary haemochromatosis. Exon 4 in the sequence (nucleotides
6494-6769) codes for the alpha-3 domain of the HFE protein.

• Design primers to amplify the sequence which encodes the alpha-3 domain and test the
primers against the HSHFE

• To specify where the primers must amplify within the target sequence, you need to put the
exon start and stop in [Targets]

• Click on Pick Primers

• Then play around with <Excluded regions>, {Included region}, and combination of selected
regions

Target sequence

6481 ctttcctgtc aagtgcctcc tttggtgaag gtgacacatc atgtgacctc ttcagtgacc


6541 actctacggt gtcgggcctt gaactactac ccccagaaca tcaccatgaa gtggctgaag
6601 gataagcagc caatggatgc caaggagttc gaacctaaag acgtattgcc caatggggat
6661 gggacctacc agggctggat aaccttggct gtaccccctg gggaagagca gagatatacg
6721 tgccaggtgg agcacccagg cctggatcag cccctcattg tgatctgggg tatgtgactg
6781 atgagagcca ggagctgaga aaatctattg ggggttgaga ggagtgcctg aggaggtaat
6841 tatggcagtg agatgaggat ctgctctttg ttaggggatg ggctgagggt ggcaatcaaa

Exercise 4:

27
In this exercise you will specify a particular region within a sequence for which primers are to be
designed. This sequence is GC-rich sequence, you will see the effect of high G+C content upon
primer searching

• Search GenBank accession number: Z30589, for suitable primers for positions 800-1600.
Change the target length and product size parameters accordingly.

o The main problem was with the GC content of the region being amplified.

• Try the search again after increasing the max primer %GC (to 75), and the max product %GC
(to 75).

o Since higher GC increases the Tm, primers can now be found. You will need to allow
for this change in the reaction itself (increase annealing time for instance).

Questions:
1. Using crustacean hyperglycaemic hormone gene (AF372657) as the input, create a restriction
map with 3 restriction enzymes: AseI, BsaAI, EcoRI. Report: a number of cut site, recognition
sequence, positions of cut site, fragment sizes, and types of fragment end,

2. Using Humsomi (J00306) as the input, design primers to amplify the exons only. Report:
images of primer setting-up & the first pair of primer (including the highlight of primers &
target sequence)

28
LAB 5 – GENE DISCOVERY
Genomic sequence data is generated by sequencing projects worldwide much faster than it can be
fully analysed. There is an increasing need for computer software able to indentify features of
interest in a DNA sequence, including regions of abnormal composition, unusual repeats etc., with a
special emphasis on protein coding regions and their control sequences.

Searching for genes and ‘interesting’ patterns can involve a range of bioinformatics methods. For
example, the first step in analysing a new DNA sequence would be to use this sequence and its
protein translation as queries to search sequence databases and indentify potential homologues
which may provide information on the function of the sequence. The DNA can also be used to search
the EST (Expressed Sequence Tag) database, which contains a large number of mass-sequenced
reverse transcribed RNAs. If a cDNA is found with a strong similarity to a region in the genomic
sequence, this region is likely to be coding. It is also possible to search de novo sequences from non-
model species using the statistical programs demonstrated in this section.

Codon usage statistics


Exercise 1: Create a codon usage statistics

• Start chips ([Link] and select ecoliad (accession


number: x59263) as the input sequence

29
Chips calculates Frank Wright's Nc statistic for a nucleotide sequence. This is the "effective number
of codons used in a gene sequence" (ref 1), and is a simple measure of synonymous codon usage
bias. Nc quantifies how far the codon usage of a gene departs from equal usage of synonymous
codons. Nc is easily calculated from codon usage data alone and is independent of gene length and
amino acid composition. Nc can take values from 20, in the case of extreme bias where one codon is
exclusively used for each amino acid, to 61 when the use of alternative synonymous codons is
equally likely. Nc thus provides an intuitively meaningful measure of the extent of codon preference
in a gene. Low values therefore indicate a strong codon bias, and high values indicate a low bias (and
possibly a non-coding region).

In this example there is really no codon usage bias.

Exercise 2: Create a codon usage table using cusp

• Start cusp ([Link] and select ecoliad as the input


sequence

This file contains a table showing the values from which the codon preference statistics was
calculated. For each codon, it lists the frequency of occurrence, relative to all the codons in the
organism (All) and to other codons encoding the same amino acid (Family). The codon usage table
gives for each codon: (i) sequence of the codon, (ii) the encoded amino acid, (iii) the proportion of
usage of the codon among its redundant set, i.e. the set of codons which code for this codon's amino
30
acid, (iv) the expected number of codons, given the input sequence(s), per 1000 bases, (v) the
observed number of codons in the input sequences.

Exercise 3: Six frame translation

• Use the input ecoliad and run the program sixpack ([Link]
bin/emboss/sixpack)

The program sixpack presents the amino acid translation using all six frames (3 forward and 3
reverse). It also writes a file containing the open reading frames that are larger than the specified
minimum size (default 1 base, showing all possible open reading frames). These open reading frames
are written as protein sequences in the default output sequence format. An open reading frame is
defined in this program as any possible translation between two STOP codons.

31
Exercise 4: Tcode

• Use the input ecoliad and run the program Tcode ([Link]
bin/emboss/tcode)

The program identifies the regions that are most likely to be coding regions. Tcode tests DNA
sequences for protein coding regions using an algorithm which looks for simple and universal
differences between protein-coding and non-coding DNA. The program slides a window of user-
selectable size over the DNA sequence. For each window the TESTCODE statistic is applied. The
results can be output as a text report or displayed graphically. The text output reports each window
as "Coding", "Noncoding" or "No opinion". Entries marked "No opinion" have a TESTCODE value that
falls between the maximum and minimum values required to report a region as noncoding or
coding. For the graphical plot, all points above a green horizontal line are determined to be coding
regions. Those below a red line are determined to be noncoding. Points between the red and green
lines are "no opinion" ones.

• Re-run the program – this time increasing the size of the sliding window – notice how the
figure changes.

Graphic ORF program

The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds all open reading
frames of a selectable minimum size in a query sequence or in a sequence already in the database.
This tool identifies all open reading frames using the standard or alternative genetic codes.

32
Exercise 5:

• Use the input ecoliad and run the program ORF Finder
([Link]

Figure lists the reading frames for the forward (+1, +2, +3) or the reverse (-1, -2, -3). The blue bars
indicate open reading frames, and the white bars indicate a region that does not have a start codon
that indicates the start of an ORF. Next columns provide additional information of each ORF with the
positions of ORF start and end and its length.

• Select the long ORF in the +2 reading frame. The reading frame is then translated into
protein

33
GENSCAN

GenScan is a general purpose gene identification program which analysed genomic DNA sequences
from a variety of organisms including human, other vertebrates, invertebrates and plants. For each
sequence, the program determines the most likely “parse” (gene structure) under a probabilistic
model of the gene structural and compositional properties of the genomic DNA for the given
organism. This set of exons/genes is then printed to an output file together with the corresponding
predicted peptide sequences. Unlike the majority of other currently available gene prediction
programs, the model treats the most general case in which the sequence may contain no genes, one
gene, or multiple genes on either or both DNA strands and partial genes as well as complete genes
are considered.

Exercise 6:

• Use humsomi as the input for the program GenScan


([Link]

Explanation:

[Link]: gene number, exon number

Type: Init = Initial exon, Intr = Internal exon, Term = Terminal exon, PlyA = poly-A signal

S: DNA strand (+ = input strand, - = opposite strand)

Begin: beginning of exon or signal (numbered on input strand)

End: end point of exon or signal (numbered on input strand)

Len: length of exon or signal (bp)

34
Fr: reading frame

Ph: net phase of exon

I/Ac: initiation signal or acceptor splice site score

Do/T: donor splice site or termination signal score

CodRg: coding region score

P: probability of exon (sum over all parses containing exon)

Tscr: exon score (depends on length, I/Ac, Do/T and CodRg scores)

Genefinder ([Link]

These programs are available on the softberry website, however as this is a commercial site only a
set number of analyses are permitted to occur from each location per day for free. These programs
are more specialized-they are designed for specific organism.

Exercise 7:

• In the Gene Finding in Eukaryota menu on the left hand side of the page is the option to
select the program FGENESH.
• Use humsoma as the test sequence
• Take note of where the predicted exons are located

The TSS (Position of transcription start (TATA-box position)) was not identified using
the GRAILEXP program

• View the PDF file that is available


• Redo the analysis using FGENES. Are the same exons & their locations predicted?

Exercise 8:

The programs TSSG & TSSW under the Promoters & functional option are used to predict the
location of poll (eukaryotic protein coding gene) promoters. They both work on the same principle
but use different databases of transcription factors. TSSG uses the older and no longer supported
Transcription Factor Database of Gosh, whereas TSSW used the more recent and still active
TRANSFAC of Edgar Wingender.

• In the SEARCH FOR MOTIFS/promoters & functional menu is the option to select the
program TSSW.
• Use humsoma as the test sequence

The output starts by summarizing the location of the predicted promoters and their TATA box. The
quality of the predicted promoter is indicated by the LDF (linear discrimination factor). Promoters
with an LDF over the threshold indicated are reported. The output concludes with a list of the
transcription factor binding sites which were found in the predicted promoter regions. For each site,
35
the position, DNA strand (+ or -), the name of the site in TRANSFAC and the sequence which
matched the consensus sequence are given.

Questions

1. Access to NCBI for searching NG011676. Report the features of this gene, including mRNA &
CDS.

2. Using homo sapiens growth hormone gene (NG011676) as the input, run GeneScan.
Compare these results with information of this gene from NCBI.

3. Using homo sapiens growth hormone gene (NG011676) as the input, run Genefinder.
Compare these results with information of this gene from NCBI.

36

You might also like