Prepared By : Ankur Gajendra Meshram
Applications of Genomics and Proteomics
Course No. :- PBTEL-463
Credits :- 03 (2+1)
Course Title :- Applications of Genomics and Proteomics
Course Teacher :-
Student’s Name :-
College Name :- M. G. College of Agri. Biotechnology, Pokharni, Nanded
COURSE SYLLABUS OUTLINE
Lec. No. Topics Weightage (%)
Unit : 1
1. Introduction to genomics and proteomics, terminology, History 02
2. Genomics of Arabidopsis 03
3. Genomics of rice, 03
4. Genomics of tomato, 03
5. Genomics of pigeon pea, 03
6. Genomics of wheat; 03
7. Introduction to Transcriptomics and techniques involved in its analysis 02
8. Principles, methods, types, procedure and application of DNA 04
chips/Miicroarray in transcript analysis
9. forward and reverse genetic approaches 03
10. Types of mutation, types of mutagen, mutant and its application in 03
functional genomics
11. Principal and mechanism of RNAi in functional genomics 02
12. Applications of RNAi Technology in crop improvement 04
13. Mutation breeding, 03
14. Principle & application of Site directed mutagenesis with its mechanism 04
15. Transposon tagging its principle, procedure and mechanism and 03
application in functional genomics
16. Transient gene expression by VIGS and FACS : Principal, procedure and 04
application of VIGS in genomics
17. Principal, procedure and application of VIGS in genomics FACS 02
18. Introduction , components and applications of targeted genome editing 04
technologies : CRISPR, TALENS etc
PBTEL-463 (Applications of Genomics and Proteomics) 1
Unit : 2
19. Proteomics study in relation to bioinformatics and different components 02
of proteomics
20. Proteome analysis by MALDI-TOF 03
21. Proteome expasy tools study to analysis the protein 03
22. Structural analysis of protein 03
23. Protein 3D structure modeling by different modules and its procedure : 04
Homology modelling and crystallography
24. Study of protein-protein interaction and techniques involved in the 04
interactive study of protein
25. Principal, analysis, mechanism of FRET and its application in proteomics 03
26. yeast two hybrid system for analysis of protein-protein interaction at 04
molecular level
27. Principal, mechanism and application of co-immunoprecipitation in 04
proteomics
28. Success case study on application of genomics and proteomics in health 03
and industry
29. Introduction to Metabolomics and ionomics and techniques involved in 03
metabolite analysis
30. Procedure, steps, components involved in metabolite analysis 03
31. Principal and Application of Nuclear Magnetic Resonance Spectroscopy 03
(NMR), Mass Spectrometry (MS) in metabolite analysis
32. Application of genomics and proteomics in crop improvement 03
Total : 100
PRACTICAL EXERCISES
Exercise No. Title
1. Principal and procedure of SDS_PAGE
2. Principal and procedure of 2D Electrophoresis
3. Protein analysis and characterization through HPLC
4. Specialized crop based genomic resources : databases and analysis of genomics and
proteomics of a crops
5. TAIR
6. Study and analysis Gramene database
7. Study and analysis GrainGenes database
8. Study and analysis Maizedb database
9. Study and analysis Phytozome database
10. Study and analysis Cerealdb database
11. Study and analysis Citrusdb database
12. Study and analysis miRbase database
PBTEL-463 (Applications of Genomics and Proteomics) 2
UNIT : 1
1.1 INTRODUCTION TO GENOMICS & PROTEOMICS
From the Greek 'gen', meaning “become, create, creation, birth”, and subsequent variants : genealogy,
genesis, genetics, genic, genomere, genotype, genus etc. While the word genome (from the German
‘Genom’, attributed to Hans Winkler) was in use in English as early as 1926, the term genomics was
coined by Thomas Roderick, a geneticist at the Jackson Laboratory (Bar Harbor, Maine), over beer at a
meeting held in Maryland on the mapping of the human genome in 1986.
In the fields of molecular biology and genetics, a genome is all genetic material of an organism. It
consists of DNA (or RNA in RNA viruses). The genome includes both the genes (the coding regions) and
the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is
called genomics.
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping,
and editing of genomes. A genome is an organism’s complete set of DNA, including all of its genes. In
contrast to genetics, which refers to the study of individual genes and their roles in inheritance,
genomics aims at the collective characterization and quantification of all of an organism’s genes, their
interrelations and influence on the organism.
The field also includes studies of intragenomic (within the genome) phenomena such as epistasis (effect
of one gene on another), pleiotropy (one gene affecting more than one trait), heterosis (hybrid vigour),
and other interactions between loci and alleles within the genome.
Scope And Importance of Genomics :-
1) Target identification and validation : integrative analysis of molecular data at scale, coupling
genetic, epigenetic, gene expression profiling, proteomic, metabolomic, phenotypic trait
measurements to disease diagnosis and clinical outcome data to generate hypotheses on
molecular etiology of diseases in service of identification or validation of novel therapeutic
targets.
2) Biomarker discovery : utilizing genetic and genomic data derived from cell lines, animal models,
human disease tissues and PBMC to develop preclinical or clinical biomarkers for target
engagement, pharmacodynamics, drug response, prognosis, and patient stratification; applying
genomic profiling in clinical trials to identify early response markers to predict clinical end
points.
3) Pharmacogenomics : identify associations between germline SNPs, somatic mutations, gene
expression and other molecular alterations and drug responses.
4) Toxicogenomics : integrative analysis of genomic, histopathology, and clinical chemistry data to
develop predictive toxicology biomarkers in preclinical 4-day, 14-day and 30-day studies and
clinical studies.
5) Understanding drug mechanisms of action (MoA) : applying genomic profiling to de-convolute
targets and delineate MoA of non-selective drugs or drugs from phenotypic screening.
6) Characterization of mechanisms of acquired resistance : analysis of genetic and genomic data
derived from preclinical isogenic models or clinical patient samples to study the mechanisms of
acquired resistance.
PBTEL-463 (Applications of Genomics and Proteomics) 3
7) Selection of disease-relevant experimental models : comparative analysis of genetic and
genomic data to assess and select cell line and animal models in drug discovery that best
represent the disease indications.
8) Developing drug combination strategies : analysis of genetic and genomic data to identify
synthetic lethality genes as drug combination targets; computational analysis to understand
gene regulatory networks to develop combination strategies that target parallel pathways or
reverse drug resistance.
9) Drug repurposing : applying in silico approaches to identify new disease indications for existing
drugs.
Applications of Genomics in Crop Improvement :-
1) Molecular Breeding or Marker Assisted Selection (MAS) :
Crop improvement relies on the identification of desirable genes and superior genotypes possessing
such genes. Selection of such genes and genotypes is facilitated by MAS. MAS refers to the process of
indirect selection of desirable genes or traits through direct selection for morphological, biochemical or
DNA-based/molecular markers. Breeders scan new varieties to directly select for the presence of the
markers and indirectly select for the desired genes . In this way, MAS subverts the need for phenotypic
selection which can be difficult, time consuming , costly and influenced by the environment.
2) DNA / Molecular markers :
DNA markers can be defined as differences in nucleotide sequence of different individuals that can be
used to mark a genomic region, tag a gene or distinguish individuals from each other. Many different
classes of molecular markers are available today. These have been elaborately described in many good
reviews.
3) Mapping and tagging of genes :
A large number of monogenic and polygenic genomic loci for various traits have been identified in many
plants and are currently being exploited by breeders. Tagging of useful genes responsible for conferring
resistance to plant pathogens, synthesis of plant hormones, drought tolerance and a variety of other
important developmental pathway genes is a major target. Molecular markers have facilitated
construction of physical and genetic maps of genes/QTL on the genome. Table 1 gives a summary of the
traits for which MAS is being applied in different crops.
4) Identification/DNA Fingerprinting :
The ability to identify individual plants is at the core of many applications. The use of DNA markers in
cultivar identification is particularly important for the protection of proprietary germplasm. In India, the
National Research Centre on DNA Fingerprinting, rechristened as the Division of Genomic Resources,
National Bureau of Plant Genetic Resources , Indian Council of Agricultural Research, has been entrusted
with the responsibility of fingerprinting released varieties and important germplasm in crops of national
importance. A total of 2,146 accessions of different crops, have been fingerprinted at the centre using
different molecular marker systems. A wide array of markers have been used for fingerprinting in
different crops . The markers chosen for the purpose of fingerprinting depends on factors like their
availability, genomic coverage, cost effectiveness and reproducibility. DNA markers have also been used
to confirm purity in hybrid cultivars where the maintenance of high levels of genetic purity is essential.
Marker based identification has also been used to check adulteration of commercial medicinal plants.
Sex identification in some dioecious plants is also facilitated by the use of molecular markers.
PBTEL-463 (Applications of Genomics and Proteomics) 4
5) Diversity analysis of germplasm and heterosis breeding :
Heterosis refers to the presence of superior phenotypes in hybrids relative to their inbred parents with
respect to traits such as growth rate, reproductive success and yield. The theory of quantitative genetics
postulates a positive correlation between parental genetic distance and degree of heterosis.
Conventionally, the selection of such parents was based on a combination of phenotypic assessments,
pedigree information and breeding records . Now, molecular markers are also used for this purpose. A
genome wide assessment of genetic diversity using molecular markers makes parental selection more
efficient. Efforts have been made to construct haplotype blocks on the basis of molecular markers which
have been successfully used to predict hybrid performance.
6) Introgression/Backcrossing and gene pyramiding :
Commercial elite cultivars can be improved for a desirable trait (like resistance to a specific disease),
which exists in distantly related (wild type) genotype but lacks in the commercially grown variety. This is
achieved by gene introgression which involves crossing the elite cultivar with the donor plant, followed
by repeated backcrossing of the progeny with the recipient line, while selecting simultaneously for the
desirable allele in each generation (foreground selection).This takes about 6 or more generations but
the use of DNA markers can effectively shorten this duration by reducing the number of backcrosses
required. MAS allows recovery of the maximum proportion of recurrent parent genomic regions at the
non target loci (background selection) and thus help minimize linkage drag. This method, is termed
marker-assisted backcrossing (MABC). MAS has also been widely utilized for gene pyramiding.
Pyramiding is the accumulation of several desired alleles into a single line or cultivar (background). This
is often cited as one of the major applications of MAS, since gene pyramiding through conventional
plant breeding is difficult.
PROTEOMICS :-
The word “proteome” represents the complete protein pool of an organism encoded by the genome. In
broader term, Proteomics, is defined as the total protein content of a cell or that of an organism.
Proteomics helps in understanding of alteration in protein expression during different stages of life cycle
or under stress condition. Likewise, Proteomics helps in understanding the structure and function of
different proteins as well as protein-protein interactions of an organism. A minor defect in either protein
structure, its function or alternation in expression pattern can be easily detected using proteomics
studies. This is important with regards to drug development and understanding various biological
processes, as proteins are the most favorable targets for various drugs.
History :-
The first studies of proteins that could be regarded as proteomics began in 1975, after the introduction
of the two-dimensional gel and mapping of the proteins from the bacterium Escherichia coli.
The word proteome is blend of the words “protein” and “genome”, and was coined by Marc Wilkins in
1994 while he was a Ph.D. student at Macquarie University. Macquarie University also founded the first
dedicated proteomics laboratory in 1995.
Steps in Proteomic Analysis :-
The following steps are involved in analysis of proteome of an organism as shown in Fig. 1 :
PBTEL-463 (Applications of Genomics and Proteomics) 5
1) Purification of proteins : This step involves extraction of protein samples from whole cell, tissue
or sub cellular organelles followed by purification using density gradient centrifugation,
chromatographic techniques (exclusion, affinity etc.)
2) Separation of proteins : 2D gel electrophoresis is applied for separation of proteins on the basis
of their isoelectric points in one dimension and molecular weight on the other. Spots are
detected using fluorescent dyes or radioactive probes.
3) Identification of proteins : The separated protein spots on gel are excised and digested in gel by
a protease (e.g. trypsin). The eluted peptides are identified using mass spectrometry.
Fig 1. Overview of steps involved in proteomic analysis
Applications of Proteomics :-
1) Post-Translational Modifications :
Proteomics studies involve certain unique features as the ability to analyze post- translational
modifications of proteins. These modifications can be phosphorylation, glycosylation and sulphation as
well as some other modifications involved in the maintenance of the structure of a protein.
These modifications are very important for the activity, solubility and localization of proteins in the cell.
Determination of protein modification is much more difficult rather than the identification of proteins.
As for identification purpose, only few peptides are required for protease cleavages followed by
database alignment of a known sequence of a peptide. But for determination of modification in a
protein, much more material is needed as all the peptides do not have the expected molecular mass
need to be analyzed further.
For example, during protein phosphorylation events, phosphopeptides are 80 Da heavier than their
unmodified counterparts. Therefore, it gives, rise to a specific fragment (PO3-mass 79) bind to metal
PBTEL-463 (Applications of Genomics and Proteomics) 6
resins, get recognized by specific antibodies and later phosphate group can be removed by
phosphatases (Clauser et al. 1999; Colledge and Scott, 1999). So protein of interest (post-translationally
modified protein) can be detected by Western blotting with the help of antibodies or 32P-labelling that
recognize only the active state of molecules. Later, these spots can be identified by mass spectrometry.
2) Protein-Protein Interactions :
The major attribution of proteomics towards the development of protein interactions map of a cell is of
immense value to understand the biology of a cell. The knowledge about the time of expression of a
particular protein, its level of expression, and, finally, its interaction with another protein to form an
intermediate for the performance of a specific biological function is currently available.
These intermediates can be exploited for therapeutic purposes also. An attractive way to study the
protein-protein interactions is to purify the entire multi-protein complex by affinity based methods
using GST-fusion proteins, antibodies, peptides etc.
The yeast two-hybrid system has emerged as a powerful tool to study protein-protein interactions
(Haynes and Yates, 2000). According to Pandey and Mann (2000) it is a genetic method based on the
modular structure of transcription factors in the close proximity of DNA binding domain to the activation
domain induces increased transcription of a set of genes.
The yeast hybrid system uses ORFs fused to the DNA binding or activation domain of GAL4 such that
increased transcription of a reporter gene results when the proteins encoded by two ORFs interact in
the nucleus of the yeast cell. One of the main consequences of this is that once a positive interaction is
detected, simply sequencing the relevant clones identifies the ORF. For this reason it is a generic
method that is simple and amenable to high throughput screening of protein-protein interactions.
Phage display is a method where bacteriophage particles are made to express either a peptide or
protein of interest fused to a capsid or coat protein. It can be used to screen for peptide epitopes,
peptide ligands, enzyme substrate or single chain antibody fragments.
Another important method to detect protein-protein interactions involves the use of fluorescence
resonance energy transfer (FRET) between fluorescent tags on interacting proteins. FRET is a non-
radioactive process whereby energy from an excited donor fluorophore is transferred to an acceptor
fluorophore. After excitation of the first fluorophore, FRET is detected either by emission from the
second fluorophore using appropriate filters or by alteration of the fluorescence lifetime of the donor.
A proteomics strategy of increasing importance involves the localization of proteins in cells as a
necessary first step towards understanding protein function in complex cellular networks. The discovery
of GFP (green fluorescent protein) and the development of its spectral variants has opened the door to
analysis of proteins in living cells by use of the light microscope.
Large-scale approaches of localizing GFP-tagged proteins in cells have been performed in the genetically
amenable yeast S. pombe (Ding et al. 2000) and in Drosophila (Morin et al. 2001). To localize proteins in
mammalian cells, a strategy was developed that enables the systematic GFP tagging of ORFs from novel
full-length cDNAs that are identified in genome projects.
3) Protein Expression Profiling :
The largest application of proteomics continues to be protein expression profiling. The expression levels
of a protein sample could be measured by 2-DE or other novel technique such as isotope coded affinity
PBTEL-463 (Applications of Genomics and Proteomics) 7
tag (ICAT). Using these approaches the varying levels of expression of two different protein samples can
also be analyzed.
This application of proteomics would be helpful in identifying the signaling mechanisms as well as
disease specific proteins. With the help of 2-DE several proteins have been identified that are
responsible for heart diseases and cancer (Celis et al. 1999). Proteomics helps in identifying the
cancercells from the non-cancerous cells due to the presence of differentially expressed proteins.
The technique of Isotope Coded Affinity Tag has developed new horizons in the field of proteomics. This
involves the labeling of two different proteins from two different sources with two chemically identical
reagents that differ in their masses due to isotope composition (Gygi et al. 1999). The biggest advantage
of this technique is the elimination of protein quantitation by 2-DE. Therefore, high amount of protein
sample can be used to enrich low abundance proteins.
Different methods have been used to probe genomic sets of proteins for biochemical activity. One
method is called a biochemical genomics approach, which uses parallel biochemical analysis of a
proteome comprised of pools of purified proteins in order to identify proteins and the corresponding
ORFs responsible for a biochemical activity.
The second approach for analyzing genomic sets of proteins is the use of functional protein microarrays,
in which individually purified proteins are separately spotted on a surface such as a glass slide and then
analyzed for activity. This approach has huge potential for rapid high-throughput analysis of proteomes
and other large collections of proteins, and promises to transform the field of biochemical analysis.
4) Molecular Medicine :
With the help of the information available through clinical proteomics, several drugs have been
designed. This aims to discover the proteins with medical relevance to identify a potential target for
pharmaceutical development, a marker(s) for disease diagnosis or staging, and risk assessment—both
for medical and environmental studies. Proteomic technologies will play an important role in drug
discovery, diagnostics and molecular medicine because of the link between genes, proteins and disease.
As researchers study defective proteins that cause particular diseases, their findings will help develop
new drugs that either alter the shape of a defective protein or mimic a missing one. Already, many of
the best-selling drugs today either act by targeting proteins or are proteins themselves. Advances in
proteomics may help scientists eventually create medications that are “personalized” for different
individuals to be more effective and have fewer side effects. Current research is looking at protein
families linked to disease including cancer, diabetes and heart disease.
1.2 GENOMICS OF ARABIDOPSIS
Thale cress, Arabidopsis thaliana, is a member of one of the largest families of flowering plants, the
Brassicaceae, to which mustards, radishes and cabbages also belong. A. thaliana is thought to have
originated in Central Asia and spread from there throughout Eurasia. During the last glaciation, A.
thaliana was confined to the southern limit of its range, and after the ice retreated, much of Europe was
recolonized by different populations, resulting in complex admixture patterns. Today, A. thaliana occurs
throughout the Northern Hemisphere, mostly in temperate regions, from the mountains of North Africa
to the Arctic Circle. Like many other European plants, it has also invaded North America, most probably
during historic times.
PBTEL-463 (Applications of Genomics and Proteomics) 8
The ascendancy of A. thaliana to become one of the most popular species in basic plant research,
despite its lack of economic value, is due to the favorable genetics of this plant. It has a diploid genome
of only about 125 to 150 Mb distributed over five chromosomes, with fewer than 30,000 protein-coding
genes. The ease with which it can be stably transformed is unsurpassed by any other multicellular
organism. Moreover, as flowering plants only appeared about 100 million years ago, they are all
relatively closely related. Indeed, key aspects of plant physiology such as flowering are highly conserved
between economically important grasses such as rice and A. thaliana.
A. thaliana was the first plant species for which a genome sequence became available. This initial
sequence was from a single inbred strain (accession), and was of very high quality, with each
chromosome represented by merely two contigs, one for each arm. In addition to functional analyses,
the 120 Mb reference sequence of the Columbia (Col-0) accession proved to be a boon for evolutionary
and ecological genetics. A particular advantage in this respect is that the species is mostly self-fertilizing,
and most strains collected from the wild are homozygous throughout the genome. This distinguishes A.
thaliana from other model organisms such as the mouse or the fruit fly. In these systems, inbred strains
have been derived, but they do not represent any individuals actually found in nature.
The Genome of Arabidopsis :-
In December of 2000, the Arabidopsis research community announced a major accomplishment: the
completion of the sequence of a flowering plant. For the first time, we have in hand the sequence of all
of the genes necessary for a plant to function, knowledge unprecedented in the history of science.
Additionally, this sequence is freely available to every member of the scientific community. Below is a
summary of major findings described in a groundbreaking paper, “The Arabidopsis Genome Initiative”,
Analysis of the genome sequence of the flowering plant Arabidopsis thaliana”.
The genome of Arabidopsis :
• Contains about 125 megabases of sequence
• Encodes approximately 25,500 genes
• Contains a similar number of gene functional classifications as other sequenced eukaryotic
genomes (Drosophila melanogaster and Ceanorhabditis elegans).
• Has 35% unique genes
• Has 37.5% genes that exist as members of large gene families (families of 5 or more members)
• Shows evidence of ancient polyploidy: an estimated 58-60% of the Arabidopsis genome exists as
large segmental duplications.
Analysis of the sequence of the Arabidopsis genome tells us that the genome of a higher plant is similar
in several important ways to the genome of other sequenced multicellular organisms. It also points out
several important differences, which may not be too surprising, considering that plants differ in many
important ways from the animals whose genomes have been analyzed. Plants are autotrophic: they
require only light, water, air and minerals to survive. They can therefore be expected to have genes that
animals do not have, encoding the proteins and enzymes involved in plant-specific processes, including
the complex process of photosynthesis.
• Arabidopsis centromeric regions, although largely heterochromatic, overall contain at least 47
expressed genes.
• Arabidopsis contains several classes of proteins that are used in animal systems for processes
not present in the plant, underscoring the idea that evolution makes use of the tools it is given
to accomplish different tasks in different organisms.
PBTEL-463 (Applications of Genomics and Proteomics) 9
• Plants have evolved a host of signal transduction apparati, perhaps to enable them to deal with
their sessile nature.
• Arabidopsis genome contains genes encoding RNA polymerase subunits not seen in other
eukaryotic organisms.
• Arabidopsis has genes unique to plants – approximately 150 unique protein families were found,
including 16 unique families of transcription factors.
• Arabidopsis has many gene families common to plants and animals which have been greatly
expanded in plants – for instance, Arabidopsis contains 10-fold as many aquaporin (water
channel) proteins than any other sequenced organism.
A first-generation haplotype map (HapMap) for A. thaliana :-
From this first set of 96 strains, 20 maximally diverse strains were chosen for much denser
polymorphism discovery using array-based resequencing. This led to the identification of about one
single nucleotide polymorphism (SNP) for every 200 bp of the genome, constituting one quarter or so of
all SNPs estimated to be present. In addition, regions that are missing or highly divergent in at least one
accession encompass about a quarter of the reference genome.
Fig 1. Arabidopsis thaliana Genome Map
The progress made with genome-wide association (GWA) mapping in humans during the past three
years has been nothing but phenomenal, and bodes well for applying association mapping to A.
thaliana. As in humans, linkage disequilibrium (LD), which is the basis for GWA studies, decays over
about 10 kb, the equivalent of two average genes. That the average LD in Arabidopsis is not so different
from that in humans might seem surprising, given the selfing nature of A. thaliana, but it reflects the
PBTEL-463 (Applications of Genomics and Proteomics) 10
fact that outcrossing is not that rare, and that this species apparently has a large effective population
size. A 250 k SNP chip (containing 250,000 probes), corresponding to approximately one SNP very 480
bp, has been produced, and should predict some 90% of all non-singleton SNPs. A collection of over
6,000 A. thaliana accessions, both from stock centers and recent collections has been assembled, and a
subset of 1,200 genetically diverse strains will be interrogated with the 250 k SNP chip, providing a
fantastic resource for GWA studies in this species.
The A. thaliana 1001 Genomes project :-
Together with partners from around the world, we have initiated a project with the goal of describing
the whole-genome sequence variation in 1,001 accessions of A. thaliana. The current technological
revolution in sequencing means that it is now feasible and inexpensive to sequence large numbers of
genomes. Indeed, a 1000 Genomes Project for humans was announced in January 2008, and the first
results of this initiative are very encouraging. It builds, in a manner similar to the A. thaliana project, on
previous HapMap information, but because of the greater complexity and repetitiveness of human
genomes, much of the initial effort for the human project will go towards comparing the feasibility of
different approaches. In contrast, even short reads of the A. thaliana sequence, such as those produced
by the first generation of Illumina’s Genome Analyzer instrument, have already been proved to support
not only the discovery of SNPs, but also of short to medium-size indels, including the detection of
sequences not present in the reference genome.
We are proposing a hierarchical strategy to sequence the species-wide genome of A. thaliana. The first
aspect of this approach is to make use of different technologies and different depths of sequencing
coverage. A small number of genome sequences that approach the quality of the original Col-0
reference will be generated by exploiting mostly technologies such as Roche’s 454 platform, which
generates longer reads, in combination with libraries of different insert sizes, allowing long-range
assembly. A much larger number of genomes will be sequenced with a less expensive technology such
as Illumina’s Genome Analyzer or Applied Biosystems’ SOLiD and with only a single type of clone library.
For this set of accessions, local haplotype similarity will be exploited in combination with information
from the reference genomes to deduce the complete sequence, using methods similar those employed
in inbred strains of mice. The power of this approach is in the large number of accessions that can be
sequenced. For example, even if a particular haplotype is only present at 1% frequency, and each of the
1,001 strains is only sequenced at 8× coverage, there would still be on average 80 reads for each site in
this haplotype.
The second aspect of the hierarchical approach will be the sampling of ten individuals from ten
populations each in ten geographic regions throughout Eurasia, plus at least one North African accession
(10 × 10 × 10 + 1). We expect individuals from the same region to show more extensive haplotype
sharing than is observed in worldwide samples, which will be advantageous for the imputation strategy
discussed above. An argument that might be raised against this approach is the strong population
structure it entails, but we note that it is probably impossible to sample accessions in a manner that
avoids population structure completely, and that our strategy will allow us to address questions of local
adaptation, which are of great interest to evolutionary scientists. The output of the 1001 Genomes
project will be a generalized genome sequence that encompasses every A. thaliana accession analysed
as a special case. It will comprise a mosaic of variable haplotypes such that every genome can be aligned
completely against it.
PBTEL-463 (Applications of Genomics and Proteomics) 11
The main motivation for the 1001 Genomes project is, however, to enable GWA studies in this species.
The seeds from the 1,001 accessions will be freely available from the Arabidopsis stock centers, and
each accession can be grown and phenotyped by scientists from all over the world, in as many
environments as desired. Importantly, because an unlimited supply of genetically identical individuals
will be available for each accession, even subtle phenotypes and ones that are highly sensitive to the
microenvironment, which is often difficult to control, can be measured with high confidence. The
phenotypes will include morphological analyses, such as plant stature, growth and flowering;
investigations of plant content, such as metabolites and ions; responses to the abiotic environment,
such as resistance to drought or salt stress; or resistance to disease caused by a host of prokaryotic and
eukaryotic pathogens, from microbes to insects and nematodes. In the last case, a particularly exciting
prospect is the ability to identify plant genes that mediate the effects of individual pathogen proteins,
which are normally delivered as a complex mix to the plant, as is being done in the Effectoromics
project, which has the aim of “understanding host plant susceptibility and resistance by indexing and
deploying obligate pathogen effectors”. The value of being able to correlate many different phenotypes,
including genome-wide phenotypes, has already been beautifully demonstrated for the Drosophila
Genetic Reference Panel, and we expect similar dividends for the A. thaliana project.
1.3 GENOMICS OF RICE
Rice has the simplest of the monocotyledonous genomes analysed to date, which is a considerable
advantage when assessing chromosome structure and behaviour. The draft genome sequences for
japonica (Goff et al., 2002) and indica (Yu et al., 2002) rice have been published recently. These
sequences cover 92 ± 93 % of the genome and predict 40,000 ± 50,000 genes, the largest number in any
organism sequenced to date. The International Rice Genome Sequencing Project
(http://rgp.dna.affrc.go.jp/cgi-bin/statusdb/status.pl) aims to complete full and accurate genome
sequencing by the end of 2002. These projects generate the resources needed to assess basic genome
structure.
Basic Features of the Rice Genome :-
1) Genome size and relationships between cereal genomes :-
The rice genome (Oryza sativa; AA genome) is composed of 12 chromosomes (2n = 24) and has a total
length of 430 Mb (megabase, a nucleotide length of 1000 000 base pairs) corresponding to about 1500
cM (centiMorgan, a genetic unit of length measured by the crossing-over frequency in genetic
recombinations at meiosis). Thus, in rice, each cM corresponds to a mean of 290 kb and 15 meiotic
recombinations. This statistic varies amongst species and with genome size; rice has a relatively small
genome (430 Mb) compared with that of other common cereals [maize, 3000 Mb; barley, 3500 Mb,
wheat (a hexaploid species with A, B and D genomes), 7000 Mb]. However, in contrast to genome size,
the genetic distance of these plants is very similar and lies between 1200 and 1500 cM. Thus,
recombination occurs at much the same frequency in meiosis in each cereal, even though the genome
sizes differ by more than eight-fold. One reason for the similar numbers of recombinations could be that
they reflect a similar length of gene-rich regions; the remaining stretches of the genome being gene-rare
regions, each with its unique size and representing heterochromatic blocks suppressed for
recombination. Wheat, for example, has been reported to possess large blocks of
centromeric/pericentromeric heterochromatin, with gene-rich regions located on distal ends of
chromosomes. This type of large genome, possessing long heterochromatic regions having a low
PBTEL-463 (Applications of Genomics and Proteomics) 12
recombination frequency, is consistent with wheat's similar genetic distance to that of other cereals
with smaller genomes. The rice genome can be expected to possess gene-rich regions that are similar to
those of other cereals but with a smaller proportion of repetitive sequences. Accordingly, a highly
conserved order of genes has been identified on chromosomes of rice and other cereal species, i.e.
genome synteny is evident among members of the Poaceae. This synteny was first recognized in rice
and maize and between rice and wheat in the genetic maps. Later, the evidence was expanded to the
level of physical maps. These results revealed that even on the physical map, gene order and distances
between genes are much the same among the cereal genomes. Rice has become the primary target for
cereal genome analysis because of its compact genome and high synteny with other cereal genomes,
together with its great economic value.
2) Genome structure and organization :-
Almost the whole genome of rice has been sequenced by the Syngenta group and the Chinese
sequencing group. Analysis of those draft sequences for the number and functional categories of
predicted genes indicates that rice has 40,000 ± 50,000 functional genes with about 15,000 distinct gene
families. Surprisingly, the number of genes is larger than that known for any other organisms sequenced
to date, and is double the number estimated for Arabidopsis thaliana. Even in A. thaliana, duplication
has occurred randomly in most regions of the chromosomes, resulting in the chimeric tetraploid nature
of the genome. As in arabidopsis, the large number of genes in rice might reflect a polyploid genome.
Maize also has a tetraploid structure closely related to the rice genome, suggesting a contribution of
polyploidization to the genome evolution of this clade of species.
Analyses of approx. 10,000 expression sequence tags (ESTs) and over 15 000 full-length cDNAs indicate
the presence of many single copy genes and various gene families in rice. Approximately 66 700 unique
ESTs have been mapped recently: they cover almost 80 % of the genome but most were mapped to
distal parts of chromosomes. Typical of the redundant sequence families are 18S-25S and 5S ribosomal
RNA genes, which are repeated hundreds of times and clustered in certain regions of particular
chromosomes. Several gene families, such as ribosomal protein genes and histone genes, are scattered
over all 12 chromosomes. Numerous protein kinase genes were identified in the whole genome and
many disease resistance genes belonging to protein kinase families are clustered on several
chromosomes. It has yet to be determined whether there is a mechanism that controls gene order and
location on the chromosomes. If so, it would play a role not only in chromosome configuration but also
in the functional organization of the genome and might well be involved in genome evolution.
About 50 % of the rice genome, including some parts of open reading frames (ORFs), are made up of
repetitive sequences. Precise identification of many kinds of repetitive sequences has been carried out
using the full genome draft sequences, revealing that the rice genome is composed of approx. 38 Mb of
long and 150 Mb of short repetitive DNA. This showed more than 48 000 repetitions of di-, tri- and tetra-
nucleotide simple sequence repeats (SSRs). The shortest repetitions are categorized into microsatellites
of SSRs (less than 10 bp units), minisatellites (less than 40 bp units) and satellite DNAs (several hundreds
bp units). Different sub-units of various microsatellites have been calculated to repeat 5700 ± 10,000
times in the genome and to be highly scattered. One typical microsatellite (CCCTAAA), is a telomere
repeat unit, which is tandemly arrayed in blocks ranging in size from several to dozens of kilobases on
every chromosome end. A large proportion of the moderately repeated sequences comprises
transposon- and other mobile DNA-related sequences. DNA transposons of the En/Spm family, several
types of retrotransposons of LINE, SINE, RIRE, TOS, MITE, and various other kinds of degenerate
sequences including solo-LTR have been found. One member of the TOS type of retrotransposon, Tos17,
PBTEL-463 (Applications of Genomics and Proteomics) 13
was recently revealed to transpose to various genome positions when cells were subjected to stress
conditions, such as cell culture. Tos17 also proved to be very useful in enabling thousands of insertion
mutant lines to be generated. Several genes isolated from the tagged mutants by PCR screening for
pooled DNAs of Tos17 transposed lines have been reported. This is a powerful means by which to
construct a functional genomic system to search for many genes by both phenotypic screening and DNA
sequence analysis. Distribution and transposition of endogenous retrotransposon sequences may
contribute to the organization of genome structure and evolution, but this remains purely speculative.
3) The Centromere :-
Centromeres mapped on the chromosomes :-
Centromeres are highly organized functional components necessary to maintain chromosome integrity
during cell division. In rice, the firrst successful cytological identification of centromeres was by Kurata
and Omura (1978) using a novel method of chromosome preparation. However, on the genetic map, the
positions of centromeres have not been defined on all 12 rice chromosomes. Attempts have been made
to locate centromeres on the genetic map by dosage analysis of RFLP markers using secondary trisomics
and telotrisomics. More precise mapping on the high-density genetic map could, theoretically, locate up
to five centromeres on a single locus or within a 1 cM interval, but it is impossible for seven centromeres
to be restricted to less than 1 cM (Harushima et al., 1998). One or two loci that carry multiple DNA
markers are located on all candidate centromere regions of each chromosome. These clustered markers
could not be segregated from each other in the population of 186 F2 plants used for analysis. The
markers are known to be distributed on several YAC clones larger than 1 Mb. The minimum physical
lengths at single centromere loci were calculated to be 1390, 2160, 1610 and 1220 kb for chromosomes
1, 7, 9 and 11, respectively, by YAC physical mapping (http://rgp.dna.affrc.go.jp/public
data/physicalmap2001/YACall2001.html). These centromere regions should correspond well to the
centromeric heterochromatic blocks detected cytologically and are probably suppressed for
recombination along a 1 Mb length, as is already known for other organisms (Frary et al., 1996;
Copenhaver et al., 1999).
Centromere constitution with repetitive DNAs :-
In many organisms, such as fruit fly, human, A. thaliana and maize, centromeres have been shown to be
composed of long stretches of short tandem repeats and various other kinds of repetitive sequences as
shown in Fig 1. In rice, centromere-specific repetitive sequences and several kinds of repeat units have
been described. These are shown in Table 1 together with other plant centromeric repetitive sequences.
Arago and Alcaide et al. (1996) first isolated a repetitive sequence CCS1, commonly located on the
centromeres of most cereal plants (e.g. wheat, barley, maize and rice). Dong et al. (1998) identified a
BAC clone that possessed seven kinds of repeat elements, all of which were localized on centromeres.
Estimations of copy numbers in the genome showed six of them exist in 50±300 copy repeats. The RCS2
sequence of a 168 bp unit is repeated in over 5000 copies in the rice genome. RCS2 was also shown to
be arranged as short tandem repeat blocks of various sizes of up to 150 kb in length.
Another attempt to isolate centromere repeat sequences was made by targeting the centromere
protein-binding box (CENP-B). This isolated an RCE1 sequence that was a 1.9 kb unit, tandemly arrayed
with intervening sequences. On the other hand, retro-transposon studies in rice presented evidence that
most copies of gypsy-type retrotransposons, such as RIRE3, RIRE7 and RIRE8, are clustered on
centromeres though they are truncated and modified. All centromeric repeat units, other than the RCS2
short tandem repeat, were shown to be a part of gypsy-type retrotransposon sequences of RIRE3, RIRE7
PBTEL-463 (Applications of Genomics and Proteomics) 14
and RIRE8. Using the information shown in Fig. 1 and Table 1, together with sequence and distribution
information for many centromere repetitive sequences, it becomes possible to deduce the
compositional feature of centromere structure. Accumulated results reveal that the centromere of each
organism is composed of a region of genus-specific highly repetitive short tandem repeats, each about
180 bp, and of long stretches with mixed moderate repetitive transposon-related sequences common to
family members.
Fig 1. Schematic illustration of centromere structure for animal and plant species. All
centromeres have short tandem repeats (approx. 170 bp unit length) of different, but
extensive, length, and transposon-like moderately repeated sequences surrounding the
tandem repeat sequences. Orange and green blocks represent short tandem repeat arrays and
other colour blocks are moderately repetitive sequences. Diagrams are for
Schizosaccharomyces pombe, Drosophila melanogaster, Homo sapiens, Arabidopsis thaliana
and Zea mays.
PBTEL-463 (Applications of Genomics and Proteomics) 15
Table 1. Centromere repetitive sequences in plants
Next steps for the study of Genome Organization :-
Many features of the structure and function of the rice genome have now been identified. Several tools
and experimental systems have also been established that facilitate structural and functional genome
analyses; several have been introduced in this Botanical Briefing. The genus Oryza has 22 wild rice
species with nine genomes (AA, BB, CC, BBCC, CCDD, EE, FF, GG and HHJJ). This is in addition to the two
widely cultivated species Oryza sativa (AA genome) and O. glaberrima (AA genome) (Vaughan, 1994).
Following the completion of genome sequencing of O. sativa, genome diversity within the genus will be
one of the next targets for genomic analysis. These studies will necessarily include problems of genome
organization and comparisons at a chromosome level, sequence level, functional level and evolutionary
level. There will be many mechanisms involved in the organization of genome functions and in
maintaining complicated programmes of genome organization. Accordingly, studies for resolving them
will need novel breakthroughs in molecular biological, cytogenetical, biochemical and genetic methods.
Detailed analyses of interactions between homologous, non-homologous and/or homoeologous
chromosomes in meiosis and comparisons of centromere organization among them can be expected to
reveal principles for genome functioning at the chromosome level. Experimental insertion of artificial
chromosomes into transgenic plant cells should be a powerful tool by which to clarify essential and
supportive elements for centromere function and to identify active factors involved in chromosome
organization.
PBTEL-463 (Applications of Genomics and Proteomics) 16
1.4 GENOMICS OF TOMATO
Tomato (Solanum lycopersicum L., formerly Lycopersicon esculentum Miller) is an economically
important crop worldwide, and a preeminent model system for genetic studies in plants. It is also the
most intensively investigated Solanaceous species, with simple diploid genetics, a short generation time,
routine transformation technology, and availability of rich genetic and genomic resources. It has a
diploid genome with 12 chromosome pairs and a genome size of 950 Mb encoding approximately
35,000 genes that are largely sequestered in contiguous euchromatic regions. Several resources are
available for genetic/genomic research in tomato including the following :
1) tomato wild species and mutant collections;
2) marker collections;
3) F2 synteny mapping population and permanent recombinant inbred (RI) mapping populations;
4) BAC libraries and an advanced physical map;
5) TILLING populations; and
6) tomato microarrays, gene silenced tomato lines, and VIGS libraries (for transient silencing).
Till recently, tomato genomics has largely relied on molecular marker analysis and functional analysis of
gene-sets. However, for a better understanding of structural and functional aspects of its genome,
following latest high-throughput technologies are also being utilized :
• RNA transcription and protein analysis,
• screening of posttranslational modifications and protein-protein interactions, and
• discovery of metabolic networks.
The information generated by large-scale genome sequencing can lead a major revolution in the
understanding of tomato biology.
The International Solanaceae Genome Project (SOL) was established to develop a network of knowledge
on the Solanaceae family and to coordinate the research efforts of different groups from around the
world. The Solanaceae Genomics Network website (SGN;http://www.sgn.cornell.edu) was created to
facilitate distribution of genomic information for tomato in particular and for Solanaceous species in
general in a comparative genomic context. The challenge facing SOL in the coming years is to develop
methodologies that will enable genomic information to be associated with phenotypes of interest for
crop improvement. The framework for organizing these data is the highly conserved genetic map of the
Solanaceae that will allow the information basis to be extended beyond individual species.
Progress in tomato research will depend on our ability to tie together the independent components into
higher-order complexity with multiple dimensions. Multidisciplinary research efforts, involving the
increased input of chemistry, physics, statistics, mathematics, and computing sciences, are becoming
increasingly crucial for the success of such approach.
A) Structural Genomics :-
1) Molecular markers :
Beginning in the 1980s, different types of molecular markers have been developed in tomato. Among
crop species, tomato is one of the richest in the number and type of these genetic markers, including
restriction fragment length polymorphisms (RFLPs), simple sequence repeats (SSRs), cleaved amplified
polymorphic sequence (CAPS), amplified fragment length polymorphisms (AFLPs), and single nucleotide
polymorphism (SNP). Chronologically, RFLPs were the first markers developed. Currently, more than
PBTEL-463 (Applications of Genomics and Proteomics) 17
1000 RFLPs have been mapped on the 12 tomato chromosomes. A subset of RFLP markers has been
converted into PCR-based markers through sequencing of their ends. These sequences are available
from the SGN Database, thus allowing specific primers for PCR reaction to be designed. Other PCR-based
markers were developed both as random markers, such as random amplified polymorphic DNA (RAPD),
AFLPs, and locus-specific markers, such as SSRs, CAPS, and conserved ortholog sets (COSs); and many of
them have been mapped onto the high-density tomato genetic map.
Given the huge number of markers that have been set up for tomato using different methods, a
database collecting the different datasets is available at the SGN website. Indeed, all information for
more than 15,000 markers is collected in the SGN, where a specific tool for “marker search” allows
markers to be located on the map. Markers can be searched by name, chromosome position, mapping
population, and BAC associations (if they have been associated with BAC from the tomato sequencing
project by hybridization with overgo probes or computationally by BLAST comparisons). Some of them
have also been grouped into collections for organizational purposes or because they are part of a
particular project. So, it is possible to select either COS (markers that have been mapped in both tomato
and Arabidopsis) or COSII markers (markers that have been mapped in several Asterid species, including
several Solanaceous species). Other groups comprise known function genes (KFG), or EST-derived (TM)
markers.
2) Genetic and physical maps :
Genetic mapping of morphological traits in tomato started at the beginning of last century, and by 1973
a total of 257 morphological and disease resistance markers had been mapped. By the 1990s, tomato
had become one of the first plants for which RFLPs were used to generate a high-density linkage map.
Later several genetic maps using PCR-based markers were developed and integrated with the RFLP
maps, as reviewed by Labate et al. The first PCR-based reference genetic map covering the entire
tomato genome was reported by Frary et al. for a population derived from the cross S. lycopersicum × S.
pennellii.
The Solanaceae is the first family of flowering plants for which comparative mapping was conducted. As
a result, several genetic maps not only for tomato genome, but also for the genomes of other
Solanaceous crops are now available at the SGN site. Comparative genome analysis showed that tomato
and potato genomes differ in only five paracentric inversions, whereas the tomato and pepper genomes
differ in numerous rearrangements including several translocations as well as both pericentric and
paracentric inversions. More recently, Doganlar et al. have shown that eggplant and tomato genomes
are differentiated by 28 rearrangements, which could be explained by 23 paracentric inversions and five
translocations. These data suggest that paracentric inversions have been the primary mechanism for
chromosome evolution in the Solanaceae.
3) QTL mapping and exploitation of natural biodiversity :
The high-density RFLP linkage map developed for tomato facilitated extensive mapping of qualitative
traits such as various disease resistance genes, for example. This allowed tomato breeders to use
marker-assisted selection (MAS) for variety improvement. Furthermore, tomato was the first species for
which a whole genome molecular linkage map was used to identify quantitative trait loci (QTL), leading
to an understanding of the genetic basis of numerous quantitative traits including morphology, yield,
fruit quality, fruit primary and secondary metabolites, as well as resistance to a variety of abiotic and
biotic stresses. The QTL mapping studies conducted by de Vicente and Tanksley and by Eshed and Zamir
using mapping populations derived from interspecific tomato crosses provided stronger evidence that
PBTEL-463 (Applications of Genomics and Proteomics) 18
despite the inferior phenotype, unadapted germplasm could also be used as a source of complementary
positive alleles that can result in favorable transgressive phenotypes once incorporated in the cultivated
background.
B) Functional Genomics :-
In order to understand the function of specific genes and their role in metabolic pathways, as also to
identify the key steps in their coregulation mechanisms, several approaches have been exploited,
including mutagenesis, genetic transformation, and transcriptome analysis.
1) Insertional mutagenesis :
Both classical and insertional mutageneses have been used in tomato. Indeed, together with barley,
Arabidopsis, and maize, tomato was the focus of early, extensive mutagenesis programs. In a paper
published in 1964, Hans Stubbe reviewed over 250 tomato mutants arising from the seminal work of the
Gatersleben group. To date, over 600 characterized monogenic mutations are available in a variety of
genetic backgrounds at the Tomato Genetics Resource Center (http://tgrc.ucdavis.edu). More recently,
an extensive mutant population consisting of 6000 EMS-induced and 7000 fast neutron-induced mutant
lines has been obtained. This population is probably saturating. For instance, extensive allelic tests
confirmed that all the wiry mutants with 3 to 7 alleles present in TGRC are represented in the
population. Two new wiry loci have also been described in the collection, each with 10 alleles. A detailed
phenotypic description of the mutants is available online (http://zamir.sgn.cornell.edu/mutants).
Insertional mutagenesis systems exploiting exogenous transposon systems have also been described in
tomato. Nevertheless, these systems, some of which utilize the Micro-Tom cultivar, have not yielded
saturating mutant collections and have thus not been utilized extensively. Highly efficient protocols for
transformation of Micro-Tom have been described, which may serve as a tool for extensive T-DNA
mutagenesis programs also.
2) Gene silencing (RNAi and VIGS) :
Strategies for gene silencing have also been widely used as a tool for functional genomics research in
tomato. Indeed, tomato fruit ripening was one of the early systems in which both sense and antisense
silencing were found to be effective. More recently, RNA interference (RNAi) and virus-induced gene
silencing (VIGS) have also been successfully used as functional genomics tools in tomato. Interestingly,
the use of RNAi remains confined in the fruit, thus making the fruit-specific silencing of genes possible.
Similarly, VIGS has been described in tomato roots and fruits although the extent to which silencing
remains confined to these organs has not been extensively investigated. Several viral vectors have been
used, including Tobacco rattle virus (TRV), Tomato yellow leaf curl China virus isolate, and potato virus X.
Of these, TRV displays the widest host range, allowing silencing in several Solanum species, as well as in
non-Solanaceous species like opium poppy and Arabidopsis.
3) Transient expression of exogenous genes :
Transient expression of exogenous genes has also been achieved through several transient
transformation techniques, such as particle bombardment or agroinfiltration. Recently, an agroinjection
technique was developed for tomato fruits, which allow the functional analysis of several genes in fruits
in a short time. This technique has been used both for expression of exogenous genes and for TRV-
induced gene silencing.
PBTEL-463 (Applications of Genomics and Proteomics) 19
4) Transcriptional profiling :
Finally, transcriptional profiling is being widely explored since the extensive EST collection available in
tomato [4] has allowed designing of several microarray platforms: the most widely used to date has
been Tom1, a cDNA-based microarray containing probes for approximately 8000 independent genes;
and Tom2, a long oligonucleotide-based microarray containing probes for approximately 11000
independent genes. Both these microarrays are already available from BTI (http://bti.cornell.edu/
CGEP/CGEP.html) and soon Tom2 will also be available from the EU-SOL project (http://www.eu-
sol.net). The third array is an Affymetrix Genechip, which contains probe sets for approximately 9000
independent genes (http://www.affymetrix.com/products/arrays/specific/tomato.affxspecific
/tomato.affx). As the tomato genome project progresses, a comprehensive, public tomato microarray
platform will become indispensable.
1.5 GENOMICS OF PIGEON PEA
Pigeonpea, a member of family Fabaceae, is one of the important food legumes cultivated in tropical
and subtropical regions. Due to its inherent properties to withstand harsh environments, it plays a
critical role in ensuring sustainability in the subsistence agriculture. Furthermore, plasticity in the
maturity duration imparts it a greater adaptability in a variety of cropping systems. In the post genomics
era, the importance of pigeonpea is further evident from the fact that pigeonpea has emerged as first
non-industrial legume crop for which the whole genome sequence has been completed. It revealed
605.78 Mb of assembled and anchored sequence as against the predicted 833 Mb genome consequently
representing 72.8 % of the whole genome. In order to perform genetic and genomic analysis various
molecular markers like random amplified polymorphic DNA (RAPD), restriction fragment length
polymorphism (RFLP), amplified fragment length polymorphism (AFLP), simple sequence repeat (SSR),
diversity array technology (DArT), single feature polymorphism (SFP), and single nucleotide
polymorphism (SNP) were employed. So far four transcriptome assemblies have been constructed and
different sets of EST-SSRs were developed and validated in a panel of diverse pigeonpea genotypes.
Extensive survey of BAC-end sequences (BESs) provided 3,072 BES-SSRs and all these BES-SSRs were
further used for linkage analysis and trait mapping. To make the available linkage information more
useful, six intra-specific genetic maps were joined together into a single consensus genetic map
providing map positions to a total of 339 SSR markers.
Genome Size :-
Pigeonpea is a diploid crop with chromosome number 2n = 2x = 22. The various karyotype studies
conducted in pigeonpea have concluded that all the wild relatives of pigeonpea carry the same number
of chromosomes. After soybean, pigeonpea became the second member of clade Phaseoloid for which
the draft genome sequence has become available and based on K-mer statistics the entire genome size
was estimated to be 833.07 Mb.
Genomic Resources :-
1) Mapping Populations :
Availability of large segregating populations is an essential requirement for molecular tagging of traits of
interest. Several types of bi-parental mapping populations such as F2; Backcross (BC1 F1) , recombinant
inbred lines (RILs), near isogenic lines (NILs) and double haploid (DH) are being employed for genetic
map construction and trait mapping. Based on morphological and molecular diversity and targeting the
PBTEL-463 (Applications of Genomics and Proteomics) 20
trait segregation a series of mapping populations were generated in pigeonpea under phase I of
pigeonpea genomic initiative (PGI). A total of 25 F2 mapping populations were reported in pigeonpea
segregating for several traits such as resistance to sterility mosaic disease (SMD), Fusarium wilt (FW),
water logging and fertility restoration (Rf). Most of these populations have reached to the RILs and are
being deployed for multi-location trials. Details on these mapping populations have been provided by
Varshney et al. (2010). Of these mapping populations, an inter-specific F2 mapping population (ICP 28 x
ICPW 94) was chosen for constructing high density reference genetic map for pigeonpea. Apart from
PGI, few more mapping populations were developed at various national agricultural research centers.
(See Table 1).
Table 1. Trait mapping in pigeonpea
2) Molecular Markers :
A wide range of DNA markers have been employed in pigeonpea including RAPD , RFLP, AFLP, SSR,
DArT , SFP and SNP, etc. All these marker systems have been used for a variety of applications e.g.
estimation of genetic diversity, construction of genetic maps, etc. in pigeonpea. Initially SSRs were
preferred over other marker systems due to unavailability of SNPs and several advantages like higher
abundance, co-dominant and multi-allelic nature and ease of scoring etc. In pigeonpea, SSRs were
generated through :
• enriched library
• in silico expressed sequence tags (ESTs) mining and
• surveying BAC-end sequences and whole genome sequence.
PBTEL-463 (Applications of Genomics and Proteomics) 21
The first set of SSRs comprising ten SSRs in pigeonpea was developed by Burns et al. (2001) using CA and
CT repeat enriched libraries. However, development of SSRs through enriched libraries remains to be
time consuming and of low through put. In this context, sequencing of BAC ends and mining for SSRs
had provided potential alternative for large scale SSR discovery.
3) BAC Libraries :
BAC libraries harbor large inserts of DNA ranging from 100 to 350 kb with an average insert size of 150
kb. The large size of DNA inserts ensures better coverage of the genome. These offer several advantages
like ease of handling, high stability, non-chimeric nature and better transformation efficiency over other
vectors such as yeast artificial chromosomes (YACs) and cosmids. BAC libraries represent a potential
genomic resource extensively used for -
• physical map construction,
• comparative genome analysis via searching for macrosyntenic blocks across species,
• map-based or positional cloning to isolate genes/ QTLs responsible for economically important
traits,
• large scale DNA marker discovery through BAC-end sequencing, and
• assembling of raw sequence reads into genome assembly for an organism.
In pulses, several BAC libraries have been reported and are being constructed for chickpea, lentil,
pigeonpea, mungbean, cowpea, field pea and common bean etc. In pigeonpea, two BAC libraries were
constructed by using HindIII and BamHI restriction enzymes. Each of the libraries was composed of
34,560 clones. The average insert size of HindIII library was 120 kb while the BamHI library had an
average insert size of 115 kb. These clones collectively represented ~11x coverage of the pigeonpea
genome. The sequences adjacent to the insertion sites are generally known as BESs and potential
resources for identifying minimally overlapping clones. With this perspective, randomly selected 50,000
BAC clones were targeted for end sequencing which generated a set of 88,860 high quality BESs.
4) Genetic Maps :
Saturated genetic maps have been constructed for several legumes like chickpea, cowpea, common
bean, soybean etc. Till 2010, no genetic map was available for pigeonpea due to non-availability of
ample amount of genomic resources such as molecular markers and segregating mapping populations
and this situation exacerbated by low genetic variation in Cajanus primary gene pool. Following the large
scale development of BES-SSR and DArT markers, the first generation genetic maps were constructed for
an F2 population derived from an inter-specific cross ICP 28 (C. cajan) x ICPW 94 (C. scarabaeoides). SSR
based genetic map covered a total map length of 930.9 cM with 239 loci with an average inter-marker
distance of 3.8 cM. In parallel, DArT based genotyping on this parental combination provided a set of
388 polymorphic markers. However, coupling and repulsion phase of polymorphic markers resulted in
development of paternal and maternal specific genetic maps with 172 and 122 unique loci, respectively.
5) Trait Mapping :
Trait mapping is one of the important pre-requisite for prediction of phenotype from the genotype. As
compared to some other legumes like chickpea and common bean not much progress has been
witnessed in the area of trait mapping in pigeonpea. Earlier inadequate supply of DNA polymorphisms
and lack of saturated genetic maps have posed obstacles in undertaking QTL analysis in pigeonpea.
Despite this, some of the traits such as tolerance to SMD and FW and ideal plant type were chosen for
mapping using bulked segregants analysis (BSA). BSA was performed using DNA from extremes
PBTEL-463 (Applications of Genomics and Proteomics) 22
phenotypes from segregating F2 populations. The first instance of QTL analysis was reported by Gnanesh
et al. (2011) to tag SMD resistance in pigeonpea. This study reported existence of major as well as minor
effect QTLs imparting resistance against SMD. The investigation included two F2 mapping populations
which were subjected to linkage and QTL analysis. The results indicated occurrence of six QTLs
(designated as qSMD1-6) explaining phenotypic variations in the range of 8.3-24.72 % (Table 1).
1.6 GENOMICS OF WHEAT
Wheat (Triticum aestivum L.), with a large genome (16000 Mb) and high proportion (∼80%) of repetitive
sequences, has been a difficult crop for genomics research. However, the availability of extensive
cytogenetics stocks has been an asset, which facilitated significant progress in wheat genomic research
in recent years. For instance, fairly dense molecular maps (both genetic and physical maps) and a large
set of ESTs allowed genome-wide identification of gene-rich and gene-poor regions as well as QTL
including eQTL. The availability of markers associated with major economic traits also allowed
development of major programs on marker-assisted selection (MAS) in some countries, and facilitated
map-based cloning of a number of genes/QTL. Resources for functional genomics including TILLING and
RNA interference (RNAi) along with some new approaches like epigenetics and association mapping are
also being successfully used for wheat genomics research. BAC/BIBAC libraries for the subgenome D and
some individual chromosomes have also been prepared to facilitate sequencing of gene space.
Wheat is adapted to temperate regions of the world and was one of the first crops to be domesticated
some 10000 years ago. At the cytogenetics level, common wheat is known to have three subgenomes
(each subgenome has 7 chromosomes, making n = 21) that are organized in seven homoeologous
groups, each homoeologous group has three closely related chromosomes, one from each of the three
related subgenomes. The diploid progenitors of the A, B, and D subgenomes have been identified,
although there has always been a debate regarding the progenitor of B genome. It has also been found
that common wheat behaves much like a diploid organism during meiosis, but its genome can tolerate
aneuploidy because of the presence of triplicate genes. These features along with the availability of a
large number of aneuploids [particularly including a complete set of monosomics, a set of 42
compensating nullisomic-tetrasomics and a complete set of 42 ditelocentrics developed by Sears] and
more than 400 segmental deletion lines [developed later by Endo and Gill] facilitated greatly the wheat
genomics research.
TILLING in Wheat :-
Recently, Targeting Induced Local Lesions IN Genomes (TILLING) was developed as a reverse genetic
approach to take advantage of DNA sequence information and to investigate functions of specific genes.
TILLING was initially developed for model plant Arabidopsis thaliana having fully sequenced diploid
genome and now has also been successfully used in complex allohexaploid genome of wheat, which was
once considered most challenging candidate for reverse genetics.
To demonstrate the utility of TILLING for complex genome of bread wheat, Slade et al. created TILLING
library in both bread and durum wheat and targeted waxy locus, a well characterized gene in wheat
encoding granule bound starch synthase I (GBSSI). Loss of all copies of this gene results in the
production of waxy starch (lacking amylose). Production of waxy wheat by traditional breeding was
difficult due to lack of genetic variation at one of the waxy loci. However, targeting waxy loci by TILLING,
using locus specific PCR primers led to identification of 246 alleles (196 alleles in hexaploid and 50 alleles
PBTEL-463 (Applications of Genomics and Proteomics) 23
in tetraploid) using 1920 cultivars of wheat (1152 hexaploid and 768 tetraploid). This made available
novel genetic diversity at waxy loci and provided a way for allele mining in important germplasm of
wheat. The approach also allowed evaluation of a triple homozygous mutant line containing mutations
in two waxy loci (in addition to a naturally occurring deletion of the third locus) and exhibiting a near
waxy phenotype.
Another example of on-going research using TILLING in wheat is the development of EMS mutagenised
populations of T. aestivum (cv. Cadenza, 4200 lines, cv. Paragon, 6000 lines), T. durum (cv. Cham1, 4,200
lines), and T. monococcum (Accession DV92, 3000 lines) under the Wheat Genetic Improvement
Network (WGIN; funded by Defra and BBSRC in the UK and by the EU Optiwheat programme). The aim
of this program is to search noval variant alleles for Rht-b1c,RAR-1, SGT-1, and NPR-1 genes.
The above examples provide proof-of-concept for TILLING other genes, whose mutations may be
desired in wheat or other crops. However, homoeolog-specific primers are required in order to identify
new alleles via TILLING in wheat. In case of waxy, the sequences of the three homoeologous sequences
were already known, which facilitated primer designing, but TILLING of other genes may require cloning
and sequencing of these specific genes in order to develop homoeolog-specific target primers.
Molecular Maps of Wheat Genome :-
1) Molecular genetic maps :-
Although some efforts toward mapping of molecular markers on wheat genome were initially made
during late 1980s, a systematic construction of molecular maps in wheat started only in 1990, with the
organization of International Triticeae Mapping Initiative (ITMI), which coordinated the construction of
molecular maps of wheat genome. Individual groups prepared the maps for chromosomes belonging to
each of the seven different homoeologous groups. A detailed account on mapping of chromosomes of
individual homoeologous groups and that of the whole wheat genome is available elsewhere; an
updated version is available at GrainGenes (http://wheat.pw.usda.gov/), and summarized in Table 1.
Integrated or composite maps involving more than one type of molecular markers have also been
prepared in wheat (particularly the SSR, AFLP, SNP, and DArT markers (see Table 1)). Consensus maps,
where map information from multiple genomes or multiple maps was merged into a single
comprehensive map, were also prepared in wheat. On these maps, classical and newly identified genes
of economic importance are being placed to facilitate marker-assisted selection (MAS). Many genes
controlling a variety of traits (both qualitative and quantitative) have already been tagged/mapped using
a variety of molecular markers. The density of wheat genetic maps was improved with the development
of microsatellite (SSR) markers leading to construction of SSR maps of wheat. Later, Somers et al. added
more SSR markers to these earlier maps and prepared a high-density SSR consensus map. At present,
>2500 mapped genomic SSR (gSSR) markers are available in wheat, which will greatly facilitate the
preparation of high-density genetic maps, so that we will be able to identify key recombination events in
breeding populations and fine-map genes. In addition to gSSRs, more then 300 EST-SSR could also be
placed on the genetic map of wheat genome. However, more markers are still needed, particularly for
preparation of high- density physical maps for gene cloning. Availability of a number of molecular
markers associated each with individual traits will also facilitate marker-assisted selection (MAS) during
plant breeding.
PBTEL-463 (Applications of Genomics and Proteomics) 24
Table 1. A list of some important molecular maps developed in wheat.
PBTEL-463 (Applications of Genomics and Proteomics) 25
2) Molecular marker-based physical maps :-
Molecular markers in bread wheat have also been used for the preparation of physical maps, which
were then compared with the available genetic maps involving same markers. These maps allowed
comparisons between genetic and physical distances to give information about variations in
recombination frequencies and cryptic structural changes (if any) in different regions of individual
chromosomes. Several methods have been employed for the construction of physical maps.
a) Deletion mapping :
In wheat, physical mapping of genes to individual chromosomes began with the development of
aneuploids, which led to mapping of genes to individual chromosomes. Later, deletion lines of wheat
chromosomes developed by Endo and Gill were extensively used as a tool for physical mapping of
molecular markers. Using these deletion stocks, genes for morphological characters were also mapped
to physical segments of wheat chromosomes directly in case of unique and genome specific markers or
indirectly in case of duplicate or triplicate loci through the use of intergenomic polymorphism between
the A, B, and D subgenomes (see Table 2 for details of available physical maps). In addition to physical
mapping of genomic SSRs, ESTs and EST-SSRs were also subjected to physical mapping (see Table 2). As a
part of this effort, a major project (funded by National Science Foundation, USA) on mapping of ESTs in
wheat was successfully completed by a consortium of 13 laboratories in USA leading to physical
mapping of ∼16000 EST loci.
PBTEL-463 (Applications of Genomics and Proteomics) 26
Table 2. Deletion-based physical maps of common wheat.
b) In silico physical mapping :
As many as 16000 wheat EST loci assigned to deletion bins, as mentioned above, constitute a useful
source for in silico mapping, so that markers with known sequences can be mapped to wheat
chromosomes through sequence similarity with mapped EST loci available at GrainGene database
(http://wheat.pw.usda.gov/GG2/blast.shtml). Using the above approach, Parida et al. were able to map
157 SSR containing wheat unique sequences (out of 429 class I unigene-derived microsatellites (UGMS)
markers developed in wheat) to chromosome bins. These bin-mapped UGMS markers provide valuable
information for a targeted mapping of genes for useful traits, for comparative genomics, and for
sequencing of gene-rich regions of the wheat genome. Another set of 672 loci belonging to 275 EST-SSRs
of wheat and rye was assigned to individual bins through in silico and wet-lab approaches by Mohan et
al. A few cDNA clones associated with QTL for FHB resistance in wheat were also successfully mapped
using in silico approach.
c) Radiation-hybrid mapping :
Radiation hybrid (RH) mapping was first described by Goss and Harris and was initially used by Cox et al.
for physical mapping in animals/humans. In wheat, the approach has been used at North Dakota State
University (NDSU) utilizaing addition and substituition of individual D-genome chromosomes into
tetraploid durum wheat. For RH mapping of 1D, durum wheat alien substitution line for chromosome 1D
(DWRH-1D), harboring nuclear-cytoplasmic compatibility gene scsae was used. These RH lines initially
allowed detection of 88 radiation-induced breaks involving 39 1D specific markers. Later, this 1D RH map
was further expanded to a resolution of one break every 199 kb of DNA, utilizing 378 markers. Using the
same approach, construction of radiation hybrid map for chromosome 3B is currently in progress.
3) BAC-based physical maps :-
BAC-based physical map of wheat D genome is being constructed using the diploid species, Aegilops
tauschii, with the aim to identify and map genes and later sequence the gene-rich regions (GRRs). For
this purpose, a large number of BACs were first fingerprinted and assembled into contigs. Fingerprint
contigs (FPCs) and the data related to physical mapping of the D genome are available in the database
(http://wheat.pw.usda.gov/PhysicalMapping/index.html). BACs belonging to chromosome 3B are also
being fingerprinted (with few BACs already anchored to wheat bins), and a whole genome BAC-based
physical map of hexaploid wheat is proposed to be constructed under the aegis of IWGSC in its pilot
studies.
IN SITU HYBRIDIZATION STUDIES IN WHEAT :-
In bread wheat, in situ hybridization (ISH) involving radioactively labeled probes was initially used to
localize repetitive DNA sequences, rRNA and alien DNA segments. Later, fluorescence in situ
hybridization (FISH), multicolor FISH (McFISH, simultaneous detection of more than one probe), and
genome in situ hybridization (GISH, total genomic DNA as probe) were used in several studies. FISH with
some repeated sequences as probes was used for identification of individual chromosomes. FISH was
also utilized to physically map rRNA multigene family, RFLP markers, and unique sequences and also for
detecting and locating alien chromatin introgressed into wheat.
PBTEL-463 (Applications of Genomics and Proteomics) 27
A novel high-resolution FISH strategy using super-stretched flow-sorted chromosomes was also used
(extended DNA fibre-FISH) to fine map DNA sequences and to confirm integration of transgenes into the
wheat genome.
Recently, BACs were also utilized as probes for the so called BAC-FISH which helped not only in the
discrimination between the three subgenomes, but also in the identification of intergenomic
translocations, molecular cytogenetic markers, and individual chromosomes. BAC-FISH also helped in
localization of genes (BACs carrying genes) and in studying genome evolution and organization among
wheat and its relatives.
MAP-BASED CLONING IN WHEAT :-
In wheat, a number of genes for some important traits including disease resistance, vernalization
response, grain protein content, free threshing habit, and tolerance to abiotic stresses have been
recently cloned/likely to be cloned via map-based cloning (see table 3). The first genes to be isolated
from wheat by map-based cloning included three resistance genes, against fungal diseases, including
leaf rust (Lr21 and Lr10) and powdery mildew (Pm3b). A candidate gene for the Q locus conferring free
threshing character to domesticated wheat was also cloned. This gene influences many other
domestication-related traits like glume shape and tenacity, rachis fragility, plant height, spike length,
and ear-emergence time. Another important QTL, Gpc-B1, associated with increased grain protein, zinc,
and iron content has been cloned, which will contribute in breeding enhanced nutritional value wheat in
future. Cloning of three genes for vernalization response (VRN1, VRN2, VRN3) helped in postulating a
hypothetical model summarizing interactions among these three genes.
Table 3. Genes already cloned or likely to be cloned through map-based cloning in wheat.
PBTEL-463 (Applications of Genomics and Proteomics) 28
1.7 TRANSCRIPTOMICS
Transcriptome is the whole set of RNAs transcribed by the genome from a specific tissue or cell type at
a developmental stage and/or under a certain physiological condition. After the genome has been
sequenced, transcriptome analysis allows us to understand the expression of genome at the
transcription level, which provides information on gene structure, regulation of gene expression, gene
product function, and genome dynamics. Transcriptome analysis will further reveal the regulation
network of biological processes and eventually give some guidance in disease diagnosis, clinical therapy,
and crop improvement.
Transcriptomics is the study of RNA, single-stranded nucleic acid, which was not separated from the
DNA world until the central dogma was formulated by Francis Crick in 1958, i.e., the idea that genetic
information is transcribed from DNA to RNA and then translated from RNA into protein. In 1961, Jacob
and Monod proposed a model that the protein-coding gene is transcribed into a special short-lived
intermediate associated with the ribosome, which was designated as messenger RNA (mRNA). In 1958,
together with the central dogma of molecular biology, the “adaptor” hypothesis was indicated by Crick
to explain how mRNA template directs the protein synthesis. In this hypothesis, Crick predicted that
each amino acid was first attached to its own “adaptor” which could fit onto the mRNA template by
base-pairing and thus carry the amino acid to the specific site of the RNA template. A short, stable RNA,
transfer RNA (tRNA), was identified as the predicted “adaptor”. Shortly, ribosomal RNA (rRNA) involved
in protein synthesis was purified.
Steps of Transcriptomic Analysis :-
1) Quantifying the transcript :
Transcriptional response of the genome varies in different tissues or under different physiological
conditions or environmental stimuli. To discover differentially expressed genes was one of the earliest
goals of transcriptome analysis. Expressed sequence tag based method (EST, SAGE), hybridization based
gene microarray or chip technology, and NGS based RNA-sequencing (RNA-seq) technology were
developed to scan the transcriptome quickly and obtain the differentially expressed genes. Many key
genes from various developmental, physiological, or pathological processes were identified by these
means.
2) Defining the gene structure and RNA metabolism :
Gene is usually defined as a genetic unit. When talking about the transcription, we also refer to gene as
a transcription unit. From the gene to the functionally mature RNA, multiple steps of transcription and
post-transcriptional processing take place in the cells. A transcription procedure consists of transcription
initiation (forming the transcription initiation complex), elongation, pausing (transcription complexes
stop right downstream the transcription start site), and termination. Combining the traditional
biochemical and molecular biology technology with NGS, transcription events were observed globally in
higher throughput and more precise level (Table 1).
NGS Method Conventional Method Description
RNA-seq EST, SAGE, microarray CAGE Quantify and characterize the
transcriptome
Small RNA-seq miRNA microarray Characterize small non-coding RNA
deepCAGE, nanoCAGE, CAGE , 5′ RACE Map the 5′ end of mRNA/transcription
PEAT, CAGEscan, PRO-cap start site
PBTEL-463 (Applications of Genomics and Proteomics) 29
PRO-seq, GRO-seq Nuclear run on Detect nascent RNA
3P-seq, PAS-seq 3′ RACE Detect alternative polyadenylation
BRIC-seq NA Measure half life of RNA transcripts
PAR-CLIP, Argonaute HITS- NA Detect the Argonaute associate RNA to
CLIP, Argonaute CLIP-seq predict miRNA targets
PARE-seq, degradome-seq Modified 5′ RACE Map the 5′ end of RNA degradation
products to predict
miRNA targets
STRT, SMART-seq SMART/template switch Single cell or low RNA input
transcriptome analysis
CEL-seq In vitro transcription based Single cell or low RNA input
cDNA amplification transcriptome analysis
Table 1. RNA-seq based methods
To understand the RNA transcript structure and its promoter, mapping the transcription start site (TSS)
is required. Taking the advantage of cap structure of mRNA 5′ end, cap analysis of gene expression
(CAGE) method was developed to sequence the 5′ ends using Sanger sequencing. This method was
improved when NGS took the place of Sanger sequencing. Paired-end analysis of TSSs (PEAT),
deepCAGE, nanoCAGE and CAGEscan revealed precisely the TSS of each gene. Similarly, precision
nuclear run-on and sequencing (PRO-cap) method allows detection of TSS of nascent RNAs.
Digital quantification by regular RNA-seq only represents a certain RNA species’ steady-state, which
does not reflect the dynamic process of RNA metabolism. To determine the biogenesis rate of RNA,
genome-wide nuclear run-on and sequencing (GRO-seq) and the improved version PRO-seq succeeded
in monitoring the nascent mRNA globally at very high resolution (single nucleotide for PRO-seq). These
two methods, combining the RNA-seq and nuclear run-on assay, provide not only the rates of
transcription initiation and elongation but also the RNA polymerase pausing positions. Kwak et al. found
that transcription pausing and elongation activation happen widely in the Drosophila genome.
3) Studying the post-transcriptional processing :
The maturing of RNAs is a series of steps such as 5′ capping, splicing, 3′ cleavage and adding polyA. In
the human genome, protein-coding genes are less than 30000, but in human cells more than 80000
different proteins can be produced, which is mainly because that RNA precursor of one gene will
generate different mature RNA molecules by post-transcriptional processing such as alternative splicing.
By pair-end sequencing, improving read length and depth, the majority of coding genes with introns
were found to have more than one isoforms. Hence, it is important to consider the abundance of each
isoform rather than calculate the sum of various isoforms, when quantifying a certain gene expression.
RNA editing is a post-transcriptional processing, where RNA sequence alteration is introduced, such as
uridine insertion and deletion, A-to-I shift. These alterations may lead to changes of amino acid
sequence in protein, splicing sites within RNA precursor, or seed sequence of miRNAs. By comparing the
RNA-seq results to the reference genome sequence, Park et al. found 500–3000 RNA editing events in
certain cell type after filtering out the polymorphisms and somatic mutations.
mRNA 3′ end processing involves endonucleolytic cleavage and adding multiple adenosines. The mRNA
products of many genes have more than one 3′ cleavage or polyadenylation sites, known as alternative
PBTEL-463 (Applications of Genomics and Proteomics) 30
polyadenylation (APA). APA may affect the length of protein coding sequence or 3′ untranslated region,
thereby regulating mRNA translation efficiency and/or its half-life. RNA-seq based method, PAS-seq and
3P-seq, together with specific bioinformatics analysis revealed that APA is an evolutionarily conserved
mechanism of gene regulation.
Degradation is an important step in the metabolism of RNA, Tani et al. developed the 5′-bromo-uridine
immunoprecipitation chase-deep sequencing analysis (BRIC-seq) method to determine half-life of RNAs
by sequencing the pulse-labeling RNA, finding many short half-life non-coding RNAs.
4) Discovering and characterizing the non-coding RNA :
Tremendous progress has been made in characterizing regulatory non-coding RNAs recently. Unbiased
transcriptome analysis allowed the discovery of numerous previously unknown RNA transcripts.
Short non-coding RNA usually includes miRNA, siRNA, and piRNA, with lengths shorter than 35 nt. The
short non-coding RNA cDNA library construction started with the purification of 15–35 base RNAs by
denaturing polyacrylamide gel, followed by ligation of 3′ and 5′ adapter, and finished by reverse
transcription. At present, more than 20000 of miRNA genes have been cloned from ~200 species.
Therefore, it is more challenging to identify miRNA’s target mRNA in order to elucidate its function. Two
methods emerged based on the fact that miRNA interacts with and cleave its target mRNA through
Argonaute protein. Argonaute cross-linking immunoprecipitation and sequencing (CLIP-seq), designed to
immunoprecipitate the Argonaute-RNA complex, allows to sequence the Argonaute associated RNA.
While, degradome-seq or parallel analysis of RNA end (PARE) methods succeed to sequence the 5′ ends
of the target mRNA cleavage products by miRNA. Along with the bioinformatics analysis, miRNA:mRNA
complex regulatory network can be reconstituted.
lncRNAs, especially antisense transcripts, were recovered by the strand-specific RNA-seq. This
modification on RNA-seq provides the direction information of transcripts sequenced and therefore
allows distinguishing antisense ncRNA from sense coding transcripts. Strand-specific sequencing of
polyA RNA or rRNA minus RNA fraction found that the human genome is able to express more than
10000 lncRNA. Defining the differential expression in different tissues and/or under different
physiological conditions will help to elucidate the function expression of lncRNAs. How to find the
regulating targets for each lncRNA will be another challenge.
5) Checking the genome by RNA-seq :
Huge amount of transcriptome data were generated from medical research, especially in cancer
research. Besides the comparison of gene expression in normal and pathological conditions, further
changes in genome sequence can be learned, such as somatic mutations in disease tissues, including
mutation, insertion and deletion. Gene fusion often indicates the genomic rearrangements, such as
translocation, deletion, and inversion. Gene fusion revealed by RNA-seq may provide extra functional
hints compared with that by genome sequencing.
Although the genome sequencing cost continues to decline, there are still a lot of non-model organisms
of interests lacking genome reference sequence. Through the pair-end, long read length RNA-seq and de
novo assembly, both quantification and transcripts structure information will be provided for an
unsequenced genome. Such transcriptome analysis also helps to annotate the genome to be or being
sequenced.
PBTEL-463 (Applications of Genomics and Proteomics) 31
Technology for transcriptome analysis :-
Techniques have been evolved for almost 20 years, from the initial expression sequencing tag (EST)
strategy to gene chips, and now the RNA-seq. To analyze the transcriptome becomes cost effective with
higher throughput, better sensitivity, and less starting RNA.
1) EST and microarray :
Sanger sequencing of EST or cDNA library provided information for genome annotation in the early days
of genome research. Due to the limitations on throughput and cost, it is impossible to achieve
transcriptome quantitative analysis using EST methods. With serial analysis of gene expression (SAGE)
and CAGE, respectively, multiple 3′ and 5′ cDNA ends were concatenated to be one clone. Therefore,
multiple sequence tags can be recovered from one Sanger sequencing reaction, which over-comes those
limits and makes quantitative analysis possible. However, due to the high cost of Sanger sequencing and
the difficulty to map the short sequence (~20 bp) tags to genome, CAGE and SAGE were replaced by
DNA microarray shortly.
DNA microarray or chip method is based on nucleic acid hybridization. Fluorescent labeled cDNAs
incubate with oligonucleotide probes on the chip, then the abundance of RNA is determined by
measuring fluorescence density. High-density gene chip allowed relatively low cost gene expression
profiling. Specific microarrays were designed according to the purpose of the experiment, such as arrays
to detect different isoforms from alternative splicing. In addition, the genome tiling array is an unbiased
design, without prior knowledge of genome transcription information, using a set of overlapping
oligonucleotide probes for the detection of whole genome expression with the resolution up to a few
nucleotides. However, for large genomes, tiling array is expensive. Another limiting factor of
hybridization methodology is high background, because it is unable to distinguish RNA molecules sharing
high sequence similarity.
2) RNA-seq :
Compared with Sanger sequencing, the core of NGS is massive parallel sequencing. Development of
nanotechnology makes it possible to sequence hundreds of thousands of DNA molecules
simultaneously . The prototype of NGS is massive parallel signature sequencing (MPSS), which applies
four rounds of restriction enzyme digestion and ligation reactions to determine the nucleotide sequence
of cDNA ends generating a 17–20 bp sequence as the fingerprint of a corresponding RNA. MPSS is used
to digitize the quantitative transcriptome with the capacity to produce more than 100000 signatures at a
time. However, due to the nature of digestion and ligation reactions, a large fraction of the sequence
signatures obtained is not long enough to be unique fingerprints of RNA molecules.
Overcoming the limits of MPSS, Illumina, Roche, Lifescientific, and other companies developed their own
platforms with considerable improvement on the throughput, reading length, and sequencing accuracy.
Based on these platforms, the RNA-seq methodology became the most convenient and cost effective
tool for transcriptome analysis. Briefly, total or part of RNA transcripts (e.g., polyA RNA or small RNA
fraction) are purified and reversely transcribed into cDNAs, which are subjected to massive parallel
sequencing. By analyzing millions to billions of 25–500 bp sequence tags from massive parallel
sequencing, the transcriptome can be studied qualitatively and quantitatively. In addition, RNA-seq is an
approximately unbiased way, even without prior knowledge of genomic information. Because of the
single-base resolution, RNA-seq’s background noise is very low compared to hybridization-based
technology. Linear detection range of RNA by RNA-seq spans several orders, which is at least one order
higher than the DNA chip.
PBTEL-463 (Applications of Genomics and Proteomics) 32
3) Advance of RNA-seq :
In order to accurately reveal the transcriptome of complex biological tissue or precious sample, low RNA
input (even single cell) RNA-seq techniques have been developed by direct RNA sequencing or RNA
amplification method. Tang et al. reported the first single cell transcriptome analysis, where the authors
reversely transcribed polyA RNA from a single mouse blastomere lysate. The cDNA was subjected to PCR
amplification with the primers annealing to the anchoring sequences introduced during the generation
of the 1st and 2nd strand cDNA. The STRT (single-cell tagged reverse transcription) and SMART-seq
(switching mechanism at the 5' end of the RNA transcript sequencing) methods took the advantage of
the extra few cytosines added by MMLV reverse transcriptase to amplify the cDNAs transcribed from a
single cell. The in vitro transcription amplification strategy was applied in the latest CEL-seq (cell
expression by linear amplification and sequencing), which introduced the T7 promoter into the 5' end of
the 1st strand cDNA and directed a linear and less biased amplification. Besides the spiking RNA
molecules, barcoding strategy was implemented in single cell RNA-seq to overcome the bias brought
from the cDNA amplification, in which each single RNA molecule was uniquely barcoded. Therefore,
multiple reads of a specific RNA molecule with the same barcode would be considered as amplification
redundancy and counted only once.
Over the past few years, RNA-seq technology has been widely used in the transcriptomics study not only
because of its advantages over previously prevalent methodologies but also because of its fast
evolvement. In order to fulfill different research purposes, the preparation of the cDNA library has been
modified to different forms, such as strand-specific RNA-seq. Combined with conventional molecular
biology and biochemistry methods (Table 1), RNA-seq was applied to study different aspects of the
transcriptome, such as deepCAGE and CAP-Seq to map TSS, GRO-seq and PRO-seq to detect nascent
RNA, small RNA-seq, degradome-seq and Argounate CLIP-seq to characterize miRNA targets, and so on.
Bioinformatics analysis :-
It becomes increasingly obvious that bioinformatics analysis is a significant part of transcriptome
research. The challenge facing bioinformatics analysis is almost as big as the experimental procedure
including RNA purification, cDNA library construction, and high-throughput sequencing. The difficulty of
analysis comes from not only massive amounts of data, errors introduced by sequencing experiments,
but also unimaginable complexity of the transcriptome.
A typical RNA-seq data analysis can be summarized as follows :
• Perform quality control for raw RNA-seq data. Low-quality sequence tags produced from library
construction or sequencing process are trimmed away by the software provided by the
sequencing platform.
• For cases with a reference genome, map millions of short reads to the reference genome,
determine the position of each RNA transcript in the genome, calculate the expression level of
each transcript, and then find differentially expressed genes across the samples.
All the above processes are carried out by corresponding pipelines, including representatives of Bowtie
to map reads to the reference genome , Tophat to identify splice junctions, cufflinks to test for
differential expression, and so on. For cases without a reference genome, de novo transcriptome
assembly is performed from short RNA-seq reads, and then all assembled contigs are subjected to
functional annotation, which requires extensive computer resources.
PBTEL-463 (Applications of Genomics and Proteomics) 33
1.8 FORWARD AND REVERSE GENETICS
Forward genetics starts with identification of interesting mutant phenotype. Forward genetics seeks to
find the genetic basis of a phenotype or trait. Automated DNA sequencing generates large volumes of
genomic sequence data relatively rapidly. The main objective of forward genetics is to discover the
function of genes defective in mutants. Genomics approaches to forward genetics are :
• Insertional mutagenesis
• Genetic mapping
• Expression analysis using Microarray
• Candidate gene approach
• Exmore sequencing
1) Insertional Mutagenesis :-
Insertional mutagenesis is mutagenesis of DNA by the insertion of one or more bases. Insertional
mutations can occur naturally, mediated by virus or transposon, or can be artificially created for
research purposes in the lab.
2) Genetic mapping :-
In genetic mapping the locus of the gene responsible for the trait of interest is identified. The first step
in mapping studies is to find markers which are linked to the trait. Physical linkage will lead to co-
inheritance of markers, while recombination events will break these associations. To develop
appropriate mapping populations; screen parents for marker polymorphism and genotype the mapping
population. Linkage analysis is performed to find out recombination frequencies between markers
which in turn lead to the fine mapping of the location of the gene of interest.
3) Expression analysis using Microarray :-
Microarray is a very promising technology for identification of genes in transcription deficient mutants.
The approach of using phenotypically similar mutants minimizes the number of candidate genes for
sequencing, due to the reduction of genes which are secondarily affected by the mutation. Both cDNA
and Affymetrix microarray platforms successfully pinpoint the gene which is up-regulated or down
regulated due to the induced or natural mutation events.
4) Candidate gene approach :-
This approach is appropriate for plants where mutant collections, represented by multiple independent
mutant alleles are available. The major difficulty with this approach is that in order to choose a potential
candidate gene for the mutation. Then this gene can be a potential candidate for the mutation in the
investigated plant and the principle proof that this candidate gene is responsible for the observed
phenotype is coming from comparative sequence analysis of all available mutant alleles in the particular
locus.
5) Exome sequencing :-
Exome sequencing is a influential method to selectively sequence the coding regions of the genome.
This method can be combined with target-enrichment strategies, which give possibility to selectively
capture genomic regions of interest from a DNA sample prior to sequencing.
PBTEL-463 (Applications of Genomics and Proteomics) 34
Fig 1. Forward and reverse genetics
Reverse genetics :-
Reverse genetics seeks to find what phenotypes arise as a result of particular genetic sequences.
Reverse genetics starts with a known gene and alters its function by transgenic technology. Reverse
genetics however, due to the advent of advances in sequencing is employed to seek out what sequences
underlies the phenotype. Reverse genetics starts with known genes e.g., from genomic sequencing.
Major goal of reverse genetics is to determine function through targeted modulation of gene activity.
Genomics approaches to reverse genetics :
• Gene silenencing
• TILLING (Targeting Induced Local Lesions in Genomes)
1) Gene silencing :-
Gene silencing using double stranded RNA is known as RNA interference (RNAi), and the development of
gene knockdown using Morpholino oligos, have made disturbing gene expression. RNAi creates a
specific knockout effect without actually mutating the DNA of interest. RNAi acts by directing cellular
systems to degrade target messenger RNA (mRNA).
2) TILLING (Targeting Induced Local Lesions in Genomes) :-
TILLING is based on the use of a mismatch-specific endonuclease ( CelI), which finds mutations in a
target gene containing a heteroduplex formation. The technique involves PCR amplification of the target
gene using fluorescently labeled primers, formation of DNA heteroduplex between wild type and
mutant alleles (PCR products, corresponding to the mutant and wild type alleles are heated and then
slowly cooled), followed by endonuclease digestion specifically cleaving at the site of an EMS induced
mismatch. One of the greatest benefits of the TILLING approach is that it does not involve genetic
manipulations, that results in Genetically Modified Organisms (GMO), which are not legal for agricultural
applications in many countries In EcoTILLING genomic DNA sequence is used to identify the mutations
created using endonuclease (CEL I). These techniques are independent of genome size, ploidy level and
PBTEL-463 (Applications of Genomics and Proteomics) 35
reproductive system of plants. They can be applied not only to model organisms but also to
economically important crops.
Fig 2. Schematic representation of TILLING and Eco-TILLING
1.9 MUTATION BREEDING
The utilization of induced mutation in crop improvement is called mutation breeding. In mutation
breeding, desirable mutations are induced in crop plants with the use of physical or chemical mutagens.
The variability generated through induced mutations are either released as new variety or used as the
parent for subsequent hybridization programmes. Treating of biological materials with mutagens to
induce mutation is called mutagenesis. If any class of radiations are used as a mutagen to induce
mutation in crop plants, the exposure of biological organism to the radiation is called irradiation.
Mutation breeding programme should be clearly planned and should be large enough with sufficient
facilities to screen large population.
Steps in mutation breeding :-
(1) Objectives of the programme :
• Mutation breeding should have well defined and clear cut objectives.
(2) Selection of the varieties for mutagen treatment :
• The variety selected should be the best variety available
(3) Part of the plant to be treated :
• Seeds, pollen, vegetative propagules, sometimes complete plant as treated with mutagen
• The selection of plant part varies with crop plant.
PBTEL-463 (Applications of Genomics and Proteomics) 36
• Seeds are best part in sexually reproducing plants.
• Seed treatment is actually the treatment of embryo.
(4) Dose of mutagen :
• The mutagen treatment reduces germination, growth rate, vigour and fertility of organism.
• The mutation also increases frequency of chromosomal changes, mitotic and meiotic
irregularities in the organism.
• All these damages increase with increase in the dose mutagen and duration of exposure.
• Thus, the dose should be optimized for a maximum success rate
• The dose and treatment duration of mutagens varies with crop and plant parts and also with the
type of mutagen used.
• The optimum dose is the dose at which maximum frequency of mutation will occurs with
minimum killing of the organism.
• The optimum dose of mutagen is expressed as LD50.
• LD50 : Dose of mutagen which will kill 50% of treated individuals.
• LD50 varies with crop plants and type of mutagen used.
(5) Giving mutagen treatment :
• M1 : generation produced directly from mutagen treated plant parts.
• M2, M3 & M4 are subsequent generation derived from M1, M2 and M3.
• M2, M3 & M4 are produced by selfing or clonal propagation.
(6) Handling mutagen treated population :
• Mutation treatment in seeds and vegetative propagules produce chimeras.
• Mutation usually occurs in small section of plant parts such as seeds or meristem.
• One or more clonal or sexual generations with selection are necessary for stable mutant
phenotype.
• Mutant alleles are generally recessive. Dominant mutation do occurs, however, the chance of
dominant mutation is very less.
• In sexually reproducing plants dominant and recessive mutations are useful.
• However in clonal propagated plants, the dominant mutations are beneficial.
Application / Advantages of induced mutations in crop improvements :-
• Mutation breeding can be used for both oligogenic and polygenic traits in plants.
• It improves morphological and physiological characters of cultivated crops.
• Mutation breeding can improve the disease resistance of crop plants.
• Induced mutations can induce desirable mutant alleles in crop plants.
• Mutation breeding can be used to improve the specific characters of a well-adapted high
yielding variety.
• Quantitative characters characteristics of crop plants including yield can be improved by induced
mutations.
• The F1 hybrids obtained from inter varietal cross are treated with mutagen to increase
variability.
• Mutation breeding can effective to disseminate an undesirable character from a crop variety.
PBTEL-463 (Applications of Genomics and Proteomics) 37
Limitations / Disadvantages of Mutation Breeding :-
• The frequency of desirable mutation will be very low (0.1 % of total mutations)
• The breeder has to screen a large population to select a desirable mutation.
• Desirable mutations are commonly associated with undesirable side effects.
• Mutations often produce pleiotropic effects.
• Mutation in quantitative traits is usually in a direction away from the selection history of the
parent variety.
• There may be problems in registration of mutant variety in many parts.
• Most of the mutations are recessive and their effects are not expressed due to the dominance
of its allelic counterpart.
Achievements of Mutation Breeding :-
• A large number of crop varieties have been produced by mutation breeding all over the world.
• A brief list is given below :
➢ Cereals : 350 varieties
➢ Legumes : 62 varieties
➢ Fruits : 40 varieties
➢ Ornaments : 462 varieties
• Among seed plants :
➢ Rice : 278 varieties
➢ Barley : 229 varieties
➢ Wheat : 113 varieties
• China has produced : 281 varieties (Top position)
• India has produced : 116 varieties (Second position)
• USSR has produce : 82 varieties (Third position)
• Japan has produced : 65 varieties (Fourth position)
Mutation breeding in India :-
• Till 1990, 219 mutant varieties of crop plants have been produced in India.
• Among which 116 are seed propagated and 103 vegetative propagated plants.
• The number of varieties of crop plants produced by mutation breeding in India are given below :
➢ Rice : 24 varieties
➢ Barley : 12 varieties
➢ Cotton : 8 varieties
➢ Ground nut : 8 varieties
• Crop varieties produced in India by Mutation Breeding :
➢ Rice : Jagannath
➢ Wheat : NP836
➢ Sugar cane : Co 8152, Co 8153
➢ Cotton : Indore 2
➢ Jute : JRO 514, JRO 412
• In rice, Jagannath is a gamma semi dwarf mutant from tall cultivar T141.
• Jagannath has improved resistance to lodging, high yield, more responsive to fertilizers than its
parent.
• In wheat, NP836 is an awned mutant from the awneless seed variety NP799.
PBTEL-463 (Applications of Genomics and Proteomics) 38
• Sugarcane Co8152 is a gamma induced mutant from Co527.
• Co8152 has 40% more yield than the parent.
MUTAGENESIS :-
Mutagenesis is defined as the change in the genetic information of an organism in a stable manner by
the use of physical and chemical mutagens. It was developed by Charlotte Auerbach. She was the first
women scientist to study on the effect of chemical mutagens.
This immediately found use as a genetic tool to induce mutations in specific ways which in turn can be
used to determine the phenotype of the organism, the function of the genes and even the nucleotides.
There are different types of Mutagenesis like :
1) Directed Mutagenesis
2) Random Mutagenesis
3) Insertional Mutagenesis
• Signature Tagged Mutageneis
• Virus insertional Mutagenesis
4) PCR Mutagenesis
• Site Directed Mutagenesis
• Mismatched Mutagenesis
• 5’Add On Mutagenesis
• Cassette Mutagenesis
1) Directed Mutagenesis :-
Directed Mutagenesis is defined as the change in amino acid coding at the DNA level. A characterized 3D
structure of the protein using X Ray crystallography and other analytical procedures helps in
determining which amino acids of protein should be changed to attain specific property. However,this
may not be possible for most of the proteins and hence a trial and error strategy is used to make
changes in the nucleotides to yield a particular change in the protein. The encoded protein is then
tested for the desired change in the protein.
Different types of Directed Mutagenesis are :
• Oligonucleotide Directed Mutagenesis With M13 DNA
• Oligonucleotide directed Mutagenesis with Plasmid DNA.
• PCR Amplified oligonucleotide Directed Mutagenesis
Advantages :
• Mutation rates is high
• All mutations can be induced
• Systematic and detailed investigation of the targeted mutation can be made.
2) Random Mutagenesis :-
Random Mutagenesis is also known as Directed Evolution or Molecular breeding. In this the gene is non
specifically changed at one or more codon levels to produce a mixture of mutated genes which in turn
PBTEL-463 (Applications of Genomics and Proteomics) 39
can be selected and screened for the genes with desired catalytic activity. The oligonucleotide primer is
a heterogeneous set of DNA sequences to produce a series of mutation in the defined portion of the
target gene.
Advantages :
• The role of an amino acid in the working of the protein is not required.
• Since a range of mutants are produced in this process some interesting and useful proteins may
be generated
3) Insertional Mutagenesis :-
Insertional Mutagenesis is produced by the insertion of one or more bases and can be produced :
• Natural method
• Mediated by Virus or transposon(Signature tagged Mutagenesis) or
• Artificially in the lab for research purpose
a) Signature Tagged Mutagenesis : Is used to study the function of genes using transposons.A
transposon such as Drosophila melanogaster P -Element is made to integrate randomly in the genome of
an organism. The mutants are then screened for any unusual phenotype.Any such phenotype is found
suggests that the transposon has caused the inactivation of the phenotype related to that gene. Since
the sequence of transposon is known the whole genome can be sequenced to identify the gene or PCR
can be used to amplify the specific gene.
b) Virus Insertional Mutagenesis : Is caused by replication competent virus.The virus inserts viral
oncogene near Cellular myc gene which are generally turned off in a cell and when turned on pushes the
cell into G1 of the cell cycle which allows even the viral gene to get replicated.Latent tumors are
produced after many replication at the site of viral gene.Eg. Avian Leukosis virus causes disease using
insertional mutagenesis.
Advantages :
• Can produce easily tractable Mutations.
• Can produce large number of mutants at low cost and high speed
4) PCR Mutagenesis :-
Is used to change the nucleotide sequence of the DNA.The method can be used to alter amino acids to
test the function of domains in a protein and to asses the function of a promoter.
Different ways of introducing mutations in PCR are :
a) Site Directed Mutagenesis : As the name suggests it introduces mutation at a specific location on the
DNA strand.The primers used are altered at one or more nucleotides.
b) Mismatched Mutagenesis : Is used with focus on a single amono acid and is useful in checking
missense mutation in a known gene of disease or in determining the function of a particular amino acid
in the protein.
c) 5’Add On Mutagenesis is produced by adding a new sequence or chemical group to the 5’end of the
PCR product.The primers are produced in a specific manner.
• 3’ end of the primer matches the sequence of PCR product.
• 5’ end contains the novel sequence
PBTEL-463 (Applications of Genomics and Proteomics) 40
• Suitable restriction site
• Addition of a functional sequence (promoter sequence)
• Modified nucleotide that contains labeled group, biotinlyated, or fluorophore.
d) Cassette Mutagenesis : Is used to introduce multiple mutations into the DNA sequence.Using blunt
ended DNA at the site of mutation a 3 base pair direct terminal repeat is created.The mutagenic codon
cassette has a head to head Sapl site to remove all the DNA except the mutated one.
Advantages :
• PCR based methods are useful in the study of specific mutations in the DNA which in turn is
useful in the study of different aspects of the protein function.
Limitations :
• Mutated DNA is hard to replicate as the competent cell to be used are expensive.
• Screening is tedious.
• Requires sequencing to confirm mutation.
• Specific primers are required.
SITE DIRECTED MUTAGENESIS :-
The first site-directed mutagenesis (SDM) experiment was performed in the year 1974, in the laboratory
of Charles Weissmann. The induced mutation was from GC to AT, however, the mutation was inserted
randomly. Scientists from the Charles Weissmann university had incorporated mutations and changed
the GC nucleotide with AT.
Before the study of Charles Weissmann, chemical analogs and radiations like mutagens are used to
create alteration into the DNA sequence, however, those changes are random and not site-specific.
In 1974, Michael Smith and Clyde Hutchison had performed site-directed mutagenesis using the primer
extension method. They had achieved artificial mutagenesis at a specific location the very first time in
the history of genetics.
Meaning and Definition :-
• Site-directed mutagenesis : Site : specific location into the genome, Directed – performed by
artificial techniques, mutagenesis – Introduction of mutation.
• “The site-directed mutagenesis is an artificial (in vitro) technique for introducing mutation or
alteration into the target (known) DNA sequence.”
• In simpler words we can say that – “By using artificial techniques such as PCR a mutation or
alteration can be inserted into the target-DNA sequence of our interest.”
• In the site-directed mutagenesis, At a specific location on oligonucleotide sequence, the
mutation is intentionally created, hence it is also called as oligonucleotide-specific mutagenesis
or site-specific mutagenesis. The purpose of introducing a mutation is to study the function of a
gene.
PBTEL-463 (Applications of Genomics and Proteomics) 41
Importance of SDM :-
• The site-directed mutagenesis is used to remove the restriction sites. Restriction digestion is a
process in which the DNA having the recognition site for a particular restriction endonuclease is
cleaved into fragments.
• Next, the modified sequence is transferred into a plasmid and then again, restriction digestion is
performed to validate whether the recognition site is removed or not.
• But if the DNA of our interest contains the restriction site, the restriction endonuclease cleaves
the target DNA as well the rest of the plasmid DNA which indicates experimental failure.
• Next, it is practiced to analyze the structure and function of the protein. The method is powerful
enough to study the function of a protein formed by a gene.
• For example, suppose a gene encodes one wild type protein and it is required in the metabolic
pathway of one organism. By mutating this protein-coding gene, the effect of the absence of
that protein in the metabolic pathway of an organism can be studied.
• Also, one important function of site-directed mutagenesis is in the study of gene editing or gene
manipulation. Through gene manipulation, the activity of a particular protein can be determined
as well.
• Besides these important applications, the method is often practiced to study protein-protein
interaction, enzyme alteration, and enzyme silencing studies.
Principle of SDM :-
❖ The basic principle of site-directed mutagenesis is simple, DNA primers having the desired
mutation are artificially synthesized and used to amplify the gene of interest. This primer set is
used to amplify a gene of interest in a polymerase chain reaction.
❖ The DNA polymerase (high fidelity) extends the growing DNA strand having the new mutation.
Now the gene of interest is transferred into the plasmid for further downstream research.
❖ Mutations such as SNP (Single nucleotide polymorphism), deletion & insertion of few
nucleotides are commonly used for SDM. In the final step, DNA sequencing screens the present
or absence of mutation in our DNA.
Basic Mechanism of SDM :-
Step 1 : Plasmid Preparation
• Gene in plasmid with target site mutation
Step 2 : Temperature Cycling
• Denature the plasmid and anneal the oligonucleotide.
• Primers containing the desired mutation Using the non-strand-displacing action of PfuTurbo
polymerase, extend and incorporate the mutagenic primers resulting in nicked circular strands
Step 3 : Digestion
• Digest the methylated, nonmutated parental DNA template with Dpn I
Step 4 : Transformation
• Transform the circular, nicked dsDNA into super-competent cells
• After transformation the supercompetent cells repair the nicks in the mutated plasmid
PBTEL-463 (Applications of Genomics and Proteomics) 42
Different Techniques of SDM :-
1) Oligonucleotide-Directed Mutagenesis with M13 DNA
2) Oligonucleotide-Directed Mutagenesis with Plasmid DNA
3) PCR based mutagenesis
• PCR-Amplified Oligonucleotide-Directed Mutagenesis
• Error-Prone PCR
• Random Mutagenesis with Degenerate Oligonucleotide Primers
4) Random Insertion/Deletion Mutagenesis
5) DNA Shuffling
6) Mutant Proteins with Unusual Amino Acids
A) Oligonucleotide-directed mutagenesis with M13 DNA :-
PBTEL-463 (Applications of Genomics and Proteomics) 43
1) Single-stranded bacteriophage M13 (M13 + strand), carrying a cloned gene is taken.
2) It is annealed with a complementary synthetic oligonucleotide containing one mismatched base.
3) With the oligonucleotide as the primer, DNA synthesis is catalyzed by the Klenow fragment of E.
coli DNA polymerase I.
4) Synthesis continues until the entire strand is copied. The newly synthesized DNA strand is
circularized by T4 DNA ligase.
5) The ligation reaction mixture is used to transform E. coli.
6) Both the target DNA with its original sequence and the mutated sequence are present in the
progeny M13 phage.
7) After the double-stranded form of M13 is isolated, the mutated gene is excised by digestion
with restriction enzymes and then spliced onto an E. coli plasmid expression vector.
8) For further study, the altered protein is expressed in and purified from the E. coli cells.
As per the diagram we see only 50% of the M13 viruses are expected to carrying the mutated form of
the target.
In practice only around 1% of the plaques actually contain phage carrying the mutated gene.
Consequently, the oligonucleotide-directed mutagenesis method has been modified in several ways to
enrich for the number of mutant phage plaques that can be obtained.
Modification :-
Introduce the M13 viral vector carrying the gene that is to be mutagenized into an E. coli strain that has
two defective enzymes of DNA metabolism.
PBTEL-463 (Applications of Genomics and Proteomics) 44
1) The target DNA is cloned into the double-stranded replicative form of bacteriophage M13,
which is then used to transform a dut ung strain of E. coli.
2) The dut mutation causes the intracellular level of dUTP to be elevated; the high level of
nucleotide leads to the incorporation of a few dUTP residues (U). The ung mutation prevents the
removal of any incorporated uracil residues.
3) Following in vitro oligonucleotide-directed mutagenesis, the double-stranded M13 vector with
the mutated DNA is introduced into wild-type E. coli.
4) The wild-type ung gene product removes any uracil residues from the parental strand, so a
significant portion of the parental strand is degraded.
5) The mutated strand remains intact because it does not contain uracil. It serves as a template for
DNA replication, thereby enriching the yield of M13 bacteriophage carrying the mutated gene.
Drawbacks of using the M13 DNA for carrying out SDM :-
• There is a need to subclone a target gene from a plasmid into M13 and then, after mutagenesis,
clone it back into a plasmid.
• Additional step of transforming enzyme defective E. coli necessary to enrich the yield.
• Lengthy process involving multiple steps.
B) Oligonucleotide-Directed Mutagenesis with Plasmid DNA :-
1) The target DNA is inserted into the
multiple cloning site (MCS) on the vector
pALTER.
2) Plasmid DNA is isolated from E. coli cells
and alkaline denatured.
3) Mutagenic oligonucleotide, the ampicillin
resistance (Ampr) oligonucleotide, and the
tetracycline sensitivity (Tets)
oligonucleotide are annealed.
4) The oligonucleotides act as primers for
DNA synthesis by T4 DNA polymerase with
the original strand as the template.
5) The gaps between the synthesized pieces
of DNA are sealed by T4 DNA ligase.
6) The reaction mixture is used to transform
E. coli host cells, and cells that are Ampr
and Tets are selected.
PBTEL-463 (Applications of Genomics and Proteomics) 45
• With this procedure, about 90% of the selected transformants have the specified mutation in
the target gene.
• In the remaining transformants, the target gene is unchanged because the oligonucleotide did
not anneal to the target gene or it was bypassed during DNA synthesis.
• The cells with the specified mutation in the target gene are identified by DNA hybridization.
• All of the plasmids, host bacterial strains, enzymes, oligonucleotides (other than the one needed
to alter the target gene), and buffers for this method are sold as a kit, facilitating its widespread
use.
C) PCR based mutagenesis :-
(i) PCR-Amplified Oligonucleotide-Directed Mutagenesis :-
• For this method of mutagenesis, no special plasmid vectors are required; any plasmid up to
approximately 10 kb in length is acceptable.
• For PCR-based mutagenesis point mutations, nucleotide changes are introduced in the middle of
the primer sequence.
• To create deletion mutations, primers must border the region of target DNA to be deleted on
both sides and be perfectly matched to their annealing (or template) sequences.
• To create mutations with long insertions, a stretch of mismatched nucleotides is added to the 5′
end of one or both primers, while for mutations with short insertions, a stretch of nucleotides is
designed in the middle of one of the primers.
Overview of the basic methodology to introduce point mutations, insertions, or deletions
into DNA cloned into a plasmid.
In all of these procedures, the only absolute requirements are that :
1) the nucleotide sequence of the target DNA must be known.
2) the 5′ ends of the primers must be phosphorylated.
Following PCR amplification, the linear DNA is circularized by ligation with T4 DNA ligase.
The circularized plasmid DNA is then used to transform E. coli by any standard procedure.
This protocol yields a very high frequency of plasmids with the desired mutation. The screening three or
four clones by sequencing the target DNA should be sufficient to find the desired.
PBTEL-463 (Applications of Genomics and Proteomics) 46
(ii) Error-Prone PCR :-
• This is a very powerful method for random mutagenesis useful for the construction of a library
of mutants.
• With DNA up to 10 kb in size, it is possible to vary the number of alterations per gene from
about 1 to about 20 by modifying the DNA template concentration.
• When error-prone PCR is performed using Taq DNA polymerase, which lacks proofreading
activity, the error rate may be increased -
a) by adding Mn2+, by increasing the concentration of Mg2+
b) by adding unequal amounts of the four deoxynucleoside triphosphates to the reaction
buffer.
c) by using other temperature-stable DNA polymerases in the absence of Mn2+ and with
balanced amounts of the four deoxynucleoside triphosphates.
• Following error-prone PCR, the randomly mutagenized DNA is cloned into expression vectors
and screened for altered or improved protein activity.
• The DNA from those clones that encode the desired activity is isolated and sequenced so that
the relevant changes to the target DNA may be elaborated.
• Error-prone PCR has been used to create enzymes with improved solvent and temperature
stability and with enhanced specific activity.
Drawback :-
Since errors are typically introduced into DNA at no more than one or two per 1,000 nucleotides, only
single nucleotides are replaced within a triplet codon, yielding only a limited number of amino acid
changes.
(iii) Random Mutagenesis with Degenerate Oligonucleotide Primers :-
• Investigators seldom know which specific nucleotide changes need to be introduced into a
cloned gene to modify the properties of the target protein.
PBTEL-463 (Applications of Genomics and Proteomics) 47
• Consequently, they must use methods that generate all the possible amino acid changes at one
particular site.
• For example, oligonucleotide primers can be synthesized with any of the four nucleotides at
defined positions. This pattern of sequence degeneracy is generally achieved by programming
an automated DNA synthesis reaction to add a low level (usually a few percent) of each of the
three alternative nucleotides each time a particular nucleotide is added to the chain.
In this way, the oligonucleotide primer preparation contains a heterogeneous set of DNA sequences that
will generate a series of mutations that are clustered in a defined portion of the target gene.
Mechanism :-
1) A target gene is inserted into a plasmid between two unique restriction endonuclease sites.
2) The left and right portions of the target DNA are amplified separately by PCR.
3) The amplified fragments are purified, denatured to make them single stranded, and then
reannealed.
4) Complementary regions of overlap are formed between complementary mutation producing
oligonucleotides. The single-stranded regions are made double stranded with DNA polymerase,
and then the entire fragment is amplified by PCR.
5) The resultant product is digested with restriction endonucleases A and B and then cloned into a
vector that has been digested with the same enzymes.
PBTEL-463 (Applications of Genomics and Proteomics) 48
This approach has two advantages.
1) Detailed information regarding the roles of particular amino acid residues in the functioning of
the protein is not required.
2) Unexpected mutants encoding proteins with a range of interesting and useful properties may be
generated because the introduced changes are not limited to one amino acid.
D) Random Insertion/Deletion Mutagenesis :-
• As an alternative to error-prone PCR, researchers have developed the technique of random
insertion/deletion mutagenesis.
• With this approach, it is possible to delete a small number of nucleotides at random positions
along the gene and, at the same time, insert either specific or random sequences into that
position.
PBTEL-463 (Applications of Genomics and Proteomics) 49
1) An isolated gene fragment with different restriction endonuclease sites at each end is ligated at
one end to a short nonphosphorylated linker that leaves a small gap in the DNA. The gap is a
consequence of the fact that the 5′ nucleotide from the linker is not phosphorylated and
therefore cannot be ligated to an adjacent 3′-OH group.
2) After restriction enzyme digestion that creates compatible sticky ends, the gene fragment is
cyclized with T4 DNA ligase to create a circular double-stranded gene fragment with a nick in the
antisense strand.
PBTEL-463 (Applications of Genomics and Proteomics) 50
3) The nicked strand is degraded by digestion with the enzyme T4 DNA polymerase (which has
exonuclease activity).
4) The single-stranded DNA is randomly cleaved at single positions by treating it with a cerium(IV)–
ethylenediaminetetraacetic acid (EDTA) complex.
5) The linear single-stranded DNAs are ligated to a linker (containing several additional nucleotides
selected for insertion at one end), and the entire mutagenesis library is PCR amplified.
6) The linkers are removed by restriction enzyme digestion.
7) The constructs are made blunt ended by filling in the single stranded overhangs using the
Klenow fragment of E. coli DNA polymerase I and then cyclized again by T4 DNA ligase.
8) The amplified products are digested with appropriate restriction enzymes, cloned into a plasmid
vector, and then tested for activity.
With this approach, it is possible to insert any small DNA fragment (carried on a linker) into the
randomly cleaved single-stranded DNA, with the result that a much greater number of modified genes
may be generated than by error-prone PCR.
The mutations that are developed by this procedure may be used to select protein variants with a wide
range of activities.
E) DNA Shuffling :-
• DNA shuffling is a method for in vitro recombination of homologous genes invented by W.P.C
Stemmer.
• It is done in the hope that some of the hybrid proteins will have unique properties or activities
that were not encoded in any of the original sequences.
• Also, some of the hybrid proteins may combine important attributes of two or more of the
original proteins, e.g., high activity and thermostability.
PBTEL-463 (Applications of Genomics and Proteomics) 51
Method 1 :-
• The genes to be recombined are randomly fragmented by DNaseI, and fragments of the desired
size are purified from an agarose gel.
• These fragments are then reassembled using cycles of denaturation, annealing, and extension
by a polymerase.
• Recombination occurs when fragments from different parents anneal at a region of high
sequence identity.
• Following this reassembly reaction, PCR amplification with primers is used to generate full-
length chimeras suitable for cloning into an expression vector.
Method 2 :-
• The simplest way to shuffle portions of similar genes is
through the use of common restriction enzyme sites.
• Digestion of two or more of the DNAs that encode the
native forms of similar proteins with one or more
restriction enzymes that cut the DNAs in the same
place, followed by ligation of the mixture of DNA
fragments, can potentially generate a large number of
hybrids.
PBTEL-463 (Applications of Genomics and Proteomics) 52
Method 3 :-
Another way to shuffle DNA involves -
• Combining several members of a gene family, fragmenting the mixed DNA with
deoxyribonuclease I (DNase I), selecting smaller DNA fragments
• Followed by PCR amplifying these fragments.
• During PCR, gene fragments from different members of a gene family cross-prime each other
after DNA fragments bind to one another in regions of high homology/complementarity.
• The final full-length products are obtained by including “terminal primers” in the PCR. After 20
to 30 PCR cycles, a panel of hybrid (full-length) DNAs will be established.
• The hybrid DNAs are then used to create a library that can be screened for the desired activity.
Although DNA shuffling works well with gene families—it is sometimes called molecular breeding—or
with genes from different families that nevertheless have a high degree of homology, the technique is
not especially useful when proteins have little or no homology.
Thus, the DNAs must be very similar to one another or the PCR will not proceed.
To remedy this situation and combine the genes of dissimilar proteins, several variations of the DNA
shuffling protocol have been developed.
Mechanism of DNA Shuffling :-
1) Different DNAs (shown in different colors) are mixed together, partially digested with DNase I,
blunt ended by digestion with T4 DNA polymerase.
2) They are then size fractionated.
3) The fragments are ligated with synthetic hairpin DNAs to form extended hairpins.
4) Restriction enzymes digest and remove the hairpin ends and generate sticky ends, and then
ligated into plasmid vectors.
PBTEL-463 (Applications of Genomics and Proteomics) 53
F) Mutant Proteins with Unusual Amino Acids :-
PBTEL-463 (Applications of Genomics and Proteomics) 54
Schematic representation of the production of a protein with a modified (nonstandard amino acid) side
chain. The start codon is highlighted in green, and the stop codons are in red. The inserted amino acid
analogue is shown in blue.
• Any protein can be altered by substituting one amino acid for another using directed
mutagenesis.
• There are only 20 amino acids that are normally used in protein synthesis, hence this becomes a
limitation in generating mutants
• One way to increase the diversity of the proteins formed after mutagenesis is to introduce
synthetic amino acids with unique side chains at specific sites. To do this, E. coli was engineered
to produce both a novel transfer RNA (tRNA) that is not recognized by any of the existing E. coli
aminoacyl-tRNA synthetases but nevertheless functions in translation and a new aminoacyltRNA
synthetase that aminoacylates only that novel tRNA.
Role of site-directed mutagenesis in the CRISPR-CAS9 :-
• In modern times, various methods and ready to use kits for site-directed mutagenesis are
available. However, one of the most important merits of the site-directed mutagenesis is in the
gene editing, especially in the CRISPR-CAS9.
• Any point mutation can be introduced in vivo with the help of the CRISPR-CAS9 system into the
genome of a model organism.
• Here, in the CRISPR-CAS9, the CAS9 is the nuclease which is used to cleave the DNA. Once it
induces a double-stranded break, the mutation is inserted through the homologous-direct
repair.
Applications of Site-directed mutagenesis :-
1) The site-directed mutagenesis helps to improve the quality of the protein by removing harmful
elements from it. These genetically modified proteins have high commercial values.
2) The tool is adopted in the study of gene characteristics. Gene properties, a protein encoded by
it, and post-translational modification are studied.
3) The method is the first choice in gene synthesis and gene-editing technology.
4) It is used in cloning as well.
5) By mutating the promoter or the regulatory regions of a gene, one can construct the map of the
regulatory elements of a gene.
6) It is also useful in the screening of SNPs as well.
PBTEL-463 (Applications of Genomics and Proteomics) 55
1.10 TRANSPOSON TAGGING
Gene tagging is referred to assigning some known DNA sequences in or into the vicinity of a gene of
interest such that the gene can be found with the help of assigned sequences.
Types of Gene Tagging :-
• Marker based gene tagging
• T-DNA tagging
• Transposon tagging
• Epitope tagging
Transposon :-
These are discrete sequences in the genome that are mobile and are able to transport themselves within
the genome. There are two main categories of transposable elements :
1) Class I or Retroelements :
• Copy and paste mechanism
• Move via RNA intermediate
• Used for tagging in mammals and yeast
• E.g. – Ty1, Copia, Gypsy, LINES, SINES
Class II or DNA-type Elements :
• Cut and paste mechanism
• Used for tagging in plants, bacteria, drosophila and C. elegans
• E.g. – IS elements, Ac/Ds elements, Sleeping beauty, Mu elements, En/Spm elements.
Transposon Tagging :-
• Transposon tagging describes isolation of novel genes using transposable elements as tags.
• The transposon sequence is used to identify the flanking sequences after insertional
mutagenesis.
• The strategy was initially developed to clone the drosophila white locus (Bingham et al.1981).
• In plants, this strategy was first used to identify and clone bronze1 (bz1)locus in maize (Federoff
et al. 1984).
Transposons used in Tagging :-
Plants :
• Ac/Ds (Activator/Dissociation) – Maize (Zea mays)
• En/Spm (Enhancer/Suppressor- mutator) – Maize (Zea mays)
• Mu (Mutator) – Maize (Zea mays)
• Tam3 – Snapdragon (Antirrhinum majus)
Animals :
• Sleeping beauty – Salmon (salmo salar)
• Tol2 – Medaka (Oryzias latipes)
Approaches of Transposon Tagging :-
• Targeted gene tagging : Tagging of a specific gene for which the mutant is already available.
PBTEL-463 (Applications of Genomics and Proteomics) 56
• Non Targeted gene tagging : Tagging any random gene(s) and then studying the resultant
mutant.
1) Directed-Gene Tagging :
• Directed tagging identifies the transposon-induced alleles by crossing transposon active plants
with a reference homozygous mutants.
• The mutable alleles are separated from the reference allele by crossing to a standard
line(hybrid, inbred or tester).
• To identify co-segregating transposon, the mutable allele is backcrossed to the standard line,
and the backcrossed progeny is selfed.
2) Non-directed Gene Tagging :
• In this technique one has to go for M2 population as compared to F1 in targeted gene tagging.
• The main advantage of this technique is that it can also be used to study lethal or infertile
mutants, ultimately identifying the gene responsible for it. But targeted tagging is only restricted
to mutants non essential for a plant to complete its life cycle.
• Transposon active stocks are crossed with a standard line and the resulting progeny is self
pollinated.
PBTEL-463 (Applications of Genomics and Proteomics) 57
• The self pollinated progeny is screened for recessive mutants and segregating populations is
generated by crossing with the standard line for studying co-segregation of transposon along
with mutant phenotype.
• The most common method for co-segregation analysis is insertion mutagenized sites (AIMS)
protocol (Frey et al. 1998).
Identification of Transposon Tagging :-
1) RFLP followed by Southern Hybridization :
• Restriction enzyme employed should be such that it do not cut within the transposon used.
• Gel electrophoresis
• Analysis by southern blotting.
• Probes are used for the transposon.
Results :
• Heterologous system : band present in mutant and absent in wild type
• Endogenous system : bands found in both mutant and wild type.
2) Inverse-PCR :
• Ligation of fragment ends
• Amplifying flanking region of transposon
• Flanking region used as probe in wild type DNA
Confirmation of Transposon Tagging :-
1) Complementation Test :
• The mutant is again transformed with the functional copy of the gene for which the mutant is
expected.
Result :
• Function restored : mutation by tagged gene.
• Function not restored : mutation by some other reason.
2) Revertants :
• The mutant obtained is either selfed (endogenous system) or crossed with mutant transgenic
line with autonomous transposable element (heterologous system).
Result :
• Revertants obtained : Gene confirmed
• Revertants not obtained : mutation by some other reason.
Transposon Tagging in Tomato Plant :-
• Cf-9 gene was known to provide resistance against leaf mould fungus Cladosporium fulvum in
tomato.
• Cf-9 gene was already mapped on short arm of chromosome 1.
• Avr9 genes were characterised in race 5 of fungus C. fulvum.
PBTEL-463 (Applications of Genomics and Proteomics) 58
• Plant defenses are often activated by specific interaction between the product of a disease
resistance (R) gene in the plant and the product of a corresponding avirulence (Avr) gene in the
pathogen. Without either of these genes, plant defenses are not activated and infection by the
pathogen is permitted.
Materials Required :
• A transgenic tomato line carrying a Ds element located 3 centimorgans from the Cf-9 locus
• A stable line containing genetically unlinked Ac (sAc), itself incapable of transposition
• A line homozygous for Cf-9
• A line homozygous for Avr9
Mechanism :-
1) A total of at least 37 independent Ds insertions into Cf-9. Of these, 28 have been mapped to
the same 3-kb region of the tomato genome
2) All stable mutants were susceptible to race 5 of C. fulvum
3) Correlation between multiple independent mutations of Cf-9 and multiple independent Ds
insertions in a defined region
4) In plants which carried sAc in heterozygous state, they found revertants, thus confirming the
gene tagged was correct.
5) Y1 allele is responsible for yellow endosperm color while y1 result in white color.
6) They used Mu3 transposon to tag Y1 allele
PBTEL-463 (Applications of Genomics and Proteomics) 59
Fig. Restriction Endonuclease Map of the Y1-mum and Y1 Alleles
Conclusion :-
• Initially only transposon sequence is required
• Used to identify and clone numerous genes having visible phenotypes.
• Aid in mutation studies
• Any specific as well as random genes can be targeted by this method
• Can be used to produce an allelic series of a gene.
PBTEL-463 (Applications of Genomics and Proteomics) 60
1.11 VIGS AND FACS
• VIGS can be defined as the silencing of endogenous plant genes initiated by recombinant viral
vectors is called VIGS.
• Virus-induced gene silencing (VIGS) in plants takes place if there is sequence similarity between
the virus and either a transgene or an endogenous nuclear gene (Lindbo et al., 1993; Kumagai et
al., 1995).
• Many of the reported examples of pathogen-derived resistance are probably manifestations of
transgene-induced silencing targeted against viral RNA (Baulcombe, 1996b).
• According to this idea, gene silencing would be activated naturally in virus infected plants and
artificially in transgenic plants when the trans gene or its RNA is perceived as part of a virus.
Objectives of VIGS :-
• Virus-Induced Gene Silencing (VIGS) is one of these tools to suppress expression level of the
gene of interest in plants.
• To describe the resistance event against viral infection.
Procedure of VIGS :-
1) Cloning a 200–1300 bp cDNA fragment from a plant gene of interest into a DNA copy of the
genome of an RNA-virus.
2) Transfecting the plant with this construct using Agrobacterium.
3) Double-stranded RNA from the viral genome, including sequence from the gene of interest, is
formed during viral replication.
4) The double-stranded RNA molecules are degraded into siRNA molecules by the plant Dicer-like
enzymes.
5) Limited only by the host range of the virus used, TRV and ALSV are the most common vectors.
VIGS vectors :-
Vectors for VIGS are standard binary Ti plasmid derived vectors used for Agrobacterium tumefaciens
mediated plant transformation in which part of a viral genome is inserted. The VIGS vector is a
recombinant virus engineered to be able to carry a piece an endogenous gene from the host. During
infection with the modified vector, the host’s defense reaction will be induced against the cloned host
gene; a loss of function phenotype makes it possible to identify the function of the gene.
Around 37 VIGS vectors have been developed for gene function studies in dicotyledonous plants, but far
fewer are available for studies with monocotyledonous plants. For examples, tobacco rattle virus (TRV)
derived vector is one of the most widely used systems in dicots because of its broad host range and
ability to infect meristematic tissue. Bean pot mottle virus (BPMV) derived vectors have been developed
to induce VIGS in soybean and common bean. Cabbage leaf curl virus (CLCV) derived vector has already
been developed to trigger VIGS for siRNA mediated silencing in Arabidopsis.
To date only few RNA viruses and one DNA virus have been modified as a vector for VIGS in
monocotyledonous plant inhibition was observed to spread systemically throughout the entire plant.
Reduced levels of photo protective carotenoids lead to the rapid destruction of chlorophyll by photo
oxidation which subsequently resulted in a white leaf phenotype which can be easily followed visually.
Species as indicated in Table 1 of which Barley stripe mosaic virus (BSMV) based VIGS has been applied
for functional genomics in barley and wheat. Brome mosaic virus (BMV) in rice, barley and maize;
PBTEL-463 (Applications of Genomics and Proteomics) 61
Bamboo mosaic virus (BMV) and its satellite RNA in Nicotiana benthamiana and Brachypodium
distachyon and Rice tungro bacilliform virus in rice.
Table 1 : The most frequently used VIGS Vectors in monocotyledonous plant species.
Molecular mechanism of VIGS :-
The major steps in VIGS includes; engineering viral genomes to the appropriate viral vector to
incorporate fragments of host genes that are targeted to be silenced, infecting the appropriate plant
hosts and silencing the target genes as part of the defense mechanism of the plant against virus
infection. VIGS vectors are constructed by cloning fragment of the plant target gene with efficient siRNA.
The size of the inserted fragment of target endogenous gene usually affects the efficiency of VIGS.
Several VIGS vectors have the capacity to carry a fragment size between 150 and 800 bp. VIGS vectors
can fail to induce gene silencing if the insert fragment is greater than 1500 bp.
Even though several studies indicated that a 23 bp insertion can to induce VIGS, fragments of 200 to 350
bp in length is mostly preferred to induce higher silencing efficiency and improper sized gene fragment
can induce non-target silencing, producing inappropriate phenotype. Besides, the orientation of the
inserted gene fragment was another very important factor which can affect the efficiency of VIGS. The
higher silencing efficiency usually induced by a reverse oriented insertion compared with that of a
forward oriented insertion.
The silencing efficiency significantly enhanced if the target fragment was constructed as a hairpin
structure. Selection of the target gene is important for VIGS. Different candidate fragments can be used
for silencing of a specific gene. If the target gene belongs to a gene family, some sequences possibly
have conserved domains between different genes in the gene family and the fragment of the target
gene may have more than 23 bp which is homologous to other genes in the gene family resulting in the
degradation of non-target genes and in this case, a very specific fragment needs to be selected. Mostly a
fragment from UTR region is mostly appropriate and the conserved domains should be chosen to avoid
functional complementation by genes from the same family. The efficiency of gene silencing may be
PBTEL-463 (Applications of Genomics and Proteomics) 62
affected by the types of the inoculation methods. The recombinant virus is most frequently introduced
into plant cells through Agrobacterium tumefaciens mediated transient expression or in vitro
transcribed RNA inoculation or direct DNA inoculation.
After the recombinant virus is introduced in to plant cells, the trans gene is Amplified along with the
viral RNA by either an endogenous or a viral RNA dependent RNA polymerase (RdRp) enzyme generating
dsRNA molecules. This dsRNA is the triggering molecule of PTGS. It is cleaved to produce small guide
molecules called short interfering RNA (siRNA). The antisense strand of the siRNA associates with the
RNAi silencing complex (RISC) to target homologous RNA for degradation. Dicer like proteins cleave
these viral dsRNAs into short interfering RNAs (siRNAs) duplexes which are approximately 21–24
nucleotides in length. These siRNA in turn are incorporated as single strand RNA molecules into RISC
(RNA)-induced.
Fig 1. Major steps in VIGS. (A) Organization of vectors, (B) Mode of
delivery, (C) Molecular mechanism.
PBTEL-463 (Applications of Genomics and Proteomics) 63
Silencing Complex) which screens for and destroys RNAs complementary to the siRNA. Interaction of
viral siRNA with target RNA can lead to endonucleolytic cleavage and translational inhibition of cognate
RNAs. This leads to posttranscriptional gene silencing.
As indicated in Figures 1 and 2 the virus derived silencing signal is further Amplified and spreads
systemically throughout the plant. siRNAs about 21 in length are assumed to mediate the short-range
transport and the RNA dependent RNA Polymerase (RDR) is required for long range transport, possibly
by amplifying the silencing signal. This systemic spread of the silencing signal occurs regardless of the
successful movement of the virus particles in the plant. When VIGS is applied to a susceptible plant, the
host plant’s target gene mRNA is degraded in large portions of the plant.
Virus induced gene silencing (VIGS) is a quick and robust method to assess the function of a gene by
transient post transcriptional gene silencing and does not defect the next generation. The degree of
efficiency of silencing in the live plant can be monitored through observing plant part discoloration and
photo bleaching.
Fig 2. Over view of VIGS in plants and systemic spread of silencing signal.
PBTEL-463 (Applications of Genomics and Proteomics) 64
VIGS applications in functional genomics :-
The first RNA virus used as silencing vectors is Tobacco mosaic virus. Transcripts of recombinant virus
carrying a sequence from the phytoene desaturase (PDS) gene were produced in vitro and inoculated
onto Nicotiana benthamiana plants to successfully silence PDS. Investigations using VIGS has been made
in the wild tobacco species Nicotiana benthamiana and its susceptibility to virus infection exhibited
efficient gene silencing because of good infection. Subsequently, VIGS vector based on potato virus X
(PVX) developed which is more stable than the TMV based vector. However, PVX has a more limited
host range than TMV. Both TMV and PVX based vectors are excluded from the growing points or
meristems of their hosts, which precludes effective silencing of genes in those tissues.
In earlier study, VIGS vector based on the tomato golden mosaic DNA virus (TGMV) was used to
successfully silence a meristem gene in Nicotiana benthamiana. This TGMV based vector had been used
to silence a non-meristematic gene as well as a foreign transgene. The limitations of host range and
meristem exclusion were overcome with the development of VIGS vectors based on tobacco rattle virus
(TRV). TRV is able to spread more vigorously throughout the entire plant, including meristem tissue, yet
the overall symptoms of infection are mild compared with other viruses. The function of a wheat starch
regulator 1 (TaRSR1) in regulating the synthesis of grain storage starch was determined using the barley
stripe mosaic virus induced gene silencing (BSMV-VIGS) method in field experiments. Chlorotic stripes
appeared on the wheat spikes infected with barley stripe mosaic virus induced gene silencing wheat
starch regulator at 15 days after anthesis, at which time the transcription levels of the TaRSR1 gene
significantly decreased.
VIGS has been successfully used to silence genes in tomato and strawberry fruits even after the fruits
were detached from the plant. This method is particularly useful if a specific target gene, to be silenced
in the reproductive organs, is involved in basic metabolic processes during the early stages of plant
growth to avoid deformities in vegetative plant growth. Pepper huasteco yellow veins virus (PHYVV)
derived vector were used to postulate the involvement of three genes Comt, pAmt, and Kas genes in the
biosynthesis of capsaicinoids which are responsible for the pungent taste of chilli pepper fruits of
Capsicum species and expressed differentially placenta tissue. They have successfully produced non-
pungent chilli peppers at high efficiency with these VIGS approach.
Major challenges in VIGS application :-
Although improvements in the protocols used for VIGS in several plant species, several limitations of
VIGS remain unaddressed. Mainly, VIGS requirement to initiate viral infection restricts its application in
certain varieties of crop plants. There are several viral resistance genes known in cultivated varieties of
crops such as bean, cucumber, pea, pepper, potato, tomato, etc., which confer resistance against certain
viruses and thus, vectors derived from those make VIGS ineffective.
Besides, to minimize the potential hazards and prevent unintentional escape of the VIGS vectors in to
the environment, an appropriate biosafety precautions need to be practiced. Although it has been
shown that siRNAs and triggers of silencing can be transferred from one organism to another, there is no
report that indicates this actually occurs in VIGS. A more realistic scenario is the transfer of silencing
from plants to silencing competent soil organisms.
However, there is no experimental indication yet that this transfer is possible in VIGS. Besides, a transfer
of silencing between plants would also be undesirable. Although not ruled out yet by experiments, that
a mechanical transfer of silencing from plant to plant is also possible.
PBTEL-463 (Applications of Genomics and Proteomics) 65
VIGS cannot be used in some plant species because of the lack of appropriate VIGS vectors. This can be
overcome in two ways. One way is to use heterologous gene sequences, wherever possible, from plant
species that are recalcitrant to VIGS to silence genes in a closely related VIGS amenable species. The
other option is developing new specific vector for a desired plant species from a suitable virus that
infects the same species. Besides, developing broad host range VIGS vectors will be more useful. For
instance, Tobacco rattle virus (TRV) and Apple latent spherical virus (ASLV) vectors can provoke VIGS in
many plant species.
Lack of an efficient method for virus vector delivery is another challenge in application of VIGS in plants.
Agrobacterium mediated delivery of binary VIGS vectors is efficiently used in many dicot plants.
However, for plants that are recalcitrant to Agrobacterium mediated transformation, VIGS can be
induced by the virus sap inoculation method, RNA transcript inoculation or DNA bombardment. Uneven
or localized VIGS resulting in a lack of silencing in certain tissues is mainly the result of ineffective virus
movement. This can be addressed by maintaining environmental conditions favoring systemic virus
movement. Furthermore, appropriate virus vectors that have the ability to spread systemically in the
host plant without deleting the insert should be chosen and such vectors should not have a strong
silencing suppressor.
Reporter gene expression along with the expression of VIGS vectors should be useful for visualizing the
silenced tissues. VIGS vectors that produce severe symptoms in host plants should be avoided. Selection
of an appropriate virus host system where viral symptoms are less obvious is important. Besides, gene
silencing in many plants is detected by gene target position, insert length and orientation. For optimum
VIGS, insert lengths should be in the range of 200 bp to 350 bp.
Future prospects of VIGS application :-
The application of VIGS has now become more versatile. However, studies related to heritable and long
duration VIGS have yet to be extended to other genes, apart from marker genes, in a wide range of plant
species. The potential application of VIGS in crop improvement has yet to be realized. In light of recent
advances in VIGS technology, the prospects of using VIGS for various applications in modern plant
biology are promising.
PTGS achieved by VIGS vectors can be used for both genetic engineering and molecular breeding aimed
at crop improvement. Reduction or alteration of the flowering time of certain genotypes or of
indeterminate cultivars, wild relatives or inbred lines can be achieved by using viral vectors.
Silencing of a negative regulator of flowering in a late flowering genotype can help to match flowering
time, enabling crossing with an early flowering genotype. This can also facilitate early and uniform
flowering needed for crossing in indeterminate growth genotypes and reduce hurdles related to
pollination time. Progeny plants can be made virus free using methods such as heat or freeze shock and
the elimination of viruses can also reverse the PTGS of target genes. In the case of viruses that do not
invade meristematic tissue, virus free plants can be developed by meristem tip culture.
Virus free plants can also be identified among the progeny plants because seed transmission of the virus
is not always 100% effective. However, wherever possible, a non-seed transmitted VIGS vector can be
used to avoid virus transmission to the next generation. Results from VIGS mediated forward genetics
screens can also be used by breeders to identify the genes important for a plant process and for quick
validation of putative candidate genes during map-based cloning of important traits in crop plants.
PBTEL-463 (Applications of Genomics and Proteomics) 66
FLUORESCENCE ACTIVATED CELL SORTING (FACS) :-
• Fluorescence-activated cell sorting (FACS) is a specialized type of flow cytometry.
• It provides a method for sorting a heterogeneous mixture of cells into two or more containers,
one cell at a time, based upon the specific light scattering and fluorescent characteristics of each
cell.
• It provides fast, objective and quantitative recording of signals from individual cells as well as
physical separation of cells of particular interest.
• The first cell sorter was invented in 1965 by Mack Fulwyler (1965).
• Technique expanded by Len Herzenberg – coin term “FACS”.
• Invention in late 1960s by Bonner, Sweet, Hulett, Herzenberg.
• The commercial machines introduction- Becton Dickinson Immunocytometry Systems in the
early 1970s .
• It can separate one fluorescent cell from a population of 1000 unlabelled cells.
Principle of FACS :-
➢ One or more beams of light (usually laser light) is directed onto a hydrodynamically – focused
stream of fluid.
➢ A number of detectors are placed at the intersection of the stream with the light beam, to
detect scattered light (forward scatter or FSC, in line with the light beam, and side scatter or
SSC, perpendicular to it) and one or more fluorescent detectors.
➢ Each suspended particle –from 0.2 to 150 micrometers- passing through the beam scatters the
light, and fluorescent chemicals found in the particle or attached to it may be excited, emitting
light at a longer wavelength than the light source.
➢ This combination of scattered and fluorescent light is recorded by the detectors and, by
analyzing changes in brightness at each detector, information about the physical and chemical
structure of each individual particle is obtained.
➢ FSC correlates with the cell volume and SSC depends on the inner complexity of the particle
(shape of the nucleus, the amount and type of cytoplasmatic granules or the membrane
roughness).
➢ Modern flow cytometers are able to analyze several thousand particles every second, in “real
time”, and can actively isolate particles having specified properties.
Instrumentation of FACS :-
PBTEL-463 (Applications of Genomics and Proteomics) 67
❖ Fluidics – direct liquid stream containing particles through focused laser beam.
❖ Vibrating nozzle – It breaks the cell suspension into fine droplets.
❖ Laser – light source to focus light.
❖ Collection optics and filters – detect light signals coming from particles.
❖ Electronics/computer – convert light signals to voltage and digital output.
1) Fluidics :-
• The fluidic system is used to transport particles from a random three-dimensional sample
suspension to an orderly stream of particles.
• The fluidic system often uses air pressure regulation for stable operation and consists of at least
one sheath line and a sample line feeding the flow cell.
• As the sample enters the flow cell chamber, the outer, faster flowing sheath fluid
hydrodynamically focuses this fluid into a narrow core region within the jet and presents a single
file of particles to excitation sources.
• This geometry provides increased positioning accuracy at the laser interrogation point for
consistent excitation irradiance and greatly reduced particle blockage of the flow cell Fluidics
2) Optics :-
• Optics are central to flow cytometry for the illumination of stained and unstained particles and
for the detection of scatter and fluorescent light signals.
• Commonly used are lamps (mercury, xenon); high-power water-cooled lasers (argon, krypton,
dye laser); low-power air-cooled lasers (argon (488 nm), red-HeNe (633 nm), green-HeNe, HeCd
(UV)); diode lasers (blue, green, red, violet) resulting in light signals.
3) Optical Filters :-
• Once the fluorescence light from a cell has been captured by the collection optics, the spectral
component of interest for each stain must be separated spatially for detection.
• This separation of wavelengths is achieved using dichroic (45 degree) and emission (normal
incidence) filters.
• Longpass filters permit longer wavelength transmission, while shortpass filters allow shorter
wavelength transmission.
• Bandpass filters only allow a selected wavelength band of interest to be transmitted while
blocking unwanted wavelengths.
4) Detectors :-
• Silicon photodiodes and photomultiplier tubes (PMTs).
5) Electronics :-
• As a particle of interest passes through the focus, fluoresces and is detected by a photodetector,
an electrical pulse is generated and presented to the signal processing electronics.
• An amplification system – linear or logarithmic
• A computer for analysis of the signals.
PBTEL-463 (Applications of Genomics and Proteomics) 68
Process of FACS Operation :-
➢ The cell suspension containing the cells labeled with fluorescent dye is directed into a thin
stream so that all cells pass in a single file.
➢ The stream emerges from nozzle vibrating at a some 40000 cps.
➢ It breaks the stream into 40000 droplets per sec.
➢ Laser beam is directed at the stream just before it breaks up into droplets.
➢ As each labeled cell passes through the beam, its resulting fluorescence is detected by a
photocell.
➢ If cell is fluorescent then it given a charge +ve or –ve.
➢ The droplets retain this charge as they pass between a pair of charged metal plates.
➢ Positively charged cells are attracted by a negatively charged plate vice versa.
➢ Uncharged droplets doesn’t deviate and it pass straight into the third container and discarded
later.
➢ Cell scatters light in all directions when passes through the laser.
➢ Two types of scatters : Forward Scatter & Side Scatter.
1) Forward Scatter :
• Low angle scattering (upto 20o from beam of axis).
• Detector converts intensity of light into voltage.
• Blocking bar is placed in front of detector.
• In absence of cell light falls on blocking bar so no voltage.
• When cell passes through beam scattered light falls on detector and voltage is measured.
2) Side Scatter :
• Scatter at larger angle. (~90o from beam axis).
• Because of granularity and complexity inside the cells.
• Focused to the side scatter detection system and detected by separate detectors.
• Detection is same as forward scatter detectors.
PBTEL-463 (Applications of Genomics and Proteomics) 69
Fluorescent Detection :-
Fluorochrome molecules have : Excitation (Accept light at given wavelength) and Emission (Re-emit light
at higher wavelength) processes.
1) Light is absorbed by fluorochrome and electrons become excited.
2) Excited electrons migrate from Resting (Ground) state to Excited state.
3) Within 10 nanoseconds it releases some of absorbed energy as heat and fall lower to more
stable level.
4) Electron steadily moves back to Ground state by releasing energy as Fluoroscence.
• Absorbed Energy (Eexcitation) > Released Energy (Eemission)
• Stokes Shift = Eexcitation – Eemission.
• Value of Stokes Shift determines quality of fluorochrome.
• Higher the value of Stokes Shift easier the detection.
• Fluorescent Probes directly target the interested cells so readily chosen for flowcytometric
analysis.
• More parameters can be detected one at a time if more fluorochromes are used.
• Generally Tandem Dyes are used for detection.
• E.g. Alexa Fluor 488, Phycoerythrin, APC Cy 7.
• Fluorophores are attached to the cell surface or inside the cell.
• Fluorescent signals are emitted when cell passes through laser light (635 nm, 488 nm).
• Signals are collected by various mirrors and filters and detected by Photodiodes or PMTs.
Quantifying FACS Data :-
FACS data collected by the computer can be displayed in two different ways :
• The X-axis plots the intensity of green fluorescence while the Y-axis plots the intensity of red
fluorescence.
• The individual black dots represent individual cells and we are not supposed to count the dots
but just look at the relative density of dots in each quadrant.
1.12 GENOME EDITING SYSTEMS
Genome editing, or genome engineering, or gene editing, is a type of genetic engineering in which DNA
is inserted, deleted, modified or replaced in the genome of a living organism. Unlike early genetic
engineering techniques that randomly inserts genetic material into a host genome, genome editing
targets the insertions to site specific locations.
History of Genome Editing :-
Genome editing was pioneered in the 1990s, before the advent of the common current nuclease-based
gene editing platforms, however, its use was limited by low efficiencies of editing. Genome editing with
engineered nucleases, i.e. all three major classes of these enzymes—zinc finger nucleases (ZFNs),
transcription activator-like effector nucleases (TALENs) and engineered meganucleases—were selected
by Nature Methods as the 2011 Method of the Year. The CRISPR-Cas system was selected by Science as
2015 Breakthrough of the Year.
PBTEL-463 (Applications of Genomics and Proteomics) 70
As of 2015 four families of engineered nucleases were used : meganucleases, zinc finger nucleases
(ZFNs), transcription activator-like effector-based nucleases (TALEN), and the clustered regularly
interspaced short palindromic repeats (CRISPR/Cas9) system. Nine genome editors were available as of
2017.
In 2018, the common methods for such editing used engineered nucleases, or “molecular scissors”.
These nucleases create site-specific double-strand breaks (DSBs) at desired locations in the genome. The
induced double-strand breaks are repaired through nonhomologous end-joining (NHEJ) or homologous
recombination (HR), resulting in targeted mutations (‘edits’).
In May 2019, lawyers in China reported, in light of the purported creation by Chinese scientist He Jiankui
of the first gene-edited humans, the drafting of regulations that anyone manipulating the human
genome by gene-editing techniques, like CRISPR, would be held responsible for any related adverse
consequences. A cautionary perspective on the possible blind spots and risks of CRISPR and related
biotechnologies has been recently discussed, focusing on the stochastic nature of cellular control
processes.
Fig 1. The different generations of nucleases used for genome editing and the DNA
repair pathways used to modify target DNA.
PBTEL-463 (Applications of Genomics and Proteomics) 71
Process of Genome Editing :-
A) Double strand break repair :
A common form of Genome editing relies on the concept of DNA double stranded break (DSB) repair
mechanics. There are two major pathways that repair DSB; non-homologous end joining (NHEJ) and
homology directed repair (HDR). NHEJ uses a variety of enzymes to directly join the DNA ends while the
more accurate HDR uses a homologous sequence as a template for regeneration of missing DNA
sequences at the break point. This can be exploited by creating a vector with the desired genetic
elements within a sequence that is homologous to the flanking sequences of a DSB. This will result in the
desired change being inserted at the site of the DSB. While HDR based gene editing is similar to the
homologous recombination based gene targeting, the rate of recombination is increased by at least
three orders of magnitude.
B) Engineered nucleases :
The key to genome editing is creating a DSB at a specific point within the genome. Commonly used
restriction enzymes are effective at cutting DNA, but generally recognize and cut at multiple sites. To
overcome this challenge and create site-specific DSB, three distinct classes of nucleases have been
discovered and bioengineered to date. These are the Zinc finger nucleases (ZFNs), transcription-activator
like effector nucleases (TALEN), meganucleases and the clustered regularly interspaced short
palindromic repeats (CRISPR/Cas9) system.
1) Meganucleases :
Meganucleases, discovered in the late 1980s, are enzymes in the endonuclease family which are
characterized by their capacity to recognize and cut large DNA sequences (from 14 to 40 base pairs). The
most widespread and best known meganucleases are the proteins in the LAGLIDADG family, which owe
their name to a conserved amino acid sequence.
Meganucleases, found commonly in microbial species, have the unique property of having very long
recognition sequences (>14bp) thus making them naturally very specific. However, there is virtually no
chance of finding the exact meganuclease required to act on a chosen specific DNA sequence. To
overcome this challenge, mutagenesis and high throughput screening methods have been used to
create meganuclease variants that recognize unique sequences. Others have been able to fuse various
meganucleases and create hybrid enzymes that recognize a new sequence. Yet others have attempted
to alter the DNA interacting aminoacids of the meganuclease to design sequence specific meganucleases
in a method named rationally designed meganuclease. Another approach involves using computer
models to try to predict as accurately as possible the activity of the modified meganucleases and the
specificity of the recognized nucleic sequence.
A large bank containing several tens of thousands of protein units has been created. These units can be
combined to obtain chimeric meganucleases that recognize the target site, thereby providing research
and development tools that meet a wide range of needs (fundamental research, health, agriculture,
industry, energy, etc.) These include the industrial-scale production of two meganucleases able to cleave
the human XPC gene; mutations in this gene result in Xeroderma pigmentosum, a severe monogenic
disorder that predisposes the patients to skin cancer and burns whenever their skin is exposed to UV
rays.
Meganucleases have the benefit of causing less toxicity in cells than methods such as Zinc finger
nuclease (ZFN), likely because of more stringent DNA sequence recognition; however, the construction
PBTEL-463 (Applications of Genomics and Proteomics) 72
of sequence-specific enzymes for all possible sequences is costly and time-consuming, as one is not
benefiting from combinatorial possibilities that methods such as ZFNs and TALEN-based fusions utilize.
2) Zinc finger nucleases (ZFN) :
As opposed to meganucleases, the concept behind ZFNs and TALEN technology is based on a non-
specific DNA cutting catalytic domain, which can then be linked to specific DNA sequence recognizing
peptides such as zinc fingers and transcription activator-like effectors (TALEs). The first step to this was
to find an endonuclease whose DNA recognition site and cleaving site were separate from each other, a
situation that is not the most common among restriction enzymes. Once this enzyme was found, its
cleaving portion could be separated which would be very non-specific as it would have no recognition
ability. This portion could then be linked to sequence recognizing peptides that could lead to very high
specificity.
Zinc finger motifs occur in several transcription factors. The zinc ion, found in 8% of all human proteins,
plays an important role in the organization of their three-dimensional structure. In transcription factors,
it is most often located at the protein-DNA interaction sites, where it stabilizes the motif. The C-terminal
part of each finger is responsible for the specific recognition of the DNA sequence.
The recognized sequences are short, made up of around 3 base pairs, but by combining 6 to 8 zinc
fingers whose recognition sites have been characterized, it is possible to obtain specific proteins for
sequences of around 20 base pairs. It is therefore possible to control the expression of a specific gene. It
has been demonstrated that this strategy can be used to promote a process of angiogenesis in animals.
It is also possible to fuse a protein constructed in this way with the catalytic domain of an endonuclease
in order to induce a targeted DNA break, and therefore to use these proteins as genome engineering
tools.
The method generally adopted for this involves associating two DNA binding proteins – each containing
3 to 6 specifically chosen zinc fingers – with the catalytic domain of the FokI endonuclease which need
to dimerize to cleave the double-strand DNA. The two proteins recognize two DNA sequences that are a
few nucleotides apart. Linking the two zinc finger proteins to their respective sequences brings the two
FokI domains closer together. FokI requires dimerization to have nuclease activity and this means the
specificity increases dramatically as each nuclease partner would recognize a unique DNA sequence. To
enhance this effect, FokI nucleases have been engineered that can only function as heterodimers.
Several approaches are used to design specific zinc finger nucleases for the chosen sequences. The most
widespread involves combining zinc-finger units with known specificities (modular assembly). Various
selection techniques, using bacteria, yeast or mammal cells have been developed to identify the
combinations that offer the best specificity and the best cell tolerance. Although the direct genome-
wide characterization of zinc finger nuclease activity has not been reported, an assay that measures the
total number of double-strand DNA breaks in cells found that only one to two such breaks occur above
background in cells treated with zinc finger nucleases with a 24 bp composite recognition site and
obligate heterodimer FokI nuclease domains.
The heterodimer functioning nucleases would avoid the possibility of unwanted homodimer activity and
thus increase specificity of the DSB. Although the nuclease portions of both ZFNs and TALEN constructs
have similar properties, the difference between these engineered nucleases is in their DNA recognition
peptide. ZFNs rely on Cys2-His2 zinc fingers and TALEN constructs on TALEs. Both of these DNA
recognizing peptide domains have the characteristic that they are naturally found in combinations in
their proteins. Cys2-His2 Zinc fingers typically happen in repeats that are 3 bp apart and are found in
PBTEL-463 (Applications of Genomics and Proteomics) 73
diverse combinations in a variety of nucleic acid interacting proteins such as transcription factors. Each
finger of the Zinc finger domain is completely independent and the binding capacity of one finger is
impacted by its neighbor. TALEs on the other hand are found in repeats with a one-to-one recognition
ratio between the amino acids and the recognized nucleotide pairs. Because both zinc fingers and TALEs
happen in repeated patterns, different combinations can be tried to create a wide variety of sequence
specificities. Zinc fingers have been more established in these terms and approaches such as modular
assembly (where Zinc fingers correlated with a triplet sequence are attached in a row to cover the
required sequence), OPEN (low-stringency selection of peptide domains vs. triplet nucleotides followed
by high-stringency selections of peptide combination vs. the final target in bacterial systems), and
bacterial one-hybrid screening of zinc finger libraries among other methods have been used to make site
specific nucleases.
Zinc finger nucleases are research and development tools that have already been used to modify a range
of genomes, in particular by the laboratories in the Zinc Finger Consortium. The US company Sangamo
BioSciences uses zinc finger nucleases to carry out research into the genetic engineering of stem cells
and the modification of immune cells for therapeutic purposes. Modified T lymphocytes are currently
undergoing phase I clinical trials to treat a type of brain tumor (glioblastoma) and in the fight against
AIDS.
3) Transcription Activator-like Effector Nuclease (TALEN) :
Transcription activator-like effector nucleases (TALENs) are specific DNA-binding proteins that feature
an array of 33 or 34-amino acid repeats. TALENs are artificial restriction enzymes designed by fusing the
DNA cutting domain of a nuclease to TALE domains, which can be tailored to specifically recognize a
unique DNA sequence. These fusion proteins serve as readily targetable “DNA scissors” for gene editing
applications that enable to perform targeted genome modifications such as sequence insertion,
deletion, repair and replacement in living cells. The DNA binding domains, which can be designed to
bind any desired DNA sequence, comes from TAL effectors, DNA-binding proteins excreted by plant
pathogenic Xanthomanos app. TAL effectors consists of repeated domains, each of which contains a
highly conserved sequence of 34 amino acids, and recognize a single DNA nucleotide within the target
site. The nuclease can create double strand breaks at the target site that can be repaired by error-prone
non-homologous end-joining (NHEJ), resulting in gene disruptions through the introduction of small
insertions or deletions. Each repeat is conserved, with the exception of the so-called repeat variable di-
residues (RVDs) at amino acid positions 12 and 13. The RVDs determine the DNA sequence to which the
TALE will bind. This simple one-to-one correspondence between the TALE repeats and the
corresponding DNA sequence makes the process of assembling repeat arrays to recognize novel DNA
sequences straightforward. These TALEs can be fused to the catalytic domain from a DNA nuclease, FokI,
to generate a transcription activator-like effector nuclease (TALEN). The resultant TALEN constructs
combine specificity and activity, effectively generating engineered sequence-specific nucleases that bind
and cleave DNA sequences only at pre-selected sites. The TALEN target recognition system is based on
an easy-to-predict code. TAL nucleases are specific to their target due in part to the length of their 30+
base pairs binding site. TALEN can be performed within a 6 base pairs range of any single nucleotide in
the entire genome.
TALEN constructs are used in a similar way to designed zinc finger nucleases, and have three advantages
in targeted mutagenesis : (1) DNA binding specificity is higher, (2) off-target effects are lower, and (3)
construction of DNA-binding domains is easier.
PBTEL-463 (Applications of Genomics and Proteomics) 74
4) Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) :
CRISPRs (Clustered Regularly Interspaced Short Palindromic Repeats) are genetic elements that bacteria
use as a kind of acquired immunity to protect against viruses. They consist of short sequences that
originate from viral genomes and have been incorporated into the bacterial genome. Cas (CRISPR
associated proteins) process these sequences and cut matching viral DNA sequences. By introducing
plasmids containing Cas genes and specifically constructed CRISPRs into eukaryotic cells, the eukaryotic
genome can be cut at any desired position.
Applications of Genome Editing :-
As of 2012 efficient genome editing had been developed for a wide range of experimental systems
ranging from plants to animals, often beyond clinical interest, and was becoming a standard
experimental strategy in research labs. The recent generation of rat, zebrafish, maize and tobacco ZFN-
mediated mutants and the improvements in TALEN-based approaches testify to the significance of the
methods, and the list is expanding rapidly. Genome editing with engineered nucleases will likely
contribute to many fields of life sciences from studying gene functions in plants and animals to gene
therapy in humans. For instance, the field of synthetic biology which aims to engineer cells and
organisms to perform novel functions, is likely to benefit from the ability of engineered nuclease to add
or remove genomic elements and therefore create complex systems. In addition, gene functions can be
studied using stem cells with engineered nucleases.
Listed below are some specific tasks this method can carry out :
• Targeted gene mutation
• Gene therapy
• Creating chromosome rearrangement
• Study gene function with stem cells
• Transgenic animals
• Endogenous gene labeling
• Targeted transgene addition
1) Targeted gene modification in animals :
The combination of recent discoveries in genetic engineering, particularly gene editing and the latest
improvement in bovine reproduction technologies (e.g. in vitro embryo culture) allows for genome
editing directly in fertilized oocytes using synthetic highly specific endonucleases. RNA-guided
endonucleases : clustered regularly interspaced short palindromic repeats associated Cas9
(CRISPR/Cas9) are a new tool, further increasing the range of methods available. In particular
CRISPR/Cas9 engineered endonucleases allows the use of multiple guide RNAs for simultaneous
Knockouts (KO) in one step by cytoplasmic direct injection (CDI) on mammalian zygotes.
Furthermore, gene editing can be applied to certain types of fish in aquaculture such as Atlantic salmon.
Gene editing in fish is currently experimental, but the possibilities include growth, disease resistance,
sterility, controlled reproduction, and colour. Selecting for these traits can allow for a more sustainable
environment and better welfare for the fish.
AquAdvantage salmon is a genetically modified Atlantic salmon developed by AquaBounty Technologies.
The growth hormone-regulating gene in the Atlantic salmon is replaced with the growth hormone-
regulating gene from the Pacific Chinook salmon and a promoter sequence from the ocean pout.
PBTEL-463 (Applications of Genomics and Proteomics) 75
Thanks to the parallel development of single-cell transcriptomics, genome editing and new stem cell
models we are now entering a scientifically exciting period where functional genetics is no longer
restricted to animal models but can be performed directly in human samples. Single-cell gene
expression analysis has resolved a transcriptional road-map of human development from which key
candidate genes are being identified for functional studies. Using global transcriptomics data to guide
experimentation, the CRISPR based genome editing tool has made it feasible to disrupt or remove key
genes in order to elucidate function in a human setting.
2) Targeted gene modification in plants :
Genome editing using Meganuclease, ZFNs, and TALEN provides a new strategy for genetic manipulation
in plants and are likely to assist in the engineering of desired plant traits by modifying endogenous
genes. For instance, site-specific gene addition in major crop species can be used for ‘trait stacking’
whereby several desired traits are physically linked to ensure their co-segregation during the breeding
processes. Progress in such cases have been recently reported in Arabidopsis thaliana and Zea mays. In
Arabidopsis thaliana, using ZFN-assisted gene targeting, two herbicide-resistant genes (tobacco
acetolactate synthase SuRA and SuRB) were introduced to SuR loci with as high as 2% transformed cells
with mutations. In Zea mays, disruption of the target locus was achieved by ZFN-induced DSBs and the
resulting NHEJ. ZFN was also used to drive herbicide-tolerance gene expression cassette (PAT) into the
targeted endogenous locus IPK1 in this case. Such genome modification observed in the regenerated
plants has been shown to be inheritable and was transmitted to the next generation. A potentially
successful example of the application of genome editing techniques in crop improvement can be found
in banana, where scientists used CRISPR/Cas9 editing to inactivate the endogenous banana streak virus
in the B genome of banana (Musa spp.) to overcome a major challenge in banana breeding.
In addition, TALEN-based genome engineering has been extensively tested and optimized for use in
plants. TALEN fusions have also been used by a U.S. food ingredient company, Calyxt, to improve the
quality of soybean oil products and to increase the storage potential of potatoes.
Several optimizations need to be made in order to improve editing plant genomes using ZFN-mediated
targeting. There is a need for reliable design and subsequent test of the nucleases, the absence of
toxicity of the nucleases, the appropriate choice of the plant tissue for targeting, the routes of induction
of enzyme activity, the lack of off-target mutagenesis, and a reliable detection of mutated cases.
A common delivery method for CRISPR/Cas9 in plants is Agrobacterium-based transformation. T-DNA is
introduced directly into the plant genome by a T4SS mechanism. Cas9 and gRNA-based expression
cassettes are turned into Ti plasmids, which are transformed in Agrobacterium for plant application. To
improve Cas9 delivery in live plants, viruses are being used more effective transgene delivery.
CRISPR GENE EDITING :-
CRISPR (which is an acronym for clustered regularly interspaced short palindromic repeats) is a family of
DNA sequences found in the genomes of prokaryotic organisms such as bacteria and archaea. These
sequences are derived from DNA fragments of bacteriophages that had previously infected the
prokaryote. They are used to detect and destroy DNA from similar bacteriophages during subsequent
infections. Hence these sequences play a key role in the antiviral (i.e. anti-phage) defense system of
prokaryotes and provide a form of acquired immunity. CRISPR are found in approximately 50% of
sequenced bacterial genomes and nearly 90% of sequenced archaea.
PBTEL-463 (Applications of Genomics and Proteomics) 76
Cas9 (or “CRISPR-associated protein 9”) is an enzyme that uses CRISPR sequences as a guide to recognize
and cleave specific strands of DNA that are complementary to the CRISPR sequence. Cas9 enzymes
together with CRISPR sequences form the basis of a technology known as CRISPR-Cas9 that can be used
to edit genes within organisms. This editing process has a wide variety of applications including basic
biological research, development of biotechnological products, and treatment of diseases. The
development of the CRISPR-Cas9 genome editing technique was recognized by the Nobel Prize in
Chemistry in 2020 which was awarded to Emmanuelle Charpentier and Jennifer Doudna.
Evolution of CRISPR :-
The cas genes in the adaptor and effector modules of the CRISPR-Cas system are believed to have
evolved from two different ancestral modules. A transposon-like element called casposon encoding the
Cas1-like integrase and potentially other components of the adaptation module was inserted next to the
ancestral effector module, which likely functioned as an independent innate immune system. The highly
conserved cas1 and cas2 genes of the adaptor module evolved from the ancestral module while a
variety of class 1 effector cas genes evolved from the ancestral effector module. The evolution of these
various class 1 effector module cas genes was guided by various mechanisms, such as duplication
events. On the other hand, each type of class 2 effector module arose from subsequent independent
insertions of mobile genetic elements. These mobile genetic elements took the place of the multiple
gene effector modules to create single gene effector modules that produce large proteins which
perform all the necessary tasks of the effector module. The spacer regions of CRISPR-Cas systems are
taken directly from foreign mobile genetic elements and thus their long term evolution is hard to trace.
The non-random evolution of these spacer regions has been found to be highly dependent on the
environment and the particular foreign mobile genetic elements it contains.
CRISPR/Cas can immunize bacteria against certain phages and thus halt transmission. For this reason,
Koonin described CRISPR/Cas as a Lamarckian inheritance mechanism. This was disputed by a critic who
noted, “We should remember [Lamarck] for the good he contributed to science, not for things that
resemble his theory only superficially. Indeed, thinking of CRISPR and other phenomena as Lamarckian
only obscures the simple and elegant way evolution really works”. But as more recent studies have been
conducted, it has become apparent that the acquired spacer regions of CRISPR-Cas systems are indeed a
form of Lamarckian evolution because they are genetic mutations that are acquired and then passed on.
On the other hand, the evolution of the Cas gene machinery that facilitates the system evolves through
classic Darwinian evolution.
Mechanism of CRISPR :-
CRISPR-Cas immunity is a natural process of bacteria and archaea. CRISPR-Cas prevents bacteriophage
infection, conjugation and natural transformation by degrading foreign nucleic acids that enter the cell.
1) Spacer acquisition :
When a microbe is invaded by a bacteriophage, the first stage of the immune response is to capture
phage DNA and insert it into a CRISPR locus in the form of a spacer. Cas1 and Cas2 are found in both
types of CRISPR-Cas immune systems, which indicates that they are involved in spacer acquisition.
Mutation studies confirmed this hypothesis, showing that removal of cas1 or cas2 stopped spacer
acquisition, without affecting CRISPR immune response.
Multiple Cas1 proteins have been characterised and their structures resolved. Cas1 proteins have
diverse amino acid sequences. However, their crystal structures are similar and all purified Cas1 proteins
PBTEL-463 (Applications of Genomics and Proteomics) 77
are metal-dependent nucleases/integrases that bind to DNA in a sequence-independent manner.
Representative Cas2 proteins have been characterised and possess either (single strand) ssRNA- or
(double strand) dsDNA-specific endoribonuclease activity.
Fig 2. The stages of CRISPR immunity for each of the three major types of adaptive immunity.
1) Acquisition begins by recognition of invading DNA by Cas1 and Cas2 and cleavage of a protospacer.
2) The protospacer is ligated to the direct repeat adjacent to the leader sequence and
3) single strand extension repairs the CRISPR and duplicates the direct repeat. The crRNA processing and
interference stages occur differently in each of the three major CRISPR systems.
4) The primary CRISPR transcript is cleaved by cas genes to produce crRNAs.
5) In type I systems Cas6e/Cas6f cleave at the junction of ssRNA and dsRNA formed by hairpin loops in
the direct repeat. Type II systems use a trans-activating (tracr) RNA to form dsRNA, which is cleaved
by Cas9 and RNase III. Type III systems use a Cas6 homolog that does not require hairpin loops in the
direct repeat for cleavage.
6) In type II and type III systems secondary trimming is performed at either the 5’ or 3’ end to produce
mature crRNAs.
7) Mature crRNAs associate with Cas proteins to form interference complexes.
8) In type I and type II systems, interactions between the protein and PAM sequence are required for
degradation of invading DNA. Type III systems do not require a PAM for successful degradation and in
type III-A systems base pairing occurs between the crRNA and mRNA rather than the DNA, targeted
by type III-B systems.
PBTEL-463 (Applications of Genomics and Proteomics) 78
In the I-E system of E. coli Cas1 and Cas2 form a complex where a Cas2 dimer bridges two Cas1 dimers.
In this complex Cas2 performs a non-enzymatic scaffolding role, binding double-stranded fragments of
invading DNA, while Cas1 binds the single-stranded flanks of the DNA and catalyses their integration into
CRISPR arrays. New spacers are usually added at the beginning of the CRISPR next to the leader
sequence creating a chronological record of viral infections. In E. coli a histone like protein called
integration host factor (IHF), which binds to the leader sequence, is responsible for the accuracy of this
integration. IHF also enhances integration efficiency in the type I-F system of Pectobacterium
atrosepticum but in other systems different host factors may be required.
Protospacer adjacent motifs :
Bioinformatic analysis of regions of phage genomes that were excised as spacers (termed protospacers)
revealed that they were not randomly selected but instead were found adjacent to short (3–5 bp) DNA
sequences termed protospacer adjacent motifs (PAM). Analysis of CRISPR-Cas systems showed PAMs to
be important for type I and type II, but not type III systems during acquisition. In type I and type II
systems, protospacers are excised at positions adjacent to a PAM sequence, with the other end of the
spacer cut using a ruler mechanism, thus maintaining the regularity of the spacer size in the CRISPR
array. The conservation of the PAM sequence differs between CRISPR-Cas systems and appears to be
evolutionarily linked to Cas1 and the leader sequence.
New spacers are added to a CRISPR array in a directional manner, occurring preferentially, but not
exclusively, adjacent to the leader sequence. Analysis of the type I-E system from E. coli demonstrated
that the first direct repeat adjacent to the leader sequence, is copied, with the newly acquired spacer
inserted between the first and second direct repeats.
The PAM sequence appears to be important during spacer insertion in type I-E systems. That sequence
contains a strongly conserved final nucleotide (nt) adjacent to the first nt of the protospacer. This nt
becomes the final base in the first direct repeat. This suggests that the spacer acquisition machinery
generates single-stranded overhangs in the second-to-last position of the direct repeat and in the PAM
during spacer insertion. However, not all CRISPR-Cas systems appear to share this mechanism as PAMs
in other organisms do not show the same level of conservation in the final position. It is likely that in
those systems, a blunt end is generated at the very end of the direct repeat and the protospacer during
acquisition.
Insertion variants :
Analysis of Sulfolobus solfataricus CRISPRs revealed further complexities to the canonical model of
spacer insertion, as one of its six CRISPR loci inserted new spacers randomly throughout its CRISPR array,
as opposed to inserting closest to the leader sequence.
Multiple CRISPRs contain many spacers to the same phage. The mechanism that causes this
phenomenon was discovered in the type I-E system of E. coli. A significant enhancement in spacer
acquisition was detected where spacers already target the phage, even mismatches to the protospacer.
This ‘priming’ requires the Cas proteins involved in both acquisition and interference to interact with
each other. Newly acquired spacers that result from the priming mechanism are always found on the
same strand as the priming spacer. This observation led to the hypothesis that the acquisition machinery
slides along the foreign DNA after priming to find a new protospacer.
PBTEL-463 (Applications of Genomics and Proteomics) 79
2) crRNA Processing / Biogenesis :
CRISPR-RNA (crRNA), which later guides the Cas nuclease to the target during the interference step,
must be generated from the CRISPR sequence. The crRNA is initially transcribed as part of a single long
transcript encompassing much of the CRISPR array. This transcript is then cleaved by Cas proteins to
form crRNAs. The mechanism to produce crRNAs differs among CRISPR/Cas systems. In type I-E and type
I-F systems, the proteins Cas6e and Cas6f respectively, recognize stem-loops created by the pairing of
identical repeats that flank the crRNA. These Cas proteins cleave the longer transcript at the edge of the
paired region, leaving a single crRNA along with a small remnant of the paired repeat region.
Type III systems also use Cas6, however their repeats do not produce stem-loops. Cleavage instead
occurs by the longer transcript wrapping around the Cas6 to allow cleavage just upstream of the repeat
sequence.
Type II systems lack the Cas6 gene and instead utilize RNase III for cleavage. Functional type II systems
encode an extra small RNA that is complementary to the repeat sequence, known as a trans-activating
crRNA (tracrRNA). Transcription of the tracrRNA and the primary CRISPR transcript results in base
pairing and the formation of dsRNA at the repeat sequence, which is subsequently targeted by RNase III
to produce crRNAs. Unlike the other two systems the crRNA does not contain the full spacer, which is
instead truncated at one end.
CrRNAs associate with Cas proteins to form ribonucleotide complexes that recognize foreign nucleic
acids. CrRNAs show no preference between the coding and non-coding strands, which is indicative of an
RNA-guided DNA-targeting system. The type I-E complex (commonly referred to as Cascade) requires
five Cas proteins bound to a single crRNA.
3) Interference :
During the interference stage in type I systems the PAM sequence is recognized on the crRNA-
complementary strand and is required along with crRNA annealing. In type I systems correct base
pairing between the crRNA and the protospacer signals a conformational change in Cascade that recruits
Cas3 for DNA degradation.
Type II systems rely on a single multifunctional protein, Cas9, for the interference step. Cas9 requires
both the crRNA and the tracrRNA to function and cleaves DNA using its dual HNH and RuvC/RNaseH-like
endonuclease domains. Base pairing between the PAM and the phage genome is required in type II
systems. However, the PAM is recognized on the same strand as the crRNA (the opposite strand to type
I systems).
Type III systems, like type I require six or seven Cas proteins binding to crRNAs. The type III systems
analysed from S. solfataricus and P. furiosus both target the mRNA of phages rather than phage DNA
genome, which may make these systems uniquely capable of targeting RNA-based phage genomes. Type
III systems were also found to target DNA in addition to RNA using a different Cas protein in the
complex, Cas10. The DNA cleavage was shown to be transcription dependent.
The mechanism for distinguishing self from foreign DNA during interference is built into the crRNAs and
is therefore likely common to all three systems. Throughout the distinctive maturation process of each
major type, all crRNAs contain a spacer sequence and some portion of the repeat at one or both ends. It
is the partial repeat sequence that prevents the CRISPR-Cas system from targeting the chromosome as
base pairing beyond the spacer sequence signals self and prevents DNA cleavage. RNA-guided CRISPR
enzymes are classified as type V restriction enzymes.
PBTEL-463 (Applications of Genomics and Proteomics) 80
Identification of CRISPR :-
CRISPRs are widely distributed among bacteria and archaea and show some sequence similarities. Their
most notable characteristic is their repeating spacers and direct repeats. This characteristic makes
CRISPRs easily identifiable in long sequences of DNA, since the number of repeats decreases the
likelihood of a false positive match.
Analysis of CRISPRs in metagenomic data is more challenging, as CRISPR loci do not typically assemble,
due to their repetitive nature or through strain variation, which confuses assembly algorithms. Where
many reference genomes are available, polymerase chain reaction (PCR) can be used to amplify CRISPR
arrays and analyse spacer content. However, this approach yields information only for specifically
targeted CRISPRs and for organisms with sufficient representation in public databases to design reliable
polymerase chain reaction (PCR) primers. Degenerate repeat-specific primers can be used to amplify
CRISPR spacers directly from environmental samples; amplicons containing two or three spacers can be
then computationally assembled to reconstruct long CRISPR arrays.
The alternative is to extract and reconstruct CRISPR arrays from shotgun metagenomic data. This is
computationally more difficult, particularly with second generation sequencing technologies (e.g. 454,
Illumina), as the short read lengths prevent more than two or three repeat units appearing in a single
read. CRISPR identification in raw reads has been achieved using purely de novo identification or by
using direct repeat sequences in partially assembled CRISPR arrays from contigs (overlapping DNA
segments that together represent a consensus region of DNA) and direct repeat sequences from
published genomes as a hook for identifying direct repeats in individual reads.
Applications of CRISPR :-
1) CRISPR gene editing :
CRISPR technology has been applied in the food and farming industries to engineer probiotic cultures
and to immunize industrial cultures (for yogurt, for instance) against infections. It is also being used in
crops to enhance yield, drought tolerance and nutritional value.
By the end of 2014 some 1000 research papers had been published that mentioned CRISPR. The
technology had been used to functionally inactivate genes in human cell lines and cells, to study Candida
albicans, to modify yeasts used to make biofuels and to genetically modify crop strains. Hsu and his
colleagues state that the ability to manipulate the genetic sequences allows for reverse engineering that
can positively affect biofuel production CRISPR can also be used to change mosquitos so they cannot
transmit diseases such as malaria. CRISPR-based approaches utilizing Cas12a have recently been utilized
in the successful modification of a broad number of plant species.
2) CRISPR as diagnostic tool :
CRISPR associated nucleases have shown to be useful as a tool for molecular testing due to their ability
to specifically target nucleic acid sequences in a high background of non-target sequences. In 2016, the
Cas9 nuclease was used to deplete unwanted nucleotide sequences in next-generation sequencing
libraries while requiring only 250 picograms of initial RNA input. Beginning in 2017, CRISPR associated
nucleases were also used for direct diagnostic testing of nucleic acids, down to single molecule
sensitivity.
PBTEL-463 (Applications of Genomics and Proteomics) 81
TALEN GENE EDITING :-
Transcription activator-like effector nucleases (TALEN) are restriction enzymes that can be engineered to
cut specific sequences of DNA. They are made by fusing a TAL effector DNA-binding domain to a DNA
cleavage domain (a nuclease which cuts DNA strands). Transcription activator-like effectors (TALEs) can
be engineered to bind to practically any desired DNA sequence, so when combined with a nuclease,
DNA can be cut at specific locations. The restriction enzymes can be introduced into cells, for use in
gene editing or for genome editing in situ, a technique known as genome editing with engineered
nucleases. Alongside zinc finger nucleases and CRISPR/Cas9, TALEN is a prominent tool in the field of
genome editing.
1) TALE DNA-binding domain :-
TAL effectors are proteins that are secreted by Xanthomonas bacteria via their type III secretion system
when they infect plants. The DNA binding domain contains a repeated highly conserved 33–34 amino
acid sequence with divergent 12th and 13th amino acids. These two positions, referred to as the Repeat
Variable Diresidue (RVD), are highly variable and show a strong correlation with specific nucleotide
recognition. This straightforward relationship between amino acid sequence and DNA recognition has
allowed for the engineering of specific DNA-binding domains by selecting a combination of repeat
segments containing the appropriate RVDs. Notably, slight changes in the RVD and the incorporation of
“nonconventional” RVD sequences can improve targeting specificity.
2) DNA cleavage domain :-
The non-specific DNA cleavage domain from the end of the FokI endonuclease can be used to construct
hybrid nucleases that are active in a yeast assay. These reagents are also active in plant cells and in
animal cells. Initial TALEN studies used the wild-type FokI cleavage domain, but some subsequent TALEN
studies also used FokI cleavage domain variants with mutations designed to improve cleavage specificity
and cleavage activity. The FokI domain functions as a dimer, requiring two constructs with unique DNA
binding domains for sites in the target genome with proper orientation and spacing. Both the number of
amino acid residues between the TALE DNA binding domain and the FokI cleavage domain and the
number of bases between the two individual TALEN binding sites appear to be important parameters for
achieving high levels of activity.
Engineering TALEN constructs :-
The simple relationship between amino acid sequence and DNA recognition of the TALE binding domain
allows for the efficient engineering of proteins. In this case, artificial gene synthesis is problematic
because of improper annealing of the repetitive sequence found in the TALE binding domain. One
solution to this is to use a publicly available software program (DNA Works) to calculate oligonucleotides
suitable for assembly in a two step PCR oligonucleotide assembly followed by whole gene amplification.
A number of modular assembly schemes for generating engineered TALE constructs have also been
reported. Both methods offer a systematic approach to engineering DNA binding domains that is
conceptually similar to the modular assembly method for generating zinc finger DNA recognition
domains.
Mechanism of TALENs :-
TALEN can be used to edit genomes by inducing double-strand breaks (DSB), which cells respond to with
repair mechanisms.
PBTEL-463 (Applications of Genomics and Proteomics) 82
Non-homologous end joining (NHEJ) directly ligates DNA from either side of a double-strand break
where there is very little or no sequence overlap for annealing. This repair mechanism induces errors in
the genome via indels (insertion or deletion), or chromosomal rearrangement; any such errors may
render the gene products coded at that location non-functional. Because this activity can vary
depending on the species, cell type, target gene, and nuclease used, it should be monitored when
designing new systems. A simple heteroduplex cleavage assay can be run which detects any difference
between two alleles amplified by PCR. Cleavage products can be visualized on simple agarose gels or
slab gel systems.
Alternatively, DNA can be introduced into a genome through NHEJ in the presence of exogenous double-
stranded DNA fragments.
Homology directed repair can also introduce foreign DNA at the DSB as the transfected double-stranded
sequences are used as templates for the repair enzymes.
Fig 3. Workflow of genome editing of Your Favorite Gene (YFG) using TALEN. The target
sequence is identified, a corresponding TALEN sequence is engineered and inserted into
a plasmid. The plasmid is inserted into the target cell where it is translated to produce
the functional TALEN, which enters the nucleus and binds and cleaves the target
sequence. Depending on the application, this can be used to introduce an error (to knock
out a target gene) or to introduce a new DNA sequence into the target gene.
Applications of TALENs :-
TALEN has been used to efficiently modify plant genomes, creating economically important food crops
with favorable nutritional qualities. They have also been harnessed to develop tools for the production
of biofuels. In addition, it has been used to engineer stably modified human embryonic stem cell and
induced pluripotent stem cell (IPSCs) clones and human erythroid cell lines, to generate knockout C.
PBTEL-463 (Applications of Genomics and Proteomics) 83
elegans, knockout rats, knockout mice, and knockout zebrafish. Moreover, the method can be used to
generate knockin organisms. Wu et al. obtained a Sp110 knockin cattle using Talen nickases to induce
increased resistance of tuberculosis. This approach has also been used to generate knockin rats by
TALEN mRNA microinjection in one-cell embryos.
TALEN has also been utilized experimentally to correct the genetic errors that underlie disease. For
example, it has been used in vitro to correct the genetic defects that cause disorders such as sickle cell
disease, Xeroderma pigmentosum, and epidermolysis bullosa. Recently, it was shown that TALEN can be
used as tools to harness the immune system to fight cancers; TALEN-mediated targeting can generate T
cells that are resistant to chemotherapeutic drugs and show anti-tumor activity.
In theory, the genome-wide specificity of engineered TALEN fusions allows for correction of errors at
individual genetic loci via homology-directed repair from a correct exogenous template. In reality,
however, the in situ application of TALEN is currently limited by the lack of an efficient delivery
mechanism, unknown immunogenic factors, and uncertainty in the specificity of TALEN binding.
Another emerging application of TALEN is its ability to combine with other genome engineering tools,
such as meganucleases. The DNA binding region of a TAL effector can be combined with the cleavage
domain of a meganuclease to create a hybrid architecture combining the ease of engineering and highly
specific DNA binding activity of a TAL effector with the low site frequency and specificity of a
meganuclease.
In comparison to other genome editing techniques TALEN falls in the middle in terms of difficulty and
cost. Unlike ZFNs, TALEN recognizes single nucleotides. It’s far more straightforward to engineer
interactions between TALEN DNA binding domains and their target nucleotides than it is to create
interactions with ZNFs and their target nucleotide triplets. On the other hand, CRISPR relies on
ribonucleotide complex formation instead of protein/DNA recognition. gRNAs can target nearly any
sequence in the genome and they can be cheaply produced, thus making CRISPR more efficient and less
expensive than both TALEN and ZFN. TALEN is ultimately 200 times more expensive than CRISPR and
takes several months more to perform.
PBTEL-463 (Applications of Genomics and Proteomics) 84
UNIT : 2
2.1 PROTEOMICS STUDY IN RELATION TO BIOINFORMATICS
Proteomics is a fundamental science in which many sciences in the world are directing their efforts. The
proteins play a key role in the biological function and their studies make possible to understand the
mechanisms that occur in many biological events (human or animal diseases, factor that influence plant
and bacterial grown). Due to the complexity of the investigation approach that involve various
technologies, a high amount of data are produced. In fact, proteomics has known a strong evolution and
now we are in a phase of unparalleled growth that is reflected by the amount of data generated from
each experiment. That approach has provided, for the first time, unprecedented opportunities to
address biology of humans, animals, plants as well as micro-organisms at system level. Bioinformatics
applied to proteomics offered the management, data elaboration and integration of these huge amount
of data.
Thus, the role of bioinformatics is fundamental in order to reduce the analysis time and to provide
statistically significant results. To process data efficiently, new software packages and algorithms are
continuously being developed to improve protein identification, characterization and quantification in
terms of high-throughput and statistical accuracy. However, many limitations exist concerning
bioinformatic spectral data elaboration. In particular, for the analysis of plant proteins extensive data
elaboration is necessary due to the lack of structural information in the proteomic and genomic public
databases.
PBTEL-463 (Applications of Genomics and Proteomics) 85
Bioinformatics in proteomics :-
Mass spectrometry became a very important tool in proteomics : it has made rapid progresses as an
analytical technique, particularly over the last decade, with many new types of hardware being
introduced. Moreover, constant improvements have increased the levels of MS sensitivity, selectivity as
well as mass measurement accuracy. The principles of mass spectrometry can be envisaged by the
following four functions of the mass spectrometer :
a) peptide ionization;
b) peptide ions analyses according to their mass/charge ratio (m/z) values ;
c) acquisition of ion mass data ;
d) measurement of relative ion abundance.
PBTEL-463 (Applications of Genomics and Proteomics) 86
Ionization is fundamental as the physics of MS relies upon the molecule of interest being charged,
resulting in the formation of positive ions, and, depending on the ionization method, fragment ions.
These ion species are visualized according to their corresponding m/z ratio(s), and their masses
assigned. Finally, the measurement of relative ion abundance, based on either peak height or peak area
of sample(s) and internal standard(s), leads to a semi-quantitative request.
Typical procedure for proteome analysis :-
Proteome data elaboration procedure is different depending of the study target. In general the studies
can be in order to characterize the organisms expressed proteome and quantitative to detect potential
biomarker related to disease or other organism proprieties.
The principal proteomics studies are :
• Full proteomics (qualitative);
• Functional proteomics (relative quantitation studies);
• Post translational modification functional proteomics (qualitative and relative quantitation
studies)
Data elaboration for full proteome analysis :-
In full proteomics analysis the proteins are usually extracted and qualitatively identified. These studies
are usually performed in order to understand what proteins are expressed by the genome of the
organism of interest. The general analytical scheme is reported in Figure 2.
Basically, after protein separation, mainly through gel electrophoresis or other separation approaches
(liquid chromatography etc.), proteins are identified by means of mass spectrometric technique. Two
kind of data processing algorithms can be employed depending by the analytical technology used to
analyze the proteins. The two approaches are :
1) Bottom up approach. It is used to identify the protein of interest after enzymatic or chemical
digestion;
2) Top down approach. In this case proteins are not digested but directly analyzed by mass
spectrometric approaches;
In the former case (bottom up) the protein are digested by means of enzymatic or chemical reaction and
the specific peptides produced are then analyzed to identify the protein of interest. This results can be
obtained using mass spectrometric mass analyzer that can operate in two conditions :
a) full scan peptide mass fingerprint (MS) and
b) tandem mass spectrometry (MS/MS).
In the case the mass/charge (m/z) ratio of the peptide is obtained using high resolution and mass
accurate analyzer. The combination of the high accurate m/z ratio of the detected peptides is checked
against the theoretical one generated by virtual digestion of the proteins present in the known
database. A list of protein candidates is so obtained with relative statistical identification score,
correlated to the number of peptides detected, per proteins and peptide mass accuracy.
PBTEL-463 (Applications of Genomics and Proteomics) 87
Fig. 2. General analytical scheme of Full proteomic analysis.
2.2 EXPASY TOOLS FOR PROTEIN ANALYSIS
Protein identification and analysis software performs a central role in the investigation of proteins from
two-dimensional (2-D) gels and mass spectrometry. For protein identification, the user matches certain
empirically acquired information against a protein database to define a protein as already known or as
novel. For protein analysis, information in protein databases can be used to predict certain properties
about a protein, which can be useful for its empirical investigation. The two processes are thus
complementary. Although there are numerous programs available for those applications, we have
developed a set of original tools with a few main goals in mind. Specifically, these are :
1) To utilize the extensive annotation available in the Swiss-Prot database wherever possible, in
particular the position-specific annotation in the Swiss-Prot feature tables to take into account
posttranslational modifications and protein processing.
2) To develop tools specifically, but not exclusively, applicable to proteins prepared by two-
dimensional gel electrophoresis and peptide mass fingerprinting experiments.
3) To make all tools available on the World-Wide Web (WWW), and freely usable by the scientific
community.
Analysis tools include Compute pI/Mw, a tool for predicting protein isoelectric point (pI) and molecular
weight (Mw); ProtParam, to calculate various physicochemical parameters; PeptideMass, a tool for
theoretically cleaving proteins and calculating the masses of their peptides and any known cellular or
PBTEL-463 (Applications of Genomics and Proteomics) 88
artifactual posttranslational modifications; PeptideCutter, to predict cleavage sites of proteases or
chemicals in protein sequences; ProtScale, for amino acid scale representation, such as hydrophobicity
plots.
Protein identification tools include TagIdent, a tool that lists proteins within a user-specified pI and Mw
region, and allows proteins to be identified through the use of short “sequence tags” up to six amino
acids long; AACompIdent, a program that identifies proteins by virtue of their amino acid (AA)
compositions, sequence tags, pI, and Mw; AACompSim, a program that matches the theoretical AA
composition of proteins against the Swiss-Prot database to find similar proteins; MultiIdent, a
combination of other tools mentioned above that accepts multiple data types to achieve identification,
including protein pI, Mw, species of interest, AA composition, sequence tag, and peptide masses; and
Aldente, a powerful peptide mass fingerprinting identification (PMF) tool.
Protein characterization tools in the context of PMF experiments include FindMod, to predict
posttranslational modifications and single-amino acid substitutions; GlycoMod, a tool to predict the
possible compositions for glycan structures, or compositions of glycans attached to glycoproteins;
FindPept, to predict peptides resulting from unspecific proteolytic cleavage, protease autolysis, and
keratin contaminants; and BioGraph to visualize the results of the ExPASy identification and
characterization tools.
The tools described here are accessible through the ExPASy WWW server, from the tools page,
http://www.expasy.org/tools/ (see Fig. 1). In addition to the tools maintained by the ExPASy team, this
page contains links to many analysis and prediction programs provided on Web sites all over the world.
The “local” ExPASy tools can be distinguished by the small ExPASy logo preceding their name. They are
continually under development and thus may change with time. We document new features of tools in
the “What’s new on ExPASy” Web page at http://www.expasy.org/history.html. Feedback and
suggestions from users of the tools is very much appreciated and can be sent by e-mail to
[email protected]. Detailed documentation for each of the programs is available from the Web site.
A) Single-Protein Analysis Tools on the ExPASy Server :-
1) Compute pI/Mw Tool :
This tool (http://www.expasy.org/tools/pi_tool.html) calculates the estimated pI and Mw of a specified
Swiss-Prot/TrEMBL entry or a user-entered AA sequence. These parameters are useful if you want to
know the approximate region of a 2-D gel where a protein may be found.
To use the program, enter one or more Swiss-Prot/TrEMBL identification names (e.g., LACB_BOVIN) or
accession numbers (e.g., P02754) into the text field, and select the “click here to compute pI/Mw”
button. If one entry is specified, you will be asked to specify the protein’s domain of interest for which
the pI and mass should be computed. The domain can be selected from the hypertext list of features
shown, if any, or by numerically specifying the domain start and end points.
2) ProtParam Tool :
ProtParam (http://www.expasy.org/tools/protparam.html) computes various physico-chemical
properties that can be deduced from a protein sequence. No additional information is required about
the protein under consideration. The protein can either be specified as a Swiss-Prot/TrEMBL accession
number or ID, or in the form of a raw sequence. White space and numbers are ignored. If you provide
the accession number of a Swiss-Prot/TrEMBL entry, you will be prompted with an intermediary page
that allows you to select the portion of the sequence on which you would like to perform the analysis.
PBTEL-463 (Applications of Genomics and Proteomics) 89
The choice includes a selection of mature chains or peptides and domains from the Swiss-Prot feature
table (which can be chosen by clicking on the positions), as well as the possibility to enter start and end
position in two boxes. By default (i.e., if you leave the two boxes empty) the complete sequence will be
analyzed.
3) PeptideMass :
The PeptideMass tool (http://www.expasy.org/tools/peptide-mass.html) is designed to assist in peptide-
mapping experiments, and in the interpretation of peptide-mass fingerprinting (PMF) results and other
mass-spectrometry data. It cleaves in silico a user-specified protein sequence or a mature protein in the
Swiss-Prot/TrEMBL databases with an enzyme or reagent of choice, to generate peptides. Masses of the
peptides are then calculated and displayed. If a protein from Swiss-Prot has annotations that describe
discrete posttranslational modifications (specifically acetylation, amidation, biotinylation, C-
mannosylation, formylation, farnesylation, γ-carboxy glutamic acid, geranyl-geranylation, lipoyl groups,
N-acyl glycerides, methylation, myristoylation, NAD, O-GlcNAc, palmitoylation, phosphorylation,
pyridoxyl phosphate, pyrrolidone carboxylic acid, or sulfation), the masses of these modifications will be
considered in peptide mass calculations. Post-translational modifications can also be specified along
with a user-entered sequence that is not in Swiss-Prot or TrEMBL. Guidelines for the input format of
posttranslational modifications (PTMs) are accessible directly from the PeptideMass input form. The
mass effects of artifactual protein modifications such as the oxidation of methionine or acrylamide
adducts on cysteine residues can also be considered. The program can supply warnings where peptide
masses may be subject to change from protein isoforms, database conflicts, or mRNA splicing variation.
4) PeptideCutter :
PeptideCutter (http://www.expasy.org/tools/peptidecutter/) predicts cleavage sites of proteases or
chemicals in a protein sequence. Protease digestion can be useful if one wants to carry out experiments
on a portion of a protein, separate the domains in a protein, remove a tag protein when expressing a
fusion protein, or make sure that the protein under investigation is not sensitive to endogenous
proteases. One or several reagents can be selected from a list of (currently) 33 proteases and chemicals.
The protein sequence can be entered in the form of a Swiss-Prot/TrEMBL accession number, a raw
sequence, or a sequence in FASTA format, in one-letter amino acid code. Letters that do not correspond
to an amino acid code (B, J, O, U, X, or Z) will cause an error message, and the user is required to correct
the input. Please note that only one sequence can be entered at a time.
5) ProtScale :
ProtScale (http://www.expasy.org/tools/protscale.html) allows computation and representation (in the
form of a 2-D plot) of the profile produced by any amino acid scale on a selected protein. ProtScale can
be used with 50 predefined scales entered from the literature. The scale values for the 20 amino acids,
as well as a literature reference, are provided on ExPASy for each of these scales. To generate data for a
plot, the protein sequence is scanned with a sliding window of a given size. At each position, the mean
scale value of the amino acids within the window is calculated, and that value is plotted for the midpoint
of the window.
You can set several parameters that control the computation of a scale profile, such as the window size,
the weight variation model, the window edge relative weight value, and scale normalization.
PBTEL-463 (Applications of Genomics and Proteomics) 90
Fig. 1. The ExPASy tools page, http://www.expasy.org/tools/. All underlined text represents hypertext
links, which, when selected with a computer mouse, take the user to the corresponding page for the
chosen tool. The tools whose names are preceded by a small ExPASy logo are maintained by the ExPASy
team; all other links lead to external servers.
PBTEL-463 (Applications of Genomics and Proteomics) 91
B) Protein Identification and Characterization Tools on ExPASy :-
1) TagIdent Tool :
The TagIdent tool (http://www.expasy.org/tools/tagident.html) serves two main purposes. Firstly, it can
create lists of proteins from one or more organisms that are within a user-specified pI or Mw range. This
is useful to find proteins from the database that may be in a region of interest on a 2-D gel. Secondly,
the program can identify proteins from 2-D gels by virtue of their estimated pI and Mw, and a short
protein “sequence tag” of up to six amino acids. The sequence tag can be derived from protein N-
termini, C-termini, or internally, and generated by chemical- or mass-spectrometric sequencing
techniques. As sequence tags are highly specific (e.g., there are 160,000 different combinations of four
amino acid sequence tags) they represent a form of protein identification that is useful for organisms
that are molecularly well defined and have a relatively small number of proteins (e.g., Escherichia coli or
Saccharomyces cerevisiae). Interestingly, we have shown that C-terminal sequence tags are more
specific than N-terminal tags; however, it remains technically more difficult to generate high-quality C-
terminal protein sequence data. Thirdly, the sequence tag can be used, together with a very precise
protein mass obtained from mass spectrometry, to identify a protein after peptide fragmentation. One
can also specify terms to limit the search to a range of organisms or to a specific organism (species).
Additionally, Swiss-Prot keywords can also be used in the identification procedure.
2) AACompIdent Tool :
The AACompIdent tool (http://www.expasy.org/tools/aacomp/) can identify proteins by their amino
acid (AA) composition. The program matches the percent empirically measured AA composition of an
unknown protein against the theoretical percent AA compositions of proteins in the Swiss-Prot/TrEMBL
database. A score, which represents the degree of difference between the composition of the unknown
protein and a protein in the database, is calculated for each database entry by the sum of the squared
difference between the percent AA composition for all amino acids of the unknown protein and the
database entry. All proteins in the database are then ranked according to their score, from lowest (best
match) to highest (worst match). Estimated protein pI and Mw, as well as species of interest and
keyword, can also be used in the identification procedure.
3) AACompSim Tool :
The AACompSim tool (http://www.expasy.org/tools/aacsim/) allows the theoretical AA composition of
one protein in the Swiss-Prot database to be compared to proteins from one or all species in the
database. This serves two main purposes :
• first, to allow the simulation of matching undertaken for identification purposes with
AACompIdent;
• second, to allow the detection of weak similarities between proteins by comparison of their
compositions rather than sequences, as explored by Hobohm and Sander.
To use AACompSim, first select the constellation of amino acids you wish to work with. If you wish to
simulate matching undertaken with empirical data, you should specify constellation 2. To match against
the database for detecting protein similarities, you should use all 20 amino acids in constellation 0. Then
specify an e-mail address where the results can be sent, the Swiss-Prot identification name (e.g.,
IPIA_TOBAC) or accession number (e.g., Q03198) of the protein you would like to compare against the
database, and the Swiss-Prot abbreviation for the species to match against (e.g., SALTY for Salmonella
typhimurium). A document containing a full list of all Swiss-Prot species and their organism codes can be
PBTEL-463 (Applications of Genomics and Proteomics) 92
found at http://www.expasy.org/cgi-bin/speclist. If desired, matching can be done against all species in
the database by specifying “ALL.” Finally, select the “Search” button to submit the query to the ExPASy
server. Results will be sent to your e-mail address. AACompSim will return three lists of proteins, similar
to those from AACompIdent.
4) MultiIdent Tool :
Proteins can be identified by virtue of their peptide masses alone, but frequently other data are needed
to provide high-confidence identification. The same is true for protein identification with AA
composition. Following our earlier observations that high-confidence protein identification can be
achieved with a combination of peptide mass and AA composition data, we have developed the protein
identification tool MultiIdent (http://www.expasy.org/tools/multiident/). This tool uses parameters of
protein species, estimated pI and Mw, keyword, AA composition, sequence tag, and PMF data to
achieve protein identification. Currently, the program works by first generating a set of proteins in the
database with AA compositions close to the unknown protein, as for AACompIdent. Theoretical peptide
masses from the proteins in this set are then matched with the peptide masses of the unknown protein
to find the number of peptides in common (number of “hits”). Three types of lists are produced in the
results :
• first, a list where proteins from the database are ranked according to their AA composition
score;
• second, a list where proteins are ranked according to the number of peptide hits they showed
with the unknown protein; and
• thirdly, a list that shows only proteins that were present in the both the above lists, where these
proteins are ranked according to an integrated AA and peptide hit score.
In all these lists, protein pI, Mw, species of origin, and Swiss-Prot keyword can be used as in
AACompIdent to increase the specificity of searches.
5) Aldente Tool :
Aldente (Advanced Large-scale iDENTification Engine, http://www.expasy.org/tools/aldente/) is a tool
that allows the identification of proteins using peptide-mass fingerprinting data.
Experimentally measured, user-specified peptide masses are compared with the theoretical peptides
calculated for all proteins in the Swiss-Prot/TrEMBL databases. Iso-electric point, molecular weight, and
a species (or group of species) can be specified in order to restrict the number of candidate proteins and
reduce false-positive matches. The main features of Aldente are :
• Use of a robust method (the Hough transform) to determine the deviation function of the mass
spectrometer and to resolve peptide match ambiguities. In particular, the method is relatively
insensitive to noise.
• Tuneable score parametrization: the user can choose the parameters he or she wants to take
into consideration in the score and in which proportion.
• Extensive use of the annotations (protein mature form, posttranslational modifications,
alternative splicing) in Swiss-Prot/TrEMBL, offering a degree of protein characterization as part
of the identification procedure.
• Consideration of user-defined chemical amino acid modifications (oxidation of methionine,
acrylamide adducts on cysteine residues, alkylation products on cysteine residues), and the
possibility to define their contribution to the score.
PBTEL-463 (Applications of Genomics and Proteomics) 93
6) FindMod Tool :
The FindMod, GlycoMod, and FindPept tools are used to identify the origin of peptide masses obtained
by PMF that are not matched by protein identification tools such as Aldente. They also take into account
posttranslational modifications annotated in Swiss-Prot or supplied by the user, and chemical
modifications of peptides. It is quite common for PMF tools not to be able to find matching theoretical
peptides for a few of the less intense peaks that were detected and submitted to the identification
process.
FindMod (http://www.expasy.org/tools/findmod/) is a program for de novo discovery of protein PTM or
single-amino acid substitutions. It examines PMF results of known proteins for the presence of more
than 20 types of PTMs of discrete mass, such as acetylation, amidation, biotin, C-mannosylation,
deamidation, N-acyl diglyceride cysteine (tripalmitate), FAD, farnesylation, formylation, geranyl-geranyl,
γ-carboxy-glutamic acid, O-GlcNAc, hydroxylation, lipoyl, methylation, myristoylation, palmitoylation,
phosphorylation, pyridoxal phosphate, pyrrolidone carboxylic acid, and sulfation.
This is done by looking at mass differences between experimentally determined peptide masses and
theoretical peptide masses calculated from a specified protein sequence. If a mass difference
corresponds to a known PTM not already annotated in Swiss-Prot, rules are applied that examine the
sequence of the peptide of interest and make predictions as to what amino acid in the peptide is likely
to carry the modification. The same method is applied when predicting potential amino acid
substitutions.
7) GlycoMod and GlycanMass Tools :
Protein glycosylation is one of the most common and most complex post-translational modifications.
Although the problem of predicting glycosylation from peptide-mass fingerprinting data is in principle
the same as the one addressed by FindMod, the complexity and heterogeneity (the high number of
possible combinations of monosaccharides forming glycan structures) made it necessary to conceive a
separate tool, specializing in glycan structures and glycopeptides, GlycoMod.
GlycoMod (http://www.expasy.org/tools/glycomod/) finds all possible compositions of a glycan
structure from its experimentally determined mass. It may be used to calculate the possible
compositions of free or derivatized glycan structures, or compositions of glycans attached to
glycoproteins and glycopeptides. The motivation and use of the tool are quite similar to FindMod. As
there has been a recent book chapter devoted entirely to the use of GlycoMod, we will not detail its use
here.
GlycanMass (http://www.expasy.org/tools/glycomod/glycanmass.html) allows the user to calculate the
mass of a glycan from its monosaccharide composition. Available elements to build the oligosaccharide
are hexose, HexNAc, deoxyhexose, NeuAc, NeuGc, pentose, sulfate, phosphate, KDN, and HexA. The
user has the possibility to specify whether the monosaccharide residues are underivatized,
permethylated, or peracetylated, and whether to use average or monoisotopic mass values.
8) FindPept Tool :
FindPept (http://www.expasy.org/tools/findpept/,) is designed to predict peptides resulting from the
following causes : unspecific proteolytic cleavage, missed cleavage, protease autolysis, and keratin
contaminants.
PBTEL-463 (Applications of Genomics and Proteomics) 94
Unspecific cleavage is the process by which peptides whose termini do not correspond to the cleavage
specificity rules implemented in computer programs are produced by proteolysis. These rules are often
simplistic and reflect our incomplete understanding of the specificity of certain enzymes. Other causes
include a contamination with other proteases (e.g., trypsin usually contains traces of chymotrypsin),
biological processes such as protein degradation, or a change in enzyme specificity over time.
9) BioGraph Tool :
BioGraph (http://www.expasy.org/tools/BiographApplet/) is a Java applet that aims at providing ExPASy
users with an interactive interface to visualize results of some proteomics tools. BioGraph is therefore
accessible from Aldente, FindMod, or FindPept results by clicking on the “BioGraph” button.
This viewer is composed of three main components, or panels :
• first, the “Title panel,” intended to give general information about the source program;
• then the “Tool results panel,” to summarize the source program results and interact with the
spectrum; and
• lastly, the “Spectrum manipulation panel,” to interactively visualize the user-entered spectrum.
2.3 STRUCTURAL ANALYSIS OF PROTEIN
Several methods are currently used to determine the structure of a protein, including X-ray
crystallography, NMR spectroscopy, and electron microscopy. Each method has advantages and
disadvantages. In each of these methods, the scientist uses many pieces of information to create the
final atomic model. Primarily, the scientist has some kind of experimental data about the structure of
the molecule. For X-ray crystallography, this is the X-ray diffraction pattern. For NMR spectroscopy, it is
information on the local conformation and distance between atoms that are close to one another. In
electron microscopy, it is an image of the overall shape of the molecule.
In most cases, this experimental information is not sufficient to build an atomic model from scratch.
Additional knowledge about the molecular structure must be added. For instance, we often already
know the sequence of amino acids in a protein, and we know the preferred geometry of atoms in a
typical protein (for example, the bond lengths and bond angles). This information allows the scientist to
build a model that is consistent with both the experimental data and the expected composition and
geometry of the molecule.
When looking at PDB entries, it is always good to be a bit critical. Keep in mind that the structures in the
PDB archive are determined using a balanced mixture of experimental observation and knowledge-
based modeling. It often pays to take a little extra time to confirm for yourself that the experimental
evidence for a particular structure supports the model as represented and the scientific conclusions
based on the model.
2.3.1 X-RAY CRYSTALLOGRAPHY :-
Protein X-ray crystallography is a technique used to obtain the three-dimensional structure of a
particular protein by x-ray diffraction of its crystallized form. This three dimensional structure is crucial
to determining a protein’s functionality. Making crystals creates a lattice in which this technique aligns
millions of proteins molecules together to make the data collection more sensitive. It’s like getting a
stack of papers, measuring the width with a ruler, and dividing that length with the number of pages to
determine the width of one piece of paper. By this averaging technique, the noise level gets reduced
PBTEL-463 (Applications of Genomics and Proteomics) 95
and the signal to noise ratio increases. The specificity of the protein’s active sites and binding sites is
completely dependent on the protein’s precise conformation. X-ray crystallography can reveal the
precise three-dimensional positions of most atoms in a protein molecule because x-rays and covalent
bonds have similar wavelength, and therefore currently provides the best visualization of protein
structure. It was the X-ray crystallography by Rosalind E. Franklin, that made it possible for J.D. Watson
and F.H.C. Crick to figure out the double-helix structure of DNA.
X-ray crystallography can reveal the detailed three-dimensional structures of thousands of proteins. The
three components in an X-ray crystallographic analysis are a protein crystal, a source of x-rays, and a
detector.
X-ray crystallography is used to investigate molecular structures through the growth of solid crystals of
the molecules they study. Crystallographers aim high-powered X-rays at a tiny crystal containing trillions
of identical molecules. The crystal scatters the X-rays onto an electronic detector. The electronic
detector is the same type used to capture images in a digital camera. After each blast of X-rays, lasting
from a few seconds to several hours, the researchers precisely rotate the crystal by entering its desired
orientation into the computer that controls the X-ray apparatus. This enables the scientists to capture in
three dimensions how the crystal scatters, or diffracts, X-rays. The intensity of each diffracted ray is fed
into a computer, which uses a mathematical equation to calculate the position of every atom in the
crystallized molecule. The result is a three-dimensional digital image of the molecule.
Crystallographers measure the distances between atoms in angstroms. The perfect “rulers” to measure
angstrom distances are X-rays. The X-rays used by crystallographers are approximately 0.5 to 1.5
angstroms long, which are just the right size to measure the distance between atoms in a molecule. That
is why X-rays are used.
Components of X-ray crystallography :-
The three components needed to complete an X-ray crystallography analysis are a protein crystal, a
source of x-rays and a detector.
Steps involved in Crystallographic Analysis :-
1) Crystallizing Protein of Interest :
The process begins by crystallizing a protein of interest. Crystallization of protein causes all the protein
atoms to be orientated in a fixed way with respect to one another while still maintaining their
biologically active conformations – a requirement for X-ray diffraction. A protein must be precipitated
out or extracted from a solution. The rule of thumb here is to get as pure a protein as possible to grow
lots of crystals (this allows for the crystals to have charged properties, and surface charged distribution
for better scattering results). 4 critical steps are taken to achieve protein crystallization, they are :
• Purify the protein. Determine the purity of the protein and if not pure (usually >99%), then
must undergo further purification.
• Must precipitate protein. Usually done so by dissolving the protein in an appropriate
solvent(water-buffer soln. w/ organic salt such as 2-methyl-2,4-pentanediol). If protein is
insoluble in water-buffer or water-organic buffer then a detergent such as sodium lauryl sulfate
must be added.
• The solution has to be brought to supersaturation (condensing the protein from the rest of the
solvent forming condensation nuclei). This is done by adding a salt to the concentrated solution
of the protein, reducing its solubility and allowing the protein to form a highly organized crystal
PBTEL-463 (Applications of Genomics and Proteomics) 96
(this process is referred to as salting out). Other methods include batch crystallization, liquid-
liquid crystallization, vapor diffusion, and dialysis.
• Let the actual crystals grow. Since nuclei crystals are formed this will lead to obtaining actual
crystal growth.
2) X-rays are generated :
For the next step, x-rays are generated and directed toward the crystallized protein. X-rays can be
generated in four different ways,
• By bombarding a metal source with a beam of high-energy electrons,
• By exposing a substance to a primary beam of X-rays to create a secondary beam of X-ray
fluorescence,
• From a radioactive decay process that generates X-rays (Gamma rays are indistinguishable from
X-rays), and
• From a synchrotron (a cyclotron with an electric field at constant frequency) radiation source.
The first and last method utilize the phenomenon of bremsstrahlung, which states that an accelerating
charge will give off radiation.
Then, the x-rays are shot at the protein crystal resulting in some of the x-rays going through the crystal
and the rest being scattered in various directions. The scattering of x-rays is also known as “x-ray
diffraction”. Such scattering results from the interaction of electric and magnetic fields of the radiation
with the electrons in the atoms of the crystal.
3) Creation of electron density map :
The final step involves creating an electron density map based on the measured intensities of the
diffraction pattern on the film. A Fourier Transform can be applied to the intensities on the film to
reconstruct the electron density distribution of the crystal. In this case, the Fourier Transform takes the
spatial arrangement of the electron density and gives out the spatial frequency (how closely spaced the
atoms are) in the form of the diffraction pattern on the x-ray film. An everyday example of the Fourier
Transform is the music equalizer on a music player. Instead of displaying the actual music waveform,
which is difficult to visualize, the equalizer displays the intensity of various bands of frequencies.
Through the Fourier Transform, the electron density distribution is illustrated as a series of parallel
shapes and lines stacked on top of each other (contour lines), like a terrain map. The mapping gives a
three-dimensional representation of the electron densities observed through the x-ray crystallography.
When interpreting the electron density map, resolution needs to be taken into account. A resolution of
5Å – 10Å can reveal the structure of polypeptide chains, 3Å – 4Å of groups of atoms, and 1Å – 1.5Å of
individual atoms. The resolution is limited by the structure of the crystal and for proteins is about 2Å.
Advantages of X-ray crystallography :-
Some of the advantages of X-ray crystallography are that the technique itself can obtain an atomic
resolution structure even if the atomic structure is in solution. This is because the structure in crystal
form is the same if it were in solution.[Citation needed] Another advantageous aspect is that atomic
structure contains a huge amount of data pertaining to the crystallized pure protein. The information
one could receive from the structure of the protein can provide more information then finding its niche
in the cellular environment.
PBTEL-463 (Applications of Genomics and Proteomics) 97
Drawbacks of X-ray crystallography :-
Some of the drawbacks of X-ray crystallography are that the sample needs to be in a solid form, the
sample must be present in a large enough quantity to be studied, and the sample is often destroyed by
the x ray radiation used to study it. This means that nothing in the gas or liquid state can be analyzed via
x ray crystallography. Also, rare or hard to synthesize samples may be difficult to study, because there
may not be enough of the sample for the radiation to provide a clear image. Thirdly, studying biological
samples can be problematic because the radiation used to study the samples is most likely going to
harm or destroy the living tissues.
2.3.2 HOMOLOGY MODELLING :-
Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-
resolution model of the “target” protein from its amino acid sequence and an experimental three-
dimensional structure of a related homologous protein (the “template”). Homology modeling relies on
the identification of one or more known protein structures likely to resemble the structure of the query
sequence, and on the production of an alignment that maps residues in the query sequence to residues
in the template sequence has been shown that protein structures are more conserved than protein
sequences amongst homologues, but sequences falling below a 20% sequence identity can have very
different structure.
Homology modeling can produce high-quality structural models when the target and template are
closely related, which has inspired the formation of a structural genomics consortium dedicated to the
production of representative experimental structures for all classes of protein folds. The chief
inaccuracies in homology modeling, which worsen with lower sequence identity, derive from errors in
the initial sequence alignment and from improper template selection. Like other methods of structure
prediction, current practice in homology modeling is assessed in a biennial large-scale experiment
known as the Critical Assessment of Techniques for Protein Structure Prediction, or CASP.
Steps in Homology Model Production :-
The homology modeling procedure can be broken down into four sequential steps : template selection,
target-template alignment, model construction, and model assessment. The first two steps are often
essentially performed together, as the most common methods of identifying templates rely on the
production of sequence alignments; however, these alignments may not be of sufficient quality because
database search techniques prioritize speed over alignment quality. These processes can be performed
iteratively to improve the quality of the final model, although quality assessments that are not
dependent on the true target structure are still under development.
Optimizing the speed and accuracy of these steps for use in large-scale automated structure prediction
is a key component of structural genomics initiatives, partly because the resulting volume of data will be
too large to process manually and partly because the goal of structural genomics requires providing
models of reasonable quality to researchers who are not themselves structure prediction experts.
1-2) Template selection and sequence alignment :-
The critical first step in homology modeling is the identification of the best template structure, if indeed
any are available. The simplest method of template identification relies on serial pairwise sequence
alignments aided by database search techniques such as FASTA and BLAST. More sensitive methods
PBTEL-463 (Applications of Genomics and Proteomics) 98
based on multiple sequence alignment – of which PSI-BLAST is the most common example – iteratively
update their position-specific scoring matrix to successively identify more distantly related homologs.
This family of methods has been shown to produce a larger number of potential templates and to
identify better templates for sequences that have only distant relationships to any solved structure.
Protein threading, also known as fold recognition or 3D-1D alignment, can also be used as a search
technique for identifying templates to be used in traditional homology modeling methods. Recent CASP
experiments indicate that some protein threading methods such as RaptorX indeed are more sensitive
than purely sequence (profile)-based methods when only distantly-related templates are available for
the proteins under prediction. When performing a BLAST search, a reliable first approach is to identify
hits with a sufficiently low E-value, which are considered sufficiently close in evolution to make a reliable
homology model. Other factors may tip the balance in marginal cases; for example, the template may
have a function similar to that of the query sequence, or it may belong to a homologous operon.
However, a template with a poor E-value should generally not be chosen, even if it is the only one
available, since it may well have a wrong structure, leading to the production of a misguided model. A
better approach is to submit the primary sequence to fold-recognition servers or, better still, consensus
meta-servers which improve upon individual fold-recognition servers by identifying similarities
(consensus) among independent predictions.
3) Model generation :-
Given a template and an alignment, the information contained therein must be used to generate a
three-dimensional structural model of the target, represented as a set of Cartesian coordinates for each
atom in the protein. Three major classes of model generation methods have been proposed.
Fragment assembly :
The original method of homology modeling relied on the assembly of a complete model from conserved
structural fragments identified in closely related solved structures. For example, a modeling study of
serine proteases in mammals identified a sharp distinction between “core” structural regions conserved
in all experimental structures in the class, and variable regions typically located in the loops where the
majority of the sequence differences were localized. Thus unsolved proteins could be modeled by first
constructing the conserved core and then substituting variable regions from other proteins in the set of
solved structures. Current implementations of this method differ mainly in the way they deal with
regions that are not conserved or that lack a template. The variable regions are often constructed with
the help of fragment libraries.
Segment matching :
The segment-matching method divides the target into a series of short segments, each of which is
matched to its own template fitted from the Protein Data Bank. Thus, sequence alignment is done over
segments rather than over the entire protein. Selection of the template for each segment is based on
sequence similarity, comparisons of alpha carbon coordinates, and predicted steric conflicts arising from
the van der Waals radii of the divergent atoms between target and template.
Satisfaction of spatial restraints :
The most common current homology modeling method takes its inspiration from calculations required
to construct a three-dimensional structure from data generated by NMR spectroscopy. One or more
target-template alignments are used to construct a set of geometrical criteria that are then converted to
probability density functions for each restraint. Restraints applied to the main protein internal
PBTEL-463 (Applications of Genomics and Proteomics) 99
coordinates – protein backbone distances and dihedral angles – serve as the basis for a global
optimization procedure that originally used conjugate gradient energy minimization to iteratively refine
the positions of all heavy atoms in the protein.
4) Model assessment :-
Assessment of homology models without reference to the true target structure is usually performed
with two methods: statistical potentials or physics-based energy calculations. Both methods produce an
estimate of the energy (or an energy-like analog) for the model or models being assessed; independent
criteria are needed to determine acceptable cutoffs. Neither of the two methods correlates
exceptionally well with true structural accuracy, especially on protein types underrepresented in the
PDB, such as membrane proteins.
Statistical potentials are empirical methods based on observed residue-residue contact frequencies
among proteins of known structure in the PDB. They assign a probability or energy score to each
possible pairwise interaction between amino acids and combine these pairwise interaction scores into a
single score for the entire model. Some such methods can also produce a residue-by-residue assessment
that identifies poorly scoring regions within the model, though the model may have a reasonable score
overall. These methods emphasize the hydrophobic core and solvent-exposed polar amino acids often
present in globular proteins. Examples of popular statistical potentials include Prosa and DOPE.
Statistical potentials are more computationally efficient than energy calculations.
2.3.3 NMR SPECTROSCOPY :-
• NMR spectroscopy may be used to determine the structure of proteins.
• The protein is purified, placed in a strong magnetic field, and then probed with radio waves.
• A distinctive set of observed resonances may be analyzed to give a list of atomic nuclei that are
close to one another, and to characterize the local conformation of atoms that are bonded
together.
• This list of restraints is then used to build a model of the protein that shows the location of each
atom.
• The technique is currently limited to small or medium proteins, since large proteins present
problems with overlapping peaks in the NMR spectra.
A major advantage of NMR spectroscopy is that it provides information on proteins in solution, as
opposed to those locked in a crystal or bound to a microscope grid, and thus, NMR spectroscopy is the
premier method for studying the atomic structures of flexible proteins. A typical NMR structure will
include an ensemble of protein structures, all of which are consistent with the observed list of
experimental restraints. The structures in this ensemble will be very similar to each other in regions with
strong restraints, and very different in less constrained portions of the chain. Presumably, these areas
with fewer restraints are the flexible parts of the molecule, and thus do not give a strong signal in the
experiment.
In the PDB archive, you will typically find two types of coordinate entries for NMR structures. The first
includes the full ensemble from the structural determination, with each structure designated as a
separate model. The second type of entry is a minimized average structure. These files attempt to
capture the average properties of the molecule based on the different observations in the ensemble.
You can also find a list of restraints that were determined by the NMR experiment. These include things
PBTEL-463 (Applications of Genomics and Proteomics) 100
like hydrogen bonds and disulfide linkages, distances between hydrogen atoms that are close to one
another, and restraints on the local conformation and stereochemistry of the chain.
2.3.4 3D ELECTRON MICROSCOPY (3DEM) :-
• Electron microscopy, frequently referred to as 3DEM, is also used to determine 3D structures of
large macromolecular assemblies.
• A beam of electrons and a system of electron lenses is used to image the biomolecule directly.
• Several tricks are required to obtain a 3D structure from 2D projection images produced by
transmission electron microscopes.
• The most commonly used technique today involves imaging of many thousands of different
single particles preserved in a thin layer of non-crystalline ice (cryo-EM).
• Provided these views show the molecule in myriad different orientations, a computational
approach akin to that used for computerized axial tomography or CAT scans in medicine will
yield a 3D mass density map.
• With a sufficient number of single particles, the 3DEM maps can then be interpreted by fitting
an atomic model of the macromolecule into the map, just as macromolecular crystallographers
interpret their electron density maps.
• In a restricted number of cases, electron diffraction from 2D or 3D crystals or helical assemblies
of biomolecules can be used determine 3D structures with an electron microscope using an
approach very similar to that of X-ray crystallography.
• Finally, 3DEM techniques are gaining prominence in studying biological assemblies inside cryo-
preserved cells and tissues using electron tomography.
• This method involves recording images at different tilt angles and averaging the images across
multiple copies of the biological assembly in situ.
2.3.5 SERIAL FEMTOSECOND CRYSTALLOGRAPHY (SFX) :-
• New technology, termed serial femtosecond crystallography, is revolutionizing the methods of
X-ray crystallography.
• A free electron X-ray laser (XFEL) is used to create pulses of radiation that are extremely short
(lasting only femtoseconds) and extremely bright.
• A stream of tiny crystals (nanometers to micrometers in size) is passed through the beam, and
each X-ray pulse produces a diffraction pattern from a crystal, often burning it up in the process.
• A full data set is compiled from as many as tens of thousands of these individual diffraction
patterns.
• The method is very powerful because it allows scientists to study molecular processes that occur
over very short time scales, such as the absorption of light by biological chromophores.
PBTEL-463 (Applications of Genomics and Proteomics) 101
2.4 PROTEIN – PROTEIN INTERACTION
Protein–protein interactions (PPIs) are physical contacts of high specificity established between two or
more protein molecules as a result of biochemical events steered by interactions that include
electrostatic forces, hydrogen bonding and the hydrophobic effect. Many are physical contacts with
molecular associations between chains that occur in a cell or in a living organism in a specific
biomolecular context.
Proteins rarely act alone as their functions tend to be regulated. Many molecular processes within a cell
are carried out by molecular machines that are built from numerous protein components organized by
their PPIs. These physiological interactions make up the so-called interactomics of the organism, while
aberrant PPIs are the basis of multiple aggregation-related diseases, such as Creutzfeldt–Jakob and
Alzheimer’s diseases.
PPIs have been studied with many methods and from different perspectives : biochemistry, quantum
chemistry, molecular dynamics, signal transduction, among others. All this information enables the
creation of large protein interaction networks – similar to metabolic or genetic/epigenetic networks –
that empower the current knowledge on biochemical cascades and molecular etiology of disease, as
well as the discovery of putative protein targets of therapeutic interest.
Examples of PPIs :-
1) Signal transduction :
The activity of the cell is regulated by extracellular signals. Signal propagation inside and/or along the
interior of cells depends on PPIs between the various signaling molecules. The recruitment of signaling
pathways through PPIs is called signal transduction and plays a fundamental role in many biological
processes and in many diseases including Parkinson’s disease and cancer.
2) Membrane transport :
A protein may be carrying another protein (for example, from cytoplasm to nucleus or vice versa in the
case of the nuclear pore importins).
3) Cell metabolism :
In many biosynthetic processes enzymes interact with each other to produce small compounds or other
macromolecules.
4) Muscle contraction :
Physiology of muscle contraction involves several interactions. Myosin filaments act as molecular
motors and by binding to actin enables filament sliding. Furthermore, members of the skeletal muscle
lipid droplet-associated proteins family associate with other proteins, as activator of adipose triglyceride
lipase and its coactivator comparative gene identification-58, to regulate lipolysis in skeletal muscle
Types of Protein – Protein Interaction :-
1) Homo-oligomers and hetero-oligomers :
Homo-oligomers are macromolecular complexes constituted by only one type of protein subunit.
Protein subunits assembly is guided by the establishment of non-covalent interactions in the quaternary
structure of the protein. Disruption of homo-oligomers in order to return to the initial individual
monomers often requires denaturation of the complex. Several enzymes, carrier proteins, scaffolding
PBTEL-463 (Applications of Genomics and Proteomics) 102
proteins, and transcriptional regulatory factors carry out their functions as homo-oligomers. Distinct
protein subunits interact in hetero-oligomers, which are essential to control several cellular functions.
The importance of the communication between heterologous proteins is even more evident during cell
signaling events and such interactions are only possible due to structural domains within the proteins.
2) Stable interactions and transient interactions :
Stable interactions involve proteins that interact for a long time, taking part of permanent complexes as
subunits, in order to carry out functional roles. These are usually the case of homo-oligomers (e.g.
cytochrome c), and some hetero-oligomeric proteins, as the subunits of ATPase. On the other hand, a
protein may interact briefly and in a reversible manner with other proteins in only certain cellular
contexts – cell type, cell cycle stage, external factors, presence of other binding proteins, etc. – as it
happens with most of the proteins involved in biochemical cascades. These are called transient
interactions. For example, some G protein-coupled receptors only transiently bind to Gi/o proteins when
they are activated by extracellular ligands, while some Gq-coupled receptors, such as muscarinic
receptor M3, pre-couple with Gq proteins prior to the receptor-ligand binding. Interactions between
intrinsically disordered protein regions to globular protein domains (i.e. MoRFs) are transient
interactions.
3) Covalent and non-covalent :
Covalent interactions are those with the strongest association and are formed by disulphide bonds or
electron sharing. While rare, these interactions are determinant in some posttranslational modifications,
as ubiquitination and SUMOylation. Non-covalent bonds are usually established during transient
interactions by the combination of weaker bonds, such as hydrogen bonds, ionic interactions, Van der
Waals forces, or hydrophobic bonds.
Experimental Methods of PPIs :-
There are a multitude of methods to detect them. Each of the approaches has its own strengths and
weaknesses, especially with regard to the sensitivity and specificity of the method. The most
conventional and widely used high-throughput methods are yeast two-hybrid screening and affinity
purification coupled to mass spectrometry.
1) Yeast two-hybrid screening :
This system was firstly described in 1989 by Fields and Song using Saccharomyces cerevisiae as
biological model. Yeast two hybrid allows the identification of pairwise PPIs (binary method) in vivo, in
which the two proteins are tested for biophysically direct interaction. The Y2H is based on the functional
reconstitution of the yeast transcription factor Gal4 and subsequent activation of a selective reporter
such as His3. To test two proteins for interaction, two protein expression constructs are made : one
protein (X) is fused to the Gal4 DNA-binding domain (DB) and a second protein (Y) is fused to the Gal4
activation domain (AD). In the assay, yeast cells are transformed with these constructs. Transcription of
reporter genes does not occur unless bait (DB-X) and prey (AD-Y) interact with each other and form a
functional Gal4 transcription factor. Thus, the interaction between proteins can be inferred by the
presence of the products resultant of the reporter gene expression. In cases in which the reporter gene
expresses enzymes that allow the yeast to synthesize essential amino acids or nucleotides, yeast growth
under selective media conditions indicates that the two proteins tested are interacting. Recently,
software to detect and prioritize protein interactions was published.
PBTEL-463 (Applications of Genomics and Proteomics) 103
Despite its usefulness, the yeast two-hybrid system has limitations. It uses yeast as main host system,
which can be a problem when studying proteins that contain mammalian-specific post-translational
modifications. The number of PPIs identified is usually low because of a high false negative rate; and,
understates membrane proteins, for example.
In initial studies that utilized Y2H, proper controls for false positives (e.g. when DB-X activates the
reporter gene without the presence of AD-Y) were frequently not done, leading to a higher than normal
false positive rate. An empirical framework must be implemented to control for these false positives.
Limitations in lower coverage of membrane proteins have been overcoming by the emergence of yeast
two-hybrid variants, such as the membrane yeast two-hybrid (MYTH) and the split-ubiquitin system,
which are not limited to interactions that occur in the nucleus; and, the bacterial two-hybrid system,
performed in bacteria.
2) Affinity purification coupled to mass spectrometry :
Affinity purification coupled to mass spectrometry mostly detects stable interactions and thus better
indicates functional in vivo PPIs. This method starts by purification of the tagged protein, which is
expressed in the cell usually at in vivo concentrations, and its interacting proteins (affinity purification).
One of the most advantageous and widely used method to purify proteins with very low contaminating
background is the tandem affinity purification, developed by Bertrand Seraphin and Matthias Mann and
respective colleagues. PPIs can then be quantitatively and qualitatively analysed by mass spectrometry
using different methods : chemical incorporation, biological or metabolic incorporation (SILAC), and
label-free methods. Furthermore, network theory has been used to study the whole set of identified
protein-protein interactions in cells.
Fig. Principle of tandem affinity purification
PBTEL-463 (Applications of Genomics and Proteomics) 104
3) Nucleic acid programmable protein array :
This system was first developed by LaBaer and colleagues in 2004 by using in vitro transcription and
translation system. They use DNA template encoding the gene of interest fused with GST protein, and it
was immobilized in the solid surface. Anti-GST antibody and biotinylated plasmid DNA were bounded in
aminopropyltriethoxysilane (APTES)-coated slide. BSA can improve the binding efficiency of DNA.
Biotinylated plasmid DNA was bound by avidin. New protein was synthesized by using cell-free
expression system i.e. rabbit reticulocyte lysate (RRL), and then the new protein was captured through
anti-GST antibody bounded on the slide. To test protein-protein interaction, the targeted protein cDNA
and query protein cDNA were immobilized in a same coated slide. By using in vitro transcription and
translation system, targeted and query protein was synthesized by the same extract. The targeted
protein was bound to array by antibody coated in the slide and query protein was used to probe the
array. The query protein was tagged with hemagglutinin (HA) epitope. Thus, the interaction between the
two proteins was visualized with the antibody against HA.
4) Intragenic complementation :
When multiple copies of a polypeptide encoded by a gene form a complex, this protein structure is
referred to as a multimer. When a multimer is formed from polypeptides produced by two different
mutant alleles of a particular gene, the mixed multimer may exhibit greater functional activity than the
unmixed multimers formed by each of the mutants alone. In such a case, the phenomenon is referred to
as intragenic complementation (also called inter-allelic complementation). Intragenic complementation
has been demonstrated in many different genes in a variety of organisms including the fungi Neurospora
crassa, Saccharomyces cerevisiae and Schizosaccharomyces pombe; the bacterium Salmonella
typhimurium; the virus bacteriophage T4, an RNA virus and humans. In such studies, numerous
mutations defective in the same gene were often isolated and mapped in a linear order on the basis of
recombination frequencies to form a genetic map of the gene. Separately, the mutants were tested in
pairwise combinations to measure complementation. An analysis of the results from such studies led to
the conclusion that intragenic complementation, in general, arises from the interaction of differently
defective polypeptide monomers to form a multimer. Genes that encode multimer-forming
polypeptides appear to be common. One interpretation of the data is that polypeptide monomers are
often aligned in the multimer in such a way that mutant polypeptides defective at nearby sites in the
genetic map tend to form a mixed multimer that functions poorly, whereas mutant polypeptides
defective at distant sites tend to form a mixed multimer that functions more effectively. Direct
interaction of two nascent proteins emerging from nearby ribosomes appears to be a general
mechanism for homo-oligomer (multimer) formation. Hundreds of protein oligomers were identified
that assemble in human cells by such an interaction. The most prevalent form of interaction is between
the N-terminal regions of the interacting proteins. Dimer formation appears to be able to occur
independently of dedicated assembly machines. The intermolecular forces likely responsible for self-
recognition and multimer formation were discussed by Jehle.
5) Other potential methods :
Diverse techniques to identify PPIs have been emerging along with technology progression. These
include co-immunoprecipitation, protein microarrays, analytical ultracentrifugation, light scattering,
fluorescence spectroscopy, luminescence-based mammalian interactome mapping (LUMIER),
resonance-energy transfer systems, mammalian protein–protein interaction trap, electro-switchable
PBTEL-463 (Applications of Genomics and Proteomics) 105
biosurfaces, protein-fragment complementation assay, as well as real-time label-free measurements by
surface plasmon resonance, and calorimetry.
Förster (Fluorescence) Resonance Energy Transfer (FRET) :-
Fluorescence Resonance Energy Transfer (FRET) is a special technique to gauge the distance between
two chromophores, called a donor-acceptor pair. The limitation of FRET is that this transfer process is
effective only when the separating distance of donor-acceptor pair is smaller than 10 nanometers.
However, FRET is a highly distance-dependent phenomenon and thus has become a popular tool to
measure the dynamic activities of biological molecules within nanoscale.
Introduction :-
FRET is the acronym of the Förster (Fluorescence) Resonance Energy Transfer. The Förster energy
transfer is the phenomenon that an excited donor transfers energy (not an electron) to an acceptor
group through a non-radiative process. This process is highly distance-dependent, thus allowing one to
probe biological structures. One common application is simply to measure the distance between two
positions of interest on a big molecule, generally a biological macromolecule, by attaching appropriate
donor-acceptor groups to the big one. If the big molecule only involves one donor and one acceptor
group, the distance between the donor and the acceptor can be easily measured if there is no
conformational change within this process. Besides, if the molecule has a huge conformational change,
one may also measure the dynamical activities between two sites on this macromolecule such as protein
interactions. Today, this technique is widely applied in many fields such as single-molecule experiments,
molecular motors, biosensors and DNA mechanical movements. FRET is also called the “Spectroscopic
Ruler” because of its intrinsic convenience.
Theory :-
Figure 1. Schematic Diagram of Förster Resonance Energy Transfer
PBTEL-463 (Applications of Genomics and Proteomics) 106
The theoretical analysis was well developed by Theodor Förster. This non-radiative transfer mechanism
is schematically capsuled in Figure 1. A donor group (D) is excited by a photon and then relaxes to the
lowest excited singlet state, S1 (by Kasha’s rule). If the acceptor group is not too far, the energy released
when the electron returns to the ground state (S0) may simultaneously excite the donor group. This non-
radiative process is referred to as “resonance”. After excitation, the excited acceptor emits a photon and
returns to the ground state, if the other quenching states do not exist.
The resonance mechanism is associated with the Coulombic interaction between electrons. Thus, the
relative distance of Coulombic interaction between the donor-acceptor pair could be longer than the
electron exchange energy transfer which needs the overlap of wave functions, namely the Dexter
Energy Transfer. The Coulombic interaction only needs the overlap of the spectrum which means that
the identity of resonance energy. Figure 2 here shall give a sense about what the resonance mechanism
is. (Note that the HOMO-LUMO gap does not equal the energy difference between the ground state and
the lowest S1 excited state of the molecule.)
Figure 2 : Schematic diagram of Coulombic Interactions
Factors that affect FRET :-
The FRET efficiency (E) is the quantum yield of the energy transfer transition; i.e., the fraction of energy
transfer event occurring per donor excitation event. FRET efficiency is given by
where
• kET is the rate of FRET
• kf is the rate of radiative relaxation (i.e., fluorescence)
• ki are the non-radiative relaxation rates (e.g., internal conversion, intersystem crossing, external
conversion etc).
Within a point dipole-dipole approximation, the FRET efficiency can be related to the donar-acceptor
distance via
PBTEL-463 (Applications of Genomics and Proteomics) 107
where
• r is the distance between donor and acceptor chromophores and
• Ro is the characteristic distance (the Förster distance or Förster radius) with a 50% transfer
efficiency.
Overlap of spectrum :-
To enhance the FRET efficiency, the donor group should have good abilities to absorb photons and emit
photons. That means the donor group should have a high extinction coefficient and a high quantum
yield. The overlap of emission spectrum of the donor and absorption spectrum of the acceptor means
that the energy lost from excited donor to ground state could excite the acceptor group. The energy
matching is called the resonance phenomenon. Thus, the more overlap of spectra, the better a donor
can transfer energy to the acceptor. The overlap integral, J(λ), between the donor and the acceptor
stands for the overlap of spectra, as shown in Figure 3.
Figure 3 : Schematic diagram of Spectral Overlap
The overlap integral is given by
where
• FD(λ) is the normalized emission spectrum of the donor
• EpsilonA standards for the molar absorption coefficient of the acceptor and
• λ is the wavelength.
Förster Distance (R0) :-
A long R0 can cause a high FRET efficiency. Based on Förster’s analysis, R0 is a function of quantum yield
of the donor chromophore ΦD , spectral overlap of donor and acceptor J(λ), directional relationship of
transition dipoles κ 2 and the refractive index of the medium n.
PBTEL-463 (Applications of Genomics and Proteomics) 108
Experimental Confirmation of FRET :-
The inverse sixth-power distance dependence of Förster resonance energy transfer was experimentally
confirmed by Wilchek, Edelhoch and Brand using tryptophyl peptides. Stryer, Haugland and Yguerabide
also experimentally demonstrated the theoretical dependence of Förster resonance energy transfer on
the overlap integral by using a fused indolosteroid as a donor and a ketone as an acceptor. However, a
lot of contradictions of special experiments with the theory was observed. The reason is that the theory
has approximate character and gives overestimated distances of 50–100 ångströms.
Methods to measure FRT Efficiency :-
1) Sensitized emission :
One method of measuring FRET efficiency is to measure the variation in acceptor emission intensity.
When the donor and acceptor are in proximity (1–10 nm) due to the interaction of the two molecules,
the acceptor emission will increase because of the intermolecular FRET from the donor to the acceptor.
For monitoring protein conformational changes, the target protein is labeled with a donor and an
acceptor at two loci. When a twist or bend of the protein brings the change in the distance or relative
orientation of the donor and acceptor, FRET change is observed. If a molecular interaction or a protein
conformational change is dependent on ligand binding, this FRET technique is applicable to fluorescent
indicators for the ligand detection.
2) Photobleaching FRET :
FRET efficiencies can also be inferred from the photobleaching rates of the donor in the presence and
absence of an acceptor. This method can be performed on most fluorescence microscopes; one simply
shines the excitation light (of a frequency that will excite the donor but not the acceptor significantly) on
specimens with and without the acceptor fluorophore and monitors the donor fluorescence (typically
separated from acceptor fluorescence using a bandpass filter) over time. The timescale is that of
photobleaching, which is seconds to minutes, with fluorescence in each curve being given by
Where rpb is the photobleaching decay time constant and depends on whether the acceptor is present or
not. Since photobleaching consists in the permanent inactivation of excited fluorophores, resonance
energy transfer from an excited donor to an acceptor fluorophore prevents the photobleaching of that
donor fluorophore, and thus high FRET efficiency leads to a longer photobleaching decay time constant :
Where r’pb and rpb are the photobleaching decay time constants of the donor in the presence and in the
absence of the acceptor respectively. (Notice that the fraction is the reciprocal of that used for lifetime
measurements).
This technique was introduced by Jovin in 1989. Its use of an entire curve of points to extract the time
constants can give it accuracy advantages over the other methods. Also, the fact that time
measurements are over seconds rather than nanoseconds makes it easier than fluorescence lifetime
measurements, and because photobleaching decay rates do not generally depend on donor
concentration (unless acceptor saturation is an issue), the careful control of concentrations needed for
intensity measurements is not needed. It is, however, important to keep the illumination the same for
PBTEL-463 (Applications of Genomics and Proteomics) 109
the with- and without-acceptor measurements, as photobleaching increases markedly with more
intense incident light.
3) Lifetime measurements :
FRET efficiency can also be determined from the change in the fluorescence lifetime of the donor. The
lifetime of the donor will decrease in the presence of the acceptor. Lifetime measurements of the FRET-
donor are used in fluorescence-lifetime imaging microscopy (FLIM).
Applications of FRET :-
The applications of fluorescence resonance energy transfer (FRET) have expanded tremendously in the
last 25 years, and the technique has become a staple in many biological and biophysical fields. FRET can
be used as a spectroscopic ruler to measure distance and detect molecular interactions in a number of
systems and has applications in biology and biochemistry.
1) Proteins :
FRET is often used to detect and track interactions between proteins. Additionally, FRET can be used to
measure distances between domains in a single protein by tagging different regions of the protein with
fluorophores and measuring emission to determine distance. This provides information about protein
conformation, including secondary structures and protein folding. This extends to tracking functional
changes in protein structure, such as conformational changes associated with myosin activity. Applied in
vivo, FRET has been used to detect the location and interactions of cellular structures including integrins
and membrane proteins.
2) Membranes :
FRET can be used to observe membrane fluidity, movement and dispersal of membrane proteins,
membrane lipid-protein and protein-protein interactions, and successful mixing of different membranes.
FRET is also used to study formation and properties of membrane domains and lipid rafts in cell
membranes and to determine surface density in membranes.
3) Chemosensory :
FRET-based probes can detect the presence of various molecules: the probe’s structure is affected by
small molecule binding or activity, which can turn the FRET system on or off. This is often used to detect
anions, cations, small uncharged molecules, and some larger biomacromolecules as well. Similarly, FRET
systems have been designed to detect changes in the cellular environment due to such factors as pH,
hypoxia, or mitochondrial membrane potential.
4) Signaling pathways :
Another use for FRET is in the study of metabolic or signaling pathways. For example, FRET and BRET
have been used in various experiments to characterize G-protein coupled receptor activation and
consequent signaling mechanisms. Other examples include the use of FRET to analyze such diverse
processes as bacterial chemotaxis and caspase activity in apoptosis.
PBTEL-463 (Applications of Genomics and Proteomics) 110
2.5 CO-IMMUNOPRECIPITATION
Co-Immunoprecipitation (Co-IP) was developed from the immunoprecipitation technique with which Co-
IP shares the fundamental principle of the specific antigen-antibody reaction. Immunoprecipitation(IP)
technique is used for isolate individual protein. Using an antibody that is specific for a particular protein,
the target protein could be isolated out from a crude lysate of a plant or animal tissue or other biological
regent.
Immunoprecipitation of intact protein complexes is known as co-IP, which could pull the entire protein
complex out of solution and thereby identify unknown members of the complex. Co-IP is a powerful
technique that is used regularly by molecular biologists to analyze protein–protein interactions.
Prepare lysate form cells or tissue samples which express the target protein is the first step of the co-IP.
In order to preserve the intact of protein-protein interactions, the lysis buffers should use non-ionic
detergents (e.g., NP-40, Triton X-100). Then the target proteins are captured by specific antibodies from
total lysate. The resultant immunocomplexes (composed of antibody, protein of interest (antigen), and
antigen-associated proteins) can be precipitated using a resin (e.g., agarose, sepharose, or magnetic
beads) that is conjugated with IgG-binding Protein A/G. The third step is washes. Irrelevant, non-binding
proteins, antigens and any proteins that are bound are eluted by series of washes. Then, the bound
proteins which eluted are analyzed by SDS-PAGE/immunoblotting and/or mass spectrometry.
Principle of Co-immunoprecipitation :-
Co-Immunoprecipitation (Co-IP) was developed from the immunoprecipitation technique with which Co-
IP shares the fundamental principle of the specific antigen-antibody reaction. Co-IP helps determine
whether two proteins interact or not in physiological conditions in vitro. Graphically, the Co-IP principle
is as described in the right hand side picture.
The known protein (antigen) is termed the bait protein, and the protein it interacts with is called the
prey protein. The standard Co-IP protocol is the same as that described for IP, and actually any system
designed for IP should also work for Co-IP.
After that cells are completely lysed under non-denaturing conditions, proteins that bound together are
kept. Therefore if you use anti-X to precipitate protein X through Co-IP, then you can get other proteins
that interact with protein X in situ.
Co-IP is applied to test whether two known proteins bind each other in cells, or to find a new protein
that interacts with a known protein.
Reagents and buffers :-
❖ PBS
❖ RIPA (RadioImmunoPrecipitation Assay) Lysis buffer
❖ protein A/G-agarose beads
❖ Specific antibody (MAb or PAb)
Procedure / Protocol / Method / Mechanism :-
I) Transfection :
1) Transfection of plasmids expressing proteins of interest. Follow the transfection protocol.
2) 48h after transfection, harvest the cells using a cell scraper and pellet the cell by centrifugation
at 1000g×6min at 4℃.
PBTEL-463 (Applications of Genomics and Proteomics) 111
3) Resuspend the cells tenderly by pre-cold PBS. Centrifugation at 1000g×6min at 4℃ and discard
the supernatant.
4) Repeat Step 3 for three times.
II) Lysis of Resuspended Cells :
5) Resuspend the cells tenderly by pre-cold lysis buffer (add protease and phosphatase inhibitor
prior to use).
6) Incubate the samples on ice for 30 min. Convert the tube up and down during this period.
7) Centrifuge at 15,000 g×10 min at 4℃ and transfer the supernatant to a new 1.5 ml tube.
8) Measure protein concentration.
III) Addition of Antibody-Immobilized resin :
9) Using a tip to transfer the IgG-crosslinked resins to a EP tube containing 1ml PBS.
10) Pellet the resin by centrifugation at 6000 g×30 s, and aspirate the supernatant.
11) Repeat step 10 for three times.
12) Resuspend the resin in pre-cold lysis buffer.
13) Transfer 750μg of total protein to a new tube and adjust the volume to 200μl with pre-cold lysis
buffer.
14) Add 50μl of the prepared resin.
15) Incubate the tubes on a tube rotator at a slow speed for 1 h at 4 ℃.
16) Centrifugation at 6000 g for 30 s at 4 ℃ and transfer 200 μl of the supernatant to a new tube.
Save the pelleted resins to use as negative controls.
17) Dilute the antibody to proper concentration with lysis buffer.
18) Add diluted antibody to lysate prepared in step 16 and incubate the tube on a rotation 1h (or
overnight) at 4 ℃.
19) Prepare Protein G-immobilized resin as instruction of step 9-12. (Protein G is often considered a
more universal IgG binding protein than is Protein A, but different species, and subtypes of
species, do vary in their binding to these proteins.)
20) Add 50 μl of the prepared resin to each reaction tube in step 18 and incubate on a rotator for 1h
at 4 ℃.
21) Centrifugation at 6000g×30 s at 4 ℃. Transfer 100 μl of supernatant to a new tube and aspirate
the remaining supernatant.
22) Resuspend the resin with 500 μl pre-cold lysis buffer, centrifugation at 6000g×30 s at 4 ℃, and
aspirate the remaining supernatant.
23) Repeat step 22 for three times.
IV) Analysis :
24) Resuspend the resin-bound immune complexes in 20 μl of 2 Laemmli buffer, boil for 5 min, and
analyze by SDS-PAGE/immunoblot analysis.
Note : This Co-IP protocol is to bind antibody to the Protein A/G-agarose beads and then mix with the
antigen. It gives lesser yield than the other one and avoids the problem of co-elution of antibodies. If
you want to yield high purity of target protein regardless of non-specific binding, you can mix antibody
with protein sample prior to addition of Protein A/G-agarose beads, thus in the end the antibodies are
also co-eluted with target protein and interference might occurs in western blot detection.
PBTEL-463 (Applications of Genomics and Proteomics) 112
Fig 1. Schematic of the co-immunoprecipitation procedure.
Advantages of Co-immunoprecipitation :-
1) Proteins that interact in a typical Co-IP are post-translationally modified and conformationally
natural.
2) In Co-IP proteins interact in a non-denaturing condition which is almost physiological.
Disadvantages of Co-immunoprecipitation :-
1) The signals of low-affinity of protein interactions might not be detected.
2) There might be a third protein in certain protein-protein interaction.
3) To choose an appropriate antibody, the target protein needs to be properly predicted. Or there
would not be a positive result in Co-IP.
PBTEL-463 (Applications of Genomics and Proteomics) 113
2.6 METABOLOMICS AND IONOMICS
Metabolomics is the scientific study of chemical processes involving metabolites, the small molecule
substrates, intermediates and products of cell metabolism. Specifically, metabolomics is the “systematic
study of the unique chemical fingerprints that specific cellular processes leave behind”, the study of
their small-molecule metabolite profiles. The metabolome represents the complete set of metabolites in
a biological cell, tissue, organ or organism, which are the end products of cellular processes. Messenger
RNA (mRNA), gene expression data and proteomic analyses reveal the set of gene products being
produced in the cell, data that represents one aspect of cellular function. Conversely, metabolic profiling
can give an instantaneous snapshot of the physiology of that cell, and thus, metabolomics provides a
direct “functional readout of the physiological state” of an organism. One of the challenges of systems
biology and functional genomics is to integrate genomics, transcriptomic, proteomic, and metabolomic
information to provide a better understanding of cellular biology.
DNA -------------> Transcriptome --------------> Proteins ------------> Metabolome
(Genomics) (Transcriptomics) (Proteomics) (Metabolomics)
Fig. The central dogma of biology showing the flow of information from DNA to
the phenotype. Associated with each stage is the corresponding systems biology
tool, from genomics to metabolomics.
History :-
The concept that individuals might have a “metabolic profile” that could be reflected in the makeup of
their biological fluids was introduced by Roger Williams in the late 1940s, who used paper
chromatography to suggest characteristic metabolic patterns in urine and saliva were associated with
diseases such as schizophrenia. However, it was only through technological advancements in the 1960s
and 1970s that it became feasible to quantitatively (as opposed to qualitatively) measure metabolic
profiles. The term “metabolic profile” was introduced by Horning et al. in 1971 after they demonstrated
that gas chromatography-mass spectrometry (GC-MS) could be used to measure compounds present in
human urine and tissue extracts. The Horning group, along with that of Linus Pauling and Arthur B.
Robinson led the development of GC-MS methods to monitor the metabolites present in urine through
the 1970s.
General Terminologies in Metabolomics :-
1) Metabolome :-
The metabolome refers to the complete set of small-molecule (<1.5 kDa) metabolites (such as metabolic
intermediates, hormones and other signaling molecules, and secondary metabolites) to be found within
a biological sample, such as a single organism. The word was coined in analogy with transcriptomics and
proteomics; like the transcriptome and the proteome, the metabolome is dynamic, changing from
second to second. Although the metabolome can be defined readily enough, it is not currently possible
to analyse the entire range of metabolites by a single analytical method.
The first metabolite database (called METLIN) for searching fragmentation data from tandem mass
spectrometry experiments was developed by the Siuzdak lab at The Scripps Research Institute in 2005.
PBTEL-463 (Applications of Genomics and Proteomics) 114
METLIN contains over 450,000 metabolites and other chemical entities, each compound having
experimental tandem mass spectrometry data. In 2006, the Siuzdak lab also developed the first
algorithm to allow for the nonlinear alignment of mass spectrometry metabolomics data. Called XCMS,
where the “X” constitutes any chromatographic technology, it has since (2012) been developed as an
online tool and as of 2019 (with METLIN) has over 30,000 registered users.
2) Metabolites :-
Metabolites are the substrates, intermediates and products of metabolism. Within the context of
metabolomics, a metabolite is usually defined as any molecule less than 1.5 kDa in size. However, there
are exceptions to this depending on the sample and detection method. For example, macromolecules
such as lipoproteins and albumin are reliably detected in NMR-based metabolomics studies of blood
plasma. In plant-based metabolomics, it is common to refer to “primary” and “secondary” metabolites.
A primary metabolite is directly involved in the normal growth, development, and reproduction. A
secondary metabolite is not directly involved in those processes, but usually has important ecological
function. Examples include antibiotics and pigments. By contrast, in human-based metabolomics, it is
more common to describe metabolites as being either endogenous (produced by the host organism) or
exogenous. Metabolites of foreign substances such as drugs are termed xenometabolites.
3) Metabonomics :-
Metabonomics is defined as “the quantitative measurement of the dynamic multiparametric metabolic
response of living systems to pathophysiological stimuli or genetic modification”. The word origin is
from the Greek metabon meaning change and nomos meaning a rule set or set of laws. This approach
was pioneered by Jeremy Nicholson at Murdoch University and has been used in toxicology, disease
diagnosis and a number of other fields. Historically, the metabonomics approach was one of the first
methods to apply the scope of systems biology to studies of metabolism.
4) Exometabolomics :-
Exometabolomics, or “metabolic footprinting”, is the study of extracellular metabolites. It uses many
techniques from other subfields of metabolomics, and has applications in biofuel development,
bioprocessing, determining drugs’ mechanism of action, and studying intercellular interactions.
Techniques involved in Metabolite Analysis :-
1) Mass Spectrometry Techniques :
Gas chromatography-mass spectrometry (GC-MS) is a chromatographic technique applied to study
metabolites which have a low boiling point and which will be present in the gas phase at the
temperature range 50-350°C. These metabolites can have a low boiling point in their biologically native
form or the boiling point of a metabolite can be decreased through a chemical alteration, what we call
chemical derivatization. This approach increases the number of metabolites that can be detected in a
biological sample and is commonly applied in metabolomics.
The separation mechanism in GC-MS and LC-MS is similar as metabolites absorb and desorb from a
stationary phase. However in GC-MS the stationary phase is coated on the inside surface of a hollow
silica capillary and the mobile phase is a gas, for example helium, rather than a liquid as is applied in LC-
MS. Gas chromatography (GC) columns are silica capillaries of 10-60 m in length and an internal
diameter of 100-500 μm. The stationary phase is siloxane (or a similar chemical composition) with
PBTEL-463 (Applications of Genomics and Proteomics) 115
varying percentages of chemical moieties. Samples are vaporized and introduced on to the GC column
by the mobile phase. The chromatographic separation is optimized via adjustment of the oven
temperature, gas flow rate and stationary phase composition. One strategy is commonly used for the
formation of ions for mass spectrometric analysis – electron impact ionization.
Electron impact ionization is a high energy process where electrons bombard molecules in a vacuum
and release an electron from the molecule to form a positively charge ion. Electron impact ionization
provides significant fragmentation of the molecular ion and produces a reproducible fragmentation
pattern that can be employed for the identification of metabolites by searching mass spectral libraries.
GC-MS and LC-MS can be complementary techniques and detect complementary sets of metabolites.
A third analytical technique that integrates a separation process with mass spectrometry detection is
capillary electrophoresis-mass spectrometry (CE-MS) that is sometimes called capillary zone
electrophoresis-mass spectrometry. CE-MS involves the separation of ionic species in the liquid phase
via the application of high voltages and is usually coupled to an electrospray mass spectrometry.
Capillary electrophoresis separates metabolites based on their electrophoretic mobility in a liquid
electrolyte solution operating in an electric field. Electrophoretic mobility is dependent on the charge
and size of the metabolite and so separation of metabolites with different sizes and/or charges is
possible. All metabolites have to be charged to allow any mobility to occur.
2) Nuclear Magnetic Resonance Spectroscopy :
Nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry are the two analytical
instruments most frequently applied in metabolomics research. NMR spectroscopy applies the magnetic
properties of atomic nuclei in a metabolite. Only some atoms are NMR active and include 1H, 13C and 31P;
proton (1H) NMR spectroscopy is the most frequently applied in metabolomics. The technique operates
by placing a liquid sample in a small internal diameter tube (for example a 5 mm tube), or occasionally a
piece of tissue is studied directly using a special sample holder. The sample is pulsed with a range of
radio frequencies covering all possible energies required for exciting the selected type of nuclei.
The nuclei absorb energy at different radio frequencies depending on their chemical environment and
then the release of this energy is measured, forming what’s called a free induction decay (FID). This FID
is converted from a time domain data set to a frequency domain – using a Fourier transformation – and
an NMR spectrum is constructed as the chemical shift (effectively the absorption energy) plotted against
peak intensity.
A metabolite that the NMR operator wishes to measure might have each proton in that metabolite
present in a different local magnetic environment and therefore each proton will absorb a slightly
different radio frequency. These differences are observed as different chemical shifts in the NMR
spectrum. Typically, a single metabolite will produce several peaks at different chemical shifts, with
some metabolites producing many peaks, for example glucose has ca. 30 peaks. Therefore a sample
containing many metabolites will provide a very complex set of overlapping peaks, similar to that
observed for direct infusion mass spectrometry. Complex data processing methods can be applied to
deconvolve the signals from each metabolite.
A typical NMR metabolomics analysis can consist of a few NMR experiments to increase the information
acquired from each sample, including 1-dimensional 1H NMR and 2-dimensional 1H, 1H J-resolved NMR.
A typical analysis time for one sample is 20 minutes, although a basic 1H NMR experiment can be
recorded in less than 5 minutes.
PBTEL-463 (Applications of Genomics and Proteomics) 116
NMR spectroscopy is commonly applied for untargeted metabolomic studies. However, and unlike mass
spectrometry, the absolute concentration of each metabolite can be measured from an NMR spectrum.
This is possible because the signal produced by a single 1H nucleus is not dependent on the magnetic
environment, which allows a single intensity calibration to be applied to all metabolites.
3) Vibrational Spectroscopy :
Vibrational spectroscopy techniques such as Fourier transform–infrared (FT-IR) spectroscopy and Raman
spectroscopy have been used to analyse the metabolic changes in biological samples. The techniques
operate by passing ultraviolet or infrared light through a sample, or sometimes allowing the light to
reflect from the sample, before being detected. The approaches predominantly measure the vibrations
and rotations of bonds related to different chemical functional groups resulting from the interaction of
the sample with the ultraviolet or infrared light.
These techniques typically lack the specificity to detect each metabolite separately but instead specific
parts of a molecule will absorb the ultraviolet or infrared light at specific wavelengths. For example in
FT-IR, C-H stretching vibrations characteristic of fatty acid chains are observed between the
wavenumber range 3100 – 2800cm-1 and the region between 1800 and 1500 cm-1 is dominated by
amide I and amide II bands indicating the predominance of either alpha helix or beta sheet structures. A
metabolic fingerprint is detected with a single absorption spectrum collected for each sample and
consisting of information for many metabolites. This is similar to the data produced by direct infusion
mass spectrometry where a single mass spectrum rather than absorption spectrum is collected for each
sample. The metabolic fingerprints produced in these approaches lack the sensitivity of mass
spectrometry but are a useful tool for high-throughput screening. The analysis time of FT-IR is
approximately one minute per sample, although this can be lower depending on the instrument
parameters applied.
4) Imaging techniques :
All of the techniques described above typically require a sample extraction preparation method for the
analysis of cells and tissues that provides a single homogenous sample and no information of the spatial
distribution of metabolites in different cellular organelles or different cell types in a tissue. However, the
location of metabolites can provide important information on how and why biological mechanisms
operate. For example, whether a drug is concentrated in one area of a tissue related to its mode of
action or is distributed across the tissue. Imaging techniques can be applied as a complementary
approach to all other non-imaging techniques discussed.
Mass spectral imaging is one of the most common set of techniques applied and includes different
ionization methods :
MALDI matrix assisted laser desorption ionization
DESI desorption electrospray ionization
SIMS secondary ion mass spectrometry
All of these techniques separate the sample in to multiple pixels and a mass spectrum is acquired for
each pixel applying one of the different ionization methods named above. The distribution of a
metabolite can then be mapped by visually defining the response of the metabolite ion in each pixel.
The size of each pixel defines the spectral resolution, a small pixel size means that data for more pixels
can be acquired and that spectral resolution is higher.
PBTEL-463 (Applications of Genomics and Proteomics) 117
Other techniques can also be applied for imaging cells and tissues. For example, FT-IR and Raman
spectroscopy can be applied for imaging where an absorption or transmission spectrum is acquired for
each pixel instead of a mass spectrum. The distribution of a metabolite or class of metabolites can be
visually described by plotting specific wavelengths related to the metabolite or metabolite class.
All of these techniques provide unique and complementary data to all other techniques but the
collection of imaging data commonly requires a longer analysis times of hours compared to minutes for
other techniques.
Steps Involved in Metabolite Analysis :-
Plant metabolomes can be very diverse. Because of the complexity and divergent physicochemical
properties of the cellular metabolome, a combination of two or more metabolomic strategies (outlined
in Table 1) may be considered to achieve a comprehensive coverage of the plant metabolome.
Furthermore, in any metabolomic approach, a broad metabolic picture is achieved through the
combination of multiparallel and complementary analytical systems, including the use of various
extraction protocols.
Term Description
Metabolomics Holistic quantification and identification of all metabolites within an
organism or a biological system, under a given set of conditions. This state is
currently unrealisable, with any single or combination of metabolomic
approaches.
Metabonomics This term is normally used in non-plant systems and generally refers to the
quantitative detection of endogenous metabolites that are dynamically
altered within a living system in response to pathophysiological stimuli or
genetic modification. Tissues and biofluids are commonly used for these
analyses.
Metabolic / The identification and quantification of metabolites related through their
Metabolite profiling metabolic pathway(s) or similarities in their chemistry.
Targeted metabolite Qualitative and quantitative analysis of one or a few pre-defined
analysis or metabolite metabolites related to a specific metabolic reaction. Such an approach relies
target analysis on optimized metabolite extraction, separation and detection.
Metabolite Rapid and high-throughput methods where global metabolite profiles are
fingerprinting obtained from crude samples or simple cellular extracts. In general,
metabolites are neither quantified nor identified.
Metabolite The measurement of metabolites secreted from the intracellular
footprinting complement of an organism (or biological system) into its extracellular
medium or matrix. This approach is commonly used in microbial
metabolomics.
Table 1 : Strategies for metabolomic analysis
A metabolomic analysis comprises three main experimental stages : (1) preparation of the sample, (2)
acquisition of the data using analytical methods and (3) data mining using chemometric methods
followed by compound identification. These steps are crucially interrelated, and as illustrated in Figure
PBTEL-463 (Applications of Genomics and Proteomics) 118
1, may each consist of a series of sub-steps. The resulting analysed data from the various experimental
phases form the basis for meaningful biochemical interpretation.
Sample preparation is a critical step in transforming the sample into a solution that can be analysed to
make a vital contribution in defining the array of metabolite classes to be covered. This step involves a
series of different experimental stages: selection and harvesting of samples, drying or enzyme
quenching procedures, extraction of metabolites and preparation of the samples for analysis. The
selection of plant material depends mainly on the biological question that the researcher seeks to
investigate. Throughout this step, care must be taken to avoid the introduction of any form of unwanted
variability that would significantly affect the outcome of the analysis. Sample degradation (thermal,
oxidative or enzymological) and contamination are major factors leading to variations during this step.
Various enzyme quenching methods include drying, treatment with acid, use of enzyme inhibitors or
high concentrations of organic solvents.
Figure 2 : Flowchart for plant metabolomic studies. The three main steps of a
metabolomic analysis are sample preparation, data acquisition and data mining.
A data handling pipeline is established from data acquisition to data mining.
These three steps are interrelated and lead to the biochemical interpretations.
PBTEL-463 (Applications of Genomics and Proteomics) 119
Plant metabolites are structurally diverse, forming a highly complex spectrum of compounds of different
size, solubility, volatility, polarity, quantity and stability. Any extraction method would certainly produce
an inherently multidimensional sample arising from the chemical and physical differences of the
constituents. Several methods may be employed to extract metabolites; the choice of method depends
on a variety of factors, such as the physicochemical properties of the target metabolites, the
biochemical composition of the system under investigation and the properties of the solvent to be used.
Some of the common extraction methods include solvent extraction, supercritical fluid extraction,
sonication and solid phase extraction. However, no comprehensive extraction technique exists for the
recovery of all classes of compounds with high reproducibility and robustness. Thus, for a
comprehensive coverage of different classes of metabolites, extraction methods may be used in
combination.
Data acquisition (sample analysis) follows the sample preparation step and requires advanced analytical
techniques as the ultracomplexity of samples for metabolomic analysis makes it impossible to
technologically separate, quantify and identify every metabolite within a biological sample. A range of
analytical platforms are employed in metabolomic studies (separately or in combination), and each
platform has its own advantages and limitations, either in selectivity or sensitivity (Table 2). The choice
of the analytical platform depends mainly on the study at hand, taking into consideration the class of
compounds, their chemical and physical properties, and their concentration levels. As such, the
techniques most often used are nuclear magnetic resonance spectroscopy and chromatography coupled
to MS.
Table 2 : Some standard techniques used in metabolomic analysis. In general, one technology is not
sufficient for the analysis of all compounds, but any form of separation will inherently introduce a bias
towards the analytes being detected.
PBTEL-463 (Applications of Genomics and Proteomics) 120
Applications of Metabolomics :-
Toxicity assessment/toxicology by metabolic profiling (especially of urine or blood plasma samples)
detects the physiological changes caused by toxic insult of a chemical (or mixture of chemicals). In many
cases, the observed changes can be related to specific syndromes, e.g. a specific lesion in liver or kidney.
This is of particular relevance to pharmaceutical companies wanting to test the toxicity of potential drug
candidates : if a compound can be eliminated before it reaches clinical trials on the grounds of adverse
toxicity, it saves the enormous expense of the trials.
For functional genomics, metabolomics can be an excellent tool for determining the phenotype caused
by a genetic manipulation, such as gene deletion or insertion. Sometimes this can be a sufficient goal in
itself—for instance, to detect any phenotypic changes in a genetically modified plant intended for
human or animal consumption. More exciting is the prospect of predicting the function of unknown
genes by comparison with the metabolic perturbations caused by deletion/insertion of known genes.
Such advances are most likely to come from model organisms such as Saccharomyces cerevisiae and
Arabidopsis thaliana. The Cravatt laboratory at The Scripps Research Institute has recently applied this
technology to mammalian systems, identifying the N-acyltaurines as previously uncharacterized
endogenous substrates for the enzyme fatty acid amide hydrolase (FAAH) and the monoalkylglycerol
ethers (MAGEs) as endogenous substrates for the uncharacterized hydrolase KIAA1363.
Metabologenomics is a novel approach to integrate metabolomics and genomics data by correlating
microbial-exported metabolites with predicted biosynthetic genes. This bioinformatics-based pairing
method enables natural product discovery at a larger-scale by refining non-targeted metabolomic
analyses to identify small molecules with related biosynthesis and to focus on those that may not have
previously well known structures.
Fluxomics is a further development of metabolomics. The disadvantage of metabolomics is that it only
provides the user with steady-state level information, while fluxomics determines the reaction rates of
metabolic reactions and can trace metabolites in a biological system over time.
Nutrigenomics is a generalized term which links genomics, transcriptomics, proteomics and
metabolomics to human nutrition. In general a metabolome in a given body fluid is influenced by
endogenous factors such as age, sex, body composition and genetics as well as underlying pathologies.
The large bowel microflora are also a very significant potential confounder of metabolic profiles and
could be classified as either an endogenous or exogenous factor. The main exogenous factors are diet
and drugs. Diet can then be broken down to nutrients and non-nutrients. Metabolomics is one means to
determine a biological endpoint, or metabolic fingerprint, which reflects the balance of all these forces
on an individual’s metabolism.
PBTEL-463 (Applications of Genomics and Proteomics) 121
IONOMICS :-
Ionomics is the measurement of the total elemental composition of an organism to address biological
problems. Questions within physiology, ecology, evolution, and many other fields can be investigated
using ionomics, often coupled with bioinformatics and other genetic tools. Observing an organism’s
ionome is a powerful approach to the functional analysis of its genes and the gene networks.
Information about the physiological state of an organism can also be revealed indirectly through its
ionome, for example iron deficiency in a plant can be identified by looking at a number of other
elements, rather than iron itself. A more typical example is in a blood test, where a number of
conditions involving nutrition or disease may be inferred from testing this single tissue for sodium,
potassium, iron, chlorine, zinc, magnesium, calcium and copper.
In practice, the total elemental composition of an organism is rarely determined. The number and type
of elements measured are limited by the available instrumentation, the assumed value of the element in
question, and the added cost of measuring each additional element. Also, a single tissue may be
measured instead of the entire organism, as in the example given above of a blood test, or in the case of
plants, the sampling of just the leaves or seeds. These are simply issues of practicality.
Various techniques may be fruitfully used to measure elemental composition. Among the best are
Inductively-Coupled Plasma Optical Emission Spectroscopy (ICP-OES), Inductively-Coupled Plasma Mass
Spectrometry (ICP-MS), X-Ray Fluorescence (XRF), synchrotron-based microXRF, and Neutron activation
analysis (NAA). This latter technique has been applied to perform ionomics in the study of breast cancer,
colorectal cancer and brain cancer. High-throughput ionomic phenotyping has created the need for data
management systems to collect, organize and share the collected data with researchers worldwide.
Why ionome ?
• Living systems are supported and sustained by the genome through the action of the
transcriptome, proteome, metabolome, and ionome, the four basic biochemical pillars of
functional genomics.
• These pillars represent the sum of all the expressed genes, proteins, metabolites, and elements
within an organism.
• The dynamic response and interaction of these biochemical ‘‘omes’’ defines how a living system
functions, and its study, systems biology, is now one of the biggest challenges in the life
sciences.
• The ionome is involved in such a broad range of important biological phenomena, including
electrophysiology, signaling, enzymology, osmoregulation, and transport, its study promises to
yield new and significant biological insight.
What area it is covering ?
• Lahner and colleagues first described the ionome to include all the metals, metalloids, and
nonmetals present in an organism (Lahner et al., 2003), extending the term metallome (Outten
and O’Halloran, 2001; Williams, 2001; Szpunar, 2004) to include biologically significant
nonmetals such as nitrogen, phosphorus, sulfur, selenium, chlorine, and iodine.
• It is important to note here that the boundaries between the ionome, metabolome, and
proteome are blurred.
• Compounds containing the nonmetals phosphorus, sulfur, or nitrogen, for example, would fall
within both the ionome and metabolome, and metals such as zinc, copper, manganese, and iron
PBTEL-463 (Applications of Genomics and Proteomics) 122
in metalloproteins would fall within the proteome, or metalloproteome as it has been described
(Szpunar, 2004).
• The elements measured in the ionome will be determined by their biological importance or
environmental relevance, in conjunction with their amenability to quantitation.
Ionome flow chart details :-
• Figure 1. High-throughput ionomics. Putative mutants and wild-type Arabidopsis plants are
grown together with known ionomic mutants, used as positive controls, under standardized
conditions.
• Plants are uniformly sampled, digested in concentrated nitric acid, diluted, and analyzed for
numerous elements using ICP-MS.
• Raw ICP-MS data are normalized using analytical standards and calculated weights based on
wild- type plants (Lahner et al., 2003).
• Data are processed using custom tools and stored in a searchable, World Wide Web-accessible
database.
• Ionomic analysis can also be applied to other plants with available genetic resources, including
rice and maize.
• Elements in the Periodic Table highlighted in black boxes represent those elements analyzed
during our ionomic analyses using ICP-MS, elements highlighted in green are essential for plant
growth, and those in red represent nonessential trace elements.
• The table represents Arabidopsis (Col 0) shoot and seed ionomes, all elements presented as mg
g21 dry weight. Data represent the average shoot concentrations from 60 individual plants and
seed from 12 individuals 6 SD as percentage of average (%RSD), all plants grown as described by
Lahner et al. (2003).
PBTEL-463 (Applications of Genomics and Proteomics) 123
Natural variation – Arabidopsis ionome :-
• Natural variation in Arabidopsis seed and shoot phosphate accumulation is known to exist
between the Ler and Cvi accessions (Bentsink et al., 2003), and for potassium, sodium, calcium,
magnesium, iron, manganese, zinc, and phosphorus in seeds of numerous ecotypes
(Vreugdenhil et al., 2004).
• Analyses of Ler/Cvi recombinant inbred lines revealed quantitative trait loci (QTL) that explain
between 10% and 79% of this variation for the different elements (Vreugdenhil et al., 2004).
Natural variation in several Arabidopsis ecotypes has also been observed for shoot caesium.
Inductively Coupled Plasma – Optical Emission Spectroscopy (ICP-OES) technology :-
• The ICP is designed to generate a plasma, a gas in which atoms are present in the ionized state.
• To generate a plasma a silica torch is used, situated within a water- or argon-cooled coil of a
radio frequency generator (RF coil). Flowing gas (plasma gas) [typically argon (Ar)] is introduced
into the plasma torch and the radio frequency field ionizes the gas, making it electrically
conductive.
• The plasma is maintained by the inductive heating of the flowing gas. The plasma, at up to 8000
K, is insulated both electrically and thermally from the instrument, and maintained in position
by a flow of cooling argon gas (coolant gas).
• The sample to be analyzed, as an aerosol, is carried into the plasma by a third argon gas stream
(carrier gas).
• A nebulizer in the instrument transforms the aqueous sample into an aerosol. The sample is
pumped into the nebulizer via a peristaltic pump where it is converted into an aerosol, which
passes into the spray chamber with the carrier argon gas.
• In the spray chamber the finest sample droplets are swept into the plasma while the large
sample droplets settle out and run to waste.
• On introduction into the plasma atoms in the sample are ionized, generally into singly charged
positive ions. Once ionized the analyte atoms are detected using either an optical emission
spectrometer or a mass spectrometer.
Inductively Coupled Plasma Mass Spectrometry (ICP-MS) Vs (ICP-OES) :-
• Advantage of ICP-MS over ICP-OES is that it allows for a smaller sample size owing to its greater
sensitivity.
• Although ICPOES is less sensitive than ICP-MS, some of this sensitivity is won back by the
robustness of ICP-OES in more concentrated sample matrices.
• Whereas ICP-MS struggles with sample matrices with greater than about 0.1% solids, ICP-OES
can handle up to about 3% dissolved solids
• Drawbacks of ICP-MS is that the formation of polyatomic ionic species in the plasma can
interfere with the measurement of particular elements; e.g., 40Ar , 16O+ interferes with the
determination of 56Fe.
• An alternative approach to the removal or reduction of interfering polyatomic ions is to utilize a
single collector magnetic sector high- resolution ICP-MS (HR-ICP-MS).
ICP-MS Analysis :-
• eLaboratory portal, PiiMS divides the Purdue Ionomics pipeline into four process stages defined
as Planting, Harvesting, Drying, and MS Analysis.
PBTEL-463 (Applications of Genomics and Proteomics) 124
• These processes can be generalized as experimental subject, sample acquisition, sample
preparation, and sample analysis. Samples are first prepared for ICP-MS analysis by drying and
digestion in acid.
• The concentrations of various elements in the sample are then quantified using ICP-MS. Within
the eManagement portal, there are tools to define the list of elements to be analyzed, providing
full flexibility in the analysis.
• ICP-MS analyst is satisfied with the quality of the data, it is released into the database for
general searching and visualization.
Purdue Ionomics Information Management System (PiiMS) :-
• Website : www.purdue.edu/dp/ionomics.
• PiiMS currently contains data on shoot concentrations of P, Ca, K, Mg, Cu, Fe, Zn, Mn, Co, Ni, B,
Se, Mo, Na, As, and Cd in over 60,000 shoot tissue samples of Arabidopsis (Arabidopsis
thaliana), including ethyl methanesulfonate, fast-neutron and defined T-DNA mutants
• Natural accession and populations of recombinant inbred lines from over 800 separate
experiments, representing over 1,000,000 fully quantitative elemental concentrations.
• Using Web services, we also plan to integrate the PiiMS ionomics dataset with other Arabidopsis
resources.
• Finally, we are taking the basic architectural principles of PiiMS and generalizing them across
other organisms, including rice (Oryza sativa) and yeast (Saccharomyces cerevisiae), as well as
other omics technologies, including proteomics.
Applications of Ionomics :-
1) Once ionomics QTL have been identified, genomic tools available for A. thaliana and to some
extent rice and maize can be used to locate the genes that underlie these QTL and thus describe
the traits at a molecular level.
2) Genes responsible for QTL that control Na in rice and A. thaliana ; interestingly, the responsible
gene was found to be the Na- transporter HKT1 in both species.
3) Loudet et al. recently identified the gene that controls a major QTL for sulfate accumulation in A.
thaliana adenosine 5-phosphosulfate reductase, a central enzyme in sulfate assimilation.
4) Researchers are also well on the way to identifying the gene that controls a major QTL for seed
P content in A. thaliana, which has currently been narrowed down to only 13 open reading
frames.
PBTEL-463 (Applications of Genomics and Proteomics) 125