0% found this document useful (0 votes)
24 views154 pages

Bio Info Merged

Uploaded by

Arsh Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views154 pages

Bio Info Merged

Uploaded by

Arsh Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

Introduction to

bioinformatics

Subha Narayan Rath


• Nucleotide sequence database: 6*10e11 bases (600 Gbp)
• Human genome: 3 * 10e9
• 200 human genome equivalent data are there.
• The databse of macromolecular structures: 100 000 entries with full 3d corordinates of
proteins, amino acids etc.
• Phenotype= genotype+ environment+ life history+ epigenetics
• Alleles are different forms or sequences of the same gene
• Homozyogsity and heterozygosity…
• Bioinformatics is defined as the application of tools of
computation and analysis to the capture and interpretation of
biological data. It is an interdisciplinary field, which harnesses
computer science, mathematics, physics, and biology

• (Molecular) Bioinformatics is conceptualizing biology in terms of


• molecules (in the sense of Physical chemistry) and
• applying “informatics techniques” (derived from disciplines such as applied
maths, computer science and statistics)
• to understand and organize the information associated with these
molecules, on a large scale.
• Outside strands= phosphoric acid+
deoxyribose sugar
• Inside strands= 4 nitrogenous bases
such as purines (AG) or pyrimidines (CT)
bases determining the code of the genes
• Nucleotide= Phosphoric
acid+deoxyribose+Base
• Three successive bases= a code word
• Transcription: DNA > RNA
• 1. GENETIC REGULATION • 2. ENZYME REGULATION
• Promoter (TATA box or TATAAAA) • Enzyme inhibition (negative feedback
control)
• TF: + OR – transcription factors
• Enhancers • Enzyme activation (e.g. cAMP activates
glycogen breakdown to form more ATPs)
• Hormones: signals from outside the cells
• E.g. purine and pyrimidine formation
uses both of the above mechanisms
• DNA > m RNA > Protein
• Strands in double helix are anti-parallel, direction either 3` or 5` (for deoxy-ribose ring)
• Transcription of DNA TO RNA and translation of m RNA: always read from 5` position
• Protein formation requires splicing or removal of non-coding regions
• Several proteins from same gene: by mixing and matching of exons
• Other types of RNA such as siRNA, microRNA, piwi-interacting RNAs: control translation
• Triplets of code from DNA act as cipher for protein code (as fig)
• To understand function of entire human genome: Encyclopaedia of DNA elements
• https://www.nature.com/collections/aghcdefffg/
• 80% of human genome can be ascribed to some function…compared to previously
thought 23,000 protein coding genes which is 1.5% of genome
• The rest is called by some junk DNA
• Variable splicing means number of proteins are not limited to these genes
• TWO ways non-coding regions to have function
• 1. involved in sequence dependent physical interaction: within chromatin that either expose
it or block from protein ligands
• 2. if transcribed to RNA: functions like regulation of transcription
• 75% of human genome is transcribed
• Mapping and dictionary of regulatory sites: many to one i.e. many proteins can bind to
same regulatory sites
• A sketch of structure of regulatory network
• Mapping of exposed sites of chromatin, which are unprotected from DNase1 cleavage:
these are regulatory sites near genes for binding of regulators of expression
• 200-400 Amino acids length
• Exons and introns: among introns some are for regulation and some considered junks
• Proteins and structural RNAs vary a lot in their 3d structure
• For each protein or peptide sequence: there is a stable native state adopted
spontaneously
• The paradigm is: (and this is focus of bioinformatics)
• DNA structure determines protein sequence
• That determines protein structure
• That determines protein function
• Regulatory mechanisms like control of expression patterns
• Cell RNA content: Transcriptome
• DNA methylation patterns
• Splice variants and post-translational modifications of proteins in any cell
• Patterns in protein-protein interaction, DNA-Protein interaction with finding of exact
regions binding for it
• Integration of individual regulatory steps into networks
Genome organization

Subha Narayan Rath


Genome, transcriptome, proteomes
• Genome of a typical bacterium like E. coli: 4.6* 10e6 bp with a single DNA molecule
• If extended it would be 2 mm long and it fits into 0.001 mm of diameter of a cell
• Human cells contain 23 pairs of chromosomes and size is 3223*10e6 bp
• Transcriptome: for all the RNA content of the cells
• Proteome: for all the protein content of the cells….in humans only 20000 protein
coding genes, but number of proteins are very huge
Problem 3.1
• The overall base composition of E.COLI genome is A=T=49.2%. In a random sequence
of 4 639 221 (normal bp of E. coli) with these proportions, what is the expected number
of occurrences of the sequence CTAG?
Genes
• They may appear in either strand of DNA
• In bacteria: functional unit of genetic sequence are
• 3N nucleotides presenting
• N amino acids of a protein
• In eukaryotes: one gene is split into separate segments in genetic DNA
• EXON: Expressed region
• INTRON: Intervening region
• Cellular machinery splices together the initial mRNA to make the product
• Control mechanism may turn genes ON or OFF
• Or regulate gene expression more finely
• Cascade of controls respond to conc. of nutrients, to stress, or to control cell cycle
Control regions of DNA lie near the
segments coding for proteins
• They contain signal sequences that serve as binding sites for molecules that causes
transcription like TFs
• Or, bind regulatory molecules that block transcription
• Bacterial genes: there are OPERONS in line with the genes
• In Eukaryotes: epigenetic signals like
• DNA methylation
• Histone modification
• …they direct tissue specific expression of developmentally regulated genes
• DNA methylation is stable during tissue differentiation surviving cell division
• Reversible chemical modification of histones render the transcription sites more or less
accessible
• ….so it is like 3.2 Gb of data in a mass storage device..
Proteomics
• 2 methods can measure:
• High resolution 2D PAGE
• Mass spectrometric techniques

• Previously by direct sequencing of proteins; but now by translation of DNA sequences…new protein
sequence data are determined; but the later is always hypothetical unless experimentally proven
• The pattern recognition programs that can do it will be subject to 3 types of errors:
• 1. protein sequences might be missed entirely, may incorrectly spliced, might be from exons in
different ways of combination and in different tissues can’t be predicted. If mRNA is edited before
translation, it can’t be known.
• 2. No clue for quarternary structure, prosthetic group binding, patterns of disulphide bridges
• 3. Post translational modifications: covalent alterations within a cell, addition of a ligand, or cleavage of
a protein to an active form
• Inteins are proteins that have self splicing activity compared to done by proteases
Transcriptomics
• Many RNA transcripts are not protein coding
• Transient: mRNA
• Stable: rRNA
• 1. RNA seq methods by RNA to cDNA and sequencing methods or real time PCR
• 2. RNA can be sequenced directly
BLAST = Basic local alignment
search tool
• Rapidly compare a query sequence to a database of subject sequences
• Generate alignments between them= the quality of which is by ALIGNMENT SCORE
• Return alignments that pass user defined score and statistical significance thresholds
• BLAST uses local alignment to find high scoring segment pairs (HSP) between two
sequences
• BLAST HIT: A Subject sequence that is aligned to the query
https://blast.ncbi.nlm.nih.gov/Blast.cgi
Common blast programs
BLASTN is for:
• Mapping oligonucleotides to genome
• Comparing DNA from closely related species
• Aligning expressed sequence tags to a genome
BLASTP is for:
• Exploring protein function
• Initial discovery for conserved domains
BLASTX
• Nucleotide query is translated into all 6 reading frames
• 3 reading frames in + strand
• 3 reading frames in – strand
• Each reading frame is compared to a protein database
• It is used for:
• Gene finding in genomic DNA (Annotations)
• Annotating ESTs
TBLASTN
• Query is a protein sequence
• Nucleotide database is translated into 6 RFs
• The query is then compared to each RF
• It is used for
• Mapping a protein to genome database
• Finding ESTs that map to a protein sequence
• Finding RNA Seq reads that map to a protein sequence
TBLASTX
• BOTH query and subject database (both are nucleotides) are converted to 6 RFs and
compared
• It is best used for:
• Comparing the nucleotide sequence from distantly related species
• Identify coding regions in ESTs
• Sensitive but expensive
HOUSE-KEEPING GENES to read too!!

Polymerase
Chain
Reaction
(PCR)
• PCR
• History of PCR
• Thermal cycler
• Components of PCR
• Three basic steps
• PCR program in thermal cycler
• General guidelines for primer.
• Application of PCR
• Advantages and disadvantages of PCR
What is PCR?

• PCR is an exponentially progressing synthesis of the defined target


DNA sequences in vitro.

Why “Polymerase”?

• It is called “polymerase” because the only enzyme used in this reaction


is DNA polymerase.

Why “Chain”?

• It is called “chain” because the products of the first reaction become


substrates of the following one, and so on.
The "Reaction" Components

1) Target DNA - contains the sequence to be amplified.

2) Pair of Primers - oligonucleotides that define the seq to be amplified. They


are complementary to the 3' ends of each of the sense and anti-sense strand of
the DNA target and needed to initiate DNA synthesis.

3) dNTPs - deoxynucleotidetriphosphates: DNA building

4) Thermostable DNA Polymerase - enzyme that catalyzes the reaction

5) Mg++ ions - cofactor of the enzyme

6) Buffer solution - maintains pH and ionic strength of the reaction solution


suitable for the activity of the enzyme
PCR (Polymerase chain reaction)

• A technique to make many copies of a specific DNA region in vitro.

• Primer mediated enzymatic amplification of specially cloned or genomic DNA sequences.

• Possible to generate thousands to millions of copies of a particular section of DNA from a


very small amount of DNA.

• Common tool used in medical and biological research labs.

Principle of PCR

• It is based on the enzymatic replication of DNA, were a short segment of DNA is


amplified using primer mediated enzymes. DNA Polymerase synthesises new strands of
DNA complementary to the template DNA. The DNA polymerase can add a nucleotide to
the pre-existing 3’-OH group only. Therefore, a primer is required. Thus, more nucleotides
are added to the 3’ prime end of the DNA polymerase
History of PCR

• Great mind behind this PCR : was an American biochemist Kary Banks Mullis

• Developed PCR in 1985 and was awarded the Nobel Prize in Chemistry in 1993 for his pioneering work.

• PCR machine otherwise called Thermocycler.

• -1983-Kary Mullis, a scientist working for the Cetus Corporation was driving along US Route 101 in northern

California when he came up with the idea for the polymerase chain reaction.

• ln 1985 Cetus Corp. Scientists isolate Thermostable Taq Polymerase (from T. aquaticus), which revolutionized

PCR & introduced to the scientific community at a conference in October .

• Cetus rewarded Kary Mullis with a $10,000 bonus for his invention.

• Later, during a corporate reorganization, Cetus sold the patent for the PCR process to a pharmaceutical company

Hoffmann-LaRoche for $300 million.


Three basic steps

1. Denaturation ds DNA template (96°C)

Heat the reaction strongly to separate, or denature, the DNA strands. This provides single-
stranded template for the next step.

2. Annealing of primers (55-65°C)

Cool the reaction so the primers can bind to their complementary sequences on the single-
stranded template DNA.

3. Extension ds DNA molecules (72°C)

Raise the reaction temperatures so Taq polymerase extends the primers, synthesizing new strands
of DNA.
Denaturation
• The reaction mixture is heated to a temperature between 90-98º C so that the ds DNA
is denatured into single strands by disrupting the hydrogen bonds between
complementary bases.
• Duration of this step is 1-2 mins.
• Temperature: 92-94C.
• Double stranded DNA melts → single stranded DNA.
Annealing
Temperature of reaction mixture is cooled to 45-60º C
• Primers are jiggling around caused by ???????
• Primers base pair with the complementary sequence in the DNA.
• Hydrogen bonds reform.
• Annealing fancy word for renaturing.
Temperature: ~45-70C (dependant on the melting temperature of the expected
duplex).
Extension
DNA polymerase binds to the annealed primers and extends DNA at the 3' end of the chain
The temperature is now shifted to 72º C which is ideal for polymerase.
Primers are extended by joining the bases complementary to DNA strands.
➤Elongation step continues where the polymerase adds dNTP's from 5' to 3', reading the template from 3' to 5' side, bases are
added complementary to the template.
➤Now first cycle is over and next cycle is continued,as PCR machine is automated thermocycler the same cycle is repeated
upto 30-40 times.
Temperature: ~72C
Time: 0.5-3min
General Guidelines for primers
1. Length:
• Shorter primers have a tendency to go and anneal to the non-target sequence of the
DNA template.
• Short primer may offer sufficient for a simple template such as a small plasmid.
• But a long primer may be required when using eukaryotic genomic DNA as
template. In practice, 20-30 nucleotides is generally satisfactory.
2. Mismatches:
• Do not need to match the template completely.
• Often beneficial to have C or G as the 3' terminal nucleotide which makes the
binding of the 3' end of the primer to the template more stable.
General Guidelines for primers
3. Melting Temperature Tm:
Melting temperature is the temperature at which one half of the DNA duplex
will dissociated and become single stranded. Typically the annealing
temperature is about 3-5 degrees Celsius bellow the Tm of the primers used.
Primers with melting temperatures in the range of 52-58°C generally produce
the best results. Primers with melting temperatures above 65°C have a
tendency for secondary annealing.
Tm can be calculated from the following formula:
Tm= (4 X [G+C]) + 2 x [A+T])
General Guidelines for primers
4. Internal Secondary Structure:
Should be avoided in order to prevent the primer to fold back on itself and not
be available to bind to the template.
5. Primer-Primer Annealing:
Also important to avoid the two primers being able to anneal to each other.
Extension by DNA polymerase of two self-annealed primers leads to
formation of a primer dimer.
6. G/C content:
Ideally a primer should have a near random mix of nucleotides, a 50% G/C
content.
Templates for PCR
• Body Fluids (Blood, CSF, Synovial, Sputum, Semen, Menstrual blood, Stool, Urine etc).
• Tissues
• Dried blood
• Semen stains
• Vaginal swabs
• Single hair
• Fingernail scrapings
• Insects in Amber
• Egyptian mummies
• Buccal Swab
• Toothbrushes
• Microorganisms (Bacteria, Fungi, Virus etc)
Things to try if PCR does not work
A) If no product (of correct size) produced:
1 Check DNA quality.
2 Reduce annealing temperature.
3 Increase magnesium concentration.
4 Add dimethylsulphoxide (DMSO) to assay (at around 10%).
5 Use different thermostable enzyme.
6 Throw out primers - make new stocks.
B) If extra spurious product bands present:
1 Increase annealing temperature
2 Reduce magnesium concentration
3 Reduce number of cycles
4 Try different enzyme
Variations of the PCR
Colony PCR
Nested PCR
Multiplex PCR
AFLP PCR (Amplified fragment length polymorphism)
Hot Start PCR
In Situ PCR
Inverse PCR
Asymmetric PCR
Reverse Transcriptase PCR
Allele specific PCR
Real time/Qunatitative PCR
ARMS PCR (Amplification Refractory Mutation System)
Methyl Specific PCR
TaqManTM Sequence Detection
- Several types of chemistries have been System
developed for this direct detection of PCR-
copied sequences. One of the most
popular is the TaqManTM system illustrated
here.

- In the TaqManTM system, as each new


copy of the target sequence is made –
a hybridization probe which binds to the
sequence is simultaneously hydrolyzed by
the polymerase enzyme

- This causes two fluorescent dyes at


either end of the probe to become
separated and eliminates the Q dye
quenching effect on the fluorescence of
the reporter or R dye.

- For each new copy of the sequence that


is made, the fluorescence of one reporter
dye molecule becomes detectable by the
instrument.
QPCR Growth Curves

* * * * *
- The quantitative capability of this
system stems from the direct correlation
that has been shown between the
starting number of target sequence
copies in the sample and the number of
amplification cycles required for the
instrument to first detect an increase in
reporter dye fluorescence associated
with the generation of new copies.

- The cycle numbers where the reporter


dye fluorescence curves cross a
threshold value (red line near the bottom 100 pg DNA 10 pg DNA
of the figure) that is significantly above
10 ng DNA 1 pg DNA
the background fluorescence (purple line
at the very bottom) are automatically 1 ng DNA no DNA
reported by real time PCR instruments.
*Cycle Threshold: Cycle # at which growth
curve = 30 fluorescence units (significantly
above background)
Calculation of target organism cells in test samples from TaqMan assay cycle
threshold results using the comparative cycle threshold method
_______________________________________________________________________

Target cells Sample ΔCT Measured cells in test sample


in sample type CT (CT,test-CT,calib) (2-ΔCT x cells in calibrator)
_______________________________________________________________________
20000 Calibrator 19.8 ---- ----

Unknown Test 22.9 3.1 0.11 x 20000 = 2200

Unknown Test 26.2 6.4 0.012 x 20000 = 240


_______________________________________________________________________
assuming amplification efficiency = 2

- Information from the standard curve and results from a single calibrator sample
containing known target cell numbers - that is extracted and run with the test
samples - can also be used to determine target cell numbers in the test samples
using a simple calculation called the comparative cycle threshold method as
illustrated here.
Application of PCR
Medical Application
• Genetic testing for presence of genetic disease mutations.
• Detection of disease causing genes in suspected parents who act as carrier.
• Study of alteration to oncogenes may help in customization of therapy.
• Can also be used as part of a sensitive test for tissue typing, vital to organ
transplantation genotyping of embryo.
• Helps to monitor the gene therapy.
Infectious Disease Application
• Analyzing clinical specimens for the presence of infectious agents, including HIV,
hepatitis, malaria, tuberculosis etc.
• Detection of new virulent subtypes of organism that is responsible for epidemics
Application of PCR
Forensic Application
• Can be used as a tool in genetic fingerprinting.
• This technology can identify any one person from millions of others in case of
crime scene, paternity testing etc.
Research and Molecular Genetics
• Helps to compare the genomes of two organisms and identify the difference
between them.
• In phylogenetic analysis, minute quantities of DNA from any source such a
fossilized material, hair, bones, mummified tissues.
• In Human genome project for aim to complete mapping and understanding of all
genes of human beings.
Advantages of PCR
Automated, fast, reliable (reproducible) results.
Contained (less chances of contamination).
• High output.
Sensitive.
Broad uses.
Defined, easy to follow protocols.
More Cycles = More DNA
Sample problem: PCR in forensics
Suppose that you are working in a forensics lab. You have just received a DNA sample from a hair left at
a crime scene, along with DNA samples from three possible suspects. Your job is to examine a
particular genetic marker and see whether any of the three suspects matches the hair DNA for this
marker.
The marker comes in two alleles, or versions. One contains a single repeat (brown region below), while
the other contains two copies of the repeat. In a PCR reaction with primers that flank the repeat region,
the first allele produces a 200 bp DNA fragment, while the second produces a 300 bp DNA fragment:

You perform PCR on the four DNA samples and visualize the results by gel electrophoresis, as shown
below:

Which suspect's DNA matches the DNA from the crime scene at this marker?
Suspect 3
• Humans are diploid, meaning that they have two copies of most of their DNA. Thus, there will be two copies of the marker
we are examining in each of the DNA samples.

• If a person has two different alleles of the marker (is heterozygous), two different-sized bands ( 200 bp and 300 bp) will be
amplified during PCR. These will appear as bands of DNA in the gel at the 200 bp and 300 bp locations.

• If a person has two copies of the same allele (is homozygous), only one band will be amplified during PCR. If the person is
homozygous for the 200 bp allele, only a 200 bp band will be visible on the gel. Similarly, if the person is homozygous for
the 300 bp allele, only a 300 bp band will be visible on the gel.

• The marker genotypes of the DNA samples are:

• Crime scene DNA: homozygous 200 bp allele

• Suspect 1: homozygous 300 bp allele

• Suspect 2: heterozygous

• Suspect 3: homozygous 200 bp allele

• Both the DNA sample from the crime scene and the DNA sample from suspect are homozygous for the 200 bp version of
the marker. That is, the two samples match for this marker.
(A) Gradient temperature control of one single block with heating
and cooling elements at each end. (B) Applied Biosystems
VeriFlex Block with “better-than-gradient” temperature control,
featuring three separate independent blocks and individual
heating and cooling elements for each block.
Difference between primers and probes

Primers are the starting point of the polymerase chain reaction (PCR) with single-stranded DNA/RNA.
Primers are usually designed to bind to a specific DNA/RNA sequence. In the cell, RNA primers are
the starting point of DNA replication.

qPCR probes describe DNA sequences similar to primers that are typically labelled with a fluorophore
as signalling molecules (molecular marker). The use of these probes allows for the quantification of
specific DNA sequences present in a sample (image 1).
Make sure it’s the right length
The specificity of a primer depends on its length.
• Primers, such as PCR primers, should be designed with a length of 18 to 24 nucleotides for ideal
amplification.
• Long primers have a slower hybridisation rate and a lower chance of annealing to the intended target
sequence.
• Slower hybridisation produces inadequate specificity and inadequate binding to the target sequence,
whereas faster hybridisation rates result in high target concentrations and maximum binding to the target
sequence.

The risk of a slower hybridisation increases when a primer is longer than 30 base pairs.
Even though long primers have a higher level of specificity than short primers, they are less efficient during the
annealing phase and produce less amplicon yield. Due to the build-up of by-products and the loss of components
necessary for DNA synthesis during a PCR, more cycles can result in less efficient outcomes. Short primers,
however, anneal to their target sequence more effectively and need fewer PCR cycles for amplicon generation
compared to long primers (Wu et al., 2010).
Unlike primers, the optimal length of probes is highly target-specific. The length generally selected by experts is
between 15 and 30 nucleotides. However, when longer probes are used instead of shorter ones, fewer probes per
gene may be required.

Choose the optimal melting temperature (Tm) range


Melting temperature (Tm) plays an integral part in both PCR and qPCR experiments. This is the temperature at
which the DNA duplex, commonly known as double-stranded DNA, splits into two single-stranded DNA parts.

Moreover, Tm also determines the annealing temperature Ta, where primer binding occurs at the highest
efficiency and specificity. Ta affects the final amount of PCR and qPCR product yield.

The optimal melting temperature for maintenance of primer specificity is 54°C or higher (54°C to 65°C).
However, Ta of a primer is often above its Tm, usually in a range of 2-5°C.

When a primer is designed, its Tm should not be above 65°C, as it increases the risk of secondary annealing.

During a PCR, Tm describes the temperature at which primers


have annealed to 50% of the target sequences and the other 50
% of target sequences are free. Both states are in equilibrium
(image).

The Tm of a PCR can vary based on buffer composition, metal


ion concentration, pH, and additives such as DMSO.
Choose the appropriate percentage of GC content
• The GC content of primers and probes is the percentage of guanine (G) and cytosine (C) in the primer. It is
recommended to keep the GC content between 40% to 60% when designing a primer. A primer with a length
of 20 nucleotides should contain 8 to 12 Gs or Cs.

• The reason for this GC content range is simple. When primers and probes anneal to their target sequence, GC
base pairs form three hydrogen bonds and adenine (A) and thymine (T) form two hydrogen bonds (image).
• As three hydrogens bonds are stronger than two, the
separation of G and C requires more energy (in the form of
heat) than for A and C.

• A higher GC content in the primer will lead to stronger


binding of the single DNA strands (ssDNA), resulting in a
higher Tm. This could be an issue during PCR and qPCR, as
higher GC content can cause mismatches and the formation of
primer-dimers, i.e. the hybridisation of two primers with each
other .

• If the GC content of the primers is less than 40%, their


lengths may need to be increased in order to maintain the
optimal Tm.
GC clamp
• The term “GC clamp” refers to the presence of Gs or Cs in the last five nucleotides at the 3′ end of primers.
• The benefit of a GC clamp is that it promotes complete primer binding.
• However, the presence of more than 3 G’s or C’s at the 3′ end of a primer can lead to non-specific binding
and false-positive results.
• For probes, the ideal GC content is between 35% and 60% to promote probe specificity and avoid false-
positive results.
• Probes should not contain a G at the 5′ end, as this could interfere with the fluorescence from the reporter
molecule (fluorophore) that is attached to this end.
Avoid secondary structures and primer-dimers
• Primer-dimers and hairpin loops are two types of secondary structures that can form during PCRs and qPCRs.
• Primer-dimers are formed due to the presence of complementary sequences within a utilised primer or
complementary sequences shared by two primers.
• This is represented by the parameter “self-complementarity” in primer design tools. The two types of primer
dimers are:
• Self-dimer refers to, for example, the hybridisation of two forward primers to each other due to intra-primer
homology.
• Cross-dimer refers to the hybridisation of the forward and reverse primer due to inter-primer homology.
• Primer-dimers prevent primers from annealing to the target sequence. Hence, most of the final PCR product is
just the amplification of the primers themselves rather than the amplicon.
• Hairpins are formed due to the intramolecular interaction of
the primer.
• Here, two regions of three or more nucleotides within the primer
are complementary to each other. When they anneal, a hairpin is
formed.
• The probability of forming a hairpin is represented by the
parameter “self 3′ complementarity”.
• Hairpins can impact the amplification step and lead to non-
specific amplicons or even no amplicon yield.
• Significantly, hairpins can also form when the annealing
temperature is too low.
• For both parameters (self-complementarity and self 3′-complementarity), the lower the number, the better.

• However, secondary structures can be avoided by adjusting the annealing temperature (in most cases, an
increase in temperature), avoiding cross homology, and changing the primers or DNA concentration. The
DNA concentration should be balanced with the number of cycles required in the reaction to enable the best
possible results.

• Runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G or C-rich sequences
(because of stability of annealing), and should be avoided;

• 3'-ends of primer should not be complementary.


Protein structure and
how to retrieve
information from
archives

Subha Narayan Rath


• Exploring protein function
• Initial discovery for conserved domains
• However, can we predict protein structure and function based on amino acid
sequence??
• Each protein folds to a unique 3D structure based on its amino acid sequence
• Protein structure is closely related to its function
• Protein structure prediction is a grand challenge of computational biology
• 1. Protein information resource (PIR) and associated database
• A. PIRSF: Sequence family classification
• B. iProClass: integrated protein knowledgebase
• C. iProLINK: integrated protein literature, information and knowledge
• 2. SWISS-PROT (from Swiss institute of bioinformatics, Geneva, Switzerland)
• Also contains bioinformatics tools and links called Expert Protein Analysis System
(ExPASy….www.expasy.org)
• PROSITE is a set of signature patterns characteristic of protein familiies
• 3. Trembl
• Amino acid sequence is not inferable from the gene sequence, because
• A. ambiguity in splicing
• B. information about ligands, disulphide bridges, subunit associations, post-
translational modifications, effects of mRNA editing etc. can’t be known from gene
sequence
• E.g. https://www.uniprot.org/
• Patterns of conservation identity features that nature has found to retain (PROSITE
signatures are examples)
• R. Doolittle suggested: two full length protein sequences (>100 residues) that have
25% or more identical residues in an optical alignment are likely to be related
• < 15%; doubtful similarity
• 18-25%: twilight zone where there might be similarity like appearance of PROSITE
consensus patterns
• Sequence oriented databases are: interPro, Pfam, COG
• Structure-oriented databases are: SCOP, CATH
• They archive, annotate, distribute a set of atomic coordinates
• Worldwide Protein DataBank (wwPDB)
• Others: protein data bank Europe (EBI at UK) and protein data bank Japan (based at
Osaka university)
• www.wwpdb.org
• It contains also structures of nucleic acids, carbohydrates in addition to proteins
• CCDC: Cambridge crystallographic Data Centre archives the structure of small
molecules
• BioMagResBank at University of Wisconsin: archives protein structures determined by
NMR
• wwPDB keeps the data from X-ray structure determinations
• What protein and from which species
• Who solved the structure with reference
• Experimental details to solve the structure such as resolution of xray structure
detemination
• Amino acid sequence
• Atomic coordinates (starting with ATOM)
• Additional molecules including cofactors, inhibitors, water molecules (keyword
HETATM identifies coordinates of these)
• Assignment of secondary structure: helices and sheets
• Disulphide bridges
• Protein details can be found in RCSB homepage, a part of PDB
• https://www.rcsb.org/
• 1TRZ and different parts of that
• Depending on R sub-unit
of amino acids the
properties vary
• 20 amino acids and their
properties
• Polymerization of amino
acids while synthesis,
causes the primary
structure
• Linear and ordered
• 1D
• Sequence of amino acids
• Written from amino end to carboxyl end
by convention
• This linear structure is neither functional
nor stable; so it has to do folding
• Non-linear
• 3D
• Localized to regions of amino acid
chain
• Formed and stabilized by
• Hydrogen bonding
• Van-der-walls interactions
• Electrostatic forces
• Occurs in cytosol (upto 60% of bulk
water or 40% of water of hydration)
• Due to interaction of 2 structure and
solvents
• It’s non-linear and 3D
• In cytosol due to close proximity to other
folded and packed proteins
• Involve interaction of tertiary structure
elements of separate protein molecules
• Non-linear and 3D
• Class = secondary structure composition
• Could be all α, all β, α/β, α+β
• Motif = small specific combination of
secondary structure elements e.g. β –α – β
loop
• Fold is architecture, the overall shape
and orientation of secondary structures,
ignoring connectivity between the
structures
• E.g. alpha/beta barrel
• Fold families: catergorizations that take
into account topology and previous
subsets as well as empirical or biological
properties
• Superfamilies: above plus it includes
evolutionary and ancestral properteis
• Primary (amino acid sequence)
• Secondary (alpha helix and beta sheet)
• Tertiary (3D structure formed by
assembly of secondary structures)
• Quaternary (more than one polypeptide
chains)
Alignment of pair of
sequences and
phylogenetic trees

Subha Narayan Rath


• Given 2 or more sequences:
• Measure their similarity
• Determine the residue-residue correspondence
• Observe the patterns of conservation and variability
• Infer evolutionary relationships
• Sequence alignment is the identification of residue-residue correspondences.
• A mutual alignment of more than 2 sequences is called multiple sequence alignment
• IT is a tool in bioinformatics to compare two sequences and show the sequences of
close similarity in a visually understandable way.
• Sliding window size: noisy vs smooth (default is 10)
• Window size changes with goal of analysis
– size of average exon
– size of average protein structural element
– size of gene promoter
– size of enzyme active site
• Cut-off value by statistics and Z score
• If the direction of the movement: diagonal or horizontal or vertical….
• Repeated domains
• Conserved domains
• Exons and introns
• Terminators
• Frameshifts
• Low-complexity regions

https://www.bioinformatics.nl/cgi-
bin/emboss/dotmatcher
Arrangement of domains as described in
Swiss-Plot entry

Drosophila melanogaster SLIT protein against itself


• ANACON – Contact analysis of dot plots.
• D-Genies– Specializes in interactive whole genome dotplots of large genomes
• Dotlet – Provides a program allowing you to construct a dot plot with your own sequences.
• dotmatcher– Web tool to generate dot plots (and part of the EMBOSS suite).
• Dotplot – easy (educational) HTML5 tool to generate dot plots from RNA sequences.
• dotplot – R package to rapidly generate dot plots as either traditional or ggplot graphics.
• Dotter– Stand alone program to generate dot plots.
• JDotter – Java version of Dotter.
• Flexidot – Customizable and ambiguity-aware dotplot suite for aesthetics, batch analyses and printing (implemented in
Python).
• Gepard – Dot plot tool suitable for even genome scale.
• Genomdiff – An open source Java dot plot program for viruses.
• LAST for whole-genome “split-alignment”.
• lastz and laj – Programs to prepare and visualize genomic alignments.
• yass – Web-based tool to generate (both forward and reverse complement) dot plots from genomic alignments.
• seqinr – R package to generate dot plots.
• SynMap – An easy to use, web-based tool to generate dotplots for many species with access to an extensive genome
database. Offered by the comparative genomics platform CoGe.
• UGENE Dot Plot viewer – Opensource dot plot visualizer.
• Given 2 strings, two measures of the distance between them.
• A. Hamming distance: defined between two strings of equal length, is the number of
positions with mismatching characters
• B. Levenshtein or edit, distance: defined between two equal or unequal length, is the
minimal number of “edit operations’ required to change one string to the other.
• It could be Deletion, Insertion, Substitution of a single character in either sequence
• A given sequence of edit operations induces a unique alignment, but not vice versa!!!
• For molecular biology certain changes are likely to occur than others e.g. amino acid
substitutions of similar sizes, so variable weights to different edit operations
• What about similarity scoring system??
• Transition mutation vs transversion mutation with higher points for former groups.
• To measure the relative probability of any particular substitution, first we find the
relative frequencies of changes in pairs of aligned homologous sequences and based
on that we can make a scoring matrix for substitutions.
• A common change should score HIGHER than a rare one
• A measure of sequence divergence is PAM = 1% Accepted Mutation
• 1 PAM apart of two sequences would have 99% identical residues and collecting
statistics of these pair produces 1PAM substitution matrix
• Power of the matrix is used for more divergent sequences
PAM 250 levels
(250% of expected change or
250 substituions per 100 amino acids),
corresponds to 20% overall sequence similarity.

The occurrence of reversions, either directly or


via other changes, produces slowdown of
mutation rates
• It expresses scores of log-odds values:
• Score of mutation I to J = log10
(observed I to J mutation
rate/mutation rate expected from
amino acid frequencies)
• The numbers are multiplied by 10 to
avoid decimals
• The probability of 2 independent
mutations is the product of their
individual probabilities and hence
added.
• Score is +ve: sequences are related or
conservative substitution
• Using much larger amount of data available
now
• Means BLOcks SUbstittuion Matrix and
based on BLOCKS database (representing
known protein families) of aligned protein
sequences
• From family of closely related proteins
alignable without gaps… they calculated
the ratio of number of observed pairs of
amino acids at any position to the number
expected from overall amino acid
frequencies
• They have sequence identities higher than a
threshold e.g. BLOSUM 62% is commonly
used where the matrix built using
sequences no more than 62% similarity
• In addition to substituion matrix: there is a way of gap weighting too
• Aligning DNA sequnces: CLUSTAL-W is recommended
• Alignign Protein sequences: BLOSUM 62 is recommended

Matrix/ protocol Gap initiation Gap Match Mismatch


recommended extension

DNA CLUSTAL-W 10 0.1 1 0

Protein BLOSUM62 11 1 Matrix Matrix


• An algorithm used for this: dynamic programming and very imp for molecular biology
• Guarantee: to give an optimal global alignment
• Problem1: many alignment may give the same optimal score
• Problem 2: technical: the time required to align two sequences is proportional to n * m, as it
is the size of edit matrix that must be filled in
• Variations of the dynamic programming method:
• 1. entire sequence to entire sequence: global match
• 2. region of one sequence to entire other sequence: local match
• 3. region of one to region of another: motif match
• Typical approximation approach would take a small integer k > all instances of each k-tuple
of residues in the probe sequence that is found in database sequences
Alignment of pair of
sequences and
phylogenetic trees

Subha Narayan Rath


BLOSUM matrix of Henikoff and
Henikoff
• Using much larger amount of data available
now
• Means BLOcks SUbstittuion Matrix and
based on BLOCKS database (representing
known protein families) of aligned protein
sequences
• From family of closely related proteins
alignable without gaps… they calculated
the ratio of number of observed pairs of
amino acids at any position to the number
expected from overall amino acid
frequencies
• They have sequence identities higher than a
threshold e.g. BLOSUM 62% is commonly
used where the matrix built using
sequences no more than 62% similarity
Scoring insertions and deletions
(substitution matrix) or gap weighting
• In addition to substituion matrix: there is a way of gap weighting too
• Aligning DNA sequnces: CLUSTAL-W is recommended
• Alignign Protein sequences: BLOSUM 62 is recommended

Matrix/ protocol Gap initiation Gap Match Mismatch


recommended extension

DNA CLUSTAL-W 10 0.1 1 0

Protein BLOSUM62 11 1 Matrix Matrix


Computing the alignment of two
sequences
• An algorithm used for this: dynamic programming and very imp for molecular biology
• Guarantee: to give an optimal global alignment
• Problem1: many alignment may give the same optimal score
• Problem 2: technical: the time required to align two sequences is proportional to n * m, as it
is the size of edit matrix that must be filled in
• Variations of the dynamic programming method:
• 1. entire sequence to entire sequence: global match
• 2. region of one sequence to entire other sequence: local match
• 3. region of one to region of another: motif match
• Typical approximation approach would take a small integer k > all instances of each k-tuple
of residues in the probe sequence that is found in database sequences
Dynamic
programming:
for BLAST
search
How to travel from start to finish
by passing through hyderabad
Kashmir

Hyderabad

Kanyakumari

6+6 only!!
However, detailed algorithm can be read from the book!!
Multiple sequence alignment and
database searching
• Searching a database for homologues of known protein is a central theme of
bioinformatics
• 3 imp methods are there
• A. Profiles
• B. PSI-BLAST
• C. Hidden Markov models (HMM)
• THE goal is to find high sensitivity or high specificity sequences to find.
A. Profiles
• It express the patterns inherent in a MSA of a set of homologous sequences
• They help in following:
• A. greater accuracy in alignment of distantly related sequences
• B. set of residues that are highly conserved are likely to be part of active site and give
clues to function
• C. Identification of other homologous sequences
• D. set of residues which are of little conservation are in surface loops and used for
vaccine design
• E. Most structure prediction methods rely on the profiles
Matching MSA of thioredoxins
from 25-30 position…

What is score of VDFSAE??


Amino acid colors
B. PSI-BLAST
• IT is a program that searches the data bank for sequences similar to a query sequence
• It derives pattern information from a multiple sequence alignment of initial hits
• And reprobes the database using the pattern
• Then it repeats the process, fine tuning the pattern in successive cycles
• Very powerful: in picking distant relationships
E is usually 0.005;
program will make position-specific-scoring matrix
the matrix can be used as an alternative to input
sequence and substitution matrix in a BLAST search
the procedure usually converge
C. Hidden Markov models
• It is a computational structure for describing subtle patterns that define families of
homologous sequences
• 1. distant relative prediction
• 2. prediction of protein folding patterns

• HMMs are more general than profiles:


• 1. possibility of introducing gaps into the generated sequence with position specific
gap penalties
• 2. HMMs carry out the alignment and the assignment of probabilities together
HMM..output sequence from succession of
match state and insert state
• EACH residue position to a MSA, HMM contains a match state (m), delete state (d)
• Insert state (I) appear between residue positions and at the beginning and at the end
• Proabibility of each match state is position-dependent and it emits a match
• Delete state skips a column and starts a gap opening and another delete state from previous
delete state causes gap extension
• Insert state: causes new residue that does not correspond to a position in the alignment table
appears in the emitted sequence
• Traverse the network without m or d state at each position is not possible
HMM can detect distant
homologous
• Only the current state influences the choice of the successor
• The system has no memory of the history
• The succession of amino acids emitted causes the output and visible
• The state sequence that generates the characters remains internal to the system, that is
hidden
• By probability distribution associated with the individual states the system models the
patterns inherent in a family of sequences
Structural
bioinformatics and
drug discovery

Subha Narayan Rath


• 1. great variety of structure and functions for all functions of cells and tissues
• 2. One place to synthesize ie. Ribosome and using only 20 different units
• 3. they fold spontaneously to the active native 3d state just by encoding the amino acid
sequence
• At the end: HST forms to make the 3d structure, which is more informative
• Similar structure < > similar function
• PDB database is the main repository for 3d biological macromolecular structure data
• Source:
• 1. crystal structure
• 2. NMR models
• 3. others
• 1. rotation and translation
• 2. color specific parts of molecules
• 3. labelling of residues and atoms
• 4. Geometrical measurements (distances and angles)
• 5. schematic representation and structures to compare and alignment
• 1. stick and ball model
• 2. space filled model
• 3. Backbone: only connecting the C-alpha atoms
• 4. Schematic: helix: cylinder and strand: arrow
• 5. surface
• 1. Pymol • https://pymol.org/2/
• 2. Rasmol • First download and install
• 3. Chimera etc….. • Pymol tutorial
• http://www.protein.osaka-
u.ac.jp/rcsfp/supracryst/suzuki/jpxtal/Kat
sutani/en/interface.php
• Examination of atomic interactions
• Examination of secondary structures
• Buried/exposed regions
• Analysis of ligands
• Structural alignment
• Structural classification
• Secondary structure prediction or structure prediction
• Molecular docking
• Molecular dynamics
• Sequence
• Type and number of secondary structures (HST)
• Structural arrangement of secondary structures
• Attributes of individual amino acids
• Distances between amino acids
• All β
• All α
• α/β: β-α-β super secondary structures are
present, could be linear or barrels
• α+β: both are separated at different parts of
molecules
• The most common classification databases
are:
• 1. SCOP
• 2. CATH
• Prediction of secondary structure is
feasible and mostly used machine
learning algorithm
• It is a bridge between linear and 3d
strucures
• 1. Homology modelling: by aligning
proteins of known structure (by SWISS
MODEL)
• 2. Fold recognition (by known protein
folds)
• 3. Ab initio method of modelling
• Rotation is permitted around the N-Ca and Ca-C single bonds of all residues (except Proline)
• The angles around these bonds and angle of rotation around the peptide bond define the
confirmation of residues
• Peptide bonds are mostly planar with ω = 180 degrees as they are in trans state
• Principle: two atoms can’t occupy the same space limits the values of conformational angles
• The allowed range of φ, ψ with ω = 180 fall into defined regions in a graph called
Ramachandran plot
• Solid lines delimit energetically preferred regions of the angles: regions outside the broken
lines are sterically disallowed
• Most amino acids falls into right handed helix or beta regions
• Glycine has additonal confirmation and can form left handed helix
• Only few are forced into energetically less favorable states
• Many but not all turns are short, surface exposed regions that contain charged or polar
residues
• 1. Size: glycine is smallest with only H and phynylalanine contains a benzene ring, one
of the largest
• 2. Electric charge: acidic amino acids are negatively charged while basic ones are
positively charged
• 3. Polarity: polar sidechains form hydrogen bonds to others and to water; others are
electrically neutral and if unfavorable interactions with water: hydrophobic
• 4. Shape and rigidity

• E.g. D & E are similar and L & I are similar


Gene expression and
regulation

Subha Narayan Rath


• Proteomics is the study of the distribution and interactions of proteins in time and
space in a cell or organism
• Transcriptomics is the study of all RNA molecules in a tissue
• High throughput methods of data analysis: microarray analysis and mass
spectrometry> gives large-scale picture of proteins in living tissues
• They are active in controlling: transcription and translation
• Goal of systems biology is the synthesis of genomic, transcriptomics, proteomics and
other data into an integrated picture of structure, dynamics, and logic of living tissues
• 2 techniques show the distribution of RNA showing indirectly proteins in cells:
Microarray technique and RNAseq (high throughput sequencing of RNAs in a sample)
• Mass spectrometry for proteomics: Not discussed!!
Type of tests Probe Target Comments
(synthesized and (the sample which is
immobilized material in extracted, labelled &
the chip) then tested)
One-to-one test Oligo with a One oligonucleotide Hybridization
complementary sequence with a known sequence

Many-to-one test One probe with To find the query oligo Northern or southern
complementary sequence in a mixture, spread the blot
mixture out and test
each component of the
mixture
Many-to-many test A set of oligos are To detect many oligos Microarrays where
synthesized one in a mixture, they are DNA oligomers are
complementary to each prepared with different affixed to known
sequence of query colored fluorescent locations on a rigid
tags for different support in a regular 2D
• DNA microarrays analyze
• 1. the mRNAs in a cell to reveal the expression patterns of proteins; or
• 2. genomic DNA to reveal absent or mutated genes

• The following answers can be found by DNA microarrays


• A. Integrated characterization of cellular activity: what proteins are present at what
amounts and where exactly?
• B. Measuring expressed genes help to identify which genes are causative for diseases
• DNA microarrays or DNA chips are devices for checking a sample simultaneously for
the presence of many sequences
• Distributed on small wafer of glass or nylon typically 2cm square
• Spot size ~ 150 um in diameters
• DNA chip: 400 000 probe oligomers (larger than total genes)
• The data is scanned to get it computer readable forms
• Affymetrix and Illumina: 25-mer probes synthesized in situ vs. multiple copies of single
50-mer probes attached to a microbead
• 1. Expression chips: immobilized oligos are cDNA samples (20-80bp) from mRNAs of
known genes; Target samples are mixture of mRNAs of normal or diseased tissue
• 2. Genomic hybridization: gains or losses of genes or changes in copy number. Target
sequences fixed on chips are large pieces of genomic DNA, 500-5000bp long; probe
mixture contain genomic DNA from normal or diseased states
• 3. Mutation microarray analysis: one looks for SNPs
• Precision is low and hence it is semi-quantitative
• mRNA levels detected by the array, do not reflect protein level
• Even yield in RT to make cDNA may be non-uniform
• MIAME: Minimum information about a Microarray Experiment: describe the contents and
formats of the information to be recorded in the experiment
• European bioinformatics institute: array express:
https://www.ebi.ac.uk/biostudies/arrayexpress/studies
• US NCBI hosts GENE EXPRESSION OMNIBUS: https://www.ncbi.nlm.nih.gov/geo/geo2r/
• Princeton university microarray database: https://puma.princeton.edu/
• Microarray database of plants:
• Steps are:
• 1. collection of samples and
isolation of mRNA
• 2. Labelled Cdna
• 3. Hybridization
• 4. Scanning and analysis
• Color and intensity of the
fluorescence reflect the extent of
hybridization
• One gene may correspond to 30-
40 spots and highly redundant

https://microbenotes.com/dna-microarray/
• By image processing, checking internal
controls, dealing with missing data, selecting
reliable measurements, putting the results in
consistent scales
• Change by 1.5-2 is considered significant in
each row or column, considered as vector
• Two approaches for analysis:
• 1. comparisons focused on genes by
comparing rows
• 2. comparison focused on different samples
by comparing columns

https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/05_counting_reads.html
• About GEO2R: https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html
• geo2R: https://www.ncbi.nlm.nih.gov/geo/geo2r/
• Volcano plot: to see differentially expressed plots by plotting statistically significant
changes vs differentially expressed plots
• Mean difference plot: see differentially expressed plots

You might also like