0% found this document useful (0 votes)
18 views

Phylogenetic Analysis1

This document provides an overview of phylogenetic analysis and describes key concepts: 1. Phylogenetic analysis involves constructing tree diagrams called phylogenies to represent the evolutionary relationships between taxa based on genetic or physical traits. 2. Common methods for phylogenetic analysis include characterizing taxa, inferring phylogenetic trees, and analyzing trait evolution. 3. Important goals of phylogenetic analysis are determining evolutionary histories and close relatives of organisms.

Uploaded by

Panku Pankaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Phylogenetic Analysis1

This document provides an overview of phylogenetic analysis and describes key concepts: 1. Phylogenetic analysis involves constructing tree diagrams called phylogenies to represent the evolutionary relationships between taxa based on genetic or physical traits. 2. Common methods for phylogenetic analysis include characterizing taxa, inferring phylogenetic trees, and analyzing trait evolution. 3. Important goals of phylogenetic analysis are determining evolutionary histories and close relatives of organisms.

Uploaded by

Panku Pankaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

PHYLOGENETIC ANALYSIS

OUTLINE OF THE LECTURE

1. Introducing some of the terminology of Phylogenetics.

2. Introducing some of the most commonly used methods for


phylogenetic analysis.

3. Explain how to construct the phylogenetic trees.


What is Phylogenetic Analysis?
 Phylogenetics is the study of the evolutionary history of living
organisms using tree like diagrams to represent pedigrees of these
organisms.

 The tree branching patterns representing the evolutionary divergence are


referred to as phylogeny.

Phylogenetic analysis has two major components:

 Phylogeny inference or “tree building” — the inference of the branching


orders, and ultimately the evolutionary relationships, between “taxa” (entities
such as genes, populations, species, etc.)

 Character and rate analysis — using phylogenies as analytical frameworks


for rigorous understanding of the evolution of various traits or conditions of
interest.
Studying Phylogenetics

 Fossil records – morphological information,


available only for certain species, data can be
fragmentary, morphological traits are ambiguous,
fossil record nonexistent for microorganisms

 Molecular data (molecular fossils) – more


numerous than fossils, easier to obtain, favorite for
reconstruction of the evolutionary history
Why should we perform it?

FEW EXAMPLES

 Which species are the closest living relatives of


modern humans?
 Did the infamous Florida Dentist infect his patients
with HIV?
 What were the origins of specific transposable
elements?
 Plus countless others…..
Common Phylogenetic Tree Terminology

Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny

D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
 Monophyletic (clade) – a group of taxa that are derived from
a single ancestral species. (sister taxa)
 Polyphyletic – a group whose members were derived from
two or more ancestors.
 Paraphyletic – a taxon that excludes some members that share
a common ancestor with members included in the taxon.
Tree Topology: branching pattern

 dichotomy – all branches bifurcate


 polytomy – result of a taxon giving rise to more than
two descendants or unresolved phylogeny (the exact
order of bifurcations can not be determined exactly)
 unrooted – no knowledge of a common ancestor, shows
relative relationship of taxa, no direction of an
evolutionary path
 rooted – obviously, more informative
Finding a true tree is difficult


Each unrooted tree theoretically can be rooted anywhere
along any of its branches

A C
# Unrooted # Rooted
# Taxa Trees # Roots
x Trees
=
3 1 3 3
B D 4 3 5 15
5 15 7 105
C 6 105 9 945
A D 7 945 11 10,395
8 10,935 13 135,135
9 135,135 15 2,027,025
B E . . . .
. . . .
. . . .
C . . . .
A D 30 ~3.58 x 1036 57 ~2.04 x 1038

B F E (2N - 3)!! = # unrooted trees for N taxa


Rooting the tree
B
C
To root a tree mentally,
imagine that the tree is
made of string. Grab the Root D
string at the root and
tug on it until the ends of Unrooted tree
the string (the taxa) fall A
opposite the root:
A B C D

Note that in this rooted tree, taxon A is no


more closely related to taxon B than it is to Rooted tree
C or D.
Root
Now, try it again with the root at another position:
B
C

Root
Unrooted tree
D

B
C D

Rooted tree

Note that in this rooted tree, taxon A is most


closely related to taxon B, and together they are
equally distantly related to taxa C and D.
Root
An unrooted, four-taxon tree theoretically can be rooted in
five different places to produce five different rooted trees

2 4
A C
The unrooted tree 1: 1 5

B 3 D

Rooted tree 1a Rooted tree 1b Rooted tree 1c Rooted tree 1d Rooted tree 1e
B A A C D

A B B D C

C C C A A

D D D B B

These trees show five different evolutionary relationships among the taxa!
There are two major ways to root trees:
By outgroup:
Uses taxa (the “outgroup”) that are
known to fall outside of the group of
interest (the “ingroup”). Requires some
prior knowledge about the relationships
among the taxa. The outgroup can either
be species (e.g., birds to root a
mammalian tree) or previous gene
duplicates (e.g., outgroup
a-globins to root b-globins).

By midpoint or distance:
Roots the tree at the midway point A
between the two most distant taxa in the d (A,D) = 10 + 3 + 5 = 18
tree, as determined by branch lengths. Midpoint = 18 / 2 = 9
10
Assumes that the taxa are evolving in a C
clock-like manner. This assumption is 3 2
built into some of the distance-based tree B 2
5 D
building methods.
Gene phylogeny vs. species phylogeny
 Main objective of building phylogenetic trees based on molecular
sequences: reconstruct the evolutionary history of the species involved.
 A gene phylogeny only describes the evolution of that particular gene
or encoded protein. This sequence may evolve more or less rapidly
than other genes in the genome.
 The evolution of a particular sequence does not necessarily correlate
with the evolutionary path of the species.
 Branching point in a species tree – the speciation event
 Branching point in a gene tree – which event?
 The two events may or may not coincide.
 To obtain a species phylogeny, phylogenetic trees from a variety of
gene families need to be constructed to give an overall assessment of
the species evolution.
Forms of tree representation

 phylogram – branch lengths represent the amount of evolutionary


divergence
 cladogram – external taxa line up neatly, only the topology matters
 ultrametric- An ultrametric tree is an additive tree which can be
rooted so that all paths from the root to a leaf have the same length
Newick format
Phenetics versus Cladistics

 Cladistics can be defined as the study of the pathways of evolution. In


other words, cladists are interested in such questions as: how many
branches there are among a group of organisms; which branch connects to
which other branch; and what is the branching sequence. A tree-like
network that expresses such ancestor-descendant relationships is called a
cladogram. Thus, a cladogram refers to the topology of
a rooted phylogenetic tree.

 Phenetics is the study of relationships among a group of organisms on the


basis of the degree of similarity between them, be that similarity
molecular, phenotypic, or anatomical. A tree-like network expressing
phenetic relationships is called a phenogram.
A consensus tree
 combining the nodes:
 strict consensus - all conflicting nodes are collapsed into
polytomies
 majority rule – among the conflicting nodes, those that agree
by more than 50% of the nodes are retained whereas the
remaining nodes are collapsed into polytomies
ASSUMPTIONS
Procedure
1. Choice of molecular markers
2. Multiple sequence alignment
3. Choice of a model of evolution
4. Determine a tree building method
5. Assess tree reliability
1. Choice of molecular markers
 Nucleotide or protein sequence data?
 NA sequences evolve more rapidly.
 They can be used for studying very closely related
organisms, e.g. for evolutionary analysis of different
individuals within a population, noncoding regions of
mtDNA are often used.
 Evolution of more divergent organisms – either slowly
evolving NA (e.g., rRNA) or protein sequences.
 Deepest level (e.g., relationships between bacteria and
eukaryotes) – conserved protein sequences
 NA sequences: good if sequences are closely related, reveal
synonymous/nonsynonymous substitutions
Positive and negative selection
 synonymous substitution – nucleotide changes in a
sequence not resulting in amino acid sequence changes
(genetic code degeneracy, 3rd codon position)
 nonsynonymous changes
 nonsynonsymous substitution rate synonymous –
positive selection
 certain parts of the protein are undergoing active mutations that
may contribute to the evolution of new function
 negative selection – synonymous > nonsynonymous
 neutral changes at the AA level, the protein sequence is critical
enough that its changes are not tolerated
2. MSA
 Critical step
 Multiple state-of-the-art alignment programs (e.g., T-
Coffee, Praline, Poa, …) should be used.
 The alignment results from multiple sources should be
inspected and compared carefully to identify the most
reasonable one.
 Automatic sequence alignments almost always contain
errors and should be further edited or refined if necessary
– manual editing!
3. Model of evolution
 A simple measure of the divergence of two sequences –
number of substitutions in the alignment, a distance between
two sequences – a proportion of substitutions
 If A was replaced by C: A → C or A → T → G → C?
 Back mutation: G → C → G.
 Parallel mutations – both sequences mutate into e.g., T at the
same time.
 All of this obscures the estimation of the true evolutionary
distances between sequences.
 This effect is known as homoplasy and must be corrected.
 Statistical models infer the true evolutionary distances
between sequences.
Model of evolution

FOR NUCLEIC ACIDS FOR PROTEINS (Substitution matrices)

•JUKES-CANTOR MODEL •PAM


•KIMURA MODEL •BLOSUM
4. TREE BUILDING METHODS

Classification of phylogenetic inference methods


COMPUTATIONAL METHOD
Optimality criterion Clustering algorithm
Characters

MAXIMUM PARSIMONY

MAXIMUM LIKELIHOOD
DATA TYPE

Distances

MINIMUM EVOLUTION UPGMA

FM NEIGHBOR-JOINING
Tree building methods: Two major
categories.

 Distance based methods.


 Based on the amount of dissimilarity between pairs of
sequences, computed on the basis of sequence alignment.

 Characters based methods.


 Based on discrete characters, which are molecular sequences
from individual taxa.
Distance based methods
 Calculate evolutionary distances dAB between sequences
using some of the evolutionary model.
 Construct a distance matrix – distances between all pairs
of taxa.
 Based on the distance scores, construct a phylogenetic
tree.
Distance based- Clustering method

 UPGMA (Unweighted Pair Group Method


with Arithmetic Mean)
 Hierachical clustering, agglomerative, you know it as an
average linkage (simplest)
 Produces rooted tree (most phylogenetic methods
produce unrooted tree).
 Basic assumption of the UPGMA method: all taxa
evolve at a constant rate, they are equally distant from
the root, implying that a molecular clock is in effect.
 However, real data rarely meet this assumption. Thus,
UPGMA often produces erroneous tree topologies.
FAST SPEED
NEIGHBOR JOINING
• A little like UPGMA: builds a tree using stepwise reduced distance matrices
• Difference: Does not assume a molecular clock
• Uses a conversion step method to correct the unequal evolutionary rates between sequences.
• Fastest and can give the good results

EQUATIONS USED
ORIGINAL DISTANCE
MATRIX
CORRECTRD
DISTANCE MATRIX
C
C A 0.15
A A C
U
O O
O D
0.25 D
D B B
B
STAR TREE
Optimality based methods
 Clustering methods produce a single tree.
 There is no criterion in judging how this tree is compared
to other alternative trees.
 Optimality based methods have a well-defined algorithm
to compare all possible tree topologies and select a tree
that best fits the actual evolutionary distance matrix.

FITCH-MARGOLIASH MINIMUM
EVOLUTION
Minimal deviation between distance
calculated in all branches and Tree with the minimal overall
distance in original dataset branch length.

EXHAUSTIVE METHODS: SLOW COMPUTATION: CAN’T BE USED FOR


LARGE DATA SETS
Distance based – pros and cons
 Clustering

 Fast, can handle large datasets


 Not guaranteed to find the best tree OVERALL
 The actual sequence information is lost when all the ADVANTAGE
sequence variation is reduced to a single value. Hence,
ancestral sequences at internal nodes cannot be inferred. Make use of large number
 UPGMA – assumes a constant rate of evolution of the of substitution models to
sequences in all branches of the tree (molecular clock correct the distances.
assumption)
 NJ – does not assume that the rate of evolution is the same
in all branches of the tree LIMITATION
 NJ is slower but better than UPGMA
 Single tree Actual sequence
information is lost.

 Exhaustive tree searching (FM)

 better accuracy, prohibitive for more than 12 taxa


 Gives many alternative trees
Character based methods
 Also called discrete methods
 Based directly on the sequence characters
 They count mutational events accumulated on the
sequences and may therefore avoid the loss of information
when characters are converted to distances.
 Evolutionary dynamics of each character can be studied
 Ancestral sequences can also be inferred.
 The two most popular character-based approaches:
maximum parsimony (MP) and maximum likelihood
(ML) methods.

MORE ADVANCED SLOWER MOST ACCURATE


Character based methods
Maximum parsimony
 Based on Occam’s razor.
 William of Occam, 13th century.
 The simplest explanation is probably the correct one.
 This is because the simplest explanation requires the fewest
assumptions and the fewest leaps of logic.
 A tree with the least number of substitutions is probably
the best to explain the differences among the taxa under
study.

 Similar to ME

NO: Inconsistencies Ambiguities Redundancies


Character based methods
Maximum parsimony
A worked example

1 2 3 4 5 6 7 8 9
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G

To save computing time, only a small number of sites that have the richest phylogenetic
information are used in tree determination.

informative site – sites that have at least two different kinds of characters, each
occurring at least twice
Character based methods
Maximum parsimony
A worked example

1 2 3 4 5 6 7 8 9
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G

To save computing time, only a small number of sites that have the richest phylogenetic
information are used in tree determination.

informative site – sites that have at least two different kinds of characters, each
occurring at least twice
How many possible unrooted trees?

1 2 3
1 G G A
( 2𝑛 −5 ) !
2 G G G 𝑁 𝑈 = 𝑛 −3
3 A C A 2 ( 𝑛 −3 ) !
4 A C G

1 3 1 2 1 3

2 4 3 4 4 2
Tree I Tree II Tree III
Informative site I

GGAA
G A
G A

G A G G 1 3
Tree I G G
4 2
1 3

2 4
A Tree II
A G A
1 2
GG

3 4
A Tree III
G
Informative site II

GGCC
G C
G C

G C G G
Tree I G G

C Tree II
C G C
GG

C Tree III
G
Informative site III

AGAG
A A
A A

G G A G
Tree I A G

A Tree II
G A A
GG

G Tree III
G
TREE TREE TREE
I II III
GGAA 1 2 2
GGCC 1 2 2
AGAG 2 1 2
Tree length 4 5 6

GGA ACA
GGA ACA
2
1 1

GGG ACG
Tree I
Character based methods
Weighted parsimony
 The parsimony method discussed so far is unweighted
because it treats all mutations as equivalent.

 This may be an oversimplification; mutations of some


sites are known to occur less frequently than others, for
example, transversions versus transitions, functionally
important sites versus neutral sites.

 A weighting scheme takes into account the different kinds


of mutations.
Character based methods
Branch-and-bound

 The parsimony method examines all possible tree topologies to


find the maximally parsimonious tree.
 This is an exhaustive search method, expensive.
 N = 10 … 2 027 025
 N = 20 … 2.22 × 1020
 Branch-and-bound
 Rationale: a maximally parsimonious tree must be equal to or shorter
than the distance-based tree.
 First build a distance tree using NJ or UPGMA.
 Compute the minimum number of substitutions for this tree.
 The resulting number defines the upper bound to which any other
trees are compared.
 I.e., when you build a parsimonous tree, you stop growing it when its
length exceeds the upper bound.
Heuristic methods
 When a number of taxa exceeds 20, even branch-and-
bound becomes computationally unfeasible.
 Then, heuristic search can be applied.
 Both exhaustive search and branch-and-bound methods
lead to the optimum tree.
 Heuristic search leads to the suboptimum tree (compare to
BLAST which is also heuristic).
Character based methods
Maximum parsimony – pros and cons

 Intuitive - its assumptions are easily understood


 The character-based method is able to provide evolutionary
information about the sequence characters, such as information
regarding homoplasy and ancestral states.
 It tends to produce more accurate trees than the distance-based
methods when sequence divergence is low because this is the
circumstance when the parsimony assumption of rarity in
evolutionary changes holds true.
 When sequence divergence is high, tree estimation by MP can be
less effective, because the original parsimony assumption no
longer holds.
 Estimation of branch lengths may also be erroneous because MP
does not employ substitution models to correct for multiple
substitutions.
Character based methods
Maximum likelihood – ML
 Uses probabilistic models to choose a best tree that has the
highest probability (likelihood) of reproducing the observed data.
 ML is an exhaustive method that searches every possible tree
topology and considers every position in an alignment, not just
informative sites.
 By employing a particular substitution model that has probability
values of residue substitutions, ML calculates the total likelihood
of ancestral sequences evolving to internal nodes and eventually
to existing sequences.
 It sometimes also incorporates parameters that account for rate
variations across sites.
 The tree with the highest ML value is considered to be the most
preferred.
COMPUTATIONALY HEAVY SLOW GOOD RESULTS
Character based methods
ML – pros and cons
 Based on well-founded statistics instead of a medieval
philosophy.
 More robust, uses the full sequence information, not just
informative sites.
 Employs substitution model – strength, but also weakness
(choosing wrong model leads to incorrect tree).
 Accurately reconstructs the relationships between
sequences that have been separated for a long time.
 Very time consuming, considerably more than MP which
is itself more time consuming than clustering methods.
 Distance methods only give one tree, while parsimony analyses many
trees and may suggest multiple, equally likely trees, none of which is
necessarily the right one.
 The maximum likelihood method does not suffer from these limitations. It
uses the entire sequence information, it analyses many trees and proposes
the tree with the highest likelihood.
 The method, however, is not suitable for large datasets due to its CPU-
intensive nature.

 When you have a powerful computer and a not too large dataset the
maximum likelihood method is the preferred method both for DNA as
well as for protein data.
5. ACCESS THE TREE RELIABILITY

After the tree has been constructed, it is important to statistically evaluate the
tree reliability

HOW RELIABLE THE WHETHER IT IS BETTER THAN


TREE? OTHER POSSIBLE TREES?

BOOTSTRAPPING
CONVENTIONAL
Or
STATISTICAL TESTS
JACKKNIFING

Repeatedly resample the data from the


original data sets.
Phylogeny packages
Download

https://www.megasoftwar
e.net/
Alignment
Phylogeny
Bootstrap

You might also like