Phylogenetic Analysis1
Phylogenetic Analysis1
FEW EXAMPLES
Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny
D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
Monophyletic (clade) – a group of taxa that are derived from
a single ancestral species. (sister taxa)
Polyphyletic – a group whose members were derived from
two or more ancestors.
Paraphyletic – a taxon that excludes some members that share
a common ancestor with members included in the taxon.
Tree Topology: branching pattern
Each unrooted tree theoretically can be rooted anywhere
along any of its branches
A C
# Unrooted # Rooted
# Taxa Trees # Roots
x Trees
=
3 1 3 3
B D 4 3 5 15
5 15 7 105
C 6 105 9 945
A D 7 945 11 10,395
8 10,935 13 135,135
9 135,135 15 2,027,025
B E . . . .
. . . .
. . . .
C . . . .
A D 30 ~3.58 x 1036 57 ~2.04 x 1038
Root
Unrooted tree
D
B
C D
Rooted tree
2 4
A C
The unrooted tree 1: 1 5
B 3 D
Rooted tree 1a Rooted tree 1b Rooted tree 1c Rooted tree 1d Rooted tree 1e
B A A C D
A B B D C
C C C A A
D D D B B
These trees show five different evolutionary relationships among the taxa!
There are two major ways to root trees:
By outgroup:
Uses taxa (the “outgroup”) that are
known to fall outside of the group of
interest (the “ingroup”). Requires some
prior knowledge about the relationships
among the taxa. The outgroup can either
be species (e.g., birds to root a
mammalian tree) or previous gene
duplicates (e.g., outgroup
a-globins to root b-globins).
By midpoint or distance:
Roots the tree at the midway point A
between the two most distant taxa in the d (A,D) = 10 + 3 + 5 = 18
tree, as determined by branch lengths. Midpoint = 18 / 2 = 9
10
Assumes that the taxa are evolving in a C
clock-like manner. This assumption is 3 2
built into some of the distance-based tree B 2
5 D
building methods.
Gene phylogeny vs. species phylogeny
Main objective of building phylogenetic trees based on molecular
sequences: reconstruct the evolutionary history of the species involved.
A gene phylogeny only describes the evolution of that particular gene
or encoded protein. This sequence may evolve more or less rapidly
than other genes in the genome.
The evolution of a particular sequence does not necessarily correlate
with the evolutionary path of the species.
Branching point in a species tree – the speciation event
Branching point in a gene tree – which event?
The two events may or may not coincide.
To obtain a species phylogeny, phylogenetic trees from a variety of
gene families need to be constructed to give an overall assessment of
the species evolution.
Forms of tree representation
MAXIMUM PARSIMONY
MAXIMUM LIKELIHOOD
DATA TYPE
Distances
FM NEIGHBOR-JOINING
Tree building methods: Two major
categories.
EQUATIONS USED
ORIGINAL DISTANCE
MATRIX
CORRECTRD
DISTANCE MATRIX
C
C A 0.15
A A C
U
O O
O D
0.25 D
D B B
B
STAR TREE
Optimality based methods
Clustering methods produce a single tree.
There is no criterion in judging how this tree is compared
to other alternative trees.
Optimality based methods have a well-defined algorithm
to compare all possible tree topologies and select a tree
that best fits the actual evolutionary distance matrix.
FITCH-MARGOLIASH MINIMUM
EVOLUTION
Minimal deviation between distance
calculated in all branches and Tree with the minimal overall
distance in original dataset branch length.
Similar to ME
1 2 3 4 5 6 7 8 9
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
To save computing time, only a small number of sites that have the richest phylogenetic
information are used in tree determination.
informative site – sites that have at least two different kinds of characters, each
occurring at least twice
Character based methods
Maximum parsimony
A worked example
1 2 3 4 5 6 7 8 9
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
To save computing time, only a small number of sites that have the richest phylogenetic
information are used in tree determination.
informative site – sites that have at least two different kinds of characters, each
occurring at least twice
How many possible unrooted trees?
1 2 3
1 G G A
( 2𝑛 −5 ) !
2 G G G 𝑁 𝑈 = 𝑛 −3
3 A C A 2 ( 𝑛 −3 ) !
4 A C G
1 3 1 2 1 3
2 4 3 4 4 2
Tree I Tree II Tree III
Informative site I
GGAA
G A
G A
G A G G 1 3
Tree I G G
4 2
1 3
2 4
A Tree II
A G A
1 2
GG
3 4
A Tree III
G
Informative site II
GGCC
G C
G C
G C G G
Tree I G G
C Tree II
C G C
GG
C Tree III
G
Informative site III
AGAG
A A
A A
G G A G
Tree I A G
A Tree II
G A A
GG
G Tree III
G
TREE TREE TREE
I II III
GGAA 1 2 2
GGCC 1 2 2
AGAG 2 1 2
Tree length 4 5 6
GGA ACA
GGA ACA
2
1 1
GGG ACG
Tree I
Character based methods
Weighted parsimony
The parsimony method discussed so far is unweighted
because it treats all mutations as equivalent.
When you have a powerful computer and a not too large dataset the
maximum likelihood method is the preferred method both for DNA as
well as for protein data.
5. ACCESS THE TREE RELIABILITY
After the tree has been constructed, it is important to statistically evaluate the
tree reliability
BOOTSTRAPPING
CONVENTIONAL
Or
STATISTICAL TESTS
JACKKNIFING
https://www.megasoftwar
e.net/
Alignment
Phylogeny
Bootstrap