Agenda for today
• Phylogenic tree construction
– Introduction to phylogenic trees
∗ Ultrametricality
∗ Additive distance
– Distance-based approaches
– Parsimony
• Phylogenic tree building and multi-sequence alignment
1
Phylogenic trees
• Rooted trees, tracing evolutionary divergence
• Without loss of generality, assume binary branching
• Strings at the leaves of trees
• Internal nodes labeled or unlabeled
– Labels may hypothesize ancestor strings
• Nodes or links in tree may have scores
– Scores may represent distance or time
2
Clade of Apes
3
Ultrametric trees
• Ultrametric trees have real numbers at the internal nodes
• Numbers at the nodes must strictly decrease
• For our purposes, strings at the leaves of the tree
• Pairwise scores between leaves:
number associated with the lowest-common-ancestor
• Defines an ultrametric (symmetric) matrix of pairwise distances
– Diagonals zero; off-diagonal positive
• Can construct a unique ultrametric tree efficiently from a given
ultrametric matrix (see Gusfield)
4
Example ultrametric matrix/tree
A B C D E 9
A 0 9 9 6 4
B 0 4 9 9
6 4
C 0 9 9
D 0 6
E 0 4 D B C
A E
• Variant: min-ultrametric trees are strictly increasing down the tree
– (just a sign change, same procedures apply)
5
Molecular clock theory
• Ultrametric trees generally involve time since divergence
• The Molecular clock theory of Zuckerkandl and Pauling states
– For any given protein, accepted mutations in the amino acid
sequence occur at a constant rate (from Gusfield)
– Accepted means not impacting function
– Hence number of changes is proportional to time
– (rate varies depending on protein)
• Evidence of mutation can be based on sequence edits
• Most real data is not ultrametric – assumptions too strong
6
Additive-distance trees
• Relaxes some assumptions on the constancy of the rate
• Scores are labeled on links rather than internal nodes of the tree
• Distance between two leaves (strings) is the sum of scores on links
between the leaves
• We can move away from straight phylogenic trees and allow strings
to label internal nodes
– “Compact” additive-distance trees introduce no additional nodes
beyond leaves
7
Example additive-distance tree
A B C D E
A 0 9 9 6 4 2 2
B 0 4 9 9
C 0 9 9 1 3 2 2
D 0 6 C
D B
E 0 2 2
A E
• Ultrametric matrices can be represented with additive-distance trees
• O(n2) algorithms for building additive-distance trees from n×n
matrices
8
Building phylogenic trees
• Given a set of sequences, how can we build a phylogenic tree from
those sequences?
• Two main approaches
– Distance-based approaches (minimize distance)
– Parsimony (fewest required changes)
• Distance-based approaches are well suited to ultrametric and
additive distance trees
– Many problems are not so well behaved
• Parsimony does not make such assumptions
9
Simple distance-based tree building
• Standard approach (around since late 50s) involves agglomerative
pairwise distance-based clustering
• Unweighted pair group method using arithmetic averages,
aka “UPGMA”
• Builds binary tree by iteratively merging two closest clusters
• Initialize with each string as a cluster unto itself
• Cluster distances are based on pairwise distances
1 X X
d(Ci, Cj ) = d(x, y)
|Ci||Cj |
x∈Ci y∈Cj
10
Efficient re-calculation of pairwise distances
• When two clusters Ci, Cj are merged into Ck , must calculate
pairwise distances to other remaining clusters
• Efficient re-calculation possible:
1 X X
d(Ck , Cl ) = d(x, y)
|Ck ||Cl | x∈C
k y∈Cl
1 X X X X
= d(x, y) + d(x, y)
(|Ci| + |Cj |)|Cl | x∈Ci y∈Cl x∈Cj y∈Cl
|Cl ||Ci| d(Ci, Cl ) + |Cl ||Cj | d(Cj , Cl )
=
(|Ci| + |Cj |)|Cl |
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
=
|Ci| + |Cj |
11
Clustering by distance
• If the distance matrix is ultrametric, UPGMA is the
right approach
• If, however, the distance matrix is not ultrametric,
but additive, then there are better clustering meth-
ods than UPGMA
• Why not join the closest as UPGMA suggests?
– Because two very close strings may not form a node
– Additivity is a very different requirement from min-
imum distance
12
UPGMA clustering of additive-distance matrix
A B C D E
A 0 8 8 5 4
B 0 2 5 6
C 0 5 6 B C
D 0 3
E 0
• Find the lowest score in the matrix
• Merge columns/rows
• Update distances
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
d(Ck , Cl ) =
|Ci| + |Cj |
13
UPGMA clustering of additive-distance matrix
A BC D E
A 0 8 5 4
BC 0 5 6 B C
D 0 3 E D
E 0
• Find the lowest score in the matrix
• Merge columns/rows
• Update distances
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
d(Ck , Cl ) =
|Ci| + |Cj |
14
UPGMA clustering of additive-distance matrix
A BC DE
A 0 8 4.5
BC 0 5.5 A B C
E D
DE 0
• Find the lowest score in the matrix
• Merge columns/rows
• Update distances
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
d(Ck , Cl ) =
|Ci| + |Cj |
15
Additive matrix
A B C D E Truth:
2 1
A 0 8 8 5 4
B 0 2 5 6
1 1 1 1
C 0 5 6
D B C
D 0 3 3 1
E 0 E
A
What UPGMA found:
A B C
E D
16
Neighbor joining in additive distance
• Two key ideas in modifying clustering for additive trees
• “Normalize” distance by average distances over set of leaves L
P
z∈L d(x, z) + d(y, z)
D(x, y) = d(x, y) −
|L| − 2
• Change distance recalculation after merging clusters
S
• If Ck = Ci Cj then for all nodes m
1
d(k, m) = (d(i, m) + d(j, m) − d(i, j))
2
17
Additive neighbor joining
A B C D E A B C D E
A 0 8 8 5 4 A 0 -7.33 -7.33 -9.33 -10.5
B 0 2 5 6 B 0 -12 -8 -7.17
→
C 0 5 6 C 0 -8 -7.17
D 0 3 D 0 -9.17
E 0 E 0
• Find the lowest score in the matrix
• Merge columns/rows
• Update distances
1
d(k, m) = (d(i, m) + d(j, m) − d(i, j))
2
18
Additive neighbor joining
A BC D E A BC D E
A 0 7 5 4 A 0 -9 -9 -10
→
BC 0 4 5 BC 0 -10 -9
D 0 3 D 0 -9
E 0 E 0
• Two possible merges
• Additive-distance tree (unlike ultrametric) not necessarily unique
19
Multiple trees (basically unrooted)
2 1
2 1
A
1 1
1 1 1 1
E 1 3
D B C
3 1
D
1 1
A E
B C
20
Internal nodes
• Just finding the tree-topology is not necessarily the end product
• What about labels on the internal nodes?
• A phylogenic tree is hypothesizing a point of divergence
– There was an ancestor string at that point
– Can we hypothesize the string in addition to the point of diver-
gence?
• One method is “maximum parsimony” or just “parsimony”
• Sort of an Occam’s razor approach: hypothesize as few mutations
as necessary
21
Parsimony
• Parsimony methods are generally presented for a given tree
– Multiple trees can be compared, but searching over all possible
trees is generally intractable
• For the current discussion, assume that we are given a tree
– Perhaps derived via iterative pairwise alignment
• Since we have a tree, assume that we have a multi-sequence align-
ment consistent with that tree
• Given the tree and multi-sequence alignment of leaves, parsimony
looks for the minimum substitutions/mutations over the tree
22
Phylogenic Alignment problem
• Given a phylogenic tree T , the phylogenic alignment problem is:
– Label the internal nodes of T such that the overall distance of
the alignment is minimized
– Overall distance is the sum of all parent/child distances
– Usually different “sites” in the string are modeled independently
• Minimum mutation problem (Fitch-Hartigan)
– Phylogenic Alignment problem when given multi-sequence
alignment of the leaves
– Efficient dynamic programming when given tree T
23
Continue with additive tree example
A B C D E
2 1
A 0 8 8 5 4
B 0 2 5 6
1 1 1 1
C 0 5 6
D B C
D 0 3 3 1
E 0 E
A
A: CATG-AAG D: G-AG-ATT
B: G-CATCCT E: C--G-AGT
C: G-GATGCT
24
Minimum mutation dynamic programming
G−AG−ATT G−CATCCT G−GATGCT
CATG−AAG C−−G−AGT 25
Minimum mutation dynamic programming
C: 02222222 G−AG−ATT G−CATCCT G−GATGCT
T: 22122221
A: 21222012
G: 22202211
−: 21120222
CATG−AAG C−−G−AGT 26
Minimum mutation dynamic programming
C: 22122102
T: 22220220
A: 22202222
G: 02122122
−: 20222222
C: 02222222 G−AG−ATT G−CATCCT G−GATGCT
T: 22122221
A: 21222012
G: 22202211
−: 21120222
CATG−AAG C−−G−AGT 27
Minimum mutation dynamic programming
C: 13322233 C: 22122102
T: 23222221 T: 22220220
A: 22222023 A: 22202222
G: 13302222 G: 02122122
−: 21220233 −: 20222222
C: 02222222 G−AG−ATT G−CATCCT G−GATGCT
T: 22122221
A: 21222012
G: 22202211
−: 21120222
CATG−AAG C−−G−AGT 28
Minimum mutation dynamic programming
C: 23422232
T: 33421231
A: 33512233
G: 13412233
−: 31421343
C: 13322233 C: 22122102
T: 23222221 T: 22220220
A: 22222023 A: 22202222
G: 13302222 G: 02122122
−: 21220233 −: 20222222
C: 02222222 G−AG−ATT G−CATCCT G−GATGCT
T: 22122221
A: 21222012
G: 22202211
−: 21120222
CATG−AAG C−−G−AGT 29
Minimum mutation backtrace
G−−ATCCT
C: 13322233 C: 22122102
T: 23222221 T: 22220220
A: 22222023 A: 22202222
G: 13302222 G: 02122122
−: 21220233 −: 20222222
C: 02222222 G−AG−ATT G−CATCCT G−GATGCT
T: 22122221
A: 21222012
G: 22202211
−: 21120222
CATG−AAG C−−G−AGT
30
Minimum mutation backtrace
G−−ATCCT
C: 22122102
T: 22220220
G−−G−ACT A: 22202222
G: 02122122
−: 20222222
C: 02222222 G−AG−ATT G−CATCCT G−GATGCT
T: 22122221
A: 21222012
G: 22202211
−: 21120222
CATG−AAG C−−G−AGT
31
Minimum mutation backtrace
G−−ATCCT
G−−G−ACT G−−ATCCT
C: 02222222 G−AG−ATT G−CATCCT G−GATGCT
T: 22122221
A: 21222012
G: 22202211
−: 21120222
CATG−AAG C−−G−AGT
32
Minimum mutation backtrace
G−−ATCCT
G−−G−ACT G−−ATCCT
G−AG−ATT G−CATCCT G−GATGCT
C−−G−AAT
CATG−AAG C−−G−AGT
33
Minimum mutation without tree
• For a given tree, we have seen that this algorithm has an efficient
dynamic programming solution
• No efficient algorithm for exploring all possible phylogenic trees
for a set of strings
• However, for a given order of strings at the leaves, a variant of the
CYK algorithm could be used
– Typically used with context-free grammars
– O(k3n) for k strings and n-column multi-sequence alignment
– Could result in any binary tree shape over the ordered set
34
Iterative approaches to MSA and tree building
• Having trees helps with multi-sequence alignment
• Having multi-sequence alignment helps with trees
• Iterative approaches can be used. One example:
1. Build a multi-sequence alignment using Iterative alignment
2. Build an initial phylogenic tree using neighbor joining
3. Perform minimum mutation phylogenic alignment
4. Build new multi-sequence alignment using full phylogenic tree
5. Throw away strings on internal nodes, preserving MSA
6. Go back to step 2. using new multi-sequence alignment
35