0% found this document useful (0 votes)
248 views35 pages

Ultrametricity

The document provides an agenda and overview for discussing phylogenetic tree construction and multi-sequence alignment. It includes introductions to phylogenetic trees, ultrametricity, additive distance approaches, and parsimony. It also covers distance-based tree building using UPGMA clustering and neighbor joining approaches for additive distance matrices.

Uploaded by

Masooma Raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
248 views35 pages

Ultrametricity

The document provides an agenda and overview for discussing phylogenetic tree construction and multi-sequence alignment. It includes introductions to phylogenetic trees, ultrametricity, additive distance approaches, and parsimony. It also covers distance-based tree building using UPGMA clustering and neighbor joining approaches for additive distance matrices.

Uploaded by

Masooma Raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Agenda for today

• Phylogenic tree construction


– Introduction to phylogenic trees
∗ Ultrametricality
∗ Additive distance
– Distance-based approaches
– Parsimony
• Phylogenic tree building and multi-sequence alignment

1
Phylogenic trees

• Rooted trees, tracing evolutionary divergence


• Without loss of generality, assume binary branching
• Strings at the leaves of trees
• Internal nodes labeled or unlabeled
– Labels may hypothesize ancestor strings
• Nodes or links in tree may have scores
– Scores may represent distance or time

2
Clade of Apes

3
Ultrametric trees

• Ultrametric trees have real numbers at the internal nodes


• Numbers at the nodes must strictly decrease
• For our purposes, strings at the leaves of the tree
• Pairwise scores between leaves:
number associated with the lowest-common-ancestor
• Defines an ultrametric (symmetric) matrix of pairwise distances
– Diagonals zero; off-diagonal positive
• Can construct a unique ultrametric tree efficiently from a given
ultrametric matrix (see Gusfield)

4
Example ultrametric matrix/tree

A B C D E 9

A 0 9 9 6 4
B 0 4 9 9
6 4
C 0 9 9
D 0 6
E 0 4 D B C

A E

• Variant: min-ultrametric trees are strictly increasing down the tree


– (just a sign change, same procedures apply)

5
Molecular clock theory

• Ultrametric trees generally involve time since divergence


• The Molecular clock theory of Zuckerkandl and Pauling states
– For any given protein, accepted mutations in the amino acid
sequence occur at a constant rate (from Gusfield)
– Accepted means not impacting function
– Hence number of changes is proportional to time
– (rate varies depending on protein)
• Evidence of mutation can be based on sequence edits
• Most real data is not ultrametric – assumptions too strong

6
Additive-distance trees

• Relaxes some assumptions on the constancy of the rate


• Scores are labeled on links rather than internal nodes of the tree
• Distance between two leaves (strings) is the sum of scores on links
between the leaves
• We can move away from straight phylogenic trees and allow strings
to label internal nodes
– “Compact” additive-distance trees introduce no additional nodes
beyond leaves

7
Example additive-distance tree

A B C D E
A 0 9 9 6 4 2 2

B 0 4 9 9
C 0 9 9 1 3 2 2

D 0 6 C
D B
E 0 2 2

A E

• Ultrametric matrices can be represented with additive-distance trees


• O(n2) algorithms for building additive-distance trees from n×n
matrices
8
Building phylogenic trees

• Given a set of sequences, how can we build a phylogenic tree from


those sequences?
• Two main approaches
– Distance-based approaches (minimize distance)
– Parsimony (fewest required changes)
• Distance-based approaches are well suited to ultrametric and
additive distance trees
– Many problems are not so well behaved
• Parsimony does not make such assumptions

9
Simple distance-based tree building

• Standard approach (around since late 50s) involves agglomerative


pairwise distance-based clustering
• Unweighted pair group method using arithmetic averages,
aka “UPGMA”
• Builds binary tree by iteratively merging two closest clusters
• Initialize with each string as a cluster unto itself
• Cluster distances are based on pairwise distances
1 X X
d(Ci, Cj ) = d(x, y)
|Ci||Cj |
x∈Ci y∈Cj

10
Efficient re-calculation of pairwise distances

• When two clusters Ci, Cj are merged into Ck , must calculate


pairwise distances to other remaining clusters
• Efficient re-calculation possible:
1 X X
d(Ck , Cl ) = d(x, y)
|Ck ||Cl | x∈C
k y∈Cl
 
1 X X X X
=  d(x, y) + d(x, y)
(|Ci| + |Cj |)|Cl | x∈Ci y∈Cl x∈Cj y∈Cl

|Cl ||Ci| d(Ci, Cl ) + |Cl ||Cj | d(Cj , Cl )


=
(|Ci| + |Cj |)|Cl |
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
=
|Ci| + |Cj |

11
Clustering by distance

• If the distance matrix is ultrametric, UPGMA is the


right approach
• If, however, the distance matrix is not ultrametric,
but additive, then there are better clustering meth-
ods than UPGMA
• Why not join the closest as UPGMA suggests?
– Because two very close strings may not form a node
– Additivity is a very different requirement from min-
imum distance

12
UPGMA clustering of additive-distance matrix

A B C D E
A 0 8 8 5 4
B 0 2 5 6
C 0 5 6 B C

D 0 3
E 0

• Find the lowest score in the matrix


• Merge columns/rows
• Update distances
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
d(Ck , Cl ) =
|Ci| + |Cj |
13
UPGMA clustering of additive-distance matrix

A BC D E
A 0 8 5 4

BC 0 5 6 B C

D 0 3 E D

E 0

• Find the lowest score in the matrix


• Merge columns/rows
• Update distances
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
d(Ck , Cl ) =
|Ci| + |Cj |
14
UPGMA clustering of additive-distance matrix

A BC DE
A 0 8 4.5

BC 0 5.5 A B C

E D

DE 0

• Find the lowest score in the matrix


• Merge columns/rows
• Update distances
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
d(Ck , Cl ) =
|Ci| + |Cj |
15
Additive matrix

A B C D E Truth:
2 1
A 0 8 8 5 4
B 0 2 5 6
1 1 1 1
C 0 5 6
D B C
D 0 3 3 1

E 0 E
A

What UPGMA found:


A B C

E D

16
Neighbor joining in additive distance

• Two key ideas in modifying clustering for additive trees


• “Normalize” distance by average distances over set of leaves L
P
z∈L d(x, z) + d(y, z)
D(x, y) = d(x, y) −
|L| − 2
• Change distance recalculation after merging clusters
S
• If Ck = Ci Cj then for all nodes m
1
d(k, m) = (d(i, m) + d(j, m) − d(i, j))
2

17
Additive neighbor joining

A B C D E A B C D E
A 0 8 8 5 4 A 0 -7.33 -7.33 -9.33 -10.5
B 0 2 5 6 B 0 -12 -8 -7.17

C 0 5 6 C 0 -8 -7.17
D 0 3 D 0 -9.17
E 0 E 0

• Find the lowest score in the matrix


• Merge columns/rows
• Update distances
1
d(k, m) = (d(i, m) + d(j, m) − d(i, j))
2
18
Additive neighbor joining

A BC D E A BC D E
A 0 7 5 4 A 0 -9 -9 -10


BC 0 4 5 BC 0 -10 -9
D 0 3 D 0 -9
E 0 E 0

• Two possible merges


• Additive-distance tree (unlike ultrametric) not necessarily unique

19
Multiple trees (basically unrooted)

2 1
2 1

A
1 1
1 1 1 1
E 1 3
D B C
3 1
D
1 1
A E
B C

20
Internal nodes

• Just finding the tree-topology is not necessarily the end product


• What about labels on the internal nodes?
• A phylogenic tree is hypothesizing a point of divergence
– There was an ancestor string at that point
– Can we hypothesize the string in addition to the point of diver-
gence?
• One method is “maximum parsimony” or just “parsimony”
• Sort of an Occam’s razor approach: hypothesize as few mutations
as necessary

21
Parsimony

• Parsimony methods are generally presented for a given tree


– Multiple trees can be compared, but searching over all possible
trees is generally intractable
• For the current discussion, assume that we are given a tree
– Perhaps derived via iterative pairwise alignment
• Since we have a tree, assume that we have a multi-sequence align-
ment consistent with that tree
• Given the tree and multi-sequence alignment of leaves, parsimony
looks for the minimum substitutions/mutations over the tree

22
Phylogenic Alignment problem

• Given a phylogenic tree T , the phylogenic alignment problem is:


– Label the internal nodes of T such that the overall distance of
the alignment is minimized
– Overall distance is the sum of all parent/child distances
– Usually different “sites” in the string are modeled independently
• Minimum mutation problem (Fitch-Hartigan)
– Phylogenic Alignment problem when given multi-sequence
alignment of the leaves
– Efficient dynamic programming when given tree T

23
Continue with additive tree example

A B C D E
2 1
A 0 8 8 5 4
B 0 2 5 6
1 1 1 1
C 0 5 6
D B C
D 0 3 3 1

E 0 E
A

A: CATG-AAG D: G-AG-ATT
B: G-CATCCT E: C--G-AGT
C: G-GATGCT

24
Minimum mutation dynamic programming

G−AG−ATT G−CATCCT G−GATGCT

CATG−AAG C−−G−AGT 25
Minimum mutation dynamic programming

C: 02222222 G−AG−ATT G−CATCCT G−GATGCT


T: 22122221
A: 21222012
G: 22202211
−: 21120222

CATG−AAG C−−G−AGT 26
Minimum mutation dynamic programming

C: 22122102
T: 22220220
A: 22202222
G: 02122122
−: 20222222

C: 02222222 G−AG−ATT G−CATCCT G−GATGCT


T: 22122221
A: 21222012
G: 22202211
−: 21120222

CATG−AAG C−−G−AGT 27
Minimum mutation dynamic programming

C: 13322233 C: 22122102
T: 23222221 T: 22220220
A: 22222023 A: 22202222
G: 13302222 G: 02122122
−: 21220233 −: 20222222

C: 02222222 G−AG−ATT G−CATCCT G−GATGCT


T: 22122221
A: 21222012
G: 22202211
−: 21120222

CATG−AAG C−−G−AGT 28
Minimum mutation dynamic programming
C: 23422232
T: 33421231
A: 33512233
G: 13412233
−: 31421343

C: 13322233 C: 22122102
T: 23222221 T: 22220220
A: 22222023 A: 22202222
G: 13302222 G: 02122122
−: 21220233 −: 20222222

C: 02222222 G−AG−ATT G−CATCCT G−GATGCT


T: 22122221
A: 21222012
G: 22202211
−: 21120222

CATG−AAG C−−G−AGT 29
Minimum mutation backtrace

G−−ATCCT

C: 13322233 C: 22122102
T: 23222221 T: 22220220
A: 22222023 A: 22202222
G: 13302222 G: 02122122
−: 21220233 −: 20222222

C: 02222222 G−AG−ATT G−CATCCT G−GATGCT


T: 22122221
A: 21222012
G: 22202211
−: 21120222

CATG−AAG C−−G−AGT
30
Minimum mutation backtrace

G−−ATCCT

C: 22122102
T: 22220220
G−−G−ACT A: 22202222
G: 02122122
−: 20222222

C: 02222222 G−AG−ATT G−CATCCT G−GATGCT


T: 22122221
A: 21222012
G: 22202211
−: 21120222

CATG−AAG C−−G−AGT
31
Minimum mutation backtrace

G−−ATCCT

G−−G−ACT G−−ATCCT

C: 02222222 G−AG−ATT G−CATCCT G−GATGCT


T: 22122221
A: 21222012
G: 22202211
−: 21120222

CATG−AAG C−−G−AGT
32
Minimum mutation backtrace

G−−ATCCT

G−−G−ACT G−−ATCCT

G−AG−ATT G−CATCCT G−GATGCT


C−−G−AAT

CATG−AAG C−−G−AGT
33
Minimum mutation without tree

• For a given tree, we have seen that this algorithm has an efficient
dynamic programming solution
• No efficient algorithm for exploring all possible phylogenic trees
for a set of strings
• However, for a given order of strings at the leaves, a variant of the
CYK algorithm could be used
– Typically used with context-free grammars
– O(k3n) for k strings and n-column multi-sequence alignment
– Could result in any binary tree shape over the ordered set

34
Iterative approaches to MSA and tree building

• Having trees helps with multi-sequence alignment


• Having multi-sequence alignment helps with trees
• Iterative approaches can be used. One example:
1. Build a multi-sequence alignment using Iterative alignment
2. Build an initial phylogenic tree using neighbor joining
3. Perform minimum mutation phylogenic alignment
4. Build new multi-sequence alignment using full phylogenic tree
5. Throw away strings on internal nodes, preserving MSA
6. Go back to step 2. using new multi-sequence alignment

35

You might also like