Phylogenetic Analysis: Based On Two Talks, by
Phylogenetic Analysis: Based On Two Talks, by
2.
Branches or Lineages
A B
C
D
Ancestral Node or ROOT of the Tree
Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa)
Based on lectures by C-B Stewart, and by Tal Pupko
Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny
A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data: Which species are the closest living relatives of modern humans? Did the infamous Florida Dentist infect his patients with HIV? What were the origins of specific transposable elements? Plus countless others..
Based on lectures by C-B Stewart, and by Tal Pupko
Chimpanzees
Bonobos
Bonobos Orangutans
Humans
15-30
Gorillas
Orangutans
14 MYA
MYA
Mitochondrial DNA, most nuclear DNAencoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.
The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.
Yes:
The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist.
DENTIST
Local control 2 Local control 3
Patient F
Local control 9 Local control 35 Local control 3 Patient D
Based on lectures by C-B Stewart,
From Ou et al. (1992) and Page & Holmes (1998) and by Tal Pupko
No
No
A few examples of what can be learned from character analysis using phylogenies as analytical frameworks:
When did specific episodes of positive Darwinian selection occur during evolutionary history?
Which genetic changes are unique to the human lineage? What was the most likely geographical location of the common ancestor of the African apes and humans? Plus countless others..
Based on lectures by C-B Stewart, and by Tal Pupko
The number of unrooted trees increases in a greater than exponential manner with number of taxa
A
B
# Taxa ( N) # Unrooted trees 1 3 15 105 945 10,935 135,135 2,027,025 . . . . 01 x 85.3
C D
B
A B
E
A C D E
3 4 5 6 7 8 9 10 . . . . 30
36
Inferring evolutionary relationships between the taxa requires rooting the tree:
B
To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root:
Root
Unrooted tree
A
Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.
Rooted tree
Root
Unrooted tree
A A B C D
Rooted tree
Root
Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. Based on lectures by C-B Stewart, and by Tal Pupko
An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees
A
The unrooted tree 1:
1
C
5
B
Rooted tree 1a Rooted tree 1b
D
Rooted tree 1d Rooted tree 1e
Rooted tree 1c
B A C D
A B C D
A B C D
C D A B
D C A B
These trees show five different evolutionary relationships among the taxa!
Based on lectures by C-B Stewart, and by Tal Pupko
outgroup
By midpoint or distance:
Roots the tree at the midway point A between the two most distant taxa in the tree, as determined by branch 10 lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods. Based on lectures by C-B Stewart,
and by Tal Pupko d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9
C
3 2 5
Each unrooted tree theoretically can be rooted anywhere along any of its branches A
B A B A C C C
D
D E D E
# Taxa 3 4 5 6 7 8 9 . . . . 30
# Unrooted # Rooted x # Roots = Trees Trees 1 3 3 3 5 15 15 7 105 105 9 945 945 11 10,3 95 10,935 13 135,1 35 135,135 15 2,027,0 25 . . . . . . . . . . . . ~3.58 x 10 36 57 ~2.04 x 10 38
Characters
DATA TYPE
Distances
UPGMA NEIGHBOR-JOINING
Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.
Species Species Species Species Species A B C D E A ---0.23 0.87 0.73 0.59 B 0.20 ---0.59 1.12 0.89 C 0.50 0.40 ---0.17 0.61 D 0.45 0.55 0.15 ---0.31 E 0.40 0.50 0.40 0.25 ----
Based on 2-parameter lectures by distance C-B Stewart, Example 2: Kimura and by Tal (estimate of the true number ofPupko substitutions between taxa)
Computational methods for finding optimal trees: Exact algorithms: "Guarantee" to find the optimal or
"best" tree for the method of choice. Two types used in tree building: Exhaustive search: Evaluates all possible unrooted trees, choosing the one with the best score for the method.
Branch-and-bound search: Eliminates the parts of the search tree that only contain suboptimal solutions.
Exact searches become increasingly difficult, and eventually impossible, as the number of taxa increases:
A C B A C D
# Taxa ( N) 3 4 5 6 7 8 9 10 . . . . 30 # Unrooted trees 1 3 15 105 945 10,935 135,135 2,027,025 . . . . 01 x 85.3
B
A C D E A C
D E
36
Heuristic search algorithms are input order dependent and can get stuck in local minima or maxima
Search for global maximum GLOBAL MAXIMUM
Rerunning heuristic searches using different input orders of taxa can help find global minima or maxima
local maximum
GLOBAL MAXIMUM
local minimum
GLOBAL MINIMUM
COMPUTATIONAL METHOD
Optimality criterion Clustering algorithm
Characters Distances
DATA TYPE
UPGMA NEIGHBOR-JOINING
Parsimony methods:
Optimality criterion: The most-parsimonious tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences.
Advantages: Are simple, intuitive, and logical (many possible by pencil-and-paper). Can be used on molecular and non-molecular (e.g., morphological) data. Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy) Can be used for character (can infer the exact substitutions) and rate analysis. Can be used to infer the sequences of the extinct (hypothetical) ancestors.
Disadvantages: Are simple, intuitive, and logical (derived from Medieval logic, not statistics!) Can be fooled by high levels of homoplasy (same events). Can become positively misleading in the Felsenstein Zone:
[See Stewart (1993) for a simple explanation of parsimony analysis, and Swofford et al. (1996) for a detailed explanation of various parsimony methods.]
Based on lectures by C-B Stewart, and by Tal Pupko
There is a way more efficient than just go over all possible tree. It is called BRANCH AND BOUND and is a general technique in computer science, that can be applied to phylogeny.
Based on lectures by C-B Stewart, and by Tal Pupko
Nave approach: (say for 5 points). You have 5 starting points. For each such starting point you have 4 next steps. For each such combination of starting point and first step, you have 3 possible second steps, etc. All together we have 5*4*3*2*1 Based on lectures by C-B Stewart, Possible solutions = and 5! by.Tal Pupko
245
145
125
124
45
25
24
54
52
42
(1,2,3,4,5) (1,2,3,5,4) (1,2,4,3,5) (1,2,4,5,3) (1,2,5,3,4) We can go over the list and find the one giving the highest score.
Based on lectures by C-B Stewart, and by Tal Pupko
However, for 15 points, for example, there are 1,307,674,368,000 The rate of increase of the number of solutions is too fast for this to be practical.
Start from a random point. Go to the closest point. Go to its closest point, etc.etc. This approach doesnt work so well
(but a reasonably close heuristic, based on simulated annealing, will be presented in a couple of lectures.)
245
145
125
124
45
25
24
54
52
42
Score here already 16: no point in expanding the rest of the subtree
1 4 2 3
1 4 2 3
1 3 4 2
43
55
39
52
54
52
53
58
61
56
59
61
69
53
51
42
47
47
MP-BNB
30 4 is added to branch 1.
43
55
39
52
54
52
53
58
61
56
59
61
69
53
51
42
47
47
MP-BNB
30 4 is added to branch 1.
43
55
39
52
54
52
53
58
61
56
59
61
69
53
51
42
47
47
MP-BNB
30 4 is added to branch 1.
43
55
39
52
54
52
53
58
61
56
59
61
69
53
51
42
47
47
MP-BNB
30
43
55
39
52
54
52
53
58
53
51
42
47
47
MP-BNB
30
43
55
39
52
54
52
53
58
53
51
42
47
47
MP-BNB
30
43
55
39
52
54
52
53
58
53
51
42
47
47
MP-BNB
30
43
55
39
52
54
52
53
58
53
51
42
47
47
MP-BNB
30
43
55
39
52
54
52
53
58
53
51
42
47
47
MP-BNB
30
43
55
39
52
54
52
53
58
53
51
42
47
47
MP-BNB
30
43
55
39
52
54
52
53
58
53
51
42
47
47
43
55
53
51
42
47
47
And Now Maximum Parsimony is Computationally Intractable Felsensteins Dynamic Programming Algorithm for tiny maximum likelihood