0% found this document useful (0 votes)
68 views

Phylogenetic Analysis: Based On Two Talks, by

1. Phylogenetic analysis involves inferring evolutionary relationships between taxa through phylogeny inference and character analysis on phylogenetic trees. 2. Phylogenetic trees diagram the evolutionary relationships between taxa and can be used to study traits, geographic origins, and transmission of diseases and genetic elements. 3. Inferring evolutionary relationships requires rooting unrooted trees, which can be done using outgroups or midpoint rooting methods. Each unrooted tree can have multiple rooted representations.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Phylogenetic Analysis: Based On Two Talks, by

1. Phylogenetic analysis involves inferring evolutionary relationships between taxa through phylogeny inference and character analysis on phylogenetic trees. 2. Phylogenetic trees diagram the evolutionary relationships between taxa and can be used to study traits, geographic origins, and transmission of diseases and genetic elements. 3. Inferring evolutionary relationships requires rooting unrooted trees, which can be done using outgroups or midpoint rooting methods. Each unrooted tree can have multiple rooted representations.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 45

Phylogenetic Analysis

based on two talks, by

Caro-Beth Stewart, Ph.D.


Department of Biological Sciences University at Albany, SUNY [email protected] and Tal Pupko, Ph.D. Faculty of Life Science Tel-Aviv University [email protected]
Based on lectures by C-B Stewart, and by Tal Pupko

What is phylogenetic analysis and why should we perform it?


Phylogenetic analysis has two major components: 1. Phylogeny inference or tree building the inference of the branching orders, and ultimately the evolutionary relationships, between taxa (entities such as genes, populations, species, etc.) Character and rate analysis using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest
Based on lectures by C-B Stewart, and by Tal Pupko

2.

Common Phylogenetic Tree Terminology


Terminal Nodes

Branches or Lineages

A B

C
D
Ancestral Node or ROOT of the Tree
Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa)
Based on lectures by C-B Stewart, and by Tal Pupko

Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny

Phylogenetic trees diagram the evolutionary relationships between the taxa


Taxon B Taxon C Taxon A Taxon D Taxon E
This dimension either can have no scale (for cladograms), can be proportional to genetic distance or amount of change (for phylograms or additive trees), or can be proportional to time (for ultrametric trees or true evolutionary trees). No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom.

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses


These say that B and C are more closely related to each other than either is to A, and that A, B, and C form a clade that is a sister group to the clade composed of Based on lectures by C-B Stewart, D and E. If the tree has a time scale, and E are the most closely related. and then by Tal D Pupko

A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data: Which species are the closest living relatives of modern humans? Did the infamous Florida Dentist infect his patients with HIV? What were the origins of specific transposable elements? Plus countless others..
Based on lectures by C-B Stewart, and by Tal Pupko

Which species are the closest living relatives of modern humans?


Humans Gorillas Chimpanzees

Chimpanzees
Bonobos

Bonobos Orangutans
Humans
15-30

Gorillas
Orangutans
14 MYA

MYA

Mitochondrial DNA, most nuclear DNAencoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.

The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.

Based on lectures by C-B Stewart, and by Tal Pupko

Did the Florida Dentist infect his patients with HIV?


Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People: DENTIST Patient C Patient A Patient G Patient B Patient E Patient A

Yes:
The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist.

DENTIST
Local control 2 Local control 3

Patient F
Local control 9 Local control 35 Local control 3 Patient D
Based on lectures by C-B Stewart,
From Ou et al. (1992) and Page & Holmes (1998) and by Tal Pupko

No

No

A few examples of what can be learned from character analysis using phylogenies as analytical frameworks:

When did specific episodes of positive Darwinian selection occur during evolutionary history?
Which genetic changes are unique to the human lineage? What was the most likely geographical location of the common ancestor of the African apes and humans? Plus countless others..
Based on lectures by C-B Stewart, and by Tal Pupko

The number of unrooted trees increases in a greater than exponential manner with number of taxa
A

B
# Taxa ( N) # Unrooted trees 1 3 15 105 945 10,935 135,135 2,027,025 . . . . 01 x 85.3

C D

B
A B

E
A C D E

3 4 5 6 7 8 9 10 . . . . 30

36

(2N - 5)!! = # unrooted trees for N taxa

Based on lectures by C-B Stewart, and by Tal Pupko

Inferring evolutionary relationships between the taxa requires rooting the tree:
B

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root:

Root

Unrooted tree
A

Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree
Root

Based on lectures by C-B Stewart, and by Tal Pupko

Now, try it again with the root at another position:


B Root D

Unrooted tree

A A B C D

Rooted tree

Root

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. Based on lectures by C-B Stewart, and by Tal Pupko

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

A
The unrooted tree 1:
1

C
5

B
Rooted tree 1a Rooted tree 1b

D
Rooted tree 1d Rooted tree 1e

Rooted tree 1c

B A C D

A B C D

A B C D

C D A B

D C A B

These trees show five different evolutionary relationships among the taxa!
Based on lectures by C-B Stewart, and by Tal Pupko

There are two major ways to root trees:


By outgroup:
Uses taxa (the outgroup) that are known to fall outside of the group of interest (the ingroup). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins).

outgroup

By midpoint or distance:
Roots the tree at the midway point A between the two most distant taxa in the tree, as determined by branch 10 lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods. Based on lectures by C-B Stewart,
and by Tal Pupko d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9

C
3 2 5

Each unrooted tree theoretically can be rooted anywhere along any of its branches A
B A B A C C C

D
D E D E

# Taxa 3 4 5 6 7 8 9 . . . . 30

# Unrooted # Rooted x # Roots = Trees Trees 1 3 3 3 5 15 15 7 105 105 9 945 945 11 10,3 95 10,935 13 135,1 35 135,135 15 2,027,0 25 . . . . . . . . . . . . ~3.58 x 10 36 57 ~2.04 x 10 38

(2N - 3)!! = # unrooted trees for N taxa


Based on lectures by C-B Stewart, and by Tal Pupko

Molecular phylogenetic tree building methods:


Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:
COMPUTATIONAL METHOD
Optimality criterion Clustering algorithm

Characters

PARSIMONY MAXIMUM LIKELIHOOD

DATA TYPE

Distances

MINIMUM EVOLUTION LEAST SQUARES

UPGMA NEIGHBOR-JOINING

Based on lectures by C-B Stewart, and by Tal Pupko

Types of data used in phylogenetic inference:


Character-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference.
Taxa Species Species Species Species Species A B C D E Characters ATGGCTATTCTTATAGTACG ATCGCTAGTCTTATATTACA TTCACTAGACCTGTGGTCCA TTGACCAGACCTGTGGTCCG TTGACCAGTTCTCTAGTTCG

Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.
Species Species Species Species Species A B C D E A ---0.23 0.87 0.73 0.59 B 0.20 ---0.59 1.12 0.89 C 0.50 0.40 ---0.17 0.61 D 0.45 0.55 0.15 ---0.31 E 0.40 0.50 0.40 0.25 ----

Example 1: Uncorrected p distance (=observed percent sequence difference)

Based on 2-parameter lectures by distance C-B Stewart, Example 2: Kimura and by Tal (estimate of the true number ofPupko substitutions between taxa)

Computational methods for finding optimal trees: Exact algorithms: "Guarantee" to find the optimal or
"best" tree for the method of choice. Two types used in tree building: Exhaustive search: Evaluates all possible unrooted trees, choosing the one with the best score for the method.

Branch-and-bound search: Eliminates the parts of the search tree that only contain suboptimal solutions.

Heuristic algorithms: Approximate or quick-and-dirty


methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searches often operate by hill-climbing methods.
Based on lectures by C-B Stewart, and by Tal Pupko

Exact searches become increasingly difficult, and eventually impossible, as the number of taxa increases:
A C B A C D
# Taxa ( N) 3 4 5 6 7 8 9 10 . . . . 30 # Unrooted trees 1 3 15 105 945 10,935 135,135 2,027,025 . . . . 01 x 85.3

B
A C D E A C

D E

36

(2N - 5)!! = # unrooted trees for N taxa

Based on lectures by C-B Stewart, and by Tal Pupko

Heuristic search algorithms are input order dependent and can get stuck in local minima or maxima
Search for global maximum GLOBAL MAXIMUM

Rerunning heuristic searches using different input orders of taxa can help find global minima or maxima

Search for global minimum

local maximum

GLOBAL MAXIMUM

local minimum

GLOBAL MINIMUM Based on lectures by C-B Stewart, and by Tal Pupko

GLOBAL MINIMUM

Classification of phylogenetic inference methods

COMPUTATIONAL METHOD
Optimality criterion Clustering algorithm

Characters Distances

PARSIMONY MAXIMUM LIKELIHOOD

DATA TYPE

MINIMUM EVOLUTION LEAST SQUARES

UPGMA NEIGHBOR-JOINING

Based on lectures by C-B Stewart, and by Tal Pupko

Parsimony methods:
Optimality criterion: The most-parsimonious tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences.
Advantages: Are simple, intuitive, and logical (many possible by pencil-and-paper). Can be used on molecular and non-molecular (e.g., morphological) data. Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy) Can be used for character (can infer the exact substitutions) and rate analysis. Can be used to infer the sequences of the extinct (hypothetical) ancestors.

Disadvantages: Are simple, intuitive, and logical (derived from Medieval logic, not statistics!) Can be fooled by high levels of homoplasy (same events). Can become positively misleading in the Felsenstein Zone:

[See Stewart (1993) for a simple explanation of parsimony analysis, and Swofford et al. (1996) for a detailed explanation of various parsimony methods.]
Based on lectures by C-B Stewart, and by Tal Pupko

Branch and Bound

Tal Pupko, Tel-Aviv University

Based on lectures by C-B Stewart, and by Tal Pupko

There are many trees..,


We cannot go over all the trees. We will try to find a way to find the best tree. There are approximate solutions But what if we want to make sure we find the global maximum.

There is a way more efficient than just go over all possible tree. It is called BRANCH AND BOUND and is a general technique in computer science, that can be applied to phylogeny.
Based on lectures by C-B Stewart, and by Tal Pupko

BRANCH AND BOUND


To exemplify the BRANCH AND BOUND (BNB) method, we will use an example not connected to evolution. Later, when the general BNB method is understood, we will see how to apply this method to finding the MP tree. We will present the traveling salesperson path problem (TSP).

Based on lectures by C-B Stewart, and by Tal Pupko

THE TSP PROBLEM (especially adapted to israel).


A guard has to visit n check-points whose location on a map is known. The problem is to find the shortest path that goes through all points exactly once (no need to come back to starting point).

Nave approach: (say for 5 points). You have 5 starting points. For each such starting point you have 4 next steps. For each such combination of starting point and first step, you have 3 possible second steps, etc. All together we have 5*4*3*2*1 Based on lectures by C-B Stewart, Possible solutions = and 5! by.Tal Pupko

THE TSP TREE


1 2 3 4 5

245

145

125

124

45

25

24

54

52

42

Based on lectures by C-B Stewart, and by Tal Pupko

THE SHP NAVE APPROACH


Each solution can be represented as a permutation:

(1,2,3,4,5) (1,2,3,5,4) (1,2,4,3,5) (1,2,4,5,3) (1,2,5,3,4) We can go over the list and find the one giving the highest score.
Based on lectures by C-B Stewart, and by Tal Pupko

THE SHP NAVE APPROACH

However, for 15 points, for example, there are 1,307,674,368,000 The rate of increase of the number of solutions is too fast for this to be practical.

Based on lectures by C-B Stewart, and by Tal Pupko

A TSP GREEDY HEURISTIC

Start from a random point. Go to the closest point. Go to its closest point, etc.etc. This approach doesnt work so well

(but a reasonably close heuristic, based on simulated annealing, will be presented in a couple of lectures.)

Based on lectures by C-B Stewart, and by Tal Pupko

BNB SOLUTION TO SHP


1 2 3 4 5

Shortest path found so far = 15

245

145

125

124

45

25

24

54

52

42

Score here already 16: no point in expanding the rest of the subtree

Based on lectures by C-B Stewart, and by Tal Pupko

Back to finding the MP tree


Finding the MP tree is NP-Hard (will see shortly)
BNB helps, though it is still exponential

Based on lectures by C-B Stewart, and by Tal Pupko

The MP search tree


1 4 is added to branch 1. 2

1 4 2 3

1 4 2 3

1 3 4 2

5 is added to branch 2. There are 5 branches

Based on lectures by C-B Stewart, and by Tal Pupko

The MP search tree


30 4 is added to branch 1.

43

55

39

52

54

52

53

58

61

56

59

61

69

53

51

42

47

47

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB
30 4 is added to branch 1.

43

55

39

52

54

52

53

58

61

56

59

61

69

53

51

42

47

47

Best (minimum) value = 52


Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB
30 4 is added to branch 1.

43

55

39

52

54

52

53

58

61

56

59

61

69

53

51

42

47

47

Best record = 52 Based on lectures by C-B Stewart,


and by Tal Pupko

MP-BNB
30 4 is added to branch 1.

43

55

39

52

54

52

53

58

61

56

59

61

69

53

51

42

47

47

Best record = 52 Based on lectures by C-B Stewart,


and by Tal Pupko

MP-BNB
30

43

55

39

52

54

52

53

58

53

51

42

47

47

Best record = 52 Based on lectures by C-B Stewart,


and by Tal Pupko

MP-BNB
30

43

55

39

52

54

52

53

58

53

51

42

47

47

Best record = 52 Based on lectures by C-B Stewart,


and by Tal Pupko

MP-BNB
30

43

55

39

52

54

52

53

58

53

51

42

47

47

Best record = 52 Based 51 on lectures by C-B Stewart,


and by Tal Pupko

MP-BNB
30

43

55

39

52

54

52

53

58

53

51

42

47

47

Best record = 52 Based 51 on 42 lectures by C-B Stewart,


and by Tal Pupko

MP-BNB
30

43

55

39

52

54

52

53

58

53

51

42

47

47

Best record = 52 Based 51 on 42 lectures by C-B Stewart,


and by Tal Pupko

MP-BNB
30

43

55

39

52

54

52

53

58

53

51

42

47

47

Best record = 52 Based 51 on 42 lectures by C-B Stewart,


and by Tal Pupko

MP-BNB
30

43

55

39

52

54

52

53

58

53

51

42

47

47

Total # trees visited: 14


Based on lectures by C-B Stewart, and by Tal Pupko

Best TREE. MP score = 42

Order of Evaluation Matters


Evaluate all 3 first
30

The bound after searching this subtree will be 42.


39

43

55

53

51

42

47

47

Total tree visited: 9


Based on lectures by C-B Stewart, and by Tal Pupko

And Now Maximum Parsimony is Computationally Intractable Felsensteins Dynamic Programming Algorithm for tiny maximum likelihood

and more, time permitting


Based on lectures by C-B Stewart, and by Tal Pupko

You might also like