Phylogenetics Workshop 
Part I : Introduction 
De Landtsheer Sébastien, University of Luxemburg 
Ahead of the BeNeLux Bioinformatics Conference 2011
Outline of the Workshop 
Part I : 
• General introduction 
• Alignments 
• Distance-based methods 
Part II : 
• Maximum likelihood trees 
• Bayesian trees 
Part III : 
• Advanced bayesian phylogenetics 
• Hypothesis testing
Outline of Part I 
• General introduction : what is 
phylogenetics ? 
• Basic DNA alignment algorithm 
• Distance matrices 
• Distance-based tree inference methods
Software featured in Part I 
• Seaview (http://pbil.univ-lyon1.fr/software/seaview.html) 
• BioEdit (http://www.mbio.ncsu.edu/bioedit/bioedit.html) 
• MEGA (http://www.megasoftware.net/) 
• FigTree (http://tree.bio.ed.ac.uk/software/figtree/)
What is Phylogenetics ? 
• Classification of living species into 
categories 
• Study of characters → states 
• Underlying assumption of evolution 
(cladogram / dendrogram)
What is Phylogenetics 
• Characters : 
– Morphological 
– Biochemical 
– Genetic 
• States : 
– Continuous 
– Discontinous
Different types of Phylogenetic trees 
• Phylogenetic tree : graphical representation of 
our hypothesis about the evolution of a group of 
organisms 
• Can represent different quantities (time/genetic 
distance) and be displayed in different ways 
• There are several possible methods, and there 
is no single method that is best
Phylogenetic trees jargon 
Internal 
branches 
Root 
(if there is) 
Node 
Terminal 
branches 
Leaves or 
Tips or 
OTUs
Properties of Phylogenetic trees 
• Rooted vs Unrooted
Properties of Phylogenetic trees 
• The real face of unrooted trees 
=
Properties of Phylogenetic trees 
• The real face of unrooted trees : undirected 
= 
Multiple possibilities for rooting the tree
Properties of Phylogenetic trees 
• Where to place the root ? 
– Midpoint rooting : equally distant from the two most distantly 
related taxa on the tree. Makes sense but more often than not it 
is wrong 
– Outgroup : using one distantly related taxon (uncontroversial) 
• Marsupial for eutherian study 
• Treeshrew for primate study 
• SIV for HIV study
Properties of Phylogenetic trees 
• How to root unrooted trees ? 
1) Midpoint rooting 
= 
Assumes that the rates of evolution have stayed +/- constant
Properties of Phylogenetic trees 
• How to root unrooted trees ? 
2) Using an outgroup 
= 
Problem : difficult to find the proper outgroup 
(not ambiguous choice but still not too distant)
Properties of Phylogenetic trees 
• Rooted trees tell a story (directed) 
Most Recent Common Ancestor (MRCA)
Properties of Phylogenetic trees 
• Branch swapping : only horizontal distance matters 
=
Properties of Phylogenetic trees 
• Many topologies are always possible : 
Number of possible rooted trees for n sequences 
= (2n-3)! / (2n-2 (n-2))! 
2 sequences: 1 
3 sequences: 3 
4 sequences: 15 
5 sequences: 105 
6 sequences: 954 
7 sequences: 10395 
8 sequences: 135135 
9 sequences: 2027025 
10 sequences: 34459425 
51 sequences: >1080 (nb of particles in the universe)
DNA alignments 
• Aligning two sequences: the Needleman–Wunsch 
algorithm 
– Construct a similarity matrix 
– Assign similarity scores based on an arbitrary scoring system 
– Finds the best GLOBAL alignment between two sequence = the 
maximum number of residues from one sequence that can be 
aligned with the other one
DNA alignments 
A T G T A C C G T 
0 0 0 0 0 0 0 0 0 0 
T 0 
G 0 
A 0 
C 0 
T 0 
C 0 
G 0 
T 0
DNA alignments 
• The score in one cell is the maximum of different 
possibilities : 
– 0 
– The upper left cell plus the value of the similarity between the 
two residues 
– The upper cell plus the value of a gap (in the upper sequence) 
– The left cell plus the value of a gap (in the left sequence) 
Hi,j = max { Hi-1,j-1+s(ai,bj), Hi,j-1+Pg(k), Hi-1,j+Pg(k) } 
There is a penality for gap opening and for gap extension
DNA alignments 
• For the example we will use the following scoring matrix : 
– Identity : +1 
– Gap : 0 
• In real life ClustalW uses different scoring matrices 
depending the code (AA or DNA) and can be set to use 
word matches (k-tuples). All parameters are editable
DNA alignments 
A T G T A C C G T 
0 0 0 0 0 0 0 0 0 0 
T 0 0 
G 0 
A 0 
C 0 
T 0 
C 0 
G 0 
T 0
DNA alignments 
A T G T A C C G T 
0 0 0 0 0 0 0 0 0 0 
T 0 0 1 1 2 2 2 2 2 3 
G 0 
A 0 
C 0 
T 0 
C 0 
G 0 
T 0
DNA alignments 
A T G T A C C G T 
0 0 0 0 0 0 0 0 0 0 
T 0 0 1 1 2 2 2 2 2 3 
G 0 0 1 
A 0 1 1 
C 0 1 1 
T 0 1 2 
C 0 1 2 
G 0 1 2 
T 0 1 3
DNA alignments 
A T G T A C C G T 
0 0 0 0 0 0 0 0 0 0 
T 0 0 1 1 2 2 2 2 2 3 
G 0 0 1 2 2 2 2 2 3 3 
A 0 1 1 2 2 3 3 3 3 3 
C 0 1 1 2 2 3 4 4 4 4 
T 0 1 2 2 3 3 4 4 4 5 
C 0 1 2 2 3 3 4 5 5 5 
G 0 1 2 3 3 3 4 5 6 6 
T 0 1 3 3 4 3 4 5 6 7
DNA alignments 
A T G T A C C G T 
0 0 0 0 0 0 0 0 0 0 
T 0 0 1 1 2 2 2 2 2 3 
G 0 0 1 2 2 2 2 2 3 3 
A 0 1 1 2 2 3 3 3 3 3 
C 0 1 1 2 2 3 4 4 4 4 
T 0 1 2 2 3 3 4 4 4 5 
C 0 1 2 2 3 3 4 5 5 5 
G 0 1 2 3 3 3 4 5 6 6 
T 0 1 3 3 4 3 4 5 6 7
DNA alignments 
• Final sequence : 
A T G T A C - C G T 
- T G - A C T C G T
DNA alignments 
• More technological alignment methods include : 
– T-COFFEE computes a tree that is the consistent with the 
pairwise alignments scores computed from a variety of sources. 
Computationnaly intensive (not good for big datasets) 
– MUSCLE is an iterative refinement algorithm. Very fast 
– MAFFT uses fast Fourier Transform to detect homologous 
regions. Very fast 
– Genetic Algorithms (ex : SAGA) generates a population of 
alignments that evolves according to selection and crossing. 
Very slow but allows to define custom scoring functions. Need to 
be run several times (stochastic) 
– Hidden Markov models (HMMs) used to be innacurate methods. 
They are better now but still slow and difficult to use
DNA alignments 
• Good practice for alignments : 
– Use a variety of algorithms 
– Align at the nucleotide but also at the amino acid level 
(TranslatorX or manually) 
– Compare the different outputs 
– Check manualy : 
• Consistancy given ORF (frame-shift) 
• Sequencing errors 
– The alignment also can be seen as an hypothesis, 
therefore it needs to make sense from the biological 
point of view : genes have to be HOMOLOGS (share 
ancestry)
Building trees with distance methods 
• The distance between 2 sequences can be calculated in 
different ways: 
– number of differences 
– according to a substitution model 
• The clustering can be achieved in different ways: 
– UPGMA 
– Neighbor-joining 
– (Parsimony)
Building trees with distance methods 
• Building a UPGMA tree with the number of differences : 
1. Calculate the pairwise distance matrix 
A B C D E F 
A 0 1 3 6 7 10 
B 1 0 3 6 7 10 
C 3 3 0 5 6 9 
D 6 6 5 0 1 7 
E 7 7 6 1 0 8 
F 10 10 9 7 8 0
Building trees with distance methods 
• Building a UPGMA tree with the number of differences : 
2. Group the 2 most closely related sequences 
A B C D E F 
A 0 1 3 6 7 10 
B 1 0 3 6 7 10 
C 3 3 0 5 6 9 
D 6 6 5 0 1 7 
E 7 7 6 1 0 8 
F 10 10 9 7 8 0 
A 
B 
0.5 
0.5
Building trees with distance methods 
• Building a UPGMA tree with the number of differences : 
3. Recalculate the distance matrix and take the next smallest distance 
A/B C D E F 
A/B 0 3 6 7 10 
C 3 0 5 6 9 
D 6 5 0 1 7 
E 7 6 1 0 8 
F 10 9 7 8 0 
A 
B 
0.5 
0.5 
D 
E 
0.5 
0.5
Building trees with distance methods 
• Building a UPGMA tree with the number of differences : 
3. Recalculate the distance matrix and take the next smallest distance 
A 
B 
0.5 
0.5 
D 
E 
0.5 
0.5 
A/B C D/E F 
1 
A/B 0 3 6.5 10 
C 3 0 5.5 9 
D/E 6.5 5.5 0 7.5 
F 10 9 7.5 0 1.5 
C
Building trees with distance methods 
• Building a UPGMA tree with the number of differences : 
3. Recalculate the distance matrix and take the next smallest distance 
A 
B 
0.5 
0.5 
D 
E 
0.5 
0.5 
C 
1 
1.5 
A/B/ 
C D/E F 
A/B/C 0 6 9.5 
D/E 6 0 7.5 
F 9.5 7.5 0 
1.5 
2.5
Building trees with distance methods 
• Building a UPGMA tree with the number of differences : 
3. Recalculate the distance matrix and take the next smallest distance 
A 
B 
0.5 
0.5 
D 
E 
0.5 
0.5 
C 
1 
1.5 
1.5 
2.5 
A/B/C/D/E F 
A/B/C/D/E 0 8.5 
F 8.5 0 
4.25 F 
1.25
Building trees with distance methods 
• Assumption of the UPGMA method : constant rate of evolution 
across time and for all branches. This assumption is frequently 
violated in real-life datasets and therefore the UPGMA can find a 
wrong tree. 
• How can we relax this assumption ? We calculate the total 
divergence for each tip and compute a corrected distance matrix 
• Starting from a star-like tree, we create branches to minimize the 
length of the tree and agglomeratively join the closest neighbors 
=> Neighbor-joining
Building trees with distance methods 
• Building a Neighbog-Joining tree with the number of differences 
A 
B 
1 
4 
1 TRUE topology where 
D 
E 
3 
2 
C 
1 
2 
1 
1 
4 F 
B has accumulated 4 
times as much 
mutations as A since 
their divergence
Building trees with distance methods 
• Building a Neighbog-Joining tree with the number of differences 
A 
B 
1 
4 
D 
E 
3 
2 
C 
1 
2 
1 
1 
4 F 
1 
A B C D E F 
A 0 5 4 7 6 8 
B 5 0 7 10 9 11 
C 4 7 0 7 6 8 
D 7 10 7 0 5 9 
E 6 9 6 5 0 8 
F 8 11 8 9 8 0 
UPGMA would cluster A and C 
together because B is more 
distant
Building trees with distance methods 
• A global divergence is calculated by summing all distances, and a 
new distance matrix is computed 
A B C D E F 
A 0 5 4 7 6 8 
B 5 0 7 10 9 11 
C 4 7 0 7 6 8 
D 7 10 7 0 5 9 
E 6 9 6 5 0 8 
F 8 11 8 9 8 0 
Div 30 42 32 38 34 44 
A B C D E F 
A 0 -13 -11.5 -10 -10 -10.5 
B -13 0 -11.5 -10 -10 -10.5 
C -11.5 -11.5 0 -10.5 -10.5 -11 
D -10 -10 -10.5 0 -13 -11.5 
E -10 -10 -10.5 -13 0 -11.5 
F -10.5 -10.5 -11 -11.5 -11.5 0 
Div(A) = Σi dist(A,i) = 5+4+7+6+8 = 30 
Div(B) = Σi dist(B,i) = 5+7+10+9+11 = 42 
Div(C) = Σi dist(C,i) = 32 
Div(D) = Σi dist(D,i) = 38 
Div(E) = Σi dist(E,i) = 34 
Div(F) = Σi dist(F,i) = 44 
M(i,j) = dist(i,j)-(Div(i)+Div(j))/N-2 
M(A,B) = 5-(30+42)/4 = -13 
M(A,C) = 4-(30+32)/4=-11.5 
etc…
Building trees with distance methods 
• Starting with a star-like tree, the nodes are created sequentially 
A 
B 
C 
D 
E 
F 
A 
B 
C 
D 
E 
F 1 4 
…
Advantages and disadvantages of 
the Neighbor-Joining method 
• Fast method that will always produce a reasonnable tree. Always 
produces the same tree if the same alignment is used 
• Relaxes the most irrealistic assumptions of the UPGMA 
• Long Branches Attraction : two taxa with similar converging 
properties (increased GC content or high evolutionary rates) will 
have the tendency to group together
How to test the reliability of trees ? 
• One popular method : BOOTSTRAPPING 
– Randomly generates new alignment from the original one, by drawing 
positions with replacement 
– The new alignments will have the same length, but slightly different 
composition than the original one (i.e. some positions will be represented 
more than once and some positions will be omitted) 
– Tree reconstruction is applied to these new alignment. 
– The clustering in the original tree are investigated, to see how often they 
occur in the bootstrapped trees. The more a group appears, the more 
that node is supported by a high bootstrap value
How to test the reliability of trees ? 
• Bootstrapping example : 1) The Data 
x y 
1 0.969977 
2 1.744463 
3 3.073277 
4 4.510589 
5 5.471489 
6 5.599175 
7 7.03988 
8 7.812655 
9 8.913299 
10 9.971481 
11 9.98552 
12 10.24078 
13 10.59902 
14 12.61131 
15 12.63132 
16 13.83974 
17 16.03453 
18 17.27271 
19 19.25622 
20 19.26901 
Original Data 
y = 0.9176x + 0.2072 
R2 = 0.9794 
20 
18 
16 
14 
12 
10 
8 
6 
4 
2 
0 
0 2 4 6 8 10 12 14 16 18 20 
X 
Y
How to test the reliability of trees ? 
• Bootstrapping example : 2) Resampling
How to test the reliability of trees ? 
• Bootstrapping example : 3) Analyse the Resamples
How to test the reliability of trees ? 
• Boostrapping example : 4) Assess the reliability of the original 
estimates with the dispersion of the estimates of the resamples 
Original Data + Bootstraps 
20 
18 
16 
14 
12 
10 
8 
6 
4 
2 
0 
0 2 4 6 8 10 12 14 16 18 20 
X 
Y
How to test the reliability of trees ? 
• BOOTSTRAPPING : 
Taxon A : ATGCGAGTTTAGCAG 
Taxon B : ATGCGAGCTTAACTG 
Taxon C : ATACTAGCTTAGCTG 
Taxon D : ATGCTATCTTAGGTG 
Alignment s1 
Alignment s2 
Alignment s3 
Alignment s4 
AB 
CD 
AB 
CD 
AB 
CD 
AB 
CD 
AB 
CD 
A+B : 4/4 = 100% 
C+D : 3/4 = 75% 
A+B+C+D : 4/4 = 100% 
A 
B 
C 
D 
100 
100 
75
Genetic distances 
• A multitude of forces act on sequences (mutation, selection, drift) 
and therefore two sequences coming from a common ancestor will 
diverge with time 
• The problem with counting the number of difference (p-distance) is 
that it does not take into account multiple substitutions on the same 
site 
• Therefore we need to model the substitution process 
=> time-homogenous continuous stationary Markov Process
Genetic distances 
Example : double substitution 
ATGTCTTTG ATGTCGTTG 
ATGTCATTG 
* * 
ATGTCATTG ATGTCATTG 
p-distance = 1 but 2 substitutions occured !
Genetic distances 
Example : back-mutation 
ATGTCTTTG ATGTCATTG 
ATGTCATTG 
* * 
ATGTCATTG ATGTCATTG 
p-distance = 0 but 2 substitutions occured !
Genetic distances 
Example : convergence 
ATGTCTTTG ATGTCTTTG 
ATGTCATTG 
* 
ATGTCATTG ATGTCTTTG 
p-distance = 0 but 2 substitutions occured ! 
*
Genetic distances 
• How does the p-distance correlates with speciation time ? 
When we look at the divergence of proteins in distantly related 
organisms, we expect a linear relation (e.g. the more distant 
organisms share less and less identities) 
=> correct but we always underestimate the genetic 
distance if we only count the number of differences
Genetic distances 
• How does the p-distance correlates with speciation time ? 
Observed p difference 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
0 0.5 1 1.5 2 2.5 3 
Non-linear relation because of multiple, parallel, and back-substitutions
Genetic distances 
• How to model sequence evolution ? (Jukes and Cantor, 1969) 
– All possible substitutions have the same probability 
– All 4 nucleotides have the same frquency = 25% 
– The chance for a particular substitution is a simple function of time 
– The chance for a nucleotide to not change is therefore a decreasing 
function of time 
– Two random sequences (diverged for an infinite time) will still have 25% 
identity (there are only 4 nucleotides)
Genetic distances 
• The JC69 matrix : 
to A C G T 
from 
A ¼+3/4*X ¼-1/4*X ¼-1/4*X ¼-1/4*X 
C ¼-1/4*X ¼+3/4*X ¼-1/4*X ¼-1/4*X 
G ¼-1/4*X ¼-1/4*X ¼+3/4*X ¼-1/4*X 
T ¼-1/4*X ¼-1/4*X ¼-1/4*X ¼+3/4*X 
X = e-μ.t 
Sums of columns = sums of lines : the rate of appearance of 
nucleotides 
is the same as the rate of disparition (nucleotides are at equilibrium)
Genetic distances 
• How to model sequence evolution ? (Jukes and Cantor, 1969) 
– Example : we count 20 differences between two 100bp-long sequences 
• d = -3/4 * ln( 1 - 4/3 * p ) 
• p = 0.2 
• d = 0.232 
• => there are 3 mutations that have occured but that we do not see, because 
they have occured in a position where another mutation had already occured 
– Does this now efficiently model the substitution process ?
Genetic distances 
• How to model sequence evolution ? Some facts 
– Transitions are more likely than transversions 
purines pyrimidines 
A T 
G C
Genetic distances 
• How to model sequence evolution ? Some facts 
– Not all positions evolve at the same rate : 
the chance for an amino acid change is 
different for the third position than for the 
other positions
Genetic distances 
• How to model sequence evolution ? Some facts 
– Not all positions evolve at the same rate : 
some codons are under strong purifying 
selection, while some other are under 
diversifying selection 
=> they do not evolve at the same rate
Genetic distances 
• How to model sequence evolution ? 
– Better models have been designed to take into account the individuality 
of each substitution rate. 
– Rate heterogeneity models take into account the inter-position 
differences. Some positions are allowed to evolve faster than other 
– Genomes have their proper nucleotide compositions (GC-content)
Genetic distances 
• Some models of nucleotide substitution 
- JC69 : a=b=c=d=e=f 
A=C=G=T=1/4 
- K80 : b=e, a=c=d=f 
A=C=G=T=1/4 
- HKY85 : b=e, a=c=d=f 
A ≠ C ≠ G ≠ T 
- TN93 : b, e, a=c=d=f 
A ≠ C ≠ G ≠ T 
- GTR : a, b, c, d, e, f 
A ≠ C ≠ G ≠ T 
More models are possible 
(12-parameters, codons) but 
are generally not used
Genetic distances 
• Site heterogeneity models 
– Usually described as a Gamma distribution (discretized in 4 – 10 
categories) 
– An arbitrary proportion of invariant sites is sometimes added 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
k=1 
k=1.5 
k=3 
k=5 
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Genetic distances 
• Which model to chose ? 
– The simplest models make irrealistic assumptions 
– Why don’t we choose always the most complex models ? 
• Difficult to compute 
• Parameters values difficult to get from the data 
• Danger of overfitting
HOLY DATA 
MODEL 
HYPOTHESIS
Genetic distances 
• Overfitting 
Measurement of some phenomenon 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
0 1 2 3 4 5 6 7 8 9 10 
Input 
Output
Genetic distances 
• Overfitting 
R2 = 1 !?
Genetic distances 
• How to chose the appropriate model ? 
– Likelihood ratio tests for nested models (more on that later) 
– Many information criteria have been designed (more on that later also) 
– Trial-and error, depends of the dataset 
• more data -> more complex model 
• better data -> more complex models 
• Litterature search
File formats (flat files) 
• FASTA (.fas, .fst, .fasta) 
Most common sequence format, no header 
>seq1 
ATCGTGCATACGAGCT 
>seq2 
ATCGTGCATACGACGT 
>seq3 
ATCGTGCATACGAAGT
File formats (flat files) 
• NEXUS (.nex) 
Contains blocks with sequence and tree information 
#NEXUS 
Begin Data; 
Dimensions ntax=3 nchar =16; 
Format datatype=Nucleotide gap=-; 
[insert comment here] 
seq1 ATCGTGCATACGAGCT 
seq2 ATCGTGCATACGACGT 
seq3 ATCGTGCATACGAAGT 
End;
Practicals : Phylogenetics Part I 
1. Download the file « PrimatesNuc_1.txt». Open it and identify its format. Rename it 
with the correct extension. 
2. Load the file in BioEdit and run a multiple alignment (select all sequences then click 
« Accessory application -> ClustalW multiple alignment »). Save the resulting file 
3. Load the original file in Seaview and check alignment options (Align -> Alignment 
options). Select ClustalW2 and run a multiple alignment (Align all). Save the resulting 
file. Then, reload the original data, change the option to Muscle and run the alignment 
again. Save this file too 
4. To generate an consistency-based alignment with T-COFFEE, access the web page 
http://www.tcoffee.org/, submit the original data and save the resulting alignment 
5. To generate an alignment with MAFFT, access the web page 
http://mafft.cbrc.jp/alignment/server/index.html, submit the original data with the 
default options, and save the resulting alignment 
6. Now we can compare the alignments obtained by the different methods. Access the 
web page http://bibiserv.techfak.uni-bielefeld.de/altavist/, select option 2 for 
comparing two alignments and compare the different alignments you produced. Which 
alignment is the most different ? Which are the most identical ? Can you guess why ? 
Open the alignments in BioEdit and spot the differences.
Practicals : Phylogenetics Part I 
7. Open MEGA. Import the MAFFT alignment. Open it as « analyse », consider it 
« nucleotides », « coding sequence », with the standard genetic code. Press F4 to 
open the alignment explorer. Try the different options in the « Statistics » menu. 
8. In the « Models » menu, select « Find Best DNA/Protein Model (ML) ». Leave the 
default options and run. Which model has the best likelihood ? Which model is the 
most appropriate ? 
9. Go to the « Distances » menu and select « Compute pairwise distances ». Now 
select the proper options for this analysis (substitution model and site heterogeneity 
model). Are chimps closer to humans or to gorillas ? (you might need to export the 
data to Excel) 
10. Go to the « Phylogeny » menu and select « Construct/Test UPGMA Tree ». Leave 
the default options and compute. Does the human/chimp/gorilla clustering fit with 
your knowledge ? Redo the analysis with appropriate options (substitution model 
and site heterogeneity model). Does it get any better ? 
11. Go to the « Phylogeny » menu and select « Construct/Test Neighbor-Joining Tree ». 
Select the appropriate options and compute the tree. Do chimps cluster with 
humans or gorillas ? Being able to explain is important 
12. Try the same with the ClustalW alignment. Draw some conclusions for yourself
Practicals : Phylogenetics Part I 
13. Which of these 4 unrooted trees does not have the same topology as the 3 other 
ones ?

More Related Content

PPTX
Secondary protein structure prediction
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
PPTX
Express sequence tags
PDF
MEGA (Molecular Evolutionary Genetics Analysis)
PPT
PPTX
Protein Threading
PPTX
Structure alignment methods
Secondary protein structure prediction
Sequence alig Sequence Alignment Pairwise alignment:-
Express sequence tags
MEGA (Molecular Evolutionary Genetics Analysis)
Protein Threading
Structure alignment methods

What's hot (20)

PDF
Gene prediction strategies
PDF
Protein structure classification/domain prediction: SCOP and CATH (Bioinforma...
PDF
Gene prediction method
PDF
PPT
Analysis of gene expression
PPTX
gene prediction programs
DOCX
UniProt
PPTX
Sequence homology search and multiple sequence alignment(1)
PDF
Bioinformatics data mining
PPT
methods for protein structure prediction
PPTX
Chou fasman algorithm for protein structure prediction
PDF
Phylogenetic analysis
PPTX
Isolation, purification and characterisation of protein
PPTX
Types of genomics ppt
PPTX
Electrophoretic mobility shift assay
PPTX
PPTX
Map based cloning
PPTX
MULTIPLE SEQUENCE ALIGNMENT
PPTX
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
Gene prediction strategies
Protein structure classification/domain prediction: SCOP and CATH (Bioinforma...
Gene prediction method
Analysis of gene expression
gene prediction programs
UniProt
Sequence homology search and multiple sequence alignment(1)
Bioinformatics data mining
methods for protein structure prediction
Chou fasman algorithm for protein structure prediction
Phylogenetic analysis
Isolation, purification and characterisation of protein
Types of genomics ppt
Electrophoretic mobility shift assay
Map based cloning
MULTIPLE SEQUENCE ALIGNMENT
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
Ad

Viewers also liked (16)

PPTX
Algorithm research project neighbor joining
DOC
Distance
PPTX
Molecular phylogenetics
PPTX
Distance based method
PPTX
2015 bioinformatics phylogenetics_wim_vancriekinge
PPT
Phylogenetics2
PPTX
PDF
Introduction to Probabilistic Models for Bioinformatics
PDF
BIS2C. Biodiversity and the Tree of Life. 2014. L4. Inferring Phylogenetic Trees
PPT
Phylogeny
PPS
Phylogenetic tree
PPT
What is a phylogenetic tree
PPT
Phylogenetic analysis
PPT
Phylogenetic trees
PDF
Phylogenetics Analysis in R
PPTX
Parsimony analysis
Algorithm research project neighbor joining
Distance
Molecular phylogenetics
Distance based method
2015 bioinformatics phylogenetics_wim_vancriekinge
Phylogenetics2
Introduction to Probabilistic Models for Bioinformatics
BIS2C. Biodiversity and the Tree of Life. 2014. L4. Inferring Phylogenetic Trees
Phylogeny
Phylogenetic tree
What is a phylogenetic tree
Phylogenetic analysis
Phylogenetic trees
Phylogenetics Analysis in R
Parsimony analysis
Ad

Similar to Phylogenetics1 (20)

PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
PDF
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
PPTX
human phylogetic contrution of evolution tree.pptx
PPT
Plant Molecular Systematics Phylogenetics.ppt
PPT
phylogenetics (1)...............................ppt
PPT
distance based phylogenetics-methodology
PPTX
Phylogenetic tree construction
PPTX
PPTX
Tools in phylogeny
PPT
6238578.ppt
PPTX
BTC 506 Phylogenetic Analysis.pptx
PPTX
Msa & rooted/unrooted tree
PDF
Multiple sequence alignment
PPTX
Presentation about phylogenetic tree and its construction methods.
PPTX
Upgma
PPT
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
PPTX
BioINfo.pptx
PPTX
Virus Sequence Alignment and Phylogenetic Analysis 2019
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
human phylogetic contrution of evolution tree.pptx
Plant Molecular Systematics Phylogenetics.ppt
phylogenetics (1)...............................ppt
distance based phylogenetics-methodology
Phylogenetic tree construction
Tools in phylogeny
6238578.ppt
BTC 506 Phylogenetic Analysis.pptx
Msa & rooted/unrooted tree
Multiple sequence alignment
Presentation about phylogenetic tree and its construction methods.
Upgma
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
BioINfo.pptx
Virus Sequence Alignment and Phylogenetic Analysis 2019

Recently uploaded (20)

PPT
2011 HCRP presentation-final.pptjrirrififfi
PPTX
Stats annual compiled ipd opd ot br 2024
PPTX
DAA UNIT 1 for unit 1 time compixity PPT.pptx
PPTX
Overview_of_Computing_Presentation.pptxxx
PPTX
cyber row.pptx for cyber proffesionals and hackers
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PDF
PPT nikita containers of the company use
PPTX
1.Introduction to orthodonti hhhgghhcs.pptx
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PDF
General category merit rank list for neet pg
PPTX
research framework and review of related literature chapter 2
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PDF
Q1-wK1-Human-and-Cultural-Variation-sy-2024-2025-Copy-1.pdf
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PPTX
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
PPTX
Bussiness Plan S Group of college 2020-23 Final
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
Sistem Informasi Manejemn-Sistem Manajemen Database
2011 HCRP presentation-final.pptjrirrififfi
Stats annual compiled ipd opd ot br 2024
DAA UNIT 1 for unit 1 time compixity PPT.pptx
Overview_of_Computing_Presentation.pptxxx
cyber row.pptx for cyber proffesionals and hackers
inbound6529290805104538764.pptxmmmmmmmmm
PPT nikita containers of the company use
1.Introduction to orthodonti hhhgghhcs.pptx
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
General category merit rank list for neet pg
research framework and review of related literature chapter 2
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
Q1-wK1-Human-and-Cultural-Variation-sy-2024-2025-Copy-1.pdf
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
Bussiness Plan S Group of college 2020-23 Final
Teal Blue Futuristic Metaverse Presentation.pdf
Chapter security of computer_8_v8.1.pptx
Sistem Informasi Manejemn-Sistem Manajemen Database

Phylogenetics1

  • 1. Phylogenetics Workshop Part I : Introduction De Landtsheer Sébastien, University of Luxemburg Ahead of the BeNeLux Bioinformatics Conference 2011
  • 2. Outline of the Workshop Part I : • General introduction • Alignments • Distance-based methods Part II : • Maximum likelihood trees • Bayesian trees Part III : • Advanced bayesian phylogenetics • Hypothesis testing
  • 3. Outline of Part I • General introduction : what is phylogenetics ? • Basic DNA alignment algorithm • Distance matrices • Distance-based tree inference methods
  • 4. Software featured in Part I • Seaview (http://pbil.univ-lyon1.fr/software/seaview.html) • BioEdit (http://www.mbio.ncsu.edu/bioedit/bioedit.html) • MEGA (http://www.megasoftware.net/) • FigTree (http://tree.bio.ed.ac.uk/software/figtree/)
  • 5. What is Phylogenetics ? • Classification of living species into categories • Study of characters → states • Underlying assumption of evolution (cladogram / dendrogram)
  • 6. What is Phylogenetics • Characters : – Morphological – Biochemical – Genetic • States : – Continuous – Discontinous
  • 7. Different types of Phylogenetic trees • Phylogenetic tree : graphical representation of our hypothesis about the evolution of a group of organisms • Can represent different quantities (time/genetic distance) and be displayed in different ways • There are several possible methods, and there is no single method that is best
  • 8. Phylogenetic trees jargon Internal branches Root (if there is) Node Terminal branches Leaves or Tips or OTUs
  • 9. Properties of Phylogenetic trees • Rooted vs Unrooted
  • 10. Properties of Phylogenetic trees • The real face of unrooted trees =
  • 11. Properties of Phylogenetic trees • The real face of unrooted trees : undirected = Multiple possibilities for rooting the tree
  • 12. Properties of Phylogenetic trees • Where to place the root ? – Midpoint rooting : equally distant from the two most distantly related taxa on the tree. Makes sense but more often than not it is wrong – Outgroup : using one distantly related taxon (uncontroversial) • Marsupial for eutherian study • Treeshrew for primate study • SIV for HIV study
  • 13. Properties of Phylogenetic trees • How to root unrooted trees ? 1) Midpoint rooting = Assumes that the rates of evolution have stayed +/- constant
  • 14. Properties of Phylogenetic trees • How to root unrooted trees ? 2) Using an outgroup = Problem : difficult to find the proper outgroup (not ambiguous choice but still not too distant)
  • 15. Properties of Phylogenetic trees • Rooted trees tell a story (directed) Most Recent Common Ancestor (MRCA)
  • 16. Properties of Phylogenetic trees • Branch swapping : only horizontal distance matters =
  • 17. Properties of Phylogenetic trees • Many topologies are always possible : Number of possible rooted trees for n sequences = (2n-3)! / (2n-2 (n-2))! 2 sequences: 1 3 sequences: 3 4 sequences: 15 5 sequences: 105 6 sequences: 954 7 sequences: 10395 8 sequences: 135135 9 sequences: 2027025 10 sequences: 34459425 51 sequences: >1080 (nb of particles in the universe)
  • 18. DNA alignments • Aligning two sequences: the Needleman–Wunsch algorithm – Construct a similarity matrix – Assign similarity scores based on an arbitrary scoring system – Finds the best GLOBAL alignment between two sequence = the maximum number of residues from one sequence that can be aligned with the other one
  • 19. DNA alignments A T G T A C C G T 0 0 0 0 0 0 0 0 0 0 T 0 G 0 A 0 C 0 T 0 C 0 G 0 T 0
  • 20. DNA alignments • The score in one cell is the maximum of different possibilities : – 0 – The upper left cell plus the value of the similarity between the two residues – The upper cell plus the value of a gap (in the upper sequence) – The left cell plus the value of a gap (in the left sequence) Hi,j = max { Hi-1,j-1+s(ai,bj), Hi,j-1+Pg(k), Hi-1,j+Pg(k) } There is a penality for gap opening and for gap extension
  • 21. DNA alignments • For the example we will use the following scoring matrix : – Identity : +1 – Gap : 0 • In real life ClustalW uses different scoring matrices depending the code (AA or DNA) and can be set to use word matches (k-tuples). All parameters are editable
  • 22. DNA alignments A T G T A C C G T 0 0 0 0 0 0 0 0 0 0 T 0 0 G 0 A 0 C 0 T 0 C 0 G 0 T 0
  • 23. DNA alignments A T G T A C C G T 0 0 0 0 0 0 0 0 0 0 T 0 0 1 1 2 2 2 2 2 3 G 0 A 0 C 0 T 0 C 0 G 0 T 0
  • 24. DNA alignments A T G T A C C G T 0 0 0 0 0 0 0 0 0 0 T 0 0 1 1 2 2 2 2 2 3 G 0 0 1 A 0 1 1 C 0 1 1 T 0 1 2 C 0 1 2 G 0 1 2 T 0 1 3
  • 25. DNA alignments A T G T A C C G T 0 0 0 0 0 0 0 0 0 0 T 0 0 1 1 2 2 2 2 2 3 G 0 0 1 2 2 2 2 2 3 3 A 0 1 1 2 2 3 3 3 3 3 C 0 1 1 2 2 3 4 4 4 4 T 0 1 2 2 3 3 4 4 4 5 C 0 1 2 2 3 3 4 5 5 5 G 0 1 2 3 3 3 4 5 6 6 T 0 1 3 3 4 3 4 5 6 7
  • 26. DNA alignments A T G T A C C G T 0 0 0 0 0 0 0 0 0 0 T 0 0 1 1 2 2 2 2 2 3 G 0 0 1 2 2 2 2 2 3 3 A 0 1 1 2 2 3 3 3 3 3 C 0 1 1 2 2 3 4 4 4 4 T 0 1 2 2 3 3 4 4 4 5 C 0 1 2 2 3 3 4 5 5 5 G 0 1 2 3 3 3 4 5 6 6 T 0 1 3 3 4 3 4 5 6 7
  • 27. DNA alignments • Final sequence : A T G T A C - C G T - T G - A C T C G T
  • 28. DNA alignments • More technological alignment methods include : – T-COFFEE computes a tree that is the consistent with the pairwise alignments scores computed from a variety of sources. Computationnaly intensive (not good for big datasets) – MUSCLE is an iterative refinement algorithm. Very fast – MAFFT uses fast Fourier Transform to detect homologous regions. Very fast – Genetic Algorithms (ex : SAGA) generates a population of alignments that evolves according to selection and crossing. Very slow but allows to define custom scoring functions. Need to be run several times (stochastic) – Hidden Markov models (HMMs) used to be innacurate methods. They are better now but still slow and difficult to use
  • 29. DNA alignments • Good practice for alignments : – Use a variety of algorithms – Align at the nucleotide but also at the amino acid level (TranslatorX or manually) – Compare the different outputs – Check manualy : • Consistancy given ORF (frame-shift) • Sequencing errors – The alignment also can be seen as an hypothesis, therefore it needs to make sense from the biological point of view : genes have to be HOMOLOGS (share ancestry)
  • 30. Building trees with distance methods • The distance between 2 sequences can be calculated in different ways: – number of differences – according to a substitution model • The clustering can be achieved in different ways: – UPGMA – Neighbor-joining – (Parsimony)
  • 31. Building trees with distance methods • Building a UPGMA tree with the number of differences : 1. Calculate the pairwise distance matrix A B C D E F A 0 1 3 6 7 10 B 1 0 3 6 7 10 C 3 3 0 5 6 9 D 6 6 5 0 1 7 E 7 7 6 1 0 8 F 10 10 9 7 8 0
  • 32. Building trees with distance methods • Building a UPGMA tree with the number of differences : 2. Group the 2 most closely related sequences A B C D E F A 0 1 3 6 7 10 B 1 0 3 6 7 10 C 3 3 0 5 6 9 D 6 6 5 0 1 7 E 7 7 6 1 0 8 F 10 10 9 7 8 0 A B 0.5 0.5
  • 33. Building trees with distance methods • Building a UPGMA tree with the number of differences : 3. Recalculate the distance matrix and take the next smallest distance A/B C D E F A/B 0 3 6 7 10 C 3 0 5 6 9 D 6 5 0 1 7 E 7 6 1 0 8 F 10 9 7 8 0 A B 0.5 0.5 D E 0.5 0.5
  • 34. Building trees with distance methods • Building a UPGMA tree with the number of differences : 3. Recalculate the distance matrix and take the next smallest distance A B 0.5 0.5 D E 0.5 0.5 A/B C D/E F 1 A/B 0 3 6.5 10 C 3 0 5.5 9 D/E 6.5 5.5 0 7.5 F 10 9 7.5 0 1.5 C
  • 35. Building trees with distance methods • Building a UPGMA tree with the number of differences : 3. Recalculate the distance matrix and take the next smallest distance A B 0.5 0.5 D E 0.5 0.5 C 1 1.5 A/B/ C D/E F A/B/C 0 6 9.5 D/E 6 0 7.5 F 9.5 7.5 0 1.5 2.5
  • 36. Building trees with distance methods • Building a UPGMA tree with the number of differences : 3. Recalculate the distance matrix and take the next smallest distance A B 0.5 0.5 D E 0.5 0.5 C 1 1.5 1.5 2.5 A/B/C/D/E F A/B/C/D/E 0 8.5 F 8.5 0 4.25 F 1.25
  • 37. Building trees with distance methods • Assumption of the UPGMA method : constant rate of evolution across time and for all branches. This assumption is frequently violated in real-life datasets and therefore the UPGMA can find a wrong tree. • How can we relax this assumption ? We calculate the total divergence for each tip and compute a corrected distance matrix • Starting from a star-like tree, we create branches to minimize the length of the tree and agglomeratively join the closest neighbors => Neighbor-joining
  • 38. Building trees with distance methods • Building a Neighbog-Joining tree with the number of differences A B 1 4 1 TRUE topology where D E 3 2 C 1 2 1 1 4 F B has accumulated 4 times as much mutations as A since their divergence
  • 39. Building trees with distance methods • Building a Neighbog-Joining tree with the number of differences A B 1 4 D E 3 2 C 1 2 1 1 4 F 1 A B C D E F A 0 5 4 7 6 8 B 5 0 7 10 9 11 C 4 7 0 7 6 8 D 7 10 7 0 5 9 E 6 9 6 5 0 8 F 8 11 8 9 8 0 UPGMA would cluster A and C together because B is more distant
  • 40. Building trees with distance methods • A global divergence is calculated by summing all distances, and a new distance matrix is computed A B C D E F A 0 5 4 7 6 8 B 5 0 7 10 9 11 C 4 7 0 7 6 8 D 7 10 7 0 5 9 E 6 9 6 5 0 8 F 8 11 8 9 8 0 Div 30 42 32 38 34 44 A B C D E F A 0 -13 -11.5 -10 -10 -10.5 B -13 0 -11.5 -10 -10 -10.5 C -11.5 -11.5 0 -10.5 -10.5 -11 D -10 -10 -10.5 0 -13 -11.5 E -10 -10 -10.5 -13 0 -11.5 F -10.5 -10.5 -11 -11.5 -11.5 0 Div(A) = Σi dist(A,i) = 5+4+7+6+8 = 30 Div(B) = Σi dist(B,i) = 5+7+10+9+11 = 42 Div(C) = Σi dist(C,i) = 32 Div(D) = Σi dist(D,i) = 38 Div(E) = Σi dist(E,i) = 34 Div(F) = Σi dist(F,i) = 44 M(i,j) = dist(i,j)-(Div(i)+Div(j))/N-2 M(A,B) = 5-(30+42)/4 = -13 M(A,C) = 4-(30+32)/4=-11.5 etc…
  • 41. Building trees with distance methods • Starting with a star-like tree, the nodes are created sequentially A B C D E F A B C D E F 1 4 …
  • 42. Advantages and disadvantages of the Neighbor-Joining method • Fast method that will always produce a reasonnable tree. Always produces the same tree if the same alignment is used • Relaxes the most irrealistic assumptions of the UPGMA • Long Branches Attraction : two taxa with similar converging properties (increased GC content or high evolutionary rates) will have the tendency to group together
  • 43. How to test the reliability of trees ? • One popular method : BOOTSTRAPPING – Randomly generates new alignment from the original one, by drawing positions with replacement – The new alignments will have the same length, but slightly different composition than the original one (i.e. some positions will be represented more than once and some positions will be omitted) – Tree reconstruction is applied to these new alignment. – The clustering in the original tree are investigated, to see how often they occur in the bootstrapped trees. The more a group appears, the more that node is supported by a high bootstrap value
  • 44. How to test the reliability of trees ? • Bootstrapping example : 1) The Data x y 1 0.969977 2 1.744463 3 3.073277 4 4.510589 5 5.471489 6 5.599175 7 7.03988 8 7.812655 9 8.913299 10 9.971481 11 9.98552 12 10.24078 13 10.59902 14 12.61131 15 12.63132 16 13.83974 17 16.03453 18 17.27271 19 19.25622 20 19.26901 Original Data y = 0.9176x + 0.2072 R2 = 0.9794 20 18 16 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 X Y
  • 45. How to test the reliability of trees ? • Bootstrapping example : 2) Resampling
  • 46. How to test the reliability of trees ? • Bootstrapping example : 3) Analyse the Resamples
  • 47. How to test the reliability of trees ? • Boostrapping example : 4) Assess the reliability of the original estimates with the dispersion of the estimates of the resamples Original Data + Bootstraps 20 18 16 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 X Y
  • 48. How to test the reliability of trees ? • BOOTSTRAPPING : Taxon A : ATGCGAGTTTAGCAG Taxon B : ATGCGAGCTTAACTG Taxon C : ATACTAGCTTAGCTG Taxon D : ATGCTATCTTAGGTG Alignment s1 Alignment s2 Alignment s3 Alignment s4 AB CD AB CD AB CD AB CD AB CD A+B : 4/4 = 100% C+D : 3/4 = 75% A+B+C+D : 4/4 = 100% A B C D 100 100 75
  • 49. Genetic distances • A multitude of forces act on sequences (mutation, selection, drift) and therefore two sequences coming from a common ancestor will diverge with time • The problem with counting the number of difference (p-distance) is that it does not take into account multiple substitutions on the same site • Therefore we need to model the substitution process => time-homogenous continuous stationary Markov Process
  • 50. Genetic distances Example : double substitution ATGTCTTTG ATGTCGTTG ATGTCATTG * * ATGTCATTG ATGTCATTG p-distance = 1 but 2 substitutions occured !
  • 51. Genetic distances Example : back-mutation ATGTCTTTG ATGTCATTG ATGTCATTG * * ATGTCATTG ATGTCATTG p-distance = 0 but 2 substitutions occured !
  • 52. Genetic distances Example : convergence ATGTCTTTG ATGTCTTTG ATGTCATTG * ATGTCATTG ATGTCTTTG p-distance = 0 but 2 substitutions occured ! *
  • 53. Genetic distances • How does the p-distance correlates with speciation time ? When we look at the divergence of proteins in distantly related organisms, we expect a linear relation (e.g. the more distant organisms share less and less identities) => correct but we always underestimate the genetic distance if we only count the number of differences
  • 54. Genetic distances • How does the p-distance correlates with speciation time ? Observed p difference 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Non-linear relation because of multiple, parallel, and back-substitutions
  • 55. Genetic distances • How to model sequence evolution ? (Jukes and Cantor, 1969) – All possible substitutions have the same probability – All 4 nucleotides have the same frquency = 25% – The chance for a particular substitution is a simple function of time – The chance for a nucleotide to not change is therefore a decreasing function of time – Two random sequences (diverged for an infinite time) will still have 25% identity (there are only 4 nucleotides)
  • 56. Genetic distances • The JC69 matrix : to A C G T from A ¼+3/4*X ¼-1/4*X ¼-1/4*X ¼-1/4*X C ¼-1/4*X ¼+3/4*X ¼-1/4*X ¼-1/4*X G ¼-1/4*X ¼-1/4*X ¼+3/4*X ¼-1/4*X T ¼-1/4*X ¼-1/4*X ¼-1/4*X ¼+3/4*X X = e-μ.t Sums of columns = sums of lines : the rate of appearance of nucleotides is the same as the rate of disparition (nucleotides are at equilibrium)
  • 57. Genetic distances • How to model sequence evolution ? (Jukes and Cantor, 1969) – Example : we count 20 differences between two 100bp-long sequences • d = -3/4 * ln( 1 - 4/3 * p ) • p = 0.2 • d = 0.232 • => there are 3 mutations that have occured but that we do not see, because they have occured in a position where another mutation had already occured – Does this now efficiently model the substitution process ?
  • 58. Genetic distances • How to model sequence evolution ? Some facts – Transitions are more likely than transversions purines pyrimidines A T G C
  • 59. Genetic distances • How to model sequence evolution ? Some facts – Not all positions evolve at the same rate : the chance for an amino acid change is different for the third position than for the other positions
  • 60. Genetic distances • How to model sequence evolution ? Some facts – Not all positions evolve at the same rate : some codons are under strong purifying selection, while some other are under diversifying selection => they do not evolve at the same rate
  • 61. Genetic distances • How to model sequence evolution ? – Better models have been designed to take into account the individuality of each substitution rate. – Rate heterogeneity models take into account the inter-position differences. Some positions are allowed to evolve faster than other – Genomes have their proper nucleotide compositions (GC-content)
  • 62. Genetic distances • Some models of nucleotide substitution - JC69 : a=b=c=d=e=f A=C=G=T=1/4 - K80 : b=e, a=c=d=f A=C=G=T=1/4 - HKY85 : b=e, a=c=d=f A ≠ C ≠ G ≠ T - TN93 : b, e, a=c=d=f A ≠ C ≠ G ≠ T - GTR : a, b, c, d, e, f A ≠ C ≠ G ≠ T More models are possible (12-parameters, codons) but are generally not used
  • 63. Genetic distances • Site heterogeneity models – Usually described as a Gamma distribution (discretized in 4 – 10 categories) – An arbitrary proportion of invariant sites is sometimes added 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 k=1 k=1.5 k=3 k=5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
  • 64. Genetic distances • Which model to chose ? – The simplest models make irrealistic assumptions – Why don’t we choose always the most complex models ? • Difficult to compute • Parameters values difficult to get from the data • Danger of overfitting
  • 65. HOLY DATA MODEL HYPOTHESIS
  • 66. Genetic distances • Overfitting Measurement of some phenomenon 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 9 10 Input Output
  • 67. Genetic distances • Overfitting R2 = 1 !?
  • 68. Genetic distances • How to chose the appropriate model ? – Likelihood ratio tests for nested models (more on that later) – Many information criteria have been designed (more on that later also) – Trial-and error, depends of the dataset • more data -> more complex model • better data -> more complex models • Litterature search
  • 69. File formats (flat files) • FASTA (.fas, .fst, .fasta) Most common sequence format, no header >seq1 ATCGTGCATACGAGCT >seq2 ATCGTGCATACGACGT >seq3 ATCGTGCATACGAAGT
  • 70. File formats (flat files) • NEXUS (.nex) Contains blocks with sequence and tree information #NEXUS Begin Data; Dimensions ntax=3 nchar =16; Format datatype=Nucleotide gap=-; [insert comment here] seq1 ATCGTGCATACGAGCT seq2 ATCGTGCATACGACGT seq3 ATCGTGCATACGAAGT End;
  • 71. Practicals : Phylogenetics Part I 1. Download the file « PrimatesNuc_1.txt». Open it and identify its format. Rename it with the correct extension. 2. Load the file in BioEdit and run a multiple alignment (select all sequences then click « Accessory application -> ClustalW multiple alignment »). Save the resulting file 3. Load the original file in Seaview and check alignment options (Align -> Alignment options). Select ClustalW2 and run a multiple alignment (Align all). Save the resulting file. Then, reload the original data, change the option to Muscle and run the alignment again. Save this file too 4. To generate an consistency-based alignment with T-COFFEE, access the web page http://www.tcoffee.org/, submit the original data and save the resulting alignment 5. To generate an alignment with MAFFT, access the web page http://mafft.cbrc.jp/alignment/server/index.html, submit the original data with the default options, and save the resulting alignment 6. Now we can compare the alignments obtained by the different methods. Access the web page http://bibiserv.techfak.uni-bielefeld.de/altavist/, select option 2 for comparing two alignments and compare the different alignments you produced. Which alignment is the most different ? Which are the most identical ? Can you guess why ? Open the alignments in BioEdit and spot the differences.
  • 72. Practicals : Phylogenetics Part I 7. Open MEGA. Import the MAFFT alignment. Open it as « analyse », consider it « nucleotides », « coding sequence », with the standard genetic code. Press F4 to open the alignment explorer. Try the different options in the « Statistics » menu. 8. In the « Models » menu, select « Find Best DNA/Protein Model (ML) ». Leave the default options and run. Which model has the best likelihood ? Which model is the most appropriate ? 9. Go to the « Distances » menu and select « Compute pairwise distances ». Now select the proper options for this analysis (substitution model and site heterogeneity model). Are chimps closer to humans or to gorillas ? (you might need to export the data to Excel) 10. Go to the « Phylogeny » menu and select « Construct/Test UPGMA Tree ». Leave the default options and compute. Does the human/chimp/gorilla clustering fit with your knowledge ? Redo the analysis with appropriate options (substitution model and site heterogeneity model). Does it get any better ? 11. Go to the « Phylogeny » menu and select « Construct/Test Neighbor-Joining Tree ». Select the appropriate options and compute the tree. Do chimps cluster with humans or gorillas ? Being able to explain is important 12. Try the same with the ClustalW alignment. Draw some conclusions for yourself
  • 73. Practicals : Phylogenetics Part I 13. Which of these 4 unrooted trees does not have the same topology as the 3 other ones ?