GenAlEx Tutorials-Part 1
Introduction to Population Genetic
Analysis
Based on material provided at the national graduate workshop An Introduction to
Genetic Analysis for Populations Studies offered by Rod Peakall and Peter Smouse
at the Australian National University, Canberra, Australia, July 2009.
Table of Contents
About this Tutorial Module ........................................................................................ 4
Goals of the Tutorial ................................................................................................................ 4
About the Software GenAlEx ..................................................................................... 5
Software Instructions.............................................................................................................. 5
Understanding Abridged GenAlEx Instructions ...................................................................... 6
Genetic Marker Analysis............................................................................................ 6
Scoring Codominant STR DNA profiles ...................................................................... 7
Ex 1.1 Scoring Microsatellite DNA profiles.............................................................................. 7
Ex 1.2 Calculating Allele Frequency ...................................................................................... 11
Box 1.1 Allele Frequency for Codominant Data ........................................................................11
Ex 1.3 No. of Alleles, Heterozygosity & Fixation Index ........................................................ 12
Box 1.2 Heterozygosity and the Fixation Index ........................................................................12
Ex 1.4 Partitioning Genetic Diversity .................................................................................... 13
Box 1.3 Genetic Diversity Within and Among Populations .........................................................14
Ex 1.5 Calculating F-statistics ............................................................................................... 14
Q 1.5 Questions ....................................................................................................................15
Box 1.4 F-Statistics ...............................................................................................................16
Box 1.5 The Magnitude of FST .................................................................................................16
Getting Started in GenAlEx...................................................................................... 17
Before you Start..................................................................................................................... 17
Installation ............................................................................................................................ 17
Loading GenAlEx in Excel Pre-2007 ...................................................................................... 17
Optimizing Font Size for GenAlEx in Excel Pre-2007 ............................................................ 19
Loading GenAlEx in Excel 2007 onwards .............................................................................. 19
Optimizing Font Size for GenAlEx in Excel 2007 ................................................................... 20
Understanding GenAlEx Data Formats..................................................................... 21
Input ...................................................................................................................................... 21
Output .................................................................................................................................... 21
Sample Labels ........................................................................................................................ 21
Data Parameters and Labels.................................................................................................. 21
Parameter locations...............................................................................................................22
Data Formats ......................................................................................................................... 22
Format for codominant data ...................................................................................................22
Format for dominant, haploid or sequence data .......................................................................23
Format for geographic data ....................................................................................................24
Missing Data .......................................................................................................................... 25
Using Create to Learn about GenAlEx Data Formats................................................ 25
Ex 1.6 Using Create with Auto Pop Size ................................................................................ 26
Ex 1.7 Using Create with Variable Pop Sizes ........................................................................ 26
Ex 1.8 Using Create with Other Data Types .......................................................................... 26
Ex 1.9 Using Template as a Starting Point for Data Entry .................................................... 27
GenAlEx Data Parameters ....................................................................................... 28
Ex 1.10 Getting Population Parameters ................................................................................ 28
Using Data to Work Efficiently .............................................................................................. 29
Data Exploration and Allele Frequencies ................................................................. 29
Ex 1.11 Plots of Allele Frequency .......................................................................................... 29
Q 1.11 Questions .....................................................................................................................30
Ex 1.12 Heterozygosity, F-statistics and Allelic Patterns ..................................................... 31
Q 1.12 Questions ..................................................................................................................31
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.2
Shannon Diversity Indices in Population Genetics .................................................. 32
Ex 1.13 Hand Calculation of Shannons Indices.................................................................... 32
Q 1.13 Questions ..................................................................................................................34
Box 1.6 Shannons Information Indices ...................................................................................35
Nei Genetic Distance ............................................................................................... 35
Ex 1.14 Hand Calculation of Neis Genetic Distance ............................................................. 35
Box 1.7 Neis Genetic Identity and Distance ............................................................................37
Pairwise Population Genetic Analysis...................................................................... 37
Ex 1.15 Pairwise Fst and Nei Genetic Distances ................................................................... 37
Q 1.15 Questions ..................................................................................................................38
Ex 1.16 Pairwise calculation of Shannons Indices............................................................... 38
Q 1.16 Questions ..................................................................................................................39
Principal Coordinate Analysis (PCA)........................................................................ 39
Ex 1.17 Steps for Performing PCA ......................................................................................... 40
Q 1.17 Questions ..................................................................................................................40
Hardy-Weinberg Equilibrium ................................................................................... 41
Ex 1.18 Testing for Hardy-Weinberg Equilibrium ................................................................. 41
Box 1.8 Chi-square for Hardy-Weinberg Equilibrium (HWE) ......................................................42
Q 1.18 Questions ..................................................................................................................43
Putting It All Together ............................................................................................ 44
Ex 1.19 Revision: F-statistics in Glycine and Caladenia........................................................ 44
Q 1.19 Questions ..................................................................................................................44
Ex 1.20 Bringing the Genetics and Ecology Together ........................................................... 46
Box 1.9 The case of Glycine clandestina ..................................................................................46
Box 1.10 The case of Caladenia tentaculata.............................................................................46
Box 1.11 Estimation of Outcrossing Rates in Plants ..................................................................47
Q 1.20 Questions ..................................................................................................................47
References and Further Reading ............................................................................. 49
Glossary Some Important Definitions .................................................................. 50
Glossary - Genetic markers ..................................................................................... 51
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.3
About this Tutorial Module
This GenAlEx tutorial module is based on material provided at a two-day national graduate
workshop entitled An Introduction to Genetic Analysis for Populations Studies offered by Rod
Peakall (Australian National University) and Peter Smouse (Rutgers University, USA) at the
Australian National University in July 2009. We are also pleased to include as an appendix, an
overview on Shannon Diversity analysis by Bill Sherwin (University of New South Wales) who
contributed a guest lecture to our workshop.
This tutorial is intended to provide a brief refresher course in frequency-based population genetic
statistics and to introduce students to the software GenAlEx.
This tutorial module is provided free for personal use by registered users of the software package
GenAlEx. This document and associated data files must not be used for any other purpose,
including teaching in any undergraduate or graduate course, without express permission of the
authors. While every effort has been taken to ensure the accuracy of this document, supporting data
files and the software package GenAlEx, we are unable to take responsibility for unintentional
errors or software problems that may be encountered by users. We regret that we are also unable to
provide individualized support.
Rod Peakall and Peter Smouse, Dec 2009
Professor Rod Peakall
Evolution, Ecology and Genetics
Research School of Biology
The Australian National University
Canberra ACT 0200 Australia
Email:
[email protected]Professor Peter Smouse
Department of Ecology, Evolution and Natural
Resources
Rutgers University, Cook College
New Brunswick NJ 08901-8551 USA
Email:
[email protected]Goals of the Tutorial
1.
To describe the procedures for scoring codominant genetic markers such as microsatellites.
2.
To demonstrate by way of hand calculations the basic statistical procedures for frequencybased within and among population genetic analysis.
3.
To introduce the software GenAlEx and outline important information about installation, data
formats and operation of the software.
4.
To demonstrate the basic statistical procedures for frequency-based within and among
population genetic analysis including Allele Frequency, Heterozygosity, F-statistics, Nei
Genetic Distance and Shannon Diversity Indices.
5.
To explore the biological interpretation of the statistics described for some real data sets.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.4
About the Software GenAlEx
GenAlEx - Genetic Analysis in Excel (Peakall and Smouse 2006) is designed as a user-friendly
package with an intuitive and consistent interface that allows users to analyse a wide range of
population genetic data within a software environment with which most users will have some
familiarity (MS Excel). GenAlEx is now widely used by university teachers at both undergraduate
and graduate levels in Australia, North America, South America, and Europe. The software also
offers a wide range of analysis options for researchers, including some spatial analysis options not
available elsewhere. Options for exporting data to a wide range of other population genetic
packages are also provided. The software is used by more than 5000 registered users, representing
more than 60 countries. The paper describing the software was cited more than 650 times in the
period 2006 to 2009.
Peakall, R. and Smouse P.E. (2006) GENALEX 6: genetic analysis in Excel. Population genetic
software for teaching and research. Molecular Ecology Notes. 6, 288-295.
Freely available from The Australian National University, Canberra, Australia.
http://www.anu.edu.au/BoZo/GenAlEx/.
Note the official reference refers to the software package in lower caps: GENALEX 6. Please use this
text format for publication purposes. Here we will continue to use the original text format for the
software: GenAlEx.
Software Instructions
Throughout this text, instructions for using GenAlEx are provided in abbreviated form. For
consistency, the same text styles as used in the GenAlEx 6 Guide have been adopted here:
Menu name (eg. GenAlEx)
Menu option (e.g. Distance)
Menu suboption (e.g. Genetic)
Dialog box name (e.g. Genetic Distance Options)
Dialog box option (e.g. Binary)
Tips are written in italics.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.5
Understanding Abridged GenAlEx Instructions
Full Procedure for Calculating Genetic Distance
1.
Choose the option Distance from the GenAlEx menu, and then select Genetic from the
submenu.
2.
Ensure the locus and sample parameters are correct in the Genetic Distance Options dialog
box.
3.
Select the appropriate Distance Calculation, and output options required (see below).
4.
Enter Title and Worksheet Prefix then click Ok. Genetic distance is output to sheet [GD].
Abridged Procedure for Calculating Genetic Distance
Choose Distance->Genetic then select the appropriate Distance Calculation in the Genetic
Distance Options dialog box.
Note the abridged options omit the prompt about entering a Title and Worksheet Prefix, however, it
is strongly recommended that you take advantage of this feature which is provided to help users
keep track of their data analysis. In later sections of the course instructions may be further
abbreviated as students become more familiar with GenAlEx.
Genetic Marker Analysis
Broadly speaking, population genetic analyses proceeds along one of two pathways: frequencybased analysis and distance-based analysis. In this introductory course we will restrict our attention
to a frequency-based analyses. In these analyses an estimate of allele frequencies is the basis for
most downstream calculations. For codominant data, frequency-based analyses include F-statistics,
Neis genetic distance, and Shannon diversity indices that are introduced in this first section of the
course. Other allele frequency-based options such as population assignment procedures, estimates
of genotypic probabilities, probabilities of identity, probabilities of exclusion and pairwise
relatedness estimates will be covered in the main section of the workshop. A subset of these
frequency-based analyses is also applicable to haploid and binary data.
By contrast to frequency-based analyses, genetic distance-based analyses are relatively new. For
these analyses the starting point is the conversion of genetic data into a pairwise individual-byindividual genetic distance matrix. Distance matrices can be calculated for all kinds of genetic data
including codominant, haploid and binary data genetic markers, and DNA sequences. Once a
genetic distance matrix is calculated, further extensive genetic analysis can be performed including:
Analysis of Molecular Variance (AMOVA); Principal Coordinates Analysis (PCA); UPGMA and
Neighbor Joining Tree building; Mantel Tests; Spatial Autocorrelation analyses; and TwoGener.
The main section of course will explore many of these genetic analysis options.
The first step for both frequency-based and distance-based genetic analysis is the scoring of the
DNA profiles.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.6
Scoring Codominant STR DNA profiles
In practice, before you begin scoring any microsatellite or STR DNA profiles it is important to
inspect the results for good number of samples (>20, often many more). This will allow you to get a
sense of the general patterns and identify any potential artifacts that might be incorrectly scored.
1.
Based on your initial inspection of the fragment sizes across multiple samples, and your
knowledge of the nucleotide repeat structure, define the allele series as integers. For di-and
tetra-nucleotide repeats start the allele series as either odd or even (whichever most closely
matches fragment sizes) and then stick with the series chosen (unless microvariants are
confirmed to disrupt the allele series).
2.
For each DNA profile, identify the alleles and label the allele size(s). Note that some STR
profiles show PCR artifacts such as stutter patterns that need to be identified and excluded.
3.
For each DNA profile, assign the genotype score based on the allele sizes in the allele series.
4.
List the genotypes in a Table 1 for downstream analysis.
The allele series and some scored genotypes are illustrated here for a tetra-nucleotide codominant
microsatellite or STR locus D18 with repeat motif [AGAA]n that is widely used in human forensics.
Allele series = 297, 301, 305, 309, 313, 317, 321, 325, 329
Genotype
297 301
297 325
297 313
Ex 1.1 Scoring Microsatellite DNA profiles
Microsatellite genotypes at the locus TT for 20 samples of bush rats are shown below. Locus TT
contains a tetra nucleotide repeat (AAAG)N. Ten alleles are known at the locus with an inferred
repeat range of (AAAG)6 to (AAAG)18.
Step 1.
Inspect the DNA profiles of multiple samples to identify putative alleles and
determine allele sizes.
Step 2.
List the series of expected allele sizes as integers based on your inspection of
multiple samples and knowledge of the locus sequence. Enter the list of alleles in
the table below.
Allele Series
109
117
125
GenAlEx Tutorials Part 1
129
133
137
141
145
Peakall and Smouse (2009)
149
1.7
Step 3.
On the DNA profiles label the alleles with their integer size in base pairs.
Step 4.
Score and record the genotypes as the allele sizes in the table below.
Note that the table below is divided into 3 parts. The first 3 rows provide information about the
parameters for the data set. When we begin using the computer software GenAlEx, this
information will become important. Do not worry about it until then! The second block in the
table is for the genotypes belonging to samples from the first population P1. The third block
is for samples from the second population P2.
1
Scoring
Sample
RF009
RF010
RF011
RF012
RF014
RF482
RF486
RF488
RF489
RF493
RF495
RF528
RF529
RF530
RF531
RF538
RF539
RF602
RF603
RF614
GenAlEx Tutorials Part 1
20
Pop
P1
P1
P1
P1
P1
P1
P1
P1
P1
P1
P2
P2
P2
P2
P2
P2
P2
P2
P2
P2
TT
109
117
109
117
145
141
109
125
117
117
117
133
129
109
125
117
109
125
125
125
10
P1
TT
137
137
117
117
145
149
125
137
137
129
133
141
141
117
137
137
109
137
137
137
10
P2
Peakall and Smouse (2009)
20
1.8
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.9
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.10
Ex 1.2 Calculating Allele Frequency
This exercise continues from Ex 1.1.
Step 1.
Based on your scored genotypes in the table above, calculate allele frequencies for
population 1 (P1), population 2 (P2) and the total (P1 and P2). Record your
answers in the table below.
Allele counts and allele frequency
Pops./Allele
TT
TT
P1
Count
Freq
Allele 109
3
0.150
Allele 117
6
0.300
Allele 125
2
0.100
Allele 129
1
0.050
Allele 133
0
0.000
Allele 137
4
0.200
Allele 141
1
0.050
Allele 145
2
0.100
Allele 149
1
0.050
P2
Allele
Allele
Allele
Allele
Allele
Allele
Allele
Allele
Allele
109
117
125
129
133
137
141
145
149
3
3
4
1
2
5
2
0
0
0.150
0.150
0.200
0.050
0.100
0.250
0.100
0.000
0.000
Total
Allele 109
Allele 117
Allele 125
Allele 129
Allele 133
Allele 137
Allele 141
Allele 145
Allele 149
6
9
6
2
2
9
3
2
1
0.150
0.225
0.150
0.050
0.050
0.225
0.075
0.050
0.025
Box 1.1 Allele Frequency for Codominant Data
2Nxx + Nxy
2N
Calculated locus by locus. Where Nxx is the number of homozygotes for allele X (XX), and Nxy is
the number of heterozygotes containing the allele X (Y can be any other allele). N = the number of
samples. Can also be determined simply by direct count of the proportion of different alleles.
!
FreqAllele _ x =
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.11
Ex 1.3 No. of Alleles, Heterozygosity & Fixation Index
This exercise continues from Ex 1.1 and 1.2.
Step 1.
Based on the genotypes and allele frequencies for P1 and P2 calculate the number
of different alleles Na, observed Ho, expected Heterozygosity He and the Fixation
Index F. Show your calculations in the space below and summarise your answers in
the table.
Pop
TT
P1
P2
10
Na
Ho
0.800
He
0.820
0.024
10
Na
Ho
0.900
He
0.830
-0.084
Box 1.2 Heterozygosity and the Fixation Index
No._ of _ Hets
N
Where Ho is the observed heterozygosity, i.e. the proportion of N samples that are heterozygous at a
given locus.
Ho =
H e = 1" # pi
Where He is the expected heterozygosity, i.e. the proportion of heterozygosity expected under
random mating and pi is the allele frequency of the i-th allele.
H "H
!
o
F= e
H
e
The Fixation Index F (also called the Inbreeding Coefficient) exhibits values ranging from -1 to +1.
Values close to zero are expected under random mating, while substantial positive values indicate
inbreeding or undetected null alleles.
Negative values indicate excess of heterozygosity, due to
!
negative assortative mating, or selection for heterozygotes.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.12
Ex 1.4 Partitioning Genetic Diversity
In Ex 1.3 you calculated the observed and expected heterozygosity for each of the populations
P1 and P2 for the bush rat populations P1 and P2. Here we continue with this data set.
Step 1.
For ease of calculations transcribe your data and answers from Ex 1.2 and Ex 1.3
into the reorganized table below.
Allele Frequency
Allele
P1
P2
Total
109
0.150
0.150
0.150
117
0.300
0.150
0.225
125
0.100
0.200
0.150
129
0.050
0.050
0.050
133
0.000
0.100
0.050
137
0.200
0.250
0.225
141
0.050
0.100
0.075
145
0.100
0.000
0.050
149
0.050
0.000
0.025
Heterozygosity
Ho
0.800
0.900
0.850
He
0.820
0.830
0.840
Mean Ho
0.850
Mean He
0.825
HT
0.840
Step 2.
Calculate the mean Ho as the average of Ho across P1 and P2 and enter the value in
the table.
Step 3.
Calculate the mean He as the average of He across P1 and P2 and enter the value in
the table.
Step 4.
Calculate HT as the expected heterozygosity of the total (using the total allele
frequencies) and enter the value in the table.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.13
Box 1.3 Genetic Diversity Within and Among Populations
For codominant genetic data at a single locus, the total genetic diversity (heterozygosity) can be
divided into within and among populations as follows (based on Hartl and Clark 1989, with some
modification of notation):
H = Observed heterozygosity averaged across subpopulations.
o
H = Expected heterozygosity averaged across subpopulations.
e
H = Total expected heterozygosity (calculated as if all the subpopulations were pooled).
T
Ho = ! Ho k
i =1
Where Ho= observed heterozygosity in subpopulation i, and k is the number of subpopulations.
h
H e = 1 " ! pi2,s
i =1
He = ! He k
i =1
Where He is the expected heterozygosity within subpopulation s, and pi,s is the frequency of the ith allele in subpopulation s. The summation of the allele frequency squared is over all i-th alleles to
h the max number of alleles.
h
HT = 1" # pTi2
i=1
Where HT is the total expected heterozygosity, and pTi is the frequency of allele i over the total
population. If subpopulation sample sizes are equal then pTi = pi , where pi is the frequency of
allele i averaged over the subpopulations
of equal size.
!
!
Ex 1.5 Calculating F-statistics
This exercise is a continuation of Ex 1.4. Using the values for Mean Ho, Mean He and HT we can
easily calculate Wrights F-statistics by the formula in Box 1.4.
Step 1.
Mean Ho
0.850
Mean He
0.825
HT
0.840
FIS
-0.030
FIT
-0.012
FST
0.018
Calculate FIS, show your working out below and enter your answer in the table.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.14
Step 2.
Calculate FIT, show your working out below and enter your answer in the table.
Step 3.
Calculate FST, show your working out below and enter your answer in the table.
Step 4.
Check that your answers fit the relationship between FIS, FIT and FST shown at the
bottom of Box 1.4. Answer questions 1 and 2.
Q 1.5 Questions
1.
Did you detect genetic differentiation between P1 and P2?
significant?
Is this differentiation
No. This is a trick question to make students think about how they could test for
significant difference.
2.
Did you expect to obtain negative values for F IS and FIT ?
might you explain this outcome?
If not, why not?
How
No. But the values are close to zero are probably just reflect the small sample size. Due
Tip: You can check your hand calculations using GenAlEx by following the instructions outlined in
Ex. 1.9. If you are a new user of GenAlEx, please continue to read the essential background to
GenAlEx and to complete Ex 1.6 to 1.8 first.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.15
Box 1.4 F-Statistics
Perhaps the most widely reported statistics in population genetics are Wrights F-statistics (Wright
1946, 1951, 1965). One way to calculate these statistics is to use the partition of genetic diversity
(heterozygosity) described in Box 1.3 as the starting point.
It may come as a surprise to learn that differences within versus among subpopulations can be
characterised by F-statistics, since these statistics are normally associated with inbreeding.
However, this is possible because population subdivision is associated with inbreeding like effects
viz. excess homozygosity (reduction of heterozygosity).
FIS = The inbreeding coefficient within individuals relative to the subpopulation. It measures the
reduction in heterozygosity of an individual due to non random mating within its subpopulation.
H ! Ho
FIS = e
H
e
FIT = the inbreeding coefficient within individuals relative to the total. This statistic takes into
account the effects of both non random mating within subpopulations and genetic differentiation
among the subpopulations.
H ! Ho
FIT = T
H
T
FST = the inbreeding coefficient within subpopulations relative to the total. This statistic provides a
measure of the genetic differentiation between subpopulations. That is, the proportion of the total
genetic diversity (heterozygosity) that is distributed among the subpopulations. FST is almost always
greater than (or equal to zero). If all subpopulations are in Hardy-Weinberg equilibrium with the
same allele frequencies, FST = 0. Note that FST as calculated in this way is equivalent to GST
H ! He
FST = T
H
T
F-statistics are related according to the following equation:
(1 -FIS) (1 -FST) = (1 -FIT).
Box 1.5 The Magnitude of FST
In practice, FST is rarely larger than 0.5 and often very much less. Wright (1978) proposed for the
simple 2 allelic systems that he studied that values of FST = 0.25 are taken to mean very great
differentiation between subpopulations; the range 0.15 to 0.25 indicates moderate differentiation;
while differentiation is not negligible if FST is 0.05 or less. However, the interpretation of the
magnitude of FST is more complex than simple reference to this quantitative guide. Hedrick (1999)
has shown that with modern hypervariable markers characterized by many alleles, FST values can be
considerably lower than for genetic markers with very few alleles. Therefore, in modern population
genetic procedures a more important question is whether we can detect significant genetic
differentiation (FST > 0) or not, and whether this differentiation is biologically meaningful.
Procedures such as AMOVA allow for such statistical tests.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.16
Getting Started in GenAlEx
Before you Start
Before you can work with GenAlEx you need the following:
1.
Scored genetic data
2.
Knowledge of whether the data are binary (haploid), binary (diploid), haploid or codominant
3.
For spatial genetic analysis you also require geographic data for individuals and populations
3.
Microsoft Excel installed on your computer.
Installation
GenAlEx is provided as an Excel Add-in, a compiled module and its associated GenAlEx menu.
Your downloaded file may initially be in the zipped format. Use the extract option to unzip the
download and save the files to a dedicated folder of your choice. You can work with GenAlEx
directly from this folder.
Tip: Versions of GenAlEx 6.1 onwards offer full compatibility with Excel 2007. This includes the
ability to take advantage of the substantially increased number of columns from 256 pre-Excel 2007
to 16,384 columns in Excel 2007.
For versions of GenAlEx 6.3 onwards, users are given the choice of installing either GenAlEx
6.3.xla or GenAlEx 6.3 for 2007.xla. Both versions will run in Excel 2007, but if you wish to take
advantage of full compatibility with Excel 2007 you should install the Excel 2007 specific option.
There are different instructions for getting GenAlEx up and running in Excel 2007 versus earlier
Excel versions. Choose the instructions below that match your version of Excel.
Notes for Macintosh Users
Unfortunately GenAlEx is unable to run in Excel 2008 on the Macintosh. This is because Microsoft
removed the ability of Excel 2008 to run Visual Basic for Applications (VBA), the macro language
of Microsoft Office. This has considerably reduced the cross platform compatibility that
characterised previous versions of Microsoft Office. GenAlEx does run in Excel 2003 (and earlier)
on the Macintosh. To take advantage of the features of Excel 2007 you will either need to run your
analysis on a PC or run Excel 2007 in Windows on an Intel based Mac.
Loading GenAlEx in Excel Pre-2007
1.
2.
Copy the GenAlEx Add-in (e.g. GenAlEx 6.3.xla) to your choice of location on your
computer. This should preferably be in a dedicated folder.
Launch MS Excel. Choose Open from the File menu, locate the GenAlEx Add-in, then click
the OK button.
Tip: Alternatively, you can launch GenAlEx and Excel simultaneously by clicking directly on the
GenAlEx Add-in file.
3.
Depending on the settings of your Excel program, Excel may warn you that GenAlEx contains
macros. Click the Enable button to proceed.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.17
4.
In a few seconds the GenAlEx splash screen will, appear click to hide, the GenAlEx menu
will appear in the Excel menu bar, just before the Help menu.
5.
If you do not see the security warning dialog box and/or GenAlEx does not launch, you may
see the message below instead. In this case, continue to step 6.
6.
From the Excel menu Tools, choose Options. At the options dialog box click the Macro
Security button on the Security tab. In the next dialog box, choose Medium for the security
level. Now return to step 1 to launch GenAlEx.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.18
Optimizing Font Size for GenAlEx in Excel Pre-2007
GenAlEx output is optimized for a font size of 10 pt. If not already, you should therefore set the
Excel default font size to this setting.
Note that this setting does not over ride the font settings embedded in existing worksheets. To
ensure GenAlEx output is in 10 pt, you may wish to copy data (or data sheets) to new workbook that
has been created after the Excel default font is set to 10 pt as outlined below.
Step 1.
From the Excel menu Tools, choose Options.
Step 2.
At the options dialog box click the General tab. Set the standard font size to 10 pt.
Loading GenAlEx in Excel 2007 onwards
1.
Launch Excel 2007.
2.
Open the GenAlEx Add-in (e.g. GenAlEx 6.3 for Excel 2007.xla) via the Microsoft Office
Button (top left of screen).
3.
When prompted, by the Security Notice, choose Enable Macros.
4.
Shortly the GenAlEx splash screen will appear.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.19
5.
Click the Add-Ins tab to show the Add-Ins ribbon. The GenAlEx menu will appear on the
right, along with any other installed Add-Ins you may be running. When using GenAlEx you
may find it convenient to turn off Minimize Ribbon so that the GenAlEx menu is always
accessible when the Add-Ins ribbon is shown. You can access this option by right-clicking the
ribbon.
Optimizing Font Size for GenAlEx in Excel 2007
GenAlEx output is optimized for a font size of 10 pt. If not already, you should therefore set the
Excel default font size to this setting.
Note that this setting does not over ride the font settings embedded in existing worksheets. To
ensure GenAlEx output is in 10 pt, you may wish to copy data (or data sheets) to new workbook that
has been created after the Excel default font is set to 10 pt as outlined below.
1.
Click the Microsoft Office button (top left of screen), then Excel Options.
2.
Set the default Font Size for new workbooks to 10 pt under the option When
creating new workbooks.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.20
Understanding GenAlEx Data Formats
Input
Input consists of raw data or distance matrices in appropriate GenAlEx format (see below). In order
to proceed with an analysis the worksheet containing the data must be activated (visible as the
current sheet). Some analyses and procedures take several worksheets as input. Unless otherwise
explained, these need to be placed starting on the left hand side (LHS) of the workbook, in the order
1 to n.
Wherever possible, GenAlEx offers two options to help users keep track of data and analysis
output. In the initial Data Parameter dialog box for statistical procedures, the user may provide a
worksheet prefix to help identify the output of a particular analysis, and a title for the output that
can provide specific details of the analysis being performed. This title will appear at the top of each
output worksheet. It is strongly recommended that both these options be used.
Output
GenAlEx can generate many worksheets in routine analysis, so the ability to create and manipulate
new workbooks and new worksheets within workbooks is particularly important. Each worksheet
output by GenAlEx is given a name dependent on the analysis performed. This is particularly useful
in analyses that have multiple worksheet outputs. In this document (and the GenAlEx 6 guide)
worksheet names are identified using square brackets e.g. [GD]. A user-defined prefix may be
added to the worksheet name for further clarity.
Output of GenAlEx worksheets is designed so that the raw data or other input worksheet is always
at the extreme left hand side (LHS) of the workbook. Thus, output worksheets for most menu
options will appear to the right hand side (RHS) of the raw data worksheet. However, Genetic
Distance outputs will appear to the LHS of the raw data, as the distance matrix is used as input for
subsequent analyses.
Graphs are output in standard Excel format and may need to be resized in order to see all the
information. All graphs can be edited using standard Excel functions.
Sample Labels
To obtain maximum benefit out of GenAlEx it is ideal if each sample be given a unique numerical
identifier. Sample names may carry an alpha character prefix, but this must be the same for all
samples in a single dataset. In this case it is important to know that, when sorting on alphanumeric
data, GenAlEx uses the Excel sort-order rules, sorting character by character, (e.g. A11will come
after A100). For ease of sorting, we recommend that the format A001A199 be used when using
prefixes.
Data Parameters and Labels
Data parameters and labels are crucial for telling GenAlEx how to read and analyse the data.
GenAlEx stores all parameters and labels in rows 1, 2 and 3 of the data worksheets. Columns A and
B are used for sample and population labels respectively. Actual data begins in Cell C4 of a
worksheet.
Data parameters and labels may be entered in GenAlEx in several ways
1.
A worksheet containing data may be manually formatted to provide appropriate parameters.
2.
The Template option in the GenAlEx menu may be used to provide parameters through a
dialog box, creating a formatted worksheet into which the data are then entered (see section
below for further instructions).
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.21
3.
The Parameters option in the GenAlEx menu may be used to obtain the relevant parameter
values from an existing dataset and insert them into their appropriate location (see section
below for further instructions). This option requires that your data are bounded by blank
columns and rows.
4.
On initiating an analysis, GenAlEx prompts for the relevant parameters in a dialog box.
Changing parameters in this box provides an easy way to select subsets of data for analysis.
Parameter locations
Essential parameters are inserted into Row 1. They are: No. Loci (cell A1); No. Samples (cell B1);
No. Populations (cell C1); The size of each population (cell D1 to cell n1).
Data Formats
GenAlEx accepts 3 types of numerically-coded data:
1.
Codominant data with 2 columns per locus.
2.
Dominant, Haploid (including Haplotypes), or Sequence data coded numerically with 1
column per locus/base.
3.
Geographic data with 2 columns for X and Y coordinates.
Tip: GenAlEx also allows you to work with DNA sequences in 2 different formats, however, for
most analyses the sequence needs to be coded numerically by options provided in GenAlEx. After
conversion to numeric format, sequence data are treated like all other haploid data.
Format for codominant data
Codominant data are presented as two columns per locus as in the figure below. Alleles may be
simply numerically-coded (1, 2, 3 etc). Alternatively, and preferably for microsatellite data, alleles
may be coded as their integer size in base pairs (bp), or as the inferred number of simple sequence
repeats. These last two formats are essential for calculation of the distance measure, RST. There is a
limit of 999 numerically-coded alleles. Codominant alleles need not be numbered consecutively.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.22
Example of codominant, numerically-coded data, with regional parameters.
In this example the 4 populations are split into 2 regions with Pops 1 & 2 in Region 1 and Pops 3 &
4 in Region 2. Note the regional parameters are only required for AMOVA.
Example of codominant microsatellite data, with genotypes by fragment size.
Format for dominant, haploid or sequence data
Dominant, haploid (including haplotypes) or sequence data are presented as a single column per
locus. Haploid data can be coded numerically from 1n, or each may be represented by multiple
variable sites (columns 1 n), with multiple states. For sequence or SNP data the bases are
numerically coded as follows: A=1, C=2, G=3, T=4, :=5; -=5, all other characters = 0. GenAlEx
provides several options for the import of sequence data and auto conversion to numbers.
Example of dominant, or binary data.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.23
Example of sequence data, coded numerically at multiple variable sites.
Example of haplotype data, with individual haplotypes coded numerically.
These haplotypes correspond to the sequences shown in the previous example.
Format for geographic data
For convenience, both geographic and genetic distances can be calculated in a single analysis.
Coordinates can be entered as either integer or decimal numbers.
X and Y coordinates may be read by GenAlEx from two different formats.
1. X / Y data are located in the same worksheet as the genetic data, and separated from the genetic
data by a single blank column. This format is used by GenAlEx for various analyses, including
Genetic Distance, Clonal and TwoGener.
Example of geographic data after genetic data.
2.
In a separate worksheet, in columns C and D. In this case, the sample and population labels in
columns A & B will correspond exactly to those for the genetic data. This format is also
appropriate if only geographic distances are required. This format is required for analyses
such as the 2D Spatial autocorrelation.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.24
Example of geographic data in columns 3 & 4.
Missing Data
Virtually all GenAlEx options handle missing data. However, missing data can be particularly
problematic for pairwise distance-based analyses such as AMOVA, Mantel and spatial
autocorrelation. Therefore, a unique option for interpolating missing individual-by-individual
pairwise distances is provided. This action will insert the average genetic distances for each
population level pairwise contrast e.g. within Pop. 1, or between Pop. 1 and Pop. 2. Nonetheless, in
order to avoid excessive bias, large numbers of missing data for individual-based distance
calculations should be minimized.
Codominant and Haploid missing data are coded as 0. Missing Binary data are coded as -1.
Using Create to Learn about GenAlEx Data Formats
In this section you will use the Create menu option to learn about GenAlEx data formats. This menu
provides options to create random examples of all GenAlEx data formats, both Genetic and
Geographic. These datasets are useful for exploring the range of GenAlEx procedures.
Create dialog box on a PC
GenAlEx Tutorials Part 1
Create dialog box on a Macintosh
Peakall and Smouse (2009)
1.25
Ex 1.6 Using Create with Auto Pop Size
In this first exercise we will take advantage of the Auto Pop Size feature in GenAlEx that will
automatically generate even pop sizes, for the number of samples and populations you specify.
Step 1.
Before you proceed, randomly choose a set of numbers within the specified range
as follows, and record them below:
The number of loci (suggested range 1 to 10) =
The number of samples (suggested range 10 to 40, and evenly divisible by the
number of pops chosen below) =
The number of populations (suggested range 2 to 10) =
The number of alleles (suggested range 4 to 9) =
Step 2.
With a workbook open, choose the option Create from the GenAlEx menu, and select
the Codominant submenu.
Step 3.
In the Create Data Parameters dialog box enter the number (#) of loci, # samples, #
populations and # Alleles, as chosen above.
Step 4.
Check the Auto Pop Size and XY Coords options on the dialog box.
Step 5.
Inspect the data sheet generated by GenAlEx. By reference to the numbers you
jotted down, identify the location of the parameters in the data sheet and study the
format for the genotypes and XY coordinates.
Ex 1.7 Using Create with Variable Pop Sizes
In this second exercise you will be given the option to manually enter variable pop sizes.
Step 1.
Choose a new set of numbers as for Ex 1.6. Also choose a set of pop sizes that add
to the total number of samples you have chosen. Jot down the numbers you have
chosen, then proceed.
Step 2.
With a workbook open, choose the option Create from the GenAlEx menu, and select
the Codominant submenu.
Step 3.
In the Create Data Parameters dialog box enter the number (#) of loci, # samples, #
populations and # Alleles required.
Step 4.
Enter the size of each pop in the edit box below Pop. Size, and add to the
population list using the Add Pops option.
Step 5.
Uncheck the default Auto Pop Size and XY Coords options on the dialog box.
Step 6.
Inspect the data sheet generated by GenAlEx. By reference to the numbers you
jotted down, identify the location of the parameters in the data sheet and study the
format for the genotypes and XY coordinates.
Ex 1.8 Using Create with Other Data Types
Now that you are up and running, use the Create option to generate some random
demonstration data for other types of GenAlEx data formats, such as haploid or binary.
Tip: The Create option is a great way to troubleshoot the occasional data analysis problem you
might encounter in GenAlEx. Suppose you have a large data set that GenAlEx is unable to analyse.
Often you will be given a warning by GenAlEx to check your data and parameters. Sometime such a
problem is associated with something unusual about your data, rather than the parameters. To
check whether or not this is the case, simply use the Create option to generate a data set of the
same size (No Samples, No of Pops, Pops Size, No Loci etc). Now perform the required analysis in
GenAlEx. If the created data set runs, check your own data carefully. Look for things like missing
data for all samples in a population at a specific locus. Such a case might trigger an error. Check
for unusual data values that might be typos etc.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.26
Ex 1.9 Using Template as a Starting Point for Data Entry
The Create option is provided in GenAlEx primarily for users to create examples of the formats
of the various data types that can be analysed by the software. If you are entering small data
sets by hand, and you know how many samples, and from which populations they come, you
can take advantage of the Template option.
Here we return to the microsatellite data set scored in Ex 1.1. First you will use the Template
option to set up the data for entry out of the completed table in Ex 1.1 Once your data has
been entered you can use GenAlEx to check your answers to the hand calculations in Ex 1.2
and 1.5.
Step 1.
Open a new Excel document. Now use the GenAlEx option Template->Codominant to
quickly setup the table for data entry. What parameters will you use for this data
set? How many samples? How many populations? How many loci?
The Template dialog box and the template
created shown to the left
Step 2.
Enter the genotypes for all samples from Ex 1.1. Name the worksheet Ex 1.1 Data
and save the workbook with the name Ex 1.1 Rats.xls.
Step 3.
To check your hand calculations using GenAlEx, first activate the worksheet
containing the Ex 1.1 Data. Next, choose Frequency from the GenAlEx menu then
select Freq by Pop, Het, Fstat & Poly by Locus and Step by Step in the Codominant
Frequency Options dialog box.
Step 4.
Compare your hand calculations with GenAlEx. How did you go?
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.27
GenAlEx Data Parameters
Parameters submenu on PC
Parameters submenu on a Macintosh
The Parameters option provides a quick and easy way to obtain the necessary GenAlEx parameters
from a pre-existing dataset, and insert them in their correct location. Data must be in standard
GenAlEx format, with samples in column 1, population labels in column 2 (or in col. 1), and data
starting in cell C4. The dataset needs to be bounded below by an empty row and to the right by an
empty column, as GenAlEx uses empty cells to identify the data limits. All samples per population
must have the same population label, and be in a contiguous block. For each menu sub-option,
GenAlEx will interrogate the chosen column(s) and insert the corresponding parameters in their
correct locations. An option to insert the header rows into an unformatted dataset is also provided.
Always remember that the Parameters menu option requires:
1.
Data to be in standard GenAlEx format, with sample codes in column 1, population labels in
column 2, and data starting in cell C4.
2.
Data to be bounded by an empty row below the last sample and an empty column at the right
of the last locus entry.
3.
All samples within a population must have the same population label, and be in a contiguous
block.
Ex 1.10 Getting Population Parameters
Real genetic data sets may be imported into GenAlEx directly from genotyping software or
other data sources. In these cases, GenAlEx can determine the parameters for you provided
you follow the rules for GenAlEx formats. In this exercise you are provided with a real
codominant genetic data set from a study of bush rats by Peakall and Lindenmayer (2006).
Step 1.
Open the workbook called Ex 1.10 Bush Rats Raw Data, read the info provided then
activate the worksheet containing the data.
Step 2.
Inspect the data provided. Note it has not yet been formatted for GenAlEx analysis.
Step 3.
Convert the data into codominant GenAlEx format.
Hint: This may require the addition and deletion of rows or columns.
Step 4.
Using GenAlEx, automatically obtain parameters for the data set, and record your
answers below:
The
The
The
The
The
number of codominant loci =
number of samples =
number of populations =
name and number of samples in the first population =
name and number of samples in the last population =
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.28
Using Data to Work Efficiently
The Data menu option offers several commands for quickly manipulating your dataset. In all cases,
Data must be in appropriate GenAlEx format (including parameters). Two useful options are:
Sort on Sample: Sorts the entire dataset on the sample label (in Column A).
Sort on Pop: Sorts the entire dataset on the population label (in Column B).
Data Exploration and Allele Frequencies
Broadly speaking, population genetic analyses proceeds along one of two pathways: frequencybased analysis and distance-based analysis. For frequency-based analyses an estimate of allele
frequencies is the basis for most downstream calculations. For codominant data, frequency-based
analyses include F-statistics, Neis genetic distance, population assignment procedures, estimates of
genotypic probabilities, probabilities of identity, probabilities of exclusion and pairwise relatedness
estimates, among others. A subset of these frequency-based analyses are also applicable to haploid
and binary data.
By contrast to frequency-based analyses, genetic distance-based analyses are relatively new. For
these analyses the starting point is the conversion of genetic data into a pairwise individual-byindividual genetic distance matrix. Distance matrices can be calculated for all kinds of genetic data
including codominant, haploid and binary data genetic markers, and DNA sequences. Once a
genetic distance matrix is calculated, further extensive genetic analysis can be performed including:
Analysis of Molecular Variance (AMOVA); Principal Coordinates Analysis (PCA); UPGMA and
Neighbor Joining Tree building; Mantel Tests; Spatial Autocorrelation analyses; and TwoGener.
Genetic data exploration is the first step of any population genetic analysis. GenAlEx provides
some powerful graphic tools to aid this important first step. The calculation of allele frequencies
and various summary statistics such as the number of different alleles, observed and expected
heterozygosity (or equivalent diversity estimates for haploid data) represent critical baseline
statistics that should be reported in every population genetic study. However, even before further
analysis, let alone publication of results, inspecting the outcomes of allele frequencies and summary
statistics is important for identifying problems that might be attributable to incorrect scoring of your
DNA profiles, or errors in data entry. If you find unexpected results at this stage, and you can rule
out error, this data exploration step can also reveal interesting genetic patterns that might provide
unexpected insights into the biology of your study species.
Ex 1.11 Plots of Allele Frequency
Preliminary genetic data exploration and the calculation of allele frequency are intertwined.
The Frequency menu option in the GenAlEx menu is the entry point for this analysis. The small
data set in this exercise is drawn from the plant Glycine clandestina, an Australian native
relative of the soybean. This species has an unusual reproductive biology - it produces two
kinds of flowers: Normal 'Open pollinated' flowers and 'Closed or cleistogamous' flowers. The
open flowers are typical of pea flowers in general requiring insect pollinators for seed set. The
'Closed or Cleistogamous' flowers are adapted to self pollination, regularly producing seed
without the aid of pollinators. The seeds of the species lack an obvious dispersal mechanism
and it appears most seeds fall close to the parent plant. Data are provided for 4 populations,
two from within Canberra (Aranda and Taylor) and two from the Brindabella range (Brind and
Franklin) some 50 km west of Canberra.
Step 1.
Return to Excel and open the workbook Ex 1.11 Glycine, then activate the Data
worksheet. Next, choose Frequency from the GenAlEx menu. You will first be
prompted with the Allele Frequency Data Parameters dialog box. Click OK. Next select
Freq by Pop and check the options Graph by Locus and Graph by Pop for each Locus
in the Codominant Frequency Options dialog box.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.29
Step 2.
Inspect the outcomes in the three worksheets with suffixes AFP, AGF and AGP and
answer the questions below. Two example graphs generated by the analysis are
shown below.
Example of Allele Frequency Graph by Locus
Frequency
Allele Frequency for satt478
1.00
0.80
0.60
0.40
0.20
0.00
Aranda
154
157
160
163
166
169
satt478
175
Taylor
Brind
Franklin
Locus
Example of Allele Frequency Graph by Pop
Allele Frequency at sat040 for Aranda (n=14)
190
18%
194
57%
192
25%
Q 1.11 Questions
1.
Based on the allele frequency patterns, do you predict there is genetic
differentiation among the Glycine populations? How might you test this prediction?
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.30
Ex 1.12 Heterozygosity, F-statistics and Allelic Patterns
Inspection of the population allele frequency graphs for the Glycine example reveal some
interesting patterns among populations and loci. The Frequency menu option offers other tools
for data exploration and analysis including various Heterozygosity estimates and F-statistics
analysis (via allele frequency).
Step 1.
Return to the workbook Ex 1.11 Glycine, re-activate the Data worksheet, then
choose Frequency from the GenAlEx menu. You will first be prompted with the Allele
Frequency Data Parameters dialog box. Click OK. Next, uncheck the options previously
run in Ex. 1.9 then check Het, Fstat & Poly by Pop, Het, Fstat & Poly by Locus, Allelic
Patterns and Graph Pattern in the Codominant Frequency Options dialog box.
Step 2.
Inspect the outcomes in the three worksheets with suffixes HFP, HFL and APT. The
allelic patterns graph generated by this analysis is shown below (with minor
additional modification available in Excel for all GenAlEx graphs).
Example of an Allelic Patterns Graph
Na
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Aranda
Taylor
Brind
Na Freq. >= 5%
Ne
Heterozygosity
Mean
Allelic Patterns across Populations
Franklin
Populations
I
No. Private Alleles
No. LComm Alleles
(<=25%)
No. LComm Alleles
(<=50%)
He
Q 1.12 Questions
Based on your inspection of the results in the worksheet HFP:
1.
Summarize the findings for observed versus expected heterozygosities?
2.
What do you conclude about the extent of inbreeding?
3.
Did you detect genetic differentiation among the populations? Is the differentiation
significant?
No. This is a trick question to make students think about how they could test for
significant difference.
Based on your inspection of the results in the worksheet APT:
4.
Briefly describe the key allelic patterns that are revealed across the four
populations. What biological factors might explain the patterns observed?
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.31
Shannon Diversity Indices in Population Genetics
Shannons diversity index for information theory (Shannon 1948) has been widely employed in
ecology but has been less widely used in population genetics. In a recent series of studies, Sherwin
et al. (2006) and Rossetto et al. (2008) have shown both by computer simulation and for real data
sets that Shannons Indices offer some ideal statistical properties for measuring biological
information across multiple scales from genes to landscapes. In particular, the capacity to apply the
indices at multiple scales is unique among the commonly employed population statistics.
Furthermore, Shannons mutual information index S HUA not only provides a convenient measure of
differentiation among populations, but it can be readily converted to the log-likelihood contingency
test G statistic enabling a convenient chi-square based statistical test for allele frequency differences
at each locus for each pairwise combination of populations. Finally, for diploid species with large
estimated effective population size S HUA can be converted to an estimate of Nm (Number of
Migrants).
Tip: For more extensive background to Shannon Diversity see Box 1.6 and Appendix 1.1.
Ex 1.13 Hand Calculation of Shannons Indices
Shannons indices are remarkably straightforward to calculate by hand, requiring only
knowledge of the allele frequencies and the sample sizes. In this exercise we will work through
the steps for calculating these indices drawing on a subset of data from a study of the plant
Glycine clandestina introduced in earlier exercises. In this case microsatellite genotype data
are provided for the locus AG48 for 10 samples each in two Canberra populations, Aranda and
Taylor.
Step 1.
The raw genotype data are shown below. Inspect the data and answer question 1
before proceeding.
Step 2.
Calculate wt1 = ct1/(ct1+ct2) = _____ and wt2 = ct2/(ct1+ct2) = _____ where ct1 =
20 and ct2 = 20 (2 x No. of Samples in each Pop).
Step 3.
Allele frequencies for the two populations have been calculated for you in the table
below. Calculate the weighted mean frequency for each allele as pi1 x wt1 + pi2 x
wt2.
Tip: In this case because sample sizes are the same the weighted mean is simply the arithmetic
mean.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.32
Aranda
Taylor
Mean
" pi1Log2 pi1
pi1
pi2
pi
266
0.050
0.000
0.025
0.216
0.000
0.133
268
0.000
0.050
0.025
0.000
0.216
0.133
270
0.000
!
0.850
!
0.425
!0.000
!0.199
0.525
272
0.550
0.100
0.325
0.474
0.332
0.527
280
0.400
0.000
0.200
0.529
0.000
0.464
Sum
1.000
1.000
1.000
1.219
0.748
1.782
Allele
" pi2 Log2 pi2
" pi Log2 pi
Step 4.
To calculate SHA1 for the Aranda pop, first compute -1 x p i1 x Log2 pi1 for each allele
and enter the values in Col X, then sum these values across the 5 alleles.
Step 5.
To calculate SHA2 for the Taylor pop, first compute -1 pi2 x Log2 pi2 for each allele
and enter the values in Col Y, then sum these values across the 5 alleles.
Step 6.
To calculate SHU compute -1 x pi x Log2 x pi for each allele and enter the values in
Col Z, then sum these values across the 5 alleles.
Tip: On a calculator to obtain the log2 of the value pi you need to calculate log(pi)/log(2). If you are
using Excel use the function LOG(number,
base),
!
! in this case enter = LOG (pi, 2).
Step 7.
Calculate SHUA by the formula SHUA = SHU wt1 x SHA1 wt2 x SHA2.
Step 8.
Calculate G by the formula G=1.3863 x SHUA x (ct1+ct2) where ct1 = 20 and ct2 = 20
(2 x No. of Samples in each Pop).
Step 9.
Calculate the DF as the (No. of populations-1) x (No. of alleles compared 1) = (21) x (5 - 1) = _____.
Step 10.
Look up the Chi-Square Probability for the G-test given the degrees of freedom DF
using the table provided in the section on Hardy-Weinberg Equilibrium later in this
module. Record your answers from Step 7 to Step 10 in the table below.
Tip: When using Excel you can easily calculate the Chi-Square probability using the function
CHIDIST(x,deg_freedom), in this case enter =CHIDIST(G,DF).
Shannon Statistic
Value
1.219
0.748
1.782
0.799
44.290
HA1
HA2
HU
HUA
DF
Chi-Sq Prob
GenAlEx Tutorials Part 1
0.000
Peakall and Smouse (2009)
1.33
Step 11.
Check your answers using GenAlEx. The data can be found in the workbook Ex 1.13
Glycine by Hand.xls. Choose Shannon->Pairwise Pops. When prompted by the Shannon
Analysis Options dialog box choose By Locus, Output for Each Locus , Output Pairwise
Matrices as Table, Step by Step and Log Base 2.
Q 1.13 Questions
1.
Based on your inspection of raw genotypes shown above, summarise the patterns
of allele frequency differences between the two populations.
2.
Do you predict there will be a significant difference in allele frequencies between
the two populations?
3.
Summarise and interpret the outcomes of the G-test for allele frequency
differences.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.34
Box 1.6 Shannons Information Indices
Based on Sherwin et al. (2006) Mol. Ecol. 15, 2857-2869, see also Appendix 1.1.
In general for a specific locus in a given population the Shannons Allele Information index S H A is
calculated as:
S
H A = "# pi log 2 pi
At this specific locus across multiple populations we consider each pairwise combination of
populations in turn calculating S H A for each of the two populations:
S
H A1 !
= " ! pi1 log2 pi1 and S H A2 = " ! pi 2 log2 pi 2
Where pi is the allele frequency of the ith allele at the locus in question for the specified population
(1 or 2).
Shannons Total Information index across each pair of populations is calculated as:
S
HU = " ! pi log2 pi
Where pi is the average weighted frequency of the ith allele for each pair of populations:
pi = pi1 ! wt1 + pi 2 ! wt2
Where wt1 =
ct1
ct2
and wt2 =
ct1 + ct2
ct1 + ct2
and ct = the total allele count at the locus for the respective populations.
Finally, Shannons Mutual Information index S HUA is calculated for each pair of populations as:
S
HUA = S HU ! wt1 S H A1 ! wt2 S H A2
Shannons Mutual Information index can now be used to compute the log-likelihood contingency
test statistic G as:
G = 1.3863S HUA ( ct1 + ct2 )
With degrees of freedom DF calculated as the (number of populations compared - 1) x (number of
alleles compared - 1).
For diploid species with effective population sizes > 500 estimates of Nm among pairs of
populations can be computed as:
& 0.156 #
!!
Nm = $$ S
% HUA "
Nei Genetic Distance
Ex 1.14 Hand Calculation of Neis Genetic Distance
While F ST is perhaps the most widely used measure of genetic differentiation among
populations, another frequently used estimate of the genetic difference among populations is
Neis Genetic Distance D. In this exercise we will utilise the same data set as for Ex 1.13.
Record your answers in the Table below.
Step 1.
Allele frequencies for the two Glycine populations are shown below. Calculate
squared allele frequencies for each allele in population x and population y.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.35
Step 2.
Calculate Jx as the sum of the squared allele frequencies for population x. Calculate
Jy as the sum of the squared allele frequencies for population y.
Aranda
Taylor
Allele
x2
y2
xy
266
0.050
0.000
0.003
0.000
0.000
268
0.000
0.050
0.000
0.003
0.000
270
0.000
0.850
0.000
0.723
0.000
272
0.550
0.100
0.303
0.010
0.055
280
0.400
0.000
0.160
0.000
0.000
Jx
Jy
Jxy
0.465
0.735
0.055
Sum
1.000
Nei I
0.094
Nei D
2.364
1.000
Step 3.
For each allele calculate the product of allele frequency in population x and
population y. Calculate Jxy as the sum of the products of allele frequency.
Step 4.
Calculate Nei I as Jxy/(JxJy)0.5.
Step 5.
Calculate Nei D as Ln(I). Record values in the table above.
Step 6.
Check your answers using the Frequency option in GenAlEx. The data can be found in
the workbook Ex 1.13 Glycine by Hand.xls. In the Codominant Frequency Options
dialog box check Frequency by Pop, Nei Distance and Step by Step.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.36
Box 1.7 Neis Genetic Identity and Distance
Nei I =
k
J xy
(J x J y )
k
Where, J xy = " pix piy , J x = " pix2 , and J y = " piy2 .
i=1
i=1
i=1
Where I is Neis Genetic Identity, and pix and piy are the frequencies of the i-th allele in
populations x and y. For multiple loci, Jxy, Jx and Jy are calculated by summing over all loci and
alleles and dividing by!the number of !
loci. These average
! values are then used to calculate I.
Nei D = ! ln(I )
!
!
Neis genetic identity ranges from 0 to 1. Consequently, Neis Genetic Distance ranges from 0 to
infinity (Nei 1972, 1978). Note an unbiased estimate of Neis I and Neis D is also available in
GenAlEx. Hedrick (2000) suggests this correction may give spurious results when homozygosity is
low and sample size is small. This unbiased estimator may also give slightly negative values for
Neis Unbiased Genetic distance, which should be interpreted as zero.
Pairwise Population Genetic Analysis
Ex 1.15 Pairwise Fst and Nei Genetic Distances
FST when reported as a single statistic over loci and populations provides an estimate of
average differentiation. However, by exploring the patterns of differentiation among each pair
of populations you can learn more about the genetic relationships than evident from the
average FST value on its own. Within GenAlEx both pairwise FST and Nei Genetic Distance can
be readily computed for each pairwise combination of populations and summarized as a
matrix. Both options are offered in GenAlEx via the Frequency menu.
Step 1.
Return to the workbook Ex 1.11 Glycine and re-activate the Data worksheet.
Choose Frequency from the GenAlEx menu, click Uncheck All, then choose Nei
Distance, Pairwise Fst, Output Pairwise Matrix, Output Labeled Pairwise Matrix and
Output Pairwise Matrix as Table from the Codominant Frequency Options dialog box.
Tip: You might find it easier to move a copy of the data sheet from Ex 1.11 into a new workbook
called Ex. 1.15. Right-click on the worksheet tab at the left-hand corner of the workbook and use
the Move or Copy option to achieve this task.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.37
Step 2.
Inspect the outcomes in the four worksheets with suffixes NeiP, NeiL, FstP and FstL,
in order to understand the nature of the output.
Step 3.
Now based on the summary worksheets NeiT and FstT, answer the questions below.
Q 1.15 Questions
Based on your inspection of the results in the worksheet NeiT:
1.
Describe the genetic relationships among the populations indicated by the Nei
Genetic distances. Which pairs of populations are genetically most similar?
Based on your inspection of the results in the worksheet FstT:
2.
Describe the genetic relationships among the populations indicated by the pairwise
FST values. Which pairs of populations are genetically most similar?
3.
Compare the pairwise F ST results with the pairwise Nei D. What do you conclude?
4.
How would you test for a correlation between the pairwise FST results and the
pairwise Nei D?
Ex 1.16 Pairwise calculation of Shannons Indices
It will be self evident that Shannons diversity indices can also be computed at each locus for
each pairwise combination of populations. Here we use the data from Ex 1.11 (and Ex 1.15)
for Shannon analysis with a focus on the mean values over loci.
Step 1.
Return to the workbook Ex 1.11 Glycine and re-activate the Data worksheet.
Choose choose Shannon->Pairwise Pops. When prompted by the Shannon Analysis Options
dialog box choose Output for Each Locus and Output Pairwise Matrices as Table.
Tip: You might find it easier to move a copy of the data sheet from Ex 1.11 into a new workbook
called Ex. 1.16. Right-click on the worksheet tab at the left-hand bottom of the workbook and use
the Move or Copy option to achieve this task.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.38
Q 1.16 Questions
Based on your inspection of the results in the worksheet SH:
1.
Describe the genetic relationships among the populations indicated by the Shannon
S
Mutual Information Index HUA . Which pairs of populations are genetically similar?
2.
! below drawing on the outcomes of analysis from Ex 1.15 and Ex
Complete the table
1.16. Do the different pairwise population statistics reveal similar genetic patterns?
Summary of pairwise population values of FST, Nei Distance and Shannons Mutual
Information index among two Glycine populations.
Pop1
Pop2
FST
Nei D
Nei I
sHua
Aranda
Taylor
0.061
0.208
0.813
0.337
Aranda
Brind
0.136
1.012
0.364
0.531
Taylor
Brind
0.163
1.231
0.292
0.655
Aranda
Franklin
0.208
1.633
0.195
0.672
Taylor
Franklin
0.238
2.032
0.131
0.856
Brind
Franklin
0.047
0.217
0.805
0.333
Notes:
Principal Coordinate Analysis (PCA)
Even a matrix as small as the 4x4 Nei Genetic distance matrix from our Glycine analysis (shown
below) can be a little difficult to read and interpret. Larger matrices become impossible to interpret.
Ideally, what we need is a way of visualizing the patterns of genetic relationship contained in such a
matrix. Principal Coordinate Analysis (PCA) provides such a tool.
PCA is a multivariate technique that allows one to find and plot the major patterns within a
multivariate data set (e.g. multiple loci and multiple samples). The mathematics is complex, but in
essence PCA is a process by which the major axes of variation are located within a
multidimensional data set. Each successive axis explains proportionately less of the total variation,
such that when there are distinct groups, the first 2 or 3 axes will typically reveal most of the
separation among groups.
The Pairwise Nei Genetic Distance Matrix Among the 4 Glycine Populations
Aranda
Taylor
Brind
Franklin
Aranda
0.000
GenAlEx Tutorials Part 1
Taylor
0.208
0.000
1.012
1.231
0.000
1.633
2.032
0.217
Brind
0.000
Peakall and Smouse (2009)
Franklin
1.39
Ex 1.17 Steps for Performing PCA
In this course we will leave the complex mathematics behind PCA to GenAlEx (although we will
return to PCA in a later module). For now all that is needed is an appropriately formatted
distance matrix. Here we will use PCA to visualize the genetic relationships revealed by the Nei
S
D, FST and HUA analysis of the 4 Glycine populations in Ex 1.15 and Ex 1.16.
Step 1.
Return to the workbook Ex 1.11 Glycine, (or the renamed Ex. 1. 15 & Ex. 1.16) and
move a copy of the NeiP, FstP and SHuaP worksheets to a new workbook. Name the
workbook Ex 1.17 PCA.
Step 2.
Activate the worksheet NeiP. Note that for PCA a distance matrix in GenAlEx format
is required with the matrix diagonal starting in the cell A4, and parameters as
shown.
Step 3.
Choose PCA from the GenAlEx menu, accept the default options of TriDistance
Matrix, Covariance-Standardized and Data Labels in the PCA Options dialog box.
Step 4.
Inspect the outcomes of the PCA analysis and answer question 1 below.
Step 5.
Repeat step 3 in order to produce a PCA plot from the worksheets FstP and ShuaP,
then answer questions 2 and 3.
Q 1.17 Questions
1.
Summarize the outcomes of your PCA analysis of Nei Genetic Distance in words.
How well does this PCA plot represent the original data? (Hint: check the
percentage of variation explained by the first 2 axes)
2.
Compare the outcomes of the three PCA analyses. Do they reveal a similar pattern?
3.
Do your PCA plots suggest regional genetic structure in Glycine? How would you
test for this pattern?
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.40
Hardy-Weinberg Equilibrium
For codominant genotypes at a single locus, and for a single population, we can determine whether
the observed tallies of genotypes are consistent with the expectations under random mating by
performing a Chi-Square Test of Hardy-Weinberg Equilibrium.
Before one conducts any statistical test it is important to understand the null (H0) and alternative
hypotheses (H1). Typically in biology the null hypothesis concerns the condition of No
Difference.
In the case of tests for Hardy-Weinberg Equilibrium:
H0=No departure from random mating expectations (F=0) i.e. the population is randomly mating
H1=Departure from random mating expectations i.e. the population is not randomly mating (F<>0).
Note that more sophisticated statistical methods than the Chi-Square Test are available for testing
for Hardy-Weinberg Equilibrium. These procedures are offered by software packages such as
GenePop and Arlequin and are recommended for final publication purposes. GenAlEx offers
options for exporting your data to these packages.
Ex 1.18 Testing for Hardy-Weinberg Equilibrium
Small but real data sets for two plant examples with contrasting reproductive systems, Glycine
clandestina and Caladenia tentaculata, are shown below. Complete the steps 1 to 8 for each
species to determine whether or not they conform to Hardy-Weinberg Equilibrium then answer
questions 1 to 3. For simplicity, steps 1 to 4 have already been completed for you.
Step 1.
Determine the number of samples.
Step 2.
Determine the number of alleles, Na.
Step 3.
Count the numbers of each genotype.
Step 4.
Calculate allele frequencies.
Step 5.
Estimate the expected genotype frequencies, given the sample size of the
population, either as p2 for a homozygous genotypes or as 2pq for a heterozygous
genotypes.
Step 6.
Test for conformity with HWE expectations by calculating the Chi-squared statistic
X2.
Step 7.
Determine the degrees of freedom as DF = [Na(Na-1)]/2
Step 8.
Given the calculated Chi-squared value and the degrees of freedom, estimate the
probability of the observed numbers deviating as far from the expected numbers by
chance alone from the table below.
If the probability of obtaining the observed Chi-squared value (given the degrees of
freedom) is greater than 0.05 (P in the range 0.05 to 1.0), the result is NOT
statistically significant and we accept the null hypothesis H0 = The population is
mating randomly.
If the probability of obtaining the observed Chi-squared value (given the degrees of
freedom) is less than 0.05 (in the range 0 < P < 0.05), we conclude that the result
is statistically significant, and we reject the null hypothesis H0, in favour of H1 =
The population is NOT mating randomly.
Step 9.
Record your answer in the tables below, and then answer the questions.
Step 10.
Check your hand calculations using GenAlEx. The data are provided in Ex 1.18 HWE
Glycine.xls and Ex 1.18 HWE Caladenia.xls.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.41
Table showing critical values of X2.
Upper-tail Probability
DF
0.05
0.01
0.005
0.001
3.841
6.635
7.879
10.828
5.991
9.210
10.597
13.816
7.815
11.345
12.838
16.266
9.488
15.086
16.750
18.467
You can use this table, given the degrees of freedom, to estimate the upper-tail probability for
your calculated Chi-Square value. For example if DF=1 and your Chi-Square value is 5.5, the P
value is less than 0.05, but greater than 0.01.
Tip: When using Excel you can easily calculate the Chi-Square probability using the function
CHIDIST(x,deg_freedom), in this case enter =CHIDIST(X2,DF).
Box 1.8 Chi-square for Hardy-Weinberg Equilibrium (HWE)
k
(O " E) 2
E
i=1
X2 = #
Where the summation from i to k genotypes is based on Oi the observed number of individuals of
the i-th genotype, and Ei the expected number for the i-th genotype. Ei is calculated as either pi2 for
a homozygous genotype or 2pq!for a heterozygous genotype.
Degrees of freedom for the Chi-Squared test can be calculated one of two ways:
DF = (No. of genotype classes)-Na
or
DF = [Na(Na-1)]/2, where Na is the number of alleles at the locus.
The second formula is more convenient when there are a large number of alleles as is frequently the
case with genetic markers such as microsatellites or STRs.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.42
Allele Frequencies Glycine clandestina
Allele Frequencies Caladenia tentaculata
Sample size = 30
Sample size = 100
Pop
Mt Taylor
Allele
1
2
SATT373
0.317
0.683
Expected genotype frequency
1
2
1
2
Observed (=Observed genotype counts)
1
2
1
7
2
5
18
Expected
(=Expected genotype
samples=30)
frequency
1
3.008
12.990
1
2
Observed - Expected
1
1
2
No.
Allele
MDH1
1
0.220
2
0.705
3
0.075
Expected genotype frequency
1
0.048
0.310
0.033
1
5
34
0
48
11
1
2
0.497
3
0.106
0.006
Observed (=Observed genotype counts)
1
2
3
Expected
(=Expected genotype
samples=100)
2
1
1
4.840
frequency
2
31.020 49.703
3
3.300
10.575
Observed - Expected
1
2
1
0.160
2
2.980
-1.702
3
-3.300 0.425
(Observed - Expected)^2
1
2
1
0.026
(Observed - Expected)^2
1
2
1
2
(Observed - Expected)^2/Expected
1
2
1
2
ChiSquare
DF
Prob
Pop
Pop1
No.
0.563
3
1.438
3
2
8.880
2.899
3
10.890 0.181
2.066
(Observed - Expected)^2/Expected
1
2
3
1
2
3
ChiSquare
DF
0.005
0.286
3.300
7.341
3
Prob
0.062
0.058
0.017
3.674
ns
Q 1.18 Questions
1.
Summarise your findings for Glycine clandestina. Which hypothesis, H0 or H1 is
supported?
H1 not in HWE
2.
Summarise your findings for Caladenia tentaculata. Which hypothesis, H0 or H1 is
supported?
H0 in HWE
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.43
Putting It All Together
Ex 1.19 Revision: F-statistics in Glycine and Caladenia
Allele frequencies and observed heterozygosity are shown for two populations of Glycine
clandestina and two populations of Caladenia tentaculata in the tables below. We will use this
exercise to revise many of the formula we have learnt so far. Only minimal instructions are
provided, if in doubt please refer to earlier exercises.
Glycine clandestina
Caladenia tentaculata
Allele Frequencies
Allele Frequencies
Allele
Aranda
Taylor
Total
Allele
W1
W2
Total
266
0.050
0.000
0.025
0.021
0.146
0.083
268
0.000
0.050
0.025
0.896
0.792
0.844
270
0.000
0.850
0.425
0.083
0.063
0.073
272
0.550
0.100
0.325
280
0.400
0.000
0.200
Heterozygosity and F statistics
Heterozygosity and F statistics
Ho
0.300
0.100
0.200
Ho
0.208
0.333
0.271
He
0.535
0.265
0.673
He
0.190
0.348
0.276
Mean F
Mean F
Mean Ho
0.200
Mean Ho
0.271
Mean He
0.400
Mean He
0.269
HT
0.673
HT
0.276
FIS
0.500
FIS
-0.006
FIT
0.703
FIT
0.018
FST
0.405
FST
0.024
Step 1.
Inspect the allele frequencies for both the Glycine and Caladenia data and answer
questions 1 and 2.
Step 2.
Calculate F-statistics for Glycine and Caladenia showing full hand calculations
below. Record your answers in the table above, then answer the remaining
questions below.
Step 3.
Check your hand calculations using GenAlEx. The data are provided in the
workbooks Ex 1.19 Glycine Fstats.xls and Ex 1.19 Caladenia Fstats.xls.
Q 1.19 Questions
1.
Based on your inspection of the allele frequencies in Glycine, how much genetic
differentiation do you predict? Explain why.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.44
2.
Based on your inspection of the allele frequencies in Caladenia, how much genetic
differentiation do you predict? Explain why.
3.
Based on your calculations for Glycine, what do you conclude about the extent of
genetic differentiation between the two populations Aranda and Taylor?
By reference to Box 1.6 and the FST of 0.4 it is clear that the extent of differentiation is
very great!
4.
What do you conclude about the extent of inbreeding within the two populations of
Glycine?
Also large given FIS is 0.5.
5.
Based on your calculations for Caladenia, what do you conclude about the extent of
genetic differentiation between the two populations W1 and W2?
6.
What do you conclude about the extent of inbreeding within the two populations of
Caladenia?
Also large given FIS is 0.5.
7.
Are the biological conclusions you draw from the F-statistics analysis of Glycine and
Caladenia the same as for the HWE tests? Explain your answer.
For Caladenia F= -0.002
For Glycine F= 0.615
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.45
Ex 1.20 Bringing the Genetics and Ecology Together
Throughout these exercises we have been working with data from three different species, the
bush rat, Rattus fuscipes and two plant species Glycine clandestina and Caladenia tentaculata.
For the two plant species it is now time to reveal a little more about their biology. By
combining ecology and genetics we can frequently discover new insights not evident from
either ecology or genetic studies alone. In addition, genetic results can help us test predictions
from our ecological knowledge, and vice versa. In the boxes below a brief summary of what we
know about the biology of the two plant species is provided. Read these summaries before
proceeding.
Box 1.9 The case of Glycine clandestina
Glycine clandestina is a native relative of the soybean. This species has an unusual reproductive
biology - it produces two kinds of flower: Normal 'Open pollinated' flowers and 'Closed or
cleistogamous' flowers. The open flowers are typical of pea flowers in general requiring insect
pollinators for seed set. The 'Closed or cleistogamous' flowers are adapted to self pollination,
regularly producing seed without the aid of pollinators. The seeds of the species lack an obvious
dispersal mechanism and it appears most seeds will fall close to the parent plant.
Box 1.10 The case of Caladenia tentaculata
Caladenia tentaculata, the green spider orchid, is exclusively pollinated by sexually attracted male
thynnine wasps. The orchid, like many other Australian orchids, exploits the reproductive behavior
of thynnine wasps by mimicking the sex pheromones of the female wasp. Pollination occurs when
male wasps attempt copulation (pseudocopulation) with the labellum (the modified 3rd petal of
orchids). After pollination, wasps immediately leave the patch, rather than visiting additional
orchids. As a consequence of this behavior, pollen movements approximate a linear distribution,
with a mean dispersal distance of 17 m (max = 58 m). This is among the largest mean pollen
dispersal distances known for herbaceous plants (Peakall and Beattie 1996). The seeds of the
species, like orchids in general, are minute and wind dispersed. However, we presently know little
about the extent of seed dispersal in this and other orchids.
Step 1.
Given what you now know about the biology of these two plant species, draw up a
series of qualitative genetic predictions (summary in words) in the table below.
Briefly justify these predictions below the table.
Step 2.
Collate your answers from previous exercises in the summary table of statistical
outcomes.
Step 3.
To complement your statistical summary, calculate the outcrossing rate t as
outlined in Box 1.11.
Step 4.
Briefly summarize the key findings below the summary table. Then answer the
questions that follow.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.46
Box 1.11 Estimation of Outcrossing Rates in Plants
Typically the estimation of outcrossing rates in plants involves a genetic analysis of the genotypes
of mother and offspring across multiple loci followed by a formal mating system analysis. In the
absence of such an analysis, a simple transformation of the Fixation Index F can provide an
estimate of the outcrossing rate t:
(1! F)
t=
(1+ F)
This transformation assumes no selection between fertilisation and the stage at which the samples
were analysed for the estimate of F.
Q 1.20 Questions
1.
Summarize in words your predictions in the table below, then justify your answer.
Statistic
Glycine clandestina
Caladenia tentaculata
HWE
No
Yes
Zero
FIS
Zero
FST
Large
Small
Near one
Near zero
Justification:
2.
Summarise the statistical outcomes in the table below, then list your key findings.
Statistic
Glycine clandestina
2
Caladenia tentaculata
HWE
X =11.343, P = 0.001 ****
X2=7.341, P = 0.062 ns
0.444 to 0.62
-0.009 to 0.042
FIS
0.500
-0.0006
FST
0.405
0.024
t
Key Findings:
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.47
3.
Based on your findings in Glycine clandestina, how important is the contribution of
the 'Closed flowers' to reproductive success. Explain your answer using the
statistics you have calculated to back up your case.
If we assume that only the closed flower are able to self then: the Outcrossing rate
estimate t at an FIS of 0.5 is 0.33 or 33%. Thus selfing accounts for 67% of all matings.
4.
What do you conclude about the extent of seed dispersal in Glycine clandestina?
Explain your answer.
Given the very great differentiation it is reasonable to conclude that extensive seed
dispersal is limited.
5.
What do you conclude about the outcrossing rate in Caladenia tentaculata? Given
selfing is possible in this system (i.e. the plant is self-compatible) how can you
explain the result? Use the statistics you have calculated to support your case.
Outcrossing rate estimate t at the FIS of approximately 0.0 is 1.0 or 100%. The
pollinators avoidance of multiple flower visits within patches, and the longer distance
pollen dispersal documented for the species are consistent. There may also be late
acting inbreeding depression (post seed set).
6.
What do you conclude about the extent of seed dispersal in Caladenia tentaculata?
Explain your answer using the statistics you have calculated to support your case.
The very low FST of 0.02, indicates limited genetic differentiation which in turn is
consistent with extensive seed dispersal.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.48
References and Further Reading
Note that for a more extensive literature on these topics, please see the appendices provided with
GenAlEx: Freely available from the Australian National University, Canberra, Australia.
http://www.anu.edu.au/BoZo/GenAlEx/
Brown AHD and Weir BS (1983) Measuring genetic variability in plant populations, in Isozymes in Plant
Genetics and Breeding, Part A, (Tanksley SD, Orton TJ, Editors). Elsevier Science Publ.: Amsterdam.
p. 219-239.
Conner JK and Hartl DL (2004) A Primer of Ecological Genetics, Sunderland, Massachusetts: Sinauer
Associates, Inc.
Frankham R, Ballou JD and Briscoe DA (2002) Introduction to Conservation Genetics, Cambridge
University Press: Cambridge.
Frankham R, Ballou JD and Briscoe DA (2004) A Primer of Conservation Genetics, Cambridge: Cambridge
University Press.
Hartl DL (2000) A Primer of Population Genetics 3rd Ed, Sunderland, Massachusetts: Sinauer Associates,
Inc.
Hartl DL and Clark AG (1997) Principles of Population Genetics 3rd Ed, Sunderland, Massachusetts:
Sinauer Associates, Inc.
Hedrick PW (2000) Genetics of Populations 2nd Ed, Boston: Jones and Bartlett.
Nei M (1972) Genetic distance between populations. American Naturalist, 106, 283-392.
Nei M (1978) Estimation of average heterozygosity and genetic distance from a small number of individuals.
Genetics, 89, 583-590.
Peakall R and Beattie AJ (1996) Ecological and genetic consequences of pollination by sexual deception in
the orchid Caladenia tentactulata. Evolution, 50, 2207-2220.
Peakall R, Ruibal M and Lindenmayer DB (2003) Spatial autocorrelation analysis offers new insights into
gene flow in the Australian bush rat, Rattus fuscipes. Evolution, 57, 1182-1195.
Peakall R and Smouse PE (2006) GENALEX 6: genetic analysis in Excel. Population genetic software for
teaching and research. Molecular Ecology Notes, 6, 288-295.
Peakall R and Lindenmayer DB (2006) Genetic insights into population recovery following experimental
perturbation in a fragmented landscape. Biological Conservation, 132, 520-532.
Peakall R, Ebert D, Cunningham R and Lindenmayer DB 2006. Mark-recapture by genetic tagging reveals
restricted movements by bush rats, Rattus fuscipes, in a fragmented landscape. Journal of Zoology,
268, 207-216.
Rossetto M, Kooyman R, Sherwin W and Jones R (2008) Dipersal limitation, rather than bottlenecks or
habitat specificity, can restrict the distribution of rare and endangered rainforest trees. American
Journal of Botany, 95, 321-329.
Sherwin WB, Jobot F, Rush R and Rossetto M (2006) Measurement of biological information with
applications from genes to landscapes. Molecular Ecology, 15, 2857-2869.
Shannon CE (1948) A mathematical theory of communication. The Bell System Technical Journal, 27, 379423, 623-656.
Weir BS (1990) Genetic Data Analysis, Sunderland, Massachusetts: Sinauer Ass. Inc.
Wright S (1946) Isolation by distance under diverse systems of mating. Genetics, 31, 39-59.
Wright S (1951) The genetical structure of populations. Annual Eugenics, 15, 323-354.
Wright S (1965) The interpretation of population structure by F-Statistics with special regard to systems of
mating. Evolution, 19, 395-420.
Wright S (1978) Evolution and the Genetics of Populations. Variability within and among natural
populations. Vol 4. The University of Chicago Press, Chicago.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.49
Glossary Some Important Definitions
Allele: One or more alternative forms of a given gene or non-coding region of DNA.
Codominant: Both alleles in a diploid organism are visualized by a genetic marker system such that
homozygous and heterozygous genotypes are detected. At the phenotypic level, the gene products
of both alleles are expressed.
Dominant: Only one allele in a diploid organism is visualized by a genetic marker system such that
only two genotypes are detected, either band presence or band absence. At the phenotypic level the
gene product of only one allele is detected.
DNA: Deoxyribonucleic acid (DNA). Ribbons of sugars and phosphates held together in two
opposite strands by 4 different bases or nucleotides: Adenine (A), Guanine (G), Cytosine (C) and
Thymine (T). Sequences of nucleotides make up genes.
DNA sequence: The sequence of DNA bases at a given locus.
DNA profile: Bands or genetic fingerprint produced by a genetic marker.
Electrophoresis: Migration of particles under the influence of an electric field. In the context of
genetics, electrophoresis separates protein and DNA molecules of different size in a gel matrix that
is subject to an electric field.
Genetic Marker: Any genetic character that can be measured and quantified. Most often genetic
markers are visualized using laboratory procedures that detect variation either directly at the DNA
level or indirectly via the products of DNA transcription and translation such as for allozyme or
morphological characters.
Genotype: The set of alleles within an organism. In a narrower sense the alleles observed at a
particular locus or loci. cf. Phenotype.
Heterozygosity: The proportion of heterozygous individuals at a locus, or heterozygous loci in an
individual. Approximates genetic variance.
Heterozygote: Two different alleles at a given locus.
Homozygote: Two identical alleles at a given locus.
Locus: A specific position on the homologous chromosomes. Includes any identifiable coding
(genes) and non coding region of the chromosome (pl. Loci).
PCR: Polymerase chain reaction.
Phenotype: The characteristics or appearance of an organism influenced by both the environment
and genotype of the organism. In a narrower sense the characteristics displayed by a particular locus
or loci cf. Genotype.
Polymorphism: The presence of one or more alternative forms at a given locus or loci = genetic
variation. All genetic variation reflects variation in the sequence of nucleotides. For example, at a
given locus, genetic variation can be represented by: (1) variation in the bases e.g. CGTACG vs
CGAAAG, (2) variation in DNA length due to an insertion of nucleotides e.g. CGTACG vs
CGTATATATATATACG, or (3) variation in DNA length due to a deletion of nucleotides e.g.
CGTACG vs CGCG (TA deleted).
Restriction enzyme: An enzyme that cuts DNA at short specific sequences. Each enzyme has a
unique cutting site.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.50
Glossary - Genetic markers
AFLPs (Amplified Fragment Length Polymorphisms): A method that reveals fragment length
polymorphism by PCR. First, genomic DNA is cut with two different restriction enzymes to
produce short DNA fragments. Next, adapters of known DNA sequence are ligated to the ends of
the cut fragments. Subsequently, selective PCR of the genomic fragments is then achieved using
primers that match the known adapter sequence plus additional 'selective' nucleotides.
Electrophoresis of the fragments produces a multi-locus profile or DNA fingerprint with
polymorphisms apparent as either band presence or absence. Fluorescent or radioactive methods are
used to visualize the fragments.
Allozymes: Alternate forms of enzymes encoded by different alleles at the same locus. Allozymes
are prepared by homogenising tissue to produce a solution of proteins that is electrophoresed
through a gel. Specific enzyme products are then visualized by a specific reaction. Alleles with
different charges have different mobilities.
PCR-based genetic markers: Genetic markers produced via the amplification of DNA by the
polymerase chain reaction.
RAPD's (Random amplified polymorphic DNA): An arbitrary-primed PCR method that uses
arbitrary primers, of known sequence, usually 10 base pairs long to serve as both forward and
reverse primers. Typically the amplified DNA fragments are resolved by low resolution agarose
electrophoresis and staining with ethidium bromide. A multi-locus profile or DNA fingerprint with
polymorphisms apparent as either band presence or absence is produced.
RFLPs (Restriction fragment length polymorphisms): Polymorphisms at specific sites in the DNA
sequence revealed by the following method: DNA is cut with restriction enzymes, electrophoresed,
blotted to a membrane and probed with radioactive DNA. Depending on the probe, single-locus or
multi-locus profiles will be produced.
SNPs (Single Nucleotide Polymorphisms: Single base changes at a specific position in the genome,
in most cases with two alleles. SNPs represent the most common form of DNA variation in the
genome, and the analysis of a set of linked nuclear SNPs (haplotypes) provide an essentially
inexhaustible source of stable polymorphic markers. An array of new methods are rapidly being
developed for the routine screening of SNPs.
STRs (Short Tandem Repeats) or SSR (Simple Sequence Repeats) or Microsatellites: Tandem
repeats of very short nucleotide motifs (1-6 bases long) eg: [(CA)17] or [(AAT)10] obtained by
STS-PCR amplification using specific primers. Typically high resolution electrophoresis is
required. A single-locus codominant genetic marker is produced. The standard genetic marker in
human forensics and widely used in population genetics.
STS-PCR (Sequence-tagged-site PCR): A PCR method that uses two different specific primers,
complementary to opposite strands of conserved DNA, to amplify the intervening sequence. A
single-locus codominant genetic marker is produced.
GenAlEx Tutorials Part 1
Peakall and Smouse (2009)
1.51