Information Geometry and Population Genetics: The Mathematical Structure of The Wright-Fisher Model Julian Hofrichter Instant Download
Information Geometry and Population Genetics: The Mathematical Structure of The Wright-Fisher Model Julian Hofrichter Instant Download
https://textbookfull.com/product/information-geometry-and-
population-genetics-the-mathematical-structure-of-the-wright-
fisher-model-julian-hofrichter/
https://textbookfull.com/product/population-biology-of-plant-
pathogens-genetics-ecology-and-evolution-michael-g-milgroom/
https://textbookfull.com/product/extending-the-linear-model-with-
r-second-edition-julian-james-faraway/
https://textbookfull.com/product/population-health-monitoring-
climbing-the-information-pyramid-marieke-verschuuren/
https://textbookfull.com/product/information-geometry-and-its-
applications-ay/
Random Graphs Geometry and Asymptotic Structure 1st
Edition Michael Krivelevich
https://textbookfull.com/product/random-graphs-geometry-and-
asymptotic-structure-1st-edition-michael-krivelevich/
https://textbookfull.com/product/model-perspectives-structure-
architecture-and-culture-mark-r-cruvellier/
https://textbookfull.com/product/mathematical-foundation-of-
railroad-vehicle-systems-geometry-and-mechanics-ahmed-a-shabana/
https://textbookfull.com/product/the-wrong-family-a-thriller-
tarryn-fisher-fisher/
https://textbookfull.com/product/problem-solving-and-selected-
topics-in-euclidean-geometry-in-the-spirit-of-the-mathematical-
olympiads-2013th-edition-sotirios-e-louridas/
Julian Hofrichter • JRurgen Jost • Tat Dat Tran
Information Geometry
and Population Genetics
The Mathematical Structure
of the Wright-Fisher Model
123
Julian Hofrichter JRurgen Jost
Mathematik in den Naturwissenschaften Mathematik in den Naturwissenschaften
Max-Planck-Institut Max Planck Institut
Leipzig, Germany Leipzig, Germany
Population genetics is concerned with the distribution of alleles, that is, variants at
a genetic locus, in a population and the dynamics of such a distribution across gen-
erations under the influences of genetic drift, mutations, selection, recombination
and other factors [57]. The Wright–Fisher model is the basic model of mathematical
population genetics. It was introduced and studied by Ronald Fisher, Sewall Wright,
Motoo Kimura and many other people. The basic idea is very simple. The alleles
in the next generation are drawn from those of the current generation by random
sampling with replacement. When this process is iterated across generations, then
by random drift, asymptotically, only a single allele will survive in the population.
Once this allele is fixed in the population, the dynamics becomes stationary. This
effect can be countered by mutations that might restore some of those alleles that
had disappeared. Or it can be enhanced by selection that might give one allele an
advantage over the others, that is, a higher chance of being drawn in the sampling
process. When the alleles are distributed over several loci, then in a sexually
recombining population, there may also exist systematic dependencies between the
allele distributions at different loci. It turns out that rescaling the model, that is,
letting the population size go to infinity and the time steps go to 0, leads to partial
differential equations, called the Kolmogorov forward (or Fokker–Planck) and the
Kolmogorov backward equation. These equations are well suited for investigating
the asymptotic dynamics of the process. This is what many people have investigated
before us and what we also study in this book.
So, what can we contribute to the subject? Well, in spite of its simplicity,
the model leads to a very rich and beautiful mathematical structure. We uncover
this structure in a systematic manner and apply it to the model. While many
mathematical tools, from stochastic analysis, combinatorics, and partial differential
equations, have been applied to the Wright–Fisher model, we bring in a geometric
perspective. More precisely, information geometry, the geometric approach to
parametric statistics pioneered by Amari and Chentsov (see, for instance, [4, 20]
and for a treatment that also addresses the mathematical problems for continuous
sample spaces [9]), studies the geometry of probability distributions. And as a
remarkable coincidence, here we meet Ronald Fisher again. The basic concept
v
vi Preface
and statistical physicists who want to see how concepts from geometry, partial
differential equations (Kolmogorov or Fokker–Planck equations) and statistical
mechanics (entropy, free energy) can be developed and applied to one of the most
important mathematical models in biology; bioinformaticians who want to acquire
a theoretical background in population genetics; and biologists who are not afraid
of abstract mathematical models and want to understand the formal structure of
population genetics.
Our book consists essentially of three parts. The first two chapters introduce
the basic Wright–Fisher model (random genetic drift) and its generalizations
(mutation, selection, recombination). The next few chapters introduce and explore
the geometry behind the model. We first introduce the basic concepts of information
geometry and then look at the Kolmogorov equations and their moments. The
geometric structure will provide us with a systematic perspective on recombination.
And we can utilize moment-generating and free energy functionals as powerful
computational tools. We also explore the large deviation theory of the Wright–
Fisher model. Finally, in the last part, we develop hierarchical schemes for the
construction of global solutions in Chaps. 8 and 9 and present various applications in
Chap. 10. Most of those applications are known from the literature, but our unifying
perspective lets us obtain them in a more transparent and systematic manner.
From a different perspective, the first four chapters contain general material, a
description of the Wright–Fisher model, an introduction to information geometry,
and the derivation of the Kolmogorov equations. The remaining five chapters
contain our investigation of the mathematical aspects of the Wright–Fisher model,
the geometry of recombination, the free energy functional of the model and its
properties, and hierarchical solutions of the Kolmogorov forward and backward
equations.
This book contains the results of the theses of the first [60] and the third
author [113] written at the Max Planck Institute for Mathematics in the Sciences
in Leipzig under the direction of the second author, as well as some subsequent
work. Following the established custom in the mathematical literature, the authors
are listed in the alphabetical order of their names. In the beginning, there will be
some overlap with the second author’s textbook Mathematical Methods in Biology
and Neurobiology [73]. Several of the findings presented in this book have been
published in [61–64, 114–118].
The research leading to these results has received funding from the European
Research Council under the European Union’s Seventh Framework Programme
(FP7/2007–2013)/ERC grant agreement no. 267087. The first and the third authors
have also been supported by the IMPRS “Mathematics in the Sciences”.
We would like to thank Nihat Ay for a number of inspiring and insightful
discussions.
1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 The Basic Setting .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 Mutation, Selection and Recombination . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.3 Literature on the Wright–Fisher Model . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8
1.4 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 12
2 The Wright–Fisher Model .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.1 The Wright–Fisher Model . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.2 The Multinomial Distribution . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 19
2.3 The Basic Wright–Fisher Model . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
2.4 The Moran Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
2.5 Extensions of the Basic Model . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 24
2.6 The Case of Two Alleles . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 27
2.7 The Poisson Distribution .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.8 Probabilities in Population Genetics . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 29
2.8.1 The Fixation Time . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 29
2.8.2 The Fixation Probabilities .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 30
2.8.3 Probability of Having .k C 1/ Alleles (Coexistence) . . . . . 30
2.8.4 Heterozygosity .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 30
2.8.5 Loss of Heterozygosity .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31
2.8.6 Rate of Loss of One Allele in a Population
Having .k C 1/ Alleles . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31
2.8.7 Absorption Time of Having .k C 1/ Alleles . . . . . . . . . . . . . . 31
2.8.8 Probability Distribution at the Absorption
Time of Having .k C 1/ Alleles. . . . . . . .. . . . . . . . . . . . . . . . . . . . 31
2.8.9 Probability of a Particular Sequence of Extinction . . . . . . . 31
2.9 The Kolmogorov Equations.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 32
2.10 Looking Forward and Backward in Time . . . . . . .. . . . . . . . . . . . . . . . . . . . 33
2.11 Notation and Preliminaries.. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 35
2.11.1 Notation for Random Variables . . . . . . . .. . . . . . . . . . . . . . . . . . . . 35
2.11.2 Moments and the Moment Generating Functions .. . . . . . . . 36
ix
x Contents
10 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 269
10.1 The Case of Two Alleles . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 269
10.1.1 The Absorption Time .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 269
10.1.2 Fixation Probabilities and Probability
of Coexistence of Two Alleles . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 272
10.1.3 The ˛th Moments .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 274
10.1.4 The Probability of Heterozygosity .. . . .. . . . . . . . . . . . . . . . . . . . 274
10.2 The Case of n C 1 Alleles . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 275
10.2.1 The Absorption Time for Having k C 1 Alleles. . . . . . . . . . . 275
10.2.2 The Probability Distribution of the Absorption
Time for Having k C 1 Alleles . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 282
10.2.3 The Probability of Having Exactly k C 1 Alleles . . . . . . . . . 283
10.2.4 The ˛th Moments.. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 284
10.2.5 The Probability of Heterozygosity .. . . .. . . . . . . . . . . . . . . . . . . . 284
10.2.6 The Rate of Loss of One Allele in a Population
Having k C 1 Alleles . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 285
10.3 Applications of the Hierarchical Solution .. . . . . .. . . . . . . . . . . . . . . . . . . . 285
10.3.1 The Rate of Loss of One Allele in a Population
Having Three Alleles . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 285
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 307
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 317
Chapter 1
Introduction
X
n
pi D 1: (1.1.1)
iD0
The relationship between the deterministic concept of a frequency and the stochastic
concept of a probability of course requires some clarification, and this will be
addressed below, through the passage to a continuum limit.1
The population is evolving in time, and members pass on genes to their offspring,
and the allele frequencies pi then change in time through the mechanisms of
selection, mutation and recombination. In the simplest case, one has a population
with nonoverlapping generations. That means that we have a discrete time index t,
and for the transition from t to t C 1, the population Vt produces a new population
VtC1 . More precisely, members of Vt can give birth to offspring that inherit their
alleles. This process involves potential sources of randomness. Most basically, the
parents for each offspring are randomly chosen, and therefore, the transition from
the allele pool of one generation to that of the next defines a random process. In
particular, we shall see the effects of random genetic drift. Mutation means that
an allele may change to another value in the transition from parent to offspring.
Selection means that the chances of producing offspring vary depending on the value
of the allele in question, as some alleles may be fitter than others. Recombination
takes place in sexual reproduction, that is, when each member of the population has
two parents. It is then determined by chance which allele value she inherits when
the two parents possess different alleles at the locus in question. Depending on how
loci from the two parents are combined, this may introduce correlations between the
allele values at different loci.
Here is a remark which is perhaps obvious, but which illuminates how the
biological process is translated into a mathematical one. As already indicated, in
the simplest case we have a single genetic locus. In the diploid case, each individual
carries two alleles at this locus. These alleles could be different or identical, but
for the basic process of creating offspring, this is irrelevant. In the diploid case,
for each individual of the next generation, two parents are chosen from the current
generation, and the individual inherits one allele from each parent. That allele then is
1
In a certain sense, we shall sidestep the real issue, and in this text, we do not enter into the issue
of objective and subjective probabilities.
1.2 Mutation, Selection and Recombination 3
randomly chosen from the two that parent carries. The parents are chosen randomly
from the population, and we sample with replacement. That means that when a
parent has produced an offspring it is put back into the population so that it has
the chance to be chosen for the production of further offspring. To be precise,
we also allow for the possibility that one and the same parent is chosen twice for
the production of an individual offspring. In such a case, that offspring would not
have two different parents, but would get both its alleles from a single parent, and
according to the procedure, then even the same allele of that parent could be chosen
twice. (Of course, when the population size N becomes large—and eventually, we
shall let it tend to infinity—, the probability that this happens becomes exceedingly
small.) But then, formally, we can look at the population of 2N alleles instead of
that of N individuals. The rule for the process then simply says that the next allele
generation is produced by sampling with replacement from the current one. In other
words, instead of considering a diploid population with N members, we can look
at a haploid one with 2N participants. That is, for producing an allele in the next
generation, we randomly choose one parent in the current population of 2N alleles,
and that then will be the offspring allele. Thus, we have the process of sampling
with replacement in a population of size 2N. The situation changes, however, when
the individuals possess several loci, and the transmission of the alleles at different
loci may be correlated through restrictions on the possible recombinations. In that
case, we need to distinguish between gametes and zygotes, and the details of the
process will depend on whether we recombine gametes or zygotes, that is, whether
we perform recombination after or before sampling. This will be explained and
addressed in Chap. 5.
Since we want to adopt a stochastic model, in line with the conceptual structure
of evolutionary biology, the future frequencies become probabilities, that is, instead
of saying that a fraction of pi of the 2N alleles in the population has the value i, we
shall rather say that the probability of finding the allele i at the locus in question is
pi . While these probabilities express stochastic effects, they will then change in time
according to deterministic rules.
Although we start with a finite population with a discrete time dynamics,
subsequently, we shall pass to the limit of an infinite population. In order to
compensate for the growing size, we shall make the time steps shorter and pass to
continuous time. Obviously, we shall choose the scaling between population size
and time carefully, and we shall obtain a parabolic differential equation for the
deterministic dynamics of the probabilities in the continuum limit.
gain of abstraction makes a mathematical analysis possible which in the end will
yield insights of biological value.
We consider a population Vt that is changing in discrete time t with nonover-
lapping generations, that is, the population VtC1 consists of the offspring of
the members of Vt . There is no spatial component here, that is, everything is
independent of the location of the members of the population. In particular, the
issue of migration does not arise in this model.
Moreover, we shall keep the population size constant from generation to
generation.
While we consider sexual reproduction, we only consider monoecious or, in a
different terminology, hermaphrodite populations, that is, they do not have separate
sexes, and so, any individual can pair with any other to produce offspring. We
also assume random mating, that is, individuals get paired at random to produce
offspring.
The reproduction process is formally described as follows. For each individual
in generation t C 1, we sample the generation t to choose its one or two parents. The
simplest case is to take sampling with replacement. This means that the number of
offspring an individual can foster is only limited by the size of the next generation.
If we took sampling without replacement, each individual could only produce one
offspring. This would not lead to a satisfactory model. Of course, one could limit
the maximal number of offspring of individuals, but we shall not pursue this option.
Each individual in the population is represented by its genotype . We assume
that the genetic loci of the different members of the population are in one-to-one
correspondence with each other. Thus, we have loci ˛ D 1; : : : ; k. In the haploid
case, at each locus, there can be one of n˛ C 1 possible alleles. Thus, a genotype is
of the form D . 1 ; : : : k /, where ˛ 2 f0; 1; : : : ; n˛ g. In the diploid case, at each
locus, there are two alleles, which could be the same or different. We are interested
in the distribution of genotypes in the population and how that distribution changes
over time through the effects of mutation, selection, and recombination.
The trivial case is that each member of Vt by itself, that is, without recombination,
produces one offspring that is identical to itself. In that case, nothing changes in
time. This baseline situation can then be varied in three respects:
1. The offspring is not necessarily identical to the parent (mutation).
2. The number of offspring an individual produces or may be expected to produce
varies with that individual’s genotype (selection).
3. Each individual has two parents, and its genotype is assembled from the
genotypes of its parents (sexual recombination).
Item 2 leads to a naive concept of fitness as the realized or the expected number
of offspring. Fitness is a difficult concept; in particular, it is not clear what the unit
of fitness is, whether it is the allele or the genotype or the ancestor of a lineage, or in
groups of interacting individuals even some higher order unit (see for instance the
analysis and discussion in [70]). Item 3 has two aspects:
1.2 Mutation, Selection and Recombination 5
(a) Each allele is taken from one of the parents in the haploid case. In the diploid
case, each parent produces gametes, which means that she chooses one of her
two alleles at the locus in question and gives it to the offspring. Of course,
this choice is made for each offspring, so that different descendents can carry
different alleles.
(b) Since each individual has many loci that are linearly arranged on chromosomes,
alleles at neighboring loci are in general not passed on independently.
The purpose of the model is to understand how the three mechanisms of mutation,
selection and recombination change the distribution of genotypes in the population
over time. In the present treatise, item 3, that is, recombination, will be discussed in
more detail than the other two.
These three mechanisms are assumed to be independent of each other. For
instance, the mutation rates do not favour fitter alleles.
For the purpose of the model, a population is considered as a distribution
of genotypes. Probability distributions then describe the composition of future
populations. More precisely, pt ./ is the probability that an individual in generation
t carries the genotype . The model should then express the dynamics of the
probability distribution pt in time t.
For mutations, we consider a matrix M D .m / where ; range over the
possible genotypes and m is the probability that genotype mutates to genotype .
In the most basic version, the mutation probability m depends only on the number
d.; / (d standing for distance, of course) of loci at which and carry different
alleles. Thus, in this basic version, we assume that a mutation occurs at each locus
with a uniform rate m, independently of the particular allele found at that locus.
Thus, when the allele i at the locus ˛ mutates, it can turn into any of the n˛ other
alleles that could occur at that locus. Again, we assume that the probabilities are
equal, and so, it then mutates with probability nm˛ into the allele j ¤ i. In the simplest
case, there are only n C 1 D 2 alleles possible at each locus. In this case,
When the number n C 1 of alleles is arbitrary, but still the same at each site, we
have instead
m d.;/ m kd.;/
m D 1 : (1.2.2)
n n
In contrast to mutation, recombination is a binary operation, that is, an operation
that takes two parent genotypes ; as arguments to produce one offspring genotype
. Here, a genotype consists of a linear sequence of k sites occupied by particular
alleles. We consider the case of monoecious individuals with haploid genotypes for
the moment. An offspring is then formed through recombination by choosing at
each locus the allele that one of the parents carries there. When the two parents
carry different alleles at the locus in question, we have to decide by a selection rule
which one to choose. This selection rule is represented by a mask , a binary string
6 1 Introduction
of length k. An entry 1 at position ˛ means that the allele is taken from the first
parent, say , and a 0 signifies that the allele is taken from the second parent, say .
Each genotype is simply described by a string of length k, and for k D 6, the mask
100100 produces from the parents D 1 : : : 6 and D 1 : : : 6 the offspring
D 1 2 3 4 5 6 . The recombination operator
X
R D pr ./C ./ (1.2.3)
is then expressed in terms of the recombination schemes C ./ for the masks
and the probabilities pr ./ for those masks. In the simplest case, all the possible 2k
masks are equally probable, and consequently, at each locus, the offspring obtains
an allele from either parent with probability 1=2, independently of the choices at the
other loci. Thus, this case reduces to the consideration of k independent loci.
Dependencies between sites arise in the so-called cross-over models (see for
example [11]). Here, the linear arrangement of the sites is important. Only masks
of the form c D 11 : : : 100 : : : 0 are permitted. For such a mask, at the first a.c /
sites, the allele from the first parent is chosen, and at the remaining k a.c / sites,
the one from the second parent. As a can range from 0 to k, we then have k C 1
possible such masks c , and we may wish to assume again that each of those is
equally probable.
In the diploid case, each individual carries two alleles at each locus, one from
each parent. We think of this as two strings of alleles. It is then randomly decided
which of the two strings of each parent is given to any particular offspring.
Therefore, formally, the scheme can be reduced to the haploid case with suitable
masks, but as we shall discuss in Chap. 5, there will arise a further distinction, that
between gametes and zygotes.
With recombination alone, some alleles may disappear from the populations,
and in fact, as we shall study in detail below, with probability 1, in the long
term, only one allele will survive at each site. This is due to random genetic drift,
that is, because the parents that produce offspring are randomly selected from the
population. Thus, it may happen that no carrier of a particular allele is chosen at
a given time or that none of the chosen recombination masks preserves that allele
when the mating partner carries a different allele at the locus under consideration.
That would then lead to the ultimate extinction of that allele. However, when
mutations may occur, an allele that is not present in the population at time t may
reappear at some later time. Of course, mutation might also produce new alleles that
have not been present in the population before, and this is a main driver of biological
evolution.
For these introductory purposes, we do not discuss the order in which the
mutation and recombination operators should be applied. In fact, in most models
this is irrelevant.
Finally, we include selection. This means that we shall modify the assumptions
that individuals in generation t are randomly selected with equal probabilities as
parents of individuals in generation t C 1. Formally, this means that we need to
1.2 Mutation, Selection and Recombination 7
change the sampling rule for the parents of the next generation. The sampling
probability for an individual to become a parent for the next generation should
now depend on its fitness, that is, on its genotype, according to the naive fitness
notion employed here. Thus, there is a probability distribution ps ./ on the space of
genotypes . Again, the simplest assumption is that in the haploid case, each allele
at each locus has a fitness value, independently of which other alleles are present
at other loci. In the diploid case, each pair of alleles at a locus would have a fitness
value, again independently of the situation at other loci. Of course, in general one
should consider fitness functions depending in a less trivial manner on the genotype.
Also, in general, the fitness of an individual will depend on the composition of the
population, but we shall not address this important aspect here.
The preceding was needed to the set the stage. However, everything said so far
is fairly standard and can be found in the introduction of any book on mathematical
population genetics. We shall now turn to the mathematical structures underlying the
processes of allele dynamics. Here, we shall develop a more abstract mathematical
framework than utilized before in population genetics.
Let us first outline our strategy. Since we want to study dynamics of probability
distributions, we shall first study the geometry of the space of probability distribu-
tions, in order to gain a geometric description and interpretation of our dynamics.
For the dynamics itself, it will be expedient to turn to a continuum limit by suitably
rescaling population size 2N and generation time ıt in such a way that 2N ! 1,
but 2Nıt D 1. This will lead to Kolmogorov type backward and forward partial
differential equations for the probability distributions. This means that in the limit,
the probability density f . p; s; x; t/ WD @x1@@xn P.X.t/ xjX.s/ D p/ with s < t will
n
1 X @2 i i X @ i
n n
@
f . p; s; x; t/ D x .ı x j
/f . p; s; x; t/ b .x; t/f . p; s; x; t/ ;
@t 2 i;jD1 @x @x
i j j
iD1
@x i
(1.2.4)
1X i i X
n n
@ @2 @
f . p; s; x; t/ D p .ıj p j / i j f . p; s; x; t/ C bi . p; s/ i f . p; s; x; t/
@s 2 i;jD1 @p @p iD1
@p
(1.2.5)
where the second order terms arise from random genetic drift, which therefore is
seen as the most important mechanism, whereas the first order terms with their
coefficients bi incorporate the effects of the other evolutionary forces.
Again, this is standard in the population genetics literature since its original
introduction by Wright and its systematic investigation by Kimura. We shall develop
a geometric framework that will interpret the coefficients of the second order terms
as the inverse of the Fisher metric of mathematical statistics. Among other things,
8 1 Introduction
this will enable us to find explicit solutions of these equations which, importantly,
are valid across loss of allele events. In particular, we can then determine all
quantities of interest, like the expected extinction times of alleles in the population,
in a more general and systematic manner than so far known in the literature.
In this section, we discuss some of the literature on the Wright–Fisher model. Our
treatment here is selective, for several reasons. First, there are simply too many
papers in order to list them all and discuss and compare their relevant contributions.
Second, we may have overlooked some papers. Third, our intention is to develop a
new and systematic approach for the Wright–Fisher model, based on the geometric
as opposed to the stochastic or analytical structure of the model. This approach
can unify many previous results and develop them from a general perspective, and
therefore, we did not delve so deeply into some of the different methods that have
been applied to the Wright–Fisher model since its inception.
Actually, there exist some monographs on population genetics with a systematic
mathematical treatment of the Wright–Fisher model that also contain extensive
bibliographies, in particular [15, 33, 39], and the reader will find there much useful
information that we do not repeat here.
But let us first recall the history of the Wright–Fisher model (as opposed to
other population genetics models, cf. for example [17, 18] for a branching process
model). The Wright–Fisher model was initially presented implicitly by Ronald
Fisher in [46] and explicitly by Sewall Wright in [125]—hence the name. A third
person with decisive contributions to the model was Motoo Kimura. In 1945,
Wright approximated the discrete process by a diffusion process that is continuous
in space and time (continuous process, for short) and that can be described by a
Fokker–Planck equation. By solving this Fokker–Planck equation derived from the
Wright–Fisher model, Kimura then obtained an exact solution for the Wright–Fisher
model in the case of two alleles in 1955 (see [79]). Shortly afterwards, Kimura [78]
produced an approximation for the solution of the Wright–Fisher model in the multi-
allele case, and in [80], he obtained an exact solution of this model for three alleles
and concluded that this can be generalized to arbitrarily many alleles. This yields
more information about the Wright–Fisher model as well as the corresponding
continuous process. We also mention the monograph [24] where Kimura’s theory
is systematically developed. Kimura’s solution, however, is not entirely satisfactory.
For one thing, it depends on very clever algebraic manipulations so that the general
mathematical structure is not very transparent, and this makes generalizations very
difficult. Also, Kimura’s approach is local in the sense that it does not naturally
incorporate the transitions resulting from the (irreversible) loss of one or more
alleles in the population. Therefore, for instance the integral of his probability
density function on its domain need not be equal to 1. Baxter et al. [14] developed
1.3 Literature on the Wright–Fisher Model 9
a scheme that is different from Kimura’s; it uses separation of variables and works
for an arbitrary number of alleles.
While the original model of Wright and Fisher works with a finite population in
discrete time, many mathematical insights into its behavior are derived from its dif-
fusion approximation that passes to the limit of an infinite population in continuous
time. As indicated, the potential of the diffusion approximation had been realized
already by Wright and, in particular, by Kimura. The diffusion approximation
also makes an application of the general theory of strongly-continuous semigroups
and Markov processes possible, and this then lead to a more systematic approach
(cf. [43, 119]). In this framework, the diffusion approximation for the multi-allele
Wright–Fisher model was derived by Ethier and Nagylaki [36–38], and a proof of
convergence of the Markov chain to the diffusion process can be found in [34, 56].
Mathematicians then derived existence and uniqueness results for solutions of the
diffusion equations from the theory of strongly continuous semigroups [34, 36, 77]
or martingale theory (see, for example [109, 110]). Here, however, we shall not
appeal to the general theory of stochastic processes in order to derive the diffusion
approximation, but rather proceed directly within our geometric framework.
As the diffusion operator of the diffusion approximation becomes degenerate
at the boundary, the analysis at the boundary becomes difficult, and this issue
is not addressed by the aforementioned results, but was dealt with by more
specialized approaches. An alternative to those methods and results some of which
we shall discuss shortly is the recent approach of Epstein and Mazzeo [29–31] that
systematically treats singular boundary behavior of the type arising in the Wright–
Fisher model with tools from the regularity theory of partial differential equations.
We shall also return to their work in a moment, but we first want to identify
the source of the difficulties. This is the possibility that alleles get lost from the
population by random drift, and as it turns out, this is ultimately inevitable, and as
time goes to infinity, in the basic model, in the absence of mutations or particular
balancing selective effects, this will happen almost surely. This is the key issue,
and the full structure of the Wright–Fisher model and its diffusion approximation
is only revealed when one can connect the dynamics before and after the loss of an
allele, or in analytic terms, if one can extend the process from the interior of the
probability simplex to all its boundary strata. In particular, this is needed to preserve
the normalization of the probability distribution. In geometric terms, we have an
evolution process on a probability simplex. The boundary strata of that simplex
correspond to the vanishing of some of the probabilities. In biological terms, when a
probability vanishes, the corresponding allele has disappeared from the population.
As long as there is more than one allele left, the probabilities continue to evolve.
Thus, we get not only a flow in the interior of the simplex, but also flows within all
the boundary strata. The key issue then is to connect these flows in an analytical,
geometric, or stochastic manner.
Before going into further details, however, we should point out that the diffusion
approximation leads to two different partial differential equations, the Kolmogorov
forward or Fokker–Planck equation on one hand and the Kolmogorov backward
equation on the other hand. While these two equations are connected by a duality
10 1 Introduction
Some ideas from statistical mechanics are already contained in the free fitness
function introduced by Iwasa [67] as a consequence of H-theorems. Such ideas will
be developed here within the modern theory of free energy functionals. A different
approach from statistical mechanics which can also produce explicit formulae
involves master equations for probability distributions; they have been applied to
the Moran model [89] of population genetics in [65]. That model will be briefly
described in Sect. 2.4.
Large deviation theory has been systematically applied to the Wright–Fisher
model by Papangelou [96–100], although this is usually not mentioned in the
literature. In Chap. 7, we can build upon his work.
As already mentioned, the Kolmogorov equations of the Wright–Fisher model
are not accessible to standard stochastic theory, because of their boundary behavior.
In technical terms, the square root of the coefficients of the second order terms of
the operators is not Lipschitz continuous up to the boundary. As a consequence, in
particular the uniqueness of solutions to the above Kolmogorov backward equations
may not be derived from standard results.
In this situation, Epstein and Mazzeo [29–31] have developed PDE techniques to
tackle the issue of solving PDEs on a manifold with corners that degenerate at the
boundary with the same leading terms as the Kolmogorov backward equation (1.2.5)
for the Wright–Fisher model in the closure of the probability simplex in .n /1 D
n .1; 0/. Such an analysis had been started by Feller [43] (and essentially also
[42]), who had considered equations of the form
@ @2 @
f .x; t/ D x 2 f .x; t/ C b f .x; t/ for x 0 (1.3.1)
@t @x @x
with b 0, that is, equations that have the same singularity at the boundary
x D 0 as the Fokker–Planck or Kolmogorov forward equation of the simplest
type of the Wright–Fisher model. Feller could compute the fundamental solution
for this problem and thereby analyze the local behavior near the boundary. In
particular, the case where b ! 0 is subtle; in biological terms, this corresponds
to the transition from a setting with mutation to one without, and without mutation,
the boundary becomes absorbing. For more recent work in this direction, see for
instance [21]. In any case, this approach which focusses on the precise local analysis
at the boundary and which only requires a particular type of asymptotics near the
boundary and can therefore apply general tools from analysis, should be contrasted
with Kimura’s who looked for global solutions in terms of expansions in terms of
eigenfunctions and which needs the precise algebraic structure of the equations.
Epstein and Mazzeo [29, 30] then take up the local approach and develop it much
further. A main achievement of their analysis is the identification of the appropriate
function spaces. These are anisotropic Schauder spaces. In [31], they develop a
different PDE approach and derive and apply a Moser type Harnack inequality,
that is, the probably most powerful general tool of PDE theory for studying the
regularity of solutions of partial differential equations. According to general results
in PDE theory, such a Harnack inequality follows when the underlying metric and
12 1 Introduction
1.4 Synopsis
We now briefly describe, in somewhat informal terms, our approach and results.
Again, we begin with the case of a single locus. As already indicated, we consider
the relative frequencies or probabilities p0 ; : : : ; pn on the set f0; 1; : : : ; ng of possible
1.4 Synopsis 13
iD0
associated to our iterated sampling from the multinomial distribution. In fact, the
Kolmogorov equations can naturally be interpreted as diffusion equations w.r.t. the
Fisher metric. One should note, however, that the Kolmogorov equations are not
in divergence form, and therefore, they do not constitute the natural heat equation
for the Fisher metric, or in other words, they do not model Brownian motion
for the Fisher metric. They rather have to be interpreted in terms of the dually
affine connections of Amari and Chentsov that we mentioned earlier. From that
perspective, entropy functions emerge as potentials. In particular, this will provide
us with a beautiful geometric approach to the exit times of the process, that is,
the expected times of allele losses from the population. When considering so-
called exponential families (called Gibbs distributions in statistical mechanics),
information geometry also naturally connects with the basic quantities of statistical
mechanics. These are entropy and free energy. As is well known in statistical
mechanics, the free energy functional and its derivatives encode all the moments
of a process. We shall make systematic use of this powerful scheme, and also
indicate some connections to recent research in stochastic analysis. In Chap. 7, we
shall explore large deviation principles in the context of the Wright–Fisher model.
Moreover, the geometric structure behind the Kolmogorov equations will also guide
our analysis of the transitions between the different boundary strata of the simplex.
This will constitute our main technical achievement.
As discussed, the key is the degeneracy at the boundary of the Kolmogorov
equations. While from an analytical perspective, this presents a profound difficulty
for obtaining boundary regularity of the solutions of the equations, from a biological
or geometric perspective, this is very natural because it corresponds to the loss
of some alleles from the population in finite time by random drift. And from
a stochastic perspective, this has to happen almost surely. For the Kolmogorov
forward equation, in Chap. 8, we gain a global solution concept from the equations
for the moments of the process, which incorporate the dynamics on the entire
simplex, including all its boundary strata. This also involves the duality between
the Kolmogorov forward equation and the Kolmogorov backward equation. In
Chap. 9, we then develop a careful notion of hierarchically extended solutions of
the Kolmogorov backward equation, and we show their uniqueness both in the time
dependent and in the stationary case. The stationary case is described by an elliptic
equation whose solutions arise from the time dependent equation as time goes to
infinity.2 The stationary equation is important because, for instance, the expected
times of allele loss are solutions of an inhomogeneous stationary equation. From
our information geometric perspective, as already mentioned, we can interpret these
solutions most naturally in terms of entropies.
2
In fact, one might be inclined to say that time goes to minus infinity in the backward case, because
this corresponds to the infinite past. With this time convention, however, the Kolmogorov backward
equation is not parabolic. When we change the direction of time, it becomes parabolic, and we can
then speak of time going to infinity. This mathematically natural, although not compatible with the
biological interpretation.
1.4 Synopsis 15
In Chap. 10, we shall explore how the schemes developed in this book, namely
the moment equations and free energy schemes, information geometry, the expan-
sions of solutions of the Kolmogorov equations in terms of Gegenbauer polyno-
mials, will provide us with computational tools for deriving formulas for basic
quantities of interest in population genetics.
We mainly focus on the basic Wright–Fisher model in the absence of additional
effects like selection or mutation. Nevertheless, we shall describe, in line with
the standard literature, how this will modify the equations. Also, in Sect. 6.1, we
shall systematically apply the moment generating function and energy functional
method to those issues. The issue of recombination will be treated in more detail
in Chap. 5 because here our geometric approach on one hand leads to an important
simplification of Kimura’s original treatment and on the other hand also provides
general insight into the geometry of linkage equilibria.
Chapter 2
The Wright–Fisher Model
The Wright–Fisher model considers the effects of sampling for the distribution of
alleles across discrete generations. Although the model is usually formulated for
diploid populations, and some of the interesting effects occurring in generalizations
depend on that diploidy, the formal scheme emerges already for haploid populations.
In the basic version, with which we start here, there is a single genetic locus that
can be occupied by different alleles, that is, alternative variants of a gene.1 In the
haploid case, it is occupied by a single allele, whereas in the diploid case, there are
two alleles at the locus. Biologically, diploidy expresses the fact that one allele is
inherited from the mother and the other from the father. However, the distinction
between female and male individuals is irrelevant for the basic model. In biological
terminology, we thus consider monoecious (hermaphrodite) individuals. Inheritance
is then symmetric between the parents, without a distinction between fathers and
mothers. Consequently, it does not matter from which parent an allele is inherited,
and there will be no effective difference between the two alleles at a site, that is, their
order is not relevant. Even in the case of dioecious individuals, one might still make
the simplifying assumption that it does not matter whether an allele is inherited
from the mother or the father. While there do exist biological counterexamples, one
might argue that for mathematical population genetics, this could be considered as
a secondary or minor effect only. Nevertheless, it would not be overly difficult to
extend the theory presented here to also include such effects.
Generalizations will be discussed subsequently, and we start with the simplest
case. In particular, for the moment, we assume that there are no selective differences
between these alleles and no mutations. These assumptions will be relaxed later,
after we have understood the basic model.
1
Obviously, the term “gene” is used here in a way that abstracts from most biological details.
In order to have our conventions best adapted to the diploid case, we consider
a population of 2N alleles. In the haploid case, we are thus dealing with 2N
individuals, each carrying a single allele, whereas in the diploid case, we have N
individuals carrying two alleles each.
For each of these alleles, there are n C 1 possibilities. We begin with the simplest
case, n D 1, where we have two types of alleles A0 ; A1 . In the diploid case, an
individual can be a homozygote of type A0 A0 or A1 A1 or a heterozygote of type
A0 A1 or A1 A0 —but we do not care about the order of the alleles and therefore
identify the latter two types. The population reproduces in discrete time steps. In the
haploid case, each allele in generation m C 1 is randomly and independently chosen
from the allele population of generation m. In the diploid case, each individual in
generation m C 1 inherits one allele from each of its parents. When a parent is a
heterozygote, each allele is chosen with probability 1=2. Here, for each individual
in generation m C 1, randomly two parents in generation m are chosen. All the
choices are independent of each other. Thus, the alleles in generation m C 1 are
chosen by random sampling with replacement from the ones in generation m. In this
model, the two parents of any particular individual might be identical (that is, in
biological terminology, selfing is possible), but of course, the probability for that to
occur goes to zero like N1 when the population size increases. Also, each individual
in generation m may foster any number of offspring between 0 and N in generation
m C 1 and thereby contribute between 0 and 2N alleles.
In any case, the model is not concerned with the lineage of any particular
individual, but rather with the relative frequencies of the two alleles in each
generation. Even though the diploid case appears more complicated than the haploid
one, at this stage, the two are formally equivalent, because in either case the 2N
alleles present in generation m C 1 are randomly and independently sampled from
those in generation m. In fact, from a mathematical point of view, the individuals
play no role, and we are simply dealing with multinomial sampling in a population
of 2N alleles belonging to n C 1 different classes. The only reason at this stage to
talk about the diploid case is that that case will offer more interesting perspectives
for generalization below.
The quantity of interest therefore is the number2 Ym of alleles A0 in the population
at time m. This number then varies between 0 and 2N. The distribution of allele
numbers thus follows the binomial distribution. When n > 1, the principle remains
the same, but we need to work more generally with the multinomial distribution. We
shall now discuss the basic properties of that distribution.
2
The random variable Y will carry two different indices in the course of our text. Sometimes, the
index m is chosen to indicate the generation time, but at other occasions, we rather use the index
2N for the number of alleles in the population, that is, more shortly, (twice) the population size.
2.2 The Multinomial Distribution 19
iD0
and hence
j
Var.Y1i / D pi .1 pi /; Cov.Y1i Y1 / D pi p j for i ¤ j: (2.2.2)
j
i
E.Y2N / D 2Npi ; i
Var.Y2N / D 2Npi .1 pi /; i
Cov.Y2N Y2N / D 2Npi p j for i ¤ j:
(2.2.3)
By the same kind of reasoning, we also get
i ˛
E..Y2N / / D O.2N/ (2.2.4)
for all other moments (where ˛ is a multi-index with j˛j 3 whose convention will
be explained below in Sect. 2.11).
We also point out the following obvious lumping lemma.
Lemma 2.2.1 Consider a map
`W †n ! †m
0
.p ;:::;p / n
7 .q0 ; : : : ; qm /
!
P
with q j D iDij1 C1;:::;ij pi where i0 D 1; im D n; (2.2.5)
20 2 The Wright–Fisher Model
that is, we lump the alleles Aij1 C1 ; : : : ; Aij into the single super-allele Bj . Then the
j
random variable Z2N that records multinomial sampling from †m is given by
j
X
Z2N D i
Y2N : (2.2.6)
iDij1 C1;:::;ij
t
u
For the Wright–Fisher model, we simply iterate this process across several genera-
tions. Thus, we introduce a discrete time m and let this time m now be the subscript
for Y instead of the 2N that we had employed so far to indicate the total number of
alleles present in the population. Instead of the absolute probabilities of multinomial
sampling, we now need to consider the transition probabilities.
That is, when we know what the allele distribution at time m is and when we
multinomially sample from that distribution, we want to know the probabilities
for the resulting distribution at time m C 1. We also not only want to know the
expectation values for the numbers of alleles—which remain constant in time—and
the variances and covariances—which grow in time in the sense that if we start at
time 0 and want to know the distribution at time m, the formulas in (2.2.3) acquire a
factor m—, but we are now interested in the entire distribution of allele frequencies.
We recall that we have n C 1 possible alleles A0 ; : : : ; An at a given locus, still in a
diploid population of fixed size N. There are therefore 2N alleles in the population
in any generation, so it is sufficient to focus on the number Ym D .Ym1 ; : : : ; Ymn / of
alleles A1 ; : : : ; An at generation time m. Assume that Y0 D 0 D .f10 ; : : : ; n0 g/
and that, as before, the alleles in generation m C 1 are derived by sampling with
replacement from the alleles of generation m. Thus, the transition probability is
given by the multinomial formula
Yn i yi
.2N/Š
P.YmC1 D yjYm D / D 0 ; (2.3.1)
.y /Š.y1 /Š : : : .yn /Š iD0 2N
where
( )
X
n
; y 2 Sn.2N/ 1
D D . ; : : : ; / W 2 f0; 1; : : : ; 2Ng;
n i
2N
i
iD1
and
0 D 2N jj D 2N 1 : : : n I y0 D 2N jyj D 2N y1 : : : yn :
2.3 The Basic Wright–Fisher Model 21
and
Y i y
i
.2N/Š
P.YmC1 D yjYm D / D 0 ;
.y /Š.y1 /Š : : : .y j1 /Š.y jC1 /Š : : : .yn /Š 2N
i¤j
(2.3.3)
for y D 0. Thus, whenever allele j disappears from the population, we simply get
j
the same process with one fewer allele. Iteratively, we can let n alleles disappear so
that only one allele remains which will then live on forever.
Returning to the general case, we then also have the probability
X
P.YmC1 D yjY0 D / D P.YmC1 D yjYm D m / P.Ym D m jYm1 D m1 /
1 ;:::;m
assuming that the process started with the allele distribution at time 0.
From (2.2.3), we have
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com