0% found this document useful (0 votes)
32 views17 pages

Geometry and Applied Statistics: Paul Marriott

This survey paper explores the relationship between Geometry and Applied Statistics, reflecting on the evolution of Information Geometry over the past 50 years. It highlights how geometric concepts have historically influenced statistical methods and continue to play a vital role in modern statistical developments. The paper discusses foundational links, key case studies, and the importance of understanding the geometric structure of data in statistical analysis.

Uploaded by

msin36978
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views17 pages

Geometry and Applied Statistics: Paul Marriott

This survey paper explores the relationship between Geometry and Applied Statistics, reflecting on the evolution of Information Geometry over the past 50 years. It highlights how geometric concepts have historically influenced statistical methods and continue to play a vital role in modern statistical developments. The paper discusses foundational links, key case studies, and the importance of understanding the geometric structure of data in statistical analysis.

Uploaded by

msin36978
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Information Geometry (2024) 7 (Suppl 1):S211–S227

https://doi.org/10.1007/s41884-022-00086-6

SURVEY PAPER

Geometry and applied statistics

Paul Marriott1

Received: 5 September 2022 / Revised: 25 November 2022 / Accepted: 26 November 2022 /


Published online: 6 December 2022
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2022

Abstract
We take a very high level overview of the relationship between Geometry and Applied
Statistics 50 years from the birth of Information Geometry. From that date we look
both backwards and forwards. We show that Geometry has always been part of the
statistician’s toolbox and how it played a vital role in the evolution of Statistics in the
last 50 years.

Keywords Applied statistics · Information geometry

1 Introduction

This review of Information Geometry (IG) written 50 years after Čensov’s ground
breaking paper, [1], takes an applied statistical perspective. One audience is statisti-
cians who are intrigued by the links between Geometry and Statistics. It is also to be
hoped that experts in IG will be interested in seeing their own subject relative to the
history of Statistics as a whole.
We use the term ‘Geometry’ in a very broad sense. The term includes the basic
intuitive concepts of geometry: invariance, linearity and affineness, convexity, orthog-
onality, size, distance and shape. These will be combined with more technical
mathematical tools developed in the theory of geometry, particularly differential and
convex geometry.
Applied Statistics is going through a period of intense development driven, at least
partly, by ever faster and cheaper computing, new types of data with associated new
statistical questions, and major new theoretical developments. Nevertheless Geometry,
in the general sense described above, can be found in these new developments as well as

Communicated by Shinto Eguchi.

B Paul Marriott
[email protected]

1 Department of Statistics and Actuarial Science, University of Waterloo, 200 University Ave West,
Waterloo, ON N2L 3G1, Canada

123
S212 P. Marriott

in Statistics’ past. This paper tries to illustrate these connections which are sometimes
explicit and sometimes more hidden.
Geometric ideas have always been used in Statistics, see Sect. 2, and the structure
of IG has its roots in these applications. In Sect. 3 we see the way that IG formalises
these key early ideas into a mature methodology. In Sect. 4 we look at some of the most
important new developments in Statistics and see, through a set of key case studies,
how Geometry keeps its place at the core of applied statistical thought. We close, in
Sect. 5, with some thoughts on cases where statistical and geometric intuitions do not
always agree. In order to keep the discussion as non-technical as possible, but still
be fairly self contained, definitions and key technical references have been put in the
Appendix.

2 Geometry in statistics: hiding in plain sight

In order to highlight the foundational links between Geometry and Applied Statistics it
is helpful to first look backwards from Čensov’s 1972 paper [1] to the very beginnings
of Statistics as a discipline.
Core statistical techniques, such as ordinary least squares (OLS) and the analysis
of variance (ANOVA), have their roots in Euclidean geometry, see Definition 1 in the
Appendix. They are motivated by the idea of minimising a distance from a point to an
affine subset of Euclidean space. Whether we assign priority for the discovery of the
method of least squares to Gauss or Legendre [2, 3] they were both brilliant geometers
and probabilists. It is the combination of geometry and probability laws that give
these methods their power, [4]. Beran notes in [5] that Gauss’s two justifications of the
use of Euclidean geometry in least squares can be seen either in terms of maximum
likelihood or through ideas of risk and the Gauss-Markov theorem.
It is important to recognize that the choice of geometry being Euclidean is an
assumption that needs careful consideration.

Example 1 Gauss’s 1795 use of least squares concerned a geodetic survey. The data
involved both measures of distance on the surface of the earth—hence on a sphere—
and also arc length measurements in degrees latitude, [3]. Hence strictly speaking the
underlying geometry is not Euclidean but involves spherical geometry. However on
the scale of the data a local Euclidean approximation is appropriate. Thus even this
early work hints at ideas from manifold based geometry, see Definition 2.

2.1 Fisher and geometry

It could be argued that Applied Statistics really only started to become a mature
discipline after the work of Fisher. For example he was influential in recognising the
importance of randomisation in experimental design, the use of the likelihood function,
formalising inference in regression and the use of ANOVA.
Primarily Fisher was a scientist, [6, 7], whose thinking about Statistics was driven
by his experience designing and analysing the results of experiments. However it
seems clear that he also had a very strong geometric intuition. A well-known example

123
Geometry and applied statistics S213

of this being his discussion of Student’s t-statistic, [8], where he states: ‘It is, however,
of interest to notice that the form establishes itself instantly, when the distribution of
the sample is viewed geometrically’.
His development of the analysis of variance also can be seen, and taught, in a highly
geometric way involving orthogonal projections and squared distances in Euclidean
space, with degrees of freedom being dimensions and squared distances being cali-
brated by the appropriate χ 2 -distribution.
While Fisher clearly recognised the link between the normal distribution and
Euclidean geometry his motivation for using geometry seems to have a different
source than Gauss. It came more from randomisation in experimental design which
he regarded as vital both to remove bias and to quantify error, [9]. Box [6] writes that
Fisher thought that ‘given that randomization had been carried out inferences should
be made from the appropriate randomization distribution; to which, however, stan-
dard normal theory often provided an adequate approximation.’ Also, according to
[7], his analysis of variance was always made conditional on randomisation. Hall, in
[9], comments on the importance of Fisher’s visualisation of Euclidean spaces formu-
lating his intuition about experimental design and randomisation. She states ‘Fisher’s
own view of the randomness inherent in sampling was a geometric one; it began with
considering n observations as the n coordinates of a point in Euclidean n-space.’
The underlying geometry is coming from the group of permutations which, in the
randomised experimental context, is a real part of the design and not derived from
modelling assumptions.
Another example of Fisher’s geometric understanding concerns invariance. Accord-
ing to [10] invariance was one of his motivations to develop maximum likelihood
methods, [11]. He noted thatleast squares methods, such as curve fitting by minimiz-
ing expressions of the form (y(x) − f (x; θ ))2 d x, were not invariant to non-linear
transformations of x. However Fisher noted that the maximum likelihood estimate is
invariant to such transformations. It may be that Fisher was the first to recognise the
importance of invariance since Aldrich, [10], states that ‘Noninvariance is important
in this paper [11] and in Fisher 1912–1922 generally but it was not important in the
statistical literature he knew.’

2.2 Models and geometry

Aside from its invariance properties Fisher championed maximum likelihood meth-
ods over, say Pearson’s method of moments, on the grounds of efficiency, [11]. This
led to a more complete geometric treatment in Rao’s 1945 paper, [12]. As is well-
known, Rao recognised that the Fisher Information—Definition 2—is an important
object in Differential Geometry: the Riemannian metric tensor, [12]. Under regularity
conditions a parametric statistical model, M := { f (x| θ ) | θ ∈ } can be thought
of as a smooth manifold, Definition 2. Rao understood that the Fisher information
can define invariant measures of length and angles on the tangent space and invariant
geodesic distances between distributions in the family f (x| θ ). This geometry was one
of the motivating factors at the start of the developments of Information Geometry as
described in the seminal work of Amari, [13].

123
S214 P. Marriott

Table 1 Stevens’ classes of data and associated geometry

Scale Basic empirical operation Mathematical group Permissible statistics

Nominal Determination of equality Permutation group Counts, mode


Ordinal Inequality Isotonic group Median
Interval Equality of intervals General linear group Mean, standard deviation
Ratio Equality of ratios Similarity group Coefficient of variation

2.3 Data and geometry

Consider one of the examples in the Box and Cox paper on transformations, [14] which
is discussed below.
Example 2 Table 1 in [14] shows survival times of animals in a 3 × 4 factorial exper-
iment whose factors are types of poisons and treatments. The data are in units of 10
hours and are recorded to two decimal places. The paper proposes that the analysis
is much more straightforward when the response is on the reciprocal scale, ti−1 j , with
an interpretation as the ‘rate of dying’. Furthermore, we note that in either choice, ti j
or ti−1
j , there is a constraint that observations have to be strictly positive—hence the
underlying geometry has a boundary so it is not strictly a Euclidean space. Further-
more, there is an intrinsic level of precision due to the recording system.
In general choices of scale and details of the measurement system are associated with
implied geometric structure. It is therefore important to understand how conclusions
of the analysis depend on such choices. Understanding what is invariant under choices
of representation is at the very heart of geometric thought. Indeed Klein, in his famous
1892 Erlangen program, defines geometry as the study of invariants of groups of
transformations, [15].
Data, and the sample space, can carry geometric structure intrinsic to way it is
defined. Hand writes in [16]: ‘The term ‘data’ includes the numerical values recorded
and also the meaning of the numbers and the variables—the context in which the data
arose’, and even more fundamentally, from the point of view of using Statistics, to
answer questions. Hand also notes: ‘a measurement scale can constrain the sort of
questions which it is sensible to ask of a particular set of data—and hence can limit
the scientific questions which can be posed’.
In 1946 Stevens categorised data into classes by scale: nominal, ordinal, interval,
and ratio, [17]. He clearly was viewing these in geometric terms since, as part of the
definition, there are associate groups of allowed transformations and hence associated
geometries. Table 1 below is a summary of [17, Table 1] and uses the terminology
therein.
The list in Table 1 is not exhaustive and other geometric structures are possible,
particularly in the multivariate case. An interesting example is compositional data,
see [18, 19]. The data is a vector of non-negative components which sum to one. For
example in Economics where the budget share of household costs are the data points.
Such data has a natural closed simplicial geometric structure. In these papers Aitchison
notes the same point as Steven’s about the suitability of types of data analytic tools used:

123
Geometry and applied statistics S215

Table 2 Initial and final ratings


Initial/final A B C D E
on disability of 121 stroke
patients E 11 23 12 15 5
D 9 10 4 1 0
C 6 4 4 0 0
B 4 5 0 0 0
A 5 0 0 0 0

‘almost a century ago Pearson (1897) warned us to beware of naive interpretations of


correlations of his product-moment correlation for compositional data’.
Count data in categorical data analysis, [20] also can be seen to lie within a closed
simplex after conditioning on sample size. If the sample size is, however considered
random then a product Poisson model might be a suitable tool and the sample space
would be a closed cone, [21].
Understanding the support set of a random variable is an essential specification of
the geometry of the sample space and a vital part of any data analysis. This is perhaps
clearest in categorical data analysis, [20]. A count of zero in a particular category can
have two quite distinct interpretations. First it is zero in the sample we have but does
have to be be zero in all possible samples. Alternatively it must be zero in all cases.
The first is called a sampling zero, while the second is a structural zero. The geometry
of these cases is quite different as discussed in [22, 23].

Example 3 An early example of such problems can be found in [24]. Table 2 shows
data on patients in a hospital rated in increasing severity according to their physical
disability following a stroke on a five point scale. The paper states: ‘as none of the
patients had a second stroke their score on the second examination could only be the
same or better than on the original examination’. It is for this contextual reason that
there are structural zeros in this table.

Having seen that meaning and context can define geometrical structures on a sample
space we end this subsection by considering transformations of these structures and
hence transformations of the Geometry.
In the context of normal linear regression models Box and Cox, [14], point out that
these models have four key assumptions which enable them to be useful in applied
analysis: (i) having a simple linear structure for the systematic part of the model
E(Y |X ) (ii) constancy of error variance; (iii) normality; and (iv) independence of
observations. They show that often the first three properties can be jointly improved
by a choice of a non-linear transformation on the response, Y , of the form

y λ −1
y (λ) = λ for λ = 0, (1)
log y for λ = 0.

Therefore on the sample space there is a choice of representation and hence geometry.
In this representation λ is also selected to make results interpretable as is shown in
Example 2 above.

123
S216 P. Marriott

Other well-known examples of transformations of sample space come from multi-


variate analysis, [25]. Many of the methods in this area work under the assumption of
Euclidean geometry and include: principal component, factor, canonical correlation,
cluster and discriminant analysis.

2.4 Complementary dual geometries

One of the defining features of IG is the interplay between different geometries. The
leading example of this in Applied Statistics is that of the two commonly used ‘dual’
parameterizations associated with exponential families. We write an exponential fam-
ily in the general form as
 p 

f (x|θ ) = ν(x) exp si (x)θi − ψ(θ ) , (2)
i=1

where ν(x) is a positive measure and (s1 (x), . . . , s p (x)) is the vector of canonical
statistics. In applied work these are, as a class, extremely important due to their excel-
lent inferential properties and applicability.
The geometry of exponential families, [26, 27], can be understood through  the
relationship between the dual parameters θ and μ := Eθ [s1 (X )] , . . . , Eθ s p (X ) ,
which are defined up to an affine transformation. Both can be usefully thought of as
having affine structure in their own right giving two ‘dual’ affine geometries. The
strongly related exponential dispersion families inherent this ‘dual’ parameter system
and, of course, form the workhorse of Applied Statistic through generalised linear
models, [28].

2.5 Distances

Distance and distance like functions have always played a major role in statistical
analysis. For example in the classroom we often explain that OLS is used to minimised
a goodness-of-fit based ‘distance’ from the data to a low-dimensional linear model.
This idea of measuring the ‘distance’ between a representation of the data and a
model is seen in many of the following key historical examples: Pearson’s χ 2 statistic
based on a sum of squares, the Mahalanobis distance, [29], the information theory
based Kullback–Leibler divergence, [30], and the Hellinger distance, [31]. In terms of
Applied Statistical practice, for example in model selection in linear modelling, the
Akaike Information Criterion (AIC), [32], is one of the most important application of
divergence functions, see Definition 5.
Cressie and Read [33] point out that many important goodness-of-fit statistics can
be written in a way that contrasts observed frequencies X i with expected frequencies
μi across k categories via

2  k
Xi λ
λ
2n I (X ; μ) = Xi −1 , (3)
λ(λ + 1) μi
i=1

123
Geometry and applied statistics S217

here using the notation of [21]. In this representation λ ∈ R giving the cases λ = 1
is Pearson’s χ 2 , λ = 0—after taking limits—is the log-likelihood ratio statistic. The
family also includes the Freeman-Tukey, and Neyman’s χ 2 statistic.

3 Information geometry and statistics

Section 2 has shown that, even before the notional birth of Information Geometry
50 years ago, Geometry has played a key role in Statistics. It is helpful to separate
foundational geometric ideas from more technical geometric tools. In the first group
we include ideas such as linearity, invariance, projection, orthogonality, distance, con-
vexity, shape and visualisation. These have always been in the statistician’s toolbox.
In the second group we have tensors, manifolds, tangent spaces, affine connections,
divergence and contrast functions, polytopes, cones and convex sets. It could be argued
that after Čensov’s paper we see more applications of these powerful tools comple-
menting the foundational geometric thinking that has always been important. It is the
combination of the two that resulted in Information Geometry as we now understand
it.
We will not, in this paper, attempt to give a detailed overview of all of the develop-
ment of IG. However the following list provides a rich source of information for the
interested reader: [13, 21, 26, 34–43]. Rather than attempting to be comprehensive,
we highlight some of the key features of the structure of IG that have, and continue
to have, impact in Statistics. In particular for details of the key structures of IG, see
Appendix Definition 3.
From the point of view of parametric statistical modelling, IG combines Rao’s
observation that recognised that, under regularity, parametric models can be viewed as
Riemannian manifolds. Further Efron’s definition of statistical curvature, [44], showed
the possibility of other geometries on the parametric models, with the combination
of idea resulting in Amari’s dual affine geometries, [13]. A recent review of these
structures can be found in [45], with key definitions found in the Appendix.
The most immediate impact of the formal development of IG in Statistics was in
the area of likelihood based asymptotics, [46–51]. In the next section Case Study 4
discusses an example of the importance of this.
This development of IG coincided with another change which revolutionised Statis-
tics and gave rise to what we now call Data Science. This change was an exponential
growth in computing power. We consider these changes and the role geometry has in
modern Statistics in the next section.

4 The last 50 years: case studies

Statistics, as a discipline, has change dramatically during our 50 year window. Gel-
man and Vehtari’s recent paper, [52], list eight ideas that they consider the most
important developments in this period: Overparameterized Models and Regularization,
Exploratory Data Analysis, Robust Inference, Bootstrapping, Bayesian Multilevel

123
S218 P. Marriott

Models, Generic Computation Algorithms, Causal Inference, and Adaptive Decision


Analysis.
As we have said one of the most important drivers of change has been the exponential
growth of computing power leading to both new solutions to old problems and, with
the growth of big-data, new problems. This results in a complex relationship between
Statistics, Machine Learning and Data Science which has developed in the 50 year
timeframe, see [53] for a review.
In this section we select topics from Gelman and Vehtari’s list and show where
geometry has made an impact. We make no attempt to be comprehensive, rather we
use a set of case studies where geometry and geometric thinking are shown to be as
important today as they have been in the past.
The first two case studies share a fundamental geometric theme, that of the impor-
tance of geometric intuition. However this needs care. In Case Study 1 intuition about
a Pythagorean relationship helps motivate and explain one of the most surprising and
powerful results in statistical theory in the last 50 years. However in Case Study 2 we
see that our 3-dimensional trained intuition needs to be retrained when working in the
high-dimensional context of modern Data Science.

Case Study 1 In Efron’s introduction to the reprint of James and Stein’s (1961) paper,
[54], he describes the main result as ‘the most striking theorem of post-war mathemat-
ical statistics’. This result was the inadmissibility of the unbiased estimator μ̂ S = X
as an estimator of μ ∈ Rk when X ∼ N(μ, Ik×k ) for k ≥ 3. This follows since the
James-Stein estimator
 
k−2

μ JS
:= 1 − X
X 2

has smaller risk. Efron goes on to say the ‘result was, and sometimes still is, considered
paradoxical.’
We can link this result to ‘Regularisation’ on Gelman and Vehtari’s list since, among
other things, it started statisticians considering shrinkage estimators and allowing bias
in estimation when it reduces risk. There are interesting geometric aspects to many
regularization methods, for example [55, Figures 2 and 3], but here we focus on the
geometry of the James-Stein estimate itself. In this we follow the approach of [5] and
[56] who show that simple geometry, applied in the right way, gives real insight into
why the method works.
Beran writes, in [5], that despite claims about the result’s paradoxical nature, for
James and Stein it is the ‘asymptotic geometry of quadratic loss in high dimensions that
makes Stein estimation transparent’. In essence the intuition comes from noting that
the normalised quadratic loss for an estimate μ̂ is k1 μ̂ − μ2 and the normalised risk

is k1 E μ̂ − μ2 . Assuming that limk→∞ μ
2
k = a < ∞ then we have, approxi-
2

mately for large k,

k k k
i=1 μi i=1 (X i − μ)2
2 X i2
i=1
≈a ,
2
≈ 1, ≈ 1 + a2.
k k k

123
Geometry and applied statistics S219

Fig. 1 The asymptotic geometry


of the James-Stein estimator

Giving, in a Euclidean asymptotic geometry a Pythagorean triangle shown in Figure


1, with vertices O, A and B with

O A2 = a 2 , AB2 = 1, O B2 = 1 + a 2 .

It is then geometrically clear that risk can be reduced by shrinking the point B,
the unbiased estimate, towards O by the shrinkage of B to C. We are not, of course,
claiming that the Pythagorean geometry of Figure 1 is a proof. Rather this example
shows the importance of geometric insight which can turn a highly complex and
surprising result into something ‘transparent’, [5].

One of Tukey’s most important contributions, around 50 years ago, to the devel-
opment of Statistics is his emphasis on the importance of Exploratory Data Analysis
(EDA), [57]. The list from [52] includes EDA as one of the most important develop-
ments in the time period under consideration.
The advent of fast and cheap computing has allowed EDA to be truly exploratory
allowing visualisation and exploration of data in real time. It has also given rise to
new forms of EDA including so-called Topological Data Analysis, [58]. In that paper
Wasserman argues that geometric concepts such as shape and connectivity are impor-
tant tools.
In the next case study though we review the geometric consequences of the fact that
data now can be extremely high dimensional. In Case Study 1 the emphasis was on the
intuitive power of geometric thought. However, the geometry of higher dimensions
can be different than the intuition provided by our experience with lower dimensions.
While a sample of standard normally distributed points in Rk is a ball centred at the
origin where concentration decays exponentially, it is also true that for large k most
of the data is contained in a narrow shell a squared distance k from the origin. This
can be seen statistically by considering the χk2 distribution for the squared distances
from the origin, and geometrically by the fact that most of the volume of a ball in Rk
is contained in a narrow shell at its boundary.

Case Study 2 As argued by Donoho and Tanner, [59], the high dimensional geometry of
clusters of points can be highly non-intuitive. In particular it can exhibit phase-change
behaviour which has important consequences in EDA, linear modelling and signal

123
S220 P. Marriott

processing. These phase-change results are connected to results from high dimensional
Combinatorial Geometry.
Consider the case of X 1 , . . . , X n independent N(0, Id×d ) random variables, for
large n and d but where d/n is a fixed fraction. In this case, typically, all points
are extreme, lying on the boundary of their convex hull. Furthermore, line segments,
joining any pair of points, also lie only on the boundary of the convex hull and do not
intersect the relative interior. If we generalise this to faces defined by sets of size k
there is a sudden phase-change as k passes a threshold determined by n and d. Below
the threshold the faces typically do not intersect the interior of the convex hull, while
just above the threshold they typically do.
This high-dimensional geometric property is important in EDA, model selection,
robustness and compress sensing in high dimensions. For example consider model
selection in sparse regression problems determined by (k, p, n) where there are k
‘useful’ predictors out of a total of p, and there are n observations. In the sparse
case we have k < n < p. Convex geometry predicts a phase change as k passes
a geometrically determined threshold were we can have very good model selection
just below the threshold but very poor selection when above. The geometric threshold
determines when regression can be a powerful tool and when it will almost certainly
fail.

The next two case studies are linked by a simple geometric object, the Influence
function, see Definition 4. This can be seen as a functional derivative defined relative
to the (−1)-affine structure, see Definition 1. These two case studies each lie in one
of Gelman and Vehtari’s important developments: Robustness and Bootstrapping.
As discussed above one of Fisher’s motivations for using the maximum likelihood
estimate was that it could be optimally efficient as measured by the Fisher information.
However ideas of robustness developed by Tukey, Huber, Hampel and other—see [60]
for a historical perspective—caused a major shock to the discipline by showing that
the results of many ‘optimal’ procedures are highly sensitive to ‘small’ changes in
assumptions. The following case study illustrates the importance of IG tools.

Case Study 3 The paper [61] considers the trade-off between efficiency and robustness
in parametric estimation. The Hellinger divergence between models p(x) and q(x)
with support S which, for simplicity, is assumed to be a countable set, is defined by

1    2
D( p, q) := p(x) − q(x) d x.
2
S

Further consider a parametric model p(x|β) and the empirical distribution d(x) then
H := arg min D( p(x|β),
the minimum Hellinger Divergence estimate is defined to be β
d(x)). Lindsay argues that minimum Hellinger Divergence estimation has excellent
properties in terms of both efficiency and robustness.
The paper [61] shows that the minimum Hellinger Divergence estimate and the
maximum likelihood estimate have, to first order, the same information matrix and
the same influence curves and this implies that they should share the same trade-
off between efficiency and robustness. Indeed this is true of all of the divergences

123
Geometry and applied statistics S221

defined by Cressie and Reed in Expression (3). However the paper goes on to argue
that the influence function, being infinitesimal derivative, only captures first order
behaviour and ignores the second, and higher, order curvature properties. Lindsay
argues that the residual adjustment formula, [61, Page 1086] captures more of the
underlying geometry and shows that the Hellinger based estimator has considerably
better robustness properties than the MLE.
Computer intensive methods for inference, with the Jackknife and the Bootstrap
being key examples, have had a major impact in the last 50 years. Given that their
computational costs are now very often negligible and that inference can be done on
many different linear and non-linear functionals of distributions it is not surprising
that they are very popular. The higher-order asymptotic justification of their good
performance can be explained with IG tools.
Case Study 4 The derivative of a functional such as a mean, median or variance of a
distribution— that is the influence function—plays an important role in understanding
the properties of Efron’s Bootstrap. The IG links with Jackknifing, in terms of its
asymptotic behaviour, were explored in Amari’s 1985 monograph, [13, Section 5.6].
Furthermore, the geometric links between the infinitesimal Jackknife, the Delta method
and the influence function are are well-known, see Efron’s 1982 book [62] and [63].
Aside from its practical utility and computational simplicity a very attractive feature
of some versions of the Bootstrap, such as the Bootstrap-t and accelerated methods,
are their excellent higher-order asymptotic behaviour. This is described in [64]. Sec-
tion 9 of that paper explains the links between the IG influenced theory of higher-order
likelihood inference found in [46–49]. Diciccio and Efron emphasis the strong con-
nections between the geometry and the higher order performance of the Bootstrap
in the exponential family case, but for other families—such as curved exponential
families—it might be more appropriate to study conditional sampling behaviour, [65,
66]. We also note the utility of using cumulant based expansions due to their tenso-
rial transformation properties, [67]. Also for more general links with non-parametric
methods the geometric ideas behind Stein’s least favourable direction, [68], and the
Bootstrap are of great interest.
We finish this non-systematic and highly selective tour with a combination of two
of Gelman and Vehtari’s highlighted areas: Bayesian Multilevel Models and Generic
Computation Algorithms.
The advent of cheap computing opened the door for the application of ideas from
Bayesian statistics through Markov Chain Monte Carlo (MCMC). One of the reasons
that this made such a big impact in Applied Statistics was the way that accessible
software, such as BUGS, [69], allowed practitioners to easily use these methods with
a very shallow learning curve. The following case study looks at the next generation
of these developments, Stan and Hamiltonian MCMC, which have strong geometric
foundations.
Case Study 5 Stan, [70], named after Stanislaw Ulam, is a programming language
which implements gradient-based MCMC including the methods of Hamiltonian
Monte Carlo (HMC). For a general conceptual introduction to HMC see [71]. Here
we focus on HMC’s differential geometric links, see [72] for more detail.

123
S222 P. Marriott

In general an implementation of a MCMC algorithm tries to generate a Markov


chain which explores all of the high probability regions of its domain without being
too localised and hence inefficient. The trade-offs needed for a basic Metropolis-
Hastings algorithm are well-known to users. If the proposal make jumps which are too
large they move outside the high probability region and are rejected, while too small
jumps are inefficient. In some sense what is needed is a tool which can move freely
around the high probability regions to act as a ‘proposal’ distribution. Understanding
measure preserving maps from the parameter space to itself would therefore be very
useful. As a simple example imagine a posterior which is spherically symmetric. Being
allowed to flow quickly around the spheres of constant density would be a good way
to efficiently explore high probability regions.
HMC takes its inspiration from Physics and Hamiltonian dynamics where the phase
space of a system describes its (generalized) positions and (generalized) momenta.
In this setting the Hamiltonian defined flows in phase space which correspond to
the time evolution of the system. This idea has been generalised to abstract sym-
plectic manifolds. These are even-dimensional—due to needing both position and
momentum—and have a closed symplectic form representing the Hamiltonian dynam-
ics. These flows can be constructed to be measure preserving hence they have the
desired properties discussed above. As a practical point since the geometry of sym-
plectic manifolds are extremely well studied in Physics, [70] states that ‘the numerical
solution of Hamiltonian flow is a well-researched subject and many efficient integra-
tors are available.’ Thus there is a form of technology transfer to the advantage of
HMC implementations.

5 Discussion

As we have seen in general the relationship between Geometry and Statistics has been
very productive. However there are cases where a geometer and statistician would
have very different reflexes.
Consider, for example, the common statistical practice of separating the parameters
of a model into ‘interest’ and ‘nuisance’. This does not have an equivalent notion in
pure differential geometry. Parameters in a geometric sense are merely labels which
are an arbitrary way of describing a geometric object. Parameters do not always have
an interpretation of course. Breiman, [73], argues for the importance of the predictive
power of a statistical model and, in this context, rejects any direct interpretation.
However interpretation is relevant when we are using, in the terminology of [74],
substantiative models which are connected directly with subject matter considerations.
In this case it is common that a parameter has a real world interpretation, for example
a population mean or an odds ratio. This meaning can even exist across different
models for the same statistical problem. Cox emphasizes in [75] the importance of
parameters ‘keeping their physical interpretation’ when comparing models. This is
again noticeably different from the pure geometrical view of a parameter.
We also note that not all statistical models are Riemannian manifolds. There are
examples of very simple models, two component mixtures of Poisson distributions
for example, where the Fisher information is infinite, [76]. Furthermore, there are

123
Geometry and applied statistics S223

many cases of useful models which do not have a constant dimension. This includes
mixture models but also closures of exponential families, [26, 77–80]. An example of
the importance of boundaries is the way that they determine the limits of asymptotic
approximations, [81].

Appendix A: Definitions

In this section for more detail please see the cited references. We aim here to give
intuition more than precision.

Definition 1 (a) An affine space (X , V , +) consists of a set, X , a vector space, V , and


a translation operator +, which for each v ∈ V defines a map +v : x → x + v
such that there is a unique translation between any pair of points in X .
(b) An N -dimensional Euclidean space is an affine space (RN , V , +) whose vector
space, V , has an inner product ·, · , [37, Section 3.2].
(c) There are two important affine spaces in IG the exponential or (+1) and the mixture
or (−1). Murray and Rice [37] defines the (+1)-affine structure on the space of
positive measures, while [82] defines a (−1)-affine structure on the space of unit
measures on a given set.
(d) The intersection of positive and unit measures is, of course, the set of probability
measures, thus this space inherits both affine structures. However, we note bound-
aries, where either positivity (-1) or finiteness (+1) fails, which are important in
understanding the underlying geometry of ‘distribution space’.

Definition 2 (a) A smooth manifold, M, of dimension p is locally diffeomorphic to


an open subset of R p .
(b) A parametric model, M := { f (x| θ ) | θ ∈ } lies inside the distribution space
which inherits the affine structures of Definition 1. Under regularity then it is a
smooth sub-manifold. For example the set {log f (x| θ ) | θ ∈ } lies in the (+1)-
affine space as a smooth sub-manifold
(c) The tangent space, or local linearisation, to M at θ , denoted by T Mθ is then the
affine subspace spanned by
 

log f (X ; θ ); i = 1, . . . , p .
∂θi

(d) The tangent space T Mθ has a metric structure defined by the Fisher information
via the p × p matrix with i j component
 
∂ ∂ ∂2
Cov log f (X ; θ ), log f (X ; θ ) = E − log f (X ; θ ) (A1)
∂θi ∂θ j ∂θi ∂θ j

Definition 3 The affine structures defined above can also be characterized by the differ-
ential geometric tool of an affine connection ∇, [40, p. 17]. There is a one dimensional

123
S224 P. Marriott

family of such connections defined by

1 + α (+1) 1 − α (−1)
∇ (α) = ∇ + ∇ , (A2)
2 2
[40, p. 33] for α ∈ R. Here the α = ±1 connections agree with the affine structures
defined in Definition 1. The α = 0 connection is also of interest since it is the Levi–
Civita connection [37, p. 115] associated with the Fisher information.
The relationship between dual connections and the metric is encoded in the duality
relationship

(α) (−α)
X Y, Z F = ∇X Y , Z F + Y , ∇X Z F, (A3)

where X , Y and Z are smooth vector fields, and ·, · F corresponds to the Fisher
information, [40, p. 51]. From this relationship we have two fundamental results: the
dual flatness theorem, [13, Thm 3.2, p. 72], and the Pythagoras theorem, [13, Thm
3.9, p. 91]

Definition 4 Consider a sample space X , and δx (·) be the singular distribution with
unit mass at x ∈ X . If T is a functional on the space of distributions over X then its
influence function

T ((1 − )F + δx ) − T (F)
IF F (x) := lim (A4)
→0

This is a derivative of the functional T evaluated at F in a line segment in the mixture


affine space.

Definition 5 On a model M := { f (x| θ ) | θ ∈ } a divergence function D : M ×


M → R is a smooth function with the properties:(i) D(θ1 , θ2 ) ≥ 0 for all θ1 , θ2 , (ii)
D(θ1 , θ2 ) = 0 if and only if θ1 = θ2 , (iii) is locally quadratic and agrees with the
Fisher information, [21].
Acknowledgements I would like to thank Qingyuan Zhao for information on the background of Fisher’s
work and Frank Critchley for many helpful comments as the paper was prepared.

Data availibility All data generated or analysed during this study are included in this published article.

Declarations
Conflict of interest The author is on the Editorial Board of Information Geometry. The author states there
are no other conflicts of interest.

References
1. Čensov, N.N.: Statistical decision rules and optimal inference. Transl. Math. Mongr. 53 (1972)
2. Plackett, R.L.: Studies in the history of probability and statistics. xxix: the discovery of the method of
least squares. Biometrika 59(2), 239–251 (1972)
3. Stigler, S.M.: Gauss and the invention of least squares. Ann. Stat., 465–474 (1981)

123
Geometry and applied statistics S225

4. Taylor, J.: The geometry of least squares in the 21st century. Bernoulli 19(4), 1449–1464 (2013)
5. Beran, R.: The unbearable transparency of Stein estimation. In: Nonparametrics and Robustness in
Modern Statistical Inference and Time Series Analysis, p. 25 (2010)
6. Box, G.E.: Science and statistics. J. Am. Stat. Assoc. 71(356), 791–799 (1976)
7. Box, J.F.: R.A. Fisher and the design of experiments, 1922–1926. Am. Stat. 34(1), 1–7 (1980)
8. Fisher, R.A.: Frequency distribution of the values of the correlation coefficient in samples from an
indefinitely large population. Biometrika 10(4), 507–521 (1915)
9. Hall, N.S.: R.A. Fisher and his advocacy of randomization. J. Hist. Biol. 40(2), 295–325 (2007)
10. Aldrich, J.: RA Fisher and the making of maximum likelihood 1912–1922. Stat. Sci. 12(3), 162–176
(1997)
11. Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond.
Ser. A 222(594–604), 309–368 (1922)
12. Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters. In:
Breakthroughs in Statistics, pp. 235–247. Springer, London (1992)
13. Amari, S.: Differential Geometric Methods in Statistics. Lect. Notes Stat. Springer, Berlin (1985)
14. Box, G.E., Cox, D.R.: An analysis of transformations. J. R. Stat. Soc. Ser. B (Methodological) 26(2),
211–243 (1964)
15. Klein, F.: A comparative review of recent researches in geometry. Bull. Am. Math. Soc. 2(10), 215–249
(1893)
16. Hand, D.J.: Deconstructing statistical questions. J. R. Stat. Soc. Ser. A (Statistics in Society) 157(3),
317–338 (1994)
17. Stevens, S.S.: On the theory of scales of measurement. Science 103(2684), 677–680 (1946)
18. Aitchison, J.: The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B (Methodological)
44(2), 139–160 (1982)
19. Aitchison, J.: Principles of compositional data analysis. Lecture Notes-Monograph Series, pp. 73–81
(1994)
20. Agresti, A.: Categorical Data Analysis. Wiley, London (2003)
21. Kass, R.E., Vos, P.W.: Geometrical Foundations of Asymptotic Inference. Wiley, London (2011)
22. Geyer, C.J.: Likelihood inference in exponential families and directions of recession. Electron. J. Stat.
3, 259–289 (2009)
23. Rinaldo, A., Fienberg, S.E., Zhou, Y.: On the geometry of discrete exponential families with application
to exponential random graph models. Electron. J. Stat. 3, 446–484 (2009)
24. Bishop, Y.M., Fienberg, S.E.: Incomplete two-dimensional contingency tables. Biometrics, 119–128
(1969)
25. Kent, M., Bibby, J., Mardia, K.: Multivariate Analysis, Probability and Mathematical Statistics. Else-
vier, Oxford (2006)
26. Barndorff-Nielsen, O.E.: Information and Exponential Families in Statistical Theory, p. 238. Wiley,
London (1978)
27. Efron, B.: The geometry of exponential families. Ann. Stat., 362–376 (1978)
28. Nelder, J.A., Wedderburn, R.W.: Generalized linear models. J. R. Stat. Soc. Ser. A (General) 135(3),
370–384 (1972)
29. Mahalanobis, P.C.: On the Generalized Distance in Statistics. National Institute of Science of India
(1936)
30. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
31. Hellinger, E.: Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen.
J. Die Reine Angew. Math. 1909(136), 210–271 (1909)
32. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6),
716–723 (1974)
33. Cressie, N., Read, T.R.: Multinomial goodness-of-fit tests. J. R. Stat. Soc. Ser. B (Methodol.) 46(3),
440–464 (1984)
34. Eguchi, S.: A differential geometric approach to statistical inference on the basis of contrast functionals.
Hiroshima Math. J. 15(2), 341–391 (1985)
35. Amari, S.-I., Barndorff-Nielsen, O.E., Kass, R., Lauritzen, S., Rao, C.: Differential geometry in statis-
tical inference. IMS Lecture Notes-Monograph Series, p. 240 (1987)
36. Dodson, C.T.: Geometrization of Statistical Theory: Proceedings of the GST Workshop, University of
Lancaster Department of Mathematics, 28–31 October 1987. ULDM Publications, London (1987)
37. Murray, M.K., Rice, J.W.: Differential Geometry and Statistics. Routledge, London (2017)

123
S226 P. Marriott

38. Marriott, P., Salmon, M.: Applications of Differential Geometry to Econometrics. Cambridge Univer-
sity Press, Cambridge (2000)
39. Marriott, P., Vos, P.: On the global geometry of parametric models and information recovery. Bernoulli
10(4), 639–649 (2004)
40. Amari, S.-I., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical Soc,
New York (2007)
41. Vos, P.W., Marriott, P.: Geometry in statistics. Wiley Interdiscip. Rev. Comput. Stat. 2(6), 686–694
(2010)
42. Nielsen, F., Bhatia, R.: Matrix Information Geometry. Springer, New York (2013)
43. Nielsen, F.: Geometric Theory of Information. Springer, New York (2014)
44. Efron, B.: Defining the curvature of a statistical problem (with applications to second order efficiency).
Ann. Stat., 1189–1242 (1975)
45. Critchley, F., Marriott, P.: Information geometry and its applications: an overview. Comput. Inf. Geom.,
1–31 (2017)
46. Barndorff-Nielsen, O.E.: Infereni on full or partial parameters based on the standardized signed log
likelihood ratio. Biometrika 73(2), 307–322 (1986)
47. Cox, D.R., Reid, N.: Parameter orthogonality and approximate conditional inference. J. R. Stat. Soc.
Ser. B (Methodol.) 49(1), 1–18 (1987)
48. Pierce, D.A., Peters, D.: Practical use of higher order asymptotics for multiparameter exponential
families. J. R. Stat. Soc. Ser. B (Methodol.) 54(3), 701–725 (1992)
49. McCullagh, P., Tibshirani, R.: A simple method for the adjustment of profile likelihoods. J. R. Stat.
Soc. Ser. B (Methodol.) 52(2), 325–344 (1990)
50. Barndorff-Nielsen, O., Blaesild, P.: Exponential models with affine dual foliations. Ann. Stat., 753–769
(1983)
51. Barndorff-Nielsen, O.E., Koudou, A.E.: Cuts in natural exponential families. Theory Probab. Appl.
40(2), 220–229 (1996)
52. Gelman, A., Vehtari, A.: What are the most important statistical ideas of the past 50 years? J. Am. Stat.
Assoc. 116(536), 2087–2097 (2021)
53. Donoho, D.: 50 years of data science. J. Comput. Gr. Stat. 26(4), 745–766 (2017)
54. James, W., Stein, C.: Estimation with quadratic loss. In: Breakthroughs in Statistics, pp. 443–460.
Springer, London (1992)
55. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.)
58(1), 267–288 (1996)
56. Brown, L.D., Zhao, L.H.: A geometrical explanation of Stein shrinkage. Stat. Sci. 27(1), 24–30 (2012)
57. Hoaglin, D.C.: John W. Tukey and data analysis. Stat. Sci., 311–318 (2003)
58. Wasserman, L.: Topological data analysis. Annu. Rev. Stat. Appl. 5, 501–532 (2018)
59. Donoho, D., Tanner, J.: Observed universality of phase transitions in high-dimensional geometry, with
implications for modern data analysis and signal processing. Philos. Trans. R. Soc. A Math. Phys. Eng.
Sci. 367(1906), 4273–4293 (2009)
60. Stigler, S.M.: The changing history of robustness. Am. Stat. 64(4), 277–281 (2010)
61. Lindsay, B.G.: Efficiency versus robustness: the case for minimum hellinger distance and related
methods. Ann. Stat. 22(2), 1081–1114 (1994)
62. Efron, B.: The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, New York (1982)
63. Efron, B.: Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods.
Biometrika 68(3), 589–599 (1981)
64. DiCiccio, T.J., Efron, B.: Bootstrap confidence intervals. Stat. Sci. 11(3), 189–228 (1996)
65. Barndorff-Nielsen, O.E., Cox, D.R.: Asymptotic Techniques for Use in Statistics. Chapman and Hall,
London (1989)
66. Cox, D.R., Barndorff-Nielsen, O.E.: Inference and Asymptotics, vol. 52. CRC Press, London (1994)
67. McCullagh, P.: Tensor Methods in Statistics. Chapman and Hall/CRC, London (2018)
68. Stein, C., et al.: Efficient nonparametric testing and estimation. In: Proceedings of the Third Berkeley
Symposium on Mathematical Statistics and Probability, vol. 1, pp. 187–195 (1956)
69. Lunn, D., Jackson, C., Best, N., Thomas, A., Spiegelhalter, D.: The Bugs Book. A Practical Introduction
to Bayesian Analysis. Chapman Hall, London (2013)
70. Stan Development Team and others: Stan modeling language users guide and reference manual. Tech-
nical report (2016)
71. Betancourt, M.: A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv (2017)

123
Geometry and applied statistics S227

72. Betancourt, M., Byrne, S., Livingstone, S., Girolami, M.: The geometric foundations of Hamiltonian
Monte Carlo. Bernoulli 23(4A), 2257–2298 (2017)
73. Breiman, L.: Statistical modeling: The two cultures (with comments and a rejoinder by the author).
Stat. Sci. 16(3), 199–231 (2001)
74. Cox, D.R.: Role of models in statistical analysis. Stat. Sci. 5(2), 169–174 (1990)
75. Cox, D.R.: Comment on ‘Assessment of local influence’ by R. D. Cook. J. R. Stat. Soc. Ser. B
(Methodol.), 133–169 (1986)
76. Li, P., Chen, J., Marriott, P.: Non-finite Fisher information and homogeneity: an em approach.
Biometrika 96(2), 411–426 (2009)
77. Brown, L.D.: Fundamentals of statistical exponential families with applications in statistical decision
theory. IMS Lecture Notes-monograph series (1986)
78. Lauritzen, S.L.: Graphical Models. Oxford University Press, Oxford (1996)
79. Csiszár, I., Matus, F.: Closures of exponential families. Ann. Probab. 33(2), 582–600 (2005)
80. Critchley, F., Marriott, P.: Computational information geometry in statistics: theory and practice.
Entropy 16, 2454–2471 (2014)
81. Anaya-Izquierdo, K., Critchley, F., Marriott, P.: When are first-order asymptotics adequate? a diagnos-
tic. Stat 3(1), 17–22 (2014)
82. Marriott, P.: On the local geometry of mixture models. Biometrika 89(1), 77–93 (2002)

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

123

You might also like