Tabak Turner
Tabak Turner
E. G. TABAK
Courant Institute of Mathematical Sciences
AND
CRISTINA V. TURNER
FaMAF, Universidad Nacional de Córdoba
Abstract
A new methodology for density estimation is proposed. The methodology, which
builds on the one developed in [17], normalizes the data points through the com-
position of simple maps. The parameters of each map are determined through
the maximization of a local quadratic approximation to the log-likelihood. Var-
ious candidates for the elementary maps of each step are proposed; criteria for
choosing one includes robustness, computational simplicity and good behavior in
high-dimensional settings. A good choice is that of localized radial expansions,
which depend on a single parameter: all the complexity of arbitrary, possibly
convoluted probability densities can be built through the composition of such
simple maps. c 2000 Wiley Periodicals, Inc.
1 Introduction
A central problem in the analysis of data is density estimation: given a set
of independent observations x j , j = 1, . . . , m, estimate its underlying probability
distribution. This article is concerned with the case in which x is a continuous,
possibly multidimensional variable, typically in Rn , and its distribution is specified
by a probability density ρ(x). Among the many uses of density estimation are its
application to classification, clustering and dimensional reduction, as well as more
field-specific applications such as medical diagnosis, option pricing and weather
prediction [2, 7, 14].
Parametric density estimation is often based on maximal likelihood: a family
of candidate densities is proposed, ρ(x; β ), where β denotes parameters from an
admissible set A. Then these parameters are chosen so as to maximize the log-
likelihood L of the available observations:
m
(1.1) β = arg max L = ∑ log (ρ(x j ; β )) .
β ∈A j=1
wide applicability, yet it suffers from the arbitrariness in the choice of the para-
metric family and number of parameters involved. Ideally, the form of the density
function would emerge from the data, not from arbitrary a priori choices, unless
these are guided by a deeper knowledge of the processes originating the probabil-
ity distribution under study.
The simplest methodology for non-parametric density estimation is the his-
togram [18], whereby space is divided into regular bins, and the estimated density
within each bin is assigned a uniform value, proportional to the number of ob-
servations that fall within. Histogram estimates are not smooth and suffer greatly
from the curse of dimensionality. A smoother version, first developed in [13] and
[12], uses a sum of kernel functions centered at each observation, with a bandwidth
adapted to the level of resolution desired. Particular kernels have been devised to
handle properties of the target distributions; for instance, when these are known to
have support only in the positive half-line, Gamma kernels have been proposed as
a substitute for their more widely used symmetric Gaussian counterpart [4]. Many
kernels, including the Gaussian, can be conceptualized as smoothers arising from
a diffusion process [3], paving the road for a unified, systematic treatment of band-
width selection and boundary bias.
In non-parametric estimation, one must be careful not to over-resolve the den-
sity, for which one needs to calibrate the smoothing parameters to the data [11].
The most universal methodology for this is cross-validation [6], in which the avail-
able data are partitioned into subsets, used alternatively for training and out-of-
sample testing of the estimation procedure. A related algorithm is the bootstrap
[8], which creates training and testing populations by drawing samples with re-
placement from the available data points.
An alternative methodology for non-parametric density estimation was devel-
oped in [17], based on normalizing flows in feature-space. Other procedures based
on normalization include the exploratory projection pursuit [5], a methodology
originally developed for the visualization of high-dimensional data, which normal-
izes selected small-dimensional cross-sections of the space of features, and copula–
based density estimation [15, 10], which normalizes the one-dimensional marginal
distribution of each individual feature, and then couples these marginals through
a multidimensional copula, typically Gaussian or Archimedean. Normalizing the
data x j is finding a map y(x) such that the y j = y(x j ) have a prescribed distribution
µ(y), for which we shall adopt here the isotropic Gaussian
1 − kyk
2
(1.2) µ(y) = N(0, In ) = n e
2
(2π) 2
If such map is known, then the probability density ρ(x) underlying the original
data is given by
where J y (x) is the Jacobian of the map y(x) evaluated at the point x. In view of
(1.3), density estimation can be rephrased as the search for a normalizing map.
There is more than semantics to this rephrasing: normalizing the data is often
a goal per se. It allows us, for instance, to compare observations from different
datasets, to define robust metrics in phase-space, and to use standard statistical
tools, often applicable only to normal distributions. More important for us here,
however, is that it leads to the development of a novel family of density-estimation
techniques.
where we have omitted from the log-likelihood L the β -independent term − 2n log(2π).
In particular, if y(x) is chosen among all linear functions of the form
β = A ∈ Rn×n , b ∈ Rn ,
(1.5) yβ (x) = A(x − b)
then the output of the maximization in (1.4) is
1
(1.6) b = x̄, A = Σ− 2 ,
where x̄ = m1 ∑mj=1 x j is the empirical mean and Σ = m1 ∑mj=1 x j xtj the empirical co-
variance matrix of the data. In other words, a linear choice for yβ (x) yields the stan-
dard normalization procedure of subtracting the mean and dividing by the square-
root of the covariance matrix. In terms of density estimation, it yields the Gaussian
1 1 t −1 (x−x̄)
(1.7) ρ(x) = n 1 e− 2 (x−x̄) Σ .
(2π) |Σ|
2 2
Yet this, as all parametric procedures, suffers from the extra-structure it imposes on
the data, by assuming that it has an underlying probability density of a particular
form.
One way to approach the algorithm proposed in [17] is to factor the map y(x)
into N parametric maps φβi (z):
(1.8) yN (x) = φβN ◦ φβN−1 ◦ . . . ◦ φβ1 (x),
since the composition of many simple maps can be made arbitrarily complex, thus
overcoming the limitations of parametric maps. If this is considered as a function
yβ (x), depending on the indexed family of parameters β = (β1 . . . βN ) on which to
perform the maximization in (1.4), then we have just complicated matters without
resolving any issue. Yet the following two realizations help us move forward:
4 E. G. TABAK AND C. V. TUNER
passive Lagrangian markers that move with the flow but do not influence it, since
they are not included in the likelihood function.
(1.11) β = ν∇β L β =0
,
and ε 1 prescribed. This simple formula for the learning rate guarantees that
the size kβ k of all steps is bounded by ε and decreases near a maximum of L. It
was proved in [17] that the composition of such one-dimensional maps suffices to
guarantee convergence to arbitrary distributions ρ(x), based on the fact that two
distributions with the same marginals in all directions are necessarily identical.
This procedure was further developed in [1] to address clustering and classification
problems.
Yet the procedure just described suffers from some computational drawbacks:
• Exploring all directions through one-dimensional maps requires a number
of steps that grows exponentially with the dimension of phase-space. In
many applications, such as to microarray data, this dimension can be very
large. Moreover, performing random rotations –i.e. orthogonal transformations–
in high dimensions is costly.
• In order to have a smooth ascent process, the step-size ε needs to be small,
hence requiring the algorithm to perform a large number of steps to reach
convergence.
In this paper, we address both of these issues. On the one hand, we propose ele-
mentary transformations that do not deteriorate when the dimensionality of phase-
space grows, the simplest and most effective of which is based on radial expan-
sions. On the other, we exploit the fact that the elementary transformations have a
very simple analytical form to go beyond straightforward gradient descent, and in-
stead maximize in each step the local quadratic approximation to the log-likelihood
in terms of the parameters β . This allows us to take much larger steps, and hence
reduces significantly the total number of steps that the algorithm requires.
6 E. G. TABAK AND C. V. TUNER
negative side, it never picks points away from the observations, so it may be inef-
fective at reducing over-estimated densities at points far from the observed set. The
latter choice, on the other hand, will sample all points proportionally to their cur-
rent estimated density, so it will detect and help correct points with over-estimated
probability, yet it may fail to sample points in areas with under-estimated proba-
bility density, so these may never be corrected. We have implemented a balanced
solution, whereby we randomly alternate between the two sampling methodologies
just described.
• For data points far from the origin, the gradient of their likelihood under a
normal distribution could reach machine zero, at which point the algorithm
will lack any guidance as to how to move them to improve their likelihood.
• Movements in the bulk may require a coarse resolution, as measured by
the length scale α, at odds with the finer one needed for a more detailed
resolution of the probability density.
• We may have some a priori knowledge of a family of distributions that
should capture much of the data’s variability. Using this to do first a simple
parametric estimation may save much computational time.
• In some cases, we might be interested in how much the actual distribution
differs from a conventional one, such as the log-norm for investment re-
turns. Then we can first do a fit to the conventional distribution, and then
quantify the extent and nature of the subsequent maps.
This first set of maps can be thought of as a preconditioning step of the algo-
rithm, which only differs from the subsequent steps in the form of the proposed
maps or in the scale adopted. Two preconditioning steps that we include by default
in the algorithm are substracting the mean of the data,
m
1
(2.6) x → x − µ, µ= ∑ x j,
m j=1
and dividing by the average standard deviation:
s
1 1 m
(2.7) x → x, σ =
σ ∑ kx j k2 ,
mn j=1
with corresponding initial estimation
2
− n2 − kx−µk
(2.8) ρ0 (x) = 2πσ 2 e 2
2σ .
Proposing a general Gaussian as in (1.7) is not generally advisable in high dimen-
sions, unless the sample size m is big enough to allow for a robust estimation of
the covariance matrix Σ.
Another preconditioning candidate generally applicable consists of carrying out
a few steps of the regular core procedure, but with coarser resolution, i.e. with
larger n p in (2.1). More generally, we can have a value of n p that decreases mono-
tonically throughout the procedure, from an initial coarse value to the finest reso-
lution desired or allowed by the data, thus blurring the boundary between precon-
ditioning and the algorithm’s core.
In specific cases, where a family of probability densities of specific form ρ0 (x, β )
is known or conjectured to provide a sensible fit of the data, and a map y(x, β ) is
known such that
(2.9) ρ0 (x, β ) = J y (x)µ(y(x, β )),
then the preconditioning step should consist of a parametric fit of these parame-
ters β followed by the map. The popular procedure of taking the log of a series
DENSITY ESTIMATION ALGORITHMS 9
of returns fits within this framework, where the conjecture ρ0 (x) is a log-normal
distribution.
Often ρ(x) has bounded or semi-infinite support, which may be known even
though ρ itself is not. For instance, some components of x may be known to be
positive or, if x denotes geographical location, ρ(x) may be known to vanish over
seas or in other unpopulated areas [16]. When this is the case, it may be convenient
to perform as preconditioning a first map that fills out all of space, such as
x → erf−1 1 − 2e−x
(2.10)
for one-dimensional data with x ≥ 0. The advantage of such preconditioning step
goes beyond moving the data toward Gaussianity: it also guarantees that the esti-
mated ρest (x) will vanish outside the support of ρ(x).
depending on the single parameter β , positive for local expansions and negative
for contractions.
A typical localization function f is given by
1 erf αr
(3.2) f (r) = ,
α αr
where r = kx − x0 k. Another choice is
1
(3.3) f (r) =.
α +r
Even though the two are similar in shape, each choice has its advantages: the
former is smoother and more localized, while the latter is faster to compute and,
more importantly, the corresponding map (3.1) can be inverted in closed form,
yielding
x − x0 y − x0
= ,
r s
where s = ky − x0 k and
s
s − (α + β ) 2
s − (α + β )
r= + + αs.
2 2
This is useful in a number of applications that involve finding the inverse x(y) of
the normalizing map y(x): producing synthetic extra sample points x j from ρ(x),
for instance, can be achieved by obtaining samples y j from the Gaussian µ(y), and
writing x j = x(y j )).
Still one more choice is
2
1 − αr
(3.4) f (r) = for r < α, f (r) = 0 otherwise.
α
This has the advantage of its compact support, which permits the easy superposi-
tion of various such maps simultaneously. All three families require β > −α for
the maps to be one–to–one; the last one requires also that β < 3α. Figure 3.1
compares the three functions, for x0 = 0 and α = 1.
The map in (3.1) has Jacobian
J = (1 + β f )n−1 (1 + β ( f + r f 0 ))
and corresponding log-likelihood function L
n 1
∑ log(ρ(x j )) = ∑ − 2 |x0 |2 + 2(x0 , (1 + β f )(x j − x0 )) + ((1 + β f )r j )2
j j
o
+(n − 1) log(1 + β f ) + log(1 + β ( f + r j f 0 )) .
Then
∂L
n − (x0 , x j − x0 ) − r2j f + r j f 0
=∑
∂β β =0 j
DENSITY ESTIMATION ALGORITHMS 11
F IGURE 3.1. Three radial building blocks. The upper panels display
f (|x|), the lower ones x f (|x|). On the left, a smooth, analytic block
based on the error function, in the center, one with algebraic decay –and
closed-form inversion– and, on the right, one with compact support.
and
∂ 2L
= − ∑ (n + r2j ) f 2 + 2r j f f 0 + (r j f 0 )2 < 0,
∂β2 β =0 j
so we may replace L by its quadratic approximation at β = 0, yielding the following
approximation to the maximizer:
∂L
∂ β β =0
β= − ∂ 2L .
∂ β 2 β =0
∂L n 2 o
= ∑ fi + xij − x0i fi0 − x0i fi xij − x0i − fi xij − x0i
∂ βi β =0 j
12 E. G. TABAK AND C. V. TUNER
and
∂ 2L 2 2
= − ∑ fi + xij − x0i fi0 + fi2 xij − x0i ,
∂ βi2 β =0 j
so we must pick
∂L
∂ β β =0
βi = − ∂ 2 Li .
∂ βi2 β =0
This family of maps is not isotropic, since it privileges the coordinate axes.
To restore isotropy, one can rotate the axes every time-step, through a random
orthogonal matrix. With this extra ingredient, this building block agrees with the
one originally implemented in [17]; the only differences are the specific form of
the stretching function, which in [17] was a more complex function depending on
three parameters per dimension, and the maximization procedure, which is carried
out here through a local quadratic approximation, not by first-order ascent of the
log-likelihood.
where
∂f
Lij (x) = − f (x)x (x j − x0j ) + f (x)δij
i
∂ xi
and
2
j
j ∂f l
Qkl l k k j l l
= f (x) δi δ j + δi (x − x0 )(x − x0 ) + δk 2 f (x) i (x − x0 ) +
i j (x)
l
∂x
∂f ∂f j
(x − x0j )(xl − x0l ).
∂ xi ∂ xk
Hence maximizing over A the quadratic approximation to the log-likelihood L =
∑m log(ρ(xm )) yields the system
DENSITY ESTIMATION ALGORITHMS 13
∑ ∑ Qi j (xm ) + Qkl (xm ) Alk = − ∑ Lij (xm ).
kl ij
kl m m
Notice that this building block requires much more computational work than
the isotropic expansions, hence its use would only be justified if it yielded better
accuracy in a much smaller number of steps. We found in the experiments below
that this is not typically the case, so we conclude that simpler maps, with only a
handful of parameters β –such as the single one for the radial expansions– are to
be preferred.
4 Examples
In this section, we use some synthetic examples to illustrate the procedure and
to compare the efficiency of the various building blocks proposed above. In all
examples, we have used for preconditioning only the two steps in (2.6,2.7), which
re-center the observations at x = 0 and stretch them isotropically so as to produce
a unitary average standard deviation.
As a first example, consider the two-dimensional probability density displayed
in figure 4.1, given by
1 1 r−1 2
ρ(x, y) = e− 2 θ e− 2 ( 0.1 ) ,
2
(4.1)
where r and θ are the radius and angle in polar coordinates: a distribution con-
centrated in a small neighborhood of the unit circle, with maximal density at
(x, y) = (1, 0). Such distribution, with thin support and pronounced curvature,
would be hard to capture with any parametric approach. Yet the proposed algo-
rithm does a very good job, as shown in figure 4.2.
For the experiment displayed in figures 4.1 and 4.2, we have taken a sample of
size m = 1000, used the radial expansion in (3.1) with f (r) from (3.2), and adopted
a value n p = 500 for the calculation of the lengh-scale α in (2.1). The Kullback-
Leibler divergence [9] between the exact and the estimated distributions, displayed
in the last panel of figure 4.1, is given by
ρex (y)
Z
(4.2) KL = log ρex (y) dy,
ρest (y)
which is integrated numerically on the same grid used for the plots, a set of points
carried passively by the algorithm, where ρest is known. Another possibility, much
more efficient in high dimensions, is to estimate KL through Monte Carlo simula-
tion:
1 n
(4.3) KL ≈ ∑ log(ρex (x j ) − log(ρest (x j )).
N j=1
This also reveals the connection between the Kullback-Leibler divergence between
estimated and exact densities and the log-likelihood of the estimated density, which
the algorithm ascends.
14 E. G. TABAK AND C. V. TUNER
Experimenting with the other building blocks proposed above yields entirely
similar results. We conclude that the radial expansions are to be preferred, since
their use is much less computationally intensive. Moreover, the simplicity of the
radial expansions brings in an extra degree of robustness, as revealed by a much
smaller sensitivity to the choice of n p , the only free parameter of the algorithm.
DENSITY ESTIMATION ALGORITHMS 15
Next we compare the procedure developed here with Kernel density estima-
tion, the most popular non-parametric methodology in use [18]. We have adopted
Gaussian kernels of the form
2
1 − 12 ky−xk
h
(4.4) Kh (x, y) = n e ,
(2π) 2 hn
and proposed the estimate
m
1
(4.5) ρ(y) = ∑ Kh (x j , y) .
m j=1
Hence each observation x j contributes to the local density within a neighborhood
whose size scales with h. This bandwidth plays a similar role to the n p of our
procedure, the typical number of points affected by each map.
Figure 4.3 displays the results of applying both procedures, at different band-
widths, to a sample with m = 500 points from the distribution in (4.1). We have
picked two values for h and n p , one too large, slightly under-resolving the distribu-
tion, and one too small, slightly over-resolving it. Comparing the results from the
two procedures, we can make the following observations:
• Both procedures are robust, capturing the main features of the probabil-
ity density ρ(x), whereas most parametric approaches would have done
poorly.
• The mapping procedure yields smoother and tighter density profiles, and
correspondingly smaller values of the Kullback–Leibler divergence be-
tween the exact and estimated densities.
The computational costs of both procedures are comparable: estimating the density
at q points requires m × q evaluations of the kernel and ns × q map applications
respectively. Since the number ns of iterations before convergence scales with
the number m of observations, these two numbers of evaluations are of the same
order. The mapping procedure has the additional cost of determining the optimal
parameter β for each step, but this is comparatively unimportant when q is much
larger than m.
Beyond the comparison of effectiveness, which depends on the actual problem
in hand, one can describe the main differences between the two procedures:
• The estimated density is expressed in terms of the sum of kernel functions
in one case and of the composition of elementary maps in the other.
• In the implementations discussed here, Kernel density estimation is explicit
and deterministic, while there is a stochastic element to the choice of the
centers for the elementary maps.
• Kernel density estimation is conceptually simpler, while the normalizing
maps have a richer structure and more versatility.
• The kernels provide just an estimated density, while the new procedure also
produces a normalizing map. This can be used for a variety of purposes,
such as sampling.
16 E. G. TABAK AND C. V. TUNER
Figures 4.4 and 4.5 show another two-dimensional experiment. In this case, the
proposed density is the mixture of two anisotropic Gaussians, and, for illustration,
the building block utilized is the general localized linear transformation in (3.6).
Notice in figure 4.5 a feature associated with the dual nature of the algorithm:
since the normalizing procedure cannot fully eliminate the gap between the two
Gaussians without over-resolving, as shown in the three bottom panels, the corre-
sponding density estimation, displayed in the top panels, cannot fully separate the
two. A kernel estimator would also be unable to fully separate the two components,
but here the reason would be more straightforward: a bandwidth h small enough to
separate them would over-resolve the estimation, particularly at the less populated
tails of the distribution.
The examples above are two-dimensional to facilitate their display, yet the full
power of the algorithm manifests itself in high-dimensional situations. Thus we
consider next the equal-weight mixture of two n-dimensional normal distributions,
centered at x = ±2e1 . Here we have used a sample of m = 1000 points, n p =
500, and again the isotropic radial expansion in (3.1,3.2). Figure 4.6 compares
the evolution of the Kullback-Leibler divergence between the exact and estimated
density in dimensions n = 2, n = 5 and n = 10. In order to enable a meaningful
DENSITY ESTIMATION ALGORITHMS 17
radial expansions. The value of n = 10 is beyond the largest one might have hoped
to resolve with a sample of size m = 1000, since 210 = 1024: one has in average one
observation per the 10-dimensional equivalent of a quadrant! Thus it is surprising
that the algorithm resolves this density so well.
5 Conclusions
We have developed a methodology for non-parametric density estimation. Based
on normalizing flows, the new procedure improves on the one developed in [17],
in that it is more robust and efficient in high dimensions, and ascends the log-
likelihood function through larger steps, based on a quadratic approximation rather
than gradient ascent. It requires only one external parameter, n p , with a clear in-
terpretation: the level of resolution sought, measured in number of observations
per localized feature of the estimated density. We have found that the simplest el-
ementary transformations, such as localized radial expansions, are also the most
efficient and robust building blocks from which to form the map that normalizes
the data points.
Density estimation appears often in applications as a tool for more specific
tasks. One advantage of the methodology developed here is its flexibility, which
allows for easy adaptation to such tasks. Thus, in [1], we have adapted the al-
gorithm from [17], a direct ancestor to the one in this paper, to do classification
and clustering. Along similar lines, projects under way employ variations of the
methodology proposed here to perform tasks as varied as medical diagnosis, re-
lating behavioral traits to neuron classes in worms, Montecarlo simulation, time
series analysis, estimation of risk-neutral measures, and transportation theory. It is
in the context of these more specific procedures that examples with real data make
DENSITY ESTIMATION ALGORITHMS 19
Bibliography
[1] Agnelli, J. P. ; Cadeiras, M. ; Tabak, E. G. ; Turner, C. V. ; Vanden-Eijnden, E. Clustering and
classification through normalizing flows in feature space. SIAM MMS 8 (2010), 1784–1802.
[2] Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
[3] Botev, Z. I.; Grotowski, J. F. ; Kroese, D. P. Kernel density estimation via diffusion. Annals of
Statistics 38 (2010), 2916–2957.
[4] Chen, S. X. Probability density function estimation using Gamma kernels. Ann. Inst. Statist.
Math. 52 (2000), 471–480.
[5] Friedman, J. Exploratory projection pursuit. J. Amer. Statist. Assoc. 82 (1987), 249–266.
[6] Hall, P.; Racine, J.; Li, Q. Cross-validation and the estimation of conditional probability densi-
ties. J. Amer. Statist. Assoc. 99 (2004), 1015–1026.
[7] Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning. Springer, 2001.
[8] Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selec-
tion. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2
(1995), 11371143.
[9] Kullback S. ; Leibler, R. A. On information and sufficiency. Annals of Math. Statistics 22
(1951), 79–86.
[10] Nelsen, R. B. An Introduction to Copulas. Springer, 1999.
[11] Park, B. U. ; Marron, J. S. Comparison of data-driven bandwidith selectors. J. Amer. Statist.
Assoc. 85 (1990), 6672.
[12] Parzen, E. On estimation of a probability density function and mode. Annals of Math. Stat. 33
(1962), 1065–1076.
[13] Rosenblatt, M. Remarks on some nonparametric estimates of a density function. Annals of
Math. Stat. 27 (1956), 832–837.
[14] Silverman, B.W. Density Estimation for Statistics and Data Analysis, London: Chapman &
Hall/CRC, 1998.
[15] Sklar, A. Fonctions de rpartition n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris
8 (1959), 229231.
[16] Smith, L. M.; Keegan, M. S.; Wittman, T.; Mohler, G. O.; Bertozzi, A. L. Improving Density
Estimation by Incorporating Spatial Information. EURASIP Journal on Advances in Signal
Processing, 2010 (2010), 1–12.
[17] Tabak, E. ; Vanden-Eijnden, E. Density estimation by dual ascent of the log-likelihood. Comm.
Math. Sci. 8 (2010), 217-233.
[18] Wasserman, L., All of nonparametric statistics. Springer, 2006.
20 E. G. TABAK AND C. V. TUNER