0% found this document useful (0 votes)

22 views20 pages

Tabak Turner

This document presents a new methodology for non-parametric density estimation using a family of algorithms based on normalizing flows. The approach involves composing simple maps to normalize data points, with parameters determined through maximizing a local quadratic approximation to the log-likelihood. The proposed methods aim to address computational challenges and improve robustness in high-dimensional settings, ultimately allowing for effective density estimation without imposing arbitrary parametric structures.

Uploaded by

gladysmuzela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views20 pages

Tabak Turner

Uploaded by

gladysmuzela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

A family of non-parametric density estimation algorithms

E. G. TABAK
Courant Institute of Mathematical Sciences
AND
CRISTINA V. TURNER
FaMAF, Universidad Nacional de Córdoba

Abstract
A new methodology for density estimation is proposed. The methodology, which
builds on the one developed in [17], normalizes the data points through the com-
position of simple maps. The parameters of each map are determined through
the maximization of a local quadratic approximation to the log-likelihood. Var-
ious candidates for the elementary maps of each step are proposed; criteria for
choosing one includes robustness, computational simplicity and good behavior in
high-dimensional settings. A good choice is that of localized radial expansions,
which depend on a single parameter: all the complexity of arbitrary, possibly
convoluted probability densities can be built through the composition of such
simple maps. c 2000 Wiley Periodicals, Inc.

1 Introduction
A central problem in the analysis of data is density estimation: given a set
of independent observations x j , j = 1, . . . , m, estimate its underlying probability
distribution. This article is concerned with the case in which x is a continuous,
possibly multidimensional variable, typically in Rn , and its distribution is specified
by a probability density ρ(x). Among the many uses of density estimation are its
application to classification, clustering and dimensional reduction, as well as more
field-specific applications such as medical diagnosis, option pricing and weather
prediction [2, 7, 14].
Parametric density estimation is often based on maximal likelihood: a family
of candidate densities is proposed, ρ(x; β ), where β denotes parameters from an
admissible set A. Then these parameters are chosen so as to maximize the log-
likelihood L of the available observations:
m
(1.1) β = arg max L = ∑ log (ρ(x j ; β )) .
β ∈A j=1

A typical example is a family ρ(x; β ) of Gaussian mixtures, with β including free

parameters in the means and covariance matrices of the individual Gaussians and
their weights in the mixture. Parametric density estimation is a practical tool of

Communications on Pure and Applied Mathematics, Vol. 000, 0001–0020 (2000)

c 2000 Wiley Periodicals, Inc.
2 E. G. TABAK AND C. V. TUNER

wide applicability, yet it suffers from the arbitrariness in the choice of the para-
metric family and number of parameters involved. Ideally, the form of the density
function would emerge from the data, not from arbitrary a priori choices, unless
these are guided by a deeper knowledge of the processes originating the probabil-
ity distribution under study.
The simplest methodology for non-parametric density estimation is the his-
togram [18], whereby space is divided into regular bins, and the estimated density
within each bin is assigned a uniform value, proportional to the number of ob-
servations that fall within. Histogram estimates are not smooth and suffer greatly
from the curse of dimensionality. A smoother version, first developed in [13] and
[12], uses a sum of kernel functions centered at each observation, with a bandwidth
adapted to the level of resolution desired. Particular kernels have been devised to
handle properties of the target distributions; for instance, when these are known to
have support only in the positive half-line, Gamma kernels have been proposed as
a substitute for their more widely used symmetric Gaussian counterpart [4]. Many
kernels, including the Gaussian, can be conceptualized as smoothers arising from
a diffusion process [3], paving the road for a unified, systematic treatment of band-
width selection and boundary bias.
In non-parametric estimation, one must be careful not to over-resolve the den-
sity, for which one needs to calibrate the smoothing parameters to the data [11].
The most universal methodology for this is cross-validation [6], in which the avail-
able data are partitioned into subsets, used alternatively for training and out-of-
sample testing of the estimation procedure. A related algorithm is the bootstrap
[8], which creates training and testing populations by drawing samples with re-
placement from the available data points.
An alternative methodology for non-parametric density estimation was devel-
oped in [17], based on normalizing flows in feature-space. Other procedures based
on normalization include the exploratory projection pursuit [5], a methodology
originally developed for the visualization of high-dimensional data, which normal-
izes selected small-dimensional cross-sections of the space of features, and copula–
based density estimation [15, 10], which normalizes the one-dimensional marginal
distribution of each individual feature, and then couples these marginals through
a multidimensional copula, typically Gaussian or Archimedean. Normalizing the
data x j is finding a map y(x) such that the y j = y(x j ) have a prescribed distribution
µ(y), for which we shall adopt here the isotropic Gaussian
1 − kyk
2
(1.2) µ(y) = N(0, In ) = n e
2
(2π) 2
If such map is known, then the probability density ρ(x) underlying the original
data is given by

(1.3) ρ(x) = J y (x)µ(y(x)),

DENSITY ESTIMATION ALGORITHMS 3

where J y (x) is the Jacobian of the map y(x) evaluated at the point x. In view of
(1.3), density estimation can be rephrased as the search for a normalizing map.
There is more than semantics to this rephrasing: normalizing the data is often
a goal per se. It allows us, for instance, to compare observations from different
datasets, to define robust metrics in phase-space, and to use standard statistical
tools, often applicable only to normal distributions. More important for us here,
however, is that it leads to the development of a novel family of density-estimation
techniques.

1.1 Density estimation through normalizing flows

The simplest idea for finding a normalizing map y(x) is to propose a parametric
family, y = yβ (x), and maximize the log-likelihood of the data, combining (1.1),
(1.2) and (1.3) into
" #
m ky (x)k2
β
(1.4) β = arg max L = ∑ log (J yβ (x j )) − ,
β ∈A j=1 2

where we have omitted from the log-likelihood L the β -independent term − 2n log(2π).
In particular, if y(x) is chosen among all linear functions of the form
β = A ∈ Rn×n , b ∈ Rn ,

(1.5) yβ (x) = A(x − b)
then the output of the maximization in (1.4) is
1
(1.6) b = x̄, A = Σ− 2 ,
where x̄ = m1 ∑mj=1 x j is the empirical mean and Σ = m1 ∑mj=1 x j xtj the empirical co-
variance matrix of the data. In other words, a linear choice for yβ (x) yields the stan-
dard normalization procedure of subtracting the mean and dividing by the square-
root of the covariance matrix. In terms of density estimation, it yields the Gaussian
1 1 t −1 (x−x̄)
(1.7) ρ(x) = n 1 e− 2 (x−x̄) Σ .
(2π) |Σ|
2 2

Yet this, as all parametric procedures, suffers from the extra-structure it imposes on
the data, by assuming that it has an underlying probability density of a particular
form.
One way to approach the algorithm proposed in [17] is to factor the map y(x)
into N parametric maps φβi (z):
(1.8) yN (x) = φβN ◦ φβN−1 ◦ . . . ◦ φβ1 (x),
since the composition of many simple maps can be made arbitrarily complex, thus
overcoming the limitations of parametric maps. If this is considered as a function
yβ (x), depending on the indexed family of parameters β = (β1 . . . βN ) on which to
perform the maximization in (1.4), then we have just complicated matters without
resolving any issue. Yet the following two realizations help us move forward:
4 E. G. TABAK AND C. V. TUNER

• We can calculate the various βi sequentially: firstfind β1 using y1 (x) =

φβ1 (x) in (1.4), then β2 using y2 (x) = φβ2 φβ1 (x) , with β1 fixed at the
value found in the prior step, and so on. If the identity map is included in
each elementary family for βi = 0,
φ0 (z) = z,
then each new step can only increase the value of the log-likelihood L, so
even though we are not maximizing L over β = (β1 . . . βN ), we are still
ascending it through the sequence of maps.
• Switching perspective from density estimation to normalization, we can at
each step i forget all prior steps, and just deal with the currently normalized
states z j = yi−1 (x j ) of the observations as if these were the original ones. In
order to be able to compute at the end the estimated density of the original
variables x j , we just need to update at each step the global Jacobian of the
map, through
(1.9) J yi (x j ) → J φi (z j )J yi−1 (x j ),
i.e. by multiplying it by the Jacobian of the current elementary map. With
this new perspective, all steps adopt the simple form in (1.4), with β = βi
and each x j replaced by the current z j . This duality is the basis of our algo-
rithm: rather than set out to estimate the density ρ(x), we seek a normaliz-
ing map y(x). This we factor into many elementary maps, with parameters
determined through a local density estimation, in which (1.4) is applied not
to x but to the current state z(x) of the map.
Pushing this idea to the limit, we may think of a continuous flow z = φt (x) in an
algorithmic time t, with velocity field
∂z
(1.10) u(z) =
∂t
driven by the variational gradient of the log-likelihood L. From this perspective,
the observations x j give rise to active Lagrangian markers z j (t), with z j (0) = x j ,
that move with the flow and guide it through their contribution to the –local in
time– log-likelihood L. It was proved in [17] that, as the number of observations
grows, y(x) = limt→∞ z(x,t) converges to a normal distribution, and the density
ρ(x) estimated through (1.3) to the actual density of the data.
For the observations x j , the active Lagrangian markers that guide the flow lead-
ing to the map y(x), we know at the end their normalized values y(x j ) and the
corresponding estimated densities ρ(x j ) from (1.3). Yet one is typically interested
in evaluating the estimated density at other points xig . These might be points on a
regular grid –hence the “g” in xig –, required to plot or manipulate ρ(x). They can
also represent events whose likelihood one would like to know, or points whose
probability under various distributions is required for a classification problem. In
order to evaluate the density at these extra points xig , it is enough to add them as
DENSITY ESTIMATION ALGORITHMS 5

passive Lagrangian markers that move with the flow but do not influence it, since
they are not included in the likelihood function.

1.2 The individual maps

The building blocks φi (x) proposed in [17] were simple one-dimensional maps,
centered at a random point x0 , oriented in a random direction; they depended on
three parameters β . These parameters were found by ascent of the log-likelihood,
i.e. through

(1.11) β = ν∇β L β =0
,

with a learning rate ν given by

ε
(1.12) ν=q ,
ε 2 + k∇β Lk2

and ε 1 prescribed. This simple formula for the learning rate guarantees that
the size kβ k of all steps is bounded by ε and decreases near a maximum of L. It
was proved in [17] that the composition of such one-dimensional maps suffices to
guarantee convergence to arbitrary distributions ρ(x), based on the fact that two
distributions with the same marginals in all directions are necessarily identical.
This procedure was further developed in [1] to address clustering and classification
problems.
Yet the procedure just described suffers from some computational drawbacks:
• Exploring all directions through one-dimensional maps requires a number
of steps that grows exponentially with the dimension of phase-space. In
many applications, such as to microarray data, this dimension can be very
large. Moreover, performing random rotations –i.e. orthogonal transformations–
in high dimensions is costly.
• In order to have a smooth ascent process, the step-size ε needs to be small,
hence requiring the algorithm to perform a large number of steps to reach
convergence.
In this paper, we address both of these issues. On the one hand, we propose ele-
mentary transformations that do not deteriorate when the dimensionality of phase-
space grows, the simplest and most effective of which is based on radial expan-
sions. On the other, we exploit the fact that the elementary transformations have a
very simple analytical form to go beyond straightforward gradient descent, and in-
stead maximize in each step the local quadratic approximation to the log-likelihood
in terms of the parameters β . This allows us to take much larger steps, and hence
reduces significantly the total number of steps that the algorithm requires.
6 E. G. TABAK AND C. V. TUNER

F IGURE 2.1. Dependence of the length scale α on the center x0 . In

order not to over-resolve the estimation, the maps need to include a suf-
ficient number of observations within their typical length scale. Thus,
for maps centered in relatively unpopulated areas, the radius α must be
larger than in areas with high probability density, so as to encompass a
similar number of points. The two circles in the figure exemplify this: a
point in the tail of the distribution is assigned a larger domain of influ-
ence than one in the core.

2 General methodological aspects

2.1 Center x0 and length-scale α
All the elementary maps that we propose are of the form

x − x0
y = x+φ ,
α
centered at a random point x0 . The parameter α, measuring the length-scale of
the map, has a value that depends on the selected node x0 : in areas with small
probability, the length-scale must be large, not to over-fit the data. We start by
choosing a number n p of points that we would like to have within a ball or radius
α around x0 . Then α is given by the expression
1
n p 1n kx0 k2
(2.1) α = (2π) 2 Ω−1n e 2n ,
m
which results from inverting the density of the target normal distribution. Here m
is the total number of data-points, n the dimension of feature-space, and Ωn the
volume of the unit ball in Rn . The concept is illustrated in two dimensions in figure
2.1.
We can think of two candidate methodologies for selecting the point x0 : to pick
it at random from the actual observations -at their current normalized state- or to
sample the normal distribution to which y(x) is converging. The former choice has
the advantage of sampling the actual current density, not the estimated one; on the
DENSITY ESTIMATION ALGORITHMS 7

negative side, it never picks points away from the observations, so it may be inef-
fective at reducing over-estimated densities at points far from the observed set. The
latter choice, on the other hand, will sample all points proportionally to their cur-
rent estimated density, so it will detect and help correct points with over-estimated
probability, yet it may fail to sample points in areas with under-estimated proba-
bility density, so these may never be corrected. We have implemented a balanced
solution, whereby we randomly alternate between the two sampling methodologies
just described.

2.2 Local ascent

In [17], we proposed picking the parameters β of each time-step by gradient
ascent of the log-likelihood, through (1.11) and (1.12). Yet such procedure does
not exploit to its full extent the simplicity of each elementary map. The explicit
nature of these maps allows us to compute analytically not just the first but also
the second derivatives of the log-likelihood L with respect to β . With this is hand,
we can take larger and more effective steps by maximizing the quadratic local
approximation to L
1
(2.2) L ≈ L0 + Gβ + β 0 Hβ ,
2
where
∂L ∂ 2L
(2.3) L0 = L β =0 , G j = and Hij =
∂ β j β =0 ∂ βi ∂ β j β =0
are the log-likelihood, its gradient and its Hessian matrix evaluated at β = 0. Max-
imizing (2.2) yields
(2.4) β = −H −1 G.
A little care is required though not to take steps that are too long, incompatible
with the local quadratic approximation. Firstly, if the Hessian matrix H were not
negative-definite, the quadratic form (2.2) would have no maximum, and regular
gradient descent would be called upon. Luckily, this is never the case with the
maps that we propose, whose Hessian H is always negative definite. It may happen
though that L is locally quite flat, leading to a large value of kβ k from (2.4). To
avoid pushing the quadratic approximation too far from its domain of validity, we
limit the step-size to a maximum learning rate ε. We adopt, if kH −1 Gk > ε, the
capped step
H −1 G
(2.5) β = −ε .
kH −1 Gk
2.3 Preconditioning
It may be convenient to do some simple initial transformations that map the
data points toward a normal distribution in the bulk. Reasons for this range from
the general to the specific:
8 E. G. TABAK AND C. V. TUNER

• For data points far from the origin, the gradient of their likelihood under a
normal distribution could reach machine zero, at which point the algorithm
will lack any guidance as to how to move them to improve their likelihood.
• Movements in the bulk may require a coarse resolution, as measured by
the length scale α, at odds with the finer one needed for a more detailed
resolution of the probability density.
• We may have some a priori knowledge of a family of distributions that
should capture much of the data’s variability. Using this to do first a simple
parametric estimation may save much computational time.
• In some cases, we might be interested in how much the actual distribution
differs from a conventional one, such as the log-norm for investment re-
turns. Then we can first do a fit to the conventional distribution, and then
quantify the extent and nature of the subsequent maps.
This first set of maps can be thought of as a preconditioning step of the algo-
rithm, which only differs from the subsequent steps in the form of the proposed
maps or in the scale adopted. Two preconditioning steps that we include by default
in the algorithm are substracting the mean of the data,
m
1
(2.6) x → x − µ, µ= ∑ x j,
m j=1
and dividing by the average standard deviation:
s
1 1 m
(2.7) x → x, σ =
σ ∑ kx j k2 ,
mn j=1
with corresponding initial estimation
2
− n2 − kx−µk
(2.8) ρ0 (x) = 2πσ 2 e 2
2σ .
Proposing a general Gaussian as in (1.7) is not generally advisable in high dimen-
sions, unless the sample size m is big enough to allow for a robust estimation of
the covariance matrix Σ.
Another preconditioning candidate generally applicable consists of carrying out
a few steps of the regular core procedure, but with coarser resolution, i.e. with
larger n p in (2.1). More generally, we can have a value of n p that decreases mono-
tonically throughout the procedure, from an initial coarse value to the finest reso-
lution desired or allowed by the data, thus blurring the boundary between precon-
ditioning and the algorithm’s core.
In specific cases, where a family of probability densities of specific form ρ0 (x, β )
is known or conjectured to provide a sensible fit of the data, and a map y(x, β ) is
known such that
(2.9) ρ0 (x, β ) = J y (x)µ(y(x, β )),
then the preconditioning step should consist of a parametric fit of these parame-
ters β followed by the map. The popular procedure of taking the log of a series
DENSITY ESTIMATION ALGORITHMS 9

of returns fits within this framework, where the conjecture ρ0 (x) is a log-normal
distribution.
Often ρ(x) has bounded or semi-infinite support, which may be known even
though ρ itself is not. For instance, some components of x may be known to be
positive or, if x denotes geographical location, ρ(x) may be known to vanish over
seas or in other unpopulated areas [16]. When this is the case, it may be convenient
to perform as preconditioning a first map that fills out all of space, such as
x → erf−1 1 − 2e−x

(2.10)
for one-dimensional data with x ≥ 0. The advantage of such preconditioning step
goes beyond moving the data toward Gaussianity: it also guarantees that the esti-
mated ρest (x) will vanish outside the support of ρ(x).

3 Elementary building blocks

In order to complete the description of the algorithm, we need to provide a form
for the elementary maps of each computational step, the “building blocks” of the
general map y(x) defining the estimated density ρ(x) through (1.3). In order to be
useful, these elementary maps must satisfy some properties:
• They must include the identity map for β = 0.
• They must constitute a basis, through composition, for quite general maps.
Thus linear maps are not good, since their composition never leaves the
group of linear transformations.
• For robustness, they should have a simple spatial structure, without unnec-
essary oscillations. Our choices below are local, typically the identity plus
localized bumps times linear functions.
• They must have a simple analytical dependence on the parameters β , lead-
ing to first and second derivatives of the likelihood function with respect
to β that are not computationally intensive. We find below that a scalar β ,
the simplest of all choices, works best, since no complexity is needed at
the level of the elementary maps: any complexity of the actual y(x) can be
built by the composition of simple maps.
• They should not deteriorate in high dimensions. The maps proposed in [17]
require the laborious construction of general maps through the composition
of one-dimensional transformations. This is always doable, as proved in
[17], but not computationally efficient in high dimensions.

3.1 Radial expansions

Among the simplest elementary transformations suitable for building general
maps are isotropic local expansion centered at a random point x0 , of the form

(3.1) y = x + β f (kx − x0 k)(x − x0 ),

10 E. G. TABAK AND C. V. TUNER

depending on the single parameter β , positive for local expansions and negative
for contractions.
A typical localization function f is given by
1 erf αr

(3.2) f (r) = ,
α αr
where r = kx − x0 k. Another choice is
1
(3.3) f (r) =.
α +r
Even though the two are similar in shape, each choice has its advantages: the
former is smoother and more localized, while the latter is faster to compute and,
more importantly, the corresponding map (3.1) can be inverted in closed form,
yielding
x − x0 y − x0
= ,
r s
where s = ky − x0 k and
s
s − (α + β ) 2

s − (α + β )
r= + + αs.
2 2
This is useful in a number of applications that involve finding the inverse x(y) of
the normalizing map y(x): producing synthetic extra sample points x j from ρ(x),
for instance, can be achieved by obtaining samples y j from the Gaussian µ(y), and
writing x j = x(y j )).
Still one more choice is
2
1 − αr
(3.4) f (r) = for r < α, f (r) = 0 otherwise.
α
This has the advantage of its compact support, which permits the easy superposi-
tion of various such maps simultaneously. All three families require β > −α for
the maps to be one–to–one; the last one requires also that β < 3α. Figure 3.1
compares the three functions, for x0 = 0 and α = 1.
The map in (3.1) has Jacobian
J = (1 + β f )n−1 (1 + β ( f + r f 0 ))
and corresponding log-likelihood function L
n 1
∑ log(ρ(x j )) = ∑ − 2 |x0 |2 + 2(x0 , (1 + β f )(x j − x0 )) + ((1 + β f )r j )2

j j
o
+(n − 1) log(1 + β f ) + log(1 + β ( f + r j f 0 )) .
Then
∂L
n − (x0 , x j − x0 ) − r2j f + r j f 0

=∑
∂β β =0 j
DENSITY ESTIMATION ALGORITHMS 11

F IGURE 3.1. Three radial building blocks. The upper panels display
f (|x|), the lower ones x f (|x|). On the left, a smooth, analytic block
based on the error function, in the center, one with algebraic decay –and
closed-form inversion– and, on the right, one with compact support.

and
∂ 2L
= − ∑ (n + r2j ) f 2 + 2r j f f 0 + (r j f 0 )2 < 0,

∂β2 β =0 j
so we may replace L by its quadratic approximation at β = 0, yielding the following
approximation to the maximizer:
∂L
∂ β β =0
β= − ∂ 2L .
∂ β 2 β =0

3.2 One-dimensional maps

One may gain extra degrees of freedom by changing the map above into the
component-wise
yi − x0i = 1 + β i fi (|xi − x0i |) xi − x0i ,

(3.5)
the composition of n one-dimensional maps, each depending on a parameter β i . In
this case,
L = ∑ ∑ log 1 + β i fi + xij − x0i fi0

j i
1h 2 2 2 i
− x0i + 2x0i 1 + β i fi xij − x0i + xij − x0i 1 + β i fi ,

2

∂L n 2 o
= ∑ fi + xij − x0i fi0 − x0i fi xij − x0i − fi xij − x0i

∂ βi β =0 j
12 E. G. TABAK AND C. V. TUNER

and
∂ 2L 2 2
= − ∑ fi + xij − x0i fi0 + fi2 xij − x0i ,
∂ βi2 β =0 j

so we must pick
∂L
∂ β β =0
βi = − ∂ 2 Li .
∂ βi2 β =0

This family of maps is not isotropic, since it privileges the coordinate axes.
To restore isotropy, one can rotate the axes every time-step, through a random
orthogonal matrix. With this extra ingredient, this building block agrees with the
one originally implemented in [17]; the only differences are the specific form of
the stretching function, which in [17] was a more complex function depending on
three parameters per dimension, and the maximization procedure, which is carried
out here through a local quadratic approximation, not by first-order ascent of the
log-likelihood.

3.3 Localized linear transformations

The radial expansions in (3.1) and, except for a minor twist, also the one-
dimensional maps in (3.5) can be thought of as particular instances of a more
general localized linear transformation of the form
(3.6) y = x + f (kx − x0 k)A(x − x0 ).
For the radial expansions, we have A = β I and, for each one-dimensional map, A =
β nnt , where n is the column vector of direction cosines of the direction considered;
in the latter case f applies not to kx − x0 k, but to |n · (x − x0 )|.
For a general matrix A in (3.6), we have the following quadratic approximation
to the logarithm of the density ρ at each point x:
(3.7) log(ρ(x)) ≈ −|x|2 /2 + ∑ Lij (x)Aij + ∑ Qkl j l
i j (x)Ai Ak ,
i, j i, j,k,l

where
∂f
Lij (x) = − f (x)x (x j − x0j ) + f (x)δij
i
∂ xi
and

2

j

j ∂f l
Qkl l k k j l l
= f (x) δi δ j + δi (x − x0 )(x − x0 ) + δk 2 f (x) i (x − x0 ) +
i j (x)
l
∂x
∂f ∂f j
(x − x0j )(xl − x0l ).
∂ xi ∂ xk
Hence maximizing over A the quadratic approximation to the log-likelihood L =
∑m log(ρ(xm )) yields the system
DENSITY ESTIMATION ALGORITHMS 13

∑ ∑ Qi j (xm ) + Qkl (xm ) Alk = − ∑ Lij (xm ).
kl ij

kl m m
Notice that this building block requires much more computational work than
the isotropic expansions, hence its use would only be justified if it yielded better
accuracy in a much smaller number of steps. We found in the experiments below
that this is not typically the case, so we conclude that simpler maps, with only a
handful of parameters β –such as the single one for the radial expansions– are to
be preferred.

4 Examples
In this section, we use some synthetic examples to illustrate the procedure and
to compare the efficiency of the various building blocks proposed above. In all
examples, we have used for preconditioning only the two steps in (2.6,2.7), which
re-center the observations at x = 0 and stretch them isotropically so as to produce
a unitary average standard deviation.
As a first example, consider the two-dimensional probability density displayed
in figure 4.1, given by
1 1 r−1 2
ρ(x, y) = e− 2 θ e− 2 ( 0.1 ) ,
2
(4.1)
where r and θ are the radius and angle in polar coordinates: a distribution con-
centrated in a small neighborhood of the unit circle, with maximal density at
(x, y) = (1, 0). Such distribution, with thin support and pronounced curvature,
would be hard to capture with any parametric approach. Yet the proposed algo-
rithm does a very good job, as shown in figure 4.2.
For the experiment displayed in figures 4.1 and 4.2, we have taken a sample of
size m = 1000, used the radial expansion in (3.1) with f (r) from (3.2), and adopted
a value n p = 500 for the calculation of the lengh-scale α in (2.1). The Kullback-
Leibler divergence [9] between the exact and the estimated distributions, displayed
in the last panel of figure 4.1, is given by

ρex (y)
Z
(4.2) KL = log ρex (y) dy,
ρest (y)
which is integrated numerically on the same grid used for the plots, a set of points
carried passively by the algorithm, where ρest is known. Another possibility, much
more efficient in high dimensions, is to estimate KL through Monte Carlo simula-
tion:
1 n
(4.3) KL ≈ ∑ log(ρex (x j ) − log(ρest (x j )).
N j=1
This also reveals the connection between the Kullback-Leibler divergence between
estimated and exact densities and the log-likelihood of the estimated density, which
the algorithm ascends.
14 E. G. TABAK AND C. V. TUNER

F IGURE 4.1. A synthetic two-dimensional example. On the left, the

proposed probability density, displayed through contours and in perspec-
tive. On the right, the 1000 point sample used to test the procedure, and
the evolution of Kullback-Leibler divergence between the analytical den-
sity and the one discovered by the algorithm.

F IGURE 4.2. Evolution of the estimated density and normalized obser-

vations, through three snap-shots: on the left, the onset of the algorithm,
after a pre-conditioning step that re-centers the observations and rescales
them isotropically; in the center, the situation after 200 steps and, on the
right, a final estimation after 600 steps. The top two rows display the
estimated density; the third row the normalized observations.

Experimenting with the other building blocks proposed above yields entirely
similar results. We conclude that the radial expansions are to be preferred, since
their use is much less computationally intensive. Moreover, the simplicity of the
radial expansions brings in an extra degree of robustness, as revealed by a much
smaller sensitivity to the choice of n p , the only free parameter of the algorithm.
DENSITY ESTIMATION ALGORITHMS 15

Next we compare the procedure developed here with Kernel density estima-
tion, the most popular non-parametric methodology in use [18]. We have adopted
Gaussian kernels of the form
2
1 − 12 ky−xk
h
(4.4) Kh (x, y) = n e ,
(2π) 2 hn
and proposed the estimate
m
1
(4.5) ρ(y) = ∑ Kh (x j , y) .
m j=1
Hence each observation x j contributes to the local density within a neighborhood
whose size scales with h. This bandwidth plays a similar role to the n p of our
procedure, the typical number of points affected by each map.
Figure 4.3 displays the results of applying both procedures, at different band-
widths, to a sample with m = 500 points from the distribution in (4.1). We have
picked two values for h and n p , one too large, slightly under-resolving the distribu-
tion, and one too small, slightly over-resolving it. Comparing the results from the
two procedures, we can make the following observations:
• Both procedures are robust, capturing the main features of the probabil-
ity density ρ(x), whereas most parametric approaches would have done
poorly.
• The mapping procedure yields smoother and tighter density profiles, and
correspondingly smaller values of the Kullback–Leibler divergence be-
tween the exact and estimated densities.
The computational costs of both procedures are comparable: estimating the density
at q points requires m × q evaluations of the kernel and ns × q map applications
respectively. Since the number ns of iterations before convergence scales with
the number m of observations, these two numbers of evaluations are of the same
order. The mapping procedure has the additional cost of determining the optimal
parameter β for each step, but this is comparatively unimportant when q is much
larger than m.
Beyond the comparison of effectiveness, which depends on the actual problem
in hand, one can describe the main differences between the two procedures:
• The estimated density is expressed in terms of the sum of kernel functions
in one case and of the composition of elementary maps in the other.
• In the implementations discussed here, Kernel density estimation is explicit
and deterministic, while there is a stochastic element to the choice of the
centers for the elementary maps.
• Kernel density estimation is conceptually simpler, while the normalizing
maps have a richer structure and more versatility.
• The kernels provide just an estimated density, while the new procedure also
produces a normalizing map. This can be used for a variety of purposes,
such as sampling.
16 E. G. TABAK AND C. V. TUNER

F IGURE 4.3. Comparison between density estimation through the map-

ping procedure and through Gaussian kernels at various bandwidths. On
the left, two estimates performed using radial expansions; on the right,
two Gaussian kernel density estimations. The top row uses values of n p
and h that slightly under-resolve the distribution, while the corrrespond-
ing values on the bottom row slightly over-resolve it. Both methodolo-
gies are robust and yield comparable results, yet the mapping procedure
gets estimates that are both tighter and smoother, with corresponding
lower values for the Kullback-Leibler divergence with the exact density
underlying the sample.

Figures 4.4 and 4.5 show another two-dimensional experiment. In this case, the
proposed density is the mixture of two anisotropic Gaussians, and, for illustration,
the building block utilized is the general localized linear transformation in (3.6).
Notice in figure 4.5 a feature associated with the dual nature of the algorithm:
since the normalizing procedure cannot fully eliminate the gap between the two
Gaussians without over-resolving, as shown in the three bottom panels, the corre-
sponding density estimation, displayed in the top panels, cannot fully separate the
two. A kernel estimator would also be unable to fully separate the two components,
but here the reason would be more straightforward: a bandwidth h small enough to
separate them would over-resolve the estimation, particularly at the less populated
tails of the distribution.
The examples above are two-dimensional to facilitate their display, yet the full
power of the algorithm manifests itself in high-dimensional situations. Thus we
consider next the equal-weight mixture of two n-dimensional normal distributions,
centered at x = ±2e1 . Here we have used a sample of m = 1000 points, n p =
500, and again the isotropic radial expansion in (3.1,3.2). Figure 4.6 compares
the evolution of the Kullback-Leibler divergence between the exact and estimated
density in dimensions n = 2, n = 5 and n = 10. In order to enable a meaningful
DENSITY ESTIMATION ALGORITHMS 17

F IGURE 4.4. A second synthetic two-dimensional example, the mixture

of two Gaussians. On the left, the proposed probability density, dis-
played through contours and in perspective. On the right, the 2000 point
sample used to test the procedure, and the evolution of Kullback-Leibler
divergence between the analytical density and the one discovered by the
algorithm.

F IGURE 4.5. Evolution of the estimated density and normalized obser-

vations, through three snap-shots: on the left, the onset of the algorithm,
after a pre-conditioning step that re-centers the observations and rescales
them isotropically; in the center, the situation after 100 steps and, on the
right, a final estimation after 500 steps. The top two rows display the
estimated density; the third row the normalized observations.

comparison between problems in different dimensions, the KL from (4.2) in the

plots is normalized by An , the surface area of the n-dimensional unit sphere.
Notice that the rate of convergence does not deteriorate significantly with the
dimension n –nor does the time per step, which is nearly independent of n for
18 E. G. TABAK AND C. V. TUNER

F IGURE 4.6. Evolution of the Kullback-Leibler divergence between the

real and the estimated density for a Gaussian mixture in dimensions 2
(solid blue), 5 (dashed red) and 10 (dashed black.)

radial expansions. The value of n = 10 is beyond the largest one might have hoped
to resolve with a sample of size m = 1000, since 210 = 1024: one has in average one
observation per the 10-dimensional equivalent of a quadrant! Thus it is surprising
that the algorithm resolves this density so well.

5 Conclusions
We have developed a methodology for non-parametric density estimation. Based
on normalizing flows, the new procedure improves on the one developed in [17],
in that it is more robust and efficient in high dimensions, and ascends the log-
likelihood function through larger steps, based on a quadratic approximation rather
than gradient ascent. It requires only one external parameter, n p , with a clear in-
terpretation: the level of resolution sought, measured in number of observations
per localized feature of the estimated density. We have found that the simplest el-
ementary transformations, such as localized radial expansions, are also the most
efficient and robust building blocks from which to form the map that normalizes
the data points.
Density estimation appears often in applications as a tool for more specific
tasks. One advantage of the methodology developed here is its flexibility, which
allows for easy adaptation to such tasks. Thus, in [1], we have adapted the al-
gorithm from [17], a direct ancestor to the one in this paper, to do classification
and clustering. Along similar lines, projects under way employ variations of the
methodology proposed here to perform tasks as varied as medical diagnosis, re-
lating behavioral traits to neuron classes in worms, Montecarlo simulation, time
series analysis, estimation of risk-neutral measures, and transportation theory. It is
in the context of these more specific procedures that examples with real data make
DENSITY ESTIMATION ALGORITHMS 19

sense. In this paper, we have purposefully concentrated instead on “pure” density

estimation, illustrating the new procedure only with synthetic examples. The ad-
vantage of these is that the knowledge of the precise distribution from which the
observations are drawn allows us to quantify the accuracy of the estimated distri-
bution, both visually, for small dimensional problems, and quantitatively, through
the Kullback-Leibler divergence between the two.
Acknowledgment.
The work of C. V. Turner was partially supported by grants from CONICET,
SECYT-UNC and PICT-FONCYT, and the work of E. G. Tabak was partially sup-
ported by the National Science Foundation under grant number DMS 0908077.

Bibliography
[1] Agnelli, J. P. ; Cadeiras, M. ; Tabak, E. G. ; Turner, C. V. ; Vanden-Eijnden, E. Clustering and
classification through normalizing flows in feature space. SIAM MMS 8 (2010), 1784–1802.
[2] Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
[3] Botev, Z. I.; Grotowski, J. F. ; Kroese, D. P. Kernel density estimation via diffusion. Annals of
Statistics 38 (2010), 2916–2957.
[4] Chen, S. X. Probability density function estimation using Gamma kernels. Ann. Inst. Statist.
Math. 52 (2000), 471–480.
[5] Friedman, J. Exploratory projection pursuit. J. Amer. Statist. Assoc. 82 (1987), 249–266.
[6] Hall, P.; Racine, J.; Li, Q. Cross-validation and the estimation of conditional probability densi-
ties. J. Amer. Statist. Assoc. 99 (2004), 1015–1026.
[7] Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning. Springer, 2001.
[8] Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selec-
tion. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2
(1995), 11371143.
[9] Kullback S. ; Leibler, R. A. On information and sufficiency. Annals of Math. Statistics 22
(1951), 79–86.
[10] Nelsen, R. B. An Introduction to Copulas. Springer, 1999.
[11] Park, B. U. ; Marron, J. S. Comparison of data-driven bandwidith selectors. J. Amer. Statist.
Assoc. 85 (1990), 6672.
[12] Parzen, E. On estimation of a probability density function and mode. Annals of Math. Stat. 33
(1962), 1065–1076.
[13] Rosenblatt, M. Remarks on some nonparametric estimates of a density function. Annals of
Math. Stat. 27 (1956), 832–837.
[14] Silverman, B.W. Density Estimation for Statistics and Data Analysis, London: Chapman &
Hall/CRC, 1998.
[15] Sklar, A. Fonctions de rpartition n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris
8 (1959), 229231.
[16] Smith, L. M.; Keegan, M. S.; Wittman, T.; Mohler, G. O.; Bertozzi, A. L. Improving Density
Estimation by Incorporating Spatial Information. EURASIP Journal on Advances in Signal
Processing, 2010 (2010), 1–12.
[17] Tabak, E. ; Vanden-Eijnden, E. Density estimation by dual ascent of the log-likelihood. Comm.
Math. Sci. 8 (2010), 217-233.
[18] Wasserman, L., All of nonparametric statistics. Springer, 2006.
20 E. G. TABAK AND C. V. TUNER

Received Month 200X.

Ast Part1 PDF
No ratings yet
Ast Part1 PDF
20 pages
Racine - 2007 - Nonparametric Econometrics A Primer
No ratings yet
Racine - 2007 - Nonparametric Econometrics A Primer
88 pages
A Primer in Nonparametric Econometrics
No ratings yet
A Primer in Nonparametric Econometrics
88 pages
Non Parametric Density Estimation
No ratings yet
Non Parametric Density Estimation
4 pages
Empirical Finance1
No ratings yet
Empirical Finance1
31 pages
Izenman 1991
No ratings yet
Izenman 1991
21 pages
A Review of Kernel Density Estimation With Applications To Econometrics (#278024) - 259389
No ratings yet
A Review of Kernel Density Estimation With Applications To Econometrics (#278024) - 259389
23 pages
Advanced Data Analysis Techniques
No ratings yet
Advanced Data Analysis Techniques
20 pages
Econometricians' Guide to KDE
No ratings yet
Econometricians' Guide to KDE
35 pages
Chap 4
No ratings yet
Chap 4
21 pages
Density Estimation
No ratings yet
Density Estimation
17 pages
Articulo Sheather
No ratings yet
Articulo Sheather
11 pages
(Bernard. W. Silverman) Density Estimation For Sta
No ratings yet
(Bernard. W. Silverman) Density Estimation For Sta
92 pages
Densityestimation
No ratings yet
Densityestimation
28 pages
Chapter One
100% (1)
Chapter One
46 pages
Slides3part1 mrbm2324
No ratings yet
Slides3part1 mrbm2324
29 pages
Towardsdatascience Com The Math Behind Kernel Density Estimation 5deca75cba38 ...
No ratings yet
Towardsdatascience Com The Math Behind Kernel Density Estimation 5deca75cba38 ...
26 pages
Nonparametric Statistics Epiphany 2024-25
No ratings yet
Nonparametric Statistics Epiphany 2024-25
102 pages
Transformations in Density Estimation
No ratings yet
Transformations in Density Estimation
12 pages
Non-Parametric Methods
No ratings yet
Non-Parametric Methods
51 pages
Advanced Density Estimation Guide
No ratings yet
Advanced Density Estimation Guide
32 pages
Non-Parametric Methods Using Kernel Density Estimation
No ratings yet
Non-Parametric Methods Using Kernel Density Estimation
1 page
Parameter Estimation - PR
No ratings yet
Parameter Estimation - PR
66 pages
Simon Sheather 2004 PDF
No ratings yet
Simon Sheather 2004 PDF
10 pages
Robust Kernel Density Estimation-Kim and Scott
No ratings yet
Robust Kernel Density Estimation-Kim and Scott
37 pages
The Study of Different Types of Kernel Density Estimators: Minge Sha, Yonggang Xie
No ratings yet
The Study of Different Types of Kernel Density Estimators: Minge Sha, Yonggang Xie
5 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
Kernel Density Estimation
No ratings yet
Kernel Density Estimation
10 pages
Adaptive Bayesian Density Regression For High-Dimensional Data
No ratings yet
Adaptive Bayesian Density Regression For High-Dimensional Data
25 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
Chapter 9 - Statistical Estimat - 2016 - Introduction To Statistical Machine Lea
No ratings yet
Chapter 9 - Statistical Estimat - 2016 - Introduction To Statistical Machine Lea
8 pages
Kernel Density Estimation and Its Application
No ratings yet
Kernel Density Estimation and Its Application
8 pages
On Density Estimation
No ratings yet
On Density Estimation
4 pages
Getdist: Kernel Density Estimation: Url: Http://Cosmologist - Info
No ratings yet
Getdist: Kernel Density Estimation: Url: Http://Cosmologist - Info
11 pages
Kernel Density Estimation - Wikipedia
No ratings yet
Kernel Density Estimation - Wikipedia
11 pages
13 Density Estimation Note
No ratings yet
13 Density Estimation Note
48 pages
Minimum L - Distance Estimators For Non-Normalized Parametric Models
No ratings yet
Minimum L - Distance Estimators For Non-Normalized Parametric Models
32 pages
U4 ProbabilityDensityEstimation
No ratings yet
U4 ProbabilityDensityEstimation
6 pages
Kernel Smoothers: An Overview of Curve Estimators For The First Graduate Course in Nonparametric Statistics
No ratings yet
Kernel Smoothers: An Overview of Curve Estimators For The First Graduate Course in Nonparametric Statistics
13 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
Green 1988
No ratings yet
Green 1988
3 pages
A Hybrid Method For Density Power Divergence Minimization With Application To Robust Univariate Location and Scale Estimation
No ratings yet
A Hybrid Method For Density Power Divergence Minimization With Application To Robust Univariate Location and Scale Estimation
25 pages
Lecture Notes - 1
No ratings yet
Lecture Notes - 1
56 pages
Univariate Density Estimation by Orthogonal Series: Department of Statistics, Oregon State University, Corvallis
No ratings yet
Univariate Density Estimation by Orthogonal Series: Department of Statistics, Oregon State University, Corvallis
8 pages
05 Density Estimation
No ratings yet
05 Density Estimation
29 pages
Nonparametric Methods: Jason Corso
No ratings yet
Nonparametric Methods: Jason Corso
49 pages
CrimeStatChapter 8
No ratings yet
CrimeStatChapter 8
43 pages
Week 1 1720465962 Estimation Hour 2
No ratings yet
Week 1 1720465962 Estimation Hour 2
14 pages
Histogram: Nonparametric Kernel Density Estimation
No ratings yet
Histogram: Nonparametric Kernel Density Estimation
19 pages
TEAA - Memory Based Tecniques
No ratings yet
TEAA - Memory Based Tecniques
23 pages
CH Density Estimation
No ratings yet
CH Density Estimation
15 pages
Comprehensiv Questions Solved
No ratings yet
Comprehensiv Questions Solved
28 pages
Chapter 02 Understanding of Data
No ratings yet
Chapter 02 Understanding of Data
63 pages
Robust Kernel Density Estimation With Median-of-Means Principle-Humbert
No ratings yet
Robust Kernel Density Estimation With Median-of-Means Principle-Humbert
22 pages
Mathematics 11 04478 v5
No ratings yet
Mathematics 11 04478 v5
21 pages
Statistics
No ratings yet
Statistics
7 pages
Business Model 2
No ratings yet
Business Model 2
13 pages
Practical Research 1: Quarter 3 - Module 13: Literature Review
100% (2)
Practical Research 1: Quarter 3 - Module 13: Literature Review
26 pages
FDP Broucher 2024 - 1
No ratings yet
FDP Broucher 2024 - 1
2 pages
Susanne K. Langer: THE Symbol OF Feeling
No ratings yet
Susanne K. Langer: THE Symbol OF Feeling
15 pages
Marketing Strategies in Creating Brand Image of FMCG in India With Special Reference To Store Promotion
No ratings yet
Marketing Strategies in Creating Brand Image of FMCG in India With Special Reference To Store Promotion
9 pages
GU Student Manual 2 Schemas
No ratings yet
GU Student Manual 2 Schemas
11 pages
CEL 2106 - Material 3
No ratings yet
CEL 2106 - Material 3
12 pages
Ed 246665
No ratings yet
Ed 246665
20 pages
Ohmicide Ref Man
100% (1)
Ohmicide Ref Man
33 pages
Checking Understanding
No ratings yet
Checking Understanding
9 pages
Dsoc202 Social Stratification English PDF
No ratings yet
Dsoc202 Social Stratification English PDF
315 pages
PNM Approach To Protecting Overcompensated High-Voltage Lines
No ratings yet
PNM Approach To Protecting Overcompensated High-Voltage Lines
13 pages
School As Learning Organisation: The Role of Principal's Transformational Leadership in Promoting Teacher Engagement
No ratings yet
School As Learning Organisation: The Role of Principal's Transformational Leadership in Promoting Teacher Engagement
6 pages
KFR 2
No ratings yet
KFR 2
126 pages
G4000+G5000+Miele+Service+Manual
No ratings yet
G4000+G5000+Miele+Service+Manual
159 pages
Adolescent KPK Research Report-Final Draft-4th
No ratings yet
Adolescent KPK Research Report-Final Draft-4th
85 pages
Manual Spreading
No ratings yet
Manual Spreading
4 pages
Patchwork Text Winter
No ratings yet
Patchwork Text Winter
22 pages
Art & Design Student Assessment
No ratings yet
Art & Design Student Assessment
2 pages
Reviewing Strategies Used by Grade 12 Science, Technology, Engineering, and Mathematics Honor Students
No ratings yet
Reviewing Strategies Used by Grade 12 Science, Technology, Engineering, and Mathematics Honor Students
31 pages
Xi-Maths Model Paper 2025 (According To Reduced Syllabus) - The Anonymous Institute
No ratings yet
Xi-Maths Model Paper 2025 (According To Reduced Syllabus) - The Anonymous Institute
6 pages
اسئلة التنافسي قسم الاجهزة الطبية 2018 2019
No ratings yet
اسئلة التنافسي قسم الاجهزة الطبية 2018 2019
5 pages
Marketing Implementation & Control Guide
No ratings yet
Marketing Implementation & Control Guide
19 pages
Product Manual 36693 (Revision D, 5/2015) : PG Base Assemblies
No ratings yet
Product Manual 36693 (Revision D, 5/2015) : PG Base Assemblies
10 pages
OPENMARK 4000 Brochure-Re
No ratings yet
OPENMARK 4000 Brochure-Re
4 pages
Action Research in Education Innovation
No ratings yet
Action Research in Education Innovation
80 pages
Stuudy Case
No ratings yet
Stuudy Case
8 pages
4shapes in Tide Pools
No ratings yet
4shapes in Tide Pools
7 pages
9TH Mathematics Test CH 2
No ratings yet
9TH Mathematics Test CH 2
2 pages
Summer Training Guidelines - BBA MSI
No ratings yet
Summer Training Guidelines - BBA MSI
10 pages

Tabak Turner

Uploaded by

Tabak Turner

Uploaded by

A family of non-parametric density estimation algorithms

A typical example is a family ρ(x; β ) of Gaussian mixtures, with β including free

Communications on Pure and Applied Mathematics, Vol. 000, 0001–0020 (2000)

(1.3) ρ(x) = J y (x)µ(y(x)),

1.1 Density estimation through normalizing flows

• We can calculate the various βi sequentially: firstfind β1 using y1 (x) =

1.2 The individual maps

with a learning rate ν given by

F IGURE 2.1. Dependence of the length scale α on the center x0 . In

2 General methodological aspects

2.2 Local ascent

3 Elementary building blocks

3.1 Radial expansions

(3.1) y = x + β f (kx − x0 k)(x − x0 ),

3.2 One-dimensional maps

3.3 Localized linear transformations

F IGURE 4.1. A synthetic two-dimensional example. On the left, the

F IGURE 4.2. Evolution of the estimated density and normalized obser-

F IGURE 4.3. Comparison between density estimation through the map-

F IGURE 4.4. A second synthetic two-dimensional example, the mixture

F IGURE 4.5. Evolution of the estimated density and normalized obser-

comparison between problems in different dimensions, the KL from (4.2) in the

F IGURE 4.6. Evolution of the Kullback-Leibler divergence between the

sense. In this paper, we have purposefully concentrated instead on “pure” density

Received Month 200X.

You might also like