18 Aos1715
18 Aos1715
However, new complex modelling and data paradigms are seriously challeng-
ing these established methodologies. First, the restriction of traditional MCMC to
reversible Markov chains is a serious limitation. It is now well understood both the-
oretically (Bierkens (2016), Chen and Hwang (2013), Duncan, Lelièvre and Pavli-
otis (2016), Hwang, Hwang-Ma and Sheu (1993), Rey-Bellet and Spiliopoulos
(2015)) and heuristically (Neal (1998)) that nonreversible chains offer potentially
massive advantages over reversible counterparts. The need to escape reversibility,
and create momentum to aid mixing throughout the state space is certainly well
known, and motivates a number of modern MCMC methods, including the popu-
lar Hamiltonian MCMC (Duane et al. (1987)).
A second major obstacle to the application of MCMC for Bayesian inference
is the need to process potentially massive data-sets. Since MH algorithms in their
pure form require a likelihood evaluation—and thus processing the full data-set—
at every iteration, it can be impractical to carry out large numbers of MH iterations.
This has led to a range of alternatives that use sub-samples of the data at each
iteration (Ma, Chen and Fox (2015), Maclaurin and Adams (2014), Quiroz, Villani
and Kohn (2015), Welling and Teh (2011)), or that partition the data into shards,
run MCMC on each shard and then attempt to combine the information from these
different MCMC runs (Li, Srivastava and Dunson (2017), Neiswanger, Wang and
Xing (2013), Scott et al. (2016), Wang and Dunson (2013)). However, most of
these methods introduce some form of approximation error, so that the final sample
will be drawn from some approximation to the posterior, and the quality of the
approximation can be impossible to evaluate. As an exception the Firefly algorithm
(Maclaurin and Adams (2014)) samples from the exact distribution of interest (but
see the comment below).
This paper introduces the multidimensional Zig-Zag sampling algorithm (ZZ)
and its variants. These methods overcome the restrictions of the lifted Markov
chain approach of Turitsyn, Chertkov and Vucelja (2011) as they do not depend
on the introduction of momentum generating quantities. They are also amenable
to the use of sub-sampling ideas. The dynamics of the Zig-Zag process depends
on the target distribution through the gradient of the logarithm of the target. For
Bayesian applications this is a sum, and is easy to estimate unbiasedly using sub-
sampling. Moreover, Zig-Zag with Sub-Sampling (ZZ-SS) retains the exactness of
the required invariant distribution. Furthermore, if we also use control variate ideas
to reduce the variance of our sub-sampling estimator of the gradient, the resulting
Zig-Zag with Control Variates (ZZ-CV) algorithm has remarkable super-efficient
scaling properties for large data sets.
We will call an algorithm super-efficient if it is able to generate independent
samples from the target distribution at a higher efficiency than if we would draw in-
dependently from the target distribution at the cost of evaluating all data. The only
situation we are aware of where we can implement super-efficient sampling is with
simple conjugate models, where the likelihood function has a low-dimensional
summary statistic which can be evaluated at cost O(n), where n is the number
1290 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
of observations, after which we can obtain independent samples from the poste-
rior distribution at a cost of O(1) by using the functional form of the posterior
distribution. The ZZ-CV can replicate this computational efficiency: After a pre-
computation of O(n), we are able to obtain independent samples at a cost of O(1).
In this sense it contrasts with the Firefly algorithm (Maclaurin and Adams (2014))
which has an ESS per datum which decreases approximately as 1/n where n is
the size of the data, so that the gains of this algorithm do not increase with n; see
Bouchard-Côté, Vollmer and Doucet (2017), Section 4.6.
This breakthrough is based upon the Zig-Zag process, a continuous time piece-
wise deterministic Markov process (PDMP). Given a d-dimensional differentiable
target density π , Zig-Zag is a continuous-time nonreversible stochastic process
with continuous, piecewise linear trajectories on Rd . It moves with constant veloc-
ity, ∈ {−1, 1}d , until one of the velocity components switches sign. The event
time and choice of which direction to reverse is controlled by a collection of state-
dependent switching rates, (λi )di=1 which in turn are constrained via an identity
(2) which ensures that π is a stationary distribution for the process. The process
intrinsically is constructed in continuous-time, and it can be easily simulated using
standard Poisson thinning arguments as we shall see in Section 3.
The use of PDMPs such as the Zig-Zag processes is an exciting and mostly un-
explored area in MCMC. The first occurrence of a PDMP for sampling purposes
is in the computational physics literature (Peters and De With (2012)), which in
one dimension coincides with the Zig-Zag process. In Bouchard-Côté, Vollmer
and Doucet (2017), this method is given the name Bouncy Particle Sampler. In
multiple dimensions the Zig-Zag process and Bouncy Particle Sampler (BPS) are
different processes: Both are PDMPs which move along straight line segments,
but the Zig-Zag process changes direction in only a single component at each
switch, whereas the Bouncy Particle Sampler reflects the full direction vector in
the level curves of the density function. As we will see in Section 2.4, this differ-
ence has a beneficial effect on the ergodic properties of the Zig-Zag process. The
one-dimensional Zig-Zag process is analysed in detail in, for example, Bierkens
and Roberts (2017), Fontbona, Guérin and Malrieu (2012), Fontbona, Guérin and
Malrieu (2016), Monmarché (2014).
Since the first version of this paper was conceived, already several other re-
lated theoretical and methodological papers have appeared. In particular, we men-
tion here results on exponential ergodicity of the BPS (Deligiannidis, Bouchard-
Côté and Doucet (2017)) and ergodicity of the multidimensional Zig-Zag process
(Bierkens, Roberts and Zitt (2017)). The Zig-Zag process has the advantage that
it is ergodic under very mild conditions, which in particular means that we are
not required to choose a refreshment rate. At the same time, the BPS seems more
“natural”, in that it tries to minimise the bounce rate and the change in direction
at bounces, and it may be more efficient for this reason. However, it is a challenge
to make a direct comparison in efficiency of the two methods since the efficiency
ZIG-ZAG SAMPLING 1291
depends both on the computational effort per unit of continuous time of the respec-
tive algorithms, as well as the mixing time of the underlying processes. Therefore,
we expect analysing the relative efficiency of PDMP based algorithms to be an
important area of continued research for years to come.
A continuous-time sequential Monte Carlo algorithm for scalable Bayesian in-
ference with big data, the SCALE algorithm, is given in Pollock et al. (2016).
Advantages that Zig-Zag has over SCALE is that it avoids the issue of controlling
the stability of importance weights, and it is simpler to implement. Whereas the
SCALE algorithm is well adapted for the use of parallel architecture computing,
and has particularly simple scaling properties for big data.
1.1. Notation. For a topological space X, let B (X) denote the Borel σ -
algebra. We write R+ := [0, ∞). If h : Rd → R is differentiable then ∂i h denotes
the function ξ → ∂h(ξ )
∂ξi . We equip E := R × {−1, +1} with the product topol-
d d
ogy of the Euclidean topology on Rd and the discrete topology on {−1, +1}d .
Elements in E will often be denoted by (ξ, θ ) with ξ ∈ Rd and θ ∈ {−1, +1}d .
For g : E → R differentiable in its first argument, we will use ∂i g to denote the
function (ξ, θ ) → ∂g(ξ,θ)
∂ξi , i = 1, . . . , d.
2.2. Invariant distribution. The most important aspect of the Zig-Zag process
is that in many cases the switching rates are directly related to an easily identifiable
invariant distribution. Let C 1 (Rd ) denote the space of continuously differentiable
functions on Rd . For θ ∈ {−1, +1}d and i ∈ {1, . . . , d}, let Fi [θ ] ∈ {−1, +1}d de-
note the binary vector obtained by flipping the i-th component of θ , that is,
θj if i = j,
Fi [θ ] =
j
−θj if i = j.
We introduce the following assumption.
ZIG-ZAG SAMPLING 1293
F IG . 1. Top two rows: example trajectories of the canonical Zig-Zag process. In (a) and (b) the
horizontal axis shows time and the vertical axis the -coordinate of the 1D process. In (c) and (d),
the trajectories in R2 of ( 1 , 2 ) are plotted. Bottom row: Zig-Zag process (e) and MALA (f) for a
Cauchy target with both processes started in the tail.
1294 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
Throughout this paper, we will refer to as the negative log density. Let μ0
denote the measure on B (E) such that, for A ∈ B (Rd ) and θ ∈ {−1, +1}d , μ0 (A ×
{θ}) = Leb(A), with Leb denoting Lebesgue measure on Rd .
T HEOREM 2.2. Suppose Assumption 2.1 holds. Let μ denote the probability
distribution on E such that μ has Radon–Nikodym derivative
dμ exp(− (ξ ))
(3) (ξ, θ ) = , (ξ, θ ) ∈ E,
dμ0 Z
where Z = E exp(− ) dμ0 . Then the Zig-Zag process ( , ) with switching
rates (λi )di=1 has invariant distribution μ.
2.3. Zig-Zag process for Bayesian inference. One application of the Zig-Zag
process is as an alternative to MCMC for sampling from posterior distributions in
Bayesian statistics. We show here that it is straightforward to derive a class of Zig-
Zag processes that have a given posterior distribution as their invariant distribution.
ZIG-ZAG SAMPLING 1295
The dynamics of the Zig-Zag process only depend on knowing the posterior den-
sity up to a constant of proportionality.
To keep notation consistent with that used for the Zig-Zag process, let ξ ∈ Rd
denote a vector of continuous parameters. We are given a prior density function
for ξ , which we denote by π0 (ξ ), and observations x 1:n = (x 1 , . . . , x n ). Our model
for the data defines a likelihood function L(x 1:n |ξ ). Thus the posterior density
function is
π(ξ ) ∝ π0 (ξ )L x 1:n |ξ .
We can write π(ξ ) in the form of the previous section,
1
π(ξ ) = exp − (ξ ) , ξ ∈ Rd ,
Z
where (ξ ) = − log π0 (ξ ) − log L(x 1:n |ξ ), and Z = Rd exp(− (ξ )) dξ is the
unknown normalising constant. Now assuming that log π0 (ξ ) and log L(x 1:n |ξ )
are both continuously differentiable with respect to ξ , from (4) a Zig-Zag process
with rates
+
λi (ξ, θ ) = θi ∂i (ξ )
will have the posterior density π(ξ ) as the marginal of its invariant distribution. We
call the process with these rates the Canonical Zig-Zag process for the negative log
density . As explained in Proposition 2.3, we can construct a family of Zig-Zag
processes with the same invariant distribution by choosing any set of functions
γi (ξ, θ ), for i = 1, . . . , d, which take nonnegative values and for which γi (ξ, θ ) =
γi (ξ, Fi [θ ]), and setting
+
λi (ξ, θ ) = θi ∂i (ξ ) + γi (ξ, θ ) for i = 1, . . . , d.
The intuition here is that λi (ξ, θ ) is the rate at which we transition from θ to Fi [θ ].
The condition γi (ξ, θ ) = γi (ξ, Fi [θ ]) means that we increase by the same amount
both the rate at which we will transition from θ to Fi [θ ] and vice versa. As our
invariant distribution places the same probability of being in a state with velocity
θ as that of being in state Fi [θ ], these two changes in rate cancel out in terms of
their effect on the invariant distribution. Changing the rates in this way does impact
the dynamics of the process, with larger γi values corresponding to more frequent
changes in the velocity of the Zig-Zag process, and we would expect the resulting
process to mix more slowly.
Under the assumption that the Zig-Zag process has the desired invariant distri-
bution and is ergodic, it follows from the Birkhoff ergodic theorem that for any
bounded continuous function f : E → R,
1 t
lim f (s), (s) ds = f dμ,
t→∞ t 0 E
for any initial condition (ξ, θ ) ∈ E. Sufficient conditions for ergodicity will be dis-
cussed in the following section. Taking γ to be positive and bounded everywhere
ensures ergodicity, as will be established in Theorem 2.10.
1296 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
2.4. Ergodicity of the Zig-Zag process. We have established in Section 2.2 that
for any continuously differentiable, positive density π on Rd a Zig-Zag process can
be constructed that has π as its marginal stationary density. In order for ergodic
averages T1 0T f ( (s)) ds of the Zig-Zag process to converge asymptotically to
π(f ), we further require ( (t), (t)) to be ergodic, that is, to admit a unique
invariant distribution.
Ergodicity is directly related to the requirement that ( (t), (t)) is irreducible,
that is, the state space is not reducible into components which are each invariant
for the process ( (t), (t)). For the one-dimensional Zig-Zag process, (exponen-
tial) ergodicity has already been established under mild conditions (Bierkens and
Roberts (2017)). As we discuss below, irreducibility, and thus ergodicity, can be
established for large classes of multidimensional target distributions, such as i.i.d.
Gaussian distributions, and also if the switching rates λi (ξ, θ ) are positive for all
i = 1, . . . , d, and (ξ, θ ) ∈ E.
Let P t ((ξ, θ ), ·) be the transition kernel of the Zig-Zag process with initial con-
dition (ξ, θ ). A function f : E → R is called norm-like if lim ξ →∞ f (ξ, θ ) = ∞
for all θ ∈ {−1, +1}d . Let · TV denote the total variation norm on the space of
signed measures. First, we consider the one-dimensional case.
P ROPOSITION 2.5 (Bierkens and Roberts (2017), Theorem 5). Suppose As-
sumption 2.4 holds. Then there exists a function f : E → [1, ∞) which is norm-
like such that the Zig-Zag process is f -exponentially ergodic, that is, there exists
a constant κ > 0 and 0 < ρ < 1 such that
P t (ξ, θ ), · − π TV ≤ κf (ξ, θ )ρ t for all (ξ, θ ) ∈ E and t ≥ 0.
Suppose P (x, dy) is the transition kernel of a Markov chain on a state space E.
We say that the Markov chain associated to P is mixing if there exists a probability
distribution π on E such that
lim P k (x, ·) − π TV =0 for all x ∈ E.
k→∞
For any continuous time Markov process with family of transition kernels
P t (x, dy), we can consider the associated time-discretized process, which is a
Markov chain with transition kernel Q(x, dy) := P δ (x, dy) for a fixed δ > 0. The
value of δ will be of no significance in our use of this construction.
E XAMPLE 2.9. Continuing Example 2.6, consider the simple case in which π
is of product form with each πi a centered Gaussian density function with variance
σi2 . It follows from Proposition 2.8 and Example 2.6 that the multidimensional
canonical Zig-Zag process (i.e., the Zig-Zag process with γi ≡ 0) is mixing. This is
different from the Bouncy Particle Sampler (Bouchard-Côté, Vollmer and Doucet
(2017)), which is not ergodic for an i.i.d. Gaussian without “refreshments” of the
momentum variable.
1298 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
The proof of this result consists of a Girsanov change of measure with respect to
a Zig-Zag process targeting an i.i.d. standard normal distribution, which we know
to be irreducible. The irreducibility then carries over to the Zig-Zag process with
the stated switching rates. A detailed proof can be found in the Supplementary
Material.
1. (T 0 , 0 , 0 ) := (0, ξ, θ ).
2. for k = 1, 2, . . .
(a) Define mi (t) := λi ( k−1 + k−1 t, k−1 ) for t ≥ 0 and i = 1, . . . , d.
(b) For i = 1, . . . , d, let (Mi ) denote computational bounds for (mi ).
(c) Draw τ1 , . . . , τd such that P(τi ≥ t) = exp(− 0t Mi (s) ds).
(d) i0 := argmini=1,...,d {τi } and τ := τi0 .
(e) (T k , k ) := (T k−1 + τ, k−1 + k−1 τ )
(f) With probability mi0 (τ )/Mi0 (τ ),
• k := Fi0 [k−1 ],
otherwise
• k := k−1 .
3.2. Example: Globally bounded log density gradient. If there are constants
ci > 0 such that supξ ∈Rd |∂i (ξ )| ≤ ci , i = 1, . . . , d, then we can use the global
upper bounds Mi (t) = ci for t ≥ 0. Indeed, for (ξ, θ ) ∈ E,
+
λi (ξ, θ ) = θi ∂i (ξ ) ≤ ∂i (ξ ) ≤ ci .
Algorithm 1 may be used with Mi ≡ ci for i = 1, . . . , d at every iteration.
This situation arises with heavy-tailed distributions, for example, if π is Cauchy,
2θ ξ +
then (ξ ) = log(1 + ξ 2 ), and consequently λ(ξ, θ) = ( 1+ξ 2 ) ≤ 1.
3.3. Example: Negative log density with dominated Hessian. Another impor-
tant case is when there exists a positive definite matrix Q ∈ Rd×d which dominates
the Hessian H (ξ ) in the positive definite ordering of matrices for every ξ ∈ Rd .
Here, H (ξ ) = (∂i ∂j (ξ ))di,j =1 denotes the Hessian of .
Denote the Euclidean inner product in Rd by ·, ·. For p ∈ [1, ∞], the p -norm
on Rd and the induced matrix norms are both denoted by · p . For symmetric
matrices S, T ∈ Rd×d , we write S T if v, Sv ≤ v, T v for every v ∈ Rd , or in
words, if T dominates S in the positive definite ordering. The key assumption is
that H (ξ ) Q for all ξ ∈ Rd , where Q ∈ R d×d is positive definite. In particular,
if H (ξ ) 2 ≤ c for all ξ , then this holds for Q = cI . We let (ei )di=1 denote the
canonical basis vectors in Rd .
For an initial value (ξ, θ ) ∈ E, we move along the trajectory t → ξ(t)√:= ξ + θ t.
Let ai denote an upper bound for θi ∂i (ξ ), i = 1, . . . , d and let bi := d Qei 2 .
For general symmetric matrices S, T with S T , we have for any v, w ∈ Rd that
(6) v, Sw ≤ v 2 Sw 2 ≤ v 2 T w 2.
Applying this inequality, we obtain for i = 1, . . . , d
t
d t
θ i ∂i ξ(t) = θi ∂i (ξ ) + ∂i ∂j ξ(s) θj ds ≤ ai + H ξ(s) ei , θ ds
0 j =1 0
t
≤ ai + Qei 2 θ 2 ds = ai + bi t.
0
1. (T 0 , 0 , 0 ) := (0, ξ, θ ).
2. ai := θi ∂i √(ξ ), i = 1, . . . , d.
3. bi := Qei d, i = 1, . . . , d.
4. For k = 1, 2, . . .
(a) Draw τi such that P(τi ≥ t) = exp(− 0t (ai + bi s)+ ds), i = 1, . . . , d.
(b) i0 := argmini∈{1,...,d} τi and τ := τi0 .
(c) (T k , k , k ) := (T k−1 + τ, k−1 + k−1 τ, k−1 ).
(d) ai := ai + bi τ , i = 1, . . . , d.
(k−1 k ))+
i 0 ∂i 0 (
(e) with probability (ai0 )+
,
• k := Fi0 [k−1 ],
otherwise
• k := k−1 .
(f) ai0 := k−1
i 0 ∂i 0 (
k) (re-using the earlier computation).
with (E j )nj=1 continuous functions mapping Rd into Rd . The motivation for con-
sidering such a class of density functions is the problem of sampling from a pos-
terior distribution for big data. The key feature of such posteriors is that they can
be written as the product of a large number of terms. For example, consider the
simplest example of this, where we have n independent data points (x j )nj=1 and
for which the likelihood function is L(ξ ) = nj=1 f (x j |ξ ), for some probability
density or probability mass function f . In this case, we can write the negative log
1302 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
4.1. Main idea. Let (ξ(t))t≥0 denote a linear trajectory originating in (ξ, θ ) ∈
E, so ξ(t) = ξ + θ t. Define a collection of switching rates along the trajectory
(ξ(t)) by
j j +
mi (t) := θi Ei ξ(t) , i = 1, . . . , d, j = 1, . . . , n, t ≥ 0.
We will make use of computational bounds (Mi ) as before, which this time bound
j
(mi ) uniformly. Let Mi : R+ → R+ be continuous and satisfy
j
(9) mi (t) ≤ Mi (t) for all i = 1, . . . , d, j = 1, . . . , n, and t ≥ 0.
We will generate random times according to the computational upper bounds
(Mi ) as before. However, we now use a two-step approach to deciding whether
to switch or not at the generated times. As before, for i = 1, . . . , d let (τi )di=1
be simulated random times for which P(τ ≥ t) = exp(− 0t Mi (s) ds) and let
i0 := argmini∈{1,...,d} τi , and τ := τi0 . Then switch component i0 of θ with prob-
ability mJi0 (τ )/Mi0 (τ ), where J ∈ {1, . . . , n} is drawn uniformly at random, in-
dependent of τ . This “sub-sampling” procedure is detailed in Algorithm 3. De-
j
pending on the choice of Ei , we will refer to this algorithm as Zig-Zag with Sub-
Sampling (ZZ-SS, Section 4.2) or ZZ-CV (Section 4.3).
1. (T 0 , 0 , 0 ) := (0, ξ, θ ).
2. for k = 1, 2, . . .
(a) Define mi (t) := (k−1 Ei ( k−1 + k−1 t))+ for t ≥ 0, i = 1, . . . , d
j j
and j = 1, . . . , n.
j
(b) For i = 1, . . . , d, let (Mi ) denote computational bounds for (mi ),
satisfying (9).
(c) Draw τ1 , . . . , τd such that P(τi ≥ t) = exp(− 0t Mi (s) ds).
(d) i0 := argmini=1,...,d τi and τ := τi0 .
(e) (T k , k ) := (T k−1 + τ, k−1 + k−1 τ ).
(f) Draw J ∼ Uniform({1, . . . , n}).
(g) With probability mJi0 (τ )/Mi0 (τ ),
• k := Fi0 [k−1 ],
otherwise
• k := k−1 .
1 n
j
= θi Ei (ξ ) = θi ∂i (ξ ).
n j =1
By Theorem 2.2, the Zig-Zag process has the stated invariant distribution.
1304 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
4.2. Zig-Zag with Sub-Sampling (ZZ-SS) for globally bounded log density gra-
dient. A straightforward application of sub-sampling is possible if we have (8)
with ∇ j globally bounded, so that there exist positive constants (ci ) such that
(11) ∂i j
(ξ ) ≤ ci , i = 1, . . . , d, j = 1, . . . , n, ξ ∈ Rd .
In this case, we may take
j
Ei := ∂i j
and Mi (t) := ci , i = 1, . . . , d, j = 1, . . . , n, t ≥ 0,
so that (9) is satisfied. The corresponding version of Algorithm 3 will be called
Zig-Zag with Sub-Sampling (ZZ-SS).
4.3. Zig-Zag with Control Variates (ZZ-CV). Suppose again that admits the
representation (8), and further suppose that the derivatives (∂i j ) are globally
and uniformly Lipschitz, that is, there exist constants (Ci )ni=1 such that for some
p ∈ [1, ∞] and all i = 1, . . . , d, j = 1, . . . , n and ξ1 , ξ2 ∈ Rd ,
(12) ∂i j
(ξ1 ) − ∂i j
(ξ2 ) ≤ Ci ξ1 − ξ2 p.
Mi (t) := ai + bi t, t ≥ 0, i = 1, . . . , d,
5.1. Scaling of Zig-Zag sampling (ZZ). First, let us obtain a Taylor expansion
of the switching rate for ξ close to
ξ . We have
n
∂ξi (ξ ) = −∂ξi log f x j | ξ
j =1
n
= −∂ξi log f x j |
ξ
j =1
=0
n
d
2
− ∂ξi ∂ξk log f x j |
ξ (ξk −
ξk ) + O ξ −
ξ .
j =1 k=1
The first term vanishes by the definition of the MLE. Expressed in terms of φ, the
switching rates are
+
+ 1 n d
φ 2
θi ∂ξi ξ(φ) =√ − ∂ξi ∂ξk log f x j |
ξ φk +O .
n j =1 k=1
n
√
O( n)
With respect
√ to the coordinate φ, the canonical Zig-Zag process has constant
speed
√ n in each coordinate, and by the above computation, √ a switching rate of
O( n). After a rescaling of the time parameter by a factor n, the process in the
φ-coordinate becomes a Zig-Zag process with unit speed in every direction and
switching rates
+
1 n d
− ∂ξi ∂ξk log f x j | ξ φk + O n−1/2 .
n j =1 k=1
where i(θ0 ) denotes the expected information. These switching rates correspond
to the limiting Gaussian distribution with covariance matrix i(θ0 )−1 .
In this limiting Zig-Zag process, all dependence on n has vanished. Starting
from equilibrium, we require a time interval of O(1) (in the rescaled time) to ob-
tain an essentially independent sample. In the original time scale, this corresponds
to a time interval of O(n−1/2 ). As long as the computational bound in the Zig-
Zag algorithm is O(n1/2 ), this can be achieved using O(1) proposed switches.
The computational cost for every proposed switch is O(n), because the full data
(x i )ni=1 needs to be processed in the computation of the true switching rate at the
proposed switching time.
We conclude that the computational complexity of the Zig-Zag (ZZ) algo-
rithm per independent sample is O(n), provided that the computational bound
is O(n1/2 ). This is the best we can expect for any standard Monte Carlo algorithm
[where we will have a O(1) number of iterations, but each iteration is O(n) in
computational cost].
To compare, if the computational bound is O(nα ) for some α > 1/2, then we re-
quire O(nα−1/2 ) proposed switches before we have simulated a total time interval
of length O(n−1/2 ), so that, with a complexity of O(n) per proposed switching
time, the Zig-Zag algorithm has total computational complexity O(nα+1/2 ). So,
for example, with global bounds we have that the computational bound is O(n)
[as each term in the log density is O(1)], and hence ZZ will have total computa-
tional complexity of O(n3/2 ).
5.2. Scaling of Zig-Zag with Control Variates (ZZ-CV). Now we will study the
limiting behaviour as n → ∞ of ZZ-CV introduced in Section 4.3. In determin-
ing the computational bounds, we take p = 2 for simplicity, for example, in (12).
Also for simplicity assume that ξ → ∂ξi log f (x j | ξ ) has Lipschitz constant ki
(independent of j = 1, . . . , n) and write Ci = nki , so that (12) is satisfied. In prac-
tice, there may be a logarithmic increase with n in the Lipschitz constants ki as
we have to take a global bound in n. For the present discussion, we ignore such
1308 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
j
We find that |Ei (ξ )| = O(n1/2 ) under the stationary distribution.
√
By slowing down the Zig-Zag process in φ space by n, the continuous time
process generated by ZZ-CV will approach a limiting Zig-Zag process with a cer-
tain switching rate of O(1). In general, this switching rate will depend on the way
that ξ is obtained. To simplify the exposition, in the following computation we as-
sume ξ = ξ . Rescaling by n−1/2 , and developing a Taylor approximation around
ξ,
j
n−1/2 Ei (ξ ) = n−1/2 ∂ξi
j j
(ξ ) − ∂ξi (ξ )
= n−1/2 −n∂ξi log f x j | ξ + n∂ξi log f x j |
ξ
d
j
= −n1/2 ∂ξi ∂ξk log f x |
ξ (ξk −
ξk ) + O n1/2 ξ −
ξ 2
k=1
d
=− ξ φk + O n−1/2 .
∂ξi ∂ξk log f x j |
k=1
By Theorem 4.1, the rescaled effective switching rate for ZZ-CV is given by
1
n
j +
λi (φ, θ ) := n−1/2 λi ξ(φ), θ = θi Ei ξ(φ)
n3/2 j =1
+
1
n
d
= −θi ∂ξi ∂ξk log f x j |
ξ φk + O n−1/2
n j =1 k=1
+
d
→ E −θi ∂ξi ∂ξk log f (X | ξ0 )φk ,
k=1
ZIG-ZAG SAMPLING 1309
6.1. Sampling and integration along Zig-Zag trajectories. There are essen-
tially two different ways of using the Zig-Zag skeleton points which we obtain by
using, for example, Algorithms 1, 2 or 3.
The first possible approach is to collect a number of samples along the trajec-
tories. Suppose we have simulated the Zig-Zag process up to time τ > 0, and we
wish to collect m samples. This can be achieved by setting ti = iτ/m, and setting
i := (ti ) for i = 1, . . . , m, with the continuous time trajectory ( (t)) defined
1310 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
6.2. Beating one ESS per epoch. We use the term epoch as a unit of com-
putational cost, corresponding to the number of iterations required to evaluate
the complete gradient of log π . This means that for the basic Zig-Zag algorithm
(without sub-sampling), an epoch consists of exactly one iteration, and for the
sub-sampled variants of the Zig-Zag algorithm, an epoch consists of n iterations.
The CPU running times per epoch of the various algorithms we consider are
equal up to a constant factor. To assess the scaling of various algorithms, we
use ESS per epoch. The notion of ESS is discussed in the Supplementary Ma-
ZIG-ZAG SAMPLING 1311
terial [Bierkens, Fearnhead and Roberts (2019), Section 2]. Consider any classical
MCMC algorithm based upon the Metropolis–Hastings acceptance rule. Since ev-
ery iteration requires an evaluation of the full density function to compute the
acceptance probability, we have that the ESS per epoch for such an algorithm is
bounded from above by one. Similar observations apply to all other known MCMC
algorithms capable of sampling asymptotically from the exact target distribution.
There do exist several conceptual innovations based on the idea of sub-
sampling, which have some theoretical potential to overcome the fundamental lim-
itation of one ESS per epoch sketched above.
The Pseudo-Marginal Method (PMM, Andrieu and Roberts (2009)) is based
upon using a positive unbiased estimator for a possibly unnormalized density. Ob-
taining an unbiased estimator of a product is much more difficult than obtaining
one for a sum. Furthermore, it has been shown to be impossible to construct an
estimator that is guaranteed to be positive without other information about the
product, such as a bound on the terms in the product (Jacob and Thiery (2015)).
Therefore, the PMM does not apply in a straightforward way to vanilla MCMC in
Bayesian inference.
In the Supplementary Material (Bierkens, Fearnhead and Roberts (2019), Sec-
tion 3) we analyse the scaling of Stochastic Gradient Langevin Dynamics (SGLD,
Welling and Teh (2011)) in an analogous fashion to the analysis of ZZ and ZZ-CV
in Section 5. From this analysis, we conclude that it is in general not possible to
implement SGLD in such a way that the ESSpE has a larger order of magnitude
than O(1). We compare SGLD to Zig-Zag in experiments of Sections 6.3 and 6.5.
F IG . 2. Mean square error (MSE) in the first and second moment as a function of the number
of epochs, based on n = 100 or n = 10,000 observations, for a one-dimensional Gaussian posterior
distribution (Section 6.3). Displayed are SGLD (green), ZZ-CV (magenta), ZZ-soCV (dark magenta),
ZZ (black). The displayed dots represent averages over experiments based on randomly generated
data from the true posterior distribution. Parameter values (see Bierkens, Fearnhead and Roberts
(2019), Section 4) are ξ0 = 1 (the true value of the mean parameter), ρ = 1, σ = 1 and c1 = 1,
c2 = 1/100 (for the SGLD parameters; see the Supplementary Material, Bierkens, Fearnhead and
Roberts (2019), Section 3). The value of ξ for ZZ-soCV is based on a sub-sample of size m = n/10 so
that it will not be equal to the exact maximizer of the posterior. For an honest comparison, trajectories
of all algorithms have initial condition equal to ξ MAP .
ZIG-ZAG SAMPLING 1313
presence of bias in SGLD. This bias does not appear in the different versions of
Zig-Zag sampling, agreeing with the theoretical result that ergodic averages over
Zig-Zag trajectories are consistent. Furthermore, we see a significant relative in-
crease in efficiency for ZZ-(so)CV over basic ZZ when the number of observations
is increased, agreeing with the scaling results of Section 5. A poor choice of refer-
ence point (as in ZZ-soCV) is seen to have only a small effect on the efficiency.
Combined with a flat prior distribution, this induces a posterior distribution ξ given
observations of (x j , y j ) for j = 1, . . . , n; see the Supplementary Material for im-
plementational details (Bierkens, Fearnhead and Roberts (2019), Section 5).
The results of this experiment are shown in Figure 3. In both of the plots of
ESS per epoch (see (a) and (c)), the best linear fit for ZZ-CV has slope approx-
imately 0.95, which is in close agreement with the scaling analysis of Section 5.
The other algorithms have roughly a horizontal slope, corresponding to a linear
scaling with the size of the data. We conclude that, among the algorithms tested,
ZZ-CV is the only algorithm for which the ESS per CPU second is approximately
constant as a function of the size of the data (see Figure 3(b) and (d)). Furthermore
ZZ-CV obtains an ESSpE which is roughly linearly increasing with the number of
observations n (see Figure 3(a) and (c)), whereas the other versions of the Zig-Zag
algorithms, and MALA, have an ESSpE which is approximately constant with re-
spect to n. These statements apply regardless of the dimensionality of the problem.
F IG . 3. Log-log plots of the experimentally observed dependence of ESS per epoch (ESSpE) and
ESS per second (ESSpS) with respect to the first coordinate 1 , as a function of the number of
observations n in the case of (2-D and 16-D) Bayesian logistic regression (Section 6.4). Data is
randomly generated based on true parameter values ξ0 = (1, 2) (2-D) and ξ0 = (1, . . . , 1) (16-D).
Trajectories all start in the true parameter value ξ0 . Plotted are mean and standard deviation over 10
experiments, along with the best linear fit. Displayed are MALA (tuned to have optimal acceptance
ratio, green), Zig-Zag with global bound (red), Zig-Zag with Lipschitz bound (black), ZZ-SS using
global bound (blue) and ZZ-CV (magenta), all run for 105 epochs. As reference point for ZZ-CV we
compute the posterior mode numerically, the cost of which is negligible compared to the MCMC. The
experiments are carried out in R with C++ implementations of all algorithms.
ZIG-ZAG SAMPLING 1315
Material (Bierkens, Fearnhead and Roberts (2019), Section 6) how to obtain com-
putational bounds for the Zig-Zag and ZZ-CV algorithms, which may serve as an
illustration on how to obtain such bounds in settings beyond those described in
Sections 3.3 and 4.3.
In Figure 4, we compare trace plots for the Zig-Zag algorithms (ZZ, ZZ-CV) to
trace plots for Stochastic Gradient Langevin Dynamics (SGLD) and the Consensus
Algorithm (Scott et al. (2016)). SGLD and Consensus are seen to be strongly bi-
ased, whereas ZZ and ZZ-CV target the correct distribution. However, this comes
at a cost: ZZ-CV loses much of its efficiency in this situation (due to the combina-
tion of lack of posterior contraction and unbounded Hessian); in particular it is not
F IG . 4. Trace plots of several algorithms (blue) and density contour plots for the nonidentifiable
logistic regression example of Section 6.5. In this example we have for the number of observations
n = 1000. Data is randomly generated from the model with true parameter satisfying ξ1 + ξ22 = −1.
The prior is a two-dimensional standard normal distribution. Due to the unbounded Hessian and
because SGLD is not corrected by a Metropolis–Hastings accept/reject, the stepsize of SGLD needs
to be set to a very small value (compared, e.g., to what would be required for MALA) in order to
prevent explosion of the trajectory; still the algorithm exhibits a significant asymptotic bias.
1316 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
super-efficient. The use of multiple reference points may alleviate this problem,
see also the discussion in Section 7.
Acknowledgments. The authors are grateful for helpful comments from ref-
erees, the Editor and the Associate Editor which have improved the paper. Fur-
thermore, the authors acknowledge Matthew Moores (University of Warwick) for
1318 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
SUPPLEMENTARY MATERIAL
Supplement to “The Zig-Zag process and super-efficient sampling for
Bayesian analysis of big data” (DOI: 10.1214/18-AOS1715SUPP; .pdf). Math-
ematics of the Zig-Zag process, scaling of SGLD, details on the experiments in-
cluding how to obtain computational bounds.
REFERENCES
A NDERSON , D. F. (2007). A modified next reaction method for simulating chemical systems with
time dependent propensities and delays. J. Chem. Phys. 127 214107.
A NDRIEU , C. and ROBERTS , G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo
computations. Ann. Statist. 37 697–725. MR2502648
BARDENET, R., D OUCET, A. and H OLMES , C. (2017). On Markov chain Monte Carlo methods for
tall data. J. Mach. Learn. Res. 18 1515–1557. MR3670492
B IERKENS , J. (2016). Non-reversible Metropolis–Hastings. Stat. Comput. 26 1213–1228.
MR3538633
B IERKENS , J. (2017). Computer experiments accompanying J. Bierkens, P. Fearnhead and
G. Roberts, the Zig-Zag process and super-efficient sampling for Bayesian analysis of big data.
Available at https://github.com/jbierkens/zigzag-experiments. Date accessed: 20-10-2017.
B IERKENS , J., F EARNHEAD , P. and ROBERTS , G. (2018). Supplement to “The Zig-Zag process and
super-efficient sampling for Bayesian analysis of big data.” DOI:10.1214/18-AOS1715SUPP.
B IERKENS , J. and ROBERTS , G. (2017). A piecewise deterministic scaling limit of lifted
Metropolis–Hastings in the Curie–Weiss model. Ann. Appl. Probab. 27 846–882. MR3655855
B IERKENS , J., ROBERTS , G. O. and Z ITT, P.-A. (2017). Ergodicity of the zigzag process. Preprint.
Available at arXiv:1712.09875.
B OUCHARD -C ÔTÉ , A., VOLLMER , S. J. and D OUCET, A. (2017). The bouncy particle sampler:
A non-reversible rejection-free Markov chain Monte Carlo method. J. Amer. Statist. Assoc. To
appear. Available at arXiv:1510.02451.
C HEN , T.-L. and H WANG , C.-R. (2013). Accelerating reversible Markov chains. Statist. Probab.
Lett. 83 1956–1962. MR3079029
D ELIGIANNIDIS , G., B OUCHARD -C ÔTÉ , A. and D OUCET, A. (2017). Exponential ergodicity of
the bouncy particle sampler. Preprint. Available at arXiv:1705.04579.
D UANE , S., K ENNEDY, A. D., P ENDLETON , B. J. and ROWETH , D. (1987). Hybrid Monte Carlo.
Phys. Lett. B 195 216–222.
D UBEY, K. A., R EDDI , S. J., W ILLIAMSON , S. A., P OCZOS , B., S MOLA , A. J. and X ING , E. P.
(2016). Variance reduction in stochastic gradient Langevin dynamics. In Advances in Neural In-
formation Processing Systems 1154–1162.
D UNCAN , A. B., L ELIÈVRE , T. and PAVLIOTIS , G. A. (2016). Variance reduction using nonre-
versible Langevin samplers. J. Stat. Phys. 163 457–491. MR3483241
F EARNHEAD , P., B IERKENS , J., P OLLOCK , M. and ROBERTS , G. O. (2018). Piecewise determin-
istic Markov processes for continuous-time Monte Carlo. Statist. Sci. 33 386–412. MR3843382
F ONTBONA , J., G UÉRIN , H. and M ALRIEU , F. (2012). Quantitative estimates for the long-time
behavior of an ergodic variant of the telegraph process. Adv. in Appl. Probab. 44 977–994.
MR3052846
F ONTBONA , J., G UÉRIN , H. and M ALRIEU , F. (2016). Long time behavior of telegraph processes
under convex potentials. Stochastic Process. Appl. 126 3077–3101. MR3542627
ZIG-ZAG SAMPLING 1319
G IBSON , M. A. and B RUCK , J. (2000). Efficient exact stochastic simulation of chemical systems
with many species and many channels. J. Phys. Chem. A 104 1876–1889.
H ASTINGS , W. K. (1970). Monte Carlo sampling methods using Markov chains and their applica-
tions. Biometrika 57 97–109. MR3363437
H WANG , C.-R., H WANG -M A , S.-Y. and S HEU , S. J. (1993). Accelerating Gaussian diffusions.
Ann. Appl. Probab. 3 897–913. MR1233633
JACOB , P. E. and T HIERY, A. H. (2015). On nonnegative unbiased estimators. Ann. Statist. 43 769–
784. MR3319143
J OHNSON , R. A. (1970). Asymptotic expansions associated with posterior distributions. Ann. Math.
Stat. 41 851–864. MR0263198
L EWIS , P. A. W. and S HEDLER , G. S. (1979). Simulation of nonhomogeneous Poisson processes
by thinning. Nav. Res. Logist. Q. 26 403–413. MR0546120
L I , C., S RIVASTAVA , S. and D UNSON , D. B. (2017). Simple, scalable and accurate posterior interval
estimation. Biometrika 104 665–680. MR3694589
M A , Y.-A., C HEN , T. and F OX , E. (2015). A complete recipe for stochastic gradient MCMC. In
Advances in Neural Information Processing Systems 2917–2925.
M ACLAURIN , D. and A DAMS , R. P. (2014). Firefly Monte Carlo: Exact MCMC with subsets of
data. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence AUAI Press,
Arlington, VA.
M ETROPOLIS , N., ROSENBLUTH , A. W., ROSENBLUTH , M. N., T ELLER , A. H. and T ELLER , E.
(1953). Equation of state calculations by fast computing machines. J. Chem. Phys. 21 1087.
M ONMARCHÉ , P. (2014). Hypocoercive relaxation to equilibrium for some kinetic models via a third
order differential inequality. Available at arXiv:1306.4548.
N EAL , R. M. (1998). Suppressing random walks in Markov chain Monte Carlo using ordered over-
relaxation. In Learning in Graphical Models 205–228. Springer, Berlin.
N EISWANGER , W., WANG , C. and X ING , E. (2013). Asymptotically exact, embarrassingly parallel
MCMC. Available at arXiv:1311.4780.
PAKMAN , A., G ILBOA , D., C ARLSON , D. and PANINSKI , L. (2016). Stochastic bouncy particle
sampler. Preprint. Available at arXiv:1609.00770.
P ETERS , E. A. J. F. and D E W ITH , G. (2012). Rejection-free Monte Carlo sampling for general
potentials. Phys. Rev. E (3) 85 1–5.
P OLLOCK , M., F EARNHEAD , P., J OHANSEN , A. M. and ROBERTS , G. O. (2016). The scalable
Langevin exact algorithm: Bayesian inference for big data. Available at arXiv:1609.03436.
Q UIROZ , M., V ILLANI , M. and KOHN , R. (2015). Speeding up MCMC by efficient data subsam-
pling. Riksbank Research Paper Series 121.
R EY-B ELLET, L. and S PILIOPOULOS , K. (2015). Irreversible Langevin samplers and variance re-
duction: A large deviations approach. Nonlinearity 28 2081–2103. MR3366637
ROBERTS , G. O. and T WEEDIE , R. L. (1996). Exponential convergence of Langevin distributions
and their discrete approximations. Bernoulli 2 341–363. MR1440273
S COTT, S. L., B LOCKER , A. W., B ONASSI , F. V., C HIPMAN , H. A., G EORGE , E. I. and M C C UL -
LOGH , R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. Int. J. Manag.
Sci. Eng. Manag. 11 78–88.
T URITSYN , K. S., C HERTKOV, M. and V UCELJA , M. (2011). Irreversible Monte Carlo algorithms
for efficient sampling. Phys. D 240 410–414.
VOLLMER , S. J., Z YGALAKIS , K. C. and T EH , Y. W. (2016). Exploration of the (non-)asymptotic
bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17 1–48.
MR3555050
WANG , X. and D UNSON , D. B. (2013). Parallelizing MCMC via Weierstrass sampler. Available at
arXiv:1312.4605.
1320 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS
W ELLING , M. and T EH , Y. W. (2011). Bayesian learning via stochastic gradient Langevin dy-
namics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11)
681–688.
J. B IERKENS P. F EARNHEAD
D ELFT I NSTITUTE OF A PPLIED M ATHEMATICS D EPARTMENT OF M ATHEMATICS
VAN M OURIK B ROEKMANWEG 6 AND S TATISTICS
2628 XE D ELFT F YLDE C OLLEGE
T HE N ETHERLANDS L ANCASTER U NIVERSITY
E- MAIL : [email protected] L ANCASTER , LA1 4YF
U NITED K INGDOM
E- MAIL : [email protected]
G. ROBERTS
D EPARTMENT OF S TATISTICS
U NIVERSITY OF WARWICK
C OVENTRY CV4 7AL
U NITED K INGDOM
E- MAIL : [email protected]