0% found this document useful (0 votes)
30 views

18 Aos1715

Uploaded by

aprilvee1222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

18 Aos1715

Uploaded by

aprilvee1222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

The Annals of Statistics

2019, Vol. 47, No. 3, 1288–1320


https://doi.org/10.1214/18-AOS1715
© Institute of Mathematical Statistics, 2019

THE ZIG-ZAG PROCESS AND SUPER-EFFICIENT SAMPLING FOR


BAYESIAN ANALYSIS OF BIG DATA1

B Y J ORIS B IERKENS , PAUL F EARNHEAD AND G ARETH ROBERTS


Delft University of Technology, Lancaster University and University of Warwick
Standard MCMC methods can scale poorly to big data settings due to the
need to evaluate the likelihood at each iteration. There have been a number
of approximate MCMC algorithms that use sub-sampling ideas to reduce this
computational burden, but with the drawback that these algorithms no longer
target the true posterior distribution. We introduce a new family of Monte
Carlo methods based upon a multidimensional version of the Zig-Zag process
of [Ann. Appl. Probab. 27 (2017) 846–882], a continuous-time piecewise de-
terministic Markov process. While traditional MCMC methods are reversible
by construction (a property which is known to inhibit rapid convergence) the
Zig-Zag process offers a flexible nonreversible alternative which we observe
to often have favourable convergence properties. We show how the Zig-Zag
process can be simulated without discretisation error, and give conditions for
the process to be ergodic. Most importantly, we introduce a sub-sampling
version of the Zig-Zag process that is an example of an exact approximate
scheme, that is, the resulting approximate process still has the posterior as
its stationary distribution. Furthermore, if we use a control-variate idea to re-
duce the variance of our unbiased estimator, then the Zig-Zag process can
be super-efficient: after an initial preprocessing step, essentially independent
samples from the posterior distribution are obtained at a computational cost
which does not depend on the size of the data.

1. Introduction. The importance of Markov chain Monte Carlo techniques


in Bayesian inference shows no signs of diminishing. However, all commonly
used methods are variants on the Metropolis–Hastings (MH) algorithm (Hastings
(1970), Metropolis et al. (1953)) and rely on innovations which date back over 60
years. All MH algorithms simulate realisations from a discrete reversible ergodic
Markov chain with invariant distribution π which is (or is closely related to) the
target distribution, that is, the posterior distribution in a Bayesian context. The MH
algorithm gives a beautifully simple though flexible recipe for constructing such
Markov chains, requiring only local information about π (typically pointwise eval-
uations of π and, perhaps, its derivative at the current and proposed new locations)
to complete each iteration.

Received July 2016; revised March 2018.


1 Supported by EPSRC Grants EP/D002060/1 (CRiSM) and EP/K014463/1 (iLike).
MSC2010 subject classifications. Primary 65C60; secondary 65C05, 62F15, 60J25.
Key words and phrases. MCMC, nonreversible Markov process, piecewise deterministic Markov
process, stochastic gradient Langevin dynamics, sub-sampling, exact sampling.
1288
ZIG-ZAG SAMPLING 1289

However, new complex modelling and data paradigms are seriously challeng-
ing these established methodologies. First, the restriction of traditional MCMC to
reversible Markov chains is a serious limitation. It is now well understood both the-
oretically (Bierkens (2016), Chen and Hwang (2013), Duncan, Lelièvre and Pavli-
otis (2016), Hwang, Hwang-Ma and Sheu (1993), Rey-Bellet and Spiliopoulos
(2015)) and heuristically (Neal (1998)) that nonreversible chains offer potentially
massive advantages over reversible counterparts. The need to escape reversibility,
and create momentum to aid mixing throughout the state space is certainly well
known, and motivates a number of modern MCMC methods, including the popu-
lar Hamiltonian MCMC (Duane et al. (1987)).
A second major obstacle to the application of MCMC for Bayesian inference
is the need to process potentially massive data-sets. Since MH algorithms in their
pure form require a likelihood evaluation—and thus processing the full data-set—
at every iteration, it can be impractical to carry out large numbers of MH iterations.
This has led to a range of alternatives that use sub-samples of the data at each
iteration (Ma, Chen and Fox (2015), Maclaurin and Adams (2014), Quiroz, Villani
and Kohn (2015), Welling and Teh (2011)), or that partition the data into shards,
run MCMC on each shard and then attempt to combine the information from these
different MCMC runs (Li, Srivastava and Dunson (2017), Neiswanger, Wang and
Xing (2013), Scott et al. (2016), Wang and Dunson (2013)). However, most of
these methods introduce some form of approximation error, so that the final sample
will be drawn from some approximation to the posterior, and the quality of the
approximation can be impossible to evaluate. As an exception the Firefly algorithm
(Maclaurin and Adams (2014)) samples from the exact distribution of interest (but
see the comment below).
This paper introduces the multidimensional Zig-Zag sampling algorithm (ZZ)
and its variants. These methods overcome the restrictions of the lifted Markov
chain approach of Turitsyn, Chertkov and Vucelja (2011) as they do not depend
on the introduction of momentum generating quantities. They are also amenable
to the use of sub-sampling ideas. The dynamics of the Zig-Zag process depends
on the target distribution through the gradient of the logarithm of the target. For
Bayesian applications this is a sum, and is easy to estimate unbiasedly using sub-
sampling. Moreover, Zig-Zag with Sub-Sampling (ZZ-SS) retains the exactness of
the required invariant distribution. Furthermore, if we also use control variate ideas
to reduce the variance of our sub-sampling estimator of the gradient, the resulting
Zig-Zag with Control Variates (ZZ-CV) algorithm has remarkable super-efficient
scaling properties for large data sets.
We will call an algorithm super-efficient if it is able to generate independent
samples from the target distribution at a higher efficiency than if we would draw in-
dependently from the target distribution at the cost of evaluating all data. The only
situation we are aware of where we can implement super-efficient sampling is with
simple conjugate models, where the likelihood function has a low-dimensional
summary statistic which can be evaluated at cost O(n), where n is the number
1290 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

of observations, after which we can obtain independent samples from the poste-
rior distribution at a cost of O(1) by using the functional form of the posterior
distribution. The ZZ-CV can replicate this computational efficiency: After a pre-
computation of O(n), we are able to obtain independent samples at a cost of O(1).
In this sense it contrasts with the Firefly algorithm (Maclaurin and Adams (2014))
which has an ESS per datum which decreases approximately as 1/n where n is
the size of the data, so that the gains of this algorithm do not increase with n; see
Bouchard-Côté, Vollmer and Doucet (2017), Section 4.6.
This breakthrough is based upon the Zig-Zag process, a continuous time piece-
wise deterministic Markov process (PDMP). Given a d-dimensional differentiable
target density π , Zig-Zag is a continuous-time nonreversible stochastic process
with continuous, piecewise linear trajectories on Rd . It moves with constant veloc-
ity,  ∈ {−1, 1}d , until one of the velocity components switches sign. The event
time and choice of which direction to reverse is controlled by a collection of state-
dependent switching rates, (λi )di=1 which in turn are constrained via an identity
(2) which ensures that π is a stationary distribution for the process. The process
intrinsically is constructed in continuous-time, and it can be easily simulated using
standard Poisson thinning arguments as we shall see in Section 3.
The use of PDMPs such as the Zig-Zag processes is an exciting and mostly un-
explored area in MCMC. The first occurrence of a PDMP for sampling purposes
is in the computational physics literature (Peters and De With (2012)), which in
one dimension coincides with the Zig-Zag process. In Bouchard-Côté, Vollmer
and Doucet (2017), this method is given the name Bouncy Particle Sampler. In
multiple dimensions the Zig-Zag process and Bouncy Particle Sampler (BPS) are
different processes: Both are PDMPs which move along straight line segments,
but the Zig-Zag process changes direction in only a single component at each
switch, whereas the Bouncy Particle Sampler reflects the full direction vector in
the level curves of the density function. As we will see in Section 2.4, this differ-
ence has a beneficial effect on the ergodic properties of the Zig-Zag process. The
one-dimensional Zig-Zag process is analysed in detail in, for example, Bierkens
and Roberts (2017), Fontbona, Guérin and Malrieu (2012), Fontbona, Guérin and
Malrieu (2016), Monmarché (2014).
Since the first version of this paper was conceived, already several other re-
lated theoretical and methodological papers have appeared. In particular, we men-
tion here results on exponential ergodicity of the BPS (Deligiannidis, Bouchard-
Côté and Doucet (2017)) and ergodicity of the multidimensional Zig-Zag process
(Bierkens, Roberts and Zitt (2017)). The Zig-Zag process has the advantage that
it is ergodic under very mild conditions, which in particular means that we are
not required to choose a refreshment rate. At the same time, the BPS seems more
“natural”, in that it tries to minimise the bounce rate and the change in direction
at bounces, and it may be more efficient for this reason. However, it is a challenge
to make a direct comparison in efficiency of the two methods since the efficiency
ZIG-ZAG SAMPLING 1291

depends both on the computational effort per unit of continuous time of the respec-
tive algorithms, as well as the mixing time of the underlying processes. Therefore,
we expect analysing the relative efficiency of PDMP based algorithms to be an
important area of continued research for years to come.
A continuous-time sequential Monte Carlo algorithm for scalable Bayesian in-
ference with big data, the SCALE algorithm, is given in Pollock et al. (2016).
Advantages that Zig-Zag has over SCALE is that it avoids the issue of controlling
the stability of importance weights, and it is simpler to implement. Whereas the
SCALE algorithm is well adapted for the use of parallel architecture computing,
and has particularly simple scaling properties for big data.

1.1. Notation. For a topological space X, let B (X) denote the Borel σ -
algebra. We write R+ := [0, ∞). If h : Rd → R is differentiable then ∂i h denotes
the function ξ → ∂h(ξ )
∂ξi . We equip E := R × {−1, +1} with the product topol-
d d

ogy of the Euclidean topology on Rd and the discrete topology on {−1, +1}d .
Elements in E will often be denoted by (ξ, θ ) with ξ ∈ Rd and θ ∈ {−1, +1}d .
For g : E → R differentiable in its first argument, we will use ∂i g to denote the
function (ξ, θ ) → ∂g(ξ,θ)
∂ξi , i = 1, . . . , d.

2. The Zig-Zag process. The Zig-Zag process is a continuous time Markov


process whose trajectories lie in the space E = Rd × {−1, +1}d and will be de-
noted by ( (t), (t))t≥0 . They can be described as follows: At random times,
a single component of (t) flips. In between these switches, (t) is linear with
dt (t) = (t). The rates at which the flips in (t) occur are time inhomogeneous:
d

The ith component of  switches at rate λi ( (t), (t)), where λi : E → R+ for


i = 1, . . . , d are continuous functions.

2.1. Construction. For a given (ξ, θ ) ∈ E, we may construct a trajectory of


( , ) of the Zig-Zag process with initial condition (ξ, θ ) as follows:
• Let (T 0 , 0 , 0 ) := (0, ξ, θ ).
• For k = 1, 2, . . .
– Let ξ k (t) := k−1 + k−1 t, t ≥ 0.
– For i = 1, . . . , d, let τik be distributed according to
  t 
   
P τik ≥ t = exp − λi ξ k (s), k−1 ds .
0

– Let i0 := argmini∈{1,...,d} τik and let T k := T k−1 + τik0 .


– Let k := ξ k (T k ).
– Let

k−1 (i) if i = i0 ,
 (i) :=
k
−k−1 (i) if i = i0 .
1292 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

This procedure defines a sequence of skeleton points (T k , k , k )∞ k=0 in R+ × E,


which correspond to the time and position at which the direction of the pro-
cess changes. The trajectory ξ k (t) represents the position of the process at time
T k−1 + t until time T k , for 0 ≤ t ≤ T k − T k−1 . The time until the next skeleton
event is characterized as the smallest time of a set of events in d simultaneous
point processes, where each point process corresponds to switching of a differ-
ent component of the velocity. For the ith of these processes, events occur at rate
λi (ξ k (s), k−1 ), and τik is defined to be the time to the first event for the ith com-
ponent. The component for which the earliest event occurs is i0 . This defines τik0 ,
the time between the (k − 1)th and kth skeleton point, and the component, i0 , of
the velocity that switches.
The piecewise deterministic trajectories ( (t), (t)) are now obtained as
   k     
(t), (t) := + k t − T k , k for t ∈ T k , T k+1 , k = 0, 1, 2, . . . .
Since the switching rates are continuous, and hence bounded on compact sets,
and will travel a finite distance within any finite time interval, within any
bounded time interval there will be finitely many switches almost surely.
The above procedure provides a mathematical construction of a Markov process
as well as the outline of an algorithm which simulates this process. The only step
in this procedure which presents a computational challenge is the simulation of
the random times (Tik ) and a significant part of this paper will consider obtaining
these in a numerically efficient way.
Figure 1 displays trajectories of the Zig-Zag process for several examples of
invariant distributions. The name of the process is derived by the zig-zag nature of
paths that the process produces. Figure 1 shows an important difference in the out-
put of the Zig-Zag process, as compared to a discrete-time MCMC algorithm: The
output of is a continuous-time sample path. The bottom row of plots in Figure 1
also gives a comparison to a reversible MCMC algorithm, Metropolis Adjusted
Langevin (MALA, Roberts and Tweedie (1996)), and demonstrates an advantage
of a nonreversible sampler: it can cope better with a heavy tailed target. This is
most easily seen if we start the process out in the tail, as in the figure. The velocity
component of the Zig-Zag process enables it to quickly return to the mode of the
distribution, whereas the reversible algorithm behaves like a random walk in the
tails, and takes much longer to return to the mode.

2.2. Invariant distribution. The most important aspect of the Zig-Zag process
is that in many cases the switching rates are directly related to an easily identifiable
invariant distribution. Let C 1 (Rd ) denote the space of continuously differentiable
functions on Rd . For θ ∈ {−1, +1}d and i ∈ {1, . . . , d}, let Fi [θ ] ∈ {−1, +1}d de-
note the binary vector obtained by flipping the i-th component of θ , that is,

  θj if i = j,
Fi [θ ] =
j
−θj if i = j.
We introduce the following assumption.
ZIG-ZAG SAMPLING 1293

F IG . 1. Top two rows: example trajectories of the canonical Zig-Zag process. In (a) and (b) the
horizontal axis shows time and the vertical axis the -coordinate of the 1D process. In (c) and (d),
the trajectories in R2 of ( 1 , 2 ) are plotted. Bottom row: Zig-Zag process (e) and MALA (f) for a
Cauchy target with both processes started in the tail.
1294 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

A SSUMPTION 2.1. For some function ∈ C 1 (Rd ) satisfying



 
(1) exp − (ξ ) dξ < ∞
Rd
we have
 
(2) λi (ξ, θ ) − λi ξ, Fi [θ ] = θi ∂i (ξ ) for all (ξ, θ ) ∈ E, i = 1, . . . , d.

Throughout this paper, we will refer to as the negative log density. Let μ0
denote the measure on B (E) such that, for A ∈ B (Rd ) and θ ∈ {−1, +1}d , μ0 (A ×
{θ}) = Leb(A), with Leb denoting Lebesgue measure on Rd .

T HEOREM 2.2. Suppose Assumption 2.1 holds. Let μ denote the probability
distribution on E such that μ has Radon–Nikodym derivative
dμ exp(− (ξ ))
(3) (ξ, θ ) = , (ξ, θ ) ∈ E,
dμ0 Z
where Z = E exp(− ) dμ0 . Then the Zig-Zag process ( , ) with switching
rates (λi )di=1 has invariant distribution μ.

The proof is located in Section 1 of the Supplementary Material (Bierkens,


Fearnhead and Roberts (2019)). We see that under the invariant distribution of
the Zig-Zag process, ξ and θ are independent, with ξ having density proportional
to exp(− (ξ )) and θ having a uniform distribution on the points in {−1, +1}d .
For a ∈ R, let (a)+ := max(0, a) and (a)− := max(0, −a) denote the positive
and negative parts of a, respectively. We will often use the trivial identity a =
(a)+ − (a)− without comment. The following result characterizes the switching
rates for which (2) holds.

P ROPOSITION 2.3. Suppose λ : E → Rd+ is continuous. Then Assumption 2.1


is satisfied if and only if there exists a continuous function γ : E → Rd+ such that
for all i = 1, . . . , d and (ξ, θ ) ∈ E, γi (ξ, θ ) = γi (ξ, Fi [θ ]) and, for ∈ C 1 (Rd )
satisfying (1),
 +
(4) λi (ξ, θ ) = θi ∂i (ξ ) + γi (ξ, θ ).

The proof is located in Section 1 of the Supplementary Material (Bierkens,


Fearnhead and Roberts (2019).

2.3. Zig-Zag process for Bayesian inference. One application of the Zig-Zag
process is as an alternative to MCMC for sampling from posterior distributions in
Bayesian statistics. We show here that it is straightforward to derive a class of Zig-
Zag processes that have a given posterior distribution as their invariant distribution.
ZIG-ZAG SAMPLING 1295

The dynamics of the Zig-Zag process only depend on knowing the posterior den-
sity up to a constant of proportionality.
To keep notation consistent with that used for the Zig-Zag process, let ξ ∈ Rd
denote a vector of continuous parameters. We are given a prior density function
for ξ , which we denote by π0 (ξ ), and observations x 1:n = (x 1 , . . . , x n ). Our model
for the data defines a likelihood function L(x 1:n |ξ ). Thus the posterior density
function is
 
π(ξ ) ∝ π0 (ξ )L x 1:n |ξ .
We can write π(ξ ) in the form of the previous section,
1  
π(ξ ) = exp − (ξ ) , ξ ∈ Rd ,
Z
where (ξ ) = − log π0 (ξ ) − log L(x 1:n |ξ ), and Z = Rd exp(− (ξ )) dξ is the
unknown normalising constant. Now assuming that log π0 (ξ ) and log L(x 1:n |ξ )
are both continuously differentiable with respect to ξ , from (4) a Zig-Zag process
with rates
 +
λi (ξ, θ ) = θi ∂i (ξ )
will have the posterior density π(ξ ) as the marginal of its invariant distribution. We
call the process with these rates the Canonical Zig-Zag process for the negative log
density . As explained in Proposition 2.3, we can construct a family of Zig-Zag
processes with the same invariant distribution by choosing any set of functions
γi (ξ, θ ), for i = 1, . . . , d, which take nonnegative values and for which γi (ξ, θ ) =
γi (ξ, Fi [θ ]), and setting
 +
λi (ξ, θ ) = θi ∂i (ξ ) + γi (ξ, θ ) for i = 1, . . . , d.
The intuition here is that λi (ξ, θ ) is the rate at which we transition from θ to Fi [θ ].
The condition γi (ξ, θ ) = γi (ξ, Fi [θ ]) means that we increase by the same amount
both the rate at which we will transition from θ to Fi [θ ] and vice versa. As our
invariant distribution places the same probability of being in a state with velocity
θ as that of being in state Fi [θ ], these two changes in rate cancel out in terms of
their effect on the invariant distribution. Changing the rates in this way does impact
the dynamics of the process, with larger γi values corresponding to more frequent
changes in the velocity of the Zig-Zag process, and we would expect the resulting
process to mix more slowly.
Under the assumption that the Zig-Zag process has the desired invariant distri-
bution and is ergodic, it follows from the Birkhoff ergodic theorem that for any
bounded continuous function f : E → R,
 
1 t  
lim f (s), (s) ds = f dμ,
t→∞ t 0 E
for any initial condition (ξ, θ ) ∈ E. Sufficient conditions for ergodicity will be dis-
cussed in the following section. Taking γ to be positive and bounded everywhere
ensures ergodicity, as will be established in Theorem 2.10.
1296 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

2.4. Ergodicity of the Zig-Zag process. We have established in Section 2.2 that
for any continuously differentiable, positive density π on Rd a Zig-Zag process can
be constructed that has π as its marginal stationary density. In order for ergodic
averages T1 0T f ( (s)) ds of the Zig-Zag process to converge asymptotically to
π(f ), we further require ( (t), (t)) to be ergodic, that is, to admit a unique
invariant distribution.
Ergodicity is directly related to the requirement that ( (t), (t)) is irreducible,
that is, the state space is not reducible into components which are each invariant
for the process ( (t), (t)). For the one-dimensional Zig-Zag process, (exponen-
tial) ergodicity has already been established under mild conditions (Bierkens and
Roberts (2017)). As we discuss below, irreducibility, and thus ergodicity, can be
established for large classes of multidimensional target distributions, such as i.i.d.
Gaussian distributions, and also if the switching rates λi (ξ, θ ) are positive for all
i = 1, . . . , d, and (ξ, θ ) ∈ E.
Let P t ((ξ, θ ), ·) be the transition kernel of the Zig-Zag process with initial con-
dition (ξ, θ ). A function f : E → R is called norm-like if lim ξ →∞ f (ξ, θ ) = ∞
for all θ ∈ {−1, +1}d . Let · TV denote the total variation norm on the space of
signed measures. First, we consider the one-dimensional case.

A SSUMPTION 2.4. Suppose d = 1 and there exists ξ0 > 0 such that:


(i) infξ ≥ξ0 λ(ξ, +1) > supξ ≥ξ0 λ(ξ, −1) and
(ii) infξ ≤−ξ0 λ(ξ, −1) > supξ ≤−ξ0 λ(ξ, +1).

P ROPOSITION 2.5 (Bierkens and Roberts (2017), Theorem 5). Suppose As-
sumption 2.4 holds. Then there exists a function f : E → [1, ∞) which is norm-
like such that the Zig-Zag process is f -exponentially ergodic, that is, there exists
a constant κ > 0 and 0 < ρ < 1 such that
 
P t (ξ, θ ), · − π TV ≤ κf (ξ, θ )ρ t for all (ξ, θ ) ∈ E and t ≥ 0.

E XAMPLE 2.6. As an example of fundamental importance, which will also


be used in the proof of Theorem 2.10, consider a one-dimensional Gaussian dis-
ξ2
tribution. For simplicity, let π(ξ ) be centred, π(ξ ) = √ 1 2 exp(− 2σ 2 ) for some
2π σ
σ > 0. According to (4) the switching rates take the form
 +
λ(ξ, θ) = θξ/σ 2 + γ (ξ ), (ξ, θ) ∈ E.
As long as γ is bounded from above, Assumption 2.4 is satisfied. In particular, this
holds if γ is equal to a nonnegative constant.

R EMARK 2.7. We say a probability density function π is of product form


if π(ξ ) = di=1 πi (ξi ), where πi : Rd → (0, ∞) are one-dimensional probability
density functions. When its target density is of product form the Zig-Zag process
ZIG-ZAG SAMPLING 1297

is the concatenation of independent Zig-Zag processes. In this case, the negative


log density is of the form (ξ ) = di=1 i (ξi ) and the switching rate for the ith
component of θ is
 +
(5) λi (ξ, θ ) = θi i (ξi ) + γi (ξ ).
As long as γi (ξ ) = γi (ξi ), so if γi (ξ ) only depends on the ith coordinate of ξ , the
switching rate of coordinate i is independent of the other coordinates ξj , j = i.
It follows that the switches of the ith coordinate can be generated by a one-
dimensional time inhomogeneous Poisson process, which is independent of the
switches in the other coordinates. As a consequence, the d-dimensional Zig-Zag
process ( (t), (t)) = ( 1 (t), . . . , d (t), 1 (t), . . . , d (t)) consists of a combi-
nation of d independent Zig-Zag processes ( i (t), i (t)), i = 1, . . . , d.

Suppose P (x, dy) is the transition kernel of a Markov chain on a state space E.
We say that the Markov chain associated to P is mixing if there exists a probability
distribution π on E such that
lim P k (x, ·) − π TV =0 for all x ∈ E.
k→∞

For any continuous time Markov process with family of transition kernels
P t (x, dy), we can consider the associated time-discretized process, which is a
Markov chain with transition kernel Q(x, dy) := P δ (x, dy) for a fixed δ > 0. The
value of δ will be of no significance in our use of this construction.

P ROPOSITION 2.8. Suppose π is of product form and λ : E → Rd+ admits the


representation (5) with γi (ξ ) only depending on {ξi , i = 1, . . . , d}. Furthermore
suppose that for every i = 1, . . . , d, the one-dimensional time-discretized Zig-Zag
process corresponding to switching rate λi is mixing in R × {−1, +1}. Then the
time-discretized d-dimensional Zig-Zag process with switching rates (λi ) is mix-
ing. In particular, the multidimensional Zig-Zag process admits a unique invariant
distribution.

P ROOF. This follows from the decomposition of the d-dimensional Zig-Zag


process as d one-dimensional Zig-Zag processes and Lemma 1.1 in the Supple-
mentary Material. 

E XAMPLE 2.9. Continuing Example 2.6, consider the simple case in which π
is of product form with each πi a centered Gaussian density function with variance
σi2 . It follows from Proposition 2.8 and Example 2.6 that the multidimensional
canonical Zig-Zag process (i.e., the Zig-Zag process with γi ≡ 0) is mixing. This is
different from the Bouncy Particle Sampler (Bouchard-Côté, Vollmer and Doucet
(2017)), which is not ergodic for an i.i.d. Gaussian without “refreshments” of the
momentum variable.
1298 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

We now show that strict positivity of the rates ensures ergodicity.

T HEOREM 2.10. Suppose λ : E → (0, ∞)d , in particular λi (ξ, θ ) is positive


for all i = 1, . . . , d and (ξ, θ ) ∈ E. Then there exists at most a single invariant
measure for the Zig-Zag process with switching rate λ.

The proof of this result consists of a Girsanov change of measure with respect to
a Zig-Zag process targeting an i.i.d. standard normal distribution, which we know
to be irreducible. The irreducibility then carries over to the Zig-Zag process with
the stated switching rates. A detailed proof can be found in the Supplementary
Material.

R EMARK 2.11. Based on numerous experiments, we conjecture that the


canonical multidimensional Zig-Zag process is ergodic in general under only mild
conditions. A detailed investigation of ergodicity will be the subject of a forthcom-
ing paper (Bierkens, Roberts and Zitt (2017)).

3. Implementation. As mentioned earlier, the main computational challenge


is an efficient simulation of the random times Tik introduced in Section 2.1. We
will focus on simulation by means of Poisson thinning.

P ROPOSITION 3.1 (Poisson thinning, Lewis and Shedler (1979)). Let m :


R+ → R+ and M : R+ → R+ be continuous such that m(t) ≤ M(t) for t ≥ 0.
Let τ 1 , τ 2 , . . . be the increasing finite or infinite sequence of points of a Poisson
process with rate function (M(t))t≥0 . For all i, delete the point τ i with probability
1 − m(τ i )/M(τ i ). Then the remaining points, τ 1 , τ 2 , . . . say, form a nonhomoge-
neous Poisson process with rate function (m(t))t≥0 .

Now for a given initial point (ξ, θ ) ∈ E, let mi (t) := λi (ξ + θ t, θ ), for i =


1, . . . , d, and suppose we have available continuous functions Mi (t) such that
mi (t) ≤ Mi (t) for i = 1, . . . , d and t ≥ 0. We call these (Mi )di=1 computational
bounds for (mi )di=1 . We can use Proposition 3.1 to obtain the first switching
times (τi1 )di=1 from a (theoretically infinite) collection of proposed switching times
(τi1 , τi2 , . . . )di=1 given the initial point (ξ, θ ), and use the obtained skeleton point
at time τ 1 := mini∈{1,...,d} τi1 as a new initial point (which is allowed by the strong
Markov property) with the component i0 = argmini∈{1,...,d} τi1 of θ switched.
The strong Markov property of the Zig-Zag process simplifies the computa-
tional procedure further: We can draw for each component i = 1, . . . , d the first
proposed switching time τi := τi1 , determine i0 := argmini∈{1,...,d} τi and decide
whether the appropriate component of θ is switched at this time with probability
mi0 (τ )/Mi0 (τ ), where τ := τi0 . Then since τ is a stopping time for the Markov
process, we can use the obtained point of the Zig-Zag process at time τ as new
ZIG-ZAG SAMPLING 1299

Algorithm 1: Zig-Zag sampling (ZZ)


Input: initial condition (ξ, θ ) ∈ E.
Output: a sequence of skeleton points (T k , k , k )∞ .
k=0

1. (T 0 , 0 , 0 ) := (0, ξ, θ ).
2. for k = 1, 2, . . .
(a) Define mi (t) := λi ( k−1 + k−1 t, k−1 ) for t ≥ 0 and i = 1, . . . , d.
(b) For i = 1, . . . , d, let (Mi ) denote computational bounds for (mi ).
(c) Draw τ1 , . . . , τd such that P(τi ≥ t) = exp(− 0t Mi (s) ds).
(d) i0 := argmini=1,...,d {τi } and τ := τi0 .
(e) (T k , k ) := (T k−1 + τ, k−1 + k−1 τ )
(f) With probability mi0 (τ )/Mi0 (τ ),
• k := Fi0 [k−1 ],
otherwise
• k := k−1 .

starting point, regardless of whether we switch a component of θ at the obtained


skeleton point. A full computational procedure for simulating the Zig-Zag process
is given by Algorithm 1.

3.1. Computational bounds. We now come to the important issue of obtain-


ing computational bounds for the Zig-Zag Process, that is, useful upper bounds
for the switching rates (mi ). If we can compute the inverse function Gi (y) :=
inf{t ≥ 0 : Hi (t) ≥ y} of Hi : t → 0t Mi (s) ds, we can simulate τ1 , . . . , τd using
the CDF inversion technique, which involves drawing i.i.d. uniform random vari-
ables U1 , . . . , Ud and setting τi := Gi (− log Ui ), i = 1, . . . , d.
Let us ignore the subscript i for a moment. Examples of computational bounds
are piecewise affine bounds of the form M : t → (a + bt)+ , with a, b ∈ R, and
the constant bounds M : t → c for c ≥ 0. It is also possible to simulate using
the combined rate M : t → min(c, (a + bt)+ ). In these cases, H (t) = 0t M(s) ds
is piecewise linear or quadratic and nondecreasing, so we can obtain an explicit
expression for the inverse function, G.
The computational bounds are directly related to the algorithmic efficiency of
Zig-Zag sampling. From Algorithm 1, it is clear that for every simulated time
τ a single component of λ needs to be evaluated, which corresponds by (4) to the
evaluation of a single component of the gradient of the negative log density . The
magnitude of the computational bounds, (Mi ), will determine how far the Zig-Zag
process will have moved in the state space before a new evaluation of a component
of λ is required, and we will pay close attention to the scaling of Mi with respect
to the number of available observations in a Bayesian inference setting.
1300 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

3.2. Example: Globally bounded log density gradient. If there are constants
ci > 0 such that supξ ∈Rd |∂i (ξ )| ≤ ci , i = 1, . . . , d, then we can use the global
upper bounds Mi (t) = ci for t ≥ 0. Indeed, for (ξ, θ ) ∈ E,
 +  
λi (ξ, θ ) = θi ∂i (ξ ) ≤ ∂i (ξ ) ≤ ci .
Algorithm 1 may be used with Mi ≡ ci for i = 1, . . . , d at every iteration.
This situation arises with heavy-tailed distributions, for example, if π is Cauchy,
2θ ξ +
then (ξ ) = log(1 + ξ 2 ), and consequently λ(ξ, θ) = ( 1+ξ 2 ) ≤ 1.

3.3. Example: Negative log density with dominated Hessian. Another impor-
tant case is when there exists a positive definite matrix Q ∈ Rd×d which dominates
the Hessian H (ξ ) in the positive definite ordering of matrices for every ξ ∈ Rd .
Here, H (ξ ) = (∂i ∂j (ξ ))di,j =1 denotes the Hessian of .
Denote the Euclidean inner product in Rd by ·, ·. For p ∈ [1, ∞], the p -norm
on Rd and the induced matrix norms are both denoted by · p . For symmetric
matrices S, T ∈ Rd×d , we write S  T if v, Sv ≤ v, T v for every v ∈ Rd , or in
words, if T dominates S in the positive definite ordering. The key assumption is
that H (ξ )  Q for all ξ ∈ Rd , where Q ∈ R d×d is positive definite. In particular,
if H (ξ ) 2 ≤ c for all ξ , then this holds for Q = cI . We let (ei )di=1 denote the
canonical basis vectors in Rd .
For an initial value (ξ, θ ) ∈ E, we move along the trajectory t → ξ(t)√:= ξ + θ t.
Let ai denote an upper bound for θi ∂i (ξ ), i = 1, . . . , d and let bi := d Qei 2 .
For general symmetric matrices S, T with S  T , we have for any v, w ∈ Rd that
(6) v, Sw ≤ v 2 Sw 2 ≤ v 2 T w 2.
Applying this inequality, we obtain for i = 1, . . . , d
 t
d  t
       
θ i ∂i ξ(t) = θi ∂i (ξ ) + ∂i ∂j ξ(s) θj ds ≤ ai + H ξ(s) ei , θ ds
0 j =1 0
 t
≤ ai + Qei 2 θ 2 ds = ai + bi t.
0

It thus follows that


    +
λi ξ(t), θ = θi ∂i ξ(t) ≤ (ai + bi t)+ .
Hence the general Zig-Zag algorithm may be applied taking
Mi (t) := (ai + bi t)+ , t ≥ 0, i = 1, . . . , d,
with ai and bi as specified above. A complete procedure for Zig-Zag sampling for
a log density with dominated Hessian is provided in Algorithm 2.
ZIG-ZAG SAMPLING 1301

Algorithm 2: Zig-Zag sampling for log density with dominated Hessian


Input: initial condition (ξ, θ ) ∈ E.
Output: a sequence of skeleton points (T k , k , k )∞
k=0 .

1. (T 0 , 0 , 0 ) := (0, ξ, θ ).
2. ai := θi ∂i √(ξ ), i = 1, . . . , d.
3. bi := Qei d, i = 1, . . . , d.
4. For k = 1, 2, . . .
(a) Draw τi such that P(τi ≥ t) = exp(− 0t (ai + bi s)+ ds), i = 1, . . . , d.
(b) i0 := argmini∈{1,...,d} τi and τ := τi0 .
(c) (T k , k , k ) := (T k−1 + τ, k−1 + k−1 τ, k−1 ).
(d) ai := ai + bi τ , i = 1, . . . , d.
(k−1 k ))+
i 0 ∂i 0 (
(e) with probability (ai0 )+
,

• k := Fi0 [k−1 ],
otherwise
• k := k−1 .
(f) ai0 := k−1
i 0 ∂i 0 (
k) (re-using the earlier computation).

R EMARK 3.2. It is also possibly to apply inequality (6) in such a way as to


obtain the estimate
       
H ξ(s) ei , θ = ei , H ξ(s) θ ≤ ei 2 Qθ 2 = Qθ 2.

This requires us to compute Qθ whenever θ changes [a computation of O(d)].

4. Big data Bayesian inference by means of error-free sub-sampling.


Throughout this section, we assume the derivatives of admit the representation
1 n
j
(7) ∂i (ξ ) = Ei (ξ ), i = 1, . . . , d, ξ ∈ Rd ,
n j =1

with (E j )nj=1 continuous functions mapping Rd into Rd . The motivation for con-
sidering such a class of density functions is the problem of sampling from a pos-
terior distribution for big data. The key feature of such posteriors is that they can
be written as the product of a large number of terms. For example, consider the
simplest example of this, where we have n independent data points (x j )nj=1 and
for which the likelihood function is L(ξ ) = nj=1 f (x j |ξ ), for some probability
density or probability mass function f . In this case, we can write the negative log
1302 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

density associated with the posterior distribution as an average


1 n
(8) (ξ ) = j
(ξ ), ξ ∈ Rd ,
n j =1
j
where j (ξ ) = − log π0 (ξ ) − n log f (x j |ξ ), and we could choose Ei (ξ ) =
j
∂i j (ξ ). It is crucial that every Ei is a factor O(n) cheaper to evaluate than the
full derivative ∂i (ξ ).
We will describe two successive improvements over the basic Zig-Zag sampling
(ZZ) algorithm specifically tailored to the situation in which (7) is satisfied. The
first improvement consists of a sub-sampling approach where we need calculate
j
only one of the Ei s at each simulated time, rather than sum of all n of them. This
sub-sampling approach (referred to as Zig-Zag with Sub-Sampling, ZZ-SS) comes
at the cost of an increased computational bound. Our second improvement is to
use control variates to reduce this bound, resulting in the Zig-Zag with Control
Variates (ZZ-CV) algorithm.

4.1. Main idea. Let (ξ(t))t≥0 denote a linear trajectory originating in (ξ, θ ) ∈
E, so ξ(t) = ξ + θ t. Define a collection of switching rates along the trajectory
(ξ(t)) by
j  j +
mi (t) := θi Ei ξ(t) , i = 1, . . . , d, j = 1, . . . , n, t ≥ 0.
We will make use of computational bounds (Mi ) as before, which this time bound
j
(mi ) uniformly. Let Mi : R+ → R+ be continuous and satisfy
j
(9) mi (t) ≤ Mi (t) for all i = 1, . . . , d, j = 1, . . . , n, and t ≥ 0.
We will generate random times according to the computational upper bounds
(Mi ) as before. However, we now use a two-step approach to deciding whether
to switch or not at the generated times. As before, for i = 1, . . . , d let (τi )di=1
be simulated random times for which P(τ ≥ t) = exp(− 0t Mi (s) ds) and let
i0 := argmini∈{1,...,d} τi , and τ := τi0 . Then switch component i0 of θ with prob-
ability mJi0 (τ )/Mi0 (τ ), where J ∈ {1, . . . , n} is drawn uniformly at random, in-
dependent of τ . This “sub-sampling” procedure is detailed in Algorithm 3. De-
j
pending on the choice of Ei , we will refer to this algorithm as Zig-Zag with Sub-
Sampling (ZZ-SS, Section 4.2) or ZZ-CV (Section 4.3).

T HEOREM 4.1. Algorithm 3 generates a skeleton of a Zig-Zag process with


switching rates given by
1 n
 
θi Ei (ξ ) + ,
j
(10) λi (ξ, θ ) = i = 1, . . . , d, (ξ, θ ) ∈ E,
n j =1
and invariant distribution μ given by (3).
ZIG-ZAG SAMPLING 1303

Algorithm 3: Zig-Zag with Sub-Sampling (ZZ-SS)/Zig-Zag with Control


Variates (ZZ-CV)
Input: initial condition (ξ, θ ) ∈ E.
Output: a sequence of skeleton points (T k , k , k )∞
k=0 .

1. (T 0 , 0 , 0 ) := (0, ξ, θ ).
2. for k = 1, 2, . . .
(a) Define mi (t) := (k−1 Ei ( k−1 + k−1 t))+ for t ≥ 0, i = 1, . . . , d
j j

and j = 1, . . . , n.
j
(b) For i = 1, . . . , d, let (Mi ) denote computational bounds for (mi ),
satisfying (9).
(c) Draw τ1 , . . . , τd such that P(τi ≥ t) = exp(− 0t Mi (s) ds).
(d) i0 := argmini=1,...,d τi and τ := τi0 .
(e) (T k , k ) := (T k−1 + τ, k−1 + k−1 τ ).
(f) Draw J ∼ Uniform({1, . . . , n}).
(g) With probability mJi0 (τ )/Mi0 (τ ),

• k := Fi0 [k−1 ],
otherwise
• k := k−1 .

P ROOF. Conditional on τ , the probability that component i0 of θ is switched


at time τ is seen to be
1 n j
  n j =1 mi0 (τ ) mi0 (τ )
EJ mJi0 (τ )/Mi0 (τ ) = = ,
Mi0 (T ) Mi0 (τ )
where
1 n
1 n
 j 
θi Ei ξ(t) + ,
j
mi (t) := mi (t) = i = 1, . . . , d, t ≥ 0.
n j =1 n j =1
By Proposition 3.1, we thus have an effective switching rate λi for switching the
ith component of θ given by (10). Finally, we verify that the switching rates (λi )
given by (10) satisfy (2). Indeed,
  1 n
    
θi Ei (ξ ) + − θi Ei (ξ ) −
j j
λi (ξ, θ ) − λi ξ, Fi [θ ] =
n j =1

1 n
j
= θi Ei (ξ ) = θi ∂i (ξ ).
n j =1
By Theorem 2.2, the Zig-Zag process has the stated invariant distribution. 
1304 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

The important advantage of using Zig-Zag in combination with sub-sampling is


that at every iteration of the algorithm we only have to evaluate a single component
j
of Ei , which reduces algorithmic complexity by a factor O(n). However, this may
come at a cost. First, the computational bounds (Mi ) may have to be increased
which in turn will increase the algorithmic complexity of simulating the Zig-Zag
sampler. Also, the dynamics of the Zig-Zag process will change, because the actual
switching rates of the process are increased. This increases the diffusivity of the
continuous time Markov process, and affects the mixing properties in a negative
way.

4.2. Zig-Zag with Sub-Sampling (ZZ-SS) for globally bounded log density gra-
dient. A straightforward application of sub-sampling is possible if we have (8)
with ∇ j globally bounded, so that there exist positive constants (ci ) such that
 
(11) ∂i j
(ξ ) ≤ ci , i = 1, . . . , d, j = 1, . . . , n, ξ ∈ Rd .
In this case, we may take
j
Ei := ∂i j
and Mi (t) := ci , i = 1, . . . , d, j = 1, . . . , n, t ≥ 0,
so that (9) is satisfied. The corresponding version of Algorithm 3 will be called
Zig-Zag with Sub-Sampling (ZZ-SS).

4.3. Zig-Zag with Control Variates (ZZ-CV). Suppose again that admits the
representation (8), and further suppose that the derivatives (∂i j ) are globally
and uniformly Lipschitz, that is, there exist constants (Ci )ni=1 such that for some
p ∈ [1, ∞] and all i = 1, . . . , d, j = 1, . . . , n and ξ1 , ξ2 ∈ Rd ,
 
(12) ∂i j
(ξ1 ) − ∂i j
(ξ2 ) ≤ Ci ξ1 − ξ2 p.

To use these Lipschitz bounds, we need to choose a reference point ξ  in ξ -space,


so that we can bound the derivative of the log density based on how close we are
to this reference point. Now if we choose any fixed reference point, ξ  ∈ Rd , we
can use a control variate idea to write
  1 
n
 j   
∂i (ξ ) = ∂i ξ + ∂i j
(ξ ) − ∂i ξ , ξ ∈ Rd , i = 1, . . . , d.
n i=1
This suggests using
j   j  
Ei (ξ ) := ∂i ξ + ∂i j
(ξ ) − ∂i ξ ,
ξ ∈ Rd , i = 1, . . . , d, j = 1, . . . , n.
j
The reason for defining Ei (ξ ) in this manner is to try and reduce its variability
j
as we vary j . By the Lipschitz condition, we have Ei (ξ ) ≤ |∂i (ξ  )| + Ci ξ −
ZIG-ZAG SAMPLING 1305
j
ξ  p , and thus the variability of the Ei (ξ )s will be small if (1) the reference point
ξ  is close to the mode of the posterior and (2) ξ is close to ξ  . Under standard
asymptotics, we expect a draw from the posterior for ξ to be Op (n−1/2 ) from
the posterior mode. Thus if we have a procedure for finding a reference point ξ 
which is within O(n−1/2 ) of the posterior mode then this would ensure ξ − ξ  2
is Op (n−1/2 ) if ξ is drawn from the posterior. For such a choice of ξ  , we would
have ∂i (ξ  ) of Op (n1/2 ).
Using the Lipschitz condition, we can now obtain computational bounds of (mi )
for a trajectory ξ(t) := ξ + θ t originating in (ξ, θ ). Define

Mi (t) := ai + bi t, t ≥ 0, i = 1, . . . , d,

where ai := (θi ∂i (ξ  ))+ + Ci ξ − ξ  p and bi := Ci d 1/p . Then (9) is satisfied.


Indeed, using Lipschitz continuity of y → (y)+ ,
 +     
ξ + θi ∂i j (ξ + θ t) − θi ∂i j ξ  +
j j
mi (t) = θi Ei (ξ + θ t) = θi ∂ i
   +     
≤ θi ∂ i ξ + ∂ i
j
(ξ ) − ∂i j ξ   + ∂i j (ξ + θ t) − ∂i j (ξ )
   +  
≤ θi ∂ i ξ + Ci ξ − ξ  p + t θ p = Mi (t).

Implementing this scheme requires some preprocessing of the data. First, we


need a way of choosing a suitable reference point ξ  to find a value close to the
mode using an approximate or exact numerical optimization routine. The complex-
ity of this operation will be O(n). Once we have found such a reference point, we
have an one-off O(n) cost of calculating ∂i (ξ  ) for each i = 1, . . . , d. However,
once we have paid this upfront computational cost, the resulting Zig-Zag sampler
can be super-efficient. This is discussed in more detail in Section 5, and demon-
strated empirically in Section 6. The version of Algorithm 3 resulting from this
j
choice of Ei and Mi will be called Zig-Zag with Control Variates (ZZ-CV).

R EMARK 4.2. When choosing p ≥ 1, there will be a trade-off between the


magnitude of Ci and of ξ − ξ  p , which may influence the scaling of Zig-Zag
sampling with dimension. We will see in Section 6.3 that for i.i.d. Gaussian com-
ponents, the choice p = ∞ is optimal. When the situation is less clear, choosing
the Euclidean norm (p = 2) is a reasonable choice.

5. Scaling analysis. In this section, we provide an informal scaling argument


for canonical Zig-Zag, and Zig-Zag with control variates and sub-sampling. For
the moment fix n ∈ N and consider a posterior with negative log density,

n
 
(ξ ) = − log f x j | ξ ,
j =1
1306 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

where x j are i.i.d. drawn from f (x j | ξ0 ). Let 


ξ denote the maximum likelihood
estimator (MLE) for ξ based on data x 1 , . . . , x n . Introduce the coordinate trans-
formation
√ 1
φ(ξ ) = n(ξ − ξ ), ξ(φ) = √ φ +  ξ.
n
As n → ∞, the posterior distribution in terms of φ will converge to a multivari-
ate Gaussian distribution with mean 0 and covariance matrix given by the inverse
of the expected information i(θ0 ); see, for example, Johnson (1970).

5.1. Scaling of Zig-Zag sampling (ZZ). First, let us obtain a Taylor expansion
of the switching rate for ξ close to 
ξ . We have

n
 
∂ξi (ξ ) = −∂ξi log f x j | ξ
j =1


n
 
= −∂ξi log f x j | 
ξ
j =1
  
=0


n 
d
   2
− ∂ξi ∂ξk log f x j | 
ξ (ξk − 
ξk ) + O ξ − 
ξ .
j =1 k=1

The first term vanishes by the definition of the MLE. Expressed in terms of φ, the
switching rates are
 + 
  + 1  n  d
  φ 2
θi ∂ξi ξ(φ) =√ − ∂ξi ∂ξk log f x j | 
ξ φk +O .
n j =1 k=1
n
  

O( n)

With respect
√ to the coordinate φ, the canonical Zig-Zag process has constant
speed
√ n in each coordinate, and by the above computation, √ a switching rate of
O( n). After a rescaling of the time parameter by a factor n, the process in the
φ-coordinate becomes a Zig-Zag process with unit speed in every direction and
switching rates
 +
1 n  d
   
− ∂ξi ∂ξk log f x j | ξ φk + O n−1/2 .
n j =1 k=1

If we let n → ∞, the switching rates converge almost surely to those of a Zig-Zag


process with switching rates
   +
λi (φ, θ ) = θi i(θ0 )φ i ,
ZIG-ZAG SAMPLING 1307

where i(θ0 ) denotes the expected information. These switching rates correspond
to the limiting Gaussian distribution with covariance matrix i(θ0 )−1 .
In this limiting Zig-Zag process, all dependence on n has vanished. Starting
from equilibrium, we require a time interval of O(1) (in the rescaled time) to ob-
tain an essentially independent sample. In the original time scale, this corresponds
to a time interval of O(n−1/2 ). As long as the computational bound in the Zig-
Zag algorithm is O(n1/2 ), this can be achieved using O(1) proposed switches.
The computational cost for every proposed switch is O(n), because the full data
(x i )ni=1 needs to be processed in the computation of the true switching rate at the
proposed switching time.
We conclude that the computational complexity of the Zig-Zag (ZZ) algo-
rithm per independent sample is O(n), provided that the computational bound
is O(n1/2 ). This is the best we can expect for any standard Monte Carlo algorithm
[where we will have a O(1) number of iterations, but each iteration is O(n) in
computational cost].
To compare, if the computational bound is O(nα ) for some α > 1/2, then we re-
quire O(nα−1/2 ) proposed switches before we have simulated a total time interval
of length O(n−1/2 ), so that, with a complexity of O(n) per proposed switching
time, the Zig-Zag algorithm has total computational complexity O(nα+1/2 ). So,
for example, with global bounds we have that the computational bound is O(n)
[as each term in the log density is O(1)], and hence ZZ will have total computa-
tional complexity of O(n3/2 ).

E XAMPLE 5.1 (Dominated Hessian). Consider Algorithm 2 in the one-


dimensional case, with the second derivative of bounded from above by Q > 0.
We have Q = O(n) as is the sum of n terms of O(1). The value of b is kept
fixed at the value b = Q = O(n). Next, a is given initially as
 
a=θ (ξ ) ≤ θ (
ξ ) + Q (ξ − 
ξ ) = O n1/2 ,
      
=0 O(n) O(n−1/2 )

and increased by bτ until a switch happens and a is reset to θ (ξ ). Because of the


initial value for a, switches will occur at rate O(n1/2 ) so that τ will be O(n−1/2 ),
and the value of a will remain O(n1/2 ). Hence the magnitude of the computational
bound M(t) = (a + bt)+ is O(n1/2 ).

5.2. Scaling of Zig-Zag with Control Variates (ZZ-CV). Now we will study the
limiting behaviour as n → ∞ of ZZ-CV introduced in Section 4.3. In determin-
ing the computational bounds, we take p = 2 for simplicity, for example, in (12).
Also for simplicity assume that ξ → ∂ξi log f (x j | ξ ) has Lipschitz constant ki
(independent of j = 1, . . . , n) and write Ci = nki , so that (12) is satisfied. In prac-
tice, there may be a logarithmic increase with n in the Lipschitz constants ki as
we have to take a global bound in n. For the present discussion, we ignore such
1308 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

logarithmic factors. We assume reference points ξ  for growing n are determined


in such a way that ξ  − ξ 2 is O(n−1/2 ). For definiteness, suppose there exists
a d-dimensional random variable Z such that n1/2 (ξ  −  ξ ) → Z in distribution,
with the randomness in Z independent of (x j )∞j =1 .
We can look at ZZ-CV with respect to the scaled coordinate √ φ as n → ∞.
Denote the reference point for the rescaled parameter as φ  := n(ξ  − 
ξ ).
j
The essential quantities to consider are the switching rate estimators Ei . We
estimate
 j     j   
E (ξ ) = ∂ξ ξ + ∂ξi j
(ξ ) − ∂ξi ξ
i i
   j   
= ∂ξ i ξ − ∂ξi (
ξ ) + ∂ξi j
(ξ ) − ∂ξi ξ
≤ Ci ξ −
ξ + Ci ξ − ξ .
       
O(n) O(n−1/2 ) O(n) O(n−1/2 )

j
We find that |Ei (ξ )| = O(n1/2 ) under the stationary distribution.

By slowing down the Zig-Zag process in φ space by n, the continuous time
process generated by ZZ-CV will approach a limiting Zig-Zag process with a cer-
tain switching rate of O(1). In general, this switching rate will depend on the way
that ξ  is obtained. To simplify the exposition, in the following computation we as-
sume ξ  =  ξ . Rescaling by n−1/2 , and developing a Taylor approximation around

ξ,
 j  
n−1/2 Ei (ξ ) = n−1/2 ∂ξi
j j
(ξ ) − ∂ξi (ξ )
    
= n−1/2 −n∂ξi log f x j | ξ + n∂ξi log f x j | 
ξ
 d 
  j   
= −n1/2 ∂ξi ∂ξk log f x | 
ξ (ξk − 
ξk ) + O n1/2 ξ − 
ξ 2
k=1


d
   
=− ξ φk + O n−1/2 .
∂ξi ∂ξk log f x j | 
k=1
By Theorem 4.1, the rescaled effective switching rate for ZZ-CV is given by
  1 
n
 j +
λi (φ, θ ) := n−1/2 λi ξ(φ), θ = θi Ei ξ(φ)
n3/2 j =1
 +
1
n 
d
   
= −θi ∂ξi ∂ξk log f x j | 
ξ φk + O n−1/2
n j =1 k=1
 +

d
→ E −θi ∂ξi ∂ξk log f (X | ξ0 )φk ,
k=1
ZIG-ZAG SAMPLING 1309

where E denotes expectation with respect to X, with density f (· | ξ0 ) and the


convergence is a consequence of the law of large numbers. If ξ  is not exactly
equal to  ξ , the limiting form of λi (φ, θ ) will be different, but the important point
j
is that it will be O(1), which follows from the bound on |Ei | above.
Just as with ZZ, the rescaled Zig-Zag process underlying ZZ-CV converges to
a limiting Zig-Zag process with switching rate λi (φ, θ ). Since the computational
bounds of ZZ-CV are O(n1/2 ), a completely analogous reasoning to the one for ZZ
algorithm above (Section 5.1) leads to the conclusion that O(1) proposed switches
are required to obtain an independent sample. However, in contrast with the ZZ-
algorithm, the ZZ-CV algorithm is designed in such a way that the computational
cost per proposed switch is O(1).
We conclude that the computational complexity of the ZZ-CV algorithm is O(1)
per independent sample. This provides a factor n increase in efficiency over stan-
dard MCMC algorithms, resulting in an asymptotically unbiased algorithm for
which the computational cost of obtaining an independent sample does not de-
pend on the size of the data.

5.3. Remarks. The arguments above assume we are at stationarity—and how


quickly the two algorithms converge is not immediately clear. Note, however, that
for sub-sampling Zig-Zag it is possible to choose the reference point ξ  as the
starting point, thus avoiding much of the issues about convergence.
In some sense, the good computational scaling of ZZ-CV is leveraging the
asymptotic normality of the posterior, but in such a way that ZZ-CV always sam-
ples from the true posterior. Thus when the posterior is close to Gaussian it will be
quick; when it is far from Gaussian it may well be slower but will still be “correct”.
This is fundamentally different from other algorithms (e.g., Bardenet, Doucet and
Holmes (2017), Neiswanger, Wang and Xing (2013), Scott et al. (2016)) that utilise
the asymptotic normality in terms of justifying their approximation to the poste-
rior. Such algorithms are accurate if the posterior is close to Gaussian, but may be
inaccurate otherwise, and it is often impossible to quantify the size of the approx-
imation in practice.

6. Examples and experiments.

6.1. Sampling and integration along Zig-Zag trajectories. There are essen-
tially two different ways of using the Zig-Zag skeleton points which we obtain by
using, for example, Algorithms 1, 2 or 3.
The first possible approach is to collect a number of samples along the trajec-
tories. Suppose we have simulated the Zig-Zag process up to time τ > 0, and we
wish to collect m samples. This can be achieved by setting ti = iτ/m, and setting
i := (ti ) for i = 1, . . . , m, with the continuous time trajectory ( (t)) defined
1310 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

as in Section 2.1. In order to approximate π(f ) numerically for some function


f : Rd → R of interest, we can use the usual ergodic average
m
) := 1
π(f f( i ).
m i=1
We can also estimate posterior quantiles by using the quantiles of the sample
1 , . . . , m , as with standard MCMC output. An issue with this approach is that
we have to decide on the number, m, of samples we wish to use. Whilst the more
samples we use the greater the accuracy of our approximation to π(f ), this comes
at an increased computational and storage cost. The trade-off in choosing an ap-
propriate value for m is equivalent to the choice of how much to thin output from
a standard MCMC algorithm.
It is important that one does not make the mistake of using the switching points
of the Zig-Zag process as samples, as these points are not distributed according
to π . In particular, the switching points are biased towards the tails of the target
distribution.
An alternative approach is intrinsically related to the continuous time and piece-
wise linear nature of the Zig-Zag trajectories. This approach consists of continuous
time integration of the Zig-Zag process. By the continuous time ergodic theorem,
for f as above, π(f ) can be estimated as
 τ  
) = 1
π(f f (s) ds.
τ 0
Since the output of the Zig-Zag algorithms consists of a finite number of skeleton
points (T i , i , i )ki=0 , we can express this as
k  Ti
   
) = 1
π(f f i−1
+ i−1 s − T i−1 ds.
k
T i=1 T i−1
Due to the piecewise linearity of (t), in many cases these integrals can be com-
puted exactly, for example, for the moments, f (x) = x p , p ∈ R. In cases where
the integral cannot be computed exactly, numerical quadrature rules can be ap-
plied. An advantage of this method is that we do not have to make an arbitrary
decision on the number of samples to extract from the trajectory.

6.2. Beating one ESS per epoch. We use the term epoch as a unit of com-
putational cost, corresponding to the number of iterations required to evaluate
the complete gradient of log π . This means that for the basic Zig-Zag algorithm
(without sub-sampling), an epoch consists of exactly one iteration, and for the
sub-sampled variants of the Zig-Zag algorithm, an epoch consists of n iterations.
The CPU running times per epoch of the various algorithms we consider are
equal up to a constant factor. To assess the scaling of various algorithms, we
use ESS per epoch. The notion of ESS is discussed in the Supplementary Ma-
ZIG-ZAG SAMPLING 1311

terial [Bierkens, Fearnhead and Roberts (2019), Section 2]. Consider any classical
MCMC algorithm based upon the Metropolis–Hastings acceptance rule. Since ev-
ery iteration requires an evaluation of the full density function to compute the
acceptance probability, we have that the ESS per epoch for such an algorithm is
bounded from above by one. Similar observations apply to all other known MCMC
algorithms capable of sampling asymptotically from the exact target distribution.
There do exist several conceptual innovations based on the idea of sub-
sampling, which have some theoretical potential to overcome the fundamental lim-
itation of one ESS per epoch sketched above.
The Pseudo-Marginal Method (PMM, Andrieu and Roberts (2009)) is based
upon using a positive unbiased estimator for a possibly unnormalized density. Ob-
taining an unbiased estimator of a product is much more difficult than obtaining
one for a sum. Furthermore, it has been shown to be impossible to construct an
estimator that is guaranteed to be positive without other information about the
product, such as a bound on the terms in the product (Jacob and Thiery (2015)).
Therefore, the PMM does not apply in a straightforward way to vanilla MCMC in
Bayesian inference.
In the Supplementary Material (Bierkens, Fearnhead and Roberts (2019), Sec-
tion 3) we analyse the scaling of Stochastic Gradient Langevin Dynamics (SGLD,
Welling and Teh (2011)) in an analogous fashion to the analysis of ZZ and ZZ-CV
in Section 5. From this analysis, we conclude that it is in general not possible to
implement SGLD in such a way that the ESSpE has a larger order of magnitude
than O(1). We compare SGLD to Zig-Zag in experiments of Sections 6.3 and 6.5.

6.3. Mean of a Gaussian distribution. Consider the illustrative problem of es-


timating the mean of a Gaussian distribution. This problem has the advantage that
it allows for an analytical solution which can be compared with the numerical
solutions obtained by Zig-Zag sampling and other methods. Conditional on a one-
dimensional parameter ξ , the data is assumed to be i.i.d. from N(ξ, σ 2 ). Further-
more a N(0, 1/ρ 2 ) prior on ξ is specified. Data are generated from the true distri-
bution N(ξ0 , σ 2 ) for some fixed ξ0 . For a detailed description of the experiment
and computational bounds, see Section 4 of the Supplementary Material.
In this experiment, we compare the mean square error (MSE) for several al-
gorithms, namely basic Zig-Zag (ZZ), Zig-Zag with Control Variates (ZZ-CV),
Zig-Zag with Control Variates with a “sub-optimal” reference point (ZZ-soCV)
and Stochastic Gradient Langevin Dynamics (SGLD). SGLD is implemented with
fixed step size, as is usually done in practice; see, for example, Vollmer, Zygalakis
and Teh (2016), with the added benefit that it makes the comparison with the Zig-
Zag algorithms more straightforward. Here, in basic Zig-Zag we pretend that every
iteration requires the evaluation of n observations (whereas in practice, we can pre-
compute ξ MAP ).
Results for this experiment are displayed in Figure 2. The MSE for the sec-
ond moment using SGLD does not decrease beyond a fixed value, indicating the
1312 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

F IG . 2. Mean square error (MSE) in the first and second moment as a function of the number
of epochs, based on n = 100 or n = 10,000 observations, for a one-dimensional Gaussian posterior
distribution (Section 6.3). Displayed are SGLD (green), ZZ-CV (magenta), ZZ-soCV (dark magenta),
ZZ (black). The displayed dots represent averages over experiments based on randomly generated
data from the true posterior distribution. Parameter values (see Bierkens, Fearnhead and Roberts
(2019), Section 4) are ξ0 = 1 (the true value of the mean parameter), ρ = 1, σ = 1 and c1 = 1,
c2 = 1/100 (for the SGLD parameters; see the Supplementary Material, Bierkens, Fearnhead and
Roberts (2019), Section 3). The value of ξ  for ZZ-soCV is based on a sub-sample of size m = n/10 so
that it will not be equal to the exact maximizer of the posterior. For an honest comparison, trajectories
of all algorithms have initial condition equal to ξ MAP .
ZIG-ZAG SAMPLING 1313

presence of bias in SGLD. This bias does not appear in the different versions of
Zig-Zag sampling, agreeing with the theoretical result that ergodic averages over
Zig-Zag trajectories are consistent. Furthermore, we see a significant relative in-
crease in efficiency for ZZ-(so)CV over basic ZZ when the number of observations
is increased, agreeing with the scaling results of Section 5. A poor choice of refer-
ence point (as in ZZ-soCV) is seen to have only a small effect on the efficiency.

6.4. Logistic regression. In this numerical experiment, we compare how the


ESS per epoch (ESSpE) and ESS per second grow with the number of observa-
tions n for several Zig-Zag algorithms and the MALA algorithm when applied to
a logistic regression problem. Conditional on a d-dimensional parameter ξ and
j
given d-dimensional covariates x j ∈ Rd , where j = 1, . . . , n, and with x1 = 1 for
all j , the binary variable y j ∈ {0, 1} has distribution
 j j  1
P y j | x1 , . . . , xd , ξ1 , . . . , ξd = .
1 + exp(− d
i=1 ξi xi )

Combined with a flat prior distribution, this induces a posterior distribution ξ given
observations of (x j , y j ) for j = 1, . . . , n; see the Supplementary Material for im-
plementational details (Bierkens, Fearnhead and Roberts (2019), Section 5).
The results of this experiment are shown in Figure 3. In both of the plots of
ESS per epoch (see (a) and (c)), the best linear fit for ZZ-CV has slope approx-
imately 0.95, which is in close agreement with the scaling analysis of Section 5.
The other algorithms have roughly a horizontal slope, corresponding to a linear
scaling with the size of the data. We conclude that, among the algorithms tested,
ZZ-CV is the only algorithm for which the ESS per CPU second is approximately
constant as a function of the size of the data (see Figure 3(b) and (d)). Furthermore
ZZ-CV obtains an ESSpE which is roughly linearly increasing with the number of
observations n (see Figure 3(a) and (c)), whereas the other versions of the Zig-Zag
algorithms, and MALA, have an ESSpE which is approximately constant with re-
spect to n. These statements apply regardless of the dimensionality of the problem.

6.5. A nonidentifiable logistic regression example with unbounded Hessian.


In a further experiment, we consider one-dimensional data (x j , y j ), for j =
1, . . . , n, x j ∈ R, y j ∈ {0, 1}, which we assume for illustrational purposes to be
generated from a logistic model where P(y j = +1 | x j , ξ1 , ξ2 ) = 1+exp(−(ξ1 +ξ 2 )x j ) .
1 2
The model is nonidentifiable since two parameters ξ , η correspond to the same
model as long as ξ1 + ξ22 = η1 + η22 . This leads to a sharply rigged probability
density function reminiscent of density functions concentrated along lower di-
mensional submanifolds which often arise in Bayesian inference problems. In this
case, the Hessian of the log density is unbounded so that we cannot use the stan-
dard framework for the Zig-Zag algorithms. It is discussed in the Supplementary
1314 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

F IG . 3. Log-log plots of the experimentally observed dependence of ESS per epoch (ESSpE) and
ESS per second (ESSpS) with respect to the first coordinate 1 , as a function of the number of
observations n in the case of (2-D and 16-D) Bayesian logistic regression (Section 6.4). Data is
randomly generated based on true parameter values ξ0 = (1, 2) (2-D) and ξ0 = (1, . . . , 1) (16-D).
Trajectories all start in the true parameter value ξ0 . Plotted are mean and standard deviation over 10
experiments, along with the best linear fit. Displayed are MALA (tuned to have optimal acceptance
ratio, green), Zig-Zag with global bound (red), Zig-Zag with Lipschitz bound (black), ZZ-SS using
global bound (blue) and ZZ-CV (magenta), all run for 105 epochs. As reference point for ZZ-CV we
compute the posterior mode numerically, the cost of which is negligible compared to the MCMC. The
experiments are carried out in R with C++ implementations of all algorithms.
ZIG-ZAG SAMPLING 1315

Material (Bierkens, Fearnhead and Roberts (2019), Section 6) how to obtain com-
putational bounds for the Zig-Zag and ZZ-CV algorithms, which may serve as an
illustration on how to obtain such bounds in settings beyond those described in
Sections 3.3 and 4.3.
In Figure 4, we compare trace plots for the Zig-Zag algorithms (ZZ, ZZ-CV) to
trace plots for Stochastic Gradient Langevin Dynamics (SGLD) and the Consensus
Algorithm (Scott et al. (2016)). SGLD and Consensus are seen to be strongly bi-
ased, whereas ZZ and ZZ-CV target the correct distribution. However, this comes
at a cost: ZZ-CV loses much of its efficiency in this situation (due to the combina-
tion of lack of posterior contraction and unbounded Hessian); in particular it is not

F IG . 4. Trace plots of several algorithms (blue) and density contour plots for the nonidentifiable
logistic regression example of Section 6.5. In this example we have for the number of observations
n = 1000. Data is randomly generated from the model with true parameter satisfying ξ1 + ξ22 = −1.
The prior is a two-dimensional standard normal distribution. Due to the unbounded Hessian and
because SGLD is not corrected by a Metropolis–Hastings accept/reject, the stepsize of SGLD needs
to be set to a very small value (compared, e.g., to what would be required for MALA) in order to
prevent explosion of the trajectory; still the algorithm exhibits a significant asymptotic bias.
1316 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

super-efficient. The use of multiple reference points may alleviate this problem,
see also the discussion in Section 7.

7. Discussion. We have introduced the multidimensional Zig-Zag process and


shown that it can be used as an alternative to standard MCMC algorithms. The
advantages of the Zig-Zag process are that it is a nonreversible process, and thus
has the potential to mix better than standard reversible MCMC algorithms, and that
we can use sub-sampling ideas when simulating the process and still be guaranteed
to sample from the true target distribution of interest. We have shown that it is
possible to implement sub-sampling with control-variates in a way that we can
have super-efficient sampling from a posterior: the cost per effective sample size
is sub-linear in the number of data points. We believe the latter aspect will be
particularly useful for applications where the computational cost of calculating the
likelihood for a single data point is high.
As such, the Zig-Zag process holds substantial promise. However, being a com-
pletely new method, there are still substantial challenges in implementation which
will need to be overcome for Zig-Zag to reach the levels of popularity of standard
discrete-time MCMC. The key challenges to implementing the Zig-Zag efficiently
are:
1. to simulate from the relevant time-inhomogeneous Poisson process; and
2. in order to realise the advantages of Zig-Zag for large datasets, reasonable
centering points need to be found before commencing the MCMC algorithm itself.
For the first of these challenges, we have shown how this can be achieved through
bounding the rate of the Poisson process, but the overall efficiency of the simu-
lation algorithm then depends on how tight these bounds are. In Section 3.1, we
describe efficient ways to carry this out. Moreover, as pointed out by a reviewer,
there is a substantial literature on simulating stochastic processes that involve sim-
ulating such time-inhomogeneous Poisson processes (Anderson (2007), Gibson
and Bruck (2000)). Ideas from this literature could be leveraged both to extend the
class of models for which we can simulate the Zig-Zag process, and also to make
implementation of simulation algorithms more efficient.
The second challenge applies when using the ZZ-CV algorithm to obtain super-
efficiency for big data as discussed in Section 4.3. Although in our experience
finding appropriate centering points is rarely a serious problem, it is difficult to
give a prescriptive recipe for this step.
On the face of it, these challenges may limit the practical applicability of Zig-
Zag, at least in the short term. With that in mind, we have released a R/Rcpp pack-
age for logistic regression, as well as the code which reproduces the experiments
of Section 6 (Bierkens (2017)).
In addition, while Zig-Zag is an exact approximate simulation method, there
are various short-cuts to speed it up at the expense of the introduction of an ap-
proximation. For instance, there are already ideas of approximately simulating
ZIG-ZAG SAMPLING 1317

the continuous-time dynamics, through approximate bounds on the Poisson rate


(Pakman et al. (2016)). These ideas can lead to efficient simulation of the Zig-Zag
process for a wide class of models, albeit with the loss of exactness. Understanding
the errors introduced by such an approach is an open area.
The most exciting aspect of the Zig-Zag process is the super-efficiency we ob-
serve when using sub-sampling with control variates. Already this idea has been
adapted and shown to apply to other recent continuous-time MCMC algorithms
(Fearnhead (2002), Pakman et al. (2016)). We have shown in Section 6.5 that Zig-
Zag can be applied effectively within highly non-Gaussian examples where rival
approximate methods such as SGLD and the Consensus Algorithm are seriously
biased. So there is no intrinsic reason to expect Zig-Zag to rely on the target dis-
tribution being close to Gaussian, although posterior contraction and the ability to
find tight Poisson process rate bounds play important roles as we saw in our exam-
ples. There is much to learn about how the efficiency of Zig-Zag depends on the
statistical properties of the posterior distribution. However, unlike its approximate
competitors, Zig-Zag will still remain an exact approximate method whatever the
structure of the target distribution.
In truly “big data” settings, in principle we still need to process all the data once,
although a suitable reference point can be determined using a subset of the data,
we do need to evaluate the full gradient of the log density once at this reference
point, and this computation is O(n). This operation, however, is much easier to
parallelize than MCMC is, and after this approximately independent samples can
be obtained at a cost of O(1) each. Thus if we wish to obtain k approximately
independent samples, the computational efficiency of ZZ-CV is O(k + n) while
the complexity of traditional MCMC algorithms is O(kn). This is confirmed by
the experiment in Section 6.4.
The idea for control variates we present in this paper is just one, possibly the
simplest, implementation of this idea. There are natural extensions to deal with,
for example, multimodal posteriors or situations where we do not have posterior
concentration for all parameters. The simplest of these involve using multiple ref-
erence points and monitoring the computational bound we get within the CV-ZZ
algorithm and switching to a different algorithm when we stray so far from a ref-
erence point that this bound becomes too large. More sophisticated approaches
include using the ideas from (Dubey et al. (2016)), where we introduce a reference
point for each data point and update the reference points for data within the sub-
sample at each iteration of the algorithm. This would lead to the estimate of the
gradient that we center our control variate estimator around to depend on the re-
cent history of the Zig-Zag process, and thus could be accurate even if we explore
multiple modes or the tails of the target distribution.

Acknowledgments. The authors are grateful for helpful comments from ref-
erees, the Editor and the Associate Editor which have improved the paper. Fur-
thermore, the authors acknowledge Matthew Moores (University of Warwick) for
1318 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

helpful advice on implementing the Zig-Zag algorithms a s an R package using


Rcpp.

SUPPLEMENTARY MATERIAL
Supplement to “The Zig-Zag process and super-efficient sampling for
Bayesian analysis of big data” (DOI: 10.1214/18-AOS1715SUPP; .pdf). Math-
ematics of the Zig-Zag process, scaling of SGLD, details on the experiments in-
cluding how to obtain computational bounds.

REFERENCES
A NDERSON , D. F. (2007). A modified next reaction method for simulating chemical systems with
time dependent propensities and delays. J. Chem. Phys. 127 214107.
A NDRIEU , C. and ROBERTS , G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo
computations. Ann. Statist. 37 697–725. MR2502648
BARDENET, R., D OUCET, A. and H OLMES , C. (2017). On Markov chain Monte Carlo methods for
tall data. J. Mach. Learn. Res. 18 1515–1557. MR3670492
B IERKENS , J. (2016). Non-reversible Metropolis–Hastings. Stat. Comput. 26 1213–1228.
MR3538633
B IERKENS , J. (2017). Computer experiments accompanying J. Bierkens, P. Fearnhead and
G. Roberts, the Zig-Zag process and super-efficient sampling for Bayesian analysis of big data.
Available at https://github.com/jbierkens/zigzag-experiments. Date accessed: 20-10-2017.
B IERKENS , J., F EARNHEAD , P. and ROBERTS , G. (2018). Supplement to “The Zig-Zag process and
super-efficient sampling for Bayesian analysis of big data.” DOI:10.1214/18-AOS1715SUPP.
B IERKENS , J. and ROBERTS , G. (2017). A piecewise deterministic scaling limit of lifted
Metropolis–Hastings in the Curie–Weiss model. Ann. Appl. Probab. 27 846–882. MR3655855
B IERKENS , J., ROBERTS , G. O. and Z ITT, P.-A. (2017). Ergodicity of the zigzag process. Preprint.
Available at arXiv:1712.09875.
B OUCHARD -C ÔTÉ , A., VOLLMER , S. J. and D OUCET, A. (2017). The bouncy particle sampler:
A non-reversible rejection-free Markov chain Monte Carlo method. J. Amer. Statist. Assoc. To
appear. Available at arXiv:1510.02451.
C HEN , T.-L. and H WANG , C.-R. (2013). Accelerating reversible Markov chains. Statist. Probab.
Lett. 83 1956–1962. MR3079029
D ELIGIANNIDIS , G., B OUCHARD -C ÔTÉ , A. and D OUCET, A. (2017). Exponential ergodicity of
the bouncy particle sampler. Preprint. Available at arXiv:1705.04579.
D UANE , S., K ENNEDY, A. D., P ENDLETON , B. J. and ROWETH , D. (1987). Hybrid Monte Carlo.
Phys. Lett. B 195 216–222.
D UBEY, K. A., R EDDI , S. J., W ILLIAMSON , S. A., P OCZOS , B., S MOLA , A. J. and X ING , E. P.
(2016). Variance reduction in stochastic gradient Langevin dynamics. In Advances in Neural In-
formation Processing Systems 1154–1162.
D UNCAN , A. B., L ELIÈVRE , T. and PAVLIOTIS , G. A. (2016). Variance reduction using nonre-
versible Langevin samplers. J. Stat. Phys. 163 457–491. MR3483241
F EARNHEAD , P., B IERKENS , J., P OLLOCK , M. and ROBERTS , G. O. (2018). Piecewise determin-
istic Markov processes for continuous-time Monte Carlo. Statist. Sci. 33 386–412. MR3843382
F ONTBONA , J., G UÉRIN , H. and M ALRIEU , F. (2012). Quantitative estimates for the long-time
behavior of an ergodic variant of the telegraph process. Adv. in Appl. Probab. 44 977–994.
MR3052846
F ONTBONA , J., G UÉRIN , H. and M ALRIEU , F. (2016). Long time behavior of telegraph processes
under convex potentials. Stochastic Process. Appl. 126 3077–3101. MR3542627
ZIG-ZAG SAMPLING 1319

G IBSON , M. A. and B RUCK , J. (2000). Efficient exact stochastic simulation of chemical systems
with many species and many channels. J. Phys. Chem. A 104 1876–1889.
H ASTINGS , W. K. (1970). Monte Carlo sampling methods using Markov chains and their applica-
tions. Biometrika 57 97–109. MR3363437
H WANG , C.-R., H WANG -M A , S.-Y. and S HEU , S. J. (1993). Accelerating Gaussian diffusions.
Ann. Appl. Probab. 3 897–913. MR1233633
JACOB , P. E. and T HIERY, A. H. (2015). On nonnegative unbiased estimators. Ann. Statist. 43 769–
784. MR3319143
J OHNSON , R. A. (1970). Asymptotic expansions associated with posterior distributions. Ann. Math.
Stat. 41 851–864. MR0263198
L EWIS , P. A. W. and S HEDLER , G. S. (1979). Simulation of nonhomogeneous Poisson processes
by thinning. Nav. Res. Logist. Q. 26 403–413. MR0546120
L I , C., S RIVASTAVA , S. and D UNSON , D. B. (2017). Simple, scalable and accurate posterior interval
estimation. Biometrika 104 665–680. MR3694589
M A , Y.-A., C HEN , T. and F OX , E. (2015). A complete recipe for stochastic gradient MCMC. In
Advances in Neural Information Processing Systems 2917–2925.
M ACLAURIN , D. and A DAMS , R. P. (2014). Firefly Monte Carlo: Exact MCMC with subsets of
data. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence AUAI Press,
Arlington, VA.
M ETROPOLIS , N., ROSENBLUTH , A. W., ROSENBLUTH , M. N., T ELLER , A. H. and T ELLER , E.
(1953). Equation of state calculations by fast computing machines. J. Chem. Phys. 21 1087.
M ONMARCHÉ , P. (2014). Hypocoercive relaxation to equilibrium for some kinetic models via a third
order differential inequality. Available at arXiv:1306.4548.
N EAL , R. M. (1998). Suppressing random walks in Markov chain Monte Carlo using ordered over-
relaxation. In Learning in Graphical Models 205–228. Springer, Berlin.
N EISWANGER , W., WANG , C. and X ING , E. (2013). Asymptotically exact, embarrassingly parallel
MCMC. Available at arXiv:1311.4780.
PAKMAN , A., G ILBOA , D., C ARLSON , D. and PANINSKI , L. (2016). Stochastic bouncy particle
sampler. Preprint. Available at arXiv:1609.00770.
P ETERS , E. A. J. F. and D E W ITH , G. (2012). Rejection-free Monte Carlo sampling for general
potentials. Phys. Rev. E (3) 85 1–5.
P OLLOCK , M., F EARNHEAD , P., J OHANSEN , A. M. and ROBERTS , G. O. (2016). The scalable
Langevin exact algorithm: Bayesian inference for big data. Available at arXiv:1609.03436.
Q UIROZ , M., V ILLANI , M. and KOHN , R. (2015). Speeding up MCMC by efficient data subsam-
pling. Riksbank Research Paper Series 121.
R EY-B ELLET, L. and S PILIOPOULOS , K. (2015). Irreversible Langevin samplers and variance re-
duction: A large deviations approach. Nonlinearity 28 2081–2103. MR3366637
ROBERTS , G. O. and T WEEDIE , R. L. (1996). Exponential convergence of Langevin distributions
and their discrete approximations. Bernoulli 2 341–363. MR1440273
S COTT, S. L., B LOCKER , A. W., B ONASSI , F. V., C HIPMAN , H. A., G EORGE , E. I. and M C C UL -
LOGH , R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. Int. J. Manag.
Sci. Eng. Manag. 11 78–88.
T URITSYN , K. S., C HERTKOV, M. and V UCELJA , M. (2011). Irreversible Monte Carlo algorithms
for efficient sampling. Phys. D 240 410–414.
VOLLMER , S. J., Z YGALAKIS , K. C. and T EH , Y. W. (2016). Exploration of the (non-)asymptotic
bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17 1–48.
MR3555050
WANG , X. and D UNSON , D. B. (2013). Parallelizing MCMC via Weierstrass sampler. Available at
arXiv:1312.4605.
1320 J. BIERKENS, P. FEARNHEAD AND G. ROBERTS

W ELLING , M. and T EH , Y. W. (2011). Bayesian learning via stochastic gradient Langevin dy-
namics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11)
681–688.

J. B IERKENS P. F EARNHEAD
D ELFT I NSTITUTE OF A PPLIED M ATHEMATICS D EPARTMENT OF M ATHEMATICS
VAN M OURIK B ROEKMANWEG 6 AND S TATISTICS
2628 XE D ELFT F YLDE C OLLEGE
T HE N ETHERLANDS L ANCASTER U NIVERSITY
E- MAIL : [email protected] L ANCASTER , LA1 4YF
U NITED K INGDOM
E- MAIL : [email protected]
G. ROBERTS
D EPARTMENT OF S TATISTICS
U NIVERSITY OF WARWICK
C OVENTRY CV4 7AL
U NITED K INGDOM
E- MAIL : [email protected]

You might also like