WP Improving The Robustness of Backtesting I Edmond Lezmi
WP Improving The Robustness of Backtesting I Edmond Lezmi
Document for the exclusive attention of professional clients, investment services providers and any other professional of the financial industry
Improving the Robustness of Trading Strategy
Backtesting with Boltzmann Machines
and Generative Adversarial Networks
Abstract
Jules Roche
Jules Roche joined Amundi in July 2019 in the Quantitative
Research department as an intern. He worked on financial
applications of generative models such as GAN and
Boltzmann Machines. He is currently a graduate student
at Ecole Nationale des Ponts et Chaussées (ENPC) in Paris
where he studied applied mathematics and computer
science. He will be joining the Master of Finance at MIT
in September 2020.
Thierry Roncalli
Thierry Roncalli joined Amundi as Head of Quantitative
Research in November 2016. Prior to that, he was Head of
Research and Development at Lyxor Asset Management
(2009- 2016), Head of Investment Products and Strategies
at SGAM AI, Société Générale (2005-2009), and Head
of Risk Analytics at the Operational Research Group
of Crédit Agricole SA (2004-2005). From 2001 to 2003,
he was also Member of the Industry Technical Working
Group on Operational Risk (ITWGOR). Thierry began
his professional career at Crédit Lyonnais in 1999 as a
financial engineer. Before that, Thierry was a researcher
at the University of Bordeaux and then a Research
Fellow at the Financial Econometrics Research Centre of
Cass Business School. During his five years of academic
career, he also served as a consultant on option pricing
models for different banks.
Since February 2017, he is Member of the Scientific Advisory
Board of AMF, the French Securities & Financial Markets
Regulator, while he was Member of the Group of Economic
Advisers (GEA), ESMA’s Committee for Economic and
Market Analysis (CEMA), European Securities and Market
Analysis from 2014 to 2018. Thierry is also Adjunct Professor
of Economics at the University of Evry, Department of
Economics. He holds a PhD in Economics from the University
of Bordeaux, France. He is the author of numerous academic
articles in scientific reviews and has published several
books on risk and asset management. His last two books
are “Introduction to Risk Parity and Budgeting” published
in 2013 by Chapman & Hall and translated in Chinese in
2016 by China Financial Publishing House (410 pages), and
“Handbook of Financial Risk Management” published in
2020 by Chapman & Hall (1142 pages).
Jiali Xu
Jiali XU joined Amundi in 2018 as a quantitative research
analyst within the Multi-Asset Quantitative Research team.
Prior to that, he was a quantitative analyst in the Risk
Analytics and Solutions team at Société Générale between
2014 and 2018. He is graduated from Ecole des Ponts ParisTech
and he also holds a master degree in Financial Mathematics
from the University of Paris-Est Marne-la-Vallée.
Improving the Robustness of Backtesting with Generative Models
1 Introduction
In machine learning, we generally distinguish two types of statistical modeling (Jebara,
2004). The generative approach models the unconditional probability distribution given a
set of observable variables. The discriminative approach models the conditional probability
distribution given a set of observable variables and a target variable. In the first case,
generative models can be used to simulate samples that reproduce the statistical properties
of a training dataset. In the second case, discriminative models can be used to predict the
target variable for new examples. More specifically, generative models can be used to learn
the underlying probability distributions over data manifolds. The objective of these models
is to estimate the statistical and correlation structure of real data and simulate synthetic
data with a probability distribution, which is close to the real one.
In finance, we generally observe only one sample path of market prices. For instance, if
we would like to build a trading strategy on the S&P 500 index, we can backtest the strategy
using the historical values of the S&P 500 index. We can then measure the performance and
the risk of this investment strategy by computing annualized return, volatility, Sharpe ratio,
maximum drawdown, etc. In this case, it is extremely difficult to assess the robustness of
the strategy, since we can consider that the historical sample of the S&P 500 index is one
realization of the unknown stochastic process. Therefore, portfolio managers generally split
the study period into two subperiods: the ‘in-sample’ period and the ‘out-of-sample’ period.
The objective is to calibrate the parameters of the trading strategy with one subperiod and
measure the financial performance with the other period in order to reduce the overfitting
bias. However, if the out-of-sample approach is appealing, it is limited for two main reasons.
First, by splitting the study period into two subperiods, the calibration procedure is per-
formed with fewer observations, and does not generally take into account the most recent
period. Second, the validation step is done using only one sample path. Again, we observe
only one realization of the risk/return statistics. Of course, we could use different splitting
methods, but we know that these bootstrap techniques are not well-adapted to times series
and must be reserved for modeling random variables. For stochastic processes, statisticians
prefer to consider Monte Carlo methods. Nevertheless, financial times series are difficult to
model, because they exhibit non-linear autocorrelations, fat tails, heteroscedasticity, regime
switching and non-stationary properties (Cont, 2001).
In this article, we are interested in generative models in order to obtain several train-
ing/validation sets. The underlying idea is then to generate artificial but realistic financial
market prices. The choice of generative model and market data leads naturally to the
concept of market generator introduced by Kondratyev and Schwarz (2019). If the mar-
ket generator is able to replicate the probability distribution of the original market data,
we can then backtest quantitative investment strategies on several financial markets. The
backtesting procedure is then largely improved, since we obtain a probability distribution
of performance and risk statistics, and not only one value. Therefore, we can reduce the
in-sample property of the backtest procedure and the overfitting bias of the parameters that
define the trading strategy. However, the challenge is not simple, since the market generator
must be sufficiently robust and flexible in order to preserve the uni-dimensional statistical
properties of the original financial time series, but also the multi-dimensional dependence
structure.
This paper is organized as follows. Section Two reviews the more promising generative
models that may be useful in finance. In particular, we focus on restricted Boltzmann
machines, generative adversarial networks and Wasserstein distance models. In Section
Three, we apply these models in the context of trading strategies. We first consider the
joint simulation of S&P 500 and VIX indices. We then build an equity/bond risk parity
7
Improving the Robustness of Backtesting with Generative Models
strategy with an investment universe of six futures contracts. Finally, Section Four offers
some concluding remarks.
2 Generative models
In this section, we consider two main approaches. The first one is based on a restricted
Boltzmann machine (RBM), while the second one uses a conditional generative adversarial
network (GAN). Both are stochastic artificial neural networks that learn the probability
distribution of a real data sample. However, the objective function strongly differs. Indeed,
RBMs consider a log-likelihood maximization problem, while the framework of GANs cor-
responds to a minimax two-player game. In the first case, the difficulty lies in the gradient
approximation of the log-likelihood function. In the second case, the hard task is to find
a learning process algorithm that solves the two-player game. This section also presents
an extension of generative adversarial networks, which is called a convolutional Wasserstein
model.
Figure 1: The schema of a RBM with m visible units and n hidden units
H1 H2 Hj Hn Hidden layer
V1 V2 V3 Vi Vm Visible layer
8
Improving the Robustness of Backtesting with Generative Models
where a = (ai ) and b = (bj ) are the two vectors of bias terms associated with V and H
and W = (wi,j ) is the matrix of weights associated with the edges between V and H. The
normalizing constant Z is the partition function and ensures the overall distribution sums
to one3 . It follows that the marginal probability distributions of the visible and hidden unit
states are:
1 X −E(v,h)
P (v) = e
Z
h
and:
1 X −E(v,h)
P (h) = e
Z v
The underlying idea of a Bernoulli RBM is to learn the unconditional probability distribu-
tions P (v) of the observable data.
Conditional Distributions According to Long and Servedio (2010), the partition func-
tion Z is intractable in the case of Bernoulli RBMs, since its calculus requires summing up
2m+n elements. Therefore, the probability distribution P (v) is also intractable when m + n
increases. However, we can take advantage of the property of the bipartite graph structure
of the RBM. Indeed, there are no connections between two different units in the same layer.
The probabilities P (vi | h) and P (hj | v) are then independent for all i and j. It follows
that:
m
Y
P (v | h) = P (v1 , v2 , . . . , vm | h) = P (vi | h)
i=1
and:
n
Y
P (h | v) = P (h1 , h2 , . . . , hn | v) = P (hj | v)
j=1
With these properties, we can find some useful results that help when computing the gradient
of the log-likelihood function on page 10. For instance, we can show that4 :
X
P (h | v) hj = P (hj = 1 | v)
h
3 We have Z = v,h e−E(v,h) .
P
4 See Appendix A.2.1 on page 56.
9
Improving the Robustness of Backtesting with Generative Models
A neural network perspective of RBMs In Appendix A.2.2 on page 56, we show that:
n
X
P (vi = 1 | h) = σ ai + wi,j hj (2)
j=1
and: !
m
X
P (hj = 1 | v) = σ bj + wi,j vi (3)
i=1
where σ (x) is the sigmoid function:
1
σ (x) =
1 + e−x
Thus, a Bernoulli RBM can be considered as a stochastic artificial neural network, meaning
that the nodes and edges correspond to neurons and synaptic connections. For a given
vector v, h is obtained as follows:
m
!
X
hj = f wi,j vi + bj
i=1
According to Fischer and Igel (2014), “an RBM can [then] be reinterpreted as a standard
feed-forward neural network with one layer of nonlinear processing units”.
Training process Let θ = (a, b, W ) be the set of parameters to estimate. The objective is
to find a value of θ such that Pθ (v) ≈ Pdata (v). Since the log-likelihood function ` (θ | v) of
the input vector v is defined as ` (θ | v) = log Pθ (v), the Bernoulli RBM model
is trained in
order to maximize the log-likelihood function of a training set of N samples v(1) , . . . , v(N ) :
N
X
θ? = arg max
` θ | v(s) (4)
θ
s=1
where:
!
X e−E(v,h)
` (θ | v) = log
Z
h
!
0
e−E (v ,h)
X X
= log e−E(v,h) − log (5)
h v 0 ,h
Hinton (2002) proposed to use gradient ascent method with the following update rule be-
tween iteration steps t and t + 1:
N
!
(t) ∂
X
(t+1) (t) (t)
θ = θ +η ` θ | v(s)
∂ θ s=1
= θ(t) + η (t) ∆θ(t)
PN
where η (t) is the learning rate parameter, ∆θ(t) = s=1 ∇θ(t) v(s) and ∇θ (v) is the gradi-
ent vector given in Appendix A.2.3 on page 57.
10
Improving the Robustness of Backtesting with Generative Models
Gibbs sampling Ackley et al. (1985) and Hinton and Sejnowski (1986) showed that the
expectation over P (v) can be approximated by Gibbs sampling, which belongs to the family
of MCMC algorithms. The goal of Gibbs sampling is to simulate correlated random variables
by using a Markov chain. Usually, we initialize the Gibbs sampling with a random vector and
the algorithm updates one variable iteratively, based on its conditional distribution given
the state of the remaining variables. After a sufficiently large number of sampling steps, we
get the unbiased samples from the joint probability distribution. Formally, Gibbs sampling
of the joint probability distribution of n random variables X = (X1 , X2 , . . . , Xn ) consists in
sampling xi ∼ P (Xi | X−i = x−i ) iteratively.
..
.
(k) (k) (k) (k−1)
xi ∼ P Xi | x1 , . . . , xi−1 , xi+1 , . . . , x(k−1)
n
..
.
(k) (k) (k)
x(k)
n ∼ P Xn | x1 , x2 , . . . , xn−1
end for
Let us consider a Bernoulli RBM as a Markov random field defined by the random
variables V = (V1 , V2 , . . . , Vm ) and H = (H1 , H2 , . . . , Hn ). Since an RBM is a bipartite
undirected graph, we can take advantage of conditional independence properties between
the variables in the same layer. At each step, we jointly sample the states of all variables in
one layer, as shown in Figure 2. Thus, Gibbs sampling for an RBM consists in alternating5
between sampling a new state h for all hidden units based on P (h | v) and sampling
a
(k) (k) (k)
new state v for all visible units based on P (v | h). Let v (k) = v1 , v2 , . . . , vm and
(k) (k) (k)
h(k) = h1 , h2 , . . . , hn denote the states of the visible layer and the hidden layer at time
(k)
step k. For each unit, we rely on the fact that the conditional probabilities P vi | h(k)
(k)
and P hj | v (k−1) are easily tractable according to Equations (2) and (3). We start by
initializing the state of the visible units and we can choose a binary random vector for the
first time step. At time step k, here are the steps of the algorithm:
11
Improving the Robustness of Backtesting with Generative Models
(k) (k)
the following probability p vi = P vi | h(k) . The state of the ith unit is then
(k)
simulated according to the Bernoulli distribution B p vi .
H1 H2 Hn H1 H2 Hn
P h(1) | v (0) P v (1) | h(1) P h(2) | v (1) P v (2) | h(2)
V1 V2 Vm V1 V2 Vm V1 V2 Vm
We note Pmodel (v) = Pθ (v). Since Pdata is independent of θ, maximizing the log-likelihood
function (5) is equivalent to minimizing the Kullback-Leibler divergence between Pdata and
Pmodel :
12
Improving the Robustness of Backtesting with Generative Models
should have the same distribution as Pdata . In this case, KL P(0) k P(∞) is equal to 0
and the difference KL P(0) k P(∞) − KL P(k) k P(∞) is also equal to 0. Thus, accord-
ing to Hinton (2002),
we can findthe optimal parameter θ? by minimizing the difference
(0) (∞)
KL P k P −KL P(k) k P(∞) instead of minimizing directly KL P(0) k P(∞) in Equa-
tion (7). This quantity is called a contrastive divergence since we compare two KL divergence
measures. Therefore, we can rewrite the objective function in Equation (7) as follows:
Again, we can use the gradient descent method to find the minimum of the objective function.
In this case, the update rule between iteration steps t and t + 1 is:
(k)
(t+1) (t) (t) ∂ CD θ(t)
θ =θ −η
∂θ
where the gradient vector ∂θ CD(k) θ(t) is given in Appendix A.2.4 on page 59. Thus,
Algorithm (2) summarizes the k-step contrastive divergence algorithm for calibrating θ? .
where ai and bj are bias terms associated with visible variables Vi and hidden variables Hj ,
wi,j is the weight associated with the edge between Vi and Hj , and σi is a new parame-
ter associated with Vi . Following the same calculus in Equation 3, we can show that the
conditional probability is equal to:
m
!
X vi
P (hj = 1 | v) = sigmoid bj + wi,j 2 (9)
i=1
σi
As we have previously seen, we can maximize the log-likelihood function in order to train
a Gaussian-Bernoulli RBM with m visible units and n hidden units. As all visible units are
6 See Appendix A.1.3 on page 53.
13
Improving the Robustness of Backtesting with Generative Models
14
Improving the Robustness of Backtesting with Generative Models
H1 H2 Hn
C1
Conditional layer (d × m units)
C2
Cd
V1 V2 Vm
A conditional RBM contains both undirected and directed connections in the graph.
Thus, it can’t be defined as an MRF or Bayesian network. However, conditionally on ct ,
we can consider both visible and hidden layers as an undirected graph and we can compute
P (vt , ht | ct ) instead of computing P (vt , ht , ct ) in order to take still advantage of undirected
graph properties. Thus, according to the Hammersley-Clifford theorem, the conditional
probability distribution has the following form:
1 −Ẽg (vt ,ht ,ct )
P (vt , ht | ct ) = e
Z
1 − Pm i=1 Ẽi (vt,i ,ct )−
Pn
j=1 Ẽj (ht,j ,ct )−
Pm Pn
j=1 Ẽi,j (vt,i ,ht,j ,ct )
= e i=1
Z
7A
conditional meta-unit is then composed of m values: Ck = Vt−k,1 . . . , Vt−k,m .
15
Improving the Robustness of Backtesting with Generative Models
where Z is the partition function. Taylor et al. (2011) proposed a form of energy function
that is an extension of the Gaussian-Bernoulli RBM energy:
2
(vt,i − ãt,i )
Ẽi (vt,i , ct ) =
2σi2
Ẽj (ht,j , ct ) = −b̃t,j ht,j
vt,i ht,j
Ẽi,j (vt,i , ht,j , ct ) = −wi,j
σi2
where ãt = a + ct Q> and b̃t = b + ct P > are dynamic bias terms with respect to a and b.
Thus, the energy function Ẽg corresponds to a Gaussian-Bernoulli RBM energy function Eg
by replacing constant biases a and b by dynamic bias ãt and b̃t . By updating these terms
in Equations (9) and (10), we obtain:
m
!
X vt,i
P (hj = 1 | vt , ct ) = sigmoid b̃t,j + wi,j 2 (11)
i=1
σi
and:
n
X
Vt,i | ht , ct ∼ N ãt,i + wi,j ht,j , σi2 (12)
j=1
Remark 2. As for Gaussian-Bernoulli RBMs, we can use the k-step contrastive divergence
algorithm for the training process with some modifications of the gradient vector8 .
16
Improving the Robustness of Backtesting with Generative Models
According to Goodfellow et al. (2014), “a generative adversarial process trains two mod-
els: a generative model G that captures the data distribution, and a discriminative model
D that estimates the probability that a sample came from the training data rather than G.
The training procedure for G is to maximize the probability of D making a mistake. This
framework corresponds to a minimax two-player game. In the space of arbitrary functions
G and D, a unique solution exists, with G recovering the training data distribution and D
equal to 1/2 everywhere”. Therefore, a generative adversarial model consists of two neural
networks. The first neural network simulates a new data sample, and is interpreted as a
data generation process. The second neural network is a classifier. The input data are both
the real and simulated samples. If the generative model has done a good job, the discrimi-
native model is unable to know whether an observation comes from the real dataset or the
simulated dataset.
where X denotes the data space and ΘG defines the parameter space including generator
weights and bias that will be optimized. The generator G helps to simulate data x0 . The
discriminative model is defined as follows:
(X , ΘD ) −→ [0, 1]
D: (14)
(x, θd ) 7−→ p = D (x; θd )
where x corresponds to the set of simulated data x0 and training (or real) data x1 . In this
approach, the statistical model of X corresponds to:
The probability that the observation comes from the real (or true) data is equal to p1 =
D (x1 ; θd ), whereas the probability that the observation is simulated is given by p0 =
D (x0 ; θd ). If the model (15) is wrong and does not reproduce the statistical properties
of the real data, the classifier has no difficulty in separating the simulated data from the
real data, and we obtain p0 ≈ 0 and p1 ≈ 1. Otherwise, if the model is valid, we must verify
that:
1
p0 ≈ p1 ≈
2
The main issue of GANs is the specification of the two functions G and D and the
estimation of the parameters θg and θd associated to G and D. For the first step, Goodfellow
et al. (2014) proposed to use two multi-layer neural networks G and D, whereas they consider
the following cost function for the second step:
17
Improving the Robustness of Backtesting with Generative Models
In other words, the discriminator is trained in order to maximize the probability to cor-
rectly classify historical samples from simulated samples. The objective is to obtain a good
classification model since the maximum value C (θg , θd ) with respect to θd is reached when:
D (x1 ; θd ) = 1
D (x0 ; θd ) = 0
In the meantime, the generator is trained in order to minimize the probability that the
discriminator is able to perform a correct classification or equivalently to maximize the
probability to fool the discriminator, since the minimum value C (θg , θd ) with respect to θg
is reached when:
D (x0 ; θd ) = 1
x0 ∼ G (z; θg )
Remark 3. In Appendix A.5 on page 72, we show that the cost function is related to the
binary cross-entropy measure or the opposite of the log-likelihood function of the logit model.
Moreover, the cost function can be interpreted as a φ-divergence measure as explained in
Appendix A.3.1 on page 63.
where C (max) (θd | θg ) corresponds to the cost function by assuming that θg is given.
where C (min) (θg | θd ) corresponds to the cost function by assuming that θd is given.
(min)
The two-stage approach is repeated until convergence by setting θg to the value θ̂g calcu-
(max)
lated at the minimization step and θd to the value θ̂d calculated at the maximization step.
The cycle sequence is given in Figure 4. A drawback of this approach is the computational
time. Indeed, this implies to solve two optimization problems at each iteration.
Therefore, Goodfellow et al. (2014) proposed another two-stage approach, which con-
verges more rapidly. The underlying idea is not to estimate the optimal generator at each
iteration. The objective is, rather, to improve the generative model at each iteration, such
that the new estimated model is better than the previous. The convergence is only needed
for the discriminator step (Goodfellow et al., 2014, Proposition 2). Moreover, the authors
18
Improving the Robustness of Backtesting with Generative Models
(max)
θ̂d
(max) (min)
C θg | θ̂d C θd | θ̂g
(min)
θ̂g
applied a mini-batch sampling in order to reduce the computational time. It follows that
the cost functions become:
m
1 X
(i)
(i)
C (max) (θd | θg ) ≈ log D x1 ; θd + log 1 − D x0 ; θd
m i=1
and:
m
1 X
C (min) (θg | θd ) ≈ c + log 1 − D G z (i) ; θg ; θd
m i=1
where m is the size of the mini-batch sample and c is a constant that does not depend on
the generator parameters θg :
m
1 X
(i)
c= log D x1 ; θd
m i=1
19
Improving the Robustness of Backtesting with Generative Models
Sample m examples x(1) , . . . , x(m) from the data distribution Pdata (x)
Compute the gradient vector of the maximization problem with respect to the pa-
rameter θd :
m
!
(j,k) 1 X
(i)
(i)
∆θd ← ∇θd log D x1 ; θd + log 1 − D x0 ; θd
m i=1 (j,k−1)
θd =θd
end for
(j) (j,n )
θd ← θd max
Sample m random noise vector z (1) , . . . , z (m) from the prior distribution Pnoise (z)
Compute the gradient vector of the minimization problem with respect to the parameter
θg : !
m
(j) 1 X
(j)
∆θg ← ∇θg log 1 − D G z (i) ; θg ; θd
m i=1 (j−1)
θg =θg
Update the generator parameters θg using a backpropagation learning rule. For in-
stance, in the case of the steepest descent method, we obtain:
(j)
θg(j) ← θg(j−1) − ηg · ∆θg
end for
(n ) (n )
return θ̂g ← θg min and θ̂d ← θd min
20
Improving the Robustness of Backtesting with Generative Models
10 For instance, we can define the labels in the following way: v = 1 {ε > } − 1 {ε < −} where
t t t
εt = yt − mt and > 0 is a threshold.
21
Improving the Robustness of Backtesting with Generative Models
where F (P, Q) denotes the Fréchet class of all joint distributions F (x, y) whose marginals
are equal to P and Q. In the case p = 1, the Wasserstein distance represents the cost of the
optimal transport problem11 (Villani, 2008):
because F (x, y) describes how many masses are needed in order to transport one distribution
to another (Arjovsky et al., 2017a,b). The Kantorovich-Rubinstein duality introduced by
Villani (2008) allows us to rewrite Equation (17) as follows:
where ϕ is a 1-Lipschitz function, such that |ϕ (x) − ϕ (y)| ≤ kx − yk for all (x, y).
In the case of a generative model, we would like to check whether sample and gen-
erated data follows the same distribution. Therefore, we obtain P = Pdata and Q =
Pmodel . In the special case of a GAN model, Pmodel (x; θg ) is given by G (Pnoise (z) ; θg ).
Arjovsky et al. (2017a,b) demonstrated that if G (z; θg ) is continuous with respect to θ,
then W (Pdata , Pmodel ) is continuous everywhere and differentiable almost everywhere. This
result implies that GAN can be trained until it has converged to its optimal solution con-
trary to basic GANs, where the training loss were bounded and not continuous12 . Moreover,
Gulrajani et al. (2017) adapted Equation (18) in order to recover the two player min-max
game:
The Wasserstein GAN (WGAN) plays the same min-max optimization problem as previ-
ously:
θ̂g(min) = arg min C (min) (θg | θd )
θg ∈ΘG
22
Improving the Robustness of Backtesting with Generative Models
and we have:
∇θg C (θg , θd ) = −E ∇θg ϕ (G (z; θg ) ; θd )
We recall that the discriminator ϕ needs to satisfy the Lipschitz property otherwise the loss
gradient will explode13 . A first alternative proposed by Arjovsky et al. (2017a,b) is to clip
the weights into a closed space [−c, c]. However, this method is limited because the gradient
can vanish and weights can saturate. A second alternative proposed by Gulrajani et al.
(2017) is to focus on the properties of the optimal discriminator gradient. They showed that
the optimal solution ϕ? has a gradient norm 1 almost everywhere under P and Q. Therefore,
they proposed to add a regularization term to the cost function in order to constraint the
gradient norm to converge to 1. This gradient penalty leads to define a new cost function:
h i
2
Cregularized (θg , θd ; λ) = C (θg , θd ) + λE (k∇x ϕ (x2 ; θd )k2 − 1) | x2 ∼ Pmixture
where Pmixture is a mixture distribution of P and Q. More precisely, Gulrajani et al. (2017)
proposed to sample x2 as follows: x2 = αx1 + (1 − α) x0 where α ∼ U[0,1] , x0 ∼ Q and
x1 ∼ P.
Remark 6. Gulrajani et al. (2017) found that a good value of the coefficient λ is 10.
name ϕ.
14 This class of CNNs is called deep convolutional generative adversarial networks (DCGANs).
23
Improving the Robustness of Backtesting with Generative Models
raw inputs
filtered inputs
filtered inputs
nt − nk + 2np
n0t = +1 (19)
ns
This type of convolution will be used to build the discriminator that should output a single
scalar. In this case, down-sampling real or fake time series become essential.
The term transpose comes from the fact that convolution is in fact a matrix operation.
When we perform a convolution, input matrix flattens into a nt nx -dimensional vector. All
the parameters (stride, kernel settings or padding) are encoded into the weight convolution
matrix belonging to Mnt nx ,n0t (R). Therefore, the output data is obtained by computing
the product between the input vector and the convolution matrix. Performing a transpose
convolution is equivalent of taking the transpose of this convolution matrix.
24
Improving the Robustness of Backtesting with Generative Models
3 Financial applications
3.1 Application of RBMs to generate synthetic financial time series
A typical financial time series of length T may be described by a real-valued vector y =
(y1 , ..., yT ). As we have introduced in Sections 2.1.1 and 2.1.2 on page 9, Bernoulli RBMs
take binary vectors as input for training and Gaussian RBMs take the data with unit variance
as input in order to ignore the parameter σ. Therefore, data preprocessing is necessary
before any training process of RBMs. For a Bernoulli RBM, each value of yt needs to be
converted to an N -digit binary vector using the algorithm proposed by Kondratyev and
Schwarz (2019). The underlying idea consists in discretizing the distribution of the training
dataset and representing them with binary numbers. For instance, N may be set to 16 and
a Bernoulli RBM, which takes a one-dimensional time series as training dataset, will have
16 visible layer units. In the same way, the visible layer will have 16 × n neurons in the case
of an n-dimensional time series. Moreover, samples generated by a Bernoulli RBM after
Gibbs sampling are also binary vectors and we need another algorithm to transform binary
vectors into real values. These transformation algorithms between real values and binary
vectors are detailed in Appendix A.7 on page 77. For a Gaussian RBM, we need only to
normalize data to unit variance and scale generated samples after Gibbs sampling.
For the training process of RBMs, we use the contrastive divergence algorithm CD(1) to
estimate the log-likelihood gradient and the learning rate η (t) is set to 0.01. All models are
trained using mini-batch gradient descent with batch size 500 and we apply 100 000 epochs
to ensure that models are well-trained. Using a larger mini-batch size will get a more stable
gradient estimate at each step, but it will also use more memory and take a longer time to
train the model.
After having trained the RBMs, we may use these models to generate synthetic (or
simulated) samples to match training samples. Theoretically, a well-trained RBM should be
able to transform random noise into samples that have the same distribution as the training
dataset after doing Gibbs sampling for a sufficiently long time. Therefore, the trained RBM
is fed by samples drawn from the Gaussian distribution N (0, 1) as input data. We then
perform a large number of forward and backward passes through the network to ensure
convergence between the probability distribution of generated samples and the probability
distribution of the training sample. In the particular case of Bernoulli RBMs, after the last
backward pass from the hidden layer to the visible layer, we need to transform generated
binary vectors into real-valued vectors.
25
Improving the Robustness of Backtesting with Generative Models
0.20
0.30
Frequency
Frequency
0.15
0.20
0.10
0.10
0.05
0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
0.20 0.20
0.15 0.15
Frequency
Frequency
0.10 0.10
0.05 0.05
0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
0% 0% 100% 50%
given in Figure 7. We set a strong negative correlation −60% between the Gaussian mix-
ture model and the Student’s t distribution and a strong positive correlation 50% between
the two Gaussian distributions. Moreover, the Gaussian distribution in the fourth dimen-
sion has a more complicated correlation structure16 than the Gaussian distribution in the
third dimension, which is independent from the Gaussian mixture model and the Student’s
t distribution or the first and second dimensions. Here, the challenge is to learn both the
marginal distribution and the copula function.
16 The correlations are equal to 30% and −20% with respect to the first and second dimensions.
26
Improving the Robustness of Backtesting with Generative Models
Before implementing the training process, we need to update the parameters for the
RBMs to adjust multi-dimensional input. In the case of the Bernoulli RBM, each value
should be transformed into a 16-digit binary vector and we need to concatenate them to
form a 64-digit binary vector. So, the visible layer of the Bernoulli RBM has 64 neurons.
For the Gaussian RBM, we have simply 4 visible layer units. According to our experience,
we need to set a large number of hidden layer units in order to learn marginal distributions
and the correlation structure at the same time. Therefore, we choose 256 hidden layer units
for the Bernoulli RBM and 128 hidden layer units for the Gaussian RBM. We also recall
that the number of epochs is set to 100 000 in order to ensure that models are well-trained.
After the training process, we perform 1 000 steps of the Gibbs sampling on a 4-dimensional
random noise to generate 10 000 samples with the same size as the training dataset. These
simulated samples are expected to have not only the same marginal distributions but also
the same correlation structure as the training dataset.
Bernoulli RBM Figures 8 and 9 compare the histograms and QQ-plots between training
samples and generated samples after 1 000 Gibbs sampling steps in the case of the Bernoulli
RBM. According to these figures, we observe that the Bernoulli RBM can learn pretty well
each marginal distribution of training samples. However, we also find that the Bernoulli
RBM focuses on extreme values in the training dataset instead of the whole tail of the dis-
tribution and this phenomenon is more evident for heavy-tailed distributions. For example,
in the case of the Student’s t distribution (Panel (b) in Figure 9), the Bernoulli RBM tries
to learn the behavior of extreme values, but ignores the other part of the distribution tails.
Comparing the results for the two Gaussian distributions (Panels (c) and (d) in Figure 9),
we notice that the learning of the Gaussian distribution in the fourth dimension, which has
a more complicated correlation structure with the other dimensions, is not as good as the
result for the Gaussian distribution in the third dimension.
We have replicated the previous simulation 50 times17 and we compare the average of
the mean, the standard deviation, the 1st percentile and the 99th percentile between training
and simulated samples in Table 1. Moreover, in the case of simulated samples, we indicate
the confidence interval by ± one standard deviation. These statistics demonstrate that
the Bernoulli RBM overestimates the value of standard deviation, the 1st percentile and
the 99th percentile, especially in the case of Student’s t distribution. This means that the
Bernoulli RBM is sensitive to extreme values and the learned probability distribution is
more heavy-tailed than the empirical probability distribution of the training dataset.
In Figure 10, we show the comparison between the empirical correlation matrix18 of the
training dataset and the average of the correlation matrices computed by the 50 Monte
Carlo replications. We notice that the Bernoulli RBM does not capture perfectly the cor-
relation structure since the correlation coefficient values are less significant comparing with
the empirical correlation matrix. For instance, the correlation between the first and second
dimensions is equal to −57% for the training data, but only −31% for the simulated data
on average19 .
17 Each replication corresponds to 10 000 generated samples of the four random variables, and differs
27
Improving the Robustness of Backtesting with Generative Models
Frequency
0.15
0.20
0.10
0.10
0.05
0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
Generated Generated
0.20 0.20
Training Training
0.15 0.15
Frequency
Frequency
0.10 0.10
0.05 0.05
0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
5.0
10
2.5
Generated
Generated
5
0.0
−2.5 0
−5.0 −5
−7.5
−10
−7.5 −5.0 −2.5 0.0 2.5 5.0 −10 −5 0 5 10 15
Training Training
(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4
7.5
5.0 5
Generated
Generated
2.5
0
0.0
−2.5
−5
−5.0
28
Improving the Robustness of Backtesting with Generative Models
Dimension 1 Dimension 2
Statistic
Training Simulated Training Simulated
Mean 0.303 0.319 (± 0.039) −0.006 −0.054 (± 0.029)
Standard deviation 2.336 2.385 (± 0.124) 1.409 1.540 (± 0.044)
1st percentile −5.512 −5.380 (± 0.121) −3.590 −5.008 (± 0.905)
99th percentile 4.071 4.651 (± 0.143) 3.862 4.255 (± 0.374)
Dimension 3 Dimension 4
Statistic
Training Simulated Training Simulated
Mean −0.002 0.071 (± 0.039) −0.063 0.033 (± 0.043)
Standard deviation 1.988 2.044 (± 0.028) 1.982 2.037 (± 0.030)
1st percentile −4.691 −4.987 (± 0.167) −4.895 −5.227 (± 0.244)
99th percentile 4.677 4.918 (± 0.190) 4.431 4.938 (± 0.322)
Figure 10: Comparison between the empirical correlation matrix and the average correlation
matrix of Bernoulli RBM simulated data
Dimension 4 Dimension 3 Dimension 2 Dimension 1
Gaussian RBM Let us now consider a Gaussian RBM with 4 visible layer units and 128
hidden layer units using the same simulated training dataset. After the training process,
synthetic samples are always generated using the Gibbs sampling with 1 000 steps between
visible and hidden layers. Histograms and QQ-plots of training and generated samples are
reported in Figures 11 and 12. We notice that the Gaussian RBM works well for the Gaussian
mixture model and the two Gaussian distributions (Panels (a), (c) and (d) in Figure 12).
Again, we observe that the Gaussian distribution with the simplest correlation structure is
easier to learn than the Gaussian distribution with a more complicated correlation structure.
However, the Gaussian RBM fails to learn heavy-tailed distributions such as the Student’s
t distribution since the model tends to ignore all values in the distribution tails.
Comparing Tables 1 on page 29 for the Bernoulli RBM and Tables 2 on page 31 for the
Gaussian RBM, we notice that the Gaussian RBM generally underestimates the value of
the standard deviation, the 1st percentile and the 99th percentile. This means that this is
challenging to generate leptokurtic probability distributions with Gaussian RBMs.
In Figure 13, we also compare the empirical correlation matrix of the training dataset
29
Improving the Robustness of Backtesting with Generative Models
Frequency
0.15
0.20
0.10
0.10
0.05
0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
0.15 0.15
Frequency
Frequency
0.10 0.10
0.05 0.05
0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
5.0 4
2.5 2
Generated
Generated
0.0 0
−2.5
−2
−5.0
−4
−7.5
−7.5 −5.0 −2.5 0.0 2.5 5.0 −10 −5 0 5 10 15
Training Training
(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4
7.5
7.5
5.0
5.0
2.5 2.5
Generated
Generated
0.0 0.0
−2.5 −2.5
−5.0 −5.0
−7.5 −7.5
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −5 0 5
Training Training
(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)
30
Improving the Robustness of Backtesting with Generative Models
Dimension 1 Dimension 2
Statistic
Training Simulated Training Simulated
Mean 0.303 0.325 (± 0.037) −0.006 −0.006 (± 0.023)
Standard deviation 2.336 2.232 (± 0.021) 1.409 1.355 (± 0.020)
1st percentile −5.512 −4.992 (± 0.121) −3.590 −3.054 (± 0.078)
99th percentile 4.071 4.167 (± 0.078) 3.862 3.240 (± 0.105)
Dimension 3 Dimension 4
Statistic
Training Simulated Training Simulated
Mean −0.002 0.064 (± 0.029) −0.063 0.046 (± 0.031)
Standard deviation 1.988 1.942 (± 0.023) 1.982 1.910 (± 0.026)
1st percentile −4.691 −4.382 (± 0.109) −4.895 −4.512 (± 0.122)
99th percentile 4.677 4.572 (± 0.109) 4.431 4.303 (± 0.139)
and the average of the correlation matrices computed with 50 Monte Carlo replications.
Compared to the Bernoulli RBM, the Gaussian RBM captures much better the correlation
structure of the training dataset. This is particular true for the largest correlation values.
For instance, the correlation between first and second dimensions is equal to −57% for the
training data, −31% for the Bernoulli RBM simulated data and −48% for the Gaussian
RBM simulated data.
Figure 13: Comparison between the empirical correlation matrix and the average correlation
matrix of Gaussian RBM simulated data
Dimension 4 Dimension 3 Dimension 2 Dimension 1
Summary of the results According to our tests, we consider that both Bernoulli and
Gaussian well-trained RBMs can transform random noise series to the joint distribution of
a training dataset. But they have their own characteristics:
• A Bernoulli RBM is very sensitive to extreme values of the training dataset and has a
tendency to overestimate the tail of distribution. On the contrary, a Gaussian RBM has
a tendency to underestimate the tails of probability distribution and faces difficulties
in learning leptokurtic probability distributions.
31
Improving the Robustness of Backtesting with Generative Models
• Gaussian RBMs may capture more accurately the correlation structure of the training
dataset than Bernoulli RBMs.
In practice, we may apply the normal score transformation in order to transform the training
dataset, such that each marginal has a standard normal distribution. For that, we rank the
values of each dimension from the lowest to the highest, map these ranks to a uniform
distribution and apply the inverse Gaussian cumulative distribution function. We then
train the RBM with these transformed values. After the training process, the marginals
of samples generated by well-trained RBMs will follow a standard normal distribution. By
using the inverse transformation, we can generate synthetic samples with the same marginal
distributions than those of the training dataset, and a correlation structure that is closed to
the empirical correlation matrix of the training data. In this case, we may consider Gaussian
RBMs as an alternate method of the bootstrap sampling approach.
Figure 14: Historical prices of the S&P 500 and VIX indices
3000 80
2500
60
2000
40
1500
1000 20
19 98 200
2
200
6
201
0
201
4
201
8
199
8
200
2
200
6
201
0
201
4
201
8
12/ 10/ 08/ 06/ 04/ 02/ 12/ 10/ 08/ 06/ 04/ 02/
22/ 22/ 22/ 22/ 22/ 20/ 22/ 22/ 22/ 22/ 22/ 20/
S&P 500 index VIX index
Normal score transformation In order to verify the learning quality of Gaussian RBMs
using the normal score transformation, we run 50 Monte Carlo simulations after the training
process and for each simulation, we generate 5 354 simulated observations from different
32
Improving the Robustness of Backtesting with Generative Models
Autocorrelation
0.2
0.0
−0.2
1 3 5 7 9 11 13 15 17 19
Lag
random noise series. We then calculate the statistics of these samples as we have done in
the previous section. According to Table 3, we find that although the Gaussian RBM still
underestimates a little the standard deviation, the 1st percentile and the 99th percentile,
the learning quality is much more improved. In addition, the average correlation coefficient
between daily returns of S&P 500 and VIX indices over 50 Monte Carlo simulations is equal
to −76%, which is very closed to the figure 71% of the empirical correlation. Therefore, we
consider that training Gaussian RBMs with the normal score transformation may be used
as an alternate method of bootstrap sampling.
Table 3: Comparison between the training sample and Gaussian RBM simulated samples
using the normal score transformation
33
Improving the Robustness of Backtesting with Generative Models
samples by Gaussian and conditional RBMs. These results show clearly that a conditional
RBM can capture well the autocorrelation of the training dataset, which is not the case of
a traditional Gaussian RBM.
Autocorrelation
Training Training
0.2 0.2
0.0 0.0
−0.2 −0.2
5 10 15 5 10 15
Lag Lag
S&P 500 index VIX index
Autocorrelation
Training Training
0.2 0.2
0.0 0.0
−0.2 −0.2
5 10 15 5 10 15
Lag Lag
S&P 500 index VIX index
Market generator Based on above results, we consider that conditional RBMs can be
used as a market generator (Kondratyev and Schwarz, 2019). In the example of S&P 500
and VIX indices, we have trained the models with all historical observations and for each
date, we have used the values of the last 20 days as a memory vector for the conditional
layer. In this approach, the normal score transformation ensures the learning of marginal
distributions, whereas the conditional RBM ensures the learning of the correlation structure
and the time dependence. After having calibrated the training process, we may choose a
date as the starting point. The values of this date and its last 20 days are then passed to
the trained RBM model and, after sufficient steps of Gibbs sampling, we get a sample for
the next day. We then feed this new sample and the updated memory vector to the model
in order to generate a sample for the following day. Iteratively, the conditional RBM can
generate a multi-dimensional time series that has the same patterns as those of the training
data in terms of marginal distribution, correlation structure and time dependence.
34
Improving the Robustness of Backtesting with Generative Models
For instance, Figure 18 shows three time series of 250 trading days starting at the date
22/11/2000, which are generated by the trained conditional RBM. We notice that these
three simulated time series have the same characteristics as the historical prices represented
by the dashed line. The negative correlation between S&P 500 and VIX indices is also
learned and we can see clearly that there exists a positive autocorrelation in each simulated
one-dimensional time series.
Figure 18: Comparison between scaling historical prices of S&P 500 and VIX indices and
synthetic time series generated by the conditional RBM
1.05
2.0
1.00
0.95
1.5
0.90
0.85
1.0
0.80
0.75
0 50 100 150 200 0 50 100 150 200
Time step Time step
Simulation 1 Simulation 3 Real Simulation 1 Simulation 3 Real
Simulation 2 Simulation 2
35
Improving the Robustness of Backtesting with Generative Models
traditional GAN models using the cross-entropy loss measure. Therefore, we test in this
first study Wasserstein GAN models on the simulated multi-dimensional dataset as we have
done for Bernoulli and Gaussian RBMs. We recall that we simulate 10 000 samples of 4-
dimensional data with different marginal distributions (Gaussian mixture model, Student’s
t distribution and two Gaussian distributions), as shown in Figure 6 on page 26 and the
training dataset has a correlation structure simulated by a Gaussian copula with the correla-
tion matrix given in Figure 7 on page 26. Here, the objective of the test is to check whether
a well-trained Wasserstein GAN models can learn the marginal distribution and the copula
function of training samples.
Data preprocessing In the practice of GAN models, we need to match the dimension
and the range of the output of the generator and the input of discriminator. To address this
issue, there are two possible ways:
• We can use a complex activation function for the last layer of the generator in order
to ensure that the generated samples have the same range as the training samples for
the discriminator (Clevert et al., 2015).
• We can modify the range of the training samples by applying a preprocessing function
and choose a usual activation function (Wiese et al., 2020).
In our study, we consider the second approach by applying the MinMax scaling func-
tion21 to the training samples for the discriminator. As a result, the input data scaled for
the discriminator will take their values in [0, 1] and we may choose the sigmoid activation
function for the last layer of the generator to ensure that the outputs of the generator have
theirs values in [0, 1]. In addition, we need to simulate the random noise as the inputs for the
generator. As the random noise will play the role of latent variables, we have the freedom
to choose its distribution. For instance, we use the Gaussian distribution z ∼ N (0, 1) in
this study.
36
Improving the Robustness of Backtesting with Generative Models
of the discriminator and the generator, the activation function f (x) corresponds to a leaky
rectified linear unit (or RELU) function with α = 0.5:
αx if x < 0
fα (x) =
x otherwise
According to Arjovsky et al. (2017a), the Wasserstein GAN model should be trained with
the RMSProp optimizer using a small learning rate. Therefore, we set the learning rate to
η = 1.10−4 and use mini-batch gradient descent with batch size 500. Table 4 summarizes the
setting of the Wasserstein GAN model implementation using Python and the TensorFlow
library. After the training process, we generate 10 000 samples in order to compare the
marginal distributions and the correlation structure with the training dataset.
Results Figures 19 and 20 compare the histograms and QQ-plots between training samples
and generated samples. According to these figures, we observe that the Wasserstein GAN
model can learn very well each marginal distribution of training samples, even better than
Bernoulli and Gaussian RBMs. In particular, the Wasserstein GAN model fits more accuracy
for heavy-tailed distributions than RBMs as in the case of the Student’s t distribution (Panel
(b) in Figure 20).
As we have done for RBMs, we replicate the previous simulation 50 times and we compare
the average of the mean, the standard deviation, the 1st percentile and the 99th percentile
between training and simulated samples in Table 5. Moreover, in the case of simulated
samples, we indicate the confidence interval by ± one standard deviation. These statistics
show that the Wasserstein GAN model estimates very well the value of standard devia-
tion, the 1st percentile and the 99th percentile for each marginal distribution, especially for
Gaussian mixture model and Student’s t distribution in the first and second dimensions.
One shortcoming is that the Wasserstein GAN model slightly overestimates the value of the
99th percentile for the two Gaussian distributions. Comparing with the results shown in
Table 1 on page 29 for the Bernoulli RBM and Table 2 on page 31 for the Gaussian RBM,
we notice that the Wasserstein GAN model learns better the marginal distributions of the
training dataset and doesn’t have tendency to always underestimate or overestimate the
tails of probability distribution as RBMs. However, there exists a small bias between the
empirical mean computed with 50 Monte Carlo replications and the mean of the training
dataset.
In Figure 21, we also compare the empirical correlation matrix of the training dataset
and the average of the correlation matrices computed with 50 Monte Carlo replications.
We notice that the Wasserstein GAN model captures very well the correlation structure.
37
Improving the Robustness of Backtesting with Generative Models
Frequency
0.15
0.20
0.10
0.10
0.05
0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
0.15
Frequency
Frequency
0.15
0.10 0.10
0.05 0.05
0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5
5.0
2.5 5
Generated
Generated
0.0
0
−2.5
−5.0 −5
−7.5
−7.5 −5.0 −2.5 0.0 2.5 5.0 −10 −5 0 5 10 15
Training Training
(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4
7.5 7.5
5.0 5.0
2.5 2.5
Generated
Generated
0.0 0.0
−2.5 −2.5
−5.0 −5.0
−7.5
−5.0 −2.5 0.0 2.5 5.0 7.5 −5 0 5
Training Training
(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)
38
Improving the Robustness of Backtesting with Generative Models
Dimension 1 Dimension 2
Statistic
Training Simulated Training Simulated
Mean 0.303 0.427 (± 0.027) −0.006 −0.026 (± 0.013)
Standard deviation 2.336 2.361 (± 0.014) 1.409 1.379 (± 0.019)
1st percentile −5.512 −5.539 (± 0.099) −3.590 −3.638 (± 0.116)
99th percentile 4.071 4.103 (± 0.030) 3.862 4.073 (± 0.181)
Dimension 3 Dimension 4
Statistic
Training Simulated Training Simulated
Mean −0.002 0.199 (± 0.021) −0.063 0.062 (± 0.020)
Standard deviation 1.988 2.002 (± 0.013) 1.982 1.958 (± 0.016)
1st percentile −4.691 −4.717 (± 0.080) −4.895 −4.821 (± 0.106)
99th percentile 4.677 4.954 (± 0.066) 4.431 4.865 (± 0.082)
For each value in the correlation matrix, the difference is less than 5%. Compared to the
Bernoulli RBM and the Gaussian RBM, the Wasserstein GAN model captures much bet-
ter the correlation structure of the training dataset. For instance, the correlation between
first and second dimensions is equal to −57% for the training data, −53% for the Wasser-
stein GAN simulated data, −48% for the Gaussian RBM simulated data and −31% for the
Bernoulli RBM simulated data. If we consider the first and fourth dimensions, these figures
become 29% for the training data, 24% for the Wasserstein GAN simulated data, 22% for
the Gaussian RBM simulated data and 20% for the Bernoulli RBM simulated data.
Figure 21: Comparison between the empirical correlation matrix and the average correlation
matrix of WGAN simulated data
Dimension 4 Dimension 3 Dimension 2 Dimension 1
Summary of the results According to our tests, the Wasserstein GAN model performs
very well in the task of learning the joint distribution of our simulated training dataset.
Comparing with the results of Bernoulli and Gaussian RBMs, the Wasserstein GAN model
has several advantages:
1. The Wasserstein GAN model is less sensitive to extreme values of the training dataset
39
Improving the Robustness of Backtesting with Generative Models
and doesn’t have the tendency to underestimate or overestimate the tail of probability
distribution.
2. We don’t need to apply complex transformation to input data for Wasserstein GAN
model in data preprocessing. In our study, we just use a MinMax scaling function and
we recall that we need to use the binary transformation for Bernoulli RBM and the
normal score transformation for Gaussian RBM.
3. The Wasserstein GAN model may capture more accurately the correlation structure
of the training dataset than Bernoulli and Gaussian RBMs.
Structure and training of CDCWGAN models In this study on financial time se-
ries, the training samples for the discriminator correspond to a nx -dimensional time series
representing historical returns over nt days. Therefore, inputs are represented by a matrix
belonging to Mnt ,nx (R). In addition, each training sample is conditioned by the past values
of returns over nh days, which are represented by a matrix belonging to Mnh ,nx (R). In prac-
tice, we concatenate these two matrices to form a matrix belonging to the Mnh +nt ,nx (R)
and this matrix will be fed to the discriminator as inputs. In this study, the inputs of
the discriminator will be down sampled using 4 convolutional layers with number of filters
{16, 32, 64, 128}. We set the kernel length to 3 and the stride to 2 and we choose to use
the leaky RELU function with α = 0.5 for each convolutional layer. For the last layer of
the discriminator, we choose always a 1-neuron dense layer and use the tangent hyperbolic
activation function for this neuron as in the case of the traditional Wasserstein GAN model.
For the generator, we modify the original method of Mariani et al. (2019) and we borrow
the idea of conditional layer in the case of conditional RBM. The generator of our CDCW-
GAN model will take two inputs: a 100-dimensional random noise vector and nh past values
that are concatenated into a nh × nx -dimensional vector. We then feed these two vectors to
the first layer of the generator, which is a dense layer and we will reshape the output to a
two-dimensional nt × 256 matrix before passing them to the second layer. We construct the
rest of the generator using 3 convolutional layers with number of filters {256, 64, 2}. The
kernel length is set to 3 and the stride is set to 2. We also use a leaky RELU function with
α = 0.5 for each convolutional layer and since the last convolutional layer will give directly
40
Improving the Robustness of Backtesting with Generative Models
the simulated scenarios, we choose the sigmoid activation function in order to get the value
in range [0, 1].
For this conditional deep convolutional Wasserstein GAN model, we use always the
RMSProp optimizer with a small learning rate. Indeed, the learning rate is set to η = 1.10−4
and the batch size is set to 500. Table 6 summarizes the setting of the model implementation
using Python and the TensorFlow library.
Learning joint distribution and time dependence In this study, we choose to set
the length of training data window nt to 5, which means that the generator of the model
will generate at each time step the scenarios for the 5 next days. In addition, we also use
the values of the last 20 days as a long memory for input data of the model in order to
compare with the results of the conditional RBM. After the training process, we generate
samples of consecutive time steps to verify the quality of learning the joint distribution of
daily returns of the S&P index and the VIX index. We run 50 Monte Carlo simulations and
for each simulation, we generate 5 354 simulated observations from different random noise
series. According to Table 7, we find that the CDCWGAN model learns also pretty well the
joint distribution of the S&P index and the VIX index. In addition, the average correlation
coefficient between daily returns of the two indices over 50 Monte Carlo replications is equal
to −71%, which is exactly the figure of the empirical correlation. In Figure 22, we observe
clearly that a CDCWGAN model can capture well the autocorrelation of the training dataset
as in the case of the conditional RBM shown in Figure 17 on page 34.
Table 7: Comparison between the training sample and simulated samples using the condi-
tional deep convolutional Wasserstein GAN model
41
Improving the Robustness of Backtesting with Generative Models
Autocorrelation
Training Training
0.2 0.2
0.0 0.0
−0.2 −0.2
5 10 15 5 10 15
Lag Lag
S&P 500 index VIX index
Market generator Based on above results, we consider that the CDCWGAN model can
be also used as a market generator. In the example of S&P 500 and VIX indices, we have
trained the model with all historical observations using the method proposed by Mariani et
al. (2019) and the generator of this method will generate scenarios over several consecutive
days. After having calibrated the training process, we may choose a date as the starting
point and generate new samples for several days. We then update the memory vector and
generate samples for the following days. Iteratively, the CDCWGAN model can generate
a multi-dimensional time series that has the same patterns as those of the training data in
terms of marginal distribution, correlation structure and time dependence. Comparing with
the conditional RBM that we have studied, we don’t need the normal score transformation
to ensures the learning of marginal distributions and, convolutional layers in the generator
and the discriminator may help us to extract more features in the training dataset.
As we have shown in Figure 18 on page 35 for the conditional RBM, Figure 23 shows
three time series of 250 trading days starting at the date 22/11/2000, which are generated
by the trained CDCWGAN model. We notice that these three simulated time series have
clearly the positive autocorrelation in each dimension and they capture very well the neg-
ative correlation between S&P 500 and VIX indices. Comparing with the historical prices
that are represented by the dashed line, these three simulated time series have the same
characteristics.
42
Improving the Robustness of Backtesting with Generative Models
Figure 23: Comparison between scaling historical prices of S&P 500 and VIX indices and
synthetic time series generated by the CDCWGAN model
1.75
1.4
1.50
1.2 1.25
1.00
1.0
0.75
0.50
0.8
0.25
0 50 100 150 200 0 50 100 150 200
Time step Time step
Simulation 1 Simulation 3 Real Simulation 1 Simulation 3 Real
Simulation 2 Simulation 2
true real time series becomes one realization of its probability distribution. Suppose for
example that the maximum drawdown of the backtest is equal to 10%, and that the live
investment strategy has a maximum drawdown of 20%. Does it mean that the process of
the investment strategy has been overfitted? Certainly yes if there is a zero probability to
experience a maximum drawdown of 20% when the strategy is backtested with generative
models. Definitively not if some samples of generative models have produced a maximum
drawdown larger than 20%.
22 The maximum drawdown corresponds to a loss. For example, if the maximum drawdown is equal to
10%, the investor may face a maximum loss of 10%. This is why the maximum drawdown is expressed by a
positive value.
23 If ξ (x) is greater than 3, this indicates that the strategy has a high skewness risk.
43
Improving the Robustness of Backtesting with Generative Models
180
160
140
120
100
8 0 2 4 6 8 0
200 201 201 201 201 201 202
Figure 25: Comparison between cumulative performance of the risk parity strategy using
synthetic time series generated by the bootstrap sampling and conditional RBM models
Real Real
120 120
110 110
100 100
90 90
8 -01 18-04 18-07 18-10 19-01 19-04 19-07 19-10 -0 1
8-0
4
8-0
7 0 1 4 7
8-1 019-0 019-0 019-0 019-1
0
201 20 20 20 20 20 20 20 2 018 201 201 201 2 2 2 2
Bootstrap sampling Conditional RBM
Figure 25 shows the cumulative performance of ten risk parity backtests using synthetic
time series generated by the two methods. Among these simulations, we notice that the
44
Improving the Robustness of Backtesting with Generative Models
cumulative performance of the risk parity strategy using synthetic time series generated by
the bootstrap sampling method is more centered around the real time series of backtesting.
This phenomenon corresponds to the drawback of traditional bootstrap sampling method,
since it cannot capture the time dependence in risk parity strategy. As shown in Figure 26,
our risk parity strategy has a first-lag positive autocorrelation of 20%, and we compare the
average autocorrelation function over 500 Monte Carlo simulations generated by the two
approaches. We notice that the bootstrap sampling method cannot replicate this positive
autocorrelation as the conditional RBM can do. This property is very important when we
backtest meta-strategies, which means a strategy of an existing strategy. For instance, if
we want to design a stop-loss strategy for our risk parity strategy, we should tune several
parameters for implementing this stop-loss strategy. If we use the real historical time series
to calibrate the stop-loss parameters, we might fall into the trap of overfitting because we
only have one real historical time series. Therefore, if we use the time series generated by
the bootstrap sampling method, we will not find the appropriate values of the stop-loss
parameters, since the autocorrelation plays a key role in the stop-loss strategy as explained
by Kaminski and Lo (2014). Using the conditional RBM as a market generator to generate
time series is then a better way to find the appropriate parameters for the meta-strategy
and manage the out-of-sample robustness.
Figure 26: Autocorrelation function of the risk parity strategy using synthetic time series
generated by the bootstrap sampling and conditional RBM methods
Autocorrelation
0.1 0.1
0.0 0.0
−0.1 −0.1
1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19
Lag Lag
Bootstrap sampling Conditional RBM
45
Improving the Robustness of Backtesting with Generative Models
Figure 27: Histogram of the maximum drawdown of the risk parity strategy using synthetic
time series generated by the bootstrap sampling and conditional RBM methods
80 80
Frequency
Frequency
60 60
40 40
20 20
0 0
2% 4% 6% 8% 10% 2% 4% 6% 8% 10%
Bootstrap sampling Conditional RBM
influence of the difference between realized volatility, we plot in Figure 28 the probability
distribution of the skew measure. In this case, we notice that the tail of the probability dis-
tribution generated by the conditional RBM is less fat than that generated by the bootstrap
sampling method but has several extreme severe scenarios. The skew measure of the real
risk parity strategy from January 2018 to December 2019 is equal to 1.05, and this value cor-
responds respectively to the 66.3% quantile in the probability distribution generated by the
bootstrap sampling method and the 70.1% quantile in the probability distribution generated
by the conditional RBM.
Figure 28: Histogram of the skew measure of the risk parity strategy using synthetic time
series generated by the bootstrap sampling and the conditional RBM methods
80 80
Frequency
Frequency
60 60
40 40
20 20
0 0
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 3.0
Bootstrap sampling Conditional RBM
46
Improving the Robustness of Backtesting with Generative Models
using 4 dense layers with the structure SG = {100, 50, 10, 6}. We use a leaky RELU function
with α = 0.5 for each dense layer and a sigmoid activation function for the output layer.
The discriminator also takes two inputs: a 6-dimensional vector of current daily returns
of futures contracts and a 20 × 6-dimensional vector of past values. We then construct
the discriminator using 4 dense layers with the structure SD = {100, 50, 10, 1}. For the
first three dense layers, the activation function corresponds to a leaky RELU function with
α = 0.5, whereas the activation function of the output layer is a tanh activation function.
To avoid the problem of outliers, we also use the normal score transformation as in the
case of conditional RBMs before applying the MinMax scaling function on input data. In
Figure 29, we compare the distributions of the skew measure of the risk parity strategy using
synthetic time series generated by the conditional RBM and conditional Wasserstein GAN
models. We notice that these two distributions are similar, except that we have several
extreme severe scenarios in the case of the conditional RBM. We recall that the value of
the skew measure of the real risk parity strategy from January 2018 to December 2019 is
equal to 1.05 and this value corresponds respectively to the 70.1% quantile in the probability
distribution generated by the conditional RBM and the 72.9% quantile in the probability
distribution generated by the conditional Wasserstein GAN.
Figure 29: Histogram of the skew measure of the risk parity strategy using synthetic time
series generated by the conditional RBM and Wasserstein GAN models
80 80
Frequency
Frequency
60 60
40 40
20 20
0 0
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0
Conditional RBM Conditional Wasserstein GAN
47
Improving the Robustness of Backtesting with Generative Models
contracts. In order to have a high-quality market generator, we believe that we should train
a conditional RBM or Wasserstein GAN model with not only the time series of assets that
compose the investment portfolio, but also those of the several market regime indicators
such as the VIX index.
Figure 30: Histogram of the skew measure of the risk parity strategy using synthetic time
series generated by two conditional RBMs
100 100
Real Real
80 80
Frequency
Frequency
60 60
40 40
20 20
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Trained with returns of assets Trained with returns of assets and VIX index
4 Conclusion
In this article, we explore the use of generative models for improving the robustness of trading
strategies. We consider two approaches for simulating financial time series. The first one is
based on restricted Boltzmann machines, whereas the second approach corresponds to the
family of generative adversarial networks, including Wasserstein distance models. Given an
historical sample of financial data, we show how to generate new samples of financial data
using these techniques. These new samples of financial data are called synthetic or fake
financial time series, and the objective is to preserve the statistical properties of the original
sample. By statistical properties, we mean the statistical moments of each univariate time
series, the stochastic dependence between the different variables that compose the multi-
dimensional random vector, and also the time dependence between the observations. If we
consider the financial times series as a multi-dimensional data matrix, the challenge is then
to model both the row and column stochastic structures.
There are few satisfactory methods for simulating non-gaussian multi-dimensional finan-
cial time series. For instance, the bootstrap method does not preserve the cross-correlation
between the different variables. The copula method is better, but it must use the tech-
niques of conditional augmented data in order to reproduce the autocorrelation functions.
Restricted Boltzmann machines and generative adversarial networks have been successful
for generating complex data with non-linear dependence structures. By applying them to
financial market data, our first results are encouraging and show that these new alternative
techniques may help for simulating non-gaussian multi-dimensional financial time series.
This is particularly true when we consider the backtesting of trading strategies. In this
case, RBMs and GANs may be used for estimating the probability distribution of perfor-
mance and risk statistics of the backtest. This opens the door to a new field of research for
improving the risk management of quantitative investment strategies.
48
Improving the Robustness of Backtesting with Generative Models
References
Ackley, D., Hinton, G.E., and Sejnowski, T.J. (1985), A Learning Algorithm for Boltz-
mann Machines, Cognitive Science, 9(1), pp. 147-169.
Arjovsky, M., Chintala, S., and Bottou, L. (2017a), Wasserstein GAN, arXiv,
1701.07875.
Barbu, V., and Precupanu, T. (2012), Convexity and Optimization in Banach spaces,
Fourth edition, Springer Monographs in Mathematics, Springer.
Bengio, Y., and Delalleau, O. (2009), Justifying and Generalizing Contrastive Diver-
gence, Neural Computation, 21(6), pp. 1601-1621.
Cho, K., Ilin, A., and Raiko, T. (2011), Improved Learning of Gaussian-Bernoulli Re-
stricted Boltzmann Machines, in Proceedings of the Twentith International Conference on
Artificial Neural Networks (ICANN) 2011, pp. 10-17.
Chu, C., Blanchet, J., and Glynn, P. (2019), Probability Functional Descent: A Uni-
fying Perspective on GANs, Variational Inference, and Reinforcement Learning, arXiv,
1901.10691.
Clevert, D., Unterthiner, T., and Hochreiter, S. (2015), Fast and Accurate Deep
Network Learning by Exponential Linear Units (ELUs), arXiv, 1511.07289.
Cont, R. (2001), Empirical Properties of Asset Returns: Stylized Facts and Statistical
Issues, Quantitative Finance, 1(2), pp. 223-236.
Cui, Z., Chen, W., and Chen, Y. (2016), Multi-scale Convolutional Neural Networks for
Time Series Classification, arXiv, 1603.06995.
Denton, E.L., Chintala, S., Szlam, A., and Fergus, R. (2015), Deep Generative Image
Models Using a Laplacian Pyramid of Adversarial Networks, in Cortes, C., Lawrence,
N.D., Lee, D.D., Sugiyama, M., and Garnett, R. (eds), Advances in Neural Information
Processing Systems, 28, pp. 1486-1494.
Dumoulin, V., and Visin, F. (2016), A Guide to Convolution Arithmetic for Deep Learning,
arXiv, 1603.07285.
Fernholz, L.T. (2012), Von Mises Calculus for Statistical Functionals, Lecture Notes in
Statistics, 19, Springer.
49
Improving the Robustness of Backtesting with Generative Models
Givens, C.R., and Shortt, R.M. (1984), A Class of Wasserstein Metrics for Probability
Distributions, Michigan Mathematical Journal, 31(2), pp. 231-240.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., and Bengio, Y. (2014), Generative Adversarial Nets, in Ghahramani,
Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (eds), Advances in
Neural Information Processing Systems, 27, pp. 2672-2680.
Gozlan, N., Roberto, C., Samson, P.M., and Tetali, P. (2017), Kantorovich Duality
for General Transport Costs and Applications, Journal of Functional Analysis, 273(11),
pp. 3327-3405.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.C. (2017),
Improved Training of Wasserstein GANs, in Guyon, I, Luxburg, U.V., Bengio, S., Wallach,
H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds), Advances in Neural Information
Processing Systems, 30, pp. 5767-5777.
Hinton, G.E. (2002), Training Products of Experts by Minimizing Contrastive Divergence,
Neural Computation, 14(8), pp. 1771-1800.
Hinton, G.E. (2012), A Practical Guide to Training Restricted Boltzmann Machines, in
Montavon, G., Orr, G.B., Müller, K-R. (eds), Neural Networks: Tricks of The Trade, pp.
599-619, Second edition, Springer.
Hinton, G.E., and Sejnowski, T.J. (1986), Learning and Relearning in Boltzmann Ma-
chines, Chapter 7 in Rumelhart, D.E., and McClelland, J.L. (eds), Parallel Distributed
Processing: Explorations in the Microstructure of Cognition, 1, pp. 282-317, MIT Press.
Hyland, S.L., Esteban, C., and Rätsch, G. (2017), Real-valued (Medical) Time Series
Generation with Recurrent Conditional GANs, arXiv, 1706.02633.
Isola, P., Zhu, J.Y., Zhou, T., and Efros, A.A. (2017), Image-to-image Translation
with Conditional Adversarial Networks, Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 1125-1134.
Jebara, T. (2004), Machine Learning: Discriminative and Generative, Springer Interna-
tional Series in Engineering and Computer Science, 755, Springer.
Kaminski, K.M., and Lo, A.W. (2014), When Do Stop-loss Rules Stop Losses?, Journal of
Financial Markets, 18, pp. 234-254.
Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018), Progressive Growing of GANs
for Improved Quality, Stability, and Variation, International Conference on Learning Rep-
resentations (ICLR 2018), 7, available on arXiv, 1710.10196.
Keziou, A. (2003), Dual Representation of φ-divergences and Applications, Comptes Rendus
de l’Académie des Sciences – Series I – Mathematics, 336(10), pp. 857-862.
Kindermann, R. and Snell, J.L. (1980), Markov Random Fields and Their Applications,
Contemporary Mathematics, American Mathematical Society.
Koller, D. and Friedman, N. (2009), Probabilistic Graphical Models: Principles and
Techniques, MIT Press.
Kondratyev, A., and Schwarz, C. (2020), The Market Generator, SSRN, https://www.
ssrn.com/abstract=3384948.
50
Improving the Robustness of Backtesting with Generative Models
Koshiyama, A., Firoozye, N., and Treleaven, P. (2019), Generative Adversarial Net-
works for Financial Trading Strategies Fine-Tuning and Combination, arXiv, 1901.01751.
Krizhevsky, A. (2009), Learning Multiple Layers of Features from Tiny Images, University
of Toronto, Technical Report.
Laschos, V., Obermayer, K., Shen, Y., and Stannat, W. (2019), A Fenchel-Moreau-
Rockafellar Type Theorem on the Kantorovich-Wasserstein Space with Applications in
Partially Observable Markov Decision Processes, Journal of Mathematical Analysis and
Applications, 477(2), pp. 1133-1156.
LeCun, Y., and Bengio, Y. (1995), Convolutional Networks for Images, Speech, and Time
Series, in Arbib, M.A. (ed.), The Handbook of Brain Theory and Neural Networks, MIT
Press.
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M.A., and Huang, F.J. (2007), A
Tutorial on Energy-based Learning, Chapter 10 in Bakır, G., Hofmann, T., Schölkopf, B.,
Smola, A.J., Taskar, B., and Vishwanathan, S.V.N. (eds), Predicting Structured Data, pp.
191-246, MIT Press.
Liu, R., Lehman, J., Molino, P., Such, F.P., Frank, E., Sergeev, A., and Yosinski,
J. (2018), An Intriguing Failing of Convolutional Neural Networks and the Coordconv
Solution, in Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and
Garnett, R. (eds), Advances in Neural Information Processing Systems, 31, pp.,9605-9616.
Mao, X., Li, Q., Xie, H., Lau, R.Y.K., Wang, Z., and Smolley, P.S. (2017), Least Squares
Generative Adversarial Networks, Proceedings of the IEEE International Conference on
Computer Vision (ICCV), pp. 2794-2802.
Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. (2017), Unrolled Generative Ad-
versarial Networks, International Conference on Learning Representations (ICLR 2018),
7, available on arXiv, 1611.02163.
Mariani, G., Zhu, Y., Li, J., Scheidegger, F., Istrate, R., Bekas, C., Cristiano,
A., and Malossi, I. (2019), PAGAN: Portfolio Analysis with Generative Adversarial
Networks, arXiv, 1909.10578.
Mirza, M., and Osindero, S. (2014), Conditional Generative Adversarial Nets, arXiv,
1411.1784.
Mroueh, Y., and Sercu, T. (2017), Fisher GAN, in Guyon, I, Luxburg, U.V., Bengio, S.,
Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds), Advances in Neural
Information Processing Systems, 30, pp. 2513-2523.
Mroueh, Y., Li, C.L., Sercu, T., Raj, A., and Cheng, Y. (2017), Sobolev GAN, In-
ternational Conference on Learning Representations (ICLR 2018), 7, available on arXiv,
1711.04894.
Müller, A. (1997), Integral Probability Metrics and Their Generating Classes of Functions,
Advances in Applied Probability, 29(2), pp. 429-443.
51
Improving the Robustness of Backtesting with Generative Models
Nguyen, X., Wainwright, M.J., and Jordan, M.I.(2010), Estimating Divergence Func-
tionals and the Likelihood Ratio by Convex Risk Minimization, IEEE Transactions on
Information Theory, 56(11), pp. 5847-5861.
Nowozin, S., Cseke, B., and Tomioka, R. (2016), f -gan: Training Generative Neu-
ral Samplers using Variational Divergence Minimization, in Lee, D.D., Sugiyama, M.,
Luxburg, U.V., Guyon, I., and Garnett, R. (eds), Advances in Neural Information Pro-
cessing Systems, 29, pp 271-279.
Pearl, J. (1985), Bayesian Networks: A Model of Self-activated Memory for Evidential
Reasoning, in Proceedings of the 7th Conference of the Cognitive Science Society, pp.
329-334.
Peyré, G., and Cuturi, M. (2019), Computational Optimal Transport, Foundations and
Trends R in Machine Learning, 11(5-6), pp. 355-607.
Rachev, S.T. (1985), The Monge-Kantorovich Mass Transference Problem and its Stochas-
tic Applications, Theory of Probability & Its Applications, 29(4), pp. 647-676.
Rachev, S.T., and Rüschendorf, L. (1998), Mass Transportation Problems: Theory (Vol-
ume 1), Springer.
Radford, A., Metz, L., and Chintala, S. (2016), Unsupervised Representation Learning
with Deep Convolutional Generative Adversarial Networks, International Conference on
Learning Representations (ICLR 2016), available on arXiv, 1511.06434.
Rubner, Y., Tomasi, C., and Guibas, L.J. (2000), The Earth Mover’s Distance as a Metric
for Image Retrieval, International Journal of Computer Vision, 40(2), pp. 99-121.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X.
(2016), Improved Techniques for Training GANs, in Lee, D.D., Sugiyama, M., Luxburg,
U.V., Guyon, I., and Garnett, R. (eds), Advances in Neural Information Processing Sys-
tems, 29, pp. 2234-2242.
Seguy, V., Damodaran, B.B., Flamary, R., Courty, N., Rolet, A., and Blondel, M.
(2017), Large-scale Optimal Transport and Mapping Estimation, arXiv, 1711.02283.
Smolensky, P. (1986), Information Processing in Dynamical Systems: Foundations of Har-
mony Theory, Chapter 6 in Rumelhart, D.E., and McClelland, J.L. (eds), Parallel Dis-
tributed Processing: Explorations in the Microstructure of Cognition, 1, pp. 194-281, MIT
Press.
Taylor, G.W., Hinton, G.E., and Roweis, S.T. (2011), Two Distributed-state Models
For Generating High-Dimensional Time Series, Journal of Machine Learning Research,
12, pp. 1025-1068.
Villani, C. (2008), Optimal Transport: Old and New, Grundlehren der mathematischen
Wissenschaften, 338, Springer.
Wiese, M., Knobloch, R., Korn, R., and Kretschmer, P. (2020), Quant GANs: Deep
Generation of Financial Time Series, Quantitative Finance, forthcoming.
Xia, Q. (2008), Numerical Simulation of Optimal Transport Paths, arXiv, 0807.3723.
Xia, Q. (2009), The Geodesic Problem in Quasimetric Spaces, Journal of Geometric Anal-
ysis, 19(2), pp. 452-479.
52
Improving the Robustness of Backtesting with Generative Models
Appendix
A Mathematical results
A.1 Fundamental concepts of undirected graph model
Probabilistic graphical models use graphs to describe interactions between random variables.
Each random variable is represented by a node (or vertice) and each direct interaction be-
tween random variables is represented by an edge (or link). According to the directionality
of the edge in the graph, probabilistic graphical models can be divided into two categories:
directed graphical model and undirected graphical model. For instance, the Bayesian net-
works that are introduced by Pearl (1985) are a type of directed graphical models, whereas
Markov random fields (or Markov networks) use undirected graphs (Kindermann and Snell,
1980).
In other words, each node belonging to Cn is fully connected with the other nodes of Cn . A
clique of a graph G is called maximal if we can’t create a bigger clique by adding another
node of G, meaning that no node in G can be added such that the resulting set is still a
clique.
In other words, X is said to be a Markov random field if the joint probability distribution
verifies the local Markov property.
(2009).
53
Improving the Robustness of Backtesting with Generative Models
over G. This means that there exists a set of strictly positive functions {ψC , C ∈ C}, such
that the joint probability distribution is given by a product of factors:
1 Y
P (x) = P (x (v1 ) , . . . , x (vn )) = ψC (x (vC )) (21)
Z
C∈C
where ψC is the potential function for the clique C, vC are all the nodes belonging to the
clique C and Z is the partition function given by:
XY
Z= ψC (x (vC ))
x C∈C
The partition function Z is the normalization constant what ensures the overall distribution
sums to 1.
54
Improving the Robustness of Backtesting with Generative Models
and all connections between visible layer and hidden layer are edges of G. In the case of
RBMs, we know that there are only cliques of size 1 (one visible unit or one hidden unit)
and cliques of size 2 (a pair of one visible unit and one hidden unit) in the graph G. In
addition, it is easy to show that all these cliques are maximal. Let C1 and C2 be respectively
the set of all the cliques of size 1 and the set of all the cliques of size 2. We obtain:
and:
C2 = {{V1 , H1 } , . . . , {Vi , Hj } , . . . , {Vm , Hn }}
According to the Hammersley-Clifford theorem, the probability distribution of an RBM is
given by:
1 Y
P (v, h) = ψC
Z
C∈{C1 ,C2 }
m n m Y n
1 Y Y Y
= ψVi (vi ) ψHj (hj ) ψVi ,Hj (vi , hj )
Z i=1 j=1 i=1 j=1
1 −E(v,h)
= e
Z
where:
m
Y n
Y m Y
Y n
E (v, h) = − log ψVi (vi ) ψHj (hj ) ψVi ,Hj (vi , hj )
i=1 j=1 i=1 j=1
m
X n
X
= − log ψVi (vi ) − log ψHj (hj ) −
i=1 j=1
m
XX n
log ψVi ,Hj (vi , hj )
i=1 j=1
Xm n
X m X
X n
= Ei (vi ) + Ej (hj ) + Ei,j (vi , hj )
i=1 j=1 i=1 j=1
In the case of Bernoulli RBMs introduced in Section 2.1.1 on page 9, we defined Ei (vi ) =
−ai vi , Ej (hj ) = −bj hj and Ei,j (vi , hj ) = −wi,j vi hj . It follows that the energy function of
a Bernoulli RBM is equal to:
m
X n
X m X
X n
E (v, h) = − ai vi − bj hj − wi,j vi hj
i=1 j=1 i=1 j=1
where ai and bj are bias terms associated with the visible and hidden variables Vi and Hj ,
and wi,j is the weight associated with the edge between Vi and Hj .
55
Improving the Robustness of Backtesting with Generative Models
We have:
X n
XY
P (h | v) hj = P (hk | v) hj
h h k=1
X
= P (h−j | v) P (hj | v) hj
h
X X
= P (h−j | v) P (hj | v) hj
hj ∈{0,1} h−j
X X
= P (hj | v) hj · P (h−j | v)
hj ∈{0,1} h−j
| {z }
=1
= P (hj = 1 | v) (23)
where h−j = (h1 , h2 , . . . , hj−1 , hj+1 , . . . , hn ) denotes the state of all hidden units except the
j th one.
Following Fischer and Igel (2014), we divide the energy function E (v, h) into two parts: one
collecting all terms involving vi and one collecting all the other terms v−i :
m
X n
X m X
X n
E (v, h) = − ak vk − bj hj − wk,j vk hj
k=1 j=1 k=1 j=1
n
X m
X n
X m
X n
X
= −ai vi − wi,j vi hj − ak vk − bj hj − wk,j vk hj
j=1 k=1,k6=i j=1 k=1,k6=i j=1
= vi αi (h) + β (v−i , h)
where:
n
X
αi (h) = −ai − wi,j hj
j=1
and:
m
X n
X m
X n
X
β (v−i , h) = − ak vk − bj hj − wk,j vk hj
k=1,k6=i j=1 k=1,k6=i j=1
56
Improving the Robustness of Backtesting with Generative Models
1
σ (x) =
1 + e−x
Similarly, we can divide the energy function E (v, h) into two parts: one collecting all terms
involving hj and one collecting all the other terms h−j :
where:
m
X
γj (v) = −bj − wi,j vi
i=1
and:
m
X n
X m
X n
X
δ (v, h−j ) = − ai vi − bk hk − wi,k vi hk
i=1 k=1,k6=j i=1 k=1,k6=j
We have:
!
0
e−E (v ,h)
X X
` (θ | v) = log e−E(v,h) − log
h v 0 ,h
Fischer and Igel (2014) computed the log-likelihood gradient ∇θ (v) = ∂θ ` (θ | v):
!
∂ ∂ 0
e−E (v ,h)
X X
∇θ (v) = log e−E(v,h) −
∂θ ∂θ
h v 0 ,h
(1) (2)
= ∇θ (v) + ∇θ (v 0 ) (24)
57
Improving the Robustness of Backtesting with Generative Models
We have26 :
e−E(v,h) · ∂θ E (v, h)
P
(1) h
∇θ (v) = − P −E(v,h)
he
X e−E(v,h)
= − P ∂ E (v, h)
−E(v,h0 ) θ
h h0 e
X ∂ E (v, h)
= − P (h | v)
∂θ
h
and:
0
e−E (v ,h) · ∂θ E (v 0 , h)
P
(2) 0 v 0 ,h
∇θ (v ) = P −E(v 0 ,h)
v 0 ,h e
0
X e−E (v ,h) ∂ E (v 0 , h)
= P −E(v 00 ,h0 )
v 00 ,h0 e ∂θ
v 0 ,h
X ∂ E (v 0 , h)
= P (v 0 | h)
∂θ
v 0 ,h
XX ∂ E (v 0 , h)
= P (v 0 ) P (h | v 0 )
∂θ
v0 h
X X ∂ E (v 0 , h)
= P (v 0 ) P (h | v 0 )
∂θ
v0 h
We recall that ∂ai E (v, h) = −vi and ∂bj E (v, h) = −hj . It follows that:
∂ ` (θ | v) X ∂ E (v, h) X X ∂ E (v 0 , h)
= − P (h | v)
+ P (v 0 ) P (h | v 0 )
∂ ai ∂ ai ∂ ai
h v0 h
X X X
= P (h | v) vi − P (v 0 ) P (h | v 0 ) vi0
h v0 h
X
0
= vi − P (v ) vi0
v0
and27 :
∂ ` (θ | v) X X X
= P (h | v) hj − P (v 0 ) P (h | v 0 ) hj
∂ bj
h v0 h
X
= P (hj = 1 | v) − P (v ) P (hj = 1 | v 0 )
0
v0
For the gradient with respect to W , we have ∂wi,j E (v, h) = −vi hj and:
∂ ` (θ | v) X X X
= P (h | v) vi hj − P (v 0 ) P (h | v 0 ) vi0 hj
∂ wi,j
h v0 h
X
= P (hj = 1 | v) vi − P (v ) P (hj = 1 | v 0 ) vi0
0
v0
26 Using the Bayes theorem, the conditional probability distribution P (h | v) is equal to:
P (v, h) e−E(v,h)
P (h | v) = = P −E(v,h0 )
P (v) h0 e
27 We use Equation (23) on page 56.
58
Improving the Robustness of Backtesting with Generative Models
We notice that we have to sum over 2m possible combinations of the visible variables when
(2)
calculating the second term ∇θ (v). Therefore, we generally approximate the expectation
0
P (v ) by sampling from the model distribution.
where:
X P (v)
KL (P k Q) = P (v) log
v
Q (v)
X X
= P (v) log P (v) − P (v) log Q (v)
v v
We have:
∂ KL (P k Q) X ∂ P (v) X ∂ log P (v)
= log P (v) + P (v) −
∂θ v
∂θ v
∂θ
X ∂ P (v) X ∂ log Q (v)
log Q (v) − P (v) (25)
v
∂θ v
∂θ
When P does not depend on the parameter set θ, we have ∂θ P (v) = 0 and the derivative
reduces to28 :
∂ KL (P k Q) X ∂ log Q (v)
=− P (v)
∂θ v
∂θ
Since we have:
X ∂ log P (v) X ∂ P (v)
P (v) =
v
∂θ v
∂θ
∂ KL (P k Q) X ∂ P (v)
X
P (v) ∂ log Q (v)
= log +1 − P (v)
∂θ v
∂ θ Q (v) v
∂θ
In Equation (26), it is often possible to compute the exact values for the first two terms, but
not for the third term. However, Hinton (2002) showed experimentally that this last term
28 This is the case when P is equal to P(0) .
59
Improving the Robustness of Backtesting with Generative Models
is so small compared the other two terms that it can be ignored. Thus, we can consider the
following approximation:
∂ CD(k) X ∂ log P(∞) (v) X (0) ∂ log P(∞) (v)
≈ P(k) (v) − P (v)
∂θ v
∂θ v
∂θ
X ∂ log Pθ (v) X (0) ∂ log Pθ (v)
= P(k) (v) − P (v)
v
∂θ v
∂θ
(k) (0)
1 X ∂ log Pθ v(s) ∂ log Pθ v(s)
= −
N s ∂θ ∂θ
(0) (k)
where v(s) is the sth sample of the training set and v(s) is the associated sample after running
k steps of Gibbs sampling29 . By noticing that log Pθ (v) = ` (θ | v) and using Equation (24),
we obtain:
(k)
(0)
∂ CD (k)
1 XN ∂ ` θ | v (s) ∂ ` θ | v (s)
= −
∂θ N s=1 ∂θ ∂θ
N
1 X (k)
(0)
= ∇θ v(s) − ∇θ v(s)
N s=1
N
1 X (1) (k) (1)
(0)
= ∇θ v(s) − ∇θ v(s)
N s=1
N X (0) (k)
1 ∂ E v (s) , h ∂ E v (s) , h
P h | v (0) (k)
X
= (s) − P h | v(s)
N s=1 ∂θ ∂θ
h
Therefore, we can compute the derivatives with respect to the parameters ai , bj and wi,j :
N
∂ CD(k) 1 X (k) (0)
= v(s),i − v(s),i
∂ ai N s=1
N
∂ CD(k) 1 X (k)
(0)
= P hj = 1 | v(s) − P hj = 1 | v(s)
∂ bj N s=1
N
∂ CD(k) 1 X (k)
(k)
(0)
(0)
= P hj = 1 | v(s) · v(s),i − P hj = 1 | v(s) · v(s),i
∂ wi,j N s=1
where:
m 2 n m X
n
X (vi − ai ) X X vi hj
Eg (v, h) = − bj hj − wi,j
i=1
2σi2 j=1 i=1 j=1
σi2
29 In (0)
this case, v(s) is the starting value of the Gibbs sampler.
60
Improving the Robustness of Backtesting with Generative Models
We have:
∂ Eg (v, h) vi − ai
=
∂ ai σi2
∂ Eg (v, h)
= −hj
∂ bj
∂ Eg (v, h) vi hj
= −
∂ wi,j σi2
2
∂ Eg (v, h) (vi − ai ) X hj
= − 3 + vi wi,j 3
∂ σi σi j
2σi
We deduce that the derivative of ` (θ | v) with respect to the weight wi,j is given by:
∂ ` (θ | v) v 0 hj
Z
X vi hj X
= P (h | v) 2 − p (v 0 ) P (h | v 0 ) i 2 dv 0
∂ wi,j σi v0 σi
h h
v0
Z
vi
= P (hj = 1 | v) 2 − p (v 0 ) P (hj = 1 | v 0 ) i2 dv 0
σi v0 σi
Similarly, we can compute the other derivatives. For ai and bj , we obtain:
∂ ` (θ | v) v0
Z
vi
=− 2 + p (v 0 ) i2 dv 0
∂ ai σi v0 σi
and:
∂ ` (θ | v)
Z
= P (hj = 1 | v) − p (v 0 ) P (hj = 1 | v 0 ) dv 0
∂ bj v0
Finally, we obtain for the parameter σi :
2
∂ ` (θ | v) (vi − ai ) vi X
= − + wi,j P (hj = 1 | v) +
∂ σi σi3 2σi3 j
0 2 0 X
(v − ai )
Z
v
p (v 0 ) i 3 − i3 wi,j P (hj = 1 | v 0 ) dv 0
v 0 σ i 2σ i j
` (θ | vt ) = log pθ (vt | ct )
where the set of parameters becomes θ = (a, b, W, P, Q). We can then compute the partial
derivative of the energy function by using the chain rule:
61
Improving the Robustness of Backtesting with Generative Models
Similarly, we have:
and:
∂ ` (θ | vt )
= P (ht,j = 1 | vt , ct ) ct,k −
∂ pk,j
Z
p (vt0 | ct ) P (hj = 1 | vt0 , ct ) ct,k dv 0
v0
The calculation for the other derivatives remains unchanged and are the same as those
obtained for the Gaussian-Bernoulli RBM.
62
Improving the Robustness of Backtesting with Generative Models
Remark 7. Different divergence measures can be used for modeling the function φ:
In order to establish the link with GAN optimization problems, Nguyen et al. (2010)
proposed to compute a variational characterization of these φ-divergence measures by looking
at the convex dual. For that, we need to introduce the Fenchel conjugate, which is a
30 This means that if we consider a σ-field A ⊆ X such that Q (A) = 0, then P (A) = 0.
31 This last condition ensures that Dφ (P k Q) = 0 if P = Q.
32 The condition that imposes that P must be absolutely continuous with respect to Q is related to the
63
Improving the Robustness of Backtesting with Generative Models
fundamental tool in convex analysis (Barbu and Precupanu, 2012). Let us consider a function
f : X → R ∪ {+∞}. According to the Riesz representation theorem, it is possible to identify
the dual space X ∗ of the Banach space X . Therefore, we can work on the product space
X ∗ × X associated with the scalar product hx∗ , xi. The Fenchel transform is then defined
on the dual space such that:
∗
X →R
f∗ :
x∗ → supx∈dom f {hx∗ , xi − f (x)}
for all x ∈ X . If we consider the function φ associated with a given φ-divergence, the space
X is R and the dual space is also R. In order to express the φ-divergence in terms of loss,
Nguyen et al. (2010) simply expressed φ in term of its conjugate:
To find the optimal function such that equality in the supremum is obtained, we introduce
the notation x̃ = ϕ (x) and the quantity C (x̃) defined by:
C (x̃) = C (ϕ (x))
= E [ϕ (X) | X ∼ P] − E [φ∗ (ϕ (X)) | X ∼ Q] (30)
By computing the derivative C (x̃), Nowozin et al. (2016) found that the optimal function
ϕ? is33 :
? 0 dP (x)
ϕ (x) = φ (31)
dQ (x)
33 See Broniatowski and Keziou (2006, Theorem 4.4) for a formal proof.
64
Improving the Robustness of Backtesting with Generative Models
However, evaluating this quantity is impossible because the distribution function P is un-
known. Therefore, ϕ should be flexible enough to approximate the derivative φ0 everywhere.
This is why GAN models use deep neural networks to estimate it. This leads us to introduce
the parameter θd that will be optimized during the training process. In this context, we
write the parameterized function D (x, θd ), which aims to estimate the function ϕ? (x):
Consequently, Nowozin et al. (2016) proposed to use the resulting lower bound in order
to train GANs. G (z, θg ) represents the generative model that is also a neural network
and allows us to build the probability distribution Pmodel , which should estimate the given
probability distribution Pdata . Thus, we obtain the saddle point problem:
where:
We deduce that:
0 x
φ (x) = log
x+1
and34 :
φ∗ (x) = − log (1 − ex )
34 We have:
φ∗ (t) = sup hx, ti − φ (x)
x∈dom φ
It follows that h (x) = xt − x log x + (x + 1) log (x + 1) and:
x
h0 (x) = t − log
x+1
The supremum is reached at the point h0 (x? ) = 0, implying that:
et
x? =
1 − et
Finally, we obtain:
et et et
1 1
φ∗ (t) = t − log log +
1 − et 1 − et 1−e t1 − et 1−e t
et et
1
t − log 1 − et − log 1 − et
= t −
1 − et 1 − et 1 − et
et
1
log 1 − et − log 1 − et
= t t
1−e 1−e
− log 1 − et
=
65
Improving the Robustness of Backtesting with Generative Models
However, contrary to the previous case, probability functionals take value in probability
spaces, where functional derivatives need to be defined. Moreover, the dimension of the
space is potentially infinite. Therefore, the difficulty is to transform functional optimization
into a convex optimization problem that can be solved using traditional numerical algorithms
such as the gradient descent.
In order to define the derivative in B (X ), Chu et al. (2019) used the Gâteaux derivative
of the functional JP at Q in the direction of δ = P − Q:
JP0 (P − Q) = dJP (Q, δ)
d
= JP (Q + δ)
d =0
JP (Q + δ) − JP (Q)
= lim (35)
→0
This definition initially comes from Von Mises calculus, and the Gâteaux derivative is also
called the ‘Volterra’ derivative (Fernholz, 2012). Chu
R et al. (2019) recalled that the Gâteaux
derivative has an integral representation JP0 (δ) = X g (x) δ (dx) where the function g : X →
R is called the ‘influence function’ or the influence curve35 . It follows that:
Z
0
JP (P − Q) = g (x) (P − Q) (dx)
X
= E [g (X) | X ∼ P] − E [g (X) | X ∼ Q] (36)
We are now able to compute the derivative in order to recover the optimal function that
satisfy the minimum of the φ-divergence:
dQ (x) + (dP (x) − dQ (x))
Z
d
JP0 (P − Q) = φ dP (x)
X d dP (x) =0
dQ (x) + (dP (x) − dQ (x)) (dP (x) − dQ (x))
Z
= φ0 dP (x)
X dP (x) =0 dP (x)
Z
dQ (x)
= φ0 (P − Q) (dx) (37)
X dP (x)
35 The influence function is not unique. Indeed, for any function g that describes the Gâteaux differential
at Q, g + c also works. Thus, the influence function is uniquely defined up to an arbitrary additive constant.
66
Improving the Robustness of Backtesting with Generative Models
We deduce that the influence function for the φ-divergence is equal to:
X → R
gφ : 0 dQ (x) (38)
x→φ
dP (x)
The influence function can then be associated with the optimal function φ introduced before
because we have gφ = ϕ? ∈ F. We conclude that the GAN discriminator will try to estimate
the influence function. While Nguyen et al. (2010) focused on the dual form of the function
φ, Chu et al. (2019) used a more general approach by considering the dual form of the
probability functional in order to recover the Goodfellow’s saddle point problem.
In the case of probability functionals which take values in B (X ), the dual space can
be defined as the space L1 (X ) of all real-valued Lipschitz function that takes values in
X (Laschos et al., 2019). According to the Riesz representation theorem, it is possible to
identify the space B (X ) to it dual. We consider the product space B (X ) × L1 (X ) with the
scalar product defined as:
Z
hϕ, Qi = ϕ (x) Q (dx)
ZX
= ϕ (x) dQ (x) dx
X
1
where ϕ ∈ L (X ) and Q ∈ B (X ) × R. Thus, we can rewrite JP in term of its convex
transform:
To bridge the gap between the two approaches, we have to demonstrate that JP∗ (ϕ) =
E [φ∗ (ϕ (X)) | X ∼ P] for a given estimate of the function ϕ ∈ F. For that, Chu et al.
(2019) considered the Fenchel conjugate of the Jensen-Shannon divergence36 :
JP∗ (ϕ) ∗
= JJS (ϕ)
1 h i 1
= − E log 1 − e2ϕ(X)−log 2 | X ∼ P − log 2 (40)
2 2
1 1
Using ϕ (x) = 2 log (1 − D (x, θd )) + 2 log 2, we obtain:
1
E [ϕ (X) | X ∼ Q] − JP∗ (ϕ) = E [log (1 − D (x, θd )) | X ∼ Q] +
2
1
E [log D (x, θd ) | X ∼ P] + log 2 (41)
2
Therefore, the descent algorithm applied to probability functionals is equivalent to the Good-
fellow’s saddle point problem:
67
Improving the Robustness of Backtesting with Generative Models
IPMs are looking for a critic function that maximizes the average discrepancy between the
two distributions P and Q. Contrary to φ-GAN models, where we have to find the variational
form of the divergence φ (x), the definition of an IPM directly gives the discrepancy operator:
for all ϕ ∈ F. The Wasserstein GAN proposed by Arjovsky et al. (2017a,b) considers
the function class F such that ϕ (x) is a 1-Lipschitz function. In order to present dif-
R F, we consider the Lebesgue norm on the measur-
ferent choices for the function class
2
able space Ω = (X , P): kf k2 = X f 2 (x) P (dx). Let us denote the normed space by
L2 (Ω) = {f : X → R | kf k2 < +∞} and the unit ball by B1 = f ∈ L2 (Ω) | kf k2 ≤ 1 .
Therefore, choosing the class function such that F = B1 is called a Fisher GAN by Mroueh
and Sercu (2017). In a similar way, Mroueh et al. (2018) proposed
to define Sobolev GAN
models by considering the following class of functions F = f ∈ L2 (Ω) | k∇x f k2 ≤ 1 .
68
Improving the Robustness of Backtesting with Generative Models
69
Improving the Robustness of Backtesting with Generative Models
In order to find the convex conjugate of JJS (P), we consider the Fenchel-Moreau theorem:
∗
JJS (ϕ) = sup {hϕ, Pi − JJS (P)}
P∈B(X )
Z
= sup ϕ (x) P (dx) − JJS (P)
P∈B(X ) X
We have:
Z
f (P) = ϕ (x) P (dx) − JJS (P)
ZX
= ϕ (x) p (x) dx −
X
Z
1 2p (x) 2q (x)
p (x) log + q (x) log dx
2 p (x) + q (x) p (x) + q (x)
Z X
= h (x) dx
X
where:
1 1 1
h (x) = ϕ (x) p (x) − p (x) log 2 − p (x) log p (x) + p (x) log (p (x) + q (x)) −
2 2 2
1 1 1
q (x) log 2 − q (x) log q (x) + q (x) log (p (x) + q (x))
2 2 2
and:
∂h (x) 1 1 1 1
= ϕ (x) − log 2 − log p (x) − + log (p (x) + q (x)) +
∂p (x) 2 2 2 2
1 p (x) 1 q (x)
+
2 p (x) + q (x) 2 p (x) + q (x)
1 1 1
= ϕ (x) − log 2 − log p (x) + log (p (x) + q (x))
2 2 2
Since the first-order condition ∂p(x) h (x) = 0, we deduce that the optimal solution is:
1 1 p (x)
ϕ (x) = log 2 + log
2 2 p (x) + q (x)
1 2p (x)
= log
2 p (x) + q (x)
Chu et al. (2019) noticed that:
q (x) p (x)
= 1−
p (x) + q (x) p (x) + q (x)
1
= 1 − e2ϕ(x)
2
In this case, we obtain:
1 2p (x) 1 2q (x)
h (x) = ϕ (x) p (x) − p (x) log − q (x) log
2 p (x) + q (x) 2 p (x) + q (x)
1
= ϕ (x) p (x) − ϕ (x) p (x) − q (x) log 2 − e2ϕ(x)
2
70
Improving the Robustness of Backtesting with Generative Models
71
Improving the Robustness of Backtesting with Generative Models
In a binary classification problem, for a given observation xi , we have p = yi ∈ {0, 1}, which
is the true label and q = ŷi ∈ [0, 1] which is the predicted probability of the current model.
We can use binary cross entropy to get a measure of dissimilarity between yi and yˆi :
Under the GAN framework, we have m samples of x0 and m samples of x1 , which serve
as the input data of the discriminator model. We note x = {x0 , x1 } the set of the two
samples. Since yˆi = D (xi ; θd ), the formula above can be written as:
2m
1 X
L (θg , θd ) = − yi log D (xi ; θd ) + (1 − yi ) log (1 − D (xi ; θd ))
2m i=1
If xi comes from the sample x0 , yi takes the value 0, otherwise it takes the value 1. It follows
that:
m m
!
1 X X
L (θg , θd ) = − log D (x1,i ; θd ) + log (1 − D (x0,i ; θd ))
2m i=1 i=1
m m
!
1 1 X 1 X
= − log D (x1,i ; θd ) + log (1 − D (G (zi ; θg ) ; θd ))
2 m i=1 m i=1
Therefore, the loss function L (θg , θd ) corresponds to the average of the cross-entropy when
considering several observations:
1 1
L (θg , θd ) = − E [log D (x1 ; θd )] − E [log (1 − D (G (z; θg ) ; θd ))]
2 2
We notice that C (θg , θd ) is equal to −2 · L (θg , θd ). Minimizing the loss function is then
equivalent to maximize C (θg , θd ) with respect to θd .
72
Improving the Robustness of Backtesting with Generative Models
where F (P, Q) is the Fréchet class37 . The infimum is then obtained by considering all joint
probability measures F on X ×Y such that P and Q are the marginals. Such joint probability
measures are called transportation plans. The Kantorovich transportation problem is now
convex and becomes a linear programming problem easier to solve than the Monge trans-
portation problem. However, searching among all join probability measures F ∈ F (P, Q)
can also be computationally intractable.
For instance, Seguy et al. (2017) recall that solving
the linear program takes O n3 log n when n is the size of the support in the case of discrete
probability distributions.
Remark 10. Let us consider the particular case where P and Q are continuous Lebesgue
measures and the cost function c correspond to a p-Euclidian distance d (x, y). The solution
to the Monge-Kantorovich problem is then defined as the p-Wasserstein distance38 :
Z 1/p
p
Wp (P, Q) = inf d (x, y) dF (x, y)
F∈F (P,Q) X ×Y
to P and Q.
38 It is also known as the earth mover’s distance (EMD), which has been used by Rubner et al. (2000) for
73
Improving the Robustness of Backtesting with Generative Models
Remark 11. In some particular cases, optimal transport problems can be easily solved.
Let us consider the case where X = Y is a one-dimensional space, and c (x, y) is a convex
function that satisfies the following condition: if x1 < x2 and y1 < y2 , then c (x2 , y2 ) −
c (x1 , y2 ) − c (x2 , y1 ) + c (x1 , y1 ) < 0, then the optimal transport plan respects the ordering
of the elements. Consequently, the solution corresponds to a monotone rearrangement of P
into Q. Solving this problem is no more than sorting elements in a list (Brenier, 1991).
According to Villani (2008), ϕ and ψ are integrable such that (ϕ, ψ) ∈ L1 (X , P) × L1 (Y, Q).
The functions ϕ and ψ are often called Kantorovich potentials39 .
Remark 12. Considering the viewpoint of the manufacturer who cares about the cost is
equivalent to look at the solution of the Monge-Kantorovich primal problem. Considering the
viewpoint of the transportation company that cares about optimizing the profit is equivalent
to looking at the solution of the Monge-Kantorovich dual problem.
A rigorous proof of dual formulation40 is given by Villani (2008). We recall that the
condition F ∈ F (P, Q) implies:
Z Z Z
(ψ (y) − ϕ (x)) dF (x, y) = ψ (y) dF (x, y) − ϕ (x) dF (x, y)
X ×Y X ×Y X ×Y
Z Z Z Z
= ψ (y) dF (x, y) − ϕ (x) dF (x, y)
Y X X Y
Z Z
= ψ (y) dQ (y) − ϕ (x) dP (x) (45)
Y X
If F ∈
/ F (P, Q), we assume by convention that the difference between the two members is
39 Brenier (1991, Theorem 1.3) showed the link between Kantorovich potentials and mapping functions in
74
Improving the Robustness of Backtesting with Generative Models
infinite and note 1F (P,Q) (F) the convex indicator function of F (P, Q):
Z
(∗) = inf c (x, y) dF (x, y)
F∈F (P,Q) X ×Y
Z
= inf c (x, y) dF (x, y) + 1F (P,Q) (F)
F∈F (P,Q) X ×Y
Z
= inf c (x, y) dF (x, y) +
F∈F (P,Q) X ×Y
Z Z Z
sup ψ (y) dQ (y) − ϕ (x) dP (x) − (ψ (y) − ϕ (x)) dF (x, y)
(ϕ,ψ) Y X X ×Y
where:
Z Z
Γ (P, Q, F, ϕ, ψ) = c (x, y) dF (x, y) + ψ (y) dQ (y) −
X ×Y Y
Z Z
ϕ (x) dP (x) − (ψ (y) − ϕ (x)) dF (x, y)
X X ×Y
Since we have:
it follows that:
Z Z
inf Γ (P, Q, F, ϕ, ψ) = ψ (y) dQ (y) − ϕ (x) dP (x) +
F∈F (P,Q) Y X
Z
inf (c (x, y) − (ψ (y) − ϕ (x))) dF (x, y)
F∈F (P,Q) X ×Y
because: Z
inf (c (x, y) − (ψ (y) − ϕ (x))) dF (x, y) = 0
F∈F (P,Q) X ×Y
75
Improving the Robustness of Backtesting with Generative Models
for all y ∈ Y. In the particular case where the cost function is a distance, a c-convex function
is simply a 1-Lipschitz function, and is equal to its c-transform. Indeed, let us consider that
ϕ is 1-Lipschitz such that ϕ(x) − ϕ(y) ≤ c(x, y). We have ϕ(x) ≤ ϕ(y) + c(x, y) and:
Again, it is possible to understand the c-transform through the economic point of view. Let
us recall that the transportation company needs to satisfy the condition ψ (y) − ϕ (x) ≤
c (x, y) in order to remain competitive. It follows that ψ (y) ≤ ϕ (x) + c (x, y) and ϕ (x) ≥
ψ (y) − c(x, y). To maximize its profits, the company will choose the pair (ϕ, ψ) such that:
ψ (y) = inf x {ϕ (x) + c (x, y)}
ϕ (x) = supy {ψ (y) − c(x, y)}
Therefore, it becomes useful to write ψ in term of ϕ. If we consider that the cost function
is a distance and X = Y, Villani (2008) showed that:
Z
(∗) = inf c (x, y) dF (x, y)
F∈F (P,Q) X ×Y
Z Z
= sup ψ (y) dQ (y) − ϕ (x) dP (x) : ψ (y) − ϕ (x) ≤ c (x, y)
(ϕ,ψ) Y X
Z Z
= sup ϕc (y) dQ (y) −ϕ (x) dP (x)
ϕ Y X
Z Z
= sup ϕ (y) dQ (y) − ϕ (x) dP (x)
ϕ Y X
= sup {E [ϕ (Y ) | Y ∼ Q] − E [ϕ (X) | X ∼ P]}
ϕ
This is the semi-dual formulation of the problem also called the Kantorovich-Rubinstein
duality. This formulation is used to train the Wasserstein GAN that estimates the optimal
function ϕ.
A.6.4 An example
If P ∼ N (µ1 , Σ1 ) and Q ∼ N (µ2 , Σ2 ), Givens and Shortt (1984) showed that the 2-
Wasserstein distance is equal to:
s 1 1/2
2 /2 1/2
W2 (P, Q) = kµ1 − µ2 k + tr Σ1 + Σ2 − 2 Σ1 Σ2 Σ1 (46)
76
Improving the Robustness of Backtesting with Generative Models
Algorithm (5) performs the inverse transformation. Similarly, in the case of n-dimensional
data, we transform receptively each 16 binary numbers into a real value and concatenate
them to form a n-dimensional real-valued vector.
77
Chief Editors
Pascal BLANQUÉ
Chief Investment Officer
Philippe ITHURBIDE
Senior Economic Advisor
Working Paper
July 2020
DISCLAIMER
In the European Union, this document is only for the attention of “Professional” investors as defined in Directive 2004/39/
EC dated 21 April 2004 on markets in financial instruments (“MIFID”), to investment services providers and any other
professional of the financial industry, and as the case may be in each local regulations and, as far as the offering in
Switzerland is concerned, a “Qualified Investor” within the meaning of the provisions of the Swiss Collective Investment
Schemes Act of 23 June 2006 (CISA), the Swiss Collective Investment Schemes Ordinance of 22 November 2006 (CISO)
and the FINMA’s Circular 08/8 on Public Advertising under the Collective Investment Schemes legislation of 20 November
2008. In no event may this material be distributed in the European Union to non “Professional” investors as defined in
the MIFID or in each local regulation, or in Switzerland to investors who do not comply with the definition of “qualified
investors” as defined in the applicable legislation and regulation. This document is not intended for citizens or residents of
the United States of America or to any «U.S. Person» , as this term is defined in SEC Regulation S under the U.S. Securities
Act of 1933.
This document neither constitutes an offer to buy nor a solicitation to sell a product, and shall not be considered as an
unlawful solicitation or an investment advice.
Amundi accepts no liability whatsoever, whether direct or indirect, that may arise from the use of information contained in
this material. Amundi can in no way be held responsible for any decision or investment made on the basis of information
contained in this material. The information contained in this document is disclosed to you on a confidential basis and
shall not be copied, reproduced, modified, translated or distributed without the prior written approval of Amundi, to any
third person or entity in any country or jurisdiction which would subject Amundi or any of “the Funds”, to any registration
requirements within these jurisdictions or where it might be considered as unlawful. Accordingly, this material is for
distribution solely in jurisdictions where permitted and to persons who may receive it without breaching applicable legal
or regulatory requirements.
The information contained in this document is deemed accurate as at the date of publication set out on the first page of
this document. Data, opinions and estimates may be changed without notice.
You have the right to receive information about the personal information we hold on you. You can obtain a copy of the
information we hold on you by sending an email to [email protected]. If you are concerned that any of the information
we hold on you is incorrect, please contact us at [email protected]
Document issued by Amundi, “société par actions simplifiée”- SAS with a capital of €1,086,262,605 - Portfolio manager
regulated by the AMF under number GP04000036 – Head office: 90 boulevard Pasteur – 75015 Paris – France – 437 574
452 RCS Paris - www.amundi.com