0% found this document useful (0 votes)
33 views80 pages

WP Improving The Robustness of Backtesting I Edmond Lezmi

This document discusses using generative models to simulate artificial financial time series data that match the statistical properties of real market data in order to improve the robustness of backtesting trading strategies. Generative models like restricted Boltzmann machines and generative adversarial networks are reviewed for this application. The goal is to develop a framework for better assessing risk in quantitative investment strategies through Monte Carlo simulation of multiple artificial market scenarios.

Uploaded by

florent.bersani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views80 pages

WP Improving The Robustness of Backtesting I Edmond Lezmi

This document discusses using generative models to simulate artificial financial time series data that match the statistical properties of real market data in order to improve the robustness of backtesting trading strategies. Generative models like restricted Boltzmann machines and generative adversarial networks are reviewed for this application. The goal is to develop a framework for better assessing risk in quantitative investment strategies through Monte Carlo simulation of multiple artificial market scenarios.

Uploaded by

florent.bersani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Working Paper 98-2020 I July 2020

Improving the Robustness of Trading


Strategy Backtesting with Boltzmann
Machines and Generative Adversarial
Networks

Document for the exclusive attention of professional clients, investment services providers and any other professional of the financial industry
Improving the Robustness of Trading Strategy
Backtesting with Boltzmann Machines
and Generative Adversarial Networks

Abstract

Edmond Lezmi In this article, we explore generative models in order


Quantitative Research to build a market generator. The underlying idea is to
[email protected] simulate artificial multi-dimensional financial time
series, whose statistical properties are the same as those
Jules Roche observed in the financial markets. In particular, these
Ecole Nationale des Ponts et synthetic data must preserve the first four statistical
Chaussées
moments (mean, standard deviation, skewness and
[email protected]
kurtosis), the stochastic dependence between the
different dimensions (copula structure) and across time
Thierry Roncalli
Quantitative Research (autocorrelation function). The first part of the article
[email protected]
reviews the more relevant generative models, which are
restricted Boltzmann machines, generative adversarial
Jiali Xu networks, and convolutional Wasserstein models.
Quantitative Research The second part of the article is dedicated to financial
[email protected] applications by considering the simulation of multi-
dimensional times series and estimating the probability
distribution of backtest statistics. The final objective is to
develop a framework for improving the risk management
of quantitative investment strategies.
Keywords: Machine learning, generative approach,
discriminative approach, restricted Boltzmann machine,
generative adversarial network, Wasserstein distance,
market generator, quantitative asset management,
backtesting, trading strategy.

JEL classification: C53, G11


About the authors
Edmond Lezmi
Edmond Lezmi joined Amundi in 2002. He is currently
Head of Multi-Asset Quantitative Research. Prior to
that, he was Head of Quantitative Research at Amundi
Alternative Investments (2008-2012), a derivatives and
fund structurer at Amundi IS (2005-2008), and Head
of Market Risk (2002-2005). Before joining Amundi, he
was Head of Market Risk at Natixis, and an exotic FX
derivatives quantitative developer at Société Générale.
He started his working career with Thales in 1987 as a
research engineer in signal processing. He holds an MSc in
Stochastic processes from the University of Orsay.

Jules Roche
Jules Roche joined Amundi in July 2019 in the Quantitative
Research department as an intern. He worked on financial
applications of generative models such as GAN and
Boltzmann Machines. He is currently a graduate student
at Ecole Nationale des Ponts et Chaussées (ENPC) in Paris
where he studied applied mathematics and computer
science. He will be joining the Master of Finance at MIT
in September 2020.

Thierry Roncalli
Thierry Roncalli joined Amundi as Head of Quantitative
Research in November 2016. Prior to that, he was Head of
Research and Development at Lyxor Asset Management
(2009- 2016), Head of Investment Products and Strategies
at SGAM AI, Société Générale (2005-2009), and Head
of Risk Analytics at the Operational Research Group
of Crédit Agricole SA (2004-2005). From 2001 to 2003,
he was also Member of the Industry Technical Working
Group on Operational Risk (ITWGOR). Thierry began
his professional career at Crédit Lyonnais in 1999 as a
financial engineer. Before that, Thierry was a researcher
at the University of Bordeaux and then a Research
Fellow at the Financial Econometrics Research Centre of
Cass Business School. During his five years of academic
career, he also served as a consultant on option pricing
models for different banks.
Since February 2017, he is Member of the Scientific Advisory
Board of AMF, the French Securities & Financial Markets
Regulator, while he was Member of the Group of Economic
Advisers (GEA), ESMA’s Committee for Economic and
Market Analysis (CEMA), European Securities and Market
Analysis from 2014 to 2018. Thierry is also Adjunct Professor
of Economics at the University of Evry, Department of
Economics. He holds a PhD in Economics from the University
of Bordeaux, France. He is the author of numerous academic
articles in scientific reviews and has published several
books on risk and asset management. His last two books
are “Introduction to Risk Parity and Budgeting” published
in 2013 by Chapman & Hall and translated in Chinese in
2016 by China Financial Publishing House (410 pages), and
“Handbook of Financial Risk Management” published in
2020 by Chapman & Hall (1142 pages).

Jiali Xu
Jiali XU joined Amundi in 2018 as a quantitative research
analyst within the Multi-Asset Quantitative Research team.
Prior to that, he was a quantitative analyst in the Risk
Analytics and Solutions team at Société Générale between
2014 and 2018. He is graduated from Ecole des Ponts ParisTech
and he also holds a master degree in Financial Mathematics
from the University of Paris-Est Marne-la-Vallée.
Improving the Robustness of Backtesting with Generative Models

1 Introduction
In machine learning, we generally distinguish two types of statistical modeling (Jebara,
2004). The generative approach models the unconditional probability distribution given a
set of observable variables. The discriminative approach models the conditional probability
distribution given a set of observable variables and a target variable. In the first case,
generative models can be used to simulate samples that reproduce the statistical properties
of a training dataset. In the second case, discriminative models can be used to predict the
target variable for new examples. More specifically, generative models can be used to learn
the underlying probability distributions over data manifolds. The objective of these models
is to estimate the statistical and correlation structure of real data and simulate synthetic
data with a probability distribution, which is close to the real one.
In finance, we generally observe only one sample path of market prices. For instance, if
we would like to build a trading strategy on the S&P 500 index, we can backtest the strategy
using the historical values of the S&P 500 index. We can then measure the performance and
the risk of this investment strategy by computing annualized return, volatility, Sharpe ratio,
maximum drawdown, etc. In this case, it is extremely difficult to assess the robustness of
the strategy, since we can consider that the historical sample of the S&P 500 index is one
realization of the unknown stochastic process. Therefore, portfolio managers generally split
the study period into two subperiods: the ‘in-sample’ period and the ‘out-of-sample’ period.
The objective is to calibrate the parameters of the trading strategy with one subperiod and
measure the financial performance with the other period in order to reduce the overfitting
bias. However, if the out-of-sample approach is appealing, it is limited for two main reasons.
First, by splitting the study period into two subperiods, the calibration procedure is per-
formed with fewer observations, and does not generally take into account the most recent
period. Second, the validation step is done using only one sample path. Again, we observe
only one realization of the risk/return statistics. Of course, we could use different splitting
methods, but we know that these bootstrap techniques are not well-adapted to times series
and must be reserved for modeling random variables. For stochastic processes, statisticians
prefer to consider Monte Carlo methods. Nevertheless, financial times series are difficult to
model, because they exhibit non-linear autocorrelations, fat tails, heteroscedasticity, regime
switching and non-stationary properties (Cont, 2001).
In this article, we are interested in generative models in order to obtain several train-
ing/validation sets. The underlying idea is then to generate artificial but realistic financial
market prices. The choice of generative model and market data leads naturally to the
concept of market generator introduced by Kondratyev and Schwarz (2019). If the mar-
ket generator is able to replicate the probability distribution of the original market data,
we can then backtest quantitative investment strategies on several financial markets. The
backtesting procedure is then largely improved, since we obtain a probability distribution
of performance and risk statistics, and not only one value. Therefore, we can reduce the
in-sample property of the backtest procedure and the overfitting bias of the parameters that
define the trading strategy. However, the challenge is not simple, since the market generator
must be sufficiently robust and flexible in order to preserve the uni-dimensional statistical
properties of the original financial time series, but also the multi-dimensional dependence
structure.
This paper is organized as follows. Section Two reviews the more promising generative
models that may be useful in finance. In particular, we focus on restricted Boltzmann
machines, generative adversarial networks and Wasserstein distance models. In Section
Three, we apply these models in the context of trading strategies. We first consider the
joint simulation of S&P 500 and VIX indices. We then build an equity/bond risk parity

7
Improving the Robustness of Backtesting with Generative Models

strategy with an investment universe of six futures contracts. Finally, Section Four offers
some concluding remarks.

2 Generative models
In this section, we consider two main approaches. The first one is based on a restricted
Boltzmann machine (RBM), while the second one uses a conditional generative adversarial
network (GAN). Both are stochastic artificial neural networks that learn the probability
distribution of a real data sample. However, the objective function strongly differs. Indeed,
RBMs consider a log-likelihood maximization problem, while the framework of GANs cor-
responds to a minimax two-player game. In the first case, the difficulty lies in the gradient
approximation of the log-likelihood function. In the second case, the hard task is to find
a learning process algorithm that solves the two-player game. This section also presents
an extension of generative adversarial networks, which is called a convolutional Wasserstein
model.

2.1 Restricted Boltzmann machines


Restricted Boltzmann machines were initially invented under the name ‘Harmonium’ by
Smolensky (1986). Under the framework of undirected graph models1 , an RBM is a Markov
random field (MRF) associated with a bipartite undirected graph2 . RBMs are made of only
two layers as shown in Figure 1. We distinguish visible units belonging to the visible layer
from hidden units belonging to the hidden layer. Each unit of visible and hidden layers
is respectively associated with visible and hidden random variables. The term ‘restricted ’
comes from the facts that the connections are only between the visible units and the hidden
units and that there is no connection between two different units in the same layer.

Figure 1: The schema of a RBM with m visible units and n hidden units

H1 H2 Hj Hn Hidden layer

V1 V2 V3 Vi Vm Visible layer

We would like to model the distribution of m visible variables V = (V1 , V2 , . . . , Vm )


representing the observable data whose elements Vi are highly dependent. A first way to
directly model these dependencies is to introduce a Markov chain or a Bayesian network. In
this case, networks are no longer restricted and those methods are computationally expensive
particularly when V is a high-dimensional vector. The RBM approach consists in introducing
hidden variables H = (H1 , H2 , . . . , Hn ) as latent variables which will indirectly capture
dependencies. Therefore, the hidden layer can be considered as an alternative representation
of the visible layer.
1 Fundamental concepts of undirected graph model are explained in Appendix A.1 on page 53.
2Abipartite graph is a graph whose nodes can be divided into two disjoint and independent sets U and
V such that every edge connects a node in U to one in V.

8
Improving the Robustness of Backtesting with Generative Models

2.1.1 Bernoulli RBMs


Definition A Bernoulli RBM is the standard type of RBMs and has binary-valued visi-
ble and hidden variables. Let us denote by v = (v1 , v2 , . . . , vm ) and h = (h1 , h2 , . . . , hn )
the configurations of visible variables V = (V1 , V2 , . . . , Vm ) and hidden variables H =
(H1 , H2 , . . . , Hn ), where vi and hj are the binary states of the ith visible variable Vi and the
m+n
j th hidden variable Hj such that (v, h) ∈ {0, 1} . The joint probability distribution of a
Bernoulli RBM is given by the Boltzmann distribution:
1 −E(v,h)
P (v, h) = e
Z
where the energy function E (v, h) is defined as:
m
X n
X m X
X n
E (v, h) = − ai vi − bj hj − wi,j vi hj
i=1 j=1 i=1 j=1

= −a> v − b> h − v > W h (1)

where a = (ai ) and b = (bj ) are the two vectors of bias terms associated with V and H
and W = (wi,j ) is the matrix of weights associated with the edges between V and H. The
normalizing constant Z is the partition function and ensures the overall distribution sums
to one3 . It follows that the marginal probability distributions of the visible and hidden unit
states are:
1 X −E(v,h)
P (v) = e
Z
h
and:
1 X −E(v,h)
P (h) = e
Z v
The underlying idea of a Bernoulli RBM is to learn the unconditional probability distribu-
tions P (v) of the observable data.

Conditional Distributions According to Long and Servedio (2010), the partition func-
tion Z is intractable in the case of Bernoulli RBMs, since its calculus requires summing up
2m+n elements. Therefore, the probability distribution P (v) is also intractable when m + n
increases. However, we can take advantage of the property of the bipartite graph structure
of the RBM. Indeed, there are no connections between two different units in the same layer.
The probabilities P (vi | h) and P (hj | v) are then independent for all i and j. It follows
that:
m
Y
P (v | h) = P (v1 , v2 , . . . , vm | h) = P (vi | h)
i=1
and:
n
Y
P (h | v) = P (h1 , h2 , . . . , hn | v) = P (hj | v)
j=1

With these properties, we can find some useful results that help when computing the gradient
of the log-likelihood function on page 10. For instance, we can show that4 :
X
P (h | v) hj = P (hj = 1 | v)
h
3 We have Z = v,h e−E(v,h) .
P
4 See Appendix A.2.1 on page 56.

9
Improving the Robustness of Backtesting with Generative Models

A neural network perspective of RBMs In Appendix A.2.2 on page 56, we show that:
 
n
X
P (vi = 1 | h) = σ ai + wi,j hj  (2)
j=1

and: !
m
X
P (hj = 1 | v) = σ bj + wi,j vi (3)
i=1
where σ (x) is the sigmoid function:
1
σ (x) =
1 + e−x
Thus, a Bernoulli RBM can be considered as a stochastic artificial neural network, meaning
that the nodes and edges correspond to neurons and synaptic connections. For a given
vector v, h is obtained as follows:
m
!
X
hj = f wi,j vi + bj
i=1

where f = ϕ ◦ σ, ϕ : x ∈ [0 : 1] 7→ X ∼ B (x) is a binarizer function and σ is the sigmoid


activation function. In addition, we may also go backward in the neural network as follows:
 
n
X
vi = f  wi,j hj + ai 
j=1

According to Fischer and Igel (2014), “an RBM can [then] be reinterpreted as a standard
feed-forward neural network with one layer of nonlinear processing units”.

Training process Let θ = (a, b, W ) be the set of parameters to estimate. The objective is
to find a value of θ such that Pθ (v) ≈ Pdata (v). Since the log-likelihood function ` (θ | v) of
the input vector v is defined as ` (θ | v) = log Pθ (v), the Bernoulli RBM model
 is trained in
order to maximize the log-likelihood function of a training set of N samples v(1) , . . . , v(N ) :
N
X
θ? = arg max

` θ | v(s) (4)
θ
s=1
where:
!
X e−E(v,h)
` (θ | v) = log
Z
h
!  
0
e−E (v ,h) 
X X
= log e−E(v,h) − log  (5)
h v 0 ,h

Hinton (2002) proposed to use gradient ascent method with the following update rule be-
tween iteration steps t and t + 1:
N
!
(t) ∂
X  
(t+1) (t) (t)
θ = θ +η ` θ | v(s)
∂ θ s=1
= θ(t) + η (t) ∆θ(t)
PN 
where η (t) is the learning rate parameter, ∆θ(t) = s=1 ∇θ(t) v(s) and ∇θ (v) is the gradi-
ent vector given in Appendix A.2.3 on page 57.

10
Improving the Robustness of Backtesting with Generative Models

Gibbs sampling Ackley et al. (1985) and Hinton and Sejnowski (1986) showed that the
expectation over P (v) can be approximated by Gibbs sampling, which belongs to the family
of MCMC algorithms. The goal of Gibbs sampling is to simulate correlated random variables
by using a Markov chain. Usually, we initialize the Gibbs sampling with a random vector and
the algorithm updates one variable iteratively, based on its conditional distribution given
the state of the remaining variables. After a sufficiently large number of sampling steps, we
get the unbiased samples from the joint probability distribution. Formally, Gibbs sampling
of the joint probability distribution of n random variables X = (X1 , X2 , . . . , Xn ) consists in
sampling xi ∼ P (Xi | X−i = x−i ) iteratively.

Algorithm 1 Gibbs sampling


 
(0) (0) (0)
initialization: x(0) = x1 , x2 , . . . , xn
for step k = 1, 2, 3, . . . do
Sample x(k) as follows:
 
(k) (k−1) (k−1)
x1 ∼ P X1 | x2 , x3 , . . . , x(k−1)
n
 
(k) (k) (k−1)
x2 ∼ P X2 | x1 , x3 , . . . , x(k−1)
n

..
.  
(k) (k) (k) (k−1)
xi ∼ P Xi | x1 , . . . , xi−1 , xi+1 , . . . , x(k−1)
n

..
.  
(k) (k) (k)
x(k)
n ∼ P Xn | x1 , x2 , . . . , xn−1

end for

Let us consider a Bernoulli RBM as a Markov random field defined by the random
variables V = (V1 , V2 , . . . , Vm ) and H = (H1 , H2 , . . . , Hn ). Since an RBM is a bipartite
undirected graph, we can take advantage of conditional independence properties between
the variables in the same layer. At each step, we jointly sample the states of all variables in
one layer, as shown in Figure 2. Thus, Gibbs sampling for an RBM consists in alternating5
between sampling a new state h for all hidden units based on P  (h | v) and sampling
 a
(k) (k) (k)
new state v for all visible units based on P (v | h). Let v (k) = v1 , v2 , . . . , vm and
 
(k) (k) (k)
h(k) = h1 , h2 , . . . , hn denote the states of the visible layer and the hidden layer at time
 
(k)
step k. For each unit, we rely on the fact that the conditional probabilities P vi | h(k)
 
(k)
and P hj | v (k−1) are easily tractable according to Equations (2) and (3). We start by
initializing the state of the visible units and we can choose a binary random vector for the
first time step. At time step k, here are the steps of the algorithm:

1. We go forward in the network


 by  computing
 simultaneously
 for each hidden unit j
(k) (k) (k−1)
the following probability p hj = P hj | v . The state of the j th is then
  
(k)
simulated according to the Bernoulli distribution B p hj .

2. We go backward in the network by computing simultaneously for each visible unit i


5 This method is also known as block Gibbs sampling.

11
Improving the Robustness of Backtesting with Generative Models

   
(k) (k)
the following probability p vi = P vi | h(k) . The state of the ith unit is then
  
(k)
simulated according to the Bernoulli distribution B p vi .

Figure 2: The schema of blocks Gibbs sampling for an RBM

H1 H2 Hn H1 H2 Hn

   
P h(1) | v (0) P v (1) | h(1) P h(2) | v (1) P v (2) | h(2)

V1 V2 Vm V1 V2 Vm V1 V2 Vm

Contrastive divergence algorithm As shown in the previous paragraph, it is possible


to use Gibbs sampling to get samples from the joint distribution P (v, h). However, the
computational effort is still too large since the sampling chain needs many sampling steps
to obtain unbiased samples. To address this issue, Hinton (2002) initialized the Gibbs
sampling with a sample from the real data, instead of using a random vector and suggested
the use of only a few sampling steps to get a sample that could produce a sufficiently good
approximation of log-likelihood gradient. This faster method is called contrastive divergence
algorithm.

Given a training set of N samples v(1) , . . . , v(N ) , we obtain:
N N
1 X  1 X 
` θ | v(s) = log Pθ v(s)
N s=1 N s=1
= EPdata [log Pθ (v)] (6)

We note Pmodel (v) = Pθ (v). Since Pdata is independent of θ, maximizing the log-likelihood
function (5) is equivalent to minimizing the Kullback-Leibler divergence between Pdata and
Pmodel :

θ? = arg max EPdata [log Pθ (v)] − EPdata [log Pdata (v)]


θ
X X
= arg max Pdata (v) log Pmodel (v) − Pdata (v) log Pdata (v)
θ
v v
 
X Pmodel (v)
= arg max Pdata (v) log
θ
v
Pdata (v)
= arg max − KL (Pdata k Pmodel )
θ
= arg min KL (Pdata k Pmodel ) (7)
θ

In contrastive divergence algorithms, we note the distribution of starting values as


P(0) , which is also the distribution of real data in training set: P(0) = Pdata . Let P(k)
and P(∞) be the distribution after running k steps of Gibbs sampling and the equilib-
rium distribution. Compared to P(0) , P(k) is k steps closer to the equilibrium distribu-
tion P(∞) , so the divergence measure KL P(0) k P(∞) should be greater than or equal

to KL P(k) k P(∞) . In particular, when the model is well-trained, P(0) , P(k) , and P(∞)

12
Improving the Robustness of Backtesting with Generative Models


should have the same distribution as Pdata . In this case, KL P(0) k P(∞) is equal to 0
 
and the difference KL P(0) k P(∞) − KL P(k) k P(∞) is also equal to 0. Thus, accord-
ing to Hinton (2002),
 we can findthe optimal parameter θ? by minimizing the difference
(0) (∞)
KL P k P −KL P(k) k P(∞) instead of minimizing directly KL P(0) k P(∞) in Equa-
tion (7). This quantity is called a contrastive divergence since we compare two KL divergence
measures. Therefore, we can rewrite the objective function in Equation (7) as follows:

θ? = arg min CD(k)


θ
   
= arg min KL P(0) k P(∞) − KL P(k) k P(∞) (8)
θ

Again, we can use the gradient descent method to find the minimum of the objective function.
In this case, the update rule between iteration steps t and t + 1 is:
(k)

(t+1) (t) (t) ∂ CD θ(t)
θ =θ −η
∂θ

where the gradient vector ∂θ CD(k) θ(t) is given in Appendix A.2.4 on page 59. Thus,


Algorithm (2) summarizes the k-step contrastive divergence algorithm for calibrating θ? .

2.1.2 Gaussian-Bernoulli RBMs


A Bernoulli RBM is limited to modelling the probability distribution P (v) where v is a
binary vector. To address this issue, given an RBM with m visible units and n hidden units,
we can associate a normally distributed variable to each visible unit and a binary variable
to each hidden unit. This type of RBMs is called Gaussian-Bernoulli RBM. We are free to
choose the expression of the energy function as long as it satisfies the Hammersley-Clifford
theorem6 and its partition function is well defined. For instance, Cho et al. (2011) defined
the energy function of this RBM as follows:
m 2 n m X
n
X (vi − ai ) X X vi hj
Eg (v, h) = − bj hj − wi,j
i=1
2σi2 j=1 i=1 j=1
σi2

where ai and bj are bias terms associated with visible variables Vi and hidden variables Hj ,
wi,j is the weight associated with the edge between Vi and Hj , and σi is a new parame-
ter associated with Vi . Following the same calculus in Equation 3, we can show that the
conditional probability is equal to:
m
!
X vi
P (hj = 1 | v) = sigmoid bj + wi,j 2 (9)
i=1
σi

Moreover, according to Krizhevsky (2009), the conditional distribution of Vi given h is


Gaussian and we obtain:
 
n
X
Vi | h ∼ N ai + wi,j hj , σi2  (10)
j=1

As we have previously seen, we can maximize the log-likelihood function in order to train
a Gaussian-Bernoulli RBM with m visible units and n hidden units. As all visible units are
6 See Appendix A.1.3 on page 53.

13
Improving the Robustness of Backtesting with Generative Models

Algorithm 2 k-step contrastive divergence algorithm



input: v(1) , . . . , v(N )
initialization: We note ∆ai = ∂ai CD(k) , ∆bj = ∂bi CD(k) and ∆wi,j = ∂wi,j CD(k) , and
we set ∆ai = ∆bj = ∆wi,j = 0
for v from v(1) to v(N ) do
v (0) ← v
{Gibbs sampling}
for t = 1 to k do
for j = 1 to n do   
(t) (t)
compute p hj = P hj | v (t−1)
  
(t) (t)
sample the j th hidden unit state: hj ∼ B p hj
end for
for i = 1 to m  do  
(t) (t)
compute p vi = P vi | h(t)
  
(t) (t)
sample the ith visible unit state: vi ∼ B p vi
end for
end for
{Gradient approximation}
for i = 1 to m do 
1 (k) (0)

∆ai ← ∆ai + vi − vi
N
for j = 1 to n do
1  
∆bj ← ∆bj + P hj = 1 | v (k) − P hj = 1 | v (0)
N
1   (k)  (0) 
∆wi,j ← ∆wi,j + P hj = 1 | v (k) · vi − P hj = 1 | v (0) · vi
N
end for
end for
end for
Output: gradient estimation ∆ai , ∆bj and ∆wi,j

associated with continuous probability distribution, the log-likelihood function ` (θ | v) of an


input vector v = (v1 , v2 , . . . , vm ) is equal to log pθ (v) where pθ (v) is the probability density
function of V = (V1 , V2 , . . . , Vm ). In Appendix A.2.5 on page 60, we give the expression of
the gradient vector. Therefore, there is no difficulty to use the gradient ascent method or
the k-step contrastive divergence algorithm to train a Gaussian-Bernoulli RBM.
Remark 1. If we normalize the data of the training set using a z-score function, we can
set the standard deviation σi to 1 during the training process. This reduces the number of
parameter and accelerates the convergence of the training process.

2.1.3 Conditional RBM structure


Traditional Bernoulli and Gaussian-Bernoulli RBMs can only model the static dependence
of variables. However, in the practice of financial data modeling, we also want to capture
the temporal dependencies between variables. To address this issue, Taylor et al. (2011)
introduced a conditional RBM structure by adding a new layer to the Gaussian-Bernoulli
RBM. Thus, this conditional RBM is made of three parts as shown in Figure 3:
1. a hidden layer with n binary units;

14
Improving the Robustness of Backtesting with Generative Models

2. a visible layer with m units;


3. a conditional layer with d conditional meta-units {C1 , . . . , Cd }. Since we model the
temporal structure, we note the observation at time t as vt = (vt,1 , vt,2 , . . . , vt,m ). The
conditional layer is fed with d past values (vt−1 , . . . , vt−d ) that are concatenated into a
d × m-dimensional vector7 and we note it as ct . The conditional layer is fully linked to
visible and hidden layers with directed connections. Let us denote by Q = (qd×m,m ) the
weight matrix connecting the conditional layer to the visible layer and P = (pd×m,n )
the weight matrix connecting the conditional layer to the hidden layer.
Figure 3: The schema of a conditional RBM with m visible layer units, n hidden layer units
and d × m conditional layer units

Hidden layer (n units)

H1 H2 Hn

C1
Conditional layer (d × m units)

C2

Cd

V1 V2 Vm

Visible layer (m units)

A conditional RBM contains both undirected and directed connections in the graph.
Thus, it can’t be defined as an MRF or Bayesian network. However, conditionally on ct ,
we can consider both visible and hidden layers as an undirected graph and we can compute
P (vt , ht | ct ) instead of computing P (vt , ht , ct ) in order to take still advantage of undirected
graph properties. Thus, according to the Hammersley-Clifford theorem, the conditional
probability distribution has the following form:
1 −Ẽg (vt ,ht ,ct )
P (vt , ht | ct ) = e
Z
1 − Pm i=1 Ẽi (vt,i ,ct )−
Pn
j=1 Ẽj (ht,j ,ct )−
Pm Pn
j=1 Ẽi,j (vt,i ,ht,j ,ct )
= e i=1
Z

7A

conditional meta-unit is then composed of m values: Ck = Vt−k,1 . . . , Vt−k,m .

15
Improving the Robustness of Backtesting with Generative Models

where Z is the partition function. Taylor et al. (2011) proposed a form of energy function
that is an extension of the Gaussian-Bernoulli RBM energy:
2
(vt,i − ãt,i )
Ẽi (vt,i , ct ) =
2σi2
Ẽj (ht,j , ct ) = −b̃t,j ht,j
vt,i ht,j
Ẽi,j (vt,i , ht,j , ct ) = −wi,j
σi2

where ãt = a + ct Q> and b̃t = b + ct P > are dynamic bias terms with respect to a and b.
Thus, the energy function Ẽg corresponds to a Gaussian-Bernoulli RBM energy function Eg
by replacing constant biases a and b by dynamic bias ãt and b̃t . By updating these terms
in Equations (9) and (10), we obtain:
m
!
X vt,i
P (hj = 1 | vt , ct ) = sigmoid b̃t,j + wi,j 2 (11)
i=1
σi

and:  
n
X
Vt,i | ht , ct ∼ N ãt,i + wi,j ht,j , σi2  (12)
j=1

Again, the partition function Z is still intractable:


XZ
Z= e−Ẽg (vt ,ht ,ct ) dvt
ht vt

Remark 2. As for Gaussian-Bernoulli RBMs, we can use the k-step contrastive divergence
algorithm for the training process with some modifications of the gradient vector8 .

2.2 Generative adversarial networks


Generative models are an important part of machine learning algorithms that learn the
underlying probability distributions of the real data sample. In other words, given a finite
sample data with a distribution Pdata (x), can we build a model such that Pmodel (x; θ) ≈
Pdata (x)? The goal is to learn to sample a complex distribution given a real sample. Gen-
erative adversarial networks (GANs) belong to the class of generative models and move
away from the classical likelihood maximization approach, whose objective is to estimate
the parameter θ. GAN models enable the estimation of the high dimensional underlying
statistical structure of real data and simulate synthetic data, whose probability distribution
is close to the real one. To assess the difference between real and simulated data, GANs
are trained using a discrepancy measure. Two widely classes of discrepancy measures are
information-theoretic divergences and integral probability metrics. The choice of the GAN
objective function associated with the selected discrepancy measure explains the multitude
of GAN models appearing in the machine learning literature. However, the different GAN
models share a common framework. Indeed, Appendix A.3 page 63 shows how the original
formulation of GAN models is a particular case of the theory of φ-divergence and probability
functional descent. Moreover, the link between influence function used in robust statistics
and the formulation of the discriminator and generator is derived.
8 The calculations are given in Appendix A.2.6 on page 61.

16
Improving the Robustness of Backtesting with Generative Models

According to Goodfellow et al. (2014), “a generative adversarial process trains two mod-
els: a generative model G that captures the data distribution, and a discriminative model
D that estimates the probability that a sample came from the training data rather than G.
The training procedure for G is to maximize the probability of D making a mistake. This
framework corresponds to a minimax two-player game. In the space of arbitrary functions
G and D, a unique solution exists, with G recovering the training data distribution and D
equal to 1/2 everywhere”. Therefore, a generative adversarial model consists of two neural
networks. The first neural network simulates a new data sample, and is interpreted as a
data generation process. The second neural network is a classifier. The input data are both
the real and simulated samples. If the generative model has done a good job, the discrimi-
native model is unable to know whether an observation comes from the real dataset or the
simulated dataset.

2.2.1 Adversarial training problem


We use the framework of Goodfellow et al. (2014) and Wiese et al. (2020). Let Z be a
random noise space. z ∈ Z is sampled from the prior distribution Pnoize (z). The generative
model G is then specified as follows:

(Z, ΘG ) −→ X
G: (13)
(z, θg ) 7−→ x0 = G (z; θg )

where X denotes the data space and ΘG defines the parameter space including generator
weights and bias that will be optimized. The generator G helps to simulate data x0 . The
discriminative model is defined as follows:

(X , ΘD ) −→ [0, 1]
D: (14)
(x, θd ) 7−→ p = D (x; θd )

where x corresponds to the set of simulated data x0 and training (or real) data x1 . In this
approach, the statistical model of X corresponds to:

Pmodel (x; θ) ∼ G (Pnoise (z) ; θg ) (15)

The probability that the observation comes from the real (or true) data is equal to p1 =
D (x1 ; θd ), whereas the probability that the observation is simulated is given by p0 =
D (x0 ; θd ). If the model (15) is wrong and does not reproduce the statistical properties
of the real data, the classifier has no difficulty in separating the simulated data from the
real data, and we obtain p0 ≈ 0 and p1 ≈ 1. Otherwise, if the model is valid, we must verify
that:
1
p0 ≈ p1 ≈
2
The main issue of GANs is the specification of the two functions G and D and the
estimation of the parameters θg and θd associated to G and D. For the first step, Goodfellow
et al. (2014) proposed to use two multi-layer neural networks G and D, whereas they consider
the following cost function for the second step:

C (θg , θd ) = E [log D (x1 ; θd ) | x1 ∼ Pdata (x)] +


E [log (1 − D (x0 ; θd )) | x0 ∼ G (z; θg ) , z ∼ Pnoise (z)]

The optimization problem becomes:


n o
θ̂g , θ̂d = arg min max C (θg , θd )
θg ∈ΘG θd ∈ΘD

17
Improving the Robustness of Backtesting with Generative Models

In other words, the discriminator is trained in order to maximize the probability to cor-
rectly classify historical samples from simulated samples. The objective is to obtain a good
classification model since the maximum value C (θg , θd ) with respect to θd is reached when:

D (x1 ; θd ) = 1
D (x0 ; θd ) = 0

In the meantime, the generator is trained in order to minimize the probability that the
discriminator is able to perform a correct classification or equivalently to maximize the
probability to fool the discriminator, since the minimum value C (θg , θd ) with respect to θg
is reached when: 
D (x0 ; θd ) = 1
x0 ∼ G (z; θg )

Remark 3. In Appendix A.5 on page 72, we show that the cost function is related to the
binary cross-entropy measure or the opposite of the log-likelihood function of the logit model.
Moreover, the cost function can be interpreted as a φ-divergence measure as explained in
Appendix A.3.1 on page 63.

2.2.2 Solving the optimization problem


The minimax optimization problem is difficult to solve directly, because the gradient vector
∇θ C (θg , θd ) is not well informative if the discriminative model is poor. Therefore, the
traditional way to solve this problem is to use a two-stage approach:

1. In a first stage, the vector of parameters θg is considered to be constant whereas the


vector of parameters θd is unknown. This implies that the minimax problem reduces
to a maximization step:
(max)
θ̂d = arg max C (max) (θd | θg )
θd ∈ΘD

where C (max) (θd | θg ) corresponds to the cost function by assuming that θg is given.

2. In a second stage, the vector of parameters θd is considered to be constant whereas the


vector of parameters θg is unknown. This implies that the minimax problem reduces
to a minimization step:

θ̂g(min) = arg min C (min) (θg | θd )


θg ∈ΘG

where C (min) (θg | θd ) corresponds to the cost function by assuming that θd is given.
(min)
The two-stage approach is repeated until convergence by setting θg to the value θ̂g calcu-
(max)
lated at the minimization step and θd to the value θ̂d calculated at the maximization step.
The cycle sequence is given in Figure 4. A drawback of this approach is the computational
time. Indeed, this implies to solve two optimization problems at each iteration.
Therefore, Goodfellow et al. (2014) proposed another two-stage approach, which con-
verges more rapidly. The underlying idea is not to estimate the optimal generator at each
iteration. The objective is, rather, to improve the generative model at each iteration, such
that the new estimated model is better than the previous. The convergence is only needed
for the discriminator step (Goodfellow et al., 2014, Proposition 2). Moreover, the authors

18
Improving the Robustness of Backtesting with Generative Models

Figure 4: Two-stage minimax algorithm

(max)
θ̂d

   
(max) (min)
C θg | θ̂d C θd | θ̂g

(min)
θ̂g

applied a mini-batch sampling in order to reduce the computational time. It follows that
the cost functions become:
m
1 X 
(i)
  
(i)

C (max) (θd | θg ) ≈ log D x1 ; θd + log 1 − D x0 ; θd
m i=1

and:
m
1 X     
C (min) (θg | θd ) ≈ c + log 1 − D G z (i) ; θg ; θd
m i=1

where m is the size of the mini-batch sample and c is a constant that does not depend on
the generator parameters θg :
m
1 X 
(i)

c= log D x1 ; θd
m i=1

In Algorithm (3), we describe the stochastic gradient optimization method proposed by


Goodfellow et al. (2014) when the learning rule corresponds to the steepest descent method.
But other learning rules can be used such as the momentum method or the adaptive learning
method. In practice, these algorithms may not converge9 . Another big issue known as ‘mode
collapse ’ concerns the diversity of generated samples. In this case, the simulated data drawn
by the generator exhibit small and limited differences, meaning all generated samples are
almost identical. This is why researchers have proposed alternatives to Algorithm (3) in
order to solve these two problems (Mao et al., 2017; Karras et al., 2018; Metz et al., 2017).

2.2.3 Time series modeling with GANs


It is possible to generate fake time series from a single random noise vector z without
any labels. However, this implies the use of complex structures for both generator and
9 For instance, we may observe a cycle because the parameters oscillate

19
Improving the Robustness of Backtesting with Generative Models

Algorithm 3 Stochastic gradient optimization for GAN training


The goal is to compute the optimal parameters θ̂g and θ̂d
We note nmax the number of iterations to apply to the maximization problem (discrim-
inator step) and nmin the number of iterations to apply to the minimization problem
(generator step)
m is the size of the mini-batch sample
(0) (0)
The starting values are denoted by θg and θd
for j = 1 to nmin do
(j,0) (j−1)
θd ← θd
for k = 1 to nmax do 
(1) (m)
Sample m random noise vector  z , . . . , z  from the prior distribution Pnoise (z)
(1) (m)
Compute the simulated data x0 , . . . , x0 :
 
(i)
x0 ∼ G z (i) ; θg(j−1)


Sample m examples x(1) , . . . , x(m) from the data distribution Pdata (x)
Compute the gradient vector of the maximization problem with respect to the pa-
rameter θd :
m
!
(j,k) 1 X 
(i)
  
(i)
 
∆θd ← ∇θd log D x1 ; θd + log 1 − D x0 ; θd
m i=1 (j,k−1)
θd =θd

Update the discriminator parameters θd using a backpropagation learning rule. For


instance, in the case of the steepest descent method, we obtain:
(j,k) (j,k−1) (j,k)
θd ← θd + ηd · ∆θd

end for
(j) (j,n )
θd ← θd max 
Sample m random noise vector z (1) , . . . , z (m) from the prior distribution Pnoise (z)
Compute the gradient vector of the minimization problem with respect to the parameter
θg : !
m
(j) 1 X    
(j)

∆θg ← ∇θg log 1 − D G z (i) ; θg ; θd
m i=1 (j−1)
θg =θg

Update the generator parameters θg using a backpropagation learning rule. For in-
stance, in the case of the steepest descent method, we obtain:
(j)
θg(j) ← θg(j−1) − ηg · ∆θg

end for
(n ) (n )
return θ̂g ← θg min and θ̂d ← θd min

Source: Goodfellow et al. (2014, Algorithm 1).

20
Improving the Robustness of Backtesting with Generative Models

discriminator. These structures can cause the models to be computationally expensive,


particularly when using convolutional networks. Moreover, such structures don’t address the
issue concerning the lack of data. Let us remember that we only have one single historical
scenario. A less expensive alternative is to use labels combined with simple multi-layer
perceptrons for the generator and the discriminator. This approach has been introduced
by Mirza and Osindero (2014) and is called conditional generative adversarial networks
(cGANs).
Before training a GAN, the data need to be labelled. An additional vector that encodes
structural information about real historical data must then be defined. It will help the GAN
to generate specific scenarios depending on labels defined before. Therefore, the learning
process will be supervised. The previous framework
 remains valid, but the generative and
discriminative models are written as G z | v; θg and D (x| v; θd ), where v ∈ V is the label
vector. Therefore, the cost function becomes:

C (θg , θd ) = E [log D (x1 | v; θd ) | x1 ∼ Pdata (x | v)] +


E [log (1 − D (x0 | v; θd )) | x0 ∼ Pmod el (x | v)]

where Pmod el (x | v) = G (Pnoise (z) | v; θg ).


The label vector may encode various types of information, and can be categorical or
continuous. For example, if we consider the S&P 500 index, we can specify V = {−1, 0, +1}
depending on its short trend. Let yt be the value of the S&P 500 index and mt the corre-
sponding 10-day moving average. vt is equal to −1 if the S&P 500 index exhibits a negative
trend, 0 if it has no trend, and +1 otherwise10 . An example of continuous labels is a vector
composed of the last p values (Koshiyama et al., 2019). This type of labels acts as a time
memory and helps to reproduce auto-correlation patterns of the stochastic process.

2.3 Wasserstein GAN models


In Section 3, we will see that the basic GAN model using the cross-entropy as the loss
function for the discriminator suffers from three main problems. First, the visualization of
the training process is not obvious. Traditionally, we focus on the loss error curve to decide
whether or not the network is trained correctly. In the case of financial applications, looking
at binary cross entropy is not relevant. Computer vision algorithms can trust generated
images to evaluate the GAN, financial time series are extremely noisy. Therefore, a visual
evaluation is much more questionable. Second, the basic GAN model is not suitable for
generating multi-dimensional time series. A possible alternative is to modify the GAN’s
structure. For instance, we can replace the binary cross entropy function by the Wasserstein
distance for the training error (Arjovsky et al., 2017a,b), or temporal information of multi-
dimensional time series can be encoded using a more complex structure such as convolutional
neural networks (Radford et al., 2016) or recurrent neural networks (Hyland et al., 2017).
Third, the mode collapse phenomenon must be addressed in order to manage the poor
diversity of generated samples, because we would like to have several simulated time series,
as we have when performing Monte Carlo methods.

10 For instance, we can define the labels in the following way: v = 1 {ε > } − 1 {ε < −} where
t t t
εt = yt − mt and  > 0 is a threshold.

21
Improving the Robustness of Backtesting with Generative Models

2.3.1 Optimization problem


Let P and Q be two univariate probability distributions. The Wasserstein (or Kantorovich)
distance between P and Q is defined as:
 Z 1/p
p
Wp (P, Q) = inf kx − yk dF (x, y) (16)
F∈F (P,Q)

where F (P, Q) denotes the Fréchet class of all joint distributions F (x, y) whose marginals
are equal to P and Q. In the case p = 1, the Wasserstein distance represents the cost of the
optimal transport problem11 (Villani, 2008):

W (P, Q) = inf E [kX − Y k | (X, Y ) ∼ F] (17)


F∈F (P,Q)

because F (x, y) describes how many masses are needed in order to transport one distribution
to another (Arjovsky et al., 2017a,b). The Kantorovich-Rubinstein duality introduced by
Villani (2008) allows us to rewrite Equation (17) as follows:

W (P, Q) = sup E [ϕ (X) | X ∼ P] − E [ϕ (Y ) | Y ∼ Q] (18)


ϕ

where ϕ is a 1-Lipschitz function, such that |ϕ (x) − ϕ (y)| ≤ kx − yk for all (x, y).
In the case of a generative model, we would like to check whether sample and gen-
erated data follows the same distribution. Therefore, we obtain P = Pdata and Q =
Pmodel . In the special case of a GAN model, Pmodel (x; θg ) is given by G (Pnoise (z) ; θg ).
Arjovsky et al. (2017a,b) demonstrated that if G (z; θg ) is continuous with respect to θ,
then W (Pdata , Pmodel ) is continuous everywhere and differentiable almost everywhere. This
result implies that GAN can be trained until it has converged to its optimal solution con-
trary to basic GANs, where the training loss were bounded and not continuous12 . Moreover,
Gulrajani et al. (2017) adapted Equation (18) in order to recover the two player min-max
game:

C (θg , θd ) = E [D (x1 ; θd ) | x1 ∼ Pdata (x)] − E [D (x0 ; θd ) | x0 ∼ G (z; θg ) , z ∼ Pnoise (z)]

The Wasserstein GAN (WGAN) plays the same min-max optimization problem as previ-
ously:
θ̂g(min) = arg min C (min) (θg | θd )
θg ∈ΘG

This implies that the discriminator needs to be a 1-Lipschitz function.


Remark 4. The discriminator function is not necessarily a classifier, since the output of
the function D can take a value in [0, 1]. It can then be a general scoring method.
Remark 5. The Kantorovich-Rubinstein duality makes the link between Wasserstein dis-
tance and integral probability metrics presented in Appendix A.3.2 on page 68.

2.3.2 Properties of the optimal solution


In Wasserstein GAN models, the cost function becomes:

C (θg , θd ) = E [ϕ (x1 ; θd ) | x1 ∼ Pdata ] − E [ϕ (x0 ; θd ) | x0 ∼ G (z; θg ) , z ∼ Pnoise (z)]


11 See Appendix A.6 on page 73 for an introduction of Monge-Kantorovich problems, and the relationship

between optimal transport and the Wasserstein distance.


12 There is also no problem in computing the gradient.

22
Improving the Robustness of Backtesting with Generative Models

and we have:  
∇θg C (θg , θd ) = −E ∇θg ϕ (G (z; θg ) ; θd )
We recall that the discriminator ϕ needs to satisfy the Lipschitz property otherwise the loss
gradient will explode13 . A first alternative proposed by Arjovsky et al. (2017a,b) is to clip
the weights into a closed space [−c, c]. However, this method is limited because the gradient
can vanish and weights can saturate. A second alternative proposed by Gulrajani et al.
(2017) is to focus on the properties of the optimal discriminator gradient. They showed that
the optimal solution ϕ? has a gradient norm 1 almost everywhere under P and Q. Therefore,
they proposed to add a regularization term to the cost function in order to constraint the
gradient norm to converge to 1. This gradient penalty leads to define a new cost function:
h i
2
Cregularized (θg , θd ; λ) = C (θg , θd ) + λE (k∇x ϕ (x2 ; θd )k2 − 1) | x2 ∼ Pmixture

where Pmixture is a mixture distribution of P and Q. More precisely, Gulrajani et al. (2017)
proposed to sample x2 as follows: x2 = αx1 + (1 − α) x0 where α ∼ U[0,1] , x0 ∼ Q and
x1 ∼ P.
Remark 6. Gulrajani et al. (2017) found that a good value of the coefficient λ is 10.

2.4 Convolutional neural networks


A convolutional neural network (CNN) is a class of deep neural networks, where the inputs
are transformed using convolution and filtering operators. For instance, this type of neural
networks is extensively used in computer vision. Recently, Radford et al. (2016) implemented
a new version of GANs using convolutional networks as generator and discriminator in
order to improved image generation14 . In this approach, the underlying idea is to extract
pertinent features. In finance, inputs are different and correspond to nx -dimensional time
series representing asset prices over nt days. Therefore, inputs are represented by a matrix
belonging to Mnt ,nx (R), and are characterized by the time axis, where order matters and
the asset axis, where order doesn’t matter. Therefore, the input transformation is referring
to 1-dimensional convolution (1D-CNN).

2.4.1 Extracting features using convolution


Convolution is no more than a linear transformation of a given input. But, contrary to simple
multilayer perceptrons, convolutional operations preserve the notion of ordering according
to a specific axis. To achieve this, a weight matrix of a given length nk called kernel
will slide across the input data. At each location, a continuous part of the input data is
selected, and the kernel receives this vector as input in order to produce a single output
by matrix product. This process is repeated with nf different filters of similar dimension.
Consequently, we obtained nf down-sampled versions of the original input. The way the
kernel can slide across the input data is controlled by the stride ns . It is defined as the
distance between two consecutive positions of the kernel (Dumoulin and Visin, 2016). The
higher the stride value, the more important the sub-sampling. If s is equal to one, all the
data are considered and there is no sub-sampling. Finally, the notion of padding allows us
to address the situation when the kernel arrive at the end or the beginning of an axis. So,
padding is defined as the number of zeros concatenated at the beginning or at the end of a
given axis.
13 As said previously, ϕ is not necessarily a discriminator function, but we continue to use this term to

name ϕ.
14 This class of CNNs is called deep convolutional generative adversarial networks (DCGANs).

23
Improving the Robustness of Backtesting with Generative Models

Figure 5: Input architecture of convolutional neural networks

raw inputs

filtered inputs

filtered inputs

In finance, we want to down-sample a given multi-dimensional time series belonging


to Mnt ,nx (R) into a subspace Mn0t ,nx (R) with the condition that n0t < nt . With a 1-
dimensional convolution, output time axis is automatically defined with respect to the di-
mension of the kernel. The first dimension of the kernel is free, whereas the second dimension
is set to the number of time series nx . Thus, the kernel is defined by a weight matrix be-
longing to Mnk ,nx (R). In order to sample the input data, the stride is chosen, such that
s ≥ 1. The padding is set, such that np rows of zeros is padded at the (bottom and top)
limits of the input data. Finally, the output will belong to Mn0t ,nx (R), such that:

nt − nk + 2np
n0t = +1 (19)
ns
This type of convolution will be used to build the discriminator that should output a single
scalar. In this case, down-sampling real or fake time series become essential.

2.4.2 Up-sampling a feature using transpose convolution


This transformation also called ‘deconvolution’ is useful to build the generator when we
would like to up-sample a random noise in order to produce a fake time series belonging
to Mnt ,nx (R). More generally, it could be used as the decoding layer in all forms of auto-
encoders. Transpose convolution is the exact inverse transformation of the convolution
previously defined. Considering an input data belonging to Mnt ,nx (R), we obtain an output
data belonging to Mn0t ,nx (R) such that:

n0t = ns (nt − 1) + nk − 2np (20)

The term transpose comes from the fact that convolution is in fact a matrix operation.
When we perform a convolution, input matrix flattens into a nt nx -dimensional vector. All
the parameters (stride, kernel settings or padding) are encoded into the weight convolution
matrix belonging to Mnt nx ,n0t (R). Therefore, the output data is obtained by computing
the product between the input vector and the convolution matrix. Performing a transpose
convolution is equivalent of taking the transpose of this convolution matrix.

24
Improving the Robustness of Backtesting with Generative Models

3 Financial applications
3.1 Application of RBMs to generate synthetic financial time series
A typical financial time series of length T may be described by a real-valued vector y =
(y1 , ..., yT ). As we have introduced in Sections 2.1.1 and 2.1.2 on page 9, Bernoulli RBMs
take binary vectors as input for training and Gaussian RBMs take the data with unit variance
as input in order to ignore the parameter σ. Therefore, data preprocessing is necessary
before any training process of RBMs. For a Bernoulli RBM, each value of yt needs to be
converted to an N -digit binary vector using the algorithm proposed by Kondratyev and
Schwarz (2019). The underlying idea consists in discretizing the distribution of the training
dataset and representing them with binary numbers. For instance, N may be set to 16 and
a Bernoulli RBM, which takes a one-dimensional time series as training dataset, will have
16 visible layer units. In the same way, the visible layer will have 16 × n neurons in the case
of an n-dimensional time series. Moreover, samples generated by a Bernoulli RBM after
Gibbs sampling are also binary vectors and we need another algorithm to transform binary
vectors into real values. These transformation algorithms between real values and binary
vectors are detailed in Appendix A.7 on page 77. For a Gaussian RBM, we need only to
normalize data to unit variance and scale generated samples after Gibbs sampling.
For the training process of RBMs, we use the contrastive divergence algorithm CD(1) to
estimate the log-likelihood gradient and the learning rate η (t) is set to 0.01. All models are
trained using mini-batch gradient descent with batch size 500 and we apply 100 000 epochs
to ensure that models are well-trained. Using a larger mini-batch size will get a more stable
gradient estimate at each step, but it will also use more memory and take a longer time to
train the model.
After having trained the RBMs, we may use these models to generate synthetic (or
simulated) samples to match training samples. Theoretically, a well-trained RBM should be
able to transform random noise into samples that have the same distribution as the training
dataset after doing Gibbs sampling for a sufficiently long time. Therefore, the trained RBM
is fed by samples drawn from the Gaussian distribution N (0, 1) as input data. We then
perform a large number of forward and backward passes through the network to ensure
convergence between the probability distribution of generated samples and the probability
distribution of the training sample. In the particular case of Bernoulli RBMs, after the last
backward pass from the hidden layer to the visible layer, we need to transform generated
binary vectors into real-valued vectors.

3.1.1 Simulating multi-dimensional datasets


In this first study, we test Bernoulli and Gaussian RBMs on a simulated multi-dimensional
dataset to check whether RBMs can learn the joint distribution of training samples that
involves marginal distributions and a correlation structure. We simulate 10 000 observations
of 4-dimensional data with different marginal distributions, as shown in Figure 6. The first
dimension consists of samples drawn from a Gaussian mixture model15 . The samples in
the second dimension are drawn from a Student’s t distribution with 4 degrees of freedom.
For the third and fourth dimensions, we draw respectively samples from the Gaussian dis-
tribution N (0, 2). The empirical probability distributions of these 4-dimensional data are
reported in Figure 6. In addition, we use a Gaussian copula to construct a simple correla-
tion structure of the simulated samples with the correlation matrix of the Gaussian copula
15 The mixture probability is equal to 50%, whereas the two underlying probability distributions correspond

to two Gaussian random variables N (−1.5, 2) and N (2, 1).

25
Improving the Robustness of Backtesting with Generative Models

Figure 6: Histograms of simulated training samples


0.40

0.20
0.30
Frequency

Frequency
0.15

0.20
0.10

0.10
0.05

0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4

0.20 0.20

0.15 0.15
Frequency

Frequency
0.10 0.10

0.05 0.05

0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)

Figure 7: The theoretical correlation matrix of the Gaussian copula


Dimension 4 Dimension 3 Dimension 2 Dimension 1

100% -60% 0% 30%

-60% 100% 0% -20%

0% 0% 100% 50%

30% -20% 50% 100%

Dimension 1 Dimension 2 Dimension 3 Dimension 4

given in Figure 7. We set a strong negative correlation −60% between the Gaussian mix-
ture model and the Student’s t distribution and a strong positive correlation 50% between
the two Gaussian distributions. Moreover, the Gaussian distribution in the fourth dimen-
sion has a more complicated correlation structure16 than the Gaussian distribution in the
third dimension, which is independent from the Gaussian mixture model and the Student’s
t distribution or the first and second dimensions. Here, the challenge is to learn both the
marginal distribution and the copula function.
16 The correlations are equal to 30% and −20% with respect to the first and second dimensions.

26
Improving the Robustness of Backtesting with Generative Models

Before implementing the training process, we need to update the parameters for the
RBMs to adjust multi-dimensional input. In the case of the Bernoulli RBM, each value
should be transformed into a 16-digit binary vector and we need to concatenate them to
form a 64-digit binary vector. So, the visible layer of the Bernoulli RBM has 64 neurons.
For the Gaussian RBM, we have simply 4 visible layer units. According to our experience,
we need to set a large number of hidden layer units in order to learn marginal distributions
and the correlation structure at the same time. Therefore, we choose 256 hidden layer units
for the Bernoulli RBM and 128 hidden layer units for the Gaussian RBM. We also recall
that the number of epochs is set to 100 000 in order to ensure that models are well-trained.
After the training process, we perform 1 000 steps of the Gibbs sampling on a 4-dimensional
random noise to generate 10 000 samples with the same size as the training dataset. These
simulated samples are expected to have not only the same marginal distributions but also
the same correlation structure as the training dataset.

Bernoulli RBM Figures 8 and 9 compare the histograms and QQ-plots between training
samples and generated samples after 1 000 Gibbs sampling steps in the case of the Bernoulli
RBM. According to these figures, we observe that the Bernoulli RBM can learn pretty well
each marginal distribution of training samples. However, we also find that the Bernoulli
RBM focuses on extreme values in the training dataset instead of the whole tail of the dis-
tribution and this phenomenon is more evident for heavy-tailed distributions. For example,
in the case of the Student’s t distribution (Panel (b) in Figure 9), the Bernoulli RBM tries
to learn the behavior of extreme values, but ignores the other part of the distribution tails.
Comparing the results for the two Gaussian distributions (Panels (c) and (d) in Figure 9),
we notice that the learning of the Gaussian distribution in the fourth dimension, which has
a more complicated correlation structure with the other dimensions, is not as good as the
result for the Gaussian distribution in the third dimension.
We have replicated the previous simulation 50 times17 and we compare the average of
the mean, the standard deviation, the 1st percentile and the 99th percentile between training
and simulated samples in Table 1. Moreover, in the case of simulated samples, we indicate
the confidence interval by ± one standard deviation. These statistics demonstrate that
the Bernoulli RBM overestimates the value of standard deviation, the 1st percentile and
the 99th percentile, especially in the case of Student’s t distribution. This means that the
Bernoulli RBM is sensitive to extreme values and the learned probability distribution is
more heavy-tailed than the empirical probability distribution of the training dataset.
In Figure 10, we show the comparison between the empirical correlation matrix18 of the
training dataset and the average of the correlation matrices computed by the 50 Monte
Carlo replications. We notice that the Bernoulli RBM does not capture perfectly the cor-
relation structure since the correlation coefficient values are less significant comparing with
the empirical correlation matrix. For instance, the correlation between the first and second
dimensions is equal to −57% for the training data, but only −31% for the simulated data
on average19 .

17 Each replication corresponds to 10 000 generated samples of the four random variables, and differs

because they use different random noise series as input data.


18 This empirical correlation matrix is not exactly equal to the correlation matrix of the copula function,

because the marginals are not all Gaussian.


19 Similarly, the correlation between third and fourth dimensions is equal to 50% for the training data, but

only 31% for the simulated data on average.

27
Improving the Robustness of Backtesting with Generative Models

Figure 8: Histogram of training and Bernoulli RBM simulated samples


0.40
Generated Generated
0.20 Training Training
0.30
Frequency

Frequency
0.15
0.20
0.10

0.10
0.05

0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4

Generated Generated
0.20 0.20
Training Training

0.15 0.15
Frequency

Frequency
0.10 0.10

0.05 0.05

0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)

Figure 9: QQ-plot of training and Bernoulli RBM simulated samples

5.0
10
2.5
Generated

Generated

5
0.0

−2.5 0

−5.0 −5

−7.5
−10
−7.5 −5.0 −2.5 0.0 2.5 5.0 −10 −5 0 5 10 15
Training Training
(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4

7.5

5.0 5
Generated
Generated

2.5
0
0.0

−2.5
−5
−5.0

−5.0 −2.5 0.0 2.5 5.0 7.5 −5 0 5


Training Training
(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)

28
Improving the Robustness of Backtesting with Generative Models

Table 1: Comparison between training and Bernoulli RBM simulated samples

Dimension 1 Dimension 2
Statistic
Training Simulated Training Simulated
Mean 0.303 0.319 (± 0.039) −0.006 −0.054 (± 0.029)
Standard deviation 2.336 2.385 (± 0.124) 1.409 1.540 (± 0.044)
1st percentile −5.512 −5.380 (± 0.121) −3.590 −5.008 (± 0.905)
99th percentile 4.071 4.651 (± 0.143) 3.862 4.255 (± 0.374)
Dimension 3 Dimension 4
Statistic
Training Simulated Training Simulated
Mean −0.002 0.071 (± 0.039) −0.063 0.033 (± 0.043)
Standard deviation 1.988 2.044 (± 0.028) 1.982 2.037 (± 0.030)
1st percentile −4.691 −4.987 (± 0.167) −4.895 −5.227 (± 0.244)
99th percentile 4.677 4.918 (± 0.190) 4.431 4.938 (± 0.322)

Figure 10: Comparison between the empirical correlation matrix and the average correlation
matrix of Bernoulli RBM simulated data
Dimension 4 Dimension 3 Dimension 2 Dimension 1

Dimension 4 Dimension 3 Dimension 2 Dimension 1

100% -57% 1% 29% 100% -31% 3% 20%

-57% 100% -1% -19% -31% 100% 0% -11%

1% -1% 100% 50% 3% 0% 100% 31%

29% -19% 50% 100% 20% -11% 31% 100%

Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 1 Dimension 2 Dimension 3 Dimension 4

Training samples Generated samples

Gaussian RBM Let us now consider a Gaussian RBM with 4 visible layer units and 128
hidden layer units using the same simulated training dataset. After the training process,
synthetic samples are always generated using the Gibbs sampling with 1 000 steps between
visible and hidden layers. Histograms and QQ-plots of training and generated samples are
reported in Figures 11 and 12. We notice that the Gaussian RBM works well for the Gaussian
mixture model and the two Gaussian distributions (Panels (a), (c) and (d) in Figure 12).
Again, we observe that the Gaussian distribution with the simplest correlation structure is
easier to learn than the Gaussian distribution with a more complicated correlation structure.
However, the Gaussian RBM fails to learn heavy-tailed distributions such as the Student’s
t distribution since the model tends to ignore all values in the distribution tails.
Comparing Tables 1 on page 29 for the Bernoulli RBM and Tables 2 on page 31 for the
Gaussian RBM, we notice that the Gaussian RBM generally underestimates the value of
the standard deviation, the 1st percentile and the 99th percentile. This means that this is
challenging to generate leptokurtic probability distributions with Gaussian RBMs.
In Figure 13, we also compare the empirical correlation matrix of the training dataset

29
Improving the Robustness of Backtesting with Generative Models

Figure 11: Histogram of training and Gaussian RBM simulated samples


0.40
Generated Generated
0.20 Training Training
0.30
Frequency

Frequency
0.15

0.20
0.10

0.10
0.05

0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

(a) Gaussian mixture model (b) Student’s t distribution with ν = 4

0.20 Generated Generated


0.20
Training Training

0.15 0.15
Frequency

Frequency
0.10 0.10

0.05 0.05

0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)

Figure 12: QQ-plot of training and Gaussian RBM simulated samples

5.0 4

2.5 2
Generated
Generated

0.0 0

−2.5
−2

−5.0
−4
−7.5
−7.5 −5.0 −2.5 0.0 2.5 5.0 −10 −5 0 5 10 15
Training Training
(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4

7.5
7.5
5.0
5.0
2.5 2.5
Generated

Generated

0.0 0.0

−2.5 −2.5

−5.0 −5.0

−7.5 −7.5
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −5 0 5
Training Training
(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)

30
Improving the Robustness of Backtesting with Generative Models

Table 2: Comparison between training and Gaussian RBM simulated samples

Dimension 1 Dimension 2
Statistic
Training Simulated Training Simulated
Mean 0.303 0.325 (± 0.037) −0.006 −0.006 (± 0.023)
Standard deviation 2.336 2.232 (± 0.021) 1.409 1.355 (± 0.020)
1st percentile −5.512 −4.992 (± 0.121) −3.590 −3.054 (± 0.078)
99th percentile 4.071 4.167 (± 0.078) 3.862 3.240 (± 0.105)
Dimension 3 Dimension 4
Statistic
Training Simulated Training Simulated
Mean −0.002 0.064 (± 0.029) −0.063 0.046 (± 0.031)
Standard deviation 1.988 1.942 (± 0.023) 1.982 1.910 (± 0.026)
1st percentile −4.691 −4.382 (± 0.109) −4.895 −4.512 (± 0.122)
99th percentile 4.677 4.572 (± 0.109) 4.431 4.303 (± 0.139)

and the average of the correlation matrices computed with 50 Monte Carlo replications.
Compared to the Bernoulli RBM, the Gaussian RBM captures much better the correlation
structure of the training dataset. This is particular true for the largest correlation values.
For instance, the correlation between first and second dimensions is equal to −57% for the
training data, −31% for the Bernoulli RBM simulated data and −48% for the Gaussian
RBM simulated data.

Figure 13: Comparison between the empirical correlation matrix and the average correlation
matrix of Gaussian RBM simulated data
Dimension 4 Dimension 3 Dimension 2 Dimension 1

Dimension 4 Dimension 3 Dimension 2 Dimension 1

100% -57% 1% 29% 100% -48% -3% 22%

-57% 100% -1% -19% -48% 100% 1% -13%

1% -1% 100% 50% -3% 1% 100% 44%

29% -19% 50% 100% 22% -13% 44% 100%

Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 1 Dimension 2 Dimension 3 Dimension 4

Training samples Generated samples

Summary of the results According to our tests, we consider that both Bernoulli and
Gaussian well-trained RBMs can transform random noise series to the joint distribution of
a training dataset. But they have their own characteristics:

• A Bernoulli RBM is very sensitive to extreme values of the training dataset and has a
tendency to overestimate the tail of distribution. On the contrary, a Gaussian RBM has
a tendency to underestimate the tails of probability distribution and faces difficulties
in learning leptokurtic probability distributions.

31
Improving the Robustness of Backtesting with Generative Models

• Gaussian RBMs may capture more accurately the correlation structure of the training
dataset than Bernoulli RBMs.

In practice, we may apply the normal score transformation in order to transform the training
dataset, such that each marginal has a standard normal distribution. For that, we rank the
values of each dimension from the lowest to the highest, map these ranks to a uniform
distribution and apply the inverse Gaussian cumulative distribution function. We then
train the RBM with these transformed values. After the training process, the marginals
of samples generated by well-trained RBMs will follow a standard normal distribution. By
using the inverse transformation, we can generate synthetic samples with the same marginal
distributions than those of the training dataset, and a correlation structure that is closed to
the empirical correlation matrix of the training data. In this case, we may consider Gaussian
RBMs as an alternate method of the bootstrap sampling approach.

3.1.2 Application to financial time series


According to the results in the previous section, Gaussian RBMs capture more accurately
the correlation structure of the training dataset and we can avoid its drawbacks by applying
the normal score transformation. In this paragraph, we test and compare a Gaussian RBM
and a conditional RBM on a multi-dimensional autocorrelated time series in order to check
if conditional RBMs may capture at the same time the correlation structure and the time
dependence of the training dataset. The RBM is trained on a real financial dataset consisting
of 5 354 historical daily returns of the S&P 500 index and the VIX index from December
1998 to May 2018 (see Figure 14). During this period, the S&P 500 index and the VIX
index have a remarkable negative correlation that is equal to −71%. In order to reinforce
the autocorrelation level in the training samples, we apply a 3-day exponential weighted
moving average approach to historical prices before calculating daily returns. Figure 15
shows the autocorrelation of training data. As both daily returns of the S&P 500 and VIX
indices are leptokurtic and have some extreme values, we need to apply the normal score
transformation on input data before the training process.

Figure 14: Historical prices of the S&P 500 and VIX indices
3000 80

2500
60
2000

40
1500

1000 20

19 98 200
2
200
6
201
0
201
4
201
8
199
8
200
2
200
6
201
0
201
4
201
8
12/ 10/ 08/ 06/ 04/ 02/ 12/ 10/ 08/ 06/ 04/ 02/
22/ 22/ 22/ 22/ 22/ 20/ 22/ 22/ 22/ 22/ 22/ 20/
S&P 500 index VIX index

Normal score transformation In order to verify the learning quality of Gaussian RBMs
using the normal score transformation, we run 50 Monte Carlo simulations after the training
process and for each simulation, we generate 5 354 simulated observations from different

32
Improving the Robustness of Backtesting with Generative Models

Figure 15: Autocorrelation function of the training dataset


0.6
VIX
SPX
0.4

Autocorrelation
0.2

0.0

−0.2
1 3 5 7 9 11 13 15 17 19
Lag

random noise series. We then calculate the statistics of these samples as we have done in
the previous section. According to Table 3, we find that although the Gaussian RBM still
underestimates a little the standard deviation, the 1st percentile and the 99th percentile,
the learning quality is much more improved. In addition, the average correlation coefficient
between daily returns of S&P 500 and VIX indices over 50 Monte Carlo simulations is equal
to −76%, which is very closed to the figure 71% of the empirical correlation. Therefore, we
consider that training Gaussian RBMs with the normal score transformation may be used
as an alternate method of bootstrap sampling.

Table 3: Comparison between the training sample and Gaussian RBM simulated samples
using the normal score transformation

S&P 500 index VIX index


Statistic
Training Simulated Training Simulated
Mean 0.02% −0.01% (± 0.01%) 0.06% 0.09% (± 0.06%)
Standard deviation 0.64% 0.62% (± 0.01%) 3.84% 3.65% (± 0.12%)
1st percentile −1.93% −1.88% (± 0.12%) −7.31% −6.97% (± 0.30%)
99th percentile 1.61% 1.49% (± 0.06%) 11.51% 10.96% (± 0.63%)

Learning time dependence As conditional RBM is an extension of Gaussian RBM,


applying the normal score transformation should still work. Moreover, conditional RBM
should be able to learn the time-dependent relationship by its design. So, we train a condi-
tional RBM on the same training dataset in order to check whether this model can capture
the autocorrelation patterns previously given in Figure 15.
In practice, we choose to use a long memory for the conditional layer in the conditional
RBM and the values of the last 20 days will be fed to the model. Therefore, the conditional
RBM has 2 visible layer units, 128 hidden layer units and 40 conditional layer units20 .
We then use the normal score transformation in order to transform the training dataset as
mentioned before. After having done a training process of 100 000 epochs, we are interested
in generating samples of consecutive time steps to verify if the conditional RBM can learn
the joint distribution but also capture the autocorrelation of the training dataset. We run 3
Monte Carlo simulations and for each simulation, we generate consecutively samples of 250
observations. In Figures 16 and 17, we compare the autocorrelation function of generated
20 Corresponding to two time series of 20 lags.

33
Improving the Robustness of Backtesting with Generative Models

samples by Gaussian and conditional RBMs. These results show clearly that a conditional
RBM can capture well the autocorrelation of the training dataset, which is not the case of
a traditional Gaussian RBM.

Figure 16: Autocorrelation function of synthetic samples generated by a Gaussian RBM


0.6 0.6
Simulation 1 Simulation 1
Simulation 2 Simulation 2
0.4 Simulation 3 0.4 Simulation 3
Autocorrelation

Autocorrelation
Training Training

0.2 0.2

0.0 0.0

−0.2 −0.2
5 10 15 5 10 15
Lag Lag
S&P 500 index VIX index

Figure 17: Autocorrelation function of synthetic samples generated by a conditional RBM


0.6 0.6
Simulation 1 Simulation 1
Simulation 2 Simulation 2
0.4 Simulation 3 0.4 Simulation 3
Autocorrelation

Autocorrelation

Training Training
0.2 0.2

0.0 0.0

−0.2 −0.2
5 10 15 5 10 15
Lag Lag
S&P 500 index VIX index

Market generator Based on above results, we consider that conditional RBMs can be
used as a market generator (Kondratyev and Schwarz, 2019). In the example of S&P 500
and VIX indices, we have trained the models with all historical observations and for each
date, we have used the values of the last 20 days as a memory vector for the conditional
layer. In this approach, the normal score transformation ensures the learning of marginal
distributions, whereas the conditional RBM ensures the learning of the correlation structure
and the time dependence. After having calibrated the training process, we may choose a
date as the starting point. The values of this date and its last 20 days are then passed to
the trained RBM model and, after sufficient steps of Gibbs sampling, we get a sample for
the next day. We then feed this new sample and the updated memory vector to the model
in order to generate a sample for the following day. Iteratively, the conditional RBM can
generate a multi-dimensional time series that has the same patterns as those of the training
data in terms of marginal distribution, correlation structure and time dependence.

34
Improving the Robustness of Backtesting with Generative Models

For instance, Figure 18 shows three time series of 250 trading days starting at the date
22/11/2000, which are generated by the trained conditional RBM. We notice that these
three simulated time series have the same characteristics as the historical prices represented
by the dashed line. The negative correlation between S&P 500 and VIX indices is also
learned and we can see clearly that there exists a positive autocorrelation in each simulated
one-dimensional time series.

Figure 18: Comparison between scaling historical prices of S&P 500 and VIX indices and
synthetic time series generated by the conditional RBM

1.05
2.0
1.00

0.95
1.5
0.90

0.85
1.0
0.80

0.75
0 50 100 150 200 0 50 100 150 200
Time step Time step
Simulation 1 Simulation 3 Real Simulation 1 Simulation 3 Real
Simulation 2 Simulation 2

S&P 500 index VIX index

3.2 Application of GANs to generate fake financial time series


In this section, we perform the same tests for GAN models as those we have done for RBMs.
The objective is to compare the results between these two types of generative models on the
same training dataset. We recall that the first test consists of learning the joint distribution
of a simulated multi-dimensional dataset with different marginal distributions and a simple
correlation structure driven by a Gaussian copula. In this part, we test a Wasserstein GAN
model with only simple dense layers in the generator and in the discriminator. In the second
test, we train a conditional Wasserstein GAN model on the historical daily returns of the
S&P 500 index and the VIX index from December 1998 to May 2018. As we have done
in the previous tests for RBMs, we also apply a three-day exponential weighted moving
average approach to historical prices before calculating daily returns. We know that this
transformation will reinforce the autocorrelation level in the training samples. As a result, we
need to choose a more complex structure for the generator and the discriminator to capture
the complex features in the training dataset. For instance, we construct the generator and
the discriminator with convolutional neural network as introduced in Section 2.4 on page
23. According to the results of these two tests, we check whether the family of GAN models
may also learn the marginal distribution, the correlation structure and the time dependence
of the training dataset as RBMs can do.

3.2.1 Simulating multi-dimensional datasets


As we have introduced in Section 2.3 on page 21, Wasserstein GAN models, which use
the Wasserstein distance as the loss function, have several advantages comparing the basic

35
Improving the Robustness of Backtesting with Generative Models

traditional GAN models using the cross-entropy loss measure. Therefore, we test in this
first study Wasserstein GAN models on the simulated multi-dimensional dataset as we have
done for Bernoulli and Gaussian RBMs. We recall that we simulate 10 000 samples of 4-
dimensional data with different marginal distributions (Gaussian mixture model, Student’s
t distribution and two Gaussian distributions), as shown in Figure 6 on page 26 and the
training dataset has a correlation structure simulated by a Gaussian copula with the correla-
tion matrix given in Figure 7 on page 26. Here, the objective of the test is to check whether
a well-trained Wasserstein GAN models can learn the marginal distribution and the copula
function of training samples.

Data preprocessing In the practice of GAN models, we need to match the dimension
and the range of the output of the generator and the input of discriminator. To address this
issue, there are two possible ways:

• We can use a complex activation function for the last layer of the generator in order
to ensure that the generated samples have the same range as the training samples for
the discriminator (Clevert et al., 2015).

• We can modify the range of the training samples by applying a preprocessing function
and choose a usual activation function (Wiese et al., 2020).

In our study, we consider the second approach by applying the MinMax scaling func-
tion21 to the training samples for the discriminator. As a result, the input data scaled for
the discriminator will take their values in [0, 1] and we may choose the sigmoid activation
function for the last layer of the generator to ensure that the outputs of the generator have
theirs values in [0, 1]. In addition, we need to simulate the random noise as the inputs for the
generator. As the random noise will play the role of latent variables, we have the freedom
to choose its distribution. For instance, we use the Gaussian distribution z ∼ N (0, 1) in
this study.

Structure and training of Wasserstein GAN models As we have mentioned, we


need to match the output of the generator and the input of the discriminator. Therefore,
the last layer in the generator should have 4 neurons for our simulated training dataset
and we may use sigmoid activation function for each neuron as we have applied MinMax
scaling function to the training samples. The discriminator corresponds to a traditional
classification problem, so its last layer should have only one neuron. We may choose the
sigmoid activation function for this neuron if the training samples are labelled by {0, 1} or
the tangent hyperbolic activation function if the training samples are labelled by {−1, 1}.
In our study, we want to construct a Wasserstein GAN model, which can convert a
100-dimensional random noise vector to the data with the same joint distribution as the
training dataset. According to our tests, a simple structure as multi-layer perceptrons for the
generator and the discriminator is sufficient for learning the distribution of these simulated
training samples. More precisely, the generator is fed by 100-dimensional vectors and has
5 layers with the structure SG = {200, 100, 50, 25, 4} where si ∈ SG represents the numbers
of neurons of the ith layer. The discriminator has 4 layers with SD = {100, 50, 10, 1} and
takes 4-dimensional vectors as input data. As explained in the previous paragraph, we may
choose the sigmoid activation function for the last layer of the generator and the tangent
hyperbolic activation function for the last layer of the discriminator. For all other layers
21 We have f (x) = (x − min x) / (max x − min x).

36
Improving the Robustness of Backtesting with Generative Models

of the discriminator and the generator, the activation function f (x) corresponds to a leaky
rectified linear unit (or RELU) function with α = 0.5:

αx if x < 0
fα (x) =
x otherwise

According to Arjovsky et al. (2017a), the Wasserstein GAN model should be trained with
the RMSProp optimizer using a small learning rate. Therefore, we set the learning rate to
η = 1.10−4 and use mini-batch gradient descent with batch size 500. Table 4 summarizes the
setting of the Wasserstein GAN model implementation using Python and the TensorFlow
library. After the training process, we generate 10 000 samples in order to compare the
marginal distributions and the correlation structure with the training dataset.

Table 4: The setting of the Wasserstein GAN model

Training data dimension 4


Input feature scaling function MinMax
Random noise vector dimension 100
Random noise distribution N (0, 1)
Generator structure SG = {200, 100, 50, 25, 4}
Discriminator structure SD = {100, 50, 10, 1}
Loss function Wasserstein distance
Learning optimizer RMSProp
Learning rate 1.10−4
Batch size 500

Results Figures 19 and 20 compare the histograms and QQ-plots between training samples
and generated samples. According to these figures, we observe that the Wasserstein GAN
model can learn very well each marginal distribution of training samples, even better than
Bernoulli and Gaussian RBMs. In particular, the Wasserstein GAN model fits more accuracy
for heavy-tailed distributions than RBMs as in the case of the Student’s t distribution (Panel
(b) in Figure 20).
As we have done for RBMs, we replicate the previous simulation 50 times and we compare
the average of the mean, the standard deviation, the 1st percentile and the 99th percentile
between training and simulated samples in Table 5. Moreover, in the case of simulated
samples, we indicate the confidence interval by ± one standard deviation. These statistics
show that the Wasserstein GAN model estimates very well the value of standard devia-
tion, the 1st percentile and the 99th percentile for each marginal distribution, especially for
Gaussian mixture model and Student’s t distribution in the first and second dimensions.
One shortcoming is that the Wasserstein GAN model slightly overestimates the value of the
99th percentile for the two Gaussian distributions. Comparing with the results shown in
Table 1 on page 29 for the Bernoulli RBM and Table 2 on page 31 for the Gaussian RBM,
we notice that the Wasserstein GAN model learns better the marginal distributions of the
training dataset and doesn’t have tendency to always underestimate or overestimate the
tails of probability distribution as RBMs. However, there exists a small bias between the
empirical mean computed with 50 Monte Carlo replications and the mean of the training
dataset.
In Figure 21, we also compare the empirical correlation matrix of the training dataset
and the average of the correlation matrices computed with 50 Monte Carlo replications.
We notice that the Wasserstein GAN model captures very well the correlation structure.

37
Improving the Robustness of Backtesting with Generative Models

Figure 19: Histogram of training and WGAN simulated samples


0.40
Generated Generated
0.20 Training Training
0.30
Frequency

Frequency
0.15

0.20
0.10

0.10
0.05

0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4

0.20 Generated Generated


Training 0.20 Training

0.15
Frequency

Frequency
0.15

0.10 0.10

0.05 0.05

0.00 0.00
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5

(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)

Figure 20: QQ-plot of training and WGAN simulated samples

5.0

2.5 5
Generated
Generated

0.0
0
−2.5

−5.0 −5

−7.5
−7.5 −5.0 −2.5 0.0 2.5 5.0 −10 −5 0 5 10 15
Training Training
(a) Gaussian mixture model (b) Student’s t-distribution with ν = 4

7.5 7.5

5.0 5.0

2.5 2.5
Generated

Generated

0.0 0.0

−2.5 −2.5

−5.0 −5.0

−7.5
−5.0 −2.5 0.0 2.5 5.0 7.5 −5 0 5
Training Training
(c) Gaussian distribution N (0, 2) (d) Gaussian distribution N (0, 2)

38
Improving the Robustness of Backtesting with Generative Models

Table 5: Comparison between training and WGAN simulated samples

Dimension 1 Dimension 2
Statistic
Training Simulated Training Simulated
Mean 0.303 0.427 (± 0.027) −0.006 −0.026 (± 0.013)
Standard deviation 2.336 2.361 (± 0.014) 1.409 1.379 (± 0.019)
1st percentile −5.512 −5.539 (± 0.099) −3.590 −3.638 (± 0.116)
99th percentile 4.071 4.103 (± 0.030) 3.862 4.073 (± 0.181)
Dimension 3 Dimension 4
Statistic
Training Simulated Training Simulated
Mean −0.002 0.199 (± 0.021) −0.063 0.062 (± 0.020)
Standard deviation 1.988 2.002 (± 0.013) 1.982 1.958 (± 0.016)
1st percentile −4.691 −4.717 (± 0.080) −4.895 −4.821 (± 0.106)
99th percentile 4.677 4.954 (± 0.066) 4.431 4.865 (± 0.082)

For each value in the correlation matrix, the difference is less than 5%. Compared to the
Bernoulli RBM and the Gaussian RBM, the Wasserstein GAN model captures much bet-
ter the correlation structure of the training dataset. For instance, the correlation between
first and second dimensions is equal to −57% for the training data, −53% for the Wasser-
stein GAN simulated data, −48% for the Gaussian RBM simulated data and −31% for the
Bernoulli RBM simulated data. If we consider the first and fourth dimensions, these figures
become 29% for the training data, 24% for the Wasserstein GAN simulated data, 22% for
the Gaussian RBM simulated data and 20% for the Bernoulli RBM simulated data.

Figure 21: Comparison between the empirical correlation matrix and the average correlation
matrix of WGAN simulated data
Dimension 4 Dimension 3 Dimension 2 Dimension 1

Dimension 4 Dimension 3 Dimension 2 Dimension 1

100% -57% 1% 29% 100% -53% -1% 24%

-57% 100% -1% -19% -53% 100% -1% -17%

1% -1% 100% 50% -1% -1% 100% 51%

29% -19% 50% 100% 24% -17% 51% 100%

Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 1 Dimension 2 Dimension 3 Dimension 4

Training sample Generated samples

Summary of the results According to our tests, the Wasserstein GAN model performs
very well in the task of learning the joint distribution of our simulated training dataset.
Comparing with the results of Bernoulli and Gaussian RBMs, the Wasserstein GAN model
has several advantages:

1. The Wasserstein GAN model is less sensitive to extreme values of the training dataset

39
Improving the Robustness of Backtesting with Generative Models

and doesn’t have the tendency to underestimate or overestimate the tail of probability
distribution.

2. We don’t need to apply complex transformation to input data for Wasserstein GAN
model in data preprocessing. In our study, we just use a MinMax scaling function and
we recall that we need to use the binary transformation for Bernoulli RBM and the
normal score transformation for Gaussian RBM.

3. The Wasserstein GAN model may capture more accurately the correlation structure
of the training dataset than Bernoulli and Gaussian RBMs.

3.2.2 Application to financial time series


According to the results in the previous section, the Wasserstein GAN model preforms
very well in the case of multi-dimensional simulated dataset without the time dependence
relationship. In this paragraph, we construct a more complex Wasserstein GAN model
with convolutional layers in the generator and the discriminator in order to extract more
features in the real financial dataset consisting of 5 354 historical daily returns of the S&P
500 index and the VIX index from December 1998 to May 2018. To capture the time
dependence of the multi-dimensional autocorrelated time series, we also need to associate
the training samples with conditional labels as we have done in the case of the conditional
RBM. Mariani et al. (2019) proposed a method to train the model with historical returns
over nt consecutive days conditioned by the last nh day historical returns. As a result, the
generator of this model should generate a matrix belonging to Mnt ,nx (R), which means
the generator simulate directly daily returns of S&P 500 index and the VIX index for nt
days. For reasons of simplicity, this model is named in this paper as the conditional deep
convolutional Wasserstein GAN model (or CDCWGAN). In the step of data preprocessing,
we need to apply the MinMax scaling function on input data before the training process to
ensure that historical returns are scaled into the range [0, 1].

Structure and training of CDCWGAN models In this study on financial time se-
ries, the training samples for the discriminator correspond to a nx -dimensional time series
representing historical returns over nt days. Therefore, inputs are represented by a matrix
belonging to Mnt ,nx (R). In addition, each training sample is conditioned by the past values
of returns over nh days, which are represented by a matrix belonging to Mnh ,nx (R). In prac-
tice, we concatenate these two matrices to form a matrix belonging to the Mnh +nt ,nx (R)
and this matrix will be fed to the discriminator as inputs. In this study, the inputs of
the discriminator will be down sampled using 4 convolutional layers with number of filters
{16, 32, 64, 128}. We set the kernel length to 3 and the stride to 2 and we choose to use
the leaky RELU function with α = 0.5 for each convolutional layer. For the last layer of
the discriminator, we choose always a 1-neuron dense layer and use the tangent hyperbolic
activation function for this neuron as in the case of the traditional Wasserstein GAN model.
For the generator, we modify the original method of Mariani et al. (2019) and we borrow
the idea of conditional layer in the case of conditional RBM. The generator of our CDCW-
GAN model will take two inputs: a 100-dimensional random noise vector and nh past values
that are concatenated into a nh × nx -dimensional vector. We then feed these two vectors to
the first layer of the generator, which is a dense layer and we will reshape the output to a
two-dimensional nt × 256 matrix before passing them to the second layer. We construct the
rest of the generator using 3 convolutional layers with number of filters {256, 64, 2}. The
kernel length is set to 3 and the stride is set to 2. We also use a leaky RELU function with
α = 0.5 for each convolutional layer and since the last convolutional layer will give directly

40
Improving the Robustness of Backtesting with Generative Models

the simulated scenarios, we choose the sigmoid activation function in order to get the value
in range [0, 1].
For this conditional deep convolutional Wasserstein GAN model, we use always the
RMSProp optimizer with a small learning rate. Indeed, the learning rate is set to η = 1.10−4
and the batch size is set to 500. Table 6 summarizes the setting of the model implementation
using Python and the TensorFlow library.

Table 6: The setting of the CDCWGAN model

Training data dimension nx 2


Training data window nt 5
Training data label length nh 20
Input feature scaling function MinMax
Random noise vector dimension 100
Random noise distribution N (0, 1)
{Dense(5, 256), Conv1D(256),
Generator structure
Conv1D(64), Conv1D(2)}
{Conv1D(16), Conv1D(32),
Discriminator structure
Conv1D(64), Conv1D(128), Dense(1)}
Loss function Wasserstein distance
Learning optimizer RMSProp
Learning rate 1.10−4
Batch size 500

Learning joint distribution and time dependence In this study, we choose to set
the length of training data window nt to 5, which means that the generator of the model
will generate at each time step the scenarios for the 5 next days. In addition, we also use
the values of the last 20 days as a long memory for input data of the model in order to
compare with the results of the conditional RBM. After the training process, we generate
samples of consecutive time steps to verify the quality of learning the joint distribution of
daily returns of the S&P index and the VIX index. We run 50 Monte Carlo simulations and
for each simulation, we generate 5 354 simulated observations from different random noise
series. According to Table 7, we find that the CDCWGAN model learns also pretty well the
joint distribution of the S&P index and the VIX index. In addition, the average correlation
coefficient between daily returns of the two indices over 50 Monte Carlo replications is equal
to −71%, which is exactly the figure of the empirical correlation. In Figure 22, we observe
clearly that a CDCWGAN model can capture well the autocorrelation of the training dataset
as in the case of the conditional RBM shown in Figure 17 on page 34.

Table 7: Comparison between the training sample and simulated samples using the condi-
tional deep convolutional Wasserstein GAN model

S&P 500 index VIX index


Statistic
Training Simulated Training Simulated
Mean 0.02% 0.08% (± 0.01%) 0.06% −0.33% (± 0.02%)
Standard deviation 0.64% 0.65% (± 0.01%) 3.84% 3.58% (± 0.02%)
1st percentile −1.93% −1.57% (± 0.02%) −7.31% −7.39% (± 0.06%)
99th percentile 1.61% 1.56% (± 0.02%) 11.51% 9.74% (± 0.13%)

41
Improving the Robustness of Backtesting with Generative Models

Figure 22: Autocorrelation function of synthetic samples generated by the CDCWGAN


model
0.6 0.6
Simulation 1 Simulation 1
Simulation 2 Simulation 2
0.4 Simulation 3 0.4 Simulation 3
Autocorrelation

Autocorrelation
Training Training
0.2 0.2

0.0 0.0

−0.2 −0.2
5 10 15 5 10 15
Lag Lag
S&P 500 index VIX index

Market generator Based on above results, we consider that the CDCWGAN model can
be also used as a market generator. In the example of S&P 500 and VIX indices, we have
trained the model with all historical observations using the method proposed by Mariani et
al. (2019) and the generator of this method will generate scenarios over several consecutive
days. After having calibrated the training process, we may choose a date as the starting
point and generate new samples for several days. We then update the memory vector and
generate samples for the following days. Iteratively, the CDCWGAN model can generate
a multi-dimensional time series that has the same patterns as those of the training data in
terms of marginal distribution, correlation structure and time dependence. Comparing with
the conditional RBM that we have studied, we don’t need the normal score transformation
to ensures the learning of marginal distributions and, convolutional layers in the generator
and the discriminator may help us to extract more features in the training dataset.
As we have shown in Figure 18 on page 35 for the conditional RBM, Figure 23 shows
three time series of 250 trading days starting at the date 22/11/2000, which are generated
by the trained CDCWGAN model. We notice that these three simulated time series have
clearly the positive autocorrelation in each dimension and they capture very well the neg-
ative correlation between S&P 500 and VIX indices. Comparing with the historical prices
that are represented by the dashed line, these three simulated time series have the same
characteristics.

3.3 Managing the out-of-sample robustness of backtesting methods


We now consider an application of generative models in the context of quantitative asset
management. Traditionally, we use the historical daily returns of assets to backtest an
investment strategy, meaning that only one real time series is available. Therefore, we get
only one estimated value for the performance and risk statistics of the strategy, such as the
annualized return, the volatility, the Sharpe ratio, the maximum drawdown, etc. Generative
models can be used here to manage the out-of-sample robustness of backtesting methods. For
instance, we may use a generative model like RBMs and GANs to learn the joint distribution
and the time dependence of historical daily returns of assets. We may then use these
models to generate new time series. Finally, we may backtest our investment strategy using
these new simulated time series in order to construct the probability distribution of the
different statistics. In this approach, the estimated value of the statistic obtained with the

42
Improving the Robustness of Backtesting with Generative Models

Figure 23: Comparison between scaling historical prices of S&P 500 and VIX indices and
synthetic time series generated by the CDCWGAN model

1.75
1.4
1.50

1.2 1.25

1.00
1.0
0.75

0.50
0.8
0.25
0 50 100 150 200 0 50 100 150 200
Time step Time step
Simulation 1 Simulation 3 Real Simulation 1 Simulation 3 Real
Simulation 2 Simulation 2

S&P 500 index VIX index

true real time series becomes one realization of its probability distribution. Suppose for
example that the maximum drawdown of the backtest is equal to 10%, and that the live
investment strategy has a maximum drawdown of 20%. Does it mean that the process of
the investment strategy has been overfitted? Certainly yes if there is a zero probability to
experience a maximum drawdown of 20% when the strategy is backtested with generative
models. Definitively not if some samples of generative models have produced a maximum
drawdown larger than 20%.

3.3.1 Backtest of the risk parity strategy


Our dataset consists of daily returns of six futures contracts on world-wide equity indices
such as S&P 500, Eurostoxx 50 and Nikkei 225 indices and 10Y sovereign bonds such as US,
Germany and Australia from January 2007 to December 2019. In this study, we build a risk
parity strategy on this multi-asset investment universe. Moreover, the strategy is unfunded
and we calibrate its leverage in order to obtain a volatility around 3%. In Figure 24, we have
reported the cumulative performance of the risk parity strategy. Moreover, the descriptive
statistics of performance and risk are given in Table 8, and correspond to the annualized
performance µ (x), the volatility σ (x), the Sharpe ratio SR (x), the maximum drawdown22
MDD (x) and the skew measure ξ (x), which is the ratio between the maximum drawdown
and the volatility23 .

Table 8: Descriptive statistics of the risk parity backtest

Period µ (x) σ (x) SR (x) MDD (x) ξ (x)


2007 – 2019 5.30% 3.54% 1.50 9.40% 2.65
2018 – 2019 6.97% 3.48% 2.00 3.69% 1.05

22 The maximum drawdown corresponds to a loss. For example, if the maximum drawdown is equal to

10%, the investor may face a maximum loss of 10%. This is why the maximum drawdown is expressed by a
positive value.
23 If ξ (x) is greater than 3, this indicates that the strategy has a high skewness risk.

43
Improving the Robustness of Backtesting with Generative Models

Figure 24: Cumulative performance of the risk parity strategy


200

180

160

140

120

100

8 0 2 4 6 8 0
200 201 201 201 201 201 202

3.3.2 Market generator of the risk parity strategy


We first train the conditional RBM introduced in Section 3.1.2 on page 32 on the daily
returns of the six futures contracts. In this study, the training of the models is done with
all historical observations of futures contracts returns from January 2007 to December 2019.
Moreover, for each date, we have used the values of the last 20 days as a memory vector for
the conditional layer. We also choose a large number for the number of hidden layers units
in order to capture the complex correlation structure of financial time series. Therefore,
the conditional RBM has 6 visible layer units, 256 hidden layer units and 120 conditional
layer units. In addition, we use the normal score transformation to avoid the problem of
outliers in the training dataset and improve the learning quality. After having calibrated
the training process, we choose the 1st January 2018 as the starting point and generate
new futures contracts returns iteratively for the next two years. We run 500 Monte Carlo
simulations with different random noise series and for each simulation, we backtest our risk
parity strategy using synthetic time series and calculate the descriptive statistics of the
strategy. Finally, we construct probability distributions for these descriptive statistics and
we compare them with the results obtained by the traditional method of bootstrap sampling,
which consists of using a random sampling of the historical returns with replacement.

Figure 25: Comparison between cumulative performance of the risk parity strategy using
synthetic time series generated by the bootstrap sampling and conditional RBM models

Real Real
120 120

110 110

100 100

90 90
8 -01 18-04 18-07 18-10 19-01 19-04 19-07 19-10 -0 1
8-0
4
8-0
7 0 1 4 7
8-1 019-0 019-0 019-0 019-1
0
201 20 20 20 20 20 20 20 2 018 201 201 201 2 2 2 2
Bootstrap sampling Conditional RBM

Figure 25 shows the cumulative performance of ten risk parity backtests using synthetic
time series generated by the two methods. Among these simulations, we notice that the

44
Improving the Robustness of Backtesting with Generative Models

cumulative performance of the risk parity strategy using synthetic time series generated by
the bootstrap sampling method is more centered around the real time series of backtesting.
This phenomenon corresponds to the drawback of traditional bootstrap sampling method,
since it cannot capture the time dependence in risk parity strategy. As shown in Figure 26,
our risk parity strategy has a first-lag positive autocorrelation of 20%, and we compare the
average autocorrelation function over 500 Monte Carlo simulations generated by the two
approaches. We notice that the bootstrap sampling method cannot replicate this positive
autocorrelation as the conditional RBM can do. This property is very important when we
backtest meta-strategies, which means a strategy of an existing strategy. For instance, if
we want to design a stop-loss strategy for our risk parity strategy, we should tune several
parameters for implementing this stop-loss strategy. If we use the real historical time series
to calibrate the stop-loss parameters, we might fall into the trap of overfitting because we
only have one real historical time series. Therefore, if we use the time series generated by
the bootstrap sampling method, we will not find the appropriate values of the stop-loss
parameters, since the autocorrelation plays a key role in the stop-loss strategy as explained
by Kaminski and Lo (2014). Using the conditional RBM as a market generator to generate
time series is then a better way to find the appropriate parameters for the meta-strategy
and manage the out-of-sample robustness.

Figure 26: Autocorrelation function of the risk parity strategy using synthetic time series
generated by the bootstrap sampling and conditional RBM methods

0.2 Real 0.2 Real


Generated Generated
Autocorrelation

Autocorrelation

0.1 0.1

0.0 0.0

−0.1 −0.1

1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19
Lag Lag
Bootstrap sampling Conditional RBM

3.3.3 Building the probability distribution of backtest statistics


In our study, we are interested in building the probability distribution of the maximum
drawdown MDD (x) and the skew measure ξ (x) for the risk parity strategy. Figure 27
shows the distributions of the maximum drawdown generated by the bootstrap sampling
and conditional RBM methods. We notice that the distribution of the maximum drawdown
generated by the conditional RBM are more centered and have less severe scenarios. For
instance, the value of the maximum drawdown of the real risk parity strategy from January
2018 to December 2019 is 3.69% and this value corresponds respectively to the 61.5% quantile
in the probability distribution generated by the bootstrap sampling method and the 69.1%
quantile in the probability distribution generated by the conditional RBM. We consider that
the difference comes from the quality of learning about the time dependence of the dataset.
Although we use a volatility targeting method to control the volatility of risk parity
strategies, we cannot ensure that the strategy volatility is exactly equal to 3%. To avoid the

45
Improving the Robustness of Backtesting with Generative Models

Figure 27: Histogram of the maximum drawdown of the risk parity strategy using synthetic
time series generated by the bootstrap sampling and conditional RBM methods

100 Real 100 Real

80 80
Frequency

Frequency
60 60

40 40

20 20

0 0
2% 4% 6% 8% 10% 2% 4% 6% 8% 10%
Bootstrap sampling Conditional RBM

influence of the difference between realized volatility, we plot in Figure 28 the probability
distribution of the skew measure. In this case, we notice that the tail of the probability dis-
tribution generated by the conditional RBM is less fat than that generated by the bootstrap
sampling method but has several extreme severe scenarios. The skew measure of the real
risk parity strategy from January 2018 to December 2019 is equal to 1.05, and this value cor-
responds respectively to the 66.3% quantile in the probability distribution generated by the
bootstrap sampling method and the 70.1% quantile in the probability distribution generated
by the conditional RBM.

Figure 28: Histogram of the skew measure of the risk parity strategy using synthetic time
series generated by the bootstrap sampling and the conditional RBM methods

100 Real 100 Real

80 80
Frequency

Frequency

60 60

40 40

20 20

0 0
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 3.0
Bootstrap sampling Conditional RBM

3.3.4 Comparison with Wasserstein GAN models


For the purpose of illustration, we train also a conditional Wasserstein GAN model with
a simple structure as multi-layer perceptrons for the generator and the discriminator. In
order to make a comparison with the results of the conditional RBM, we also use the values
of the last 20 days as a long memory for input data of the model. More precisely, the
generator will take two inputs: a 100-dimensional random noise vector and 20 past values
that are concatenated into a 20 × 6-dimensional vector. We then construct the generator

46
Improving the Robustness of Backtesting with Generative Models

using 4 dense layers with the structure SG = {100, 50, 10, 6}. We use a leaky RELU function
with α = 0.5 for each dense layer and a sigmoid activation function for the output layer.
The discriminator also takes two inputs: a 6-dimensional vector of current daily returns
of futures contracts and a 20 × 6-dimensional vector of past values. We then construct
the discriminator using 4 dense layers with the structure SD = {100, 50, 10, 1}. For the
first three dense layers, the activation function corresponds to a leaky RELU function with
α = 0.5, whereas the activation function of the output layer is a tanh activation function.
To avoid the problem of outliers, we also use the normal score transformation as in the
case of conditional RBMs before applying the MinMax scaling function on input data. In
Figure 29, we compare the distributions of the skew measure of the risk parity strategy using
synthetic time series generated by the conditional RBM and conditional Wasserstein GAN
models. We notice that these two distributions are similar, except that we have several
extreme severe scenarios in the case of the conditional RBM. We recall that the value of
the skew measure of the real risk parity strategy from January 2018 to December 2019 is
equal to 1.05 and this value corresponds respectively to the 70.1% quantile in the probability
distribution generated by the conditional RBM and the 72.9% quantile in the probability
distribution generated by the conditional Wasserstein GAN.

Figure 29: Histogram of the skew measure of the risk parity strategy using synthetic time
series generated by the conditional RBM and Wasserstein GAN models

100 Real 100 Real

80 80
Frequency
Frequency

60 60

40 40

20 20

0 0
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0
Conditional RBM Conditional Wasserstein GAN

3.3.5 Augmenting the investment universe with market regime indicators


In the real world of finance, the correlation structure is very complex and learning the joint
distribution of daily returns of futures contracts using only time series themselves is not
sufficient. For instance, if we train the generative models by including the time series of the
VIX index, it’s sure that we will obtain different results. Indeed, since the daily returns of
the VIX index has a very negative correlation with equity futures contracts and a positive
correlation with bond futures contracts and its distribution is very leptokurtic, the generative
models trained using historical returns of futures contracts and VIX index will generate more
distinct and more severe scenarios. Figure 30 shows the histograms of the skew measure
of the risk parity strategy using synthetic time series generated by two conditional RBMs.
The first one is trained with only historical daily returns of futures contracts and the other
is trained with historical daily returns of futures contracts and VIX index. In this figure, we
notice that the probability distribution generated by the model trained with the VIX index
has a fatter tail and more extreme scenarios. Therefore, we consider that this model may
generate more realistic data than the model trained using only the time series of the future

47
Improving the Robustness of Backtesting with Generative Models

contracts. In order to have a high-quality market generator, we believe that we should train
a conditional RBM or Wasserstein GAN model with not only the time series of assets that
compose the investment portfolio, but also those of the several market regime indicators
such as the VIX index.

Figure 30: Histogram of the skew measure of the risk parity strategy using synthetic time
series generated by two conditional RBMs
100 100
Real Real
80 80
Frequency

Frequency
60 60

40 40

20 20

0 0
0 1 2 3 4 5 0 1 2 3 4 5
Trained with returns of assets Trained with returns of assets and VIX index

4 Conclusion
In this article, we explore the use of generative models for improving the robustness of trading
strategies. We consider two approaches for simulating financial time series. The first one is
based on restricted Boltzmann machines, whereas the second approach corresponds to the
family of generative adversarial networks, including Wasserstein distance models. Given an
historical sample of financial data, we show how to generate new samples of financial data
using these techniques. These new samples of financial data are called synthetic or fake
financial time series, and the objective is to preserve the statistical properties of the original
sample. By statistical properties, we mean the statistical moments of each univariate time
series, the stochastic dependence between the different variables that compose the multi-
dimensional random vector, and also the time dependence between the observations. If we
consider the financial times series as a multi-dimensional data matrix, the challenge is then
to model both the row and column stochastic structures.
There are few satisfactory methods for simulating non-gaussian multi-dimensional finan-
cial time series. For instance, the bootstrap method does not preserve the cross-correlation
between the different variables. The copula method is better, but it must use the tech-
niques of conditional augmented data in order to reproduce the autocorrelation functions.
Restricted Boltzmann machines and generative adversarial networks have been successful
for generating complex data with non-linear dependence structures. By applying them to
financial market data, our first results are encouraging and show that these new alternative
techniques may help for simulating non-gaussian multi-dimensional financial time series.
This is particularly true when we consider the backtesting of trading strategies. In this
case, RBMs and GANs may be used for estimating the probability distribution of perfor-
mance and risk statistics of the backtest. This opens the door to a new field of research for
improving the risk management of quantitative investment strategies.

48
Improving the Robustness of Backtesting with Generative Models

References
Ackley, D., Hinton, G.E., and Sejnowski, T.J. (1985), A Learning Algorithm for Boltz-
mann Machines, Cognitive Science, 9(1), pp. 147-169.

Arjovsky, M., Chintala, S., and Bottou, L. (2017a), Wasserstein GAN, arXiv,
1701.07875.

Arjovsky, M., Chintala, S. and Bottou, L. (2017b), Wasserstein Generative Adversarial


Networks, in Precup, D., and Teh, Y.W. (eds), Proceedings of the 34th International
Conference on Machine Learning, 70, pp. 214-223.

Barbu, V., and Precupanu, T. (2012), Convexity and Optimization in Banach spaces,
Fourth edition, Springer Monographs in Mathematics, Springer.

Bengio, Y., and Delalleau, O. (2009), Justifying and Generalizing Contrastive Diver-
gence, Neural Computation, 21(6), pp. 1601-1621.

Brenier, Y. (1991), Polar Factorization and Monotone Rearrangement of Vector-valued


Functions, Communications on Pure and Applied Mathematics, 44(4), pp. 375-417.

Broniatowski, M., and Keziou, A. (2006), Minimization of φ-divergences on Sets of


Signed Measures, Studia Scientiarum Mathematicarum Hungarica, Akadémiai Kiadó,
43(4), pp. 403-442.

Cho, K., Ilin, A., and Raiko, T. (2011), Improved Learning of Gaussian-Bernoulli Re-
stricted Boltzmann Machines, in Proceedings of the Twentith International Conference on
Artificial Neural Networks (ICANN) 2011, pp. 10-17.

Chu, C., Blanchet, J., and Glynn, P. (2019), Probability Functional Descent: A Uni-
fying Perspective on GANs, Variational Inference, and Reinforcement Learning, arXiv,
1901.10691.

Clevert, D., Unterthiner, T., and Hochreiter, S. (2015), Fast and Accurate Deep
Network Learning by Exponential Linear Units (ELUs), arXiv, 1511.07289.

Cont, R. (2001), Empirical Properties of Asset Returns: Stylized Facts and Statistical
Issues, Quantitative Finance, 1(2), pp. 223-236.

Cui, Z., Chen, W., and Chen, Y. (2016), Multi-scale Convolutional Neural Networks for
Time Series Classification, arXiv, 1603.06995.

Denton, E.L., Chintala, S., Szlam, A., and Fergus, R. (2015), Deep Generative Image
Models Using a Laplacian Pyramid of Adversarial Networks, in Cortes, C., Lawrence,
N.D., Lee, D.D., Sugiyama, M., and Garnett, R. (eds), Advances in Neural Information
Processing Systems, 28, pp. 1486-1494.

Dumoulin, V., and Visin, F. (2016), A Guide to Convolution Arithmetic for Deep Learning,
arXiv, 1603.07285.

Fernholz, L.T. (2012), Von Mises Calculus for Statistical Functionals, Lecture Notes in
Statistics, 19, Springer.

Fischer, A. and Igel, C. (2014), Training Restricted Boltzmann Machines: An Introduc-


tion, Pattern Recognition, 47(1), pp. 25-39.

49
Improving the Robustness of Backtesting with Generative Models

Givens, C.R., and Shortt, R.M. (1984), A Class of Wasserstein Metrics for Probability
Distributions, Michigan Mathematical Journal, 31(2), pp. 231-240.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., and Bengio, Y. (2014), Generative Adversarial Nets, in Ghahramani,
Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (eds), Advances in
Neural Information Processing Systems, 27, pp. 2672-2680.
Gozlan, N., Roberto, C., Samson, P.M., and Tetali, P. (2017), Kantorovich Duality
for General Transport Costs and Applications, Journal of Functional Analysis, 273(11),
pp. 3327-3405.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.C. (2017),
Improved Training of Wasserstein GANs, in Guyon, I, Luxburg, U.V., Bengio, S., Wallach,
H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds), Advances in Neural Information
Processing Systems, 30, pp. 5767-5777.
Hinton, G.E. (2002), Training Products of Experts by Minimizing Contrastive Divergence,
Neural Computation, 14(8), pp. 1771-1800.
Hinton, G.E. (2012), A Practical Guide to Training Restricted Boltzmann Machines, in
Montavon, G., Orr, G.B., Müller, K-R. (eds), Neural Networks: Tricks of The Trade, pp.
599-619, Second edition, Springer.
Hinton, G.E., and Sejnowski, T.J. (1986), Learning and Relearning in Boltzmann Ma-
chines, Chapter 7 in Rumelhart, D.E., and McClelland, J.L. (eds), Parallel Distributed
Processing: Explorations in the Microstructure of Cognition, 1, pp. 282-317, MIT Press.
Hyland, S.L., Esteban, C., and Rätsch, G. (2017), Real-valued (Medical) Time Series
Generation with Recurrent Conditional GANs, arXiv, 1706.02633.
Isola, P., Zhu, J.Y., Zhou, T., and Efros, A.A. (2017), Image-to-image Translation
with Conditional Adversarial Networks, Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 1125-1134.
Jebara, T. (2004), Machine Learning: Discriminative and Generative, Springer Interna-
tional Series in Engineering and Computer Science, 755, Springer.
Kaminski, K.M., and Lo, A.W. (2014), When Do Stop-loss Rules Stop Losses?, Journal of
Financial Markets, 18, pp. 234-254.
Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018), Progressive Growing of GANs
for Improved Quality, Stability, and Variation, International Conference on Learning Rep-
resentations (ICLR 2018), 7, available on arXiv, 1710.10196.
Keziou, A. (2003), Dual Representation of φ-divergences and Applications, Comptes Rendus
de l’Académie des Sciences – Series I – Mathematics, 336(10), pp. 857-862.
Kindermann, R. and Snell, J.L. (1980), Markov Random Fields and Their Applications,
Contemporary Mathematics, American Mathematical Society.
Koller, D. and Friedman, N. (2009), Probabilistic Graphical Models: Principles and
Techniques, MIT Press.
Kondratyev, A., and Schwarz, C. (2020), The Market Generator, SSRN, https://www.
ssrn.com/abstract=3384948.

50
Improving the Robustness of Backtesting with Generative Models

Koshiyama, A., Firoozye, N., and Treleaven, P. (2019), Generative Adversarial Net-
works for Financial Trading Strategies Fine-Tuning and Combination, arXiv, 1901.01751.

Krizhevsky, A. (2009), Learning Multiple Layers of Features from Tiny Images, University
of Toronto, Technical Report.

Laschos, V., Obermayer, K., Shen, Y., and Stannat, W. (2019), A Fenchel-Moreau-
Rockafellar Type Theorem on the Kantorovich-Wasserstein Space with Applications in
Partially Observable Markov Decision Processes, Journal of Mathematical Analysis and
Applications, 477(2), pp. 1133-1156.

LeCun, Y., and Bengio, Y. (1995), Convolutional Networks for Images, Speech, and Time
Series, in Arbib, M.A. (ed.), The Handbook of Brain Theory and Neural Networks, MIT
Press.

LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M.A., and Huang, F.J. (2007), A
Tutorial on Energy-based Learning, Chapter 10 in Bakır, G., Hofmann, T., Schölkopf, B.,
Smola, A.J., Taskar, B., and Vishwanathan, S.V.N. (eds), Predicting Structured Data, pp.
191-246, MIT Press.

Liu, R., Lehman, J., Molino, P., Such, F.P., Frank, E., Sergeev, A., and Yosinski,
J. (2018), An Intriguing Failing of Convolutional Neural Networks and the Coordconv
Solution, in Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and
Garnett, R. (eds), Advances in Neural Information Processing Systems, 31, pp.,9605-9616.

Long, P. M. and Servedio, R. A. (2010), Restricted Boltzmann Machines are Hard to


Approximately Evaluate or Simulate, in Fürnkranz, J., and Joachims, T. (eds), Proceed-
ings of the 27th International Conference on Machine Learning (ICML’10), pp. 703-710,
Omnipress.

Mao, X., Li, Q., Xie, H., Lau, R.Y.K., Wang, Z., and Smolley, P.S. (2017), Least Squares
Generative Adversarial Networks, Proceedings of the IEEE International Conference on
Computer Vision (ICCV), pp. 2794-2802.

Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. (2017), Unrolled Generative Ad-
versarial Networks, International Conference on Learning Representations (ICLR 2018),
7, available on arXiv, 1611.02163.

Mariani, G., Zhu, Y., Li, J., Scheidegger, F., Istrate, R., Bekas, C., Cristiano,
A., and Malossi, I. (2019), PAGAN: Portfolio Analysis with Generative Adversarial
Networks, arXiv, 1909.10578.

Mirza, M., and Osindero, S. (2014), Conditional Generative Adversarial Nets, arXiv,
1411.1784.

Mroueh, Y., and Sercu, T. (2017), Fisher GAN, in Guyon, I, Luxburg, U.V., Bengio, S.,
Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds), Advances in Neural
Information Processing Systems, 30, pp. 2513-2523.

Mroueh, Y., Li, C.L., Sercu, T., Raj, A., and Cheng, Y. (2017), Sobolev GAN, In-
ternational Conference on Learning Representations (ICLR 2018), 7, available on arXiv,
1711.04894.

Müller, A. (1997), Integral Probability Metrics and Their Generating Classes of Functions,
Advances in Applied Probability, 29(2), pp. 429-443.

51
Improving the Robustness of Backtesting with Generative Models

Nguyen, X., Wainwright, M.J., and Jordan, M.I.(2010), Estimating Divergence Func-
tionals and the Likelihood Ratio by Convex Risk Minimization, IEEE Transactions on
Information Theory, 56(11), pp. 5847-5861.
Nowozin, S., Cseke, B., and Tomioka, R. (2016), f -gan: Training Generative Neu-
ral Samplers using Variational Divergence Minimization, in Lee, D.D., Sugiyama, M.,
Luxburg, U.V., Guyon, I., and Garnett, R. (eds), Advances in Neural Information Pro-
cessing Systems, 29, pp 271-279.
Pearl, J. (1985), Bayesian Networks: A Model of Self-activated Memory for Evidential
Reasoning, in Proceedings of the 7th Conference of the Cognitive Science Society, pp.
329-334.
Peyré, G., and Cuturi, M. (2019), Computational Optimal Transport, Foundations and
Trends R in Machine Learning, 11(5-6), pp. 355-607.
Rachev, S.T. (1985), The Monge-Kantorovich Mass Transference Problem and its Stochas-
tic Applications, Theory of Probability & Its Applications, 29(4), pp. 647-676.
Rachev, S.T., and Rüschendorf, L. (1998), Mass Transportation Problems: Theory (Vol-
ume 1), Springer.
Radford, A., Metz, L., and Chintala, S. (2016), Unsupervised Representation Learning
with Deep Convolutional Generative Adversarial Networks, International Conference on
Learning Representations (ICLR 2016), available on arXiv, 1511.06434.
Rubner, Y., Tomasi, C., and Guibas, L.J. (2000), The Earth Mover’s Distance as a Metric
for Image Retrieval, International Journal of Computer Vision, 40(2), pp. 99-121.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X.
(2016), Improved Techniques for Training GANs, in Lee, D.D., Sugiyama, M., Luxburg,
U.V., Guyon, I., and Garnett, R. (eds), Advances in Neural Information Processing Sys-
tems, 29, pp. 2234-2242.
Seguy, V., Damodaran, B.B., Flamary, R., Courty, N., Rolet, A., and Blondel, M.
(2017), Large-scale Optimal Transport and Mapping Estimation, arXiv, 1711.02283.
Smolensky, P. (1986), Information Processing in Dynamical Systems: Foundations of Har-
mony Theory, Chapter 6 in Rumelhart, D.E., and McClelland, J.L. (eds), Parallel Dis-
tributed Processing: Explorations in the Microstructure of Cognition, 1, pp. 194-281, MIT
Press.
Taylor, G.W., Hinton, G.E., and Roweis, S.T. (2011), Two Distributed-state Models
For Generating High-Dimensional Time Series, Journal of Machine Learning Research,
12, pp. 1025-1068.
Villani, C. (2008), Optimal Transport: Old and New, Grundlehren der mathematischen
Wissenschaften, 338, Springer.
Wiese, M., Knobloch, R., Korn, R., and Kretschmer, P. (2020), Quant GANs: Deep
Generation of Financial Time Series, Quantitative Finance, forthcoming.
Xia, Q. (2008), Numerical Simulation of Optimal Transport Paths, arXiv, 0807.3723.
Xia, Q. (2009), The Geodesic Problem in Quasimetric Spaces, Journal of Geometric Anal-
ysis, 19(2), pp. 452-479.

52
Improving the Robustness of Backtesting with Generative Models

Appendix
A Mathematical results
A.1 Fundamental concepts of undirected graph model
Probabilistic graphical models use graphs to describe interactions between random variables.
Each random variable is represented by a node (or vertice) and each direct interaction be-
tween random variables is represented by an edge (or link). According to the directionality
of the edge in the graph, probabilistic graphical models can be divided into two categories:
directed graphical model and undirected graphical model. For instance, the Bayesian net-
works that are introduced by Pearl (1985) are a type of directed graphical models, whereas
Markov random fields (or Markov networks) use undirected graphs (Kindermann and Snell,
1980).

A.1.1 Undirected graph


An undirected graph G = (V, E) is a defined by its finite set of nodes V and its set of
undirected edges E. Each edge ei,j is defined by a pair of two connected nodes vi and
vj from V. We define the neighborhood N (vi ) of a given node vi as the set of all nodes
connected to vi :
N (vi ) = {vj ∈ V | ei,j ∈ E}
A clique of size n named Cn is a subset of V containing n nodes (v1 , v2 , . . . , vn ) defined as:

∀i, j ∈ {1, 2, . . . , n} and i 6= j : ei,j ∈ E

In other words, each node belonging to Cn is fully connected with the other nodes of Cn . A
clique of a graph G is called maximal if we can’t create a bigger clique by adding another
node of G, meaning that no node in G can be added such that the resulting set is still a
clique.

A.1.2 Markov random field


Let X = (X (v1 ) , X (v2 ) , . . . , X (vn )) be a set of random variables associated with the
undirected graph G = (V, E) such that each random variable X (vi ) is linked to the ith node
vi ∈ V. X is said to be a Markov random field (MRF) if, for all i ∈ {1, 2, . . . , n}, X (vi ) is
conditionally independent from all other variables X (vj ), whose nodes vj do not belong to
the neighborhood N (vi ):

P (x (vi ) | x (vj ) , vj ∈ V \ vi ) = P (x (vi ) | x (vj ) , vj ∈ N (vi ))

In other words, X is said to be a Markov random field if the joint probability distribution
verifies the local Markov property.

A.1.3 Hammersley-Clifford theorem


Let X = (X (v1 ) , X (v2 ) , . . . , X (vn )) be a Markov random field and C the set of all maximal
cliques of the undirected graph G = (V, E). According to Fischer and Igel (2014), a simple
version24 of Hammersley-Clifford theorem states that a strictly positive distribution P sat-
isfies the Markov property with respect to the undirected graph G if and only if P factorizes
24 The rigorous formulation of the Hammersley-Clifford theorem can be found in Koller and Friedman

(2009).

53
Improving the Robustness of Backtesting with Generative Models

over G. This means that there exists a set of strictly positive functions {ψC , C ∈ C}, such
that the joint probability distribution is given by a product of factors:
1 Y
P (x) = P (x (v1 ) , . . . , x (vn )) = ψC (x (vC )) (21)
Z
C∈C

where ψC is the potential function for the clique C, vC are all the nodes belonging to the
clique C and Z is the partition function given by:
XY
Z= ψC (x (vC ))
x C∈C

The partition function Z is the normalization constant what ensures the overall distribution
sums to 1.

A.1.4 Energy function and Boltzmann distribution


The Hammersley-Clifford theorem is valid only if each potential function ψC is strictly
positive. Thus, we can introduce a new function E in order to rewrite the probability
distribution P (x):
!
1 X
P (x) = exp log ψC (x (vC ))
Z
C∈C
1 −E(x)
= e (22)
Z
P
where E (x) = − C∈C log ψC (x (vC )) is called the energy function. Because natural ex-
ponential function is always positive, this guarantees that the energy function will result
in a positive probability for any state. In addition, a large value of energy indicates a low
probability of the state. According to LeCun et al. (2007), models of this form are called
energy-based models.
Using the energy function described above, the strictly positive probability distribution
of a Markov random field can be expressed in the form P (x) = Z −1 e−E(x) . This form
of distribution is also called Boltzmann (or Gibbs) distribution for a system in statistical
physics. It is defined as:  
1 Ei
pi = exp −
ZT kT
where pi is the probability of state i of the system, Ei is the energy of state i, k is the
Boltzmann constant, and T is the temperature of the system. Since PN is the number of
N
states accessible to the system, the normalization constant is ZT = i=1 e−Ei /(kT ) . If we
set kT to 1, we find that the Boltzmann distribution and the probability distribution of a
Markov random field have the same formula. For this reason, many energy-based models
are called Boltzmann machines.

A.1.5 The example of restricted Boltzmann machines


The restricted Boltzmann machine introduced in Section 2.1 on page 8 is a Markov random
field associated with a bipartite undirected graph G = (V, E). In other words, all visible
layer units and hidden layer units can be considered as nodes in an undirected graph25 :
V = {V1 , V2 , . . . , Vm , H1 , H2 , . . . , Hn }
25 For simplicity reasons, we make a little abuse of notation here to use Vi and Hj to represent not only
nodes in the graph but also random variables associated with these nodes and we use vi and hj to denote
respectively the possible values of the variables associated with the ith visible unit and j th hidden unit.

54
Improving the Robustness of Backtesting with Generative Models

and all connections between visible layer and hidden layer are edges of G. In the case of
RBMs, we know that there are only cliques of size 1 (one visible unit or one hidden unit)
and cliques of size 2 (a pair of one visible unit and one hidden unit) in the graph G. In
addition, it is easy to show that all these cliques are maximal. Let C1 and C2 be respectively
the set of all the cliques of size 1 and the set of all the cliques of size 2. We obtain:

C1 = {{V1 } , {V2 } , . . . , {Vm } , {H1 } , {H2 } , . . . , {Hn }}

and:
C2 = {{V1 , H1 } , . . . , {Vi , Hj } , . . . , {Vm , Hn }}
According to the Hammersley-Clifford theorem, the probability distribution of an RBM is
given by:
1 Y
P (v, h) = ψC
Z
C∈{C1 ,C2 }
m n m Y n
1 Y Y Y
= ψVi (vi ) ψHj (hj ) ψVi ,Hj (vi , hj )
Z i=1 j=1 i=1 j=1
1 −E(v,h)
= e
Z
where:
 
m
Y n
Y m Y
Y n
E (v, h) = − log  ψVi (vi ) ψHj (hj ) ψVi ,Hj (vi , hj )
i=1 j=1 i=1 j=1
m
X n
X
= − log ψVi (vi ) − log ψHj (hj ) −
i=1 j=1
m
XX n
log ψVi ,Hj (vi , hj )
i=1 j=1
Xm n
X m X
X n
= Ei (vi ) + Ej (hj ) + Ei,j (vi , hj )
i=1 j=1 i=1 j=1

In the case of Bernoulli RBMs introduced in Section 2.1.1 on page 9, we defined Ei (vi ) =
−ai vi , Ej (hj ) = −bj hj and Ei,j (vi , hj ) = −wi,j vi hj . It follows that the energy function of
a Bernoulli RBM is equal to:
m
X n
X m X
X n
E (v, h) = − ai vi − bj hj − wi,j vi hj
i=1 j=1 i=1 j=1

where ai and bj are bias terms associated with the visible and hidden variables Vi and Hj ,
and wi,j is the weight associated with the edge between Vi and Hj .

55
Improving the Robustness of Backtesting with Generative Models

A.2 Calculus formulas for restricted Boltzmann machines


A.2.1 Conditional probability P (hj = 1 | v)

We have:

X n
XY
P (h | v) hj = P (hk | v) hj
h h k=1
X
= P (h−j | v) P (hj | v) hj
h
X X
= P (h−j | v) P (hj | v) hj
hj ∈{0,1} h−j
X X
= P (hj | v) hj · P (h−j | v)
hj ∈{0,1} h−j
| {z }
=1
= P (hj = 1 | v) (23)

where h−j = (h1 , h2 , . . . , hj−1 , hj+1 , . . . , hn ) denotes the state of all hidden units except the
j th one.

A.2.2 Bernoulli RBMs and neural networks

Following Fischer and Igel (2014), we divide the energy function E (v, h) into two parts: one
collecting all terms involving vi and one collecting all the other terms v−i :
m
X n
X m X
X n
E (v, h) = − ak vk − bj hj − wk,j vk hj
k=1 j=1 k=1 j=1
n
X m
X n
X m
X n
X
= −ai vi − wi,j vi hj − ak vk − bj hj − wk,j vk hj
j=1 k=1,k6=i j=1 k=1,k6=i j=1
= vi αi (h) + β (v−i , h)

where:
n
X
αi (h) = −ai − wi,j hj
j=1

and:
m
X n
X m
X n
X
β (v−i , h) = − ak vk − bj hj − wk,j vk hj
k=1,k6=i j=1 k=1,k6=i j=1

The Bayes theorem gives:

P (vi = 1 | h) = P (vi = 1 | v−i , h)


P (vi = 1, v−i , h)
=
P (v−i , h)
P (vi = 1, v−i , h)
=
P (vi = 0, v−i , h) + P (vi = 1, v−i , h)

56
Improving the Robustness of Backtesting with Generative Models

We deduce that the conditional probability P (vi = 1 | h) is equal to:

e−E(vi =1,v−i ,h)


P (vi = 1 | h) =
e−E(vi =0,v−i ,h) + e−E(vi =1,v−i ,h)
e−1·αi (h)−β(v−i ,h)
=
e−0·αi (h)−β(v−i ,h) + e−1·αi (h)−β(v−i ,h)
e−αi (h)
=
1 + e−αi (h)
1
=
1 + eαi (h)
= σ (−αi (h))

where σ (x) is the sigmoid function:

1
σ (x) =
1 + e−x

Similarly, we can divide the energy function E (v, h) into two parts: one collecting all terms
involving hj and one collecting all the other terms h−j :

E (v, h) = hj γj (v) + δ (v, h−j )

where:
m
X
γj (v) = −bj − wi,j vi
i=1

and:
m
X n
X m
X n
X
δ (v, h−j ) = − ai vi − bk hk − wi,k vi hk
i=1 k=1,k6=j i=1 k=1,k6=j

Using the same approach as previously, we can show that:

P (hj = 1 | v) = σ (−γj (v))

A.2.3 Gradient of the Bernoulli RBM log-likelihood function

We have:
!  
0
e−E (v ,h) 
X X
` (θ | v) = log e−E(v,h) − log 
h v 0 ,h

Fischer and Igel (2014) computed the log-likelihood gradient ∇θ (v) = ∂θ ` (θ | v):

!  
∂ ∂  0
e−E (v ,h) 
X X
∇θ (v) = log e−E(v,h) −
∂θ ∂θ
h v 0 ,h
(1) (2)
= ∇θ (v) + ∇θ (v 0 ) (24)

57
Improving the Robustness of Backtesting with Generative Models

We have26 :
e−E(v,h) · ∂θ E (v, h)
P
(1) h
∇θ (v) = − P −E(v,h)
he
X e−E(v,h)
= − P ∂ E (v, h)
−E(v,h0 ) θ
h h0 e
X ∂ E (v, h)
= − P (h | v)
∂θ
h

and:
0
e−E (v ,h) · ∂θ E (v 0 , h)
P
(2) 0 v 0 ,h
∇θ (v ) = P −E(v 0 ,h)
v 0 ,h e
0
X e−E (v ,h) ∂ E (v 0 , h)
= P −E(v 00 ,h0 )
v 00 ,h0 e ∂θ
v 0 ,h
X ∂ E (v 0 , h)
= P (v 0 | h)
∂θ
v 0 ,h
XX ∂ E (v 0 , h)
= P (v 0 ) P (h | v 0 )
∂θ
v0 h
X X ∂ E (v 0 , h)
= P (v 0 ) P (h | v 0 )
∂θ
v0 h

We recall that ∂ai E (v, h) = −vi and ∂bj E (v, h) = −hj . It follows that:
∂ ` (θ | v) X ∂ E (v, h) X X ∂ E (v 0 , h)
= − P (h | v)
+ P (v 0 ) P (h | v 0 )
∂ ai ∂ ai ∂ ai
h v0 h
X X X
= P (h | v) vi − P (v 0 ) P (h | v 0 ) vi0
h v0 h
X
0
= vi − P (v ) vi0
v0

and27 :
∂ ` (θ | v) X X X
= P (h | v) hj − P (v 0 ) P (h | v 0 ) hj
∂ bj
h v0 h
X
= P (hj = 1 | v) − P (v ) P (hj = 1 | v 0 )
0

v0

For the gradient with respect to W , we have ∂wi,j E (v, h) = −vi hj and:
∂ ` (θ | v) X X X
= P (h | v) vi hj − P (v 0 ) P (h | v 0 ) vi0 hj
∂ wi,j
h v0 h
X
= P (hj = 1 | v) vi − P (v ) P (hj = 1 | v 0 ) vi0
0

v0
26 Using the Bayes theorem, the conditional probability distribution P (h | v) is equal to:
P (v, h) e−E(v,h)
P (h | v) = = P −E(v,h0 )
P (v) h0 e
27 We use Equation (23) on page 56.

58
Improving the Robustness of Backtesting with Generative Models

We notice that we have to sum over 2m possible combinations of the visible variables when
(2)
calculating the second term ∇θ (v). Therefore, we generally approximate the expectation
0
P (v ) by sampling from the model distribution.

A.2.4 Gradient of the contrastive divergence


The contrastive divergence function is equal to:
   
CD(k) = KL P(0) k P(∞) − KL P(k) k P(∞)

where:
 
X P (v)
KL (P k Q) = P (v) log
v
Q (v)
X X
= P (v) log P (v) − P (v) log Q (v)
v v

We have:
∂ KL (P k Q) X ∂ P (v) X ∂ log P (v)
= log P (v) + P (v) −
∂θ v
∂θ v
∂θ
X ∂ P (v) X ∂ log Q (v)
log Q (v) − P (v) (25)
v
∂θ v
∂θ

When P does not depend on the parameter set θ, we have ∂θ P (v) = 0 and the derivative
reduces to28 :
∂ KL (P k Q) X ∂ log Q (v)
=− P (v)
∂θ v
∂θ
Since we have:
X ∂ log P (v) X ∂ P (v)
P (v) =
v
∂θ v
∂θ

we also notice that another expression of Equation (25) is:

∂ KL (P k Q) X ∂ P (v)
  X
P (v) ∂ log Q (v)
= log +1 − P (v)
∂θ v
∂ θ Q (v) v
∂θ

Finally, we deduce that:


! !
∂ CD(k) P(0) (v) P(k) (v)
 
∂ X
(0) ∂ X
(k)
= P (v) log − P (v) log
∂θ ∂θ v
P(∞) (v) ∂θ v
P(∞) (v)
X ∂ log P(∞) (v) X (k) ∂ log P(∞) (v)
= − P(0) (v) + P (v) −
v
∂θ v
∂θ
X ∂ P(k) (v)  P(k) (v)

log (∞) +1 (26)
v
∂θ P (v)

In Equation (26), it is often possible to compute the exact values for the first two terms, but
not for the third term. However, Hinton (2002) showed experimentally that this last term
28 This is the case when P is equal to P(0) .

59
Improving the Robustness of Backtesting with Generative Models

is so small compared the other two terms that it can be ignored. Thus, we can consider the
following approximation:
∂ CD(k) X ∂ log P(∞) (v) X (0) ∂ log P(∞) (v)
≈ P(k) (v) − P (v)
∂θ v
∂θ v
∂θ
X ∂ log Pθ (v) X (0) ∂ log Pθ (v)
= P(k) (v) − P (v)
v
∂θ v
∂θ
    
(k) (0)
1 X  ∂ log Pθ v(s) ∂ log Pθ v(s)
= − 
N s ∂θ ∂θ

(0) (k)
where v(s) is the sth sample of the training set and v(s) is the associated sample after running
k steps of Gibbs sampling29 . By noticing that log Pθ (v) = ` (θ | v) and using Equation (24),
we obtain:
  (k)
 
(0)

∂ CD (k)
1 XN ∂ ` θ | v (s) ∂ ` θ | v (s)
=  − 
∂θ N s=1 ∂θ ∂θ
N
1 X   (k)  
(0)

= ∇θ v(s) − ∇θ v(s)
N s=1
N
1 X  (1)  (k)  (1)

(0)

= ∇θ v(s) − ∇θ v(s)
N s=1
    
N X (0) (k)
1   ∂ E v (s) , h   ∂ E v (s) , h
P h | v (0) (k)
X
= (s) − P h | v(s) 
N s=1 ∂θ ∂θ
h

Therefore, we can compute the derivatives with respect to the parameters ai , bj and wi,j :
N
∂ CD(k) 1 X  (k) (0)

= v(s),i − v(s),i
∂ ai N s=1
N
∂ CD(k) 1 X  (k)
 
(0)

= P hj = 1 | v(s) − P hj = 1 | v(s)
∂ bj N s=1
N
∂ CD(k) 1 X  (k)

(k)

(0)

(0)

= P hj = 1 | v(s) · v(s),i − P hj = 1 | v(s) · v(s),i
∂ wi,j N s=1

A.2.5 Gradient of the Gaussian-Bernoulli RBM log-likelihood function


By updating Equation (24), we can easily show that the gradient of the log-likelihood func-
tion is equal to:
∂ ` (θ | v) ∂ Eg (v 0 , h) 0
Z
X ∂ Eg (v, h) X
=− P (h | v) + p (v 0 ) P (h | v 0 ) dv (27)
∂θ ∂θ v0 ∂θ
h h

where:
m 2 n m X
n
X (vi − ai ) X X vi hj
Eg (v, h) = − bj hj − wi,j
i=1
2σi2 j=1 i=1 j=1
σi2

29 In (0)
this case, v(s) is the starting value of the Gibbs sampler.

60
Improving the Robustness of Backtesting with Generative Models

We have:
∂ Eg (v, h) vi − ai
=
∂ ai σi2
∂ Eg (v, h)
= −hj
∂ bj
∂ Eg (v, h) vi hj
= −
∂ wi,j σi2
2
∂ Eg (v, h) (vi − ai ) X hj
= − 3 + vi wi,j 3
∂ σi σi j
2σi

We deduce that the derivative of ` (θ | v) with respect to the weight wi,j is given by:
∂ ` (θ | v) v 0 hj
Z
X vi hj X
= P (h | v) 2 − p (v 0 ) P (h | v 0 ) i 2 dv 0
∂ wi,j σi v0 σi
h h
v0
Z
vi
= P (hj = 1 | v) 2 − p (v 0 ) P (hj = 1 | v 0 ) i2 dv 0
σi v0 σi
Similarly, we can compute the other derivatives. For ai and bj , we obtain:
∂ ` (θ | v) v0
Z
vi
=− 2 + p (v 0 ) i2 dv 0
∂ ai σi v0 σi
and:
∂ ` (θ | v)
Z
= P (hj = 1 | v) − p (v 0 ) P (hj = 1 | v 0 ) dv 0
∂ bj v0
Finally, we obtain for the parameter σi :
2
∂ ` (θ | v) (vi − ai ) vi X
= − + wi,j P (hj = 1 | v) +
∂ σi σi3 2σi3 j
 
0 2 0 X
(v − ai )
Z
v
p (v 0 )  i 3 − i3 wi,j P (hj = 1 | v 0 ) dv 0
v 0 σ i 2σ i j

A.2.6 Gradient of the conditional RBM log-likelihood function


The log-likelihood function for the conditional RBM is:

` (θ | vt ) = log pθ (vt | ct )

where the set of parameters becomes θ = (a, b, W, P, Q). We can then compute the partial
derivative of the energy function by using the chain rule:

∂ Ẽg (vt , ht , ct ) ∂ Ẽg (vt , ht , ct ) ∂ ãt


= ·
∂ qk,i ∂ ãt ∂ qk,i
vt,i − ãt,i
 
= ct,k
σi2
and we obtain:
0
∂ ` (θ | vt ) vt,i − ãt,i vt,i − ãt,i
  Z  
0
=− ct,k + p (v ) ct,k dv 0
∂ qk,i σi2 v0 σi2

61
Improving the Robustness of Backtesting with Generative Models

Similarly, we have:

∂ Ẽg (vt , ht , ct ) ∂ Ẽg (vt , ht , ct ) ∂ b̃t


= ·
∂ pk,j ∂ b̃t ∂ pk,j
= −ht,j ct,k

and:
∂ ` (θ | vt )
= P (ht,j = 1 | vt , ct ) ct,k −
∂ pk,j
Z
p (vt0 | ct ) P (hj = 1 | vt0 , ct ) ct,k dv 0
v0

The calculation for the other derivatives remains unchanged and are the same as those
obtained for the Gaussian-Bernoulli RBM.

62
Improving the Robustness of Backtesting with Generative Models

A.3 A unified approach of GAN models


Generative models are trained to perform a mapping from a latent space to some specified
data manifold, which is generally represented by the empirical distribution of real data.
The problem consists then in finding a mapping function that best matches the target data
in the sense of a certain discrepancy measure. For comparing the theoretical distribution
with the empirical distribution, the two important metrics used in the context of generative
modeling are divergence measures (φ-divergence) and integral probability metrics (IPM),
which give two different families of GANs. In this section, we show that all models share
many significant common points. Let X be a topological space that is compact, complete
and separable. Let us assume that there are two probability measures P and Q which can
be defined on X . GAN optimization relies on the fact that both φ-divergence and IPM can
be written as:
dF (P, Q) = sup |∆P,Q (ϕ)|
ϕ∈F

where F is the class of functions ϕ defined on X and ∆ : F → X is a discrepancy operator.


In the following, we show that the choice of the class F and the discrepancy operator ∆
lead to different GAN models.

A.3.1 φ-GAN models


Variational estimation of φ-divergences The very first model of GAN is often pre-
sented as a minmax optimization problem (Goodfellow et al., 2014). However, it is possible
to find a direct correspondence between Goodfellow’s saddle point problem and divergence
minimization. Let us recall the definition of a divergence measure. We assume that there
are two probability measures P and Q that can be defined on X . Moreover, P must be
absolutely continuous30 with respect to Q, which is denoted P  Q. The φ-divergence Dφ
is defined as: Z  
dP (x)
Dφ (P k Q) = φ Q (dx) (28)
X dQ (x)
where φ : R+ → R∪{+∞} is a convex, lower-semi-continuous function such that31 φ (1) = 0.
Looking at the closed-form solution of φ-divergence, we note that it can be interpreted as
the likelihood ratio between two probability distributions32 .

Remark 7. Different divergence measures can be used for modeling the function φ:

• the Kullback-Leibler divergence DKL corresponds to φ (t) = t log (t);


1+t

• the Jensen-Shannon divergence DJS is obtained by setting φ (t) = − (t + 1) log 2 +
t log (t);

• the total variation (or energy-based) divergence DT V is defined by taking φ (t) =


1
2 |t − 1|.

In order to establish the link with GAN optimization problems, Nguyen et al. (2010)
proposed to compute a variational characterization of these φ-divergence measures by looking
at the convex dual. For that, we need to introduce the Fenchel conjugate, which is a
30 This means that if we consider a σ-field A ⊆ X such that Q (A) = 0, then P (A) = 0.
31 This last condition ensures that Dφ (P k Q) = 0 if P = Q.
32 The condition that imposes that P must be absolutely continuous with respect to Q is related to the

Radon-Nikodym theorem. It states that if P  Q, then there is a function f : X → R that satisfies


R dP
P (A) = A f (x) Q (dx) for all σ-field A. The function f is often denoted by .
dQ

63
Improving the Robustness of Backtesting with Generative Models

fundamental tool in convex analysis (Barbu and Precupanu, 2012). Let us consider a function
f : X → R ∪ {+∞}. According to the Riesz representation theorem, it is possible to identify
the dual space X ∗ of the Banach space X . Therefore, we can work on the product space
X ∗ × X associated with the scalar product hx∗ , xi. The Fenchel transform is then defined
on the dual space such that:
 ∗
X →R
f∗ :
x∗ → supx∈dom f {hx∗ , xi − f (x)}

where dom f = {x ∈ X | f (x) < +∞}. According to the Fenchel-Moreau theorem, if f is


convex and continuous, then f ∗∗ = f ∗ ◦ f ∗ = f and we obtain:

f (x) = sup {hx, x∗ i − f ∗ (x∗ )}


x∗ ∈dom f ∗

for all x ∈ X . If we consider the function φ associated with a given φ-divergence, the space
X is R and the dual space is also R. In order to express the φ-divergence in terms of loss,
Nguyen et al. (2010) simply expressed φ in term of its conjugate:

φ (x) = sup {hx, x∗ i − φ∗ (x∗ )}


x∗ ∈dom φ∗

Using Jensen inequality and considering that φ∗ is convex, it follows that:


Z  
dP (x)
Dφ (P k Q) = φ Q (dx)
X dQ (x)
Z    !
dP (x) ∗ ∗ ∗
= sup , x − φ (x ) dQ (x) dx
X x∗ ∈dom φ∗ dQ (x)
Z 
∗ ∗ ∗
≥ sup (x dP (x) − φ (x ) dQ (x)) dx
x∗ ∈dom φ∗ X

We then introduce a class of functions F that maps X to dom φ∗ :


Z Z 

Dφ (P k Q) ≥ sup ϕ (x) P (dx) − φ (ϕ (x)) Q (dx)
ϕ∈F X X
≥ sup {E [ϕ (X)) | X ∼ P] − E [φ∗ (ϕ (X)) | X ∼ Q]}
ϕ∈F

where {F = ϕ : X → R | φ (X ) ⊆ dom φ∗ }. We define the discrepancy operator for this spe-


cific model as follows:

∆P,Q (ϕ) = E [ϕ (X) | X ∼ P] − E [φ∗ (ϕ (X)) | X ∼ Q] (29)

To find the optimal function such that equality in the supremum is obtained, we introduce
the notation x̃ = ϕ (x) and the quantity C (x̃) defined by:

C (x̃) = C (ϕ (x))
= E [ϕ (X) | X ∼ P] − E [φ∗ (ϕ (X)) | X ∼ Q] (30)

By computing the derivative C (x̃), Nowozin et al. (2016) found that the optimal function
ϕ? is33 :  
? 0 dP (x)
ϕ (x) = φ (31)
dQ (x)
33 See Broniatowski and Keziou (2006, Theorem 4.4) for a formal proof.

64
Improving the Robustness of Backtesting with Generative Models

However, evaluating this quantity is impossible because the distribution function P is un-
known. Therefore, ϕ should be flexible enough to approximate the derivative φ0 everywhere.
This is why GAN models use deep neural networks to estimate it. This leads us to introduce
the parameter θd that will be optimized during the training process. In this context, we
write the parameterized function D (x, θd ), which aims to estimate the function ϕ? (x):

ϕ̂? (x) = D (x, θd )

Consequently, Nowozin et al. (2016) proposed to use the resulting lower bound in order
to train GANs. G (z, θg ) represents the generative model that is also a neural network
and allows us to build the probability distribution Pmodel , which should estimate the given
probability distribution Pdata . Thus, we obtain the saddle point problem:

min ∆Pdata ,Pmodel (ϕ? ) ≈ min max C (D (X, θd )) (32)


φ θg θd

where:

C (D (X, θd )) = E [D (X, θd ) | X ∼ Pdata ] − E [φ∗ (D (X, θd )) | X ∼ G (z, θg )] (33)

D (x, θd ) ∈ F and Pmodel = G (z, θg ). Therefore, we have constrained the generator to be in


a smaller class of functions belonging to F even if the neural network can approximate any
function.
Remark 8. In order to obtain the minimax problem of Goodfellow et al. (2014), Nowozin
et al. (2016) considered the function:

φ (x) = x log x − (x + 1) log (x + 1)

We deduce that:  
0 x
φ (x) = log
x+1
and34 :
φ∗ (x) = − log (1 − ex )
34 We have:
φ∗ (t) = sup hx, ti − φ (x)
x∈dom φ
It follows that h (x) = xt − x log x + (x + 1) log (x + 1) and:
 
x
h0 (x) = t − log
x+1
The supremum is reached at the point h0 (x? ) = 0, implying that:
et
x? =
1 − et
Finally, we obtain:
et et et
         
1 1
φ∗ (t) = t − log log +
1 − et 1 − et 1−e t1 − et 1−e t

et et
     
1
t − log 1 − et − log 1 − et
 
= t −
1 − et 1 − et 1 − et
et
   
1
log 1 − et − log 1 − et
 
= t t
1−e 1−e
− log 1 − et

=

65
Improving the Robustness of Backtesting with Generative Models

Using Equation (30), we deduce that:


C (x̃) = E [ϕ (X) | X ∼ P] + E [log (1 − exp (ϕ (X))) | X ∼ Q]
If we set ϕ (X) = log D (X, θd ), we find the minimax function of Goodfellow et al. (2014):
C (D (X, θd )) = E [log D (X, θd ) | X ∼ Pdata ] + E [log (1 − D (X, θd )) | X ∼ G (z, θg )]

Another representation of φ-divergences GAN training can be viewed as a process of


successively estimating the optimal function ϕ and minimizing the φ-divergence. Chu et al.
(2019) proposed a more general view of this process. Previously, the study has been done on
the space R thanks to the dual form of the given function φ associated with the φ-divergence.
Alternatively, Chu et al. (2019) proposed to directly work on the probability space X . This
imposes to define a probability functional JP (Q) : B (X ) −→ R, where B (X ) is the space of
Borel probability measures on X . In the context of GAN optimization problems, we would
like to show that:
min Dφ (P k Q) = min JP (Q) (34)
Q∈B(X )

However, contrary to the previous case, probability functionals take value in probability
spaces, where functional derivatives need to be defined. Moreover, the dimension of the
space is potentially infinite. Therefore, the difficulty is to transform functional optimization
into a convex optimization problem that can be solved using traditional numerical algorithms
such as the gradient descent.
In order to define the derivative in B (X ), Chu et al. (2019) used the Gâteaux derivative
of the functional JP at Q in the direction of δ = P − Q:
JP0 (P − Q) = dJP (Q, δ)
d
= JP (Q + δ)
d =0
JP (Q + δ) − JP (Q)
= lim (35)
→0 
This definition initially comes from Von Mises calculus, and the Gâteaux derivative is also
called the ‘Volterra’ derivative (Fernholz, 2012). Chu
R et al. (2019) recalled that the Gâteaux
derivative has an integral representation JP0 (δ) = X g (x) δ (dx) where the function g : X →
R is called the ‘influence function’ or the influence curve35 . It follows that:
Z
0
JP (P − Q) = g (x) (P − Q) (dx)
X
= E [g (X) | X ∼ P] − E [g (X) | X ∼ Q] (36)
We are now able to compute the derivative in order to recover the optimal function that
satisfy the minimum of the φ-divergence:
dQ (x) +  (dP (x) − dQ (x))
Z  
d
JP0 (P − Q) = φ dP (x)
X d dP (x) =0
dQ (x) +  (dP (x) − dQ (x)) (dP (x) − dQ (x))
Z  
= φ0 dP (x)
X dP (x) =0 dP (x)
Z  
dQ (x)
= φ0 (P − Q) (dx) (37)
X dP (x)
35 The influence function is not unique. Indeed, for any function g that describes the Gâteaux differential

at Q, g + c also works. Thus, the influence function is uniquely defined up to an arbitrary additive constant.

66
Improving the Robustness of Backtesting with Generative Models

We deduce that the influence function for the φ-divergence is equal to:

 X → R 
gφ : 0 dQ (x) (38)
 x→φ
dP (x)

The influence function can then be associated with the optimal function φ introduced before
because we have gφ = ϕ? ∈ F. We conclude that the GAN discriminator will try to estimate
the influence function. While Nguyen et al. (2010) focused on the dual form of the function
φ, Chu et al. (2019) used a more general approach by considering the dual form of the
probability functional in order to recover the Goodfellow’s saddle point problem.
In the case of probability functionals which take values in B (X ), the dual space can
be defined as the space L1 (X ) of all real-valued Lipschitz function that takes values in
X (Laschos et al., 2019). According to the Riesz representation theorem, it is possible to
identify the space B (X ) to it dual. We consider the product space B (X ) × L1 (X ) with the
scalar product defined as:
Z
hϕ, Qi = ϕ (x) Q (dx)
ZX
= ϕ (x) dQ (x) dx
X

1
where ϕ ∈ L (X ) and Q ∈ B (X ) × R. Thus, we can rewrite JP in term of its convex
transform:

JP (Q) = sup {hϕ, Qi − JP∗ (ϕ)}


ϕ∈L1 (X )
Z 
= sup ϕ (x) Q (dx) − JP∗ (ϕ)
ϕ∈L1 (X ) X

= sup {E [ϕ (X) | X ∼ Q] − JP∗ (ϕ)} (39)


ϕ∈L1 (X )

To bridge the gap between the two approaches, we have to demonstrate that JP∗ (ϕ) =
E [φ∗ (ϕ (X)) | X ∼ P] for a given estimate of the function ϕ ∈ F. For that, Chu et al.
(2019) considered the Fenchel conjugate of the Jensen-Shannon divergence36 :

JP∗ (ϕ) ∗
= JJS (ϕ)
1 h   i 1
= − E log 1 − e2ϕ(X)−log 2 | X ∼ P − log 2 (40)
2 2
1 1
Using ϕ (x) = 2 log (1 − D (x, θd )) + 2 log 2, we obtain:

1
E [ϕ (X) | X ∼ Q] − JP∗ (ϕ) = E [log (1 − D (x, θd )) | X ∼ Q] +
2
1
E [log D (x, θd ) | X ∼ P] + log 2 (41)
2
Therefore, the descent algorithm applied to probability functionals is equivalent to the Good-
fellow’s saddle point problem:

min max E [log (D (X, θd )) | X ∼ Pdata ] + E [log (1 − D (X, θd )) | X ∼ G (z, θg )]


θg θd
36 The derivation of this result is given in Appendix A.4 on page 69.

67
Improving the Robustness of Backtesting with Generative Models

A.3.2 IPM-GAN models


In the case of Wasserstein generative adversarial networks introduced by Arjovsky et al.
(2017a,b), φ-divergences have to be replaced by integral probability metrics (IPMs) in order
to compare two different probability distributions. Let F be a class of functions defined on
X . Müller (1997) defined an IPM IF between P and Q in the following way:
Z Z 
IF (P, Q) = sup ϕ (x) P (dx) − ϕ (x) Q (dx)
ϕ∈F X X
= sup {|E [ϕ (X) | X ∼ P] − E [ϕ (X) | X ∼ Q]|}
ϕ∈F

IPMs are looking for a critic function that maximizes the average discrepancy between the
two distributions P and Q. Contrary to φ-GAN models, where we have to find the variational
form of the divergence φ (x), the definition of an IPM directly gives the discrepancy operator:

∆P,Q (ϕ) = E [ϕ (X) | X ∼ P] − E [ϕ (X) | X ∼ Q]

for all ϕ ∈ F. The Wasserstein GAN proposed by Arjovsky et al. (2017a,b) considers
the function class F such that ϕ (x) is a 1-Lipschitz function. In order to present dif-
R F, we consider the Lebesgue norm on the measur-
ferent choices for the function class
2
able space Ω = (X , P): kf k2 = X f 2 (x) P (dx). Let us denote  the normed space by
L2 (Ω) = {f : X → R | kf k2 < +∞} and the unit ball by B1 = f ∈ L2 (Ω) | kf k2 ≤ 1 .
Therefore, choosing the class function such that F = B1 is called a Fisher GAN by Mroueh
and Sercu (2017). In a similar way, Mroueh et al. (2018) proposed
 to define Sobolev GAN
models by considering the following class of functions F = f ∈ L2 (Ω) | k∇x f k2 ≤ 1 .

68
Improving the Robustness of Backtesting with Generative Models

A.4 The Jensen-Shannon divergence function


To derive the convex conjugate of JJS , we follow Appendix A given in Chu et al. (2019). Let
P and Q be two probability measures. We denote by p (x) and q (x) the associated density
functions dP (x) and dQ (x). The Jensen-Shannon divergence function is defined by:
   
1 1 1 1 1 1
DJS (P k Q) = DKL P P+ Q + DKL Q P+ Q
2 2 2 2 2 2
Z  
1 2p (x) 2q (x)
= p (x) log + q (x) log dx
2 p (x) + q (x) p (x) + q (x)
where DKL (P k Q) is the Kullback-Leibler divergence. In the case where JJS (P) = DJS (P k Q),
we obtain:
Z
1 2 (p (x) + δ (x))
JJS (P + δ) = (p (x) + δ (x)) log dx +
2 p (x) + q (x) + δ (x)
Z
1 2q (x)
q (x) log dx
2 p (x) + q (x) + δ (x)
Since we have:
2 (p (x) + δ (x))
log = log 2 + log (p (x) + δ (x)) − log (p (x) + q (x) + δ (x))
p (x) + q (x) + δ (x)
and:
d 2 (p (x) + δ (x)) δ (x) δ (x)
log = −
d p (x) + q (x) + δ (x) p (x) + δ (x) p (x) + q (x) + δ (x)
it follows that:
Z  
d 1 d 2 (p (x) + δ (x))
JJS (P + δ) = (p (x) + δ (x)) log dx +
d 2 d p (x) + q (x) + δ (x)
Z  
1 d 2q (x)
q (x) log dx
2 d p (x) + q (x) + δ (x)
Z
1 2 (p (x) + δ (x))
= δ (x) log dx +
2 p (x) + q (x) + δ (x)
Z
1 δ (x)
(p (x) + δ (x)) dx −
2 p (x) + δ (x)
Z
1 δ (x)
(p (x) + δ (x)) dx −
2 p (x) + q (x) + δ (x)
Z
1 δ (x)
q (x) dx
2 p (x) + q (x) + δ (x)
We deduce that:
Z  
d 1 2p (x)
JJS (P + δ) = δ (x) log dx
d =0 2 p (x) + q (x)
Z  
1 p (x)
= log + log 2 δ (x) dx
2 p (x) + q (x)
Chu et al. (2019) concluded that the influence function of JJS (P) is equal to:
1 dP (x) 1
gJS (x) = log + log 2
2 dP (x) + dQ (x) 2

69
Improving the Robustness of Backtesting with Generative Models

In order to find the convex conjugate of JJS (P), we consider the Fenchel-Moreau theorem:

JJS (ϕ) = sup {hϕ, Pi − JJS (P)}
P∈B(X )
Z 
= sup ϕ (x) P (dx) − JJS (P)
P∈B(X ) X

We have:
Z
f (P) = ϕ (x) P (dx) − JJS (P)
ZX
= ϕ (x) p (x) dx −
X
Z  
1 2p (x) 2q (x)
p (x) log + q (x) log dx
2 p (x) + q (x) p (x) + q (x)
Z X
= h (x) dx
X

where:
1 1 1
h (x) = ϕ (x) p (x) − p (x) log 2 − p (x) log p (x) + p (x) log (p (x) + q (x)) −
2 2 2
1 1 1
q (x) log 2 − q (x) log q (x) + q (x) log (p (x) + q (x))
2 2 2
and:
∂h (x) 1 1 1 1
= ϕ (x) − log 2 − log p (x) − + log (p (x) + q (x)) +
∂p (x) 2 2 2 2
1 p (x) 1 q (x)
+
2 p (x) + q (x) 2 p (x) + q (x)
1 1 1
= ϕ (x) − log 2 − log p (x) + log (p (x) + q (x))
2 2 2
Since the first-order condition ∂p(x) h (x) = 0, we deduce that the optimal solution is:

1 1 p (x)
ϕ (x) = log 2 + log
2 2 p (x) + q (x)
1 2p (x)
= log
2 p (x) + q (x)
Chu et al. (2019) noticed that:
q (x) p (x)
= 1−
p (x) + q (x) p (x) + q (x)
1
= 1 − e2ϕ(x)
2
In this case, we obtain:
1 2p (x) 1 2q (x)
h (x) = ϕ (x) p (x) − p (x) log − q (x) log
2 p (x) + q (x) 2 p (x) + q (x)
1  
= ϕ (x) p (x) − ϕ (x) p (x) − q (x) log 2 − e2ϕ(x)
2

70
Improving the Robustness of Backtesting with Generative Models

The convex conjugate of JJS is then:


Z
∗ 1  
JJS (ϕ) = − q (x) log 2 − e2ϕ(x) dx
X 2
Z
1  
= − q (x) log 2 − e2ϕ(x) dx
2 X
Z  
1 1 1
= − log 1 − e2ϕ(x) q (x) dx − log 2
2 X 2 2
1 h   i 1
= − E log 1 − e2ϕ(X)−log 2 | X ∼ Q − log 2
2 2
Remark 9. In order to retrieve Equation (40), we have to interchange P and Q, because
we need to compute JP (Q) and not JQ (P).

71
Improving the Robustness of Backtesting with Generative Models

A.5 Derivation of the minimax cost function


The cost function can be viewed as a binary cross-entropy measure. Let Y and Ŷ be two
random variables with probability mass function p and q. The cross-entropy function is
equal to:
H (p, q) = E [− log q (x) | x ∼ p (x)]
For discrete probability distributions, we obtain:
X
H (p, q) = − p (x) log q (x)
x

In a binary classification problem, for a given observation xi , we have p = yi ∈ {0, 1}, which
is the true label and q = ŷi ∈ [0, 1] which is the predicted probability of the current model.
We can use binary cross entropy to get a measure of dissimilarity between yi and yˆi :

H (p, q) = −yi log yˆi − (1 − yi ) log (1 − yˆi )

In the case of m samples, the loss function is then given by:


m
1 X
L=− yi log yˆi + (1 − yi ) log (1 − yˆi )
m i=1

Under the GAN framework, we have m samples of x0 and m samples of x1 , which serve
as the input data of the discriminator model. We note x = {x0 , x1 } the set of the two
samples. Since yˆi = D (xi ; θd ), the formula above can be written as:
2m
1 X
L (θg , θd ) = − yi log D (xi ; θd ) + (1 − yi ) log (1 − D (xi ; θd ))
2m i=1

If xi comes from the sample x0 , yi takes the value 0, otherwise it takes the value 1. It follows
that:
m m
!
1 X X
L (θg , θd ) = − log D (x1,i ; θd ) + log (1 − D (x0,i ; θd ))
2m i=1 i=1
m m
!
1 1 X 1 X
= − log D (x1,i ; θd ) + log (1 − D (G (zi ; θg ) ; θd ))
2 m i=1 m i=1

Therefore, the loss function L (θg , θd ) corresponds to the average of the cross-entropy when
considering several observations:
1 1
L (θg , θd ) = − E [log D (x1 ; θd )] − E [log (1 − D (G (z; θg ) ; θd ))]
2 2
We notice that C (θg , θd ) is equal to −2 · L (θg , θd ). Minimizing the loss function is then
equivalent to maximize C (θg , θd ) with respect to θd .

72
Improving the Robustness of Backtesting with Generative Models

A.6 An introduction to Monge-Kantorovich problems


A.6.1 Primal formulation of optimal transport
Optimal transport can be very powerful when comparing two probability distributions. The
geometric approach that has been proposed will allow us to resolve complex optimization
problems. OT problem relies on two probability spaces (X , P) and (Y, Q), and a cost function
c : X × Y → R+ . For instance, we would like to transform a pile of sand, where particles
are distributed according to P to a structured sand castle, where particles are distributed
according to Q. In this case, c denotes the amount of such effort. Thus, we can find an
optimal path that minimizes the cost of this transformation. Let us introduce a map function
T : X → Y that allow us to describe how x ∈ X is transported to the target space Y. The
optimal transport map function is the solution of the so-called Monge problem:
Z 
inf c (x, T (x)) dP (x) T∗ (P) = Q (42)
T ∈A X

where A = T : X → Y | Q (C) = P T −1 (C) , C ⊆ Y . In other words, T must push-


 

forward the probability measure P toward Q, meaning that Q =T∗ (P).


However, Monge problem may be difficult to solve in some cases because the mapping
function may not necessarily exist. For instance, let us consider that the source distribution
P is a Dirac measure such as dP (x) = δ (x) dx and the target distribution Q is a continuous
measure such as a normal distribution. In this particular case, there is no map function
T such that the condition Q =T∗ (P) is satisfied. Moreover, this condition is non-convex.
To illustrate this, let us take P and Q two continuous Lebesgue measures of X such that
dP (x) = p (x) dx and dQ (x) = q (x) dx for all x ∈ X . It can be shown that satisfying the
condition Q =T∗ (P) leads to the constraint q (T (x)) |det (∇x T (x))| = p (x) for all x ∈ X .
Since this constraint is non convex, the uniqueness of the minimization problem is not
necessarily guaranteed. This is why Kantorovich reformulated the problem as follows:
Z 
inf c (x, y) dF (x, y) (43)
F∈F (P,Q) X ×Y

where F (P, Q) is the Fréchet class37 . The infimum is then obtained by considering all joint
probability measures F on X ×Y such that P and Q are the marginals. Such joint probability
measures are called transportation plans. The Kantorovich transportation problem is now
convex and becomes a linear programming problem easier to solve than the Monge trans-
portation problem. However, searching among all join probability measures F ∈ F (P, Q)
can also be computationally intractable.
 For instance, Seguy et al. (2017) recall that solving
the linear program takes O n3 log n when n is the size of the support in the case of discrete
probability distributions.
Remark 10. Let us consider the particular case where P and Q are continuous Lebesgue
measures and the cost function c correspond to a p-Euclidian distance d (x, y). The solution
to the Monge-Kantorovich problem is then defined as the p-Wasserstein distance38 :
 Z 1/p
p
Wp (P, Q) = inf d (x, y) dF (x, y)
F∈F (P,Q) X ×Y

Brenier (1991) showed that there is a unique solution when p > 1.


37 This means that F (P, Q) collects all multivariate joint distributions, whose marginals are exactly equal

to P and Q.
38 It is also known as the earth mover’s distance (EMD), which has been used by Rubner et al. (2000) for

content-based image retrieval in computer vision.

73
Improving the Robustness of Backtesting with Generative Models

Remark 11. In some particular cases, optimal transport problems can be easily solved.
Let us consider the case where X = Y is a one-dimensional space, and c (x, y) is a convex
function that satisfies the following condition: if x1 < x2 and y1 < y2 , then c (x2 , y2 ) −
c (x1 , y2 ) − c (x2 , y1 ) + c (x1 , y1 ) < 0, then the optimal transport plan respects the ordering
of the elements. Consequently, the solution corresponds to a monotone rearrangement of P
into Q. Solving this problem is no more than sorting elements in a list (Brenier, 1991).

A.6.2 Dual formulation of optimal transport


Another face of the optimal transport problem is the Kantorovich duality that can be easily
understood from an economic point of view. Following closely the example given by Villani
(2008), we consider a manufacturer that produces goods on a production site located at x
and sells them on a store located at y. c (x, y) is the cost to transport goods from x to
y. Minimizing his transportation cost for the entire production is equivalent to solve the
Monge-Kantorovich problem. Now, let us assume that he does not care about transporta-
tion. This is why he wants to hire a company specialized in goods transportation. They
offer him to buy each product at the price ϕ (x). They will then transport at y and sell
back the product to him at the price ψ (y). The manufacturer will accept the deal only if
there is a financial interest such that ψ (y) − ϕ (x) ≤ c (x, y). Despite this constraint, the
transportation company will try to maximize its profits. The new problem corresponds to
the dual formulation of the Monge-Kantorovich problem and can be written as follows:
Z Z 
sup ψ (y) dQ (y) − ϕ (x) dP (y) : ψ (y) − ϕ (x) ≤ c (x, y) (44)
Y X

According to Villani (2008), ϕ and ψ are integrable such that (ϕ, ψ) ∈ L1 (X , P) × L1 (Y, Q).
The functions ϕ and ψ are often called Kantorovich potentials39 .

Remark 12. Considering the viewpoint of the manufacturer who cares about the cost is
equivalent to look at the solution of the Monge-Kantorovich primal problem. Considering the
viewpoint of the transportation company that cares about optimizing the profit is equivalent
to looking at the solution of the Monge-Kantorovich dual problem.

A rigorous proof of dual formulation40 is given by Villani (2008). We recall that the
condition F ∈ F (P, Q) implies:
Z Z Z
(ψ (y) − ϕ (x)) dF (x, y) = ψ (y) dF (x, y) − ϕ (x) dF (x, y)
X ×Y X ×Y X ×Y
Z Z Z Z
= ψ (y) dF (x, y) − ϕ (x) dF (x, y)
Y X X Y
Z Z
= ψ (y) dQ (y) − ϕ (x) dP (x) (45)
Y X

If F ∈
/ F (P, Q), we assume by convention that the difference between the two members is
39 Brenier (1991, Theorem 1.3) showed the link between Kantorovich potentials and mapping functions in

the particular case where the cost function is a 2-Euclidean distance:


 
1
T (x) = x − ∇x ϕ(x) = ∇x kxk2 − ϕ(x)
2

40 See also Xia (2008, 2009) for a geometric interpretation.

74
Improving the Robustness of Backtesting with Generative Models

infinite and note 1F (P,Q) (F) the convex indicator function of F (P, Q):
Z 
(∗) = inf c (x, y) dF (x, y)
F∈F (P,Q) X ×Y
Z 
= inf c (x, y) dF (x, y) + 1F (P,Q) (F)
F∈F (P,Q) X ×Y
Z 
= inf c (x, y) dF (x, y) +
F∈F (P,Q) X ×Y
Z Z Z 
sup ψ (y) dQ (y) − ϕ (x) dP (x) − (ψ (y) − ϕ (x)) dF (x, y)
(ϕ,ψ) Y X X ×Y

= inf sup Γ (P, Q, F, ϕ, ψ)


F∈F (P,Q) (ϕ,ψ)

where:
Z Z
Γ (P, Q, F, ϕ, ψ) = c (x, y) dF (x, y) + ψ (y) dQ (y) −
X ×Y Y
Z Z
ϕ (x) dP (x) − (ψ (y) − ϕ (x)) dF (x, y)
X X ×Y

Since we have:

inf sup Γ (P, Q, F, ϕ, ψ) = sup inf Γ (P, Q, F, ϕ, ψ)


F∈F (P,Q) (ϕ,ψ) (ϕ,ψ) F∈F (P,Q)

it follows that:
Z Z
inf Γ (P, Q, F, ϕ, ψ) = ψ (y) dQ (y) − ϕ (x) dP (x) +
F∈F (P,Q) Y X
Z 
inf (c (x, y) − (ψ (y) − ϕ (x))) dF (x, y)
F∈F (P,Q) X ×Y

Finally, we conclude that41 :


Z  Z Z 
inf c (x, y) dF (x, y) = sup ψ (y) dQ (y) − ϕ (x) dP (x)
F∈F (P,Q) X ×Y (ϕ,ψ) Y X

because: Z 
inf (c (x, y) − (ψ (y) − ϕ (x))) dF (x, y) = 0
F∈F (P,Q) X ×Y

A.6.3 Semi-dual formulation of optimal transport


It is possible to go deeper in the proof by introducing the notion of c-convexity (Villani,
2008). The function ϕ : X → R ∪ {+∞} is said to be c-convex if there exists a function
ζ : Y → R ∪ {±∞} such that:

ϕ (x) = sup {ζ (y) − c (x, y)}


y∈Y

for all x ∈ X . With this definition, it is possible to define its c-transform ϕc :

ϕc (y) = inf {ϕ (x) + c (x, y)}


x∈X
41 Of course the constraint ψ (y) − ϕ (x) ≤ c (x, y) must be satisfied.

75
Improving the Robustness of Backtesting with Generative Models

for all y ∈ Y. In the particular case where the cost function is a distance, a c-convex function
is simply a 1-Lipschitz function, and is equal to its c-transform. Indeed, let us consider that
ϕ is 1-Lipschitz such that ϕ(x) − ϕ(y) ≤ c(x, y). We have ϕ(x) ≤ ϕ(y) + c(x, y) and:

ϕ (x) = inf {ϕ (y) + c (x, y)}


y
= ϕc (x)

Again, it is possible to understand the c-transform through the economic point of view. Let
us recall that the transportation company needs to satisfy the condition ψ (y) − ϕ (x) ≤
c (x, y) in order to remain competitive. It follows that ψ (y) ≤ ϕ (x) + c (x, y) and ϕ (x) ≥
ψ (y) − c(x, y). To maximize its profits, the company will choose the pair (ϕ, ψ) such that:

ψ (y) = inf x {ϕ (x) + c (x, y)}
ϕ (x) = supy {ψ (y) − c(x, y)}

Therefore, it becomes useful to write ψ in term of ϕ. If we consider that the cost function
is a distance and X = Y, Villani (2008) showed that:
Z 
(∗) = inf c (x, y) dF (x, y)
F∈F (P,Q) X ×Y
Z Z 
= sup ψ (y) dQ (y) − ϕ (x) dP (x) : ψ (y) − ϕ (x) ≤ c (x, y)
(ϕ,ψ) Y X
Z Z 
= sup ϕc (y) dQ (y) −ϕ (x) dP (x)
ϕ Y X
Z Z 
= sup ϕ (y) dQ (y) − ϕ (x) dP (x)
ϕ Y X
= sup {E [ϕ (Y ) | Y ∼ Q] − E [ϕ (X) | X ∼ P]}
ϕ

This is the semi-dual formulation of the problem also called the Kantorovich-Rubinstein
duality. This formulation is used to train the Wasserstein GAN that estimates the optimal
function ϕ.

A.6.4 An example
If P ∼ N (µ1 , Σ1 ) and Q ∼ N (µ2 , Σ2 ), Givens and Shortt (1984) showed that the 2-
Wasserstein distance is equal to:
s   1 1/2 
2 /2 1/2
W2 (P, Q) = kµ1 − µ2 k + tr Σ1 + Σ2 − 2 Σ1 Σ2 Σ1 (46)

where A1/2 is the square root of A.

76
Improving the Robustness of Backtesting with Generative Models

A.7 Converting real-valued samples into binary features


These transformation methods have been introduced by Kondratyev and Schwarz (2019).
Algorithm (4) describes how to transform real-valued data into binary features. Each one-
dimensional data sample is represented by a 16-digit binary number and in the case of
n-dimensional data, we transform receptively each single value into 16-digit binary vector
and concatenate them to form a 16 × n-digit binary vector.

Algorithm 4 Real-valued to integer to binary transformation


Result: Conversion of real-valued dataset into binary vector
Input: A real-valued dataset Xreal with N samples
≥0
Xmin ← min (Xreal ) − 
Xmax ← max (Xreal ) + 
for l = 1, · · · , N do   
(l) (l)
Xinteger ← int 65535 × Xreal − Xmin / (Xmax − Xmin )
 
(l) (l)
Xbinary ← binarize Xinteger
end for

Algorithm (5) performs the inverse transformation. Similarly, in the case of n-dimensional
data, we transform receptively each 16 binary numbers into a real value and concatenate
them to form a n-dimensional real-valued vector.

Algorithm 5 Binary to integer to real-valued transformation


Result: Conversion of binary vector into a real-valued sample
Input: A 16-digit binary vector X = (X1 , · · · , X16 )
Xinteger ← 0
for i = 1, · · · , 16 do

X̂integer ← X̂integer + 2i−1 × X̂16−i


end for
X̂real ← Xmin + X̂integer × (Xmax − Xmin ) /65535

77
Chief Editors
Pascal BLANQUÉ
Chief Investment Officer

Philippe ITHURBIDE
Senior Economic Advisor
Working Paper
July 2020

DISCLAIMER

In the European Union, this document is only for the attention of “Professional” investors as defined in Directive 2004/39/
EC dated 21 April 2004 on markets in financial instruments (“MIFID”), to investment services providers and any other
professional of the financial industry, and as the case may be in each local regulations and, as far as the offering in
Switzerland is concerned, a “Qualified Investor” within the meaning of the provisions of the Swiss Collective Investment
Schemes Act of 23 June 2006 (CISA), the Swiss Collective Investment Schemes Ordinance of 22 November 2006 (CISO)
and the FINMA’s Circular 08/8 on Public Advertising under the Collective Investment Schemes legislation of 20 November
2008. In no event may this material be distributed in the European Union to non “Professional” investors as defined in
the MIFID or in each local regulation, or in Switzerland to investors who do not comply with the definition of “qualified
investors” as defined in the applicable legislation and regulation. This document is not intended for citizens or residents of
the United States of America or to any «U.S. Person» , as this term is defined in SEC Regulation S under the U.S. Securities
Act of 1933.

This document neither constitutes an offer to buy nor a solicitation to sell a product, and shall not be considered as an
unlawful solicitation or an investment advice.

Amundi accepts no liability whatsoever, whether direct or indirect, that may arise from the use of information contained in
this material. Amundi can in no way be held responsible for any decision or investment made on the basis of information
contained in this material. The information contained in this document is disclosed to you on a confidential basis and
shall not be copied, reproduced, modified, translated or distributed without the prior written approval of Amundi, to any
third person or entity in any country or jurisdiction which would subject Amundi or any of “the Funds”, to any registration
requirements within these jurisdictions or where it might be considered as unlawful. Accordingly, this material is for
distribution solely in jurisdictions where permitted and to persons who may receive it without breaching applicable legal
or regulatory requirements.

The information contained in this document is deemed accurate as at the date of publication set out on the first page of
this document. Data, opinions and estimates may be changed without notice.

You have the right to receive information about the personal information we hold on you. You can obtain a copy of the
information we hold on you by sending an email to [email protected]. If you are concerned that any of the information
we hold on you is incorrect, please contact us at [email protected]

Document issued by Amundi, “société par actions simplifiée”- SAS with a capital of €1,086,262,605 - Portfolio manager
regulated by the AMF under number GP04000036 – Head office: 90 boulevard Pasteur – 75015 Paris – France – 437 574
452 RCS Paris - www.amundi.com

Photo credit: iStock by Getty Images - monsitj/Sam Edwards

Find out more about


Amundi Publications
research-center.amundi.com

You might also like