Machine Learning For Factor Investing Python Version 9780367639747 9780367639723 9781003121596 2023002044 - Compress
Machine Learning For Factor Investing Python Version 9780367639747 9780367639723 9781003121596 2023002044 - Compress
Factor Investing
Machine learning (ML) is progressively reshaping the fields of quantitative finance and algorith-
mic trading. ML tools are increasingly adopted by hedge funds and asset managers, notably for
alpha signal generation and stocks selection. The technicality of the subject can make it hard for
non-specialists to join the bandwagon, as the jargon and coding requirements may seem out-of-
reach. Machine learning for factor investing: Python version bridges this gap. It provides a
comprehensive tour of modern ML-based investment strategies that rely on firm characteristics.
The book covers a wide array of subjects which range from economic rationales to rigorous
portfolio back-testing and encompass both data processing and model interpretability. Common
supervised learning algorithms such as tree models and neural networks are explained in the
context of style investing and the reader can also dig into more complex techniques like autoen-
coder asset returns, Bayesian additive trees and causal models.
All topics are illustrated with self-contained Python code samples and snippets that are applied
to a large public dataset that contains over 90 predictors. The material, along with the content
of the book, is available online so that readers can reproduce and enhance the examples at their
convenience. If you have even a basic knowledge of quantitative finance, this combination of
theoretical concepts and practical illustrations will help you learn quickly and deepen your fi-
nancial and technical expertise.
Chapman & Hall/CRC Financial Mathematics Series
Series Editors
Reasonable efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this publica-
tion and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any future
reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-
mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or
contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-
8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used
only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003121596
Typeset in SFRM1000 font
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To Eva and Leslie.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Contents
Preface xiii
I Introduction 1
1 Notations and data 3
1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Introduction 9
2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Portfolio construction: the workflow . . . . . . . . . . . . . . . . . . . . . 10
2.3 Machine learning is no magic wand . . . . . . . . . . . . . . . . . . . . . . 11
4 Data preprocessing 37
4.1 Know your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.2 Scaling the predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.1 Simple labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.2 Categorical labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vii
viii Contents
6 Tree-based methods 75
6.1 Simple trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.2 Further details on classification . . . . . . . . . . . . . . . . . . . . 77
6.1.3 Pruning criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1.4 Code and interpretation . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.2 Code and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 Boosted trees: Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Boosted trees: extreme gradient boosting . . . . . . . . . . . . . . . . . . 88
6.4.1 Managing loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.2 Penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.3 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.4 Tree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.6 Code and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.7 Instance weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6 Coding exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Contents ix
7 Neural networks 97
7.1 The original perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2.1 Introduction and notations . . . . . . . . . . . . . . . . . . . . . . . 99
7.2.2 Universal approximation . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2.3 Learning via back-propagation . . . . . . . . . . . . . . . . . . . . . 103
7.2.4 Further details on classification . . . . . . . . . . . . . . . . . . . . 106
7.3 How deep we should go and other practical issues . . . . . . . . . . . . . . 107
7.3.1 Architectural choices . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3.2 Frequency of weight updates and learning duration . . . . . . . . . 108
7.3.3 Penalizations and dropout . . . . . . . . . . . . . . . . . . . . . . . 109
7.4 Code samples and comments for vanilla MLP . . . . . . . . . . . . . . . . 109
7.4.1 Regression example . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.4.2 Classification example . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4.3 Custom losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Recurrent networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5.1 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5.2 Code and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.6 Other common architectures . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.6.1 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . 122
7.6.2 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.6.3 A word on convolutional networks . . . . . . . . . . . . . . . . . . . 124
7.6.4 Advanced architectures . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.7 Coding exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
V Appendix 273
17 Data description 275
Bibliography 307
Index 337
Preface
This book is intended to cover some advanced modelling techniques applied to equity in-
vestment strategies that are built on firm characteristics. The content is threefold.
First, we try to simply explain the ideas behind most mainstream machine learning algo-
rithms that are used in equity asset allocation. Second, we mention a wide range of academic
references for the readers who wish to push a little further. Finally, we provide hands-on
Python code samples that show how to apply the concepts and tools on a realistic dataset
which we share to encourage reproducibility.
• Use cases of alternative datasets that show how to leverage textual data from
social media, satellite imagery, or credit card logs to predict sales, earning reports, and,
ultimately, future returns. The literature on this topic is still emerging (see, e.g., Blank
et al. (2019), Jha (2019) and Ke et al. (2019)) but will likely blossom in the near future.
xiii
xiv Preface
and Mohri et al. (2018) for a general treatment on the subject.1 Moreover, Du and
Swamy (2013) and Goodfellow et al. (2016) are solid monographs on neural networks
particularly, and Sutton and Barto (2018) provide a self-contained and comprehensive
tour in reinforcement learning.
• Finally, the book does not cover methods of natural language processing (NLP)
that can be used to evaluate sentiment which can in turn be translated into investment
decisions. This topic has nonetheless been trending lately and we refer to Loughran and
McDonald (2016), Cong et al. (2019a), Cong et al. (2019b) and Gentzkow et al. (2019)
for recent advances on the matter.
awesome-machine-learning/blob/master/books.md.
Preface xv
Part II of the book is dedicated to predictive algorithms in supervised learning. Those are
the most common tools that are used to forecast financial quantities (returns, volatilities,
Sharpe ratios, etc.). They range from penalized regressions (Chapter 5), to tree methods
(Chapter 6), encompassing neural networks (Chapter 7), support vector machines (Chapter
8) and Bayesian approaches (Chapter 9).
Part III of the book bridges the gap between these tools and their applications in finance.
Chapter 10 details how to assess and improve the ML engines defined beforehand. Chapter
11 explains how models can be combined, and often why that may not be a good idea. Fi-
nally, one of the most important chapters (Chapter 12) reviews the critical steps of portfolio
backtesting and mentions the frequent mistakes that are often encountered at this stage.
Part IV of the book covers a range of advanced topics connected to machine learning more
specifically. The first one is interpretability. ML models are often considered to be black
boxes and this raises trust issues: how and why should one trust ML-based predictions?
Chapter 13 is intended to present methods that help understand what is happening under
the hood. Chapter 14 is focused on causality, which is both a much more powerful concept
than correlation and also at the heart of many recent discussions in Artificial Intelligence
(AI). Most ML tools rely on correlation-like patterns and it is important to underline the
benefits of techniques related to causality. Finally, Chapters 15 and 16 are dedicated to
non-supervised methods. The latter can be useful, but their financial applications should
be wisely and cautiously motivated.
Companion website
This book is entirely available at http://www.pymlfactor.com. It is important that not only
the content of the book be accessible, but also the data and code that are used throughout
the chapters. They can be found at https://github.com/shokru. The online version of the
book will be updated beyond the publication of the printed version.
Coding instructions
One of the purposes of the book is to propose a large-scale tutorial of ML applications
in financial predictions and portfolio selection. Thus, one keyword is reproducibility! In
order to duplicate our results (up to possible randomness in some learning algorithms), you
will need running versions of Python and Anaconda on your computer.
xvi Preface
As much as we could, we created short code chunks and commented each line whenever we
felt it was useful. Comments are displayed at the end of a row and preceded with a single
hashtag #.
The book is constructed as a very big notebook, thus results are often presented below
code chunks. They can be graphs or tables. Sometimes, they are simple numbers and are
preceded with two hashtags ##. The example below illustrates this formatting.
1+2 # Example
# 3
The book can be viewed as a very big tutorial. Therefore, most of the chunks depend on
previously defined variables. When replicating parts of the code (via online code), please
Preface xvii
make sure that the environment includes all relevant variables. One best practice
is to always start by running all code chunks from Chapter 1. For the exercises, we often
resort to variables created in the corresponding chapters.
Acknowledgments
The core of the book was prepared for a series of lectures given by one of the authors to
students of master’s degrees in finance at EMLYON Business School and at the Imperial
College Business School in the Spring of 2019. We are grateful to those students who asked
fruitful questions and thereby contributed to improve the content of the book.
We are grateful to Bertrand Tavin and Gautier Marti for their thorough screening of
the book. We also thank Eric André, Aurélie Brossard, Alban Cousin, Frédérique Girod,
Philippe Huber, Jean-Michel Maeso and Javier Nogales for friendly reviews; Christophe
Dervieux for his help with bookdown; Mislav Sagovac and Vu Tran for their early feedback;
Lara Spieker and John Kimmel for making this book happen and Jonathan Regenstein
for his availability, no matter the topic. Lastly, we are grateful for the anonymous reviews
collected by John, our original editor.
Future developments
Machine learning and factor investing are two immense research domains and the overlap
between the two is also quite substantial and developing at a fast pace. The content of
this book will always constitute a solid background, but it is naturally destined to obso-
lescence. Moreover, by construction, some subtopics and many references will have escaped
our scrutiny. Our intent is to progressively improve the content of the book and update it
with the latest ongoing research. We will be grateful to any comment that helps correct or
update the monograph. Thank you for sending your feedback directly (via pull requests) on
the book’s website which is hosted at https://github.com/shokru.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part I
Introduction
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
1
Notations and data
1.1 Notations
This section aims at providing the formal mathematical conventions that will be used
throughout the book.
Bold notations indicate vectors and matrices. We use capital letters for matrices and lower-
case letters for vectors. v0 and M0 denote the transposes of v and M. M = [m]i,j where i
is the row index and j the column index.
We will work with two notations in parallel. The first one is the pure machine learning nota-
tion in which the labels (also called output, dependent variables or predicted variables)
y = yi are approximated by functions of features Xi = (xi,1 , . . . , xi,K ). The dimension of
the features matrix X is I × K: there are I instances, records, or observations and
each one of them has K attributes, features, inputs, or predictors which will serve as
independent and explanatory variables (all these terms will be used interchangeably).
Sometimes, to ease notations, we will write xi for one instance (one row) of X or xk for one
(feature) column vector of X.
The second notation type pertains to finance and will directly relate to the first. We will
often work with discrete returns rt,n = pt,n /pt−1,n − 1 computed from price data. Here t
is the time index and n the asset index. Unless specified otherwise, the return is always
computed over one period, though this period can sometimes be one month or one year.
Whenever confusion might occur, we will specify other notations for returns.
In line with our previous conventions, the number of return dates will be T and the number
(k)
of assets, N . The features or characteristics of assets will be denoted with xt,n : it is the
time-t value of the k attribute of firm or asset n. In stacked notation, xt,n will stand for
th
the vector of characteristics of asset n at time t. Moreover, rt stands for all returns at time
t, while rn stands for all returns of asset n. Often, returns will play the role of the dependent
variable, or label (in ML terms). For the riskless asset, we will use the notation rt,f .
The link between the two notations will most of the time be the following. One instance
(or observation) i will consist of one couple (t, n) of one particular date and one particular
firm (if the data is perfectly rectangular with no missing field,I = T × N ). The label will
usually be some performance measure of the firm computed over some future period, while
the features will consist of the firm attributes at time-t. Hence, the purpose of the machine
learning engine in factor investing will be to determine the model that maps the time-t
characteristics of firms to their future performance.
In terms of canonical matrices: IN will denote the (N × N ) identity matrix.
From the probabilistic literature, we employ the expectation operator E[·] and the condi-
tional expectation Et [·], where the corresponding filtration Ft corresponds to all information
3
4 1 Notations and data
available at time . More precisely, Et [·] = E[·|Ft ] V[·] will denote the variance operator. De-
pending on the context, probabilities will be written simply P , but sometimes we will use
the heavier notation P. Probability density functions (pdfs) will be denoted with lowercase
letters (f ) and cumulative distribution functions (cdfs) with uppercase letters (F ). We will
d
write equality in distribution as X = Y , which is equivalent to FX (z) = FY (z) for all z on
the support of the variables. For a random process Xt , we say that it is stationary if the
d d
law of Xt is constant through time, i.e., Xt = Xs , where = means equality in distribution.
Sometimes, asymptotic behaviors will be characterized with the usual Landau notation o(·)
and O(·). The symbol ∝ refers to proportionality: x ∝ y means that x is proportional
to y. With respect to derivatives, we use the standard notation ∂x∂
when differentiating
with respect to x. We resort to the compact symbol ∇ when all derivatives are computed
(gradient vector).
In equations, the left-hand side and right-hand side can be written more compactly as l.h.s.
and r.h.s., respectively.
Finally, we turn to functions. We list a few below: - 1{x} : the indicator function of the
condition x, which is equal to one if x is true and to zero otherwise. - φ(·) and Φ(·) are
the standard Gaussian pdf and cdf. - card (·) = #(·) are two notations for the cardinal
function which evaluates the number of elements in a given set (provided as argument of
the function). - b·c is the integer part function. - for a real number x,[x]+ is the positive
x −x
part of x, that is max max(0, x) - tanh(·) is the hyperbolic tangent: tanh(x) = eex −e +e−x -
ReLu(·) is the rectified linear unit: ReLu(x) = max(0, x) - s(·) will be the softmax function:
xi
s(x)i = PJe exj , where the subscript i refers to the ith element of the vector.
j=1
1.2 Dataset
Throughout the book, and for the sake of reproducibility, we will illustrate the concepts
we present with examples of implementation based on a single financial dataset available
at https://github.com/shokru/mlfactor.github.io/tree/master/material. This dataset com-
prises information on 1,207 stocks listed in the US (possibly originating from Canada or
Mexico). The time range starts in November 1998 and ends in March 2019. For each point
in time, 93 characteristics describe the firms in the sample. These attributes cover a wide
range of topics: valuation (earning yields, accounting ratios); profitability and quality (re-
turn on equity); momentum and technical analysis (past returns, relative strength index);
risk (volatilities); estimates (earnings-per-share); volume and liquidity (share turnover).
The sample is not perfectly rectangular: there are no missing points, but the number of
firms and their attributes is not constant through time. This makes the computations in
the backtest more tricky, but also more realistic.
The data has 99 columns and 268336 rows. The first two columns indicate the stock identifier
and the date. The next 93 columns are the features (see Table 17.1 in the Appendix for
details). The last four columns are the labels. The points are sampled at the monthly
frequency. As is always the case in practice, the number of assets changes with time, as is
shown in Figure 1.1.
There are four immediate labels in the dataset: R1M_Usd, R3M_Usd, R6M_Usd and
R12M_Usd, which correspond to the 1-month, 3-month, 6-month and 12-month fu-
ture/forward returns of the stocks. The returns are total returns, that is, they incorporate
potential dividend payments over the considered periods. This is a better proxy of financial
gain compared to price returns only. We refer to the analysis of Hartzmark and Solomon
(2019) for a study on the impact of decoupling price returns and dividends. These labels
6 1 Notations and data
are located in the last four columns of the dataset. We provide their descriptive statistics
below.
## Label mean sd min max
## 1 R12M_Usd 0.137 0.738 -0.991 96.0
## 2 R1M_Usd 0.0127 0.176 -0.922 30.2
## 3 R3M_Usd 0.0369 0.328 -0.929 39.4
## 4 R6M_Usd 0.0723 0.527 -0.98 107.
In anticipation for future models, we keep the name of the predictors in memory. In addition,
we also keep a much shorter list of predictors.
features=list(data_ml.iloc[:,3:95].columns)
# Keep the feature's column names (hard-coded, beware!)
features_short =["Div_Yld", "Eps", "Mkt_Cap_12M_Usd",
"Mom_11M_Usd", "Ocf", "Pb", "Vol1Y_Usd"]
The predictors have been uniformized, that is, for any given feature and time point, the
distribution is uniform. Given 1,207 stocks, the graph below cannot display a perfect rect-
angle.
col_feat_Div_Yld=data_ml.columns.get_loc('Div_Yld')
# finding the location of the column/feature Div_Yld
is_custom_date =data_ml['date']=='2000-02-29'
# creating a Boolean index to filter on
data_ml[is_custom_date].iloc[:,[col_feat_Div_Yld]].hist(bins=100)
# using the hist
plt.ylabel('count')
The original labels (future returns) are numerical and will be used for regression exercises,
that is, when the objective is to predict a scalar real number. Sometimes, the exercises can
be different and the purpose may be to forecast categories (also called classes), like “buy”,
“hold” or “sell”. In order to be able to perform this type of classification analysis, we create
additional labels that are categorical.
1.2 Dataset 7
The new labels are binary: they are equal to 1 (true) if the original return is above that of
the median return over the considered period and to 0 (false) if not. Hence, at each point
in time, half of the sample has a label equal to 0 and the other half to 1: some stocks
overperform and others underperform.
In machine learning, models are estimated on one portion of data (training set) and then
tested on another portion of the data (testing set) to assess their quality. We split our
sample accordingly.
separation_date = "2014-01-15"
idx_train=data_ml.index[(data_ml['date']< separation_date)].tolist()
idx_test=data_ml.index[(data_ml['date']>= separation_date)].tolist()
We also keep in memory a few key variables, like the list of asset identifiers and a rectangular
version of returns. For simplicity, in the computation of the latter, we shrink the investment
universe to keep only the stocks for which we have the maximum number of points.
Conclusions often echo introductions. This chapter was completed at the very end of the
writing of the book. It outlines principles and ideas that are probably more relevant than
the sum of technical details covered subsequently. When stuck with disappointing results,
we advise the reader to take a step away from the algorithm and come back to this section
to get a broader perspective of some of the issues in predictive modelling.
2.1 Context
The blossoming of machine learning in factor investing has it source at the confluence
of three favorable developments: data availability, computational capacity, and economic
groundings.
First, the data. Nowadays, classical providers, such as Bloomberg and Reuters have seen
their playing field invaded by niche players and aggregation platforms.1 In addition, high-
frequency data and derivative quotes have become mainstream. Hence, firm-specific at-
tributes are easy and often cheap to compile. This means that the size of X in (2.1) is now
sufficiently large to be plugged into ML algorithms. The order of magnitude (in 2019) that
can be reached is the following: a few hundred monthly observations over several thousand
stocks (US listed at least) covering a few hundred attributes. This makes a dataset of dozens
of millions of points. While it is a reasonably high figure, we highlight that the chronological
depth is probably the weak point and will remain so for decades to come because accounting
figures are only released on a quarterly basis. Needless to say that this drawback does not
hold for high-frequency strategies.
Second, computational power, both through hardware and software. Storage and pro-
cessing speed are not technical hurdles anymore and models can even be run on the cloud
thanks to services hosted by major actors (Amazon, Microsoft, IBM and Google) and by
smaller players (Rackspace, Techila). On the software side, open source has become the
norm, funded by corporations (TensorFlow & Keras by Google, Pytorch by Facebook, h2o,
etc.), universities (Scikit-Learn by INRIA, NLPCore by Stanford, NLTK by UPenn) and
small groups of researchers (caret, xgboost, tidymodels, to list but a pair of frameworks).
Consequently, ML is no longer the private turf of a handful of expert computer scientists,
but is on the contrary accessible to anyone willing to learn and code.
Finally, economic framing. Machine learning applications in finance were initially in-
troduced by computer scientists and information system experts (e.g., Braun and Chan-
dler (1987), White (1988)) and exploited shortly after by academics in financial economics
1 We refer to https://alternativedata.org/data-providers/ for a list of alternative data providers. More-
over, we recall that Quandl, an alt-data hub was acquired by Nasdaq in December 2018. As large players
acquire newcomers, the field may consolidate.
9
10 2 Introduction
(Bansal and Viswanathan (1993)), and hedge funds (see, e.g., Zuckerman (2019)). Non-
linear relationships then became more mainstream in asset pricing (Freeman and Tse (1992),
Bansal et al. (1993)). These contributions started to pave the way for the more brute-force
approaches that have blossomed since the 2010 decade and which are mentioned throughout
the book.
In the synthetic proposal of Arnott et al. (2019b), the first piece of advice is to rely on a
model that makes sense economically. We agree with this stance, and the only assumption
that we make in this book is that future returns depend on firm characteristics. The relation-
ship between these features and performance is largely unknown and probably time-varying.
This is why ML can be useful: to detect some hidden patterns beyond the documented asset
pricing anomalies. Moreover, dynamic training allows to adapt to changing market condi-
tions.
y = f (X) + , (2.1)
Even if the overall process, depicted in Figure 2.1, seems very sequential, it is more judicious
to conceive it as integrated. All steps are intertwined and each part should not be dealt
with independently from the others.2 The global framing of the problem is essential, from
the choice of predictors, to the family of algorithms, not to mention the portfolio weighting
schemes (see Chapter 12 for the latter).
}
Training
Prediction
Price Missing Trees (signal)
Accounting Outliers Validating
NNs
(tuning)
Sentiment Features Weights
SVMs (important
Alternative Labels Ensemble? step)
• Thus, researchers have most of the time to make do with simple correlation patterns,
which are far less informative and robust.
2 Other approaches are nonetheless possible, as is advocated in de Prado and Fabozzi (2020).
12 2 Introduction
• The no-free lunch theorem of Wolpert (1992a) imposes that the analyst formulates
views on the model. This is why economic or econometric framing is key. The
assumptions and choices that are made regarding both the dependent variables and the
explanatory features are decisive. As a corollary, data is key. The inputs given to the
models are probably much more important than the choice of the model itself.
• Everybody makes mistakes. Errors in loops or variable indexing are part of the journey.
What matters is to learn from those lapses.
To conclude, we remind the reader of this obvious truth: nothing will ever replace prac-
tice. Gathering and cleaning data, coding backtests, tuning ML models, testing weighting
schemes, debugging, starting all over again: these are all absolutely indispensable steps and
tasks that must be repeated indefinitely. There is no sustitute to experience.
3
Factor investing and asset pricing anomalies
Asset pricing anomalies are the foundations of factor investing. In this chapter our aim
is twofold:
• present simple ideas and concepts: basic factor models and common empirical facts
(time-varying nature of returns and risk premia);
• provide the reader with lists of articles that go much deeper to stimulate and satisfy
curiosity.
The purpose of this chapter is not to provide a full treatment of the many topics related
to factor investing. Rather, it is intended to give a broad overview and cover the essential
themes so that the reader is guided towards the relevant references. As such, it can serve as
a short, non-exhaustive, review of the literature. The subject of factor modelling in finance
is incredibly vast and the number of papers dedicated to it is substantial and still rapidly
increasing.
The universe of peer-reviewed financial journals can be split in two. The first kind is the
academic journals. Their articles are mostly written by professors, and the audience
consists mostly of scholars. The articles are long and often technical. Prominent examples
are the Journal of Finance, the Review of Financial Studies and the Journal of Financial
Economics. The second type is more for practitioners. The papers are shorter, easier
to read, and target finance professionals predominantly. Two emblematic examples are the
Journal of Portfolio Management and the Financial Analysts Journal. This chapter reviews
and mentions articles published essentially in the first family of journals.
Beyond academic articles, several monographs are already dedicated to the topic of style
allocation (a synonym of factor investing used for instance in theoretical articles (Barberis
and Shleifer (2003)) or practitioner papers (Asness et al. (2015))). To cite but a few, we
mention:
• Ilmanen (2011): an exhaustive excursion into risk premia, across many asset classes,
with a large spectrum of descriptive statistics (across factors and periods),
• Ang (2014): covers factor investing with a strong focus on the money management
industry,
• Bali et al. (2016): very complete book on the cross-section of signals with statistical
analyses (univariate metrics, correlations, persistence, etc.),
• Jurczenko (2017): a tour on various topics given by field experts (factor purity, pre-
dictability, selection versus weighting, factor timing, etc.).
Finally, we mention a few wide-scope papers on this topic: Goyal (2012), Cazalet and Ron-
calli (2014) and Baz et al. (2015).
13
14 3 Factor investing and asset pricing anomalies
3.1 Introduction
The topic of factor investing, though a decades-old academic theme, has gained traction
concurrently with the rise of equity traded funds (ETFs) as vectors of investment. Both
have gathered momentum in the 2010 decade. Not so surprisingly, the feedback loop be-
tween practical financial engineering and academic research has stimulated both sides in
a mutually beneficial manner. Practitioners rely on key scholarly findings (e.g., asset pric-
ing anomalies), while researchers dig deeper into pragmatic topics (e.g., factor exposure or
transaction costs). Recently, researchers have also tried to quantify and qualify the impact
of factor indices on financial markets. For instance, Krkoska and Schenk-Hoppé (2019) ana-
lyze herding behaviors, while Cong and Xu (2019) show that the introduction of composite
securities increases volatility and cross-asset correlations.
The core aim of factor models is to understand the drivers of asset prices. Broadly speak-
ing, the rationale behind factor investing is that the financial performance of firms depends
on factors, whether they be latent and unobservable, or related to intrinsic characteristics
(like accounting ratios for instance). Indeed, as Cochrane (2011) frames it, the first essential
question is, which characteristics really provide independent information about average re-
turns? Answering this question helps understand the cross-section of returns and may open
the door to their prediction.
Theoretically, linear factor models can be viewed as special cases of the arbitrage pricing
theory (APT) of Ross (1976), which assumes that the return of an asset n can be modelled
as a linear combination of underlying factors fk :
K
X
rt,n = αn + βn,k ft,k + t,n , (3.1)
k=1
where the usual econometric constraints on linear models hold: E[t,n ] = 0, cov(t,n , t,m ) =
0 for n 6= m and cov(fn , n ) = 0. If such factors do exist, then they are in contradiction with
the cornerstone model in asset pricing: the capital asset pricing model (CAPM) of Sharpe
(1964), Lintner (1965) and Mossin (1966). Indeed, according to the CAPM, the only driver
of returns is the market portfolio. This explains why factors are also called ‘anomalies’.
Empirical evidence of asset pricing anomalies has accumulated since the dual publication
of Fama and French (1992) and Fama and French (1993). This seminal work has paved the
way for a blossoming stream of literature that has its meta-studies (e.g., Green et al. (2013),
Harvey et al. (2016) and McLean and Pontiff (2016)). The regression (3.1) can be evaluated
once (unconditionally) or sequentially over different time frames. In the latter case, the
parameters (coefficient estimates) change and the models are thus called conditional (we
refer to Ang and Kristensen (2012) and to Cooper and Maio (2019) for recent results on
this topic as well as for a detailed review on the related research). Conditional models are
more flexible because they acknowledge that the drivers of asset prices may not be constant,
which seems like a reasonable postulate.
3.2 Detecting anomalies 15
3. the weight of stocks inside the portfolio is either uniform (equal weights), or propor-
tional to market capitalization;
16 3 Factor investing and asset pricing anomalies
4. at a future date (usually one month), report the returns of the portfolios.
Then, iterate the procedure until the chronological end of the sample is reached.
The outcome is a time series of portfolio returns rtj for each grouping j. An anomaly is
identified if the t-test between the first (j = 1) and the last group (j = J) unveils a
significant difference in average returns. More robust tests are described in Cattaneo et al.
(2020). A strong limitation of this approach is that the sorting criterion could have a non-
monotonic impact on returns and a test based on the two extreme portfolios would not detect
it. Several articles address this concern: Patton and Timmermann (2010) and Romano and
Wolf (2013) for instance. Another concern is that these sorted portfolios may capture not
only the priced risk associated to the characteristic, but also some unpriced risk. Daniel
et al. (2020b) show that it is possible to disentangle the two and make the most of altered
sorted portfolios.
Instead of focusing on only one criterion, it is possible to group asset according to more
characteristics. The original paper Fama and French (1992) also combines market capitaliza-
tion with book-to-market ratios. Each characteristic is divided into 10 buckets, which makes
100 portfolios in total. Beyond data availability, there is no upper bound on the number of
features that can be included in the sorting process. In fact, some authors investigate more
complex sorting algorithms that can manage a potentially large number of characteristics
(see e.g., Feng et al. (2019) and Bryzgalova et al. (2019b)).
Finally, we refer to Ledoit et al. (2020) for refinements that take into account the covariance
structure of asset returns and to Cattaneo et al. (2020) for a theoretical study on the
statistical properties of the sorting procedure (including theoretical links with regression-
based approaches). Notably, the latter paper discusses the optimal number of portfolios and
suggests that it is probably larger than the usual 10 often used in the literature.
In the code and Figure 3.1 below, we compute size portfolios (equally weighted: above
versus below the median capitalization). According to the size anomaly, the firms with
below median market cap should earn higher returns on average. This is verified whenever
the orange bar in the plot is above the blue one (it happens most of the time).
plt.xlabel('year')
df_median=[] #removing the temp dataframe to keep it light!
df=[] #removing the temp dataframe to keep it light!
FIGURE 3.1: The size factor: average returns of small versus large firms.
3.2.3 Factors
The construction of so-called factors follows the same lines as above. Portfolios are based
on one characteristic and the factor is a long-short ensemble of one extreme portfolio minus
the opposite extreme (small minus large for the size factor or high book-to-market ratio
minus low book-to-market ratio for the value factor). Sometimes, subtleties include forming
bivariate sorts and aggregating several portfolios together, as in the original contribution
of Fama and French (1993). The most common factors are listed below, along with a few
references. We refer to the books listed at the beginning of the chapter for a more exhaustive
treatment of factor idiosyncrasies. For most anomalies, theoretical justifications have been
brought forward, whether risk-based or behavioral. We list the most frequently cited factors
below:
• Size (SMB = small firms minus large firms): Banz (1981), Fama and French (1992),
Fama and French (1993), Van Dijk (2011), Asness et al. (2018) and Astakhov et al.
(2019).
• Value (HM = high minus low: undervalued minus ‘growth’ firms): Fama and French
(1992), Fama and French (1993), and Asness et al. (2013).
• Momentum (WML = winners minus losers): Jegadeesh and Titman (1993), Carhart
(1997) and Asness et al. (2013). The winners are the assets that have experienced the
highest returns over the last year (sometimes the computation of the return is truncated
to omit the last month). Cross-sectional momentum is linked, but not equivalent,
18 3 Factor investing and asset pricing anomalies
to time series momentum (trend following), see, e.g., Moskowitz et al. (2012) and
Lempérière et al. (2014). Momentum is also related to contrarian movements that
occur both at higher and lower frequencies (short-term and long-term reversals), see
Luo et al. (2021).
• Profitability (RMW = robust minus weak profits): Fama and French (2015), Bouchaud
et al. (2019). In the former reference, profitability is measured as (revenues - (cost and
expenses))/equity.
• Investment (CMA = conservative minus aggressive): Fama and French (2015), Hou
et al. (2015). Investment is measured via the growth of total assets (divided by total
assets). Aggressive firms are those that experience the largest growth in assets.
• Low ‘risk’ (sometimes, BAB = betting against beta): Ang et al. (2006), Baker et al.
(2011), Frazzini and Pedersen (2014), Boloorforoosh et al. (2020), Baker et al. (2020)
and Asness et al. (2020). In this case, the computation of risk changes from one article
to the other (simple volatility, market beta, idiosyncratic volatility, etc.).
With the notable exception of the low risk premium, the most mainstream anomalies are
kept and updated in the data library of Kenneth French (https://mba.tuck.dartmouth.
edu/pages/faculty/ken.french/data_library.html). Of course, the computation of the
factors follows a particular set of rules, but they are generally accepted in the academic
sphere. Another source of data is the AQR repository: https://www.aqr.com/Insights/
Datasets.
In the dataset we use for the book, we proxy the value anomaly not with the book-to-market
ratio but with the price-to-book ratio (the book value is located in the denominator). As is
shown in Asness and Frazzini (2013), the choice of the variable for value can have sizable
effects.
Below, we import data from Ken French’s data library. We will use it later on in the chapter.
import urllib.request
min_date = 196307
max_date = 202003
ff_url = "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp"
ff_url += "/F-F_Research_Data_5_Factors_2x3_CSV.zip"
# Create the download url
urllib.request.urlretrieve(ff_url,'factors.zip') # Download it
df_ff = pd.read_csv('F-F_Research_Data_5_Factors_2x3.csv',
header=3, sep=',', quotechar='"')
df_ff.rename(columns = {'Unnamed: 0':'date'},
inplace = True) # renaming for clarity
df_ff.rename(columns = {'Mkt-RF':'MKT_RF'},
inplace = True) # renaming for clarity
df_ff[['MKT_RF','SMB','HML','RMW','CMA','RF']]=df_ff[
['MKT_RF','SMB','HML','RMW','CMA','RF']].values/100.0 # Scale returns
idx_ff=df_ff.index[(df_ff['date']>=min_date)&(
df_ff['date']<=max_date)].tolist()
FF_factors=df_ff.iloc[idx_ff]
FF_factors['year']=FF_factors.date.astype(str).str[:4]
FF_factors.iloc[1:6,0:7].head()
3.2 Detecting anomalies 19
Posterior to the discovery of these stylized facts, some contributions have aimed at building
theoretical models that capture these properties. We cite a handful below:
• size and value: Berk et al. (1999), Daniel et al. (2001b), Barberis and Shleifer (2003),
Gomes et al. (2003), Carlson et al. (2004), and Arnott et al. (2014);
• momentum: Johnson (2002), Grinblatt and Han (2005), Vayanos and Woolley (2013),
Choi and Kim (2014).
In addition, recent bridges have been built between risk-based factor representations and
behavioural theories. We refer essentially to Barberis et al. (2016) and Daniel et al. (2020a)
and the references therein.
While these factors (i.e., long-short portfolios) exhibit time-varying risk premia and are mag-
nified by corporate news and announcements (Engelberg et al. (2018)), it is well-documented
(and accepted) that they deliver positive returns over long horizons. We refer to Gagliardini
et al. (2016) and to the survey Gagliardini et al. (2019), as well as to the related bibliog-
raphy for technical details on estimation procedures of risk premia and the corresponding
empirical results. A large sample study that documents regime changes in factor premia
was also carried out by Ilmanen et al. (2019). Moreover, the predictability of returns is also
time-varying (as documented in Farmer et al. (2019), Tsiakas et al. (2020) and Liu et al.
(2021)), and estimation methods can be improved (Johnson (2019)).
In Figure 3.2, we plot the average monthly return aggregated over each calendar year for five
common factors. The risk-free rate (which is not a factor per se) is the most stable, while
the market factor (aggregate market returns minus the risk-free rate) is the most volatile.
This makes sense because it is the only long equity factor among the five series.
The individual attributes of investors who allocate towards particular factors is a blossoming
topic. We list a few references below, even though they somewhat lie out of the scope of
this book. Betermier et al. (2017) show that value investors are older, wealthier and face
lower income risk compared to growth investors who are those in the best position to take
financial risks. The study Cronqvist et al. (2015b) leads to different conclusions: it finds
that the propensity to invest in value versus growth assets has roots in genetics and in life
20 3 Factor investing and asset pricing anomalies
FIGURE 3.2: Average returns of common anomalies (1963-2020). Source: Ken French li-
brary.
events (the latter effect being confirmed in Cocco et al. (2020), and the former being further
detailed in a more general context in Cronqvist et al. (2015a)). Psychological traits can also
explain some factors: when agents extrapolate, they are likely to fuel momentum (this topic
is thoroughly reviewed in Barberis (2018)). Micro- and macro-economic consequences of
these preferences are detailed in Bhamra and Uppal (2019). To conclude this paragraph, we
mention that theoretical models have also been proposed that link agents’ preferences and
beliefs (via prospect theory) to market anomalies (see for instance Barberis et al. (2020)).
Finally, we highlight the need of replicability of factor premia and echo the recent editorial
by Harvey (2020). As is shown by Linnainmaa and Roberts (2018) and Hou et al. (2020),
many proclaimed factors are in fact very much data-dependent and often fail to deliver
sustained profitability when the investment universe is altered or when the definition of
variable changes (Asness and Frazzini (2013)).
Campbell Harvey and his co-authors, in a series of papers, tried to synthesize the research
on factors in Harvey et al. (2016), Harvey and Liu (2019a) and Harvey and Liu (2019b).
His work underlines the need to set high bars for an anomaly to be called a ‘true’ factor.
Increasing thresholds for p-values is only a partial answer, as it is always possible to resort
to data snooping in order to find an optimized strategy that will fail out-of-sample but
that will deliver a t-statistic larger than three (or even four). Harvey (2017) recommends
to resort to a Bayesian approach which blends data-based significance with a prior into a
so-called Bayesianized p-value (see subsection below).
Following this work, researchers have continued to explore the richness of this zoo. Bryz-
galova et al. (2019a) propose a tractable Bayesian estimation of large-dimensional factor
3.2 Detecting anomalies 21
models and evaluate all possible combinations of more than 50 factors, yielding an incred-
ibly large number of coefficients. This combined with a Bayesianized Fama and MacBeth
(1973) procedure allows to distinguish between pervasive and superfluous factors. Chordia
et al. (2020) use simulations of 2 million trading strategies to estimate the rate of false
discoveries, that is, when a spurious factor is detected (type I error). They also advise to
use thresholds for t-statistics that are well above three. In a similar vein, Harvey and Liu
(2020) also underline that sometimes true anomalies may be missed because of a one-time
t-statistic that is too low (type II error).
The propensity of journals to publish positive results has led researchers to estimate the
difference between reported returns and true returns. A. Y. Chen and Zimmermann (2020)
call this difference the publication bias and estimate it as roughly 12%. That is, if a pub-
lished average return is 8%, the actual value may in fact be closer to (1-12%)*8%=7%.
Qualitatively, this estimation of 12% is smaller than the out-of-sample reduction in returns
found in McLean and Pontiff (2016).
For simplicity, we assume a simple form:
r = a + bx + e, (3.2)
where the vector r stacks all returns of all stocks and x is a lagged variable so that the
regression is indeed predictive. If the estimated b̂ is significant given a specified threshold,
then it can be tempting to conclude that x does a good job at predicting returns. Hence,
long-short portfolios related to extreme values of x (mind the sign of b̂) are expected to
generate profits. This is unfortunately often false because b̂ gives information on the past*
ability of x to forecast returns. What happens in the future may be another story.
Statistical tests are also used for portfolio sorts. Assume two extreme portfolios are expected
to yield very different average returns (like very small cap versus very large cap, or strong
winners versus bad losers). The portfolio returns are written rt+ and rt− . The simplest test
√ mr −mr
for the mean is t = T σr+ −r − , where T is the number of points and mr± denotes the
+ −
means of returns and σr+ −r− is the standard deviation of the difference between the two
series, i.e., the volatility of the long-short portfolio. In short, the statistic can be viewed as
a scaled Sharpe ratio (though usually these ratios are computed for long-only portfolios)
and can in turn be used to compute p-values to assess the robustness of an anomaly. As is
shown in Linnainmaa and Roberts (2018) and Hou et al. (2020), many factors discovered
by reasearchers fail to survive in out-of-sample tests.
One reason why people are overly optimistic about anomalies they detect is the widespread
reverse interpretation of the p-value. Often, it is thought of as the probability of one hypoth-
esis (e.g., my anomaly exists) given the data. In fact, it’s the opposite; it’s the likelihood of
your data sample, knowing that the anomaly holds.
p − value = P [D|H]
P [D|H]
target prob. = P [H|D] = × P [H],
P [D]
where H stands for hypothesis and D for data. The equality in the second row is a plain
application of Bayes’ identity: the interesting probability is in fact a transform of the p-value.
22 3 Factor investing and asset pricing anomalies
Two articles (at least) discuss this idea. Harvey (2017) introduces Bayesianized p-values:
2 prior
Bayesianized p − value = Bpv = e−t /2
× , (3.3)
1 + e−t2 /2 × prior
where t is the t-statistic obtained from the regression (i.e., the one that defines the p-value)
and prior is the analyst’s estimation of the odds that the hypothesis (anomaly) is true. The
prior is coded as follows. Suppose there is a p% chance that the null holds (i.e., (1-p)%
for the anomaly). The odds are coded as p/(1 − p). Thus, if the t-statistic is equal to 2
(corresponding to a p-value of 5% roughly) and the prior odds are equal to 6, then the Bpv
is equal to e−2 × 6 × (1 + e−2 × 6)−1 ≈ 0.448 and there is a 44.8% chance that the null is
true. This interpretation stands in sharp contrast with the original p-value which cannot be
viewed as a probability that the null holds. Of course, one drawback is that the level of the
prior is crucial and solely user-specified.
The work of Chinco et al. (2021) is very different but shares some key concepts, like the in-
troduction of Bayesian priors in regression outputs. They show that coercing the predictive
regression with an L2 constraint (see the ridge regression in Chapter 5) amounts to intro-
ducing views on what the true distribution of b is. The stronger the constraint, the more
the estimate b̂ will be shrunk towards zero. One key idea in their work is the assumption
of a distribution for the true b across many anomalies. It is assumed to be Gaussian and
centered. The interesting parameter is the standard deviation: the larger it is, the more fre-
quently significant anomalies are discovered. Notably, the authors show that this parameter
changes through time, and we refer to the original paper for more details on this subject.
which are run date-by-date on the cross-section of assets.1 Theoretically, the betas would be
known and the regression would be run on the βn,k instead of their estimated values. The
γ̂t,k estimate the premia of factor k at time t. Under suitable distributional assumptions on
the εt,n , statistical tests can be performed to determine whether these premia are significant
PT
or not. Typically, the statistic on the time-aggregated (average) premia γ̂k = T1 t=1 γ̂t,k :
γ̂k
tk = √
σˆk / T
is often used in pure Gaussian contexts to assess whether or not the factor is significant (σ̂k
is the standard deviation of the γ̂t,k ).
1 Originally, Fama and MacBeth (1973) work with the market beta only: r
t,n = αn + βn rt,M + t,n and
2 +γ
the second pass included non-linear terms: rt,n = γn,0 + γt,1 β̂n + γt,2 β̂n t,3 ŝn + ηt,n , where the ŝn are
risk estimates for the assets that are not related to the betas. It is then possible to perform asset pricing
tests to infer some properties. For instance, test whether betas have a linear influence on returns or not
(E[γt,2 ] = 0), or test the validity of the CAPM (which implies E[γt,0 ] = 0).
3.2 Detecting anomalies 23
We refer to Jagannathan and Wang (1998) and Petersen (2009) for technical discussions on
the biases and losses in accuracy that can be induced by standard ordinary least squares
(OLS) estimations. Moreover, as the β̂i,k in the second-pass regression are estimates, a
second level of errors can arise (the so-called errors in variables). The interested reader will
find some extensions and solutions in Shanken (1992), Ang et al. (2018), and Jegadeesh
et al. (2019).
Below, we perform Fama and MacBeth (1973) regressions on our sample. We start by the
first pass: individual estimation of betas. We build a dedicated function below and use some
functional programming to automate the process.
df_res_full_mat=df_res_full.pivot(index='stock_id',
columns='factors_name',values='betas')
column_names_inverted = ["const", "MKT_RF", "SMB","HML","RMW","CMA"]
reg_result = df_res_full_mat.reindex(columns=column_names_inverted)
reg_result.head()
In the table, MKT_RF is the market return minus the risk free rate. The corresponding
coefficient is often referred to as the beta, especially in univariate regressions. We then
reformat these betas from Table 3.2 to prepare the second pass. Each line corresponds to
one asset: the first 5 columns are the estimated factor loadings and the remaining ones are
the asset returns (date by date).
24 3 Factor investing and asset pricing anomalies
TABLE 3.2: Sample of beta values (row numbers are stock IDs).
returns_trsp=returns.transpose()
df_2nd_pass=pd.concat([reg_result.iloc[:,1:6],returns.
,→transpose()],axis=1)
df_2nd_pass.head()
We observe that the values of the first column (market betas) revolve around one, which is
what we would expect. Finally, we are ready for the second round of regressions.
betas=df_2nd_pass.iloc[:,0:5]
date_list=list(returns_trsp.columns)
results_params=[]
reg_result=[]
df_res_full=[]
for j in range(len(returns_trsp.columns)):
Y=returns_trsp.iloc[:,j]
results=sm.OLS(endog=Y,exog=sm.add_constant(betas)).fit()
results_params=results.params
reg_result_tmp=pd.DataFrame(results_params)
reg_result_tmp['date']=date_list[j]
df_res_full.append(reg_result_tmp)
df_res_full = pd.concat(df_res_full)
df_res_full.reset_index(inplace=True)
gammas=df_res_full
gammas.rename(columns={"index":"factors_name", 0: betas"},inplace=True)
gammas_mat=gammas.
,→pivot(index='date',columns='factors_name',values='betas')
3.2 Detecting anomalies 25
Visually, the estimated premia are also very volatile. We plot their estimated values for the
market, SMB and HML factors.
gammas_mat.iloc[:,1:4].plot(
figsize=(14,10), subplots=True,sharey=True, sharex=True)
# Take gammas:
plt.show() # Plot
The two spikes at the end of the sample signal potential colinearity issues; two factors seem
to compensate in an unclear aggregate effect. This underlines the usefulness of penalized
estimates (see Chapter 5).
26 3 Factor investing and asset pricing anomalies
X
ft,k = ak + δk,j ft,j + t,k . (3.5)
j6=k
The interesting metric is then the test statistic associated to the estimation of ak . If ak
is significantly different from zero, then the cross-section of (other) factors fails to explain
exhaustively the average return of factor k. Otherwise, the return of the factor can be
captured by exposures to the other factors and is thus redundant.
One mainstream application of this technique was performed in Fama and French (2015),
in which the authors show that the HML factor is redundant when taking into account four
other factors (Market, SMB, RMW, and CMA). Below, we reproduce their analysis on an
updated sample. We start our analysis directly with the database maintained by Kenneth
French.
We can run the regressions that determine the redundancy of factors via the procedure
defined in Equation (3.5).
df_res_full=[]
for i in range(0,5):
factors_list_full = ["MKT_RF","SMB","HML","RMW","CMA"]
factors_list_tmp=factors_list_full
Y=FF_factors[factors_list_full[i]]
factors_list_tmp.remove(factors_list_full[i])
data=FF_factors[factors_list_tmp]
results=sm.OLS(endog=Y,exog=sm.add_constant(data)).fit()
results_param=results.params
reg_result_tmp=pd.DataFrame(results_param)
reg_result_tmp['factor_mnemo']=Y.name
reg_result_tmp['pvalue']=results.pvalues
df_res_full.append(reg_result_tmp)
df_res_full = pd.concat(df_res_full)
df_res_full.reset_index(inplace=True)
df_res_full.rename(columns={0: "coeff"},inplace=True)
We obtain the vector of α values from Equation (3.5). Below, we format these figures along
with p-value thresholds and export them in a summary table. The significance levels of
coefficients is coded as follows: 0 < (∗ ∗ ∗) < 0.001 < (∗∗) < 0.01 < (∗) < 0.05.
df_significance=df_res_full
conditions = [(df_significance['pvalue'] > 0) & (
3.2 Detecting anomalies 27
TABLE 3.5: Factor competition among the Fama and French (2015) five factors. The sample
starts in 1963-07 and ends in 2020-03. The regressions are run on monthly returns.
MKT_RF 0.008 (***) NA 0.264 (***) 0.101 -0.345 (***) -0.903 (***)
SMB 0.003 (*) 0.131 (***) NA 0.077 -0.43 (***) -0.126
HML 0 0.028 0.038 NA 0.148 (***) 1.02 (***)
RMW 0.004 (***) -0.096 (***) -0.219 (***) 0.143 (***) NA -0.287 (***)
CMA 0.002 (***) -0.11 (***) -0.03 0.455 (***) -0.123 (***) NA
valuest = ['(***)','(**)','(*)','na']
# Values assign for each condition
df_significance_pivot= df_significance_pivot.reindex(
columns=column_names_inverted)
df_significance_pivot.reindex(new_index)
We confirm that the HML factor remains redundant when the four others are present in the
asset pricing model. The figures we obtain are very close to the ones in the original paper
(Fama and French (2015)), which makes sense, since we only add 5 years to their initial
sample.
We confirm that the HML factor remains redundant when the four others are present in the
asset pricing model. The figures we obtain are very close to the ones in the original paper
(Fama and French (2015)), which makes sense, since we only add 5 years to their initial
sample.
At a more macro level, researchers also try to figure out which models (i.e., combinations
of factors) are the most likely, given the data empirically observed (and possibly given
priors formulated by the econometrician). For instance, this stream of literature seeks to
quantify to which extent the 3-factor model of Fama and French (1993) outperforms the
5 factors in Fama and French (2015). In this direction, De Moor et al. (2015) introduce a
novel computation for p-values that compare the relative likelihood that two models pass a
28 3 Factor investing and asset pricing anomalies
zero-alpha test. More generally, the Bayesian method of Barillas and Shanken (2018) was
subsequently improved by Chib et al. (2020).
Lastly, even the optimal number of factors is a subject of disagreement among conclusions
of recent work. While the traditional literature focuses on a limited number (3-5) of factors,
more recent research by DeMiguel et al. (2020), He et al. (2020), Kozak et al. (2019) and
Freyberger et al. (2020) advocates the need to use at least 15 or more (in contrast, Kelly
et al. (2019) argue that a small number of latent factors may suffice). Green et al. (2017)
even find that the number of characteristics that help explain the cross-section of returns
varies in time.
• Harvey and Liu (2019a) (in a similar vein) use bootstrap on orthogonalized factors.
They make the case that correlations among predictors is a major issue and their
method aims at solving this problem. Their lengthy procedure seeks to test if maximal
additional contribution of a candidate variable is significant;
• Fama and French (2018) compare asset pricing models through squared maximum
Sharpe ratios;
• Giglio and Xiu (2019) estimate factor risk premia using a three-pass method based on
principal component analysis;
• Pukthuanthong et al. (2018) disentangle priced and non-priced factors via a combina-
tion of principal component analysis and Fama and MacBeth (1973) regressions;
• Gospodinov et al. (2019) warn against factor misspecification (when spurious factors are
included in the list of regressors). Traded factors (resp. macro-economic factors) seem
more likely (resp. less likely) to yield robust identifications (see also Bryzgalova (2016)).
There is obviously no infallible method, but the number of contributions in the field high-
lights the need for robustness. This is evidently a major concern when crafting investment
decisions based on factor intuitions. One major hurdle for short-term strategies is the likely
time-varying feature of factors. We refer for instance to Ang and Kristensen (2012) and
Cooper and Maio (2019) for practical results and to Gagliardini et al. (2016) and Ma et al.
(2020) for more theoretical treatments (with additional empirical results).
3.3 Factors or characteristics? 29
• Chordia et al. (2019) find that characteristics explain a larger proportion of variation
in estimated expected returns than factor loadings;
• Han et al. (2019) show with penalized regressions that 20 to 30 characteristics (out of
94) are useful for the prediction of monthly returns of US stocks. Their methodology
is interesting: they regress returns against characteristics to build forecasts and then
regress the returns on the forecast to assess if they are reliable. The latter regression
uses a LASSO-type penalization (see Chapter 5) so that useless characteristics are
excluded from the model. The penalization is extended to elasticnet in Rapach and
Zhou (2019).
• Kelly et al. (2019) and Kim et al. (2019) both estimate models in which factors are
latent, but loadings (betas) and possibly alphas depend on characteristics. Kirby (2020)
generalizes the first approach by introducing regime-switching. In contrast, Lettau and
Pelger (2020a) and Lettau and Pelger (2020b) estimate latent factors without any link
to particular characteristics (and provide large sample asymptotic properties of their
methods).
• In the same vein as Hoechle et al. (2018), Gospodinov et al. (2019) and Bryzgalova
(2016) discuss potential errors that arise when working with portfolio sorts that yield
long-short returns. The authors show that in some cases, tests based on this procedure
may be deceitful. This happens when the characteristic chosen to perform the sort is
correlated with an external (unobservable) factor. They propose a novel regression-based
approach aimed at bypassing this problem.
30 3 Factor investing and asset pricing anomalies
More recently and in a separate stream of literature, Koijen and Yogo (2019) have intro-
duced a demand model in which investors form their portfolios according to their preferences
towards particular firm characteristics. They show that this allows them to mimic the port-
folios of large institutional investors. In their model, aggregate demands (and hence, prices)
are directly linked to characteristics, not to factors. In a follow-up paper, Koijen et al.
(2019) show that a few sets of characteristics suffice to predict future returns. They also
show that, based on institutional holdings from the UK and the US, the largest investors are
those who are the most influencial in the formation of prices. In a similar vein, Betermier
et al. (2019) derive an elegant (theoretical) general equilibrium model that generates some
well-documented anomalies (size, book-to-market). The models of Arnott et al. (2014) and
Alti and Titman (2019) are also able to theoretically generate known anomalies. Finally, in
Martin and Nagel (2019), characteristics influence returns via the role they play in the pre-
dictability of dividend growth. This paper discussed the asymptotic case when the number
of assets and the number of characteristics are proportional and both increase to infinity.
2 Autocorrelation in aggregate/portfolio returns is a widely documented effect since the seminal paper
Of the four chosen series, only the size factor is not significantly autocorrelated at the first
order.
overwhelming favorable results (Friede et al. (2015)), but of course, they could well
stem from the publication bias towards positive results.
• mixed: ESG investing may be beneficial globally but not locally (Chakrabarti and Sen
(2020)). Portfolios relying on ESG screening do not significantly outperform those with
no screening but are subject to lower levels of volatility (Gibson et al. (2020) and Gougler
and Utz (2020)). As is often the case, the devil is in the details, and results depend on
whether to use E, S, or G (Bruder et al. (2019)).
On top of these contradicting results, several articles point towards complexities in the
measurement of ESG. Depending on the chosen criteria and on the data provider, results
can change drastically (see Galema et al. (2008), Berg et al. (2020), and Atta-Darkua et al.
(2020)).
We end this short section by noting that of course ESG criteria can directly be integrated
into ML model, as is for instance done in de Franco et al. (2020).
0
max ET [u(rp,T +1 )] = max ET u (w̄T + xT θ T ) rT +1 ,
θT θT
0
where u is some utility function and rp,T +1 = (w̄T + xT θ T ) rT +1 is the return of the
portfolio, which is defined as a benchmark w̄T plus some deviations from this benchmark
that are a linear function of features xT θ T . The above program may be subject to some
external constraints (e.g., to limit leverage).
In practice, the vector θ T must be estimated using past data (from T − τ to T − 1): the
agent seeks the solution of
T −1 NT
!
1 X X
0
max (3.6)
u w̄i,t + θ T xi,t ri,t+1
θT τ
t=T −τ i=1
3.5 The links with machine learning 33
on a sample of size τ where NT is the number of asset in the universe. The above formulation
can be viewed as a learning task in which the parameters are chosen such that the reward
(average return) is maximized.
The interesting discussion lies in the differences between the above model and that of
Equation (3.1). The first obvious difference is the introduction of the non-linear function g;
indeed, there is no reason (beyond simplicity and interpretability) why we should restrict
the model to linear relationships. One early reference for non-linearities in asset pricing
kernels is Bansal and Viswanathan (1993).
More importantly, the second difference between (3.7) and (3.1) is the shift in the time
index. Indeed, from an investor’s perspective, the interest is to be able to predict some
information about the structure of the cross-section of assets. Explaining asset returns with
synchronous factors is not useful because the realization of factor values is not known in
advance. Hence, if one seeks to extract value from the model, there needs to be a time
interval between the observation of the state space (which we call xt,n ) and the occurrence
of the returns. Once the model ĝ is estimated, the time-t (measurable) value g(xt,n ) will
give a forecast for the (average) future returns. These predictions can then serve as signals
in the crafting of portfolio weights (see Chapter 12 for more on that topic).
While most studies do work with returns on the l.h.s. of (3.7), there is no reason why
other indicators should not be used. Returns are straightforward and simple to compute,
but they could very well be replaced by more sophisticated metrics, like the Sharpe ratio,
34 3 Factor investing and asset pricing anomalies
for instance. The firms’ features would then be used to predict a risk-adjusted performance
rather than simple returns.
Beyond the explicit form of Equation (3.7), several other ML-related tools can also be used
to estimate asset pricing models. This can be achieved in several ways, some of which we
list below.
First, one mainstream problem in asset pricing is to characterize the stochastic discount
factor (SDF) Mt , which satisfies Et [Mt+1 (rt+1,n − rt+1,f )] = 0 for any asset n (see Cochrane
(2009)). This equation is a natural playing field for the generalized method of moment
(Hansen (1982)): Mt must be such that
where the instrumental variables Vt are Ft -measurable (i.e., are known at time t) and the
capital Rt+1,n denotes the excess return of asset n. In order to reduce and simplify the
estimation problem, it is customary to define the SDF as a portfolio of assets (see chapter
3 in Back (2010)). In Chen et al. (2020), the authors use a generative adversarial network
(GAN, see Section 7.6.1) to estimate the weights of the portfolios that are the closest to
satisfy (3.8) under a strongly penalizing form.
A second approach is to try to model asset returns as linear combinations of factors, just
as in (3.1). We write in compact notation
and we allow the loadings β t,n to be time-dependent. The trick is then to introduce the
firm characteristics in the above equation. Traditionally, the characteristics are present in
the definition of factors (as in the seminal definition of Fama and French (1993)). The
decomposition of the return is made according to the exposition of the firm’s return to
these factors constructed according to market size, accounting ratios, past performance,
etc. Given the exposures, the performance of the stock is attributed to particular style
profiles (e.g., small stock, or value stock, etc.).
Habitually, the factors are heuristic portfolios constructed from simple rules like thresh-
olding. For instance, firms below the 1/3 quantile in book-to-market are growth firms and
those above the 2/3 quantile are the value firms. A value factor can then be defined by the
long-short portfolio of these two sets, with uniform weights. Note that Fama and French
(1993) use a more complex approach which also takes market capitalization into account
both in the weighting scheme and also in the composition of the portfolios.
One of the advances enabled by machine learning is to automate the construction of the
factors. It is for instance the approach of Feng et al. (2019). Instead of building the factors
heuristically, the authors optimize the construction to maximize the fit in the cross-section
of returns. The optimization is performed via a relatively deep feed-forward neural network
and the feature space is lagged so that the relationship is indeed predictive, as in Equation
(3.7). Theoretically, the resulting factors help explain a substantially larger proportion of
the in-sample variance in the returns. The prediction ability of the model depends on how
well it generalizes out-of-sample.
3.6 Coding exercises 35
A third approach is that of Kelly et al. (2019) (though the statistical treatment is not
machine learning per se).3 Their idea is the opposite: factors are latent (unobserved) and
it is the betas (loadings) that depend on the characteristics. This allows many degrees of
freedom because in rt,n = αn + (β t,n (xt−1,n ))0 ft + t,n , only the characteristics xt−1,n are
known and both the factors ft and the functional forms β t,n (·) must be estimated. In their
article, Kelly et al. (2019) work with a linear form, which is naturally more tractable.
Lastly, a fourth approach (introduced in Gu et al. (2021)) goes even further and combines
two neural network architectures. The first neural network takes characteristics xt−1 as
inputs and generates factor loadings β t−1 (xt−1 ). The second network transforms returns rt
into factor values ft (rt ) (in Feng et al. (2019)). The aggregate model can then be written:
The above specification is quite special because the output (on the l.h.s.) is also present as
input (in the r.h.s.). In machine learning, autoencoders (see Section 7.6.2) share the same
property. Their aim, just like in principal component analysis, is to find a parsimonious non-
linear representation form for a dataset (in this case, returns). In Equation (3.9), the input
is rt and the output function is β t−1 (xt−1 )0 ft (rt ). The aim is to minimize the difference
between the two just as in any regression-like model.
Autoencoders are neural networks which have outputs as close as possible to the inputs
with an objective of dimensional reduction. The innovation in Gu et al. (2021) is that the
pure autoencoder part is merged with a vanilla perceptron used to model the loadings. The
structure of the neural network is summarized below.
)
N N1
returns (rt ) −→ factors (ft = N N1 (rt ))
N N2 −→ returns (rt )
characteristics (xt−1 ) −→ loadings (β t−1 = N N2 (xt−1 ))
A simple autoencoder would consist of only the first line of the model. This specification is
discussed in more details in Section 7.6.2.
As a conclusion of this chapter, it appears undeniable that the intersection between the
two fields of asset pricing and machine learning offers a rich variety of applications. The
literature is already exhaustive and it is often hard to disentangle the noise from the great
ideas in the continuous flow of publications on these topics. Practice and implementation is
the only way forward to extricate value from hype. This is especially true because agents
often tend to overestimate the role of factors in the allocation decision process of real-world
investors (see Chinco et al. (2019b) and Castaneda and Sabat (2019)).
3 In the same spirit, see also Lettau and Pelger (2020a) and Lettau and Pelger (2020b).
36 3 Factor investing and asset pricing anomalies
2. Same exercise, but compute the monthly returns and plot the value (through time) of
the corresponding portfolios.
The methods we describe in this chapter are driven by financial applications. For an intro-
duction to non-financial data processing, we recommend two references: chapter 3 from the
general purpose ML book by Boehmke and Greenwell (2019) and the monograph on this
dedicated subject by Kuhn and Johnson (2019).
37
38 4 Data preprocessing
only predictor with positive median correlation (this particular example seems to refute the
low risk anomaly).
FIGURE 4.1: Swarmplot of correlations with the 1-month forward return (label).
4.1 Know your data 39
More importantly, when seeking to work with supervised learning (as we will do most of the
time), the link of some features with the dependent variable can be further characterized
by the smoothed conditional average because it shows how the features impact the label.
The use of the conditional average has a deep theoretical grounding. Suppose there is only
one feature X and that we seek a model Y = f (X) + error, where variables are real-valued.
The function f that minimizes the average squared error E[(Y − f (X))2 ] is the so-called
regression function (see section 2.4 in Hastie et al. (2009)):
In Figure 4.2, we plot two illustrations of this function when the dependent variable (Y ) is
the one month ahead return. The first one pertains to the average market capitalization over
the past year and the second to the volatility over the past year as well. Both predictors have
been uniformized (see Section 4.4.2 below) so that their values are uniformly distributed
in the cross-section of assets for any given time period. Thus, the range of features is [0, 1]
and is shown on the x-axis of the plot. The grey corridors around the lines show 95% level
confidence interval for the computation of the mean. Essentially, it is narrow when both (i)
many data points are available and (ii) these points are not too dispersed.
unpivoted_data_ml=pd.melt(
data_ml[['R1M_Usd','Mkt_Cap_12M_Usd','Vol1Y_Usd']],id_vars='R1M_Usd')
# selecting and putting in vector
plt.figure(figsize=(13,8))
sns.lineplot(data = unpivoted_data_ml, y='R1M_Usd', x='value',␣
→hue='variable');
The two variables have a close to monotonic impact on future returns. Returns, on average,
decrease with market capitalization (thereby corroborating the so-called size effect). The
40 4 Data preprocessing
reverse pattern is less pronounced for volatility: the curve is rather flat for the first half
of volatility scores and progressively increases, especially over the last quintile of volatility
values (thereby contradicting the low-volatility anomaly).
One important empirical property of features is autocorrelation (or absence thereof). A
high level of autocorrelation for one predictor makes it plausible to use simple imputation
techniques when some data points are missing. But autocorrelation is also important when
moving towards prediction tasks, and we discuss this issue shortly below in Section 4.6.
In Figure 4.3, we build the histogram of autocorrelations, computed stock-by-stock and
feature-by-feature.
data_hist_acf=pd.melt(data_ml[cols], id_vars='stock_id').groupby(
['stock_id','variable']).apply(lambda x: x['value'].autocorr(lag=1))
plt.figure(figsize=(13,8))
data_hist_acf.hist(bins=50,range=[-0.1,1]); # Plot from pandas
Allison (2001), Enders (2010), Little and Rubin (2014), and Van Buuren (2018)). While
researchers continuously propose new methods to cope with absent points (Honaker and
King (2010) or Che et al. (2018), to cite a few), we believe that a simple, heuristic treatment
is usually sufficient as long as some basic cautious safeguards are enforced.
First of all, there are mainly two ways to deal with missing data: removal and imputation.
Removal is agnostic but costly, especially if one whole instance is eliminated because of only
one missing feature value. Imputation is often preferred but relies on some underlying and
potentially erroneous assumption.
A simplified classification of imputation is the following:
• A basic imputation choice is the median (or mean) of the feature for the stock over the
past available values. If there is a trend in the time series, this will nonetheless alter the
trend. Relatedly, this method can be forward-looking, unless the training and testing
sets are treated separately.
• In time series contexts with views towards backtesting, the most simple imputation
comes from previous values: if xt is missing, replace it with xt−1 . This makes sense
most of the time because past values are all that is available and are by definition
backward-looking. However, in some particular cases, this may be a very bad choice
(see words of caution below).
• Medians and means can also be computed over the cross-section of assets. This
roughly implies that the missing feature value will be relocated in the bulk of observed
values. When many values are missing, this creates an atom in the distribution of the
feature and alters the original distribution. One advantage is that this imputation is
not forward-looking.
• Many techniques rely on some modelling assumptions for the data generating process.
We refer to non-parametric approaches (Stekhoven and Bühlmann (2011) and Shah et al.
(2014), which rely on random forests, see Chapter 6), Bayesian imputation (Schafer
(1999)), maximum likelihood approaches (Enders (2001) and Enders (2010)), interpola-
tion or extrapolation and nearest neighbor algorithms (García-Laencina et al. (2009)).
More generally, the four books cited at the begining of the subsection detail many such
imputation processes. Advanced techniques are much more demanding computationally.
A few words of caution:
• Interpolation should be avoided at all cost. Accounting values or ratios that are released
every quarter must never be linearly interpolated for the simple reason that this is
forward-looking. If numbers are disclosed in January and April, then interpolating
February and March requires the knowledge of the April figure, which, in live trading
will not be known. Resorting to past values is a better way to go.
• Nevertheless, there are some feature types for which imputation from past values should
be avoided. First of all, returns should not be replicated. By default, a superior choice
is to set missing return indicators to zero (which is often close to the average or the
median). A good indicator that can help the decision is the persistence of the feature
through time. If it is highly autocorrelated (and the time series plot create a smooth
curve, like for market capitalization), then imputation from the past can make sense. If
not, then it should be avoided.
42 4 Data preprocessing
• There are some cases that can require more attention. Let us consider the following
fictitious sample of dividend yield in Table 4.1:
In this case, the yield is released quarterly, in March, June, September, etc. But in June, the
value is missing. The problem is that we cannot know if it is missing because of a genuine
data glitch, or because the firm simply did not pay any dividends in June. Thus, imputation
from past value may be erroneous here. There is no perfect solution, but a decision must
nevertheless be made. For dividend data, three options are:
1. Keep the previous value.
2. Extrapolate from previous observations (this is very different from interpolation): for
instance, evaluate a trend on past data and pursue that trend.
3. Set the value to zero. This is tempting but may be sub-optimal due to dividend smooth-
ing practices from executives (see for instance Leary and Michaely (2011) and Chen
et al. (2012) for details on the subject). For persistent time series, the first two options
are probably better.
Tests can be performed to evaluate the relative performance of each option. It is also im-
portant to remember these design choices. There are so many of them that they are easy
to forget. Keeping track of them is obviously compulsory. In the ML pipeline, the scripts
pertaining to data preparation are often key because they do not serve only once!
• likewise, if the largest value is above m times the second-to-largest, then it can also be
classified as an outlier (the same reasoning applied for the other side of the tail).
• finally, for a given small threshold q, any value outside the [q, 1 − q] quantile range can
be considered outliers.
This latter idea was popularized by winsorization. Winsorizing amounts to setting to x(q)
all values below x(q) and to x(1−q) all values above x(1−q) . The winsorized variable x̃ is:
if xi ∈ [x(q) , x(1−q) ] (unchanged)
xi
x̃i = x(q) if xi < x(q) .
if xi > x(1−q)
(1−q)
x
The range for q is usually (0.5%, 5%) with 1% and 2% being the most often used.
The winsorization stage must be performed on a feature-by-feature and a date-by-date
basis. However, keeping a time series perspective is also useful. For instance, a $800B mar-
ket capitalization may seems out of range, except when looking at the history of Apple’s
capitalization.
We conclude this subsection by recalling that true outliers (i.e., extreme points that are
not due to data extraction errors) are valuable because they are likely to carry important
information.
4.5 Labelling
4.5.1 Simple labels
There are several ways to define labels when constructing portfolio policies. Of course, the
finality is the portfolio weight, but it is rarely considered as the best choice for the label.1
Usual labels in factor investing are the following:
• raw asset returns;
• future relative returns (versus some benchmark: market-wide index, or sector-based
portfolio, for instance). One simple choice is to take returns minus a cross-sectional
mean or median;
• the probability of positive return (or of return above a specified threshold);
• the probability of outperforming a benchmark (computed over a given time frame);
• the binary version of the above: YES (outperforming) versus NO (underperforming);
• risk-adjusted versions of the above: Sharpe ratios, information ratios, MAR or CALMAR
ratios (see Section 12.3).
When creating binary variables, it is often tempting to create a test that compares returns
to zero (profitable versus non-profitable). This is not optimal because it is very much time-
dependent. In good times, many assets will have positive returns, while in market crashes,
few will experience positive returns, thereby creating very unbalanced classes. It is a better
idea to split the returns in two by comparing them to their time-t median (or average). In
this case, the indicator is relative and the two classes are much more balanced.
As we will discuss later in this chapter, these choices still leave room for additional degrees
of freedom. Should the labels be rescaled, just like features are processed? What is the best
time horizon on which to compute performance metrics?
if
−1 r̂t,i < r−
yt,i = 0 if r̂t,i ∈ [r− , r+ ] , (4.2)
+1 if r̂t,i > r+
where r̂t,i is the performance proxy (e.g., returns or Sharpe ratio), and r± are the decision
thresholds. When the predicted performance is below r− , the decision is -1 (e.g., sell), when
it is above r+ , the decision is +1 (e.g., buy) and when it is in the middle (the model is
neither very optimistic nor very pessimistic), then the decision is neutral (e.g., hold). The
1 Some methodologies do map firm attributes into final weights, e.g., Brandt et al. (2009) and Ammann
et al. (2016), but these are outside the scope of the book.
46 4 Data preprocessing
performance proxy can of course be relative to some benchmark so that the decision is
directly related to this benchmark. It is often advised that the thresholds r± be chosen
such that the three categories are relatively balanced, that is, so that they end up having a
comparable number of instances.
In this case, the final output can be considered as categorical or numerical because it
belongs to an important subgroup of categorical variables: the ordered categorical (ordinal)
variables. If y is taken as a number, the usual regression tools apply.
When y is treated as a non-ordered (nominal) categorical variable, then a new layer of
processing is required because ML tools only work with numbers. Hence, the categories must
be recoded into digits. The mapping that is most often used is called ‘one-hot encoding’.
The vector of classes is split in a sparse matrix in which each column is dedicated to one class.
The matrix is filled with zeros and ones. A one is allocated to the column corresponding to
the class of the instance. We provide a simple illustration in the table below.
In classification tasks, the output has a larger dimension. For each instance, it gives the
probability of belonging to each class assigned by the model. As we will see in Chapters 6
and 7, this is easily handled via the softmax function.
From the standpoint of allocation, handling categorical predictions is not necessarily easy.
For long-short portfolios, plus or minus one signals can provide the sign of the position. For
long-only portfolio, there are two possible solutions: (i) work with binary classes (in versus
out of the portfolio) or (ii) adapt weights according to the prediction: zero weight for a −1
prediction, 0.5 weight for a 0 prediction, and full weight for a +1 prediction. Weights are
then of course normalized so as to comply with the budget constraint.
If the strategy hits the first (resp. second) barrier, the output is +1 (resp. −1), and if it
hits the last barrier, the output is equal to zero or to some linear interpolation (between
−1 and +1) that represents the position of the terminal value relative to the two horizontal
barriers. Computationally, this method is much more demanding, as it evaluates a whole
trajectory for each instance. Again, it is nonetheless considered as more realistic because
trading strategies are often accompanied with automatic triggers such as stop-loss, etc.
• the choice of splitting variables is (sometimes) pushed towards the features that have a
monotonic impact on the label.
48 4 Data preprocessing
These two properties are desirable. The first reduces the risk of fitting to small groups of
instances that may be spurious. The second gives more importance to features that appear
globally more relevant in explaining the returns. However, the filtering must not be too
intense. If, instead of retaining 20% of each tail of the predictor, we keep just 10%, then
the loss in signal becomes too severe and the performance deteriorates.
Theoretically, it is possible to understand why that may be the case. For simplicity, let
us assume a single feature x that explains returns r: rt+1 = f (xt ) + et+1 . If xt is highly
autocorrelated and the noise embedded in et+1 is not too large, then the two-period ahead
return (1+rt+1 )(1+rt+2 )−1 may carry more signal than rt+1 because the relationship with
xt has diffused and compounded through time. Consequently, it may also be beneficial to
embed memory considerations directly into the modelling function, as is done for instance in
Dixon (2020). We discuss some practicalities related to autocorrelations in the next section.
4.7 Extensions
4.7.1 Transforming features
The feature space can easily be augmented through simple operations. One of them is
lagging, that is, considering older values of features and assuming some memory effect
for their impact on the label. This is naturally useful mostly if the features are oscillating
(adding a layer of memory on persistent features can be somewhat redundant). New variables
(k) (k)
are defined by x̆t,n = xt−1,n .
In some cases (e.g., insufficient number of features), it is possible to consider ratios or
products between features. Accounting ratios like price-to-book, book-to-market, debt-to-
equity are examples of functions of raw features that make sense. The gains brought by a
larger spectrum of features are not obvious. The risk of overfitting increases, just like in a
simple linear regression adding variables mechanically increases the R2 . The choices must
make sense, economically.
Another way to increase the feature space (mentioned above) is to consider variations.
Variations in sentiment, variations in book-to-market ratio, etc., can be relevant predictors
because sometimes, the change is more important than the level. In this case, a new predictor
(k) (k) (k)
is x̆t,n = xt,n − xt−1,n
This technique is used by Gu et al. (2020) who use eight economic indicators (plus the
original predictors (zt = 1)). This increases the feature space ninefold.
Another route that integrates shifting economic environments is conditional engineering.
Suppose that labels are coded via formula (4.2). The thresholds can be made dependent on
some exogenous variable. In times of turbulence, it might be a good idea to increase both
r+ (buy threshold) and r− (sell threshold) so that the labels become more conservative: it
takes a higher return to make it to the buy category, while short positions are favored. One
such example of dynamic thresholding could be
¯
rt,± = r± × e±δ(VIXt −VIX) , (4.4)
where VIXt is the time-t value of the VIX, while VIX¯ is some average or median value.
When the VIX is above its average and risk seems to be increasing, the thresholds also
increase. The parameter δ tunes the magnitude of the correction. In the above example, we
assume r− < 0 < r+ .
4.7 Extensions 51
where the notation f (x; D) is used to highlight the dependence between the model fˆ and
the dataset D: the model has been trained on D. The first term is irreducible, as it does
not depend on fˆ. Thus, only the second term is of interest. If we take the average of this
quantity over all possible values of D:
h i h i2 h i
ED (fˆ(x; D) − E[y|x])2 = ED fˆ(x; D) − E[y|x] + ED (fˆ(x, D) − ED [fˆ(x; D)])2
| {z } | {z }
squared bias variance
52 4 Data preprocessing
If this expression is not too complicated to compute, the learner can query the x that
minimizes the tradeoff. Thus, on average, this new instance will be the one that yields the
best learning angle (as measured by the L2 error). Beyond this approach (which is limited
because it requires the oracle to label a possibly irrelevant instance), many other criteria
exist for querying, and we refer to section 3 from Settles (2009) for an exhaustive list.
One final question is: Is active learning applicable to factor investing? One straightfoward
answer is that data cannot be annotated by human intervention. Thus, the learners cannot
simulate their own instances and ask for corresponding labels. One possible option is to
provide the learner with X but not y and keep only a queried subset of observations with
the corresponding labels. In spirit, this is close to what is done in Coqueret and Guida
(2020) except that the query is not performed by a machine but by the human user. Indeed,
it is shown in this paper that not all observations carry the same amount of signal. Instances
with ‘average’ label values seem to be on average less informative compared to those with
extreme label values.
length = 100
x = np.exp(np.sin(np.linspace(1,length,length)))
data = pd.DataFrame(data=x,columns=['X'])
data.reset_index(inplace=True)
plt.figure(figsize=(13,5)) # resizing figure
sns.barplot(y="X", data=data, x="index", color='black');# Plot from␣
→Seaborn
With respect to shape, the blue and orange distributions are close to the original one.
It is only the support that changes: the min/max rescaling ensures all values lie in the
[0, 1] interval. In both cases, the smallest values (on the left) display a spike in distribu-
tion. By construction, this spike disappears under the uniformization: the points are evenly
distributed over the unit interval.
Let’s briefly comment on this synthetic data. We assume that dates are ordered chrono-
logically and far away: each date stands for a year or the beginning of a decade, but the
(forward) returns are computed on a monthly basis. The first firm is hugely successful and
multiplies its cap ten times over the periods. The second firm remains stable cap-wise, while
the third one plummets. If we look at ‘local’ future returns, they are strongly negatively
related to size for the first and third firms. For the second one, there is no clear pattern.
Date-by-date, the analysis is fairly similar, though slightly nuanced.
• On date 1, the smallest firm has the largest return, and the two others have negative
returns.
• On date 2, the biggest firm has a negative return, while the two smaller firms do not.
3 For a more thorough technical discussion on the impact of feature engineering, we refer to Galili and
Meilijson (2016).
4.9 Coding exercises 55
• On date 3, returns are decreasing with size. While the relationship is not always perfectly
monotonous, there seems to be a link between size and return and, typically, investing
in the smallest firm would be a very good strategy with this sample.
Now let us look at the output of simple regressions.
In terms of p-value (last column), the first estimation for the cap coefficient is above 5%
(Table 4.4) while the second is below 1% (Table 4.5). One possible explanation for this
discrepancy is the standard deviation of the variables. The deviations are equal to 0.47 and
0.29 for cap_0 and cap_u, respectively. Values like market capitalizations can have very
large ranges and are thus subject to substantial deviations (even after scaling). Working
with uniformized variables reduces dispersion and can help solve this problem.
Note that this is a double-edged sword: while it can help avoid false negatives, it can
also lead to false positives.
TABLE 4.4: Regression output when the independent variable comes from min-max rescal-
ing
TABLE 4.5: Regression output when the independent variable comes from uniformization
2. Create a new categorical label based on formulae (4.4) and (4.2). The time
series of the VIX can also be retrieved from the Federal Reserve’s website:
https://fred.stlouisfed.org/series/VIXCLS.
3. Plot the histogram of the R12M_Usd variable. Clearly, some outliers are present. Iden-
tify the stock with highest value for this variable and determine if the value can be
correct or not.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part II
In this chapter, we introduce the widespread concept of regularization for linear models.
There are in fact several possible applications for these models. The first one is straight-
forward: resort to penalizations to improve the robustness of factor-based predictive re-
gressions. The outcome can then be used to fuel an allocation scheme. For instance, Han
et al. (2019) and Rapach and Zhou (2019) use penalized regressions to improve stock return
prediction when combining forecasts that emanate from individual characteristics.
Similar ideas can be developed for macroeconomic predictions for instance, as in Uematsu
and Tanaka (2019). The second application stems from a less known result which originates
from Stevens (1998). It links the weights of optimal mean-variance portfolios to particular
cross-sectional regressions. The idea is then different and the purpose is to improve the
quality of mean-variance driven portfolio weights. We present the two approaches below
after an introduction on regularization techniques for linear models.
Other examples of financial applications of penalization can be found in d’Aspremont (2011),
Ban et al. (2016) and Kremer et al. (2019). In any case, the idea is the same as in the seminal
paper Tibshirani (1996): standard (unconstrained) optimization programs may lead to noisy
estimates, thus adding a structuring constraint helps remove some noise (at the cost of a
possible bias). For instance, Kremer et al. (2019) use this concept to build more robust
mean-variance (Markowitz (1952)) portfolios and Freyberger et al. (2020) use it to single
out the characteristics that really help explain the cross-section of equity returns.
61
62 5 Penalized regressions and sparse hedging for minimum variance portfolios
∂ ∂ 0 0
∇β L = (y − Xβ)0 (y − Xβ) = β X Xβ − 2y0 Xβ
∂β ∂β
= 2X0 Xβ − 2X0 y
which is known as the standard ordinary least squares (OLS) solution of the linear model. If
the matrix X has dimensions I ×K, then the X0 X can only be inverted if the number of rows
I is strictly superior to the number of columns K. In some cases, that may not hold; there
are more predictors than instances and there is no unique value of β that minimizes the loss.
If X0 X is non-singular (or positive definite), then the second order condition ensures that
β* yields a global minimum for the loss L (the second order derivative of L with respect to
β, the Hessian matrix, is exactly X0 X).
Up to now, we have made no distributional assumption on any of the above quantities.
Standard assumptions are the following:
Up to now, we have made no distributional assumption on any of the above quantities.
Standard assumptions are the following:
- E[y|X] = Xβ: linear shape for the regression function;
- E[|X] = 0: errors are independent of predictors;
- E[0 |X] = σ 2 I: homoscedasticity - errors are uncorrelated and have identical variance;
- the i are normally distributed.
Under these hypotheses, it is possible to perform statistical tests related to the β̂ coefficients.
We refer to chapters 2 to 4 in Greene (2018) for a thorough treatment on linear models as
well as to chapter 5 of the same book for details on the corresponding tests.
for some strictly positive constant δ. Under least square minimization, this amounts to solve
the Lagrangian formulation:
2
X I XJ XJ
min yi − βj xi,j + λ |βj | , (5.3)
β i=1
j=1 j=1
for some value λ > 0 which naturally depends on δ (the lower the δ, the higher the λ:
the constraint is more binding). This specification seems close to the ridge regression (L2
5.1 Penalized regressions 63
but the outcome is in fact quite different, which justifies a separate treatment. Mechanically,
as λ, the penalization intensity, increases (or as δ in (5.5) decreases), the coefficients of the
ridge regression all slowly decrease in magnitude towards zero. In the case of the LASSO, the
convergence is somewhat more brutal as some coefficients shrink to zero very quickly. For
λ sufficiently large, only one coefficient will remain non-zero, while in the ridge regression,
the zero value is only reached asymptotically for all coefficients. We invite the interested
read to have a look at the survey in Hastie (2020) about all applications of ridge regressions
in data science with links to other topics like cross-validation and dropout regularization,
among others.
To depict the difference between the Lasso and the ridge regression, let us consider the
case of K = 2 predictors which is shown in Figure 5.1. The optimal unconstrained solution
β ∗ is pictured in red in the middle of the space. The problem is naturally that it does
not satisfy the imposed conditions. These constraints are shown in light grey: they take
the shape of a square |β1 | + |β2 | ≤ δ in the case of the Lasso and a circle β12 + β22 ≤ δ for
the ridge regression. In order to satisfy these constraints, the optimization needs to look
in the vicinity of β ∗ by allowing for larger error levels. These error levels are shown as orange
ellipsoids in the figure. When the requirement on the error is loose enough, one ellipsoid
touches the acceptable boundary (in grey) and this is where the constrained solution is
located.
b2 b2
b* b*
e'e e'e
expanding expanding
b sum of
squared b sum of
squared
residuals residuals
constraint on b1 constraint on b1
the coefficients: the coefficients:
the betas must the betas must
lie inside the square lie inside the circle
FIGURE 5.1: Schematic view of Lasso (left) versus ridge (right) regressions.
64 5 Penalized regressions and sparse hedging for minimum variance portfolios
Both methods work when the number of exogenous variables surpasses that of observations,
i.e., in the case where classical regressions are ill-defined. This is easy to see in the case of
the ridge regression for which the OLS solution is simply
The additional term λIN compared to Equation (5.1) ensures that the inverse matrix is
well-defined whenever λ > 0. As λ increases, the magnitudes of the β̂i decrease, which
explains why penalizations are sometimes referred to as shrinkage methods (the estimated
coefficients see their values shrink).
Zou and Hastie (2005) propose to benefit from the best of both worlds when combining
both penalizations in a convex manner (which they call the elasticnet):
J
X J
X J
X
yi = βj xi,j + i , s.t. α |βj | + (1 − α) βj2 < δ, i = 1, . . . , N, (5.6)
j=1 j=1 j=1
The main advantage of the LASSO compared to the ridge regression is its selection ca-
pability. Indeed, given a very large number of variables (or predictors), the LASSO will
progressively rule out those that are the least relevant. The elasticnet preserves this selec-
tion ability, and Zou and Hastie (2005) argue that in some cases, it is even more effective
than the LASSO. The parameter α ∈ [0, 1] tunes the smoothness of convergence (of the
coefficients) towards zero. The closer α is to zero, the smoother the convergence.
5.1.3 Illustrations
We begin with simple illustrations of penalized regressions. We start with the LASSO. The
original implementation by the authors is in R, which is practical. The syntax is slightly
different, compared to usual linear models. The illustrations are run on the whole dataset.
First, we estimate the coefficients. By default, the function chooses a large array of pe-
nalization values so that the results for different penalization intensities (λ) can be shown
immediately.
Once the coefficients are computed, they require some wrangling before plotting. Also, there
are too many of them, so we only plot a subset of them.
lasso.fit(X_penalized,y_penalized)
lasso_res[alpha] = lasso.coef_ # extract LASSO coefs
df_lasso_res = pd.DataFrame.from_dict(lasso_res).T
# transpose the dataframe for plotting
df_lasso_res.columns = features # adding the names of the factors
predictors = (df_lasso_res.abs().sum() > 0.05)
# selecting the most relevant
df_lasso_res.loc[:,predictors].plot(
xlabel='Lambda',ylabel='Beta',figsize=(12,8)); # Plot!
FIGURE 5.2: LASSO model. The dependent variable is the 1 month ahead return.
The graph plots in Figure 5.2 the evolution of coefficients as the penalization intensity,
λ, increases. For some characteristics, like Ebit_Ta (in orange), the convergence to zero
is rapid. Other variables resist the penalization longer, like Mkt_Cap_3M_Usd, which is
the last one to vanish. Essentially, this means that at the first order, this variable is an
important driver of future 1-month returns in our sample. Moreover, the negative sign of its
coefficient is a confirmation (again, in this sample) of the size anomaly, according to which
small firms experience higher future returns compared to their larger counterparts.
Next, we turn to ridge regressions.
FIGURE 5.3: Ridge regression. The dependent variable is the 1 month ahead return.
In Figure 5.3, the convergence to zero is much smoother. We underline that the x-axis
(penalization intensities) have a log-scale. This allows to see the early patterns (close to
zero, to the left) more clearly. As in the previous figure, the Mkt_Cap_3M_Usd predictor
clearly dominates, with again large negative coefficients. Nonetheless, as λ increases, its
domination over the other predictor fades.
By definition, the elasticnet will produce curves that behave like a blend of the two above
approaches. Nonetheless, as long as α > 0, the selective property of the LASSO will be
preserved: some features will see their coefficients shrink rapidly to zero. In fact, the strength
of the LASSO is such that a balanced mix of the two penalizations is not reached at α = 1/2,
but rather at a much smaller value (possibly below 0.1).
5.2 Sparse hedging for minimum variance portfolios 67
Σ−1 µ
wMSR = , (5.8)
10 Σ−1 µ
where µ is the vector of expected (excess) returns. Taking µ = 1 yields the minimum
variance portfolio, which is agnostic in terms of the first moment of expected returns (and,
as such, usually more robust than most alternatives which try to estimate µ and often fail).
Usually, the traditional way is to estimate Σ and to invert it to get the MSR weights.
However, several approaches aim at estimating Σ−1 directly and we present one of them
below. We proceed one asset at a time, that is, one line of Σ−1 at a time.
If we decompose the matrix Σ into:
c0
2
σ
Σ= ,
c C
classical partitioning results (e.g., Schur complements) imply
−1 (σ 2 − c0 C−1 c)−1 −(σ 2 − c0 C−1 c)−1 c0 C−1
Σ = 0 −1 −1 −1 −1 0 −1 −1 −1 .
2
−(σ − c C c) C c C2
+ (σ − c C c) 0
C cc C−1
We are interested in the first line, which has two components: the factor (σ 2 − c0 C−1 c)−1
and the line vector c0 C−1 . C is the covariance matrix of assets 2 to N and c is the covariance
between the first asset and all other assets. The first line of Σ−1 is
We now consider an alternative setting. We regress the returns of the first asset on those of
all other assets:
N
X
r1,t = a1 + β1|n rn,t + t , i.e., r1 = a1 1T + R−1 β1 + 1 , (5.10)
n=2
where R−1 gathers the returns of all assets except the first one. The OLS estimator for β1
is
β̂1 = C−1 c, (5.11)
68 5 Penalized regressions and sparse hedging for minimum variance portfolios
and this is the partitioned form (when a constant is included in the regression) stemming
from the Frisch-Waugh-Lovell theorem (see chapter 3 in Greene (2018)).
In addition,
(1 − R2 )σr21 = σr21 − c0 C−1 c = σ21 . (5.12)
The proof of this last fact is given below.
With X being the concatenation of 1T with returns R−1 and with y = r1 , the classical
expression of the R2 is
0 y0 y − β̂ 0 X0 Xβ̂ y0 y − y0 Xβ̂
R2 = 1 − = 1 − = 1 − ,
T σY2 T σY2 T σY2
Given the first line of Σ−1 , it suffices to multiply by µ to get the portfolio weight in the
first asset (up to a scaling constant).
There is a nice economic intuition behind the above results which justifies the term “sparse
hedging”. We take the case of the minimum variance portfolio, for which µ = 1. In Equation
(5.10), we try to explain the return of asset 1 with that of all other assets. In the above
equation, up to a scaling constant, the portfolio has a unit position in the first asset and
−β̂ 1 positions in all other assets. Hence, the purpose of all other assets is clearly to hedge
the return of the first one. In fact, these positions are aimed at minimizing the squared
errors of the aggregate portfolio for the first asset (these errors are exactly 1 ). Moreover,
the scaling factor σ−2
1
is also simple to interpret: the more we trust the regression output
(because of a small σ21 ), the more we invest in the hedging portfolio of the asset.
This reasoning is easily generalized for any line of Σ−1 , which can be obtained by regressing
the returns of asset i on the returns of all other assets. If the allocation scheme has the form
(5.8) for given values of µ, then the pseudo-code for the sparse portfolio strategy is the
following.
At each date (which we omit for notational convenience),
For all stocks i,
5.2 Sparse hedging for minimum variance portfolios 69
where we recall that the vectors βi| = [βi|1 , . . . , βi|i−1 , βi|i+1 , . . . , βi|N ] are the coefficients
from regressing the returns of asset i against the returns of all other assets.
The introduction of the penalization norms is the new ingredient, compared to the orig-
inal approach of Stevens (1998). The benefits are twofold: first, introducing constraints
yields weights that are more robust and less subject to errors in the estimates of µ; sec-
ond, because of sparsity, weights are more stable, less leveraged and thus the strategy is
less impacted by transaction costs. Before we turn to numerical applications, we mention a
more direct route to the estimation of a robust inverse covariance matrix: the Graph-
ical LASSO. The GLASSO estimates the precision matrix (inverse covariance matrix) via
maximum likelihood while imposing constraints/penalizations on the weights of the matrix.
When the penalization is strong enough, this yields a sparse matrix, i.e., a matrix in which
some and possibly many coefficients are zero. We refer to the original article by Friedman
et al. (2008) for more details on this subject.
5.2.2 Example
The interest of sparse hedging portfolios is to propose a robust approach to the estimation
of minimum variance policies. Indeed, since the vector of expected returns µ is usually very
noisy, a simple solution is to adopt an agnostic view by setting µ = 1. In order to test the
added value of the sparsity constraint, we must resort to a full backtest. In doing so, we
anticipate the content of Chapter 12.
We first prepare the variables. Sparse portfolios are based on returns only; we thus base
our analysis on the dedicated variable in matrix/rectangular format (returns) which were
created at the end of Chapter 1.
Then, we initialize the output variables: portfolio weights and portfolio returns. We want to
compare three strategies: an equally weighted (EW) benchmark of all stocks, the classical
global minimum variance portfolio (GMV), and the sparse-hedging approach to minimum
variance.
t_oos=returns.index[returns.index>separation_date].values
# Out-of-sample data
Tt = len(t_oos) # Nb of dates
nb_port = 3 # Nb of portfolios/strats
port_weights = {} # Initial portf. weights in dict
port_returns = {} # Initial portf. returns in dict
Next, because it is the purpose of this section, we isolate the computation of the weights
of sparse-hedging portfolios. In the case of minimum variance portfolios, when µ = 1, the
weight in asset 1 will simply be the sum of all terms in Equation (5.13) and the other
weights have similar forms.
lr = ElasticNet(alpha=alpha,l1_ratio=Lambda) # ?? elasticnet
for col in returns.columns: # Loop on the assets
y = returns[col].values
# Dependent variable
X = returns.drop(col, axis=1).values
# Independent variable
lr.fit(X,y)
err = y - lr.predict(X) # Prediction errors
w = (1-np.sum(lr.coef_))/np.var(err)
# Output: weight of asset i
weights.append(w)
return weights / np.sum(weights) # Normalisation of weights
In order to benchmark our strategy, we define a meta-weighting function that embeds three
strategies: (1) the EW benchmark, (2) the classical GMV and (3) the sparse-hedging mini-
mum variance. For the GMV, since there are much more assets than dates, the covariance
matrix is singular. Thus, we have a small heuristic shrinkage term. For a more rigorous
treatment of this technique, we refer to the original article Ledoit and Wolf (2004) and to
the recent improvements mentioned in Ledoit and Wolf (2017). In short, we use Σ̂ = ΣS +δI
for some small constant δ (equal to 0.01 in the code below).
Finally, we proceed to the backtesting loop. Given the number of assets, the execution of
the loop takes a few minutes. At the end of the loop, we compute the standard deviation of
portfolio returns (monthly volatility). This is the key indicator as minimum variance seeks
to minimize this particular metric.
returns_temp[j] = rets
port_returns_final = pd.concat(
{k: pd.DataFrame.from_dict(v, 'index')for k, v in port_returns.
→items()},
axis=0).reset_index()
# Dict comprehension approach -- https://www.python.org/dev/peps/
→pep-0274/
strategy EW MV Sparse
return 0.041804 0.033504 0.034736
The aim of the sparse hedging restrictions is to provide a better estimate of the covariance
structure of assets so that the estimation of minimum variance portfolio weights is more
accurate. From the above exercise, we see that the monthly volatility is indeed reduced when
building covariance matrices based on sparse hedging relationships. This is not the case if
we use the shrunk sample covariance matrix because there is probably too much noise in the
estimates of correlations between assets. Working with daily returns would likely improve
the quality of the estimates. But the above backtest shows that the penalized methodology
performs well even when the number of observations (dates) is small compared to the number
of assets.
coefficients of predictive regressions is further documented by Henkel et al. (2011) for short
term returns. Lastly, Farmer et al. (2019) introduce the concept of pockets of predictability:
assets or markets experience different phases; in some stages, they are predictable and in
some others, they aren’t. Pockets are measured both by the number of days that a t-statistic
is above a particular threshold and by the magnitude of the R2 over the considered period.
Formal statistical tests are developed by Demetrescu et al. (2022).
The introduction of penalization within predictive regressions goes back at least to Rapach
et al. (2013), where they are used to assess lead-lag relationships between US markets
and other international stock exchanges. More recently, Chinco et al. (2019a) use LASSO
regressions to forecast high frequency returns based on past returns (in the cross-section) at
various horizons. They report statistically significant gains. Han et al. (2019) and Rapach
and Zhou (2019) use LASSO and elasticnet regressions (respectively) to improve forecast
combinations and single out the characteristics that matter when explaining stock returns.
These contributions underline the relevance of the overlap between predictive regressions
and penalized regressions. In simple machine-learning based asset pricing, we often seek to
build models such as that of Equation (3.7). If we stick to a linear relationship and add
penalization terms, then the model becomes:
K
X J
X J
X
rt+1,n = αn + βnk ft,n
k
+ t+1,n , s.t. (1 − α) |βj | + α βj2 < θ
k=1 j=1 j=1
We then report two key performance measures: the mean squared error and the hit ratio,
which is the proportion of times that the prediction guesses the sign of the return correctly.
A detailed account of metrics is given later in the book (Chapter 12).
MSE: 0.03699695809185004
5.4 Coding exercise 73
hitratio=np.mean(fit_pen_pred.
,→predict(X_penalized_test)*y_penalized_test>0)
Classification and regression trees are simple yet powerful clustering algorithms popularized
by the monograph of Breiman et al. (1984). Decision trees and their extensions are known
to be quite efficient forecasting tools when working on tabular data. A large proportion of
winning solutions in ML contests (especially on the Kaggle website1 ) resort to improvements
of simple trees. For instance, the meta-study in bioinformatics by Olson et al. (2018) finds
that boosted trees and random forests are the top 2 algorithms from a group of 13, excluding
neural networks.
Recently, the surge in Machine Learning applications in Finance has led to multiple pub-
lications that use trees in portfolio allocation problems. A long, though not exhaustive,
list includes: Ballings et al. (2015), Patel et al. (2015a), Patel et al. (2015b), Moritz and
Zimmermann (2016), Krauss et al. (2017), Gu et al. (2020), Guida and Coqueret (2018a),
Coqueret and Guida (2020) and Simonian et al. (2019). One notable contribution is Bryz-
galova et al. (2019b) in which the authors create factors from trees by sorting portfolios via
simple trees, which they call Asset Pricing Trees.
In this chapter, we review the methodologies associated to trees and their applications in
portfolio choice.
75
76 6 Tree-based methods
complicated simple
First split stars
stars
Depth = 1
Depth = 2
The dependent variable is the color (let’s consider the wavelength associated to the color
for simplicity). The first split is made according to size or complexity. Clearly, complexity
is the better choice: complicated stars are blue and green, while simple stars are yellow,
orange and red. Splitting according to size would have mixed blue and yellow stars (small
ones) and green and orange stars (large ones).
The second step is to split the two clusters one level further. Since only one variable (size)
is relevant, the secondary splits are straightforward. In the end, our stylized tree has four
consistent clusters. The analogy with factor investing is simple: the color represents perfor-
mance: red for high performance and blue for mediocre performance. The features (size and
complexity of stars) are replaced by firm-specific attributes, such as capitalization, account-
ing ratios, etc. Hence, the purpose of the exercise is to find the characteristics that allow to
split firms into the ones that will perform well versus those likely to fare more poorly.
We now turn to the technical construction of regression trees (splitting process). We follow
the standard literature as exposed in Breiman et al. (1984) or in chapter 9 of Hastie et al.
(2009). Given a sample of (yi ,xi ) of size I, a regression tree seeks the splitting points that
minimize the total variation of the yi inside the two child clusters. These two clusters need
not have the same size. In order to do that, it proceeds in two steps. First, it finds, for each
(k)
feature xi , the best splitting point (so that the clusters are homogeneous in Y). Second,
it selects the feature that achieves the highest level of homogeneity.
Homogeneity in regression trees is closely linked to variance. Since we want the yi inside
each cluster to be similar, we seek to minimize their variability (or dispersion) inside
each cluster and then sum the two figures. We cannot sum the variances because this would
not take into account the relative sizes of clusters. Hence, we work with total variation,
which is the variance times the number of elements in the clusters.
Below, the notation is a bit heavy because we resort to superscripts k (index of the feature),
but it is largely possible to ignore these superscripts to ease understanding. The first step
(k)
is to find the best split for each feature, that is, solve argmin VI (c(k) ) with
c(k)
2 2
(k)
X X
VI (c(k )) = yi − mk,−
I (c
(k)
) + yi − mk,+
I (c
(k)
) , (6.1)
(k) (k)
xi <c(k) xi >c(k)
| {z } | {z }
Total dispersion of first cluster Total dispersion of second cluster
6.1 Simple trees 77
where
1 X
mk,−
I (c
(k)
)= (k)
yi and
#{i, xi < c(k) } (k)
{xi <c(k) }
1 X
mk,+
I (c
(k)
)= (k)
yi
#{i, xi > c(k) } (k)
{xi >c(k) }
are the average values of Y , conditional on X (k) being smaller or larger than c. The cardinal
function #{·} counts the number of instances of its argument. For feature k, the optimal
split ck,∗ is thus the one for which the total dispersion over the two subgroups is the smallest.
(k)
The optimal splits satisfy ck,∗ = argmin VI (c(k) ). Of all the possible splitting variables,
c(k)
the tree will choose the one that minimizes the total dispersion not only over all splits, but
(k)
also over all variables: k ∗ = argmin VI (ck,∗ ).
k
After one split is performed, the procedure continues on the two newly formed clusters.
There are several criteria that can determine when to stop the splitting process (see Section
6.1.3). One simple criterion is to fix a maximum number of levels (depth) for the tree. A
usual condition is to impose a minimum gain that is expected for each split. If the reduction
in dispersion after the split is only marginal and below a specified threshold, then the split
is not executed. For further technical discussions on decision trees, we refer for instance to
section 9.2.4 of Hastie et al. (2009).
When the tree is built (trained), a prediction for new instances is easy to make. Given its
feature values, the instance ends up in one leaf of the tree. Each leaf has an average value
for the label: this is the predicted outcome. Of course, this only works when the label is
numerical. We discuss below the changes that occur when it is categorical.
dominant classes. There are several metrics proposed by the literature and all are based on
the proportions generated by the output. If there are J classes, we denote these proportions
with pj . For each leaf, the usual loss functions are:
PJ
• the Gini impurity index: 1 − j=1 p2j ;
• the misclassification error: 1 − max pj ;
j
PJ
• entropy: − j=1 log(pj )pj .
The Gini index is nothing but one minus the Herfindahl index which measures the diver-
sification of a portfolio. Trees seek partitions that are the least diversified. The minimum
value of the Gini index is zero and reached when one pj = 1 and all others are equal to
zero. The maximum value is equal to 1 − 1/J and is reached when all pj = 1/J. Similar
relationships hold for the other two losses. One drawback of the misclassification error is its
lack of differentiability which explains why the other two options are often favored.
Once the tree is grown, new instances automatically belong to one final leaf. This leaf is
associated to the proportions of classes it nests. Usually, to make a prediction, the class
with highest proportion (or probability) is chosen when a new instance is associated with
the leaf.
• all leaves comprise instances that cannot be further segregated based on the current set
of features.
At this stage, the splitting process cannot be pursued.
Obviously, fully grown trees often lead to almost perfect fits when the predictors are relevant,
numerous and numerical. Nonetheless, the fine grained idiosyncrasies of the training sample
are of little interest for out-of-sample predictions. For instance, being able to perfectly match
the patterns of 2000 to 2006 will probably not be very interesting in the period from 2007
to 2009. The most reliable sections of the trees are those closest to the root because they
embed large portions of the data: the average values in the early clusters are trustworthy
because the are computed on a large number of observations. The first splits are those that
matter the most because they highlight the most general patterns. The deepest splits only
deal with the peculiarities of the sample.
Thus, it is imperative to limit the size of the tree. There are several ways to prune the tree
and all depend on some particular criteria. We list a few of them below:
• Impose a minimum number of instances for each terminal node (leaf). This ensures that
each final cluster is composed by a sufficient number of observations. Hence, the average
value of the label will be reliable because it is calculated on a large amount of data.
• Similarly, it can be imposed that a cluster has a minimal size before even considering
any further split. This criterion is of course related to the one above.
• Require a certain threshold of improvement in the fit. If a split does not sufficiently
reduce the loss, then it can be deemed unnecessary. The user specifies a small number
> 0 and a split is only validated if the loss obtained post-split is smaller than 1 −
times the loss before the split.
6.1 Simple trees 79
• Limit the depth of the tree. The depth is defined as the overal maximum number of
splits between the root and any leaf of the tree.
In the example below, we implement all of these criteria at the same time, but usually, two
of them at most should suffice.
There usually exists a convention in the representation of trees. At each node, a condition
describes the split with a Boolean expression. If the expression is true, then the instance
goes to the left cluster; if not, it goes to the right cluster. Given the whole sample, the
initial split in this tree (Figure 6.2) is performed according to the price-to-book ratio. If the
Pb score (or value) of the instance is above 0.025, then the instance is placed in the left
bucket; otherwise, it goes in the right bucket.
At each node, there are two important metrics. The first one is the average value of the label
in the cluster, and the second one is the proportion of instances in the cluster. At the top
of the tree, all instances (100%) are present and the average 1-month future return is 1.3%.
One level below, the left cluster is by far the most crowded, with roughly 98% of observations
averaging a 1.2% return. The right cluster is much smaller (2%) but concentrates instances
with a much higher average return (5.9%). This is possibly an idiosyncracy of the sample.
The splitting process continues similarly at each node until some condition is satisfied
(typically here: the maximum depth is reached). A color codes the average return: from
white (low return) to blue (high return). The leftmost cluster with the lowest average
return consists of firms that satisfy all the following criteria:
• have a Pb score above 0.025;
• have a score of average daily volume over the past 3 months above 0.085.
80 6 Tree-based methods
FIGURE 6.2: Simple characteristics-based tree. The dependent variable is the 1 month
future return.
Notice that one peculiarity of trees is their possible heterogeneity in cluster sizes. Sometimes,
a few clusters gather almost all of the observations, while a few small groups embed some
outliers. This is not a favorable property of trees, as small groups are more likely to be
flukes and may fail to generalize out-of-sample.
This is why we imposed restrictions during the construction of the tree. The first one
(minbucket = 3500 in the code) imposes that each cluster consists of at least 3500 instances.
The second one (minsplit) further imposes that a cluster comprises at least 8000 observations
in order to pursue the splitting process. These values logically depend on the size of the
training sample. The cp = 0.0001 parameter in the code requires any split to reduce the
loss below 0.9999 times its original value before the split. Finally, the maximum depth of
three essentially means that there are at most three splits between the root of the tree and
any terminal leaf.
The complexity of the tree (measured by the number of terminal leaves) is a decreasing
function of minbucket, minsplit, and cp as well as an increasing function of maximum
depth.
Once the model has been trained (i.e., the tree is grown), a prediction for any instance is
the average value of the label within the cluster where the instance should land.
y_pred=fit_tree.predict(X.iloc[0:6,:])
# Test (prediction) on the first six instances of the sample
print(f'y_pred: {y_pred}')
Given the figure, we immediately conclude that these first six instances all belong to the
second cluster (starting from the left).
As a verification of the first splits, we plot the smoothed average of future returns, condi-
tionally on market capitalization, past return, and trading volume.
unpivoted_data_ml = pd.melt(
data_ml[['R1M_Usd','Mkt_Cap_12M_Usd','Pb','Advt_3M_Usd']],
id_vars='R1M_Usd')
# selecting and putting in vector
sns.lineplot(data = unpivoted_data_ml, y='R1M_Usd',
x='value', hue='variable');
# Plot from seaborn
plt.figure(figsize=(15, 5),dpi = 1200)
The graph shows the relevance of clusters based on market capitalizations and price-to-book
ratios. For low score values of these two features, the average return is high (close to +4%
on a monthly basis on the left of the curves). The pattern is more pronounced compared to
volume for instance.
Finally, we assess the predictive quality of a single tree on the testing set (the tree is grown
on the training set). We use a deeper tree, with a maximum depth of five.
y_train = training_sample['R1M_Usd'].values
# recall features/predictors, full sample
X_train = training_sample[features].values
# recall label/Dependent variable, full sample
MSE: 0.03699695809185004
Transforming the average results into hit ratio
hitratio = np.mean(fit_tree2.predict(X_test)*y_test>0)
print(f'Hit Ratio: {hitratio}')
6.2.1 Principle
Most of the time, when having several modelling options at hand, it is not obvious upfront
which individual model is the best, hence a combination seems a reasonable path towards
the diversification of prediction errors (when they are not too correlated). Some theoretical
foundations of model diversification were laid out in Schapire (1990).
More practical considerations were proposed later in Ho (1995) and more importantly in
Breiman (2001) which is the major reference for random forests. There are two ways to
create multiple predictors from simple trees, and random forests combine both:
• first, the model can be trained on similar yet different datasets. One way to achieve
this is via bootstrap: the instances are resampled with or without replacement (for each
individual tree), yielding new training data each time a new tree is built.
6.2 Random forests 83
• second, the data can be altered by curtailing the number of predictors. Alternative
models are built based on different sets of features. The user chooses how many features
to retain, and then the algorithm selects these features randomly at each try.
Hence, it becomes simple to grow many different trees and the ensemble is simply a
weighted combination of all trees. Usually, equal weights are used, which is an agnostic
and robust choice. We illustrate the idea of simple combinations (also referred to as bag-
ging) in Figure 6.4 below. The terminal prediction is simply the mean of all intermediate
predictions.
Random forests, because they are built on the idea of bootstrapping, are more efficient
than simple trees. They are used by Ballings et al. (2015), Patel et al. (2015a), Krauss et al.
(2017), and Huck (2019) and they are shown to perform very well in these papers. The
original theoretical properties of random forests are demonstrated in Breiman (2001) for
classification trees. In classification exercises, the decision is taken by a vote: each tree votes
for a particular class and the class with the most votes wins (with possible random picks in
case of ties). Breiman (2001) defines the margin function as
M M
!
X X
−1 −1
mg = M 1{hm (x)=y} − max M 1{hm (x)=j} ,
j6=y
m=1 m=1
where the left part is the average number of votes based on the M trees hm for the correct
class (the models hm based on x matches the data value y). The right part is the maximum
average for any other class. The margin reflects the confidence that the aggregate forest will
classify properly. The generalization error is the probability that mg is strictly negative.
Breiman (2001) shows that the inaccuracy of the aggregation (as measured by generalization
error) is bounded by ρ̄(1 − s2 )/s2 , where
- s is the strength (average quality2 ) of the individual classifiers and
- ρ̄ is the average correlation between the learners.
Notably, Breiman (2001) also shows that as the number of trees grows to infinity, the
inaccuracy converges to some finite number which explains why random forests are not
prone to overfitting.
While the original paper of Breiman (2001) is dedicated to classification models, many
articles have since then tackled the problem of regression trees. We refer the interested reader
to Biau (2012) and Scornet et al. (2015). Finally, further results on classifying ensembles
can be obtained in Biau et al. (2008), and we mention the short survey paper by Denil et al.
(2014) which sums up recent results in this field.
2 The strength is measured as the average margin, i.e., the average of mg when there is only one tree.
84 6 Tree-based methods
One first comment is that each instance has its own prediction, which contrasts with the
outcome of simple tree-based outcomes. Combining many trees leads to tailored forecasts.
Note that the second line of the chunk freezes the random number generation. Indeed,
random forests are by construction contingent on the arbitrary combinations of instances
and features that are chosen to build the individual learners.
In the above example, each individual learner (tree) is built on 10,000 randomly chosen
instances (without replacement), and each terminal leaf (cluster) must comprise at least
240 elements (observations). In total, 40 trees are aggregated, and each tree is constructed
based on 30 randomly chosen predictors (out of the whole set of features).
Unlike for simple trees, it is not possible to simply illustrate the outcome of the learning
process (though solutions exist, see Section 13.1.1). It could be possible to extract all 40
trees, but a synthetic visualization is out-of-reach. A simplified view can be obtained via
variable importance, as is discussed in Section 13.1.2.
Finally, we can assess the accuracy of the model.
MSE: 0.03686227217696956
The MSE is smaller than 4%, and the hit ratio is higher than 53%, which is reasonably
above both 50% and 52% thresholds.
Let’s see if we can improve the hit ratio by resorting to a classification exercise. We start
by training the model on a new formula (the label is R1M_Usd_C).
6.3.1 Methodology
The origins of Adaboost go back to Freund and Schapire (1997) and Freund and Schapire
(1996), and for the sake of completeness, we also mention the book dedicated to boosting by
86 6 Tree-based methods
Schapire and Freund (2012). Extensions of these ideas are proposed in Friedman et al. (2000)
(the so-called real Adaboost algorithm) and in Drucker (1997) (for regression analysis).
Theoretical treatments were derived by Breiman et al. (2004).
We start by directly stating the general structure of the algorithm:
• set equal weights wi = I −1 ;
• For m = 1, . . . , M do:
PI
1. Find a learner lm that minimizes the weighted loss i=1 wi L(lm (xi ), yi );
2. Compute a learner weight
am = fa (w, lm (x), y); (6.2)
3. Update the instance weights
Let us comment on the steps of the algorithm. The formulation holds for many variations
of Adaboost and we will specify the functions fa and fw below.
1. The first step seeks to find a learner (tree) lm that minimizes a weighted loss. Here
the base loss function L essentially depends on the task (regression versus classification).
2. The second and third steps are the most interesting because they are the heart of
Adaboost: they define the way the algorithm adapts sequentially. Because the purpose
is to aggregate models, a more sophisticated approach compared to uniform weights for
learners is a tailored weight for each learner. A natural property (for fa ) should be that a
learner that yields a smaller error should have a larger weight because it is more accurate.
3. The third step is to change the weights of observations. In this case, because the model
aims at improving the learning process, fw is constructed to give more weight on
observations for which the current model does not do a good job (i.e., generates the
largest errors). Hence, the next learner will be incentivized to pay more attention to
these pathological cases.
Let us comment on the original Adaboost specification. The basic error term i =
1{yi 6=lm (xi )} is a dummy number indicating if the prediction is correct (we recall only two
values are possible, +1 and −1). The average error ∈ [0, 1] is simply a weighted average
of individual errors, and the weight of the m learner
th
defined in Equation (6.2) is given
by am = log . The function x 7→ log((1 − x)x−1 ) decreases on [0, 1] and switches sign
1−
(from positive to negative) at x = 1/2. Hence, when the average error is small, the learner
has a large positive weight, but when the error becomes large, the learner can even obtain
a negative weight. Indeed, the threshold > 1/2 indicated that the learner is wrong more
than 50% of the time. Obviously, this indicates a problem and the learner should even be
discarded.
The change
i in instance weights follows a similar logic. The new weight is proportional to
wi 1− . If the prediction is right and i = 0, the weight is unchanged. If the prediction
is wrong and i = 1, the weight is adjusted depending on the aggregate error . If the
error is small and the learner efficient ( < 1/2), then (1 − )/ > 1 and the weight of the
instance increases. This means that for the next round, the learner will have to focus more
on instance i.
Lastly, the final prediction of the model corresponds to the sign of the weighted sums of
individual predictions: if the sum is positive, the model will predict +1 and it will yield
−1 otherwise.3 The odds of a zero sum are negligible. In the case of numerical labels, the
process is slightly more complicated and we refer to section 3, step 8 of Drucker (1997) for
more details on how to proceed.
We end this presentation with one word on instance weighting. There are two ways to deal
with this topic. The first one works at the level of the loss functions. For regression trees,
Equation (6.1) would naturally generalize to
2 2
(k)
X X
VN (c(k) , w) = wi yi − mk,−
N (c(k)
) + wi yi − mk,+
N (c(k)
) ,
(k) (k)
xi <c(k) xi >c(k)
and hence an instance with a large weight wi would contribute more to the dispersion of its
cluster. For classification objectives, the alteration is more complex, and we refer to Ting
(2002) for one example of an instance-weighted tree-growing algorithm. The idea is closely
3 The Real Adaboost of Friedman et al. (2000) has a different output: the probability of belonging to a
particular class.
88 6 Tree-based methods
linked to the alteration of the misclassification risk via a loss matrix (see section 9.2.4 in
Hastie et al. (2009)).
The second way to enforce instance weighting is via random sampling. If instances have
weights wi , then the training of learners can be performed over a sample that is randomly
extracted with distribution equal to wi . In this case, an instance with a larger weight
will have more chances to be represented in the training sample. The original Adaboost
algorithm relies on this method.
6.3.2 Illustration
Below, we test an implementation of the original Adaboost classifier. As such, we work
with the R1M_Usd_C variable and change the model formula. The computational cost of
Adaboost is high on large datasets, thus we work with a smaller sample and we only impose
three iterations.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
fit_adaboost_C = AdaBoostClassifier(DecisionTreeClassifier(
max_depth=3), # depth of the tree
n_estimators=3) # Number of trees
fit_adaboost_C.fit(X_train, y_c_train) # Fitting the model
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
n_estimators=3)
their implementation that we use in our empirical section. The other popular alternative is
lightgbm (see Ke et al. (2017)). What XGBoost seeks to minimize is the objective
I
X J
X
O= loss(yi , ỹi ) + Ω(Tj ) .
i=1 j=1
| {z } | {z }
error term regularization term
The first term (over all instances) measures the distance between the true label and the
output from the model. The second term (over all trees) penalizes models that are too
complex.
For simplicity, we propose the full derivation with the simplest loss function loss(y, ỹ) =
(y − ỹ)2 , so that:
I
X J
X
2
O= (yi − mJ−1 (xi ) − TJ (xi )) + Ω(Tj ).
i=1 j=1
All terms known at step J (i.e., indexed by J − 1) vanish because they do not enter the
optimization scheme. They are embedded in the constant c.
Things are fairly simple with quadratic loss. For more complicated loss functions, Taylor
expansions are used (see the original paper).
6.4.2 Penalization
In order to go any further, we need to specify the way the penalization works. For a given
tree T , we specify its structure by T (x) = wq(x) , where w is the output value of some leaf
and q(·) is the function that maps an input to its final leaf. This encoding is illustrated in
Figure 6.5. The function q indicates the path, while the vector w = wi codes the terminal
leaf values.
90 6 Tree-based methods
0.013
100%
yes Mkt_Cap_6M_Usd >= 0.025 no
q(.) codes
0.012 the path 0.061
98% 2%
Mkt_Cap_3M_Usd >= 0.19 Vol3Y_Usd < 0.92
0.0097 0.022
82% 16%
Recurring_Earning_Total_Assets >= 0.025 Vol3Y_Usd < 0.84
terminal
0.0093 0.036 0.016 0.038 0.03 0.15
(leaf)
values w1 w2 w3 w4 w5 w6
FIGURE 6.5: Coding a decision tree: decomposition between structure and node and leaf
values.
• the second term penalises the magnitude of output values (this helps reduce vari-
ance).
The first penalization term reduces the depth of the tree, while the second shrinks the size
of the adjustments that will come from the latest tree.
6.4.3 Aggregation
We aggregate both sections of the objective (loss and penalization). We write Il for the set
of the indices of the instances belonging to leaf l. Then,
I L
TJ (xi )2
X λX 2
O=2 −yi TJ (xi ) + mJ−1 (xi )TJ (xi )) + + γL + wl
i=1
2 2
l=1
I
( 2
) L
X wq(x i) λX 2
=2 −yi wq(xi ) + mJ−1 (xi )wq(xi ) ) + + γL + wl
i=1
2 2
l=1
L !
wl2 X
X X λ
=2 wl (−yi + mJ−1 (xi )) + 1+ + γL
2 2
l=1 i∈Il i∈Il
2
The function is of the form awl + 2b wl2 , which has minimum values − a2b at point wl = −a/b.
6.4 Boosted trees: extreme gradient boosting 91
Thus, writing #(.) for the cardinal function that counts the number of items in a set,
P
(yi − mJ−1 (xi ))
→ wl = i∈Il λ
∗
, so that (6.5)
1 + 2 #{i ∈ Il }
L P 2
1X i∈Il (yi − mJ−1 (xi ))
OL (q) = − + γL,
1 + λ2 #{i ∈ Il }
2
l=1
where we added the dependence of the objective both in q (structure of tree) and L (number
of leaves). Indeed, the meta-shape of the tree remains to be determined.
• for each node, look at whether a split is useful (in terms of objective) or not: Gain =
2 (GainL + GainR − GainO ) − γ
1
.
• each gain is computed with respect to the instances in each bucket (cluster): GainX =
P 2
(yi −mJ−1 (xi ))
, where IX is the set of instances within cluster X .
i∈IX
(1+ λ2 )#{i∈IX }
GainO is the original gain (no split) and GainL and GainR are the gains of the left and
right clusters, respectively. One word about the −γ adjustment in the above formula: there
is one unit of new leaves (two new minus one old)! This makes a one-leaf difference; hence,
∆L = 1 and the penalization intensity for each new leaf is equal to γ.
Lastly, we underline the fact that XGBoost also applies a learning rate: each new tree is
scaled by a factor η, with η ∈ (0, 1]. After each step of boosting the new tree TJ sees its
values discounted by multiplying them by η. This is very useful because a pure aggregation
of 100 optimized trees is the best way to overfit the training sample.
6.4.5 Extensions
Several additional features are available to further prevent boosted trees to overfit. Indeed,
given a sufficiently large number of trees, the aggregation is able to match the training
sample very well, but may fail to generalize well out-of-sample.
Following the pioneering work of Srivastava et al. (2014), the DART (Dropout for Additive
Regression Trees) model was proposed by Rashmi and Gilad-Bachrach (2015). The idea
is to omit a specified number of trees during training. The trees that are removed from
the model are chosen randomly. The full specifications can be found at https://xgboost.
readthedocs.io/en/latest/tutorials/dart.html and we use a 10% dropout in the first
example below..
92 6 Tree-based methods
Monotonicity constraints are another element that is featured both in XGB and lightgbm.
Sometimes, it is expected that one particular feature has a monotonic impact on the la-
bel. For instance, if one deeply believes in momentum, then past returns should have an
increasing impact on future returns (in the cross-section of stocks).
Given the recursive nature of the splitting algorithm, it is possible to choose when to perform
a split (according to a particular variable) and when not to. In Figure 6.6, we show how
the algorithm proceeds. All splits are performed according to the same feature. For the
first split, things are easy because it suffices to verify that the averages of each cluster are
ranked in the right direction. Things are more complicated for the splits that occur below.
Indeed, the average values set by all above splits matter, as they give bounds for acceptable
values for the future average values in lower splits. If a split violates these bounds, then it
is overlooked and another variable will be chosen instead.
Aggregate
mean = 0
feature < a feature > a
Aggregate Aggregate
mean = -1 mean = +1
FIGURE 6.6: Imposing monotonic constraints. The constraints are shown in bold blue in
the bottom leaves.
6.4 Boosted trees: extreme gradient boosting 93
The second (optional) step is to determine the monotonicity constraints that we want to
impose. For simplicity, we will only enforce three constraints on
1. market capitalization (negative, because large firms have smaller returns under the size
anomaly);
2. price-to-book ratio (negative, because overvalued firms also have smaller returns under
the value anomaly);
3. past annual returns (positive, because winners outperform losers under the momentum
anomaly).
The third step is to train the model on the formatted training data. We include the mono-
tonicity constraints and the DART feature (via rate_drop). Just like random forests, boosted
trees can grow individual trees on subsets of the data: both row-wise (by selecting random
instances) and column-wise (by keeping a smaller portion of predictors). These options
are implemented below with the subsample and colsample_bytree in the arguments of the
function.
params={'eta' : 0.3, # Learning rate
'objective' : "reg:squarederror", # Objective function
'max_depth' : 4, # Maximum depth of trees
'subsample' : 0.6, # Train on random 60% of sample
'colsample_bytree' : 0.7, # Train on random 70% of predictors
'lambda' : 1, # Penalisation of leaf values
'gamma' : 0.1, # Penalisation of number of leaves
'nrounds' : 30, # Number of trees used
'monotone_constraints' : mono_const,# Monotonicity constraints
'rate_drop' : 0.1, # Drop rate for DART
'verbose' : 0} # No comment from the algo
fit_xgb =xgb.train(params, train_matrix_xgb)
Finally, we evaluate the performance of the model. Note that before that, a proper format-
ting of the testing sample is required.
test_features_xgb=testing_sample[features_short]
# Test sample => XGB format
test_matrix_xgb=xgb.DMatrix(test_features_xgb, label=y_test)
# XGB format!
94 6 Tree-based methods
fit_xgb.predict(test_matrix_xgb)
mse = np.mean((fit_xgb.predict(test_matrix_xgb)-y_test)**2)
print(f'MSE: {mse}')
MSE: 0.03781719994386558
hitratio = np.mean(fit_xgb.predict(test_matrix_xgb)*y_test>0)
print(f'Hit Ratio: {hitratio}')
train_label_xgb_C=training_sample.loc[boolean_quantile,'R1M_Usd_C']
# Dependent variable
train_matrix_xgb_C=xgb.
,→DMatrix(train_features_xgb,label=train_label_xgb_C)
# XGB format!
When working with categories, the loss function is usually the softmax function (see Section
1.1).
We can then proceed to the assessment of the quality of the model. We adjust the prediction
to the value of the true label and count the proportion of accurate forecasts.
hitratio = np.mean(fit_xgb_C.predict(test_matrix_xgb)==y_c_test)
print(f'Hit Ratio: {hitratio}')
In factor investing, these weights can very well depend on the feature values (Wi = Wi (xi )).
For instance, for one particular characteristic xk , weights can be increasing thereby giving
more importance to assets with high values of this characteristic (e.g., value stocks are
favored compared to growth stocks). One other option is to increase weights when the
values of the characteristic become more extreme (deep value and deep growth stocks have
larger weights). If the features are uniform, the weights can simply be Wi (xki ) ∝ |xki − 0.5|:
firms with median value 0.5 have zero weight, and as the feature value shifts towards 0 or
1, the weight increases. Specifying weights on instances biases the learning process just like
views introduced à la Black and Litterman (1992) influence the asset allocation process.
The difference is that the nudge is performed well ahead of the portfolio choice problem.
In XGB, the implementation instance weighting is done very early, in the definition of the
xgb.DMatrix:
inst_weights = np.random.uniform(0,1,(train_features_xgb.shape[0],1))
# Random weights
train_matrix_xgb=xgb.DMatrix(train_features_xgb, label=train_label_xgb,
# XGB format!
weight = inst_weights) # Weights!
Then, in the subsequent stages, the optimization will be performed with these hard-coded
weights. The splitting points can be altered (via the total weighted loss in clusters) and the
terminal weight values (6.5) are also impacted.
6.5 Discussion
We end this chapter by a discussion on the choice of predictive engine with a view towards
portfolio construction. As recalled in Chapter 2, the ML signal is just one building stage of
construction of the investment strategy. At some point, this signal must be translated into
portfolio weights.
From this perspective, simple trees appear suboptimal. Tree depths are usually set between 3
and 6. This implies between 8 and 64 terminal leaves at most, with possibly very unbalanced
clusters. The likelihood of having one cluster with 20% to 30% of the sample is high. This
means that when it comes to predictions, roughly 20% to 30% of the instances will be given
the same value.
On the other side of the process, portfolio policies commonly have a fixed number of assets.
Thus, having assets with equal signal does not permit to discriminate and select a subset
to be included in the portfolio. For instance, if the policy requires exactly 100 stocks, and
96 6 Tree-based methods
105 stocks have the same signal, the signal cannot be used for selection purposes. It would
have to be combined with exogenous information such as the covariance matrix in a mean-
variance type allocation.
Overall, this is one reason to prefer aggregate models. When the number of learners is
sufficiently large (5 is almost enough), the predictions for assets will be unique and tailored
to these assets. It then becomes possible to discriminate via the signal and select only those
assets that have the most favorable signal. In practice, random forests and boosted trees
are probably the best choices.
Neural networks (NNs) are an immensely rich and complicated topic. In this chapter, we
introduce the simple ideas and concepts behind the most simple architectures of NNs. For
more exhaustive treatments on NN idiosyncracies, we refer to the monographs by Haykin
(2009), Du and Swamy (2013) and Goodfellow et al. (2016). The latter is available freely
online: www.deeplearningbook.org. For a practical introduction, we recommend the great
book of Chollet (2017).
For starters, we briefly comment on the qualification “neural network”. Most experts agree
that the term is not very well chosen, as NNs have little to do with how the human brain
works (of which we know not that much). This explains why they are often referred to as
“artificial neural networks” - we do not use the adjective for notational simplicity. Because
we consider it more appropriate, we recall the definition of NNs given by François Chollet:
“chains of differentiable, parameterised geometric functions, trained with gradient descent
(with gradients obtained via the chain rule)”.
Early references of neural networks in finance are Bansal and Viswanathan (1993) and
Eakins et al. (1998). Both have very different goals. In the first one, the authors aim to
estimate a non-linear form for the pricing kernel. In the second one, the purpose is
to identify and quantify relationships between institutional investments in stocks and the
attributes of the firms (an early contribution towards factor investing). An early review
(Burrell and Folarin (1997)) lists financial applications of NNs during the 1990s. More
recently, Sezer et al. (2019), Jiang (2020), and Lim and Zohren (2020) survey the attempts
to forecast financial time series with deep-learning models, mainly by computer science
scholars.
The pure predictive ability of NNs in financial markets is a popular subject and we further
cite for example Kimoto et al. (1990), Enke and Thawornwong (2005), Zhang and Wu
(2009), Guresen et al. (2011), Krauss et al. (2017), Fischer and Krauss (2018), Aldridge and
Avellaneda (2019), and Soleymani and Paquet (2020).1 The last reference even combines
several types of NNs embedded inside an overarching reinforcement learning structure. This
list is very far from exhaustive. In the field of financial economics, recent research on neural
networks includes:
• Feng et al. (2019) use neural networks to find factors that are the best at explaining
the cross-section of stock returns.
• Gu et al. (2020) map firm attributes and macro-economic variables into future returns.
This creates a strong predictive tool that is able to forecast future returns very
1 Neural networks have also been recently applied to derivatives pricing and hedging, see for instance the
work of Buehler et al. (2019) and Andersson and Oosterlee (2020) and the survey by Ruf and Wang (2019).
Limit order book modelling is also an expanding field for neural network applications (Sirignano and Cont
(2019), Wallbridge (2020)).
97
98 7 Neural networks
accurately.
• Chen et al. (2020) estimate the pricing kernel with a complex neural network struc-
ture including a generative adversarial network. This again gives crucial information on
the structure of expected stock returns and can be used for portfolio construction (by
building an accurate maximum Sharpe ratio policy).
1 if x0 w + b > 0
f (x) =
0 otherwise
The vector of weights w scales the variables, and the bias b shifts the decision barrier. Given
values for b and wi , the error is i = yi −1{PJ xi,j wj +w0 >0} . As is customary, we set b=w_0
j=1
and add an initial constant column to x: xi,0 = 1, so that i = yi − 1{PJ xi,j wj >0} . In
j=0
contrast to regressions, perceptrons do not have closed-form solutions. The optimal weights
can only be approximated. Just like for regression, one way to derive good weights is to
minimize the sum of squared errors. To this purpose, the simplest way to proceed is to
1. compute the current model value at point xi : ỹi = 1{PJ wj xi,j >0} ’-and
j=0
2. adjust the weight vector: wj ← wj + η(yi − ỹi )xi,j
which amounts to shifting the weights in the right direction. Just like for tree methods, the
scaling factor η is the learning rate. A large η will imply large shifts: learning will be rapid,
but convergence may be slow or may even not occur. A small η is usually preferable, as it
helps reduce the risk of overfitting.
In Figure 7.1, we illustrate this mechanism. The initial model (dashed grey line) was trained
on 7 points (3 red and 4 blue). A new black point comes in.
• if the point is red, there is no need for adjustment: it is labelled correctly as it lies on
the right side of the border.
• if the point is blue, then the model needs to be updated appropriately. Given the rule
mentioned above, this means adjusting the slope of the line downwards. Depending on
η, the shift will be sufficient to change the classification of the new point - or not.
At the time of its inception, the perceptron was an immense breakthrough which received an
intense media coverage (see Olazaran (1996) and Anderson and Rosenfeld (2000)). Its rather
simple structure was progressively generalized to networks (combinations) of perceptrons.
Each one of them is a simple unit, and units are gathered into layers. The next section
describes the organization of simple multilayer perceptrons (MLPs).
7.2 Multilayer perceptron 99
wx+b IF ? = 1 , NO ERROR
NO ADJUST.
IF ? = 0 , NEGATIVE ERROR
NEGATIVE ADJUST.
0
0 0
0
adjustment
f the line (negative)
es lope o
w is th ? new
b point
1 1
1
x
FIGURE 7.1: Scheme of a perceptron.
First unit
x1 (1)
vi,1= x'i w 1 + b1
(1) (1)
First unit
}
Nonlinear (2) (1) (2) (2) Aggregation
Transform vi,1= oi ' w 1 + b1 via linear
x2 Second unit Output mapping
with
(1) (1) (1)
vi,2= x'i w 2 + b2 (1)
o i,k= f
(1)
(vi,k ) possible
Second unit posterior
x3 (2) (1) (2)
vi,2= oi ' w 2 + b2
(2) activation
Third unit Yields a vector
Yields one
for each
(1) (1) (1) o(1) point for
vi,3= x'i w 3 + b3 i data point i
each initial
x4 output sent to next layer
occurrence
Before we proceed with comments, we introduce some notation that will be used thoughout
the chapter.
• The data is separated into a matrix X = xi,j of features and a vector of output values
y = yi . x or xi denotes one line of X.
• A neural network will have L ≥ 1 layers and for each layer l, the number of units is
Ul ≥ 1.
(l) (l)
• The weights for unit k located in layer l are denoted with wk = wk,j and the corre-
(l) (l)
sponding biases bk . The length of wk is equal to Ul−1 . k refers to the location of the
unit in layer l while j to the unit in layer l − 1.
(l)
• Outputs (post-activation) are denoted oi,k for instance i, layer l and unit k.
The process is the following. When entering the network, the data goes though the initial
linear mapping:
which is then transformed by a non-linear function f 1 . The result of this alteration is then
given as input of the next layer and so on. The linear forms will be repeated (with different
weights) for each layer of the network:
The connections between the layers are the so-called outputs, which are basically the linear
mappings to which the activation functions f (l) have been applied. The output of layer l is
the input of layer l + 1.
7.2 Multilayer perceptron 101
(l) (l)
oi,k = f (l) vi,k .
Finally, the terminal stage aggregates the outputs from the last layer:
(L)
ỹi = f (L+1) (oi )0 w(L+1) + b(L+1) .
In the forward-propagation of the input, the activation function naturally plays an important
role. In Figure 7.4, we plot the most usual activation functions used by neural network
libraries.
0 colour
Heaviside
Identity
ReLu
-1 Sigmoid
TanH
-1 0 1 x
Let us rephrase the process through the lens of factor investing. The input x are the charac-
teristics of the firms. The first step is to multiply their value by weights and add a bias. This
is performed for all the units of the first layer. The output, which is a linear combination of
the input is then transformed by the activation function. Each unit provides one value, and
all of these values are fed to the second layer following the same process. This is iterated
until the end of the network. The purpose of the last layer is to yield an output shape that
corresponds to the label: if the label is numerical, the output is a single number; if it is
categorical, then usually it is a vector with length equal to the number of categories. This
vector indicates the probability that the value belongs to one particular category.
It is possible to use a final activation function after the output. This can have a huge
importance on the result. Indeed, if the labels are returns, applying a sigmoid function at
the very end will be disastrous because the sigmoid is always positive.
n
X
fn (x) = cl φ(xwl + bl ) + c0 ,
l=1
where φ is a (non-constant) bounded continuous function. Then, for any > 0, it is possible
to find one n such that for any continuous function f on the unit hypercube [0, 1]d ,
This result is rather intuitive: it suffices to add units to the layer to improve the fit. The
process is more or less analogous to polynomial approximation, though some subtleties
arise depending on the properties of the activations functions (boundedness, smoothness,
convexity, etc.). We refer to Costarelli et al. (2016) for a survey on this topic.
The raw results on universal approximation imply that any well-behaved function f can be
approached sufficiently closely by a simple neural network, as long as the number of units
can be arbitrarily large. Now, they do not directly relate to the learning phase, i.e., when
the model is optimized with respect to a particular dataset. In a series of papers (Barron
(1993) and Barron (1994), notably), Barron gives a much more precise characterization of
what neural networks can achieve. In Barron (1993) it is for instance proved a more precise
version of universal approximation: for particular neural networks (with sigmoid activa-
tion), E[(f (x) − fn (x))2 ] ≤ cf /n, which gives a speed of convergence related to the size of
the network. In the expectation, the random term is x: this corresponds to the case where
the data is considered to be a sample of i.i.d. observations of a fixed distribution (this
is the most common assumption in machine learning).
Below, we state one important result that is easy to interpret; it is taken from Barron
(1994).
In the sequel, fn corresponds to a possibly penalized neural network with only one inter-
mediate layer with n units and sigmoid activation function. Moreover, both supports of the
predictors and the label are assumed to be bounded (which is not a major constraint). The
most important metric in a regression exercise is the mean squared error (MSE) and the
main result is a bound (in order of magnitude) on this quantity. For N randomly sampled
i.i.d. points yi = f (xi ) + i on which fn is trained, the best possible empirical MSE behaves
like
c
f nK log(N )
E (f (x) − fn (x))2 = (7.1)
O + O ,
n
| {z } N
| {z }
size of network size of sample
where K is the dimension of the input (number of columns) and cf is a constant that
depends on the generator function f . The above quantity provides a bound on the error
that can be achieved by the best possible neural network given a dataset of size N .
There are clearly two components in the decomposition of this bound. The first one pertains
to the complexity of the network. Just as in the original universal approximation theorem,
the error decreases with the number of units in the network. But this is not enough! Indeed,
7.2 Multilayer perceptron 103
the sample size is of course a key driver in the quality of learning (of i.i.d. observations).
The second component of the bound indicates that the error decreases at a slightly slower
pace with respect to the number of observations (log(N )/N ) and is linear in the number of
units and the size of the input. This clearly underlines the link (trade-off?) between sample
size and model complexity: having a very complex model is useless if the sample is small,
just like a simple model will not catch the fine relationships in a large dataset.
Overall, a neural network is a possibly very complicated function with a lot of parameters. In
linear regressions, it is possible to increase the fit by spuriously adding exogenous variables.
In neural networks, it suffices to increase the number of parameters by arbitrarily adding
units to the layer(s). This is of course a very bad idea because high-dimensional networks
will mostly capture the particularities of the sample they are trained on.
where ỹi are the values obtained by the model, and yi are the true values of the instances.
A simple requirement that eases computation is that the loss function be differentiable.
The most common choices are the squared error for regression tasks and cross-entropy for
classification tasks. We discuss the technicalities of classification in the next subsection.
The training of a neural network amounts to alter the weights (and biases) of all units in
all layers so that O defined above is the smallest possible. To ease the notation and given
that the yi are fixed, let us write D(ỹi (W)) = loss(yi , ỹi ), where W denotes the entirety
of weights and biases in the network. The updating of the weights will be performed via
gradient descent, i.e., via
∂D(ỹi )
W←W−η . (7.2)
∂W
This mechanism is the most classical in the optimization literature, and we illustrate it in
Figure 7.5. We highlight the possible suboptimality of large learning rates. In the diagram,
the descent associated with the high η will oscillate around the optimal point, whereas the
one related to the small eta will converge more directly.
The complicated task in the above equation is to compute the gradient (derivative) which
tells in which direction the adjustment should be done. The problem is that the succes-
sive nested layers and associated activations require many iterations of the chain rule for
differentiation.
104 7 Neural networks
ERROR (LOSS)
Large learning rate
starting point
(negative derivative) Small learning rate
First iteration
First iteration
WEIGHT
The most common way to approximate a derivative is probably the finite difference method.
Under the usual assumptions (the loss is twice differentiable), the centered difference satis-
fies:
so that if we differentiate with the most immediate weights and biases, we get:
UL
!
∂D(ỹi ) 0
(L+1) (L) (L)
X
(L+1)
= D0 (ỹi ) f (L+1) b(L+1) + wk oi,k oi,k (7.3)
∂wk k=1
0
(L+1) (L)
0
= D (ỹi ) f (L+1)
vi,k oi,k (7.4)
UL
!
∂D(ỹi ) 0
(L+1) (L)
X
0
= D (ỹi ) f (L+1)
b(L+1)
+ wk oi,k . (7.5)
∂b(L+1) k=1
This is the easiest part. We must now go back one layer, and this can only be done via
(L) (L−1) 0 (L) (L) (L)
the chain rule. To access layer L, we recall identity vi,k = (oi ) wk + bk = bk +
PUL (L−1) (L)
j=1 oi,j wk,j . We can then proceed:
7.2 Multilayer perceptron 105
(L)
∂D(ỹi ) ∂D(ỹi ) ∂vi,k ∂D(ỹi ) (L−1)
(L)
= (L) (L)
= (L)
oi,j (7.6)
∂wk,j ∂vi,k ∂wk,j ∂vi,k
(L)
∂D(ỹi ) ∂oi,k (L−1) ∂D(ỹi ) (L) 0 (L) (lL1)
= (L)
o
(L) i,j
= (L)
(f ) (vi,k )oi,j (7.7)
∂oi,k ∂vi,k ∂oi,k
0
(L+1) (L+1) (L) 0 (L) (L−1)
= D0 (ỹi ) f (L+1) vi,k wk (f ) (vi,k )oi,j , (7.8)
| {z }
computed above!
where, as we show in the last line, one part of the derivative was already computed in the
previous step (Equation (7.4)). Hence, we can recycle this number and only focus on the
right part of the expression.
The magic of the so-called back-propagation is that this will hold true for each step of the
differentiation. When computing the gradient for weights and biases in layer l, there will
be two parts: one that can be recycled from previous layers and another, local part, that
depends only on the values and activation function of the current layer. A nice illustration
of this process is given by the Google developer team: playground.tensorflow.org.
When the data is formatted using tensors, it is possible to resort to vectorization so that
the number of calls is limited to an order of the magnitude of the number of nodes (units)
in the network.
The back-propagation algorithm can be summarized as follows. Given a sample of points
(possibly just one):
1. the data flows from left as is described in Figure 7.6. The blue arrows show the forward
pass;
2. this allows the computation of the error or loss function;
3. all derivatives of this function (w.r.t. weights and biases) are computed, starting from
the last layer and diffusing to the left (hence the term back-propagation) - the green
arrows show the backward pass;
4. all weights and biases can be updated to take the sample points into account (the model
is adjusted to reduce the loss/error stemming from these points).
the data...
is processed...
by the model...
and gives a prediction ~
yi
weig
.....
ht up
.... date from the prediction
we get the error
... ht up
date ei = yi - ~
yi
weig from the error,
ate
... t upd we compute the
weigh weight ∂D(ỹ i )
updates ∂W
This operation can be performed any number of times with different sample sizes. We discuss
this issue in Section 7.3.
The learning rate η can be refined. One option to reduce overfitting is to impose that
after each epoch, the intensity of the update decreases. One possible parametric form is
η = αe−βt , where t is the epoch and α, β > 0. One further sophistication is to resort to
so-called momentum (which originates from Polyak (1964)):
Wt+1 ← Wt − mt with
∂D(ỹi )
mt ← η + γmt−1 , (7.9)
∂Wt
where t is the index of the weight update. The idea of momentum is to speed up the
convergence by including a memory term of the last adjustment (mt−1 ) and going in the
same direction in the current update. The parameter γ is often taken to be 0.9.
More complex and enhanced methods have progressively been developed:
• Nesterov (1983) improves the momentum term by forecasting the future shift in
parameters;
• Adagrad (Duchi et al. (2011)) uses a different learning rate for each parameter;
• Adadelta (Zeiler (2012)) and Adam (Kingma and Ba (2014)) combine the ideas of Ada-
grad and momentum.
Lastly, in some degenerate case, some gradients may explode and push weights far from their
optimal values. In order to avoid this phenomenon, learning libraries implement gradient
clipping. The user specifies a maximum magnitude for gradients, usually expressed as a
norm. Whenever the gradient surpasses this magnitude, it is rescaled to reach the authorized
threshold. Thus, the direction remains the same, but the adjustment is smaller.
exi
ỹi = s(x)i = PJ .
j=1 exj
The justification of this choice is straightforward: it can take any value as input (over the
real line), and it sums to one over any (finite-valued) output. Similarly as for trees, this
yields a ‘probability’ vector over the classes. Often, the chosen loss is a generalization of the
entropy used for trees. Given the target label yi = (yi,1 , . . . , yi,L ) = (0, 0, . . . , 0, 1, 0, . . . , 0)
7.3 How deep we should go and other practical issues 107
J
X
CE(yi , ỹi ) = − log(ỹi,j )yi,j . (7.10)
j=1
Basically, it is a proxy of the dissimilarity between its two arguments. One simple inter-
pretation is the following. For the non-zero label value, the loss is − log(ỹi,l ), while for all
others, it is zero. In the log, the loss will be minimal if ỹi,l = 1, which is exactly what we
seek (i.e., yi,l = ỹi,l ). In applications, this best case scenario will not happen, and the loss
will simply increase when ỹi,l drifts away downwards from one.
L
!
X
N = (Ul−1 + 1)Ul + UL + 1
l=1
As in any model, the number of parameters should be much smaller than the number of
instances. There is no fixed ratio, but it is preferable if the sample size is at least ten times
larger than the number of parameters. Below a ratio of 5, the risk of overfitting is high.
Given the amount of data readily available, this constraint is seldom an issue, unless one
wishes to work with a very large network.
The number of hidden layers in current financial applications rarely exceeds three or four.
The number of units per layer (Uk ) is often chosen to follow the geometric pyramid rule
(see, e.g., Masters (1993)). If there are L hidden layers, with I features in the input and
O dimensions in the output (for regression tasks, O = 1), then, for the k th layer, a rule of
thumb for the number of units is
108 7 Neural networks
$ L+1−k %
I L+1
Uk ≈ O .
O
√
If there is only one intermediate layer, the recommended proxy is the integer part of IO.
If not, the network starts with many units and, the number of unit decreases exponentially
towards the output size. Often, the number of layers is a power of two because, in high
dimensions, networks are trained on Graphics Processing Units (GPUs) or Tensor Processing
Units (TPUs). Both pieces of hardware can be used optimally when the inputs have sizes
equals to powers of two.
Several studies have shown that very large architectures do not always perform better than
more shallow ones (e.g., Gu et al. (2020) and Orimoloye et al. (2019) for high frequency
data, i.e., not factor-based). As a rule of thumb, a maximum of three hidden layers seem to
be sufficient for prediction purposes.
different values to evaluate the learning speed. In the examples below, we keep the number
of epochs low for computational purposes.
I
X X X
O= loss(yi , ỹi ) + λk ||Wk ||1 + δj ||Wj ||22 ,
i=1 k j
where the subscripts k and j pertain to the weights to which the L1 and (or) L2 penalization
is applied.
In addition, specific constraints can be enforced on the weights directly during the training.
Typically, two types of constraints are used:
• norm constraints: a maximum norm is fixed for the weight vectors or matrices;
• non-negativity constraint: all weights must be positive or zero.
Lastly, another (somewhat exotic) way to reduce the risk of overfitting is simply to reduce
the size (number of parameters) of the model. Srivastava et al. (2014) propose to omit units
during training (hence the term ‘dropout’). The weights of randomly chosen units are set
to zero during training. All links from and to the unit are ignored, which mechanically
shrinks the network. In the testing phase, all units are back, but the values (weights) must
be scaled to account for the missing activations during the training phase.
The interested reader can check the advice compiled in Bengio (2012) and Smith (2018)
for further tips on how to configure neural networks. A paper dedicated to hyperparameter
tuning for stock return prediction is Lee (2020).
model.add(layers.Dense(8,activation="tanh"))
model.add(layers.Dense(1))
The definition of the structure is very intuitive and uses the sequential syntax in which one
input is iteratively transformed by a layer until the last iteration which gives the output.
Each layer depends on two parameters: the number of units and the activation function that
is applied to the output of the layer. One important point is the input_shape parameter for
the first layer. It is required for the first layer and is equal to the number of features. For
the subsequent layers, the input_shape is dictated by the number of units of the previous
layer; hence it is not required. The activations that are currently available are listed on
https://keras.io/activations/. We use the hyperbolic tangent in the second-to-last
layer because it yields both positive and negative outputs. Of course, the last layer can
generate negative values as well, but it’s preferable to satisfy this property one step ahead
of the final output.
model.compile(optimizer='RMSprop',
loss='mse',
metrics=['MeanAbsoluteError'])
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_3 (Dense) (None, 16) 1504
7.4 Code samples and comments for vanilla MLP 111
=================================================================
Total params: 1,649
Trainable params: 1,649
Non-trainable params: 0
_________________________________________________________________
The summary of the model lists the layers in their order from input to output (forward
pass). Because we are working with 93 features, the number of parameters for the first layer
(16 units) is 93 plus one (for the bias) multiplied by 16, which makes 1504. For the second
layer, the number of inputs is equal to the size of the output from the previous layer (16).
Hence, given the fact that the second layer has 8 units, the total number of parameters is
(16+1)*8 = 136.
We set the loss function to the standard mean squared error. Other losses are listed on
https://keras.io/losses/, some of them work only for regressions (MSE, MAE) and
others only for classification (categorical cross-entropy, see Equation (7.10)). The RMS
propragation optimizer is the classical mini-batch back-propagation implementation. For
other weight updating algorithms, we refer to https://keras.io/optimizers/. The metric
is the function used to assess the quality of the model. It can be different from the loss: for
instance, using entropy for training and accuracy as the performance metric.
The final stage fits the model to the data and requires some additional training parameters:
fit_NN = model.fit(
NN_train_features,
NN_train_labels,
batch_size=256,
epochs = 10,
validation_data=(NN_test_features,NN_test_labels),
verbose = True
)
show_history(fit_NN) # Plot, evidently!
The batch size is quite arbitrary. For technical reasons pertaining to training on GPUs,
these sizes are often powers of two.
In Keras, the plot of the trained model shows four different curves (shown here in Figure
7.7). The top graph displays the improvement (or lack thereof) in loss as the number of
epochs increases. Usually, the algorithm starts by learning rapidly and then converges to a
point where any additional epoch does not improve the fit. In the example above, this point
arrives rather quickly because it is hard to notice any gain beyond the fourth epoch. The
two colors show the performance on the two samples: the training sample and the testing
sample. By construction, the loss will always improve (even marginally) on the training
sample. When the impact is negligible on the testing sample (the curve is flat, as is the
case here), the model fails to generalize out-of-sample: the gains obtained by training on
the original sample do not translate to gains on previously unseen data; thus, the model
seems to be learning noise.
112 7 Neural networks
The second graph shows the same behavior but is computed using the metric function. The
correlation (in absolute terms) between the two curves (loss and metric) is usually high. If
one of them is flat, the other should be as well.
In order to obtain the parameters of the model, the user can call get_weights(model). We
do not execute the code here because the size of the output is much too large, as there are
thousands of weights.
Finally, from a practical point of view, the prediction is obtained via the usual predict()
function. We use this function below on the testing sample to calculate the hit ratio.
hitratio=np.mean(model.predict(NN_test_features)*NN_test_labels>0)
print(f'Hit Ratio: {hitratio}')
NN_test_labels_C=to_categorical(testing_sample['R1M_Usd_C'].values)
# One-hot encoding of the label
The labels NN_train_labels_C and NN_test_labels_C have two columns: the first flags
the instances with above median returns and the second flags those with below median
returns. Note that we do not alter the feature variables: they remain unchanged. Below, we
set the structure of the networks with many additional features compared to the first one.
First, the options used above and below were chosen as illustrative examples and do not
serve to particularly improve the quality of the model. The first change compared to Section
7.4.1 is the activation functions. The first two are simply new cases, while the third one (for
the output layer) is imperative. Indeed, since the goal is classification, the dimension of the
output must be equal to the number of categories of the labels. The activation that yields a
multivariate is the softmax function. Note that we must also specify the number of classes
(categories) in the terminal layer.
The second major innovation is options pertaining to parameters. One family of options
deals with the initialization of weights and biases. In Keras, weights are referred to as the
‘kernel’. The list of initializers is quite long, and we suggest the interested reader has a look
at the Keras reference (https://keras.io/initializers/). Most of them are random, but some
of them are constant.
Another family of options is the constraints and norm penalization that are applied on the
weights and biases during training. In the above example, the weights of the first layer
are coerced to be non-negative, while the weights of the second layer see their magnitude
penalized by a factor (0.01) times their L2 norm.
114 7 Neural networks
Lastly, the final novelty is the dropout layer (see Section 7.3.3) between the first and second
layers. According to this layer, one-fourth of the units in the first layer will be (randomly)
omitted during training.
The specification of the training is outlined below.
Model: "sequential_2"
=================================================================
Layer (type) Output Shape Param #
=================================================================
dense_6 (Dense) (None, 16) 1504
dropout (Dropout) (None, 16) 0
dense_7 (Dense) (None, 8) 136
dense_8 (Dense) (None, 2) 18
=================================================================
Total params:1,658, Trainable params:1,658, Non-trainable params:0
=================================================================
Here again, many changes have been made: all levels have been revised. The loss is now the
cross-entropy. Because we work with two categories, we resort to a specific choice (binary
cross-entropy), but the more general form is the option categorical_crossentropy and works
for any number of classes (strictly above 1). The optimizer is also different and allows
for several parameters and, we refer to Kingma and Ba (2014). Simply put, the two beta
parameters control decay rates for exponentially weighted moving averages used in the
update of weights. The two averages are estimates for the first and second moment of the
gradient and can be exploited to increase the speed of learning. The performance metric
in the above chunk is the categorical accuracy. In multiclass classification, the accuracy is
defined as the average accuracy over all classes and all predictions. Since a prediction for one
instance is a vector of weights, the ‘terminal’ prediction is the class that is associated with
the largest weight. The accuracy then measures the proportion of times when the prediction
is equal to the realized value (i.e., when the class is correctly guessed by the model).
Finally, we proceed with the training of the model.
callback=tf.keras.callbacks.EarlyStopping(monitor="val_loss",
# Early stopping:
min_delta = 0.001,
# Improvement threshold
patience = 4,
7.4 Code samples and comments for vanilla MLP 115
FIGURE 7.8: Output from a trained neural network (classification task) with early stopping.
There is only one major difference here compared to the previous training call. In Keras,
callbacks are functions that can be used at given stages of the learning process. In the above
example, we use one such function to stop the algorithm when no progress has been made
for some time.
When datasets are large, the training can be long, especially when batch sizes are small
and/or the number of epochs is high. It is not guaranteed that going to the full number of
epochs is useful, as the loss or metric functions may be plateauing much sooner. Hence, it
can be very convenient to stop the process if no improvement is achieved during a specified
time frame. We set the number of epochs to 20, but the process will likely stop before that.
In the above code, the improvement is focused on validation accuracy (“val_loss”; one
alternative is “val_acc”). The min_delta value sets the minimum improvement that needs
to be attained for the algorithm to continue. Therefore, unless the validation accuracy gains
0.001 points at each epoch, the training will stop. Nevertheless, some flexibility is introduced
via the patience parameter, which in our case asserts that the halting decision is made only
116 7 Neural networks
after three consecutive epochs with no improvement. In the option, the verbose parameter
dictates the amount of comments that is made by the function. For simplicity, we do not
want any comments, hence this value is set to zero.
In Figure 7.8, the two graphs yield very different curves. One reason for that is the scale
of the second graph. The range of accuracies is very narrow. Any change in this range does
not represent much variation overall. The pattern is relatively clear on the training sample:
the loss decreases, while the accuracy improves. Unfortunately, this does not translate to
the testing sample which indicates that the model does not generalize well out-of-sample.
model_custom = keras.Sequential()
# this defines the structure of the network, how layers are organised
model_custom.add(layers.
,→Dense(16,activation="relu",input_shape=(len(features),)))
model_custom.add(layers.Dense(8, activation="sigmoid"))
model_custom.add(layers.Dense(1))
# No activation means linear activation: f(x) = x
Then we code the loss function and integrate it to the model. The important trick is to
resort to functions that are specific to the library (k_functions). We code the variance of
predicted values minus the scaled covariance between realized and predicted values. Below
we use a scale of five.
def custom_loss(y_true, y_pred): # Defines the loss, we use gamma = 5
loss = tf.reduce_mean(
tf.square(y_pred - tf.reduce_mean(y_pred)))-5*tf.reduce_mean(
(y_true-tf.reduce_mean(y_true))*(y_pred-tf.
,→reduce_mean(y_pred)))
return loss
model_custom.compile( # Model specification
optimizer='RMSprop', # Optim method
loss=custom_loss, # New loss function
metrics=['MeanAbsoluteError'])
Finally, we are ready to train and briefly evaluate the performance of the model.
fit_NN_cust = model_custom.fit(
NN_train_features, # training features
NN_train_labels, # Training labels
batch_size=512, epochs = 10, # Training parameters
validation_data=(NN_test_features,NN_test_labels),
7.5 Recurrent networks 117
# Test data
verbose = False) # No warnings
show_history(fit_NN_cust)
hitratio=np.mean(model_custom.predict(NN_test_features)*NN_test_labels>0)
# Hit ratio
print(f'Hit Ratio: {hitratio}')
U1
(y)
ỹi = f (y) hi,j wj + b(2)
j=1
U0 (h,1)
U1
(h,2)
hi = f (h) xi,k wk (1)
+b + wk hi−1,k
,
k=1 k=1
memory part
These kinds of models are often referred to as Elman (1990) models or to Jordan (1997)
models if in the latter case hi−1 is replaced by yi−1 in the computation of hi . Both types
of models fall under the overarching umbrella of Recurrent Neural Networks (RNNs).
The hi is usually called the state or the hidden layer. The training of this model is com-
plicated and must be done by unfolding the network over all instances to obtain a simple
feed-forward network and train it regularly. We illustrate the unfolding principle in Figure
7.9. It shows a very deep network. The first input impacts the first layer, and then the
second one via h1 and all following layers in the same fashion. Likewise, the second input
impacts all layers except the first and each instance i − 1 is going to impact the output ỹi
and all outputs ỹj for j ≥ i. In Figure 7.9, the parameters that are trained are shown in
blue. They appear many times, in fact, at each level of the unfolded network.
x1 h1 = f
(h)
(W(h,1)x1+ b(h)+ W(h,2)h0) ~ (y) (y) (y)
y1 = f (W h1+ b )
x3 h3 = f
(h)
(W(h,1)x3+ b(h)+ W(h,2)h2 ) ~ (y) (y) (y)
y3 = f (W h3+ b )
. . .
. . .
. . .
The main problem with the above architecture is the loss of memory induced by vanishing
gradients. Because of the depth of the model, the chain rule used in the back-propagation
will imply a large number of products of derivatives of activation functions. Now, as is shown
in Figure 7.4, these functions are very smooth and their derivatives are most of the time
smaller than one (in absolute value). Hence, multiplying many numbers smaller than one
leads to very small figures: beyond some layers, the learning does not propagate because
the adjustments are too small.
One way to prevent this progressive discounting of the memory was introduced in Hochre-
iter and Schmidhuber (1997) (Long-Short Term Memory - LSTM model). This model was
subsequently simplified by the authors Chung et al. (2015), and we present this more par-
simonious model below. The Gated Recurrent Unit (GRU) is a slightly more complicated
version of the vanilla recurrent network defined above. It has the following representation:
where the zi decides the optimal mix between the current and past values. For the candidate
value, ri decides which amount of past/memory to retain. ri is commonly referred to as the
‘reset gate’ and zi to the ‘update gate’.
There are some subtleties in the training of a recurrent network. Indeed, because of the
chaining between the instances, each batch must correspond to a coherent time series. A
logical choice is thus one batch per asset with instances (logically) chronologically ordered.
Lastly, one option in some frameworks is to keep some memory between the batches by
passing the final value of ỹi to the next batch (for which it will be ỹ0 ). This is often referred
to as the stateful mode and should be considered meticulously. It does not seem desirable
in a portfolio prediction setting if the batch size corresponds to all observations for each
asset: there is no particular link between assets. If the dataset is divided into several parts
for each given asset, then the training must be handled very cautiously.
Reccurrent networks and LSTM especially have been found to be good forecasting tools in
financial contexts (see, e.g., Fischer and Krauss (2018) and Wang et al. (2020)).
2. The time steps: in our case, it will simply be the number of dates.
3. The number of features: in our case, there is only one possible figure which is the number
of predictors.
For simplicity and in order to reduce computation times, we will use the same subset of
stocks as that from Section 5.2.2. This yields a perfectly rectangular dataset in which all
dates have the same number of observations.
First, we create some new, intermediate variables.
data_rnn=data_ml[data_ml['stock_id'].isin(stock_ids_short)]
# Dedicated dataset
training_sample_rnn=data_rnn[data_rnn['date']<separation_date]
# Training set
testing_sample_rnn=data_rnn[data_rnn['date']>separation_date]
120 7 Neural networks
# Test set
nb_stocks=len(stock_ids_short)
# Nb stocks
nb_feats=len(features)
# Nb features
nb_dates_train=training_sample_rnn.shape[0] // nb_stocks
# Nb training dates
nb_dates_test = testing_sample_rnn.shape[0] // nb_stocks
# Nb testing dates
nn_train_features = training_sample_rnn[features].values
# Train features in array format
nn_test_features = testing_sample_rnn[features].values
# Test features in array format
nn_train_labels = training_sample_rnn['R1M_Usd'].values
# Train label in array format
nn_test_labels = testing_sample_rnn['R1M_Usd'].values
# Test label in array format
Then, we construct the variables we will pass as arguments. We recall that the data file was
ordered first by stocks and then by date (see Section 1.2).
train_features_rnn = np.reshape(nn_train_features,
# Formats the training data into tricky ordered array
(nb_stocks, nb_dates_train, nb_feats))
# The order is: stock, date, feature
test_features_rnn = np.reshape(nn_test_features,
# Formats the training data into tricky ordered array
(nb_stocks, nb_dates_test, nb_feats))
# The order is: stock, date, feature
train_labels_rnn=np.reshape(nn_train_labels,(nb_stocks,nb_dates_train,1))
test_labels_rnn=np.reshape(nn_test_labels,(nb_stocks,nb_dates_test,1))
Finally, we move towards the training part. For simplicity, we only consider a simple RNN
with only one layer. The structure is outlined below. In terms of recurrence structure, we
pick a GRU.
model_RNN = keras.Sequential()
model_RNN.add(layers.GRU(16, # Nb units in hidden layer
␣
,→batch_input_shape=(nb_stocks,nb_dates_train,nb_feats),
There are many options available for recurrent layers. For GRUs, we refer to the Keras
documentation https://keras.io. We comment briefly on the option return_sequences which
we activate. In many cases, the output is simply the terminal value of the sequence. If we
do not require the entirety of the sequence to be returned, we will face a problem in the
dimensionality because the label is indeed a full sequence. Once the structure is determined,
we can move forward to the training stage.
FIGURE 7.10: Output from a trained recurrent neural network (regression task).
Compared to our previous models, the major difference both in the ouptut (Figure 7.10)
and the input (code) is the absence of validation (or testing) data. One reason for that is
Keras is very restrictive on RNNs and imposes that both the training and testing samples
share the same dimensions. In our situation, this is obviously not the case, hence we must
bypass this obstacle by duplicating the model.
new_model = keras.Sequential()
new_model.add(layers.GRU(16,
batch_input_shape=(nb_stocks,nb_dates_test,nb_feats),# New dimensions
activation='tanh', # Activation function
return_sequences=True)) # Return the full sequence
new_model.add(layers.Dense(1)) # Output dimension
new_model.set_weights(model_RNN.get_weights())
Finally, once the new model is ready, and with the matching dimensions, we can push
forward to predicting the test values. We resort to the predict() function and immediately
compute the hit ratio obtained by the model.
122 7 Neural networks
pred_rnn = new_model.predict(test_features_rnn,batch_size=nb_stocks)
# Predictions
hitratio = np.mean(np.multiply(pred_rnn,test_labels_rnn)>0)# Hit ratio
print(f'Hit Ratio: {hitratio}')
First, let us decompose this expression in its two parts (the optimizers). The first part
(i.e., the first max) is the classical one: the algorithm seeks to maximize the probability
of assigning the correct label to all examples it seeks to classify. As is done in economics
7.6 Other common architectures 123
and finance, the program does not maximize D(x) itself on average, but rather a functional
form (like a utility function).
On the left side, since the expectation is driven by x, the objective must be increasing in
the output. On the right side, where the expectation is evaluated over the fake instances,
the right classification is the opposite, i.e., 1 − D(G(z)).
The second, overarching, part seeks to minimize the performance of the algorithm on the
simulated data: it aims at shrinking the odds that D finds out that the data is indeed
corrupt. A summarized version of the structure of the network is provided below in Figure
(7.13).
)
training sample = x = true data D
G → output = probability for label (7.13)
noise = z → fake data
In ML-based asset pricing, the most notable application of GANs was introduced in Chen
et al. (2020). Their aim is to make use of the method of moment expression
which is an application of Equation (3.8) where the instrumental variables It,n are firm-
dependent (e.g., characteristics and attributes), while the It are macro-economic vari-
ables (aggregate dividend yield, volatility level, credit spread, term spread, etc.). The
function g yields a d-dimensional output, so that the above equation leads to d mo-
ment conditions.
PN The trick is to model the SDF as an unknown combination of assets
Mt+1 = 1 − n=1 w(It , It,n )rt+1,n . The primary discriminatory network (D) is the one that
approximates the SDF via the weights w(It , It,n ). The secondary generative network is the
one that creates the moment condition through g(It , It,n ) in the above equation.
The full specification of the network is given by the program:
N
" N
! # 2
X X
min max E 1− w(It , It,n )rt+1,n rt+1,j g(It , It,j ) ,
w g
j=1 n=1
where the L2 norm applies on the d values generated via g. The asset pricing equations (mo-
ments) are not treated as equalities but as a relationship that is approximated. The network
defined by w is the asset pricing modeler and tries to determine the best possible model,
while the network defined by g seeks to find the worst possible conditions so that the model
performs badly. We refer to the original article for the full specification of both networks.
In their empirical section, Chen et al. (2020) report that adopting a strong structure driven
by asset pricing imperatives add values compared to a pure predictive ‘vanilla’ approach
such as the one detailed in Gu et al. (2020). The out-of-sample behavior of decile sorted
portfolios (based on the model’s prediction) display a monotonic pattern with respect to
the order of the deciles.
GANs can also be used to generate artificial financial data (see Efimov and Xu (2019),
Marti (2019), and Wiese et al. (2020)), but this topic is outside the scope of the book.
7.6.2 Autoencoders
In the recent literature, autoencoders (AEs) are used in Huck (2019) (portfolio manage-
ment), and Gu et al. (2021) (asset pricing).
124 7 Neural networks
AEs are a strange family of neural networks because they are classified among non-
supervised algorithms. In the supervised jargon, their label is equal to the input. Like
GANS, autoencoders consist of two networks, though the structure is very different: the
first network encodes the input into some intermediary output (usually called the code),
and the second network decodes the code into a modified version of the input.
E D
x −→ z −→ x0
input encoder code decoder modified input
Because autoencoders do not belong to the large family of supervised algorithms, we post-
pone their presentation to Section 15.2.3.
The article Gu et al. (2021) resorts to the idea of AEs while at the same time augmenting
the complexity of their asset pricing model. From the simple specification rt = β t−1 ft + et
(we omit asset dependence for notational simplicity), they add the assumptions that the
betas depend on firm characteristics, while the factors are possibly non-linear functions of
the returns themselves. The model takes the following form:
where NNbeta and NNfactor are two neural networks. The above equation looks like an
autoencoder because the returns are both inputs and outputs. However, the additional
complexity comes from the second neural network NNbeta . Modern neural network libraries
such as Keras allow for customized models like the one above. The coding of this structure
is left as exercise (see below).
FIGURE 7.11: Scheme of a convolutional unit. Note: the dimensions are general and do not
correspond to the number of squares.
Iteratively reducing the dimension of the output via sequences of convolutional layers like
the one presented above would be costly in computation and could give rise to overfitting
because the number of weights would be incredibly large. In order to efficiently reduce the
size of outputs, pooling layers are often used. The job of pooling units is to simplify
matrices by reducing them to a simple metric such as the minimum, maximum, or average
value of the matrix:
where f is the minimum, maximum or average value. We show examples of pooling in Figure
7.12 below. In order to increase the speed of compression, it is possible to add a stride to
omit cells. A stride value of v will perform the operation only every v value and hence
bypass intermediate steps. In Figure 7.12, the two cases on the left do not resort to pooling,
hence the reduction in dimension is exactly equal to the size of the pooling size. When stride
is into action (right pane), the reduction is more marked. From a 1,000-by-1,000 input, a
2-by-2 pooling layer with stride 2 will yield a 500-by-500 output: the dimension is shrinked
fourfold, as in the right scheme of Figure 7.12.
With these tools in hand, it is possible to build new predictive tools. In Hoseinzade and Hara-
tizadeh (2019), predictors such as price quotes, technical indicators and macro-economic
data are fed to a complex neural network with six layers in order to predict the sign of price
variations. While this is clearly an interesting computer science exercise, the deep economic
motivation behind this choice of architecture remains unclear.
126 7 Neural networks
dimension: dimension:
betas: b j N J factors: fj J 1
x1 x 2 ... xN r1 r 2 ... rN
input: characteristics input: returns
dimension: N K dimension: N 1
While the origins of support vector machines (SVMs) are old (and go back to Vapnik and
Lerner (1963)), their modern treatment was initiated in Boser et al. (1992), Cortes and
Vapnik (1995) (binary classification), and Drucker et al. (1997) (regression). We refer to
http://www.kernel-machines.org/books for an exhaustive bibliography on their theo-
retical and empirical properties. SVMs have been very popular since their creation among
the machine learning community. Nonetheless, other tools (neural networks especially) have
gained popularity and progressively replaced SVMs in many applications like computer vi-
sion notably.
y=1
=0
x-b
y=-1
=1
w*
1
x-b
=- -
margin
w*
x-b
A model consists of two weights w = (w1 , w2 ) that load on the variables and create a natural
linear separation in the plane. In the example above, we show three separations. The red
one is not a good classifier because there are circles and squares above and beneath it. The
blue line is a good classifier: all circles are to its left and all squares to its right. Likewise, the
129
130 8 Support vector machines
green line achieves a perfect classification score. Yet, there is a notable difference between
the two.
The grey star at the top of the graph is a mystery point and given its location; if the data
pattern holds, it should be a circle. The blue model fails to recognize it as such, while
the green one succeeds. The interesting features of the scheme are those that we have not
mentioned yet, that is, the grey dotted lines. These lines represent the no-man’s land in
which no observation falls when the green model is enforced. In this area, each strip above
and below the green line can be viewed as a margin of error for the model. Typically, the
grey star is located inside this margin.
The two margins are computed as the parallel lines that maximize the distance between the
model and the closest points that are correctly classified (on both sides). These points are
called support vectors, which justifies the name of the technique. Obviously, the green
model has a greater margin than the blue one. The core idea of SVMs is to maximize the
margin, under the constraint that the classifier does not make any mistake. Said differently,
SVMs try to pick the most robust model among all those that yield a correct classification.
More formally, if we numerically define circles as +1 and squares as −1, any ‘good’ linear
model is expected to satisfy:
( P
K
wk xi,k + b ≥ +1 when yi = +1
Pk=1
K (8.1)
k=1 wk xi,k + b ≤ −1 when yi = −1,
P
K
which can be summarized in compact form yi × k=1 wk xi,k + b ≥ 1. Now, the margin
between the green model and a support vector on the dashed grey line is equal to ||w||−1 =
P −1/2
K
k=1 w 2
k . This value comes from the fact that the distance between a point (x0 , y0 )
and a line parametrized by ax + by + c = 0 is equal to d = |ax√0a+by 0 +c|
2 +b2
. In the case of the
model defined above (8.1), the numerator is equal to 1 and the norm is that of w. Thus,
the final problem is the following:
K
!
1 X
argmin ||w||2 s.t. yi wk xi,k + b ≥ 1. (8.2)
w,b 2
k=1
The dual form of this program (see chapter 5 in Boyd and Vandenberghe (2004)) is
I K
! !
1 X X
2
L(w, b, λ) = ||w|| + λi yi wk xi,k + b − 1 , (8.3)
2 i=1 k=1
P
K
where either λi = 0 or yi k=1 wk x i,k + b = 1. Thus, only some points will matter in
the solution (the so-called support vectors). The first order conditions impose that the
derivatives of this Lagrangian be null:
∂L ∂L
L(w, b, λ) = 0, L(w, b, λ) = 0,
∂w ∂b
where the first condition leads to
I
X
w∗ = λi ui xi .
i=1
8.1 SVM for classification 131
This solution is indeed a linear form of the features, but only some points are taken into
account. They are those for which the inequalities (8.1) are equalities.
Naturally, this problem becomes infeasible whenever the condition cannot be satisfied, that
is, when a simple line cannot perfectly separate the labels, no matter the choice of coeffi-
cients. This is the most common configuration, and datasets are then called logically not
linearly separable. This complicates the process, but it is possible to resort to a trick. The
idea is to introduce some flexbility in (8.1) by adding correction variables that allow the
conditions to be met:
( P
K
k=1 wk xi,k + b ≥ +1 − ξi when yi = +1
PK (8.4)
k=1 wk xi,k + b ≤ −1 + ξi when yi = −1,
where the novelties, the ξi are positive so-called ‘slack’ variables that make the conditions
feasible. They are illustrated in Figure 8.2. In this new configuration, there is no simple linear
model that can perfectly discriminate between the two classes.
x2
zi
y=1
zj
y=-1
margin
||w|| -1
(linearly inseparable case) x1
FIGURE 8.2: Diagram of binary classification with SVM - linearly inseparable data.
I
( K
! )
1 X X
argmin ||w||2 + C ξi s.t. yi wk φ(xi,k ) + b ≥ 1 − ξi and ξi ≥ 0, ∀i , (8.5)
w,b,ξ 2 i=1 k=1
where the parameter C > 0 tunes the cost of mis-classification: as C increases, errors become
more penalizing.
132 8 Support vector machines
In addition, the program can be generalized to non-linear models, via the kernel φ which is
applied to the input points xi,k . Non-linear kernels can help cope with patterns that are more
complex than straight lines (see Figure 8.3). Common kernels can be polynomial, radial,
or sigmoid. The solution is found using more or less standard techniques for constrained
quadratic programs. Once the weights w and bias PK b are set via training, a prediction for
a new vector xj is simply made by computing k=1 wk φ(xj,k ) + b and choosing the class
based on the sign of the expression.
I
1 X
argmin ||w||2 + C (ξi + ξi∗ ) (8.6)
w,b,ξ 2 i=1
K
X
s.t. wk φ(xi,k ) + b − yi ≤ + ξi (8.7)
k=1
K
X
yi − wk φ(xi,k ) − b ≤ + ξi∗ (8.8)
k=1
ξi , ξi∗ ≥ 0, ∀i, (8.9)
and it is illustrated in Figure 8.4. The user specifies a margin , and the model will try
to find the linear (up to kernel transformation) relationship between the labels yi and the
input xi . Just as in the classification task, if the data points are inside the strip, the slack
variables ξi and ξi∗ are set to zero. When the points violate the threshold, the objective
function (first line of the code) is penalized. Note that setting a large leaves room for more
PK
error. Once the model has been trained, a prediction for xj is simply k=1 wk φ(xj,k ) + b.
8.3 Practice 133
z *j
margin e
zi
margin e
x
FIGURE 8.4: Diagram of regression SVM.
Let us take a step back and simplify what the algorithm does, that is: minimize the sum of
squared weights ||w||2 subject to the error being small enough (modulo a slack variable). In
spirit, this somewhat the opposite of the penalized linear regressions which seek to minimize
the error, subject to the weights being small enough.
The models laid out in this section are a preview of the universe of SVM engines and several
other formulations have been developed. One reference library that is coded in C and C++
is LIBSVM and it is widely used by many other programming languages. The interested
reader can have a look at the corresponding article Chang and Lin (2011) for more details
on the SVM zoo (a more recent November 2019 version is also available online).
8.3 Practice
For the sake of consistency, we will use scikit-learn’s implementation of SVM in the following
code snippets. In the implementation of LIBSVM, the package requires to specify the label
and features separately. For this reason, we recycle the variables used for the boosted trees.
Moreover, the training being slow, we perform it on a subsample of these sets (first thousand
instances).
model_svm=svm.SVR(
kernel='rbf',# SVM kernel (or: linear, polynomial, sigmoid)
C=0.1, # Slack variable penalisation
134 8 Support vector machines
MSE: 0.04226507027049866
hitratio = np.mean(fit_svm.predict(test_feat_short)*y_test>0)
print(f'Hit Ratio: {hitratio}')
model_svm_c=svm.SVC(
kernel='sigmoid',
C=0.2, # Slack variable penalisation
gamma=0.5, # Parameter in the sigmoid kernel
coef0=0.3 # Parameter in the sigmoid kernel
)
fit_svm_c=model_svm_c.fit(x,y_c)# Fitting the model
hitratio=np.mean(fit_svm_c.predict(test_feat_short)==y_c_test)
print(f'Hit Ratio: {hitratio}')
This section is dedicated to the subset of machine learning that makes prior assumptions on
parameters. Before we explain how Bayes’ theorem can be applied to simple building blocks
in machine learning, we introduce some notations and concepts in the subsection below.
Good references for Bayesian analysis are Gelman et al. (2013) and Kruschke (2014). The
latter illustrates the concepts with many lines of code.
P [A ∩ B]
P [A|B] = ,
P [B]
that is, the probability of the intersection between the two sets divided by the probability
of B. Likewise, the probability that both eventsP occur is equal to P [A ∩ B] = P [A]P [B|A].
n
Given n disjoint events Ai , i = 1, ...n such that i=1 P (Ai ) = 1, then for any event B, the
law of total probabilities is (or implies)
n
X n
X
P (B) = P (B ∩ Ai ) = P (B|Ai )P (Ai ).
i=1 i=1
135
136 9 Bayesian methods
Endowed with this result, we can move forward to the core topic of this section, which is the
estimation of some parameter θ (possibly a vector) given a dataset, which we denote with
y thereby following the conventions from Gelman et al. (2013). This notation is suboptimal
in this book nonetheless because in all other chapters, y stands for the label of a dataset.
In Bayesian analysis, one sophistication (compared to a frequentist approach) comes from
the fact that the data is not almighty. The distribution of the parameter θ will be a mix
between some prior distribution set by the statistician (user or analyst) and the empirical
distribution from the data. More precisely, a simple application of Bayes’ formula yields
p(θ)p(y|θ)
p(θ|y) = ∝ p(θ)p(y|θ). (9.2)
p(y)
I
Y
p(y|θ, λ) = fλ (yi ; β), (9.3)
i=1
but in this case the problem becomes slightly more complex because adding new parameters
changes the posterior distribution to p(θ, λ|y). The user must find out the joint distribu-
tion of θ and λ, given y. Because of their nested structure, these models are often called
hierarchical models.
Bayesian methods are widely used for portfolio choice. The rationale is that the distribution
of asset returns depends on some parameter and the main issue is to determine the posterior
distribution. We very briefly review a vast literature below. Bayesian asset allocation is
investigated in Lai et al. (2011) (via stochastic optimization), Guidolin and Liu (2016) and
Dangl and Weissensteiner (2020). Shrinkage techniques (of means and covariance matrices)
are tested in Frost and Savarino (1986), Kan and Zhou (2007), and DeMiguel et al. (2015).
In a similar vein, Tu and Zhou (2010) build priors that are coherent with asset pricing
theories. Finally, Bauder et al. (2020) sample portfolio returns which allows to derive a
Bayesian optimal frontier. We invite the interested reader to also delve into the references
that are cited within these few articles.
9.2 Bayesian sampling 137
xm+1
1 = p(X1 |X2 = xm m
2 , . . . , XJ = xJ );
xm+1
2 = p(X2 |X1 = xm+1
1 , X3 = xm m
3 , . . . , XJ = xJ );
...
xm+1
J = p(XJ |X1 = xm+1
1 , X2 = xm+1
2 , . . . , XJ−1 = xm+1
J−1 ).
The important detail is that after each line, the value of the variable is updated. Hence, in
the second line X2 is sampled with the knowledge of X1 = xm+1 1 and in the last line, all
variables except XJ have been updated to their (m + 1)th state. The above algorithm is
called Gibbs sampling. It relates to Markov chains because each new iteration depends only
on the previous one.
Under some technical assumptions, as T increases, the distribution of xT converges to that
of p. The conditions under which the convergence occurs have been widely discussed in a
series of articles in the 1990s. The interested reader can have a look for instance at Tierney
(1994), Roberts and Smith (1994), as well as at section 11.7 of Gelman et al. (2013).
Sometimes, the full distribution is complex and the conditional laws are hard to determine
and to sample. Then, a more general method, called Metropolis-Hastings, can be used that
relies on the rejection method for the simulation of random variables.
Once an initial value for x has been sampled (x0 ), each new iteration (m) of the simulation
takes place in three stages:
1. generate a candidate value x0m+1 from
p(x|xm ),
p(x0 )p(xm |x0m+1 )
2. compute the acceptance ratio α = min p(xm+1 m )p(x
0 |xm ) ,
m+1
3. pick xm+1 = x0m+1 with probability α or stick with the previous value (xm+1 = xm )
with probability 1 − α
The interpretation of the acceptance ratio is not straightforward in the general case. When
the sampling generator is symmetric p(x|y) = p(y|x), the candidate is always chosen when-
ever p(x0m+1 ) ≥ p(xm ). If the reverse condition holds p(x0m+1 ) < p(xm ), then the candidate
is retained with odds equal to p(x0m+1 )/p(xm ), which is the ratio of likelihoods. The more
likely the new proposal, the higher the odds of retaining it.
Often, the first simulations are discarded in order to leave time to the chain to converge
to a high probability region. This procedure (often called ‘burn in’) ensures that the first
retained samples are located in a zone that is likely, i.e., that they are more representative
of the law we are trying to simulate.
For the sake of brevity, we stick to a succinct presentation here, but some additional details
are outlined in section 11.2 of Gelman et al. (2013) and in chapter 7 of Kruschke (2014).
I 2
Y i
e− 2σ √ PI 2
i
p(|b, σ) = √ = (σ 2π)−I e− i=1 2σ 2 .
i=1
σ 2π
In a regression analysis, the data is given both by y and by X, hence both are reported in
the notations. Simply acknowledging that = y − Xb, we get
I i2
Y e− 2σ
p(y, X|b, σ) = √ (9.4)
i=1
σ 2π
√ PI (yi −x0i b)2 √ (y−Xb)0 (y−Xb)
= (σ 2π)−I e− i=1 2σ2 = (σ 2π)−I e− 2σ 2
√ −I − (y−Xb̂)0 (y−Xb̂) 0 0
(b−b̂) X X(b−b̂)
= (σ 2π) e 2σ 2 × e|− {z
2σ 2
} . (9.5)
| {z }
depends on σ, not b depends on both σ, and b
In the last line, the second term is a function of the difference b− b̂, where b̂ = (X0 X)−1 X0 y.
This is not surprising: b̂ is a natural benchmark for the mean of b. Moreover, introducing
b̂ yields a relatively simple form for the probability.
9.3 Bayesian linear regression 139
The above expression is the frequentist (data-based) block of the posterior: the likelihood. If
we want to obtain a tractable expression for the posterior, we need to find a prior component
that has a form that will combine well with this likelihood. These forms are called conjugate
priors. A natural candidate for the right part (that depends on both b and σ) is the
multivariate Gaussian density:
(b−b0 )0 Λ0 (b−b0 )
p[b|σ] = σ −k e− 2σ 2 , (9.6)
where we are obliged to condition with respect to σ. The density has prior mean b0 and
prior covariance matrix Λ−1
0 . This prior gets us one step closer to the posterior because
In order to fully specify the cascade of probabilities, we need to take care of σ and set a
density of the form
b0
p[σ 2 ] ∝ (σ 2 )−1−a0 e− 2σ2 , (9.8)
which is close to that of the left part of (9.5). This corresponds to an inverse gamma
distribution for the variance with prior parameters a0 and b0 (this scalar notation is not
optimal because it can be confused with the prior mean b0 so we must pay extra attention).
Now, we can simplify p[b, σ|y, X] with (9.5), (9.6), and (9.8):
√ (y−Xb̂)0 (y−Xb̂)
p[b, σ|y, X] ∝ (σ 2π)−I σ −2(1+a0 ) e− 2σ 2
The above expression is simply a quadratic form in b and it can be rewritten after burden-
some algebra in a much more compact manner:
h
(b−b∗ )0 Λ∗ (b−b∗ ) b∗
i
−k −
p(b|y, X, σ) ∝ σ e 2σ 2
× (σ 2 )−1−a∗ e− 2σ2 , (9.9)
where
Λ∗ = X0 X + Λ0
0
b∗ = Λ−1
∗ (Λ0 b0 + X Xb̂)
a∗ = a0 + I/2
1 0
y y + b00 Λ0 b0 + b0∗ Λ∗ b∗ .
b∗ = b0 +
2
140 9 Bayesian methods
This expression has two parts: the Gaussian component which relates mostly to b, and the
inverse gamma component, entirely dedicated to σ. The mix between the prior and the data
is clear. The posterior covariance matrix of the Gaussian part (Λ∗ ) is the sum between the
prior and a quadratic form from the data. The posterior mean b∗ is a weighted average of
the prior b0 and the sample estimator b̂. Such blends of quantities estimated from data and
a user-supplied version are often called shrinkages. For instance, the original matrix of
cross-terms X0 X is shrunk towards the prior Λ0 . This can be viewed as a regularization
procedure: the pure fit originating from the data is mixed with some ‘external’ ingredient
to give some structure to the final estimation.
The interested reader can also have a look at section 16.3 of Greene (2018) (the case of
conjugate priors is treated in subsection 16.3.2).
P [X|y]P [y]
P [y|X] = ∝ P [X|y]P [y], (9.10)
P [X]
and then split the input matrix into its column vectors X = (x1 , . . . , xK ). This yields
The ‘naïve’ qualification of the method comes from a simplifying assumption on the fea-
tures.1 If they are all mutually independent, then the likelihood in the above expression can
be expanded into
K
Y
P [y|x1 , . . . , xK ] ∝ P [y] P [xk |y]. (9.12)
k=1
The next step is to be more specific about the likelihood. This can be done non-
parametrically (via kernel estimation) or with common distributions (Gaussian for con-
tinuous data, Bernoulli for binary data). In factor investing, the features are continuous,
thus the Gaussian law is more adequate:
(z−mc )2
− 2
e 2σc
P [xi,k = z|yi = c] = √ ,
σc 2π
where c is the value of the classes taken by y and σc and mc are the standard error and mean
of xi,k , conditional on yi being equal to c. In practice, each class is spanned, the training set
is filtered accordingly and σc and mc are taken to be the sample statistics. This Gaussian
parametrization is probably ill-suited to our dataset because the features are uniformly
1 This assumption can be relaxed, but the algorithms then become more complex and are out of the
scope of the current book. One such example that generalizes the naïve Bayes approach is Friedman et al.
(1997).
9.4 Naïve Bayes classifier 141
distributed. Even after conditioning, it is unlikely that the distribution will be even remotely
close to Gaussian. Technically, this can be overcome via a double transformation method.
Given a vector of features xk with empirical cdf Fxk , the variable
will have a standard normal law whenever Fxk is not pathological. Non-pathological cases
are when the cdf is continuous and strictly increasing and when observations lie in the
open interval (0,1). If all features are independent, the transformation should not have any
impact on the correlation structure. Otherwise, we refer to the literature on the NORmal-
To-Anything (NORTA) method (see, e.g., Chen (2001) and Coqueret (2017)).
Lastly, the prior P [y] in Equation (9.12) is often either taken to be uniform across the
classes (1/K for all k) or equal to the sample distribution.
We illustrate the naïve Bayes classification tool with a simple example. Below, since the
features are uniformly distributed, thus the transformation in (9.13) amounts to apply the
Gaussian quantile function (inverse cdf).
For visual clarity, we only use the small set of features.
quantile = QuantileTransformer(output_distribution='normal')
gauss_features_train = quantile.fit_transform(# for data normalization
training_sample[features_short]*0.999 + 0.0001)
# Train Features smaller than 1 and larger than 0
gauss_features_test = quantile.fit_transform(
testing_sample[features_short]*0.999 + 0.0001)
# Test Features smaller than 1 and larger than 0
fit_NB_gauss = GaussianNB() # Classifiers
fit_NB_gauss.fit(gauss_features_train, y_c_train)# Fit the model
data_GNB=pd.DataFrame(fit_NB_gauss.predict(
gauss_features_test),columns=['proba']) # Prediction from the model
data_GNB_cond=pd.concat([data_GNB,pd.DataFrame(
gauss_features_test,columns=features_short)],axis=1)
df_TRUE=data_GNB_cond.loc[data_GNB_cond['proba']==1,features_short]
# dataframe for class TRUE
df_FALSE=data_GNB_cond.loc[data_GNB_cond['proba']==0,features_short]
# dataframe for class FALSE
fig = plt.figure()
ax1 = fig.add_subplot(121) # Preparing the axis for subplots
ax2 = fig.add_subplot(122) # Preparing the axis for subplots
df_TRUE.plot.kde(bw_method=3,title='TRUE',ax=ax1,legend=False)
df_FALSE.plot.kde(bw_method=3,title='FALSE',ax=ax2,legend=False)
handles, labels = ax2.get_legend_handles_labels()
fig.legend(handles, labels, loc="upper left", bbox_to_anchor=(0.8,0.8))
plt.figure(figsize=(15,6))
plt.show();
142 9 Bayesian methods
FIGURE 9.1: Distributions of predictor variables, conditional on the class of the label.
TRUE is when the instance corresponds to an above median return and FALSE to a below
median return.
The plots in Figure 9.1 show the distributions of the features, conditionally on each value of
the label. Essentially, those are the densities P [xk |y]. For each feature, both distributions
are very similar.
As usual, once the model has been trained, the accuracy of predictions can be evaluated.
hitratio=np.mean(fit_NB_gauss.predict(
gauss_features_test)==testing_sample['R1M_Usd_C'].values)# Hit ratio
print(f'Hit Ratio: {hitratio}')
where is a Gaussian noise with variance σ 2 , and the Tm = Tm (qm , wm , x) are decision trees
with structure qm and weights vectors wm . This decomposition of the tree is the one we
used for boosted trees and is illustrated in Figure 6.5. qm codes all splits (variables chosen
9.5 Bayesian additive trees 143
for the splits and levels of the splits) and the vectors wm correspond to the leaf values (at
the terminal nodes).
At the macro-level, BARTs can be viewed as traditional Bayesian objects, where the pa-
rameters θ are all of the unknowns coded through qm , wm and σ 2 and where the focus is
set on determining the posterior
Given particular forms of priors for qm , wm , σ 2 , the algorithm draws the parameters using
9.5.2 Priors
The definition of priors in tree models is delicate and intricate. The first important assump-
tion is independence: independence between σ 2 and all other parameters and independence
between trees, that is, between couples (qm , wm ) and (qn , wn ) for m 6= n. This assump-
tion makes BARTs closer to random forests in spirit and further from boosted trees. This
independence entails
M
Y
P ((q1 , w1 ) , . . . , (qM , wM ) , σ 2 ) = P (σ 2 ) P (qm , wm ) .
m=1
Moreover, it is customary (for simplicity) to separate the structure of the tree (qm ) and the
terminal weights (wm ), so that by a Bayesian conditioning
M
Y
P ((q1 , w1 ) , . . . , (qM , wM ) , σ 2 ) = P (σ 2 ) P (wm |qm ) P (qm ) (9.16)
| {z } m=1
| {z } | {z }
noise term tree weights tree struct.
values. The variance σµ2 is chosen such that µµ plus or minus two times σµ2 covers 95% of
the range observed in the training dataset. Those are default values and can be altered by
the user.
Lastly, for computational purposes similar to those of linear regressions, the parameter σ 2
(variance of in (9.14)) is assumed to follow an inverse Gamma law IG(ν/2, λν/2) akin to
that used in Bayesian regressions. The parameters are by default computed from the data
so that the distribution of σ 2 is realistic and prevents overfitting. We refer to the original
article, section 2.2.4, for more details on this topic.
In sum, in addition to M (number of trees), the prior depends on a small number of
parameters: α and β (for the tree structure), µµ and σµ2 (for the tree weights) and ν and λ
(for the noise term).
The complex part is naturally to generate the simulations. Each tree is sampled using the
Metropolis-Hastings method: a tree is proposed, but it replaces the existing one only under
some (possibly random) criterion. This procedure is then repeated in a Gibbs-like fashion.
Let us start with the MH building block. We seek to simulate the conditional distribution
where q−m and w−m collect the structures and weights of all trees except for tree number
m. One tour de force in BART is to simplify the above Gibbs draws to
(qm , wm ) | (Rm , σ 2 ),
For simplicity, the third option is often excluded. Once the tree structure is defined (i.e.,
sampled), the terminal weights are independently drawn according to a Gaussian distribu-
tion N (µµ , σµ2 ).
After the tree is sampled, the MH principle requires that it be accepted or rejected based
on some probability. This probability increases with the odds that the new tree increases
the likelihood of the model. Its detailed computation is cumbersome, and we refer to section
2.2 in Sparapani et al. (2019) for details on the matter.
Now, we must outline the overarching Gibbs procedure. First, the algorithm starts with trees
that are simple nodes. Then, a specified number of loops include the following sequential
steps:
Step Task
1 sample (q1 , w1 ) | (R1 , σ 2 );
2 sample (q2 , w2 ) | (R2 , σ 2 );
... ...;
m sample (qm , wm ) | (Rm , σ 2 );
... ...;
M sample (qM , wM ) | (RM , σ 2 ); (last
tree )
M+1 sample σ 2Pgiven the full residual
M
R = y − l=1 Tl (ql , wl , x)
At each step m, the residual Rm is updated with the values from step m − 1. We illustrate
this process in Figure 9.2 in which M = 3. At step 1, a partition is proposed for the first
tree, which is a simple node. In this particular case, the tree is accepted. In this scheme,
the terminal weights are omitted for simplicity. At step 2, another partition is proposed for
the tree, but it is rejected. In the third step, the proposition for the third is accepted.
After the third step, a new value for σ 2 is drawn, and a new round of Gibbs sampling can
commence.
Tree 1 Tree 2 Tree 3
S proposition
T
E
P
validation u
1 p
d
a
S t
T proposition e
E
P
u
2 validation p
d
a
S t
T proposition e
E
P
validation u
3 p
update d
a
S t
T proposition e
E
P
validation
1
FIGURE 9.2: Diagram of the MH/Gibbs sampling of BARTs. At step 2, the proposed tree
is not validated.
146 9 Bayesian methods
9.5.4 Code
In the following code snippet we resort to only a few parameters, like the power and base,
which are the β and α defined in (9.17). The program is a bit verbose and delivers a few
parametric details.
Once the model is trained,2 we evaluated its performance. We simply compute the hit ratio.
The predictions are embedded within the fit variable, under the name ‘yhat.test’.
hitratio = np.mean(fit_bart.predict(
testing_sample[features_short]) * y_test > 0)
print(f'Hit Ratio: {hitratio}')
2 In the case of BARTs, the training consists exactly in the drawing of posterior samples.
9.5 Bayesian additive trees 147
As is shown in Chapters 5 to 11, ML models require user-specified choices before they can
be trained. These choices encompass parameter values (learning rate, penalization intensity,
etc.) or architectural choices (e.g., the structure of a network). Alternative designs in ML
engines can lead to different predictions, hence selecting a good one can be critical. We refer
to the work of Probst et al. (2018) for a study on the impact of hyperparameter tuning on
model performance. For some models (neural networks and boosted trees), the number of
degrees of freedom is so large that finding the right parameters can become complicated and
challenging. This chapter addresses these issues but the reader must be aware that there
is no shortcut to building good models. Crafting an effective model is time-consuming and
often the result of many iterations.
151
152 10 Validating and tuning
and the RMSE is simply the square root of the MSE. It is always possible to generalize these
formulae by adding weights wi to produce heterogeneity in the importance of instances. Let
us briefly comment on the MSE. It is by far the most common loss function in machine
learning, but it is not necessarily the exact best choice for return prediction in a portfolio
allocation task. If we decompose the loss into its three terms, we get the sum of squared
realized returns, the sum of squared predicted returns and the product between the two
(roughly speaking, a covariance term if we assume zero means). The first term does not
matter. The second controls the dispersion around zero of the predictions. The third term
is the most interesting from the allocator’s standpoint. The negativity of the cross-product
−2yi ỹi is always to the investor’s benefit: either both terms are positive and the model has
recognized a profitable asset, or they are negative and it has identified a bad opportunity.
It is when yi and ỹi don’t have the same sign that problems arise. Thus, compared to the
ỹi2 , the cross-term is more important. Nonetheless, algorithms do not optimize with respect
to this indicator.1
These metrics (MSE and RMSE) are widely used outside ML to assess forecasting errors.
Below, we present other indicators that are also sometimes used to quantify the quality
of a model. In line with the linear regressions, the R2 can be computed in any predictive
exercise. PI
(yi − ỹi )2
2
R (y, ỹ) = 1 − Pi=1
I
, (10.3)
2
i=1 (yi − ȳ)
where ȳ is the sample average of the label. One important difference with the classical
R2 is that the above quantity can be computed on the testing sample and not on the
training sample. In this case, the R2 can be negative when the mean squared error in
the numerator is larger than the (biased) variance of the testing sample. Sometimes, the
average value ȳ is omitted in the denominator (as in Gu et al. (2020) for instance). The
benefit of removing the average value is that it compares the predictions of the model to a
zero prediction. This is particularly relevant with returns because the simplest prediction
of all is the constant zero value, and the R2 can then measure if the model beats this naïve
benchmark. A zero prediction is always preferable to a sample average because the latter
can be very much period-dependent. Also, removing ȳ in the denominator makes the metric
more conservative as it mechanically reduces the R2 .
Beyond the simple indicators detailed above, several exotic extensions exist and they all
consist in altering the error before taking the averages. Two notable examples are the
Mean Absolute Percentage Error (MAPE) and the Mean Square Percentage Error (MSPE).
Instead of looking at the raw error, they compute the error relative to the original value
(to be predicted). Hence, the error is expressed in a percentage score and the averages are
simply equal to:
I
1 X yi − ỹi
MAPE(y, ỹ) = , (10.4)
I i=1 yi
I 2
1 X yi − ỹi
MSPE(y, ỹ) = , (10.5)
I i=1 yi
1 There are some exceptions, like attempts to optimize more exotic criteria, such as the Spearman rho,
which is based on rankings and is close in spirit to maximizing the correlation between the output and the
prediction. Because this rho cannot be differentiated, this causes numerical issues. These problems can be
partially alleviated when resorting to complex architectures, as in Engilberge et al. (2019).
10.1 Learning metrics 153
where the latter can be scaled by a square root if need be. When the label is positive with
possibly large values, it is possible to scale the magnitude of errors, which can be very large.
One way to do this is to resort to the Root Mean Squared Logarithmic Error (RMSLE),
defined below:
v
u I
u1 X 1 + yi
RMSLE(y, ỹ) = t log , (10.6)
I i=1 1 + ỹi
where it is obvious that when yi = ỹi , the error metric is equal to zero.
Before we move on to categorical losses, we briefly comment on one shortcoming of the
MSE, which is by far the most widespread metric and objective in regression tasks. A
simple decomposition yields:
I
1X 2
MSE(y, ỹ) = (y + ỹi2 − 2yi ỹi ).
I i=1 i
In the sum, the first term is given, there is nothing to be done about it, hence models focus
on the minimization of the other two. The second term is the dispersion of model values.
The third term is a cross-product. While variations in ỹi do matter, the third term is by
far the most important, especially in the cross-section. It is more valuable to reduce the
MSE by increasing yi ỹi . This product is indeed positive when the two terms have the same
sign, which is exactly what an investor is looking for: correct directions for the bets. For
some algorithms (like neural networks), it is possible to manually specify custom losses.
Maximizing the sum of yi ỹi may be a good alternative to vanilla quadratic optimization
(see Section 7.4.3 for an example of implementation).
Among the two types of errors, type I is the most daunting for investors because it has
a direct effect on the portfolio. The type II error is simply a missed opportunity and
is somewhat less impactful. Finally, true negatives are those assets which are correctly
excluded from the portfolio.
From the four baseline rates, it is possible to derive other interesting metrics:
• Accuracy = T P + T N is the percentage of correct forecasts;
• Recall = T PT+FP
N measures the ability to detect a winning strategy/asset (left column
analysis). Also known as sensitivity or true positive rate (TPR);
• Precision = T PT+F P
P computes the probability of good investments (top row analysis);
• Specificity = F P +T N measures the proportion of actual negatives that are correctly
TN
• F-score, F1 = 2 recall×precision
recall+precision is the harmonic average of recall and precision.
All of these items lie in the unit interval and a model is deemed to perform better when
they increase (except for fallout for which it is the opposite). Many other indicators also
exist, like the false discovery rate or false omission rate, but they are not as mainstream
and less cited. Moreover, they are often simple functions of the ones mentioned above.
10.1 Learning metrics 155
A metric that is popular but more complex is the Area Under the (ROC) Curve, often
referred to as AUC. The complicated part is the ROC curve where ROC stands for Receiver
Operating Characteristic; the name comes from signal theory. We explain how it is built
below.
As seen in Chapters 6 and 7, classifiers generate output that are probabilities that one
instance belongs to one class. These probabilities are then translated into a class by choosing
the class that has the highest value. In binary classification, the class with a score above
0.5 basically wins.
In practice, this 0.5 threshold may not be optimal and the model could very well correctly
predict false instances when the probability is below 0.4 and true ones otherwise. Hence, it
is a natural idea to test what happens if the decision threshold changes. The ROC curve
does just that and plots the recall as a function of the fallout when the threshold increases
from zero to one.
When the threshold is equal to 0, true positives are equal to zero because the model never
forecasts positive values. Thus, both recall and fallout are equal to zero. When the threshold
is equal to one, false negatives shrink to zero and true negatives too, hence recall and fallout
are equal to one. The behavior of their relationship in between these two extremes is called
the ROC curve. We provide stylized examples below in Figure 10.2. A random classifier
would fare equally well for recall and fallout and thus the ROC curve would be a linear line
from the point (0,0) to (1,1). To prove this, imagine a sample with a p ∈ (0, 1) proportion
of true instances and a classifier that predicts true randomly with a probability p0 ∈ (0, 1).
Then because the sample and predictions are independent, T P = p0 p, F P = p0 (1 − p),
T N = (1 − p0 )(1 − p) and F N = (1 − p0 )p. Given the above definition, this yields that both
recall and fallout are equal to p0 .
1
optimal
classifier
Recall
er
ssifi
cla er
d s sifi
o
0.5
go cla
m
n do
r a
0 0.5 Fallout 1
An algorithm with a ROC curve above the 45° angle is performing better than an average
classifier. Indeed, the curve can be seen as a tradeoff between benefits (probability of de-
tecting good strategies on the y axis) minus costs (odds of selecting the wrong assets on the
156 10 Validating and tuning
x axis). Hence being above the 45° is paramount. The best possible classifier has a ROC
curve that goes from point (0,0) to point (0,1) to point (1,1). At point (0,1), fallout is null,
hence there are no false positives, and recall is equal to one so that there are also no false
negatives: the model is always right. The opposite is true: at point (1,0), the model is always
wrong.
Below, we compute a ROC curve for a given set of predictions on the testing sample.
print(f'AUC: {roc_auc}')
AUC: 0.5021678143170378
In Figure 10.3, the curve is very close to the 45° angle and the model seems as good (or,
rather, as bad) as a random classifier.
Finally, having one entire curve is not practical for comparison purposes, hence the infor-
mation of the whole curve is synthesized into the area below the curve, i.e., the integral of
the corresponding function. The 45° angle (quadrant bisector) has an area of 0.5 (it is half
10.2 Validation 157
the unit square which has a unit area). Thus, any good model is expected to have an area
under the curve (AUC) above 0.5. A perfect model has an AUC of one.
We end this subsection with a word on multiclass data. When the output (i.e., the label)
has more than two categories, things become more complex. It is still possible to compute a
confusion matrix, but the dimension is larger and harder to interpret. The simple indicators
like T P , T N , etc., must be generalized in a non-standard way. The simplest metric in this
case is the cross-entropy defined in Equation (7.10). We refer to Section 6.1.2 for more
details on losses related to categorical labels.
10.2 Validation
Validation is the stage at which a model is tested and tuned before it starts to be deployed
on real or live data (e.g., for trading purposes). Needless to say, it is critical.
yi = fˆ(xi ) + ˆi .
In the above derivation, f (x) is not random, but fˆ(x) is. Also, in the second line, we assumed
E[(f (x)− fˆ(x))] = 0, which may not always hold (though it is a very common assumption).
The average squared error thus has three components:
• the variance of the model (over its predictions),
158 10 Validating and tuning
• and one irreducible error (independent from the choice of a particular model).
The last one is immune to changes in models, so the challenge is to minimize the sum of
the first two. This is known as the variance-bias tradeoff because reducing one often leads
to increasing the other. The goal is thus to assess when a small increase in either one can
lead to a larger decrease in the other.
There are several ways to represent this tradeoff and we display two of them. The first
one relates to archery (see Figure 10.4) below. The best case (top left) is when all shots
are concentrated in the middle: on average, the archer aims correctly and all the arrows
are very close to one another. The worst case (bottom right) is the exact opposite: the
average arrow is above the center of the target (the bias is nonzero) and the dispersion of
arrows is large.
Worst case
High bias (try again)
The most often encountered cases in ML are the other two configurations: either the arrows
(predictions) are concentrated in a small perimeter, but the perimeter is not the center of
the target; or the arrows are on average well distributed around the center, but they are,
on average, far from it.
The second way the variance bias tradeoff is often depicted is via the notion of model
complexity. The most simple model of all is a constant one: the prediction is always the
same, for instance equal to the average value of the label in the training set. Of course, this
prediction will often be far from the realized values of the testing set (its bias will be large),
but at least its variance is zero. On the other side of the spectrum, a decision tree with as
many leaves as there are instances has a very complex structure. It will probably have a
smaller bias, but undoubtedly it is not obvious that this will compensate the increase in
variance incurred by the intricacy of the model.
This facet of the tradeoff is depicted in Figure 10.5 below. To the left of the graph, a simple
model has a small variance but a large bias, while to the right it is the opposite for a complex
model. Good models often lie somewhere in the middle, but the best mix is hard to find.
10.2 Validation 159
variance
error
total error
optimal
trade-off
bias2
model complexity
The most tractable theoretical form of the variance-bias tradeoff is the ridge regression.2
The coefficient estimates in this type of regression are given by b̂λ = (X0 X + λIN )−1 X0 Y
(see Section 5.1.1), where λ is the penalization intensity. Assuming a true linear form for
the data generating process (y = Xb + where b is unknown and σ 2 is the variance of
errors - which have identity correlation matrix), this yields
Basically, this means that the bias of the estimator is equal to −λ(X0 X + λIN )−1 b, which
is zero in the absence of penalization (classical regression) and converges to some finite
number when λ → ∞, i.e., when the model becomes constant. Note that if the estimator
has a zero bias, then predictions will too: E[X(b − b̂)] = 0.
The variance (of estimates) in the case of an unconstrained regression is equal to V[b̂] =
σ(X0 X)−1 . In Equation (10.9), the λ reduces the magnitude of figures in the inverse matrix.
The overall effect is that as λ increases, the variance decreases and in the limit λ → ∞, the
variance is zero when the model is constant. The variance of predictions is
All in all, ridge regressions are very handy because with a single parameter, they are able
to provide a cursor that directly tunes the variance-bias tradeoff.
It’s easy to illustrate how simple it is to display the tradeoff with the ridge regression. In
the example below, we recycle the ridge model trained in Chapter 5.
2 Another angle, critical of neural networks is provided in Geman et al. (1992).
160 10 Validating and tuning
In Figure 10.6, the pattern is different from the one depicted in Figure 10.5. In the graph,
when the intensity lambda increases, the magnitude of parameters shrinks and the model
becomes simpler. Hence, the most simple model seems like the best choice: adding complex-
ity increases variance but does not improve the bias! One possible reason for that is that
features don’t actually carry much predictive value and hence a constant model is just as
good as more sophisticated ones based on irrelevant variables.
The model depicted in Figure 10.7 only has four clusters, which means that the predictions
can only take four values. The smallest one is 0.011 and encompasses a large portion of
the sample (85%), and the largest one is 0.062 and corresponds to only 4% of the training
sample. We are then able to compute the bias and the variance of the predictions on the
testing set.
bias: 0.004973916538330352
var_tree = np.var(fit_tree_simple.predict(X_test))
print(f'var: {var_tree}')
var: 0.0001397982854475224
162 10 Validating and tuning
On average, the error is slightly positive, with an overall overestimation of 0.005. As ex-
pected, the variance is very small (10−4 ).
For the complex model, we take the boosted tree that was obtained in Section 6.4.6
(fit_xgb). The model aggregates 40 trees with a maximum depth of 4, it is thus undoubtedly
more complex.
bias: 0.019378203027941212
var_xgb = np.var(fit_xgb.predict(test_matrix_xgb))
print(f'var: {var_xgb}')
var: 0.0011795820901170373
The bias is indeed smaller compared to that of the simple model, but in exchange, the
variance increases substantially. The net effect (via the squared bias) is in favor of the
simpler model.
new point
= problem!
x
FIGURE 10.8: Illustration of overfitting: a model closely matching training data is rarely a
good idea.
In addition to these options, random forests allow to control for the number of trees in
the forest. Theoretically (see Breiman (2001)), this parameter is not supposed to impact the
risk of overfitting because new trees only help reduce the total error via diversification. In
practice, and for the sake of computation times, it is not recommended to go beyond 1,000
trees. Two other hyperparameters are the subsample size (on which each learner is trained)
and the number of features retained for learning. They do not have a straightforward impact
on bias and tradeoff, but rather on raw performace. For instance, if subsamples are too small,
the trees will not learn enough. Same problem if the number of features is too low. On the
other hand, choosing a large number of predictors (i.e., close to the total number) may lead
to high correlations between each learner’s prediction because the overlap in information
contained in the training samples may be high.
Boosted trees have other options that can help alleviate the risk of overfitting. The most
obvious one is the learning rate, which discounts the impact of each new tree by η ∈ (0, 1).
When the learning rate is high, the algorithm learns too quickly and is prone to sticking
close to the training data. When it’s low, the model learns very progressively, which can
be efficient if there are sufficiently many trees in the ensemble. Indeed, the learning rate
and the number of trees must be chosen synchronously: if both are low, the ensemble will
learn nothing, and if both are large, it will overfit. The arsenal of boosted tree parameters
does not stop there. The penalizations, both of score values and of the number of leaves, are
naturally a tool to prevent the model from going to deep in the particularities of the training
sample. Finally, constraints of monotonicity like those mentioned in Section 6.4.5 are also
an efficient way to impose some structure on the model and force it to detect particular
patterns.
Lastly neural networks also have many options aimed at protecting them against overfit-
ting. Just like for boosted trees, some of them are the learning rate and the penalization of
weights and biases (via their norm). Constraints, like non-negative constraints, can also help
when the model theoretically requires positive inputs. Finally, dropout is always a direct
way to reduce the dimension (number of parameters) of a network.
is nonetheless shown in Bergstra and Bengio (2012) that random exploration is preferable
to grid search.
Both grid and random searches are suboptimal because they are likely to spend time in
zones of the parameter space that are irrelevant, thereby wasting computation time. Given
a number of parameter points that have been tested, it is preferable to focus the search in
areas where the best points are the most likely. This is possible via an interative process
that adapts the search after each new point has been tested. In the large field of finance, a
few papers dedicated to tuning are Lee (2020) and Nystrup, Lindstrom, and Madsen (2020).
One other popular approach in this direction is Bayesian optimization (BO). The central
object is the objective function of the learning process. We call this function O and it can
be widely seen as a loss function possibly combined with penalization and constraints. For
simplicity here, we will not mention the training/testing samples and they are considered
to be fixed. The variable of interest is the vector p = (p1 , . . . , pl ) which synthesizes the
hyperparameters (learning rate, penalization intensities, number of models, etc.) that have
an impact on O. The program we are interested in is
The main problem with this optimization is that the computation of O(p) is very costly.
Therefore, it is critical to choose each trial for p wisely. One key assumption of BO is that
the distribution of O is Gaussian and that O can be proxied by a linear combination of the
pl . Said differently, the aim is to build a Bayesian linear regression between the input p and
the output (dependent variable) O. Once a model has been estimated, the information that
is concentrated in the posterior density of O is used to make an educated guess at where to
look for new values of p.
This educated guess is made based on a so-called acquisition function. Suppose we have
tested m values for p, which we write p(m) . The current best parameter is written p∗m =
argmin O(p(k) ). If we test a new point p , then it will lead to an improvement only if
1≤k≤m
O(p) < O(p∗m ), that is if the new objective improves the minimum value that we already
know. The average value of this improvement is
where the positive part [·]+ emphasizes that when O(p) ≥ O(p∗m ), the gain is zero. The
expectation is indexed by m because it is computed with respect to the posterior distribution
of O(p) based on the m samples p(m) . The best choice for the next sample p(m) is then
which corresponds to the maximum location of the expected improvement. Instead of the EI,
the optimization can be performed on other measures, like the probability of improvement,
which is Pm [O(p) < O(p∗m )].
In compact form, the iterative process can be outlined as follows:
• step 1: compute O(p(m) ) for m = 1, . . . , M0 values of parameters.
166 10 Validating and tuning
• step 2a: compute sequentially the posterior density of O on all available points.
• step 2b: compute the optimal new point to test pm+1 given in Equation (10.12).
• step 3: repeat steps 2a to 2c as much as deemed reasonable and return the pm that
yields the smallest objective value.
The interested reader can have a look at Snoek et al. (2012) and Frazier (2018) for more
details on the numerical facets of this method.
Finally, for the sake of completeness, we mention a last way to tune hyperparameters. Since
the optimization scheme is argmin O(p), a natural way to proceed would be to use the
p
sensitivity of O with respect to p. Indeed, if the gradient ∂p∂O
l
is known, then a gradient
descent will always improve the objective value. The problem is that it is hard to compute
a reliable gradient (finite differences can become costly). Nonetheless, some methods (e.g.,
Maclaurin et al. (2015)) have been applied successfully to optimize over large dimensional
parameter spaces.
We conclude by mentioning the survey Bouthillier and Varoquaux (2020), which spans two
major AI conferences that took place in 2019. It shows that most papers resort to hyperpa-
rameter tuning. The two most often cited methods are manual tuning (hand-picking) and
grid search.
print(params)
{'learning_rate': [0.1, 0.3, 0.5, 0.7, 0.9], 'n_estimators': [10, 50, 100],
'reg_lambda': [0.01, 0.1, 1, 10, 100]}
10.3 The search for good hyperparameters 167
We choose the mean squared error to evaluate the impact of hyperparameter values.
res_df = pd.DataFrame(cv_results,
columns=["param_n_estimators","param_learning_rate",
"param_reg_lambda","mean_test_score"])
#Note, MAE is made negative in scikit-learn so that it can be maximized.
#As such, we can ignore the sign and assume all errors are positive.
res_df['mean_test_score']=-res_df['mean_test_score'].values
fig, axes = plt.subplots(figsize=(16, 9),nrows=3, ncols=5)
ax_all = plt.gca()
cnt = 0
for param,tmp in res_df.
,→groupby(["param_n_estimators","param_reg_lambda"]):
FIGURE 10.9: Plot of error metrics (SMEs) for many parameter values. Each row of graph
corresponds to nrounds and each column to lambda.
In Figure 10.9, the main information is that a small learning rate (η = 0.1) is detrimental
to the quality of the forecasts. This remains true even when the number of trees is large
(nrounds=100), which means that the algorithm does not learn enough.
Grid search can be performed in two stages: the first stage helps locate the zones that are of
interest (with the lowest loss/objective values) and then zoom in on these zones with refined
values for the parameter on the grid. With the results above, this would mean considering
many learners (more than 50, possibly more than 100), and avoiding large learning rates
such as η = 0.9 or η = 0.8.
opt.fit(X_train,y_train)
cv_results_opt=pd.DataFrame(opt.cv_results_)
res_df = pd.DataFrame(cv_results_opt,
columns =["param_n_estimators",
"param_learning_rate",
"param_reg_lambda","mean_test_score"])
# Note, MAE is made negative in scikit-learn so that it can be maximized.
# As such, we can ignore the sign and assume all errors are positive.
res_df['mean_test_score']=-res_df['mean_test_score'].values
fig, axes = plt.subplots(figsize=(16, 9),nrows=3, ncols=5)
ax_all = plt.gca()
cnt = 0
for param, tmp in res_df.groupby(["param_n_estimators",
"param_reg_lambda"]):
ax = axes[cnt//5][cnt%5] # get the ax
np.round(tmp[["param_learning_rate","mean_test_score"]],2).plot.bar(
ax=ax, x="param_learning_rate", y="mean_test_score",
alpha=0.5,legend=None)
ax.set_xlabel("") # no xlabel
ax.set_ylim(0, 0.1) # set y range
if cnt//5 < 2:
ax.xaxis.set_ticklabels("")
else:
for label in ax.get_xticklabels():
label.set_rotation(0);
if cnt%5 > 0:
ax.yaxis.set_ticklabels("")
# set title
ax.set_title(f"num_trees={param[0]},\n␣
,→reg_lambda={param[1]}",fontsize=10);
# update
cnt =cnt+1
170 10 Validating and tuning
FIGURE 10.10: Relationship between (minus) the loss and hyperparameter values.
The second major option is when the model is updated (retrained) at each rebalancing. The
underlying idea here is that the structure of returns evolves through time and a dynamic
model will capture the most recent trends. The drawback is that validation must (should?)
be rerun at each rebalancing date.
Let us recall the dimensions of backtests: number of strategies: possibly dozens or hun-
dreds, or even more; number of trading dates: hundreds for monthly rebalancing; number
of assets: hundreds or thousands; number of features: dozens or hundreds.
Even with a lot of computational power (GPUs, etc.), training many models over many dates
is time-consuming, especially when it comes to hyperparameter tuning when the parameter
space is large. Thus, validating models at each trading date of the out-of-sample period is
not realistic.
One solution is to keep an early portion of the training data and to perform a smaller
scale validation on this subsample. Hyperparameters are tested on a limited number of
dates and most of the time, they exhibit stability: satisfactory parameters for one date are
usually acceptable for the next one and the following one as well. Thus, the full backtest can
be carried out with these values when updating the models at each period. The backtest
nonetheless remains compute-intensive because the model has to be retrained with the most
recent data for each rebalancing date.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
11
Ensemble models
Let us be honest. When facing a prediction task, it is not obvious to determine the best
choice between ML tools: penalized regressions, tree methods, neural networks, SVMs, etc.
A natural and tempting alternative is to combine several algorithms (or the predictions
that result from them) to try to extract value out of each engine (or learner). This intention
is not new and contributions towards this goal go back at least to Bates and Granger (1969)
(for the purpose of passenger flow forecasting).
Below, we outline a few books on the topic of ensembles. The latter have many names and
synonyms, such as forecast aggregation, model averaging, mixture of experts or
prediction combination. The first four references below are monographs, while the last
two are compilations of contributions:
• Zhou (2012): a very didactic book that covers the main ideas of ensembles;
• Schapire and Freund (2012): the main reference for boosting (and hence, ensembling)
with many theoretical results and thus strong mathematical groundings;
• Claeskens and Hjort (2008): an overview of model selection techniques with a few
chapters focused on model averaging;
173
174 11 Ensemble models
matrix E. A linear combination of models has sample errors equal to Ew, where w = wm
are the weights assigned to each model and we assume w0 1M = 1. Minimizing the total
(squared) error is thus a simple quadratic program with unique constraint. The Lagrange
function is w0 1M = 1 and hence
∂
L(w) = E0 Ew − λ1M = 0 ⇔ w = λ(E0 E)−1 1M ,
∂w
(E0 E)−1 1M
and the constraint imposes w∗ = (10M E0 E)−1 1M . This form is similar to that of minimum
variance portfolios. If errors are unbiased 10I E = 00M , then E0 E is the covariance matrix of
errors.
This expression shows an important feature of optimized linear ensembles: they can only
add value if the models tell different stories. If two models are redundant, E0 E will be
close to singular and w∗ will arbitrage one against the other in a spurious fashion. This
is the exact same problem as when mean-variance portfolios are constituted with highly
correlated assets: in this case, diversification fails because when things go wrong, all assets
go down. Another problem arises when the number of observations is too small compared
to the number of assets so that the covariance matrix of returns is singular. This is not an
issue for ensembles because the number of observations will usually be much larger than
the number of models (I >> M ).
In the limit when correlations increase to one, the above formulation becomes highly un-
stable and ensembles cannot be trusted. One heuristic way to see this is when M = 2
and
σ12 σ1−2 −ρ(σ1 σ2 )−1
ρσ1 σ2
E0 E = ⇔ (E0
E)−1
= 1
ρσ1 σ2 σ22 1−ρ2 −ρ(σ1 σ2 )−1 σ2−2
so that when ρ → 1, the model with the smallest errors (minimum σi2 ) will see its weight
increasing towards infinity, while the other model will have a similarly large negative
weight: the model arbitrages between two highly correlated variables. This seems like a
very bad idea.
There is another illustration of the issues caused by correlations. Let’s assume we face M
correlated errors m with pairwise correlation ρ , zero mean and variance σ 2 . The variance
of errors is
" #
M M
1 X 2 1 X 2 X
E = 2 m + n m
M m=1 m M m=1 m6=n
2
σ 1 X 2
= + 2 ρσ
M M
n6=m
σ 2 (1 − ρ)
= ρσ 2 +
M
where while the second term converges to zero as M increases, the first term remains and is
linearly increasing with ρ . In passing, because variances are always positive, this result
implies that the common pairwise correlation between M variables is bounded below by
−(M − 1)−1 . This result is interesting but rarely found in textbooks.
11.1 Linear ensembles 175
w0 1M = 1
0
argmin w E Ew,
0
s.t. .
w wm ≥ 0 ∀m
Mechanically, if several models are highly correlated, the constraint will impose that only
one of them will have a non-zero weight. If there are many models, then just a few of them
will be selected by the minimization program. In the context of portfolio optimization,
Jagannathan and Ma (2003) have shown the benefits of constraint in the construction
mean-variance allocations. In our setting, the constraint will similarly help discriminate
wisely among the ‘best’ models.
In the literature, forecast combination and model averaging (which are synonyms of ensem-
bles) have been tested on stock markets as early as in Von Holstein (1972). Surprisingly,
the articles were not published in Finance journals but rather in fields such as Management
(Virtanen and Yli-Olli (1987), Wang et al. (2012)), Economics and Econometrics (Donald-
son and Kamstra (1996), Clark and McCracken (2009) and Mascio et al. (2021)), Operations
Reasearch (Huang et al. (2005), Leung et al. (2001), and Bonaccolto and Paterlini (2019)),
and Computer Science (Harrald and Kamstra (1997) and Hassan et al. (2007)).
In the general forecasting literature, many alternative (refined) methods for combining fore-
casts have been studied. Trimmed opinion pools (Grushka-Cockayne et al. (2016)) compute
averages over the predictions that are not too extreme. We refer to Gaba et al. (2017) for
a more exhaustive list of combinations as well as for an empirical study of their respective
efficiency. Overall, findings are mixed and the heuristic simple average is, as usual, hard to
beat (see, e.g., Genre et al. (2013)).
11.1.2 Example
In order to build an ensemble, we must gather the predictions and the corresponding errors
into the E matrix. We will work with five models that were trained in the previous chapters:
penalized regression, simple tree, random forest, xgboost, and feed-forward neural network.
The training errors have zero means, hence E0 E is the covariance matrix of errors between
models.
err_pen_train = fit_pen_pred.predict(
X_penalized_train)-training_sample['R1M_Usd'] # Reg.
err_tree_train = fit_tree.predict(
training_sample[features])-training_sample['R1M_Usd'] # Tree
err_RF_train = fit_RF.predict(
training_sample[features])-training_sample['R1M_Usd'] # RF
err_XGB_train = fit_xgb.predict(
train_matrix_xgb)-training_sample['R1M_Usd'] # XGBoost
err_NN_train = model_NN.predict(
training_sample[features_short])-training_sample['R1M_Usd'].
values.reshape((-1,1)) # NN
E= pd.concat(
[err_pen_train,err_tree_train,err_RF_train,err_XGB_train,pd.
,→DataFrame(
176 11 Ensemble models
err_NN_train)],axis=1) # E matrix
E.set_axis(['Pen_reg','Tree','RF','XGB','NN'], axis=1,␣
→inplace=True)#Names
E.corr().mean()
Pen_reg 0.993649
Tree 0.994361
RF 0.989791
XGB 0.985826
NN 0.994391
dtype: float64
As is shown by the correlation matrix, the models fail to generate heterogeneity in their
predictions. The minimum correlation (though above 95%!) is obtained by the boosted
tree models. Below, we compare the training accuracy of models by computing the average
absolute value of errors.
abs(E).mean() # Mean absolute error or columns of E
Pen_reg 0.083459
Tree 0.083621
RF 0.074806
XGB 0.084048
NN 0.083627
dtype: float64
The best performing ML engine is the random forest. The boosted tree model is the worst,
by far. Below, we compute the optimal (non-constrained) weights for the combination of
models.
w_ensemble=np.linalg.inv(([email protected]))@np.ones(5)
# Optimal weights
w_ensemble /= np.sum(w_ensemble)
w_ensemble
Because of the high correlations, the optimal weights are not balanced and diversified: they
load heavily on the random forest learner (best in sample model) and ‘short’ a few models
in order to compensate. As one could expect, the model with the largest negative weights
(Pen_reg) has a very high correlation with the random forest algorithm (0.997).
11.1 Linear ensembles 177
Note that the weights are of course computed with training errors. The optimal combina-
tion is then tested on the testing sample. Below, we compute out-of-sample (testing) errors
and their average absolute value.
err_pen_test=fit_pen_pred.predict(
X_penalized_test)-testing_sample['R1M_Usd'] # Reg.
err_tree_test = fit_tree.predict(
testing_sample[features])-testing_sample['R1M_Usd'] # Tree
err_RF_test = fit_RF.predict(
testing_sample[features])-testing_sample['R1M_Usd'] # RF
err_XGB_test = fit_xgb.predict(
test_matrix_xgb)-testing_sample['R1M_Usd'] # XGBoost
err_NN_test = model_NN.predict(
testing_sample[features_short])-testing_sample['R1M_Usd'].values.
→reshape(
(-1,1)) # NN
E_test= pd.concat(
[err_pen_test,err_tree_test,err_RF_test,err_XGB_test,
pd.DataFrame(err_NN_test,index=testing_sample.index)],axis=1)
# E_test matrix
E_test.set_axis(['Pen_reg','Tree','RF','XGB','NN'],axis=1,inplace=True)
# Names
abs(E_test).mean() # Mean absolute error or columns of E_test
Pen_reg 0.066182
Tree 0.066535
RF 0.067986
XGB 0.068569
NN 0.066613
dtype: float64
The boosted tree model is still the worst performing algorithm, while the simple models
(regression and simple tree) are the ones that fare the best. The most naïve combination is
the simple average of model and predictions.
err_EW_test = np.mean(np.abs(E_test.mean(axis=1)))
# equally weight combination
print(f'equally weight combination: {err_EW_test}')
err_opt_test =np.mean(np.abs(E_test.values@w_ensemble))
# Optimal unconstrained combination
print(f'Optimal unconstrained combination: {err_opt_test}')
Again, the result is disappointing because of the lack of diversification across models. The
correlations between errors are high not only on the training sample, but also on the testing
sample, as shown below.
The leverage from the optimal solution only exacerbates the problem and underperforms
the heuristic uniform combination. We end this section with the constrained formulation of
Breiman (1996) for the quadratic optimisation. If we write Σ for the covariance matrix of
errors, we seek
w∗ = argmin w Σw, 1 w = 1, wi ≥ 0,
w
1 1 1 1
1 0 0 0
Aw =
0
w compared to b=
0 ,
1 0
0 0 1 0
where the first line will be an equality (weights sum to one), and the last three will be
inequalities (weights are all positive).
Compared to the unconstrained solution, the weights are sparse and concentrated in one
model, usually the one with small training sample errors.
model_stack = keras.Sequential()
# This defines the structure of the network, i.e. how layers are␣
,→organized
model_stack.add(layers.Dense(8, activation="relu",␣
,→input_shape=(nb_mods,)))
180 11 Ensemble models
model_stack.add(layers.Dense(4, activation="tanh"))
model_stack.add(layers.Dense(1))
The configuration is very simple. We do not include any optional arguments and hence the
model is likely to overfit. As we seek to predict returns, the loss function is the standard L2
norm.
model_stack.compile(optimizer='RMSprop',
# Optimisation method (weight updating)
loss='mse',# Loss function
metrics=['MeanAbsoluteError']) # Output metric
model_stack.summary() # Model architecture
Model: "sequential_1"
_________________________________ Layer (type) Output␣
,→Shape Param #
=================================================================
dense_3 (Dense) (None, 8) 48
dense_4 (Dense) (None, 4) 36
dense_5 (Dense) (None, 1) 5
=================================================================
Total params: 89
Trainable params: 89
Non-trainable params: 0
11.3 Extensions 181
y_tilde=E.values+np.tile(
training_sample['R1M_Usd'].values.reshape(-1, 1), nb_mods)# Train␣
→preds
y_test=E_test.values+np.tile(
testing_sample['R1M_Usd'].values.reshape(-1, 1),nb_mods) # Testing
fit_NN_stack = model_stack.fit(y_tilde,# Train features
NN_train_labels,# Train labels
batch_size=512,# Train parameters
epochs=12,# Train parameters
verbose=1,# Show messages
validation_data=(y_test,NN_test_labels))
# Test features & labels
show_history(fit_NN_stack )
# Show training plot
The performance of the ensemble is again disappointing: the learning curve is flat in Figure
11.2, hence the rounds of back-propagation are useless. The training adds little value which
means that the new overarching layer of ML does not enhance the original predictions.
Again, this is because all ML engines seem to be capturing the same patterns, and both
their linear and non-linear combinations fail to improve their performance.
11.3 Extensions
11.3.1 Exogenous variables
In a financial context, macro-economic indicators could add value to the process. It is pos-
sible that some models perform better under certain conditions, and exogenous predictors
can help introduce a flavor of economic-driven conditionality in the predictions.
182 11 Ensemble models
Adding macro-variables to the set of predictors (here, predictions) ỹi,m could seem like one
way to achieve this. However, this would amount to mixing predicted values with (possibly
scaled) economic indicators, and that would not make much sense.
One alternative outside the perimeter of ensembles is to train simple trees on a set of
macro-economic indicators. If the labels are the (possibly absolute) errors stemming from
the original predictions, then the trees will create clusters of homogeneous error values. This
will hint towards which conditions lead to the best and worst forecasts. We test this idea
below, using aggregate data from the Federal Reserve of Saint Louis. We download and
format the data in the next chunk.
macro_cond = pd.read_csv("macro_cond.csv")
# Term Spred, Inflation and Consumer Price Index
macro_cond["Index"]=pd.to_datetime(macro_cond["date"])+pd.offsets.
→MonthBegin(-1)
We can now build a tree that tries to explain the accuracy of models as a function of
macro-variables.
X_ens = ens_data[['inflation','termspread']] # Training macro features
y_ens = abs(ens_data['err_NN_test']) # Label, error from previous section
fit_ens = tree.DecisionTreeRegressor( # Defining the model
max_depth = 2, # Maximum depth (i.e. tree levels)
ccp_alpha=0.00001 # complexity parameters
)
fit_ens.fit(X_ens, y_ens) # Fitting the model
fig, ax = plt.subplots(figsize=(13, 8)) # resizing
tree.plot_tree(fit_ens ,feature_names=X_ens.columns.values, ax=ax)
# Plot the tree
plt.show()
11.3 Extensions 183
The tree creates clusters which have homogeneous values of absolute errors. One big cluster
gathers 92% of predictions (the left one) and is the one with the smallest average. It corre-
sponds to the periods when the term spread is above 0.29 (in percentage points). The other
two groups (when the term spread is below 0.29%) are determined according to the level
of inflation. If the latter is positive, then the average absolute error is 7%, if not, it is 12%.
This last number, the highest of the three clusters, indicates that when the term spread is
low and the inflation negative, the model’s predictions are not trustworthy because their er-
rors have a magnitude twice as large as in other periods. Under these circumstances (which
seem to be linked to a dire economic environment), it may be wiser not to use ML-based
forecasts.
A split in dates requires other decisions: is the data split in large blocks (like years) and
each model gets a block, which may stand for one particular kind of market condition? Or,
are the training dates divided more regularly? For instance, if there are 12 models in the
ensemble, each model can be trained on data from a given month (e.g., January for the first
models, February for the second, etc.).
Below, we train four models on four different years to see if this helps reduce the inter-model
correlations. This process is a bit lengthy because the samples and models need to be all
redefined. We start by creating the four training samples. The third model works on the
small subset of features, hence the sample is smaller.
training_sample_2007 = training_sample.loc[training_sample.index[(
training_sample['date']>'2006-12-31')&(
training_sample['date']<'2008-01-01')].tolist()]
training_sample_2009=training_sample.loc[training_sample.index[(
training_sample['date']>'2008-12-31')&(
training_sample['date']<'2010-01-01')].tolist()]
training_sample_2011 = training_sample.loc[training_sample.index[(
training_sample['date']>'2010-12-31')&(
training_sample['date']<'2012-01-01')].tolist()]
training_sample_2013 = training_sample.loc[training_sample.index[(
training_sample['date']>'2012-12-31')&(
training_sample['date']<'2014-01-01')].tolist()]
Then, we proceed to the training of the models. The syntaxes are those used in the previous
chapters, nothing new here. We start with a penalized regression. In all predictions below,
the original testing sample is used for all models.
# Pred. errs
pd.DataFrame(X_test))-testing_sample['R1M_Usd']
# Pred. errs
# Pred. errs
model = keras.Sequential()
model.add(layers.
,→Dense(16,activation="relu",input_shape=(len(features),)))
model.add(layers.Dense(8, activation="tanh"))
model.add(layers.Dense(1))
model.compile(optimizer='RMSprop',
loss='mse',
metrics=['MeanAbsoluteError'])
model.summary()
fit_ens_2013 = model.fit(
training_sample_2013[features].values,# Training features
training_sample_2013['R1M_Usd'].values,# Training labels
batch_size=128, # Training parameters
epochs = 9, # Training parameters
verbose = True # Show messages
)
err_ens_2013=model.predict(
X_penalized_test)-testing_sample['R1M_Usd'].values.reshape((-1,1))
# Pred. errs
186 11 Ensemble models
Endowed with the errors of the four models, we can compute their correlation matrix.
E_subtraining = pd.concat(
[err_ens_2007,err_ens_2009,err_ens_2011,pd.DataFrame(
err_ens_2013,index=testing_sample.index)], axis=1)
# E_subtraining matrix
E_subtraining.set_axis(
['err_ens_2007','err_ens_2009','err_ens_2011','err_ens_2013'],
axis=1, inplace=True)# Names
E_subtraining.corr()
E_subtraining.corr().mean()
err_ens_2007 0.955186
err_ens_2009 0.937980
err_ens_2011 0.894568
err_ens_2013 0.955742
dtype: float64
The results are overall disappointing. Only one model manages to extract patterns that are
somewhat different from the other ones, resulting in a 89% correlation across the board.
Neural networks (on 2013 data) and penalized regressions (2007) remain highly correlated.
One possible explanation could be that the models capture mainly noise and little signal.
Working with long-term labels like annual returns could help improve diversification across
models.
11.4 Exercise
Build an integrated ensemble on top of three neural networks trained entirely with Keras.
Each network obtains one-third of predictors as input. The three networks yield a classifi-
cation (yes/no or buy/sell). The overarching network aggregates the three outputs into a
final decision. Evaluate its performance on the testing sample. Use the functional API.
12
Portfolio backtesting
In this section, we introduce the notations and framework that will be used when analyzing
and comparing investment strategies. Portfolio backtesting is often conceived and perceived
as a quest to find the best strategy - or at least a solidly profitable one. When carried out
thoroughly, this possibly long endeavor may entice the layman to confuse a fluke for a robust
policy. Two papers published back-to-back warn against the perils of data snooping,
which is related to p-hacking. In both cases, the researcher will torture the data until the
sought result is found.
Fabozzi and de Prado (2018) acknowledge that only strategies that work make it to the
public, while thousands (at least) have been tested. Picking the pleasing outlier (the only
strategy that seemed to work) is likely to generate disappointment when switching to real
trading. In a similar vein, Arnott et al. (2019b) provide a list of principles and safeguards
that any analyst should follow to avoid any type of error when backtesting strategies. The
worst type is arguably false positives whereby strategies are found (often by cherrypicking)
to outperform in one very particular setting, but will likely fail in live implementation.
In addition to these recommendations on portfolio constructions, Arnott et al. (2019a) also
warn against the hazards of blindly investing in smart beta products related to academic fac-
tors. Plainly, expectations should not be set too high or face the risk of being disappointed.
Another takeaway from their article is that economic cycles have a strong impact on
factor returns: correlations change quickly, and drawdowns can be magnified in times of
major downturns.
Backtesting is more complicated than it seems, and it is easy to make small mistakes that
lead to apparently good portfolio policies. This chapter lays out a rigorous approach to this
exercise, discusses a few caveats, and proposes a lengthy example.
187
188 12 Portfolio backtesting
(usually equal to 2 to 10 years) and expanding. In the first case, the training sample will
roll over time, taking into account only the most recent data. In the second case, models are
built on all of the available data, the size of which increases with time. This last option can
create problems because the first dates of the backtest are based on much smaller amounts
of information compared to the last dates. Moreover, there is an ongoing debate on whether
including the full history of returns and characteristics is advantageous or not. Proponents
argue that this allows models to see many different market conditions. Opponents make
the case that old data is by definition outdated and thus useless and possibly misleading
because it won’t reflect current or future short-term fluctuations.
Henceforth, we choose the rolling period option for the training sample, as depicted in
Figure 12.1.
Buffer
period Out-of-sample period
FIGURE 12.1: Backtesting with rolling windows. The training set of the first period is
simply the buffer period.
Two crucial design choices are the rebalancing frequency, and the horizon at which the
label is computed. It is not obvious that they should be equal, but their choice should make
sense. It can seem right to train on a 12-month forward label (which captures longer trends)
and invest monthly or quarterly. However, it seems odd to do the opposite and train on
short-term movements (monthly) and invest at a long horizon.
These choices have a direct impact on how the backtest is carried out. If we note:
• ∆h for the holding period between two rebalancing dates (in days or months),
• ∆s for the size of the desired training sample (in days or months - not taking the number
of assets into consideration),
• ∆l for the horizon at which the label is computed (in days or months),
then the total length of the training sample should be ∆s + ∆l . Indeed, at any moment t,
the training sample should stop at t − ∆l so that the last point corresponds to a label that
is calculated until time t. This is highlighted in Figure 12.2 in the form of the red danger
zone. We call it the red zone because any observation which has a time index s inside the
interval (t − ∆l , t] will engender a forward looking bias. Indeed if a feature is indexed by
s ∈ (t − ∆l , t], then by definition, the label covers the period [s, s + ∆l ] with s + ∆l > t. At
time t, this requires knowledge of the future and is naturally not realistic.
12.2 Turning signals into portfolio weights 189
w0 1 = 1,
λ
min w0 Σw − w0 µ, s.t. (w − w− )0 Λ(w − w− ) ≤ δR , (12.1)
w 2
w0 w ≤ δD ,
λ 0
L(w) = w Σw−w0 µ−η(w0 1N −1)+κR ((w−w− )Λ(w−w− )−δR )+κD (w0 w−δD ), (12.2)
2
This parameter ensures that the budget constraint is satisfied. The optimal weights in (12.3)
depend on three tuning parameters: λ, κR and κD .
- When λ is large, the focus is set more on risk reduction than on profit maximization (which
is often a good idea given that risk is easier to predict);
- When κR is large, the importance of transaction costs in (12.2) is high and thus, in the
limit when κR → ∞, the optimal weights are equal to the old ones w− (for finite values of
the other parameters).
- When κD is large, the portfolio is more diversified and (all other things equal) when
κD → ∞, the weights are all equal (to 1/N ).
- When κR = κD = 0, we recover the classical mean-variance weights which are a mix
between the maximum Sharpe ratio portfolio proportional to (Σ)−1 µ and the minimum
variance portfolio proportional to (Σ)−1 1N .
1 Constraints often have beneficial effects on portfolio composition, see Jagannathan and Ma (2003) and
This seemingly complex formula is in fact very flexible and tractable. It requires some tests
and adjustments before finding realistic values for λ, κR and κD (see exercise at the end of
the chapter). In Pedersen et al. (2020), the authors recommend a similar form, except that
the covariance matrix is shrunk towards the diagonal matrix of sample variances and the
expected returns are a mix between a signal and an anchor portfolio. The authors argue that
their general formulation has links with robust optimization (see also Kim et al. (2014)),
Bayesian inference (Lai et al. (2011)), matrix denoising via random matrix theory, and,
naturally, shrinkage. In fact, shrunk expected returns have been around for quite some
time (Jorion (1985), Kan and Zhou (2007), and Bodnar et al. (2013)) and simply seek to
diversify and reduce estimation risk.
12.3.1 Discussion
While the evaluation of the accuracy of ML tools (See Section 10.1) is of course valuable
(and imperative!), the portfolio returns are the ultimate yardstick during a backtest. One
essential element in such an exercise is a benchmark because raw and absolute metrics
don’t mean much on their own.
This is not only true at the portfolio level, but also at the ML engine level. In most of
the trials of the previous chapters, the MSE of the models on the testing set revolves
around 0.037. An interesting figure is the variance of one-month returns on this set, which
corresponds to the error made by a constant prediction of 0 all the time. This figure is equal
to 0.037, which means that the sophisticated algorithms don’t really improve on a naive
heuristic. This benchmark is the one used in the out-of-sample R2 of Gu et al. (2020).
In portfolio choice, the most elementary allocation is the uniform one, whereby each asset
receives the same weight. This seemingly simplistic solution is in fact an incredible bench-
mark, one that is hard to beat consistently (see DeMiguel et al. (2009b) and Plyakha et al.
(2016)). Theoretically, uniform portfolios are optimal when uncertainty, ambiguity, or esti-
mation risk is high (Pflug et al. (2012), Maillet et al. (2015)), and empirically, it cannot be
outperformed even at the factor level (Dichtl et al. (2021b)). Below, we will pick an equally
weighted (EW) portfolio of all stocks as our benchmark.
T T
P1X P B 1X B
r̄P = µP = E[r ] ≈ r , r̄B = µB = E[r ] ≈ r ,
T t=1 t T t=1 t
where, obviously, the portfolio is noteworthy if E[rP ] > E[rB ]. Note that we use the arith-
metic average above, but the geometric one is also an option, in which case:
T
!1/T T
!1/T
Y Y
µ̃P ≈ (1 + rtP ) − 1, µ̃B ≈ (1 + rtB ) − 1.
t=1 t=1
The benefit of this second definition is that it takes the compounding of returns into account
and hence compensates for volatility pumping. To see this, consider a very simple two-
period
√ model with returns −r and +r. The arithmetic average is zero, but the geometric
one 1 − r2 − 1 is negative.
Akin to accuracy, its ratios evaluate the proportion of times when the position is in the
right direction (long when the realized return is positive and short when it is negative).
Hence, hit ratios evaluate the propensity to make good guesses. This can be computed at
the asset level (the proportion of positions in the correct direction2 ) or at the portfolio
level. In all cases, the computation can be performed on raw returns or on relative returns
(e.g., compared to a benchmark). A meaningful hit ratio is the proportion of times that a
strategy beats its benchmark. This is of course not sufficient, as many small gains can be
offset by a few large losses.
Lastly, here is one important precision. In all examples of supervised learning tools in the
book, we compared the hit ratios to 0.5. This is in fact wrong because if an investor is bullish,
he or she may always bet on upward moves. In this case, the hit ratio is the percentage
of time that returns are positive. Over the long run, this probability is above 0.5. In our
sample, it is equal to 0.556, which is well above 0.5. This could be viewed as a benchmark
to be surpassed.
Pure performance measures are almost always accompanied by risk measures. The second
moment of returns is usually used to quantify the magnitude of fluctuations of the portfolio.
A large variance implies sizable movements in returns, and hence in portfolio values. This
is why the standard deviation of returns is called the volatility of the portfolio.
T T
1 X P 1 X B
σP2 = V[rP ] ≈ (r − µP )2 , 2
σB = V[rB ] ≈ (r − µB )2 .
T − 1 t=1 t T − 1 t=1 t
In this case, the portfolio can be preferred if it is less risky compared to the benchmark,
i.e., when σP2 < σB
2
and when average returns are equal (or comparable).
Higher order moments of returns are sometimes used (skewness and kurtosis), but they are
far less common. We refer for instance to Harvey et al. (2010) for one method that takes
them into account in the portfolio construction process.
2A long position in an asset with positive return or a short position in an asset with negative return.
12.3 Performance metrics 193
For some people, the volatility is an incomplete measure of risk. It can be argued that
it should be decomposed into ‘good’ volatility (when prices go up) versus ‘bad’ volatility
when they go down. The downward semi-variance is computed as the variance taken over
the negative returns:
T
2 1 X
σ− ≈ (rt − µP )2 1{rt <0} .
card(rt < 0) t=1
The average return and the volatility are the typical moment-based metrics used by practi-
tioners. Other indicators rely on different aspects of the distribution of returns with a focus
on tails and extreme events. The Value-at-Risk (VaR) is one such example. If Fr is the
empirical cdf of returns, the VaR at a level of confidence α (often taken to be 95%) is
It is equal to the realization of a bad scenario (of return) that is expected to happen (1−α)%
of the time on average. An even more conservative measure is the so-called Conditional Value
at Risk (CVaR), also known as expected shortfall, which computes the average loss of the
worst (1 − α)% scenarios. Its empirical evaluation is
1 X
CVaRα (rt ) = rt .
Card(rt < VaRα (rt ))
rt <VaRα (rt )
Going crescendo in the severity of risk measures, the ultimate evaluation of loss is the
maximum drawdown. It is equal to the maximum loss suffered from the peak value of
the strategy. If we write Pt for the time-t value of a portfolio, the drawdown is
DTP = max Pt − PT ,
0≤t≤T
M DTP = max max Pt − Ps , 0 .
0≤s≤T 0≤t≤s
then the estimated α̂n is the performance that cannot be explained by the other factors.
194 12 Portfolio backtesting
When returns are excess returns (over the risk-free rate) and when there is only one factor,
the market factor, then this quantity is called Jensen’s alpha (Jensen (1968)). Often, it is
simply referred to as alpha. The other estimate, β̂t,M,n (M for market), is the market beta.
Because of the rise of factor investing, it has become customary to also report the alpha of
more exhaustive regressions. Adding the size and value premium (as in Fama and French
(1993)) and even momentum (Carhart (1997)) helps understand if a strategy generates
value beyond that which can be obtained through the usual factors.
µ̃P
M ARP = ,
M DP
µP
Treynor = ,
β̂M
i.e., the (excess) return divided by the market beta (see Treynor (1965)). This definition
was generalized to multifactor expositions by Hübner (2005) into the generalized Treynor
ratio:
PK
f¯k
GT = µP PKk=1 ,
¯
k=1 β̂k fk
where the f¯k are the sample average of the factors ft,k . We refer to the original article for
a detailed account of the analytical properties of this ratio.
12.4 Common errors and issues 195
where wt,n are the desired t-time weights in the portfolio and wt−,n are the weights just
before the rebalancing. The positions of the first period (launching weights) are exluded
from the computation by convention. Transaction costs can then be proxied as a multiple
of turnover (times some average or median cost in the cross-section of firms). This is a first
order estimate of realized costs that does not take into consideration the evolution of the
scale of the portfolio. Nonetheless, a rough figure is much better than none at all.
Once transaction costs (TCs) have been annualized, they can be deducted from average
returns to yield a more realistic picture of profitability. In the same vein, the transaction
cost-adjusted Sharpe ratio of a portfolio P is given by
µP − T C
SRT C = . (12.4)
σP
Transaction costs are often overlooked in academic articles but can have a sizable impact in
real life trading (see, e.g., Novy-Marx and Velikov (2015)). DeMiguel et al. (2020) show how
to use factor investing (and exposures) to combine and offset positions and reduce overall
fees.
At any given moment, a backtest depends on only one particular dataset. Often, the result
of the first backtest will not be satisfactory - for many possible reasons. Hence, it is tempting
to have another try, when altering some parameters, that were probably not optimal. This
second test may be better, but not quite good enough - yet. Thus, in a third trial, a new
weighting scheme can be tested, along with a new forecasting engine (more sophisticated).
Iteratively, the backtester can only end up with a strategy that performs well enough, it is
just a matter of time and trials.
One consequence of backtest overfitting is that it is illusory to hope for the same Sharpe
ratios in live trading as those obtained in the backtest. Reasonable professionals divide the
Sharpe ratio by two at least (Harvey and Liu (2015), Suhonen et al. (2017)). In Bailey and
de Prado (2014), the authors even propose a statistical test for Sharpe ratios, provided that
some metrics of all tested strategies are stored in memory. The formula for deflated Sharpe
ratios is:
s !
T −1
t = φ (SR − SR∗ ) , (12.5)
1 − γ3 SR + γ44−1 SR2
where SR is the Sharpe Ratio obtained by the best strategy among all that were tested,
and
p 1 1
SR∗ = E[SR] + V[SR] (1 − γ)φ−1 1 − + γφ−1 1 − ,
N Ne
is the theoretical average maximum SR. Moreover,
• T is the number of trading dates;
• γ3 and γ4 are the skewness and kurtosis of the returns of the chosen (best) strategy;
• φ is the cdf of the standard Gaussian law and γ ≈ 0, 577 is the Euler-Mascheroni
constant;
• N refers to the number of strategy trials.
If t defined above is below a certain threshold (e.g., 0.95), then the SR cannot be deemed
significant: the best strategy is not outstanding compared to all of those that were
tested. Most of the time, sadly, that is the case. In Equation (12.5), the realized SR must
be above the theoretical maximum SR∗ and the scaling factor must be sufficiently large to
push the argument inside φ close enough to two, so that t surpasses 0.95.
In the scientific community, test overfitting is also known as p-hacking. It is rather common
in financial economics, and the reading of Harvey (2017) is strongly advised to grasp the
magnitude of the phenomenon. p-hacking is also present in most fields that use statistical
tests (see, e.g., Head et al. (2015), to cite one reference). There are several ways to cope
with p-hacking:
1. don’t rely on p-values (Amrhein et al. (2019)),
3. or, finally, use advanced methods that process arrays of statistics (e.g., the Bayesianized
versions of p-values to include some prior assessment from Harvey (2017), or other tests
such as those proposed in Romano and Wolf (2005) and Simonsohn et al. (2014)).
The first option is wise, but the drawback is that the decision process is then left to another
arbitrary yardstick.
12.5 Implication of non-stationarity: forecasting is hard 197
In fact, this is one major difference with many fields for which ML has made huge advances.
In image recognition, numbers will always have the same shape, and so will cats, buses,
etc. Likewise, a verb will always be a verb, and syntaxes in languages do not change. This
invariance, though sometimes hard to grasp,3 is nonetheless key to the great improvement
both in computer vision and natural language processing.
In factor investing, there does not seem to be such invariance (see Cornell (2020)). There is
no factor and no (possibly non-linear) combination of factors that can explain and accurately
forecast returns over long periods of several decades.4 The academic literature has yet to
find such a model; but even if it did, a simple arbitrage reasoning would logically invalidate
its conclusions in future datasets.
0 if x 6= y
δ(x, y) = . (12.7)
1 if x = y
One of the no free lunch theorems states that E1 (S) = E2 (S), that is, that with the
sole knowledge of S, there can be no superior algorithm, on average. In order to build a
performing algorithm, the analyst or econometrician must have prior views on the structure
of the relationship between y and x and integrate these views in the construction of the
model. Unfortunately, this can also yield underperforming models if the views are incorrect.
3 We invite the reader to have a look at the thoughtful albeit theoretical paper by Arjovsky et al. (2019).
4 Inthe thread https://twitter.com/fchollet/status/1177633367472259072, François Chollet, the cre-
ator of Keras argues that ML predictions based on price data cannot be profitable in the long term. Given
the wide access to financial data, it is likely that the statement holds for predictions stemming from factor-
related data as well.
12.6 First example: a complete backtest 199
4. performance indicators.
Accordingly, we start with initializations.
import datetime as dt
from datetime import datetime
sep_oos= "2007-01-01"
# Starting point for backtest
ticks= list(data_ml['stock_id'].unique())
# List of all asset ids
N= len(ticks)
# Max number of assets
t_oos= list(returns.index[returns.index>sep_oos].values)
# Out-of-sample dates
t_as= list(returns.index.values)
# Out-of-sample dates
Tt= len(t_oos)
# Nb of dates
nb_port = 2
# Nb of portfolios/strategies
portf_weights= np.zeros(shape=(Tt, nb_port, max(ticks)+1))
# Initialize portfolio weights
portf_returns= np.zeros(shape=(Tt, nb_port))
# Initialize portfolio returns
This first step is crucial; it lays the groundwork for the core of the backtest. We consider
only two strategies: one ML-based and the EW (1/N) benchmark. The main (weighting)
function will consist of these two components, but we define the sophisticated one in a
dedicated wrapper. The ML-based weights are derived from XGBoost predictions with 80
trees, a learning rate of 0.3 and a maximum tree depth of 4. This makes the model complex
but not exceedingly so. Once the predictions are obtained, the weighting scheme is simple:
it is an EW portfolio over the best half of the stocks (those with above median prediction).
In the function below, all parameters (e.g., the learning rate, eta, or the number of trees
nrounds) are hard-coded. They can easily be passed in arguments next to the data inputs.
One very important detail is that, in contrast to the rest of the book, the label is the 12-
month future return. The main reason for this is rooted in the discussion from Section 4.6.
Also, to speed up the computations, we remove the bulk of the distribution of the labels
200 12 Portfolio backtesting
and keep only the top 20% and bottom 20%, as is advised in Coqueret and Guida (2020).
The filtering levels could also be passed as arguments.
Compared to the structure proposed in Section 6.4.6, the differences are that the label is
not only based on long-term returns, but it also relies on a volatility component. Even
though the denominator in the label is the exponential quantile of the volatility, it seems
fair to say that it is inspired by the Sharpe ratio and that the model seeks to explain and
forecast a risk-adjusted return instead of a raw return. A stock with very low volatility will
have its return unchanged in the label, while a stock with very high volatility will see its
return divided by a factor close to three (exp(1)=2.718).
This function is then embedded in the global weighting function which only wraps two
schemes: the EW benchmark and the ML-based policy.
Equipped with this function, we can turn to the main backtesting loop. Given the fact that
we use a large-scale model, the computation time for the loop is large (possibly a few hours
12.6 First example: a complete backtest 201
on a slow machine with CPU). Resorting to functional programming can speed up the loop.
Also, a simple benchmark equally weighted portfolio can be coded with functions only.
There are two important comments to be made on the above code. The first comment
pertains to the two parameters that are defined in the first lines. They refer to the size
of the training sample (5 years) and the length of the buffer period shown in Figure 12.2.
This buffer period is imperative because the label is based on a long-term (12-month)
return. This lag is compulsory to avoid any forward-looking bias in the backtest.
Below, we create a function that computes the turnover (variation in weights). It requires
both the weight values as well as the returns of all assets because the weights just before a
rebalancing depend on the weights assigned in the previous period, as well as on the returns
of the assets that have altered these original weights during the holding period.
Once turnover is defined, we embed it into a function that computes several key indicators.
def perf_met_multi(portf_returns,weights,asset_returns,t_oos,strat_name):
J = weights.shape[1] # Number of strategies
met = [] # Initialization of metrics
for j in range(J): # One slighlty less ugly loop
temp_met=perf_met(portf_returns[:,j],weights[:,j,:
→],asset_returns,t_oos)
met.append(temp_met)
return pd.DataFrame(
met, index=strat_name,
columns=['avg_ret','vol','Sharpe_ratio','VaR_5','turn'])
# Stores the name of the strat
Given the weights and returns of the portfolios, it remains to compute the returns of the
assets to plug them in the aggregate metrics function.
The ML-based strategy performs finally well! The gain is mostly obtained by the average
return, while the volatility is higher than that of the benchmark. The net effect is that the
Sharpe ratio is improved compared to the benchmark. The augmentation is not breath-
taking, but (hence?) it seems reasonable. It is noteworthy to underline that turnover is
substantially higher for the sophisticated strategy. Removing costs in the numerator (say,
12.6 First example: a complete backtest 203
0.005 times the turnover, as in Goto and Xu (2015), which is a conservative figure) only
mildly reduces the superiority in Sharpe ratio of the ML-based strategy.
Finally, it is always tempting to plot the corresponding portfolio values, and we display two
related graphs in Figure 12.3.
g1 = pd.DataFrame(
[t_oos, np.cumprod(
1+portf_returns[:,0]), np.cumprod(
1+portf_returns[:,1])],index = ["date","benchmark","ml_based"]).T
# Creating cumulated timeseries
g1.reset_index(inplace=True) # Data wrangling
g1['date_month']=pd.to_datetime(g1['date']).dt.month
# Creating a new column to select dataframe partition for secong plot (y␣
→perf)
g1.set_index('date',inplace=True)
# Setting date index for plots
g2=g1[g1['date_month']==12]
# Selecting pseudo-end of year NAV
g2=g2.append(g1.iloc[[0]])
# Adding the first date of Jan 2007
g2.sort_index(inplace=True) # Sorting dates
g1[["benchmark","ml_based"]].plot(figsize=[16,6],ylabel='Cumulated␣
→value')
# plot evidently!
g2[["benchmark","ml_based"]].pct_change(1).plot.bar(
figsize=[16,6],ylabel='Yearly performance') # plot evidently!
Out of the 12 years of the backtest, the advanced strategy outperforms the benchmark
during 10 years. It is less hurtful in two of the four years of aggregate losses (2015 and
2018). This is a satisfactory improvement because the EW benchmark is tough to beat!
Then, we test the function on a triplet of arguments. We pick the price-to-book (Pb) ratio.
The position is positive and the threshold is 0.3, which means that the strategy buys the
stocks that have a Pb value above the 0.3 quantile of the distribution.
The output keeps three quantities that will be useful to compute the statistic (12.5). We
must now generate these indicators for many strategies. We start by creating the grid of
parameters.
import itertools
feature = ["Div_Yld","Ebit_Bv","Mkt_Cap_6M_Usd",
"Mom_11M_Usd","Pb","Vol1Y_Usd"]
thresh = np.arange(0.2, 0.9, 0.1) # Threshold
direction = np.array([1,-1]) # Decision direction
12.7 Second example: backtest overfitting 205
This makes 84 strategies in total. We can proceed to see how they fare. We plot the corre-
sponding Sharpe ratios below in Figure 12.4. The top plot shows the strategies that invest
in the bottoms of the distributions of characteristics, while the bottom plot pertains to the
portfolios that are long in the lower parts of these distributions.
grd[grd['direction']==-1].pivot(index='thresh',
columns='feature',values='SR').plot(
figsize=[16,6],ylabel='Direction = -1') # Plot!
grd[grd['direction']==1].pivot(index='thresh',
columns='feature',values='SR').plot(
figsize=[16,6],ylabel='Direction = 1') # Plot!
def DSR(SR, Tt, M, g3, g4, SR_m, SR_v): # First, we build the function
gamma = -special.digamma(1) # Euler-Mascheroni constant
SR_star = SR_m + np.sqrt(SR_v)*(
(1-gamma)*stats.norm.ppf(1-1/M)+gamma*stats.norm.ppf(1-1/M/np.
→exp(1)))
# SR*
num = (SR-SR_star) * np.sqrt(Tt-1) # Numerator
den = np.sqrt(1 - g3*SR + (g4-1)/4*SR**2) # Denominator
return round(stats.norm.cdf(num/den),4)
All that remains to do is to evaluate the arguments of the function. The “best” strategy is
the one on the top left corner of Figure 12.4, and it is based on market capitalization.
0.6657
The value 0.6657 is not high enough (it does not reach the 90% or 95% threshold) to make
the strategy significantly superior to the other ones that were considered in the batch of
tests.
12.8 Coding exercises 207
This chapter is dedicated to the techniques that help understand the way models pro-
cess inputs into outputs. A recent book (Molnar (2019) available at https://christophm.
github.io/interpretable-ml-book/) is entirely devoted to this topic, and we highly rec-
ommend to have a look at it. Another more introductory and less technical reference is Hall
and Gill (2019). Obviously, in this chapter, we will adopt a factor-investing tone and discuss
examples related to ML models trained on a financial dataset.
Quantitative tools that aim for interpretability of ML models are required to satisfy two
simple conditions:
1. Provide information about the model.
2. Be highly comprehensible.
Often, these tools generate graphical outputs which are easy to read and yield immediate
conclusions.
In attempts to white-box complex machine learning models, one dichotomy stands out:
• Global models seek to determine the relative role of features in the construction of
the predictions once the model has been trained. This is done at the global level, so
that the patterns that are shown in the interpretation hold on average over the whole
training set.
• Local models aim to characterize how the model behaves around one particular in-
stance by considering small variations around this instance. The way these variations
are processed by the original model allows to simplify it by approximating it, e.g., in a
linear fashion. This approximation can for example determine the sign and magnitude
of the impact of each relevant feature in the vicinity of the original instance.
Molnar (2019) proposes another classification of interpretability solutions by splitting in-
terpretations that depend on one particular model (e.g., linear regression or decision tree)
versus the interpretations that can be obtained for any kind of model. In the sequel, we
present the methods according to the global versus local dichotomy.
211
212 13 Interpretability
K
X
yi = α + βk xki + i ,
k=1
the following elements are usually extracted from the estimation of the βk :
• the R2 , which appreciates the global fit of the model (possibly penalized to prevent
overfitting with many regressors). The R2 is usually computed in-sample;
• the sign of the estimates β̂k , which indicates the direction of the impact of each feature
xk on y;
• the t-statistics tβˆk , which evaluate the magnitude of this impact: regardless of its direc-
tion, large statistics in absolute value reveal prominent variables. Often, the t-statistics
are translated into p-values which are computed under some suitable distributional as-
sumptions.
The last two indicators are useful because they inform the user on which features matter
the most and on the sign of the effect of each predictor. This gives a simplified view of how
the model processes the features into the output. Most tools that aim to explain black boxes
follow the same principles.
Decision trees, because they are easy to picture, are also great models for interpretability.
Thanks to this favorable feature, they are target benchmarks for simple models. Recently,
Vidal et al. (2020) propose a method to reduce an ensemble of trees into a unique tree. The
aim is to propose a simpler model that behaves exactly like the complex one.
More generally, it is an intuitive idea to resort to simple models to proxy more complex
algorithms. One simple way to do so is to build so-called surrogate models. The process
is simple:
1. train the original model f on features X and labels y;
2. train a simpler model g to explain the predictions of the trained model fˆ given the
features X:
fˆ(X) = g(X) + error
The estimated model ĝ explains how the initial model fˆ maps the features into the labels.
The simpler model is a tree with a depth of two.
new_target = fit_RF.predict(X_short)
# saving the predictions of tree as new target
decision_tree_model = tree.DecisionTreeRegressor(max_depth=3)
# defining the global interpretable tree surrogate model
TreeSurrogate=decision_tree_model.fit(X_short,new_target)
# fitting the surrogate
fig, ax = plt.subplots(figsize=(13, 8))
# setting the chart parameters
tree.plot_tree(TreeSurrogate,feature_names=features_short, ax=ax)
plt.show() # Plot!
13.1 Global interpretations 213
The representation of the tree is quite different, compared to those seen in Chapter 6, but
it managed to do a proper job in capturing the main complexity of the model which it
mimicks.
X
I(k) = G(n),
n∈Nk
and it is often rescaled so that the sum of I(k) across all k is equal to one. In this case, I(k)
measures the relative contribution of feature k in the reduction of loss during the training.
214 13 Interpretability
A variable with high importance will have a greater impact on predictions. Generally, these
variables are those that are located close to the root of the tree.
Below, we take a look at the results obtained from the tree-based models trained in Chapter
6. We start by recycling the output from the three regression models we used. Notice that
each fitted output has its own structure, and importance vectors have different names.
tree_VI = pd.DataFrame(
data=fit_tree.
→feature_importances_,index=features_short,columns=['Tree'])
Given the way the graph is coded, Figure 13.2 is in fact misleading. Indeed, by construction,
the simple tree model only has a small number of features with non-zero importance: in the
13.1 Global interpretations 215
above graph, there are only three: capitalization, price-to-book, and volatility. In contrast,
because random forest and boosted trees are much more complex, they give some importance
to many predictors. The graph shows the variables related to the simple tree model only.
For scale reasons, the normalization is performed after the subset of features is chosen.
We preferred to limit the number of features shown on the graph for obvious readability
concerns.
There are differences in the way the models rely on the features. For instance, the most
important feature changes from a model to the other: the simple tree model gives the most
importance to the price-to-book ratio, while the random forest bets more on volatility and
boosted trees give more weight to capitalization.
One defining property of random forests is that they give a chance to all features. Indeed,
by randomizing the choice of predictors, each individual exogenous variable has a shot
at explaining the label. Along with boosted trees, the allocation of importance is more
balanced across predictors, compared to the simple tree which puts most of its eggs in just
a few baskets.
import random
y_penalized = data_ml['R1M_Usd'].values # Dependent variable
X_penalized = data_ml[features].values # Predictors
fit_ridge_0 = Ridge(alpha=0.01) # Trained model
fit_ridge_0.fit(X_penalized_train, y_penalized_train) # Fit model
216 13 Interpretability
l_star= np.mean(np.square(
fit_ridge_0.predict(X_penalized_train)-y_penalized_train))# Loss
Next, we evaluate the loss when each of the predictors has been sequentially shuffled. To
reduce computation time, we only make one round of shuffling.
res.plot.bar(figsize=[10,6])
The resulting importances are in line with thoses of the tree-based models: the most promi-
nent variables are volatility-based, market capitalization-based, and the price-to-book ratio;
these closely match the variables from Figure 13.2. Note that in some cases (e.g., the share
turnover), the score can even be negative, which means that the predictions are more accu-
rate than the baseline model when the values of the predictor are shuffled!
where dP−k (·) is the (multivariate) distribution of the non-k features x−k . The above func-
tion takes the feature values xk as argument and keeps all other features frozen via their
sample distributions: this shows the impact of feature k solely. In practice, the average is
evaluated using Monte-Carlo simulations:
M
1 ˆ (m)
f¯k (xk ) ≈ f xk , x−k , (13.2)
M m=1
(m)
where x−k are independent samples of the non-k features.
Theoretically, PDPs could be computed for more than one feature at a time. In practice,
this is only possible for two features (yielding a 3D surface) and is more computationally
intense.
The model we seek to explain is the random forest built in Section 6.2. We recycle some
variables used therein. We choose to test the impact of the price-to-book ratio on the
outcome of the model.
from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(
fit_RF,training_sample[features_short], ['Pb'],kind='average')
The average impact of the price-to-book ratio on the predictions is decreasing. This was
218 13 Interpretability
FIGURE 13.4: Partial dependence plot for the price-to-book ratio on the random forest
model.
somewhat expected, given the conditional average of the dependent variable given the price-
to-book ratio. This latter function is depicted in Figure 6.3 and shows a behavior comparable
to the above curve: strongly decreasing for small value of P/B and then relatively flat. When
the price-to-book ratio is low, firms are undervalued. Hence, their higher returns are in line
with the value premium.
Finally, we refer to Zhao and Hastie (2021) for a theoretical discussion on the causality
property of PDPs. Indeed, a deep look at the construction of the PDPs suggests that they
could be interpreted as a causal representation of the feature on the model’s output.
13.2.1 LIME
LIME (Local Interpretable Model-Agnostic Explanations) is a methodology originally pro-
posed by Ribeiro et al. (2016). Their aim is to provide a faithful account of the model under
two constraints:
• simple interpretability, which implies a limited number of variables with visual or
textual representation. This is to make sure any human can easily understand the out-
come of the tool;
• local faithfulness: the explanation holds for the vicinity of the instance.
1 For instance, we do not mention the work of Horel and Giesecke (2019), but the interested reader can
have a look at their work on neural networks (and also at the references cited in the paper).
13.2 Local interpretations 219
The original (black-box) model is f , and we assume we want to approximate its behavior
around instance x with the interpretable model g.2 The simple function g belongs to a larger
class G. The vicinity of x is denoted πx and the complexity of g is written Ω(g). LIME seeks
an interpretation of the form
and the errors are weighted according to their distance from the initial instance x: the
closest points get the largest weights. In its most basic implementation, the set of models
G consists of all linear models.
In Figure 13.5, we provide a simplified diagram of how LIME works.
For expositional clarity, we work with only one dependent variable. The original training
sample is shown with the black points. The fitted (trained) model is represented with the
blue line (smoothed conditional average), and we want to approximate how the model works
around one particular instance which is highlighted by the red square around it. In order to
build the approximation, we sample five new points around the instance (five red triangles).
Each triangle lies on the blue line (they are model predictions) and has a weight proportional
to its size: the triangle closest to the instance has a bigger weight. Using weighted least-
squares, we build a linear model that fits to these five points (dashed grey line). This is the
outcome of the approximation. It gives the two parameters of the model: the intercept and
the slope. Both can be evaluated with standard statistical tests.
The sign of the slope is important. It is fairly clear that if the instance had been taken
closer to x = 0, the slope would have probably been almost flat and hence the predictor
could be locally discarded. Another important detail is the number of sample points. In our
explanation, we take only five, but in practice, a robust estimation usually requires around
1000 points or more. Indeed, when too few neighbors are sampled, the estimation risk is
high and the approximation may be rough.
2 In the original paper, the authors dig deeper into the notion of interpretable representations. In complex
machine learning settings (image recognition or natural language processing), the original features given to
the model can be hard to interpret. Hence, this requires an additional translation layer because the outcome
of LIME must be expressed in terms of easily understood quantities. In factor investing, the features are
elementary, hence we do not need to deal with this issue).
220 13 Interpretability
2
●
1 ●
●
●
● ● ●
●
● ● ● ●
●
● ●● ●
● ● ● ●●
●● ● ● ●
● ●●● ● ● ●
● ●● ●
0 ● ●
●
● ●
●
●
● ● ● ● ●● ●
● ●
●
●● ● ●
●
●●●●●
● ● ● ●
●
● ● ●
● ● ● ● ●● ●
●● ●
● ● ●
● ●
y ●
●
●
● ●
●
-1
●
●
●
-2
-3 ●
gamma=0.1, # Penalization
n_estimators=10, # Number of trees
min_child_weight=10) # Min number of instances in each node
13.2 Local interpretations 221
explainer = lime.lime_tabular.LimeTabularExplainer(
train_features_xgb.values,
# values in tabular i.e. Matrix
mode='regression',
# “classification” or “regression”
feature_names=train_features_xgb.columns,
verbose=1)
# if true, print local prediction values from linear model
exp = explainer.explain_instance(train_features_xgb.iloc[0,:].values,
# First instance in train_sample
predict_fn=xgb_model.predict,
# prediction function
labels=train_label_xgb.iloc[0].values,
# iterable with labels to be explained
distance_metric='euclidean',
# Dist.func. "gower" is one alternative
num_samples=900,
# Nb of features shown (important ones)
exp.show_in_notebook(show_table=True)
# Visual display
222 13 Interpretability
In each graph (one graph corresponds to the explanation around one instance), there are
two types of information: the sign of the impact and the magnitude of the impact. The sign
is revealed with the color (positive in orange, negative in blue), and the magnitude is shown
with the size of the rectangles.
The values to the left of the graphs show the ranges of the features with which the local
approximations were computed.
Lastly, we briefly discuss the choice of distance function chosen in the code. It is used to
evaluate the discrepancy between the true instance and a simulated one to give more or
less weight to the prediction of the sampled instance. Our dataset comprises only numerical
data; hence, the Euclidean distance is a natural choice:
v
uN
uX
Euclidean(x, y) = t (xi − yi )2 .
n=1
N
X
Manhattan(x, y) = |xi − yi |.
n=1
The problem with these two distances is that they fail to handle categorical variables. This
is where the Gower distance steps in (Gower (1971)). The distance imposes a different treat-
ment on features of different types (classes versus numbers essentially, but it can also handle
missing data!). For categorical features, the Gower distance applies a binary treatment: the
value is equal to 1 if the features are equal, and to zero if not (i.e., 1{xn =yn } ). For numerical
features, the spread is quantified as 1 − |xnR−yn
n|
, where Rn is the maximum absolute value
the feature can take. All similarity measurements are then aggregated to yield the final
score. Note that in this case, the logic is reversed: x and y are very close if the Gower
distance is close to one, and they are far away if the distance is close to zero.
S is any subset of the coalition that doesn’t include feature k, and its size is Card(S).
In the equation above, the model f must be altered because it’s impossible to evaluate f
when features are missing. In this case, there are several possible options:
• set the missing value to its average or median value (in the whole sample) so that its
effect is some ‘average’ effect;
• directly compute an average value R f (x1 , . . . , xk , . . . , xK )dPxk , where dPxk is the em-
R
fit_RF_short = RandomForestRegressor(
n_estimators=40,
# Nb of random trees
criterion='squared_error',
# function to measure the quality of a split
min_samples_leaf=250,
# Minimum size of terminal cluster
max_features=4,
# Nb of predictive variables for each tree
bootstrap=True,
# No replacement
max_samples=10000)
# Size of (random) sample for each tree
fit_RF_short.fit(training_sample[features_short],
training_sample['R1M_Usd'].values)
# fitting the model
We can then analyze the behavior of the model around the first instance of the training
sample.
import shap
explainer = shap.explainers.Exact(fit_RF_short.predict,
# Compute the Shapley values...
training_sample[features_short].values,
# Training data
feature_names=features_short)
# features names, could be passed by the predictor fn as well
shap_values = explainer(training_sample[features_short].values[:1,])
# On the first instance
224 13 Interpretability
shap.plots.bar(shap_values[0])
# Visual display
In the output shown in Figure 13.6, we again obtain the two crucial insights: sign of the
impact of the feature and relative importance (compared to other features).
13.2.3 Breakdown
Breakdown (see, e.g., Staniak and Biecek (2018)) is a mixture of ideas from PDPs and
Shapley values. The core of breakdown is the so-called relaxed model prediction defined
in Equation (13.4). It is close in spirit to Equation (13.1). The difference is that we are
working at the local level, i.e., on one particular observation, say x∗ . We want to measure
the impact of a set of predictors on the prediction associated to x∗ ; hence, we fix two sets
k (fixed features) and −k (free features) and evaluate a proxy for the average prediction
of the estimated model fˆ when the set k of predictors is fixed at the values of x∗ , that is,
equal to x∗k in the expression below:
M
1 ˆ (m) ∗
f˜k (x∗ ) = f x−k , xk . (13.4)
M m=1
The x(m) in the above expression are either simulated values of instances or simply sampled
values from the dataset. The notation implies that the instance has some values replaced by
those of x∗ , namely those that correspond to the indices k. When k consists of all features,
then f˜k (x∗ ) is equal to the raw model prediction fˆ(x∗ ) and when k is empty, it is equal to
the average sample value of the label (constant prediction).
13.2 Local interpretations 225
Just as for Shapley values, the above indicator computes an average impact when augment-
ing the set of predictors with feature j. By definition, it depends on the set k, so this is
one notable difference with Shapley values (that span all permutations). In Staniak and
Biecek (2018), the authors devise a procedure that incrementally increases or decreases the
set k. This greedy idea helps alleviate the burden of computing all possible combinations
of features. Moreover, a very convenient property of their algorithm is that the sum of all
contributions is equal to the predicted value:
j
φk (x∗ ) = f (x∗ ).
j
The visualization makes that very easy to see (as in Figure 13.7 below).
In order to illustrate one implementation of breakdown, we train a random forest on a
limited number of features, as shown below. This will increase the readability of the output
of the breakdown.
fit_RF_short = RandomForestRegressor(
n_estimators=12,
# Nb of random trees
criterion='squared_error',
# function to measure the quality of a split
min_samples_leaf=250,
# Minimum size of terminal cluster
max_features=4,
# Nb of predictive variables for each tree
bootstrap=True,
# No replacement
max_samples=10000)
# Size of (random) sample for each tree
fit_RF_short.fit(
training_sample[features_short],training_sample['R1M_Usd'].values)
# fitting the model
RandomForestRegressor(max_features=4, max_samples=10000,
min_samples_leaf=250,n_estimators=12)
Once the model is trained, the syntax for the breakdown of predictions is very simple.
y=training_sample['R1M_Usd'].values,
label='fit_RF_short')
instance=pd.DataFrame(training_sample.loc[0,features_short]).T
# Transpose to adapt to requiered format
pp=ex.predict_parts(instance, type='break_down')
# Compute the breakdown
pp.plot() # Visual display
The graphical output is intuitively interpreted. The grey bar is the prediction of the model
at the chosen instance. Green bars signal a positive contribution, and the red rectangles
show the variables with negative impact. The relative sizes indicate the importance of each
feature.
14
Two key concepts: causality and non-stationarity
227
228 14 Two key concepts: causality and non-stationarity
predictions to be based on the most recent trends. In Section 14.2 below, we introduce
other theoretical and practical options.
14.1 Causality
Traditional machine learning models aim to uncover relationships between variables but
do not usually specify directions for these relationships. One typical example is the linear
regression. If we write y = a + bx + , then it is also true that x = b−1 (y − a − ), which
is of course also a linear relationship (with respect to y). These equations do not define
causation whereby x would be a clear determinant of y (x → y, but the opposite could be
false).
that is, when the distribution of future values of Yt , conditionally on the knowledge of both
processes is not the same as the distribution with the sole knowledge of the filtration FY,t .
Hence X does have an impact on Y because its trajectory alters that of Y .
Now, this formulation is too vague and impossible to handle numerically, thus we simplify
the setting via a linear formulation. We keep the same notations as section 5 of the original
paper by Granger (1969). The test consists of two regressions:
m
X m
X
Xt = aj Xt−j + bj Yt−j + t
j=1 j=1
Xm Xm
Yt = cj Xt−j + dj Yt−j + νt
j=1 j=1
where, for simplicity, it is assumed that both processes have zero mean. The usual assump-
tions apply: the Gaussian noises t and νt are uncorrelated in every possible way (mutually
and through time). The test is the following: if one bj is non-zero, then it is said that Y
Granger-causes X, and if one cj is non-zero, X Granger-causes Y . The two are not mutually
exclusive and it is widely accepted that feedback loops can very well occur.
Statistically, under the null hypothesis, b1 = · · · = bm = 0 (resp. c1 = · · · = cm = 0), which
can be tested using the usual Fischer distribution. Obviously, the linear restriction can be
dismissed, but the tests are then much more complex. The main financial article in this
direction is Hiemstra and Jones (1994).
We test if market capitalization averaged over the past 6 months Granger-causes 1-month
ahead returns for one particular stock (the first in the sample).
14.1 Causality 229
Granger Causality
number of lags (no zero) 6
ssr based F test: F=4.1110 , p=0.0008 , df_denom=149, df_num=6
ssr based chi2 test: chi2=26.8179 , p=0.0002 , df=6
likelihood ratio test: chi2=24.8162 , p=0.0004 , df=6
parameter F test: F=4.1110 , p=0.0008 , df_denom=149, df_num=6
The test is directional and only tests if X Granger-causes Y . In order to test the reverse
effect, it is required to inverse the arguments in the function. In the output above, the p-
value is very low, hence the probability of observing samples similar to ours knowing that H0
holds is negligible. Thus it seems that market capitalization does Granger-cause one-month
returns. We nonetheless underline that Granger causality is arguably weaker than the one
defined in the next subsection. A process that Granger-causes another one simply contains
useful predictive information, which is not proof of causality in a strict sense. Moreover,
our test is limited to a linear model and including non-linearities may alter the conclusion.
Lastly, including other regressors (possibly omitted variables) could also change the results
(see, e.g., Chow et al. (2002)).
because hacking the barometer will have no impact on the weather. In short notation, when
there is an intervention on the barometer, P [weather|do(barometer)] = P [weather]. This is
an interesting example related to causality. The overarching variable is pressure. Pressure
impacts both the weather and the barometer, and this joint effect is called confounding.
However, it may not be true that the barometer impacts the weather. The interested reader
230 14 Two key concepts: causality and non-stationarity
who wants to dive deeper into these concepts should have a closer look at the work of Judea
Pearl. Do-calculus is a very powerful theoretical framework, but it is not easy to apply it
to any situation or dataset (see for instance the book review by Aronow and Sävje (2019)).
While we do not formally present an exhaustive tour of the theory behind causal inference,
we wish to show some practical implementations because they are easy to interpret. It
is always hard to single out one type of model in particular, so we choose one that can
be explained with simple mathematical tools. We start with the simplest definition of a
structural causal model (SCM), where we follow here chapter 3 of Peters et al. (2017). The
idea behind these models is to introduce some hierarchy (i.e., some additional structure) in
the model. Formally, this gives
X = X
Y = f (X, Y ),
where the X and Y are independent noise variables. Plainly, a realization of X is drawn
randomly and has then an impact on the realization of Y via f . Now this scheme could
be more complex if the number of observed variables was larger. Imagine a third variable
comes in so that
X = X
Y = f (X, Y ),
Z = g(Y, Z )
In this case, X has a causation effect on Y , and then Y has a causation effect on Z. We
thus have the following connections:
X
&
Y → Z.
% %
Y Z
The above representation is called a graph, and graph theory has its own nomenclature,
which we very briefly summarize. The variables are often referred to as vertices (or nodes)
and the arrows as edges. Because arrows have a direction, they are called directed edges.
When two vertices are connected via an edge, they are called adjacent. A sequence of
adjacent vertices is called a path, and it is directed if all edges are arrows. Within a directed
path, a vertex that comes first is a parent node and the one just after is a child node.
Graphs can be summarized by adjacency matrices. An adjacency matrix A = Aij is a
matrix filled with zeros and ones. Aij = 1 whenever there is an edge from vertex i to vertex
j. Usually, self-loops (X → X) are prohibited so that adjacency matrices have zeros on
the diagonal. If we consider a simplified version of the above graph like X → Y → Z, the
corresponding adjacency matrix is
0 1 0
A = 0 0 1 .
0 0 0
14.1 Causality 231
where letters X, Y , and Z are naturally ordered alphabetically. There are only two arrows:
from X to Y (first row, second column) and from Y to Z (second row, third column).
A cycle is a particular type of path that creates a loop, i.e., when the first vertex is also
the last. The sequence X → Y → Z → X is a cycle. Technically, cycles pose problems.
To illustrate this, consider the simple sequence X → Y → X. This would imply that a
realization of X causes Y , which in turn would cause the realization of Y . While Granger
causality can be viewed as allowing this kind of connection, general causal models usually
avoid cycles and work with directed acyclic graphs (DAGs).
Equipped with these tools, we can explicitize a very general form of models:
where the noise variables are mutually independent. The notation paD (j) refers to the set
of parent nodes of vertex j within the graph structure D. Hence, Xj is a function of all of
its parents and some noise term j . An additive causal model is a mild simplification of the
above specification:
Xj = fj,k (Xk ) + j , (14.2)
k∈paD (j)
where the non-linear effect of each variable is cumulative, hence the term ‘additive’. Note
that there is no time index there. In contrast to Granger causality, there is no natural
ordering. Such models are very complex and hard to estimate. The details can be found in
Bühlmann et al. (2014).
Below, we build the adjacency matrix pertaining to the small set of predictor variables plus
the one-month ahead return (on the training sample). Below, we test the ICPy package.
The matrix is not too sparse, which means that the model has uncovered many relationships
between the variables within the sample. Sadly, none are in the direction that is of interest
for the prediction task that we seek. Indeed, the first variable is the one we want to predict
and its column is empty. However, its row is full, which indicates the reverse effect: future
returns cause the predictor values, which may seem rather counter-intuitive, given the nature
of features.
For the sake of completeness, we also provide an implementation of Python version of the
pcalg package (Kalisch et al. (2012)). Below, an estimation via the so-called PC (named after
232 14 Two key concepts: causality and non-stationarity
its authors Peter Spirtes and Clark Glymour) is performed. The details of the algorithm
are out of the scope of the book, and the interested reader can have a look at section 5.4
of Spirtes et al. (2000) or section 2 from Kalisch et al. (2012) for more information on this
subject.
import cdt
import networkx as nx
data_caus = training_sample[features_short+["R1M_Usd"]]
dm = np.array(data_caus)
cm = np.corrcoef(dm.T)# Compute correlations
df=pd.DataFrame(cm)
glasso = cdt.independence.graph.Glasso()
# intialize graph lasso
skeleton = glasso.predict(df)
# apply graph lasso to dataset
model_pc = cdt.causality.graph.PC()
# PC algo. from pcalg R library
graph_pc = model_pc.predict(df, skeleton)
# Estimate model
fig=plt.figure(figsize=[10,6])
nx.draw_networkx(model_pc)
# Plot model
A bidirectional arrow is shown when the model was unable to determine the edge orientation.
While the adjacency matrix is different compared to the first model, there are still no
predictors that seem to have a clear causal effect on the dependent variable (first circle).
14.1 Causality 233
yt = Z0t αt + t
αt+1 = Tt αt + Rt η t .
The dependent variable is expressed as a linear function of state variables αt plus an error
term. These variables are in turn linear functions of their past values plus another error
term which can have a complex structure (it’s a product of a matrix Rt with a centered
Gaussian term η t ). This specification nests many models as special cases, like ARIMA for
instance.
The goal of Brodersen et al. (2015) is to detect causal impacts via regime changes. They
estimate the above model over a given training period and then predict the model’s response
on some test set. If the aggregate (summed/integrated) error between the realized versus
predicted values is significant (based on some statistical test), then the authors conclude
that the breaking point is relevant. Originally, the aim of the approach is to quantify the
effect of an intervention by looking at how a model trained before the intervention behaves
after the intervention.
Below, we test if the 100th date point in the sample (April 2008) is a turning point. Arguably,
this date belongs to the time span of the subprime financial crisis. We use the CausalImpact
module, Python version.
The time series associated with the model are shown in Figure 14.2.
stock1_data = data_ml.loc[data_ml["stock_id"]==1, :]
# Data of first stock
struct_data = stock1_data[["Advt_3M_Usd"]+features_short]
# Combine label and features
struct_data.index = pd.RangeIndex(start=0, stop=228, step=1)
# Setting index as int
pre_period = [0, 99]
# Pre-break period (pre-2008)
post_period = [100, 199]
# Post-break period
impact = CausalImpact(struct_data, pre_period, post_period)
# Causal model created
impact.run() # run!
print(impact.summary()) # Summary analysis
impact.plot() # Plot!
234 14 Two key concepts: causality and non-stationarity
1 See for instance the papers on herding in factor investing: Krkoska and Schenk-Hoppé (2019) and Santi
1. covariate shift: PX changes but PY |X does not: the features have a fluctuating
distribution, but their relationship with Y holds still;
2. concept drift: PY |X changes but PX does not: feature distributions are stable, but
their relation to Y is altered.
Obviously, we omit the case when both items change, as it is too complex to handle. In factor
investing, the feature engineering process (see Section 4.4) is partly designed to bypass the
risk of covariate shift. Uniformization guarantees that the marginals stay the same, but
correlations between features may of course change. The main issue is probably concept
drift when the way features explain the label changes through time. In Cornuejols et al.
(2018),2 the authors distinguish four types of drifts, which we reproduce in Figure 14.3. In
factor models, changes are presumably a combination of all four types: they can be abrupt
during crashes, but most of the time they are progressive (gradual or incremental) and
never-ending (continuously recurring).
time
time
time
time
in French.
236 14 Two key concepts: causality and non-stationarity
rapidly). For simple regressions, this idea is known as weighted least squares wherein
errors are weighted inside the loss:
I
L= wi (yi − xi b)2 .
i=1
In matrix terms, L = (y − Xb) W(y − Xb), where W is a diagonal matrix of weights. The
gradient with respect to b is equal to 2X WXb − 2X Wy so that the loss is minimized
for b∗ = (X WX)−1 X Wy. The standard least-square solution is recovered for W = I.
In order to fine-tune the reactiveness of the model, the weights must be a function that
decreases as instances become older in the sample.
There is of course no perfect solution to changing financial environements. Below, we men-
tion two routes that are taken in the ML literature to overcome the problem of non-
stationarity in the data generating process. But first, we propose yet another clear veri-
fication that markets do experience time-varying distributions.
data_ml["year"] = pd.to_datetime(data_ml['date']).dt.year
# Adding a year column for later groupby
data_ml.groupby("year")["R1M_Usd"].mean().plot.bar(figsize=[16,6])
# Agreggating and plotting
These changes in the mean are also accompanied by variations in the second moment (vari-
ance/volatility). This effect, known as volatility clustering, has been widely documented
ever since the theoretical breakthrough of Engle (1982) (and even well before). We refer for
instance to Cont (2007) for more details on this topic.
In terms of machine learning models, this is also true. Below, we estimate a pure char-
acteristic regression with one predictor, the market capitalization averaged over the past
6-months (rt+1,n = α + βxcapt,n + t+1,n ). The label is the 6-month forward return and the
estimation is performed over every calendar year.
The bars in Figure 14.5 highlight the concept drift: overall, the relationship between capital-
ization and returns is negative (the size effect again). Sometimes it is markedly negative,
sometimes, not so much. The ability of capitalization to explain returns is time-varying,
and models must adapt accordingly.
The problem is that if a 2019 model is trained on data from 2010 to 2019, the (dynamic)
2020 model will have to be re trained with the whole dataset including the latest points
from 2020. This can be heavy, and including just the latest points in the learning process
would substantially decrease its computational cost. In neural networks, the sequential
batch updating of weights can allow a progressive change in the model. Nonetheless, this is
typically impossible for decision trees because the splits are decided once and for all. One
notable exception is Basak (2004), but, in that case, the construction of the trees differs
strongly from the original algorithm.
The simplest example of online learning is the Widrow-Hoff algorithm (originally from
Widrow and Hoff (1960)). Originally, the idea comes from the so-called ADALINE (ADAp-
tive LInear NEuron) model which is a neural network with one hidden layer with linear
activation function (i.e., like a perceptron, but with a different activation).
Suppose the model is linear, that is y = Xb + e (a constant can be added to the list of
predictors) and that the amount of data is both massive and coming in at a high frequency so
that updating the model on the full sample is proscribed because it is technically intractable.
A simple and heuristic way to update the values of b is to compute
where xt is the row vector of instance t. The justification is simple. The quadratic error
(xt b − yt )2 has a gradient with respect to b equal to 2(xt b − yt )x0t ; therefore, the above
update is a simple example of gradient descent. ν must of course be quite small: if not, each
new point will considerably alter b, thereby resulting in a volatile model.
An exhaustive review of techniques pertaining to online learning is presented in Hoi et al.
(2018) (section 4.11 is even dedicated to portfolio selection). The book Hazan et al. (2016)
covers online convex optimization, which is a very close domain with a large overlap with
online learning. The presentation below is adapted from the second and third parts of the
first survey.
Datasets are indexed by time: we write Xt and yt for features and labels (the usual column
index (k) and row index (i) will not be used in this section). Time has a bounded horizon
T . The machine learning model depends on some parameters θ and we denote it with fθ . At
time t (when dataset (Xt ,yt ) is gathered), the loss function L of the trained model naturally
depends on the data (Xt ,yt ) and on the model via θt which are the parameter values fitted
to the time-t data. For notational simplicity, we henceforth write Lt (θ t ) = L(Xt , yt , θ t ).
The key quantity in online learning is the regret over the whole time sequence:
T
X T
X
RT = Lt (θ t ) − inf
∗
Lt (θ ∗ ). (14.3)
θ ∈Θ
t=1 t=1
The regret is the total loss incurred by the models θ t minus the minimal loss that could
have been obtained with full knowledge of the data sequence (hence computed in hindsight).
The basic methods in online learning are in fact quite similar to the batch-training of neural
networks. The updating of the parameter is based on
where ∇Lt (θ t ) denotes the gradient of the current loss Lt . One problem that can arise is
when zt+1 falls out of the bounds that are prescribed for θ t . Thus, the candidate vector for
the new parameters, zt+1 , is projected onto the feasible domain which we call S here:
Hence θ t+1 is as close as possible to the intermediate choice zt+1 . In Hazan et al. (2007),
it is shown that under suitable assumptions, e.g., Lt being strictly convex with bounded
gradient sup ∇Lt (θ) ≤ G), the regret RT satisfies
θ
G2
RT ≤ (1 + log(T )),
2H
where H is a scaling factor for the learning rate (also called step sizes): ηt = (Ht)−1 .
More sophisticated online algorithms generalize (14.4) and (14.5) by integrating the Hessian
matrix ∇2 Lt (θ) := [∇2 Lt ]i,j = ∂θi∂∂θj Lt (θ) and/or by including penalizations to reduce
instability in θ t . We refer to section 2 in Hoi et al. (2018) for more details on these extensions.
An interesting stream of parameter updating is that of the passive-aggressive algorithms
(PAAs) formalized in Crammer et al. (2006). The base case involves classification tasks, but
we stick to the regression setting below (section 5 in Crammer et al. (2006)). One strong
limitation with PAAs is that they rely on the set of parameters where the loss is either
zero or negligible: Θ∗ = {θ, Lt (θ) < }. For general loss functions and learner f , this set
is largely inaccessible. Thus, the algorithms in Crammer et al. (2006) are restricted to a
particular case, namely linear f and -insensitive hinge loss:
for some parameter > 0. If the weight θ is such that the model is close enough to the true
value, then the loss is zero; if not, it is equal to the absolute value of the error minus . In
PAA, the update of the parameter is given by
hence the new parameter values are chosen such that two conditions are satisfied:
1. the loss is zero (by the definition of the loss, this means that the model is close enough
to the true value), and
Cover (1991) in particular. The setting is the following. The function f is assumed to be
linear f (xt ) = θ 0 xt and the data xt consists of asset returns, thus, the values are portfolio
returns as long as θ 0 1N = 1 (budget constraint). The loss functions Lt correspond to a
concave utility function (e.g., logarithmic) and the regret is reversed:
T
X T
X
RT = sup Lt (r0t θ ∗ ) − Lt (r0t θ t ),
θ ∗ ∈Θ t=1 t=1
where r0t are the returns. Thus, the program is transformed to maximize a concave function.
Several articles (often from the Computer Science or ML communities) have proposed so-
lutions to this type of problems: Blum and Kalai (1999), Agarwal et al. (2006), and Hazan
et al. (2007). Most contributions work with price data only, with the notable exception
of Cover and Ordentlich (1996), which mentions external data (‘side information’). In the
latter article, it is proven that constantly rebalanced portfolios distributed according to two
random distributions achieve growth rates that are close to the unattainable optimal rates.
The two distributions are the uniform law (equally weighting, once again) and the Dirichlet
distribution with constant parameters equal to 1/2. Under this universal distribution, Cover
and Ordentlich (1996) show that the wealth obtained is bounded below by:
wealth from optimal strategy
wealth universal ≥ ,
2(n + 1)(m−1)/2
where m is the number of assets and n is the number of periods.
The literature on online portfolio allocation is reviewed in Li and Hoi (2014) and outlined
in more details in Li and Hoi (2018). Online learning, combined to early stopping for neural
networks, is applied to factor investing in Wong et al. (2020). Finally, online learning is
associated to clustering methods for portfolio choice in Khedmati and Azin (2020).
used as synonym to transfer learning. Because of a data shift, we must adapt the model to
increase its accuracy. These topics are reviewed in a series of chapters in the collection by
Quionero-Candela et al. (2009).
An important and elegant result in the theory was proven by Ben-David et al. (2010) in
the case of binary classification. We state it below. We consider f and h two classifiers with
values in {0, 1}. The average error between the two over the domain S is defined by
Then,
T (fT , h) ≤ S (fS , h)+ 2 sup |PS (B) − PT (B)| + min (ES [|fS (x) − fT (x)|], ET [|fS (x) − fT (x)|]),
B | {z }
| {z } difference between the two learning tasks
difference between domains
where PS and PT denote the distribution of the two domains. The above inequality is a
bound on the generalization performance of h. If we take fS to be the best possible classifier
for S and fT the best for T , then the error generated by h in T is smaller than the sum
of three components: - the error in the S space, the distance between the two domains
(by how much the data space has shifted), and the distance between the two best models
(generators).
One solution that is often mentioned in transfer learning is instance weighting. We present
it here in a general setting. In machine learning, we seek to minimize
T (f ) = ET [L(y, f (X))] ,
where L is some loss function that depends on the task (regression versus classification).
This can be arranged
PS (y, X)
T (f ) = ET L(y, f (X))
PS (y, X)
X PS (y, X)
= PT (y, X) L(y, f (X))
PS (y, X)
y,X
PT (y, X)
= ES L(y, f (X))
PS (y, X)
user believes they carry more relevant information. Naturally, it then always remains to
minimize this loss.
We close this topic by mentioning a practical application of transfer learning developed in
Koshiyama et al. (2020). The authors propose a neural network architecture that allows to
share the learning process from different strategies across several markets. This method is,
among other things, aimed at alleviating the backtest overfitting problem.
15
Unsupervised learning
All algorithms presented in Chapters 5 to 9 belong to the larger class of supervised learning
tools. Such tools seek to unveil a mapping between predictors X and a label Z. The su-
pervision comes from the fact that it is asked that the data tries to explain this particular
variable Z. Another important part of machine learning consists of unsupervised tasks, that
is, when Z is not specified and the algorithm tries to make sense of X on its own. Often,
relationships between the components of X are identified. This field is much too vast to be
summarized in one book, let alone one chapter. The purpose here is to briefly explain in
what ways unsupervised learning can be used, especially in the data pre-processing phase.
1 ρ 1 1 −ρ
Σ = X0 X = , Σ−1 = .
ρ 1 1 − ρ2 −ρ 1
When the covariance/correlation ρ increase towards 1 (the two variables are co-linear), the
scaling denominator in Σ−1 goes to zero and the formula β̂ = Σ−1 X0 Z implies that one
coefficient will be highly positive and one highly negative. The regression creates a spurious
arbitrage between the two variables. Of course, this is very inefficient and yields disastrous
results out-of-sample.
We illustrate what happens when many variables are used in the regression below (Ta-
ble 15.1). One elucidation of the aforementioned phenomenon comes from the variables
Mkt_Cap_12M_Usd and Mkt_Cap_6M_Usd, which have a correlation of 99.6% in the
training sample. Both are singled out as highly significant, but their signs are contradictory.
Moreover, the magnitude of their coefficients is very close (0.21 versus 0.18), so that their
net effect cancels out. Naturally, providing the regression with only one of these two inputs
would have been wiser.
stat=sm.OLS(training_sample['R1M_Usd'],
sm.add_constant(training_sample[features])).fit()
# Model: predict R1M_Usd
reg_thrhld=3
# Keep significant predictors only
243
244 15 Unsupervised learning
# Renaming columns
significant_regressors
In fact, there are several indicators for the market capitalization and maybe only one would
suffice, but it is not obvious to tell which one is the best choice.
To further depict correlation issues, we compute the correlation matrix of the predictors
below (on the training sample). Because of its dimension, we show it graphically.
sns.set(rc={'figure.figsize':(16,16)})
# Setting the figsize in seaborn
sns.heatmap(training_sample[features].corr())
# Correlation matrix and plot
15.2 Principal component analysis and autoencoders 245
The graph of Figure 15.1 reveals several light squares around the diagonal. For instance,
the biggest square around the first third of features relates to all accounting ratios based on
free cash flows. Because of this common term in their calculation, the features are naturally
highly correlated. These local correlation patterns occur several times in the dataset and
explain why it is not a good idea to use simple regression with this set of features.
In full disclosure, multicollinearity (when predictors are correlated) can be much less a
problem for ML tools than it is for pure statistical inference. In statistics, one central goal
is to study the properties of β coefficients. Collinearity perturbs this kind of analysis. In
machine learning, the aim is to maximize out-of-sample accuracy. If having many predictors
can be helpful, then so be it. One simple example can help clarify this matter. When
building a regression tree, having many predictors will give more options for the splits. If
the features make sense, then they can be useful. The same reasoning applies to random
forests and boosted trees. What does matter is that the large spectrum of features helps
improve the generalization ability of the model. Their collinearity is irrelevant.
In the remainder of the chapter, we present two approaches that help reduce the number of
predictors:
• the first one aims at creating new variables that are uncorrelated with each other. Low
correlation is favorable from an algorithmic point of view, but the new variables lack
interpretability;
• the second one gathers predictors into homogeneous clusters, and only one feature should
be chosen out of this cluster. Here the rationale is reversed: interpretability is favored
over statistical properties because the resulting set of features may still include high
correlations, albeit to a lesser point compared to the original one.
X = QDQ0 . (15.2)
This process is called diagonalization (see chapter 7 in Meyer (2000)) and conveniently
applies to covariance matrices.
15.2.2 PCA
The goal of PCA is to build a dataset X̃ that has fewer columns, but that keeps as much
information as possible when compressing the original one, X. The key notion is the change
of base, which is a linear transformation of X into Z, a matrix with identical dimension,
via
Z = XP, (15.3)
where P is a K × K matrix. There are of course an infinite number of ways to transform X
into Z, but two fundamental constraints help reduce the possibilities. The first constraint
is that the columns of Z be uncorrelated. Having uncorrelated features is desirable because
they then all tell different stories and have zero redundancy. The second constraint is that
the variance of the columns of Z is highly concentrated. This means that a few factors
(columns) will capture most of the explanatory power (signal), while most (the others) will
consist predominantly of noise. All of this is coded in the covariance matrix of Y:
• the first condition imposes that the covariance matrix be diagonal;
1 In practice, this is not a major problem; since we work with features that are uniformly distributed,
• the second condition imposes that the diagonal elements, when ranked in decreasing
magnitude, see their value decline (sharply if possible).
The covariance matrix of Z is
1 1 1
ΣY = Z Z = P X XP = P ΣX P. (15.4)
I −1 I −1 I −1
pca = decomposition.PCA(n_components=7)
# we impose the number of components
pca.fit(training_sample[features_short])
# Performs PCA on smaller number of predictors
print(pca.explained_variance_ratio_)
# Cheking the variance explained per component
P=pd.DataFrame(pca.components_,columns=features_short).T
# Rotation (n x k) = (7 x 7)
P.columns = ['P' + str(col) for col in P.columns]
# tidying up columns names
P
P0 P1 P2 P3 P4 P5 P6
Div_Yld -0.2715 0.5790 0.0457 -0.5289 0.2266 0.5065 0.0320
Eps -0.4204 0.1500 -0.0247 0.3373 -0.7713 0.3018 0.0119
Mkt_Cap_12M_Usd -0.5238 -0.3432 0.1722 0.0624 0.2527 0.0029 0.7143
Mom_11M_Usd -0.0472 -0.0577 -0.8971 0.2410 0.2505 0.2584 0.0431
Ocf -0.5329 -0.1958 0.1850 0.2343 0.3575 0.0490 -0.6768
Pb -0.1524 -0.5808 -0.2210 -0.6821 -0.3086 0.0386 -0.1687
Vol1Y_Usd 0.4068 -0.3811 0.2821 0.1554 0.0615 0.7625 0.0086
The rotation gives the matrix P: it’s the tool that changes the base. The first row of the
output indicates the standard deviation of each new factor (column). Each factor is indicated
248 15 Unsupervised learning
via a PC index (principal component). Often, the first PC (first column PC1 in the output)
loads negatively on all initial features: a convex weighted average of all predictors is expected
to carry a lot of information. In the above example, it is almost the case, with the exception
of volatility, which has a positive coefficient in the first PC. The second PC is an arbitrage
between price-to-book (short) and dividend yield (long). The third PC is contrarian, as
it loads heavily and negatively on momentum. Not all principal components are easy to
interpret.
Sometimes, it can be useful to visualize the way the principal components are built. In
Figure 15.2, we show one popular representation that is used for two factors (usually the
first two).
The numbers indicated along the axes are the proportion of explained variance of each PC.
Compared to the figures in the first line of the output, the numbers are squared and then
divided by the total sum of squares.
Once the rotation is known, it is possible to select a subsample of the transformed data.
From the original 7 features, it is easy to pick just 4.
columns=['PC1','PC2','PC3','PC4']
# Change column names
).head()
# Show first 5 lines
These four factors can then be used as orthogonal features in any ML engine. The fact that
the features are uncorrelated is undoubtedly an asset. But the price of this convenience
is high: the features are no longer immediately interpretable. De-correlating the predictors
adds yet another layer of “blackbox-ing” in the algorithm.
PCA can also be used to estimate factor models. In Equation (15.3), it suffices to replace
Z with returns, X with factor values, and P with factor loadings (see, e.g., Connor and
Korajczyk (1988) for an early reference). More recently, Lettau and Pelger (2020a) and
Lettau and Pelger (2020b) propose a thorough analysis of PCA estimation techniques.
They notably argue that first moments of returns are important and should be included in
the objective function, alongside the optimization on the second moments.
We end this subsection with a technical note. Usually, PCA is performed on the covariance
matrix of returns. Sometimes, it may be preferable to decompose the correlation matrix.
The result may adjust substantially if the variables have very different variances (which is
not really the case in the equity space). If the investment universe encompasses several asset
classes, then a correlation-based PCA will reduce the importance of the most volatile class.
In this case, it is as if all returns are scaled by their respective volatilities.
15.2.3 Autoencoders
In a PCA, the coding from X to Z is straightfoward, linear, and it works both ways:
Z = XP and X = YP ,
If we take the truncated version and seek a smaller output (with only K columns), this
gives:
where PK is the restriction of P to the K columns that correspond to the factors with
the largest variances. The dimensions of matrices are indicated inside the brackets. In this
case, the recoding cannot recover P exactly but only an approximation, which we write X̆.
This approximation is coded with less information, hence this new data X̆ is compressed
and provides a parsimonious representation of the original sample X.
250 15 Unsupervised learning
15.2.4 Application
Autoencoders are easy to code in Keras (see Chapter 7 for more details on Keras). To
underline the power of the framework, we resort to another way of coding a NN: the so-
called functional API. For simplicity, we work with the small number of predictors (7). The
structure of the network consists of two symmetric networks with only one intermediate
layer containing 32 units. The activation function is sigmoid; this makes sense since the
input has values in the unit interval.
input_layer = Input(shape=(7,))
# features_short has 7 columns
encoder=tf.keras.layers.Dense(units=32,␣
,→activation="sigmoid")(input_layer)
# First, encode
encoder = tf.keras.layers.Dense(units=4)(encoder)
# 4 dimensions for the output layer (same as PCA example)
decoder = tf.keras.layers.Dense(units=32, activation="sigmoid")(encoder)
# Then, from encoder, decode
decoder = tf.keras.layers.Dense(units=7)(decoder)
# the original sample has 7 features
In the training part, we optimize the MSE and use an Adam update of the weights (see
Section 7.2.3).
Finally, we are ready to train the data onto itself! The evolution of loss on the training and
testing samples is depicted in Figure 15.3. The decreasing pattern shows the progress of the
quality in compression.
15.3 Clustering via k-means 251
history=ae_model.fit(NN_train_features, # Input
NN_train_features, # Output
epochs=15,
batch_size=512,
validation_data=(NN_test_features, NN_test_features))
plot_history(history)
In order to get the details of all weights and biases, the syntax is the following.
ae_weights=ae_model.get_weights()
Retrieving the encoder and processing the data into the compressed format is just a matter
of matrix manipulation. In practice, it is possible to build a submodel by loading the weights
from the encoder (see exercise below).
The principle is simple: among a group of variables (the reasoning would be the same for
observations in the other dimension) x{1≤j≤J} , find the combination of k < J groups that
minimize
k
||x − mi ||2 , (15.8)
i=1 x∈Si
where || · || is some norm which is usually taken to be the Euclidean l2 -norm. The Si are
S. The mi are the group
the groups and the minimization is run on the whole set of groups
means (also called centroids or barycenters): mi = (card(Si ))−1 x∈Si x.
In order to ensure optimality, all possible arrangements must be tested, which is pro-
hibitively long when k and J are large. Therefore, the problem is usually solved with greedy
algorithms that seek (and find) solutions that are not optimal but ‘good enough’.
One heuristic way to proceed is the following:
0. Start with a (possibly random) partition of k clusters.
1. For each cluster, compute the optimal mean values m∗i that minimizes expression (15.8).
This is a simple quadratic program.
2. Given the optimal centers m∗i , reassign the points xi so that they are all the closest to
their center.
3. Repeat steps 1. and 2. until the points do not change cluster at step 2.
Below, we illustrate this process with an example. From all 93 features, we build 10 clusters.
factor cluster
6 Capex_Ps_Cf 4
19 Eps 4
20 Eps_Basic 4
21 Eps_Basic_Gr 4
22 Eps_Contin_Oper 4
23 Eps_Dil 4
68 Op_Prt_Margin 4
69 Oper_Ps_Net_Cf 4
80 Sales_Ps 4
We single out the fourth cluster which is composed mainly of accounting ratios related to
the profitability of firms. Given these 10 clusters, we can build a much smaller group of
features that can then be fed to the predictive engines described in Chapters 5 to 9. The
representative of a cluster can be the member that is closest to the center, or simply the
center itself. This pre-processing step can nonetheless cause problems in the forecasting
phase. Typically, it requires that the training data be also clustered. The extension to the
testing data is not straightforward (the clusters may not be the same).
15.4 Nearest neighbors 253
K
X
D(xj , xi ) = ck dk (xj,k , xi,k ), (15.9)
k=1
where the distance functions dk can operate on various data types (numerical, categorical,
etc.). For numerical values, dk (xj,k , xi,k ) = (xj,k − xi,k )2 or dk (xj,k , xi,k ) = |xj,k − xi,k |.
For categorical values, we refer to the exhaustive survey by Boriah et al. (2008) which lists
14 possible measures. Finally the ck in Equation (15.9) allow some flexbility by weighting
features. This is useful because both raw values (xi,k versus xi,k0 ) or measure outputs (dk
versus dk0 ) can have different scales.
Once the distances are computed over the whole sample, they are ranked using indices
l1i , . . . , lIi :
D xl1i , xi ≤ D xl2i , xi ≤ . . . , ≤ D xlIi , xi
of simplicity and because they rarely occur in practice as long as there are sufficiently many
numerical predictors.
Given these neighbors, it is now possible to build a prediction for the label side yi . The
rationale is straightforward: if xi is close to other instances xj , then the label value yi
should also be close to yj (under the assumption that the features carry some predictive
information over the label y).
254 15 Unsupervised learning
j=i h(D(xj , xi ))yj
ŷi = ,
j=i h(D(xj , xi ))
where h is a decreasing function. Thus, the further xj is from xi , the smaller the weight
in the average. A typical choice for h is h(z) = e−az for some parameter h(z) = e−az that
determines how penalizing the distance D(xj , xi ) is. Of course, the average can be taken
in the set of k nearest neighbors, in which case the h is equal to zero beyond a particular
distance threshold:
j neighbor h(D(xj , xi ))yj
ŷi = .
j neighbor h(D(xj , xi ))
A more agnostic rule is to take h := 1 over the set of neighbors and in this case, all
neighbors have the same weight (see the old discussion by Bailey and Jain (1978) in the
case of classification). For classification tasks, the procedure involves a voting rule whereby
the class with the most votes wins the contest, with possible tie-breaking methods. The
interested reader can have a look at the short survey in Bhatia et al. (2010).
For the choice of optimal k, several complicated techniques and criteria exist (see, e.g.,
Ghosh (2006) and Hall√et al. (2008)). Heuristic values often do the job pretty well. A rule
of thumb is that k = I (I being the total number of instances) is not too far from the
optimal value, unless I is exceedingly large.
Below, we illustrate this concept. We pick one date (31th of December 2006) and single out
one asset (with stock_id equal to 13). We then seek to find the k = 30 stocks that are the
closest to this asset at this particular date.
NearestNeighbors(n_neighbors=30)
Once the neighbors and distances are known, we can compute a prediction for the return
of the target stock. We use the function h(z) = e−z for the weighting of instances (via the
distances).
15.5 Coding exercise 255
0.03092438258317905
96734 0.089
Name: R1M_Usd, dtype: float64
The prediction is neither very good, nor very bad (the sign is correct!). However, note that
this example cannot be used for predictive purposes because we use data from 2006-12-
31 to predict a return at the same date. In order to avoid the forward-looking bias, the
knn_sample variable should be chosen from a prior point in time.
The above computations are fast (a handful of seconds at most), but they hold for only one
asset. In a k-NN exercise, each stock gets a customed prediction, and the set of neighbors
must be re-assessed each time. For N assets, N (N − 1)/2 distances must be evaluated. This
is particularly costly in a backtest, especially when several parameters can be tested (the
number of neighbors, k, or a in the weighting function h(z) = e−az ). When the investment
universe is small (when trading indices for instance), k -NN methods become computationally
attractive (see for instance Chen and Hao (2017)).
Due to its increasing popularity within the Machine Learning community, we dedicate a
chapter to reinforcement learning (RL). In 2019 only, more than 25 papers dedicated to RL
were submitted to (or updated on) arXiv under the q:fin (quantitative finance) classifica-
tion. Moreover, an early survey of RL-based portfolios is compiled in Sato (2019) (see also
Zhang et al. (2020)), and general financial applications are discussed in Kolm and Ritter
(2019b), Meng and Khushi (2019), Charpentier et al. (2020), and Mosavi et al. (2020). This
shows that RL has recently gained traction among the quantitative finance community.1
While RL is a framework much more than a particular algorithm, its efficient application
in portfolio management is not straightforward, as we will show.
Environment
R0
S0
Agent A0 Environment
R1
S1
Agent A1 ...
Initialization Agent Action Agent
of state and performs generates performs
reward action reward and new action
alters state
FIGURE 16.1: Scheme of Markov Decision Process. R, S, and A stand for reward, state,
and action, respectively.
Given initialized values for the state of the environment (S0 ) and reward (usually R0 = 0),
the agent performs an action (e.g., invests in some assets). This generates a reward R1 (e.g.,
1 Like neural networks, reinforcement learning methods have also been recently developed for derivatives
pricing and hedging, see for instance Kolm and Ritter (2019a).
257
258 16 Reinforcement learning
returns, profits, Sharpe ratio) and also a future state of the environment (S1 ). Based on
that, the agent performs a new action and the sequence continues. When the sets of states,
actions, and rewards are finite, the MDP is logically called finite. In a financial framework,
this is somewhat unrealistic, and we discuss this issue later on. It nevertheless is not hard
to think of simplified and discretized financial problems. For instance, the reward can be
binary: win money versus lose money. In the case of only one asset, the action can also
be dual: investing versus not investing. When the number of assets is sufficiently small, it
is possible to set fixed proportions that lead to a reasonable number of combinations of
portfolio choices, etc.
We pursue our exposé with finite MDPs; they are the most common in the literature, and
their formal treatment is simpler. The relative simplicity of MDPs helps in grasping the
concepts that are common to other RL techniques. As is often the case with Markovian
objects, the key notion is that of transition probability:
which is the probability of reaching state s0 and reward r at time t, conditionally on being
in state s and performing action a at time t − 1. The finite sets of states and actions will
be denoted with S and A henceforth. Sometimes, this probability is averaged over the set
of rewards which gives the following decomposition:
X
rp(s0 , r|s, a) = Pss
a a
0 Rss0 , where (16.2)
r
a
Pss 0
0 = P [St = s |St−1 = s, At−1 = a] , and
Rass0 = E [Rt |St−1 = s, St = s0 , At−1 = a] .
The goal of the agent is to maximize some function of the stream of rewards. This gain is
usually defined as
T
X
Gt = γ k Rt+k+1
k=0
= Rt+1 + γGt+1 , (16.3)
i.e., it is a discounted version of the reward, where the discount factor is γ ∈ (0, 1]. The
horizon T may be infinite, which is why γ was originally introduced. Assuming the rewards
are bounded, the infinite sum may diverge for γ = 1. That is the case if rewards don’t
decrease with time, and there is no reason why they should. When γ < 1 and rewards
are bounded, convergence is assured. When T is finite, the task is called episodic, and,
otherwise, it is said to be continuous.
In RL, the focal unknown to be optimized or learned is the policy π, which drives the
actions of the agent. More precisely, π(a, s) = P[At = a|St = s], that is, π equals the
probability of taking action a if the state of the environment is s. This means that actions
are subject to randomness, just like for mixed strategies in game theory. While this may
seem disappointing because an investor would want to be sure to take the best action, it is
also a good reminder that the best way to face random outcomes may well be to randomize
actions as well.
Finally, in order to try to determine the best policy, one key indicator is the so-called value
function:
vπ (s) = Eπ [Gt |St = s] , (16.4)
16.1 Theoretical layout 259
where the time index t is not very relevant and omitted in the notation of the function. The
index π under the expectation operator E[·] simply indicates that the average is taken when
the policy π is enforced. The value function is simply equal to the average gain conditionally
on the state being equal to s. In financial terms, this is equivalent to the average profit if
the agent takes actions driven by π when the market environment is s. More generally, it
is also possible to condition not only on the state, but also on the action taken. We thus
introduce the qπ action-value function:
The qπ function is highly important because it gives the average gain when the state and
action are fixed. Hence, if the current state is known, then one obvious choice is to select
the action for which qπ (s, ·) is the highest. Of course, this is the best solution if the optimal
value of qπ is known, which is Pnot always the case in practice. The value function can easily
be accessed via qπ : vπ (s) = a π(a, s)qπ (s, a).
The optimal vπ and qπ are straightforwardly defined as
If only v∗ (s) is known, then the agent must span the set of actions and find those that yield
the maximum value for any given state s.
Finding these optimal values is a very complicated task and many articles are dedicated
to solving this challenge. One reason why finding the best qπ (s, a) is difficult is because it
depends on two elements (s and a) on one side and π on the other. Usually, for a fixed
policy π, it can be time consuming to evaluate qπ (s, a) for a given stream of actions, states
and rewards. Once qπ (s, a) is estimated, then a new policy π 0 must be tested and evaluated
to determine if it is better than the original one. Thus, this iterative search for a good
policy can take long. For more details on policy improvement and value function updat-
ing, we recommend chapter 4 of Sutton and Barto (2018) which is dedicated to dynamic
programming.
16.1.2 Q-learning
An interesting shortcut to the problem of finding v∗ (s) and q∗ (s, a) is to remove the depen-
dence on the policy. Consequently, there is then of course no need to iteratively improve it.
The central relationship that is required to do this is the so-called Bellman equation that
is satisfied by qπ (s, a). We detail its derivation below. First of all, we recall that
where the second equality stems from (16.3). The expression Eπ [Rt+1 |St = s, At = a] can
be further decomposed. Since the expectation runs over π, we need to sum over all possible
actions a0 and states s0 and resort to π(a0 , s0 ). In addition, the sum on the s0 and r arguments
of the probability p(s0 , r|s, a) = P [St+1 = s0 , Rt+1 = r|St = s, At = a] gives access to the
distribution of the random couple (St+1 , Rt+1 ) so that in the end Eπ [Rt+1 |St = s, At =
a] = a0 ,r,s0 π(a0 , s0 )p(s0 , r|s, a)r. A similar reasoning applies to the second portion of qπ
P
260 16 Reinforcement learning
and:
X
qπ (s, a) = π(a0 , s0 )p(s0 , r|s, a) [r + γEπ [Gt+1 |St = s0 , At = a0 ]]
a0 ,r,s0
X
= π(a0 , s0 )p(s0 , r|s, a) [r + γqπ (s0 , a0 )] . (16.6)
a0 ,r,s0
This equation links qπ (s, a) to the future qπ (s0 , a0 ) from the states and actions (s0 , a0 ) that
are accessible from (s, a).
Notably, Equation (16.6) is also true for the optimal action-value function q∗ = max qπ (s, a):
π
X
q∗ (s, a) = max
0
p(s0 , r|s, a) [r + γq∗ (s0 , a0 )] ,
a
r,s0
X
= Eπ∗ [r|s, a] + γ p(s0 , r|s, a) max
0
q∗ (s0 0
, a ) (16.7)
a
r,s0
because one optimal policy is one that maximizes qπ (s, a), for a given state s and over all
possible actions a. This expression is central to a cornerstone algorithm in reinforcement
learning called Q-learning (the formal proof of convergence is outlined in Watkins and Dayan
(1992)). In Q-learning, the state-action function no longer depends on policy and is written
with capital Q. The process is the following:
Initialize values Q(s, a) for all states s and actions a. For each episode:
0. Initialize state S0 and for each iteration i until the end of the episode;
1. observe state si ;
(QL) 2. perform action ai (depending on Q);
3. receive reward ri+1 and observe state si+1 ;
4. Update Q as follows:
(16.8)
Qi+1 (si , ai ) ←− Qi (si , ai ) + η
ri+1 + γ max Qi (si+1 , a) −Qi (si , ai )
a
| {z }
echo of (16.7)
The underlying reason this update rule works can be linked to fixed point theorems of
contraction mappings. If a function f satisfies |f (x) − f (y)| < δ|x − y| (Lipshitz continuity),
then a fixed point z satisfying f (z) = z can be iteratively obtained via z ← f (z). This
updating rule converges to the fixed point. Equation (16.7) can be solved using a similar
principle, except that a learning rate η slows the learning process but also technically ensures
convergence under technical assumptions.
More generally, (16.8) has a form that is widespread in reinforcement learning that is sum-
marized in Equation (2.4) of Sutton and Barto (2018):
New estimate ← Old estimate + Step size (i.e., learning rate) × (Target − Old estimate),
(16.9)
16.1 Theoretical layout 261
where the last part can be viewed as an error term. Starting from the old estimate, the new
estimate therefore goes in the ‘right’ (or sought) direction, modulo a discount term that
makes sure that the magnitude of this direction is not too large. The update rule in (16.8)
is often referred to as ‘temporal difference’ learning because it is driven by the improvement
yielded by estimates that are known at time t + 1 (target) versus those known at time t.
One important step of the Q-learning sequence (QL) is the second one where the action ai
is picked. In RL, the best algorithms combine two features: exploitation and exploration.
Exploitation is when the machine uses the current information at its disposal to choose the
next action. In this case, for a given state si , it chooses the action ai that maximizes the
expected reward Qi (si , ai ). While obvious, this choice is not optimal if the current function
Qi is relatively far from the true Q. Repeating the locally optimal strategy is likely to favor
a limited number of actions, which will narrowly improve the accuracy of the Q function.
In order to gather new information stemming from actions that have not been tested much
(but that can potentially generate higher rewards), exploration is needed. This is when an
action ai is chosen randomly. The most common way to combine these two concepts is called
-greedy exploration. The action ai is assigned according to:
Thus, with probability , the algorithm explores and with probability 1 − , it exploits the
current knowledge of the expected reward and picks the best action. Because all actions
have a non-zero probability of being chosen, the policy is called “soft”. Indeed, then best
action has a probability of selection equal to 1 − (1 − card(A)−1 ), while all other actions
are picked with probability /card(A).
16.1.3 SARSA
In Q-learning, the algorithm seeks to find the action-value function of the optimal policy.
Thus, the policy that is followed to pick actions is different from the one that is learned
(via Q). Such algorithms are called off-policy. On-policy algorithms seek to improve the
estimation of the action-value function qπ by continuously acting according to the policy
π. One canonical example of on-policy learning is the SARSA method which requires two
consecutive states and actions SARSA. The way the quintuple (St , At , Rt+1 , St+1 , At+1 ) is
processed is presented below.
The main difference between Q learning and SARSA is the update rule. In SARSA, it is
given by
The improvement comes only from the local point Qi (si+1 , ai+1 ) that is based on the new
states and actions (si+1 , ai+1 ), whereas in Q-learning, it comes from all possible actions of
which only the best is retained max Qi (si+1 , a).
a
A more robust but also more computationally demanding version of SARSA is expected
SARSA in which the target Q function is averaged over all actions:
!
X
Qi+1 (si , ai ) ←− Qi (si , ai ) + η ri+1 + γ π(a, si+1 )Qi (si+1 , a) − Qi (si , ai ) (16.12)
a
262 16 Reinforcement learning
Expected SARSA is less volatile than SARSA because the latter is strongly impacted by
the random choice of ai+1 . In expected SARSA, the average smoothes the learning process.
and assuming that all features have been uniformized, their space is [0, 1]N K . Needless to
say, the dimensions of both spaces are numerically impractical.
A simple solution to this problem is discretization: each space is divided into a small number
of categories. Some authors do take this route. In Yang et al. (2018), the state space is
discretized into three values depending on volatility, and actions are also split into three
categories. Bertoluzzo and Corazza (2012) and Xiong et al. (2018) also choose three possible
actions (buy, hold, sell). In Almahdi and Yang (2019), the learner is expected to yield binary
signals for buying or shorting. García-Galicia et al. (2019) consider a larger state space (eight
elements) but restrict the action set to three options.3 In terms of the state space, all articles
assume that the state of the economy is determined by prices (or returns).
One strong limitation of these approaches is the marked simplification they imply. Realistic
discretizations are numerically intractable when investing in multiple assets. Indeed, split-
ting the unit interval in h points yields hN K possibilities for feature values. The number
of options for weight combinations is exponentially increasing N . As an example: just 10
possible values for 10 features of 10 stocks yield 10100 permutations.
The problems mentioned above are of course not restricted to portfolio construction. Many
solutions have been proposed to solve Markov Decision Processes in continuous spaces. We
refer for instance to Section 4 in Powell and Ma (2011) for a review of early methods (outside
finance).
This curse of dimensionality is accompanied by the fundamental question of training data.
2 For example, Sharpe ratio which is for instance used in Moody et al. (1998), Bertoluzzo and Corazza
(2012), and Aboussalah and Lee (2020) or drawdown-based ratios, as in Almahdi and Yang (2017).
3 Some recent papers consider arbitrary weights (e.g., Jiang et al. (2017) and Yu et al. (2019)) for a
Two options are conceivable: market data versus simulations. Under a given controlled
generator of samples, it is hard to imagine that the algorithm will beat the solution that
maximizes a given utility function. If anything, it should converge towards the static optimal
solution under a stationary data generating process (see, e.g., Chaouki et al. (2020) for
trading tasks), which is by the way a very strong modelling assumption.
This leaves market data as a preferred solution but even with large datasets, there is little
chance to cover all the (actions, states) combinations mentioned above. Characteristics-
based datasets have depths that run through a few decades of monthly data, which means
several hundreds of time-stamps at most. This is by far too limited to allow for a reliable
learning process. It is always possible to generate synthetic data (as in Yu et al. (2019)),
but it is unclear that this will solidly improve the performance of the algorithm.
where the output of function h(a, s), which has the same dimension as θ is called a feature
vector representing the pair (a, s). Typically, h can very well be a simple neural network
with two input units and an output dimension equal to the length of θ.
One desired property for πθ is that it be differentiable with respect to θ so that θ can be
improved via some gradient method. The most simple and intuitive results about policy
gradients are known in the case of episodic tasks (finite horizon) for which it is sought
to maximize the average gain Eθ [Gt ] where the gain is defined in Equation (16.3). The
expectation is computed according to a particular policy that depends on θ, this is why
we use a simple subscript. One central result is the so-called policy gradient theorem which
states that
∇πθ
∇Eθ [Gt ] = Eθ Gt . (16.15)
πθ
This result can then be used for gradient ascent: when seeking to maximize a quantity,
the parameter change must go in the upward direction:
This simple update rule is known as the Reinforce algorithm. One improvement of this
simple idea is to add a baseline, and we refer to section 13.4 of Sutton and Barto (2018) for
a detailed account on this topic.
16.3.2 Extensions
A popular extension of Reinforce is the so-called actor-critic (AC) method which combines
policy gradient with Q- or v-learning. The AC algorithm can be viewed as some kind of mix
between policy gradient and SARSA. A central requirement is that the state-value function
v(·) be a differentiable function of some parameter vector w (it is often taken to be a neural
network). The update rule is then
∇πθ
θ ← θ + η (Rt+1 + γv(St+1 , w) − v(St , w)) , (16.17)
πθ
but the trick is that the vector w must also be updated. The actor is the policy side which is
what drives decision making. The critic side is the value function that evaluates the actor’s
performance. As learning progresses (each time both sets of parameters are updated), both
sides improve. The exact algorithmic formulation is a bit long, and we refer to section 13.5
in Sutton and Barto (2018) for the precise sequence of steps of AC.
Another interesting application of parametric policies is outlined in Aboussalah and Lee
(2020). In their article, the authors define a trading policy that is based on a recurrent
neural network. Thus, the parameter θ in this case encompasses all weights and biases in
the network.
Another favorable feature of parametric policies is that they are compatible with continuous
sets of actions. Beyond the form (16.14), there are other ways to shape πθ . If A is a subset
of R, and fΩ is a density function with parameters Ω, then a candidate form for πθ is
If we set π = πα = fα , the link with factors or characteristics can be coded through α via
a linear form:
K
(k) (k)
X
(F1) αn,t = θ0,t + θt xt,n , (16.19)
k=1
which is highly tractable, but may violate the condition that αn,t > 0 for some values of
θk,t . Indeed, during the learning process, an update in θ might yield values that are out of
the feasible set of αt . In this case, it is possible to resort to a trick that is widely used in
online learning (see, e.g., section 2.3.1 in Hoi et al. (2018)). The idea is simply to find the
acceptable solution that is closest to the suggestion from the algorithm. If we call θ ∗ the
result of an update rule from a given algorithm, then the closest feasible vector is
θ = min ||θ ∗ − z||2 , (16.20)
z∈Θ(xt )
where || · || is the Euclidean norm and Θ(xt ) is the feasible set, that is, the set of vectors θ
PK (k) (k)
such that the αn,t = θ0,t + k=1 θt xt,n are all non-negative.
A second option for the form of the policy, πθ2 t , is slightly more complex but remains always
valid (i.e., has positive αn,t values):
K
!
(k) (k)
X
(F2) αn,t = exp θ0,t + θt xt,n , (16.21)
k=1
which is simply the exponential of the first version. With some algebra, it is possible to
derive the policy gradients. The policies πθj t are defined by the equations (Fj) above. Let
z denote the digamma function. Let 1 denote the RN vector of all ones. We have
N
∇θt πθ1 t X
= (z (10 Xt θ t ) − z(xt,n θ t ) + ln wn ) x0t,n
πθ1 t n=1
N
∇θt πθ2 t X
z 10 eXt θt − z(ext,n θt ) + ln wn ext,n θt x0t,n
2 =
πθ t n=1
learning process in a simplified framework. We consider two assets: one risky and one riskless,
with return equal to zero. The returns for the risky process follow an autoregressive model
of order one (AR(1)): rt+1 = a + ρrt + t+1 with |ρ| < 1 and following a standard white
noise with variance σ 2 . In practice, individual (monthly) returns are seldom autocorrelated,
but adjusting the autocorrelation helps understand if the algorithm learns correctly (see
exercise below).
The environment consists only in observing the past return rt . Since we seek to estimate
the Q function, we need to discretize this state variable. The simplest choice is to resort
to a binary variable: equal to −1 (negative) if rt < 0, and to +1 (positive) if rt ≥ 0. The
actions are summarized by the quantity invested in the risky asset. It can take five values:
0 (risk-free portfolio), 0.25, 0.5, 0.75, and 1 (fully invested in the risky asset). This is for
instance the same choice as in Pendharkar and Cusatis (2018).
For the sake of understanding, we resort to code an intuitive implementation of Q-learning
ourselves. It requires a dataset with the usual inputs: state, action, reward, and subsequent
state. We start by simulating the returns: they drive the states and the rewards (portfolio
returns). The actions are sampled randomly. The data is built in the chunk below.
• γ, the discounting rate for the rewards (also shown in Equation (16.8)) a
• , which controls the rate of exploration versus exploitation (see Equation (16.10)).
fit_RL=pd.DataFrame(fit_RL,index=s.keys(),
columns=a.keys()).sort_index(axis=1)
print(fit_RL)
print(f'Reward (last iteration): {r_final}')
The output shows the Q function, which depends naturally both on states and actions. When
the state is negative, large risky positions (action equal to 0.75 or 1.00) are associated with
268 16 Reinforcement learning
the smallest average rewards, whereas small positions yield the highest average rewards.
When the state is positive, the average rewards are the highest for the largest allocations.
The rewards in both cases are almost a monotonic function of the proportion invested in the
risky asset. Thus, the recommendation of the algorithm (i.e., policy) is to be fully invested
in a positive state and to refrain from investing in a negative state. Given the positive
autocorrelation of the underlying process, this does make sense.
Basically, the algorithm has simply learned that positive (resp. negative) returns are more
likely to follow positive (resp. negative) returns. While this is somewhat reassuring, it is by
no means impressive, and much simpler tools would yield similar conclusions and guidance.
return_3=pd.Series(data_ml.loc[data_ml['stock_id']==3, 'R1M_Usd'].values)
# Return of asset 3
return_4=pd.Series(data_ml.loc[data_ml['stock_id']==4, 'R1M_Usd'].values)
# Return of asset 4
pb_3 = pd.Series(data_ml.loc[data_ml['stock_id']==3, 'Pb'].values)
# P/B ratio of asset 3
pb_4 = pd.Series(data_ml.loc[data_ml['stock_id']==4, 'Pb'].values)
# P/B ratio of asset 4
action_3 = pd.Series(np.floor(np.random.uniform(size=len(pb_3))*3) - 1)
# Action for asset 3 (random)
action_4 = pd.Series(np.floor(np.random.uniform(size=len(pb_4))*3) - 1)
# Action for asset 4 (random)
RL_data = pd.concat([return_3,return_4,pb_3,
pb_4,action_3,action_4],axis=1)
# Building the dataset
RL_data.columns=['return_3','return_4',
'Pb_3','Pb_4','action_3','action_4']
# Adding columns names
RL_data['action']=RL_data.action_3.astype(int).apply(
str)+" "+RL_data.action_4.astype(int).apply(str) # Uniting actions
RL_data['Pb_3'] = np.round(5*RL_data['Pb_3'])
# Simplifying states (P/B)
RL_data['Pb_4'] = np.round(5*RL_data['Pb_4'])
# Simplifying states (P/B)
RL_data['state'] = RL_data.Pb_3.astype(int).apply(
str)+" "+RL_data.Pb_4.astype(int).apply(str) # Uniting states
RL_data['new_state'] = RL_data['state'].shift(-1)
# Infer new state
RL_data['reward']=RL_data.action_3*RL_data.return_3 \
16.4 Simple examples 269
+RL_data.action_4*RL_data.return_4
# Computing rewards
RL_data = RL_data[['action','state','reward','new_state']].dropna(
axis=0).reset_index(drop=True)
# Remove one missing new state, last row
RL_data.head()
# Show first lines
Actions and states have to be merged to yield all possible combinations. To simplify the
states, we round five times the price-to-book ratios.
We keep the same hyperparameters as in the previous example. Columns below stand for
actions: the first (resp. second) number notes the position in the first (resp. second) asset.
The rows correspond to states. The scaled P/B ratios are separated by a point (,e.g., “X2.3”
means that the first (resp. second) asset has a scaled P/B of 2 ( resp. 3).
The output shows that there are many combinations of states and actions that are not
spanned by the data: basically, the Q function has a zero, and it is likely that the com-
bination has not been explored. Some states seem to be more often represented (“X1.1”,
270 16 Reinforcement learning
“X1.2”, and “X2.1”), others, less (“X3.1” and “X3.2”). It is hard to make any sense of the
recommendations. Some states close “X0.1” and “X1.1”, but the outcomes related to them
are very different (buy and short versus hold and buy). Moreover, there is no coherence and
no monotonicity in actions with respect to individual state values: low values of states can
be associated to very different actions.
One reason why these conclusions do not appear trustworthy pertains to the data size. With
only 200+ time points and 99 state-action pairs (11 times 9), this yields on average only two
data points to compute the Q function. This could be improved by testing more random
actions, but the limits of the sample size would eventually (rapidly) be reached anyway.
This is left as an exercise (see below).
16.6 Exercises
1. Test what happens if the process for generating returns has a negative autocorrelation.
What is the impact on the Q function and the policy?
2. Keeping the same two assets as in Section 16.4.2, increases the size of RL_data by
testing all possible action combinations for each original data point. Re-run the
Q-learning function and see what happens.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part V
Appendix
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
17
Data description
TABLE 17.1: List of all variables (features and labels) in the dataset
275
276 17 Data description
18.1 Chapter 3
For annual values, see Figure 18.1:
df_median=[]
#creating empty placeholder for temporary dataframe
df=[]
df_median=data_ml[['date','Pb']].groupby(['date']).median().reset_index()
# computing median
df_median.rename(columns = {'Pb': 'Pb_median'}, inplace = True)
# renaming for clarity
df = pd.merge(
data_ml[["date",'Pb','R1M_Usd']],df_median,how='left',on=['date'])
df=df.groupby(
[pd.to_datetime(
df['date']).dt.year,np.where(
df['Pb'] > df['Pb_median'],
"Growth", "Value")])['R1M_Usd'].mean().reset_index()
# groupby and defining "year" and cap logic
df.rename(columns = {'level_1': 'val_sort'},inplace = True)
df.pivot(
index='date',columns='val_sort',values='R1M_Usd').plot.bar(
figsize=(10,6)) # Plot!
plt.ylabel('Average returns')
plt.xlabel('year')
279
280 18 Solutions to exercises
df_median=[]
#creating empty placeholder for temporary dataframe
df=[]
df_median=data_ml[["date","Pb"]].groupby(["date"]).median().reset_index()
# Computing median
df_median.rename(columns = {'Pb': 'Pb_median'}, inplace=True)
# renaming for clarity
df = pd.merge(data_ml[["date", "R1M_Usd", "Pb"]], df_median,on=["date"])
# Joining the dataframes for selecting on median
df["growth"] = np.where(df["Pb"] > df["Pb_median"],"Growth","Value")
# Creating new columns from condition
df = df.groupby(["date", "growth"])["R1M_Usd"].mean().unstack()
# Computing average returns
(1+df.loc[:, ["Value", "Growth"]]).cumprod().plot(figsize = (10, 6));
# Plot!
plt.ylabel('Average returns')
plt.xlabel('year')
18.1 Chapter 3 281
Portfolios’ performance are based on quartile sorting. We rely heavily on the fact that
features are uniformized, i.e., that their distribution is uniform for each given date. Overall,
small firms outperform heavily (see Figure 18.3).
df=[]
values = ["small", "medium", "large", "xl"]
conditions = [data_ml["Mkt_Cap_6M_Usd"] <= 0.25, # Small firms...
(data_ml["Mkt_Cap_6M_Usd"] > 0.25) &
(data_ml["Mkt_Cap_6M_Usd"] <= 0.5),
(data_ml["Mkt_Cap_6M_Usd"] > 0.5) &
(data_ml["Mkt_Cap_6M_Usd"] <= 0.75),
data_ml["Mkt_Cap_6M_Usd"] > 0.75] # ...Xlarge firms
df = data_ml[["date", "R1M_Usd", "Mkt_Cap_6M_Usd"]].copy()
df["Mkt_cap_quartile"] = np.select(conditions, values)
df["year"] = pd.to_datetime(df['date']).dt.year
df = df.groupby(["year", "Mkt_cap_quartile"])["R1M_Usd"].mean().unstack()
# Computing average returns
df.loc[:, ["large","medium","small","xl"]].plot.bar(figsize = (10, 6));
# Plot!
plt.ylabel('Average returns')
plt.xlabel('year')
282 18 Solutions to exercises
18.2 Chapter 4
Below, we import a credit spread supplied by Bank of America. Its symbol/ticker is
“BAMLC0A0CM”. We apply the data expansion on the small number of predictors to save
memory space. One important trick that should not be overlooked is the uniformization
step after the product (4.3) is computed. Indeed, we want the new features to have the
same properties as the old ones. If we skip this step, distributions will be altered, as we
show in one example below.
We start with the data extraction and joining. It’s important to join early so as to keep the
highest data frequency (daily) in order to replace missing points with close values. Joining
with monthly data before replacing creates unnecessary lags.
cred_spread=pd.read_csv("BAMLC0A0CM.csv",index_col=0).reset_index()
# Transform to dataframe
The creation of the augmented dataset requires some manipulation. Features are no longer
uniform as is shown in Figure 18.4.
data_tmp = data_cond.groupby(
# From new dataset Group by date and...
["date"]).apply(lambda df: norm_0_1(df))
# Uniformize the new features
data_tmp[["Eps_cred_spread"]].plot.hist(bins=100,figsize=[10,5])
# Verification
284 18 Solutions to exercises
The second question naturally requires the downloading of VIX series first and the joining
with the original data.
vix=pd.read_csv("VIXCLS.csv",index_col=0).reset_index()
# Transform to dataframe
We can then proceed with the categorization. We create the vector label in a new (smaller)
dataset but not attached to the large data_ml variable. Also, we check the balance of labels
and its evolution through time (see Figure 18.6).
delta = 0.5
# Magnitude of vix correction
vix_bar = np.median(vix["vix"])
# Median of vix
data_vix = pd.merge(
data_ml[["stock_id","date","R1M_Usd"]],vix,how="inner",on="date")
# Smaller dataset
data_vix["r_minus"]=(-0.02) * np.exp(-delta*(data_vix["vix"]-vix_bar))
# r_-
data_vix["r_plus"] = 0.02 * np.exp(delta*(data_vix["vix"]-vix_bar))
# r_+
rules=[data_vix["R1M_Usd"]>data_vix["r_plus"],
(data_vix["R1M_Usd"]>=data_vix["r_minus"]) &
(data_vix["R1M_Usd"]<=data_vix["r_plus"]),
data_vix["R1M_Usd"]<data_vix["r_minus"]]
18.2 Chapter 4 285
data_ml[["R12M_Usd"]].hist(figsize=[10,5]);
The largest return comes from stock #683. Let’s have a look at the stream of monthly
returns in 2009.
data_tmp = data_ml.loc[data_ml["stock_id"] == 683,:].copy()
data_tmp["year"] = pd.to_datetime(data_tmp["date"]).dt.year
data_tmp.loc[data_tmp["year"]==2009,["date","R1M_Usd"]].
→sort_values(['date'])
date R1M_Usd
126827 2009-01-31 -0.625
128020 2009-02-28 0.472
128950 2009-03-31 1.440
130144 2009-04-30 0.139
131338 2009-05-31 0.086
132533 2009-06-30 0.185
133727 2009-07-31 0.363
134921 2009-08-31 0.103
136115 2009-09-30 9.914
137308 2009-10-31 0.101
138501 2009-11-30 0.202
139692 2009-12-31 -0.251
The returns are all very high. The annual value is plausible. In addition, a quick glance at
the Vol1Y values shows that the stock is the most volatile of the dataset.
18.3 Chapter 5
We recycle the training and testing data variables created in the chapter (coding section
notably).
y_penalized_train = training_sample['R1M_Usd'].values
# Dependent variable
X_penalized_train = training_sample[features].values
# Predictors
18.4 Chapter 6 287
y_penalized_test = testing_sample['R1M_Usd'].values
# Dependent variable
X_penalized_test = testing_sample[features].values
# Predictors
lasso_sens=[]
alpha_seq=list(np.round(np.arange(0.1,1.1,0.2),2))
# Sequence of alpha values
lambda_seq = [1e-5,1e-4,1e-3,1e-2,1e-1,1]
# Sequence of lambda values
lasso_sens.append([rmse,i,j])
lasso_sens=pd.DataFrame(lasso_sens,columns=['rmse','alpha','lambda'])
rmse_elas=lasso_sens.pivot(index='alpha',columns='lambda',values='rmse')
# matrix format for plots
new_col_names= list(map(lambda x: str(x)+str(" Lambda"),lambda_seq))
# New column names
rmse_elas.columns=new_col_names
rmse_elas.plot(
figsize=(14,12),
subplots=True,sharey=True,sharex=True,kind='bar',ylabel='rmse')
plt.show() # Plot!
As is outlined in Figure 18.8, the parameters have a very marginal impact. Maybe the model
is not a good fit for the task.
18.4 Chapter 6
fit1 = tree.DecisionTreeRegressor( # Defining the model
max_depth = 5, # Maximum depth (i.e. tree levels)
ccp_alpha=0.00001, # Precision: smaller = more leaves
)
fit1.fit(X, y) # Fitting the model
mse = np.mean((fit1.predict(X_test) - y_test)**2)
print(f'MSE: {mse}')
MSE: 0.014679220247642136
288 18 Solutions to exercises
MSE: 0.03698756837337339
The first model (Figure 18.9) is too precise: going into the details of the training sample
does not translate to good performance out-of-sample. The second, simpler model, yields
better results.
n_trees=[10,20,40,80,160]
mse_rf=[]
for i in range(len(n_trees)):
# No need for functional programming here...
fit_RF = RandomForestRegressor(n_estimators = n_trees[i],
# Nb of random trees
criterion ='squared_error',
# function to measure the quality of a split
bootstrap=True, # replacement
max_depth=5, # Nb of predictors for each tree
max_samples=30000) # Size of (random) sample for each tree
fit_RF.fit(X_train, y_train) # Fitting the model
mse=np.mean((fit_RF.predict(X_test) - y_test)**2)
mse_rf.append([mse])
mse_rf
[[0.03864985167856587],
[0.03819487650392408],
[0.037083813469538304],
[0.03709626863784633],
[0.03642950985370603]]
Trees are by definition random, so results can vary from test to test. Overall, large numbers
of trees are preferable, and the reason is that each new tree tells a new story and diversifies
290 18 Solutions to exercises
the risk of the whole forest. Some more technical details of why that may be the case are
outlined in the original paper by Breiman (2001).
For the last exercises, we recycle the formula used in Chapter 6.
training_sample_2008 = training_sample.loc[training_sample.index[(
training_sample['date'] > '2007-12-31') &
(training_sample['date'] < '2009-01-01')].tolist()]
training_sample_2009 = training_sample.loc[training_sample.index[(
training_sample['date'] > '2008-12-31')
& (training_sample['date'] < '2010-01-01')].tolist()]
The first splitting criterion in Figure 18.10 is enterprise value (EV). EV is an indicator that
adjusts market capitalization by substracting debt and adding cash. It is a more faithful
account of the true value of a company. In 2008, the companies that fared the least poorly
were those with the highest EV (i.e., large, robust firms).
18.4 Chapter 6 291
plt.show()
In 2009 (Figure 18.11), the firms that recovered the fastest were those that experienced
high volatility in the past (likely, downwards volatility). Momentum is also very important:
the firms with the lowest past returns are those that rebound the fastest. This is a typical
example of the momentum crash phenomenon studied in Barroso and Santa-Clara (2015)
and Daniel and Moskowitz (2016). The rationale is the following: after a market downturn,
the stocks with the most potential for growth are those that have suffered the largest losses.
Consequently, the negative (short) leg of the momentum factor performs very well, often
better than the long leg. And indeed, being long in the momentum factor in 2009 would
have generated negative profits.
292 18 Solutions to exercises
data_short=data_ml.loc[(
data_ml.stock_id.isin(
stock_ids_short)),["stock_id","date"] +␣
,→features_short+["R1M_Usd"]]
# Shorter dataset
dates = data_short["date"].unique()
N = len(stock_ids_short) # Dimension for assets
Tt = len(dates) # Dimension for dates
K = len(features_short) # Dimension for features
factor_data=data_short[["stock_id", "date", "R1M_Usd"]].pivot(
index="date",columns="stock_id",values="R1M_Usd").values
# Factor side data
beta_data=np.swapaxes(
data_short[features_short].values.reshape(N, Tt, K), 0, 1)
# Beta side data: beware the permutation below!
Next, we turn to the specification of the network, using a functional API form.
main_input = Input(shape=(N,))
# Main input: returns
factor_network = tf.keras.layers.Dense(
8, activation="relu", name="layer_1_r")(main_input)
# Def of factor side network
factor_network= tf.keras.layers.Dense(
4, activation="tanh", name="layer_2_r")(factor_network)
aux_input = Input(shape=(N,K))
# Aux input: characteristics
beta_network =tf.keras.layers.Dense(
units=8, activation="relu",name="layer_1_1")(aux_input)
beta_network=tf.keras.layers.Dense(
units=4, activation="tanh",name="layer_2_1")(beta_network)
beta_network= tf.keras.layers.Permute((2, 1))(beta_network)
# Permutation!
main_output=tf.keras.layers.
,→Dot(axes=[1,1])([beta_network,factor_network])
# Product of 2 networks
model_ae = keras.Model([main_input,aux_input], main_output)
# AE Model specs
Finally, we ask for the structure of the model, and train it.
18.5 Chapter 7: the autoencoder model and universal approximation 293
Model: "model_1"
Layer (type) Output Shape Param # Connected to
=========================================================================
input_3 (InputLayer) [(None, 793, 7)] 0 []
layer_1_1 (Dense) (None, 793, 8) 64
['input_3[0][0]']
input_2 (InputLayer) [(None, 793)] 0 []
layer_2_1 (Dense) (None, 793, 4) 36
['layer_1_1[0][0]']
layer_1_r (Dense) (None, 8) 6352
['input_2[0][0]']
permute (Permute) (None, 4, 793) 0
['layer_2_1[0][0]']
layer_2_r (Dense) (None, 4) 36
['layer_1_r[0][0]']
dot (Dot) (None, 793) 0
['permute[0][0]',
'layer_2_r[0][0]']
==========================================================================
Total params: 6,488
Trainable params: 6,488
Non-trainable params: 0
model_ae.compile(
optimizer="rmsprop",loss='mean_squared_error',
metrics='mean_squared_error')
# Learning parameters
fit_model_ae=model_ae.fit(
(factor_data,␣
,→beta_data),y=factor_data,epochs=20,batch_size=49,verbose=0)
For the second exercise, we use a simple architecture. The activation function, number of
epochs and batch size may matter. . .
raw_data=np.arange(0,10,0.001)
# Random numbers for sin function
df_sin = pd.DataFrame([raw_data, np.sin(raw_data)],index = ["x",␣
,→"sinx"]).T
# Sin data
model_ua = keras.Sequential()
model_ua.add(layers.Dense(16, activation="sigmoid", input_shape=(1,)))
model_ua.add(layers.Dense(1))
model_ua.summary() # A simple model!
Model: "sequential_3"
Layer (type) Output Shape Param #
=================================================================
dense_10 (Dense) (None, 16) 32
294 18 Solutions to exercises
model_ua.
,→compile(optimizer='RMSprop',loss='mse',metrics=['MeanAbsoluteError'])
fit_ua = model_ua.fit(
raw_data,df_sin['sinx'].values,batch_size=64,epochs = 30,verbose=0)
In full disclosure, to improve the fit, we also increase the sample size. We show the improve-
ment in the figure below.
model_ua2 = keras.Sequential()
model_ua2.add(layers.Dense(128, activation="sigmoid", input_shape=(1,)))
model_ua2.add(layers.Dense(1))
model_ua2.summary() # A simple model!
Model: "sequential_4"
Layer (type) Output Shape Param #
=================================================================
dense_12 (Dense) (None, 128) 256
dense_13 (Dense) (None, 1) 129
=================================================================
Total params: 385
Trainable params: 385
Non-trainable params: 0
model_ua2.compile(
optimizer='RMSprop',loss='mse',metrics=['MeanAbsoluteError'])
fit_ua2 = model_ua2.fit(
raw_data,df_sin['sinx'].values,batch_size=64,epochs = 60,verbose=0)
df_sin['Small Model']=model_ua.predict(raw_data)
df_sin['Large Model']=model_ua2.predict(raw_data)
df_sin.set_index('x',inplace=True)
df_sin.plot(figsize=[10,4])
18.6 Chapter 8 295
18.6 Chapter 8
Since we are going to reproduce a similar analysis several times, let’s simplify the task with
two tips: First, by using default parameter values that will be passed as common arguments
to the svm function. Second, by creating a custom function that computes the MSE. Below,
we recycle datasets created in Chapter 6.
y = train_label_xgb.iloc[1:1000]
# Train label
x = train_features_xgb.iloc[1:1000,]
# Training features
test_feat_short=testing_sample[features_short]
def svm_func(_kernel,_C,_gamma,_coef0):
model_svm=svm.SVR(kernel=_kernel,C=_C,gamma=_gamma,coef0=_coef0)
fit_svm=model_svm.fit(x,y) # Fitting the model
mse = np.mean((fit_svm.predict(test_feat_short)-y_test)**2)
print(f'MSE: {mse}')
kernels=['linear', 'rbf', 'poly', 'sigmoid']
for i in range(0,len(kernels)):
svm_func(kernels[i],0.2,0.5,0.3)
MSE: 0.041511597922526976
MSE: 0.04297549469926879
MSE: 0.04405276977547405
MSE: 0.6195078379395168
The first two kernels yield the best fit, while the last one should be avoided. Note that apart
from the linear kernel, all other options require parameters. We have used the default ones,
which may explain the poor performance of some non-linear kernels.
296 18 Solutions to exercises
Below, we train an SVM model on a training sample with all observations but that is limited
to the seven major predictors. Even with a smaller number of features, the training is time
consuming.
y = train_label_xgb.iloc[1:50000]
# Train label
x = train_features_xgb.iloc[1:50000,]
# Training features
test_feat_short=testing_sample[features_short]
model_svm_full=svm.SVR(
kernel='linear',
# SVM kernel (or: linear, polynomial, sigmoid)
C=0.1,
# Slack variable penalisation
epsilon=0.1,
# Width of strip for errors
gamma=0.5
# Constant in the radial kernel
)
fit_svm_full=model_svm_full.fit(x, y) # Fitting the model
hitratio = np.mean(fit_svm_full.predict(test_feat_short)*y_test>0)
print(f'Hit Ratio: {hitratio}')
Then, we specify the network structure. First, the three independent networks, then the
aggregation.
units=2,activation='softmax')(main_output)
# Combination
model_ens=keras.Model([first_input,second_input,third_input],main_output)
# Agg. Model specs
Model: "model_2"
Layer (type) Output Shape Param # Connected to
================================================================================
first_input (InputLayer) [(None, 31)] 0 []
second_input (InputLayer) [(None, 31)] 0 []
third_input (InputLayer) [(None, 31)] 0 []
layer_1 (Dense) (None, 8) 256
['first_input[0][0]']
layer_2 (Dense) (None, 8) 256
['second_input[0][0]']
layer_3 (Dense) (None, 8) 256
['third_input[0][0]']
dense_14 (Dense) (None, 2) 18
['layer_1[0][0]']
dense_15 (Dense) (None, 2) 18
['layer_2[0][0]']
dense_16 (Dense) (None, 2) 18
['layer_3[0][0]']
concatenate (Concatenate) (None, 6) 0
['dense_14[0][0]',
'dense_15[0][0]',
'dense_16[0][0]']
dense_17 (Dense) (None, 2) 14
['concatenate[0][0]']
================================================================================
Total params: 836
Trainable params: 836
Non-trainable params: 0
NN_train_labels_C = to_categorical(training_sample['R1M_Usd_C'].values)
# One-hot encoding of the label
NN_test_labels_C = to_categorical(testing_sample['R1M_Usd_C'].values)
# One-hot encoding of the label
model_ens.compile( # Learning parameters
optimizer = "adam",
loss='binary_crossentropy',
metrics='categorical_accuracy')
fit_NN_ens=model_ens.fit(
(feat_train_1, feat_train_2,feat_train_3),
y = NN_train_labels_C,verbose=0,
epochs=12, # Nb rounds
18.8 Chapter 12 299
18.8 Chapter 12
18.8.1 EW portfolios
This one is incredibly easy; it’s simpler and more compact but close in spirit to the code
that generates Figure 3.1. The returns are plotted in Figure 18.14.
data_ml.groupby("date").mean()['R1M_Usd'].plot(
figsize=[16,6],ylabel='Return')
Second, we test it on some random dataset. We use the returns created at the end of Chapter
1 and used for the Lasso allocation in Section 5.2.2. For µ, we use the sample average, which
is rarely a good idea in practice. It serves as illustration only.
Sigma = returns.cov().values
# Covariance matrix
mu = returns.mean(axis=0).values
# Vector of exp. returns
Lambda = np.eye(Sigma.shape[0])
# Trans. Cost matrix
lamda = 1
# Risk aversion
k_D = 1
k_R = 1
w_old = np.ones(Sigma.shape[0]) / Sigma.shape[0]
# Prev. weights: EW
weights(Sigma, mu, Lambda, lamda, k_D, k_R, w_old)[:5]
# First 6 weights example
Some weights can of course be negative. Finally, we test some sensitivity. We examine three
key indicators: - diversification, which we measure via the inverse of the sum of squared
weights (inverse Hirschman-Herfindhal index); - leverage, which we assess via the absolute
sum of negative weights; and in-sample volatility, which we compute as w Σx.
To do so, we create a dedicated function below.
out.append(np.sum(np.abs(w[w<0])))
# Leverage
out.append(w.T@Sigma@w)
# In-sample vol
return out
Instead of using the baseline map2 function, we rely on a version thereof that concatenates
results into a dataframe directly.
In Figure 18.15, each panel displays an indicator. In the first panel, we see that diversification
increases with kD : indeed, as this number increases, the portfolio converges to uniform (EW)
values. The parameter λ has a minor impact. The second panel naturally shows the inverse
effect for leverage: as diversification increases with kD , leverage (i.e., total negative positions
- shortsales) decreases. Finally, the last panel shows that in-sample volatility is however
largely driven by the risk aversion parameter. As λ increases, volatility logically decreases.
For small values of λ, kD is negatively related to volatility but the pattern reverses for large
values of λ. This is because the equally weighted portfolio is less risky than very leveraged
mean-variance policies, but more risky than the minimum-variance portfolio.
18.9 Chapter 15
We recycle the AE model trained in Chapter 15. Strangely, building smaller models (en-
coder) from larger ones (AE) requires to save and then reload the weights. This creates
an external file, which we call “ae_weights”. We can check that the output does have four
columns (compressed) instead of seven (original data).
# First, encode
encoder2 = tf.keras.layers.Dense(units=4)(encoder2)
# 4 dimensions for the output layer (same as PCA example)
encoder_model = keras.Model(input_layer, encoder2)
# Builds the model
Model: "model_3"
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) [(None, 7)] 0
dense_18 (Dense) (None, 32) 256
dense_19 (Dense) (None, 4) 132
=================================================================
Total params: 388
Trainable params: 388
Non-trainable params: 0
18.10 Chapter 16 303
18.10 Chapter 16
All we need to do is change the rho coefficient in the code of Chapter 16.
n_sample = 10**5
# Number of samples to be generated
rho=(-0.8)
# Autoregressive parameter
sd=0.4
# Std. dev. of noise
a=0.06*(rho)
# Scaled mean of returns
ar1 = np.array([1, -rho])
# template for ar param, note that you need to inverse the sign of rho
AR_object1 = ArmaProcess(ar1)
# Creating the AR object
simulated_data_AR1 = AR_object1.
→generate_sample(nsample=n_sample,scale=sd)
def looping_w_counters(obj_array):
# create util function for loop with counter
_dict = {z:i for i,z in enumerate(obj_array)}
# Dictionary comprehensions
return _dict
s =looping_w_counters(data_RL['state'].unique())
# Dict for states
a =looping_w_counters(data_RL['action'].unique())
# Dict for actions
fit_RL3 = np.zeros(shape=(len(s),len(a)))
# Placeholder for Q matrix
r_final = 0
for z, row in data_RL.iterrows():
# loop for Q-learning
act = a[row.action]
r = row.reward
s_current = s[row.state]
s_new = s[row.new_state]
if np.random.uniform(size=1) < epsilon:
best_new = a[np.random.choice(list(a.keys()))]
# Explore action space
else:
best_new = np.argmax(fit_RL3[s_new,])
# Exploit learned values
r_final += r
fit_RL3[s_current,act]+=alpha*(
r+gamma*fit_RL3[s_new,best_new]-fit_RL3[s_current,act])
fit_RL3=pd.DataFrame(fit_RL3,index=s.keys(),columns=a.keys()).
,→sort_index(axis=1)
print(fit_RL3)
print(f'Reward (last iteration): {r_final}')
return_3=pd.Series(data_ml.loc[data_ml['stock_id']==3,'R1M_Usd'].values)
# Return of asset 3
return_4=pd.Series(data_ml.loc[data_ml['stock_id']==4,'R1M_Usd'].values)
# Return of asset 4
pb_3 = pd.Series(data_ml.loc[data_ml['stock_id']==3, 'Pb'].values)
# P/B ratio of asset 3
18.10 Chapter 16 305
act = a[row.action]
r = row.reward
s_current = s[row.state]
s_new = s[row.new_state]
if np.random.uniform(size=1) < epsilon:
# Explore action space
best_new = a[np.random.choice(list(a.keys()))]
else:
best_new = np.argmax(fit_RL4[s_new,])
# Exploit learned values
r_final += r
fit_RL4[s_current,act]+=alpha*(
r+gamma*fit_RL4[s_new,best_new]-fit_RL4[s_current,act])
fit_RL4=pd.DataFrame(
fit_RL4,index=s.keys(),columns=a.keys()).sort_index(axis=1)
print(f'State-Action function Q: {fit_RL4}')
print(f'Reward (last iteration): {r_final}')
The matrix is less sparse compared to the one of Chapter 16; we have covered much more
ground! Some policy recommendations have not changed compared to the smaller sample,
but some have! The change occurs for the states where only a few points were available in
the first trial. With more data, the decision is altered.
Bibliography
Abbasi, A., Albrecht, C., Vance, A., and Hansen, J. (2012). Metafraud: a meta-learning
framework for detecting financial fraud. MIS Quarterly, pages 1293–1327.
Aboussalah, A. M. and Lee, C.-G. (2020). Continuous control with stacked deep dynamic
recurrent reinforcement learning for portfolio optimization. Expert Systems with Appli-
cations, 140:112891.
Adler, T. and Kritzman, M. (2008). The cost of socially responsible investing. Journal of
Portfolio Management, 35(1):52–56.
Agarwal, A., Hazan, E., Kale, S., and Schapire, R. E. (2006). Algorithms for portfolio
management based on the newton method. In Proceedings of the 23rd international
conference on Machine learning, pages 9–16. ACM.
Aggarwal, C. C. (2013). Outlier analysis. Springer.
Aldridge, I. and Avellaneda, M. (2019). Neural networks in finance: Design and performance.
Journal of Financial Data Science, 1(4):39–62.
Alessandrini, F. and Jondeau, E. (2020). Optimal strategies for ESG portfolios. SSRN
Working Paper, 3578830.
Allison, P. D. (2001). Missing data, volume 136. Sage publications.
Almahdi, S. and Yang, S. Y. (2017). An adaptive portfolio trading system: A risk-return
portfolio optimization using recurrent reinforcement learning with expected maximum
drawdown. Expert Systems with Applications, 87:267–279.
Almahdi, S. and Yang, S. Y. (2019). A constrained portfolio trading system using particle
swarm algorithm and recurrent reinforcement learning. Expert Systems with Applications,
130:145–156.
Alti, A. and Titman, S. (2019). A dynamic model of characteristic-based return predictabil-
ity. Journal of Finance, 74(6):3187–3216.
Ammann, M., Coqueret, G., and Schade, J.-P. (2016). Characteristics-based portfolio choice
with leverage constraints. Journal of Banking & Finance, 70:23–37.
Amrhein, V., Greenland, S., and McShane, B. (2019). Scientists rise up against statistical
significance. Nature, 567:305–307.
Anderson, J. A. and Rosenfeld, E. (2000). Talking nets: An oral history of neural networks.
MIT Press.
Andersson, K. and Oosterlee, C. (2020). A deep learning approach for computations of
exposure profiles for high-dimensional bermudan options. arXiv Preprint, (2003.01977).
Ang, A. (2014). Asset management: A systematic approach to factor investing. Oxford
University Press.
307
308 Bibliography
Ang, A., Hodrick, R. J., Xing, Y., and Zhang, X. (2006). The cross-section of volatility and
expected returns. Journal of Finance, 61(1):259–299.
Ang, A. and Kristensen, D. (2012). Testing conditional factor models. Journal of Financial
Economics, 106(1):132–156.
Ang, A., Liu, J., and Schwarz, K. (2018). Using individual stocks or portfolios in tests of
factor models. SSRN Working Paper, 1106463.
Arik, S. O. and Pfister, T. (2019). Tabnet: Attentive interpretable tabular learning. arXiv
Preprint, (1908.07442).
Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant risk minimiza-
tion. arXiv Preprint, (1907.02893).
Arnott, R., Harvey, C. R., Kalesnik, V., and Linnainmaa, J. (2019a). Alice’s adventures in
factorland: Three blunders that plague factor investing. Journal of Portfolio Management,
45(4):18–36.
Arnott, R., Harvey, C. R., and Markowitz, H. (2019b). A backtesting protocol in the era of
machine learning. Journal of Financial Data Science, 1(1):64–74.
Arnott, R. D., Clements, M., Kalesnik, V., and Linnainmaa, J. T. (2020). Factor momentum.
Journal of the American Statistical Association, 3116974.
Arnott, R. D., Hsu, J. C., Liu, J., and Markowitz, H. (2014). Can noise create the size and
value effects? Management Science, 61(11):2569–2579.
Aronow, P. M. and Sävje, F. (2019). Book review. The book of Why: The new science of
cause and effect. Journal of the American Statistical Association, 115(529):482–485.
Asness, C., Chandra, S., Ilmanen, A., and Israel, R. (2017). Contrarian factor timing is
deceptively difficult. Journal of Portfolio Management, 43(5):72–87.
Asness, C. and Frazzini, A. (2013). The devil in hml’s details. Journal of Portfolio Man-
agement, 39(4):49–68.
Asness, C., Frazzini, A., Gormsen, N. J., and Pedersen, L. H. (2020). Betting against corre-
lation: Testing theories of the low-risk effect. Journal of Financial Economics, 135(3):629–
652.
Asness, C., Frazzini, A., Israel, R., Moskowitz, T. J., and Pedersen, L. H. (2018). Size
matters, if you control your junk. Journal of Financial Economics, 129(3):479–509.
Asness, C., Ilmanen, A., Israel, R., and Moskowitz, T. (2015). Investing with style. Journal
of Investment Management, 13(1):27–63.
Asness, C. S., Moskowitz, T. J., and Pedersen, L. H. (2013). Value and momentum every-
where. Journal of Finance, 68(3):929–985.
Astakhov, A., Havranek, T., and Novak, J. (2019). Firm size and stock returns: A quanti-
tative survey. Journal of Economic Surveys, 33(5):1463–1492.
Atta-Darkua, V., Chambers, D., Dimson, E., Ran, Z., and Yu, T. (2020). Strategies for
responsible investing: Emerging academic evidence. Journal of Portfolio Management,
46(3):26–35.
Back, K. (2010). Asset pricing and portfolio choice theory. Oxford University Press.
Bibliography 309
Baesens, B., Van Vlasselaer, V., and Verbeke, W. (2015). Fraud analytics using descriptive,
predictive, and social network techniques: a guide to data science for fraud detection. John
Wiley & Sons.
Bailey, D. H. and de Prado, M. L. (2014). The deflated sharpe ratio: correcting for selection
bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5):94–
107.
Bailey, T. and Jain, A. (1978). A note on distance-weighted k-nearest neighbor rules. IEEE
Trans. on Systems, Man, Cybernetics, 8(4):311–313.
Bajgrowicz, P. and Scaillet, O. (2012). Technical trading revisited: False discoveries, per-
sistence tests, and transaction costs. Journal of Financial Economics, 106(3):473–491.
Baker, M., Bradley, B., and Wurgler, J. (2011). Benchmarks as limits to arbitrage: Under-
standing the low-volatility anomaly. Financial Analysts Journal, 67(1):40–54.
Baker, M., Hoeyer, M. F., and Wurgler, J. (2020). Leverage and the beta anomaly. Journal
of Financial and Quantitative Analysis, 55(5):1491–1514.
Baker, M., Luo, P., and Taliaferro, R. (2017). Detecting anomalies: The relevance and power
of standard asset pricing tests. SSRN Working Paper.
Bali, T. G., Engle, R. F., and Murray, S. (2016). Empirical asset pricing: the cross section
of stock returns. John Wiley & Sons.
Ballings, M., Van den Poel, D., Hespeels, N., and Gryp, R. (2015). Evaluating multi-
ple classifiers for stock price direction prediction. Expert Systems with Applications,
42(20):7046–7056.
Ban, G.-Y., El Karoui, N., and Lim, A. E. (2016). Machine learning and portfolio optimiza-
tion. Management Science, 64(3):1136–1154.
Bansal, R., Hsieh, D. A., and Viswanathan, S. (1993). A new approach to international
arbitrage pricing. Journal of Finance, 48(5):1719–1747.
Bansal, R. and Viswanathan, S. (1993). No arbitrage and arbitrage pricing: A new approach.
Journal of Finance, 48(4):1231–1262.
Banz, R. W. (1981). The relationship between return and market value of common stocks.
Journal of Financial Economics, 9(1):3–18.
Barberis, N. (2018). Psychology-based models of asset prices and trading volume. In
Handbook of behavioral economics: applications and foundations 1, volume 1, pages 79–
175. Elsevier.
Barberis, N., Greenwood, R., Jin, L., and Shleifer, A. (2015). X-CAPM: An extrapolative
capital asset pricing model. Journal of Financial Economics, 115(1):1–24.
Barberis, N., Jin, L. J., and Wang, B. (2020). Prospect theory and stock market anomalies.
SSRN Working Paper, 3477463.
Barberis, N., Mukherjee, A., and Wang, B. (2016). Prospect theory and stock returns: An
empirical test. Review of Financial Studies, 29(11):3068–3107.
Barberis, N. and Shleifer, A. (2003). Style investing. Journal of Financial Economics,
68(2):161–199.
Barillas, F. and Shanken, J. (2018). Comparing asset pricing models. Journal of Finance,
73(2):715–754.
310 Bibliography
Bhatia, N. et al. (2010). Survey of nearest neighbor techniques. arXiv Preprint, (1007.0085).
Bhattacharyya, S., Jha, S., Tharakunnel, K., and Westland, J. C. (2011). Data mining for
credit card fraud: A comparative study. Decision Support Systems, 50(3):602–613.
Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research,
13(Apr):1063–1095.
Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other
averaging classifiers. Journal of Machine Learning Research, 9(Sep):2015–2033.
Black, F. and Litterman, R. (1992). Global portfolio optimization. Financial Analysts
Journal, 48(5):28–43.
Blank, H., Davis, R., and Greene, S. (2019). Using alternative research data in real-world
portfolios. Journal of Investing, 28(4):95–103.
Blitz, D. and Swinkels, L. (2020). Is exclusion effective? Journal of Portfolio Management,
46(3):42–48.
Blum, A. and Kalai, A. (1999). Universal portfolios with and without transaction costs.
Machine Learning, 35(3):193–205.
Bodnar, T., Parolya, N., and Schmid, W. (2013). On the equivalence of quadratic opti-
mization problems commonly used in portfolio theory. European Journal of Operational
Research, 229(3):637–644.
Boehmke, B. and Greenwell, B. (2019). Hands-on Machine Learning with R. Chapman &
Hall / CRC.
Boloorforoosh, A., Christoffersen, P., Gourieroux, C., and Fournier, M. (2020). Beta risk in
the cross-section of equities. The Review of Financial Studies, 33(9):4318–4366.
Bonaccolto, G. and Paterlini, S. (2019). Developing new portfolio strategies by aggregation.
Annals of Operations Research, pages 1–39.
Boriah, S., Chandola, V., and Kumar, V. (2008). Similarity measures for categorical data:
A comparative evaluation. In Proceedings of the 2008 SIAM international conference on
data mining, pages 243–254.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal
margin classifiers. In Proceedings of the fifth annual workshop on Computational learning
theory, pages 144–152. ACM.
Bouchaud, J.-p., Krueger, P., Landier, A., and Thesmar, D. (2019). Sticky expectations
and the profitability anomaly. Journal of Finance, 74(2):639–674.
Bouthillier, X. and Varoquaux, G. (2020). Survey of machine-learning experimental methods
at neurips2019 and iclr2020. Research report, Inria Saclay Ile de France.
Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.
Branch, B. and Cai, L. (2012). Do socially responsible index investors incur an opportunity
cost? Financial Review, 47(3):617–630.
Brandt, M. W., Santa-Clara, P., and Valkanov, R. (2009). Parametric portfolio policies:
Exploiting characteristics in the cross-section of equity returns. Review of Financial
Studies, 22(9):3411–3447.
312 Bibliography
Braun, H. and Chandler, J. S. (1987). Predicting stock market behavior through rule induc-
tion: an application of the learning-from-example approach. Decision Sciences, 18(3):415–
429.
Breiman, L. (1996). Stacked regressions. Machine Learning, 24(1):49–64.
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
Breiman, L. et al. (2004). Population theory for boosting ensembles. Annals of Statistics,
32(1):1–11.
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. (1984). Classification And Regression
Trees. Chapman & Hall.
Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., Scott, S. L., et al. (2015). Inferring
causal impact using bayesian structural time-series models. Annals of Applied Statistics,
9(1):247–274.
Brodie, J., Daubechies, I., De Mol, C., Giannone, D., and Loris, I. (2009). Sparse and stable
markowitz portfolios. Proceedings of the National Academy of Sciences, 106(30):12267–
12272.
Brown, I. and Mues, C. (2012). An experimental comparison of classification algorithms for
imbalanced credit scoring data sets. Expert Systems with Applications, 39(3):3446–3453.
Bruder, B., Cheikh, Y., Deixonne, F., and Zheng, B. (2019). Integration of ESG in asset
allocation. SSRN Working Paper, 3473874.
Bryzgalova, S. (2016). Spurious factors in linear asset pricing models. Unpublished
Manuscript, Stanford University.
Bryzgalova, S., Huang, J., and Julliard, C. (2019a). Bayesian solutions for the factor zoo:
We just ran two quadrillion models. SSRN Working Paper, 3481736.
Bryzgalova, S., Pelger, M., and Zhu, J. (2019b). Forest through the trees: Building cross-
sections of stock returns. SSRN Working Paper, 3493458.
Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep hedging. Quantitative
Finance, 19(8):1271–1291.
Bühlmann, P., Peters, J., Ernest, J., et al. (2014). Cam: Causal additive models, high-
dimensional order search and penalized regression. Annals of Statistics, 42(6):2526–2556.
Burrell, P. R. and Folarin, B. O. (1997). The impact of neural networks in finance. Neural
Computing & Applications, 6(4):193–200.
Bustos, O. and Pomares-Quimbaya, A. (2020). Stock market movement forecast: A system-
atic review. Expert Systems with Applications, 156:113464.
Camilleri, M. A. (2021). The market for socially responsible investing: A review of the
developments. Social Responsibility Journal, 17(3):412–428.
Campbell, J. Y. and Yogo, M. (2006). Efficient tests of stock return predictability. Journal
of Financial Economics, 81(1):27–60.
Cao, L.-J. and Tay, F. E. H. (2003). Support vector machine with adaptive parameters
in financial time series forecasting. IEEE Transactions on Neural Networks, 14(6):1506–
1518.
Bibliography 313
Chen, L., Da, Z., and Priestley, R. (2012). Dividend smoothing and predictability. Man-
agement Science, 58(10):1834–1853.
Chen, L., Pelger, M., and Zhu, J. (2020). Deep learning in asset pricing. SSRN Working
Paper, 3350138.
Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings
of the 22nd ACM SIGKDD International conference on knowledge discovery and data
mining, pages 785–794. ACM.
Chen, Y. and Hao, Y. (2017). A feature weighted support vector machine and k-nearest
neighbor algorithm for stock market indices prediction. Expert Systems with Applications,
80:340–355.
Chib, S., Zeng, X., and Zhao, L. (2020). On comparing asset pricing models. Journal of
Finance, 75(1):551–577.
Chinco, A., Clark-Joseph, A. D., and Ye, M. (2019a). Sparse signals in the cross-section of
returns. Journal of Finance, 74(1):449–492.
Chinco, A., Hartzmark, S. M., and Sussman, A. B. (2019b). Necessary evidence for a risk
factor’s relevance. SSRN Working Paper, 3487624.
Chinco, A., Neuhierl, A., and Weber, M. (2021). Estimating the anomaly base rate. Journal
of financial economics, 140(1):101–126.
Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive
regression trees. Annals of Applied Statistics, 4(1):266–298.
Choi, S. M. and Kim, H. (2014). Momentum effect as part of a market equilibrium. Journal
of Financial and Quantitative Analysis, 49(1):107–130.
Chollet, F. (2017). Deep learning with Python. Manning Publications Company.
Chordia, T., Goyal, A., and Saretto, A. (2020). Anomalies and false rejections. Review of
Financial Studies, 33(5):2134–2179.
Chordia, T., Goyal, A., and Shanken, J. (2019). Cross-sectional asset pricing with individual
stocks: betas versus characteristics. SSRN Working Paper, 2549578.
Chow, Y.-F., Cotsomitis, J. A., and Kwan, A. C. (2002). Multivariate cointegration
and causality tests of wagner’s hypothesis: evidence from the UK. Applied Economics,
34(13):1671–1677.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2015). Gated feedback recurrent neural
networks. In International Conference on Machine Learning, pages 2067–2075.
Claeskens, G. and Hjort, N. L. (2008). Model selection and model averaging. Cambridge
University Press.
Clark, T. E. and McCracken, M. W. (2009). Improving forecast accuracy by combining
recursive and rolling forecasts. International Economic Review, 50(2):363–395.
Cocco, J. F., Gomes, F., and Lopes, P. (2020). Evidence on expectations of household
finances. SSRN Working Paper, 3362495.
Cochrane, J. H. (2009). Asset pricing: Revised edition. Princeton University Press.
Cochrane, J. H. (2011). Presidential address: Discount rates. Journal of Finance,
66(4):1047–1108.
Bibliography 315
Cong, L. W., Liang, T., and Zhang, X. (2019a). Analyzing textual information at scale.
SSRN Working Paper, 3449822.
Cong, L. W., Liang, T., and Zhang, X. (2019b). Textual factors: A scalable, interpretable,
and data-driven approach to analyzing unstructured information. SSRN Working Paper,
3307057.
Cong, L. W. and Xu, D. (2019). Rise of factor investing: asset prices, informational efficiency,
and security design. SSRN Working Paper, 2800590.
Connor, G. and Korajczyk, R. A. (1988). Risk and return in an equilibrium apt: Application
of a new test methodology. Journal of Financial Economics, 21(2):255–289.
Cont, R. (2007). Volatility clustering in financial markets: empirical facts and agent-based
models. In Long memory in economics, pages 289–309. Springer.
Cooper, I. and Maio, P. F. (2019). New evidence on conditional factor models. Journal of
Financial and Quantitative Analysis, 54(5):1975–2016.
Coqueret, G. (2015). Diversified minimum-variance portfolios. Annals of Finance,
11(2):221–241.
Coqueret, G. (2017). Approximate NORTA simulations for virtual sample generation. Ex-
pert Systems with Applications, 73:69–81.
Coqueret, G. (2020). Stock-specific sentiment and return predictability. Quantitative Fi-
nance, 20(9):1531–1551.
Coqueret, G. and Guida, T. (2020). Training trees on tails with applications to portfolio
choice. Annals of Operations Research, 288:181–221.
Cornell, B. (2020). Stock characteristics and stock returns: A skeptic’s look at the cross
section of expected returns. Journal of Portfolio Management.
Cornuejols, A., Miclet, L., and Barra, V. (2018). Apprentissage artificiel: Deep learning,
concepts et algorithmes. Eyrolles.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3):273–
297.
Costarelli, D., Spigler, R., and Vinti, G. (2016). A survey on approximation by means of
neural network operators. Journal of NeuroTechnology, 1(1).
Cover, T. M. (1991). Universal portfolios. Mathematical Finance, 1(1):1–29.
Cover, T. M. and Ordentlich, E. (1996). Universal portfolios with side information. IEEE
Transactions on Information Theory, 42(2):348–363.
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. (2006). Online
passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):551–585.
Cronqvist, H., Previtero, A., Siegel, S., and White, R. E. (2015a). The fetal origins hypoth-
esis in finance: Prenatal environment, the gender gap, and investor behavior. Review of
Financial Studies, 29(3):739–786.
Cronqvist, H., Siegel, S., and Yu, F. (2015b). Value versus growth investing: Why do
different investors have different styles? Journal of Financial Economics, 117(2):333–349.
Cuchiero, C., Klein, I., and Teichmann, J. (2016). A new perspective on the fundamental
316 Bibliography
theorem of asset pricing for large financial markets. Theory of Probability & Its Applica-
tions, 60(4):561–579.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics
of Control, Signals and Systems, 2(4):303–314.
Dangl, T. and Halling, M. (2012). Predictive regressions with time-varying coefficients.
Journal of Financial Economics, 106(1):157–181.
Dangl, T. and Weissensteiner, A. (2020). Optimal portfolios under time-varying investment
opportunities, parameter uncertainty, and ambiguity aversion. Journal of Financial and
Quantitative Analysis, 55(4):1163–1198.
Daniel, K., Hirshleifer, D., and Sun, L. (2020a). Short and long horizon behavioral factors.
Review of Financial Studies, 33(4):1673–1736.
Daniel, K. and Moskowitz, T. J. (2016). Momentum crashes. Journal of Financial Eco-
nomics, 122(2):221–247.
Daniel, K., Mota, L., Rottke, S., and Santos, T. (2020b). The cross-section of risk and
return. Review of Financial Studies, 33(5):1927–1979.
Daniel, K. and Titman, S. (1997). Evidence on the characteristics of cross sectional variation
in stock returns. Journal of Finance, 52(1):1–33.
Daniel, K. and Titman, S. (2012). Testing factor-model explanations of market anomalies.
Critical Finance Review, 1(1):103–139.
Daniel, K., Titman, S., and Wei, K. J. (2001a). Explaining the cross-section of stock returns
in Japan: Factors or characteristics? Journal of Finance, 56(2):743–766.
Daniel, K. D., Hirshleifer, D., and Subrahmanyam, A. (2001b). Overconfidence, arbitrage,
and equilibrium asset pricing. Journal of Finance, 56(3):921–965.
d’Aspremont, A. (2011). Identifying small mean-reverting portfolios. Quantitative Finance,
11(3):351–364.
de Franco, C., Geissler, C., Margot, V., and Monnier, B. (2020). ESG investments: Filtering
versus machine learning approaches. arXiv Preprint, (2002.07477).
De Moor, L., Dhaene, G., and Sercu, P. (2015). On comparing zero-alpha tests across
multifactor asset pricing models. Journal of Banking & Finance, 61:S235–S240.
De Prado, M. L. (2018). Advances in Financial Machine Learning. John Wiley & Sons.
de Prado, M. L. and Fabozzi, F. J. (2020). Crowdsourced investment research through
tournaments. Journal of Financial Data Science, 2(1):86–93.
Delbaen, F. and Schachermayer, W. (1994). A general version of the fundamental theorem
of asset pricing. Mathematische Annalen, 300(1):463–520.
Demetrescu, M., Georgiev, I., Rodrigues, P. M., and Taylor, A. R. (2022). Testing for
episodic predictability in stock returns. Journal of Econometrics, 227(1):85–113.
DeMiguel, V., Garlappi, L., Nogales, F. J., and Uppal, R. (2009a). A generalized approach
to portfolio optimization: Improving performance by constraining portfolio norms. Man-
agement Science, 55(5):798–812.
DeMiguel, V., Garlappi, L., and Uppal, R. (2009b). Optimal versus naive diversification:
Bibliography 317
How inefficient is the 1/N portfolio strategy? Review of Financial Studies, 22(5):1915–
1953.
DeMiguel, V., Martín-Utrera, A., and Nogales, F. J. (2015). Parameter uncertainty in mul-
tiperiod portfolio optimization with transaction costs. Journal of Financial and Quanti-
tative Analysis, 50(6):1443–1471.
DeMiguel, V., Martin Utrera, A., and Uppal, R. (2019). What alleviates crowding in factor
investing? SSRN Working Paper, 3392875.
DeMiguel, V., Martin Utrera, A., Uppal, R., and Nogales, F. J. (2020). A transaction-
cost perspective on the multitude of firm characteristics. Review of Financial Studies,
33(5):2180–2222.
Denil, M., Matheson, D., and De Freitas, N. (2014). Narrowing the gap: Random forests in
theory and in practice. In International Conference on Machine Learning, pages 665–673.
Dichtl, H., Drobetz, W., Lohre, H., Rother, C., and Vosskamp, P. (2019). Optimal timing
and tilting of equity factors. Financial Analysts Journal, 75(4):84–102.
Dichtl, H., Drobetz, W., Neuhierl, A., and Wendt, V.-S. (2021a). Data snooping in equity
premium prediction. International Journal of Forecasting, 37(1):72–94.
Dichtl, H., Drobetz, W., and Wendt, V.-S. (2021b). How to build a factor portfolio: Does
the allocation strategy matter? European Financial Management, 27(1):20–58.
Dingli, A. and Fournier, K. S. (2017). Financial time series forecasting–a deep learning
approach. International Journal of Machine Learning and Computing, 7(5):118–122.
Dixon, M. F. (2020). Industrial forecasting with exponentially smoothed recurrent neural
networks. SSRN Working Paper, (3572181).
Dixon, M. F., Halperin, I., and Bilokon, P. (2020). Machine Learning in Finance: From
Theory to Practice. Springer.
Donaldson, R. G. and Kamstra, M. (1996). Forecast combining with neural networks.
Journal of Forecasting, 15(1):49–61.
Drucker, H. (1997). Improving regressors using boosting techniques. In International Con-
ference on Machine Learning, volume 97, pages 107–115.
Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J., and Vapnik, V. (1997). Support
vector regression machines. In Advances in Neural Information Processing Systems, pages
155–161.
Du, K.-L. and Swamy, M. N. (2013). Neural networks and statistical learning. Springer
Science & Business Media.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learn-
ing and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–
2159.
Dunis, C. L., Likothanassis, S. D., Karathanasopoulos, A. S., Sermpinis, G. S., and Theofi-
latos, K. A. (2013). A hybrid genetic algorithm–support vector machine approach in the
task of forecasting and trading. Journal of Asset Management, 14(1):52–71.
Eakins, S. G., Stansell, S. R., and Buck, J. F. (1998). Analyzing the nature of institutional
demand for common stocks. Quarterly Journal of Business and Economics, pages 33–48.
318 Bibliography
Efimov, D. and Xu, D. (2019). Using generative adversarial networks to synthesize artifi-
cial financial datasets. Proceedings of the Conference on Neural Information Processing
Systems.
Ehsani, S. and Linnainmaa, J. T. (2019). Factor momentum and the momentum factor.
SSRN Working Paper, 3014521.
Elliott, G., Kudrin, N., and Wuthrich, K. (2019). Detecting p-hacking. arXiv Preprint,
(1906.06711).
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2):179–211.
Enders, C. K. (2001). A primer on maximum likelihood algorithms available for use with
missing data. Structural Equation Modeling, 8(1):128–141.
Enders, C. K. (2010). Applied missing data analysis. Guilford Press.
Engelberg, J., McLean, R. D., and Pontiff, J. (2018). Anomalies and news. Journal of
Finance, 73(5):1971–2001.
Engilberge, M., Chevallier, L., Pérez, P., and Cord, M. (2019). Sodeep: a sorting deep net
to learn ranking loss surrogates. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 10792–10801.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the
variance of united kingdom inflation. Econometrica, pages 987–1007.
Enke, D. and Thawornwong, S. (2005). The use of data mining and neural networks for
forecasting stock market returns. Expert Systems with Applications, 29(4):927–940.
Fabozzi, F. J. (2020). Introduction: Special issue on ethical investing. Journal of Portfolio
Management, 46(3):1–4.
Fabozzi, F. J. and de Prado, M. L. (2018). Being honest in backtest reporting: A template
for disclosing multiple tests. Journal of Portfolio Management, 45(1):141–147.
Fama, E. F. and French, K. R. (1992). The cross-section of expected stock returns. Journal
of Finance, 47(2):427–465.
Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and
bonds. Journal of Financial Economics, 33(1):3–56.
Fama, E. F. and French, K. R. (2015). A five-factor asset pricing model. Journal of Financial
Economics, 116(1):1–22.
Fama, E. F. and French, K. R. (2018). Choosing factors. Journal of Financial Economics,
128(2):234–252.
Fama, E. F. and MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests.
Journal of Political Economy, 81(3):607–636.
Farmer, L., Schmidt, L., and Timmermann, A. (2019). Pockets of predictability. SSRN
Working Paper, 3152386.
Fastrich, B., Paterlini, S., and Winker, P. (2015). Constructing optimal sparse portfolios
using regularization methods. Computational Management Science, 12(3):417–434.
Feng, G., Giglio, S., and Xiu, D. (2020). Taming the factor zoo: A test of new factors.
Journal of Finance, 75(3):1327–1370.
Bibliography 319
Feng, G., Polson, N. G., and Xu, J. (2019). Deep learning in characteristics-sorted factor
models. SSRN Working Paper, 3243683.
Fischer, T. and Krauss, C. (2018). Deep learning with long short-term memory networks for
financial market predictions. European Journal of Operational Research, 270(2):654–669.
Fisher, A., Rudin, C., and Dominici, F. (2019). All models are wrong, but many are
useful: Learning a variable’s importance by studying an entire class of prediction models
simultaneously. Journal of Machine Learning Research, 20(177):1–81.
Frazier, P. I. (2018). A tutorial on Bayesian optimization. arXiv Preprint, (1807.02811).
Frazzini, A. and Pedersen, L. H. (2014). Betting against beta. Journal of Financial Eco-
nomics, 111(1):1–25.
Freeman, R. N. and Tse, S. Y. (1992). A nonlinear model of security price responses to
unexpected earnings. Journal of Accounting Research, pages 185–209.
Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In
Machine Learning: Proceedings of the Thirteenth International Conference, volume 96,
pages 148–156.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.
Freyberger, J., Neuhierl, A., and Weber, M. (2020). Dissecting characteristics nonparamet-
rically. Review of Financial Studies, 33(5):2326–2377.
Friede, G., Busch, T., and Bassen, A. (2015). ESG and financial performance: aggregated
evidence from more than 2000 empirical studies. Journal of Sustainable Finance & In-
vestment, 5(4):210–233.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation
with the graphical lasso. Biostatistics, 9(3):432–441.
Friedman, J., Hastie, T., Tibshirani, R., et al. (2000). Additive logistic regression: a statisti-
cal view of boosting (with discussion and a rejoinder by the authors). Annals of Statistics,
28(2):337–407.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of Statistics, pages 1189–1232.
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data
Analysis, 38(4):367–378.
Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian network classifiers. Machine
Learning, 29(2-3):131–163.
Frost, P. A. and Savarino, J. E. (1986). An empirical bayes approach to efficient portfolio
selection. Journal of Financial and Quantitative Analysis, 21(3):293–305.
Fu, X., Du, J., Guo, Y., Liu, M., Dong, T., and Duan, X. (2018). A machine learning
framework for stock selection. arXiv Preprint, (1806.01743).
Gaba, A., Tsetlin, I., and Winkler, R. L. (2017). Combining interval forecasts. Decision
Analysis, 14(1):1–20.
Gagliardini, P., Ossola, E., and Scaillet, O. (2016). Time-varying risk premium in large
cross-sectional equity data sets. Econometrica, 84(3):985–1046.
320 Bibliography
Gagliardini, P., Ossola, E., and Scaillet, O. (2019). Estimation of large dimensional condi-
tional factor models in finance. SSRN Working Paper, 3443426.
Galema, R., Plantinga, A., and Scholtens, B. (2008). The stocks at stake: Return and risk
in socially responsible investment. Journal of Banking & Finance, 32(12):2646–2654.
Galili, T. and Meilijson, I. (2016). Splitting matters: how monotone transformation of
predictor variables may improve the predictions of decision tree models. arXiv Preprint,
(1611.04561).
García-Galicia, M., Carsteanu, A. A., and Clempner, J. B. (2019). Continuous-time rein-
forcement learning approach for portfolio management with time penalization. Expert
Systems with Applications, 129:27–36.
García-Laencina, P. J., Sancho-Gómez, J.-L., Figueiras-Vidal, A. R., and Verleysen, M.
(2009). K nearest neighbours with mutual information for simultaneous classification
and missing data imputation. Neurocomputing, 72(7-9):1483–1493.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).
Bayesian Data Analysis, 3rd Edition. Chapman & Hall / CRC.
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance
dilemma. Neural Computation, 4(1):1–58.
Genre, V., Kenny, G., Meyler, A., and Timmermann, A. (2013). Combining expert forecasts:
Can anything beat the simple average? International Journal of Forecasting, 29(1):108–
121.
Gentzkow, M., Kelly, B., and Taddy, M. (2019). Text as data. Journal of Economic Liter-
ature, 57(3):535–74.
Ghosh, A. K. (2006). On optimum choice of k in nearest neighbor classification. Computa-
tional Statistics & Data Analysis, 50(11):3113–3123.
Gibson, R., Glossner, S., Krueger, P., Matos, P., and Steffen, T. (2020). Responsible insti-
tutional investing around the world. SSRN Working Paper, 3525530.
Giglio, S. and Xiu, D. (2019). Asset pricing with omitted factors. SSRN Working Paper,
2865922.
Gomes, J., Kogan, L., and Zhang, L. (2003). Equilibrium cross section of returns. Journal
of Political Economy, 111(4):693–732.
Gong, Q., Liu, M., and Liu, Q. (2015). Momentum is really short-term momentum. Journal
of Banking & Finance, 50:169–182.
Gonzalo, J. and Pitarakis, J.-Y. (2019). Predictive regressions. In Oxford Research Ency-
clopedia of Economics and Finance.
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning. MIT Press
Cambridge.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information
Processing Systems, pages 2672–2680.
Gospodinov, N., Kan, R., and Robotti, C. (2019). Too good to be true? Fallacies in evalu-
ating risk factor models. Journal of Financial Economics, 132(2):451–471.
Bibliography 321
Goto, S. and Xu, Y. (2015). Improving mean variance optimization through sparse hedging
restrictions. Journal of Financial and Quantitative Analysis, 50(6):1415–1441.
Gougler, A. and Utz, S. (2020). Factor exposures and diversification: Are sustainably
screened portfolios any different? Financial Markets and Portfolio Management, 34:221–
249.
Goyal, A. (2012). Empirical cross-sectional asset pricing: a survey. Financial Markets and
Portfolio Management, 26(1):3–38.
Goyal, A. and Wahal, S. (2015). Is momentum an echo? Journal of Financial and Quanti-
tative Analysis, 50(6):1237–1267.
Granger, C. W. (1969). Investigating causal relations by econometric models and cross-
spectral methods. Econometrica, pages 424–438.
Green, J., Hand, J. R., and Zhang, X. F. (2013). The supraview of return predictive signals.
Review of Accounting Studies, 18(3):692–730.
Green, J., Hand, J. R., and Zhang, X. F. (2017). The characteristics that provide indepen-
dent information about average us monthly stock returns. Review of Financial Studies,
30(12):4389–4436.
Greene, W. H. (2018). Econometric analysis, Eighth Edition. Pearson Education.
Greenwood, R. and Hanson, S. G. (2012). Share issuance and factor timing. Journal of
Finance, 67(2):761–798.
Grinblatt, M. and Han, B. (2005). Prospect theory, mental accounting, and momentum.
Journal of Financial Economics, 78(2):311–339.
Grushka-Cockayne, Y., Jose, V. R. R., and Lichtendahl Jr, K. C. (2016). Ensembles of
overfit and overconfident forecasts. Management Science, 63(4):1110–1130.
Gu, S., Kelly, B., and Xiu, D. (2021). Autoencoder asset pricing models. Journal of
Econometrics, 222(1):429–450.
Gu, S., Kelly, B. T., and Xiu, D. (2020). Empirical asset pricing via machine learning.
Review of Financial Studies, 33(5):2223–2273.
Guida, T. and Coqueret, G. (2018a). Ensemble learning applied to quant equity: gradient
boosting in a multifactor framework. In Big Data and Machine Learning in Quantitative
Investment, pages 129–148. Wiley.
Guida, T. and Coqueret, G. (2018b). Machine learning in systematic equity allocation: A
model comparison. Wilmott, 2018(98):24–33.
Guidolin, M. and Liu, H. (2016). Ambiguity aversion and underdiversification. Journal of
Financial and Quantitative Analysis, 51(4):1297–1323.
Guliyev, N. J. and Ismailov, V. E. (2018). On the approximation by single hidden layer
feedforward neural networks with fixed weights. Neural Networks, 98:296–304.
Gupta, M., Gao, J., Aggarwal, C., and Han, J. (2014). Outlier detection for temporal data.
IEEE Transactions on Knowledge and Data Engineering, 26(9):2250 – 2267.
Gupta, T. and Kelly, B. (2019). Factor momentum everywhere. Journal of Portfolio Man-
agement, 45(3):13–36.
322 Bibliography
Guresen, E., Kayakutlu, G., and Daim, T. U. (2011). Using artificial neural network models
in stock market index prediction. Expert Systems with Applications, 38(8):10389–10397.
Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal
of Lachine Learning Research, 3(Mar):1157–1182.
Haddad, V., Kozak, S., and Santosh, S. (2020). Factor timing. Review of Financial Studies,
33(5):1980–2018.
Hahn, P. R., Murray, J. S., and Carvalho, C. (2019). Bayesian regression tree models for
causal inference: regularization, confounding, and heterogeneous effects. arXiv Preprint,
(1706.09523).
Hall, P. and Gill, N. (2019). An Introduction to Machine Learning Interpretability - Second
Edition. O’Reilly.
Hall, P., Park, B. U., Samworth, R. J., et al. (2008). Choice of neighbor order in nearest-
neighbor classification. Annals of Statistics, 36(5):2135–2152.
Halperin, I. and Feldshteyn, I. (2018). Market self-learning of signals, impact and optimal
trading: Invisible hand inference with free energy. arXiv Preprint, (1805.06126).
Han, Y., He, A., Rapach, D., and Zhou, G. (2019). Firm characteristics and expected stock
returns. SSRN Working Paper, 3185335.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.
Econometrica, pages 1029–1054.
Harrald, P. G. and Kamstra, M. (1997). Evolving artificial neural networks to combine
financial forecasts. IEEE Transactions on Evolutionary Computation, 1(1):40–52.
Hartzmark, S. M. and Solomon, D. H. (2019). The dividend disconnect. Journal of Finance,
74(5):2153–2199.
Harvey, C. and Liu, Y. (2019a). Lucky factors. SSRN Working Paper, 2528780.
Harvey, C. R. (2017). Presidential address: the scientific outlook in financial economics.
Journal of Finance, 72(4):1399–1440.
Harvey, C. R. (2020). Replication in financial economics. Critical Finance Review, pages
1–9.
Harvey, C. R., Liechty, J. C., Liechty, M. W., and Müller, P. (2010). Portfolio selection with
higher moments. Quantitative Finance, 10(5):469–485.
Harvey, C. R. and Liu, Y. (2015). Backtesting. Journal of Portfolio Management, 42(1):13–
28.
Harvey, C. R. and Liu, Y. (2019b). A census of the factor zoo. SSRN Working Paper,
3341728.
Harvey, C. R. and Liu, Y. (2020). False (and missed) discoveries in financial economics.
The Journal of Finance, 75(5):2503–2553.
Harvey, C. R., Liu, Y., and Saretto, A. (2020). An evaluation of alternative multiple testing
methods for finance applications. Review of Asset Pricing Studies, 10(2):199–248.
Harvey, C. R., Liu, Y., and Zhu, H. (2016). . . . and the cross-section of expected returns.
Review of Financial Studies, 29(1):5–68.
Bibliography 323
Hasler, M., Khapko, M., and Marfe, R. (2019). Should investors learn about the timing of
equity risk? Journal of Financial Economics, 132(3):182–204.
Hassan, M. R., Nath, B., and Kirley, M. (2007). A fusion model of hmm, ann and ga for
stock market forecasting. Expert Systems with Applications, 33(1):171–180.
Hastie, T. (2020). Ridge regression: an essential concept in data science. arXiv Preprint,
(2006.00371).
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning.
Springer.
Haykin, S. S. (2009). Neural networks and learning machines. Prentice Hall.
Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online
convex optimization. Machine Learning, 69(2-3):169–192.
Hazan, E. et al. (2016). Introduction to online convex optimization. Foundations and
Trends® in Optimization, 2(3-4):157–325.
He, A., Huang, D., and Zhou, G. (2020). New factors wanted: Evidence from a simple
specification test. SSRN Working Paper, 3143752.
Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., and Jennions, M. D. (2015). The extent
and consequences of p-hacking in science. PLoS biology, 13(3):e1002106.
Heinze-Deml, C., Peters, J., and Meinshausen, N. (2018). Invariant causal prediction for
nonlinear models. Journal of Causal Inference, 6(2).
Henkel, S. J., Martin, J. S., and Nardari, F. (2011). Time-varying short-horizon predictabil-
ity. Journal of Financial Economics, 99(3):560–580.
Henrique, B. M., Sobreiro, V. A., and Kimura, H. (2019). Literature review: Machine learn-
ing techniques applied to financial market prediction. Expert Systems with Applications,
124:226–251.
Hiemstra, C. and Jones, J. D. (1994). Testing for linear and nonlinear granger causality in
the stock price-volume relation. Journal of Finance, 49(5):1639–1664.
Hill, R. P., Ainscough, T., Shank, T., and Manullang, D. (2007). Corporate social responsi-
bility and socially responsible investing: A global perspective. Journal of Business Ethics,
70(2):165–174.
Hjalmarsson, E. (2011). New methods for inference in long-horizon regressions. Journal of
Financial and Quantitative Analysis, 46(3):815–839.
Hjalmarsson, E. and Manchev, P. (2012). Characteristic-based mean-variance portfolio
choice. Journal of Banking & Finance, 36(5):1392–1401.
Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd International Conference
on Document Analysis and Recognition, volume 1, pages 278–282. IEEE.
Ho, Y.-C. and Pepyne, D. L. (2002). Simple explanation of the no-free-lunch theorem and
its implications. Journal of Optimization Theory and Applications, 115(3):549–570.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9(8):1735–1780.
Hodge, V. and Austin, J. (2004). A survey of outlier detection methodologies. Artificial
Intelligence Review, 22(2):85–126.
324 Bibliography
Hodges, P., Hogan, K., Peterson, J. R., and Ang, A. (2017). Factor timing with cross-
sectional and time-series predictors. Journal of Portfolio Management, 44(1):30–43.
Hoechle, D., Schmid, M., and Zimmermann, H. (2018). Correcting alpha misattribution in
portfolio sorts. SSRN Working Paper, 3190310.
Hoi, S. C., Sahoo, D., Lu, J., and Zhao, P. (2018). Online learning: A comprehensive survey.
arXiv Preprint, (1802.02871).
Honaker, J. and King, G. (2010). What to do about missing values in time-series cross-
section data. American Journal of Political Science, 54(2):561–581.
Hong, H., Karolyi, G. A., and Scheinkman, J. A. (2020). Climate finance. Review of
Financial Studies, 33(3):1011–1023.
Hong, H., Li, F. W., and Xu, J. (2019). Climate risks and market efficiency. Journal of
Econometrics, 208(1):265–281.
Horel, E. and Giesecke, K. (2019). Towards explainable AI: Significance tests for neural
networks. arXiv Preprint, (1902.06021).
Hoseinzade, E. and Haratizadeh, S. (2019). Cnnpred: CNN-based stock market prediction
using a diverse set of variables. Expert Systems with Applications, 129:273–285.
Hou, K., Xue, C., and Zhang, L. (2015). Digesting anomalies: An investment approach.
Review of Financial Studies, 28(3):650–705.
Hou, K., Xue, C., and Zhang, L. (2020). Replicating anomalies. Review of Financial Studies,
33(5):2019–2133.
Hsu, P.-H., Han, Q., Wu, W., and Cao, Z. (2018). Asset allocation strategies, data snooping,
and the 1/n rule. Journal of Banking & Finance, 97:257–269.
Huang, W., Nakamori, Y., and Wang, S.-Y. (2005). Forecasting stock market movement
direction with support vector machine. Computers & Operations Research, 32(10):2513–
2522.
Huck, N. (2019). Large data sets and machine learning: Applications to statistical arbitrage.
European Journal of Operational Research, 278(1):330–342.
Hünermund, P. and Bareinboim, E. (2019). Causal inference and data-fusion in economet-
rics. arXiv Preprint, (1912.09104).
Ilmanen, A. (2011). Expected returns: An investor’s guide to harvesting market rewards.
John Wiley & Sons.
Ilmanen, A., Israel, R., Moskowitz, T. J., Thapar, A. K., and Wang, F. (2019). Factor
premia and factor timing: A century of evidence. SSRN Working Paper, 3400998.
Jacobs, H. and Müller, S. (2020). Anomalies across the globe: Once public, no longer
existent? Journal of Financial Economics, 135(1):213–230.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., Hinton, G. E., et al. (1991). Adaptive mixtures
of local experts. Neural Computation, 3(1):79–87.
Jagannathan, R. and Ma, T. (2003). Risk reduction in large portfolios: Why imposing the
wrong constraints helps. Journal of Finance, 58(4):1651–1683.
Jagannathan, R. and Wang, Z. (1998). An asymptotic theory for estimating beta-pricing
models using cross-sectional regression. Journal of Finance, 53(4):1285–1309.
Bibliography 325
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical
learning, volume 112. Springer.
Jegadeesh, N., Noh, J., Pukthuanthong, K., Roll, R., and Wang, J. L. (2019). Empirical
tests of asset pricing models with individual assets: Resolving the errors-in-variables bias
in risk premium estimation. Journal of Financial Economics, 133(2):273–298.
Jegadeesh, N. and Titman, S. (1993). Returns to buying winners and selling losers: Impli-
cations for stock market efficiency. Journal of Finance, 48(1):65–91.
Jensen, M. C. (1968). The performance of mutual funds in the period 1945–1964. Journal
of Finance, 23(2):389–416.
Jha, V. (2019). Implementing alternative data in an investment process. In Big Data and
Machine Learning in Quantitative Investment, pages 51–74. Wiley.
Jiang, W. (2020). Applications of deep learning in stock market prediction: recent progress.
arXiv Preprint, (2003.01859).
Jiang, Z., Xu, D., and Liang, J. (2017). A deep reinforcement learning framework for the
financial portfolio management problem. arXiv Preprint, (1706.10059).
Jin, D. (2019). The drivers and inhibitors of factor investing. SSRN Working Paper,
(3492142).
Johnson, T. C. (2002). Rational momentum effects. Journal of Finance, 57(2):585–608.
Johnson, T. L. (2019). A fresh look at return predictability using a more efficient estimator.
Review of Asset Pricing Studies, 9(1):1–46.
Jordan, M. I. (1997). Serial order: A parallel distributed processing approach. In Advances
in Psychology, volume 121, pages 471–495.
Jorion, P. (1985). International portfolio diversification with estimation risk. Journal of
Business, pages 259–278.
Jurczenko, E. (2017). Factor Investing: From Traditional to Alternative Risk Premia. El-
sevier.
Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. H., Bühlmann, P., et al. (2012).
Causal inference using graphical models with the r package pcalg. Journal of Statistical
Software, 47(11):1–26.
Kan, R. and Zhou, G. (2007). Optimal portfolio choice with parameter uncertainty. Journal
of Financial and Quantitative Analysis, 42(3):621–656.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017).
Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural
Information Processing Systems, pages 3146–3154.
Ke, Z. T., Kelly, B. T., and Xiu, D. (2019). Predicting returns with text data. SSRN
Working Paper, 3388293.
Kearns, M. and Nevmyvaka, Y. (2013). Machine learning for market microstructure and
high frequency trading. High Frequency Trading: New Realities for Traders, Markets, and
Regulators.
Kelly, B. T., Pruitt, S., and Su, Y. (2019). Characteristics are covariances: A unified model
of risk and return. Journal of Financial Economics, 134(3):501–524.
326 Bibliography
Kempf, A. and Osthoff, P. (2007). The effect of socially responsible investing on portfolio
performance. European Financial Management, 13(5):908–922.
Khedmati, M. and Azin, P. (2020). An online portfolio selection algorithm using clus-
tering approaches and considering transaction costs. Expert Systems with Applications,
159:113546.
Kim, K.-j. (2003). Financial time series forecasting using support vector machines. Neuro-
computing, 55(1-2):307–319.
Kim, S., Korajczyk, R. A., and Neuhierl, A. (2019). Arbitrage portfolios. SSRN Working
Paper, 3263001.
Kim, W. C., Kim, J. H., and Fabozzi, F. J. (2014). Deciphering robust portfolios. Journal
of Banking & Finance, 45:1–8.
Kimoto, T., Asakawa, K., Yoda, M., and Takeoka, M. (1990). Stock market prediction
system with modular neural networks. In 1990 IJCNN international joint conference on
neural networks, pages 1–6. IEEE.
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
Preprint, (1412.6980).
Kirby, C. (2020). Firm characteristics, stock market regimes, and the cross-section of ex-
pected returns. SSRN Working Paper, 3520131.
Koijen, R. S., Richmond, R. J., and Yogo, M. (2019). Which investors matter for global
equity valuations and expected returns? SSRN Working Paper, 3378340.
Koijen, R. S. and Yogo, M. (2019). A demand system approach to asset pricing. Journal
of Political Economy, 127(4):1475–1515.
Kolm, P. N. and Ritter, G. (2019a). Dynamic replication and hedging: A reinforcement
learning approach. Journal of Financial Data Science, 1(1):159–171.
Kolm, P. N. and Ritter, G. (2019b). Modern perspectives on reinforcement learning in
finance. Journal of Machine Learning in Finance, 1(1).
Kong, W., Liaw, C., Mehta, A., and Sivakumar, D. (2019). A new dog learns old tricks: Rl
finds classic optimization algorithms. Proceedings of the ICLR Conference, pages 1–25.
Koshiyama, A., Flennerhag, S., Blumberg, S. B., Firoozye, N., and Treleaven, P. (2020).
Quantnet: Transferring learning across systematic trading strategies. arXiv Preprint,
(2004.03445).
Kozak, S., Nagel, S., and Santosh, S. (2018). Interpreting factor models. Journal of Finance,
73(3):1183–1223.
Kozak, S., Nagel, S., and Santosh, S. (2019). Shrinking the cross-section. Journal of Finan-
cial Economics, 135:271–292.
Krauss, C., Do, X. A., and Huck, N. (2017). Deep neural networks, gradient-boosted trees,
random forests: Statistical arbitrage on the s&p 500. European Journal of Operational
Research, 259(2):689–702.
Kremer, P. J., Lee, S., Bogdan, M., and Paterlini, S. (2019). Sparse portfolio selection via
the sorted l1-norm. Journal of Banking & Finance, page 105687.
Krkoska, E. and Schenk-Hoppé, K. R. (2019). Herding in smart-beta investment products.
Journal of Risk and Financial Management, 12(1):47.
Bibliography 327
Kruschke, J. (2014). Doing Bayesian Data Analysis: A tutorial with R, JAGS, and Stan
(2nd Ed.). Academic Press.
Kuhn, M. and Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach
for Predictive Models. CRC Press.
Kurtz, L. (2020). Three pillars of modern responsible investment. Journal of Investing,
29(2):21–32.
Lai, T. L., Xing, H., Chen, Z., et al. (2011). Mean–variance portfolio optimization when
means and covariances are unknown. Annals of Applied Statistics, 5(2A):798–823.
Lakonishok, J., Shleifer, A., and Vishny, R. W. (1994). Contrarian investment, extrapola-
tion, and risk. Journal of Finance, 49(5):1541–1578.
Leary, M. T. and Michaely, R. (2011). Determinants of dividend smoothing: Empirical
evidence. Review of Financial Studies, 24(10):3197–3249.
Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covari-
ance matrices. Journal of Multivariate Analysis, 88(2):365–411.
Ledoit, O. and Wolf, M. (2008). Robust performance hypothesis testing with the sharpe
ratio. Journal of Empirical Finance, 15(5):850–859.
Ledoit, O. and Wolf, M. (2017). Nonlinear shrinkage of the covariance matrix for portfolio
selection: Markowitz meets goldilocks. Review of Financial Studies, 30(12):4349–4388.
Ledoit, O., Wolf, M., and Zhao, Z. (2020). Efficient sorting: A more powerful test for
cross-sectional anomalies. Journal of Financial Econometrics, 17(4):645–686.
Lee, S. I. (2020). Hyperparameter optimization for forecasting stock returns. arXiv Preprint,
(2001.10278).
Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes.
F. Didot.
Lempérière, Y., Deremble, C., Seager, P., Potters, M., and Bouchaud, J.-P. (2014). Two
centuries of trend following. arXiv Preprint, (1404.3274).
Lettau, M. and Pelger, M. (2020a). Estimating latent asset-pricing factors. Journal of
Econometrics, 218(1):1–31.
Lettau, M. and Pelger, M. (2020b). Factors that fit the time series and cross-section of
stock returns. Review of Financial Studies, 33(5):2274–2325.
Leung, M. T., Daouk, H., and Chen, A.-S. (2001). Using investment portfolio return to
combine forecasts: A multiobjective approach. European Journal of Operational Research,
134(1):84–102.
Levy, G. and Razin, R. (2021). A maximum likelihood approach to combining forecasts.
Theoretical Economics, 16(1):49–71.
Li, B. and Hoi, S. C. (2014). Online portfolio selection: A survey. ACM Computing Surveys
(CSUR), 46(3):35.
Li, B. and Hoi, S. C. H. (2018). Online portfolio selection: principles and algorithms. CRC
Press.
Li, J., Liao, Z., and Quaedvlieg, R. (2020). Conditional superior predictive ability. SSRN
Working Paper, 3536461.
328 Bibliography
Lim, B. and Zohren, S. (2020). Time series forecasting with deep learning: A survey. arXiv
Preprint, (2004.13408).
Linnainmaa, J. T. and Roberts, M. R. (2018). The history of the cross-section of stock
returns. Review of Financial Studies, 31(7):2606–2649.
Lintner, J. (1965). The valuation of risk assets and the selection of risky investments in
stock portfolios and capital budgets. Review of Economics and Statistics, 47(1):13–37.
Lioui, A. (2018). ESG factor investing: Myth or reality? SSRN Working Paper, 3272090.
Lioui, A. and Tarelli, A. (2020). Factor investing for the long run. SSRN Working Paper,
3531946.
Little, R. J. and Rubin, D. B. (2014). Statistical analysis with missing data, volume 333.
John Wiley & Sons.
Liu, L., Pan, Z., and Wang, Y. (2021). What can we learn from the return predictability
over the business cycle? Journal of Forecasting, 40(1):108–131.
Lo, A. W. and MacKinlay, A. C. (1990). When are contrarian profits due to stock market
overreaction? Review of Financial Studies, 3(2):175–205.
Loreggia, A., Malitsky, Y., Samulowitz, H., and Saraswat, V. (2016). Deep learning for
algorithm portfolios. In Proceedings of the Thirtieth AAAI Conference on Artificial In-
telligence, pages 1280–1286. AAAI Press.
Loughran, T. and McDonald, B. (2016). Textual analysis in accounting and finance: A
survey. Journal of Accounting Research, 54(4):1187–1230.
Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions.
In Advances in Neural Information Processing Systems, pages 4765–4774.
Luo, J., Subrahmanyam, A., and Titman, S. (2021). Momentum and reversals when over-
confident investors underestimate their competition. The Review of Financial Studies,
34(1):351–393.
Ma, S., Lan, W., Su, L., and Tsai, C.-L. (2020). Testing alphas in conditional time-varying
factor models with high dimensional assets. Journal of Business & Economic Statistics,
38(1):214–227.
Maathuis, M., Drton, M., Lauritzen, S., and Wainwright, M. (2018). Handbook of Graphical
Models. CRC Press.
Maclaurin, D., Duvenaud, D., and Adams, R. (2015). Gradient-based hyperparameter opti-
mization through reversible learning. In International Conference on Machine Learning,
pages 2113–2122.
Maillard, S., Roncalli, T., and Teiletche, J. (2010). The properties of equally weighted risk
contribution portfolios. Journal of Portfolio Management, 36(4):60–70.
Maillet, B., Tokpavi, S., and Vaucher, B. (2015). Global minimum variance portfolio opti-
misation under some model risk: A robust regression-based approach. European Journal
of Operational Research, 244(1):289–299.
Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7(1):77–91.
Marti, G. (2019). Corrgan: Sampling realistic financial correlation matrices using generative
adversarial networks. arXiv Preprint, (1910.09504).
Bibliography 329
Martin, I. and Nagel, S. (2019). Market efficiency in the age of big data. SSRN Working
Paper, 3511296.
Mascio, D. A., Fabozzi, F. J., and Zumwalt, J. K. (2021). Market timing using combined
forecasts and machine learning. Journal of Forecasting, 40(1):1–16.
Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. R. (2000). Boosting algorithms as
gradient descent. In Advances in Neural Information Processing Systems, pages 512–518.
Matías, J. M. and Reboredo, J. C. (2012). Forecasting performance of nonlinear models for
intraday stock returns. Journal of Forecasting, 31(2):172–188.
McLean, R. D. and Pontiff, J. (2016). Does academic research destroy stock return pre-
dictability? Journal of Finance, 71(1):5–32.
Meng, T. L. and Khushi, M. (2019). Reinforcement learning in financial markets. Data,
4(3):110.
Metropolis, N. and Ulam, S. (1949). The Monte Carlo method. Journal of the American
Statistical Association, 44(247):335–341.
Meyer, C. D. (2000). Matrix analysis and applied linear algebra, volume 71. SIAM.
Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning.
MIT Press.
Molnar, C. (2019). Interpretable Machine Learning: A Guide for Making Black Box Models
Explainable. LeanPub / Lulu.
Moody, J. and Wu, L. (1997). Optimization of trading systems and portfolios. In Proceedings
of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr),
pages 300–307. IEEE.
Moody, J., Wu, L., Liao, Y., and Saffell, M. (1998). Performance functions and reinforcement
learning for trading systems and portfolios. Journal of Forecasting, 17(5-6):441–470.
Moritz, B. and Zimmermann, T. (2016). Tree-based conditional portfolio sorts: The relation
between past and future stock returns. SSRN Working Paper, 2740751.
Mosavi, A., Ghamisi, P., Faghan, Y., Duan, P., and Shamshirband, S. (2020). Comprehen-
sive review of deep reinforcement learning methods and applications in economics. arXiv
Preprint, (2004.01509).
Moskowitz, T. J. and Grinblatt, M. (1999). Do industries explain momentum? Journal of
Finance, 54(4):1249–1290.
Moskowitz, T. J., Ooi, Y. H., and Pedersen, L. H. (2012). Time series momentum. Journal
of Financial Economics, 104(2):228–250.
Mossin, J. (1966). Equilibrium in a capital asset market. Econometrica: Journal of the
econometric society, 34(4):768–783.
Nagy, Z., Kassam, A., and Lee, L.-E. (2016). Can ESG add alpha? An analysis of ESG tilt
and momentum strategies. The Journal of Investing, 25(2):113–124.
Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the
rate of convergence o (1/kˆ 2). In Doklady AN USSR, volume 269, pages 543–547.
Neuneier, R. (1996). Optimal asset allocation using adaptive dynamic programming. In
Advances in Neural Information Processing Systems, pages 952–958.
330 Bibliography
Perrin, S. and Roncalli, T. (2019). Machine learning optimization algorithms & portfolio
allocation. SSRN Working Paper, 3425827.
Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations
and learning algorithms. MIT Press.
Petersen, M. A. (2009). Estimating standard errors in finance panel data sets: Comparing
approaches. Review of Financial Studies, 22(1):435–480.
Pflug, G. C., Pichler, A., and Wozabal, D. (2012). The 1/n investment strategy is optimal
under high model ambiguity. Journal of Banking & Finance, 36(2):410–417.
Plyakha, Y., Uppal, R., and Vilkov, G. (2016). Equal or value weighting? implications for
asset-pricing tests. SSRN Working Paper, 1787045.
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics, 4(5):1–17.
Popov, S., Morozov, S., and Babenko, A. (2019). Neural oblivious decision ensembles for
deep learning on tabular data. arXiv Preprint, (1909.06312).
Powell, W. B. and Ma, J. (2011). A review of stochastic algorithms with continuous
value function approximation and some new approximate policy iteration algorithms for
multidimensional continuous applications. Journal of Control Theory and Applications,
9(3):336–352.
Probst, P., Bischl, B., and Boulesteix, A.-L. (2018). Tunability: Importance of hyperparam-
eters of machine learning algorithms. arXiv Preprint, (1802.09596).
Pukthuanthong, K., Roll, R., and Subrahmanyam, A. (2018). A protocol for factor identi-
fication. Review of Financial Studies, 32(4):1573–1607.
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009). Dataset
shift in machine learning. MIT Press.
Rapach, D. and Zhou, G. (2019). Time-series and cross-sectional stock return forecasting:
New machine learning methods. SSRN Working Paper, 3428095.
Rapach, D. E., Strauss, J. K., and Zhou, G. (2013). International stock return predictability:
what is the role of the United States? Journal of Finance, 68(4):1633–1662.
Rashmi, K. V. and Gilad-Bachrach, R. (2015). Dart: Dropouts meet multiple additive
regression trees. In AISTATS, pages 489–497.
Ravisankar, P., Ravi, V., Rao, G. R., and Bose, I. (2011). Detection of financial statement
fraud and feature selection using data mining techniques. Decision Support Systems,
50(2):491–500.
Reboredo, J. C., Matías, J. M., and Garcia-Rubio, R. (2012). Nonlinearity in forecasting of
high-frequency stock returns. Computational Economics, 40(3):245–264.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should I trust you?: Explaining
the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, pages 1135–1144. ACM.
Ridgeway, G., Madigan, D., and Richardson, T. S. (1999). Boosting methodology for regres-
sion problems. In Seventh International Workshop on Artificial Intelligence and Statistics.
PMLR.
Ripley, B. D. (2007). Pattern recognition and neural networks. Cambridge University Press.
332 Bibliography
Roberts, G. O. and Smith, A. F. (1994). Simple conditions for the convergence of the gibbs
sampler and metropolis-hastings algorithms. Stochastic Processes and their Applications,
49(2):207–216.
Romano, J. P. and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping.
Econometrica, 73(4):1237–1282.
Romano, J. P. and Wolf, M. (2013). Testing for monotonicity in expected asset returns.
Journal of Empirical Finance, 23:93–116.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological Review, 65(6):386.
Ross, S. A. (1976). The arbitrage theory of capital asset pricing. Journal of Economic
Theory, 13(3):341–60.
Rousseeuw, P. J. and Leroy, A. M. (2005). Robust regression and outlier detection, volume
589. Wiley.
Ruf, J. and Wang, W. (2019). Neural networks for option pricing and hedging: a literature
review. arXiv Preprint, (1911.05620).
Santi, C. and Zwinkels, R. C. (2018). Exploring style herding by mutual funds. SSRN
Working Paper, 2986059.
Sato, Y. (2019). Model-free reinforcement learning for financial portfolios: A brief survey.
arXiv Preprint, (1904.04973).
Schafer, J. L. (1999). Multiple imputation: a primer. Statistical Methods in Medical Re-
search, 8(1):3–15.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2):197–227.
Schapire, R. E. (2003). The boosting approach to machine learning: An overview. In
Nonlinear estimation and classification, pages 149–171. Springer.
Schapire, R. E. and Freund, Y. (2012). Boosting: Foundations and algorithms. MIT Press.
Schnaubelt, M. (2019). A comparison of machine learning model validation schemes for
non-stationary time series data. Technical report, FAU Discussion Papers in Economics.
Schueth, S. (2003). Socially responsible investing in the united states. Journal of Business
Ethics, 43(3):189–194.
Scornet, E., Biau, G., Vert, J.-P., et al. (2015). Consistency of random forests. Annals of
Statistics, 43(4):1716–1741.
Seni, G. and Elder, J. F. (2010). Ensemble methods in data mining: improving accuracy
through combining predictions. Synthesis Lectures on Data Mining and Knowledge Dis-
covery, 2(1):1–126.
Settles, B. (2009). Active learning literature survey. Technical report, University of
Wisconsin-Madison Department of Computer Sciences.
Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine
Learning, 6(1):1–114.
Sezer, O. B., Gudelek, M. U., and Ozbayoglu, A. M. (2019). Financial time series fore-
casting with deep learning: A systematic literature review: 2005-2019. arXiv Preprint,
(1911.13288).
Bibliography 333
Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., and Hemingway, H. (2014). Com-
parison of random forest and parametric imputation models for imputing missing data
using mice: a caliber study. American Journal of Epidemiology, 179(6):764–774.
Shanken, J. (1992). On the estimation of beta-pricing models. Review of Financial Studies,
5(1):1–33.
Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games,
2(28):307–317.
Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions
of risk. Journal of Finance, 19(3):425–442.
Sharpe, W. F. (1966). Mutual fund performance. Journal of Business, 39(1):119–138.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrit-
twieser, J., Antonoglou, I., Panneershelvam, V., and Lanctot, M. (2016). Mastering the
game of go with deep neural networks and tree search. Nature, 529:484–489.
Simonian, J., Wu, C., Itano, D., and Narayanam, V. (2019). A machine learning approach
to risk factors: A case study using the Fama-French-Carhart model. Journal of Financial
Data Science, 1(1):32–44.
Simonsohn, U., Nelson, L. D., and Simmons, J. P. (2014). P-curve: a key to the file-drawer.
Journal of Experimental Psychology: General, 143(2):534.
Sirignano, J. and Cont, R. (2019). Universal features of price formation in financial markets:
perspectives from deep learning. Quantitative Finance, 19(9):1449–1459.
Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part
1–learning rate, batch size, momentum, and weight decay. arXiv Preprint, (1803.09820).
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of
machine learning algorithms. In Advances in Neural Information Processing Systems,
pages 2951–2959.
Snow, D. (2020). Machine learning in asset management—part 2: Portfolio construc-
tion—weight optimization. The Journal of Financial Data Science, 2(2):17–24.
Soleymani, F. and Paquet, E. (2020). Financial portfolio optimization with online deep
reinforcement learning and restricted stacked autoencoder—deepbreath. Expert Systems
with Applications, 156:113456.
Sparapani, R., Spanbauer, C., and McCulloch, R. (2019). The BART R package. Technical
report, Comprehensive R Archive Network.
Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, prediction,
and search. MIT Press.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine
Learning Research, 15(1):1929–1958.
Stambaugh, R. F. (1999). Predictive regressions. Journal of Financial Economics,
54(3):375–421.
Staniak, M. and Biecek, P. (2018). Explanations of model predictions with live and break-
down packages. arXiv Preprint, (1804.01955).
334 Bibliography
Wallbridge, J. (2020). Transformers for limit order books. arXiv Preprint, (2003.00130).
Wang, G., Hao, J., Ma, J., and Jiang, H. (2011). A comparative assessment of ensemble
learning for credit scoring. Expert Systems with Applications, 38(1):223–230.
Wang, H. and Zhou, X. Y. (2019). Continuous-time mean-variance portfolio selection: A
reinforcement learning framework. SSRN Working Paper, 3382932.
Wang, J.-J., Wang, J.-Z., Zhang, Z.-G., and Guo, S.-P. (2012). Stock index forecasting
based on a hybrid model. Omega, 40(6):758–766.
Wang, W., Li, W., Zhang, N., and Liu, K. (2020). Portfolio formation with preselection
using deep learning from long-term financial data. Expert Systems with Applications,
143:113042.
Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279–292.
Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal
of Big Data, 3(1):9.
White, H. (1988). Economic prediction using neural networks: The case of ibm daily stock
returns. In ICNN, volume 2, pages 451–458.
White, H. (2000). A reality check for data snooping. Econometrica, 68(5):1097–1126.
Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In IRE WESCON Con-
vention Record, volume 4, pages 96–104.
Wiese, M., Knobloch, R., Korn, R., and Kretschmer, P. (2020). Quant gans: deep generation
of financial time series. Quantitative Finance, 20(9):1419–1440.
Wolpert, D. H. (1992a). On the connection between in-sample testing and generalization
error. Complex Systems, 6(1):47.
Wolpert, D. H. (1992b). Stacked generalization. Neural networks, 5(2):241–259.
Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization.
IEEE Transactions on Evolutionary Computation, 1(1):67–82.
Wong, S. Y., Chan, J., Azizi, L., and Xu, R. Y. (2020). Time-varying neural network for
stock return prediction. arXiv Preprint, (2003.02515).
Xiong, Z., Liu, X.-Y., Zhong, S., Yang, H., and Walid, A. (2018). Practical deep reinforce-
ment learning approach for stock trading. arXiv Preprint, (1811.07522).
Xu, K.-L. (2020). Testing for multiple-horizon predictability: Direct regression based versus
implication based. The Review of Financial Studies, 33(9):4403–4443.
Yang, S. Y., Yu, Y., and Almahdi, S. (2018). An investor sentiment reward-based trading
system using gaussian inverse reinforcement learning algorithm. Expert Systems with
Applications, 114:388–401.
Yu, P., Lee, J. S., Kulyatin, I., Shi, Z., and Dasgupta, S. (2019). Model-based deep rein-
forcement learning for dynamic portfolio optimization. arXiv Preprint, (1901.08740).
Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv Preprint,
(1212.5701).
Zhang, C. and Ma, Y. (2012). Ensemble machine learning: methods and applications.
Springer.
336 Bibliography
Zhang, Y. and Wu, L. (2009). Stock market prediction of s&p 500 via combination
of improved bco approach and bp neural network. Expert Systems with Applications,
36(5):8849–8854.
Zhang, Z., Zohren, S., and Roberts, S. (2020). Deep reinforcement learning for trading.
Journal of Financial Data Science, 2(2):25–40.
Zhao, Q. and Hastie, T. (2021). Causal interpretations of black-box models. Journal of
Business & Economic Statistics, 39(1):272–281.
Zhou, Z.-H. (2012). Ensemble methods: foundations and algorithms. Chapman & Hall /
CRC.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–
320.
Zuckerman, G. (2019). The Man Who Solved the Market: How Jim Simons Launched the
Quant Revolution. Penguin Random House.
Index
337
338 Index