0% found this document useful (0 votes)
34 views29 pages

Robust Machine Learning Pipelines For Trading Mark

1) The document proposes a machine learning pipeline for trading market-neutral stock portfolios that is robust to changes in market conditions. 2) It evaluates models like gradient boosting decision trees and neural networks on the Numerai tournament data set and finds that gradient boosting models with dropout perform well with low complexity. 3) It shows that techniques like dynamic feature neutralization and dynamic model selection can further improve robustness by reducing drawdowns during volatile periods.

Uploaded by

adeka1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views29 pages

Robust Machine Learning Pipelines For Trading Mark

1) The document proposes a machine learning pipeline for trading market-neutral stock portfolios that is robust to changes in market conditions. 2) It evaluates models like gradient boosting decision trees and neural networks on the Numerai tournament data set and finds that gradient boosting models with dropout perform well with low complexity. 3) It shows that techniques like dynamic feature neutralization and dynamic model selection can further improve robustness by reducing drawdowns during volatile periods.

Uploaded by

adeka1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

ROBUST MACHINE LEARNING PIPELINES FOR TRADING

MARKET-NEUTRAL STOCK PORTFOLIOS ∗


THOMAS WONG† AND MAURICIO BARAHONA‡

Abstract. The application of deep learning algorithms to financial data is difficult due to heavy
non-stationarities which can lead to over-fitted models that underperform under regime changes.
Using the Numerai tournament data set as a motivating example, we propose a machine learning
pipeline for trading market-neutral stock portfolios based on tabular data which is robust under
changes in market conditions. We evaluate various machine-learning models, including Gradient
arXiv:2301.00790v1 [q-fin.CP] 30 Dec 2022

Boosting Decision Trees (GBDTs) and Neural Networks with and without simple feature engineer-
ing, as the building blocks for the pipeline. We find that GBDT models with dropout display high
performance, robustness and generalisability with relatively low complexity and reduced computa-
tional cost. We then show that online learning techniques can be used in post-prediction processing
to enhance the results. In particular, dynamic feature neutralisation, an efficient procedure that
requires no retraining of models and can be applied post-prediction to any machine learning model,
improves robustness by reducing drawdown in volatile market conditions. Furthermore, we demon-
strate that the creation of model ensembles through dynamic model selection based on recent model
performance leads to improved performance over baseline by improving the Sharpe and Calmar ra-
tios. We also evaluate the robustness of our pipeline across different data splits and random seeds
with good reproducibility of results.

Key words. Robust Machine Learning, Online Learning, Gradient Boosting Decision Trees,
Deep Learning, Stock Trading, Tabular Data

1. Introduction. As investors explore new ways to generate profit, machine


learning (ML) models are increasingly used as part of trading strategies, e.g., to pre-
dict the future return of stocks or stock portfolios. In particular, deep learning models
for time-series data, such as Recurrent Neural Networks (RNNs) and Convolutional
Neural Networks (CNNs), have been applied to the prediction of future stock re-
turns [1–3]. However, a major challenge for such methods is the highly stochastic,
non-stationary and non-ergodic nature of financial data, which violates the assump-
tions of many algorithms. Furthermore, deep learning models are over-parameterised,
with numbers of parameters orders of magnitude larger than typical sizes of time series
data. Therefore, deep models can be easily over-fitted to specific patterns in historical
market data not present in future market data, and the over-fitting worsens with the
more complicated neural network architectures, such as Long Short Term Memory
(LSTM) or Transformer networks. In addition, the continuous influx of data, coupled
with possible regime changes, requires costly updating and retraining of such models.
Therefore, such methods can lack reproducibility and robustness for the prediction of
future market data.
As pointed out in recent reviews [4, 5], replication of ML studies is often difficult
due to several issues, including data leakage [5], program bugs [6], data and code
usability [7], and model representation and evaluation [4]. These problems and are
currently hindering the usage of ML in high-risk decision processes, such as healthcare
and finance. For trading applications in particular, these issues can have critical effects
on the validity of results. Data leakage, in the form of look-ahead bias or overlap

∗ Funding: This work is supported by the Wellcome Trust under Grant 108908/B/15/Z and by

the EPSRC under grant EP/N014529/1.


† Department of Mathematics, Imperial College London, London SW7 2AZ, U.K (ming-

[email protected]).
‡ Department of Mathematics, Imperial College London, London SW7 2AZ, U.K
([email protected],).
1
2 THOMAS WONG AND MAURICIO BARAHONA

of training/test sets [8], can inflate in-sample performance with poor performance
when deployed live. Furthermore, black-box ML models, such as neural networks,
can lack robustness as they are highly sensitive to small changes in parameters and
data, thus resulting in volatile predictions. The non-stationary data and the presence
of regime changes also mean that ML models need to be re-trained with the latest
financial data, a task that is not only computationally costly but also introduces
further uncertainty to the trading models. Yet most studies do not consider model
performance when trained on different segments of historical market data [1–3, 9, 10].
Although reinforcement learning (RL) in online learning settings allows ML models
to adapt to changing environments, deep reinforcement learning models are complex
and require large computational resources [11]. Indeed, applying RL to stock trading
is difficult since the complexity of the action space increases exponentially with the
number of stocks in the portfolio.
The above issues suggest the need to further develop robust ML pipelines for
trading applications possibly based on simpler models that can still operate on non-
stationary, highly stochastic data under regime changes. Here we consider such a
pipeline based on tabular data, which allows the use of traditional ML models, such
as Gradient Boosting Decision Trees (GBDT) and other ensemble methods, to predict
trading stocks and stock indices [12, 13]. This approach also allows the integration
of additional sources of data, such as sentiment analysis of news articles to improve
the prediction accuracy of the direction of stock returns [14]. In particular, we find
that Gradient Boosting models, which are known to be robust to data perturbations,
outperform neural network models. Finally, we show that improved robustness of ML
models and adaptation to regime changes can be attained without the use of deep
reinforcement learning by employing: (i) dynamic feature neutralisation, a simple
approach that reduces the linear correlation to a subset of features evolving in time,
and (ii) dynamic model selection of optimal models from an ensemble based on recent
performance. These approaches robustly improve trading performances by reducing
volatility and drawdown during adversarial market regimes.
To exemplify the above issues, we consider a benchmark financial data platform
that is continuously updated and curated under the Numerai tournament of stock
portfolio prediction [15]. Numerai is a hedge fund that organises a data science com-
petition (as of Oct 2022) and provides free, open-source, high quality standardised
financial data to all participants. As discussed below in more detail, the data set is
given in the form of pre-processed temporal tabular data and the task is the predic-
tion of the relative performances of stocks within an evolving trading universe without
access to the identity of individual stocks. Unlike other financial research papers that
use proprietary data sets which can be difficult to validate [9, 10], this open financial
data competition allows researchers to replicate findings transparently and allows us
to focus on establishing ML end-to-end pipelines to achieve consistent profits trad-
ing a market-neutral portfolio under changing market regimes. Our pipeline, shown
in Figure 1, is built upon simple, yet robust methodologies that avoid some of the
problems of over-fitting and high computational cost inherent to deep methods. The
robustness of the pipeline is enhanced since each step is implemented independently
avoiding data leakage, which is common in other methods such as neural networks,
where the pre-processing and the actual model often share data. Key ingredients
are the post-prediction processing and feature engineering steps, which allow easy
adaptation of models towards regime changes without expensive retraining.
The paper is organised as follows. Section 2 introduces the Numerai datasets
used in this paper. Section 3 describes and discusses the different computational
ROBUST ML MODELS IN FINANCE 3

Fig. 1: Schematic of the Machine Learning pipeline. Starting with the Numerai
data set, we consider feature engineering methods to augment the dataset and train an
ML model (several are evaluated, including neural networks, but we settle for gradient
boosting trees) to obtain the raw predictions. These then go through post-prediction
processing (e.g., dynamic feature neutralisation) to provide normalised predictions,
which are then combined through model ensembling and dynamic model selection
methods to output the predictions that are submitted to the Numerai tournament.

methods, including online cross-validation, feature engineering and the different ML


models considered and evaluated for the pipeline. Section 4 presents the results from
our ML pipeline, including the impact of different design choices on the robustness of
trading performance. Performances of ML models under different market regimes are
discussed in Section 5. In Section 6, we introduce adaptations to our ML models
based on online learning approaches, which can work well under regime changes,
noting that these adaptations are generic and not limited to specific families of ML
models. Lastly, we discuss the results of the method, open directions and alternatives
and provide a study of the robustness of our ML pipeline in Section 9.4.

2. Numerai dataset and prediction task. Financial data are often available
in the form of time series. These time series can be treated directly using classic meth-
ods such as ARIMA models [16] and more recently through deep learning methods
such as Temporal Fusion Transformers [17]. However, such methods are easily over-
fitted and lead to expensive retraining for financial data, which are inherently affected
by regime changes and high stochasticity. Alternatively, one can use various feature
engineering methods to transform these time series into tabular form through a pro-
cess sometimes called ‘de-trending’ in the financial industry, where the characteristics
of a financial asset at a particular time point, including features from its history, are
represented by a single dimensional data row (i.e., a vector). In this representation,
the time dimension is not considered explicitly, as the state of the system is captured
through transformed features at each time point and the continuity of the temporal
dimension is not used. For example, we can summarise the time series of the return
of a stock with the mean and standard deviation over different look-back periods.
Grouping these data rows for different financial assets into a table at a given time
point we obtain a tabular dataset. If the features are informative, this representation
can be used for prediction tasks at each time point, and allow us to employ robust
and widely tested ML algorithms that are applicable to tabular data. The Numerai
competition is based on a curated tabular data set with high-quality features made
available to the research community.
Description of the dataset:. The Numerai dataset is a temporal tabular dataset.
A temporal tabular dataset is a collection of matrices {Xi }1≤i≤T collected over time
eras 1 to T . Each matrix Xi represents data available at era i with shape Ni × M ,
4 THOMAS WONG AND MAURICIO BARAHONA

where Ni is the number of data samples (number of stocks here) in era i and M
is the number of features describing the samples. Note that the features are fixed
throughout the eras, in the sense that the same computational formula is used to
compute the features in each week. On the other hand, the number of data samples
(stocks) Ni does not have to be constant across time.
In the Numerai dataset, the matrices Xi contain M obfuscated global stock mar-
ket features (computed by Numerai) for Ni stocks, which are updated weekly (i.e., the
eras are in our case weeks). It is important to remark that the dataset is obfuscated,
i.e., it is not possible for participants to know the identity of stocks or even which
stocks are present each week. Each data row is indexed by a hash index, known only
to Numerai, that maps the data rows to the stocks. As a result, it is not possible
for participants to concatenate different data rows to create a continuous history of
a stock. The matrix Xi thus provides a snapshot of the market at week i presented
as an unknown set of stocks described by a common set of features, such that the
features are computed consistently across all stocks in the week and also computed
consistently across different weeks.
The Numerai dataset starts on 2003-01-03 (Era 1). The tabular set has 1191
features, which are already normalised into 5 equal-sized integer bins, from 0 to 4.
There are 28 target labels which are derived from stock returns using 14 proprietary
normalisation methods (nomi, jerome, janet, ben, alan, paul, george, william, arthur,
thomas, ralph, tyler, victor, waldo ) over 2 forward-looking periods (20 trading days,
60 trading days). The main target label to evaluate performance is target-nomi-v4-
20, i.e., forward 20 trading days return obtained by the nomi normalisation method.
Other targets are named similarly. The target labels are all scaled between 0 to 1,
where a smaller value represents a lower forward return, and are also grouped into
bins. For each normalisation method, the number of bins could be different, 5 to 7 bins
are created for each target with the bin sizes following a Gaussian-like distribution,
so that most stocks are within the central bin of 0.5 while only a small amount of
stocks are in the tail bins of 0 and 1. We transform the features and labels so that
both become zero-mean. (For features, we subtract 2 from the integer bins so that
the transformed bins are -2,-1,0,1,2. For the target labels, we subtract 0.5 so that the
new targets are in the range -0.5 to 0.5).
Prediction task:. The tournament task is to predict the stock rankings each week,
ordered from lowest to highest expected return. The scoring is based on Spearman’s
rank correlation of the predicted rankings with the main target label (target-nomi-v4-
20). Hence there is a single overall score each week regardless of the number of stocks
to predict each week. Participants are not scored on the accuracy of the ranking of
each stock individually. Numerai uses the predicted rankings to construct a market-
neutral portfolio which is traded every week (As of Sep 2022), i.e., the hedge fund
buys and short-sells the same dollar amount of stocks. Therefore the relative return
of stocks is more relevant than the absolute return, hence the prediction task is a
ranking problem instead of a forecast problem.
3. Methods.
3.1. Robustness in Machine Learning pipelines. In this paper, we aim to
design an ML pipeline focusing on its robustness. Table 1 details issues related to
robustness and reproducibility, as listed in a recent review [5], and how they are
addressed in this paper. By preventing look-ahead bias and other data leakage issues,
our pipeline can be robustly applied to live trading setups.
In addition to avoiding data leakage, the following design choices are used to im-
ROBUST ML MODELS IN FINANCE 5

Issues affecting robust- How the issue is addressed here


ness of ML algorithms
‘No test set’ A robust cross-validation scheme is used.
‘Pre-processing on train- Numerai features are already standardised; hence
ing and test set’ minimal pre-processing.
‘Feature selection on Feature Engineering is applied to each data row in-
training and test set’ dependently
‘Duplicates in datasets’ A unique id for each data row reduces the chance of
duplicates in dataset
‘Model uses features that Only data provided by Numerai is used to train ML
are not legitimate’ models—no extra features from other resources, and
no cherry-picking of features.
‘Temporal leakage’ We use Grouped Time-Series Cross-Validation with
no overlap between training/validation/test (Fig. 2).
Feature Engineering is applied to each data row in-
dependently, i.e., no data leakage between eras.
‘Non-independence be- Training and test samples are market data at different
tween training and test’ periods without overlap.
‘Sampling bias in test dis- The stocks trading each week are decided by Numerai
tribution’ based on operational and risk considerations.

Table 1: Data analysis design. Some common issues regarding data leakage in
machine learning research [4, 5] and how these issues are dealt with in this study.

prove the robustness and reliability of the results. Firstly, the impact of random seeds
is reduced by reporting results from average predictions over 10 different random seeds
for each machine learning method. Secondly, the metrics used for model evaluation
are the same as in the Numerai tournament to avoid researcher bias in discounting
unfavourable results. Finally, cross-validation is independent of the effects of random
seeds and other human selection, thus reducing the chance of overfitting models to a
particular data split.
For datasets that involve time, standard cross-validation schemes cannot be used
directly, as a random split of data eras could lead to the training set including data that
appears later in time than the validation and test sets, hence introducing look-ahead
bias. To avoid this problem, we use grouped time-series cross-validation, which splits
data eras according to their chronological order (Figure 2). Note that for financial
datasets, the target labels often involve future asset returns and are reported with a
lag. Therefore, we add a gap between the training and validation sets and similarly
between the validation and test sets.

3.2. Feature Engineering methods. Feature engineering is a crucial step in


enhancing the power of tabular methods for the analysis of time series data. Therefore,
we evaluate different feature engineering methods that can be applied to temporal
tabular data sets with numerical features, as the Numerai data set only contains
normalised numerical features.
New features can be created by applying polynomial transformations such as
multiplication and addition to the original features. Here we create new features by
multiplying two features that can be thought of as modelling the joint distribution
6 THOMAS WONG AND MAURICIO BARAHONA

T raining Data
gap

V alidation Data
gap

T est Data

Time

Fig. 2: Illustration of data split using grouped time-series cross-validation

of feature pairs. When the number of features is large, we draw a random subset of
feature pairs to create new features to alleviate the computational cost. Note that
the computation of these features can be done in parallel for data from each era.
A simple way of data augmentation is to add randomness to the feature matrix
with different dropout methods, which are used extensively to reduce over-fitting of
neural network models [18]. Here we apply dropout by multiplying the original data
with a Boolean mask so that some numerical features are set to zero. The dropout is
characterised by its sparsity level (how many features are set to zero) and its sparsity
structure (how to choose the features set to zero). Since our tabular dataset has no
local spatial structure, we use a random Boolean matrix with uniform probability.
This encourages the machine learning methods to learn multiple feature relationships
and reduces reliance on a small set of important features.
For our dataset, we first augment the feature matrix by creating additional fea-
tures obtained by multiplying feature pairs, and then apply dropout with a random
Boolean mask on the augmented feature matrix. A grid search is used to find optimal
hyper-parameters for the feature engineering methods, in particular the number of
feature products and the sparsity level of dropout.

3.3. Machine Learning algorithms for tabular datasets. Numerous ma-


chine learning models have been proposed for tabular datasets, and different bench-
marking studies have shown conflicting views on their performance [18, 19]. The
biggest disagreement in the literature is whether gradient-boosting decision trees or
neural networks are superior in regression and classification tasks of tabular datasets.
Whereas one paper claims gradient boosting models (XGBoost) outperformed deep
learning models in 8 out of 11 datasets and none of the deep learning models consis-
tently outperform others [19], another paper suggests that well-tuned multi-layer per-
ceptron (MLP) models with regularisation can outperform different gradient boosting
models such as XGBoost and CatBoost [18]. Both these studies, however, share the
same view that neural networks with complicated designs, such as attention layers and
other transformer layers, tend to generalise poorly with a strong drop in performance
when applied to data sets beyond their original study. Importantly, the Numerai
data set is different from the data sets in the above benchmarking studies in that it
is growing instead of fixed. Hence the data distribution varies across time periods
due to market regime effects, and we do not have a homogeneous distribution across
cross-validation splits. With such a different problem setup, it is thus not possible to
use the above benchmarking studies to guide our choice of ML method.
In this study, we benchmark a wide range of machine learning models, including
different variants of gradient-boosting decision tree models and different neural net-
work models. The choice of ML models is based on the popularity of usage in data
ROBUST ML MODELS IN FINANCE 7

science competitions and code quality, as one of our aims, is the replicability of results.
We train all machine learning models with a single GPU, the standard setup for most
participants in data science competitions. Some brief details of the ML models used
are provided in the following.
Gradient Boosting Decision Trees. Boosting can be seen as a generalisation of
generalized additive models (GAM) where the additive components of smooth para-
metric functions can be replaced by any weak learners such as decision trees [20].
Historically, various boosting algorithms have been proposed for different loss func-
tions. For example, AdaBoost [21] was proposed for binary classification problems
with exponential loss, whereas Gradient Boosting was first proposed by Friedman in
2001 [22] for any smooth loss functions. Algorithm 9.1 in the SI outlines the iterative
update equations of gradient boosting.
Of the various implementations of gradient boosting decision (GBDT) trees in
Python, we use LightGBM [23] in this paper. CatBoost [24] is not used here as the
Numerai dataset has no categorical features. XGBoost [25] is not used due to slower
computation and more memory consumption. Algorithm 9.2 in the SI shows how the
gradient boosting algorithm is implemented with decision trees being the weak learners
in LightGBM. LightGBM implements GBDT models with several computational and
numerical improvements from XGBoost and other implementations. In addition to
traditional gradient boosting decision trees (LightGBM-gbdt), we consider two other
implementations of GBDT models:
• Dropouts meet Multiple Additive Regression Trees (LightGBM-dart) ignores a
portion of trees when computing the gradient for subsequent trees [26], thus
avoiding over-specialisation where the later learned trees can only affect a
few data instances. This reduces the sensitivity of models towards decisions
made by the first few trees.
• Gradient-based One-Side Sampling (LightGBM-goss) reduces the number of
data instances used to build each tree: it keeps data instances with large
absolute gradients and randomly samples a subset of data with small absolute
gradients. The approximation error of the gradient using LightGBM-goss
converges to the standard method when the number of data is large, and it
outperforms other data sampling (e.g., uniform sampling) in most cases.
For all LightGBM models, we use mean squared error (L2 loss) as the loss function
for the regression problems. The number of gradients boosting trees and learning rate
is optimised by hyper-parameter searches. To prevent the over-fitting of trees, the
maximum depth and number of leaves in each tree and the minimal number of data
samples in the leaves are tuned for each model. L1 and L2 regularisation are also
applied. Data and feature sub-sampling are used to reduce similarities between trees:
before building each tree, a random part of data is selected without re-sampling and
a random subset of features is chosen to build the tree. For LightGBM-dart models,
both the probability to apply dropout during the tree-building process and the portion
of trees to be dropped out are tuned. Early stopping is applied using the validation
dataset for LightGBM-gbdt models to further prevent the over-fitting of models.
Neural Networks. The most basic architecture of neural networks, multi-layer
perceptron (MLP), failed to outperform gradient boosting models in many benchmark
studies of tabular datasets [19].
Recently, more complex network architectures have been proposed for tabular
data sets, as surveyed in [27] These new architectures can be classified into two
major groups:
• Hybrid models that combine neural networks with other traditional ML meth-
8 THOMAS WONG AND MAURICIO BARAHONA

ods, e.g., decision trees. Neural Oblivious Decision Ensembles (NODE) [28] is
a generalisation of gradient boosting models into differentiable deep decision
trees allowing end-to-end training with gradient descent optimisers such as
PyTorch [29]. DeepGBM [30] combines two neural networks, CatNN to han-
dle sparse categorical features and GBDT2NN to distil tree structures from
a pre-trained GBDT model to handle numerical features. A major limitation
of these models is the large memory consumption, which makes them run out
of memory on the NVIDIA 3080ti GPU. Therefore we do not use them in our
benchmark analysis.
• Transformer-based models that use deep attention mechanisms to model com-
plex feature relationships. TabNet [31] uses sequential attention to perform
instance-wise feature selection at each decision step, enabling interpretability
and better learning. AutoInt [32] maps and models feature interactions in a
low-dimensional space with a multi-head self-attentive neural network with
residual connections. AutoInt runs out of memory on a single GPU and is
thus not used in our benchmark. Tabnet also had similar memory issues,
hence we down-sampled the data by keeping every fifth week of data (i.e.,
20% of the original data) for the training/validation periods, so that Tab-
net could be trained on the single GPU used in this study. Our aim is to
compare performance under modest computational resources attainable by a
wide class of users.
In summary, our benchmark analysis includes two NN models: MLP and TabNet
implemented in PyTorch. We use Adam [33] as the gradient optimiser, with the
learning rate automatically determined by PyTorch. We use mean squared error (L2
loss) for the regression problems.

4. Evaluation of Machine Learning methods for the Numerai temporal


tabular data set. In this section, we study different ML methods applied to the
Numerai temporal tabular data set for the prediction of stock rankings aimed at
market-neutral stock portfolios.
Data Split. We use the latest version (v4) of the Numerai dataset. The training
period is fixed between 2003-01-03 (Era 1) to 2012-07-27 (Era 500), and the validation
period is fixed between 2012-12-21 (Era 521) and 2014-11-14 (Era 620). The test
period starts on 2015-05-15 (Era 646) and ends on 2022-09-23 (Era 1030). We apply a
1-year gap between training and validation periods to reduce the effect of recency bias
so that the performance of the validation period will better reflect future performance.
The gap between the validation period and test period is set to 26 weeks to allow for
sufficient time to deploy trained machine learning models.
Evaluation of performance. For each configuration of each ML method, we aver-
age over the predictions of different targets before scoring. The predictions are scored
in each era by calculating the correlation (Corr) between the rank-normalised pre-
dictions and the actual (binned) stock ranking. The mean and standard deviation
(volatility) of Corr are reported for both the validation and test periods. To measure
the downside risk of the model, we also compute the Maximum Drawdown, defined
as the largest drop suffered by an investor starting at any time during the valida-
tion/test period. As summary measures, we compute two standard ratios: (i) the
Sharpe ratio, defined as the ratio of the mean and standard deviation of Corr; and
(ii) the Calmar ratio, defined as the ratio of mean Corr over Maximum Drawdown.
Good performance is characterised by large values of both of these ratios,
ROBUST ML MODELS IN FINANCE 9

Model Training. We use Optuna [34] to perform the hyper-parameter search (see
section 9.3 in Supplementary Information) and select the hyper-parameters with the
highest Sharpe ratio for the main target (target-nomi-v4-20) in the validation period.
The optimised hyper-parameters for each ML method are so fixed, and we then train
10 models, starting the algorithms from 10 different random seeds. We report the
average prediction from these 10 models for evaluation.
Baseline Model. As a baseline, we consider a factor momentum model which is
created by linear combinations of signed features, where the sign of each feature is
determined by the sign of the 52-week moving average of correlations of that feature
with the target. This simple baseline linear model is then compared with the ML
models, which can capture non-linearity in the data.
Comparative results of the ML algorithms and Feature Engineering. Table 2 shows
the performance on the validation and test sets for the different algorithms. We
concentrate on methods that achieve the highest mean Corr, and Sharpe and Calmar
ratios.

(a) Performance over the validation period (2012-12-21 to 2014-11-14)

Method Mean Volatility Max Draw Sharpe Calmar


Factor Momentum (baseline) 0.0229 0.0170 0.0691 1.3495 0.3314
MLP with FE 0.0423 0.0208 0.0241 2.0338 1.7552
MLP without FE 0.0443 0.0201 0.0065 2.2058 6.8154
TabNet without FE 0.0362 0.0189 0.0199 1.9125 1.8191
LightGBM-gbdt with FE 0.0483 0.0229 0.0307 2.1144 1.5733
LightGBM-gbdt without FE 0.0500 0.0224 0.0235 2.2335 2.1277
LightGBM-dart with FE 0.0496 0.0223 0.0215 2.2274 2.3070
LightGBM-dart without FE 0.0475 0.0199 0.0079 2.3883 6.0127
LightGBM-goss with FE 0.0288 0.0219 0.0687 1.3136 0.4192
LightGBM-goss without FE 0.0302 0.0234 0.0877 1.2877 0.3444

(b) Performance over the test period (2015-05-15 to 2022-09-23)

Method Mean Volatility Max Draw Sharpe Calmar


Factor Momentum (baseline) 0.0080 0.0275 0.7877 0.2923 0.0102
MLP with FE 0.0237 0.0330 0.2912 0.7189 0.0814
MLP without FE 0.0258 0.0289 0.1668 0.8931 0.1547
TabNet without FE 0.0161 0.0296 0.5811 0.5431 0.0277
LightGBM-gbdt with FE 0.0253 0.0327 0.3064 0.7731 0.0826
LightGBM-gbdt without FE 0.0262 0.0321 0.2378 0.8140 0.1102
LightGBM-dart with FE 0.0265 0.0319 0.2151 0.8313 0.1232
LightGBM-dart without FE 0.0278 0.0284 0.1622 0.9791 0.1714
LightGBM-goss with FE 0.0169 0.0297 0.5539 0.5695 0.0305
LightGBM-goss without FE 0.0156 0.0318 0.7528 0.4896 0.0207

Table 2: Performance of different machine learning methods with and without feature
engineering on the Numerai dataset for (a) validation period and (b) test period.
The three top methods according to Sharpe ratio and Maximum Drawdown over
the validation period are shown in italics in (a). The top method according to the
Sharpe ratio and Maximum Drawdown over the test period is shown in boldface in
(b). For TabNet, the pipeline with feature engineering cannot be run due to memory
constraints.
10 THOMAS WONG AND MAURICIO BARAHONA

Firstly, we see that almost all ML models performed substantially better than the
factor momentum model (baseline), in both validation and test periods. Whereas the
factor momentum model relies on linear relationships, the capability of ML models
to learn non-linear relationships, in addition to linear ones, adds to their robustness
and improved performance under different, often volatile, market regimes.
Secondly, we observe that Feature Engineering does not improve the performance
of ML models. Although, in principle, Feature Engineering allows GBDT-based meth-
ods to model feature interactions more easily, our results suggest that these interac-
tions are over-fitted during the training process. For neural network-based models,
feature engineering is not strictly necessary, as dropout is already embedded in net-
work architectures.
Thirdly, we note that all ML models scored better in the validation period than
the test period. This is expected, as it is well known that the performance of trading
models deteriorates over time due to overcrowding and regime changes (a phenomenon
known as alpha decay). Models that are over-fitted to recent training data will ex-
perience greater alpha decay than properly regularised models. To select the ML
method, we consider the top models according to the Sharpe and Calmar ratios over
the validation period: a high Sharpe ratio ensures the model has good overall perfor-
mance, whereas a high Calmar ratio ensures good performance against the worst-case
scenario, thus capturing the tail risks of the trading model. Indeed, we find that
LightGBM-dart without feature engineering generalises well to the test period further
into the future.
Finally, we note that LightGBM-gbdt has better generalisation to the test period
than neural network-based models (TabNet), suggesting over-fitting in these complex
deep NN models. This indicates that although over-parameterised models can learn
non-linear relationships in temporal tabular data sets, these relationships may be
difficult to generalise under non-stationary data environments. On the other hand, our
results suggest that, despite their relative simplicity, gradient Boosting models capture
non-linearity in a more robust and controlled manner, with early trees capturing linear
relationships and non-linear relationships captured by the later trees, thus reducing
the risk of catastrophic forgetting [35].
In summary, we find that the best performing model in our set is LightGBM-
dart without feature engineering. In the rest of the paper, we will use this model to
illustrate how the pipeline can be further modified with online learning to account
for regime effects. To demonstrate the robustness of our pipeline, and how it can
be applied to improve the performance of any ML model, we will also report the
performance of two other models: a similar GBDT model (LightGBM-gbdt without
feature engineering) and a neural network model (MLP without feature engineering).

5. Dealing with regime effects in the ML pipeline. Financial data are


heavily influenced by regime changes. Growth (‘Bull’) markets are characterised
by low volatility and positive expected return, whereas high volatility and negative
expected returns are characteristic of adverse (‘bear’) markets. Switches between
regimes can be triggered by externalities, such as pandemics, economic recessions,
etc. From the perspective of the Numerai data set, such regime effects affect model
performance. Volatility is detrimental to long-term performance due to the negative
compounding of investment losses, a phenomenon known as ‘volatility tax’. Given
that hedge funds are leveraged, we consider consistent models with reasonably good
performance under different market regimes, rather than models that have excellent
performance in one market regime but fail in others.
ROBUST ML MODELS IN FINANCE 11

In this section, we focus on how to deal with regime effects when using ML
models for financial tabular temporal data sets. Specifically, we consider feature
neutralisation, and reducing the dependence on the initial trees in gradient boosting
models.
Classification into high and low volatility regimes. To classify the financial market
into regimes, we consider an intrinsic measure derived directly from the Numerai
dataset. In particular, we first compute the Numerai Market Index (NMI), i.e., the
weekly performance of the baseline (linear) factor momentum portfolio, and we then
calculate the Numerai Realised Volatility Index (NRVIX), defined as the standard
deviation of NMI rolling over 52 weeks (Fig. 3). The eras are then classified into high
and low volatility, based on a threshold of NRVIX=0.025, the mean over the first 7
years of data (2003-01-03 to 2010-02-26). According to this intrinsic characterisation,
low volatility regimes have stable linear relationships of features to stock returns, often
associated with a good performance by ML models. On the other hand, high Volatility
regimes correspond to unstable linear relationships of features to stock returns leading
to poor model performance. Figure 3 shows that high/low NRVIX regimes are well
aligned with macroeconomic events: high volatility regimes include the financial crisis
(2007-2009), the Euro crisis (2011-2012), and the Covid pandemic (2020), whereas low
volatility regimes correspond to benign market conditions with no significant macro
event risks, during which the factor momentum baseline portfolio had good returns.

0.100 0.040
0.075
0.035
0.050
0.030
0.025
0.000 0.025

0.025 0.020
0.050
0.015
2005-01-28 2008-11-28 2012-09-28 2016-07-29 2020-05-29 2005-01-28 2008-11-28 2012-09-28 2016-07-29 2020-05-29

(a) NMI (b) NRVIX

Fig. 3: High and low volatility regimes in the Numerai data. (a) Numerai
Market Index (NMI) for the period between 2005-01-28 (Era 109) and 2022-09-23
(Era 1016); (b) the computed Numerai Realised Volatility Index (NRVIX) used to
identify the high and low volatility regimes. The high volatility regime refers to weeks
where NRVIX is higher than 0.25 and the low volatility regime refers to weeks where
NRVIX is lower than 0.25.

5.1. Feature Neutralisation. Feature neutralisation is the general term to


denote the elimination of the effect of particular features in the model, thus reducing
the risk of over-relying on certain individual features. Because the predictive ability
of individual features is highly dependent on market regimes, this can lead to long
periods of drawdown when there is a regime change. It is therefore undesirable to
have ML models that could have heavy (linear)-dependence on certain features.
We start by evaluating here the feature neutralisation suggested by the Numerai
tournament. Numerai recommends that participants reduce model exposure to 420
12 THOMAS WONG AND MAURICIO BARAHONA

‘risky features’ (out of the 1191 features). This list of risky features can be used
for feature neutralisation by subtracting the linear correlation using the formula for
Feature Neutral Correlation (FNC). Specifically, given a week of data with n stocks,
let X ∈ Rn×420 be the matrix of risky features and y ∈ Rn the predicted rankings ob-
tained from a model. For a given neutralisation strength β, 0 ≤ β ≤ 1, the neutralised
predicted ranking ŷ is calculated as ŷ = y − β XX † y, where X † is the pseudo-inverse
of X. FNC is then calculated as the correlation of the neutralised predicted rankings.
Using this procedure, we reduce the linear dependencies of predictions on features.

(a) LightGBM-dart without FE

Regime Feature Neutral Mean Volatility Max Draw Sharpe Calmar


Yes 0.0215 0.0182 0.1153 1.1806 0.1865
All
No 0.0278 0.0284 0.1622 0.9791 0.1714
Yes 0.0227 0.0163 0.0223 1.3888 1.0179
High Vol
No 0.0314 0.0251 0.0657 1.2510 0.4779
Yes 0.0206 0.0195 0.1153 1.0576 0.1787
Low Vol
No 0.0252 0.0305 0.1622 0.8257 0.1554

(b) LightGBM-gbdt without FE

Regime Feature Neutral Mean Volatility Max Draw Sharpe Calmar


Yes 0.0204 0.0211 0.1998 0.9665 0.1021
All
No 0.0262 0.0321 0.2378 0.8140 0.1102
Yes 0.0217 0.0198 0.0364 1.0953 0.5962
High Vol
No 0.0308 0.0293 0.1123 1.0497 0.2743
Yes 0.0194 0.0220 0.1998 0.8820 0.0971
Low Vol
No 0.0227 0.0338 0.2378 0.6727 0.0955

(c) MLP without FE

Regime Feature Neutral Mean Volatility Max Draw Sharpe Calmar


Yes 0.0179 0.0203 0.2606 0.8798 0.0687
All
No 0.0258 0.0289 0.1668 0.8931 0.1547
Yes 0.0196 0.0193 0.0326 1.0191 0.6012
High Vol
No 0.0298 0.0276 0.1247 1.0802 0.2390
Yes 0.0165 0.0210 0.2606 0.7875 0.0633
Low Vol
No 0.0228 0.0296 0.1668 0.7721 0.1367

Table 3: The effect of feature neutralisation. Performance of different ML meth-


ods on the Numerai v4 dataset over the test period (2014-06-27 to 2022-09-23) with
and without feature neutralisation under different market regimes: the whole test
period (all), high volatility regime (high-vol), and low volatility regime (low-vol).

In Table 3, we compare the performance of the LightGBM-dart, LightGBM-gbdt


and MLP with and without feature neutralisation under different market regimes (all,
high volatility, low volatility). The neutralisation strength β is set to 1 throughout.
We find that the variance of models is consistently reduced by feature neutralisa-
tion, suggesting an overall reduction of risk. Further, feature neutralisation improves
the Sharpe and Calmar ratios of LightGBM-dart and LightGBM-gbdt under different
market regimes, but does not improve the performance of MLP models.
Importantly, this default feature neutralisation procedure suggested by Numerai
ROBUST ML MODELS IN FINANCE 13

is not optimal, and we will show in Section 6 how online learning approaches can be
used to improve the procedure.
5.2. Pruning initial trees in Gradient Boosting models. For gradient-
boosting tree models, we also consider a specific procedure consisting of pruning
initial trees during prediction to reduce feature dependencies. Specifically, we perform
a grid search over the number of initial trees to be pruned off in the trained LightGBM
models, and we cap the number of trees to be pruned to not more than half of the
trees to ensure our models do not degenerate.
Table 4 compares the performance of LightGBM-dart and LightGBM-gbdt models
pruning different numbers of initial trees before feature neutralisation. Pruning ini-
tial trees during prediction improves the Sharpe and Calmar ratios of both LightGBM
models, but LightGBM-gbdt models see a bigger improvement than LightGBM-dart
models. This is expected as LightGBM-dart models already employ a similar fun-
damental idea during training, i.e., the trained trees in LightGBM-dart models are
already optimised. Our numerics also suggest that there is a limit of trees to be pruned
such that there is little improvement in model performance once over a threshold of
around 100-250 trees.

(a) LightGBM-dart without FE

Prune Trees Mean Volatility Max Draw Sharpe Calmar


0 0.0278 0.0284 0.1622 0.9791 0.1714
100 0.0272 0.0264 0.1384 1.0293 0.1965
250 0.0264 0.0255 0.1299 1.0336 0.2032
500 0.0249 0.0238 0.1166 1.0459 0.2136

(b) LightGBM-gbdt without FE

Prune Trees Mean Volatility Max Draw Sharpe Calmar


0 0.0262 0.0321 0.2378 0.8140 0.1102
100 0.0265 0.0291 0.1835 0.9106 0.1444
250 0.0253 0.0259 0.1490 0.9769 0.1698
500 0.0253 0.0259 0.1490 0.9765 0.1698

Table 4: The effect of tree pruning. Strategy Performance of different LightGBM


models in the test period (2014-06-27 to 2022-09-23) when pruning different numbers
of initial trees.

5.3. Joint effect of feature neutralisation and tree pruning. We then


considered the joint effect of feature neutralisation and pruning initial trees. Table
5 compares the performance (FNC) of LightGBM-dart and LightGBM-gbdt models
pruning a different number of initial trees after feature neutralisation. The effect of
pruning on model performance for both LightGBM models after feature neutralisation
is at best modest. As FNC is a measure of the effect of non-linear relationships, this
suggests that in gradient boosting models, early weak learners (trees) mostly capture
linear relationships whereas most of the non-linear relationships are captured in the
later weak learners (trees). Therefore, pruning initial trees can be thought of as a
model-dependent feature neutralisation method.
14 THOMAS WONG AND MAURICIO BARAHONA

(a) LightGBM-dart without FE with Feature Neutralisation

Prune Trees Mean Volatility Max Draw Sharpe Calmar


0 0.0215 0.0182 0.1153 1.1806 0.1865
100 0.0208 0.0174 0.1079 1.1998 0.1928
250 0.0200 0.0168 0.1103 1.1918 0.1813
500 0.0183 0.0156 0.1044 1.1748 0.1753

(b) LightGBM-gbdt without FE with Feature Neutralisation

Prune Trees Mean Volatility Max Draw Sharpe Calmar


0 0.0204 0.0211 0.1998 0.9665 0.1021
100 0.0206 0.0200 0.1912 1.0293 0.1077
250 0.0194 0.0188 0.2058 1.0307 0.0943
500 0.0193 0.0188 0.2063 1.0301 0.0936

Table 5: The joint effect of feature neutralisation and tree pruning. Perfor-
mance of different LightGBM models after neutralisation in the test period (2014-06-
27 to 2022-09-23) when pruning different numbers of initial trees.

6. Online Learning to improve post-prediction processing and model


ensembles. As a further improvement to the ML pipeline, we apply online learning
approaches to both feature neutralisation and model ensembles to produce improved
versions called dynamic feature neutralisation and dynamic model selection. Dynamic
feature neutralisation acts by applying statistical rules to determine subsets of fea-
tures to neutralise predictions in each era. Dynamic model selection acts by updating
regularly the choice of model(s) from a model ensemble based on recent model per-
formance.
The aim of online learning is to derive an optimal procedure to select ML mod-
els and parameters as data arrives continuously. In a continuous time setting, the
Hamilton-Jacobi-Bellman (HJB) equation is solved to find the optimal determinis-
tic control for the decision problem [36]. The discrete-time equivalent, the Bellman
equation, is used in reinforcement learning to derive optimal policies of agents [37].
For the Numerai tournament, we consider online learning in the discrete-time
setting, since data and predictions are required once per week. For each week t
(1 ≤ t ≤ T ), we have a state (data) process Xt , which contains all the infor-
mation we know about the environment (Numerai datasets and trained ML model
parameters) up to week t. Our task is then to derive a deterministic decision pro-
cess Dt (βt ) described by parameters βt := βt (Xt ), subject to the objective function
PT
VT = maxDt t=1 q(Xt , Dt ), where q(Xt , Dt ) represents the utility at time instant t
given the data and decision process.
(Deep) Reinforcement learning algorithms are commonly used to solve online
learning problems. However, they are not used here due to the following reasons:
1. Limited data: Available data is not enough to train reinforcement learning
models, such as Deep Q Networks (DQN) [38], Proximal Policy Optimisation
(PPO) [39] and Soft Actor-Critic (SAC) [40]). Generating a large number of
samples is difficult here since we must avoid look-ahead bias.
2. Expanding action space: Most implementations of reinforcement learning al-
ROBUST ML MODELS IN FINANCE 15

gorithms, as found in Ray-RLlib [11], cannot adapt naturally to an expanding


action space. For the dynamic model selection problem, the number of po-
tential models is unbounded, as newer models can be trained with the latest
data available and added to the candidate list. Rule-based models, on the
other hand, can handle the issue of expanding action space easily.
3. Actions have negligible impact on environment: Highly successful reinforce-
ment learning algorithms are usually targetted at robotics and Atari games
[41], where agent actions can modify the environment. However, for the trad-
ing models considered here, the trading activities are assumed to have zero or
negligible market impact, and reinforcement learning algorithms thus reduce
to an online learning prediction problem.
4. Large, correlated feature sets for neutralisation: To improve feature neutral-
isation, we use a different subset of features to neutralise predictions in each
era. Yet the size of the set of risky features (420 features) makes it computa-
tionally infeasible to learn feature subsets through supervised ML methods or
reinforcement learning, as it is difficult to construct a robust reward function
for correlated features. Heuristic methods thus provide suitable alternatives
to learn interpretable and robust feature neutralisation schemes.
5. Model ensembling can be simplified in the Numerai problem: The model en-
semble step of the pipeline assigns portfolio weightings to different ML mod-
els. Although similar to a multi-armed bandit problem, in our problem ex-
ploration is not needed for the agent to learn the distribution of rewards from
different choices since the performance of all ML models up to the decision
time are known to the Numerai tournament participant. Hence there is less
need to employ trial-and-error as in multi-armed bandit algorithms.
As a consequence, instead of reinforcement learning algorithms, we use heuristics
which are shown to be effective in improving the robustness of the ML pipeline. These
heuristics can be interpreted as strong priors in Bayesian learning that greatly simplify
our problem.

6.1. Dynamic Feature neutralisation. In Section 5, the subset of ‘risky fea-


tures’ that are used to neutralise ML models is fixed throughout the whole validation
and test periods. As market conditions are variable, we suggest choosing a different
set of features to neutralise in each era to adapt our ML models without the need for
expensive re-training of models. Specifically, each week we update the set of features
to neutralise based on rolling statistical properties of features, as follows. For each
feature in the dataset, we calculate the correlation of the feature with the target (fea-
ture Corr) and then compute lagged moving average statistics, with a lag of 6 weeks
to account for the lagged reporting of future performance. The look-back window to
compute statistical properties of feature Corr is 52 weeks. We consider 5 different
criteria to select the subset of features to be neutralised:
1. ‘Fixed’: 420 features provided by the portfolio optimiser in Numerai, as in
Section 5 above
2. ‘Low Mean’: 420 features that are least correlated to the target recently
3. ‘High Mean’: 420 features that are most correlated to the target recently
4. ‘Low Volatility’: 420 features that have correlations least volatile recently
5. ‘High Volatility’: 420 features that have correlations most volatile recently
Table 6 compares the performance obtained by the different dynamic feature
neutralisation schemes on LightGBM-dart, LightGBM-gbdt and MLP models. All
Dynamic Feature Neutralisation methods perform better than using a fixed set of
16 THOMAS WONG AND MAURICIO BARAHONA

features but the ‘Low Mean’ neutralisation method has the best Sharpe and Calmar
ratios for all ML models, followed by neutralisation of ‘High Volatility’ features. The
worse performance of ‘High Mean’ and ’Low Volatility’ neutralisations suggests that a
large part of the model risks can be attributed to recently underperforming and high
volatility features.

(a) LightGBM-dart without FE

Dynamic Feature Neutral. Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0215 0.0182 0.1153 1.1806 0.1865
Low Mean 0.0240 0.0164 0.0350 1.4595 0.6857
High Mean 0.0218 0.0185 0.0986 1.1783 0.2211
Low Vol 0.0244 0.0200 0.0538 1.2220 0.4535
High Vol 0.0226 0.0169 0.0341 1.3411 0.6628

(b) LightGBM-gbdt without FE

Dynamic Feature Neutral. Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0204 0.0211 0.1998 0.9665 0.1021
Low Mean 0.0234 0.0184 0.0495 1.2737 0.4727
High Mean 0.0199 0.0212 0.1469 0.9381 0.1355
Low Vol 0.0224 0.0228 0.1852 0.9797 0.1210
High Vol 0.0182 0.1633 0.0487 1.1986 0.4476

(c) MLP without FE

Dynamic Feature Neutral. Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0179 0.0203 0.2606 0.8798 0.0687
Low Mean 0.0211 0.0185 0.0806 1.1387 0.2618
High Mean 0.0186 0.0201 0.1283 0.9256 0.1450
Low Vol 0.0206 0.0215 0.0878 0.9598 0.2346
High Vol 0.0191 0.0172 0.0730 1.1150 0.2616

Table 6: The effect of Dynamic Feature Neutralisation. Performance of differ-


ent ML models in the test period (2014-06-27 to 2022-09-23) with different dynamic
feature neutralisation methods

Next we compared the performance obtained by different dynamic feature neu-


tralisations under different market regimes, as defined in Section 5. The results
can be found in Tables 9 and 8 in the Supplementary Information. Neutralisation
by ‘Low Mean’ performs better than Neutralisation by ‘High Mean’ in low volatility
regimes, but not in high volatility regimes. Under high volatility regimes, neutralisa-
tion by ‘Low Volatility’ features in the models performs better than neutralisation by
‘Low Mean’. Under a low volatility regime, neutralisation by ‘Low Mean’ performs
significantly better than others.
Based on the above, we make the following observations: In a low volatility
regime, factors that are performing well recently continue to do so in the near future
as the feature correlation structure is more stable in low volatility regime. This
works until there is a regime change. In a high volatility regime, the ML models
after neutralisation of ‘Low Volatility’ features have a much higher Mean Corr than
models obtained by other neutralisation methods. ‘Low Volatility’ represents features
ROBUST ML MODELS IN FINANCE 17

that have a low variance, and stable performance in the last 52 weeks. During volatile
regimes, these features will underperform. Models that neutralise these features can
then outperform when there is market stress.
6.2. Dynamic model selection. In practice, it is not possible to know the best
dynamic feature engineering methods in advance. Therefore, we propose an online
learning procedure to select the dynamic feature engineering method during the test
period consisting of two steps. The first step is to have a warm-up period to collect
data on model performances, during which all 5 feature neutralisation methods (fixed,
low mean, high mean, low vol, high vol) have equal weighting. The second step is to
allocate weights to the optimal model based on recent performance according to the
following criteria:
• ‘Average’: Using all five feature neutralisation methods with equal weighting
• ‘Momentum’: Using the feature neutralisation method with the highest Mean
Corr in the last 52 weeks
• ‘Sharpe’: Using the feature neutralisation method with the highest Sharpe
Ratio in the last 52 weeks
• ‘Calmar’: Using the feature neutralisation method with the highest Calmar
ratio in the last 52 weeks
In Table 7, we use these criteria to select the optimal dynamic feature engineering
method based on recent performance. As above, a lag of 6 weeks is applied to account
for data delays.
The online learning procedure can thus select the optimal dynamic feature engi-
neering method to outperform the ‘Average’ selection in most cases. For all three ML
models (LightGBM-dart/LightGBM-gbdt/MLP), the ‘Momentum’ selection method
has higher mean Corr and Calmar ratio than the‘Average’ (baseline) and ‘Sharpe’
methods. This shows that the ‘Momentum’ method, a very simple model selection
method that chooses the recent best-performing model, can adapt a trained ML model
towards different market regimes efficiently. For LightGBM-dart and LightGBM-gbdt
models, the ‘Calmar’ selection method gives a higher Calmar ratio than the ‘Momen-
tum’ method but with a lower mean Corr. For MLP models, the ‘Calmar’ selection
method significantly under-performs other model selection methods, with a much
higher Max Drawdown. This suggests that selection based on historical drawdown is
not robust, especially under situations with regime changes.
In summary, the proposed online learning procedure to select optimal dynamic
feature engineering methods can significantly reduce trading risks and improve the
robustness of trading models, outperforming the baseline selection method that takes
a simple average of all available models.
7. Discussion. Motivated by the Numerai tournament, we have designed here
an ML pipeline that can be applied to tabular temporal data of stock prices to under-
pin strategies for trading of market-neutral stock portfolios. The various steps in the
ML pipeline are carefully designed for robustness against regime changes and to avoid
information leakage through time. We thus aim to obtain models with relatively low
complexity, so as to reduce the danger of over-fitting, and with high robustness to
changes in hyper-parameters and other choices in the algorithms. Another aim is to
Regarding the choice of ML models, we find that gradient-boosting decision tree
models are both more robust and interpretable than neural network-based models,
and they allow more consistent performance under different market regimes.
We also find that post-prediction processing, which is model-agnostic, is an effec-
tive means of adapting trained ML models towards new situations without the need
18 THOMAS WONG AND MAURICIO BARAHONA

(a) LightGBM-dart without FE

Model Selection Mean Volatility Max Draw Sharpe Calmar


Average 0.0229 0.0160 0.0619 1.4323 0.3700
Momentum 0.0246 0.0180 0.0533 1.3654 0.4615
Sharpe 0.0234 0.0165 0.0533 1.4148 0.4390
Calmar 0.0225 0.0171 0.0350 1.3122 0.6429

(b) LightGBM-gbdt without FE

Model Selection Mean Volatility Max Draw Sharpe Calmar


Average 0.0216 0.0177 0.0710 1.2165 0.3042
Momentum 0.0228 0.0201 0.0729 1.1342 0.3128
Sharpe 0.0224 0.0187 0.0729 1.1966 0.3073
Calmar 0.0216 0.0195 0.0508 1.1102 0.4252

(c) MLP without FE

Model Selection Mean Volatility Max Draw Sharpe Calmar


Average 0.0195 0.0175 0.0918 1.1149 0.2124
Momentum 0.0212 0.0191 0.0878 1.1124 0.2415
Sharpe 0.0207 0.0186 0.0878 1.1110 0.2358
Calmar 0.0187 0.0201 0.1973 0.9309 0.0948

Table 7: The effect of dynamic model selection. Performance of different ML


models in the test period (2014-06-27 to 2022-09-23) with different online learning
procedures selecting the optimal dynamic feature neutralisation method.

to re-train ML models and introduce additional model uncertainty. Using dynamic


feature neutralisation produces models with different flavours in an interpretable way,
which also have better risk-adjusted performance than models with fixed feature neu-
tralisation.
Staking is commonly used in ML competitions to improve the robustness of mod-
els. The method suggested in this study, dynamic model selection can be applied to
online ML problems in guiding the selection of an optimal model(s) from a growing
model ensemble. We find that a simple design, such as equal-weighted models, has
a robust performance under different market regimes, but selecting the best model
based on recent performance provides an improvement compared to the baseline as
it switches to a lower-risk model during more volatile market regimes. It remains an
open research area into how reinforcement learning or other online learning methods
can be used to learn optimal staking weights between different ML models, given their
historical performance and correlations.
We also studied the robustness of our ML pipeline under different random seeds
and changes in data splits for cross-validation. The results are presented in Section
9.4 in the Supplementary Information, where we show that LightGBM dart mod-
els are robust against these changes. The statistical rules used in dynamic feature
neutralisation are also shown to perform better than features chosen at random.
In the following, we discuss some ideas for further work to improve the ML pipeline
ROBUST ML MODELS IN FINANCE 19

we designed. The diversity of models within a model ensemble is a key ingredient


for dynamic model selection and other model ensemble/staking methods. A new
metric could be designed to study the impact of a new ML model on an existing
model ensemble. This metric could then be used to train new ML models that are
uncorrelated to existing ones.
The simple feature engineering methods used in our present study could not
improve the performance of ML models. Identifying robust relationships between
features over different market regimes is difficult but generative models, such as Vari-
ational Autoencoders [42], could be used to create new features that summarise non-
linear relationships in existing features.
The Gradient Boosting models used in our pipeline are suitable for distributed
learning, where large datasets are split into smaller batches to train on different ma-
chines, often with various computational resource constraints. Data science compe-
titions like the Numerai tournament rely on community efforts of individual data
scientists to create a meta-model. This approach to crowd sourcing depends on the
assumption that a complicated ML model that needs to be trained with advanced
hardware can be approximated by combining a number of ML models (each trained
with fewer data or features). Studying the convergence of model performance would
be important for organising the data science competition as it decides how many
participants are needed to maintain a well-diverse pool of models to create the meta-
model.
Overall, our results suggest using simple, well-established ML models such as
gradient-boosting decision trees instead of specialised neural network models for this
tasks. Rather than using a single neural network to perform feature engineering,
model training/inference and post-prediction transformations, the modularised de-
sign of the ML pipeline in this study offers increased robustness and transparency.
Researchers can add, modify or delete a component without affecting the rest of the
pipeline. Creating model ensembles improves model performances by reducing id-
iosyncratic variance from individual ML models. The simple model selection rules
based on recent performances provide a baseline that works well under different mar-
ket regimes, whereas various portfolio metrics such as Sharpe and Calmar ratios are
improved by using the recently best-performing models.

8. Data and Code Availability . The data and code used in this paper is
available at https://github.com/ThomasWong2022/numerai-benchmark.

REFERENCES

[1] J. Arosemena, N. Perez, D. Benitez, D. Riofrio, and R. Flores-Moyano, “Stock price analy-
sis with deep-learning models,” in 2021 IEEE Colombian Conference on Applications of
Computational Intelligence (ColCACI), 2021, pp. 1–6.
[2] S. Selvin, R. Vinayakumar, E. A. Gopalakrishnan, V. K. Menon, and K. P. Soman, “Stock price
prediction using lstm, rnn and cnn-sliding window model,” in 2017 International Confer-
ence on Advances in Computing, Communications and Informatics (ICACCI), 2017, pp.
1643–1647.
[3] K. A. Althelaya, E.-S. M. El-Alfy, and S. Mohammed, “Evaluation of bidirectional lstm for
short-and long-term stock market prediction,” in 2018 9th International Conference on
Information and Communication Systems (ICICS), 2018, pp. 151–156.
[4] J. Hullman, S. Kapoor, P. Nanayakkara, A. Gelman, and A. Narayanan, “The worst of both
worlds: A comparative analysis of errors in learning from data in psychology and machine
learning,” 2022. [Online]. Available: https://arxiv.org/abs/2203.06498
[5] S. Kapoor and A. Narayanan, “Leakage and the reproducibility crisis in ml-based science,”
2022. [Online]. Available: https://arxiv.org/abs/2207.07048
20 THOMAS WONG AND MAURICIO BARAHONA

[6] E. Rivera-Landos, F. Khomh, and A. Nikanjam, “The challenge of reproducible ml: an empirical
study on the impact of bugs,” 2021. [Online]. Available: https://arxiv.org/abs/2109.03991
[7] [Online]. Available: https://arxiv.org/abs/2006.12117
[8] J. Bollen, H. Mao, and X. Zeng, “Twitter mood predicts the stock market,” Journal
of Computational Science, vol. 2, no. 1, pp. 1–8, 2011. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S187775031100007X
[9] J. P. N and B. Vasudevan, “Effective implementation of neural network model with tune pa-
rameter for stock market predictions,” in 2021 2nd International Conference on Smart
Electronics and Communication (ICOSEC), 2021, pp. 1038–1042.
[10] S. Singh and S. Sharma, “Forecasting stock price using partial least squares regression,” in
2018 8th International Conference on Cloud Computing, Data Science & Engineering
(Confluence), 2018, pp. 587–591.
[11] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang,
W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging ai
applications,” 2017. [Online]. Available: https://arxiv.org/abs/1712.05889
[12] F. Zhou, Q. Zhang, D. Sornette, and L. Jiang, “Cascading logistic regression
onto gradient boosted decision trees for forecasting and trading stock indices,”
Applied Soft Computing, vol. 84, p. 105747, 2019. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S1568494619305289
[13] C. Bockel-Rickermann, “Predicting day-ahead stock returns using search engine query
volumes: An application of gradient boosted decision trees to the s&p 100,” 2022.
[Online]. Available: https://arxiv.org/abs/2205.15853
[14] R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara, “Deep learning for stock prediction using
numerical and textual information,” in 2016 IEEE/ACIS 15th International Conference
on Computer and Information Science (ICIS), 2016, pp. 1–6.
[15] Numerai. Numerai Hedge Fund. (2022, Apr 12). [Online]. Available: https://numerai.fund/
[16] D. B. Percival and A. T. Walden, Spectral Analysis for Univariate Time Series, ser. Cambridge
Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2020.
[17] B. Lim, S. O. Arik, N. Loeff, and T. Pfister, “Temporal fusion transformers
for interpretable multi-horizon time series forecasting,” 2019. [Online]. Available:
https://arxiv.org/abs/1912.09363
[18] A. Kadra, M. Lindauer, F. Hutter, and J. Grabocka, “Well-tuned simple nets excel on tabular
datasets,” 2021. [Online]. Available: https://arxiv.org/abs/2106.11189
[19] R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,” 2021.
[Online]. Available: https://arxiv.org/abs/2106.03253
[20] S. B. Kotsiantis, “Decision trees: a recent overview,” Artificial Intelligence Review, vol. 39, pp.
261–283, 2011.
[21] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and
an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1,
pp. 119–139, 1997. [Online]. Available: https://www.sciencedirect.com/science/article/
pii/S002200009791504X
[22] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” The Annals
of statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
[23] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y.
Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances
in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran
Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/
file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
[24] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “Catboost: unbiased
boosting with categorical features,” in Advances in Neural Information Processing
Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https:
//proceedings.neurips.cc/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf
[25] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, ser. KDD ’16. New York, NY, USA: Association for Computing Machinery,
2016, p. 785-794. [Online]. Available: https://doi.org/10.1145/2939672.2939785
[26] R. Korlakai Vinayak and R. Gilad-Bachrach, “DART: Dropouts meet Multiple Additive
Regression Trees,” in Proceedings of the Eighteenth International Conference on Artificial
Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Lebanon
and S. V. N. Vishwanathan, Eds., vol. 38. San Diego, California, USA: PMLR,
ROBUST ML MODELS IN FINANCE 21

09–12 May 2015, pp. 489–497. [Online]. Available: https://proceedings.mlr.press/v38/


korlakaivinayak15.html
[27] V. Borisov, T. Leemann, K. Seüler, J. Haug, M. Pawelczyk, and G. Kasneci,
“Deep neural networks and tabular data: A survey,” 2021. [Online]. Available:
https://arxiv.org/abs/2110.01889
[28] S. Popov, S. Morozov, and A. Babenko, “Neural oblivious decision ensembles for deep learning
on tabular data,” 2019. [Online]. Available: https://arxiv.org/abs/1909.06312
[29] [Online]. Available: https://arxiv.org/abs/1912.01703
[30] G. Ke, Z. Xu, J. Zhang, J. Bian, and T.-Y. Liu, “Deepgbm: A deep learning framework
distilled by gbdt for online prediction tasks,” in Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19. New
York, NY, USA: Association for Computing Machinery, 2019, p. 384-394. [Online].
Available: https://doi.org/10.1145/3292500.3330858
[31] S. Ã. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” Proceedings of
the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, pp. 6679–6687, May 2021.
[Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16826
[32] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, and J. Tang, “AutoInt,” in Proceedings
of the 28th ACM International Conference on Information and Knowledge Management.
ACM, nov 2019. [Online]. Available: https://doi.org/10.11452F3357384.3357925
[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014. [Online].
Available: https://arxiv.org/abs/1412.6980
[34] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-
generation hyperparameter optimization framework,” 2019. [Online]. Available: https:
//arxiv.org/abs/1907.10902
[35] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,
J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and
R. Hadsell, “Overcoming catastrophic forgetting in neural networks,” Proceedings of the
National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017. [Online]. Available:
https://www.pnas.org/doi/abs/10.1073/pnas.1611835114
[36] D. E. Kirk, Optimal control theory : an introduction. Mineola, N.Y: Dover Publications, 2004
- 1970.
[37] R. S. Sutton, Reinforcement learning : an introduction, second edition. ed., ser. Adaptive
computation and machine learning series. Cambridge, Massachusetts: The MIT Press,
2018.
[38] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,
I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis,
“Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540,
pp. 529–533, Feb 2015. [Online]. Available: https://doi.org/10.1038/nature14236
[39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy
optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347
[40] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum
entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th
International Conference on Machine Learning, ser. Proceedings of Machine Learning
Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870.
[Online]. Available: https://proceedings.mlr.press/v80/haarnoja18b.html
[41] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional
continuous control using generalized advantage estimation,” 2015. [Online]. Available:
https://arxiv.org/abs/1506.02438
[42] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in 2nd International
Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16,
2014, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2014. [Online].
Available: http://arxiv.org/abs/1312.6114
[43] P. Buhlmann and T. Hothorn, “Boosting algorithms: Regularization, prediction and
model fitting,” Statistical Science, vol. 22, no. 4, nov 2007. [Online]. Available:
https://doi.org/10.1214%2F07-sts242
22 THOMAS WONG AND MAURICIO BARAHONA

9. Supplementary Information.
9.1. Additional results for dynamic feature neutralisation. Here we show
the performance of dynamic feature neutralisation for low and high volatility regimes.

(a) LightGBM-dart without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0206 0.0195 0.1153 1.0576 0.1787
Low Mean 0.0255 0.0175 0.0350 1.4578 0.7286
High Mean 0.0207 0.0206 0.0986 1.0033 0.2099
Low Vol 0.0238 0.0221 0.0538 1.0793 0.4424
High Vol 0.0235 0.0180 0.0341 1.3069 0.6891

(b) LightGBM-gbdt without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0194 0.0220 0.1998 0.8820 0.0971
Low Mean 0.0251 0.0188 0.0495 1.3328 0.5071
High Mean 0.0184 0.0228 0.1469 0.8053 0.1253
Low Vol 0.0214 0.0247 0.1852 0.8657 0.115
High Vol 0.0225 0.0188 0.0487 1.1939 0.4620

(c) MLP without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0165 0.0210 0.2606 0.7875 0.0633
Low Mean 0.0215 0.0187 0.0496 1.1496 0.4335
High Mean 0.0170 0.0210 0.1283 0.8118 0.1325
Low Vol 0.0194 0.0229 0.0878 0.8487 0.2210
High Vol 0.0194 0.0177 0.0730 1.0990 0.2658

Table 8: Performance of ML models in the test period (2014-06-27 to 2022-09-23)


with different dynamic feature neutralisation methods in low volatility regime
ROBUST ML MODELS IN FINANCE 23

(a) LightGBM-dart without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0227 0.0163 0.0223 1.3888 1.0179
Low Mean 0.0220 0.0148 0.0199 1.4907 1.1055
High Mean 0.0233 0.0151 0.0206 1.5372 1.1311
Low Vol 0.0252 0.0168 0.0330 1.4980 0.7636
High Vol 0.0215 0.0152 0.0143 1.4077 1.5035

(b) LightGBM-gbdt without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0217 0.0198 0.0364 1.0953 0.5962
Low Mean 0.0212 0.0176 0.0380 1.2039 0.5579
High Mean 0.0218 0.0186 0.0334 1.1728 0.6527
Low Vol 0.0237 0.0201 0.0306 1.1792 0.7745
High Vol 0.0209 0.0173 0.0308 1.2068 0.6786

(c) MLP without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Fixed 0.0196 0.0193 0.0326 1.0191 0.6012
Low Mean 0.0205 0.0183 0.0806 1.1212 0.2543
High Mean 0.0170 0.0210 0.1283 0.8118 0.1325
Low Vol 0.0222 0.0194 0.0397 1.1442 0.5592
High Vol 0.0187 0.0165 0.0336 1.1368 0.5565

Table 9: Performance of ML models in the test period (2014-06-27 to 2022-09-23)


with different dynamic feature neutralisation methods in high volatility regime

9.2. Pseudocode for algorithms in the text. For completeness, we present


here brief pseudocode for some of the main methods in the paper with the appropriate
references.

Algorithm 9.1 Gradient boosting algorithm [22, 43]


Given N data samples (xi , yi ), 1 ≤ i ≤ N with the aim to find an increasing better
estimate fˆ(x) of the minimising function
P f (x) which minimise the loss L(f ) between
targets and predicted values. L(f ) = i l(yi , f (xi )) where l is a given loss function
such as mean square losses for regression
PK problems. Function f is restricted to the
class of additive models f (x) = k=1 wk h(x, αk ) where h(·, α) is a weak learner
with parameters α and wk are the weights.
PN
Initialise f0 (x) = arg minα0 i=1 l(yi , h(xi , α0 ))
h i
For k = 1 : K Compute the gradient residual using gik = − ∂l(yi ,fk−1 (xi ))
∂fk−1 (xi )
PN
Use the weak learner to compute αk which minimises i=1 (gik − h(xi , αk ))2
Update with learning rate λ fk (x) = fk−1 (x) + λh(x, αk )
Return f (x) = fK (x)
24 THOMAS WONG AND MAURICIO BARAHONA

Algorithm 9.2 Gradient boosting tree algorithm implemented in LightGBM [22, 23,
43]
PN
Initialise f0 (x) = arg minα0 i=1 l(yi , x, α0 )
For k = h 1 : K For i i = 1, 2, . . . N , compute the gradient residual using
∂l(yi ,fk−1 (xi ))
gik = − ∂fk−1 (xi )
Fit a decision tree to the targets gik giving terminal leaves Rkj , j = 1, 2, . . . Jk , where
Jk is the number of terminal leaves. P
For j = 1, 2, . . . Jk , compute αjk = arg minα xi ∈Rkj l(yi , fk−1 (xi ) + α)
PJk
Update boosting trees with learning rate λ fk (x) = fk−1 (x) + λ j=1 αkj I(x ∈ Rkj )
Return fK (x)
ROBUST ML MODELS IN FINANCE 25

9.3. Hyper-parameter search space for different ML models. We ran all


experiments on a GPU cluster, each node of which contains a NVIDIA GeForce RTX
2080 Ti GPU, running with 4352 CUDA cores and 11GB memory. Hyper-parameter
search is performed using Optuna [34]. For each Feature Engineering/ML pipeline,
hyper-parameter search is ran for at most 8 hours or at most 100 configurations,
whichever came first. The default TPE sampler in Optuna is used to perform the
hyper-parameter search. In Figure 4 and 5, we list the Hyper-parameter search pa-
rameters defined in Optuna [34] for different ML models used in the main text to
train the models.

• Feature Engineering
– Numerai Basic Feature Engineering
∗ dropout pct: low:0.05, high:0.25, step:0.05,
∗ no product features: low:50, high:1000, step:50,
• ML Models
– LightGBM-gbdt
∗ n estimators: low:50, high:1000, step:50
∗ learning rate: low:0.005, high:0.1, log:True
∗ min data in leaf: low:2500, high:40000, step:2500
∗ lambda l1: low:0.01, high: 1, log:True
∗ lambda l2: low:0.01, high: 1, log:True
∗ feature fraction: low:0.1, high:1, step:0.05
∗ bagging fraction: low:0.5, high:1, step:0.05
∗ bagging freq: low:10, high:50, step:10
– LightGBM-dart
∗ n estimators: low:50, high:1000, step:50
∗ learning rate: low:0.005, high:0.1, log:True
∗ min data in leaf: low:2500, high:40000, step:2500
∗ lambda l1: low:0.01, high: 1, log:True
∗ lambda l2: low:0.01, high: 1, log:True
∗ feature fraction: low:0.1, high:1, step:0.05
∗ bagging fraction: low:0.5, high:1, step:0.05
∗ bagging freq: low:10, high:50, step:10
∗ drop rate: low:0.1, high:0.5, step:0.1
∗ skip drop: low:0.1, high:0.8, step:0.1
– LightGBM-goss
∗ n estimators: low:50, high:1000, step:50
∗ learning rate: low:0.005, high:0.1, log:True
∗ min data in leaf: low:2500, high:40000, step:2500
∗ lambda l1: low:0.01, high: 1, log:True
∗ lambda l2: low:0.01, high: 1, log:True
∗ feature fraction: low:0.1, high:1, step:0.05
∗ bagging fraction: low:0.5, high:1, step:0.05
∗ bagging freq: low:10, high:50, step:10
∗ top rate: low:0.1, high:0.4, step:0.05
∗ other rate: low:0.05, high:0.2, step:0.05

Fig. 4: Hyper-parameter Space for ML models


26 THOMAS WONG AND MAURICIO BARAHONA

• Machine Learning
– MLP
∗ max epochs: low:10, high:100, step:5
∗ patience: low:5, high:20, step:5
∗ num layers: low:2, high:7, step:1
∗ neurons: low:64, high:1024, step:64
∗ neuron scale: low:0.3, high:1, log:True
∗ dropout: low:0.1, high:0.9, log:True
∗ batch size: low:10240, high:40960, step:10240
– TabNet
∗ max epochs: low:10, high:100, step:5
∗ patience: low:5, high:20, step:5
∗ batch size: low:1024, high:4096, step:1024
∗ num d: low:4, high:16, step:4
∗ num a: low:4, high:16, step:4
∗ num steps: low:1, high:3, step:1
∗ num shared: low:1, high:3, step:1
∗ num independent: low:1, high:3, step:1
∗ gamma : low:1, high:2, step:0.1
∗ momentum: low:0.01, high:0.4, step:0.01
∗ lambda sparse: low:0.0001, high:0.01, log:True

Fig. 5: Hyper-parameter Space for ML models


ROBUST ML MODELS IN FINANCE 27

9.4. Robustness of ML pipeline. One of the aims in this work was to provide
a robust pipeline for tabular temporal data under regime changes. Here we present
additional results of the robustness of the method under different scenarios and sources
of variability.
Robustness under changes of random seeds in the learning algorithms. In Ta-
ble 10, we report the variability of the performance of the LightGBM-dart, LightGBM-
gbdt and MLP models trained starting from 10 different initial random seeds. The
performance is generally robust to the change in random seeds, with small variances
in the prediction of the mean Corr and volatility and moderate for the Maximum
Drawdown.

Model Mean Volatility Max Draw Sharpe Calmar


mean 0.0254 0.0266 0.1567 0.9593 0.1639
LightGBM-dart without FE
sd 0.0006 0.0007 0.0158 0.0365 0.0175
mean 0.0253 0.0312 0.2338 0.8104 0.1100
LightGBM-gbdt without FE
sd 0.0006 0.0006 0.0296 0.0278 0.0153
mean 0.0233 0.0271 0.1643 0.8600 0.1446
MLP without FE
sd 0.0009 0.0011 0.0248 0.0365 0.0219

Table 10: Variability of the performance of ML models in the test period (2014-06-
27 to 2022-09-23). The mean and standard deviation of each portfolio metrics are
calculated over models with 10 different random seeds for each method

A general strategy to reduce the variance is to combine different ML models.


There are two ways to do so: (i) averaging over models, by calculating the average
performance of different models, and (ii) averaging over predictions, by calculating the
average predictions from each model and then scoring the average predictions against
the target. Table 11 shows that averaging over predictions gives higher mean Corr
and Sharpe/Calmar ratios than averaging over models. Therefore, this averaging
method is used to compute model performances in Table 2 in the main text.

Model Average Mean Volatility Max Draw Sharpe Calmar


Over models 0.0254 0.0266 0.1567 0.9593 0.1639
LightGBM-dart without FE
Over predictions 0.0278 0.0284 0.1622 0.9791 0.1714
Over models 0.0253 0.0312 0.2338 0.8104 0.1100
LightGBM-gbdt without FE
Over predictions 0.0262 0.0321 0.2378 0.8140 0.1102
Over models 0.0233 0.0271 0.1643 0.8600 0.1446
MLP without FE
Over predictions 0.0258 0.0289 0.1668 0.8931 0.1547

Table 11: Performance of different ML methods on Numerai v4 dataset in the test


period (2014-06-27 to 2022-09-23) with different averaging methods

Robustness under different cross-validation data splits. As financial data are regime
dependent, an important measure of model robustness is to measure the performance
of ML models that have been trained using different cross-validation splits of the data
and compute how much the model performance changes over different test periods.
To ascertain the robustness of data splits, we have carried out 3 cross-validation
splits (CV 1, CV 2, CV 3) as shown in Table 12. The hyper-parameters are optimised
under CV 1, which is the cross-validation used to generate the model performances
in the main text. These hyper-parameters are fixed for the models trained under
the CV 2 and CV 3 splits. For ML methods that require early stopping, the data
28 THOMAS WONG AND MAURICIO BARAHONA

in the validation period (different for each split) are used to regularise the models.
Therefore, by reusing the optimised hyper-parameters across all splits, we evaluate
the robustness of the model performance to the optimisation of hyper-parameters. We
then compute the performance when applying the models to shifted cross-validation
datasets in the walk-forward CV 2 and CV 3 data splits. Our results show good
consistency in performance across CV 2 and CV 3, with only a small deterioration of
the results as compared to CV 1 (over which the hyperparameters were optimised).
We also find that LightGBM-dart with FE, the ML method that has the highest
mean Corr in CV 1, has the greatest return and best Sharpe and Calmar ratios also
in other cross-validations, as seen in Table 13.

Train Start Train End Validation Start Validation End Enter Ensemble
CV 1 2003-01-03 2012-07-27 2012-12-21 2014-11-14 2015-05-15
CV 2 2003-01-03 2014-06-27 2014-11-21 2016-10-14 2017-04-14
CV 3 2003-01-03 2016-05-27 2016-10-21 2018-09-14 2019-03-15

Table 12: Various cross-validation schemes to train ML models on different parts of


the data. CV 1 is the cross-validation used for hyper-parameter optimisation and
training ML models in the main text.

(a) CV 1 (2015-05-15 to 2022-09-23)

Method Mean Volatility Max Draw Sharpe Calmar


LightGBM-dart without FE 0.0278 0.0284 0.1622 0.9791 0.1714
LightGBM-gbdt without FE 0.0262 0.0321 0.2378 0.8140 0.1102
MLP without FE 0.0258 0.0289 0.1668 0.8931 0.1547

(b) CV 2 (2017-04-14 to 2022-09-23)

Method Mean Volatility Max Draw Sharpe Calmar


LightGBM-dart without FE 0.0250 0.0278 0.1817 0.8990 0.1376
LightGBM-gbdt without FE 0.0231 0.0324 0.3227 0.7104 0.0716
MLP without FE 0.0215 0.0289 0.2307 0.7446 0.0932

(c) CV 3 (2019-03-15 to 2022-09-23)

Method Mean Volatility Max Draw Sharpe Calmar


LightGBM-dart without FE 0.0264 0.0297 0.1380 0.8140 0.1913
LightGBM-gbdt without FE 0.0261 0.0336 0.1584 0.7772 0.1648
MLP without FE 0.0224 0.0240 0.1171 0.9339 0.1913

Table 13: Performance of selected machine learning methods on the Numerai dataset
in the test period for various walk-forward cross-validation schemes, (a) CV 1, (b)
CV 2 and (c) CV 3

Robustness under feature selection for dynamic feature neutralisation. A fixed


set of 420 features to be neutralised was given by the Numerai organisers based on
internal evaluations of parameters. In Section 6, we introduce several statistical rules
that allow us to select a varying subset of features to be neutralised in each era based
on empirical heuristic criteria motivated by financial modelling.
ROBUST ML MODELS IN FINANCE 29

To evaluate the robustness of the proposed statistical rules, we draw 100 subsets
of 420 features selected at random. and use each set to neutralise the raw predictions
from ML models. We then evaluate the performance of ML models based on each of
the random subsets. Using the procedure described in section 6.2 we then select the
optimal dynamic feature neutralisation method and compute the performance of the
top 10 models of the highest mean Corr, Sharpe and Calmar ratio over the test period.
The results are reported in Table 14 and should be compared to the performance of
the same models in Table 7, which were obtained with dynamic feature neutralisation
using the statistical rules defined in section 6.2.
The mean Corr of models obtained with random feature neutralisation for each
rule (Momentum/Sharpe/Calmar) are lower than those obtained using the statistical
rules in Table 7. On the other hand, the Sharpe ratio of models for models with
random feature neutralisation is slightly higher, as expected due to the variance re-
duction effect by averaging over 10 different models. For models selected based on the
Calmar rule, the models obtained with statistical rules have a much higher Calmar
ratio than random feature neutralisation. It suggests the statistical rules defined can
effectively reduce model risks by reducing linear exposure to undesirable features.

(a) LightGBM-dart without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Average 0.0214 0.0147 0.0482 1.4547 0.4440
Momentum 0.0216 0.0149 0.0472 1.4522 0.4576
Sharpe 0.0213 0.0147 0.0459 1.4474 0.4641
Calmar 0.0214 0.0148 0.0453 1.4504 0.4724

(b) LightGBM-gbdt without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Average 0.0203 0.0167 0.0664 1.2140 0.3057
Momentum 0.0208 0.0167 0.0641 1.2457 0.3245
Sharpe 0.0206 0.0168 0.0618 1.2267 0.3333
Calmar 0.0216 0.0195 0.0508 1.1102 0.2743

(c) MLP without FE

Feature Neutralisation Mean Volatility Max Draw Sharpe Calmar


Average 0.0176 0.0165 0.0831 1.0658 0.2118
Momentum 0.0179 0.0165 0.0790 1.0842 0.2266
Sharpe 0.0177 0.0164 0.0762 1.0751 0.2323
Calmar 0.0175 0.0167 0.0825 1.0511 0.2121

Table 14: Performance of different ML models in the test period (2015-05-15 to 2022-
09-23) obtained with random feature neutralisation. These are averages obtained by
selecting the top 10 models under the different online learning procedures over the
test period.

You might also like