Robust Machine Learning Pipelines For Trading Mark
Robust Machine Learning Pipelines For Trading Mark
Abstract. The application of deep learning algorithms to financial data is difficult due to heavy
non-stationarities which can lead to over-fitted models that underperform under regime changes.
Using the Numerai tournament data set as a motivating example, we propose a machine learning
pipeline for trading market-neutral stock portfolios based on tabular data which is robust under
changes in market conditions. We evaluate various machine-learning models, including Gradient
arXiv:2301.00790v1 [q-fin.CP] 30 Dec 2022
Boosting Decision Trees (GBDTs) and Neural Networks with and without simple feature engineer-
ing, as the building blocks for the pipeline. We find that GBDT models with dropout display high
performance, robustness and generalisability with relatively low complexity and reduced computa-
tional cost. We then show that online learning techniques can be used in post-prediction processing
to enhance the results. In particular, dynamic feature neutralisation, an efficient procedure that
requires no retraining of models and can be applied post-prediction to any machine learning model,
improves robustness by reducing drawdown in volatile market conditions. Furthermore, we demon-
strate that the creation of model ensembles through dynamic model selection based on recent model
performance leads to improved performance over baseline by improving the Sharpe and Calmar ra-
tios. We also evaluate the robustness of our pipeline across different data splits and random seeds
with good reproducibility of results.
Key words. Robust Machine Learning, Online Learning, Gradient Boosting Decision Trees,
Deep Learning, Stock Trading, Tabular Data
∗ Funding: This work is supported by the Wellcome Trust under Grant 108908/B/15/Z and by
[email protected]).
‡ Department of Mathematics, Imperial College London, London SW7 2AZ, U.K
([email protected],).
1
2 THOMAS WONG AND MAURICIO BARAHONA
of training/test sets [8], can inflate in-sample performance with poor performance
when deployed live. Furthermore, black-box ML models, such as neural networks,
can lack robustness as they are highly sensitive to small changes in parameters and
data, thus resulting in volatile predictions. The non-stationary data and the presence
of regime changes also mean that ML models need to be re-trained with the latest
financial data, a task that is not only computationally costly but also introduces
further uncertainty to the trading models. Yet most studies do not consider model
performance when trained on different segments of historical market data [1–3, 9, 10].
Although reinforcement learning (RL) in online learning settings allows ML models
to adapt to changing environments, deep reinforcement learning models are complex
and require large computational resources [11]. Indeed, applying RL to stock trading
is difficult since the complexity of the action space increases exponentially with the
number of stocks in the portfolio.
The above issues suggest the need to further develop robust ML pipelines for
trading applications possibly based on simpler models that can still operate on non-
stationary, highly stochastic data under regime changes. Here we consider such a
pipeline based on tabular data, which allows the use of traditional ML models, such
as Gradient Boosting Decision Trees (GBDT) and other ensemble methods, to predict
trading stocks and stock indices [12, 13]. This approach also allows the integration
of additional sources of data, such as sentiment analysis of news articles to improve
the prediction accuracy of the direction of stock returns [14]. In particular, we find
that Gradient Boosting models, which are known to be robust to data perturbations,
outperform neural network models. Finally, we show that improved robustness of ML
models and adaptation to regime changes can be attained without the use of deep
reinforcement learning by employing: (i) dynamic feature neutralisation, a simple
approach that reduces the linear correlation to a subset of features evolving in time,
and (ii) dynamic model selection of optimal models from an ensemble based on recent
performance. These approaches robustly improve trading performances by reducing
volatility and drawdown during adversarial market regimes.
To exemplify the above issues, we consider a benchmark financial data platform
that is continuously updated and curated under the Numerai tournament of stock
portfolio prediction [15]. Numerai is a hedge fund that organises a data science com-
petition (as of Oct 2022) and provides free, open-source, high quality standardised
financial data to all participants. As discussed below in more detail, the data set is
given in the form of pre-processed temporal tabular data and the task is the predic-
tion of the relative performances of stocks within an evolving trading universe without
access to the identity of individual stocks. Unlike other financial research papers that
use proprietary data sets which can be difficult to validate [9, 10], this open financial
data competition allows researchers to replicate findings transparently and allows us
to focus on establishing ML end-to-end pipelines to achieve consistent profits trad-
ing a market-neutral portfolio under changing market regimes. Our pipeline, shown
in Figure 1, is built upon simple, yet robust methodologies that avoid some of the
problems of over-fitting and high computational cost inherent to deep methods. The
robustness of the pipeline is enhanced since each step is implemented independently
avoiding data leakage, which is common in other methods such as neural networks,
where the pre-processing and the actual model often share data. Key ingredients
are the post-prediction processing and feature engineering steps, which allow easy
adaptation of models towards regime changes without expensive retraining.
The paper is organised as follows. Section 2 introduces the Numerai datasets
used in this paper. Section 3 describes and discusses the different computational
ROBUST ML MODELS IN FINANCE 3
Fig. 1: Schematic of the Machine Learning pipeline. Starting with the Numerai
data set, we consider feature engineering methods to augment the dataset and train an
ML model (several are evaluated, including neural networks, but we settle for gradient
boosting trees) to obtain the raw predictions. These then go through post-prediction
processing (e.g., dynamic feature neutralisation) to provide normalised predictions,
which are then combined through model ensembling and dynamic model selection
methods to output the predictions that are submitted to the Numerai tournament.
2. Numerai dataset and prediction task. Financial data are often available
in the form of time series. These time series can be treated directly using classic meth-
ods such as ARIMA models [16] and more recently through deep learning methods
such as Temporal Fusion Transformers [17]. However, such methods are easily over-
fitted and lead to expensive retraining for financial data, which are inherently affected
by regime changes and high stochasticity. Alternatively, one can use various feature
engineering methods to transform these time series into tabular form through a pro-
cess sometimes called ‘de-trending’ in the financial industry, where the characteristics
of a financial asset at a particular time point, including features from its history, are
represented by a single dimensional data row (i.e., a vector). In this representation,
the time dimension is not considered explicitly, as the state of the system is captured
through transformed features at each time point and the continuity of the temporal
dimension is not used. For example, we can summarise the time series of the return
of a stock with the mean and standard deviation over different look-back periods.
Grouping these data rows for different financial assets into a table at a given time
point we obtain a tabular dataset. If the features are informative, this representation
can be used for prediction tasks at each time point, and allow us to employ robust
and widely tested ML algorithms that are applicable to tabular data. The Numerai
competition is based on a curated tabular data set with high-quality features made
available to the research community.
Description of the dataset:. The Numerai dataset is a temporal tabular dataset.
A temporal tabular dataset is a collection of matrices {Xi }1≤i≤T collected over time
eras 1 to T . Each matrix Xi represents data available at era i with shape Ni × M ,
4 THOMAS WONG AND MAURICIO BARAHONA
where Ni is the number of data samples (number of stocks here) in era i and M
is the number of features describing the samples. Note that the features are fixed
throughout the eras, in the sense that the same computational formula is used to
compute the features in each week. On the other hand, the number of data samples
(stocks) Ni does not have to be constant across time.
In the Numerai dataset, the matrices Xi contain M obfuscated global stock mar-
ket features (computed by Numerai) for Ni stocks, which are updated weekly (i.e., the
eras are in our case weeks). It is important to remark that the dataset is obfuscated,
i.e., it is not possible for participants to know the identity of stocks or even which
stocks are present each week. Each data row is indexed by a hash index, known only
to Numerai, that maps the data rows to the stocks. As a result, it is not possible
for participants to concatenate different data rows to create a continuous history of
a stock. The matrix Xi thus provides a snapshot of the market at week i presented
as an unknown set of stocks described by a common set of features, such that the
features are computed consistently across all stocks in the week and also computed
consistently across different weeks.
The Numerai dataset starts on 2003-01-03 (Era 1). The tabular set has 1191
features, which are already normalised into 5 equal-sized integer bins, from 0 to 4.
There are 28 target labels which are derived from stock returns using 14 proprietary
normalisation methods (nomi, jerome, janet, ben, alan, paul, george, william, arthur,
thomas, ralph, tyler, victor, waldo ) over 2 forward-looking periods (20 trading days,
60 trading days). The main target label to evaluate performance is target-nomi-v4-
20, i.e., forward 20 trading days return obtained by the nomi normalisation method.
Other targets are named similarly. The target labels are all scaled between 0 to 1,
where a smaller value represents a lower forward return, and are also grouped into
bins. For each normalisation method, the number of bins could be different, 5 to 7 bins
are created for each target with the bin sizes following a Gaussian-like distribution,
so that most stocks are within the central bin of 0.5 while only a small amount of
stocks are in the tail bins of 0 and 1. We transform the features and labels so that
both become zero-mean. (For features, we subtract 2 from the integer bins so that
the transformed bins are -2,-1,0,1,2. For the target labels, we subtract 0.5 so that the
new targets are in the range -0.5 to 0.5).
Prediction task:. The tournament task is to predict the stock rankings each week,
ordered from lowest to highest expected return. The scoring is based on Spearman’s
rank correlation of the predicted rankings with the main target label (target-nomi-v4-
20). Hence there is a single overall score each week regardless of the number of stocks
to predict each week. Participants are not scored on the accuracy of the ranking of
each stock individually. Numerai uses the predicted rankings to construct a market-
neutral portfolio which is traded every week (As of Sep 2022), i.e., the hedge fund
buys and short-sells the same dollar amount of stocks. Therefore the relative return
of stocks is more relevant than the absolute return, hence the prediction task is a
ranking problem instead of a forecast problem.
3. Methods.
3.1. Robustness in Machine Learning pipelines. In this paper, we aim to
design an ML pipeline focusing on its robustness. Table 1 details issues related to
robustness and reproducibility, as listed in a recent review [5], and how they are
addressed in this paper. By preventing look-ahead bias and other data leakage issues,
our pipeline can be robustly applied to live trading setups.
In addition to avoiding data leakage, the following design choices are used to im-
ROBUST ML MODELS IN FINANCE 5
Table 1: Data analysis design. Some common issues regarding data leakage in
machine learning research [4, 5] and how these issues are dealt with in this study.
prove the robustness and reliability of the results. Firstly, the impact of random seeds
is reduced by reporting results from average predictions over 10 different random seeds
for each machine learning method. Secondly, the metrics used for model evaluation
are the same as in the Numerai tournament to avoid researcher bias in discounting
unfavourable results. Finally, cross-validation is independent of the effects of random
seeds and other human selection, thus reducing the chance of overfitting models to a
particular data split.
For datasets that involve time, standard cross-validation schemes cannot be used
directly, as a random split of data eras could lead to the training set including data that
appears later in time than the validation and test sets, hence introducing look-ahead
bias. To avoid this problem, we use grouped time-series cross-validation, which splits
data eras according to their chronological order (Figure 2). Note that for financial
datasets, the target labels often involve future asset returns and are reported with a
lag. Therefore, we add a gap between the training and validation sets and similarly
between the validation and test sets.
T raining Data
gap
V alidation Data
gap
T est Data
Time
of feature pairs. When the number of features is large, we draw a random subset of
feature pairs to create new features to alleviate the computational cost. Note that
the computation of these features can be done in parallel for data from each era.
A simple way of data augmentation is to add randomness to the feature matrix
with different dropout methods, which are used extensively to reduce over-fitting of
neural network models [18]. Here we apply dropout by multiplying the original data
with a Boolean mask so that some numerical features are set to zero. The dropout is
characterised by its sparsity level (how many features are set to zero) and its sparsity
structure (how to choose the features set to zero). Since our tabular dataset has no
local spatial structure, we use a random Boolean matrix with uniform probability.
This encourages the machine learning methods to learn multiple feature relationships
and reduces reliance on a small set of important features.
For our dataset, we first augment the feature matrix by creating additional fea-
tures obtained by multiplying feature pairs, and then apply dropout with a random
Boolean mask on the augmented feature matrix. A grid search is used to find optimal
hyper-parameters for the feature engineering methods, in particular the number of
feature products and the sparsity level of dropout.
science competitions and code quality, as one of our aims, is the replicability of results.
We train all machine learning models with a single GPU, the standard setup for most
participants in data science competitions. Some brief details of the ML models used
are provided in the following.
Gradient Boosting Decision Trees. Boosting can be seen as a generalisation of
generalized additive models (GAM) where the additive components of smooth para-
metric functions can be replaced by any weak learners such as decision trees [20].
Historically, various boosting algorithms have been proposed for different loss func-
tions. For example, AdaBoost [21] was proposed for binary classification problems
with exponential loss, whereas Gradient Boosting was first proposed by Friedman in
2001 [22] for any smooth loss functions. Algorithm 9.1 in the SI outlines the iterative
update equations of gradient boosting.
Of the various implementations of gradient boosting decision (GBDT) trees in
Python, we use LightGBM [23] in this paper. CatBoost [24] is not used here as the
Numerai dataset has no categorical features. XGBoost [25] is not used due to slower
computation and more memory consumption. Algorithm 9.2 in the SI shows how the
gradient boosting algorithm is implemented with decision trees being the weak learners
in LightGBM. LightGBM implements GBDT models with several computational and
numerical improvements from XGBoost and other implementations. In addition to
traditional gradient boosting decision trees (LightGBM-gbdt), we consider two other
implementations of GBDT models:
• Dropouts meet Multiple Additive Regression Trees (LightGBM-dart) ignores a
portion of trees when computing the gradient for subsequent trees [26], thus
avoiding over-specialisation where the later learned trees can only affect a
few data instances. This reduces the sensitivity of models towards decisions
made by the first few trees.
• Gradient-based One-Side Sampling (LightGBM-goss) reduces the number of
data instances used to build each tree: it keeps data instances with large
absolute gradients and randomly samples a subset of data with small absolute
gradients. The approximation error of the gradient using LightGBM-goss
converges to the standard method when the number of data is large, and it
outperforms other data sampling (e.g., uniform sampling) in most cases.
For all LightGBM models, we use mean squared error (L2 loss) as the loss function
for the regression problems. The number of gradients boosting trees and learning rate
is optimised by hyper-parameter searches. To prevent the over-fitting of trees, the
maximum depth and number of leaves in each tree and the minimal number of data
samples in the leaves are tuned for each model. L1 and L2 regularisation are also
applied. Data and feature sub-sampling are used to reduce similarities between trees:
before building each tree, a random part of data is selected without re-sampling and
a random subset of features is chosen to build the tree. For LightGBM-dart models,
both the probability to apply dropout during the tree-building process and the portion
of trees to be dropped out are tuned. Early stopping is applied using the validation
dataset for LightGBM-gbdt models to further prevent the over-fitting of models.
Neural Networks. The most basic architecture of neural networks, multi-layer
perceptron (MLP), failed to outperform gradient boosting models in many benchmark
studies of tabular datasets [19].
Recently, more complex network architectures have been proposed for tabular
data sets, as surveyed in [27] These new architectures can be classified into two
major groups:
• Hybrid models that combine neural networks with other traditional ML meth-
8 THOMAS WONG AND MAURICIO BARAHONA
ods, e.g., decision trees. Neural Oblivious Decision Ensembles (NODE) [28] is
a generalisation of gradient boosting models into differentiable deep decision
trees allowing end-to-end training with gradient descent optimisers such as
PyTorch [29]. DeepGBM [30] combines two neural networks, CatNN to han-
dle sparse categorical features and GBDT2NN to distil tree structures from
a pre-trained GBDT model to handle numerical features. A major limitation
of these models is the large memory consumption, which makes them run out
of memory on the NVIDIA 3080ti GPU. Therefore we do not use them in our
benchmark analysis.
• Transformer-based models that use deep attention mechanisms to model com-
plex feature relationships. TabNet [31] uses sequential attention to perform
instance-wise feature selection at each decision step, enabling interpretability
and better learning. AutoInt [32] maps and models feature interactions in a
low-dimensional space with a multi-head self-attentive neural network with
residual connections. AutoInt runs out of memory on a single GPU and is
thus not used in our benchmark. Tabnet also had similar memory issues,
hence we down-sampled the data by keeping every fifth week of data (i.e.,
20% of the original data) for the training/validation periods, so that Tab-
net could be trained on the single GPU used in this study. Our aim is to
compare performance under modest computational resources attainable by a
wide class of users.
In summary, our benchmark analysis includes two NN models: MLP and TabNet
implemented in PyTorch. We use Adam [33] as the gradient optimiser, with the
learning rate automatically determined by PyTorch. We use mean squared error (L2
loss) for the regression problems.
Model Training. We use Optuna [34] to perform the hyper-parameter search (see
section 9.3 in Supplementary Information) and select the hyper-parameters with the
highest Sharpe ratio for the main target (target-nomi-v4-20) in the validation period.
The optimised hyper-parameters for each ML method are so fixed, and we then train
10 models, starting the algorithms from 10 different random seeds. We report the
average prediction from these 10 models for evaluation.
Baseline Model. As a baseline, we consider a factor momentum model which is
created by linear combinations of signed features, where the sign of each feature is
determined by the sign of the 52-week moving average of correlations of that feature
with the target. This simple baseline linear model is then compared with the ML
models, which can capture non-linearity in the data.
Comparative results of the ML algorithms and Feature Engineering. Table 2 shows
the performance on the validation and test sets for the different algorithms. We
concentrate on methods that achieve the highest mean Corr, and Sharpe and Calmar
ratios.
Table 2: Performance of different machine learning methods with and without feature
engineering on the Numerai dataset for (a) validation period and (b) test period.
The three top methods according to Sharpe ratio and Maximum Drawdown over
the validation period are shown in italics in (a). The top method according to the
Sharpe ratio and Maximum Drawdown over the test period is shown in boldface in
(b). For TabNet, the pipeline with feature engineering cannot be run due to memory
constraints.
10 THOMAS WONG AND MAURICIO BARAHONA
Firstly, we see that almost all ML models performed substantially better than the
factor momentum model (baseline), in both validation and test periods. Whereas the
factor momentum model relies on linear relationships, the capability of ML models
to learn non-linear relationships, in addition to linear ones, adds to their robustness
and improved performance under different, often volatile, market regimes.
Secondly, we observe that Feature Engineering does not improve the performance
of ML models. Although, in principle, Feature Engineering allows GBDT-based meth-
ods to model feature interactions more easily, our results suggest that these interac-
tions are over-fitted during the training process. For neural network-based models,
feature engineering is not strictly necessary, as dropout is already embedded in net-
work architectures.
Thirdly, we note that all ML models scored better in the validation period than
the test period. This is expected, as it is well known that the performance of trading
models deteriorates over time due to overcrowding and regime changes (a phenomenon
known as alpha decay). Models that are over-fitted to recent training data will ex-
perience greater alpha decay than properly regularised models. To select the ML
method, we consider the top models according to the Sharpe and Calmar ratios over
the validation period: a high Sharpe ratio ensures the model has good overall perfor-
mance, whereas a high Calmar ratio ensures good performance against the worst-case
scenario, thus capturing the tail risks of the trading model. Indeed, we find that
LightGBM-dart without feature engineering generalises well to the test period further
into the future.
Finally, we note that LightGBM-gbdt has better generalisation to the test period
than neural network-based models (TabNet), suggesting over-fitting in these complex
deep NN models. This indicates that although over-parameterised models can learn
non-linear relationships in temporal tabular data sets, these relationships may be
difficult to generalise under non-stationary data environments. On the other hand, our
results suggest that, despite their relative simplicity, gradient Boosting models capture
non-linearity in a more robust and controlled manner, with early trees capturing linear
relationships and non-linear relationships captured by the later trees, thus reducing
the risk of catastrophic forgetting [35].
In summary, we find that the best performing model in our set is LightGBM-
dart without feature engineering. In the rest of the paper, we will use this model to
illustrate how the pipeline can be further modified with online learning to account
for regime effects. To demonstrate the robustness of our pipeline, and how it can
be applied to improve the performance of any ML model, we will also report the
performance of two other models: a similar GBDT model (LightGBM-gbdt without
feature engineering) and a neural network model (MLP without feature engineering).
In this section, we focus on how to deal with regime effects when using ML
models for financial tabular temporal data sets. Specifically, we consider feature
neutralisation, and reducing the dependence on the initial trees in gradient boosting
models.
Classification into high and low volatility regimes. To classify the financial market
into regimes, we consider an intrinsic measure derived directly from the Numerai
dataset. In particular, we first compute the Numerai Market Index (NMI), i.e., the
weekly performance of the baseline (linear) factor momentum portfolio, and we then
calculate the Numerai Realised Volatility Index (NRVIX), defined as the standard
deviation of NMI rolling over 52 weeks (Fig. 3). The eras are then classified into high
and low volatility, based on a threshold of NRVIX=0.025, the mean over the first 7
years of data (2003-01-03 to 2010-02-26). According to this intrinsic characterisation,
low volatility regimes have stable linear relationships of features to stock returns, often
associated with a good performance by ML models. On the other hand, high Volatility
regimes correspond to unstable linear relationships of features to stock returns leading
to poor model performance. Figure 3 shows that high/low NRVIX regimes are well
aligned with macroeconomic events: high volatility regimes include the financial crisis
(2007-2009), the Euro crisis (2011-2012), and the Covid pandemic (2020), whereas low
volatility regimes correspond to benign market conditions with no significant macro
event risks, during which the factor momentum baseline portfolio had good returns.
0.100 0.040
0.075
0.035
0.050
0.030
0.025
0.000 0.025
0.025 0.020
0.050
0.015
2005-01-28 2008-11-28 2012-09-28 2016-07-29 2020-05-29 2005-01-28 2008-11-28 2012-09-28 2016-07-29 2020-05-29
Fig. 3: High and low volatility regimes in the Numerai data. (a) Numerai
Market Index (NMI) for the period between 2005-01-28 (Era 109) and 2022-09-23
(Era 1016); (b) the computed Numerai Realised Volatility Index (NRVIX) used to
identify the high and low volatility regimes. The high volatility regime refers to weeks
where NRVIX is higher than 0.25 and the low volatility regime refers to weeks where
NRVIX is lower than 0.25.
‘risky features’ (out of the 1191 features). This list of risky features can be used
for feature neutralisation by subtracting the linear correlation using the formula for
Feature Neutral Correlation (FNC). Specifically, given a week of data with n stocks,
let X ∈ Rn×420 be the matrix of risky features and y ∈ Rn the predicted rankings ob-
tained from a model. For a given neutralisation strength β, 0 ≤ β ≤ 1, the neutralised
predicted ranking ŷ is calculated as ŷ = y − β XX † y, where X † is the pseudo-inverse
of X. FNC is then calculated as the correlation of the neutralised predicted rankings.
Using this procedure, we reduce the linear dependencies of predictions on features.
is not optimal, and we will show in Section 6 how online learning approaches can be
used to improve the procedure.
5.2. Pruning initial trees in Gradient Boosting models. For gradient-
boosting tree models, we also consider a specific procedure consisting of pruning
initial trees during prediction to reduce feature dependencies. Specifically, we perform
a grid search over the number of initial trees to be pruned off in the trained LightGBM
models, and we cap the number of trees to be pruned to not more than half of the
trees to ensure our models do not degenerate.
Table 4 compares the performance of LightGBM-dart and LightGBM-gbdt models
pruning different numbers of initial trees before feature neutralisation. Pruning ini-
tial trees during prediction improves the Sharpe and Calmar ratios of both LightGBM
models, but LightGBM-gbdt models see a bigger improvement than LightGBM-dart
models. This is expected as LightGBM-dart models already employ a similar fun-
damental idea during training, i.e., the trained trees in LightGBM-dart models are
already optimised. Our numerics also suggest that there is a limit of trees to be pruned
such that there is little improvement in model performance once over a threshold of
around 100-250 trees.
Table 5: The joint effect of feature neutralisation and tree pruning. Perfor-
mance of different LightGBM models after neutralisation in the test period (2014-06-
27 to 2022-09-23) when pruning different numbers of initial trees.
features but the ‘Low Mean’ neutralisation method has the best Sharpe and Calmar
ratios for all ML models, followed by neutralisation of ‘High Volatility’ features. The
worse performance of ‘High Mean’ and ’Low Volatility’ neutralisations suggests that a
large part of the model risks can be attributed to recently underperforming and high
volatility features.
that have a low variance, and stable performance in the last 52 weeks. During volatile
regimes, these features will underperform. Models that neutralise these features can
then outperform when there is market stress.
6.2. Dynamic model selection. In practice, it is not possible to know the best
dynamic feature engineering methods in advance. Therefore, we propose an online
learning procedure to select the dynamic feature engineering method during the test
period consisting of two steps. The first step is to have a warm-up period to collect
data on model performances, during which all 5 feature neutralisation methods (fixed,
low mean, high mean, low vol, high vol) have equal weighting. The second step is to
allocate weights to the optimal model based on recent performance according to the
following criteria:
• ‘Average’: Using all five feature neutralisation methods with equal weighting
• ‘Momentum’: Using the feature neutralisation method with the highest Mean
Corr in the last 52 weeks
• ‘Sharpe’: Using the feature neutralisation method with the highest Sharpe
Ratio in the last 52 weeks
• ‘Calmar’: Using the feature neutralisation method with the highest Calmar
ratio in the last 52 weeks
In Table 7, we use these criteria to select the optimal dynamic feature engineering
method based on recent performance. As above, a lag of 6 weeks is applied to account
for data delays.
The online learning procedure can thus select the optimal dynamic feature engi-
neering method to outperform the ‘Average’ selection in most cases. For all three ML
models (LightGBM-dart/LightGBM-gbdt/MLP), the ‘Momentum’ selection method
has higher mean Corr and Calmar ratio than the‘Average’ (baseline) and ‘Sharpe’
methods. This shows that the ‘Momentum’ method, a very simple model selection
method that chooses the recent best-performing model, can adapt a trained ML model
towards different market regimes efficiently. For LightGBM-dart and LightGBM-gbdt
models, the ‘Calmar’ selection method gives a higher Calmar ratio than the ‘Momen-
tum’ method but with a lower mean Corr. For MLP models, the ‘Calmar’ selection
method significantly under-performs other model selection methods, with a much
higher Max Drawdown. This suggests that selection based on historical drawdown is
not robust, especially under situations with regime changes.
In summary, the proposed online learning procedure to select optimal dynamic
feature engineering methods can significantly reduce trading risks and improve the
robustness of trading models, outperforming the baseline selection method that takes
a simple average of all available models.
7. Discussion. Motivated by the Numerai tournament, we have designed here
an ML pipeline that can be applied to tabular temporal data of stock prices to under-
pin strategies for trading of market-neutral stock portfolios. The various steps in the
ML pipeline are carefully designed for robustness against regime changes and to avoid
information leakage through time. We thus aim to obtain models with relatively low
complexity, so as to reduce the danger of over-fitting, and with high robustness to
changes in hyper-parameters and other choices in the algorithms. Another aim is to
Regarding the choice of ML models, we find that gradient-boosting decision tree
models are both more robust and interpretable than neural network-based models,
and they allow more consistent performance under different market regimes.
We also find that post-prediction processing, which is model-agnostic, is an effec-
tive means of adapting trained ML models towards new situations without the need
18 THOMAS WONG AND MAURICIO BARAHONA
8. Data and Code Availability . The data and code used in this paper is
available at https://github.com/ThomasWong2022/numerai-benchmark.
REFERENCES
[1] J. Arosemena, N. Perez, D. Benitez, D. Riofrio, and R. Flores-Moyano, “Stock price analy-
sis with deep-learning models,” in 2021 IEEE Colombian Conference on Applications of
Computational Intelligence (ColCACI), 2021, pp. 1–6.
[2] S. Selvin, R. Vinayakumar, E. A. Gopalakrishnan, V. K. Menon, and K. P. Soman, “Stock price
prediction using lstm, rnn and cnn-sliding window model,” in 2017 International Confer-
ence on Advances in Computing, Communications and Informatics (ICACCI), 2017, pp.
1643–1647.
[3] K. A. Althelaya, E.-S. M. El-Alfy, and S. Mohammed, “Evaluation of bidirectional lstm for
short-and long-term stock market prediction,” in 2018 9th International Conference on
Information and Communication Systems (ICICS), 2018, pp. 151–156.
[4] J. Hullman, S. Kapoor, P. Nanayakkara, A. Gelman, and A. Narayanan, “The worst of both
worlds: A comparative analysis of errors in learning from data in psychology and machine
learning,” 2022. [Online]. Available: https://arxiv.org/abs/2203.06498
[5] S. Kapoor and A. Narayanan, “Leakage and the reproducibility crisis in ml-based science,”
2022. [Online]. Available: https://arxiv.org/abs/2207.07048
20 THOMAS WONG AND MAURICIO BARAHONA
[6] E. Rivera-Landos, F. Khomh, and A. Nikanjam, “The challenge of reproducible ml: an empirical
study on the impact of bugs,” 2021. [Online]. Available: https://arxiv.org/abs/2109.03991
[7] [Online]. Available: https://arxiv.org/abs/2006.12117
[8] J. Bollen, H. Mao, and X. Zeng, “Twitter mood predicts the stock market,” Journal
of Computational Science, vol. 2, no. 1, pp. 1–8, 2011. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S187775031100007X
[9] J. P. N and B. Vasudevan, “Effective implementation of neural network model with tune pa-
rameter for stock market predictions,” in 2021 2nd International Conference on Smart
Electronics and Communication (ICOSEC), 2021, pp. 1038–1042.
[10] S. Singh and S. Sharma, “Forecasting stock price using partial least squares regression,” in
2018 8th International Conference on Cloud Computing, Data Science & Engineering
(Confluence), 2018, pp. 587–591.
[11] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang,
W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging ai
applications,” 2017. [Online]. Available: https://arxiv.org/abs/1712.05889
[12] F. Zhou, Q. Zhang, D. Sornette, and L. Jiang, “Cascading logistic regression
onto gradient boosted decision trees for forecasting and trading stock indices,”
Applied Soft Computing, vol. 84, p. 105747, 2019. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S1568494619305289
[13] C. Bockel-Rickermann, “Predicting day-ahead stock returns using search engine query
volumes: An application of gradient boosted decision trees to the s&p 100,” 2022.
[Online]. Available: https://arxiv.org/abs/2205.15853
[14] R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara, “Deep learning for stock prediction using
numerical and textual information,” in 2016 IEEE/ACIS 15th International Conference
on Computer and Information Science (ICIS), 2016, pp. 1–6.
[15] Numerai. Numerai Hedge Fund. (2022, Apr 12). [Online]. Available: https://numerai.fund/
[16] D. B. Percival and A. T. Walden, Spectral Analysis for Univariate Time Series, ser. Cambridge
Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2020.
[17] B. Lim, S. O. Arik, N. Loeff, and T. Pfister, “Temporal fusion transformers
for interpretable multi-horizon time series forecasting,” 2019. [Online]. Available:
https://arxiv.org/abs/1912.09363
[18] A. Kadra, M. Lindauer, F. Hutter, and J. Grabocka, “Well-tuned simple nets excel on tabular
datasets,” 2021. [Online]. Available: https://arxiv.org/abs/2106.11189
[19] R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,” 2021.
[Online]. Available: https://arxiv.org/abs/2106.03253
[20] S. B. Kotsiantis, “Decision trees: a recent overview,” Artificial Intelligence Review, vol. 39, pp.
261–283, 2011.
[21] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and
an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1,
pp. 119–139, 1997. [Online]. Available: https://www.sciencedirect.com/science/article/
pii/S002200009791504X
[22] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” The Annals
of statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
[23] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y.
Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances
in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran
Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/
file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
[24] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “Catboost: unbiased
boosting with categorical features,” in Advances in Neural Information Processing
Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https:
//proceedings.neurips.cc/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf
[25] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, ser. KDD ’16. New York, NY, USA: Association for Computing Machinery,
2016, p. 785-794. [Online]. Available: https://doi.org/10.1145/2939672.2939785
[26] R. Korlakai Vinayak and R. Gilad-Bachrach, “DART: Dropouts meet Multiple Additive
Regression Trees,” in Proceedings of the Eighteenth International Conference on Artificial
Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Lebanon
and S. V. N. Vishwanathan, Eds., vol. 38. San Diego, California, USA: PMLR,
ROBUST ML MODELS IN FINANCE 21
9. Supplementary Information.
9.1. Additional results for dynamic feature neutralisation. Here we show
the performance of dynamic feature neutralisation for low and high volatility regimes.
Algorithm 9.2 Gradient boosting tree algorithm implemented in LightGBM [22, 23,
43]
PN
Initialise f0 (x) = arg minα0 i=1 l(yi , x, α0 )
For k = h 1 : K For i i = 1, 2, . . . N , compute the gradient residual using
∂l(yi ,fk−1 (xi ))
gik = − ∂fk−1 (xi )
Fit a decision tree to the targets gik giving terminal leaves Rkj , j = 1, 2, . . . Jk , where
Jk is the number of terminal leaves. P
For j = 1, 2, . . . Jk , compute αjk = arg minα xi ∈Rkj l(yi , fk−1 (xi ) + α)
PJk
Update boosting trees with learning rate λ fk (x) = fk−1 (x) + λ j=1 αkj I(x ∈ Rkj )
Return fK (x)
ROBUST ML MODELS IN FINANCE 25
• Feature Engineering
– Numerai Basic Feature Engineering
∗ dropout pct: low:0.05, high:0.25, step:0.05,
∗ no product features: low:50, high:1000, step:50,
• ML Models
– LightGBM-gbdt
∗ n estimators: low:50, high:1000, step:50
∗ learning rate: low:0.005, high:0.1, log:True
∗ min data in leaf: low:2500, high:40000, step:2500
∗ lambda l1: low:0.01, high: 1, log:True
∗ lambda l2: low:0.01, high: 1, log:True
∗ feature fraction: low:0.1, high:1, step:0.05
∗ bagging fraction: low:0.5, high:1, step:0.05
∗ bagging freq: low:10, high:50, step:10
– LightGBM-dart
∗ n estimators: low:50, high:1000, step:50
∗ learning rate: low:0.005, high:0.1, log:True
∗ min data in leaf: low:2500, high:40000, step:2500
∗ lambda l1: low:0.01, high: 1, log:True
∗ lambda l2: low:0.01, high: 1, log:True
∗ feature fraction: low:0.1, high:1, step:0.05
∗ bagging fraction: low:0.5, high:1, step:0.05
∗ bagging freq: low:10, high:50, step:10
∗ drop rate: low:0.1, high:0.5, step:0.1
∗ skip drop: low:0.1, high:0.8, step:0.1
– LightGBM-goss
∗ n estimators: low:50, high:1000, step:50
∗ learning rate: low:0.005, high:0.1, log:True
∗ min data in leaf: low:2500, high:40000, step:2500
∗ lambda l1: low:0.01, high: 1, log:True
∗ lambda l2: low:0.01, high: 1, log:True
∗ feature fraction: low:0.1, high:1, step:0.05
∗ bagging fraction: low:0.5, high:1, step:0.05
∗ bagging freq: low:10, high:50, step:10
∗ top rate: low:0.1, high:0.4, step:0.05
∗ other rate: low:0.05, high:0.2, step:0.05
• Machine Learning
– MLP
∗ max epochs: low:10, high:100, step:5
∗ patience: low:5, high:20, step:5
∗ num layers: low:2, high:7, step:1
∗ neurons: low:64, high:1024, step:64
∗ neuron scale: low:0.3, high:1, log:True
∗ dropout: low:0.1, high:0.9, log:True
∗ batch size: low:10240, high:40960, step:10240
– TabNet
∗ max epochs: low:10, high:100, step:5
∗ patience: low:5, high:20, step:5
∗ batch size: low:1024, high:4096, step:1024
∗ num d: low:4, high:16, step:4
∗ num a: low:4, high:16, step:4
∗ num steps: low:1, high:3, step:1
∗ num shared: low:1, high:3, step:1
∗ num independent: low:1, high:3, step:1
∗ gamma : low:1, high:2, step:0.1
∗ momentum: low:0.01, high:0.4, step:0.01
∗ lambda sparse: low:0.0001, high:0.01, log:True
9.4. Robustness of ML pipeline. One of the aims in this work was to provide
a robust pipeline for tabular temporal data under regime changes. Here we present
additional results of the robustness of the method under different scenarios and sources
of variability.
Robustness under changes of random seeds in the learning algorithms. In Ta-
ble 10, we report the variability of the performance of the LightGBM-dart, LightGBM-
gbdt and MLP models trained starting from 10 different initial random seeds. The
performance is generally robust to the change in random seeds, with small variances
in the prediction of the mean Corr and volatility and moderate for the Maximum
Drawdown.
Table 10: Variability of the performance of ML models in the test period (2014-06-
27 to 2022-09-23). The mean and standard deviation of each portfolio metrics are
calculated over models with 10 different random seeds for each method
Robustness under different cross-validation data splits. As financial data are regime
dependent, an important measure of model robustness is to measure the performance
of ML models that have been trained using different cross-validation splits of the data
and compute how much the model performance changes over different test periods.
To ascertain the robustness of data splits, we have carried out 3 cross-validation
splits (CV 1, CV 2, CV 3) as shown in Table 12. The hyper-parameters are optimised
under CV 1, which is the cross-validation used to generate the model performances
in the main text. These hyper-parameters are fixed for the models trained under
the CV 2 and CV 3 splits. For ML methods that require early stopping, the data
28 THOMAS WONG AND MAURICIO BARAHONA
in the validation period (different for each split) are used to regularise the models.
Therefore, by reusing the optimised hyper-parameters across all splits, we evaluate
the robustness of the model performance to the optimisation of hyper-parameters. We
then compute the performance when applying the models to shifted cross-validation
datasets in the walk-forward CV 2 and CV 3 data splits. Our results show good
consistency in performance across CV 2 and CV 3, with only a small deterioration of
the results as compared to CV 1 (over which the hyperparameters were optimised).
We also find that LightGBM-dart with FE, the ML method that has the highest
mean Corr in CV 1, has the greatest return and best Sharpe and Calmar ratios also
in other cross-validations, as seen in Table 13.
Train Start Train End Validation Start Validation End Enter Ensemble
CV 1 2003-01-03 2012-07-27 2012-12-21 2014-11-14 2015-05-15
CV 2 2003-01-03 2014-06-27 2014-11-21 2016-10-14 2017-04-14
CV 3 2003-01-03 2016-05-27 2016-10-21 2018-09-14 2019-03-15
Table 13: Performance of selected machine learning methods on the Numerai dataset
in the test period for various walk-forward cross-validation schemes, (a) CV 1, (b)
CV 2 and (c) CV 3
To evaluate the robustness of the proposed statistical rules, we draw 100 subsets
of 420 features selected at random. and use each set to neutralise the raw predictions
from ML models. We then evaluate the performance of ML models based on each of
the random subsets. Using the procedure described in section 6.2 we then select the
optimal dynamic feature neutralisation method and compute the performance of the
top 10 models of the highest mean Corr, Sharpe and Calmar ratio over the test period.
The results are reported in Table 14 and should be compared to the performance of
the same models in Table 7, which were obtained with dynamic feature neutralisation
using the statistical rules defined in section 6.2.
The mean Corr of models obtained with random feature neutralisation for each
rule (Momentum/Sharpe/Calmar) are lower than those obtained using the statistical
rules in Table 7. On the other hand, the Sharpe ratio of models for models with
random feature neutralisation is slightly higher, as expected due to the variance re-
duction effect by averaging over 10 different models. For models selected based on the
Calmar rule, the models obtained with statistical rules have a much higher Calmar
ratio than random feature neutralisation. It suggests the statistical rules defined can
effectively reduce model risks by reducing linear exposure to undesirable features.
Table 14: Performance of different ML models in the test period (2015-05-15 to 2022-
09-23) obtained with random feature neutralisation. These are averages obtained by
selecting the top 10 models under the different online learning procedures over the
test period.