LSTM Paper
LSTM Paper
Electrical Engineering Faculty, Czestochowa University of Technology, Al. AK 17, 42-200 Cz˛estochowa, Poland;
[email protected]
† Presented at the 9th International Conference on Time Series and Forecasting, Gran Canaria, Spain,
12–14 July 2023.
Abstract: In this paper, we propose a method for combining forecasts generated by different models
based on long short-term memory (LSTM) ensemble learning. While typical approaches for combining
forecasts involve simple averaging or linear combinations of individual forecasts, machine learning
techniques enable more sophisticated methods of combining forecasts through meta-learning, leading
to improved forecasting accuracy. LSTM’s recurrent architecture and internal states offer enhanced
possibilities for combining forecasts by incorporating additional information from the recent past. We
define various meta-learning variants for seasonal time series and evaluate the LSTM meta-learner
on multiple forecasting problems, demonstrating its superior performance compared to simple
averaging and linear regression.
Keywords: ensemble forecasting; LSTM; machine learning; multiple seasonal patterns; short-term
load forecasting
1. Introduction
Real-world time series can exhibit various complex properties such as time-varying
trends, multiple seasonal patterns, random fluctuations, and structural breaks. Given this
complexity, it can be challenging to identify a single best model to accurately approximate
the underlying data-generating process [1]. To address this issue, a common approach is to
Citation: Dudek, G. Combining
combine multiple forecasting models to capture the multiple drivers of the data-generating
Forecasts of Time Series with
process and mitigate uncertainties regarding model form and parameter specification [2].
Complex Seasonality Using
This approach, known as ensemble forecasting or combining forecasts, has been shown to
LSTM-Based Meta-Learning. Eng.
be effective in improving the accuracy and reliability of time series forecasts. By combining
Proc. 2023, 39, 53. https://doi.org/
10.3390/engproc2023039053
forecasts, the aim is to take advantage of the strengths of multiple models and reduce the
impact of their individual weaknesses.
Academic Editors: Ignacio Rojas, There are several potential explanations for the strong performance of forecast combi-
Hector Pomares, Luis Javier Herrera, nations. Firstly, by combining forecasts, the resulting ensemble can capture a broader range
Fernando Rojas and Olga Valenzuela
of information and better handle the forecasting problem complexity. It can leverage the
Published: 5 July 2023 strengths of individual models, as each model may capture different aspects of the under-
lying data-generating process. Therefore, the resulting ensemble can incorporate partial
and incompletely overlapping information, leading to improved accuracy and robustness.
Secondly, in the presence of structural breaks and other instabilities, combining forecasts
Copyright: © 2023 by the author. from models with different degrees of misspecification and adaptability can mitigate the
Licensee MDPI, Basel, Switzerland. problem. This is because individual models may perform well under certain conditions but
This article is an open access article
poorly under others, and by combining them, the ensemble can better handle a range of
distributed under the terms and
potential scenarios [3]. Finally, forecast combinations can improve stability compared to
conditions of the Creative Commons
using a single model, as the ensemble is less sensitive to the idiosyncrasies of individual
Attribution (CC BY) license (https://
models. This means that the resulting forecasts are less likely to be influenced by outliers
creativecommons.org/licenses/by/
or errors in individual models, leading to more reliable predictions.
4.0/).
In a classical way, by combining the predictions from multiple models, the resulting
ensemble prediction can be thought of as an average of the individual predictions. The vari-
ance of the average of multiple independent random variables is typically lower than the
variance of a single random variable, assuming that the individual predictions are diverse.
Therefore, a key issue in ensemble learning is ensuring diversity among the individual
models being combined. If the models are too similar, the ensemble may not be able to
capture the full range of possible outcomes and may not improve predictive performance.
In this work, we ensure high diversity among models by using non-interfering models
with different operating principles and architectures, including statistical, machine learning
(ML), and hybrid models (see Section 3.2).
A simple arithmetic average of forecasts based on equal weights is a popular and
surprisingly robust combination rule, outperforming more complicated weighting schemes
in many cases [4,5]. Other strategies, such as using the median, mode, trimmed means,
and winsorized means, are also applied [6]. To differentiate weights assigned to individual
models, linear regression can be used, where the vector of past observations is the response
variable and the matrix of past individual forecasts is the predictor variable. Combination
weights can be estimated using ordinary least squares. The weights can reflect individual
models’ performance on historical data [7]. Time-varying weights can be used to improve
forecasting ability in the presence of instabilities, and principal components regression can
be used as a solution for multicollinearity [8]. Weights can also be derived from information
criteria such as AIC [9].
Linear combination approaches assume a linear dependence between constituent
forecasts and the variable of interest, and may not result in the best forecast, especially if
the individual forecasts come from nonlinear models or if the true relationship between
base forecasts and the target has a nonlinear form [10]. In contrast, ML models can combine
the base forecasts nonlinearly using a stacking procedure.
Stacking is an ensemble ML algorithm that learns how to best combine predictions
from multiple models, using the concept of meta-learning to boost forecasting accuracy
beyond that achieved by the individual models. Neural networks (NNs) are often used
in stacking to estimate the nonlinear mapping between the target value and its forecasts
produced by multiple models [11]. The power of ensemble learning for forecasting was
demonstrated in [12], where several meta-learning approaches were evaluated on a large
and diverse set of time series data. Ensemble methods were found to provide a benefit
in overall forecasting accuracy, with simple ensemble methods leading to good results on
average. However, there was no single meta-learning method that was suitable for all
time series.
The main contributions of this study can be summarized in the following three aspects:
1. A meta-learning approach based on LSTM is proposed for combining forecasts. This
approach incorporates past information accumulated in the internal states, improving
accuracy, especially in cases where there is a temporal relationship between base
forecasts for successive time points.
2. Various meta-learning variants for time series with multiple seasonal patterns are
proposed, such as the use of the full training set, including base forecasts for successive
time points, and the use of selected training points that reflect the seasonal structure
of the data.
3. Extensive experiments are conducted on 35 time series with triple seasonality us-
ing 16 base models to validate the efficacy of the proposed approach. The exper-
imental results demonstrate the high performance of the LSTM meta-learner and
its potential to combine forecasts more accurately than simple averaging and linear
regression methods.
The remainder of this work is structured as follows. Section 2 presents the pro-
posed LSTM meta-model and introduces both the global and local meta-learning variants.
Section 3 provides application examples for time series with complex seasonality and
Eng. Proc. 2023, 39, 53 3 of 10
discusses the results obtained from the conducted experiments. Finally, in Section 4, we
conclude our work by summarizing the key findings and contributions.
Linear layer
The number of nodes in each gate, m, is the most critical hyperparameter. It determines
the amount of information stored in the states. For more intricate temporal relationships, a
higher number of nodes is necessary.
In contrast to non-recurrent ML models such as feed-forward NNs, tree-based models,
and support vector regression, to calculate output ỹt , LSTM uses not only the information
included in the base forecasts for time t, ŷt , but also in the base forecasts for previous time
steps, t − 1, t − 2, . . .. This is achieved through states ct−1 and ht−1 , which accumulate
information from the past steps.
s1
s2
Note that approaches v2 and v3 remove the training points that are not in the same
phase as the forecasted point. This simplifies the relationship between the new training
points and the forecasted point, making it easier to model. However, this simplification
comes at the cost of potentially losing some of the information related to the seasonal
patterns that occur outside of the selected phase. Therefore, it is important to carefully
consider which approach to use depending on the specific characteristics of the data.
3. Experimental Study
We evaluate the performance of our proposed approach, combining forecasts gen-
erated by 16 forecasting models described in Section 3.2. The forecasting problem is
short-term load forecasting for 35 European countries.
Eng. Proc. 2023, 39, 53 5 of 10
of LSTM for k = 168. This variant, which involves meta-learning on the full sequence
restricted to the last 168 points, provided the most accurate results as measured by MAPE,
MdAPE, and MSE errors. Note the significant difference in errors between this variant and
the second most accurate ensembling method, LinReg, which achieved about 5% in MAPE
and 35% in MSE.
Note that using the simplest method of combining forecasts, Mean or Median, resulted
in significantly larger errors compared to LSTM v1. Unfortunately, variants v2 and v3,
which excluded seasonality from the training sequence, were found to be inaccurate and
did not perform well. This suggests that excluding seasonality from the training sequence
could lead to the loss of important information related to the seasonal patterns in the data,
resulting in deteriorated forecasting performance.
Figure 3 displays the MAPE boxplots for LSTM in three variants with varying lengths
of the training sequence k. Additionally, the boxplots for the baseline methods, namely
Mean, Median, and LinReg, are shown for comparison. As shown in the figure, LSTM in
variants v2 and v3 are highly sensitive to the length of the training sequence. It achieved
the lowest errors when trained on all available data points. Extending the training sequence
may potentially further reduce errors. In contrast, for LSTM v1, the training sequences of
length 168 hours (one week) provided the lowest errors.
MPE in Table 2 provides information about the forecast bias, which is the lowest
for LinReg, but LSTM v1, with MPE = 0.0247, is in second place. It is worth noting
that Mean and Median produce more biased forecasts. The lowest value of StdPE for
LSTM v1 indicates the least dispersed predictions compared to other approaches for
combining forecasts.
Table 2. Forecasting quality metrics for different ensemble approaches (best results in bold).
Figure 4 depicts examples of forecasts for selected countries and test points. It is worth
noting that LSTM v1 was able to achieve forecasts close to the target values, which were
outside the interval of the base models’ forecasts (let us denote this interval for the i-th
test point by Zi ) and despite the fact that no base model even came close to these targets
(see test point no. 94 for FR and 99 for GB in Figure 4). One possible explanation for this
ability of LSTM is the incorporation of additional information from the immediate past
through internal states c and h (see (2)). LinReg, having no internal states, cannot use such
information. Mean and Median approaches cannot even go beyond the interval Zi .
To test the ability of LSTM v1 and LinReg to produce forecasts outside the interval Zi ,
we counted the number of such cases out of the 3500 forecasts produced by each model.
The results are shown in column N1 of Table 3. Column N2 counts how many of these N1
cases concern the situation where the target value also lay outside the Z-interval, on the
same side as the meta-model forecast. Column N3 counts the number of cases out of N1 for
which the meta-model produces more accurate predictions than the Median approach. It is
evident from Table 3 that LSTM generates many more forecasts outside of Zi than LinReg.
This may indicate better extrapolation properties of LSTM, but on the other hand, it may
also suggest an increased susceptibility to overfitting.
N1 N2 N3
LinReg 48 13 27
LSTM v1 447 192 244
8 3.4
3.2
7.5
3
7
y
2.8
6.5
2.6
6
2.4
5.5 2.2
5 2
80 85 90 95 100 20 25 30 35 40
Test point Test point
104 GB 104 DE
6 8.5
8
5.5
7.5
5
7
4.5 6.5
y
y
4 6
5.5
3.5
5
3
4.5
2.5 4
80 85 90 95 100 80 85 90 95 100
Test point Test point
4. Conclusions
This study proposes a meta-learning approach for combining forecasts based on LSTM,
which has the potential to improve accuracy, particularly in cases where there is a temporal
relationship between base forecasts. The study also proposes different variants of the
approach for time series with multiple seasonal patterns.
The experimental results clearly demonstrate that the LSTM meta-learner outperforms
simple averaging, median, and linear regression methods in terms of forecasting accuracy.
In addition, LSTM has distinct advantages over non-recurrent ML models as it is capable of
leveraging its internal states to model dependencies between base forecasts for consecutive
time points and capture patterns in the sequential data.
Further studies could compare LSTM with other meta-learning approaches, such
as feed-forward and randomized NNs, random forests, and boosted trees to determine
which approach is best suited for a given forecasting problem. Moreover, selecting a
pool of base models and controlling their diversity is an interesting topic that requires
further investigation.
Funding: This researcher was supported by grant 020/RID/2018/19 “Regional Initiative of Excel-
lence” from the Polish Minister of Science and Higher Education, 2019–23.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: We use real-world data collected from www.entsoe.eu (accessed on 6
April 2016).
Acknowledgments: The author thanks Slawek Smyl and Paweł Pełka for providing forecasts from
the base models.
Conflicts of Interest: The author declares no conflict of interest.
Eng. Proc. 2023, 39, 53 9 of 10
Abbreviations
The following abbreviations are used in this manuscript:
ANFIS Adaptive Neuro-Fuzzy Inference System
ARIMA Auto-Regressive Integrated Moving Average
cES-adRNN contextually enhanced hybrid and hierarchical model combining ETS and dilated
RNN with attention mechanism
DE Germany
DeepAR Auto-Regressive Deep recurrent NN model for probabilistic forecasting
ES Spain
ETS Exponential Smoothing
FR France
GB Great Britain
GRNN General Regression Neural Network
LinReg Linear Regression
LGBM Light Gradient-Boosting Machine
LSTM Long Short-Term Memory Neural Network
MAPE Mean Absolute Percentage Error
MdAPE Median of Absolute Percentage Error
ML Machine Learning
MLP Multilayer Perceptron
MPE Mean Percentage Error
MSE Mean Square Error
MTGNN Graph Neural Network for Multivariate Time series forecasting
N-BEATS deep NN with hierarchical doubly residual topology
N-WE Nadaraya—Watson Estimator
NN Neural Network
PE Percentage Error
PL Poland
RNN Recurrent Neural Network
StdPE Standard Deviation of Percentage Error
SVM Support Vector Machine
STLF Short-Term Load Forecasting
WaveNet Auto-Regressive deep NN model combining causal filters with dilated convolutions
XGB eXtreme Gradient Boosting
References
1. Clements, M.; Hendry, D. Forecasting Economic Time Series; Cambridge University Press: Cambridge, UK, 1998.
2. Wang, X.; Hyndman, R.; Li, F.; Kang, Y. Forecast combinations: An over 50-year review. Int. J. Forecast. 2022, in press. [CrossRef]
3. Rossi, B. Forecasting in the presence of instabilities: How we know whether models predict well and how to improve them.
J. Econ. Lit. 2021, 59, 1135–1190. [CrossRef]
4. Blanc, S.; Setzer, T. When to choose the simple average in forecast combination. J. Bus. Res. 2016, 69, 3951–3962. [CrossRef]
5. Genre, V.; Kenny, G.; Meyler, A.; Timmermann, A. Combining expert forecasts: Can anything beat the simple average? Int. J.
Forecast. 2013, 29, 108–121. [CrossRef]
6. Jose, V.; Winkler, R. Simple robust averages of forecasts: Some empirical results. Int. J. Forecast. 2008, 24, 163–169. [CrossRef]
7. Pawlikowski, M.; Chorowska, A. Weighted ensemble of statistical models. Int. J. Forecast. 2020, 36, 93–97. [CrossRef]
8. Poncela, P.; Rodriguez, J.; Sanchez-Mangas, R.; Senra, E. Forecast combination through dimension reduction techniques. Int. J.
Forecast. 2011, 27, 224–237. [CrossRef]
9. Kolassa, S. Combining exponential smoothing forecasts using Akaike weights. Int. J. Forecast. 2011, 27, 238–251. [CrossRef]
10. Babikir, A.; Mwambi, H. Evaluating the combined forecasts of the dynamic factor model and the artificial neural network model
using linear and nonlinear combining methods. Empir. Econ. 2016, 51, 1541–1556. [CrossRef]
11. Zhao, S.; Feng, Y. For2For: Learning to forecast from forecasts. arXiv 2020, arXiv:2001.04601.
12. Gastinger, J.; Nicolas, S.; Stepić, D.; Schmidt, M.; Schülke, A. A study on ensemble learning for time series forecasting and the
need for meta-learning. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China,
18–22 July 2021; pp. 1–8.
13. Hochreiter S.; Schmidhuber J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
Eng. Proc. 2023, 39, 53 10 of 10
14. Hewamalage H.; Bergmeir C.; Bandara K. Recurrent neural networks for time series forecasting: Current status and future
directions. Int. J. Forecast. 2021, 37, 388–427. [CrossRef]
15. Smyl, S.; Dudek, G.; Pełka, P. Contextually enhanced ES-dRNN with dynamic attention for short-term load forecasting. arXiv
2022, arXiv:2212.09030.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.