0% found this document useful (0 votes)
143 views22 pages

Review of Deep Learning for Time Series

This document reviews deep learning models for time series forecasting, highlighting their advancements and applications across various fields such as finance, energy, and urban traffic. It discusses the challenges of modeling time series data, presents commonly used datasets and evaluation metrics, and compares the performance of different deep learning models, notably finding the SCINet model to be the most accurate. The paper concludes with future research directions in the field of time series forecasting.

Uploaded by

rihanmulla767
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views22 pages

Review of Deep Learning for Time Series

This document reviews deep learning models for time series forecasting, highlighting their advancements and applications across various fields such as finance, energy, and urban traffic. It discusses the challenges of modeling time series data, presents commonly used datasets and evaluation metrics, and compares the performance of different deep learning models, notably finding the SCINet model to be the most accurate. The paper concludes with future research directions in the field of time series forecasting.

Uploaded by

rihanmulla767
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Received 31 May 2024, accepted 24 June 2024, date of publication 3 July 2024, date of current version 12 July 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3422528

Deep Learning Models for Time Series


Forecasting: A Review
WENXIANG LI AND K. L. EDDIE LAW
Faculty of Applied Sciences, Macao Polytechnic University, Macau, SAR, China
Corresponding author: WenXiang Li ([Link]@[Link])

ABSTRACT Time series forecasting involves justifying assertions scientifically regarding potential states
or predicting future trends of an event based on historical data recorded at various time intervals. The field
of time series forecasting, supported by diverse deep learning models, has made significant advancements,
rendering it a prominent research area. The broad spectra of available time series datasets serve as valuable
resources for conducting extensive studies in time series analysis with varied objectives. However, the
complexity and scale of time series data present challenges in constructing reliable prediction models. In this
paper, our objectives are to introduce and review methodologies for modeling time series data, outline the
commonly used time series forecasting datasets and different evaluation metrics. We delve into the essential
architectures for trending an input dataset and offer a comprehensive assessment of the recently developed
deep learning prediction models. In general, different models likely serve different design goals. We boldly
examine the performance of these models under the same time series input dataset with an identical hardware
computing system. The measured performance may reflect the design flexibility among all the ranked
models. And through our experiments, the SCINet model performs the best in accuracy with the ETT energy
input dataset. The results we obtain could give a glimpse in understanding the model design and performance
relationship. Upon concluding the paper, we shall provide further discussion on future deep learning research
directions in the realm of time series forecasting.

INDEX TERMS Dataset, deep learning, evaluation metrics, neural network models, time series forecasting,
Transformer models.

I. INTRODUCTION Research on time series forecasting started long ago due


Time series data typically consists of an orderly sequence to its wide applications across different industries, e.g.,
of observed or measured outcomes from a process at energy sector [1], transportation [2] and meteorology [3], etc.
fixed sampling time intervals. A dataset is designed to Conventional statistical methods are often used effectively
capture information and activities within its subject matter. in prediction tasks. Traditional modelings such as autore-
In time series applications, the primary tasks revolve around gressive (AR) [4], moving average (MA) [5], autoregressive
forecasting future states or data by uncovering underlying moving average (ARMA) [6] and autoregressive integrated
patterns based on historical time series data. Indeed, time moving average (ARIMA) [7] are popular for univariate pre-
series forecasting (TSF) finds widespread uses in various dictions in smoothed time series data. However, assumptions
fields such as stock market prediction, weather forecasting, have to be made including data stability, linear correlation,
and traffic congestion anticipation, etc. Through forecasting, and data independence upon using these models. Hence, it is
decision-makers obtain the abilities to identify and mitigate challenging to use these approaches in practice.
risks, and facilitate informed decision-making. More sophisticated forecasting techniques have been
developed, for example, the hierarchical time series fore-
casting [8], sparse multivariate time series forecasting [9],
The associate editor coordinating the review of this manuscript and and asynchronous time series forecasting [10]. Indeed,
approving it for publication was Seifedine Kadry . different machine learning and deep learning (DL) models
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see [Link]
92306 VOLUME 12, 2024
W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

are explored to tackle specific issues related to time series movement, data (non-)stationary, trend, periodicity, seasonal
forecasting. variation, volatility, and noise. Upon understanding the
Deep learning models, in particular, exhibit promis- patterns of these factors, we may possibly formulate and
ing results following the exceptional performance in the make proper predictions and decisions.
fields of natural language processing (NLP) and computer Time series forecasting model predicts the future values of
vision (CV). Each model presents a potential solution for a given entity over time. In its simplest form, a one-step-ahead
addressing time series forecasting challenges. Certain frame- forecast model takes the form:
works have been leveraged for multi-objective and multi-
granularity prediction [11], as well as multi-modal time series ŷi,t+1 = f (yyi,t−k:t , x i,t−k:t , si ), (1)
forecasting [12].
In deep learning, we shall anticipate continuous advance- where ŷi,t+1 denotes the prediction of a model, y i,t−k:t =
ments and accomplishments in developing sophisticated {yi,t−k , . . . , yi,t } and x i,t−k:t = {xi,t−k , . . . , xi,t } are the
techniques for time series data prediction. Many existing observed values of the target and exogenous inputs within
review articles on time series forecasting have delineated a backcasting window of k periods, respectively. The si
mostly on classic parametric models and traditional machine represents the static metadata associated with the entity. The
learning algorithms. But they lack an exposition on the latest f (·) signifies the learned prediction function of the model, and
advancements in the Transformer-based models and empiri- i represents a given entity. Each entity represents a logical
cal comparative analyses on commonly used datasets across grouping of temporal information – such as the measurements
different industries. In this paper, we attempt to categorize from different weather stations in climatology, or vital signs
existing time series prediction methods, delineate datasets, from different patients in medical study – and can be observed
and scrutinize the limitations of current methodologies. Our at the same time.
contributions are as follows: The prediction functions f (·)’s can be realized through
1) The classical time series prediction approaches are domain-specific model designs elaborated in subsequent
discussed. sections. A function and its corresponding model can be
2) Typical deep learning models related to time series trained and validated using application-specific datasets.
forecasting are thoroughly investigated, categorized, In the ensuing discussion, various types of datasets will be
and summarized. examined, and the commonly employed evaluation metrics
3) The existing issues in time series forecasting are dis- for time series forecasting algorithms will be outlined.
cussed, and some latest problems should be prioritized.
4) The prediction performance of different latest models A. DATASETS
would be carried out for comparisons. A collection of measured observations chronologically in a
The rest of the paper is organized as follows. Section II time series is needed to formulate and validate a time series
provides an overview of the popular publicly available model. The model, that would be developed, should be able to
datasets and some commonly used evaluation metrics in characterize a relation between data points in a given dataset.
time series forecasting applications. The classical time series Currently, there are publicly available datasets for different
models for time series forecasting can be found in Section III. applications, e.g., finance, industrial energy consumption,
In Section IV, the designs and implications of deep learning urban transportation, medicine, and meteorology, etc. The
models for TSF are discussed. Section V presents experi- commonly used datasets, associated with an array of deep
mental comparisons of predictive performance for different learning models, are extracted and listed together in Table 1.
models. Section VI summarizes future research directions in The basic information of the majority of different datasets in
time series methods and the unresolved issues. Finally, the different domains is presented below for your reference.
research content of this paper is summarized in Section VII.
1) FINANCIAL SECTOR
II. MODEL SETUP, DATASETS, METRICS In finance, forecasting is frequently used to predict business
In this section, before delving into the discussions of various cycles and long-term stock market movements. It helps
state-of-the-art designs, we should review the available financial firms and traders alike find opportunities and
time series datasets that can be used for training and make responsible decisions about revenue and expenditure
testing different potential models. But we shall start by based on estimates. For instance, forecasting can assist
characterizing an algebraic foundation of a forecast process. investors in the stock market with developing well-rounded
And the process should predict future values or trends based investing strategies, forecasting future trends and volatility in
on past time series data for applications in diverse fields such stock prices, and selecting more advantageous investments.
as finance, economics, meteorology, and sales, etc. Additionally, forecasting can help financial institutions assess
In time series analysis, data is arranged in chronological loan risk [28] by anticipating future borrower repayment
order. Each data point is associated with a specific timepoint. capability and credit hazards, or by projecting future interest
There are many factors that may cause the fluctuations of rate patterns to improve monetary and interest rate policies.
variables in time. And these factors can be the general Curated datasets include:

VOLUME 12, 2024 92307


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

TABLE 1. Datasets used in some deep learning models.

Gold Prices [29]: The dataset documents the daily gold Additionally, long-term forecasting helps identify traffic
prices (in US dollars) from January 2014 to April 2018, safety issues, predict future traffic accident risks, and enhance
with a granularity of days. It uses values for minimum, management of traffic safety and accident prevention. The
mean, maximum, median, standard deviation, skewness, and following open-source datasets are relevant to predict urban
kurtosis to characterize the characteristics of the distribution. traffic:
Exchange-Rate [30]: It aggregates daily exchange rates Paris Metro Line [34]: The dataset, consists of 1,742 obser-
across eight currencies spanning from 1990 to 2016: vations with a granularity of 1 hour, tracks the passenger
Australia, UK, Canada, Switzerland, China, Japan, movements on Lines #3 and #13 of the Paris Metro.
New Zealand, and Singapore. PeMS03, PeMS04, PeMS07, PeMS08 [17]: The PeMS
Stock Opening Prices [31]: From the Yahoo Financial, the system is the source of these datasets. The technology keeps
dataset compiles the daily opening prices of 50 equities across track of traffic flow data that is averaged every five minutes
10 industrial sectors between 2007 and 2016. There are a total and gathered every 30 seconds in places such as the California
of 2,518 data points per stock, and 5 top businesses in each bay area. Traffic flow, rate of occupancy, and speed are
sector. provided with timestamps for each sensor on the highway.
Birmingham Parking [35]: The dataset encompasses the
2) ENERGY SECTOR operational data of 30 parking lots managed by Birmingham
Forecasting is often used in the industrial energy sector National Parking, including parking lot identities, capacity,
to support long-term strategic resource planning. It helps occupancy rates, and update time. The data spans from
businesses and governments forecast future energy needs, October 4, 2016, to December 19, 2016, from 8:00 to
such as those for electricity, oil, and natural gas, and enables 16:30 daily, with a granularity of 30-minute.
more efficient energy production and supply planning. The
following open-source datasets are relevant for forecasting 4) ENVIRONMENTAL ISSUES
industrial energy: Long-term climatic trend forecasting is one important appli-
Electricity Transformer Temperature (ETT) [32]: The cation of the time series forecasting. Both the meteorology
curated data captures load and oil temperature data of power and pollution issues would have significant impacts on
transformers at 15-minute intervals from July 2016 to July industries such as agriculture, marine transportation, and
2018. It is composed of hourly granularity data (ETTh1, natural meteorological catastrophe warning systems. In the
ETTh2) and 15-minute granularity data (ETTm1). following, freely available datasets related to pollutions and
Solar Energy [33]: With a 5-minute sampling interval, weather forecasting include:
it documents the highest solar energy generation capacity of Weather Pollutants [36]: The dataset encompasses pollu-
137 photovoltaic power stations in Alabama in 2006. tion data (PM2.5, PM10, NO2, SO2, O3, CO) gleaned every
Electricity [33]: The dataset tracks the kilowatt-hours of hour from 76 stations in the Beijing-Tianjin-Hebei region
power used by 321 customers between 2011 and 2014. The between January 1, 2015, and April 1, 2016. It also includes
intervals between data points are 15 minutes. hourly meteorological observations during the same period,
incorporating variables such as wind speed, wind direction,
3) URBAN TRAFFIC temperature, air pressure, and relative humidity.
In order to predict future road traffic flow patterns and Beijing PM2.5 [37]: For the city of Beijing in China,
congestion, forecasting assists city traffic management it includes hourly PM2.5 data and related meteorological
departments, streamlining traffic planning and management. data. The dataset includes other variables such as dew

92308 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

m
point, temperature, air pressure, combined wind direction, 1X
MSE = (yi − ŷi )2 (3)
cumulative wind speed, snowfall hours, and rainfall hours. m
i=1
In total, there are 43,824 multivariate sequences. Curated v
u m
PM2.5 data serve as the goal of this series. u1 X
RMSE = t (yi − ŷi )2 (4)
Hangzhou Temperature [38]: This dataset records daily m
i=1
average temperature in Hangzhou from January 2011 to
m
January 2017, totaling 2,820 data points. 1X 2
MSLE = log(yi + 1) − log(ŷi + 1) (5)
Weather [23]: The dataset contains nearly 1600 local m
i=1
data from the United States Earth climate data, more than v
u m
35,000 data records collected from January 1, 2010 to u1 X 2
December 31, 2013, at 1-hour intervals. Each record consists RMSLE = t log(yi ) − log(ŷi ) (6)
m
i=1
of a target ‘‘wet-bulb’’ temperature and 11 other climatic
features. where ŷi represents the predicted values, yi is the true values,
and m is the sample size.
5) MEDICAL DATA Among them, the mostly used metrics would be the MAE
Forecasting finds applications at various stages of pharma- (mean absolute error), the RMSE (root mean squared error),
ceutical development. It aids in predicting drug toxicity, and the MSE (mean square error). And the main difference
pharmacokinetics, pharmacodynamics, and other parameters, between RMSE and MSLE (mean squared logarithmic error)
facilitating optimization of the design and screening pro- would be on how sensitive each is to the outliers in a
cesses [39]. Furthermore, long-term forecasting contributes particular dataset. The MSLE exhibits greater robustness
to predicting market prospects and sales of specific medica- when assessing datasets containing outliers, while the RMSE
tions, thereby boosting the formulation of effective marketing demonstrates relatively higher susceptibility to outlying
strategies. The following open-source datasets are relevant to data points. In contrast, the MAE reflects the actual error
medical-related studies: situation better than RMSE whereas the RMSLE (root
Influenza-Like Illness (ILI) [40]: This dataset con- mean squared logarithmic error) proves valuable in cases
tains weekly records of influenza-like illness patients where underestimation is penalized more severely than
from 2002 to 2021, as documented by the Centers for Disease overestimation.
Control and Prevention (CDC). It provides insights into Although these measures in general can be evaluated
the proportion of ILI patients relative to the total patient quickly disregard the size of a dataset, they do suffer certain
population. limitations: 1) some among them do not reflect the scenarios
COVID-19 [41]: The dataset includes daily information if the true values approach zeros, and 2) some indices may
on confirmed and recovered cases gathered from six nations exhibit skewed distributions if the true values are close to
(Italy, Spain, Italy, China, the United States, and Australia) zeros. Then there are some other useful evaluation metrics,
between January 22, 2020, and June 17, 2020. i.e.,
Medical Information Mart for Intensive Care (MIMIC)- m
100 X yi − ŷi
III [9]: The dataset collected more than 58,000 hospital MAPE = (7)
m yi
admission records from Beth Israel Deaconess Medical i=1
Center (BIDMC) in the United States from 2001 to 2012, m
1X |yi − ŷi |
including a variety of clinical features. Chief among them are sMAPE = (8)
|yi | + |ŷi | /2

m
the ICU admission data, blood sugar levels, and heart rates. i=1
MIMIC-IV [42]: This dataset, which incorporates several 6i=1m (y − ŷ )2
i i
IA = (9)
improvements over MIMIC III, includes data updates and 6i=1
m (|ŷ − ȳ | + |y − ȳ |)2
i i i i
partial table reconstruction, collected clinical data from more 6i=1
m (y − ŷ )2
i i
than 190,000 patients and 450,000 hospital admissions at R2 = 1 − (10)
6i=1
m (y − ȳ )2
i i
BIDMC from 2008 to 2019.
where ȳi is the mean value.
B. EVALUATION METRICS The MAPE (Mean Absolute Percentage Error) is a
First order and second order evaluation metrics are commonly commonly used metric that calculates the absolute percentage
used in predictive analysis. These indices offer the advantage of the prediction error for each observation. It is intuitive,
of simplicity in computation, and enable comparison across easy to understand, and works well for datasets with a small
different methods or models with the same dataset. They can number of outliers. It provides a measure of the average
be calculated quickly even if the sizes of the datasets are large. magnitude of the percentage error.
Typically, the popular indicators are: On the other hand, the sMAPE (Symmetric Mean Absolute
m Percentage Error) is a symmetric percentage error indicator
1X that takes into account the proportional relationship between
MAE = | yi − ŷi | (2)
m the actual and predicted values. It provides a balanced
i=1

VOLUME 12, 2024 92309


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

measure of the error between the predicted and actual on the ARMA model, and it creatively conflates the vector
values. The sMAPE is particularly useful for datasets that autoregressive (VAR) model with the vector moving average
require symmetric treatment of errors between the actual and (VMA) model. However, stationary time series data are the
predicted values. necessary condition for building a VARMA model. In order
Another important metric is the IA (Interval Accuracy), to achieve stationary data if a given data is non-stationary,
which measures the accuracy of the confidence interval then differencing operation is required. The Vector Error
of time series prediction. It is suitable for forecasting Correction Model (VECM) [47], which takes into account the
problems that require consideration of uncertainty, such as cointegration relationship between time series, is a solution to
meteorological prediction and financial risk estimation. this problem. These research endeavors have driven further
Lastly, the R2 (coefficient of determination) is a measure advancements in time series forecasting and provided new
of how well a prediction model fits the observational data. approaches and methods to tackle complex time series
It represents the proportion of the variance in the observed analysis issues.
data that can be explained by the model. R2 is commonly used On balance, the ARMA time series forecasting model
in linear models and provides a comprehensive evaluation of has made notable progress in addressing the stationary
prediction performance. time series forecasting. However, pure stationary time series
data is rarely found in practice. This arguably limits the
III. CLASSICAL APPROACHES FOR TSF applicability of the ARMA model. Hence, for better handling
A. PREDICTION METHODS FOR STATIONARY DATA non-stationary time series data, more adaptive and malleable
The traditional time series forecasting methodologies, such forecasting models and methods are needed.
as autoregressive (AR), moving average (MA), and autore-
gressive moving average (ARMA), are essentially based on B. PREDICTION METHODS FOR NON-STATIONARY DATA
statistical techniques. For stationary time series forecasting, Non-stationary sequences refer to the ordered lists that exhibit
Yule [4] started and proposed a new symbolic system, characteristics such as trends, seasonality, or periodicity.
which played an important role in multivariate time analysis. These sequences demonstrate consistency in local levels or
He later introduced the autoregressive (AR) model, which trends, where some data points closely resemble others.
was the first to treat randomness in time series analysis. Every To transform non-stationary sequences into stationary ones,
time series was considered a stochastic process in its own differencing operations can be employed for adjustment.
right. One of the key techniques for forecasting stationary The ARIMA(p, d, q) model comprises three components:
time series data would be the ARMA model [43], which was Autoregressive (AR), Integrated (I), and Moving Average
popularized by Box and Jenkins in 1970. (MA), where p and q are the orders for the AR and MA
The ARMA(p, q) is a linear combination of two lin- processes, respectively, and d is the number of differencing
ear AR(p) and MA(q) components, where AR(p) is an operations. The AR component uses past observations to
autoregressive model of order p, and MA(q) is a moving predict the current value. The MA is for capturing noise
average model of order q. The ARMA is frequently used and random fluctuations within the sequence. The I is to
to simulate stationary time series. The time series data is make a non-stationary time series stationary by performing
treated as random sequences, with the temporal continuity of differencing operations on the non-stationary sequence.
the original data being reflected in the correlation between The ARIMA model, for non-stationary time series, is capa-
random variables. For the analysis and forecasting of shifting ble of capturing variations in different data patterns and its
trends in stationary time series, these models offer a potent parameter estimations are relatively intuitive. Consequently,
tool. the ARIMA model has been widely adopted in practical
Then an adaptive ARMA model [44] was introduced applications. The regular AR(p) model can be expressed as:
to handle the short-term power load forecasting of a
power generation system. Through experiment findings, the p
X
adaptive ARMA model performed more accurately than the yt = c + εt +

ϕ(i)yt−i (11)
conventional ARMA model for forecasts that were made for i=1
one week and 24-hour in advance. To increase the precision of
prediction intervals, a prediction interval algorithm [45] was where yt represents the current value, c′ a constant term,
proposed based on a bootstrap distribution for setting (p, q). p the order of autoregression, ϕi the i-th autoregressive
Furthermore, the univariate time series models can be coefficient, and εt the white noise error term. It would
mapped to multivariate time series models by drawing be more convenience to use the backward shift (or lag)
inspiration from the Granger causality1 [46]. The vector operator B to represent the ARIMA model. The AR(p) can
autoregressive moving average (VARMA) model is based be represented, ignoring the error term at this moment,

p
1 The definition of Granger casuality is that a series x is deemed not to be
 X 
i
‘‘causal’’ of another series xj if leveraging the history of series xi does not
1− ϕi Bi yt = c′ . (12)
reduce the variance of the prediction of series xj . i=1

92310 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

The regular MA(q) expression is for characterizing the IV. DEEP LEARNING AND TSF
noise error term, and we have: In this section, we will explore various cutting-edge deep
learning architectures for time series forecasting. These
q
X models are typically categorized into four main groups:
yt = µ + εt + θi εt−i (13)
1) Convolutional Neural Networks (CNNs),
i=1
2) Recurrent Neural Networks (RNNs),
where yt indicates the current value, µ a constant term, q the 3) Graph Neural Networks (GNNs), and
order of the moving average, θi the i-th moving average 4) Transformers.
coefficient, and εt the white noise error term. With the Each of these deep learning models possesses distinct
backward shift operators, we have advantages and suitability for time series forecasting. Choos-
ing the appropriate model depends on the characteristics
q
 X  of the data, the complexity of the problem, and the
yt = µ + 1 + θi Bi εt . (14) performance requirements. Furthermore, effective hyper-
i=1 parameter tuning and data preprocessing are critical for
achieving accurate predictions. This section will provide an
The I(d) model is for differencing the non-stationary
overview of models belong to classes of CNN, RNN, GNN,
sequence d times, i.e.,
and Transformer-based time series forecasts. The design
characteristics of these models will be summarized in tables
y′t = (1 − B)d yt (15)
later on.
where y′t signifies the differenced sequence, and d is the order
A. CNN-BASED MODELS
of differencing. Combining the AR, I, and MA expressions,
The inception of Convolutional Neural Networks (CNNs)
we have the ARIMA(p, d, q) model, which is
gained significant attention with the pioneering work done
 p
X   q
X  by LeCun in 1989. The models he developed focused on
1− ϕi Bi (1 − B)d yt = c + 1 + θi Bi εt (16) processing grid-like topological data such as images and
i=1 i=1 time series data. CNN has emerged as one of the most
powerful techniques for understanding image content and has
where c is the aggregate constant term. The Eqn. (16) demonstrated exceptional performance in image recognition,
represents the ARIMA process. By choosing the parameters, segmentation, detection, and retrieval tasks. Over the years,
p, d, and q appropriately, the ARIMA models are suitable for CNN architectures have witnessed numerous advancements,
different time series predictions and analysis. leveraging depth and spatial aspects to achieve innovative
The ARIMA model has found widespread uses in time designs.
series forecasting. For instance, Valipour [7] showed that In general, based on specific architectural modifications,
the ARIMA model predicted well the precipitation in CNN can be broadly classified into seven categories:
important areas. In [48], it was used to forecast and spatial utilization, depth, multi-path, width, channel boosting,
analyze trends in the PM2.5 concentration in Fuzhou, China. feature map utilization, and attention-based CNN. Fig. 1
The ARIMA model was used by Malki et al. [49] to illustrates the classification of different deep CNN models.
forecast the COVID-19 outbreak’s second and potential end CNN is widely adopted in field of computer vision due
times in 2020. Shi et al. [50] proposed a model based on to its ability to effectively handle dimensionality reduction
tensor decomposition, BHT-ARIMA, which was novel to and process long sequences in parallel. Though originally
forecast multiple short time series. Finally the experiment designed for picture datasets to extract local associations
demonstrated good results on classical datasets, such as across spatial dimensions [51], many recent works have been
electricity and traffic. In this design, a time series is split conducted to explore applications in time series forecasting.
into multiple blocks and the model utilizes the Hankel tensor CNN can be modified for time series datasets by introducing
properties to construct the proper tensor representation. Then, additional layers of causal convolutions [52]. Filters that
with tensor decomposition method, the decomposed tensors depict these convolutions ensure that only historical data is
are for feature extractions in the time series. used to make predictions. Each causal convolution filter has
Conclusively for the ARIMA model, its application on the following structure for intermediate features extractions
the time series data is to make it a stable series through contained in the hidden layer l:
differential processing. However, this particular limitation
may also restrict its ability to capture nonlinear relationships. hl+1
t = A ((W ∗ h)(l, t)) (17)
Therefore, to better handle nonlinear relationships and k
X
complex time series patterns, alternative models and methods (W ∗ h)(l, t) = W (l, τ )hlt−τ (18)
merit consideration, such as those based on machine learning τ =0
and deep learning, with a view to ensure increased accuracy where hlt ∈ RHin represents the intermediate state at time t
and adaptability of time series forecasting. in layer l, and W (l, τ ) ∈ RHout ×Hin signifies the fixed filter

VOLUME 12, 2024 92311


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

FIGURE 1. Types of CNN models.

weights in layer l, and A(·) denotes the activation function. 2) CNN AND ATTENTION MECHANISM
The Hout and Hin represent the numbers of output channels To address the challenge of capturing long-term dependencies
and input channels in the convolutional layer, respectively. in multivariate long-term forecasting, Huang et al. [57]
developed the Dual Self-Attention Network (DSANet),
which is suitable for dynamic periodic or aperiodic sequences
1) VARIANTS OF CNN MODELS and simultaneously considers both global and local temporal
Temporal Convolutional Networks (TCNs) [53] are widely patterns. Fig. 2c shows the architecture of DSANet. Unlike
used models for analyzing and modeling time series data. the traditional recurrent neural networks (RNNs) with
They employ causal convolutions to prevent information recursive structures, DSANet employs parallel convolutional
leakage. Fig. 2a shows architectural elements in a TCN. layers to capture complex mixed patterns of global and
A TCN overcomes the challenges of gradient vanishing and local time. Additionally, the model leverages self-attention
explosion by incorporating a different backpropagation path mechanisms to learn the correlations among multiple time
in the temporal direction. series. This approach allows DSANet to effectively capture
HyDCNN (Hybrid Dilated CNN) [13] is another effective long-term dependencies in time series data and enhance the
approach that captures nonlinear relationships between performance of multivariate long-term forecasting.
sequences using dilated causal convolutions and linear
dependencies through autoregressive modules. Fig. 2b shows
3) CNN AND SEQ2SEQ
architectural of HyDCNN. This hybrid framework addresses
the limitations of traditional CNNs in capturing long- In recent years, there are interests in integrating both CNN
term correlations. Then the WaveNet-CNN [54] model is and Sequence-to-Sequence (Seq2Seq) models for optimiz-
introduced by combining the WaveNet speech sequence ing long-term time series forecasting. SCINet, a network
generation model with CNNs. The model utilizes the ReLU model developed by Liu et al. [14], is a technique for
activation function and exhibits superior performance in extracting useful and distinguishing temporal properties from
handling financial analysis tasks. downsampled subsequences or features. Fig. 2d shows archi-
To address the limitations of CNNs in handling large tectural of SCINet. SCINet efficiently models complicated
datasets, a Kmeans-CNN [55] model is introduced, which multi-resolution time dynamics and captures local temporal
fuses CNN with the K-means clustering algorithm, enabling dynamics, trends, and seasonal features by aggregating these
the learning of more informative features. By clustering rich data from several resolutions. This allows for precise
similar samples and segmenting large datasets, Kmeans-CNN time series forecasting. The application of this comprehensive
achieves excellent performance in processing large-scale approach has resulted in notable progress in the field of
power load data. long-term time series forecasting, as summarized in Table 2,
In 2023, Wu et al. [56] proposed a new framework structure showcasing the recent applications of CNN-based models in
that combines CNN and LSTM for more accurate stock price the context of TSF.
prediction. This new method is called SACLSTM. An array of
sequences of historical data and its leading indicators (options B. RNN-BASED MODELS
and futures) are constructed. The array is used as an input Elman [58] were the first to introduce recurrent neural
image to the CNN framework. Certain feature vectors are networks (RNNs). RNNs belong to a type of neural network
extracted through the convolutional and pooling layers and architecture designed for handling sequential data, such as
are used as input vectors to the LSTM. the time series or text written in natural language. Unlike

92312 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

FIGURE 2. Examples of CNNs for TSF.

traditional feedforward neural networks, RNNs incorporate of the network. RNNs can keep track of previous data via this
recurrent connections within the network, enabling informa- memory propagation technique, and they can use that data to
tion to be passed and persisted over time. The core idea adjust current outputs. Fig. 3 shows the internal organization
of RNN is to use the output of previous time step as the of a typical RNN.
input at current time step. This establishes dependencies In the diagram, the xt represents the input vector at time
between sequential data, enables RNNs to handle variable- step t, and ht the hidden vector at time step t. It can
length sequences, and captures temporal correlations and be observed that the traditional RNN neuron receives the
contextual information within the sequences. previous hidden state ht−1 and the current input xt . The
The hidden state in RNN serves as the memory unit in core of an RNN lies in its RNN unit, which includes an
network for each time step. Each time step updates the hidden internal memory state that summarizes past information. The
state, which is then passed to the following layer or time step calculation formula for the internal hidden state of an RNN

VOLUME 12, 2024 92313


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

TABLE 2. CNN-based models.

FIGURE 3. Internal structure of RNN.

is as follows:
ht = tanh(wh ht−1 + wx xt + b) (19) FIGURE 4. Internal structure of LSTM.

where ht represents the hidden state at time step t, wh the


weight matrix for the hidden state, wx the weight matrix for
the input, b the bias vector, and tanh the hyperbolic tangent a series of gates for modulation, including the input gate,
activation function. The output of an RNN is: output gate, and forget gate, etc.
In the analysis of flight data, Wang et al. [60] proposed
yt = wo ht + bo (20)
a univariate fault time series forecasting algorithm based
where yt represents the output at time step t, wo is the weight on LSTM. They compared multiple baseline models
matrix for the output, and bo is the bias vector. and observed that the LSTM model exhibited superior
Hochreiter and Schmidhuber [59] introduced the Long performance. Additionally, Graves and Schmidhube [61]
Short-Term Memory (LSTM) network, which effectively introduced a bidirectional Long Short-Term Memory
addresses these issues by incorporating gated units and (Bi-LSTM) network, which consists of two independent
memory mechanisms, in response to the challenges of LSTMs that are concatenated. Bi-LSTM showcases stronger
vanishing and exploding gradients in conventional RNNs capabilities in handling data with longer time delays by
when handling long sequences. Fig. 4 shows the internal leveraging additional contextual information without the
structure of the LSTM. necessity of retaining previous inputs.
The key component of LSTM is the cell state, which The Gated Recurrent Unit (GRU) [62] model was intro-
acts as a conveyor belt running throughout the chain duced as a more efficient variation of LSTM, addressing
with minimal linear interactions. LSTM employs carefully the high computational complexity associated with LSTM
designed structures called ‘‘gates’’ to selectively add or models. This design simplifies the model structure, reduces
remove information from the cell state. Gates serve as a computational load, and facilitates faster convergence during
method to allow information to filter through. LSTM utilizes training.

92314 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

1) VARIANTS OF RNN future information by propagating future information in both


Given the notable effectiveness of LSTM in capturing long- forward and backward directions.
term dependencies, a model introduced by Jung et al. [63] Another model is called Multi-Quantile RNN (MQRNN)
integrates LSTM into an RNN framework for predicting model [16] which is capable of making predictions for several
photovoltaic solar energy generation in new locations. This future time steps at once. Fig. 5d shows architectural structure
model employs LSTM layers to learn the temporal and topo- of MQRNN. MQRNN uses an encoder-decoder architecture
graphical variations in solar radiation and weather conditions, and substitutes a number of fully linked layers for the RNN
enabling the capture of temporal patterns observed across in the decoder. By directly concatenating the context from the
diverse locations. last time step of the encoder with all future features through
Another approach addressing the challenges of training fully connected layers, the model generates context at each
RNNs on extended time series data, including intricate depen- time step and global context for the entire sequence, thereby
dencies, vanishing and exploding gradients, and effective enhancing the accuracy of predicting distant future values.
parallelization, is the DilatedRNN [64]. This model combines The MTSMFF model [69] incorporates attention mech-
recursive expansion layers with hierarchical expansions anisms, encoder-decoder structure, and LSTM. For the
to address these issues inherent in the RNN network encoding network, a BiLSTM with a temporal attention
structure for sequence time forecasting. Fig. 5a shows an mechanism is used to adaptively capture implicit features
example of three-layer DilatedRNN with dilation 1, 2 and 4. and long-term correlations in multivariate time series data.
In order to improve the accuracy of traffic flow prediction. Fig. 5e shows graphical illustration of MTSMFF. This model
In 2023, Bharti et al [65] proposed PSO-Bi-LSTM short-term improves the ability to handle lengthy sequences and capture
traffic flow prediction model based on the combination of long-term dependencies by using layered Vanilla LSTMs for
Particle Swarm Optimization (PSO) and Bidirectional Long decoding. Recent research works based on RNN models for
Short-Term Memory (Bi-LSTM) neural networks. forecasting is summarized in Table 3.

2) RNN AND ATTENTION MECHANISM C. GNN-BASED MODELS


In order to extract more pertinent correlations from data The Graph Neural Networks (GNNs) are deep learning
and enable predictions further into the future, some recent models used for handling graph-structured data. In recent
works has been on integrating the attention mechanisms years, GNNs have gained significant attention in the field
into RNNs. The DA-RNN model [15], a two-stage RNN of time series forecasting. Time series data often exhibits
model with attention processes, is proposed to enhance the complex correlations and temporal dependencies, which are
capacity to recognize long-term correlations in data. Fig. 5b difficult to capture using traditional sequence-based models.
shows graphical illustration of DA-RNN. Then the RETAIN By modeling time series data as graphs, GNNs can better
model, proposed by Choi et al. [66], works on predicting handle the relationships and dynamic changes between
electronic health record (EHR) data. This model utilizes nodes. In GNNs, each node represents a time series data
two sets of attention mechanisms, visit-level and variable- sample while edges represent the relationships or temporal
level. The visit-level attention assigns higher weights to dependencies between nodes. By encoding and propagating
patient visits that provide more helpful information for features on nodes and edges, GNNs can learn the interactions
disease prediction, while the variable-level attention assigns between nodes and the temporal evolution patterns, thus
weights to variables that have a greater impact on disease facilitating time series forecasting tasks.
prediction within a patient visit. The STAM (spatio-temporal As the prediction horizon increases, the mutual influences
attention mechanism) [67] is designed to combine LSTM between nodes often vary with the fluctuation of the flow,
with attention processes. Fig. 5c shows design model of making the dynamic nature of graph structures particularly
STAM. The STAM uses LSTM layers in the encoder to important in long-term traffic prediction. To address this chal-
find the temporal correlations in the input data. The most lenge, researchers have to propose innovative approaches.
important time step and variable are chosen by the model for One design is the DMSTGCN model based on Dynamic
each output time step, and predictions are then made using Graph Convolutional Networks (DGNNs), introduced by
the derived spatial and temporal context vectors. Han et al. [19]. It utilizes dynamic graph convolutional
modules to predict traffic speed while considering the
periodicity and dynamic features of traffic. Fig. 6a shows
3) RNN AND SEQ2SEQ architectural of DMSTGCN. This model captures the spatial
Because the Seq2Seq models considerably improve the correlations between different road segments and applies
capacity to handle large time series in multi-step predictions them to traffic prediction.
upon comparing to traditional RNNs, the Seq2Seq and AutoSTG [18], which is a novel spatio-temporal graph
LSTM are then combined [68] to fill in the encoder- automatic prediction framework. Fig. 6b shows architectural
decoder framework. The model makes use of a bidirectional of AutoSTG. To capture intricate spatio-temporal rela-
LSTM decoder, which effectively takes into account dynamic tionships, it uses spatial graph convolution and temporal

VOLUME 12, 2024 92315


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

FIGURE 5. Different RNN-based models for TSF.

92316 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

TABLE 3. RNN-based models.

convolution. Through this approach, the model automatically matrix basis. Then, matrix polynomials are built for each
learns the associations between spatio-temporal data and time step using a set of time-varying coefficients and matrix
utilizes them for prediction tasks. bases.
Additionally, MTGNN [17] is a general graph neural In order to solve the problem in the field of time-
network structure that does not require a predefined explicit series prediction, where temporal patterns can only be
graph structure. Fig. 6c shows architectural of MTGNN. learned through dependencies between individual variables,
The model combines graph convolution modules with Chen et al. [21] proposed a multiscale adaptive graph neural
temporal convolution modules to capture the spatio-temporal network (MAGNN) for MTS prediction. By utilizing a
correlations in the data. It employs a graph learning module to multiscale pyramidal network to model temporal hierarchies,
adaptively extract sparse graph adjacency matrices from input an adaptive graph learning module to automatically infer
data, allowing for flexible and efficient modeling of complex dependencies between variables, a multiscale temporal graph
relationships. neural network to model intra- and inter-variable dependen-
Another approach is REST [70], which automatically cies, and a scale fusion module to facilitate collaboration
mines inferred multimodal directed weighted graphs, across different temporal scales, the MAGNN outperforms
is another approach. Fig. 6d shows the architecture of state-of-the-art methods on six datasets.
REST. This model combines Edge Inference Networks These methods share the common goal of consider-
(EINs) and Graph Convolutional Neural Networks (GCNs). ing dynamic graph structures in time series forecasting
By projecting the spatial correlations between time series and utilizing GNN models to capture the spatio-temporal
nodes from the temporal side to the spatial side, EINS builds relationships between nodes. These innovative approaches
multimodal directed weighted graphs for GCNs, enabling contribute to improving the accuracy and reliability of long-
the model to effectively capture and utilize multimodal term prediction and drive the research development in the
information for predictions. field of time series forecasting. Table 4 summarizes recent
Addressing dynamic graph problems, Liu et al. [20] works for forecasting based on GNN models.
proposed the TPGNN. Fig. 6e shows architecture of TPGNN.
This network uses time matrix polynomials to express the D. TRANSFORMER-BASED MODELS
correlations between dynamic variables as a two-step process. The original Transformer [71] was introduced in 2017 by
First, the overall correlations are captured using a static Vaswani et al. This model completely departs from traditional

VOLUME 12, 2024 92317


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

FIGURE 6. Different RNN-based models for TSF.

CNN and RNN architectures for sequence tasks. In contrast to computation mechanism is:
RNNs, the Transformer has superior parallelism, and allows
QK T
 
interaction with global information, thereby effectively A(Q, K , V ) = softmax √ V (21)
capturing correlations within sequence data through its self- d
attention mechanism. The self-attention mechanism of the where Q ∈ RLQ ×d , K ∈ RLK ×d , V ∈ RLV ×d and d
Transformer is more interpretable and signifies a significant represents the input dimensions. The LQ , LK , and LV are the
advancement over many conventional deep learning models. lengths of the three dimensions, Q, K , and V , respectively.
The Transformer relies entirely on attention mechanisms to The probability formula for the i-th attention coefficient of
characterize the global dependencies between the inputs and the query is:
outputs of the model, as illustrated in Fig. 7.
The core of Transformer is the self-attention mechanism. k(qi , ki )
A(qi , K , V ) = 6j vj = Ep(ki |qi ) [Vj ] (22)
The input is represented as (Query, Key) pairs, and the 6l k(ql , kl )

92318 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

TABLE 4. GNN-based models.

1) VARIANTS OF TRANSFORMER MODELS


Wu et al. [40] introduced the vanilla Transformer to the field
of flu disease time prediction. The Transformer analyzes
the complete sequence of data that has been presented
and, using a self-attention process, learns the dependencies
in the temporal data. The overall structure is to directly
transfer Transformer Encoder-Decoder, use the self-attention
mechanism to learn complex patterns and dynamics from
time series data to work, and take influenza-like disease
(ILI) prediction as a case study to achieve good results.
However, the model often fails to adequately encode long
sequences when dealing with long sequences of inputs and
loses long-term dependencies. Subsequently, in order to solve
transformer’s inability to adequately encode long sequences
and other problems, other transformer models have emerged
one after another, which can be roughly divided into the
following three categories:

a: TRANSFORMER AND ATTENTION MECHANISM


Lin et al. [72] proposed a Transformer-based SpringNet that
uses a Dynamic Time Warping (DTW) attention layer to
gather local correlations in time series data in order to predict
solar energy. Autoformer [24] is another upgraded version of
Transformer that optimizes the original Transformer for time
series problems. Fig. 8b shows architecture of Autoformer.
Its core includes the Series Decomposition Block module
FIGURE 7. Structure of a Transformer. and an upgraded Auto-Correlation Mechanism for multi-head
attention.
In order to improve the training speed of transformer,
i ,ki )
for p(ki , qi ) = 6k(q , and k(qi , ki ) selects asymmetric Lee-Thorp et al. [73] proposed a FNet model based on
 l k(q  l ,kl )
qi kiT improved Transformer, in which the self-attention sublayer
exponent exp √ .
d is replaced by the unparameterized Fourier transform, and

VOLUME 12, 2024 92319


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

the training speed of the model is greatly improved. input sequences first, so that representations considering
Pyraformer [25] overcomes the computational complexity context and timing information can be generated at different
limitations of Transformer by using a pyramid attention times. Next, the bottom layer representation is input into
mechanism to form a multi-resolution representation of time the upper layer Transformer to make use of Attention’s
series, effectively capturing the interdependencies in TSF. ultra-long period information extraction capability to make
Fig. 8c shows architecture of Pyraformer. up for the problem of sequence model information loss.
Liu et al. [27] introduced Non-stationary Transformers Madhusudhanan et al. [75] propose a U-Net-inspired Trans-
which improve the attention mechanism within the Trans- former architecture called Yformer, which is based on a
former by incorporating non-stationary information, thereby unique Y-shaped encoder-decoder architecture that combines
enhancing data predictability and unleashing the excellent coupled scaling mechanisms with sparse attention modules to
time-series modeling capabilities of attention mechanisms. capture long-term effects across scale levels.
Fig. 8d shows architectural of Non-stationary Transformers.
In 2023, Du et al. [74] proposed Preformer, a Transformer- c: TRANSFORMER AND TIMING ANALYSIS
based prediction model that introduces a novel and efficient
Zhou et al. [26] introduced the FEDformer model, which
multiscale segment correlation mechanism that divides a
utilizes trend and seasonal decomposition to incorporate
time series into multiple segments and utilizes segment
seasonal-trend decomposition within Transformer. Another
correlation-based attention instead of point correlation.
approach is the introduction of Fourier Transform and
A multiscale structure is established to aggregate dependen-
using Transformer in the frequency domain to help the
cies on different time scales, which facilitates the selection of
model better learn global information. Fig. 8f shows the
segment lengths.
FEDformer model. The core modules of FEDformer are
Informer [23] represents a significant contribution to
Fourier transform module and timing decomposition module.
long-period time series forecasting. It improves on the
The Fourier transform module converts the input time series
traditional Transformer model from the perspective of
from the time domain to the frequency domain, and then
efficiency. For long-period time series forecasting, the time
replaces Q, K and V in Transformer with the frequency
complexity of the regular Transformer model increases
domain information after Fourier transform, and carries out
exponentially with the length of the input series. Informer
Transformer operation in the frequency domain.
creates a sparse attentions with the keys and important
Autoformer [24] also used the decomposition of trend
queries. Through testings, the attentions scores indicates
items and seasonal items. In order to extract seasonal
long-tailed distributions between key and important queries.
items, this paper adopted the moving average method.
In fact, most of the scores are small, and only a few of them
By calculating the average value of each window in the
are large.
original input time series, the trend items for each window
Then, Informer focuses on modeling those with important
are derived, and then the trend items of the whole series were
attentions, and the rest would be ignored. This design
obtained. At the same time, the seasonal term can be obtained
creates a structure of sparse attentions, and greatly improves
by subtracting the trend term from the original input sequence
the computational efficiency. Informer further introduces
according to the addition model. It is worth noting that
self-attention distillation between every two Transformer
due to the success of Autoformer and FEDformer, exploring
layers. A convolution operation is used to reduce the length
self-attention mechanisms in the frequency domain in time
of the sequence by half. This further reduces the training
series modeling has attracted more and more attention from
overhead.
society. Table 5 shows the recent works based on transformer
At the decoder stage, Informer employs a method of
models for TSF.
predicting the results of multiple time steps at once, which
is able to alleviate the problem of cumulative error. Overall,
Informer effectively improves the operational efficiency 2) COMPLEXITY ANALYSIS OF TRANSFORMER-BASED
and performance over Transformer in the long-period time MODELS
series forecasting tasks. Apart from the sparse attention and The time complexity of the Transformer [40] model depends
self-attention distillation techniques, Informer provides an mainly on the length of the sequence and the number of
effective solution for handling large-scale long sequence data hidden layers in the model. Transformer has O(L 2 ) time
in real applications. Fig. 8a shows the model design of an and memory complexity, so for longer input sequences and
Informer. deeper models, the time and memory complexity of the
Transformer can become quite high. To solve this problem,
subsequent researchers have proposed a number of effective
b: TRANSFORMER COMBINED MODELS transformer class models to reduce complexity. For example,
Lim et al. [22] proposed a method combining LSTM and sparse attention, hierarchical attention and other methods are
Transformer, which was called Temporal Fusion Transformer used.
(TFT). Fig. 8e shows the architecture of TFT. The model uses In the self-attention model, Informer introduced the
the sequence modeling capability of LSTM to preprocess concept of sparse bias and adopted the LogSparse mask

92320 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

FIGURE 8. Different Transformer-based models for TSF.

VOLUME 12, 2024 92321


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

TABLE 5. Transformer-based models.

technique. These innovations have significantly reduced the TABLE 6. Complexity analysis of transformer-based models.
computational complexity of the traditional Transformer
model, originally O(L 2 ), to O(L log L). It is worth noting that
Informer [23] did not explicitly introduce sparse bias when
implementing this improvement. Instead, it dynamically
selects queries and keys that have significant similarities,
effectively prioritizing the dominant O(L log L) queries. This
method greatly improves the computational efficiency.
Autoformer [24] inherited the design idea of Trans-
former in the encoder-decoder structure. By employing
Autoformer’s unique internal operators, it is able to effec- The Attention mechanism used in traditional Transformer
tively separate the overall trend of a variable from the is square complexity, while the O(L) computational com-
hidden variables predicted. This approach utilizes a unique plexity can be reached in FEDformer [26] due to the use of
autocorrelation mechanism to achieve O(L log L) complexity. low-rank approximation. Table 6 summarizes the algorithm
A pyramidal attention module was designed in complexity analysis of some transformer models applied to
Pyraformer [25] to transmit information between and within time series forecasting.
different scales. As the length of the input sequence
L increases, Pyraformer is able to achieve lower O(L) 3) EFFECTIVENESS OF TRANSFORMERS IN FORECASTING
computational complexity by adjusting one parameter C and As the role of Transformer becomes increasingly prominent
keeping the others fixed. This method has a higher benefit in in the field of deep learning, more publications have
terms of calculation time and memory cost. begun the debate about its effectiveness in time series

92322 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

forecasting. In [76], a simple linear model was demonstrated


to outperform the complex Transformer models. This raised
the doubts about the necessity of using Transformer in time
series forecasting.
Later on, there was a mixer structure in the MTS-Mixers
model [77] design. The design was adopted from computer
vision (CV) field to work on the time series prediction
scenario. The function of the mixer is to craete a mix
based on the time dimension and channel dimension of
factorization. The source time series is divided into multiple
sub-sequences, and each sub-sequence is learned from the
temporal information respectively, and then pieced together
in the original order. By utilizing full connections within
and across channels, the MTS-Mixers model replaces the
complex Transformer model and achieves superior outcomes.
Recently, the TiDE model [78] was proposed which was
completely composed of fully connected MLPs without
resorting to any attention mechanism, RNNs or CNNs. In the
paper, it reported that the design achieved the state-of-the-art
results on multiple datasets, and beat the Transformer once
again.

V. EXPERIMENTS
Albeit the deep learning models covered in the paper intake
time series input data, each of them may be designed to
deal with different primary forecasting objectives. We still
like to examine the prediction performances among these
models in the situation that the models are tested under
the same input datasets running in an identical computing
platform. Although the ranked measured results may not
reflect the excellence of individual architectural design due to
their different design goals, we could get at least a hindsight
about the design ideas and the potential forecast accuracy
relationship.

A. THE SETUP FIGURE 9. Histogram plots of performance of different deep learning


models.
For the experiments, we have selected the datasets,
ETT-small-m1 and ETT-small-h1, for multi-step predictions.
The ETT dataset [32] consists of information gathered from
electricity transformers, including the temperature of the are displayed. Among the nine selected models, the SCINet
oil, and six power loading parameters. The dataset spanned demonstrates a better overall accuracy performance for either
two years and recorded one data point per minute or per MAE or MSE measure. We attribute the results to the fact
hour, as denoted by m and h, respectively. That is, the that the SCINet can better capture the time dependency
datasets, ETT-small-m1 and ETT-small-h1, were used in our in both short-term (local temporal dynamics) and long-
experiments. term (trend, seasonality) scenarios. Moreover, this design
Regarding the setup, the historical horizon length and is, especially, accurate in predicting long-term forecasting
prediction length for the dataset were both set to 96. The results.
learning rate (Lr ) was 0.0001, with a total of 4 epochs and a Then, the MTS-Mixers model performs the second best
batch size of 32. The best results obtained are shown in bold with both the MAE and MSE measures. Likely, the
in Table 7. MTS-Mixers realizes information exchanges between the
channel and token dimensions through the MLP. The mixer
B. RESULT ANALYSIS structure was originally designed for computer vision, the
In Figs. 9a and 9b, the predictive performance in terms of MTS-Mixers model was successful in transplanting the
MAE and MSE among the nine different models under tests design into an effective component for forecasting. Besides,
are plotted in histograms. In each graph, the error measures the trivial MLP design proves useful and effective in time
against the two datasets, ETT-small-m1 and ETT-small-h1, series forecasting.

VOLUME 12, 2024 92323


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

TABLE 7. Forecasting performance of different tpes of models base on A. FUTURE RESEARCH


ETT-small-m1 and ETT-small-h1 datasets (on NVidia RTX 3080 video card).
A well-crafted time series model should address a spectrum
of challenges, such as forecasting accuracy, long-term depen-
dency, noises, nonlinearity, imbalanced and non-smooth data
issues. While our study offers insights in comparing deep
learning models for time series predictions, their differences
in performances warrant further exploration.
Conceptually advanced deep learning design models
(e.g., Mamba model) and research paradigms (e.g., stable
diffusion) have recently been emerging. They potentially
usher new platforms for constructing novel model designs.
Lately, the design of Mamba is innovative in the fields of
natural language processing, genomics, and audio analysis.
It uses a linear time series modeling architecture, combined
with a selective state space, to offer superior performance
Through the experiments, the performance of FEDformer across different modalities. The selective mechanism in
performs closely to that of the MTS-mixers. Season and trend Mamba accelerates the speed of inference significantly.
decomposition methods are adopted in FEDformer model, It addresses the computation challenges faced by traditional
and Fourier analysis is combined with Transformer based Transformers upon dealing with long sequences, its inference
method. The experimental results show that the predictive speed is almost five times faster than the regular Transformer
performance of FEDformer model is much better than those model. This will be our expectation that novel deep learning
methods based on ordinary Transformers. For MTGNN, the models based on Mamba will possibly be developed in
models perform poorly on both MSE and MAE, and their the near future. Another issue would be related to the
predictive power on this dataset is relatively weak, requiring available time series data, especially, from data-rich sample
further tuning of model parameters or the use of more data to small-sized, irregular spaced time series data for
complex structures to improve performance. the training of backdone models. Transformer-based models
are prone to overfitting problems on small datasets. This
VI. DISCUSSION AND FUTURE RESEARCH requires in-depth thinking and appropriate solutions for
In the paper, we have systematically outlined various time series data. One potential research topic would be
approaches for carrying out time series forecasting, from associated with the, for example, stable diffusion mechanism,
fundamental principles to the latest deep learning designs. in creating more statistical consistent data samples. This can
We have categorized these deep learning models into be an important direction for future research, and for small
Convolutional Neural Networks (CNNs), Recurrent Neural datasets with irregular interval sampled datapoints. New
Networks (RNNs), Graph Neural Networks (GNNs), and model architectures and techniques should be investigated to
Transformer-variants classes. Indeed, upon understanding overcome the limitations of the current Transformer model
the model architectures and design ideas, each model and achieve better overall performance and generalization
among them may pursue its own goals in its time-series capabilities.
application fields. As listed in Table 1, the respective Another associated research topic could be the ‘‘decom-
application fields of the reviewed deep learning models are position learning’’ in time series forecasting. Its goal is to
listed. decompose complex time series data into vectors of various
Since time-series forecasting covers a broad range of dimensions, which may possibly improve the effectiveness of
applications, the availability of datasets in their respective the subsequent downstream tasks. In general for forecasting,
areas could be crucial in training the deep learning models. there are multiple influencing factors and variables. Complex
For example, in finance, some models are designed for interactions maong these factors may arise. Thus the goal of
pursuing the accuracy of stock price predictions. Models decomposition learning is to separate these complex factors
with accurate predictions of market trends can definitely and obtain the appropriate vector representations of each
assist investors with informed decisions. This may further dimension, such that each vector may better capture a specific
have potential to earn investment profits. There are some aspect about the characteristics of the data. Our expectation
specific designs for finance predictions, one such reviewed on decomposition learning is about the generation of more
model is the DA-RNN, which can be applied to stock accurate, interpretable, and dynamic time series forecasting
market predictions. In fact, other time-series models, e.g., models for real-world applications.
SCINet, MTGNN, Autoformer, and FEDformer may also In summary, despite its challenges and complexity, deep
work well in financial markets. Other datasets and models in learning holds great promises for the model developments.
other sectors, such as energy sector, transportation, weather Emphasis should be placed on enhancing model interpretabil-
forecast, healthcare issue, etc. are also discussed. ity, generalization, integration of external factors, real-time

92324 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

prediction, uncertainty estimation, and other pertinent aspects [10] L. Li, J. Yan, X. Yang, and Y. Jin, ‘‘Learning interpretable deep state space
to advance the field of time series forecasting. model for probabilistic time series forecasting,’’ in Proc. 28th Int. Joint
Conf. Artif. Intell., Aug. 2019, pp. 2901–2908.
[11] Z. Chen, Q. Ma, and Z. Lin, ‘‘Time-aware multi-scale RNNs for time
VII. CONCLUSION series modeling,’’ in Proc. 13th Int. Joint Conf. Artif. Intell., Aug. 2021,
pp. 2285–2291.
Time series forecasting is an excellent tool that pro- [12] L. Yang, T. L. J. Ng, B. Smyth, and R. Dong, ‘‘HTML: Hierarchical
vides predictive insights, offers decision-making strate- transformer-based multi-task learning for volatility prediction,’’ in Proc.
gies, and spans across diverse applications through time Web Conf., Apr. 2020, pp. 441–451.
[13] Y. Li, K. Li, C. Chen, X. Zhou, Z. Zeng, and K. Li, ‘‘Modeling temporal
series data analysis. In this paper, we have undertaken a patterns with dilated convolutions for time-series forecasting,’’ ACM Trans.
comprehensive review of publicly available time series Knowl. Discovery from Data, vol. 16, no. 1, pp. 1–22, Feb. 2022.
datasets and explored various approaches for designing effec- [14] M. Liu, A. Zeng, M. Chen, Z. Xu, Q. Lai, L. Ma, and Q. Xu, ‘‘SciNet: Time
tive time series design models. We starts with an overview series modeling and forecasting with sample convolution and interaction,’’
in Proc. 36th Conf. Neural Inf. Process. Syst., 2022, pp. 5816–5828.
of classical statistical methods, such as ARIMA, and then [15] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. W. Cottrell, ‘‘A dual-
delves into an extensive exploration of different deep learning stage attention-based recurrent neural network for time series prediction,’’
models, such as Transformer-based architectures. Each archi- in Proc. 16th Int. Joint Conf. Artif. Intell., Melbourne, VIC, Australia,
Aug. 2017, pp. 2627–2633.
tectural paradigm embodies distinct design principles and [16] R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka, ‘‘A multi-
advantages. It is crucial to recognize that different datasets horizon quantile recurrent forecaster,’’ in Proc. 31st Conf. Neural Inf.
may necessitate tailored models, and the computational setup Process. Syst. (NIPS), Long Beach, CA, USA, 2017, pp. 1–9.
[17] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, ‘‘Connecting
can influence model performance. Our goal is to ascertain the dots: Multivariate time series forecasting with graph neural networks,’’
the optimal deep learning model for a given dataset and in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
computational environment. Through our experiments among Aug. 2020, pp. 753–763.
[18] Z. Pan, S. Ke, X. Yang, Y. Liang, Y. Yu, J. Zhang, and Y. Zheng, ‘‘AutoSTG:
different deep learning models, the SCINet demonstrates Neural architecture search for predictions of spatio-temporal graph,’’ in
the best performance in the mean squared errors (MSEs) Proc. Web Conf., 2021, pp. 1846–1855.
of 0.21 and 0.37, and mean absolute errors (MAEs) of [19] L. Han, B. Du, L. Sun, Y. Fu, Y. Lv, and H. Xiong, ‘‘Dynamic and
0.3 and 0.4 for ETT-small-m1 and ETT-small-h1 datasets, multi-faceted spatio-temporal deep learning for traffic speed forecasting,’’
in Proc. 27th ACM SIGKDD Conf. Knowl. Discovery Data Mining,
respectively. Although our testing configuration may not Singapore, Aug. 2021, pp. 547–555.
reflect that SCINet can perform the best among all time- [20] Y. Liu, Q. Liu, J. W. Zhang, H. Feng, Z. Wang, Z. Zhou, and W. Chen,
series applications, it represents a well-crafted model that ‘‘Multivariate time-series forecasting with temporal polynomial graph
neural networks,’’ in Proc. 36th Adv. Neural Inf. Process. Syst., vol. 35,
should be understood on its architectural design. For vigorous 2022, pp. 19414–19426.
investigations, novel conceptual designs such as Mamba [21] L. Chen, D. Chen, Z. Shang, B. Wu, C. Zheng, B. Wen, and
should also be considered for future research in time-series W. Zhang, ‘‘Multi-scale adaptive graph neural network for multivariate
time series forecasting,’’ IEEE Trans. Knowl. Data Eng., vol. 35, no. 10,
forecasting. pp. 10748–10761, Oct. 2023.
[22] B. Lim, S. Ö. Arik, N. Loeff, and T. Pfister, ‘‘Temporal fusion transformers
for interpretable multi-horizon time series forecasting,’’ Int. J. Forecasting,
REFERENCES vol. 37, no. 4, pp. 1748–1764, Oct. 2021.
[1] W. Liao and S. Lin, ‘‘Prediction of photovoltaic output power based on [23] H. Y. Zhou, S. H. Zhang, J. Q. Peng, S. Zhang, J. X. Li, H. Xiong, and
match degree and entropy weight method,’’ Int. J. Pattern Recognit. Artif. W. C. Zhang, ‘‘Informer: Beyond efficient transformer for long sequence
Intell., vol. 37, no. 7, Jun. 2023, Art. no. 2350018. time-series forecasting,’’ in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35,
[2] L. Qu, W. Li, W. Li, D. Ma, and Y. Wang, ‘‘Daily long-term traffic flow no. 12, pp. 11106–11115.
forecasting based on a deep neural network,’’ Exp. Syst. Appl., vol. 121, [24] H. Wu, J. Xu, J. Wang, and M. Long, ‘‘Autoformer: Decomposition
pp. 304–312, May 2019. transformers with auto-correlation for long-term series forecasting,’’ in
[3] S. N. Ward, ‘‘Area-based tests of long-term seismic hazard predictions,’’ Proc. NIPS, vol. 34, Dec. 2021, pp. 22419–22430.
Bull. Seismological Soc. Amer., vol. 85, no. 5, pp. 1285–1298, Oct. 1995. [25] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, and S. Dustdar,
‘‘Pyraformer: Low-complexity pyramidal attention for long-range time
[4] G. U. Yule, ‘‘On a method of investigating periodicities in disturbed series,
series modeling and forecasting,’’ in Proc. Int. Conf. Learn. Represent.,
with special reference to Wolfer’s sunspot numbers,’’ Stat. Papers George
2021, pp. 1–20.
Udny Yule, vol. 226, pp. 389–420, Jan. 1971.
[26] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, ‘‘FedFormer:
[5] G. T. Walker, ‘‘On periodicity in series of related terms,’’ Proc. Roy. Soc.
Frequency enhanced decomposed transformer for long-term series fore-
A, Math., Phys. Eng. Sci., vol. 131, no. 818, pp. 518–532, 1931.
casting,’’ in Proc. Int. Conf. Mach. Learn., 2022, pp. 27268–27286.
[6] I. Rojas, O. Valenzuela, F. Rojas, A. Guillen, L. J. Herrera, H. Pomares, [27] Y. Liu, H. X. Wu, J. M. Wang, and M. Long, ‘‘Non-stationary transformers:
L. Marquez, and M. Pasadas, ‘‘Soft-computing techniques and ARMA Exploring the stationarity in time series forecasting,’’ in Proc. Adv. Neural
model for time series prediction,’’ Neurocomputing, vol. 71, nos. 4–6, Inf. Process. Syst., vol. 35, 2022, pp. 1–13.
pp. 519–537, Jan. 2008. [28] B. Yang, L. X. Li, H. Ji, and J. Xu, ‘‘An early warning system for loan risk
[7] M. Valipour, ‘‘Critical areas of Iran for agriculture water management assessment using artificial neural networks,’’ Knowl.-Based Syst., vol. 14,
according to the annual rainfall,’’ Eur. J. Sci. Res., vol. 84, no. 4, nos. 5–6, pp. 303–306, Aug. 2001.
pp. 600–608, 2012. [29] I. E. Livieris, E. Pintelas, and P. Pintelas, ‘‘A CNN–LSTM model for
[8] Z. Liu, Y. Yan, and M. Hauskrecht, ‘‘A flexible forecasting framework gold price time-series forecasting,’’ Neural Comput. Appl., vol. 32, no. 23,
for hierarchical time series with seasonal patterns: A case study of web pp. 17351–17360, Dec. 2020.
traffic,’’ in Proc. 41st Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., [30] J. Yoo and U. Kang, ‘‘Attention-based autoregression for accurate and
Jun. 2018, pp. 889–892. efficient multivariate time series forecasting,’’ in Proc. SIAM Int. Conf.
[9] Y. Wu, J. Ni, W. Cheng, B. Zong, D. Song, Z. Chen, Y. Liu, X. Zhang, Data Mining (SDM), 2021, pp. 531–539.
H. Chen, and S. B. Davidson, ‘‘Dynamic Gaussian mixture based deep [31] Y. Zhao, Y. Shen, Y. Zhu, and J. Yao, ‘‘Forecasting wavelet transformed
generative model for robust forecasting on sparse multivariate time series,’’ time series with attentive neural networks,’’ in Proc. IEEE Int. Conf. Data
in Proc. AAAI Conf. Artif. Intell., May 2021, vol. 35, no. 1, pp. 651–659. Mining (ICDM), Singapore, Nov. 2018, pp. 1452–1457.

VOLUME 12, 2024 92325


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

[32] Z. H. Yue, Y. J. Wang, J. Y. Duan, T. M. Yang, C. R. Huang, Y. H. Tong, [54] A. Borovykh, S. Bohte, and C. W. Oosterlee, ‘‘Conditional time series
and B. X. Xu, ‘‘Ts2vec: Towards universal representation of time series,’’ forecasting with convolutional neural networks,’’ 2017, arXiv:1703.04691.
in Proc. AAAI Conf. Artif. Intell., 2022, vol. 36, no. 8, pp. 8980–8987. [55] X. Dong, L. Qian, and L. Huang, ‘‘Short-term load forecasting in smart
[33] Y. Li, H. Wang, J. Li, C. Liu, and J. Tan, ‘‘ACT: Adversarial convolutional grid: A combined CNN and K-means clustering approach,’’ in Proc. IEEE
transformer for time series forecasting,’’ in Proc. Int. Joint Conf. Neural Int. Conf. Big Data Smart Comput. (BigComp), Feb. 2017, pp. 119–125.
Netw. (IJCNN), Vancouver, BC, Canada, Jul. 2022, pp. 17105–17115. [56] J. M.-T. Wu, Z. Li, N. Herencsar, B. Vo, and J. C.-W. Lin, ‘‘A graph-based
[34] Q. Wang, L. Chen, J. Zhao, and W. Wang, ‘‘A deep granular network CNN-LSTM stock price prediction algorithm with leading indicators,’’
with adaptive unequal-length granulation strategy for long-term time series Multimedia Syst., vol. 29, no. 3, pp. 1751–1770, Jun. 2023.
forecasting and its industrial applications,’’ Artif. Intell. Rev., vol. 53, no. 7, [57] S. Huang, D. Wang, X. Wu, and A. Tang, ‘‘DSANet: Dual self-attention
pp. 5353–5381, Oct. 2020. network for multivariate time series forecasting,’’ in Proc. 28th ACM Int.
[35] Z. Yang, W. Yan, X. Huang, and L. Mei, ‘‘Adaptive temporal-frequency Conf. Inf. Knowl. Manag., Nov. 2019, pp. 2129–2132.
network for time-series forecasting,’’ IEEE Trans. Knowl. Data Eng., [58] J. Elman, ‘‘Finding structure in time,’’ Cognit. Sci., vol. 14, no. 2,
vol. 34, no. 4, pp. 1576–1587, Apr. 2022. pp. 179–211, Jun. 1990.
[36] Y. Qi, Q. Li, H. Karimian, and D. Liu, ‘‘A hybrid model for spatiotemporal [59] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
forecasting of PM2.5 based on graph convolutional neural network and long Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 15, 1997.
short-term memory,’’ Sci. Total Environ., vol. 664, pp. 1–10, May 2019.
[60] X. Wang, J. Wu, and C. Liu, ‘‘Fault time series prediction based on LSTM
[37] Y.-Y. Chang, F.-Y. Sun, Y.-H. Wu, and S.-D. Lin, ‘‘A memory- recurrent neural network,’’ J. Beijing Univ. Aeronaut. Astronaut., vol. 44,
network based solution for multivariate time-series forecasting,’’ 2018, no. 4, pp. 772–784, 2018.
arXiv:1809.02105.
[61] A. Graves and J. Schmidhuber, ‘‘Framewise phoneme classification with
[38] Z. Shen, Y. Zhang, J. Lu, J. Xu, and G. Xiao, ‘‘A novel time series
bidirectional LSTM and other neural network architectures,’’ Neural Netw.,
forecasting model with deep learning,’’ Neurocomputing, vol. 396,
vol. 18, nos. 5–6, pp. 602–610, Jul. 2005.
pp. 302–313, Jul. 2020.
[62] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
[39] J. C. Lauffenburger, J. M. Franklin, A. A. Krumme, W. H. Shrank,
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using
O. S. Matlin, C. M. Spettell, G. Brill, and N. K. Choudhry, ‘‘Predicting
RNN encoder–decoder for statistical machine translation,’’ in Proc. Conf.
adherence to chronic disease medications in patients with long-term initial
Empirical Methods Natural Lang. Process. (EMNLP), Doha, Qatar, 2014,
medication fills using indicators of clinical events and health behaviors,’’
pp. 1724–1734.
J. Managed Care Specialty Pharmacy, vol. 24, no. 5, pp. 469–477,
May 2018. [63] Y. Jung, J. Jung, B. Kim, and S. Han, ‘‘Long short-term memory
recurrent neural network for modeling temporal patterns in long-term
[40] N. Wu, B. Green, X. Ben, and S. O’Banion, ‘‘Deep transformer models
power forecasting for solar PV facilities: Case study of South Korea,’’ J.
for time series forecasting: The influenza prevalence case,’’ 2020,
Cleaner Prod., vol. 250, Mar. 2020, Art. no. 119476.
arXiv:2001.08317.
[41] A. Zeroual, F. Harrou, A. Dairi, and Y. Sun, ‘‘Deep learning methods [64] S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan, X. Cui, M. Witbrock,
for forecasting COVID-19 time-series data: A comparative study,’’ Chaos, M. A. Hasegawa-Johnson, and T. S. Huang, ‘‘Dilated recurrent neural
Solitons Fractals, vol. 140, Nov. 2020, Art. no. 110121. networks,’’ in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NIPS),
Dec. 2017, pp. 76–86.
[42] K. H. Lee, J. Won, H. J. Hyun, S. C. Hahn, E. Choi, and J. H. Lee,
‘‘Self-supervised predictive coding with multimodal fusion for patient [65] P. Redhu and K. Kumar, ‘‘Short-term traffic flow prediction based on
deterioration prediction in fine-grained time resolution,’’ in Proc. Int. Conf. optimized deep learning neural network: PSO-Bi-LSTM,’’ Phys. A, Stat.
Learn. Represent., May 2023, pp. 41–50. Mech. Appl., vol. 625, Sep. 2023, Art. no. 129001.
[43] G. Janacek, ‘‘Time series analysis forecasting and control,’’ J. Time Ser. [66] E. Choi, M. T. Bahadori, J. A. Kulas, A. Schuetz, W. F. Stewart, and J. Sun,
Anal., vol. 31, no. 4, p. 303, Jul. 2010. ‘‘RETAIN: An interpretable predictive model for healthcare using reverse
[44] J.-F. Chen, W.-M. Wang, and C.-M. Huang, ‘‘Analysis of an adaptive time attention mechanism,’’ in Proc. 30th Int. Conf. Neural Inf. Process.
time-series autoregressive moving-average (ARMA) model for short-term Syst., 2016, pp. 3512–3520.
load forecasting,’’ Electric Power Syst. Res., vol. 34, no. 3, pp. 187–196, [67] T. Gangopadhyay, S. Y. Tan, Z. Jiang, R. Meng, and S. Sarkar,
Sep. 1995. ‘‘Spatiotemporal attention for multivariate time series prediction and
[45] X. Y. Lu and L. H. Wang, ‘‘Bootstrap prediction interval for ARMA models interpretation,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
with unknown orders,’’ Revstat-Stat. J., vol. 18, no. 3, pp. 375–396, 2020. (ICASSP), Toronto, ONT, Canada, Jun. 2021, pp. 3560–3564.
[46] C. W. J. Granger, ‘‘Investigating causal relations by econometric models [68] C. Fan, Y. Zhang, Y. Pan, X. Li, C. Zhang, R. Yuan, D. Wu, W. Wang, J.
and cross-spectral methods,’’ Econometrica, vol. 37, no. 3, pp. 424–438, Pei, and H. Huang, ‘‘Multi-horizon time series forecasting with temporal
Aug. 1969. attention learning,’’ in Proc. 25th ACM SIGKDD Int. Conf. Knowl.
[47] R. F. Engle and C. W. J. Granger, ‘‘Co-integration and error correction: Discovery Data Mining, Jul. 2019, pp. 2527–2535.
Representation, estimation, and testing,’’ Econometrica, vol. 55, no. 2, [69] S. Du, T. Li, Y. Yang, and S.-J. Horng, ‘‘Multivariate time series forecast-
pp. 251–276, Mar. 1987. ing via attention-based encoder–decoder framework,’’ Neurocomputing,
[48] L. Zhang, J. Lin, R. Qiu, X. Hu, H. Zhang, Q. Chen, H. Tan, D. Lin, and vol. 388, pp. 269–279, May 2020.
J. Wang, ‘‘Trend analysis and forecast of PM2.5 in Fuzhou, China using the [70] H. Lin, Y. Fan, J. Zhang, and B. Bai, ‘‘REST: Reciprocal framework
ARIMA model,’’ Ecological Indicators, vol. 95, pp. 702–710, Dec. 2018. for spatiotemporal-coupled predictions,’’ in Proc. Web Conf., Apr. 2021,
[49] Z. Malki, E.-S. Atlam, A. Ewis, G. Dagnew, A. R. Alzighaibi, pp. 3136–3145.
G. Elmarhomy, M. A. Elhosseini, A. E. Hassanien, and I. Gad, ‘‘ARIMA [71] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
models for predicting the end of COVID-19 pandemic and the risk of L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
second rebound,’’ Neural Comput. Appl., vol. 33, no. 7, pp. 2929–2948, Neural Inf. Process. Syst., vol. 30, Dec. 2017, pp. 6000–6010.
Apr. 2021. [72] Y. Lin, I. Koprinska, and M. Rana, ‘‘SpringNet: Transformer and spring
[50] Q. Q. Shi, J. M. Yin, J. J. Cai, A. Cichocki, T. Yokota, L. Chen, M. X. Yuan, DTW for time series forecasting,’’ in Proc. Int. Conf. Neural Inf. Process.
and J. Zeng, ‘‘Block Hankel tensor ARIMA for multiple short time series (ICONIP), Bangkok, Thailand, 2020, pp. 616–628.
forecasting,’’ 2020, arXiv:2002.12135. [73] J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon, ‘‘FNet: Mixing tokens
[51] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification with Fourier transforms,’’ 2021, arXiv:2105.03824.
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. [74] D. Du, B. Su, and Z. Wei, ‘‘Preformer: Predictive transformer with multi-
Process. Syst. (NIPS), vol. 25, 2012, pp. 1097–1105. scale segment-wise correlations for long-term time series forecasting,’’ in
[52] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Rhodes
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, ‘‘WaveNet: Island, Greece, Jun. 2023, pp. 1–5.
A generative model for raw audio,’’ 2016, arXiv:1609.03499. [75] K. Madhusudhanan, J. Burchert, N. Duong-Trung, S. Born, and
[53] S. Bai, J. Z. Kolter, and V. Koltun, ‘‘An empirical evaluation of generic L. Schmidt-Thieme, ‘‘U-Net inspired transformer architecture for far
convolutional and recurrent networks for sequence modeling,’’ 2018, horizon time series forecasting,’’ in Proc. Joint Euro. Conf. Machine
arXiv:1803.01271. Learning Knowledge Discovery Databases, 2022, pp. 36–52.

92326 VOLUME 12, 2024


W. Li, K. L. E. Law: Deep Learning Models for Time Series Forecasting: A Review

[76] A. L. Zeng, M. X. Chen, L. Zhang, and Q. Xu, ‘‘Are transformers effective K. L. EDDIE LAW received the [Link]. (Eng.)
for time series forecasting?’’ in Proc. 37th AAAI Conf. Artif. Intell., 2023, degree in electrical and electronic engineering
vol. 37, no. 9, pp. 11121–11128. from The University of Hong Kong, the M.S.
[77] Z. Li, Z. Rao, L. Pan, and Z. Xu, ‘‘MTS-mixers: Multivariate time degree in electrical engineering from the Tan-
series forecasting via factorized temporal and channel mixing,’’ 2023, don School of Engineering (Polytechnic Uni-
arXiv:2302.04501. versity, New York), New York University, and
[78] A. Das, W. Kong, A. Leach, S. Mathur, R. Sen, and R. Yu, ‘‘Long-term fore- the Ph.D. degree in electrical and computer
casting with TiDE: Time-series dense encoder,’’ 2023, arXiv:2304.08424.
engineering from the University of Toronto.
From 1995 to 1999, he was a Member of Scientific
Staff with Nortel Networks, Ottawa, Canada,
carrying research for passport switches, protocol designs, and scalable
proxy web server system in the Computing Technology Laboratory.
WENXIANG LI is currently pursuing the Ph.D. From 1999 to 2003, he was an Assistant Professor with The Edward S.
degree in applied computer technology with the Rogers Sr. Department of Electrical and Computer Engineering, University
Faculty of Applied Sciences, Macao Polytechnic of Toronto. From 2004 to 2010, he was an Associate Professor with Toronto
University, Macau, China. Her current research Metropolitan University (Ryerson University), Canada. Since 2019, he has
interests include deep learning modeling and been with the Faculty of Applied Sciences, Macao Polytechnic University,
forecast analysis of time series data. Macau. He has U.S. patents, and published refereed articles in journals,
conferences, magazines, and books. Among his latest research interests,
he works on data lake designs, distributed computing, computer networking,
deep learning model designs, blockchains, and the IoT systems.

VOLUME 12, 2024 92327

You might also like