Review of Deep Learning for Time Series
Review of Deep Learning for Time Series
ABSTRACT Time series forecasting involves justifying assertions scientifically regarding potential states
or predicting future trends of an event based on historical data recorded at various time intervals. The field
of time series forecasting, supported by diverse deep learning models, has made significant advancements,
rendering it a prominent research area. The broad spectra of available time series datasets serve as valuable
resources for conducting extensive studies in time series analysis with varied objectives. However, the
complexity and scale of time series data present challenges in constructing reliable prediction models. In this
paper, our objectives are to introduce and review methodologies for modeling time series data, outline the
commonly used time series forecasting datasets and different evaluation metrics. We delve into the essential
architectures for trending an input dataset and offer a comprehensive assessment of the recently developed
deep learning prediction models. In general, different models likely serve different design goals. We boldly
examine the performance of these models under the same time series input dataset with an identical hardware
computing system. The measured performance may reflect the design flexibility among all the ranked
models. And through our experiments, the SCINet model performs the best in accuracy with the ETT energy
input dataset. The results we obtain could give a glimpse in understanding the model design and performance
relationship. Upon concluding the paper, we shall provide further discussion on future deep learning research
directions in the realm of time series forecasting.
INDEX TERMS Dataset, deep learning, evaluation metrics, neural network models, time series forecasting,
Transformer models.
are explored to tackle specific issues related to time series movement, data (non-)stationary, trend, periodicity, seasonal
forecasting. variation, volatility, and noise. Upon understanding the
Deep learning models, in particular, exhibit promis- patterns of these factors, we may possibly formulate and
ing results following the exceptional performance in the make proper predictions and decisions.
fields of natural language processing (NLP) and computer Time series forecasting model predicts the future values of
vision (CV). Each model presents a potential solution for a given entity over time. In its simplest form, a one-step-ahead
addressing time series forecasting challenges. Certain frame- forecast model takes the form:
works have been leveraged for multi-objective and multi-
granularity prediction [11], as well as multi-modal time series ŷi,t+1 = f (yyi,t−k:t , x i,t−k:t , si ), (1)
forecasting [12].
In deep learning, we shall anticipate continuous advance- where ŷi,t+1 denotes the prediction of a model, y i,t−k:t =
ments and accomplishments in developing sophisticated {yi,t−k , . . . , yi,t } and x i,t−k:t = {xi,t−k , . . . , xi,t } are the
techniques for time series data prediction. Many existing observed values of the target and exogenous inputs within
review articles on time series forecasting have delineated a backcasting window of k periods, respectively. The si
mostly on classic parametric models and traditional machine represents the static metadata associated with the entity. The
learning algorithms. But they lack an exposition on the latest f (·) signifies the learned prediction function of the model, and
advancements in the Transformer-based models and empiri- i represents a given entity. Each entity represents a logical
cal comparative analyses on commonly used datasets across grouping of temporal information – such as the measurements
different industries. In this paper, we attempt to categorize from different weather stations in climatology, or vital signs
existing time series prediction methods, delineate datasets, from different patients in medical study – and can be observed
and scrutinize the limitations of current methodologies. Our at the same time.
contributions are as follows: The prediction functions f (·)’s can be realized through
1) The classical time series prediction approaches are domain-specific model designs elaborated in subsequent
discussed. sections. A function and its corresponding model can be
2) Typical deep learning models related to time series trained and validated using application-specific datasets.
forecasting are thoroughly investigated, categorized, In the ensuing discussion, various types of datasets will be
and summarized. examined, and the commonly employed evaluation metrics
3) The existing issues in time series forecasting are dis- for time series forecasting algorithms will be outlined.
cussed, and some latest problems should be prioritized.
4) The prediction performance of different latest models A. DATASETS
would be carried out for comparisons. A collection of measured observations chronologically in a
The rest of the paper is organized as follows. Section II time series is needed to formulate and validate a time series
provides an overview of the popular publicly available model. The model, that would be developed, should be able to
datasets and some commonly used evaluation metrics in characterize a relation between data points in a given dataset.
time series forecasting applications. The classical time series Currently, there are publicly available datasets for different
models for time series forecasting can be found in Section III. applications, e.g., finance, industrial energy consumption,
In Section IV, the designs and implications of deep learning urban transportation, medicine, and meteorology, etc. The
models for TSF are discussed. Section V presents experi- commonly used datasets, associated with an array of deep
mental comparisons of predictive performance for different learning models, are extracted and listed together in Table 1.
models. Section VI summarizes future research directions in The basic information of the majority of different datasets in
time series methods and the unresolved issues. Finally, the different domains is presented below for your reference.
research content of this paper is summarized in Section VII.
1) FINANCIAL SECTOR
II. MODEL SETUP, DATASETS, METRICS In finance, forecasting is frequently used to predict business
In this section, before delving into the discussions of various cycles and long-term stock market movements. It helps
state-of-the-art designs, we should review the available financial firms and traders alike find opportunities and
time series datasets that can be used for training and make responsible decisions about revenue and expenditure
testing different potential models. But we shall start by based on estimates. For instance, forecasting can assist
characterizing an algebraic foundation of a forecast process. investors in the stock market with developing well-rounded
And the process should predict future values or trends based investing strategies, forecasting future trends and volatility in
on past time series data for applications in diverse fields such stock prices, and selecting more advantageous investments.
as finance, economics, meteorology, and sales, etc. Additionally, forecasting can help financial institutions assess
In time series analysis, data is arranged in chronological loan risk [28] by anticipating future borrower repayment
order. Each data point is associated with a specific timepoint. capability and credit hazards, or by projecting future interest
There are many factors that may cause the fluctuations of rate patterns to improve monetary and interest rate policies.
variables in time. And these factors can be the general Curated datasets include:
Gold Prices [29]: The dataset documents the daily gold Additionally, long-term forecasting helps identify traffic
prices (in US dollars) from January 2014 to April 2018, safety issues, predict future traffic accident risks, and enhance
with a granularity of days. It uses values for minimum, management of traffic safety and accident prevention. The
mean, maximum, median, standard deviation, skewness, and following open-source datasets are relevant to predict urban
kurtosis to characterize the characteristics of the distribution. traffic:
Exchange-Rate [30]: It aggregates daily exchange rates Paris Metro Line [34]: The dataset, consists of 1,742 obser-
across eight currencies spanning from 1990 to 2016: vations with a granularity of 1 hour, tracks the passenger
Australia, UK, Canada, Switzerland, China, Japan, movements on Lines #3 and #13 of the Paris Metro.
New Zealand, and Singapore. PeMS03, PeMS04, PeMS07, PeMS08 [17]: The PeMS
Stock Opening Prices [31]: From the Yahoo Financial, the system is the source of these datasets. The technology keeps
dataset compiles the daily opening prices of 50 equities across track of traffic flow data that is averaged every five minutes
10 industrial sectors between 2007 and 2016. There are a total and gathered every 30 seconds in places such as the California
of 2,518 data points per stock, and 5 top businesses in each bay area. Traffic flow, rate of occupancy, and speed are
sector. provided with timestamps for each sensor on the highway.
Birmingham Parking [35]: The dataset encompasses the
2) ENERGY SECTOR operational data of 30 parking lots managed by Birmingham
Forecasting is often used in the industrial energy sector National Parking, including parking lot identities, capacity,
to support long-term strategic resource planning. It helps occupancy rates, and update time. The data spans from
businesses and governments forecast future energy needs, October 4, 2016, to December 19, 2016, from 8:00 to
such as those for electricity, oil, and natural gas, and enables 16:30 daily, with a granularity of 30-minute.
more efficient energy production and supply planning. The
following open-source datasets are relevant for forecasting 4) ENVIRONMENTAL ISSUES
industrial energy: Long-term climatic trend forecasting is one important appli-
Electricity Transformer Temperature (ETT) [32]: The cation of the time series forecasting. Both the meteorology
curated data captures load and oil temperature data of power and pollution issues would have significant impacts on
transformers at 15-minute intervals from July 2016 to July industries such as agriculture, marine transportation, and
2018. It is composed of hourly granularity data (ETTh1, natural meteorological catastrophe warning systems. In the
ETTh2) and 15-minute granularity data (ETTm1). following, freely available datasets related to pollutions and
Solar Energy [33]: With a 5-minute sampling interval, weather forecasting include:
it documents the highest solar energy generation capacity of Weather Pollutants [36]: The dataset encompasses pollu-
137 photovoltaic power stations in Alabama in 2006. tion data (PM2.5, PM10, NO2, SO2, O3, CO) gleaned every
Electricity [33]: The dataset tracks the kilowatt-hours of hour from 76 stations in the Beijing-Tianjin-Hebei region
power used by 321 customers between 2011 and 2014. The between January 1, 2015, and April 1, 2016. It also includes
intervals between data points are 15 minutes. hourly meteorological observations during the same period,
incorporating variables such as wind speed, wind direction,
3) URBAN TRAFFIC temperature, air pressure, and relative humidity.
In order to predict future road traffic flow patterns and Beijing PM2.5 [37]: For the city of Beijing in China,
congestion, forecasting assists city traffic management it includes hourly PM2.5 data and related meteorological
departments, streamlining traffic planning and management. data. The dataset includes other variables such as dew
m
point, temperature, air pressure, combined wind direction, 1X
MSE = (yi − ŷi )2 (3)
cumulative wind speed, snowfall hours, and rainfall hours. m
i=1
In total, there are 43,824 multivariate sequences. Curated v
u m
PM2.5 data serve as the goal of this series. u1 X
RMSE = t (yi − ŷi )2 (4)
Hangzhou Temperature [38]: This dataset records daily m
i=1
average temperature in Hangzhou from January 2011 to
m
January 2017, totaling 2,820 data points. 1X 2
MSLE = log(yi + 1) − log(ŷi + 1) (5)
Weather [23]: The dataset contains nearly 1600 local m
i=1
data from the United States Earth climate data, more than v
u m
35,000 data records collected from January 1, 2010 to u1 X 2
December 31, 2013, at 1-hour intervals. Each record consists RMSLE = t log(yi ) − log(ŷi ) (6)
m
i=1
of a target ‘‘wet-bulb’’ temperature and 11 other climatic
features. where ŷi represents the predicted values, yi is the true values,
and m is the sample size.
5) MEDICAL DATA Among them, the mostly used metrics would be the MAE
Forecasting finds applications at various stages of pharma- (mean absolute error), the RMSE (root mean squared error),
ceutical development. It aids in predicting drug toxicity, and the MSE (mean square error). And the main difference
pharmacokinetics, pharmacodynamics, and other parameters, between RMSE and MSLE (mean squared logarithmic error)
facilitating optimization of the design and screening pro- would be on how sensitive each is to the outliers in a
cesses [39]. Furthermore, long-term forecasting contributes particular dataset. The MSLE exhibits greater robustness
to predicting market prospects and sales of specific medica- when assessing datasets containing outliers, while the RMSE
tions, thereby boosting the formulation of effective marketing demonstrates relatively higher susceptibility to outlying
strategies. The following open-source datasets are relevant to data points. In contrast, the MAE reflects the actual error
medical-related studies: situation better than RMSE whereas the RMSLE (root
Influenza-Like Illness (ILI) [40]: This dataset con- mean squared logarithmic error) proves valuable in cases
tains weekly records of influenza-like illness patients where underestimation is penalized more severely than
from 2002 to 2021, as documented by the Centers for Disease overestimation.
Control and Prevention (CDC). It provides insights into Although these measures in general can be evaluated
the proportion of ILI patients relative to the total patient quickly disregard the size of a dataset, they do suffer certain
population. limitations: 1) some among them do not reflect the scenarios
COVID-19 [41]: The dataset includes daily information if the true values approach zeros, and 2) some indices may
on confirmed and recovered cases gathered from six nations exhibit skewed distributions if the true values are close to
(Italy, Spain, Italy, China, the United States, and Australia) zeros. Then there are some other useful evaluation metrics,
between January 22, 2020, and June 17, 2020. i.e.,
Medical Information Mart for Intensive Care (MIMIC)- m
100 X yi − ŷi
III [9]: The dataset collected more than 58,000 hospital MAPE = (7)
m yi
admission records from Beth Israel Deaconess Medical i=1
Center (BIDMC) in the United States from 2001 to 2012, m
1X |yi − ŷi |
including a variety of clinical features. Chief among them are sMAPE = (8)
|yi | + |ŷi | /2
m
the ICU admission data, blood sugar levels, and heart rates. i=1
MIMIC-IV [42]: This dataset, which incorporates several 6i=1m (y − ŷ )2
i i
IA = (9)
improvements over MIMIC III, includes data updates and 6i=1
m (|ŷ − ȳ | + |y − ȳ |)2
i i i i
partial table reconstruction, collected clinical data from more 6i=1
m (y − ŷ )2
i i
than 190,000 patients and 450,000 hospital admissions at R2 = 1 − (10)
6i=1
m (y − ȳ )2
i i
BIDMC from 2008 to 2019.
where ȳi is the mean value.
B. EVALUATION METRICS The MAPE (Mean Absolute Percentage Error) is a
First order and second order evaluation metrics are commonly commonly used metric that calculates the absolute percentage
used in predictive analysis. These indices offer the advantage of the prediction error for each observation. It is intuitive,
of simplicity in computation, and enable comparison across easy to understand, and works well for datasets with a small
different methods or models with the same dataset. They can number of outliers. It provides a measure of the average
be calculated quickly even if the sizes of the datasets are large. magnitude of the percentage error.
Typically, the popular indicators are: On the other hand, the sMAPE (Symmetric Mean Absolute
m Percentage Error) is a symmetric percentage error indicator
1X that takes into account the proportional relationship between
MAE = | yi − ŷi | (2)
m the actual and predicted values. It provides a balanced
i=1
measure of the error between the predicted and actual on the ARMA model, and it creatively conflates the vector
values. The sMAPE is particularly useful for datasets that autoregressive (VAR) model with the vector moving average
require symmetric treatment of errors between the actual and (VMA) model. However, stationary time series data are the
predicted values. necessary condition for building a VARMA model. In order
Another important metric is the IA (Interval Accuracy), to achieve stationary data if a given data is non-stationary,
which measures the accuracy of the confidence interval then differencing operation is required. The Vector Error
of time series prediction. It is suitable for forecasting Correction Model (VECM) [47], which takes into account the
problems that require consideration of uncertainty, such as cointegration relationship between time series, is a solution to
meteorological prediction and financial risk estimation. this problem. These research endeavors have driven further
Lastly, the R2 (coefficient of determination) is a measure advancements in time series forecasting and provided new
of how well a prediction model fits the observational data. approaches and methods to tackle complex time series
It represents the proportion of the variance in the observed analysis issues.
data that can be explained by the model. R2 is commonly used On balance, the ARMA time series forecasting model
in linear models and provides a comprehensive evaluation of has made notable progress in addressing the stationary
prediction performance. time series forecasting. However, pure stationary time series
data is rarely found in practice. This arguably limits the
III. CLASSICAL APPROACHES FOR TSF applicability of the ARMA model. Hence, for better handling
A. PREDICTION METHODS FOR STATIONARY DATA non-stationary time series data, more adaptive and malleable
The traditional time series forecasting methodologies, such forecasting models and methods are needed.
as autoregressive (AR), moving average (MA), and autore-
gressive moving average (ARMA), are essentially based on B. PREDICTION METHODS FOR NON-STATIONARY DATA
statistical techniques. For stationary time series forecasting, Non-stationary sequences refer to the ordered lists that exhibit
Yule [4] started and proposed a new symbolic system, characteristics such as trends, seasonality, or periodicity.
which played an important role in multivariate time analysis. These sequences demonstrate consistency in local levels or
He later introduced the autoregressive (AR) model, which trends, where some data points closely resemble others.
was the first to treat randomness in time series analysis. Every To transform non-stationary sequences into stationary ones,
time series was considered a stochastic process in its own differencing operations can be employed for adjustment.
right. One of the key techniques for forecasting stationary The ARIMA(p, d, q) model comprises three components:
time series data would be the ARMA model [43], which was Autoregressive (AR), Integrated (I), and Moving Average
popularized by Box and Jenkins in 1970. (MA), where p and q are the orders for the AR and MA
The ARMA(p, q) is a linear combination of two lin- processes, respectively, and d is the number of differencing
ear AR(p) and MA(q) components, where AR(p) is an operations. The AR component uses past observations to
autoregressive model of order p, and MA(q) is a moving predict the current value. The MA is for capturing noise
average model of order q. The ARMA is frequently used and random fluctuations within the sequence. The I is to
to simulate stationary time series. The time series data is make a non-stationary time series stationary by performing
treated as random sequences, with the temporal continuity of differencing operations on the non-stationary sequence.
the original data being reflected in the correlation between The ARIMA model, for non-stationary time series, is capa-
random variables. For the analysis and forecasting of shifting ble of capturing variations in different data patterns and its
trends in stationary time series, these models offer a potent parameter estimations are relatively intuitive. Consequently,
tool. the ARIMA model has been widely adopted in practical
Then an adaptive ARMA model [44] was introduced applications. The regular AR(p) model can be expressed as:
to handle the short-term power load forecasting of a
power generation system. Through experiment findings, the p
X
adaptive ARMA model performed more accurately than the yt = c + εt +
′
ϕ(i)yt−i (11)
conventional ARMA model for forecasts that were made for i=1
one week and 24-hour in advance. To increase the precision of
prediction intervals, a prediction interval algorithm [45] was where yt represents the current value, c′ a constant term,
proposed based on a bootstrap distribution for setting (p, q). p the order of autoregression, ϕi the i-th autoregressive
Furthermore, the univariate time series models can be coefficient, and εt the white noise error term. It would
mapped to multivariate time series models by drawing be more convenience to use the backward shift (or lag)
inspiration from the Granger causality1 [46]. The vector operator B to represent the ARIMA model. The AR(p) can
autoregressive moving average (VARMA) model is based be represented, ignoring the error term at this moment,
p
1 The definition of Granger casuality is that a series x is deemed not to be
X
i
‘‘causal’’ of another series xj if leveraging the history of series xi does not
1− ϕi Bi yt = c′ . (12)
reduce the variance of the prediction of series xj . i=1
The regular MA(q) expression is for characterizing the IV. DEEP LEARNING AND TSF
noise error term, and we have: In this section, we will explore various cutting-edge deep
learning architectures for time series forecasting. These
q
X models are typically categorized into four main groups:
yt = µ + εt + θi εt−i (13)
1) Convolutional Neural Networks (CNNs),
i=1
2) Recurrent Neural Networks (RNNs),
where yt indicates the current value, µ a constant term, q the 3) Graph Neural Networks (GNNs), and
order of the moving average, θi the i-th moving average 4) Transformers.
coefficient, and εt the white noise error term. With the Each of these deep learning models possesses distinct
backward shift operators, we have advantages and suitability for time series forecasting. Choos-
ing the appropriate model depends on the characteristics
q
X of the data, the complexity of the problem, and the
yt = µ + 1 + θi Bi εt . (14) performance requirements. Furthermore, effective hyper-
i=1 parameter tuning and data preprocessing are critical for
achieving accurate predictions. This section will provide an
The I(d) model is for differencing the non-stationary
overview of models belong to classes of CNN, RNN, GNN,
sequence d times, i.e.,
and Transformer-based time series forecasts. The design
characteristics of these models will be summarized in tables
y′t = (1 − B)d yt (15)
later on.
where y′t signifies the differenced sequence, and d is the order
A. CNN-BASED MODELS
of differencing. Combining the AR, I, and MA expressions,
The inception of Convolutional Neural Networks (CNNs)
we have the ARIMA(p, d, q) model, which is
gained significant attention with the pioneering work done
p
X q
X by LeCun in 1989. The models he developed focused on
1− ϕi Bi (1 − B)d yt = c + 1 + θi Bi εt (16) processing grid-like topological data such as images and
i=1 i=1 time series data. CNN has emerged as one of the most
powerful techniques for understanding image content and has
where c is the aggregate constant term. The Eqn. (16) demonstrated exceptional performance in image recognition,
represents the ARIMA process. By choosing the parameters, segmentation, detection, and retrieval tasks. Over the years,
p, d, and q appropriately, the ARIMA models are suitable for CNN architectures have witnessed numerous advancements,
different time series predictions and analysis. leveraging depth and spatial aspects to achieve innovative
The ARIMA model has found widespread uses in time designs.
series forecasting. For instance, Valipour [7] showed that In general, based on specific architectural modifications,
the ARIMA model predicted well the precipitation in CNN can be broadly classified into seven categories:
important areas. In [48], it was used to forecast and spatial utilization, depth, multi-path, width, channel boosting,
analyze trends in the PM2.5 concentration in Fuzhou, China. feature map utilization, and attention-based CNN. Fig. 1
The ARIMA model was used by Malki et al. [49] to illustrates the classification of different deep CNN models.
forecast the COVID-19 outbreak’s second and potential end CNN is widely adopted in field of computer vision due
times in 2020. Shi et al. [50] proposed a model based on to its ability to effectively handle dimensionality reduction
tensor decomposition, BHT-ARIMA, which was novel to and process long sequences in parallel. Though originally
forecast multiple short time series. Finally the experiment designed for picture datasets to extract local associations
demonstrated good results on classical datasets, such as across spatial dimensions [51], many recent works have been
electricity and traffic. In this design, a time series is split conducted to explore applications in time series forecasting.
into multiple blocks and the model utilizes the Hankel tensor CNN can be modified for time series datasets by introducing
properties to construct the proper tensor representation. Then, additional layers of causal convolutions [52]. Filters that
with tensor decomposition method, the decomposed tensors depict these convolutions ensure that only historical data is
are for feature extractions in the time series. used to make predictions. Each causal convolution filter has
Conclusively for the ARIMA model, its application on the following structure for intermediate features extractions
the time series data is to make it a stable series through contained in the hidden layer l:
differential processing. However, this particular limitation
may also restrict its ability to capture nonlinear relationships. hl+1
t = A ((W ∗ h)(l, t)) (17)
Therefore, to better handle nonlinear relationships and k
X
complex time series patterns, alternative models and methods (W ∗ h)(l, t) = W (l, τ )hlt−τ (18)
merit consideration, such as those based on machine learning τ =0
and deep learning, with a view to ensure increased accuracy where hlt ∈ RHin represents the intermediate state at time t
and adaptability of time series forecasting. in layer l, and W (l, τ ) ∈ RHout ×Hin signifies the fixed filter
weights in layer l, and A(·) denotes the activation function. 2) CNN AND ATTENTION MECHANISM
The Hout and Hin represent the numbers of output channels To address the challenge of capturing long-term dependencies
and input channels in the convolutional layer, respectively. in multivariate long-term forecasting, Huang et al. [57]
developed the Dual Self-Attention Network (DSANet),
which is suitable for dynamic periodic or aperiodic sequences
1) VARIANTS OF CNN MODELS and simultaneously considers both global and local temporal
Temporal Convolutional Networks (TCNs) [53] are widely patterns. Fig. 2c shows the architecture of DSANet. Unlike
used models for analyzing and modeling time series data. the traditional recurrent neural networks (RNNs) with
They employ causal convolutions to prevent information recursive structures, DSANet employs parallel convolutional
leakage. Fig. 2a shows architectural elements in a TCN. layers to capture complex mixed patterns of global and
A TCN overcomes the challenges of gradient vanishing and local time. Additionally, the model leverages self-attention
explosion by incorporating a different backpropagation path mechanisms to learn the correlations among multiple time
in the temporal direction. series. This approach allows DSANet to effectively capture
HyDCNN (Hybrid Dilated CNN) [13] is another effective long-term dependencies in time series data and enhance the
approach that captures nonlinear relationships between performance of multivariate long-term forecasting.
sequences using dilated causal convolutions and linear
dependencies through autoregressive modules. Fig. 2b shows
3) CNN AND SEQ2SEQ
architectural of HyDCNN. This hybrid framework addresses
the limitations of traditional CNNs in capturing long- In recent years, there are interests in integrating both CNN
term correlations. Then the WaveNet-CNN [54] model is and Sequence-to-Sequence (Seq2Seq) models for optimiz-
introduced by combining the WaveNet speech sequence ing long-term time series forecasting. SCINet, a network
generation model with CNNs. The model utilizes the ReLU model developed by Liu et al. [14], is a technique for
activation function and exhibits superior performance in extracting useful and distinguishing temporal properties from
handling financial analysis tasks. downsampled subsequences or features. Fig. 2d shows archi-
To address the limitations of CNNs in handling large tectural of SCINet. SCINet efficiently models complicated
datasets, a Kmeans-CNN [55] model is introduced, which multi-resolution time dynamics and captures local temporal
fuses CNN with the K-means clustering algorithm, enabling dynamics, trends, and seasonal features by aggregating these
the learning of more informative features. By clustering rich data from several resolutions. This allows for precise
similar samples and segmenting large datasets, Kmeans-CNN time series forecasting. The application of this comprehensive
achieves excellent performance in processing large-scale approach has resulted in notable progress in the field of
power load data. long-term time series forecasting, as summarized in Table 2,
In 2023, Wu et al. [56] proposed a new framework structure showcasing the recent applications of CNN-based models in
that combines CNN and LSTM for more accurate stock price the context of TSF.
prediction. This new method is called SACLSTM. An array of
sequences of historical data and its leading indicators (options B. RNN-BASED MODELS
and futures) are constructed. The array is used as an input Elman [58] were the first to introduce recurrent neural
image to the CNN framework. Certain feature vectors are networks (RNNs). RNNs belong to a type of neural network
extracted through the convolutional and pooling layers and architecture designed for handling sequential data, such as
are used as input vectors to the LSTM. the time series or text written in natural language. Unlike
traditional feedforward neural networks, RNNs incorporate of the network. RNNs can keep track of previous data via this
recurrent connections within the network, enabling informa- memory propagation technique, and they can use that data to
tion to be passed and persisted over time. The core idea adjust current outputs. Fig. 3 shows the internal organization
of RNN is to use the output of previous time step as the of a typical RNN.
input at current time step. This establishes dependencies In the diagram, the xt represents the input vector at time
between sequential data, enables RNNs to handle variable- step t, and ht the hidden vector at time step t. It can
length sequences, and captures temporal correlations and be observed that the traditional RNN neuron receives the
contextual information within the sequences. previous hidden state ht−1 and the current input xt . The
The hidden state in RNN serves as the memory unit in core of an RNN lies in its RNN unit, which includes an
network for each time step. Each time step updates the hidden internal memory state that summarizes past information. The
state, which is then passed to the following layer or time step calculation formula for the internal hidden state of an RNN
is as follows:
ht = tanh(wh ht−1 + wx xt + b) (19) FIGURE 4. Internal structure of LSTM.
convolution. Through this approach, the model automatically matrix basis. Then, matrix polynomials are built for each
learns the associations between spatio-temporal data and time step using a set of time-varying coefficients and matrix
utilizes them for prediction tasks. bases.
Additionally, MTGNN [17] is a general graph neural In order to solve the problem in the field of time-
network structure that does not require a predefined explicit series prediction, where temporal patterns can only be
graph structure. Fig. 6c shows architectural of MTGNN. learned through dependencies between individual variables,
The model combines graph convolution modules with Chen et al. [21] proposed a multiscale adaptive graph neural
temporal convolution modules to capture the spatio-temporal network (MAGNN) for MTS prediction. By utilizing a
correlations in the data. It employs a graph learning module to multiscale pyramidal network to model temporal hierarchies,
adaptively extract sparse graph adjacency matrices from input an adaptive graph learning module to automatically infer
data, allowing for flexible and efficient modeling of complex dependencies between variables, a multiscale temporal graph
relationships. neural network to model intra- and inter-variable dependen-
Another approach is REST [70], which automatically cies, and a scale fusion module to facilitate collaboration
mines inferred multimodal directed weighted graphs, across different temporal scales, the MAGNN outperforms
is another approach. Fig. 6d shows the architecture of state-of-the-art methods on six datasets.
REST. This model combines Edge Inference Networks These methods share the common goal of consider-
(EINs) and Graph Convolutional Neural Networks (GCNs). ing dynamic graph structures in time series forecasting
By projecting the spatial correlations between time series and utilizing GNN models to capture the spatio-temporal
nodes from the temporal side to the spatial side, EINS builds relationships between nodes. These innovative approaches
multimodal directed weighted graphs for GCNs, enabling contribute to improving the accuracy and reliability of long-
the model to effectively capture and utilize multimodal term prediction and drive the research development in the
information for predictions. field of time series forecasting. Table 4 summarizes recent
Addressing dynamic graph problems, Liu et al. [20] works for forecasting based on GNN models.
proposed the TPGNN. Fig. 6e shows architecture of TPGNN.
This network uses time matrix polynomials to express the D. TRANSFORMER-BASED MODELS
correlations between dynamic variables as a two-step process. The original Transformer [71] was introduced in 2017 by
First, the overall correlations are captured using a static Vaswani et al. This model completely departs from traditional
CNN and RNN architectures for sequence tasks. In contrast to computation mechanism is:
RNNs, the Transformer has superior parallelism, and allows
QK T
interaction with global information, thereby effectively A(Q, K , V ) = softmax √ V (21)
capturing correlations within sequence data through its self- d
attention mechanism. The self-attention mechanism of the where Q ∈ RLQ ×d , K ∈ RLK ×d , V ∈ RLV ×d and d
Transformer is more interpretable and signifies a significant represents the input dimensions. The LQ , LK , and LV are the
advancement over many conventional deep learning models. lengths of the three dimensions, Q, K , and V , respectively.
The Transformer relies entirely on attention mechanisms to The probability formula for the i-th attention coefficient of
characterize the global dependencies between the inputs and the query is:
outputs of the model, as illustrated in Fig. 7.
The core of Transformer is the self-attention mechanism. k(qi , ki )
A(qi , K , V ) = 6j vj = Ep(ki |qi ) [Vj ] (22)
The input is represented as (Query, Key) pairs, and the 6l k(ql , kl )
the training speed of the model is greatly improved. input sequences first, so that representations considering
Pyraformer [25] overcomes the computational complexity context and timing information can be generated at different
limitations of Transformer by using a pyramid attention times. Next, the bottom layer representation is input into
mechanism to form a multi-resolution representation of time the upper layer Transformer to make use of Attention’s
series, effectively capturing the interdependencies in TSF. ultra-long period information extraction capability to make
Fig. 8c shows architecture of Pyraformer. up for the problem of sequence model information loss.
Liu et al. [27] introduced Non-stationary Transformers Madhusudhanan et al. [75] propose a U-Net-inspired Trans-
which improve the attention mechanism within the Trans- former architecture called Yformer, which is based on a
former by incorporating non-stationary information, thereby unique Y-shaped encoder-decoder architecture that combines
enhancing data predictability and unleashing the excellent coupled scaling mechanisms with sparse attention modules to
time-series modeling capabilities of attention mechanisms. capture long-term effects across scale levels.
Fig. 8d shows architectural of Non-stationary Transformers.
In 2023, Du et al. [74] proposed Preformer, a Transformer- c: TRANSFORMER AND TIMING ANALYSIS
based prediction model that introduces a novel and efficient
Zhou et al. [26] introduced the FEDformer model, which
multiscale segment correlation mechanism that divides a
utilizes trend and seasonal decomposition to incorporate
time series into multiple segments and utilizes segment
seasonal-trend decomposition within Transformer. Another
correlation-based attention instead of point correlation.
approach is the introduction of Fourier Transform and
A multiscale structure is established to aggregate dependen-
using Transformer in the frequency domain to help the
cies on different time scales, which facilitates the selection of
model better learn global information. Fig. 8f shows the
segment lengths.
FEDformer model. The core modules of FEDformer are
Informer [23] represents a significant contribution to
Fourier transform module and timing decomposition module.
long-period time series forecasting. It improves on the
The Fourier transform module converts the input time series
traditional Transformer model from the perspective of
from the time domain to the frequency domain, and then
efficiency. For long-period time series forecasting, the time
replaces Q, K and V in Transformer with the frequency
complexity of the regular Transformer model increases
domain information after Fourier transform, and carries out
exponentially with the length of the input series. Informer
Transformer operation in the frequency domain.
creates a sparse attentions with the keys and important
Autoformer [24] also used the decomposition of trend
queries. Through testings, the attentions scores indicates
items and seasonal items. In order to extract seasonal
long-tailed distributions between key and important queries.
items, this paper adopted the moving average method.
In fact, most of the scores are small, and only a few of them
By calculating the average value of each window in the
are large.
original input time series, the trend items for each window
Then, Informer focuses on modeling those with important
are derived, and then the trend items of the whole series were
attentions, and the rest would be ignored. This design
obtained. At the same time, the seasonal term can be obtained
creates a structure of sparse attentions, and greatly improves
by subtracting the trend term from the original input sequence
the computational efficiency. Informer further introduces
according to the addition model. It is worth noting that
self-attention distillation between every two Transformer
due to the success of Autoformer and FEDformer, exploring
layers. A convolution operation is used to reduce the length
self-attention mechanisms in the frequency domain in time
of the sequence by half. This further reduces the training
series modeling has attracted more and more attention from
overhead.
society. Table 5 shows the recent works based on transformer
At the decoder stage, Informer employs a method of
models for TSF.
predicting the results of multiple time steps at once, which
is able to alleviate the problem of cumulative error. Overall,
Informer effectively improves the operational efficiency 2) COMPLEXITY ANALYSIS OF TRANSFORMER-BASED
and performance over Transformer in the long-period time MODELS
series forecasting tasks. Apart from the sparse attention and The time complexity of the Transformer [40] model depends
self-attention distillation techniques, Informer provides an mainly on the length of the sequence and the number of
effective solution for handling large-scale long sequence data hidden layers in the model. Transformer has O(L 2 ) time
in real applications. Fig. 8a shows the model design of an and memory complexity, so for longer input sequences and
Informer. deeper models, the time and memory complexity of the
Transformer can become quite high. To solve this problem,
subsequent researchers have proposed a number of effective
b: TRANSFORMER COMBINED MODELS transformer class models to reduce complexity. For example,
Lim et al. [22] proposed a method combining LSTM and sparse attention, hierarchical attention and other methods are
Transformer, which was called Temporal Fusion Transformer used.
(TFT). Fig. 8e shows the architecture of TFT. The model uses In the self-attention model, Informer introduced the
the sequence modeling capability of LSTM to preprocess concept of sparse bias and adopted the LogSparse mask
technique. These innovations have significantly reduced the TABLE 6. Complexity analysis of transformer-based models.
computational complexity of the traditional Transformer
model, originally O(L 2 ), to O(L log L). It is worth noting that
Informer [23] did not explicitly introduce sparse bias when
implementing this improvement. Instead, it dynamically
selects queries and keys that have significant similarities,
effectively prioritizing the dominant O(L log L) queries. This
method greatly improves the computational efficiency.
Autoformer [24] inherited the design idea of Trans-
former in the encoder-decoder structure. By employing
Autoformer’s unique internal operators, it is able to effec- The Attention mechanism used in traditional Transformer
tively separate the overall trend of a variable from the is square complexity, while the O(L) computational com-
hidden variables predicted. This approach utilizes a unique plexity can be reached in FEDformer [26] due to the use of
autocorrelation mechanism to achieve O(L log L) complexity. low-rank approximation. Table 6 summarizes the algorithm
A pyramidal attention module was designed in complexity analysis of some transformer models applied to
Pyraformer [25] to transmit information between and within time series forecasting.
different scales. As the length of the input sequence
L increases, Pyraformer is able to achieve lower O(L) 3) EFFECTIVENESS OF TRANSFORMERS IN FORECASTING
computational complexity by adjusting one parameter C and As the role of Transformer becomes increasingly prominent
keeping the others fixed. This method has a higher benefit in in the field of deep learning, more publications have
terms of calculation time and memory cost. begun the debate about its effectiveness in time series
V. EXPERIMENTS
Albeit the deep learning models covered in the paper intake
time series input data, each of them may be designed to
deal with different primary forecasting objectives. We still
like to examine the prediction performances among these
models in the situation that the models are tested under
the same input datasets running in an identical computing
platform. Although the ranked measured results may not
reflect the excellence of individual architectural design due to
their different design goals, we could get at least a hindsight
about the design ideas and the potential forecast accuracy
relationship.
prediction, uncertainty estimation, and other pertinent aspects [10] L. Li, J. Yan, X. Yang, and Y. Jin, ‘‘Learning interpretable deep state space
to advance the field of time series forecasting. model for probabilistic time series forecasting,’’ in Proc. 28th Int. Joint
Conf. Artif. Intell., Aug. 2019, pp. 2901–2908.
[11] Z. Chen, Q. Ma, and Z. Lin, ‘‘Time-aware multi-scale RNNs for time
VII. CONCLUSION series modeling,’’ in Proc. 13th Int. Joint Conf. Artif. Intell., Aug. 2021,
pp. 2285–2291.
Time series forecasting is an excellent tool that pro- [12] L. Yang, T. L. J. Ng, B. Smyth, and R. Dong, ‘‘HTML: Hierarchical
vides predictive insights, offers decision-making strate- transformer-based multi-task learning for volatility prediction,’’ in Proc.
gies, and spans across diverse applications through time Web Conf., Apr. 2020, pp. 441–451.
[13] Y. Li, K. Li, C. Chen, X. Zhou, Z. Zeng, and K. Li, ‘‘Modeling temporal
series data analysis. In this paper, we have undertaken a patterns with dilated convolutions for time-series forecasting,’’ ACM Trans.
comprehensive review of publicly available time series Knowl. Discovery from Data, vol. 16, no. 1, pp. 1–22, Feb. 2022.
datasets and explored various approaches for designing effec- [14] M. Liu, A. Zeng, M. Chen, Z. Xu, Q. Lai, L. Ma, and Q. Xu, ‘‘SciNet: Time
tive time series design models. We starts with an overview series modeling and forecasting with sample convolution and interaction,’’
in Proc. 36th Conf. Neural Inf. Process. Syst., 2022, pp. 5816–5828.
of classical statistical methods, such as ARIMA, and then [15] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. W. Cottrell, ‘‘A dual-
delves into an extensive exploration of different deep learning stage attention-based recurrent neural network for time series prediction,’’
models, such as Transformer-based architectures. Each archi- in Proc. 16th Int. Joint Conf. Artif. Intell., Melbourne, VIC, Australia,
Aug. 2017, pp. 2627–2633.
tectural paradigm embodies distinct design principles and [16] R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka, ‘‘A multi-
advantages. It is crucial to recognize that different datasets horizon quantile recurrent forecaster,’’ in Proc. 31st Conf. Neural Inf.
may necessitate tailored models, and the computational setup Process. Syst. (NIPS), Long Beach, CA, USA, 2017, pp. 1–9.
[17] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, ‘‘Connecting
can influence model performance. Our goal is to ascertain the dots: Multivariate time series forecasting with graph neural networks,’’
the optimal deep learning model for a given dataset and in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
computational environment. Through our experiments among Aug. 2020, pp. 753–763.
[18] Z. Pan, S. Ke, X. Yang, Y. Liang, Y. Yu, J. Zhang, and Y. Zheng, ‘‘AutoSTG:
different deep learning models, the SCINet demonstrates Neural architecture search for predictions of spatio-temporal graph,’’ in
the best performance in the mean squared errors (MSEs) Proc. Web Conf., 2021, pp. 1846–1855.
of 0.21 and 0.37, and mean absolute errors (MAEs) of [19] L. Han, B. Du, L. Sun, Y. Fu, Y. Lv, and H. Xiong, ‘‘Dynamic and
0.3 and 0.4 for ETT-small-m1 and ETT-small-h1 datasets, multi-faceted spatio-temporal deep learning for traffic speed forecasting,’’
in Proc. 27th ACM SIGKDD Conf. Knowl. Discovery Data Mining,
respectively. Although our testing configuration may not Singapore, Aug. 2021, pp. 547–555.
reflect that SCINet can perform the best among all time- [20] Y. Liu, Q. Liu, J. W. Zhang, H. Feng, Z. Wang, Z. Zhou, and W. Chen,
series applications, it represents a well-crafted model that ‘‘Multivariate time-series forecasting with temporal polynomial graph
neural networks,’’ in Proc. 36th Adv. Neural Inf. Process. Syst., vol. 35,
should be understood on its architectural design. For vigorous 2022, pp. 19414–19426.
investigations, novel conceptual designs such as Mamba [21] L. Chen, D. Chen, Z. Shang, B. Wu, C. Zheng, B. Wen, and
should also be considered for future research in time-series W. Zhang, ‘‘Multi-scale adaptive graph neural network for multivariate
time series forecasting,’’ IEEE Trans. Knowl. Data Eng., vol. 35, no. 10,
forecasting. pp. 10748–10761, Oct. 2023.
[22] B. Lim, S. Ö. Arik, N. Loeff, and T. Pfister, ‘‘Temporal fusion transformers
for interpretable multi-horizon time series forecasting,’’ Int. J. Forecasting,
REFERENCES vol. 37, no. 4, pp. 1748–1764, Oct. 2021.
[1] W. Liao and S. Lin, ‘‘Prediction of photovoltaic output power based on [23] H. Y. Zhou, S. H. Zhang, J. Q. Peng, S. Zhang, J. X. Li, H. Xiong, and
match degree and entropy weight method,’’ Int. J. Pattern Recognit. Artif. W. C. Zhang, ‘‘Informer: Beyond efficient transformer for long sequence
Intell., vol. 37, no. 7, Jun. 2023, Art. no. 2350018. time-series forecasting,’’ in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35,
[2] L. Qu, W. Li, W. Li, D. Ma, and Y. Wang, ‘‘Daily long-term traffic flow no. 12, pp. 11106–11115.
forecasting based on a deep neural network,’’ Exp. Syst. Appl., vol. 121, [24] H. Wu, J. Xu, J. Wang, and M. Long, ‘‘Autoformer: Decomposition
pp. 304–312, May 2019. transformers with auto-correlation for long-term series forecasting,’’ in
[3] S. N. Ward, ‘‘Area-based tests of long-term seismic hazard predictions,’’ Proc. NIPS, vol. 34, Dec. 2021, pp. 22419–22430.
Bull. Seismological Soc. Amer., vol. 85, no. 5, pp. 1285–1298, Oct. 1995. [25] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, and S. Dustdar,
‘‘Pyraformer: Low-complexity pyramidal attention for long-range time
[4] G. U. Yule, ‘‘On a method of investigating periodicities in disturbed series,
series modeling and forecasting,’’ in Proc. Int. Conf. Learn. Represent.,
with special reference to Wolfer’s sunspot numbers,’’ Stat. Papers George
2021, pp. 1–20.
Udny Yule, vol. 226, pp. 389–420, Jan. 1971.
[26] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, ‘‘FedFormer:
[5] G. T. Walker, ‘‘On periodicity in series of related terms,’’ Proc. Roy. Soc.
Frequency enhanced decomposed transformer for long-term series fore-
A, Math., Phys. Eng. Sci., vol. 131, no. 818, pp. 518–532, 1931.
casting,’’ in Proc. Int. Conf. Mach. Learn., 2022, pp. 27268–27286.
[6] I. Rojas, O. Valenzuela, F. Rojas, A. Guillen, L. J. Herrera, H. Pomares, [27] Y. Liu, H. X. Wu, J. M. Wang, and M. Long, ‘‘Non-stationary transformers:
L. Marquez, and M. Pasadas, ‘‘Soft-computing techniques and ARMA Exploring the stationarity in time series forecasting,’’ in Proc. Adv. Neural
model for time series prediction,’’ Neurocomputing, vol. 71, nos. 4–6, Inf. Process. Syst., vol. 35, 2022, pp. 1–13.
pp. 519–537, Jan. 2008. [28] B. Yang, L. X. Li, H. Ji, and J. Xu, ‘‘An early warning system for loan risk
[7] M. Valipour, ‘‘Critical areas of Iran for agriculture water management assessment using artificial neural networks,’’ Knowl.-Based Syst., vol. 14,
according to the annual rainfall,’’ Eur. J. Sci. Res., vol. 84, no. 4, nos. 5–6, pp. 303–306, Aug. 2001.
pp. 600–608, 2012. [29] I. E. Livieris, E. Pintelas, and P. Pintelas, ‘‘A CNN–LSTM model for
[8] Z. Liu, Y. Yan, and M. Hauskrecht, ‘‘A flexible forecasting framework gold price time-series forecasting,’’ Neural Comput. Appl., vol. 32, no. 23,
for hierarchical time series with seasonal patterns: A case study of web pp. 17351–17360, Dec. 2020.
traffic,’’ in Proc. 41st Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., [30] J. Yoo and U. Kang, ‘‘Attention-based autoregression for accurate and
Jun. 2018, pp. 889–892. efficient multivariate time series forecasting,’’ in Proc. SIAM Int. Conf.
[9] Y. Wu, J. Ni, W. Cheng, B. Zong, D. Song, Z. Chen, Y. Liu, X. Zhang, Data Mining (SDM), 2021, pp. 531–539.
H. Chen, and S. B. Davidson, ‘‘Dynamic Gaussian mixture based deep [31] Y. Zhao, Y. Shen, Y. Zhu, and J. Yao, ‘‘Forecasting wavelet transformed
generative model for robust forecasting on sparse multivariate time series,’’ time series with attentive neural networks,’’ in Proc. IEEE Int. Conf. Data
in Proc. AAAI Conf. Artif. Intell., May 2021, vol. 35, no. 1, pp. 651–659. Mining (ICDM), Singapore, Nov. 2018, pp. 1452–1457.
[32] Z. H. Yue, Y. J. Wang, J. Y. Duan, T. M. Yang, C. R. Huang, Y. H. Tong, [54] A. Borovykh, S. Bohte, and C. W. Oosterlee, ‘‘Conditional time series
and B. X. Xu, ‘‘Ts2vec: Towards universal representation of time series,’’ forecasting with convolutional neural networks,’’ 2017, arXiv:1703.04691.
in Proc. AAAI Conf. Artif. Intell., 2022, vol. 36, no. 8, pp. 8980–8987. [55] X. Dong, L. Qian, and L. Huang, ‘‘Short-term load forecasting in smart
[33] Y. Li, H. Wang, J. Li, C. Liu, and J. Tan, ‘‘ACT: Adversarial convolutional grid: A combined CNN and K-means clustering approach,’’ in Proc. IEEE
transformer for time series forecasting,’’ in Proc. Int. Joint Conf. Neural Int. Conf. Big Data Smart Comput. (BigComp), Feb. 2017, pp. 119–125.
Netw. (IJCNN), Vancouver, BC, Canada, Jul. 2022, pp. 17105–17115. [56] J. M.-T. Wu, Z. Li, N. Herencsar, B. Vo, and J. C.-W. Lin, ‘‘A graph-based
[34] Q. Wang, L. Chen, J. Zhao, and W. Wang, ‘‘A deep granular network CNN-LSTM stock price prediction algorithm with leading indicators,’’
with adaptive unequal-length granulation strategy for long-term time series Multimedia Syst., vol. 29, no. 3, pp. 1751–1770, Jun. 2023.
forecasting and its industrial applications,’’ Artif. Intell. Rev., vol. 53, no. 7, [57] S. Huang, D. Wang, X. Wu, and A. Tang, ‘‘DSANet: Dual self-attention
pp. 5353–5381, Oct. 2020. network for multivariate time series forecasting,’’ in Proc. 28th ACM Int.
[35] Z. Yang, W. Yan, X. Huang, and L. Mei, ‘‘Adaptive temporal-frequency Conf. Inf. Knowl. Manag., Nov. 2019, pp. 2129–2132.
network for time-series forecasting,’’ IEEE Trans. Knowl. Data Eng., [58] J. Elman, ‘‘Finding structure in time,’’ Cognit. Sci., vol. 14, no. 2,
vol. 34, no. 4, pp. 1576–1587, Apr. 2022. pp. 179–211, Jun. 1990.
[36] Y. Qi, Q. Li, H. Karimian, and D. Liu, ‘‘A hybrid model for spatiotemporal [59] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
forecasting of PM2.5 based on graph convolutional neural network and long Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 15, 1997.
short-term memory,’’ Sci. Total Environ., vol. 664, pp. 1–10, May 2019.
[60] X. Wang, J. Wu, and C. Liu, ‘‘Fault time series prediction based on LSTM
[37] Y.-Y. Chang, F.-Y. Sun, Y.-H. Wu, and S.-D. Lin, ‘‘A memory- recurrent neural network,’’ J. Beijing Univ. Aeronaut. Astronaut., vol. 44,
network based solution for multivariate time-series forecasting,’’ 2018, no. 4, pp. 772–784, 2018.
arXiv:1809.02105.
[61] A. Graves and J. Schmidhuber, ‘‘Framewise phoneme classification with
[38] Z. Shen, Y. Zhang, J. Lu, J. Xu, and G. Xiao, ‘‘A novel time series
bidirectional LSTM and other neural network architectures,’’ Neural Netw.,
forecasting model with deep learning,’’ Neurocomputing, vol. 396,
vol. 18, nos. 5–6, pp. 602–610, Jul. 2005.
pp. 302–313, Jul. 2020.
[62] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
[39] J. C. Lauffenburger, J. M. Franklin, A. A. Krumme, W. H. Shrank,
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using
O. S. Matlin, C. M. Spettell, G. Brill, and N. K. Choudhry, ‘‘Predicting
RNN encoder–decoder for statistical machine translation,’’ in Proc. Conf.
adherence to chronic disease medications in patients with long-term initial
Empirical Methods Natural Lang. Process. (EMNLP), Doha, Qatar, 2014,
medication fills using indicators of clinical events and health behaviors,’’
pp. 1724–1734.
J. Managed Care Specialty Pharmacy, vol. 24, no. 5, pp. 469–477,
May 2018. [63] Y. Jung, J. Jung, B. Kim, and S. Han, ‘‘Long short-term memory
recurrent neural network for modeling temporal patterns in long-term
[40] N. Wu, B. Green, X. Ben, and S. O’Banion, ‘‘Deep transformer models
power forecasting for solar PV facilities: Case study of South Korea,’’ J.
for time series forecasting: The influenza prevalence case,’’ 2020,
Cleaner Prod., vol. 250, Mar. 2020, Art. no. 119476.
arXiv:2001.08317.
[41] A. Zeroual, F. Harrou, A. Dairi, and Y. Sun, ‘‘Deep learning methods [64] S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan, X. Cui, M. Witbrock,
for forecasting COVID-19 time-series data: A comparative study,’’ Chaos, M. A. Hasegawa-Johnson, and T. S. Huang, ‘‘Dilated recurrent neural
Solitons Fractals, vol. 140, Nov. 2020, Art. no. 110121. networks,’’ in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NIPS),
Dec. 2017, pp. 76–86.
[42] K. H. Lee, J. Won, H. J. Hyun, S. C. Hahn, E. Choi, and J. H. Lee,
‘‘Self-supervised predictive coding with multimodal fusion for patient [65] P. Redhu and K. Kumar, ‘‘Short-term traffic flow prediction based on
deterioration prediction in fine-grained time resolution,’’ in Proc. Int. Conf. optimized deep learning neural network: PSO-Bi-LSTM,’’ Phys. A, Stat.
Learn. Represent., May 2023, pp. 41–50. Mech. Appl., vol. 625, Sep. 2023, Art. no. 129001.
[43] G. Janacek, ‘‘Time series analysis forecasting and control,’’ J. Time Ser. [66] E. Choi, M. T. Bahadori, J. A. Kulas, A. Schuetz, W. F. Stewart, and J. Sun,
Anal., vol. 31, no. 4, p. 303, Jul. 2010. ‘‘RETAIN: An interpretable predictive model for healthcare using reverse
[44] J.-F. Chen, W.-M. Wang, and C.-M. Huang, ‘‘Analysis of an adaptive time attention mechanism,’’ in Proc. 30th Int. Conf. Neural Inf. Process.
time-series autoregressive moving-average (ARMA) model for short-term Syst., 2016, pp. 3512–3520.
load forecasting,’’ Electric Power Syst. Res., vol. 34, no. 3, pp. 187–196, [67] T. Gangopadhyay, S. Y. Tan, Z. Jiang, R. Meng, and S. Sarkar,
Sep. 1995. ‘‘Spatiotemporal attention for multivariate time series prediction and
[45] X. Y. Lu and L. H. Wang, ‘‘Bootstrap prediction interval for ARMA models interpretation,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
with unknown orders,’’ Revstat-Stat. J., vol. 18, no. 3, pp. 375–396, 2020. (ICASSP), Toronto, ONT, Canada, Jun. 2021, pp. 3560–3564.
[46] C. W. J. Granger, ‘‘Investigating causal relations by econometric models [68] C. Fan, Y. Zhang, Y. Pan, X. Li, C. Zhang, R. Yuan, D. Wu, W. Wang, J.
and cross-spectral methods,’’ Econometrica, vol. 37, no. 3, pp. 424–438, Pei, and H. Huang, ‘‘Multi-horizon time series forecasting with temporal
Aug. 1969. attention learning,’’ in Proc. 25th ACM SIGKDD Int. Conf. Knowl.
[47] R. F. Engle and C. W. J. Granger, ‘‘Co-integration and error correction: Discovery Data Mining, Jul. 2019, pp. 2527–2535.
Representation, estimation, and testing,’’ Econometrica, vol. 55, no. 2, [69] S. Du, T. Li, Y. Yang, and S.-J. Horng, ‘‘Multivariate time series forecast-
pp. 251–276, Mar. 1987. ing via attention-based encoder–decoder framework,’’ Neurocomputing,
[48] L. Zhang, J. Lin, R. Qiu, X. Hu, H. Zhang, Q. Chen, H. Tan, D. Lin, and vol. 388, pp. 269–279, May 2020.
J. Wang, ‘‘Trend analysis and forecast of PM2.5 in Fuzhou, China using the [70] H. Lin, Y. Fan, J. Zhang, and B. Bai, ‘‘REST: Reciprocal framework
ARIMA model,’’ Ecological Indicators, vol. 95, pp. 702–710, Dec. 2018. for spatiotemporal-coupled predictions,’’ in Proc. Web Conf., Apr. 2021,
[49] Z. Malki, E.-S. Atlam, A. Ewis, G. Dagnew, A. R. Alzighaibi, pp. 3136–3145.
G. Elmarhomy, M. A. Elhosseini, A. E. Hassanien, and I. Gad, ‘‘ARIMA [71] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
models for predicting the end of COVID-19 pandemic and the risk of L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
second rebound,’’ Neural Comput. Appl., vol. 33, no. 7, pp. 2929–2948, Neural Inf. Process. Syst., vol. 30, Dec. 2017, pp. 6000–6010.
Apr. 2021. [72] Y. Lin, I. Koprinska, and M. Rana, ‘‘SpringNet: Transformer and spring
[50] Q. Q. Shi, J. M. Yin, J. J. Cai, A. Cichocki, T. Yokota, L. Chen, M. X. Yuan, DTW for time series forecasting,’’ in Proc. Int. Conf. Neural Inf. Process.
and J. Zeng, ‘‘Block Hankel tensor ARIMA for multiple short time series (ICONIP), Bangkok, Thailand, 2020, pp. 616–628.
forecasting,’’ 2020, arXiv:2002.12135. [73] J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon, ‘‘FNet: Mixing tokens
[51] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification with Fourier transforms,’’ 2021, arXiv:2105.03824.
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. [74] D. Du, B. Su, and Z. Wei, ‘‘Preformer: Predictive transformer with multi-
Process. Syst. (NIPS), vol. 25, 2012, pp. 1097–1105. scale segment-wise correlations for long-term time series forecasting,’’ in
[52] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Rhodes
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, ‘‘WaveNet: Island, Greece, Jun. 2023, pp. 1–5.
A generative model for raw audio,’’ 2016, arXiv:1609.03499. [75] K. Madhusudhanan, J. Burchert, N. Duong-Trung, S. Born, and
[53] S. Bai, J. Z. Kolter, and V. Koltun, ‘‘An empirical evaluation of generic L. Schmidt-Thieme, ‘‘U-Net inspired transformer architecture for far
convolutional and recurrent networks for sequence modeling,’’ 2018, horizon time series forecasting,’’ in Proc. Joint Euro. Conf. Machine
arXiv:1803.01271. Learning Knowledge Discovery Databases, 2022, pp. 36–52.
[76] A. L. Zeng, M. X. Chen, L. Zhang, and Q. Xu, ‘‘Are transformers effective K. L. EDDIE LAW received the [Link]. (Eng.)
for time series forecasting?’’ in Proc. 37th AAAI Conf. Artif. Intell., 2023, degree in electrical and electronic engineering
vol. 37, no. 9, pp. 11121–11128. from The University of Hong Kong, the M.S.
[77] Z. Li, Z. Rao, L. Pan, and Z. Xu, ‘‘MTS-mixers: Multivariate time degree in electrical engineering from the Tan-
series forecasting via factorized temporal and channel mixing,’’ 2023, don School of Engineering (Polytechnic Uni-
arXiv:2302.04501. versity, New York), New York University, and
[78] A. Das, W. Kong, A. Leach, S. Mathur, R. Sen, and R. Yu, ‘‘Long-term fore- the Ph.D. degree in electrical and computer
casting with TiDE: Time-series dense encoder,’’ 2023, arXiv:2304.08424.
engineering from the University of Toronto.
From 1995 to 1999, he was a Member of Scientific
Staff with Nortel Networks, Ottawa, Canada,
carrying research for passport switches, protocol designs, and scalable
proxy web server system in the Computing Technology Laboratory.
WENXIANG LI is currently pursuing the Ph.D. From 1999 to 2003, he was an Assistant Professor with The Edward S.
degree in applied computer technology with the Rogers Sr. Department of Electrical and Computer Engineering, University
Faculty of Applied Sciences, Macao Polytechnic of Toronto. From 2004 to 2010, he was an Associate Professor with Toronto
University, Macau, China. Her current research Metropolitan University (Ryerson University), Canada. Since 2019, he has
interests include deep learning modeling and been with the Faculty of Applied Sciences, Macao Polytechnic University,
forecast analysis of time series data. Macau. He has U.S. patents, and published refereed articles in journals,
conferences, magazines, and books. Among his latest research interests,
he works on data lake designs, distributed computing, computer networking,
deep learning model designs, blockchains, and the IoT systems.