0% found this document useful (0 votes)
102 views28 pages

Comprehensive Time Series Forecasting Survey

This document presents a comprehensive survey on Time Series Forecasting (TSF), highlighting its significance in decision-making across various fields and the evolution of methodologies from traditional models to machine learning approaches. It addresses the lack of a unified problem statement and systematic analysis of core challenges in TSF, proposing a general problem formulation and taxonomy of methodologies. The survey also discusses emerging topics and promising research directions to drive innovation in the field of TSF.

Uploaded by

1262252527
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views28 pages

Comprehensive Time Series Forecasting Survey

This document presents a comprehensive survey on Time Series Forecasting (TSF), highlighting its significance in decision-making across various fields and the evolution of methodologies from traditional models to machine learning approaches. It addresses the lack of a unified problem statement and systematic analysis of core challenges in TSF, proposing a general problem formulation and taxonomy of methodologies. The survey also discusses emerging topics and promising research directions to drive innovation in the field of TSF.

Uploaded by

1262252527
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PREPRINT 1

A Comprehensive Survey of Time Series


Forecasting: Concepts, Challenges,
and Future Directions
Mingyue Cheng, Zhiding Liu, Xiaoyu Tao, Qi Liu*, Jintao Zhang, Tingyue Pan,
Shilong Zhang, Panjing He, Xiaohan Zhang, Daoyu Wang, Jiahao Wang, Enhong chen

Abstract—Time series forecasting (TSF) has become an increasingly vital tool in various decision-making applications, including
business intelligence and scientific discovery, in today’s rapidly evolving digital landscape. Over the years, a wide range of methods for
TSF has been proposed, spanning from traditional static-based models to more recent machine learning-driven, data-intensive
approaches. Despite the extensive body of research, there is still no universally accepted, unified problem statement or systematic
elaboration of the core challenges and characteristics of TSF. The extent to which deep TSF models can address fundamental
issues—such as data sparsity and non-stationarity—remains unclear, and the broader TSF research landscape continues to evolve,
shaped by diverse methodological trends. This comprehensive survey aims to address these gaps by examining the key entities in TSF
(e.g., covariates) and their characteristics (e.g., frequency, length, missing values). We introduce a general problem formulation and
challenge analysis for TSF, propose a taxonomy that classifies representative methodologies from the preprocessing and forecasting
perspectives, and highlight emerging topics like transfer learning and trustworthy forecasting. Finally, we discuss promising research
directions that are poised to drive innovation in this dynamic and rapidly advancing field. The related paper list is available at
[Link]

Index Terms—Time Series, Time Series Forecasting, Statistics Model, Deep Learning

1 I NTRODUCTION

T IME Series Forecasting (TSF) has become a pivotal tool


for making informed predictions and decisions across
various fields such as healthcare [1], finance [2], manufac-
non-stationarity, making preprocessing and feature extrac-
tion critical steps for achieving reliable predictions. Methods
such as imputation [10], denoising [11], and normalization
turing [3], and environment [4], [5]. In an era dominated [12] have been developed to handle these data challenges,
by vast data generation, the need to predict future trends yet further research is needed to refine these techniques for
from time series data becomes increasingly critical. To this more robust time series forecasting.
end, numerous efforts have been devoted to capturing both With the development of the various forecasting ap-
temporal and cross-channel correlations for more accurate proaches, several surveys have been published over the
time series forecasting. years, providing valuable insights into different methodolo-
Traditional time series forecasting models, such as Au- gies. For example, [13], [14], [15] reviews the progress of
toregressive Integrated Moving Average (ARIMA) [6] and specific deep-learning architecture (Transformers and Graph
Exponential Smoothing methods [7], have played a key role Neural Networks) or learning strategy (Self-Supervised
in time series analysis. However, as the complexity and Learning) for the time series analysis tasks, and [16] fur-
scale of real-world time series data continue to grow, these ther provides a systematic overview of deep models of
models often find it challenging to capture underlying pat- diverse architectures for different downstream tasks, with
terns effectively. The advent of machine learning (ML) and a benchmark analysis. More recently, with the advancement
deep learning (DL) techniques has significantly transformed and success of foundation models, [17], [18] presents a sur-
the landscape of time series forecasting [8], [9]. These data- vey about Large-Language-Model-based and time-series-
driven approaches can learn from large, complex datasets foundation-model-based approaches for time series anal-
and automatically discover patterns, offering substantial ysis. However, these works primarily focus on the deep-
improvements over traditional methods. However, while learning-based models and explain how different architec-
ML and DL methods present promising results in many tures are well suited for time series modeling, therefore
domains, challenges persist in addressing specific issues lack a unified perspective that covers the diverse research
such as long-term and multivariate dependence capture, as directions, and bridges the historical development with the
well as modeling of exogenous variables. Furthermore, time latest trends for the forecasting task.
series data is often characterized by noise, irregularity, and To address the abovementioned issue, this paper pro-
vides a comprehensive overview focusing on time series
• All authors of this paper are affiliated with the University of Science and forecasting, encompassing both traditional statistical mod-
Technology of China, State Key Laboratory of Cognitive Intelligence. els and advanced data-driven approaches. We delve into
• Email: {mycheng, qiliuql}@[Link]. The corresponding author is Qi the fundamental concepts of time series data, investigate
Liu.
various modeling techniques, and analyze the challenges
PREPRINT 2

presented by real-world time series data. We also high- when designing forecasting or analysis techniques. These
light the importance of time series preprocessing, offering properties—including continuous, unbounded states and
a detailed examination of techniques such as tokenization typically uniform sampling intervals—distinguish time se-
and graph transformations, which have gained attention in ries data from cross-sectional data, where observations lack
recent literature for their effectiveness in improving model inherent temporal ordering.
performance. Additionally, we discuss the progress in time
series forecasting, including statistical models, data-driven
2.2 Basic Concepts of Time Series Forecasting
approaches, transfer learning methods, and trustworthy
forecasting techniques. By synthesizing diverse TSF meth- Time series forecasting aims to predict future values of a
ods under a single unifying framework, our survey aims target series based on past observations and, potentially,
to clarify the rich methodological landscape while exposing additional explanatory variables. The core formulation typ-
unresolved challenges and open questions. ically involves two key components: the look-back window
The remainder of this article is organized as follows. (i.e., a set of recent past observations) as input, and the
Section 2 formally defines the fundamental concepts of predicted window (i.e., one or more future time steps) as
TSF and introduces the widely used evaluation protocols. output. Below, we outline several fundamental concepts:
Following this, Section 3 provides an analysis of key data
• Look-back Window of the Target Series: This is a
characteristics and challenges. In Section 4, we describe
sequence of consecutive time steps (e.g., the previous
several mainstream time series preprocessing strategies. We
L observations) from the target series itself. It acts as
then present the classical statistical-based approaches in
the primary source of historical context, capturing
Sections 5, before transitioning to data-driven methods in
trends, seasonal patterns, and other temporal depen-
Section 6, which covers both machine-learning-based and
dencies.
deep-learning-based approaches. Besides, Section 7 focuses
• Covariates (Exogenous Variables): Beyond the target
on recent advances in transfer learning models, and Sec-
series, many applications leverage auxiliary factors
tion 8 discusses the trustworthy TSF. Additionally, We intro-
such as weather conditions, economic indicators, or
duce some prevalent benchmark datasets and representative
demographic information. These additional inputs,
applications of TSF in various domains in Section 9, and
known as covariates or exogenous variables, can offer
highlight open research directions in Section 10. Finally, we
valuable insights when the target series is influenced
conclude the survey in Section 11.
by external drivers.
• Predicted Window of the Target Series: In forecast-
2 F OUNDATION C ONCEPT D ESCRIPTION ing tasks, the output is typically a set of future time
2.1 Introduction to Time Series Data steps to be estimated. This window could be as short
A time series is commonly defined as a sequence of data as a single time step (e.g., predicting the next hour)
points indexed in chronological order, where each obser- or span multiple future periods (e.g., predicting the
vation is obtained at a specific (often uniformly spaced) next week or month).
time interval. Although the data are frequently recorded • One-step vs. Multi-step Prediction:One-step Predic-
at discrete intervals (e.g., hourly, daily, monthly), many tion aims to forecast a single future time step at
real-world phenomena underlying these observations can a time (e.g., t + 1 given observations up to t). It
be treated as continuous and, in principle, unbounded in is often simpler to implement, but may require re-
both time and value. As a result, time series modeling must peated application to forecast multiple steps ahead.
address both the discrete nature of the measurements and Multi-step Prediction directly generates forecasts for
the potential continuity of the process itself. several consecutive future steps (e.g., t + 1 to t + H )
A typical time series can often be decomposed into in one shot. This approach can capture long-range
several key components: dependencies but often entails greater complexity
• Trend: A long-term progression of increasing or de- and potential error accumulation.
creasing values in the series over extended periods. • Univariate vs. Multivariate Forecasting: Univari-
• Seasonality: Regular, repeating patterns that recur at ate Forecasting considers only a single target series
fixed intervals (e.g., daily, weekly, annually), driven without additional external information. Multivariate
by predictable factors such as calendar effects or Forecasting involves multiple interrelated series or
environmental cycles. auxiliary inputs to exploit cross-dependencies and
• Cyclicity: Fluctuations that are not strictly tied to a potentially improve accuracy.
fixed calendar period but exhibit recognizable rises • Iterative vs. Direct Strategies: Iterative Forecasting
and falls over time, often linked to economic or other forecasts one step ahead repeatedly, feeding each
systemic influences. predicted value back as input for the next step. Direct
• Irregular or Random Variations: Unpredictable Forecasting trains separate models (or a single model
changes in the series that are not explained by the with multi-output) to predict each future step or win-
trend, seasonality, or cyclical behavior, often treated dow of interest directly, avoiding error propagation
as residual noise. at the cost of additional modeling complexity.
Because time series data are recorded in chronolog- In summary, the essential building blocks of time se-
ical order, statistical dependencies exist among observa- ries forecasting encompass which segments of past data
tions, making it crucial to consider temporal structures (i.e., the look-back window) and what external factors (i.e.,
PREPRINT 3

Look-back Window Predicted Window 2.4 Evaluation Protocols


Interval
Future Target Values Given the strong temporal dependency and sequential na-
ture of time series data, evaluation protocols have been de-
Point
veloped to objectively and fairly assess model performance
Historical Target Values
Forecasts
in various time series forecasting tasks. The primary aim of
Model
these protocols is to accurately evaluate the effectiveness
of different time series analysis techniques with diverse
Past Dynamic Static Endogenous and architectures. Specifically, evaluation protocols standardize
Target-related Inputs Exogenous Variables methods for dividing data into training, validation, and test
sets and define appropriate metrics to assess both short-term
Dynamic Exogenous Variables
and long-term predictive performance. These protocols also
Known Input Variables: Endogenous Variables and Exogenous Variables reduce the risk of data leakage and ensure accurate evalua-
tion of model generalization capabilities in time series con-
Fig. 1: An illustration of time series forecasting. texts. By establishing a standardized framework, evaluation
protocols promote fair model comparisons and optimiza-
tion, ensuring that prediction results are interpretable and
covariates) are used to model or learn patterns that best
reliable, ultimately making model evaluation more rigorous
predict a specified time horizon (i.e., the predicted window).
and robust [16].
Understanding these basic concepts is pivotal for designing
For time series data, a [Link] ratio is commonly applied
and implementing effective forecasting solutions.
to divide the data into training, validation, and test sets.
This distribution enables the model to effectively learn
2.3 Problem Definition of Time Series Forecasting features, adjust parameters, and evaluate its generalization
Time series forecasting aims to predict future values based performance. Keeping the split in chronological order also
on historical data. Depending on the nature of the out- helps prevent the data leakage issue, ensuring more reliable
put, forecasting problem can be categorized as Non- forecast outcomes.
Probabilistic (Point-based) or Probabilistic ones. Table 1 summarizes the calculation formulas, advantages
and disadvantages of these indicators. Time series forecast-
2.3.1 Point-based Time Series Forecasting ing methods are typically assessed using two broad cate-
Point-based forecasting focuses on generating a single de- gories of metrics: point-based and probabilistic. Point-based
terministic prediction for each time step. These methods aim metrics measure the accuracy of single-valued forecasts,
to minimize a specific error metric, such as Mean Absolute while probabilistic metrics quantify the quality of an entire
Error (MAE) or Root Mean Square Error (RMSE), without predictive distribution.
explicitly accounting for uncertainty in the predictions. For-
mally, given a time series y = {y1 , y2 , . . . , yT }, the objective 2.4.1 Point-Based Metrics
is to predict the future values ŷ = {ŷT +1 , ŷT +2 , . . . , ŷT +τ }, The point-based metrics compare the forecast ŷt to the
where ŷt represents a point estimate, such as the conditional observed ground truth yt at each time step. There are many
expectation: commonly used point prediction indicators(e.g., MAE and
MSE). Table 1 presents a compilation of commonly used
ŷt = E[yt | y], metrics for point prediction, including MAE, MSE, SMAPE,
MAPE, MASE, and OWA, along with a discussion of their
assuming that the model aims to minimize the squared
respective advantages and limitations.
error. Common methods include ARIMA, Support Vector
Briefly speaking, MAE and MSE are the most commonly
Regression (SVR), and Neural Networks.
used point-based metrics, comparing the error between the
true values and predictions in the original space, while
2.3.2 Probabilistic Time Series Forecasting
SMAPE and MAPE assess model performance based on
In contrast, probabilistic forecasting generates a predictive percentage errors. In contrast, MASE and OWA evaluate the
distribution for future values, capturing the inherent un- model through scaling and weighting approaches.
certainty in the data. Instead of producing a single point
estimate, the model provides a conditional probability den- 2.4.2 Probabilistic Metrics
sity function (PDF) p(yT +1:T +τ | y). This allows for more
In settings where models generate a predictive distribu-
nuanced predictions, such as the ability to compute predic-
tion instead of a single point estimate, probabilistic met-
tion intervals or quantify risk. For example, the predictive
rics evaluate distributional coverage, which measures the
distribution can be represented as:
proportion of times the observed value falls within a pre-
dicted interval [22]. The choice of metric should reflect the
yt ∼ p(yt | y),
practical objectives of the forecasting task, whether that
where p(yt | y) is typically parameterized by the model involves minimizing average error, capturing uncertainty, or
(e.g., Gaussian distribution N (µt , σt2 ) with mean µt and targeting specific quantiles. Table 1 provides an overview of
variance σt ) or modeled using advanced techniques like widely used indicators for probabilistic forecasting, namely
DeepAR [19], Variational Autoencoders [20], or Flow-based CRPS, ρ-Quantile Loss, NLL, and VG, accompanied with a
generative models [21]. discussion of their respective strengths and weaknesses.
PREPRINT 4

TABLE 1: Overview of the metrics widely evaluated in point-based forecasting and probabilistic forecasting tasks.

Task Metric Formula Advantages Disadvantages

1 Pn Intuitive and robust to Insensitive to large


MAE n t=1 |yt − ŷt | outliers errors

1 Pn Penalizes deviations in Sensitive to outliers


MSE n t=1 (yt − ŷt )2 predictions

1 Pn 2|yt −ŷt | Standardized and Sensitive to zeros and


SMAPE n t=1 |yt |+|ŷt | × 100% works across scales large range

1 Pn yt −ŷt Intuitive and easily Sensitive to small


Point-based Forecasting MAPE n t=1 yt
× 100% compute values
1 Pn
n t=1 |yt −ŷt |
Scaled and comparable Seasonality-sensitive
MASE 1 Pn across datasets reference-dependent
n t=1 |yt −yt−1 |

Pn Flexible and weighted Requires careful weight


OWA t=1 wt · |yt − ŷt | error handling selection and complex
R∞ Interpretable and Single-variable only;
CRPS −∞ [F (z) − 1(z ≥ y)]2 dz widely applicable Sensitive to outliers
 Ideal for quantile Overfitting to
ρ-QL 2(ŷ − y) ρIŷ>y − (1 − ρ)Iŷ≤y forecasting asymmetric Data
Probabilistic Forecasting Rich gradient info and Lack of intuitive
NLL − log pfD (y) solid theoretical basis interpretation
2
Describe spatial High computational

VG |ya − yb |p − [|xa − xb |p ]
P
a,b E correlation complexity
x∼D f

Continuous Ranked Probability Score (CRPS) is used significant efforts focused on individual imputation and
to evaluate the accuracy of probabilistic predictions, where anomaly detection [23], [24], mainstream research still of-
F (z) represents the cumulative distribution function and y ten assumes that forecasting datasets are well-filtered and
is the value to be predicted. The ρ−Quantile Loss, with ŷ of high quality. As a result, there remains a challenge in
as the predicted value for the quantile level ρ, and Iŷ>y exploring effective methods to mitigate the impact of low-
and Iŷ≤y as indicator functions, computes the quantile- quality data on forecasting models.
level prediction accuracy. In the context of Negative Log
Likelihood (NLL), Df is related to the data distribution or 3.2 Irregular Time Series
f
prediction model, and pD (y) is the predictive probability
Although continuous formulations are widely adopted in
density function, with NLL mainly aiming to assess the
the field of time series forecasting, many real-world datasets
model’s goodness of fit to the data, where a smaller value
are irregularly sampled due to technical constraints or
implies a better fit. Besides, the Variogram (VG) describes
limitations inherent in the data collection process. As a
the spatial variability of the data, helping to understand its
result, observations are recorded at variable time intervals.
spatial structure and correlation. It involves observed values
The time gaps between these observations themselves con-
ya and yb potentially related to xa and xb , a given parameter
tain crucial information about the underlying time series,
ρ and the data distribution Df .
thereby presenting significant challenges in handling the
irregular sampling intervals and effectively capturing the
3 C HALLENGES A NALYSIS evolving latent dynamics [25], [26].
Sequence signals collected in chronological order can be
3.3 Long-term Sequence Forecasting
classified as time series data, a distinct data modality de-
rived from various sensors or real-world observations. Time Time series data essentially consists of a sequence of con-
series data captures the dynamic evolution of systems over tinuous numerical values and can thus be considered an
time, reflecting both short-term fluctuations and long-term intermediary modality between the extensively studied
trends. While rich in information, time series data exhibits fields of image and language data, with further inherent
several key characteristics that pose specific challenges for characteristics. Modeling the temporal dependencies within
accurate forecasting. In this survey, we provide a compre- time series data remains a prominent research focus in time
hensive discussion of these characteristics and challenges as series forecasting [27]. From a sequence perspective, time
follows: series data often spans longer time intervals, resulting in
extended sequences. Additionally, from a numerical stand-
point, time series data typically lacks well-defined upper
3.1 Noise, Anomalies, and Outliers and lower bounds, further complicating accurate forecast-
Time series data, which reflects a system through multiple ing. More specifically, many real-world applications are
sensors or evaluations, inevitably raises concerns related framed as long-term time series forecasting tasks [28], [29],
to data quality. This can result in noise, anomalies, and requiring forecasts over extended horizons and windows,
outliers within the collected series, ultimately undermining which presents substantial challenges in capturing long-
the performance of downstream forecasting tasks. Despite term dependencies within the series. Furthermore, a critical
PREPRINT 5

challenge lies in mitigating the impact of error accumulation series data can be decomposed into several independent
during the numerical sequence modeling process [30]. components, and considerable efforts have been dedicated
to developing advanced decomposition techniques over the
years [40], [41]. In forecasting tasks, a common approach
3.4 Multivariate Dependence Modeling
involves a simple neural decomposition layer that separates
Collected time series datasets typically exhibit multivariate the original series into trend and seasonal components.
characteristics, primarily due to the complex cross-channel This decomposition captures the long-term progression and
dependencies embedded within the data. On the one hand, seasonal patterns inherent in the data, thereby facilitating
causal or leading effects may exist in certain scenarios, more accurate forecasting [42]. However, precise modeling
including traffic load prediction and temperature analysis of structural characteristics, multi-periodicity, and the ef-
[31]. However, clear prior assumptions or experiential in- fects of holidays in time series data remains a challenge that
sights about these dependencies are often lacking, especially requires further exploration.
in complex and high-dimensional systems. This absence
of guidance presents a significant challenge in accurately
identifying and modeling cross-channel dependencies. On 3.8 Multi-scale Representations
the other hand, in many complex systems, such as the hu-
The multi-scale characteristics of time series data are pri-
man body or weather systems, pre-existing knowledge often
marily exhibited from two perspectives: multi-granularity
makes it difficult to identify the relevant variables. This
and hierarchical [43]. On one hand, time series data consists
necessitates the collection of data from multiple perspectives
of discrete records sampled from a continuous time-space.
to enable a precise and holistic analysis of the entire system
Depending on the sampling frequency, it inherently exhibits
[4]. Consequently, sophisticated approaches are required to
multi-granularity characteristics, where high-frequency se-
effectively capture the hidden patterns and dependencies
ries tend to be more informative but noisier, while low-
across multiple time series. The current channel-dependent
frequency series offer smoother trends but less detail. On
[32] and channel-independent [33] solutions still warrant
the other hand, both local disturbances and global trends
further exploration [34].
significantly influence the time series data. Consequently,
determining an appropriate modeling granularity, fully ex-
3.5 Exogenous Variables Modeling ploiting multi-scale features, and achieving effective feature
Real-world systems often exhibit a partially observed nature fusion present key research challenges [44], [45].
due to incomplete prior knowledge, which can lead to
suboptimal forecasting outcomes when relying solely on
endogenous variables [35], [36], [37]. In particular, certain 3.9 Computational Efficiency
scenarios necessitate the incorporation of external variables, In the forecasting of time series, data often consists of series
such as weather, policies, holidays, and macroeconomic from multiple channels, and both the forecast horizon and
indicators. The impact of these external variables on time prediction lengths are typically long. This easily lead to
series objectives may be nonlinear and time-varying. How- a significant computational efficiency challenge for apply-
ever, it remains unclear how to effectively identify the key ing traditional neural networks for time series forecasting,
exogenous variables and integrate this multimodal informa- particularly problematic in tasks requiring real-time perfor-
tion into a unified forecasting framework. mance, such as stock price forecasting [46] and human phys-
iological signal prediction [47]. Consequently, developing
3.6 Distribution Shift Modeling prediction methods that effectively balance computational
efficiency with model representation capability remains a
Time series data typically exhibit non-stationary character-
key research challenge.
istics, meaning that the distribution of the data may change
rapidly over time. This results in discrepancies between
different time spans, ultimately hindering the generalization 3.10 Generalizability and Transferability
capabilities of deep learning models, as the distribution shift
contradicts their fundamental assumption of independent Unlike the pixels used in computer vision (CV) and word
and identically distributed (I.I.D.) data [38], [39]. Further- tokens in natural language processing (NLP), there is no
more, traditional methods for addressing distribution shift, standardized definition of unified semantic units in time
such as domain adaptation and domain generalization, may series analysis. Although these series share a similar data
not be well-suited to this task, as defining a domain for time format, the values within them may have inconsistent mean-
series data is not straightforward [12]. As a result, it remains ings across different contexts due to the distinct physical
an open challenge to develop models or frameworks that significance in various scenarios. As a result, most forecast-
can effectively mitigate the impact of non-stationarity. ing methodologies are typically trained and evaluated on
a single dataset, limiting their generalizability and transfer-
ability. Therefore, the development of a robust and scalable
3.7 Trend-Seasonal Pattern Recognition forecasting foundation model, built on extensive multi-
Time series data are usually numerical responses to natural source pretraining data, represents a significant challenge
indicators or human behavior, making them highly influ- [48], [49]. Furthermore, the operational mechanisms, such as
enced by natural rhythms and exhibiting distinct, struc- scaling laws and the influence of data distribution on model
tured numerical patterns. It is widely accepted that time performance, remain unclear [50].
PREPRINT 6

4 T IME S ERIES P REPROCESSING TABLE 2: Method types and features of decomposition


approaches.
Effective preprocessing of time series data is a critical step
in building robust forecasting models. This section high- Type Method Feature
lights key preprocessing techniques designed to address
Decomposed into specific frequency
data quality issues and transform raw time series into VMD components, suitable for economic
more informative and structured forms. These techniques Methods and financial time series forecasting,
are essential for preparing time series data for analysis and Multiplicative
with strong stability.
ensuring its readiness for forecasting tasks. Wavelet
Methods Captures frequency characteristics
and
at different levels, demonstrating
Filter-Based
4.1 Time Series Imputation strong adaptability when combined.
Methods
Time series data, often collected from complex sensor sys- Suitable for nonlinear and
SSA
tems in real-world environments, is frequently affected by Methods
non-stationary time series,
quality issues. First and foremost is the problem of missing effectively avoiding mode mixing.
data caused by factors such as localized sensor failures, EMD Suitable for non-stationary and
which may lead to the loss of critical temporal and cross- Methods nonlinear time series.
channel dependency information. To this end, time series Additive Flexible adjustment of trend and
STL
imputation aims to address missing values in the data by smoothness, suitable for long-term
Methods Methods
series analysis.
restoring them based on the observed segments. Consider-
able research has been dedicated to this area, with methods Using average mutual information,
LMD
particularly effective for capturing
typically falling into two categories: predictive and gener- Methods
market data volatility.
ative approaches [10]. The predictive approach relies on
deterministic prediction techniques, using traditional neural
networks such as RNNs and attention-based models [23],
On the other hand, stationarization approaches are pro-
[51], as well as reconstruction-based learning objectives,
posed to address non-stationarity due to the distribution
often without accounting for the uncertainty of the imputed
shifts between time series instances. DAIN [57] introduces
values. In contrast, the generative approach employs meth-
a non-linear network that adaptively learns how to nor-
ods such as GANs and diffusion models [52], [53], which
malize each input instance, while RevIN [12] proposes a
facilitate the quantification of imputation uncertainty.
symmetric normalization method, first normalizing the in-
put sequences and then denormalizing the model output
4.2 Time Series Denoising sequences via instance normalization. Dish-TS [58] identifies
More generally, time series data often suffers from sig- both intra- and inter-space distribution shifts in time series
nificant noise contamination, in addition to the issue of and alleviates these issues by learning distribution coeffi-
missing data. Notably, noise in the data not only impacts cients. SAN [39] extends the normalization concept from
the accuracy of model predictions but also affects the eval- the instance level to the time series slice level to enhance
uation process of the models. Consequently, the denoising performance. Lastly, SIN [59] identifies key statistics for
process is also not negligible for accurate forecasting results. normalization and automatically learns the corresponding
Denoising approaches primarily aim to mitigate noise in normalization transformations.
time series data, thereby enhancing its quality and suit-
ability for downstream tasks. Prominent techniques in this 4.4 Time Series Decomposition Techniques
area include traditional decomposition-based methods [54]
and filtering-based approaches [55], alongside more recent Time series decomposition is a key technique in time series
learning-based models [11], [56]. analysis and forecasting. As shown in Figure 2, the decom-
position process typically involves separating the original
time series into three components: the trend (Tt ), which
4.3 Standardization and Stationarization
captures long-term changes; the seasonal component (Ct ),
Due to the dynamic nature of time series data, it inherently which represents periodic patterns; and the residual (St ),
exhibits diverse data scales and complex distributions. To which accounts for random noise.
address this, standardization and stationarization are critical As shown in table 2, decomposition can be performed
preprocessing steps, ensuring that time series data is well- using two common models: the additive methods, where
prepared for reliable forecasting. the additive methods are expressed as:
While time series data may share a similar format, the
values within them often have inconsistent meanings across Yt = Tt + Ct + St , (1)
different contexts, given the distinct physical significance
in various scenarios. As a result, collected time series data and the multiplicative methods, where these methods are
may exhibit significant scale differences, sometimes span- represented as:
ning orders of magnitude. To handle this, standardization is Yt = Tt × Ct × St . (2)
employed to convert the data into a standard normal dis-
tribution from a global dataset perspective. This eliminates By isolating these components, decomposition provides
the impact of different channel units and ensures a stabilized insights into the underlying structure of time series, improv-
learning process of the forecasting model. ing the accuracy and interpretability of forecasting models.
PREPRINT 7

multi-scale characteristics and improves forecasting by cap-


turing distinct frequency components. EMD is effective in
hydrological analysis due to its adaptability to complex
Original Series Residual Series nonlinear features [66]. Thresholding irrelevant IMFs refines
decomposition, enhancing drought forecasting [67]. EEMD,
by adding noise, mitigates mode-mixing in EMD, improving
decomposition of annual runoff data [68]. CEEMD further
Trend Series Seasonal Series reduces noise and reconstruction error with paired opposite
white noise, strengthening financial data analysis by better
Fig. 2: An illustrative example of time series decomposition capturing trends and fluctuations.
into its components: trend, seasonality, and residual.
4.4.5 STL Methods
Seasonal and Trend decomposition using Loess (STL) meth-
4.4.1 VMD Methods ods utilizes LOESS to smoothly estimate trend, sesaonal and
Variational Mode Decomposition (VMD) methods decom- residual components. Its flexibility in adjusting trend and
pose time series into band-limited modes with specific cen- smoothing levels makes it ideal for long-term series analysis
tral frequencies, using variational principles to identify the and handling missing data. STL enhances long-term fore-
optimal modal bases. These methods allow for simultane- casting accuracy and shows consistent performance across
ous estimation of both the modes and their corresponding various time series characteristics [41], [69].
frequencies, making it particularly effective for signals with
multiple frequency components. VMD is ideal for financial 4.4.6 LMD Methods
forecasting, where accurate decomposition enhances predic- Local Mean Decomposition (LMD) methods involve remov-
tion stability [60]. MVMD extends VMD to decompose mul- ing the local mean of a signal to extract its IMFs. This
tiple related time series, resolving frequency mismatches approach enhances forecasting accuracy by using average
and capturing inter-variable relationships, which is useful mutual information to capture underlying patterns. LMD is
for forecasting in meteorology and wind power [61]. SVMD particularly effective for capturing price volatility in market
integrates VMD with Sample Entropy (SampEn) to decom- data, such as oil prices, enhancing short- and long-term
pose industrial electricity load data into trend and fluctu- forecasting by mitigating endpoint effects [70].
ation components, improving performance by addressing
non-stationarity [62].
4.5 Tokenization
4.4.2 Wavelet and Filter-Based Methods Time series tokenization is a technique that converts time
series data into a format that models can process. It primar-
Wavelet and filter-based decomposition methods apply
ily comes in two types: continuous tokenization and discrete
multi-scale analysis to extract frequency components at
tokenization.
various resolutions. By isolating different frequency bands,
these methods are particularly effective in modeling com-
4.5.1 Continuous Tokenization
plex time series. Empirical Wavelet Transform (EWT) de-
composes time series into sub-levels, extracting frequency- Continuous tokenization methods directly process raw time
specific features and enhancing adaptability when com- series data by breaking it down into continuous segments
bined with models like LSTM and RELM for error correc- or points, primarily categorized into point-wise and patch-
tion [63]. Time-Varying Filtered Empirical Mode Decompo- wise approaches.
sition (TVFEMD) effectively handles high-noise data, such For point-wise methods, such as Informer [30] and NHits
as bridge monitoring signals, by decomposing them into [44], data at each time point is processed independently. This
stable IMFs, improving interpretability and accuracy [64]. approach is relatively simple to implement, but it can lead to
significant computational overhead when dealing with long
time series [71] and does not provide sufficient semantic
4.4.3 SSA Methods
information [72].
Singular Spectrum Analysis (SSA) methods leverage em- Subsequently, many methods opt to grouping specific
bedding dimension and principal component selection to number of adjacent time points into one block for unified
achieve adaptive decomposition of time series, ensuring the modeling at the block level. This approach, called patch-
extraction of physically meaningful components while effec- wise tokenization, enhances the model’s ability to under-
tively suppressing noise. SSA performs exceptionally well stand local patterns and dynamic changes within the time
when handling non-linear and non-stationary time series, series. The earliest work to apply the patch concept within
avoiding mode-mixing issues and achieving high precision the time series domain is PatchTST [73], which divides the
in complex time series forecasts [65]. time series into equal-sized segments and feeds them into
a Transformer Encoder, achieving outstanding forecasting
4.4.4 EMD Methods results. As the field of time series analysis evolves, re-
Empirical Mode Decomposition (EMD) methods decom- searchers have discovered that the patching technique not
pose non-stationary and nonlinear time series into Intrinsic only naturally adapts to transformer architectures [48], [74]
Mode Functions (IMFs), each representing oscillatory modes but also plays an indispensable role in other architectures
at different scales. This decomposition reveals the signal’s such as CNNs [75], [76] and MLPs [77].
PREPRINT 8

However, a rigid patching operation may lead to the


loss of temporal information, such as complete peaks and
cycles. Recently, some studies have focused on improving
the patching process, aiming to enable individual patches to Amplitude Amplitude Approximation Amplitude

retain more complete and rich temporal segment informa-


tion. Examples of such advancements include deformable
patchify [78], [79] and multi-scale patchify [80], [81]. The Phase Phase Detail Phase

former allows the model to automatically find the optimal DFT DCT DWT SFTP

partition boundaries for each patch in a data-driven manner,


while the latter enables the model to handle patch features Fig. 3: An example of the outcomes of applying four
of different scales, thereby mitigating the issue of local time-frequency transformation methods, namely DFT, DCT,
feature damage. In addition, some studies like Lag-llama DWT, and STFT, to a time series.
[82] have also explored using lag variables instead of con-
tinuous time historical variables to construct a single patch,
which allows the single patch to carry more comprehensive 4.6 Frequency Transformation
periodic information. Frequency transformation offers distinct advantages over
traditional time-domain methods by providing localized
4.5.2 Discrete Tokenization
time-frequency representations. It effectively captures in-
As discrete tokenization has demonstrated excellent results stantaneous frequency, enabling precise tracking of fre-
in image and audio generation [83], [84], researchers begin quency variations in time series with abrupt changes. It
to explore how to represent time series data as discrete to- is essential for analyzing non-stationary time series, where
kens. Let X = {x1 , x2 , . . . , xT } be a continuous time series, frequency dynamics evolve over time, yielding deeper in-
where xt ∈ R for t = 1, 2, . . . , T . Discretization involves sights. Proper selection of transformations enhances the
mapping X to a sequence of tokens S = {s1 , s2 , . . . , sN }, representational power of deep learning models, improving
where si ∈ V and V is a finite vocabulary. predictive accuracy.
Although tokenize time series into discrete space may Time-frequency transformation methods in time series
lead to some information loss [85], [86], this approach also analysis include DFT, DCT, DWT, and STFT, which are
has significant advantages. Compared to the continuous briefly summarized in the table3. DFT decomposes a time
tokenization, the discrete tokenization can better capture series into sine and cosine waves, representing it in the fre-
the structure and patterns of data through discretized rep- quency domain [92]. DCT expresses the series as a weighted
resentations, while also exhibiting greater robustness to sum of cosine functions with varying frequencies [93]. DWT
noise [87]. Additionally, discretized time series feature can applies multi-scale analysis and wavelet functions to cap-
integrate more effectively with other discrete data (such as ture both global trends and local details [94]. STFT segments
text), making them suitable for multimodal learning tasks, the series and applies Fourier Transform to each segment,
thereby providing improved performance and efficiency in extracting time-frequency information [95].
specific application scenarios [84], [88]. DFT and DCT are effective for stationary time series, re-
Currently, the most common discretization method in vealing frequency components and periodic characteristics
time series forecasting approaches is the vector quantiza- [96]. However, strictly stationary series are rare, as most data
tion networks (VQN) [87]. VQN-based method is a data- are influenced by external factors, with non-stationary series
driven discretization technique, which utilizing deep learn- like financial and meteorological data being prevalent. STFT
ing techniques to learn discrete representations of time is ideal for time series with localized frequency stability
series [86], [89], [90]. This scheme typically includes a vector but variability across the series, extracting time-frequency
quantization network that, through a learning paradigm information via a sliding window approach [97]. In con-
focused on reconstruction, can automatically learn how to trast, DWT excels in analyzing non-stationary series with
map time series data into a discrete symbolic space, and is instantaneous frequency changes, such as EEG and financial
capable of obtaining a high-density codebook that serves as data, using multi-scale analysis to capture frequency varia-
a dedicated embedding vector table for the time series. The tions [98]. Moreover, time series often contain noise from
advantage of this method is its ability to handle complex measurement errors, system randomness, or external inter-
data and provide rich information while reducing the length ference. DWT and DCT are commonly used for denoising,
of time series [91], but it comes with higher computational with DWT effectively removing high-frequency noise and
costs and requires sufficient data and training time. DCT primarily for compression and feature extraction [99].
In addition, there are some statistical-based time series
discretization methods that are used in models related to
time series. These methods (such as those employed in 4.7 Graph Transformation
Chronus [22] and WaveNet [88]) typically utilize statis- One of the most significant characteristics and challenges
tical characteristics of the data, such as mean, variance, in multivariate time series forecasting is the complex, often
and quantiles, as bucketing thresholds, assigning individual non-linear correlations between variables. A viable solution
time series data points to the corresponding buckets. The is to construct a spatial-temporal graph from the original
advantage of this approach is its simplicity and intuitive- time series data, where each node represents a variable and
ness, as it requires no training; however, it does not reduce the adjacency matrix may evolve over time. These graphs
the length of the series [90]. are processed using Graph Neural Networks (GNNs) to
PREPRINT 9

TABLE 3: Time-frequency transformation methods.

Method DFT DCT DWT STFT


R∞
X(k) = X(k) = Wj,k =   X(t, f ) = −∞ x(τ )w(t −
Formulation PN −1 −j 2π
PN −1 π
n + 12 k
kn
  PN −1
n=0 x(n)e
N
n=0 x(n) cos N √1
n=0 x(n)ψ
n−k τ )e−j2πf τ dτ
2j 2j

Decomposes a signal into Analyzes the signal at Applies Fourier


Uses cosine functions to
sine and cosine different scales to capture Transform to signal
Working Mechanism represent the signal, focusing
components to analyze both low and high segments for localized
on low frequencies.
frequency. frequencies. frequency analysis.

Suitable for
Best for stationary signals Used for compression and non-stationary signals Ideal for analyzing
with constant properties, feature extraction in with changing signals with localized
Application Scenarios
such as periodic stationary signals, like images frequencies, such as frequency changes, like
behaviors. or videos. financial or biomedical speech or audio.
data.

effectively capture cross-channel relationships. Generally, within time series data, such as lagged values and error
spatial-temporal graphs can be generated through either terms, to identify and predict future series, effectively cap-
heuristic-based or learning-based approaches [14], as illus- turing the underlying dynamics of the data. Classical time
trated in Figure 4. series models typically assume a linear relationship, rely-
ing on established statistical principles to produce accurate
forecasts. In this section, we primarily introduce ARIMA [6]
and its extensions, Exponential Smoothing methods, the
Heuristic
Metrics State-Space Model (SSM), and the Gaussian Mixture Model
Learned (GMM), discussing their key characteristics.
Metrics
time

Time Series
Adjacency Matrix Spatial-temporal 5.1 Auto-Regressive Moving Average and Extensions
Construction Graphs
Autoregressive (AR) and moving average (MA) components
Fig. 4: Illustration of the process of constructing spatial- are fundamental techniques in time series analysis. Several
temporal graphs. variants of the moving average (MA) model, such as the
Simple Moving Average (SMA) and Exponential Moving
Heuristic-based methods construct the spatial-temporal Average (EMA), have been developed to improve forecast
graph based on the inherent characteristics of the original accuracy. SMA smooths short-term fluctuations by averag-
time series data. For instance, spatial distance and con- ing over a fixed window [106], while EMA assigns exponen-
nectivity metrics between variables can provide valuable tially decreasing weights to recent observations, making it
guidance in typical scenarios such as traffic load forecasting particularly responsive to trends and seasonal changes [107].
and weather prediction [100]. In addition, for forecasting These variants enhance flexibility in capturing different
tasks where prior knowledge of spatial correlations between series patterns.
variables is lacking, compositional methods based on data The Auto-Regressive Moving Average (ARMA) model
similarity metrics are widely used. These methods construct combines AR and MA components and is effective for
an approximate adjacency matrix based on the Pearson Cor- stationary time series, where statistical properties like
relation Coefficient (PCC) [101] or Dynamic Time Warping mean, variance, and autocorrelation remain constant [27].
(DTW) distance [102] between the multivariate time series. However, for non-stationary data, ARMA is insufficient.
Learning-based approaches focus on learning the graph The Auto-Regressive Integrated Moving Average (ARIMA)
structure in an end-to-end manner, aligning the learning model addresses this by introducing differencing (I) to sta-
process with the forecasting task itself. This allows for bilize the mean and make the series stationary, making it
the data-driven discovery of less obvious graph structures. suitable for a broader range of data [6]. For data with sea-
Typically, these methods employ additional neural networks sonal fluctuations, the Seasonal ARIMA (SARIMA) model
alongside the main forecasting model to generate the ad- extends ARIMA by incorporating seasonal components to
jacency matrix by leveraging interactions between learned capture periodic patterns, though its complexity in model
variable embeddings [103], [104] or the parametric distribu- specification and parameter tuning can be challenging [6].
tion [105] based on the observed time series data.

5.2 Exponential Smoothing Methods


5 S TATISTICAL M ODELS Exponential smoothing methods are widely used in time se-
In time series forecasting, statistical models based on tra- ries forecasting due to their simplicity, efficiency, and adapt-
ditional methodologies occupy a prominent position due ability. The development of these methods focuses on ad-
to their theoretical foundation and structured approach vancements in model selection, parameter optimization, and
to model design. These models leverage inherent patterns techniques for adapting to structural changes to improve
PREPRINT 10

Auto-Regressive Moving Average and Extensions: [6], [27], [106], [107];


Statistical Models Exponential Smoothing Methods: [7], [108], [109], [110];
Section 5 State-Space Models: [111], [112];
Gaussian Mixture Models: [113].

Support Vector Regression: [114], [115], [116], [117], [118];


Machine Learning (6.1) Regression Trees: [119], [120], [121], [122];
KNN: [123], [124], [125], [126].

RNNs: [19], [25], [127], [128], [129], [130], [131], [132];


CNNs: [75], [78], [96], [133], [134], [135], [136], [137];
Deep Learning (6.2)
Attention: [30], [32], [42], [73], [138], [139], [140];
Data-driven Approaches MLPs: [33], [44], [141], [142], [143], [144], [145].
Time Series Forecasting Approaches

Section 6
SSMs: [146], [147], [148], [149];
Emerging Architectures (6.3)
KANs: [150], [151], [152].

Autoregressive Models:: [48], [91], [153], [154];


VAEs: TimeVAE [20], HyVAE [155];
Generative Models (6.4) GANs: TimeGAN [156], [157], Curb-GAN [158];
Flow-based: CFMTS [159], TFM [160], FM-TS [161];
Diffusion Models: [53], [162], [163], [164], [165], [166].

Constrastive Learning: [137], [167], [168], [169];


Self-supervised Pre-training (7.1) Masked Autoencoder: [73], [86], [170], [171], [172];
Autoregressive Pre-training: [48], [153], [173].

Domian Adaptation (7.2) DAF [174], STONE [175], FOIL [176].


Transfer Learning Methods
Section 7 Tuning-Based: [86], [177], [178], [179], [180];
LLM-based Models (7.3)
Tuning-Free: [154], [181], [182], [183].

Moment [184], Moirai [49].


Foundation Models (7.4)
TimesFM [185], Chronos [22] , Timer [153].

Causal Discovery: [186], [187], [188];


Interpretability (8.1)
PINNs: [189], [190], [191], [192], [193].
Trustworthy Forecasting
Section 8 Robustness (8.2) Robust TSF: [194], [195], [196], [197].

Privacy (8.3) Federated TSF: [198], [199], [200].

Fig. 5: A brief summary of the time series forecasting approaches.


forecasting accuracy [108]. Simple Exponential Smooth- suited for sequential data, offering a powerful tool for mod-
ing (SES), which assigns weights to observations using a eling systems where the underlying states evolve in a prob-
smoothing constant α, enables precise predictions [109]. abilistic manner. To address non-stationarity and nonlinear-
Several studies compare different exponential smoothing ity in financial time series, an extended HMM integrates
models, highlighting the robustness and accuracy of state- exponentially weighted and dual-weighted Expectation-
space and dynamic nonlinear approaches, as well as strate- Maximization (EM) algorithms, enhancing complex finan-
gies for managing uncertainties [110]. For seasonal data, cial dynamics modeling and advancing high-dimensional
the Holt-Winters method decomposes the time series into nonlinear time series analysis [112].
trend, seasonal, and random components, supporting both
additive and multiplicative forms. This makes it particularly
5.4 Gaussian Mixture Models
suitable for industries with predictable seasonal demand,
such as retail and energy [7]. The GMM assumes that data distribution can be represented
as a linear combination of multiple Gaussian components
and employs the Expectation-Maximization (EM) algorithm
5.3 State-Space Models for parameter estimation. Due to its flexibility in modeling
The SSM, derived from control theory, provides a flexi- complex distributions and capturing multimodal character-
ble framework for modeling dynamic systems. By incor- istics, GMM has been widely applied in density estimation,
porating state and observation equations, the State-Space clustering, and anomaly detection. In time series forecasting,
Model (SSM) effectively captures both linear and nonlinear GMM is particularly effective in modeling non-stationary
behaviors, as well as stationary and non-stationary time characteristics. the GMM estimator [113] proposed a Gen-
series. The Kalman Filter (KF), a core SSM algorithm, widely eralized Method of Moments GMM-based approach to es-
used in structural and nonlinear time series modeling [111]. timate MSM parameters. This method effectively captures
Moreover, the Hidden Markov Model (HMM), a special case structural changes in time series and improves volatility
of SSM with discrete hidden states, is particularly well- forecasting performance.
PREPRINT 11

6 DATA -D RIVEN A PPROACHES Tree-based methods have played an important role in


In this section, we mainly cover the progress in data- time series forecasting competitions [119] and many fields
driven time series forecasting approaches, which learn the such as traffic [120], healthcare [121], energy [122], and so
temporal patterns directly from the time series data. We on.
include methods ranging from traditional machine learning
methods to recent advancements in deep neural networks, 6.1.3 K-Nearest Neighbor Regression
with an emphasis on the latest generative models. K-Nearest Neighbor (KNN) regression is a non-parametric
method and does not require explicit model construction
[209]. It works by calculating the distance(e.g., Euclidean
6.1 Machine Learning Approaches
distance or Dynamic Time Warping) between the sample to
Machine learning methods are important tools for time se- be predicted and each sample in the training set, identifying
ries forecasting, offering advantages for capturing complex the k most similar neighbors, and then producing the pre-
patterns. In this section, we will discuss three prominent diction by taking an average of their target values. Given a
approaches: Support Vector Regression (SVR), Regression sample x, the prediction ŷ can be formulated as:
Trees, and K-Nearest Neighbor (KNN) regression. Each of
1 X
these approaches provides unique strengths in addressing ŷ = yi , (4)
forecasting challenges. K
i∈N (x)

6.1.1 Support Vector Regression where N (x) denotes the K nearest neighbors to the sample
x, and yi are their corresponding target values in the train-
Support Vector Regression (SVR) [201] is a regression form
ing set. In early research, KNN regression was used to deal
of Support Vector Machines (SVM) [202] designed to predict
with properties like repetitive patterns [123] or seasonality
continuous values. The objective of SVR is to ensure that
[124]. Researchers have also explored its application in
most of the data points (xi , yi ) satisfy the condition that the
multivariate time series forecasting [125], [126].
prediction f (xi ) lies within an ϵ range of the actual value yi .
SVR can be solved by transforming the original optimization
problem into its dual form with Lagrange multipliers αi and 6.2 Neural Networks and Deep Learning Models
αi∗ [203]. The final regression function is given by: With the development of deep learning, numerous neural
n
X forecasting models have been proposed that achieve supe-
f (x) = (αi − αi∗ )K(xi , x) + b. (3) rior forecasting accuracy, owing to their powerful capability
i=1 to capture both temporal and cross-channel dependencies.
In this section, we present recent advancements in deep
The function K(xi , x) is the kernel method. Common kernel
learning-based forecasting approaches, focusing on main-
functions include Linear kernel, Polynomial kernel, Gaus-
stream architectures, as illustrated in Figure 6.
sian (RBF) kernel, and so on.
In time series forecasting, SVR is widely used across
6.2.1 Recurrent Neural Networks
various fields [114]. For example, Cao et al. [115] uses
SVR to predict the relative price change trends of futures Recurrent Neural Networks (RNNs) have gained a lot of
contracts, Lu et al. [116] uses SVR to forecast air quality attention due to their unique advantage in modeling se-
parameters, Pai et al. [117] proposes a Seasonal Support quential data. The basic recursive structure can be written
Vector Regression (SSVR) model to address the challenges as follows:
ht = σ(U ht−1 + V xt + bh ),
of seasonal time series forecasting. Zhang et al [118]. pro- (5)
posed a multiple support vector regression (MSVR) model ŷt = W ht + by .
to reduce error accumulation. Where xt , yt and ht are the input, output and hidden state
at time step t, respectively. ht−1 is the hidden state at the
6.1.2 Regression Trees previous time step t − 1. U , V and W are the weight
Regression tree is a regression method based on decision matrices. bh and by are the bias terms. σ is a nonlinear
tree. Its core concept is to iteratively partition the feature activation function.
space into mutually exclusive regions, with each region In recent years, many studies have used RNNs as back-
associated with a predicted value. bone networks for time series forecasting [210]. DeepAR
The Classification and Regression Trees (CART) algo- [19], MQRNN [211] and DF-Model [130] are probabilistic
rithm [204] is one of the most widely used methods. CART forecasting models designed for uncertainty quantification.
recursively splits data based on selected features and corre- DeepAR generates the probability distribution for future
sponding split points to minimize the sum of mean squared time steps by jointly learning historical patterns and sea-
errors (MSE) for resulting subsets. This process continues sonal features across multiple sequences. MQRNN employs
until a predefined stopping criterion is met, and the target an Encoder-Decoder structure, with LSTM as the Encoder
value is typically predicted by averaging the target values and two MLP branches as the Decoder, to simultaneously
within the leaf nodes. Moreover, to reduce overfitting and predict quantiles for multiple future time steps. DF-Model
improve prediction accuracy, ensemble learning methods decomposes time series into global and local parts, using
like Random Forest [205], Gradient Boosted Decision Trees RNN to extract complex non-linear patterns globally and
(GBDT) [206], Extreme Gradient Boosting (XGBoost) [207], capturing individual random effects for each time series lo-
LightGBM [208] are often used. cally with probabilistic models like Gaussian processes(GP).
PREPRINT 12

Inputs Inputs Inputs Inputs

Recurrent Convolution Projection


ℎ0 ℎ1 ℎ2 ℎ3 ... ℎ𝑡 Fully-Connect
States Layers
Attention Layers

Outputs
Outputs Outputs Outputs

RNN-based CNN-based Attention-based MLP-based

Fig. 6: Illustration of several types of deep neural networks for time series forecasting.

In the point-based forecasting area, DA-RNN [128] and on a local segment of the input data, weight sharing uses
MH-TAL [129] both apply the attention mechanism to en- the same weights across different segments of the input,
hance context capture and prediction accuracy in tasks. and translation invariance allows the network to recognize
DA-RNN introduces input attention at the encoder stage features learned at any position in the input. These charac-
and temporal attention at the decoder stage, while MH- teristics are crucial for capturing local patterns in time series
TAL uses the attention mechanism to establish a connection that may shift over time, while significantly reducing the
between future and historical time step features. Besides, number of parameters and enhancing the network’s ability
the CRU model [25] introduces a new RNN variant for to efficiently analyze time series data. Given CNNs’ strong
modeling irregular time series, and SegRNN [132] proposes performance in feature extraction and computational effi-
a GRU-based model that employs channel-independent ciency, researchers suggest considering convolutional net-
strategy and replaces the original time point-wise iterations works as one of the primary candidate models for modeling
with sequence segment-wise iterations. Additionally, hybrid sequence data [217].
methods have also been studied, LSTNet [127] combines The initial convolutional-based neural network designed
CNN and RNN architectures to capture both short-term and specifically for time series data was Temporal Convolutional
long-term dependencies, with an autoregressive component Network(TCN) [217]. TCN employs dilated convolutions,
enhancing stability in multi-step forecasting. Its recurrent- allowing it to achieve a wider receptive field with the same
skip component helps alleviate gradient vanishing in long- number of model layers. Subsequently, researchers have
sequence modeling. ESLSTM [131] combines the exponen- continued to explore the potential of convolutional neural
tial smoothing model with LSTM to create a hybrid hierar- networks in time series analysis. MLCNN [133] utilizes a
chical forecasting model. multi-layer CNN (more than 10 layers) to learn deep ab-
In summary, RNNs have several strengths, including the stract features of time series, while DSANet [218] and MICN
ability to process time series of arbitrary length and model [134] employ convolutional kernels of two different scales,
temporal dependencies. But they also have certain limita- extracting both local and global features of time series si-
tions [212], [213], such as the difficulty in capturing long- multaneously. Unlike the traditional organization of convo-
term dependencies due to vanishing or exploding gradients lutional kernels, SCINet [135] uses sequence downsampling
and challenges with parallelization due to their sequential segmentation and organizes the convolutional networks
nature. Recent advancements, such as RWKV [214] and according to binary trees, alleviating the issues of limited
xLSTM [215], have provided new insights into RNNs. The receptive fields in lower layers of TCN and the inability of
integration of these advancements into time series forecast- a single convolutional filter to capture complex features. To
ing is still an open question and requires further exploration. fully leverage the frequency characteristics of time series,
FTMixer [136] employs convolutional neural networks to
6.2.2 Convolutional Neural Networks
extract both frequency domain and time domain features of
Convolutional Neural Network (CNN) is a prevalent archi- time series simultaneously.
tecture in deep learning, demonstrating exceptional perfor-
mance in fields such as image processing, video analysis, As the exploration in the field of time series continue
and natural language processing. In the field of time series to advance, researchers have started to focus on the study
analysis, the convolution operation, a fundamental opera- of generic frameworks for time series analysis. CNNs, with
tion in the CNN, can be defined as follows: their powerful data understanding capabilities and high
computational efficiency, have been widely used in the
C K−1
X X design of generic frameworks for time series. TimesNet [96]
Y (t) = Xc (t + k) · hc (k), (6) folds the time series according to its primary cycles, treats
c=1 k=0 the folded time series as images, and uses 2D convolution to
where X be the input time series sequence of length L extract abstract features. ModernTCN [75] employs Depth-
with C channels, and h be the convolutional kernel of size Wise Separable convolutions instead of traditional convolu-
K . Y is the output of the convolution operation at time t for tions, achieving high effectiveness and computational effi-
a single output channel. ciency of time series analysis. It also introduces the concept
Convolutional Neural Networks (CNNs) exhibit three of reparameterization, which stabilizes the learning of large
key characteristics when analyzing time series data: lo- convolutional kernels. At the same time, ConvTimeNet [78]
cal connectivity, weight sharing, and translation invariance adopts a multi-scale deep convolutional neural network
[216]. Local connectivity ensures that each neuron focuses with large kernels to simultaneously learn global representa-
PREPRINT 13

tions and deep representations. TS2Vec [137], which utilizes Fourier Transform (FFT). The aforementioned representa-
the TCN architecture to extract features, employs contrastive tive methods have achieved a computational complexity of
learning for the feature views at each layer, providing strong O(L log L), while some more recent methods have pushed
contextual support for each timestamp. the complexity to O(L): Pyraformer [220] introduces the
In summary, convolutional neural networks excel in pyramidal attention module (PAM) in which the inter-
computational efficiency and performance in time series scale tree structure summarizes features at different reso-
forecasting. However, due to their parameter sharing and lutions and the intra-scale neighboring connections model
local perception characteristics, these networks relatively the temporal dependencies of different ranges. FEDformer
struggle to capture dependencies between different time [139] proposes frequency-enhanced blocks to capture im-
points in long sequences. portant structures in time series through frequency domain
mapping, which also achieves linear complexity By ran-
6.2.3 Attention-based Neural Networks domly selecting a fixed number of Fourier components.
Transformer [219] is one of the most successful architectures Moreover from the non-stationarity perspective, the Non-
in the era of deep learning, which has brought significant stationary Transformer [221] exploits the De-stationary At-
advancements in various research areas. The most outstand- tention mechanism for boosting the forecasting performance
ing design of the Transformer is its attention mechanism, of the mainstream transformers.
which is expressed as: In terms of the overall structure, early research on Trans-
former variants for forecasting predominantly adopted the
QKT traditional encoder-decoder structure [30], [32], [42], [139],
Attention(Q, K, V) = Softmax( √ )V, (7)
d [221], [222]. The encoders process the long horizon series
where Q, K, V are the query, key, and value vectors with into hidden states, which are later decoded by the decoders
dimension d respectively. Based on this, the Transformer is to generate future forecasting results through one forward
stacked by multiple blocks which function as: procedure. Later works point out that complex decoders
are not necessary and adopt the encoder-only structure by

Hl = LayerNorm(SelfAttn(Hl ) + Hl ), replacing the decoder part with a linear prediction layer. The
′ ′ (8) experimental results show that encoder-only transformers
Hl+1 = LayerNorm(FFN(Hl ) + Hl ).
achieve more accurate forecasting results on the benchmark
Here SelfAttn is a special form of the attention mechanism datasets [45], [73], [140], [223], [224]. Moreover, recent stud-
where all three vectors are derived from the same input. Be- ies have focused on training a time series foundation model,
sides, Hl is the input of l−th block, FFN is the feed-forward the auto-regressive decoder-only transformer structure is
network made up of multi-layer perception, and LayerNorm widely utilized due to its ability to process and generate
refers to the layer normalization operation. Through the arbitrary lengths of series [48], [49], [153].
iterative modeling approach, the Transformer has shown In summary, the Transformer architecture has been ex-
great modeling ability for long-range dependencies, and tensively studied for forecasting tasks due to its ability to
recent works have proposed many variants of Transformers effectively capture long-term dependencies and its scala-
tailored for time series forecasting tasks [13]. bility as a foundation model. However, when applied to
Early Transformer-based forecasting models generally small-scale domain-specific time series data, the substantial
adopt the traditional method of projecting multi-channel data requirements for training Transformers may result in
data at a single time step into a hidden state [30]. However, overfitting issues.
due to the inherent noise in time series data, a single time
step lacks semantic meaning comparable to that of a word 6.2.4 Multi-layer Perceptrons
in a sentence. To address this limitation, PatchTST [73] Time series forecasting task basically regresses future values
introduces a patching technique that enhances locality and based on the observation within a lookback window. The
captures richer semantic information, a method now widely Multi-layer Perceptrons (MLP) based approaches assume
adopted by recent studies. Additionally, iTransformer [140] that the major correlation exists in a linear form which
refines this approach by projecting individual time points linear models with high computational efficiency and in-
into variate-level tokens, enabling a more effective repre- terpretability can well capture.
sentation of multivariate correlations. From the model’s architecture perspective, N-BEATS
On the other hand for the core attention mechanism, [141] is a groundbreaking work that constructs a deep
the original attention module operates on each time step, neural architecture based on backward and forward residual
exhibiting quadratic time and memory complexity of O(L2 ), links and an exceptionally deep stack of fully connected lay-
where L is the sequence length. Given that time series data ers. Building on this stacking approach, Nhits [44] integrates
are often formed in long sequences, the conventional atten- innovative hierarchical interpolation and multi-rate data
tion mechanism may lead to a significant computational sampling techniques to reduce computational complexity
burden and be susceptible to disturbances from noise or and achieve forecast volatility, and Koopa [142] disentangles
outliers. To this end, LogTrans [138] suggests a sparse con- time-variant and time-invariant components from intricate
volutional self-attention mechanism that generates queries non-stationary series by Fourier Filter and designs MLP-
and keys using causal convolution for lowering computa- based Koopman Predictor to advance respective dynam-
tional complexity. Besides, Autoformer [42] has developed ics forward. Additionally, TimeMixer [145] notes that time
the Auto-Correlation mechanism, discovers and represents series display unique patterns at various sampling scales.
dependencies at the sub-series level through the use of Fast Consequently, it constructs a fully MLP-based architecture
PREPRINT 14

to fully exploit the disentangled multiscale series during 𝒙𝟏:𝟐


both past extraction and future prediction phases. Autoregressive models 𝒙 Decoder Decoder 𝒙𝟏:𝟓

Besides, recent work further leverages the advantages 𝐷(𝒙) 𝐷(𝒙)

of MLP-based models from the perspective of reducing


the number of model parameters for better computational Encoder Decoder
Variational Autoencoders 𝒙 𝒛 𝒙!
efficiency. Initially, DLinear [33] first questions the necessity 𝑞) (𝒛|𝒙) 𝑝* (𝒙|𝒛)

of complex deep neural networks for time series forecasting


tasks, and proposes that the combination of simplest linear
Generative Discriminator Generator
𝒙! 𝒙 𝒛 𝒙!
transformation with decomposition technique may achieve Adversarial Networks 𝐷(𝒙)
0/1
𝐺(𝒙)
superior performance on the benchmark datasets. FITS [143]
offers a pioneering approach to time series analysis by em-
Flow-based models Flow Inverse
ploying a complex-valued neural network to capture both 𝒙
𝑓(𝒙)
𝒛
𝑓 #$ (𝒙)
𝒙!

amplitude and phase information simultaneously, further


reducing the parameters to about 10k. Moreover, SparseTSF
[144] proposes the Cross-Period Sparse Forecasting tech- Diffusion models
𝒙𝟎 𝒙𝟏 𝒙𝟐 …… 𝒛
nique, simplifying the forecasting task by decoupling the
periodicity and trend in time series data and utilizing only
1k parameters to conduct accurate forecasting.
Fig. 7: Illustration of generative model.
In summary, MLP-based forecasting models have
demonstrated their effectiveness on benchmark datasets,
offering advantages in interpretability and computational probability P (Y |X) to predict an output Y given an in-
efficiency. However, due to their inherently simple struc- put X , whereas generative models aim to learn the joint
ture, their scalability to large-scale and complex scenarios probability distribution P (X, Y ). These two types of mod-
requires further investigation. els differ in their objectives and learning approaches. This
enables generative models not only to predict outcomes but
6.3 Emerging Network Architectures also to synthesize new data that align with the observed
6.3.1 Deep State Space Models distribution. For better illustration, we present a schematic
of common generative models in Figure 7.
State Space Models (SSMs) are a class of mathematical
models that represent dynamic systems using a set of linear
6.4.1 Autoregressive Models
or nonlinear equations that describe the system’s state evo-
lution over time, which have emerged as promising alter- Autoregression is one of the fundamental concepts in both
natives for sequence modeling paradigms. Rangapuram et language modeling and time series forecasting. However,
al. [146] originally proposes the deep state space model for due to the non-stationary nature of time series data and
probabilistic time series forecasting by parametrizing a per- the high demand for precision [12], autoregressive methods
time-series linear state space model with a jointly-learned often face challenges such as inherent error accumulation.
recurrent neural network, combing the data efficiency and As a result, autoregressive models have not been fully
interpretability of SSMs with strong representation learning explored in time series modeling. The most well-known
capability of deep learning. SpaceTime [147] further intro- model, ARIMA [27], combines the autoregressive (AR) and
duces a new SSM parameterization that is more suitable moving average (MA) models by introducing differencing.
for autoregressive time series processes and performs better. Recently, with the rise of pre-trained large language mod-
Moreover, with the proposal of the advanced SSM structure, els [226], the potential of autoregressive models in time
Mamba [225], many works have attempted to apply it series data has begun to be explored [91], [153], leveraging
to the field of time series forecasting and have achieved their generalization and task versatility. These models have
satisfactory results [148], [149]. even been adapted as autoregressive predictors [154] to
achieve arbitrary input-output mappings.
6.3.2 Kolmogorov-Arnold Networks
6.4.2 Variational Autoencoders (VAEs)
Kolmogorov-Arnold Networks (KAN) [150] is a recently
proposed architecture, which is believed to have the po- VAEs are probabilistic generative models that encode input
tential to serve as an alternative to MLPs whose activation data into a latent space and then decode it to reconstruct
functions are applied on the connections between nodes. the original data [227]. This structure allows VAEs to model
Some pioneer works have been devoted to validating its complex data distributions and generate new samples by
effectiveness in many time series tasks including general sampling from the latent space. VAEs are particularly ad-
time series analysis [151], forecasting [152], and so on. vantageous for handling uncertainty and learning struc-
tured latent representations of the data, making them well-
suited for time series tasks.
6.4 Deep Time Series Generative Models One notable work in this area is TimeVAE [20]. TimeVAE
Generative models have emerged as powerful tools in time introduces a novel VAE-based architecture tailored for mul-
series forecasting due to their ability to capture the un- tivariate time series generation. It emphasizes interpretabil-
derlying data distribution and generate diverse samples. ity, allows for encoding domain-specific knowledge, and
Discriminative models focus on estimating the conditional significantly reduces training time. Through experiments
PREPRINT 15

on multiple datasets, TimeVAE demonstrates its ability to results on clinical time series. Additionally, FM-TS [161]
accurately represent temporal attributes, performing well in simplifies time series generation through a Flow Matching-
both similarity and next-step prediction tasks. Furthermore, based framework, offering efficient training and inference
it can integrate domain-specific patterns, such as poly- while outperforming diffusion models in both uncondi-
nomial trends and seasonalities, to generate interpretable tional and conditional time series generation.
outputs. This feature is especially beneficial for applications
requiring transparency. 6.4.5 Diffusion Models
Another important contribution is HyVAE [155]. HyVAE In time series forecasting, Diffusion Denoising Probabilistic
combines the strengths of VAEs with diffusion models by Models (DDPMs) have gained prominence for capturing
employing a hybrid variational inference approach. This complex temporal patterns, achieving high predictive per-
integration enhances the model’s ability to capture temporal formance, and generating realistic data samples. DDPMs
dependencies and uncertainty, leading to improved perfor- operate via a diffusion-denoising process: noise is incre-
mance in time series forecasting tasks. mentally added during forward diffusion, and subsequently
removed in reverse diffusion to recover the original data.
6.4.3 Generative Adversarial Networks (GANs) This allows the model to learn complex data distributions
GANs consist of two components: a generator, which learns and produce high-quality forecasts.
to produce realistic samples, and a discriminator, which Early models like TimeGrad [162] introduced autore-
distinguishes between real and generated data [228]. This gressive denoising with Langevin sampling. TSDiff [163]
adversarial training framework enables GANs to generate improved short-term accuracy and data generation through
highly realistic data, making them powerful tools for data self-guidance, while ScoreGrad [164] utilized stochastic dif-
synthesis. In the context of time series, GANs must address ferential equations (SDEs) for continuous-time forecasting,
the challenge of modeling sequential dependencies. addressing irregularly sampled data. Conditional diffusion
A seminal work in this field is TimeGAN [156]. models, such as TimeDiff [165] and DiffLoad [230], in-
TimeGAN adapts the GAN framework specifically for time corporated external information to improve accuracy, with
series data by incorporating a recurrent architecture into applications ranging from power load forecasting to sparse
both the generator and the discriminator. This allows the ICU and ECG data [231], [232].
model to capture temporal dependencies while maintaining Recent advancements include Latent Diffusion Models
statistical consistency with the original data. TimeGAN has (LDMs), with examples such as Latent Diffusion Trans-
been shown to effectively generate realistic time series data, formers (LDT) [233], which have demonstrated notable im-
which can be used for tasks such as simulation, data aug- provements in both precipitation forecasting and scalability.
mentation, and anomaly detection. Recent studies have ex- Innovations such as DSPD and CSPD [53] extend diffusion
tended this framework to specific domains. For example, the to function space for anomaly detection and interpolation.
Context-aware Traffic Flow Forecasting in New Roads [157] Models like FDF [166] address challenges in trend modeling
presents a GAN-based model that predicts traffic flow on by integrating linear layers and conditional modules to
new roads by considering contextual factors like weather capture trends and seasonal components, improving long-
and day type, demonstrating effectiveness in scenarios with term forecasting.
limited data. Another approach, Curb-GAN [158], addresses Diffusion models have also been applied to diverse
the urban traffic estimation problem by using a condi- domains, including flood forecasting [234], stock market
tional GAN with dynamic convolutional layers and self- prediction [235], and electric vehicle load forecasting [236].
attention mechanisms to capture both spatial and temporal While they show strong potential, limitations remain in
dependencies, providing accurate estimations even under effectively modeling trends and long-term dependencies.
unprecedented travel demand patterns. Future advancements in denoising techniques and trend
separation methods, combined with diffusion models’ high
6.4.4 Flow-based Models parallelization and cross-domain applicability, offer promis-
Normalizing Flows-based models transform a simple ing opportunities for further enhancement and deployment
base distribution into a more complex target distribution across various fields.
through a series of invertible and differentiable transfor-
mations [229]. This method facilitates exact likelihood es- 7 T RANSFER L EARNING M ETHODS
timation, making it particularly effective for modeling high- In this section, we introduce the transfer learning techniques
dimensional data with complex dependencies. In the context for time series forecasting, including self-supervised pre-
of time series, MAF [21] utilizes normalizing flows to model training, domain adaptation, and LLM-based methods.
multivariate time series by conditioning on past observa-
tions, capturing intricate temporal dependencies. 7.1 Self-supervised Pre-training Methods
Recent developments have further extended this tech- Self-supervised learning alleviates the dependence on large
nique in various ways. For instance, Conditional Flow labeled datasets by enabling models to learn transferable
Matching for Time Series (CFMTS) [159] improves the train- representations through pre-training on unlabeled data.
ing of neural ODEs by regressing vector fields of conditional Prominent self-supervised pre-training techniques include
probability paths, outperforming traditional methods on contrastive learning, denoising masked autoencoders, and
long trajectory tasks. Trajectory Flow Matching (TFM) [160] autoregressive pre-training models. The proposed taxon-
introduces a simulation-free training method for Neural omy is illustrated in Figure 8, and the related works can
SDEs, enhancing stability and scalability, with promising be found in Table 4.
PREPRINT 16

TABLE 4: Systematic elucidation and analysis of representative work categories.

Optimization Model Name Space Cross-domain GitHub Link Page Year


TNC [167] Raw Space No [Link] representation learning 2021
TS-TCC [168] Raw Space No [Link] 2021
Contrastive Learning TS2Vec [137] Raw Space No [Link] 2022
TF-C [170] Raw Space No [Link] 2021
TST Raw Space No [Link] transformer 2022
PatchTST [73] Raw Space No [Link] 2023
Denosiong Masking TimeMAE [171] Discrete Space No [Link] 2024
CrossTimeNet [86] Discrete Space Yes [Link] 2024
SimMTM [172] Manifold space No [Link] 2024
TimeDART [173] Raw Space No [Link] 2024
Auto-regressive GPHT [48] Raw Space Yes [Link] 2024
Timer [153] Raw Space Yes [Link] 2024

[86] project time series into a discrete representation space,


unifying the input format and facilitating cross-domain self-
supervised pre-training. In the manifold space, SimMTM
Negative pairs [172] reconstructs masked time points by weighted aggre-
gation of adjacent time slices.

7.1.3 Auto-regressive Pre-training Model


Positive pairs
Auto-regressive models capture time-dependent character-
Contrastive Learning Auto-regressive Learning Denoising Masked Learning
istics by leveraging historical data to predict values at
Fig. 8: Illustration of different pre-training learning. the future time point. In time series data, auto-regressive
pre-training models progressively learn sequence dynamics
and temporal dependencies by modeling historical patterns.
7.1.1 Contrastive Learning Model TimeDART [173] uses time series blocks as fundamental
modeling units, employing a self-attention-based Trans-
The fundamental principle of contrastive learning is to allow
former encoder to establish inter-block dependencies and
the model to differentiate between positive and negative
incorporating diffusion and denoising mechanisms to cap-
samples. This is achieved by drawing positive samples
ture fine-grained features within blocks, effectively applying
closer together and pushing negative samples further apart,
diffusion models to time series data. GPHT [48] constructs
thereby enabling the model to develop a highly discrimina-
a hybrid dataset for pre-training under the channel in-
tive representation. For time series data, contrastive learning
dependence assumption, significantly expanding the data
frequently employs data augmentation techniques such as
scale. Additionally, it employs an autoregressive prediction
jittering, temporal cropping, and masking to generate varied
method to model the temporal dependencies of the output
perspectives of the samples, thereby enhancing the model’s
sequence without a custom prediction head, allowing for
generalization across diverse scenarios. TS2Vec [137] serves
adaptation to various prediction lengths, and providing
as a foundational contrastive learning framework for time
new insights for self-supervised pre-training. Timer [153]
series, introducing a universal approach to time series rep-
demonstrates strong scalability and generalization by unify-
resentation. Building upon this groundwork, TNC [167] and
ing heterogeneous time series into a single sequence format
TS-TCC [168] further advance unsupervised representation
(S3) and utilizing a GPT-style decoder structure.
techniques, refining the model’s capacity for feature extrac-
tion in time series data. TF-C [169] extends this framework
by incorporating frequency domain information and em- 7.2 Domain Adapation
ploying contrastive learning to enforce consistency between Distribution shift is a common issue in time series data, lead-
time and frequency domains, thereby enabling the model ing to the out-of-distribution (OOD) challenge that hampers
to capture stable and generalizable representations across the performance of forecasting models. Moreover, time se-
diverse time series datasets. ries data from different domains often exhibit distinct map-
ping relationships, further complicating the development
7.1.2 Denoising Masked Autoencoder Model of transferable forecasting models. Therefore to achieve
The masked autoencoder model strategically masks por- transferable time series forecasting, the primary challenge
tions of the input data, allowing the model to learn ro- lies in explicitly implementing domain adaptation.
bust data representations by reconstructing the masked SLARDA [237] introduces a self-supervised pretraining
segments. In the raw data space, TST [170] establishes the approach for the source domain, utilizing contrastive pre-
masked autoencoder framework, while PatchTST [73] fur- dictive loss to improve representation learning and enhance
ther advances it by introducing channel independence and the transferability of learned features. This method also
partitioning the time series into patches, thereby enhancing accounts for the temporal dynamics of data during both
model flexibility in time series representation. Within the feature learning and domain alignment. Besides, DAF [174]
discrete latent space, TimeMAE [171] and CrossTimeNet employs an attention-based shared module paired with a
PREPRINT 17

domain discriminator to learn domain-invariant latent fea- components or the entire base parameters. Finally, for the
tures, leveraging statistical strengths from the source do- processing of the input layer, models like Chronus [22]
main to enhance target domain performance. STONE [175] and UniTime [178] might directly input time series data
focuses on learning invariant node dependencies for OOD processed through embedding into the model, whereas
spatial-temporal data, addressing both structural and tem- models like TimeLLM [182] adopt a feature fusion approach,
poral shifts. Moreover, FOIL [176] introduces an innovative reprogramming temporal features with textual information
surrogate loss function designed to alleviate the influence to enhance the informativeness of the input data. The com-
of unobserved variables, coupled with a joint optimiza- bined application of these diverse strategies and methods
tion strategy. This facilitates the acquisition of invariant enables the models to exhibit exceptional performance in
representations across the inferred environments, thereby handling time series forecasting tasks. To better demonstrate
enhancing the forecasting accuracy for OOD generalized the above content, we select several representative models
time series data. and present their tuning perspectives in Table 5.

7.3.2 Tuning-Free
7.3 LM-based Models
The tuning-free approach utilizes pre-trained language
The LM-based time series predictor is an innovative ap-
models for direct time series forecasting without the need
proach that utilizes advanced language models to forecast
for additional fine-tuning. This method offers the advantage
future values. These language models, pretrained on exten-
of rapid deployment, significantly reducing computational
sive textual data, possess the ability to capture rich semantic
costs and time. By leveraging the general linguistic features
information and robust reasoning capabilities [238], [239].
already learned by the model, effective predictions can be
This enables them to effectively analyze and reason about
made across various types of time series data. In this con-
time series data, thereby enhancing their understanding
text, LLMTime [181] incorporates the sampling background
and predictive abilities regarding time series patterns and
of instance as textual information, constructing it along with
trends [182]. Moreover, compared to traditional time series
the input sequence into a prompt. Going further, LSTPrompt
predictor, language models can further improve prediction
[242] introduces a TimeBreak design that simulates human
accuracy by integrating relevant textual information such as
thinking scenarios, allowing the model to take a break after
individual characteristics and sampling backgrounds with
every k predictions. When designing prompts for predict-
corresponding numerical features through specific prompts
ing patient survival probabilities [243], Zhu et al. make
or quantification processes [178], [180], [240]. We categorize
full use of the patients’ Electronic Health Records (EHR)
the LM-based time series predictors into two types: tuning-
and adopt a context learning strategy consistent with the
based and tuning-free.
clinical environment. Besides, TimeRAF [244] adopts the
RAG concept, using a K-nearest neighbors approach in the
7.3.1 Tuning-Based
database to retrieve the k closest time series samples and
The tuning-based approach involves making precise ad- constructing them into prompts for reference by the lan-
justments to the backbone parameters to adapt to specific guage model. Similarly, TimeRAG [245] integrates multiple
time series data. This usually employs pre-trained lan- retrieved neighboring samples with the original text into
guage models and involves additional training on specific JSON-formatted strings, and then feed it into the language
time series data to adjust the model weights. The fine- model. Meanwhile, TableTime [246] validates the capabili-
tuning process helps the model to more accurately under- ties of the large language model, Llama-3.1-405b-instruct, in
stand and predict specific patterns and trends within the understanding and classifying time series. In agent perspec-
time series. To better achieve this goal, researchers have tive, TESSA [247] designs a time series annotation scheme
explored multiple tuning perspectives. First of all, from that leverages both general and specific domain annotation
the forecasting paradigm, the AutoTimes [154] utilizes an agents to generate corresponding textual annotations for
autoregressive generation approach, which is particularly time series, significantly enhancing the understanding and
suited for simulating left-to-right sequential relationships reasoning capabilities of language models (such as GPT-4o)
and incrementally constructing the target sequence [153], regarding time series data. It is noteworthy that A. Merrill
[173]. To avoid the potential error accumulation associated et al. [248] have found that tuning-free forecasting models
with autoregression, other models such as FPT [177] opt perform only marginally better than random guessing in
for a One-step Prediction approach, generating the entire time series reasoning tasks, and the introduction of related
forecast target sequence in one go. Besides, for the training context also offers only modest improvements in predictive
paradigm, models like CrossTimeNet [86] introduce a pre- ability. These weaknesses indicate that time series reasoning
training phase of self-supervised representation learning, is an influential yet severely underdeveloped direction in
which deepen the model’s perception and understanding of tuning-free LM-based time series predictor.
time series data through extensive representation learning.
Additionally, models such as TEMPO [180] incorporate the
input of related textual information, enabling the effective 7.4 Time Series Foundation Models
use of text features to assist in predictions. Moreover, re- Recent years have witnessed the emergence of time series
garding base model parameter updates, some models like foundation models pretrained on large-scale temporal data,
LLM4TS [179] employ a LoRA [241] fine-tuning strategy to which drive innovations in time series forecasting tasks
make low-rank adjustments to critical components, while through their cross-domain transferable representations. Ex-
other models [86], [177] may directly update certain key isting architectures of time series foundation models gener-
PREPRINT 18

TABLE 5: The summary of tuning perspectives from several representative methods, corresponding to the definitions in
section 7.3. In the column of ”Parameter Updating”, D and L represent ”Directly Updating Parameter” and ”Lora Fine-
tuning” respectively, with the specific tuning components indicated in parentheses.

Forecasting Training Textual Parameter Input with


Model Backbone
Paradigm Paradigm Information Updating Reprogramming
AutoTimes auto-regressive w/o pretrain ! D(projection) ! LLaMA-7B
FPT one-step w/o pretrain % D(add&norm) % GPT-2
TimeLLM one-step w/o pretrain ! D(projection) ! LLaMA-7B
CrossTimeNet one-step w/ pretrain ! D(full-tuning) % BERT
LLM4TS one-step w/ pretrain % L(attention Q&K) % GPT-2
Chronus auto-regressive w/o pretrain % D(full-tuning) % T5
UniTime one-step w/o pretrain ! D(full-tuning) % GPT-2
TEMPO one-step w/o pretrain ! D(full-tuning) % GPT-2

ally adopt two paradigms: encoder-based structures that ex- vancements in trustworthy time series forecasting, focusing
tract universal temporal features, and decoder-based struc- on interpretability, robustness, and privacy preserving.
tures that focus on autoregressive generation capabilities.
The encoder-based approaches include MOMENT [184],
8.1 Interpretability
which integrates multi-domain data from transportation
and healthcare to establish a multi-task pretraining system Trustworthy time series forecasting involves not only deliv-
supporting classification, forecasting, and anomaly detec- ering accurate predictions but also addressing key concerns
tion. Following similar principles, Moirai [49] introduces such as model interpretability, which enables users to under-
attention mechanisms for arbitrary variates and achieves stand and trust the reasoning behind predictions. To achieve
generalizable forecasting through masked pretraining on interpretability in forecasting models, mainstream research
the LOTSA dataset, demonstrating effectiveness in out-of- primarily focuses on two approaches: causal discovery and
distribution scenarios like electricity load forecasting. In physics-informed neural networks (PINNs).
contrast, decoder-based architectures exhibit distinct design Causal discovery enhances time series forecasting by
philosophies: TimesFM [185] combines patch techniques uncovering the underlying cause-and-effect relationships
with decoder structures and frequency-specific tokeniza- between variables [249]. This approach provides deeper
tion to enable zero-shot generalization, while Chronos [22] insights into how variables influence one another over time,
discretizes continuous time series into token buckets and improving both accuracy and interpretability, particularly
leverages synthetic data with cross-entropy training for in complex systems where predictions rely on the causal
few-shot forecasting. Recently, Timer [153], unifies diverse structure [250], [251]. Statistical models such as VAR utilize
downstream tasks through decoder-only architectures and Granger causality to enhance predictions [186], while dy-
autoregressive generative training strategies. namic Bayesian networks capture temporal dependencies
Despite architectural variations, these models share fun- and adapt to changing causal structures [187]. Deep learn-
damental principles of simplicity and generality, avoiding ing models also benefit from integrating causal inference,
over-specialized designs. Current foundation models fo- enhancing their interpretability and ability to manage com-
cus primarily on constructing large-scale temporal datasets, plex, non-stationary patterns [188].
developing cross-domain modeling techniques, and estab- On the other hand, integrating PINNs into time series
lishing unified frameworks for multivariate analysis. No- forecasting has recently emerged as a promising direction
tably, the data scale surpasses previous end-to-end methods for improving both prediction accuracy and interpretability
by orders of magnitude (e.g., 27B time steps in LOTSA). [189], [190]. PINNs guide data-driven approaches using
Furthermore, synthetic time series data is emerging as a physical principles, such as conservation laws and differen-
critical pretraining resource, with synthetic data constituting tial equations, ensuring predictions are consistent with both
20% of TimesFM’s pretraining corpus. These developments data and physical realities—particularly in scenarios with
signal a paradigm shift toward data-centric methodologies limited or noisy data. By embedding physical knowledge
in both foundation model development and downstream directly into time series models, PINNs ensure that predic-
forecasting tasks. tions align with physical laws [191], [192], [193]. This not
only enhances the learning process but also improves the
robustness and reliability of the models, making them better
8 T RUSTWORTHY T IME S ERIES F ORECASTING suited for practical applications where physical consistency
is crucial.
Recently, the demand for reliable and trustworthy time se-
ries forecasting has grown exponentially, driven by its wide
applications across various domains, including finance, 8.2 Robustness
healthcare, and energy management. As these forecasting Despite significant advancements in recent forecasting mod-
models become increasingly integrated into critical decision- els, they remain vulnerable to adversarial attacks, which
making processes, ensuring their trustworthiness becomes raises important concerns about their trustworthy deploy-
paramount. Next, we primarily discuss the research ad- ment in critical applications. Understanding the robustness
PREPRINT 19

of these models against malicious attacks and developing 9.2.1 Healthcare


effective defense mechanisms is crucial. To address this, Healthcare is committed to maintaining and improving
Dang-Nhu et al. [194] introduce an effective adversarial the health of individuals and groups through prevention,
attack generation method through Monte Carlo estimation treatment, and rehabilitation. Time series forecasting plays
of statistics from the joint distribution of the target series, a crucial role in aspects of health such as disease preven-
applied to probabilistic autoregressive forecasting. Liu et tion, treatment plan formulation, and rehabilitation progress
al. [195] identify a new attack pattern involving strategic, assessment. For example, forecasting medical bookings
sparse modifications, and propose defense strategies using improves appointment scheduling and enhances hospital
randomized smoothing techniques and adversarial training. scheduling [271]. In chronic disease management, such as
Besides, RDAT [196] employs a reinforcement learning- Type 1 diabetes, models like LSTM and TCN predict blood
based adversarial training approach with self-knowledge glucose levels to prevent hypo- or hyperglycemic events
distillation regularization to enhance the adversarial robust- [272]. Additionally, forecasting intraoperative hypotension
ness of spatiotemporal traffic forecasting models. Another (IOH) using real-time biosignals and deep learning increases
notable work, BACKTIME [197], investigates backdoor at- prediction accuracy [1], [273], [274], [275]. These applica-
tacks on time series forecasting tasks. Their findings show tions demonstrate the essential role of time series forecasting
that mainstream forecasting models can be significantly in healthcare resource allocation and disease management.
degraded by stealthy, sparse, and highly effective GNN-
based triggers. 9.2.2 Manufacturing
Manufacturing is a key driver of global economic growth,
8.3 Privacy
innovation, and job creation. Time series forecasting is
With the increasing demand for time series foundation mod- essential for optimizing production processes, predicting
els, a promising approach is to train using vast amounts of equipment failures, and enhancing quality control. By an-
time series data from multiple sources. However, this raises alyzing historical data, these models uncover patterns that
significant concerns regarding data privacy. To address this help manufacturers make informed decisions, boost effi-
issue, several efforts have focused on integrating the feder- ciency, and cut costs. For instance, in aero-engine manu-
ated learning paradigm, enabling the use of multi-domain facturing, time-series forecasting with the multi-resolution
data while preventing the exposure of sensitive informa- transformer (MRT) model [3] improves component mod-
tion. CNFGNN [198] explicitly encodes the graph structure eling, fault diagnosis, and performance prediction. Addi-
under the constraints of cross-node federated learning to tionally, in semiconductor manufacturing, time-series fault
effectively model complex dependencies in spatio-temporal detection and diagnosis with the multiple time-series con-
forecasting, ensuring that data privacy is maintained while volutional neural network (MTS-CNN) model [276] en-
still capturing intricate relationships across nodes. MetePFL hances equipment monitoring, fault classification, and cause
[199] introduces a federated prompt learning mechanism, identification. These applications are crucial for developing
combined with a graph-based approach, to mitigate the smart manufacturing systems, increasing productivity and
impact of data heterogeneity while preserving data security. operational reliability across the sector.
Moreover, Time-FFM [200] proposes a prompt adaption
module and a personalized federated training strategy by 9.2.3 Finance and Economics
learning global encoders and local prediction heads to con-
The financial sector has been transformed by deep learning
struct a federated foundation model for time series forecast-
techniques for time series forecasting, particularly in stock
ing by leveraging pretrained language models.
price prediction and economic indicator forecasting. Ad-
vanced transformer-based models like the Market-Guided
9 B ENCHMARK DATASETS AND A PPLICATIONS Stock Transformer [2] accurately predict stock trends, en-
In this section, we introduce some prevalent benchmark abling data-driven investment strategies and portfolio man-
datasets and representative time series forecasting methods agement. Additionally, LSTM networks forecast energy
applied in various domains. prices and commodity trends, aiding financial institutions
and investors in resource allocation [277]. These deep learn-
9.1 Overview of Benchmark Datasets ing applications highlight their transformative impact on
Time series benchmark datasets are foundational to time se- modern finance and the advancement of intelligent, adap-
ries forecasting tasks. In table 6, we summarizes 36 publicly tive financial systems.
available real-world datasets frequently used for evaluating
time series forecasting models. The table provides informa- 9.2.4 Environment
tion on domian, channel,frequency, obs(total observations) Advancements in deep learning for spatio-temporal fore-
and source. By utilizing those datasets, we can assess model casting are transforming environmental conservation ef-
performance across various scenarios, thereby evaluating forts. Models like DeepSTF [278] combine CNNs and atten-
its generalization ability and ensuring the robustness and tion mechanisms for accurate, efficient multi-site forecasts,
reliability of the model in real-world applications. essential for urban management and disaster warnings.
Corrformer [279] captures global spatio-temporal patterns
9.2 Representative Applications with a multi-correlation mechanism, enabling forecasts for
In this part, we briefly introduce some recent advancements thousands of stations. These models enhance renewable
of time series forecastor applied in various domains. energy optimization, environmental monitoring, and smart
PREPRINT 20

TABLE 6: The statistics of evaluation datasets.


Domain Dataset Channel Frequency Observations Source
Energy BDG-2-Panther 105 1 hour 919,800 [252]
Energy SMART 5 1 hour 95709 [253]
Energy Low Carbon London 713 1 hour 9,543,348 [254]
Energy KDD Cup 2022 134 10 minutes 4,727,519 [255]
Energy Solar 137 10 minutes 7,200,720 [127]
Energy ETTh 7 1 hour 403,200 [30]
Energy ETTm 7 1 hour 100,800 [30]
Energy Electricity 321 1 hour 8,443,584 [42]
Transportation Uber TLC Daily 262 1 day 47,087 [256]
Transportation PEMS03 358 5 minutes 9,382,464 [257]
Transportation Kaggle Web Traffic 4 1 day 145,000 [258]
Transportation Traffic 862 1 hour 15,122,928 [42]
Transportation Beijing Subway 276 30 minutes 248,400 [257]
Transportation Los-Loop 207 5 minutes 7,094,304 [257]
Transportation SHMetro 288 15 minutes 1,934,208 [257]
Transportation HZMetro 80 15 minutes 146,000 [257]
Transportation METR-LA 207 5 minutes 7094304 [259]
Environment SubseasonalClimateUSA 862 1 day 14,097,148 [260]
Environment Weather 21 10 minutes 1,106,616 [42]
Environment AQWan 11 1 hour 385,704 [261]
Environment AQShunyi 11 1 hour 385,704 [261]
Environment China Air Quality 437 1 hour 5,739,234 [262]
Environment Beijing Air Quality 12 1 hour 420,768 [263]
Economic M1 Monthly 1 1 month 44,892 [264]
Economic M3 Monthly 1 1 month 141,858 [265]
Economic M4 Monthly 1 1 month 709522 [266]
Economic M5 1,947 1 day 58,327,370 [267]
Economic NN5 Daily 111 1 day 81,585 [268]
Economic Tourism Yearly 1311 1 month 11,198 [269]
Economic Exchange 8 1 day 60,704 [42]
Economic NASDAQ 5 1 day 6,220 [261]
Economic NYSE 5 1 day 6,215 [261]
Healthcare CDC Fluview ILINet 75 1 week 63,903 [49]
Healthcare CDC Fluview WHO NREVSS 74 1 week 41,760 [49]
Healthcare Project Tycho 1258 1 week 1,377,707 [270]
Healthcare ILI 7 1 week 6,762 [42]

city development by improving energy scheduling, smart direction, and many recent efforts have been devoted as pio-
grid management, and air quality tracking, supporting in- neering explorations. Despite the effectiveness, there remain
telligent and sustainable environmental management. multiple unresolved issues worth exploring.

9.2.5 Transportation On the one hand, constructing a context-aware founda-


tion model for more accurate general time series forecast-
Time series forecasting is crucial in transportation for ac-
ing is the principal direction. As mentioned in the above
curate traffic flow predictions and smart system support.
sections, existing foundational model approaches often mix
These models capture complex spatiotemporal dependen-
massive amounts of time series data from diverse sources to
cies [100], [280] to optimize route planning, congestion
construct pretraining datasets. Through pretraining, these
management, and resource allocation. PDFormer [223] en-
methods aim to enhance the model’s generalization ability
hances prediction accuracy by incorporating propagation
and investigate the potential emergence within the time
delay awareness and modeling long-range dependencies.
series field. However, these approaches often overlook the
Besides, RGDAN [281] integrates graph diffusion attention
contextual characteristics of time series data, which can lead
and temporal attention modules to better capture spatial
to suboptimal forecasting results. Firstly on the data point
and temporal dependencies. Together, these models ad-
level, different time series datasets often have varying sam-
vance smart transportation systems by improving route
pling frequencies, noise levels, and, more critically, distinct
planning, congestion control, and resource allocation.
and well-defined physical meanings across domains. Fully
accounting for these properties is essential to establish a
10 P ROSPECTS AND F UTURE D IRECTION unified semantic representation for the time series domain.
In this section, we discuss the future directions of time series Secondly on the data instances level, even within similar
forecasting from various perspectives. domains, time series data can exhibit significant differences
due to the individuality of observed subjects. For example,
vital sign data from different patient groups or traffic flow
10.1 Time Series Foundation Models data from various regions may display unique patterns.
Drawing inspiration from the great progress of the founda- Finally, on the task objective level, different application
tion models in relevant research areas including computer scenarios impose diverse requirements on the forecasting
vision and natural language processing, constructing a foun- outcomes. Models must adapt to specific instructions to
dation model for time series analysis becomes a promising optimally generate predictions that meet the demands of
PREPRINT 21

distinct tasks. Furthermore, to better model the contextual 10.4 Comprehensive Benchmark Evaluation
characteristics mentioned above, an urgent research chal- In addition to model design, model performance evalua-
lenge lies in effectively integrating multiple modalities, such tion and the design of appropriate benchmarks are also
as text, images, and graphs, into the forecasting process. important future research directions in the field of time
On the other hand, it remains inconclusive whether series forecasting. Existing benchmarks often lack a suffi-
training a universal foundation model is the optimal solu- ciently broad range of data distributions and fail to clearly
tion for time series forecasting. This uncertainty, however, differentiate the difficulty levels of tasks. Moreover, the
opens the door to several potentially valuable research evaluation metrics used are typically too simplistic, making
directions. First, it is worth exploring the operational mech- it difficult to assess the strengths and weaknesses of each
anisms of general foundation models in more depth, such model from multiple perspectives. This limitation may even
as scaling laws and emergent phenomena. Additionally, in- encourage an arms race-style search for hyperparameters of
vestigating the impact of different data ratios on model per- each model for better testing performance. Therefore, there
formance to guide the construction of pretraining datasets is a need to develop more generalized evaluation datasets
is a promising avenue. Lastly, training domain-specific time and diverse evaluation metrics to ensure the healthy and
series forecasting foundation models, which fully integrate well-rounded development of the entire research field.
the inductive biases and prior knowledge of the respective
domain for more precise predictions, is also a direction
worth exploring. 11 C ONCLUSION
In this survey, we have presented a holistic and structured
examination of recent advancements in time series forecast-
10.2 Trustworthy Time Series Forecasting ing, encompassing foundational concepts, methodological
evolutions, and critical challenges. We established a unified
Most of the current mainstream time series forecasting
perspective that bridges classical statistical modeling and
methods rely on black-box neural network models, which,
cutting-edge deep learning paradigms, highlighting how
while offering high accuracy, are difficult to apply in many
shifts in data availability, computational power, and al-
sensitive real-world applications such as healthcare analysis
gorithmic sophistication are reshaping the field. Through
and decision-making. Therefore, a promising research di-
careful attention to key hurdles—ranging from handling
rection is to leverage advanced techniques such as explain-
non-stationarity and uncertainty quantification to manag-
able AI and causal inference, in combination with existing
ing high-dimensionality and interpretability—we have il-
powerful forecasting models, to construct interpretable and
lustrated the complexity and dynamism that define modern
trustworthy forecasting systems that can facilitate practical,
time series forecasting tasks. We further surveyed promi-
real-world applications.
nent benchmark datasets and evaluation metrics, underscor-
Moreover, with the increasing demand for training time
ing the importance of robust and fair performance compar-
series forecasting models using multi-source cross-domain
isons to drive meaningful progress. Crucially, we identified
data, ensuring data privacy and model security has become
emerging trends, such as leveraging generative models,
an important research topic. Specifically, multi-source cross-
explainable AI techniques, and integrative frameworks that
domain training often relies on federated learning frame-
combine domain knowledge with data-driven insights. The
works. This raises critical issues such as how to protect the
work presented here not only consolidates the state of
privacy of client-side time series data from leakage using
the art but also illuminates avenues for future innovation,
techniques like differential privacy, and how to improve
offering researchers and practitioners a coherent reference
the robustness of models to guard against data poisoning
point as they navigate the ever-evolving landscape of time
attacks from malicious clients. These are pressing research
series forecasting research.
challenges that need to be addressed.

10.3 Emerging Modeling Paradigms


Time series forecasting has always been an open research
field, encouraging the validation of various diverse tech-
niques, including different neural network architectures,
modeling perspectives, and model training paradigms.
These methods, in turn, provide valuable insights that pro-
pel the further development of the entire field. Therefore,
combining emerging technologies from the broader machine
learning domain with the unique characteristics of time
series data is a promising future direction for research. For
example, exploring how to integrate automated machine
learning methods to adjust network architectures, balance
computational costs with forecasting accuracy, and how
to incorporate physical knowledge of the time series data
through Physics-Informed Neural Networks (PINNs) are
potential areas for future investigation.
PREPRINT 22

R EFERENCES tioned normalizing flows,” in International Conference on Learning


Representations, 2021.
[1] F. Hatib, Z. Jian, S. Buddi, C. Lee, J. Settels, K. Sibert, J. Rine- [22] A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen,
hart, and M. Cannesson, “Machine-learning algorithm to predict O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor et al.,
hypotension based on high-fidelity arterial pressure waveform “Chronos: Learning the language of time series,” arXiv preprint
analysis,” Anesthesiology, vol. 129, no. 4, pp. 663–674, 2018. arXiv:2403.07815, 2024.
[2] T. Li, Z. Liu, Y. Shen, X. Wang, H. Chen, and S. Huang, “Master: [23] W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li, “Brits: Bidi-
Market-guided stock transformer for stock price forecasting,” in rectional recurrent imputation for time series,” Advances in neural
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, information processing systems, vol. 31, 2018.
no. 1, 2024, pp. 162–170. [24] A. Blázquez-Garcı́a, A. Conde, U. Mori, and J. A. Lozano, “A
[3] H.-J. Jin, Y.-P. Zhao, and M.-N. Pan, “A novel method for aero- review on outlier/anomaly detection in time series data,” ACM
engine time-series forecasting based on multi-resolution trans- computing surveys (CSUR), vol. 54, no. 3, pp. 1–33, 2021.
former,” Expert Systems with Applications, vol. 255, p. 124597, 2024. [25] M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph, “Mod-
[4] K. Bi, L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, “Accurate eling irregular time series with continuous recurrent units,” in
medium-range global weather forecasting with 3d neural net- International conference on machine learning. PMLR, 2022, pp.
works,” Nature, vol. 619, no. 7970, pp. 533–538, 2023. 19 388–19 405.
[5] R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. For- [26] Y. Chen, K. Ren, Y. Wang, Y. Fang, W. Sun, and D. Li, “Con-
tunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu et al., tiformer: Continuous-time transformer for irregular time series
“Learning skillful medium-range global weather forecasting,” modeling,” Advances in Neural Information Processing Systems,
Science, vol. 382, no. 6677, pp. 1416–1421, 2023. vol. 36, 2024.
[6] R. J. Hyndman and Y. Khandakar, “Automatic time series fore- [27] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time
casting: the forecast package for r,” Journal of statistical software, series analysis: forecasting and control. John Wiley & Sons, 2015.
vol. 27, pp. 1–22, 2008. [28] A. K. Singh, S. K. Ibraheem, M. Muazzam, and D. Chaturvedi,
[7] P. S. Kalekar et al., “Time series forecasting using holt-winters “An overview of electricity demand forecasting techniques,”
exponential smoothing,” Kanwal Rekhi school of information Tech- Network and complex systems, vol. 3, no. 3, pp. 38–48, 2013.
nology, vol. 4329008, no. 13, pp. 1–13, 2004. [29] S. Kaushik, A. Choudhury, P. K. Sheron, N. Dasgupta, S. Natara-
[8] B. Lim and S. Zohren, “Time-series forecasting with deep learn- jan, L. A. Pickett, and V. Dutt, “Ai in healthcare: time-series
ing: a survey,” Philosophical Transactions of the Royal Society A, vol. forecasting using statistical, neural, and ensemble architectures,”
379, no. 2194, p. 20200209, 2021. Frontiers in big data, vol. 3, p. 4, 2020.
[9] K. Benidis, S. S. Rangapuram, V. Flunkert, Y. Wang, D. Mad- [30] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and
dix, C. Turkmen, J. Gasthaus, M. Bohlke-Schneider, D. Salinas, W. Zhang, “Informer: Beyond efficient transformer for long se-
L. Stella et al., “Deep learning for time series forecasting: Tutorial quence time-series forecasting,” in Proceedings of the AAAI confer-
and literature survey,” ACM Computing Surveys, vol. 55, no. 6, pp. ence on artificial intelligence, vol. 35, no. 12, 2021, pp. 11 106–11 115.
1–36, 2022.
[31] L. Zhao and Y. Shen, “Rethinking channel dependence for multi-
[10] J. Wang, W. Du, W. Cao, K. Zhang, W. Wang, Y. Liang, and
variate time series forecasting: Learning from leading indicators,”
Q. Wen, “Deep learning for multivariate time series imputation:
in The Twelfth International Conference on Learning Representations,
A survey,” arXiv preprint arXiv:2402.04059, 2024.
2024.
[11] S. Zhou, D. Zha, X. Shen, X. Huang, R. Zhang, and F.-L. Chung,
[32] Y. Zhang and J. Yan, “Crossformer: Transformer utilizing cross-
“Denoising-aware contrastive learning for noisy time series,”
dimension dependency for multivariate time series forecasting,”
arXiv preprint arXiv:2406.04627, 2024.
in The Eleventh International Conference on Learning Representations,
[12] T. Kim, J. Kim, Y. Tae, C. Park, J.-H. Choi, and J. Choo, “Re-
2023.
versible instance normalization for accurate time-series fore-
casting against distribution shift,” in International Conference on [33] A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers
Learning Representations, 2022. effective for time series forecasting?” in Proceedings of the AAAI
conference on artificial intelligence, vol. 37, no. 9, 2023, pp. 11 121–
[13] Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun,
11 128.
“Transformers in time series: a survey,” in Proceedings of the
Thirty-Second International Joint Conference on Artificial Intelligence, [34] L. Han, H.-J. Ye, and D.-C. Zhan, “The capacity and robustness
2023, pp. 6778–6786. trade-off: Revisiting the channel independent strategy for mul-
[14] M. Jin, H. Y. Koh, Q. Wen, D. Zambon, C. Alippi, G. I. Webb, tivariate time series forecasting,” IEEE Transactions on Knowledge
I. King, and S. Pan, “A survey on graph neural networks for and Data Engineering, 2024.
time series: Forecasting, classification, imputation, and anomaly [35] K. G. Olivares, C. Challu, G. Marcjasz, R. Weron, and
detection,” IEEE Transactions on Pattern Analysis and Machine A. Dubrawski, “Neural basis expansion analysis with exogenous
Intelligence, 2024. variables: Forecasting electricity prices with nbeatsx,” Interna-
[15] K. Zhang, Q. Wen, C. Zhang, R. Cai, M. Jin, Y. Liu, J. Y. Zhang, tional Journal of Forecasting, vol. 39, no. 2, pp. 884–900, 2023.
Y. Liang, G. Pang, D. Song et al., “Self-supervised learning for [36] A. Das, W. Kong, A. Leach, S. K. Mathur, R. Sen, and R. Yu,
time series analysis: Taxonomy, progress, and prospects,” IEEE “Long-term forecasting with tiDE: Time-series dense encoder,”
transactions on pattern analysis and machine intelligence, 2024. Transactions on Machine Learning Research, 2023.
[16] Y. Wang, H. Wu, J. Dong, Y. Liu, M. Long, and J. Wang, “Deep [37] Y. Wang, H. Wu, J. Dong, G. Qin, H. Zhang, Y. Liu, Y. Qiu,
time series models: A comprehensive survey and benchmark,” J. Wang, and M. Long, “Timexer: Empowering transformers for
arXiv preprint arXiv:2407.13278, 2024. time series forecasting with exogenous variables,” arXiv preprint
[17] J. Su, C. Jiang, X. Jin, Y. Qiao, T. Xiao, H. Ma, R. Wei, Z. Jing, arXiv:2402.19072, 2024.
J. Xu, and J. Lin, “Large language models for forecasting and [38] Y. Du, J. Wang, W. Feng, S. Pan, T. Qin, R. Xu, and C. Wang,
anomaly detection: A systematic literature review,” arXiv preprint “Adarnn: Adaptive learning and forecasting of time series,” in
arXiv:2402.10350, 2024. Proceedings of the 30th ACM international conference on information
[18] Y. Liang, H. Wen, Y. Nie, Y. Jiang, M. Jin, D. Song, S. Pan, and & knowledge management, 2021, pp. 402–411.
Q. Wen, “Foundation models for time series analysis: A tutorial [39] Z. Liu, M. Cheng, Z. Li, Z. Huang, Q. Liu, Y. Xie, and E. Chen,
and survey,” in Proceedings of the 30th ACM SIGKDD conference on “Adaptive normalization for non-stationary time series forecast-
knowledge discovery and data mining, 2024, pp. 6555–6565. ing: A temporal slice perspective,” Advances in Neural Information
[19] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, Processing Systems, vol. 36, 2024.
“Deepar: Probabilistic forecasting with autoregressive recurrent [40] A. C. Harvey and S. Peters, “Estimation procedures for structural
networks,” International journal of forecasting, vol. 36, no. 3, pp. time series models,” Journal of forecasting, vol. 9, no. 2, pp. 89–108,
1181–1191, 2020. 1990.
[20] A. Desai, C. Freeman, Z. Wang, and I. Beaver, “Timevae: A [41] R. B. Cleveland, W. S. Cleveland, J. E. McRae, I. Terpenning et al.,
variational auto-encoder for multivariate time series generation,” “Stl: A seasonal-trend decomposition,” J. off. Stat, vol. 6, no. 1,
arXiv preprint arXiv:2111.08095, 2021. pp. 3–73, 1990.
[21] K. Rasul, A.-S. Sheikh, I. Schuster, U. M. Bergmann, and R. Voll- [42] H. Wu, J. Xu, J. Wang, and M. Long, “Autoformer: Decomposition
graf, “Multivariate probabilistic time series forecasting via condi- transformers with auto-correlation for long-term series forecast-
PREPRINT 23

ing,” Advances in neural information processing systems, vol. 34, pp. ordinate computing and iewt reconstruction,” Energy Conversion
22 419–22 430, 2021. and Management, vol. 167, pp. 203–219, 2018.
[43] M. C. Mozer, “Induction of multiscale temporal structure,” Ad- [64] J. Xin, C. Zhou, Y. Jiang, Q. Tang, X. Yang, and J. Zhou, “A signal
vances in neural information processing systems, vol. 4, 1991. recovery method for bridge monitoring system using tvfemd and
[44] C. Challu, K. G. Olivares, B. N. Oreshkin, F. G. Ramirez, M. M. encoder-decoder aided lstm,” Measurement, vol. 214, p. 112797,
Canseco, and A. Dubrawski, “Nhits: Neural hierarchical inter- 2023.
polation for time series forecasting,” in Proceedings of the AAAI [65] P. Bonizzi, J. M. Karel, O. Meste, and R. L. Peeters, “Singular
conference on artificial intelligence, vol. 37, no. 6, 2023, pp. 6989– spectrum decomposition: A new method for time series decom-
6997. position,” Advances in Adaptive Data Analysis, vol. 6, no. 04, p.
[45] M. A. Shabani, A. H. Abdi, L. Meng, and T. Sylvain, “Scale- 1450011, 2014.
former: Iterative multi-scale refining transformers for time series [66] L. Karthikeyan and D. N. Kumar, “Predictability of nonstationary
forecasting,” in The Eleventh International Conference on Learning time series using wavelet and emd based arma models,” Journal
Representations, 2023. of hydrology, vol. 502, pp. 103–119, 2013.
[46] M. Hou, C. Xu, Y. Liu, W. Liu, J. Bian, L. Wu, Z. Li, E. Chen, [67] N. A. Agana and A. Homaifar, “Emd-based predictive deep belief
and T.-Y. Liu, “Stock trend prediction with multi-granularity network for time series prediction: An application to drought
data: A contrastive learning approach with adaptive fusion,” in forecasting,” Hydrology, vol. 5, no. 1, p. 18, 2018.
Proceedings of the 30th ACM International Conference on Information [68] W.-c. Wang, K.-w. Chau, D.-m. Xu, and X.-Y. Chen, “Improving
& Knowledge Management, 2021, pp. 700–709. forecasting accuracy of annual runoff time series using arima
[47] B. Rim, N.-J. Sung, S. Min, and M. Hong, “Deep learning in based on eemd decomposition,” Water Resources Management,
physiological signal data: A survey,” Sensors, vol. 20, no. 4, p. vol. 29, pp. 2655–2675, 2015.
969, 2020. [69] M. Theodosiou, “Forecasting monthly and quarterly time se-
[48] Z. Liu, J. Yang, M. Cheng, Y. Luo, and Z. Li, “Generative pre- ries using stl decomposition,” International Journal of Forecasting,
trained hierarchical transformer for time series forecasting,” in vol. 27, no. 4, pp. 1178–1195, 2011.
Proceedings of the 30th ACM SIGKDD Conference on Knowledge [70] J. Nasir, M. Aamir, Z. U. Haq, S. Khan, M. Y. Amin, and
Discovery and Data Mining, 2024, pp. 2003–2013. M. Naeem, “A new approach for forecasting crude oil prices
[49] G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, based on stochastic and deterministic influences of lmd using
“Unified training of universal time series forecasting transform- arima and lstm models,” IEEE Access, vol. 11, pp. 14 322–14 339,
ers,” in Forty-first International Conference on Machine Learning, 2023.
2024.
[71] M. Wang, J. Yang, B. Yang, H. Li, T. Gong, B. Yang, and
[50] J. Shi, Q. Ma, H. Ma, and L. Li, “Scaling law for time series J. Cui, “Towards lightweight time series forecasting: a patch-
forecasting,” arXiv preprint arXiv:2405.15124, 2024. wise transformer with weak data enriching,” arXiv preprint
[51] P. Bansal, P. Deshpande, and S. Sarawagi, “Missing value impu- arXiv:2501.10448, 2025.
tation on multidimensional time series,” Proceedings of the VLDB
[72] C. Ying and J. Lu, “Tfeformer: Temporal feature enhanced trans-
Endowment, vol. 14, no. 11, pp. 2533–2545, 2021.
former for multivariate time series forecasting,” IEEE Access,
[52] Y. Luo, X. Cai, Y. Zhang, J. Xu et al., “Multivariate time series 2024.
imputation with generative adversarial networks,” Advances in
[73] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A time
neural information processing systems, vol. 31, 2018.
series is worth 64 words: Long-term forecasting with transform-
[53] M. Biloš, K. Rasul, A. Schneider, Y. Nevmyvaka, and
ers,” in The Eleventh International Conference on Learning Represen-
S. Günnemann, “Modeling temporal data as continuous func-
tations, 2023.
tions with stochastic process diffusion,” in International Conference
on Machine Learning. PMLR, 2023, pp. 2452–2470. [74] P. Chen, Y. Zhang, Y. Cheng, Y. Shu, Y. Wang, Q. Wen,
B. Yang, and C. Guo, “Pathformer: Multi-scale transformers with
[54] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng,
adaptive pathways for time series forecasting,” arXiv preprint
N.-C. Yen, C. C. Tung, and H. H. Liu, “The empirical mode
arXiv:2402.05956, 2024.
decomposition and the hilbert spectrum for nonlinear and non-
stationary time series analysis,” Proceedings of the Royal Society [75] D. Luo and X. Wang, “Moderntcn: A modern pure convolution
of London. Series A: mathematical, physical and engineering sciences, structure for general time series analysis,” in The Twelfth Interna-
vol. 454, no. 1971, pp. 903–995, 1998. tional Conference on Learning Representations, 2024.
[55] P. Chaovalit, A. Gangopadhyay, G. Karabatis, and Z. Chen, “Dis- [76] Z. Gong, Y. Tang, and J. Liang, “Patchmixer: A patch-mixing
crete wavelet transform-based time series analysis and mining,” architecture for long-term time series forecasting,” arXiv preprint
ACM Computing Surveys (CSUR), vol. 43, no. 2, pp. 1–37, 2011. arXiv:2310.00655, 2023.
[56] T. Yoon, Y. Park, E. K. Ryu, and Y. Wang, “Robust probabilistic [77] V. Ekambaram, A. Jati, N. Nguyen, P. Sinthong, and
time series forecasting,” in International Conference on Artificial J. Kalagnanam, “Tsmixer: Lightweight mlp-mixer model for mul-
Intelligence and Statistics. PMLR, 2022, pp. 1336–1358. tivariate time series forecasting,” in Proceedings of the 29th ACM
[57] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis, SIGKDD Conference on Knowledge Discovery and Data Mining, 2023,
“Deep adaptive input normalization for time series forecasting,” pp. 459–469.
IEEE transactions on neural networks and learning systems, vol. 31, [78] M. Cheng, J. Yang, T. Pan, Q. Liu, and Z. Li, “Convtimenet: A
no. 9, pp. 3760–3765, 2019. deep hierarchical fully convolutional model for multivariate time
[58] W. Fan, P. Wang, D. Wang, D. Wang, Y. Zhou, and Y. Fu, “Dish- series analysis,” arXiv preprint arXiv:2403.01493, 2024.
ts: a general paradigm for alleviating distribution shift in time [79] Q. Huang, L. Shen, R. Zhang, J. Cheng, S. Ding, Z. Zhou, and
series forecasting,” in Proceedings of the AAAI conference on artificial Y. Wang, “Hdmixer: Hierarchical dependency with extendable
intelligence, vol. 37, no. 6, 2023, pp. 7522–7529. patch for multivariate time series forecasting,” in Proceedings of
[59] L. Han, H.-J. Ye, and D.-C. Zhan, “SIN: Selective and interpretable the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024,
normalization for long-term time series forecasting,” in Forty-first pp. 12 608–12 616.
International Conference on Machine Learning, 2024. [80] S. Zhong, S. Song, W. Zhuo, G. Li, Y. Liu, and S.-H. G. Chan, “A
[60] S. Lahmiri, “A variational mode decompoisition approach for multi-scale decomposition mlp-mixer for time series analysis,”
analysis and forecasting of economic and financial time series,” arXiv preprint arXiv:2310.11959, 2023.
Expert Systems with Applications, vol. 55, pp. 268–273, 2016. [81] Y. Zhang, L. Ma, S. Pal, Y. Zhang, and M. Coates, “Multi-
[61] E. Ghanbari and A. Avar, “Short-term wind power forecasting resolution time-series transformer for long-term forecasting,”
using the hybrid model of multivariate variational mode de- in International Conference on Artificial Intelligence and Statistics.
composition (mvmd) and long short-term memory (lstm) neural PMLR, 2024, pp. 4222–4230.
networks,” Electrical Engineering, pp. 1–31, 2024. [82] K. Rasul, A. Ashok, A. R. Williams, A. Khorasani, G. Adamopou-
[62] Y. Wang, S. Sun, X. Chen, X. Zeng, Y. Kong, J. Chen, Y. Guo, and los, R. Bhagwatkar, M. Biloš, H. Ghonia, N. Hassen, A. Schneider
T. Wang, “Short-term load forecasting of industrial customers et al., “Lag-llama: Towards foundation models for time series
based on svmd and xgboost,” International Journal of Electrical forecasting,” in R0-FoMo: Robustness of Few-shot and Zero-shot
Power & Energy Systems, vol. 129, p. 106830, 2021. Learning in Large Foundation Models, 2023.
[63] Y. Li, H. Wu, and H. Liu, “Multi-step wind speed forecasting [83] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin,
using ewt decomposition, lstm principal computing, relm sub- Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single
PREPRINT 24

transformer to unify multimodal understanding and generation,” in neural information processing systems, vol. 33, pp. 17 804–17 815,
arXiv preprint arXiv:2408.12528, 2024. 2020.
[84] P. Anastassiou, J. Chen, J. Chen, Y. Chen, Z. Chen, Z. Chen, [105] C. Shang, J. Chen, and J. Bi, “Discrete graph structure learning
J. Cong, L. Deng, C. Ding, L. Gao et al., “Seed-tts: A family of for forecasting multiple time series,” in International Conference on
high-quality versatile speech generation models,” arXiv preprint Learning Representations, 2021.
arXiv:2406.02430, 2024. [106] F. Johnston, J. E. Boyland, M. Meadows, and E. Shale, “Some
[85] P. Schäfer and M. Högqvist, “Sfa: a symbolic fourier approx- properties of a simple moving average when applied to fore-
imation and index for similarity search in high dimensional casting a time series,” Journal of the Operational Research Society,
datasets,” in Proceedings of the 15th international conference on vol. 50, no. 12, pp. 1267–1271, 1999.
extending database technology, 2012, pp. 516–527. [107] C. C. Holt, “Forecasting seasonals and trends by exponentially
[86] M. Cheng, X. Tao, Q. Liu, H. Zhang, Y. Chen, and C. Lei, “Learn- weighted moving averages,” International journal of forecasting,
ing transferable time series classifier with cross-domain pre- vol. 20, no. 1, pp. 5–10, 2004.
training from language model,” arXiv preprint arXiv:2403.12372, [108] E. S. Gardner Jr, “Exponential smoothing: The state of the art,”
2024. Journal of forecasting, vol. 4, no. 1, pp. 1–28, 1985.
[87] A. Van Den Oord, O. Vinyals et al., “Neural discrete representa- [109] E. Ostertagova and O. Ostertag, “Forecasting using simple ex-
tion learning,” Advances in neural information processing systems, ponential smoothing method,” Acta Electrotechnica et Informatica,
vol. 30, 2017. vol. 12, no. 3, p. 62, 2012.
[88] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, [110] C. Chatfield, A. B. Koehler, J. K. Ord, and R. D. Snyder, “A new
A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu et al., look at models for exponential smoothing,” Journal of the Royal
“Wavenet: A generative model for raw audio,” arXiv preprint Statistical Society: Series D (The Statistician), vol. 50, no. 2, pp. 147–
arXiv:1609.03499, vol. 12, 2016. 159, 2001.
[89] M. Łajszczak, G. Cámbara, Y. Li, F. Beyhan, A. van Korlaar, [111] A. C. Harvey, “Forecasting, structural time series models and the
F. Yang, A. Joly, Á. Martı́n-Cortinas, A. Abbas, A. Michal- kalman filter,” 1990.
ski et al., “Base tts: Lessons from building a billion-parameter [112] Y. Zhang, “Prediction of financial time series with hidden markov
text-to-speech model on 100k hours of data,” arXiv preprint models,” 2004.
arXiv:2402.08093, 2024. [113] T. Lux, “The markov-switching multifractal model of asset re-
[90] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, turns: Gmm estimation and linear forecasting of volatility,” Jour-
Y. Liu, H. Wang, J. Li et al., “Neural codec language mod- nal of business & economic statistics, vol. 26, no. 2, pp. 194–210,
els are zero-shot text to speech synthesizers,” arXiv preprint 2008.
arXiv:2301.02111, 2023. [114] N. I. Sapankevych and R. Sankar, “Time series prediction using
[91] M. Cheng, Y. Chen, Q. Liu, Z. Liu, Y. Luo, and E. Chen, “In- support vector machines: a survey,” IEEE computational intelli-
structime: Advancing time series classification with multimodal gence magazine, vol. 4, no. 2, pp. 24–38, 2009.
language modeling,” in Proceedings of the Eighteenth ACM Interna- [115] L.-J. Cao and F. E. H. Tay, “Support vector machine with adaptive
tional Conference on Web Search and Data Mining, 2025, pp. 792–800. parameters in financial time series forecasting,” IEEE Transactions
[92] S. Winograd, “On computing the discrete fourier transform,” on neural networks, vol. 14, no. 6, pp. 1506–1518, 2003.
Mathematics of computation, vol. 32, no. 141, pp. 175–199, 1978. [116] W. Lu, W. Wang, A. Y. Leung, S.-M. Lo, R. K. Yuen, Z. Xu,
[93] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine trans- and H. Fan, “Air pollutant parameter forecasting using support
form,” IEEE transactions on Computers, vol. 100, no. 1, pp. 90–93, vector machines,” in Proceedings of the 2002 International Joint
1974. Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290),
[94] M. J. Shensa et al., “The discrete wavelet transform: wedding vol. 1. IEEE, 2002, pp. 630–635.
the a trous and mallat algorithms,” IEEE Transactions on signal [117] P.-F. Pai, K.-P. Lin, C.-S. Lin, and P.-T. Chang, “Time series fore-
processing, vol. 40, no. 10, pp. 2464–2482, 1992. casting by a seasonal support vector regression model,” Expert
[95] A. V. Oppenheim, Discrete-time signal processing. Pearson Edu- Systems with Applications, vol. 37, no. 6, pp. 4261–4265, 2010.
cation India, 1999. [118] L. Zhang, W.-D. Zhou, P.-C. Chang, J.-W. Yang, and F.-Z. Li,
[96] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, and M. Long, “Timesnet: “Iterated time series prediction with multiple support vector
Temporal 2d-variation modeling for general time series analysis,” regression models,” Neurocomputing, vol. 99, pp. 411–422, 2013.
in The Eleventh International Conference on Learning Representations, [119] T. Januschowski, Y. Wang, K. Torkkola, T. Erkkilä, H. Hasson,
2023. and J. Gasthaus, “Forecasting with trees,” International Journal of
[97] X. Zhu, D. Shen, H. Wang, and Y. Hao, “Fcnet: Fully complex Forecasting, vol. 38, no. 4, pp. 1473–1481, 2022.
network for time series forecasting,” IEEE Internet of Things [120] B. Wang, P. Wu, Q. Chen, and S. Ni, “Prediction and analysis
Journal, 2024. of train passenger load factor of high-speed railway based on
[98] P. Liu, B. Wu, N. Li, T. Dai, F. Lei, J. Bao, Y. Jiang, and S.-T. lightgbm algorithm,” Journal of Advanced Transportation, vol. 2021,
Xia, “Wftnet: Exploiting global and local periodicity in long-term no. 1, p. 9963394, 2021.
time series forecasting,” in ICASSP 2024-2024 IEEE International [121] H. Wu, Y. Cai, Y. Wu, R. Zhong, Q. Li, J. Zheng, D. Lin, and Y. Li,
Conference on Acoustics, Speech and Signal Processing (ICASSP). “Time series analysis of weekly influenza-like illness rate using a
IEEE, 2024, pp. 5960–5964. one-year period of factors in random forest regression,” Bioscience
[99] A. Ma, D. Luo, and M. Sha, “Mmfnet: Multi-scale frequency trends, vol. 11, no. 3, pp. 292–296, 2017.
masking neural network for multivariate time series forecasting,” [122] V. Mayrink and H. S. Hippert, “A hybrid method using ex-
arXiv preprint arXiv:2410.02070, 2024. ponential smoothing and gradient boosting for electrical short-
[100] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional term load forecasting,” in 2016 IEEE Latin American Conference on
networks: a deep learning framework for traffic forecasting,” in Computational Intelligence (LA-CCI). IEEE, 2016, pp. 1–6.
Proceedings of the 27th International Joint Conference on Artificial [123] F. Martı́nez, M. P. Frı́as, M. D. Pérez, and A. J. Rivera, “A method-
Intelligence, 2018, pp. 3634–3640. ology for applying k-nearest neighbor to time series forecasting,”
[101] X. Zhang, R. Cao, Z. Zhang, and Y. Xia, “Crowd flow forecasting Artificial Intelligence Review, vol. 52, no. 3, pp. 2019–2037, 2019.
with multi-graph neural networks,” in 2020 International Joint [124] F. Martı́nez, M. P. Frı́as, M. D. Pérez-Godoy, and A. J. Rivera,
Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–7. “Dealing with seasonality by narrowing the training set in time
[102] M. Li and Z. Zhu, “Spatial-temporal fusion graph neural net- series forecasting with knn,” Expert systems with applications, vol.
works for traffic flow forecasting,” in Proceedings of the AAAI 103, pp. 38–48, 2018.
conference on artificial intelligence, vol. 35, no. 5, 2021, pp. 4189– [125] B. Rajagopalan and U. Lall, “A k-nearest-neighbor simulator for
4196. daily precipitation and other weather variables,” Water resources
[103] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, research, vol. 35, no. 10, pp. 3089–3101, 1999.
“Connecting the dots: Multivariate time series forecasting with [126] F. H. Al-Qahtani and S. F. Crone, “Multivariate k-nearest neigh-
graph neural networks,” in Proceedings of the 26th ACM SIGKDD bour regression for time series data—a novel algorithm for
international conference on knowledge discovery & data mining, 2020, forecasting uk electricity demand,” in The 2013 international joint
pp. 753–763. conference on neural networks (IJCNN). IEEE, 2013, pp. 1–8.
[104] L. Bai, L. Yao, C. Li, X. Wang, and C. Wang, “Adaptive graph [127] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling long-and
convolutional recurrent network for traffic forecasting,” Advances short-term temporal patterns with deep neural networks,” in The
PREPRINT 25

41st international ACM SIGIR conference on research & development [148] M. A. Ahamed and Q. Cheng, “Timemachine: A time series
in information retrieval, 2018, pp. 95–104. is worth 4 mambas for long-term forecasting,” arXiv preprint
[128] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell, arXiv:2403.09898, 2024.
“A dual-stage attention-based recurrent neural network for time [149] Z. Wang, F. Kong, S. Feng, M. Wang, X. Yang, H. Zhao, D. Wang,
series prediction,” arXiv preprint arXiv:1704.02971, 2017. and Y. Zhang, “Is mamba effective for time series forecasting?”
[129] C. Fan, Y. Zhang, Y. Pan, X. Li, C. Zhang, R. Yuan, D. Wu, arXiv preprint arXiv:2403.11144, 2024.
W. Wang, J. Pei, and H. Huang, “Multi-horizon time series [150] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić,
forecasting with temporal attention learning,” in Proceedings of the T. Y. Hou, and M. Tegmark, “Kan: Kolmogorov-arnold net-
25th ACM SIGKDD International conference on knowledge discovery works,” arXiv preprint arXiv:2404.19756, 2024.
& data mining, 2019, pp. 2527–2535. [151] C. J. Vaca-Rubio, L. Blanco, R. Pereira, and M. Caus,
[130] Y. Wang, A. Smola, D. Maddix, J. Gasthaus, D. Foster, and “Kolmogorov-arnold networks (kans) for time series analysis,”
T. Januschowski, “Deep factors for forecasting,” in International arXiv preprint arXiv:2405.08790, 2024.
conference on machine learning. PMLR, 2019, pp. 6607–6617. [152] R. Genet and H. Inzirillo, “A temporal kolmogorov-arnold
[131] S. Smyl, “A hybrid method of exponential smoothing and re- transformer for time series forecasting,” arXiv preprint
current neural networks for time series forecasting,” International arXiv:2406.02486, 2024.
journal of forecasting, vol. 36, no. 1, pp. 75–85, 2020. [153] Y. Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long, “Timer:
generative pre-trained transformers are large time series mod-
[132] S. Lin, W. Lin, W. Wu, F. Zhao, R. Mo, and H. Zhang, “Segrnn:
els,” in Proceedings of the 41st International Conference on Machine
Segment recurrent neural network for long-term time series
Learning, 2024, pp. 32 369–32 399.
forecasting,” arXiv preprint arXiv:2308.11200, 2023.
[154] Y. Liu, G. Qin, X. Huang, J. Wang, and M. Long, “Autotimes:
[133] J. Cheng, K. Huang, and Z. Zheng, “Towards better forecasting Autoregressive time series forecasters via large language mod-
by fusing near and distant future visions,” in Proceedings of the els,” Advances in Neural Information Processing Systems, vol. 37,
AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. pp. 122 154–122 184, 2025.
3593–3600.
[155] B. Cai, S. Yang, L. Gao, and Y. Xiang, “Hybrid variational autoen-
[134] H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, and Y. Xiao, “Micn: coder for time series forecasting,” Knowledge-Based Systems, vol.
Multi-scale local and global context modeling for long-term series 281, p. 111079, 2023.
forecasting,” in The eleventh international conference on learning [156] J. Yoon, D. Jarrett, and M. Van der Schaar, “Time-series generative
representations, 2023. adversarial networks,” Advances in neural information processing
[135] M. Liu, A. Zeng, M. Chen, Z. Xu, Q. Lai, L. Ma, and Q. Xu, systems, vol. 32, 2019.
“Scinet: Time series modeling and forecasting with sample con- [157] N. Kim, D.-K. Chae, J. A. Shin, S.-W. Kim, D. H. Chau, and
volution and interaction,” Advances in Neural Information Process- S. Park, “Context-aware traffic flow forecasting in new roads,” in
ing Systems, vol. 35, pp. 5816–5828, 2022. Proceedings of the 31st ACM International Conference on Information
[136] Z. Li, Y. Qin, X. Cheng, and Y. Tan, “Ftmixer: Frequency and time & Knowledge Management, 2022, pp. 4133–4137.
domain representations fusion for time series modeling,” arXiv [158] Y. Zhang, Y. Li, X. Zhou, X. Kong, and J. Luo, “Curb-gan:
preprint arXiv:2405.15256, 2024. Conditional urban traffic estimation through spatio-temporal
[137] Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu, generative adversarial networks,” in Proceedings of the 26th ACM
“Ts2vec: Towards universal representation of time series,” in SIGKDD International Conference on Knowledge Discovery & Data
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, Mining, 2020, pp. 842–852.
no. 8, 2022, pp. 8980–8987. [159] E. Tamir, N. Laabid, M. Heinonen, V. Garg, and A. Solin,
[138] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan, “Conditional flow matching for time series modelling,” ICML
“Enhancing the locality and breaking the memory bottleneck 2024 Workshop on Structured Probabilistic Inference and Generative
of transformer on time series forecasting,” Advances in neural Modeling, 2023.
information processing systems, vol. 32, 2019. [160] X. N. Zhang, Y. Pu, Y. Kawamura, A. Loza, Y. Bengio, D. Shung,
[139] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “Fedformer: and A. Tong, “Trajectory flow matching with applications to
Frequency enhanced decomposed transformer for long-term se- clinical time series modelling,” Advances in Neural Information
ries forecasting,” in International conference on machine learning. Processing Systems, vol. 37, pp. 107 198–107 224, 2025.
PMLR, 2022, pp. 27 268–27 286. [161] Y. Hu, X. Wang, L. Wu, H. Zhang, S. Z. Li, S. Wang, and T. Chen,
[140] Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long, “Fm-ts: Flow matching for time series generation,” arXiv preprint
“itransformer: Inverted transformers are effective for time series arXiv:2411.07506, 2024.
forecasting,” in The Twelfth International Conference on Learning [162] K. Rasul, C. Seward, I. Schuster, and R. Vollgraf, “Autoregressive
Representations, 2024. denoising diffusion models for multivariate probabilistic time
[141] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “N-beats: series forecasting,” in International Conference on Machine Learning.
Neural basis expansion analysis for interpretable time series fore- PMLR, 2021, pp. 8857–8868.
casting,” in International Conference on Learning Representations, [163] M. Kollovieh, A. F. Ansari, M. Bohlke-Schneider, J. Zschiegner,
2020. H. Wang, and Y. B. Wang, “Predict, refine, synthesize: Self-
guiding diffusion models for probabilistic time series forecast-
[142] Y. Liu, C. Li, J. Wang, and M. Long, “Koopa: Learning non-
ing,” Advances in Neural Information Processing Systems, vol. 36,
stationary time series dynamics with koopman predictors,” Ad-
2024.
vances in Neural Information Processing Systems, vol. 36, 2024.
[164] T. Yan, H. Zhang, T. Zhou, Y. Zhan, and Y. Xia, “Score-
[143] Z. Xu, A. Zeng, and Q. Xu, “FITS: Modeling time series with grad: Multivariate probabilistic time series forecasting with
$10k$ parameters,” in The Twelfth International Conference on continuous energy-based generative models,” arXiv preprint
Learning Representations, 2024. arXiv:2106.10121, 2021.
[144] S. Lin, W. Lin, W. Wu, H. Chen, and J. Yang, “SparseTSF: Mod- [165] L. Shen and J. T. Kwok, “Non-autoregressive conditional dif-
eling long-term time series forecasting with *1k* parameters,” in fusion models for time series prediction,” in Proceedings of the
Forty-first International Conference on Machine Learning, 2024. 40th International Conference on Machine Learning (ICML), 2023,
[145] S. Wang, H. Wu, X. Shi, T. Hu, H. Luo, L. Ma, J. Y. Zhang, pp. 31 016–31 029.
and J. ZHOU, “Timemixer: Decomposable multiscale mixing for [166] J. Zhang, M. Cheng, X. Tao, Z. Liu, and D. Wang, “Fdf: Flexi-
time series forecasting,” in The Twelfth International Conference on ble decoupled framework for time series forecasting with con-
Learning Representations, 2024. ditional denoising and polynomial modeling,” arXiv preprint
[146] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, arXiv:2410.13253, 2024.
and T. Januschowski, “Deep state space models for time series [167] S. Tonekaboni, D. Eytan, and A. Goldenberg, “Unsupervised rep-
forecasting,” Advances in neural information processing systems, resentation learning for time series with temporal neighborhood
vol. 31, 2018. coding,” in International Conference on Learning Representations,
[147] M. Zhang, K. K. Saab, M. Poli, T. Dao, K. Goel, and C. Re, “Ef- 2021.
fectively modeling time series with simple discrete state spaces,” [168] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and
in The Eleventh International Conference on Learning Representations, C. Guan, “Time-series representation learning via temporal and
2023. contextual contrasting,” arXiv preprint arXiv:2106.14112, 2021.
PREPRINT 26

[169] X. Zhang, Z. Zhao, T. Tsiligkaridis, and M. Zitnik, “Self- ries for cuffless blood pressure estimation,” npj Digital Medicine,
supervised contrastive pre-training for time series via time- vol. 6, no. 1, p. 110, 2023.
frequency consistency,” Advances in Neural Information Processing [190] F. M. Abushaqra, H. Xue, Y. Ren, and F. D. Salim, “Seqlink: A
Systems, vol. 35, pp. 3988–4003, 2022. robust neural-ode architecture for modelling partially observed
[170] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eick- time series,” Transactions on Machine Learning Research, 2024.
hoff, “A transformer-based framework for multivariate time [191] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed
series representation learning,” in Proceedings of the 27th ACM neural networks: A deep learning framework for solving forward
SIGKDD conference on knowledge discovery & data mining, 2021, pp. and inverse problems involving nonlinear partial differential
2114–2124. equations,” Journal of Computational physics, vol. 378, pp. 686–707,
[171] M. Cheng, Q. Liu, Z. Liu, H. Zhang, R. Zhang, and E. Chen, 2019.
“Timemae: Self-supervised representations of time series with de- [192] B. Huang and J. Wang, “Applications of physics-informed neural
coupled masked autoencoders,” arXiv preprint arXiv:2303.00320, networks in power systems-a review,” IEEE Transactions on Power
2023. Systems, vol. 38, no. 1, pp. 572–588, 2022.
[172] J. Dong, H. Wu, H. Zhang, L. Zhang, J. Wang, and M. Long, [193] A. Bracco, J. Brajard, H. A. Dijkstra, P. Hassanzadeh, C. Lessig,
“Simmtm: A simple pre-training framework for masked time- and C. Monteleoni, “Machine learning for the physics of climate,”
series modeling,” Advances in Neural Information Processing Sys- Nature Reviews Physics, pp. 1–15, 2024.
tems, vol. 36, 2024. [194] R. Dang-Nhu, G. Singh, P. Bielik, and M. Vechev, “Adversarial
[173] D. Wang, M. Cheng, Z. Liu, Q. Liu, and E. Chen, “Timedart: attacks on probabilistic autoregressive forecasting models,” in
A diffusion autoregressive transformer for self-supervised time International Conference on Machine Learning. PMLR, 2020, pp.
series representation,” arXiv preprint arXiv:2410.05711, 2024. 2356–2365.
[174] X. Jin, Y. Park, D. Maddix, H. Wang, and Y. Wang, “Domain [195] L. Liu, Y. Park, T. N. Hoang, H. Hasson, and L. Huan, “Ro-
adaptation for time series forecasting via attention sharing,” in bust multivariate time-series forecasting: Adversarial attacks and
International Conference on Machine Learning. PMLR, 2022, pp. defense mechanisms,” in The Eleventh International Conference on
10 280–10 297. Learning Representations, 2023.
[175] B. Wang, J. Ma, P. Wang, X. Wang, Y. Zhang, Z. Zhou, and [196] F. Liu, W. Zhang, and H. Liu, “Robust spatiotemporal traffic
Y. Wang, “Stone: A spatio-temporal ood learning framework kills forecasting with reinforced dynamic adversarial training,” in
both spatial and temporal shifts,” in Proceedings of the 30th ACM Proceedings of the 29th ACM SIGKDD Conference on Knowledge
SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, Discovery and Data Mining, 2023, pp. 1417–1428.
pp. 2948–2959. [197] X. Lin, Z. Liu, D. Fu, R. Qiu, and H. Tong, “Backtime: Backdoor
[176] H. Liu, H. Kamarthi, L. Kong, Z. Zhao, C. Zhang, and B. A. attacks on multivariate time series forecasting,” in The Thirty-
Prakash, “Time-series forecasting for out-of-distribution gener- eighth Annual Conference on Neural Information Processing Systems,
alization using invariant learning,” in Forty-first International 2024.
Conference on Machine Learning, 2024. [198] C. Meng, S. Rambhatla, and Y. Liu, “Cross-node federated graph
[177] T. Zhou, P. Niu, L. Sun, R. Jin et al., “One fits all: Power general neural network for spatio-temporal data modeling,” in Proceed-
time series analysis by pretrained lm,” Advances in neural infor- ings of the 27th ACM SIGKDD conference on knowledge discovery &
mation processing systems, vol. 36, pp. 43 322–43 355, 2023. data mining, 2021, pp. 1202–1211.
[199] S. Chen, G. Long, T. Shen, and J. Jiang, “Prompt federated
[178] X. Liu, J. Hu, Y. Li, S. Diao, Y. Liang, B. Hooi, and R. Zim-
learning for weather forecasting: toward foundation models on
mermann, “Unitime: A language-empowered unified model for
meteorological data,” in Proceedings of the Thirty-Second Interna-
cross-domain time series forecasting,” in Proceedings of the ACM
tional Joint Conference on Artificial Intelligence, 2023, pp. 3532–3540.
on Web Conference 2024, 2024, pp. 4095–4106.
[200] Q. Liu, X. Liu, C. Liu, Q. Wen, and Y. Liang, “Time-ffm: Towards
[179] C. Chang, W.-C. Peng, and T.-F. Chen, “Llm4ts: Two-stage fine-
lm-empowered federated foundation model for time series fore-
tuning for time-series forecasting with pre-trained llms,” arXiv
casting,” arXiv preprint arXiv:2405.14252, 2024.
preprint arXiv:2308.08469, 2023.
[201] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik,
[180] D. Cao, F. Jia, S. O. Arik, T. Pfister, Y. Zheng, W. Ye, and Y. Liu, “Support vector regression machines,” Advances in neural infor-
“TEMPO: Prompt-based generative pre-trained transformer for mation processing systems, vol. 9, 1996.
time series forecasting,” in The Twelfth International Conference on [202] C. Cortes, “Support-vector networks,” Machine Learning, 1995.
Learning Representations, 2024.
[203] A. J. Smola and B. Schölkopf, “A tutorial on support vector
[181] N. Gruver, M. Finzi, S. Qiu, and A. G. Wilson, “Large language regression,” Statistics and computing, vol. 14, pp. 199–222, 2004.
models are zero-shot time series forecasters,” Advances in Neural [204] L. BREIMAN, “Classification and regression trees,” Monterey, CA:
Information Processing Systems, vol. 36, 2024. Wadsworth and Brools, 1984.
[182] M. Jin, S. Wang, L. Ma, Z. Chu, J. Y. Zhang, X. Shi, P.-Y. Chen, [205] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32,
Y. Liang, Y.-F. Li, S. Pan et al., “Time-llm: Time series forecast- 2001.
ing by reprogramming large language models,” arXiv preprint [206] J. H. Friedman, “Greedy function approximation: a gradient
arXiv:2310.01728, 2023. boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
[183] C. Sun, H. Li, Y. Li, and S. Hong, “TEST: Text prototype aligned [207] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting sys-
embedding to activate LLM’s ability for time series,” in The tem,” in Proceedings of the 22nd acm sigkdd international conference
Twelfth International Conference on Learning Representations, 2024. on knowledge discovery and data mining, 2016, pp. 785–794.
[184] M. Goswami, K. Szafer, A. Choudhry, Y. Cai, S. Li, [208] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-
and A. Dubrawski, “Moment: A family of open time- Y. Liu, “Lightgbm: A highly efficient gradient boosting decision
series foundation models,” 2024. [Online]. Available: https: tree,” Advances in neural information processing systems, vol. 30,
//[Link]/abs/2402.03885 2017.
[185] A. Das, W. Kong, R. Sen, and Y. Zhou, “A decoder-only [209] T. Cover and P. Hart, “Nearest neighbor pattern classification,”
foundation model for time-series forecasting,” arXiv preprint IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27,
arXiv:2310.10688, 2023. 1967.
[186] S. Johansen, “Estimation and hypothesis testing of cointegration [210] H. Hewamalage, C. Bergmeir, and K. Bandara, “Recurrent neural
vectors in gaussian vector autoregressive models,” Econometrica: networks for time series forecasting: Current status and future
journal of the Econometric Society, pp. 1551–1580, 1991. directions,” International Journal of Forecasting, vol. 37, no. 1, pp.
[187] L. Song, M. Kolar, and E. Xing, “Time-varying dynamic bayesian 388–427, 2021.
networks,” Advances in neural information processing systems, [211] R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka,
vol. 22, 2009. “A multi-horizon quantile recurrent forecaster,” arXiv preprint
[188] P. Cui, Z. Shen, S. Li, L. Yao, Y. Li, Z. Chu, and J. Gao, “Causal arXiv:1711.11053, 2017.
inference meets machine learning,” in Proceedings of the 26th [212] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term de-
ACM SIGKDD international conference on knowledge discovery & pendencies with gradient descent is difficult,” IEEE transactions
data mining, 2020, pp. 3527–3528. on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
[189] K. Sel, A. Mohammadi, R. I. Pettigrew, and R. Jafari, “Physics- [213] R. Pascanu, “On the difficulty of training recurrent neural net-
informed neural networks for modeling physiological time se- works,” arXiv preprint arXiv:1211.5063, 2013.
PREPRINT 27

[214] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, in ICASSP 2024-2024 IEEE International Conference on Acoustics,
S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella et al., Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7335–7339.
“Rwkv: Reinventing rnns for the transformer era,” arXiv preprint [236] S. Li, H. Xiong, and Y. Chen, “Diffplf: A conditional diffusion
arXiv:2305.13048, 2023. model for probabilistic forecasting of ev charging load,” arXiv
[215] M. Beck, K. Pöppel, M. Spanring, A. Auer, O. Prudnikova, preprint arXiv:2402.13548, 2024.
M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, [237] M. Ragab, E. Eldele, Z. Chen, M. Wu, C.-K. Kwoh, and X. Li,
“xlstm: Extended long short-term memory,” arXiv preprint “Self-supervised autoregressive domain adaptation for time se-
arXiv:2405.04517, 2024. ries data,” IEEE Transactions on Neural Networks and Learning
[216] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, Systems, vol. 35, no. 1, pp. 1341–1351, 2022.
speech, and time series,” The handbook of brain theory and neural [238] S. Mirchandani, F. Xia, P. Florence, B. Ichter, D. Driess, M. G.
networks, vol. 3361, no. 10, p. 1995, 1995. Arenas, K. Rao, D. Sadigh, and A. Zeng, “Large language models
[217] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation as general pattern machines,” arXiv preprint arXiv:2307.04721,
of generic convolutional and recurrent networks for sequence 2023.
modeling,” arXiv preprint arXiv:1803.01271, 2018. [239] Y. Wang, Z. Chu, X. Ouyang, S. Wang, H. Hao, Y. Shen, J. Gu,
[218] S. Huang, D. Wang, X. Wu, and A. Tang, “Dsanet: Dual self- S. Xue, J. Y. Zhang, Q. Cui et al., “Enhancing recommender
attention network for multivariate time series forecasting,” in systems with large language model reasoning graphs,” arXiv
Proceedings of the 28th ACM international conference on information preprint arXiv:2308.10835, 2023.
and knowledge management, 2019, pp. 2129–2132. [240] F. Jia, K. Wang, Y. Zheng, D. Cao, and Y. Liu, “Gpt4mts: Prompt-
[219] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. based large language model for multimodal time-series forecast-
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” ing,” in Proceedings of the AAAI Conference on Artificial Intelligence,
Advances in neural information processing systems, vol. 30, 2017. vol. 38, no. 21, 2024, pp. 23 343–23 351.
[220] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, and S. Dust- [241] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
dar, “Pyraformer: Low-complexity pyramidal attention for long- L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large
range time series modeling and forecasting,” in International language models,” in International Conference on Learning Repre-
Conference on Learning Representations, 2022. sentations, 2022.
[221] Y. Liu, H. Wu, J. Wang, and M. Long, “Non-stationary trans- [242] H. Liu, Z. Zhao, J. Wang, H. Kamarthi, and B. A. Prakash,
formers: Exploring the stationarity in time series forecasting,” “Lstprompt: Large language models as zero-shot time se-
Advances in Neural Information Processing Systems, vol. 35, pp. ries forecasters by long-short-term prompting,” arXiv preprint
9881–9893, 2022. arXiv:2402.16132, 2024.
[222] G. Woo, C. Liu, D. Sahoo, A. Kumar, and S. Hoi, “Etsformer: [243] Y. Zhu, Z. Wang, J. Gao, Y. Tong, J. An, W. Liao, E. M. Harrison,
Exponential smoothing transformers for time-series forecasting,” L. Ma, and C. Pan, “Prompting large language models for zero-
arXiv preprint arXiv:2202.01381, 2022. shot clinical prediction with structured longitudinal electronic
[223] J. Jiang, C. Han, W. X. Zhao, and J. Wang, “Pdformer: Propa- health record data,” arXiv preprint arXiv:2402.01713, 2024.
gation delay-aware dynamic long-range transformer for traffic [244] H. Zhang, C. Xu, Y.-F. Zhang, Z. Zhang, L. Wang, J. Bian, and
flow prediction,” in Proceedings of the AAAI conference on artificial T. Tan, “Timeraf: Retrieval-augmented foundation model for
intelligence, vol. 37, no. 4, 2023, pp. 4365–4373. zero-shot time series forecasting,” arXiv preprint arXiv:2412.20810,
[224] R. Ilbert, A. Odonnat, V. Feofanov, A. Virmaux, G. Paolo, T. Pal- 2024.
panas, and I. Redko, “Samformer: unlocking the potential of [245] M. Xiao, Z. Jiang, Z. Chen, D. Li, S. Chen, S. Ananiadou, J. Huang,
transformers in time series forecasting with sharpness-aware M. Peng, and Q. Xie, “Timerag: It’s time for retrieval-augmented
minimization and channel-wise attention,” in Proceedings of the generation in time-series forecasting.”
41st International Conference on Machine Learning, 2024, pp. 20 924– [246] J. Wang, M. Cheng, Q. Mao, Q. Liu, F. Xu, X. Li, and E. Chen,
20 954. “Tabletime: Reformulating time series classification as zero-shot
[225] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with table understanding via large language models,” arXiv preprint
selective state spaces,” arXiv preprint arXiv:2312.00752, 2023. arXiv:2411.15737, 2024.
[226] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- [247] M. Lin, Z. Chen, Y. Liu, X. Zhao, Z. Wu, J. Wang, X. Zhang,
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., S. Wang, and H. Chen, “Decoding time series with llms: A multi-
“Language models are few-shot learners,” Advances in neural agent framework for cross-domain annotation,” arXiv preprint
information processing systems, vol. 33, pp. 1877–1901, 2020. arXiv:2410.17462, 2024.
[227] D. P. Kingma, M. Welling et al., “Auto-encoding variational [248] M. A. Merrill, M. Tan, V. Gupta, T. Hartvigsen, and T. Althoff,
bayes,” 2013. “Language models still struggle to zero-shot reason about time
[228] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- series,” arXiv preprint arXiv:2404.11757, 2024.
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad- [249] C. Glymour, K. Zhang, and P. Spirtes, “Review of causal discov-
versarial nets,” Advances in neural information processing systems, ery methods based on graphical models,” Frontiers in genetics,
vol. 27, 2014. vol. 10, p. 524, 2019.
[229] D. Rezende and S. Mohamed, “Variational inference with nor- [250] J. Tian and J. Pearl, “Causal discovery from changes,” arXiv
malizing flows,” in International conference on machine learning. preprint arXiv:1301.2312, 2013.
PMLR, 2015, pp. 1530–1538. [251] C. K. Assaad, E. Devijver, and E. Gaussier, “Survey and eval-
[230] Z. Wang, Q. Wen, C. Zhang, L. Sun, and Y. Wang, “Diffload: uation of causal discovery methods for time series,” Journal of
uncertainty quantification in load forecasting with diffusion Artificial Intelligence Research, vol. 73, pp. 767–819, 2022.
model,” arXiv preprint arXiv:2306.01001, 2023. [252] C. Miller, A. Kathirgamanathan, B. Picchetti, P. Arjunan, J. Y.
[231] P. Chang, H. Li, S. F. Quan, S. Lu, S.-F. Wung, J. Roveda, and Park, Z. Nagy, P. Raftery, B. W. Hobson, Z. Shi, and F. Meggers,
A. Li, “Tdstf: Transformer-based diffusion probabilistic model for “The building data genome project 2, energy meter data from
sparse time series forecasting,” arXiv preprint arXiv:2301.06625, the ashrae great energy predictor iii competition,” Scientific data,
2023. vol. 7, no. 1, p. 368, 2020.
[232] N. Neifar, A. Ben-Hamadou, A. Mdhaffar, and M. Jmaiel, “Dif- [253] S. Barker, A. Mishra, D. Irwin, E. Cecchet, P. Shenoy, J. Albrecht
fecg: A versatile probabilistic diffusion model for ecg signals et al., “Smart*: An open data set and tools for enabling research
synthesis,” arXiv preprint arXiv:2306.01875, 2023. in sustainable homes,” SustKDD, August, vol. 111, no. 112, p. 108,
[233] S. Feng, C. Miao, Z. Zhang, and P. Zhao, “Latent diffusion trans- 2012.
former for probabilistic time series forecasting,” in Proceedings of [254] D. A. Bashawyah and S. M. Qaisar, “Machine learning based
the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, short-term load forecasting for smart meter energy consumption
pp. 11 979–11 987. data in london households,” in 2021 IEEE 12th International
[234] P. Shao, J. Feng, J. Lu, P. Zhang, and C. Zou, “Data-driven and Conference on Electronics and Information Technologies (ELIT). IEEE,
knowledge-guided denoising diffusion model for flood forecast- 2021, pp. 99–102.
ing,” Expert Systems with Applications, vol. 244, p. 122908, 2024. [255] J. Zhou, X. Lu, Y. Xiao, J. Su, J. Lyu, Y. Ma, and D. Dou, “Sdwpf:
[235] D. Daiya, M. Yadav, and H. S. Rao, “Diffstock: Probabilistic A dataset for spatial dynamic wind power forecasting challenge
relational stock market predictions using diffusion models,” at kdd cup 2022,” arXiv preprint arXiv:2208.04360, 2022.
PREPRINT 28

[256] A. Alexandrov, K. Benidis, M. Bohlke-Schneider, V. Flunkert, [276] C.-Y. Hsu and W.-C. Liu, “Multiple time-series convolutional
J. Gasthaus, T. Januschowski, D. C. Maddix, S. Rangapuram, neural network for fault detection and diagnosis and empirical
D. Salinas, J. Schulz et al., “Gluonts: Probabilistic and neural time study in semiconductor manufacturing,” Journal of Intelligent
series modeling in python,” Journal of Machine Learning Research, Manufacturing, vol. 32, no. 3, pp. 823–836, 2021.
vol. 21, no. 116, pp. 1–6, 2020. [277] H. Ben Ameur, S. Boubaker, Z. Ftiti, W. Louhichi, and K. Tissaoui,
[257] J. Wang, J. Jiang, W. Jiang, C. Li, and W. X. Zhao, “Libcity: An “Forecasting commodity prices: empirical evidence using deep
open library for traffic prediction,” in Proceedings of the 29th in- learning tools,” Annals of Operations Research, vol. 339, no. 1, pp.
ternational conference on advances in geographic information systems, 349–367, 2024.
2021, pp. 145–148. [278] W. Kong, H. Li, C. Yu, J. Xia, Y. Kang, and P. Zhang, “A deep
[258] Google, “Web traffic time series forecasting,” [Link] spatio-temporal forecasting model for multi-site weather predic-
[Link]/c/web-traffic-time-series-forecasting, 2017. tion post-processing,” Communications in Computational Physics,
[259] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, vol. 31, no. 1, pp. 131–153, 2022.
J. M. Patel, R. Ramakrishnan, and C. Shahabi, “Big data and its [279] H. Wu, H. Zhou, M. Long, and J. Wang, “Interpretable weather
technical challenges,” Communications of the ACM, vol. 57, no. 7, forecasting for worldwide stations with a unified deep model,”
pp. 86–94, 2014. Nature Machine Intelligence, vol. 5, no. 6, pp. 602–611, 2023.
[260] S. Mouatadid, P. Orenstein, G. Flaspohler, M. Oprescu, J. Cohen, [280] S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, “Attention based
F. Wang, S. Knight, M. Geogdzhayeva, S. Levang, E. Fraenkel spatial-temporal graph convolutional networks for traffic flow
et al., “Subseasonalclimateusa: a dataset for subseasonal forecast- forecasting,” in Proceedings of the AAAI conference on artificial
ing and benchmarking,” Advances in Neural Information Processing intelligence, vol. 33, no. 01, 2019, pp. 922–929.
Systems, vol. 36, 2024. [281] J. Fan, W. Weng, H. Tian, H. Wu, F. Zhu, and J. Wu, “Rgdan: A
random graph diffusion attention network for traffic prediction,”
[261] X. Qiu, J. Hu, L. Zhou, X. Wu, J. Du, B. Zhang, C. Guo, A. Zhou,
Neural networks, vol. 172, p. 106093, 2024.
C. S. Jensen, Z. Sheng et al., “Tfb: Towards comprehensive and
fair benchmarking of time series forecasting methods,” Proceed-
ings of the VLDB Endowment, vol. 17, no. 9, pp. 2363–2377, 2024.
[262] Y. Zheng, X. Yi, M. Li, R. Li, Z. Shan, E. Chang, and T. Li,
“Forecasting fine-grained air quality based on big data,” in
Proceedings of the 21th ACM SIGKDD international conference on
knowledge discovery and data mining, 2015, pp. 2267–2276.
[263] S. Chen, “Beijing Multi-Site Air Quality,” UCI Machine Learning
Repository, 2017, DOI: [Link]
[264] S. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon,
R. Lewandowski, J. Newton, E. Parzen, and R. Winkler, “The
accuracy of extrapolation (time series) methods: Results of a
forecasting competition,” Journal of forecasting, vol. 1, no. 2, pp.
111–153, 1982.
[265] S. Makridakis and M. Hibon, “The m3-competition: results,
conclusions and implications,” International journal of forecasting,
vol. 16, no. 4, pp. 451–476, 2000.
[266] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The m4
competition: Results, findings, conclusion and way forward,”
International Journal of forecasting, vol. 34, no. 4, pp. 802–808, 2018.
[267] ——, “M5 accuracy competition: Results, findings, and conclu-
sions,” International Journal of Forecasting, vol. 38, no. 4, pp. 1346–
1364, 2022.
[268] S. B. Taieb, G. Bontempi, A. F. Atiya, and A. Sorjamaa, “A review
and comparison of strategies for multi-step ahead time series
forecasting based on the nn5 forecasting competition,” Expert
systems with applications, vol. 39, no. 8, pp. 7067–7083, 2012.
[269] G. Athanasopoulos, R. J. Hyndman, H. Song, and D. C. Wu,
“The tourism forecasting competition,” International Journal of
Forecasting, vol. 27, no. 3, pp. 822–844, 2011.
[270] W. G. van Panhuis, A. Cross, and D. S. Burke, “Project tycho 2.0: a
repository to improve the integration and reuse of data for global
population health,” Journal of the American Medical Informatics
Association, vol. 25, no. 12, pp. 1608–1617, 2018.
[271] F. Piccialli, F. Giampaolo, E. Prezioso, D. Camacho, and G. Acam-
pora, “Artificial intelligence and healthcare: Forecasting of med-
ical bookings through multi-source time-series fusion,” Informa-
tion Fusion, vol. 74, pp. 1–16, 2021.
[272] J. Xie and Q. Wang, “Benchmarking machine learning algorithms
on blood glucose prediction for type i diabetes in comparison
with classical time-series models,” IEEE Transactions on Biomedical
Engineering, vol. 67, no. 11, pp. 3101–3124, 2020.
[273] E. Hwang, Y.-S. Park, J.-Y. Kim, S.-H. Park, J. Kim, and S.-H.
Kim, “Intraoperative hypotension prediction based on features
automatically generated within an interpretable deep learning
model,” IEEE Transactions on Neural Networks and Learning Sys-
tems, 2023.
[274] F. Lu, W. Li, Z. Zhou, C. Song, Y. Sun, Y. Zhang, Y. Ren, X. Liao,
H. Jin, A. Luo et al., “A composite multi-attention framework for
intraoperative hypotension early warning,” in Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 37, no. 12, 2023, pp.
14 374–14 381.
[275] M. Cheng, J. Zhang, Z. Liu, C. Liu, and Y. Xie, “Hmf: A hybrid
multi-factor framework for dynamic intraoperative hypotension
prediction,” arXiv preprint arXiv:2409.11064, 2024.

You might also like