时间序列
时间序列
PATARA TRIRAT, YOOJU SHIN, JUNHYEOK KANG, YOUNGEUN NAM, JIHYE NA, MINY-
OUNG BAE, JOEUN KIM, BYUNGHYUN KIM, and JAE-GIL LEE∗ , KAIST, South Korea
Time-series data exists in every corner of real-world systems and services, ranging from satellites in the sky to
arXiv:2401.03717v1 [cs.LG] 8 Jan 2024
wearable devices on human bodies. Learning representations by extracting and inferring valuable information
from these time series is crucial for understanding the complex dynamics of particular phenomena and enabling
informed decisions. With the learned representations, we can perform numerous downstream analyses more
effectively. Among several approaches, deep learning has demonstrated remarkable performance in extracting
hidden patterns and features from time-series data without manual feature engineering. This survey first
presents a novel taxonomy based on three fundamental elements in designing state-of-the-art universal
representation learning methods for time series. According to the proposed taxonomy, we comprehensively
review existing studies and discuss their intuitions and insights into how these methods enhance the quality of
learned representations. Finally, as a guideline for future studies, we summarize commonly used experimental
setups and datasets and discuss several promising research directions. An up-to-date corresponding resource
is available at https://github.com/itouchz/awesome-deep-time-series-representations.
CCS Concepts: • Computing methodologies → Neural networks; Learning latent representations; •
Mathematics of computing → Time series analysis; • Information systems → Data mining.
Additional Key Words and Phrases: time series, representation learning, neural networks, temporal modeling
ACM Reference Format:
Patara Trirat, Yooju Shin, Junhyeok Kang, Youngeun Nam, Jihye Na, Minyoung Bae, Joeun Kim, Byunghyun
Kim, and Jae-Gil Lee. 2024. Universal Time-Series Representation Learning: A Survey. 1, 1 (January 2024),
39 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
1.1 Background
A time series is a sequence of data points recorded in chronological order, reflecting the complex
dynamics of particular variables or phenomena over time. Time-series data can represent various
meaningful information across application domains at different time points, enabling informed
decision-making and predictions, such as sensor readings in the Internet of Things [1, 2], measure-
ments in cyber-physical systems [3, 4], fluctuation in stock markets [5, 6], and human activity on
wearable devices [7, 8]. However, to extract and understand meaningful information from such
complicated observations, we need a mechanism to represent these time series, which leads to the
emergence of time-series representation research. Based on the new representations, we can effec-
tively perform various downstream time-series analyses [9], e.g., forecasting [10], classification [11],
regression [12], and anomaly detection [13]. Fig. 1 depicts the basic concept of representation
methods for time-series data.
∗ Jae-Gil Lee is the corresponding author.
Authors’ address: Patara Trirat, [email protected]; Yooju Shin, [email protected]; Junhyeok Kang, junhyeok.kang@
kaist.ac.kr; Youngeun Nam, [email protected]; Jihye Na, [email protected]; Minyoung Bae, [email protected];
Joeun Kim, [email protected]; Byunghyun Kim, [email protected]; Jae-Gil Lee, [email protected], KAIST, 291
Daehak-ro, Yuseong-gu, Daejeon, 34141, South Korea.
New Representations
Original Time Series Downstream Tasks
(representation space)
Forecasting
Represetantation
Classification
Method
Anomaly Detection
Early attempts [14] represent time series using piecewise linear methods (e.g., piecewise aggregate
approximation), symbolic-based methods (e.g., symbolic aggregate approximation), feature-based
methods (e.g., shapelets), or transformation-based methods (e.g., discrete wavelet transform). These
traditional time-series representation methods are known to be time-consuming and less effective
because of their dependence on domain knowledge and the poor generality of predefined priors.
Since the quality of representations significantly affects the downstream task performance, many
studies propose to learn the meaningful time-series representations automatically [15–17]. The main
goal of these studies is to obtain high-quality learned representations of time series that capture
valuable information within the data and unveil the underlying dynamics of the corresponding
systems or phenomena. Among several approaches, neural networks or deep learning (DL) have
demonstrated unprecedented performance in extracting hidden patterns and features from a wide
range of data, including time series, without the requirement of manual feature engineering.
Given the sequential nature of time series, recurrent neural networks (RNN) and their variants,
such as long short-term memory and gated recurrent unit, are considered a popular choice for
capturing temporal dependencies in time series [18, 19]. Nevertheless, recurrent-based networks
are complex and computationally expensive. Another line of work adopts one-dimensional convolu-
tional neural networks (CNN) to improve the computational efficiency with the parallel processing
of convolutional operations [20]. Even though RNN- and CNN-based models are shown to be good
at capturing temporal dependencies, they cannot explicitly model the relationship between different
variables within a multivariate time series. Many studies propose to use attention-based networks
or graph neural networks to jointly learn the temporal dependencies in each variable and the
correlations between different variables in multivariate time series using attention mechanisms or
graph structures [21, 22]. Despite the significant progress in architectural design, time series can be
collected irregularly or have missing values caused by sensor malfunctions in real-world scenarios,
making the commonly used neural networks inefficient due to the adversarial side-effect during the
imputation process. Consequently, recent research integrates neural ordinary differential equations
into existing networks such that the models can produce continuous hidden states, thereby being
robust to irregular time series [23, 24].
In addition, the reliability and efficacy of DL-based methods are generally contingent upon the
availability of sufficiently well-annotated data, commonly known as supervised learning. Time-
series data, however, is naturally continuous-valued, contains high levels of noise, and has less
intuitively discernible visual patterns. In contrast to human-recognizable patterns in images or texts,
time-series data can have inconsistent semantic meanings in real-world settings across application
domains. As a result, obtaining a well-annotated time series is inevitably inefficient and considerably
more challenging even for domain experts due to the convoluted dynamics of the time-evolving
observations collected from diverse sensors or wearable devices with different frequencies. For
example, we can collect a large set of sensor signals from a smart factory, while only a few of
them can be annotated by domain experts. To circumvent the laborious annotation process and
reduce the reliance on labeled instances, there has been a growing interest in unsupervised and
Fine-Tuning
Downstream Tasks
Deep Representation Learning
Trained
Representation Model
Freezing
Downstream Evaluation
(Section 6)
self-supervised learning approaches using self-generated labels from various pretext tasks without
relying on human annotation [25–28].
While unsupervised and self-supervised representation learning share the same objective to ex-
tract latent representations from intricate raw time series without the human-annotated labels, their
underlying mechanism is different. Unsupervised learning methods [28] usually adopt autoencoders
and sequence-to-sequence models to learn meaningful representations using reconstruction-based
learning objectives. However, accurately reconstructing the complex time-series data remains
challenging, especially with high-frequency signals. On the contrary, self-supervised learning
methods [26] leverage pretext tasks to autonomously generate labels by utilizing intrinsic informa-
tion derived from the unlabeled data. Lately, pretext tasks with contrasting loss (also known as
contrastive learning) have been proposed to enhance learning efficiency through discriminative
pre-training with self-generated supervised signals. Contrastive learning aims to bring similar
samples closer while pushing dissimilar samples apart in the feature space. These pretext tasks are
self-generated challenges the model learns to solve from the unlabeled data, thereby being able to
produce meaningful representations for multiple downstream tasks [29].
To further enhance the representation quality and alleviate the impact of limited training samples
in particular settings where collecting sufficiently large data is prohibited (e.g., human-related
services), several studies also employ data-related techniques, e.g., augmentation [30] and transfor-
mation [31], on top of the existing learning methods. Accordingly, we can effectively increase the
size and improve the quality of the training data. These techniques are also deemed essential in
generating pretext tasks. Different from other data types, working with time-series data needs to
consider their unique properties, such as temporal dependencies and multi-scale relationships [32].
Table 1. Comparison of the survey scope between this article and related papers.
learning for time-series data as unsupervised feature learning algorithms. The survey focuses
particularly on neural architectures with a discussion on classical time-series applications. After
about a decade, a few survey papers review time-series representation learning methods by focusing
on learning objective aspects. For example, Zhang et al. [26] and Deldari et al. [34] review self-
supervised learning-based models, while, for a broader scope, Meng et al. [28] review unsupervised
learning-based methods. Similarly, Ma et al. [27] present the survey for any learning objectives
with a focus on analyzing reviewed articles from transfer learning and pre-training perspectives.
For a smaller scope, a survey [25] reviews the state-of-the-art studies that specifically tackle label
scarcity in time-series data. Unlike existing survey papers, this paper comprehensively reviews the
representation learning methods for time series by focusing on their universality with discussions on
their intuitions and insights into how these methods enhance the quality of learned representations
from all three design aspects. Specifically, we aim to review and identify research directions on how
recent state-of-the-art studies design the neural architecture, devise corresponding learning objectives,
and utilize training data to enhance the quality of the learned representations of time series for
downstream tasks. Table 1 summarizes the differences between our survey and the related work.
1.3.1 Keywords. “time series” AND “representation”, “time series” AND “embedding”, “time series”
AND “encoding”, “time series” AND “modeling”, “time series” AND “deep learning”, “temporal” AND
“representation”, “sequential” AND “representation”, “audio” AND “representation”, “sequence”
AND “representation”, and (“video” OR “action”) AND “representation”. We use these keywords
to search well-known repositories, including ACM Digital Library, IEEE Xplore, Google Scholars,
Semantic Scholars, and DBLP, for the relevant papers.
1.3.2 Inclusion Criteria. The initial set of papers found with the above search queries are further
filtered by the following criteria. Only papers meeting the criteria are included for review.
• Being written in the English language only
• Being a deep learning or neural networks-based approach
• Being published in or after 2018 at a top-tier conference or high-impact journal1
• Being evaluated on at least two downstream tasks using time-series, video, or audio datasets
1 Top-tier venues are evaluated based on CORE (https://portal.core.edu.au), KIISE (https://www.kiise.or.kr), or Google
Scholar (https://scholar.google.com). Only publications from venues rated at least A by CORE/KIISE or in the top 20 in at
least one subcategory by Google Scholar metrics are included for review. Recent arXiv papers also included if their authors
have publication records in the qualified venues.
50 50
Neural Architectures 40 40
30 30
Learning Objectives 20
20
10
10
Training Data 0
0
AI
AI
S
LR
V
D
s
M
er
rIP
C
VP
KB
AA
KD
IK
C
IC
IC
EC
8 19 20 21 22 23
th
IJ
C
1
eu
O
0 20 40 60 20 20 20 20 20 20
N
(a) Focus Design Element. (b) Publication Venue. (c) Publication Year.
Fig. 3. Quantitative summary of the selected papers.
1.3.3 Quantitative Summary of Selected Papers. Given the above keywords and inclusion criteria,
we selected 105 papers in total. Fig. 3 shows the quantitative summary of the paper selected for
review. We can notice from Fig. 3a that neural architectures and learning objectives are similarly
considered important in designing state-of-the-art methods. Most papers were published at NeurIPS,
followed by AAAI, CVPR, and ICLR (Fig. 3b). According to Fig. 3c, we expect more papers on this
topic will be published in the future.
1.4 Contributions
This paper aims to identify what are the essential elements in designing state-of-the-art represen-
tation learning methods for time series and how these elements affect the quality of the learned
representations. To the best of our knowledge, this is the first survey on universal time-series rep-
resentation learning. We propose a novel taxonomy for learning universal representations of time
series from the novelty perspectives—whether the main contribution of a paper focuses on what
part of the design elements discussed above—to summarize the selected studies. Table 2 summarizes
and compares the reviewed articles based on the proposed taxonomy. From the perspective of neural
architectures, we summarize the changes made on both the module level and network level of neural
architectures used for representation learning. From the learning perspective, we classify what
objectives are used to make the learned representations generalizable to various downstream tasks.
Last, we also categorize data-centric methods for the papers that focus particularly on improving
the quality of the training data. Overall, our main contributions are as follows.
• We conduct an extensive literature review of universal time-series representation learning based
on a novel and up-to-date taxonomy that categorizes the reviewed methods into three main
categories: neural architectural, learning-focused, and data-centric approaches.
• We provide a guideline on the experimental setup and benchmark datasets for assessing repre-
sentation learning methods for time series.
• We discuss several open research challenges and new insights to facilitate future work.
2 PRELIMINARIES
This section presents definitions and notations used throughout this paper, descriptions of down-
stream tasks, unique properties in time series, and basic building blocks of neural architectures.
2.1 Definitions
Definition 2.1 (Time Series). A time series X is an chronologically ordered sequence of 𝑉 -variate
data points recorded at specific time intervals. X = (x1, . . . , x𝑡 , x𝑇 ), where x𝑡 ∈ R𝑉 is the observed
value at 𝑡-th time step, 𝑉 is the number of variables, and 𝑇 is the length of the time series. When
𝑉 = 1, it becomes the univariate time series; otherwise, it is multivariate time series. Audio and video
data can be considered special cases of time series with more dimensions. The time intervals are
typically equally spaced, and the values can represent any measurable quantity, such as temperature,
sales figures, or any phenomenon that changes over time.
Definition 2.2 (Irregularly-Sampled Time Series). An irregularly-sampled time series is a time
series that the intervals between observations are not consistent or regularly spaced. Thus, the
time intervals between (x1, x2 ) and (x2, x3 ) is unequal, as illustrated in Fig. 4. It is often encoun-
tered in situations where data is collected opportunistically or when events occur irregularly and
sporadically, e.g., sensor malfunctions, leading to varying time gaps between observations.
Definition 2.3 (Time-Series Representation Learning). Given a raw time series X, time-series
representation learning aims to learn an encoder 𝑓𝑒 , a nonlinear embedding function that maps X
into a 𝑅-dimensional representation vector Z = (z1, . . . , z𝑅−1, z𝑅 ) in the latent space, where z𝑖 ∈ R𝐹 .
Z usually has either equal (𝑅 = 𝑇 ) or shorter (𝑅 < 𝑇 ) length of the original time series. When 𝑅 = 𝑇 ,
Z is timestamp-wise (or point-wise) representation that contains representation vectors z𝑖 with
feature size 𝐹 for each 𝑡. In contrast, when 𝑅 < 𝑇 , Z is the compressed version of X with reduced
dimension, and 𝐹 is usually 1, producing the series-wise (or instance-wise) representation.
A crucial measure to assess the quality of a representation learning method, i.e., the encoder 𝑓𝑒 ,
is its ability to produce the representations Z that effectively facilitate downstream tasks, either
with or without fine-tuning (see Section 6). Once we obtain the latent representation Z, we can
use it as input for downstream tasks to evaluate its actual performance. Here, we define the most
common downstream tasks as follows.
Definition 2.4 (Forecasting). Time-series forecasting (TSF) aims to predict the future values of a
time series by explicitly modeling the dynamics and dependencies among historical observations.
It can be short-term or long-term forecasting depending on the prediction horizon 𝐻 . Formally,
given a time series X, TSF predicts the next 𝐻 values (x𝑇 +1, . . . , x𝑇 +𝐻 ) that are most likely to occur.
Definition 2.5 (Classification). Time-series classification (TSC) aims to assign predefined class
labels C = {𝑐 1, . . . , 𝑐 |C| } to a set of time series. Let D = {(X𝑖 , y𝑖 )}𝑖=1
𝑁 denote a time-series dataset
with 𝑁 samples, where X𝑖 ∈ R 𝑇 ×𝑉 is a time series and y𝑖 is the corresponding one-hot label vector
(a) Regularly Sampled Time Series. (b) Irregularly Sampled Time Series.
Fig. 4. Illustrations of regularly and irregularly sampled multivariate time series (𝑉 = 3).
of length |C|. The 𝑗-th element of the one-hot vector y𝑖 is equal to 1 if the class of X𝑖 is 𝑗, otherwise
0. Formally, TSC trains a classifier on the given dataset D by learning the discriminative features
to distinguish different classes from each other. Then, when an unseen dataset D ′ is input to the
trained classifier, it automatically determines to which class 𝑐𝑖 each time series belongs.
Definition 2.6 (Extrinsic Regression). Time-series extrinsic regression (TSER) shares a similar goal
to TSC with a key difference in label annotation. While TSC predicts a categorical value, TSER
predicts a continuous value for a variable external to the input time series. That is, 𝑦𝑖 ∈ R. Formally,
TSER trains a regression model to map a given time series X𝑖 to a numerical value 𝑦𝑖 .
Definition 2.7 (Clustering). Time-series clustering (TSCL) is the process of finding natural groups,
called clusters, in a set of time series X = {X𝑖 }𝑖=1 𝑁 . It aims to partition X into a group of clusters
G = {𝑔1, . . . , 𝑔𝑖 , 𝑔 |G| } by maximizing the similarities between time series within the same cluster
and the dissimilarities between time series of different clusters. Formally, given a similarity measure
𝑓𝑠 (·, ·), ∀𝑖 1, 𝑖 2, 𝑗 𝑓𝑠 (X𝑖 1 , X 𝑗 ) ≫ 𝑓𝑠 (X𝑖 1 , X𝑖 2 ) for X𝑖 1 , X𝑖 2 ∈ 𝑔𝑖 and X 𝑗 ∈ 𝑔 𝑗 .
Definition 2.8 (Segmentation). Time-series segmentation (TSS) aims to assign a label to a sub-
sequence X𝑇𝑠 ,𝑇𝑒 of X, where 𝑇𝑠 is the start offset and 𝑇𝑒 is the end offset, consisting of contiguous
observations of X from time step 𝑇𝑠 to 𝑇𝑒 . That is, X𝑇𝑠 ,𝑇𝑒 = (x𝑇𝑠 , . . . , x𝑇𝑒 ) and 1 ≤ 𝑇𝑠 ≤ 𝑇𝑒 ≤ 𝑇 .
Let a change point (CP) be an offset 𝑖 ∈ [𝑖, . . . ,𝑇 ] w.r.t. to a state transition in time series, TSS
finds a set of segmentation of X, having the ordered sequence of CPs in X (i.e., x𝑖 1 , . . . , x𝑖𝑆 ) with
1 < 𝑖 1 < · · · < 𝑖𝑆 < 𝑇 at which the state of observations changed. After identifying the number and
locations of all CPs, we can set the start offset 𝑇𝑒 and end offset 𝑇𝑒 for each segment in X.
Definition 2.9 (Anomaly Detection). Time-series anomaly detection (TSAD) aims to identify abnor-
mal time points that significantly deviate from the other observations in a time series. Commonly,
TSAD learns the representations of normal behavior from a time series X. Then, the trained model
computes anomaly scores A = (𝑎𝑖 , . . . , 𝑎 |X′ | ) of all values in an unseen time series X′ to determine
which time point in X′ is anomalous. The final decisions are obtained by comparing each 𝑎𝑖 with a
predefined threshold 𝛿: anomalous if 𝑎𝑖 > 𝛿 and normal otherwise.
Definition 2.10 (Imputation of Missing Values). Time-series imputation (TSI) aims to fill missing
values in a time series with realistic values to facilitate subsequent analysis. Given a time series X and
a binary 𝑀 = (𝑚 1, . . . , 𝑚𝑡 , 𝑚𝑇 ), x𝑡 is missing if 𝑚𝑡 = 0, and is observed otherwise. Let X̂ denote the
predicted values generated by a TSI method, the imputed time series Ximputed = X ⊙ 𝑀 + X̂ ⊙ (1 − 𝑀).
Definition 2.11 (Retrieval). Time-series retrieval (TSR) aims to obtain a set of time series that are
most similar to a query provided. Given a query time series X𝑞 and a similarity measure 𝑓𝑠 (·, ·),
find an ordered list Q = {X𝑖 }𝑖=1
𝐾 of time series in the given dataset or database, containing 𝐾 time
Note that the above definitions are generally based on the raw time series X. Following Defini-
tion 2.3, we can use the corresponding representation Z = 𝑓𝑒 (X) to perform the above downstream
tasks instead of the raw data.
2.2.1 Temporal Dependency. Time series exhibits dependencies on the time variable, where a data
point at a given time correlates with its previous values. Given an input x𝑡 at time 𝑡, the model
predicts 𝑦𝑡 , but the same input at a later time could be a different prediction. Therefore, windows
or subsequences of past observations are usually included as inputs to the model to learn such
temporal dependency. The length of windows for capturing the time dependencies could also be
unknown. In addition, there are local and global temporal dependencies. The former is usually
associated with abrupt changes or noises, while the latter is associated with collective trends or
recurrent patterns.
2.2.2 High Noise and Dimension. Time-series data, especially in real-world environments, often
contain noises and have high dimensions. These noises usually arise from measurement errors or
other sources of uncertainty. Dimensionality reduction techniques and wavelet transforms can
address this issue by filtering some noises and reducing the dimension of the raw time series.
Nevertheless, we may lose valuable information and need domain-specific knowledge to select the
suitable dimensionality reduction and filtering techniques.
2.2.4 Variability and Nonstationarity. Time-series data also possess variability and nonstationarity
properties, meaning that statistical characteristics, such as mean, variance, and frequency, change
over time. These changes usually reveal seasonal patterns, trends, and fluctuations. Here, seasonality
refers to repeating patterns that regularly appear, while trends describe long-term changes or shifts
over time. In some cases, the change in frequency is so relevant to the task that it is more beneficial
to work in the frequency domain than in the time domain.
2.2.5 Diverse Semantics. In contrast to image and text data, learning universal representations
of time series is challenging due to the lack of a large-scale unified semantic time-series dataset.
For instance, each word in a text dataset has similar semantics in different sentences with high
probability. Accordingly, the word embeddings learned by the model can transfer across different
scenarios. However, time-series datasets are challenging to obtain subsequences (corresponding to
words in text sequences) that have consistent semantics across scenarios and applications, making
it difficult to transfer the knowledge learned by the model. This property also makes time-series
annotation tricky and challenging, even for domain experts.
map connects to a region of neighboring neurons in the previous layer called the receptive field.
The feature maps can be created by convolving the inputs with learned kernels and applying an
element-wise nonlinear activation to the convolved results. Here, all spatial locations of the input
share the kernel for each feature map, and several kernels are used to obtain the entire feature map.
Many improvements have been made to CNN, such as using deeper networks, applying smaller and
more efficient convolutional filters, adding pooling layers to reduce the resolution of the feature
maps, and utilizing batch normalization to improve the training stability. As standard CNNs are
designed for processing images, widely used CNN architectures for time series are one-dimensional
CNN and temporal convolutional networks [25].
Temporal Convolutional Networks (TCN). Different from the standard CNNs, TCN [137] uses the
fully convolutional network to make all layers the same length and employ casual convolution
operation to avoid information leakage from the future time step to the past. Compared to RNN-
based models, TCN has recently shown to be more accurate, simpler, and more efficient across
diverse downstream tasks [27].
2.3.4 Graph Neural Networks (GNN). GNN [22] aims to learn directly from graph representations
of data. A graph consists of nodes and edges, with each edge connecting two nodes. Both nodes and
edges can have associated features. Edges can be directional or un-directional and can be weighted.
Graphs better represent data not easily represented in Euclidean space, such as spatio-temporal
data like the electroencephalogram and traffic monitoring networks. GNNs receive the graph
structure and any associated node and edge attributes as input. Typically, the core operation in
GNNs is graph convolution, which involves exchanging information across neighboring nodes.
This operation enables the GNN-based model to explicitly rely on the inter-variable dependencies
represented by the graph edges for processing multivariate time-series data. While both RNNs and
CNNs perform well on Euclidean data, time series are often more naturally represented as graphs
in many scenarios. Consider a network of traffic sensors where the sensors are not uniformly
spaced, deviating from a regular grid. Representing this data as a graph captures its irregularity
more precisely than a Euclidean space. However, using standard deep learning algorithms to learn
from graph structures is challenging as nodes may have varying numbers of neighboring nodes,
making it difficult to apply the convolution operation. Thus, GNNs are more suitable to graph data.
2.3.5 Attention-based Networks. The attention mechanism was introduced by Bahdanau et al. [138]
to improve the performance of encoder-decoder models in machine translation. The encoder encodes
a source sentence into a vector in latent space, and the decoder then decodes the latent vector
into a target language sentence. The attention mechanism enables the decoder to pay attention
to the segments of the source for each target through a context vector. Attention-based neural
networks are designed to capture long-range dependencies with broader receptive fields, usually
lacking in CNNs and RNNs. Thus, attention-based models provide more contextual information
to enhance the models’ learning and representation capability. The underlying mechanism (i.e.,
attention mechanism) is proposed to make the model focus on essential features in the input while
suppressing the unnecessary ones. For instance, it can be used to enhance LSTM performance
in many applications by assigning attention scores for LSTM hidden states to determine the
importance of each state in the final prediction. Moreover, the attention mechanism can improve
the interpretability of the model. However, it can be more computationally expensive due to the
large number of parameters, making it prone to overfitting when the training data is limited.
Self-Attention Module. Self-attention has been demonstrated to be effective in various natural
language processing tasks due to its ability to capture long-term dependencies in text [139]. The
self-attention module is usually embedded in encoder-decoder models to improve the model
performance and leveraged in many studies to replace the RNN-based models to improve learning
efficiency due to its fast parallel processing.
Transformers. The unprecedented performance of stacking multi-headed attention, called Trans-
formers [140], has led to numerous endeavors to adapt multi-headed attention to time-series data.
Transformers for time series usually contain a simple encoder structure consisting of multi-headed
self-attention and feed-forward layers. They integrate information from data points in the time
series by dynamically computing the associations between representations with self-attention.
Thanks to the practical capability to model long-range dependencies, Transformers have shown
remarkable performance in sequence data. Many recent studies use Transformers as the backbone
architecture for time-series analysis. [21].
2.3.6 Neural Ordinary Differential Equations (Neural ODE). Let 𝑓𝜃 be a function specifying continu-
ous dynamics of a hidden state h(𝑡) with paramters 𝜃 . Neural ODEs are one of the continuous-time
)
models that define the hidden state h(𝑡) as a solution to ODE initial-value problem 𝑑h(𝑡
𝑑𝑡 = 𝑓𝜃 (h(𝑡), 𝑡)
where h(𝑡𝑜 ) = h0 . The hidden state h(𝑡) is defined at all time steps, and can be evaluated at any de-
sired time steps using a numerical ODE solver. Formally, h0, . . . , h𝑇 = ODESolver(𝑓𝜃 , h0, (𝑡 0, . . . , 𝑡𝑇 )).
For training ODE-based deep learning models using black-box ODE solvers, we can use the adjoint
sensitivity method to compute memory-efficient gradients w.r.t. the neural network parameters 𝜃 ,
as described by Rubanova et al. [23]. Neural ODEs are usually combined with RNN or its variants to
sequentially update the hidden state at observation times [141]. These models provide an alternative
recurrence-based solution with better properties than traditional RNNs in terms of their ability to
handle irregularly sampled time series.
learning for classification and unsupervised learning for retrieval. MTRL jointly optimizes the
two downstream tasks via a combination of deep wavelet decomposition networks to extract
multi-scale subseries and 1D-CNN residual networks to learn time-domain features, thus improving
the performance of downstream tasks. Unlike earlier studies using CNNs, TST [15] is the first
Transformer-based framework for representation learning of multivariate time series. TST combines
a base Transformer network with an additional input encoder and a learnable positional encoding to
make the multivariate time series work seamlessly with the Transformer encoder. Specifically, the
paper argues that the multi-headed attention mechanism makes the Transformer models suitable
for time-series data by concurrently representing each input sequence element using its future-
past contexts, while multiple attention heads can consider different representation subspaces, i.e.,
multiple aspects of relevance between input elements.
More recently, MARINA [43] combines an MLP network with residual connections and a graph
attention network (an attention-based variant of GNNs) to form a temporal module and a spatial
module, respectively. The temporal module efficiently captures temporal correlations in the time
series; at the same time, the spatial module learns the similarities among time series to enhance the
downstream performance. To enhance the representation learning process even further, Yang and
Hong [17] propose an unsupervised representation learning framework for time series, named BTSF.
It enhances the representation quality through the more reasonable construction of contrastive pairs
and the adequate integration of temporal and spectral information. BTSF constitutes an iterative
application of a novel bi-linear temporal-spectral fusion, explicitly encoding affinities between
time and frequency pairs. To adequately use the informative affinities, BTSF further uses a cross-
domain interaction with spectrum-to-time and time-to-spectrum aggregation modules to iteratively
refine temporal and spectral features for cycle update, proven effective by empirical and theoretical
analysis. Kim et al. [40] introduce another MLP-based feature-wise encoder together with a element-
wise gating layer built on top of TS2Vec [16] i.e., feature-agnostic temporal representation using
TCN. Choi and Kang [45] also extend the TS2Vec encoder [16] with a multi-task self-supervised
learning framework by combining contextual, temporal, and transformation consistencies into a
single networks. Based on the existing TCN architecture, Fraikin et al. [51] attach a time-embedding
module to explicitly learn time-related features, such as trend, periodicity, and distribution shifts.
Liang et al. [50] design a self-supervised pre-training universal framework, UniTS, that uses
a pre-training module consisting of templates from various self-supervised learning methods.
Subsequently, the pre-trained representations are fused, and the results of feature fusion are applied
to task-specific output models.
Another line of studies use Neural ODEs to handle irregular time series. ODE-RNN [41] is an
early attempt applying Neural ODEs for time series. ODE-RNN serves as an encoder in the latent
ODE model, facilitating interpolation, extrapolation, and classification tasks for irregular time
series. In recent work, ANCDEs [37] and EXIT [39] propose neural controlled differential equation
(Neural CDE)-based approaches for time-series classification and forecasting with irregular time
series. ANCDEs leverages two Neural CDEs, one for learning attention from the input time series
and another for creating representations for downstream tasks. Likewise, EXIT utilizes Neural
CDEs as part of the encoder-decoder network, enabling interpolation and extrapolation of irregular
time-series data. Additionally, CrossPyramid [38] addresses the limitation of ODE-RNN which is the
high dependence on the initial observations by using pyramid attention and cross-level ODE-RNN.
For video learning, T-C3D [36] incorporates a residual 3D-CNN that captures both appearance
information from single frames and motion information between consecutive frames. This multi-
granularity learning allows for a more comprehensive understanding of video actions. MemDPC [44]
is a framework of video representation using memory-augmented dense predictive coding. It is
trained on a predictive attention mechanism over the set of compressed memories. Rahman et al.
[46] introduce a tri-modal VilBERT-inspired model by integrating separate encoders for vision
modality, pose modality, and audio modality into a single network.
3.1.2 Module-Level Combination. Audio Word2vec [54] extends vector representations to consider
the sequential phonetic structures of the audio segments trained with speaker-content disentan-
glement based segmental sequence-to-sequence autoencoder. Guo et al. [57] design SSAN with a
separable self-attention module to capture spatial and temporal correlations in videos separately.
Spatial self-attention is applied independently to input frames. The spatial contextual information
is then aggregated over the temporal dimension and sent to a temporal attention module.
Instead learning the representations from raw time series, DelTa [52] uses 2D images of time series
such that existing pre-trained models on large image datasets can be easily used. DelTa proposes
two versions of using pre-trianed vision models: layout aligned version and layout independent
version. UniTTab [56] is a Transformer-based framework for time-dependent heterogeneous tabular
data. UniTTab uses row-type dependent encoders and different feature representation methods for
categorical and numerical data, respectively. Wu et al. [31] introduce TimesNet, which reshapes the
1D time series into a 2D tensor using a newly proposed neural network module (called TimesBlock).
TimesBlock combines a Fast Fourier Transform layer with a parameter-efficient Inception block. In
particular, it initially extracts periods using the Fourier transform and reshapes the data into 2D.
Subsequently, TimesNet utilizes the parameter-efficient Inception network to extract representations
from the 2D image-like data, which are later restored to 1D time series. Instead of training the
Transformer models from scratch, One Fits All [55] uses a pre-trained language model (e.g., GPT-2)
by freezing self-attention and feed-forward layers and fine-tuning the remaining layers. Input
embedding and normalization layers are also modified for time-series data. By doing so, it can
benefit from the universality of the Transformer models on time-series data. Nguyen et al. [53]
propose CoInception by integrating the dilated CNN into the Inception block to build a scalable
and robust neural architecture with a wide receptive field. CoInception uses multi-scale filters to
capture the temporal dependency at multiple scales and resolutions. Specifically, it incorporates a
novel convolution-based aggregator and extra skip connections in the Inception block to enhance
the capability for capturing long-term dependencies in the input time series.
Remark. The evolving landscape of neural architecture design for time-series data showcases a
blend of creative combinations at both the network and module levels. Researchers have integrated
various techniques, from wavelet decomposition to attention-based networks, towards enhanc-
ing efficiency, scalability, and performance. These endeavors expand the horizons of time-series
representation and underscore the importance of adaptability in deep neural networks.
representation learning, focusing on sound and image data. This memory system allows for the
association of features between different modalities, even when data pairs are weakly paired or
unpaired. DTS [69] is a disentangled representation learning framework for semantic meanings
and interpretability through two disentangled components. The individual factor disentanglement
extracts different semantic independent factors. The group segment disentanglement makes a batch
of segments to enhance group-level semantics.
Unlike others, HyperTime [67] introduces implicit neural representation for times series used
for imputation and reconstruction by taking timestamps as input and outputting the original time
series. It consists of two networks Set Encoder and HyperNet Decoder. From another different
perspective, Zhang et al. [59] redesign standard multi-layer encoder-decoder sequence models
to learn time-series processes as state-space models via the companion matrix and a new closed-
loop view of the state-space models, resulting in a new module named SpaceTime layer, which
consists of 1D-CNN and feed-forward networks. Multiple SpaceTime layers are stacked to form the
final encoder-decoder architecture. HierCorrPool [68] is a multivariate time-series representation
learning framework that captures both hierarchical correlations and dynamic properties by using
a novel hierarchical correlation pooling scheme and sequential graphs. Liang et al. [58] propose
CSL, a representation learning framework based on a unified shapelet-based encoder with multi-
scale alignment to transform raw multivariate time series into a set of shapelets and learning the
representations using them. COMET [72] consists of observation-level, sample-level, trial-level, and
patient-level contrastive blocks to represent multiple levels of medical time series. These blocks help
to capture cross-sample features, making the learned representations more robust. To better model
multi-scale temporal patterns and inter-channel dependencies, MSD-Mixer [75] employs MLPs
along different dimensions to mix intra- and inter-patch variations (generated from a novel multi-
scale temporal patching approach) and channel-wise correlations. It learns to explicitly decompose
the input time series into different components by generating the component representations in
different layers, and accomplishes the analysis task based on such representations.
To handle irregular time series, the temporal kernelized autoencoder [60] is proposed to learn
representations aligned to a kernel function designed for handling missing values. Continuous
recurrent units [61] update hidden states based on linear stochastic differential equations, which
are solved by the continuous-discrete Kalman filter. Based on continuous-discrete filtering theory,
Ansari et al. [63] introduce a neural continuous-discrete state space model to model continuous-time
modeling of irregularly-sampled time series. Biloš et al. [62] suggest a representation learning
approach based on denoising diffusion models adapted for irregularly-sampled time series with
complex dynamics. By gradually adding noise to the entire function, the model can effectively
capture underlying continuous processes.
3.2.2 Module-Level Design. LIME-RNN [77] introduces a weighted linear memory vector into
RNNs for time-series imputation and prediction. Following TST [15], Chowdhury et al. [82] propose
TARNet, a new representation learning model based on Transformers. TARNet aims to recon-
struct important timestamps using a newly designed masking layer to improve downstream task
performance. It decouples data reconstruction from the downstream task and uses a data-driven
masking strategy instead of random masking via self-attention score distribution generated by
the transformer encoder during the downstream task training to determine a set of important
timestamps to mask. As a result, the reconstruction process becomes task-aware. Another recent
method using the wavelet transform, called WHEN [78], newly designs two types of attention
modules, i.e., WaveAtt and DTWAtt modules. In the WaveAtt module, the study proposes a novel
data-dependent wavelet function to analyze dynamic frequency components in non-stationary
time series. In the DTWAtt module, WHEN transforms the dynamic time warping (DTW) technique
into the form of the DTW attention. Here, all input sequences are synchronized with a universal
parameter sequence to overcome the time distortion problem in multiple time series. Then, the
outputs from the new modules are further combined with task-dependent neural networks to
perform the downstream tasks, such as classification and forecasting. Due to the proliferation of
edge devices, another recent study [79] proposes a novel model compression technique to make
lightweight Transformers for multivariate time-series problems using network pruning, weight
binarization, and task-specific modification of attention modules that can substantially reduce
both model size (i.e., # of parameters) and computational complexity (i.e., # of multiply-adds or
FLOPs). These compressed neural networks have the potential to enable DL-based models across
new applications and smaller computational environments. The paper also demonstrates that the
compressed Transformers using the proposed technique achieve comparable accuracy to their
original counterparts (i.e., Dense Transformers [15]) despite the substantial reduction in model size
and computational complexity. In addition, it claims that this is the first work applying sparse and
binary-weighted Transformers to multivariate time-series analysis, including forecasting, classifi-
cation, and anomaly detection. NuTime [85] is a window-wise data embedding for one window to
deal with time-series pre-training. This embedding module consists of normalized shape embedding
and numerically multi-scaled embedding based on window mean and window standard deviation.
A few studies in this group also proposed methods to handle irregularly-sampled time series.
mTAN [80] learns the representation of continuous time values by applying the attention mechanism
to irregularly observed time series. TE-ESN [65] employs a requisite time encoding mechanism to
acquire knowledge from irregular time series data, with the representation being learned within
echo state networks. Naour et al. [84] introduce TimeFlow to deal with modeling issues such as
irregular time steps. With a hyper-network, the framework modulates implicit neural representation
that is a parameterized continuous function on multiple time series.
Unlike learning standard time-series data, Sener et al. [81] suggest a flexible multi-granular
temporal aggregation framework with non-local blocks, coupling block, and temporal aggregation
block to solve reasoning from current and past observations for long-range videos. TCGL [83]
is a self-supervised learning framework to consider multi-scale temporal dependencies within
videos. TCGL is trained on a temporal contrastive graph by jointly modeling the inter-snippet and
intra-snippet temporal dependencies through a hybrid graph contrastive learning strategy.
Remark. The methods discussed in this subsection represent a diverse range of innovative ap-
proaches to designing neural architectures for time-series representation learning, with a focus on
either network-level or module-level design. These approaches encompass novel techniques such
as random warping series, hierarchical correlation pooling, disentangled representation learning,
and continuous-discrete state space models. Additionally, the application of Transformers in rep-
resentation learning showcases the adaptability of attention mechanisms to time-series analysis.
The exploration of model compression techniques for lightweight Transformers addresses the chal-
lenges posed by edge devices. Overall, these advancements contribute significantly to enhancing
the interpretability, efficiency, and performance of representation learning for diverse time series.
4 LEARNING-FOCUSED APPROACHES
Studies in this category focus on devising novel objective functions or pretext tasks used for the
representation learning process, i.e., model training. The learning objectives can be categorized into
supervised, unsupervised, or self-supervised learning, depending on the use of labeled instances. In
our survey, the difference between unsupervised and self-supervised learning is the presence of
pseudo labels. Specifically, unsupervised learning is based on the reconstruction of its input, while
self-supervised learning uses pseudo labels from pretext tasks as self-supervision signals.
More recently, Tonekaboni et al. [88] present a generative method by employing variational
approximation to decouple local and global representations of time-series data with counterfactual
regularization that minimizes the mutual information between the local and global variables. Ti-
MAE [49], a masked autoencoder framework, addresses the distribution shift problem by learning
strong representations with less inductive bias or hierarchical trick. Its masking strategy creates
different views for the encoder in each iteration, fully leveraging the whole input time series during
training. Similarly, Dong et al. [90] propose SimMTM, another masked autoencoder. SimMTM
reconstructs the original time series from multiple masked time series with series-wise similarity
learning and point-wise aggregation to implicitly reveal the local structure of the manifold. To do so,
it introduces a neighborhood aggregation design for reconstruction by aggregating the point-wise
representations of time series based on the similarities learned in the series-wise representation
space. Thus, the masked time points are recovered by weighted aggregation of multiple neighbors
outside the manifold, allowing SimMTM to assemble complementary temporal variations from
multiple masked time series and improve the quality of the reconstructed time series. In addition
to the reconstruction loss, a constraint loss is proposed to guide the series-wise representation
learning based on the neighborhood assumption of the time series manifold.
4.2.2 Masked Prediction. Masked prediction primarily focuses on predicting the masked segment of
the input rather than reconstructing the original one. Therefore, encoder/decoder-only architectures
are effective in addressing masked prediction.
To achieve unsupervised learning, TST [15] adopts an encoder-only Transformer network with a
masked prediction (denoising) objective by computing. The losses of TST are computed only from
the masked parts or timestamps. Specifically, TST leverages unlabeled time series by training the
Transformer encoder to extract dense vector representations of multivariate time series using the
denoising objective on randomly masked input time series. Similar to TST, TARNet [82] improves
downstream task performance by learning to reconstruct important timestamps using a data-driven
masking strategy. The masked timestemps are determined by self-attention scores during the
downstream task training. This reconstruction process is trained alternately with the downstream
task at every epoch by sharing parameters in a single network. It also enables the model to learn
the representations via task-specific reconstruction, which results in improved downstream task
performance. For a masked token pretext task, UniTTab [56] uses row and timestamp masking
with neighborhood label smoothing on the time-dependent heterogeneous tabular data during the
pre-training stage to address the heterogeneity problems.
Remark. Unsupervised representation learning, unlike its supervised counterpart, trains an encoder
without relying on labeled data, making it more practical for various datasets and applications. A
range of methods employ techniques like reconstruction loss and masked prediction to extract
meaningful features from data. These approaches underscore the versatility and potential of
unsupervised learning in capturing intricate patterns and structures from diverse datasets.
labeling cost, there is a surge in the popularity of self-supervised learning algorithms. Following
the impressive performance of contrastive learning in computer vision, contrasting loss has gained
attention in time-series analysis, demonstrating excellent performance. Here, we review both
non-contrastive and contrastive self-supervised learning methods.
4.3.1 Non-Contrasting Loss. Non-contrasting loss exploits the inherent structures, relationships,
or patterns in the data as the effective supervision signals for training. For example, making the
model to predict whether a given segment is the past or future data based on a reference time-series
segment. Haresh et al. [92] leverage soft dynamic time warping loss between two videos to learn
representation in the unsupervised manner for temporal alignment. Wang et al. [95] propose a
novel pretext task to address the self-supervised video representation learning using motion-aware
curriculum learning. Several spatial partitioning patterns are employed to encode rough spatial
locations instead of exact spatial Cartesian coordinates. In recent work, Liang et al. [93] introduce
three novel pretext tasks, including (1) clip continuity prediction, (2) discontinuity localization,
and (3) missing section approximation, based on video continuity by cutting a clip into three
consecutive short clips. CACL [91] aims to capture temporal details of a video by learning to predict
an approximate Edit distance between a video and its temporal shuffle. In addition, contrastive
learning is performed by generating four positive samples from two different encoders, 3D CNN
and video transformer, as well as two different augmentations. Duan et al. [96] introduce TransRank
with a pretext task aiming at assessing the relative magnitude of transformations. It allows the
model to capture inherent characteristics of the video, such as speed, even when the challenge of
matching the transformations across videos varies. Fang et al. [97] propose a multivariate time-
series modeling method to capture the spatial relation shared across all instances. A prior graph
is learned to minimize the distance between dynamic graph. On the other hand, to capture the
instance-specific spatial relation, a dynamic graph is learned to maximize the distance between
prior graph. Another recent work T-Rep [51] is claimed to be the first self-supervised framework
for time series to leverage the time-embedding module in its pretext tasks which enable the model
to learn fine-grained temporal dependencies, giving the latent space a more coherent temporal
structure than existing methods.
4.3.2 Contrasting Loss. Contrastive learning involves learning to differentiate between positive and
negative samples, assuming that the positive samples should have similar representations among
each other while being different from the negative ones. Therefore, the design of the contrastive
loss and the selection of positive and negative samples are crucial for contrasting-based methods.
Franceschi et al. [47] employ the triplet loss (i.e., T-Loss) inspired by Word2Vec [143] for learning
scalable representations of multivariate time series. It considers a sub-segment belonging to the
input time segment as a positive sample to explore sub-series consistency. From another perspective,
Eldele et al. [123] introduce a self-supervised time-series representation learning framework via
temporal and contextual contrasting (TS-TCC). Time series are first transformed into two different
yet correlated views by weak and strong augmentations. Then, TS-TCC uses a novel temporal
contrasting module to learn temporal dependencies by designing a hard cross-view prediction
task that utilizes the past latent features of one augmentation to predict the future of another
augmentation for a certain time step. This operation forces the model to learn robust representation
by a harder prediction task against any perturbations introduced by different time steps and
augmentations. Last, it proposes a contextual contrasting module built upon the contexts from the
temporal contrasting module to further learn discriminative representations by maximizing the
similarity among different contexts of the same sample while minimizing similarity among contexts
of different samples. TS-TCC shows high efficiency in few-labeled data and transfer learning
scenarios. TNC [118] introduces a contrastive learning framework for complex multivariate non-
stationary time series. This approach aims to capture the progression of the underlying temporal
dynamics. It takes advantage of the local smoothness of a signal’s generative process to define
neighborhoods in time with stationary properties. The neighborhood boundaries are determined
automatically using the properties of the signal and augmented Dickey-Fuller statistical testing.
Additionally, TNC learns the representations by ensuring that the distribution of signals from
within a neighborhood is distinguishable from the distribution of non-neighboring signals in the
latent space using a debiased contrastive loss. This loss incorporates positive unlabeled learning for
sample weight adjustment to account for potential bias introduced in sampling negative examples.
Yue et al. [16] propose a universal framework for learning representations of time series in an
arbitrary semantic level, called TS2Vec. TS2Vec is based on novel contextual consistency learning
using two contrastive policies over two augmented time segments with different contexts from
randomly overlapped subsequences. Unlike other methods, TS2Vec performs contrastive learning in
a hierarchical way over the augmented context views, making a robust contextual representation for
each timestamp. Its overall representation can be obtained by max pooling over the corresponding
timestamps. By using multi-scale contextual information with granularity, this approach can
capture multi-resolution/multi-scale contextual information with both temporal and instance-
wise contrastive loss for the given time series and generate fine-grained representations for any
granularity. Similar to BTSF [17], Zhang et al. [109] introduce a strategy for self-supervised pre-
training in time series by modeling time-frequency consistency (TF-C). This study argues that
time-based and frequency-based representations, learned from the same time series, should be
closer to each other in the time-frequency latent space than representations of different time
series. TF-C aims to minimize the distance between time-based and frequency-based embeddings
using a novel consistency loss. TimeCLR [116] is a self-supervised contrastive learning framework,
enabling the feature extractor to learn invariant representations by minimizing the similarity
between two augmented views of the same sample. Hajimoradlou et al. [110] introduce a self-
supervised framework with similarity distillation along the temporal and instance dimensions for
pre-training universal representations. For robot sensor data, Somaiya et al. [117] propose a simple
representation learning framework, TS-Rep, based on T-Loss [47]. TS-Rep considers real-world
scenarios of robotics by making the model available on varying length time series without any
padding techniques.
CSL [58] proposes to learn time-series representations shapelets with multi-grained contrast-
ing and multi-scale alignment for capturing information in various time ranges. Nguyen et al.
[53] design a novel contrastive loss function by combining ideas from hierarchical loss [16] and
triplet loss. FEAT [40] uses a combination of contrastive losses to jointly learn both feature-based
consistency and temporal consistency by using hierarchical temporal contrasting loss, feature
contrasting loss, and reconstruction loss. Choi and Kang [45] introduce uncertainty weighting
approach to weigh multiple contrastive loss by considering homoscedastic uncertainty of multiple
tasks, including contextual, temporal, and transformation consistencies. FOCAL [120] enforces
the modality consistency to learn the features shared across modalities and the transformation
consistency to learn the modality-specific feature. Additionally to accommodate sporadic devi-
ations from locality due to periodic patterns, temporally close sample pairs and distant sample
pairs are constrained by a loose ranking loss. COMET [72] composes four-level contrastive losses
with respect to medical time series. To compromise these multiple losses, the overall loss has
hyper-coefficients about each loss. TS-CoT [121] integrates contrastive learning into a multi-view
time series representation learning framework, emphasizing the importance of global consistency
and complementary information to enhance robustness against noise. By utilizing the ability of
large language models (LLM), TEST [122], an embedding method grounded in contrastive learning,
enables LLMs to handle time-series analysis tasks effectively by leveraging intrinsic information
and avoiding the need for extensive pre-defined knowledge or annotations.
Many studies also employ contrastive learning to learn representations of irregular time-series
data. For instance, TimeAutoML [115] adopts the AutoML framework, enabling automated configu-
ration and hyperparameter optimization. This approach utilizes a contrastive learning framework
where negative samples are created by introducing random noise within the range defined by
the minimum and maximum values of the given instances. Another work named PrimeNet [107]
adopts a contrastive learning approach by generating augmented samples based on the observation
density and employs a reconstruction task to facilitate the learning of irregular patterns.
For video representations, Morgado et al. [104] leverages 360° video data with strong spatial
cues and conducts audio-visual spatial alignment as a pretext task. This pretext task involves
spatially misaligned audio and video clips, treated as negative examples for contrastive learning.
MemDPC [44] constitutes a self-supervised learning approach, specifically emphasizing represen-
tations for action recognition. Within this framework, training MemDPC involves a predictive
attention mechanism applied to a collection of compressed memories. This training paradigm
ensures that any subsequent states can be synthesized consistently through a convex combination
of the condensed representations. Wang et al. [94] suggest a pretext task that classifies the pace of
a clip into five categories: super slow, slow, normal, fast, and super fast. Contrastive learning is
adopted by selecting positive pairs from a pair of clips in the same video and negative pair from a
pair of clips in different videos. CCL [99] uses the inclusive relation of a video and frames for a
contrastive learning strategy. The video and frames of the inclusive relation are learned to be close
to each other in the video and frame embedding spaces. In addition to the non-contrastive loss,
Haresh et al. [92] jointly use a temporal regularization term (i.e., Contrastive-IDM) to encourage
different frames to be mapped to different points in the embedding space. Jenni and Jin [114]
propose a time-equivariant contrastive learning model. In this framework, a pair of clips is the
unit of contrastive learning. If two pairs have same temporal transformation within each pair,
then the output of each clip is concatenated for each pair and the concatenated outputs become
similar. Auxiliary tasks are also suggested such as classifying each temporal difference between two
clips: predicting clip A as 2x speed and clip B as slow speed. CVRL [111] is a contrastive learning
framework that uses the (temporally consistent) spatial augmentation and clip selection strategy,
where each frames are spatially augmented. Here, two clips in the same video are positive, while
two clips in the different videos are negative. RSPNet [108] solves the relative speed perception task
and the appearance-focused video instance discrimination task with self-supervised representation
learning by using a triplet loss and the InfoNCE loss.
Concerning other modalities, Lee et al. [71] employ contrastive learning to effectively asso-
ciate sound and image modalities with limited paired data to enhance representation learning
by encouraging similarity in positive pairs and dissimilarity in negative pairs. CSTP [98] intro-
duces a pretext task for video representation called spatio-temporal overlap rate prediction that
considers the intermediate of contrastive learning. With a joint optimization combining pretext
tasks with contrastive learning, CSTP enhances the spatio-temporal representation learning for
downstream tasks. DCLR [100] presents a dual contrastive formulation for video representation by
decoupling the input RGB video sequence into static scene and dynamic motion. TCLR [112] is a
contrastive learning framework for video understanding tasks. It is trained with novel local–local
and global–local temporal contrastive losses. Chen et al. [101] propose sequence contrastive loss to
sample two subsequences with an overlap for each video. The overlapped timestamps are consid-
ered positives, while the clips from other videos are negatives. Two timestamps neighboring each
other also become positive pairs with the Gaussian weight proportional to the temporal distance.
Qian et al. [106] propose a windowing based contrastive learning by sampling a long clip from
a video and a short clip that lies inside the duration of the long clip. These long and short clips
becomes a positive pair for contrastive learning and other long clips become negative instances.
When conducting contrastive learning, final vectors are made in two different embedding spaces.
The first one is a fine-grained space and each embedding is made at each timestamp. The second
embedding space is a persistent embedding space and each timestamp embedding is global average
pooled for contrastive learning. Qing et al. [103] conduct representation learning on untrimmed
video to reduce the amount of labor required for manual trimming and utilize the rich semantics of
untrimmed video. Hierarchical contrastive learning teaches clips that are near in time and topic
to be similar. Zhang and Crandall [102] introduce a self-supervised video representation learning
trained with decoupled learning objectives into two contrastive sub-tasks, which are hierarchically
spatial and temporal contrast. With graph learning, TCGL [83] uses a spatial-temporal knowledge
discovering module for motion-enhanced spatial-temporal representations. TCGL introduces intra-
and inter-snippet temporal contrastive graphs to explicitly model multi-scale temporal depen-
dencies, employing a hybrid graph contrastive learning strategy. A recent study [113] proposes
TempCLR, a contrastive learning framework for exploring temporal dynamics in video-paragraph
alignment, leveraging a novel negative sampling strategy based on temporal granularity. By focus-
ing on sequence-level comparison and using dynamic time warping, TempCLR captures temporal
dynamics more effectively. Zhang et al. [105] model videos as stochastic processes via a novel
process-based contrastive learning framework by enforcing an arbitrary frame to agree with a
time-variant Gaussian distribution conditioned on the start and end frames.
Remark. Self-supervised representation learning has emerged as a powerful approach in the
realm of time-series analysis, particularly due to its ability to mitigate the high labeling cost. The
contrastive and non-contrastive self-supervised learning methods discussed showcase a diverse
range of strategies, such as temporal contrasting, combination of contrastive consistencies, and
dynamic graph modeling, among others, for capturing intricate temporal dependencies and spatial
relations in various applications. Whether addressing multivariate time series, irregular patterns,
or video representations, the contrastive-based approaches emphasize the importance of selecting
meaningful positive and negative samples. The variety of techniques presented underscores the
versatility of self-supervised learning in enhancing the efficiency and robustness of representation
learning, paving the way for advancements in time-series analysis across different domains.
5 DATA-CENTRIC APPROACHES
In this group, we categorize the methods that focus on finding a new way to enhance the usefulness
of the training data at hand. To capture the underlying patterns, trends, and relevant features within
the time series, data-centric approaches prioritize engineering the data itself rather than focusing
on model architecture and loss function design. We categorize data-centric approaches into three
techniques: data augmentation, decomposition and transformation, and sample selection.
for a wide range of downstream tasks. By considering frequency information, TF-C [109] aims to
use both time and frequency features of time series to improve the representation quality. TF-C
is claimed to be the first study to develop frequency-based contrastive augmentation exploiting
spectral information and to explore time-frequency consistency in time series. It consists of a set
of novel augmentations based on the characteristic of the frequency spectrum and produces a
frequency-based embedding through contrastive instance discrimination. More specifically, TF-C
introduces frequency-domain augmentation by randomly adding or removing frequency com-
ponents, thereby exposing the model to a range of frequency variations. Recently, TS-CoT [121]
facilitates the creation of diverse views for contrastive learning by enhancing robustness to noisy
time series that contributes to the overall effectiveness of the representation learning.
video to help the latent representation better capture meaningful features from the remaining data.
Kim et al. [124] propose DynaAugment by dynamically changing the augmentation magnitude
over time in order to learn the temporal variations of a video. By Fourier sampling the magnitudes,
it regulates the augmentations diversely while maintaining the temporal consistency.
Remark. Recent advancements in time-series representation learning have emphasized the impor-
tance of augmentation strategies tailored to capture both temporal and frequency characteristics
effectively. While random augmentation methods leverage augmented context views for robust rep-
resentation, policy-based approaches employ specific criteria to generate diverse and high-fidelity
positive samples for contrastive learning. These augmentation techniques not only enhance the
model’s ability to handle noisy and diverse time-series data but also ensure the preservation of
essential temporal and spectral attributes.
series into non-overlapping patches along the temporal dimension in each layer. It treats the time
series as patches. Different layers have different patch sizes such that they can focus on different
time scales, making the representations more expressive.
For temporal graph networks, Chanpuriya et al. [133] construct time-decayed line graph to train
edge representation by creating a sparse matrix of proximities between temporal edges.
Remark. Time-series decomposition and input space transformation have significantly enhanced
the capabilities of deep models for analyzing and processing time-series and video data. Notably,
innovations in transforming 1D time series into 2D images or tensors, as well as methods addressing
irregular time series, have expanded the applicability and performance of vision backbones.
5.3.1 Generative Methods. Sample generation is a popular technique that increases the size and
diversity of the training data when the data is scarce. This line of work explicitly generate new
samples by transformation or generative models. For enhancing the noise resilience, Nguyen et al.
[53] propose a novel noise-resilient sampling strategy by exploiting a parameter-free discrete
wavelet transform low-pass filter to generate a perturbed version of the original time series. By
leveraging the large language model (LLM), LAVILA [134] learns better video-language embedding
when only few text annotations are available. Using the available video-text data, it first fine-tunes
LLM to generate text narration given visual input. Then densely annotated videos via a fine-tuned
narrator are used for video-text contrastive learning.
5.3.2 (Re-)Sampling Strategies. Sampling strategies aim to effectively select the best samples
from available training data for a particular learning scenario. An early study adopting sample
selection for time-series representation learning is proposed by Franceschi et al. [47]. This work
uses time-based negative sampling with the T-Loss mentioned above. It determine several negative
samples by independently choosing sub-series from other time series at random, whereas a sub-
series within the referenced time series is consider a positive sample. This technique encourages
representations of the input time segment and its sampled sub-series to be close to each other.
MTRL [35] utilizes discriminative samples to design a distance-weighted sampling strategy for
achieving high convergence speed and accuracy.
Remark. Both generative methods and sampling strategies play pivotal roles in enhancing the
efficacy and diversity of training data. Generative techniques offer methods to augment data and
improve resilience against noise, especially when data is limited. On the other hand, sampling
strategies underscore the importance of judiciously selecting samples to optimize representation
learning, emphasizing both convergence speed and accuracy in training. As most of the time-
series representation learning methods focus on extracting features useful for the downstream
tasks, sample generation may be less popular. Furthermore, compared to other data types such as
image and natural language, there is a lack of generative foundation models that can be utilized
off-the-shelf for sample generation.
6 EXPERIMENTAL DESIGN
This section describes the typically used experimental design for comparing universal representation
learning methods for time series. We describe widely-used protocol and introduce publicly available
benchmark datasets with evaluation metrics according to the downstream tasks.
Given a set of 𝑁 time series {(X𝑖 , y𝑖 )}𝑖=1 𝑁 and 𝐽 pre-trained representation learning models
𝐽
{𝑓𝑒,𝑗 } 𝑗=1 , this section describes how we evaluate each model to determine the best one. As discussed
in Section 1, representations of time series play a vital role in solving time-series analysis tasks. We
expect that the learned representations by 𝑓𝑒 generalize to unseen downstream tasks. Accordingly,
the most common evaluation method is how learned representations help solve downstream tasks.
Additionally, we need a function 𝑔𝑑 that maps a representation (feature) space to a label space.
For example, 𝑔𝑑 (𝑓𝑒 (X)) : R𝑅×𝐹 → R |𝐶 | for classification or 𝑔𝑑 (𝑓𝑒 (X)) : R𝑅×𝐹 → R𝐻 for forecasting.
This is because 𝑓𝑒 is designed to extract feature representations, not to solve the downstream
task. Commonly, 𝑔𝑑 is implemented as a simple function, such as linear regression, support vector
machines, or shallow neural networks because it is enough to solve the downstream task if the
learned representations already capture meaningful and discriminative features.
control systems (e.g., MoJoCo [39]). To facilitate the evaluation of forecasting models, Godahewa
et al. [147] also introduce a publicly accessible archive for time-series forecasting. Given the numer-
ical nature of the predicted results, the most commonly used metrics are mean squared error (MSE)
and mean absolute error (MAE).
6.2.2 Classification and Clustering. As the classification and clustering tasks both aim to identify
the real category to which a time-series sample belongs, existing studies usually use the same
set of benchmark datasets to evaluate the model performance. Curated benchmarks comprising
heterogeneous time series from various application domains, such as UCR [148] and UEA [149],
are the most widely used because they can provide a comprehensive evaluation regarding the
generalization of the model being evaluated. Besides, many researchers also use human activity (e.g.,
HAR [150]) and health (e.g., PhysioNet Sepsis [151]) related datasets due to their practicality for
real-world applications. We recommend referring to Table 3 for the exhaustive list (including audio
and video modalities) of datasets for the classification and clustering tasks.
Regarding the evaluation metrics, while we can evaluate the classification task with accuracy,
precision, recall, and F1 score, we usually assess the clustering task with Silhouette score, adjusted
rand index (ARI), and normalized mutual information (NMI) to assess the inherent clusterability
due to the absence of label instances. For classification, we may also use the area under the
precision-recall curve (AUPRC) to handle the class imbalance cases.
6.2.3 Regression. Compared to the forecasting and classification tasks, time-series regression,
particularly with deep learning, remains relatively underexplored. Only a handful of public bench-
mark datasets (e.g., heart rate monitoring data [127] and air quality [74]) exist. The TSER archive,
introduced by Tan et al. [152], is the most comprehensive benchmark for time-series regression.
Like forecasting and imputation tasks, the metrics for regression are MSE and MAE. Additional
metrics, such as root mean squared error (RMSE) and R-squared (𝑅 2 ), are also commonly used.
6.2.4 Segmentation. Likewise, time-series segmentation with deep learning is also relatively un-
derexplored. There are two standard curated benchmarks: UTSA [153] and TSSB [154]. To assess
the segmentation performance, F1 and covering scores are typically used. The F1 score emphasizes
the importance of detecting the correct timestamps at which the underlying process changes. In
contrast, the covering score focuses on splitting a time series into homogeneous segments and
reports a measure for the overlaps of predicted versus labeled segments.
6.2.5 Anomaly Detection. Anomaly detection is one of the most popular research topics in time
series. There are several benchmarks publicly available, as listed in Table 3. However, as argued
by recent studies [155, 156], most existing benchmarks are deeply flawed and cannot provide a
meaningful assessment of the anomaly detection models. Therefore, we recommend using newly
proposed datasets, such as ASD [157], TimeSeAD [156], and TSB-UAD [158].
Concerning the evaluation metrics, point-adjust F1 score [159] is the most widely used metric
for time-series anomaly detection. Nevertheless, this metric is also found to have an overestimation
problem, which cannot give a reliable performance evaluation. Accordingly, recent studies [160–162]
have started to adopt more robust evaluation metrics, e.g., VUS [163], PA%K [164], and eTaPR [165].
6.2.6 Retrieval. Although we find that a few studies of the reviewed articles use particular datasets
for the retrieval task (e.g., EK-100 [166], Howto100M [167], and MUSIC [168]), the task itself can be
evaluated with any benchmark dataset because it is basically based on arbitrary query time series.
For time-series retrieval [35], benchmark datasets for classification (e.g., UCR) are commonly used.
For the evaluation, the top-𝑘 recall rate (higher is better) is the standard metric to examine the
overlap percentage of the top-𝑘 results and the ground truth. 𝑘 is usually set to 5, 10, and 20.
Table 3. Summary of public datasets widely used for time-series representation learning. 𝑇 and 𝑉 indicate the
varying time-series length (or # frames) and number of variables (or video resolution) per sample, respectively.
Downstream Tasks Dataset Name Size Dimension Domain Modality Reference Source
ETTh 14,307 7 Electric Power time series [145]
ETTm 57,507 7 Electric Power time series [145]
Electricity 26,304 321 Electricity Consumption time series [31]
Traffic 17,451 862 Transportation time series [31]
PEMS-BAY 16,937,179 325 Transportation spatiotemporal [169]
METR-LA 6,519,002 207 Transportation spatiotemporal [169]
Weather 52,603 21 Climatological Data time series [31]
Forecasting & Imputation
Exchange 7,588 8 Daily Exchange Rate time series [146]
ILI 861 7 Illness time series [31]
Google Stock 3,773 6 Stock Prices time series [170]
Monash TSF 30 × 𝑇 𝑉 Multiple time series [147]
Solar 52,560 137 Solar Power Production time series [146]
MoJoCo 10,000 x 100 14 Control Tasks time series [39]
USHCN-Climate 386,068 5 Climatological Data time series [61]
UCR 128 × 𝑇 1 Multiple time series [148]
UEA 30 × 𝑇 𝑉 Multiple time series [149]
PhysioNet Sepsis 40, 336 × 𝑇 34 Medical Data time series [151]
PhysioNet ICU 12, 000 × 𝑇 36 Medical Data time series [171]
PhysioNet ECG 12, 186 × 𝑇 1 Medical Data time series [172]
HAR 10,299 9 Human Activity time series [150]
EMG 163 1 Medical Data time series [173]
Epilepsy 11,500 1 Brain Activity time series [174]
Waveform 76,567 2 Medical Data time series [121]
Gesture 440 3 Hand Gestures time series [175]
MOD 39,609 2 Moving Object time series [120]
PAMAP2 9,611 10 Human Activity time series [120]
Sleep-EEG 371,005 1 Sleep Stages time series [90]
RealWorld-HAR 12,887 9 Human Activity time series [120]
Classification & Clustering
Speech Commands 5,630,975 20 Spoken Words audio [39]
LRW 13,050,000 64 × 64 Lib Reading video [176]
ESC50 10,000 1 Environmental Sound audio [177]
UCF101 333,000 320 × 240 Human Activity video [178]
HMDB51 6, 849 × 𝑇 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Human Activity video [179]
Kinetics-400 7,656,125 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Human Activity video [180]
AD 1,527,552 16 Medical Data time series [72]
PTB 18,711,000 15 Medical Data time series [72]
TDBrain 3,035,136 33 Brain Activity time series [72]
36,764
MMAct N/A Human Activity multi-modality [181]
(only number of instances)
PennAction 2, 326 × 𝑇 640 × 480 Human Activity video [101]
FineGym 4, 883 × 𝑇 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Human Activity video [101]
Pouring 84 × 𝑇 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Human Activity video [101]
Something-Something 2,650,164 84 × 84 Human Activity video [124]
TSER Archive 19 × 𝑇 𝑉 Multiple time series [152]
Neonate 79 × 𝑇 18 Neonatal EEG Recordings time series [182]
IEEE SPC 22 × 𝑇 5 Heart Rate Monitoring time series [127]
DaLia 15 × 𝑇 11 Heart Rate Monitoring time series [127]
Regression
IHEPC 2,075,259 1 Electricity Consumption time series [47]
AEPD 19,735 29 Appliances Energy time series [74]
BMAD 420,768 6 Air Quality time series [74]
SML2010 4,137 18 Smart home time series [74]
TSSB 75 × 𝑇 1 Multiple time series [154]
Segmentation
UTSA 32 × 𝑇 1 Multiple time series [153]
FD-A 8,184 1 Mechanical System time series [183]
FD-B 13,640 1 Mechanical System time series [183]
KPI 5,922,913 1 Server Machine time series [16]
TODS 𝑇 𝑉 Synthetic Data time series [184]
SMD 1,416,825 38 Server Machine time series [185]
ASD 154,171 19 Server Machine time series [157]
PSM 220,322 26 Server Machine time series [186]
MSL 130,046 55 Spacecraft time series [187]
SMAP 562,800 25 Spacecraft time series [187]
Anomaly Detection
SWaT 944,919 51 Infrastructure time series [188]
WADI 1,221,372 103 Infrastructure time series [189]
Yahoo 572,966 1 Multiple time series [16]
TimeSeAD 21 × 𝑇 𝑉 Multiple time series [156]
TSB-UAD 1, 980 × 𝑇 1 Multiple time series [158]
UCR-TSAD 250 × 𝑇 1 Multiple time series [155]
UCFCrime 1, 900 × 𝑇 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Surveillance video [190]
Oops! 20, 338 × 𝑇 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Human Activity video [191]
DFDC 128, 154 × 𝑇 256 × 256 Deepfake video [192]
EK-100 89,977 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Human Activity video [166]
Retrieval HowTo100M 136,600,000 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Human Activity video [167]
MUSIC 714 𝑉𝑤𝑖𝑑𝑡ℎ × 𝑉ℎ𝑒𝑖𝑔ℎ𝑡 Musical Instrument multi-modality [168]
Given that distribution shifts resulting from concept drift and domain shift are the factors that
degrade model performance, previous studies have focused on concept drift adaptation and domain
adaptation to address these shifts in specific downstream tasks, such as classification, anomaly
detection, and forecasting [197–199]. Addressing distribution shifts in the test phase is also crucial
for learning representations for various downstream environments. Therefore, researchers should
consider future directions to develop distribution shift adaptations for universal representation
learning with discrepancy-based and adversarial methods, for example.
few-shot learning, and fine-tuning, have demonstrated performance comparable to existing deep
learning methods. We note ongoing attempts to leverage LLMs for time-series analysis, yet mostly
limited to forecasting tasks [210]. Thus, using LLMs in time-series representation learning is ex-
pected to enhance the embedding quality by accurately capturing time-dependent patterns in the
time series. We also expect future research on aligning time-series representations with language
embeddings to provide valuable insights, not only in the context of single-modal time series but
also in the realm of multi-modal or multivariate time series.
8 CONCLUSIONS
This article introduces a universal time-series representation learning research and its importance
for downstream time-series analysis. We present a comprehensive and up-to-date literature review
of universal representation learning for time series by categorizing the recent advancements from
design perspectives. Our main goal is to answer how each fundamental design element—neural
architectures, learning objectives, and training data—of state-of-the-art time-series representation
learning methods contributes to the improvement of the learned representation quality, resulting
in a novel structured taxonomy with fifteen subcategories. Although most state-of-the-art studies
consider all design elements in their methods, only one or two elements are newly proposed. Given
the current review of the selected studies, we find that decomposition and transformation methods
and sample selection techniques in the data-centric approaches are still underexplored. In addition,
we provide a practical guideline about standard experimental setups and widely used time-series
datasets for particular downstream tasks, together with discussions on various open challenges
and future research directions related to time-series representation learning. Ultimately, we expect
this survey to be a valuable resource for practitioners and researchers interested in a multi-faceted
understanding of the universal representation learning methods for time series.
ACKNOWLEDGMENTS
This work was supported by Institute of Information & Communications Technology Planning &
Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00157, Robust, Fair,
Extensible Data-Centric Continual Learning, 50% and No. 2020-0-00862, DB4DL: High-Usability
and Performance In-Memory Distributed DBMS for Deep Learning, 50%).
REFERENCES
[1] Yasmin Fathy, Payam Barnaghi, and Rahim Tafazolli. Large-scale indexing, discovery, and ranking for the internet of
things (iot). ACM CSUR, 51(2):1–53, 2018.
[2] Andrew A Cook, Göksel Mısırlı, and Zhong Fan. Anomaly detection for iot time-series data: A survey. IEEE Internet
of Things Journal, 7(7):6481–6494, 2019.
[3] Jairo Giraldo, David Urbina, Alvaro Cardenas, Junia Valente, Mustafa Faisal, Justin Ruths, Nils Ole Tippenhauer,
Henrik Sandberg, and Richard Candell. A survey of physics-based attack detection in cyber-physical systems. ACM
CSUR, 51(4):1–36, 2018.
[4] Yuan Luo, Ya Xiao, Long Cheng, Guojun Peng, and Danfeng Yao. Deep learning-based anomaly detection in cyber-
physical systems: Progress and opportunities. ACM CSUR, 54(5):1–36, 2021.
[5] Wenbo Ge, Pooia Lalbakhsh, Leigh Isai, Artem Lenskiy, and Hanna Suominen. Neural network–based financial
volatility forecasting: A systematic review. ACM CSUR, 55(1):1–30, 2022.
[6] Longbing Cao. Ai in finance: Challenges, techniques, and opportunities. ACM CSUR, 55(3):1–38, 2022.
[7] Kaixuan Chen, Dalin Zhang, Lina Yao, Bin Guo, Zhiwen Yu, and Yunhao Liu. Deep learning for sensor-based human
activity recognition: Overview, challenges, and opportunities. ACM CSUR, 54(4):1–40, 2021.
[8] Fuqiang Gu, Mu-Huan Chung, Mark Chignell, Shahrokh Valaee, Baoding Zhou, and Xue Liu. A survey on deep
learning for human activity recognition. ACM CSUR, 54(8):1–34, 2021.
[9] Philippe Esling and Carlos Agon. Time-series data mining. ACM CSUR, 45(1):1–34, 2012.
[10] Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: A survey. Philos. Trans. Royal Soc., 379
(2194):20200209, 2021.
[11] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. Deep
learning for time series classification: A review. Data Min. Knowl. Discov., 33(4):917–963, 2019.
[12] Chang Wei Tan, Christoph Bergmeir, François Petitjean, and Geoffrey I Webb. Time series extrinsic regression:
Predicting numeric values from time series data. Data Min. Knowl. Discov., 35:1032–1060, 2021.
[13] Kukjin Choi, Jihun Yi, Changhwa Park, and Sungroh Yoon. Deep learning for anomaly detection in time-series data:
Review, analysis, and guidelines. IEEE Access, 2021.
[14] Anshul Sharma, Abhinav Kumar, Anil Kumar Pandey, and Rishav Singh. Time Series Data Representation and
Dimensionality Reduction Techniques, pages 267–284. Springer, 2020.
[15] George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff. A transformer-
based framework for multivariate time series representation learning. In KDD, pages 2114–2124, 2021.
[16] Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. TS2Vec:
Towards universal representation of time series. In AAAI, volume 36, pages 8980–8987, 2022.
[17] Ling Yang and Shenda Hong. Unsupervised time-series representation learning with iterative bilinear temporal-
spectral fusion. In ICML, pages 25038–25054, 2022.
[18] Varsha S Lalapura, J Amudha, and Hariramn Selvamuruga Satheesh. Recurrent neural networks for edge intelligence:
A survey. ACM CSUR, 54(4):1–38, 2021.
[19] Shitong Mao and Ervin Sejdić. A review of recurrent neural network-based methods in computational physiology.
IEEE Trans. Neural Netw. Learn. Syst., 0(0):1–21, 2022.
[20] Brian Kenji Iwana and Seiichi Uchida. An empirical survey of data augmentation for time series classification with
neural networks. PLOS ONE, 16(7):e0254841, 2021.
[21] Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time
series: A survey. In IJCAI, 2023.
[22] Ming Jin, Huan Yee Koh, Qingsong Wen, Daniele Zambon, Cesare Alippi, Geoffrey I Webb, Irwin King, and Shirui Pan.
A survey on graph neural networks for time series: Forecasting, classification, imputation, and anomaly detection.
arXiv:2307.03759, 2023.
[23] Yulia Rubanova, Ricky TQ Chen, and David Duvenaud. Latent odes for irregularly-sampled time series. In NeurIPS,
volume 33, pages 5320–5330, 2019.
[24] Chenxi Sun, Shenda Hong, Moxian Song, and Hongyan Li. A review of deep learning methods for irregularly sampled
medical time series data. arXiv:2010.12493, 2020.
[25] Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee-Keong Kwoh, and Xiaoli Li. Label-efficient
time series representation learning: A review. arXiv:2302.06433, 2023.
[26] Kexin Zhang, Qingsong Wen, Chaoli Zhang, Rongyao Cai, Ming Jin, Yong Liu, James Zhang, Yuxuan Liang, Guansong
Pang, Dongjin Song, et al. Self-supervised learning for time series analysis: Taxonomy, progress, and prospects.
arXiv:2306.10125, 2023.
[27] Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, Zhongzhong Yu, and James T Kwok. A survey on
time-series pre-trained models. arXiv:2305.10716, 2023.
[28] Qianwen Meng, Hangwei Qian, Yong Liu, Yonghui Xu, Zhiqi Shen, and Lizhen Cui. Unsupervised representation
learning for time series: A review. arXiv:2308.01578, 2023.
[29] Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. Self-supervised representation learning:
Introduction, advances, and challenges. IEEE Signal Process. Mag., 39(3):42–62, 2022.
[30] Amine Mohamed Aboussalah, Minjae Kwon, Raj G Patel, Cheng Chi, and Chi-Guhn Lee. Recursive time series data
augmentation. In ICLR, 2023.
[31] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. TimesNet: Temporal 2d-variation
modeling for general time series analysis. In ICLR, 2023.
[32] Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, and Huan Xu. Time series data
augmentation for deep learning: A survey. In IJCAI, pages 4653–4660, 2021.
[33] Martin Längkvist, Lars Karlsson, and Amy Loutfi. A review of unsupervised feature learning and deep learning for
time-series modeling. Pattern Recognit. Lett., 42:11–24, 2014.
[34] Shohreh Deldari, Hao Xue, Aaqib Saeed, Jiayuan He, Daniel V Smith, and Flora D Salim. Beyond just vision: A review
on self-supervised representation learning on multimodal and temporal data. arXiv:2206.02353, 2022.
[35] Ling Chen, Donghui Chen, Fan Yang, and Jianling Sun. A deep multi-task representation learning method for time
series classification and retrieval. Inf. Sci., 555:17–32, 2021.
[36] Kun Liu, Wu Liu, Huadong Ma, Mingkui Tan, and Chuang Gan. A real-time action representation with temporal
encoding and deep compression. IEEE Trans. Circuits Syst. Video Technol., 31(2):647–660, 2020.
[37] Sheo Yon Jhin, Heejoo Shin, Seoyoung Hong, Minju Jo, Solhee Park, Noseong Park, Seungbeom Lee, Hwiyoung
Maeng, and Seungmin Jeon. Attentive neural controlled di01fferential equations for time-series classification and
forecasting. In ICDM, pages 250–259, 2021.
[38] Futoon M Abushaqra, Hao Xue, Yongli Ren, and Flora D Salim. Crosspyramid: Neural ordinary differential equations
architecture for partially-observed time-series. arXiv:2212.03560, 2022.
[39] Sheo Yon Jhin, Jaehoon Lee, Minju Jo, Seungji Kook, Jinsung Jeon, Jihyeon Hyeong, Jayoung Kim, and Noseong Park.
EXIT: Extrapolation and interpolation-based neural controlled differential equations for time-series classification and
forecasting. In WebConf, pages 3102–3112, 2022.
[40] Subin Kim, Euisuk Chung, and Pilsung Kang. FEAT: A general framework for feature-aware multivariate time-series
representation learning. Knowl.-Based Syst., 277:110790, 2023.
[41] Yulia Rubanova, Tian Qi Chen, and David Duvenaud. Latent ordinary differential equations for irregularly-sampled
time series. In NeurIPS, pages 5321–5331, 2019.
[42] Eduardo H Sanchez, Mathieu Serrurier, and Mathias Ortner. Learning disentangled representations of satellite image
time series. In ECML PKDD, pages 306–321, 2019.
[43] Jiandong Xie, Yue Cui, Feiteng Huang, Chao Liu, and Kai Zheng. MARINA: An mlp-attention model for multivariate
time-series analysis. In CIKM, pages 2230–2239, 2022.
[44] Tengda Han, Weidi Xie, and Andrew Zisserman. Memory-augmented dense predictive coding for video representation
learning. In ECCV, pages 312–329, 2020.
[45] Heejeong Choi and Pilsung Kang. Multi-task self-supervised time-series representation learning. arXiv:2303.01034,
2023.
[46] Tanzila Rahman, Mengyu Yang, and Leonid Sigal. TriBERT: Human-centric audio-visual representation learning. In
NeurIPS, volume 34, pages 9774–9787, 2021.
[47] Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable representation learning for
multivariate time series. In NeurIPS, volume 32, 2019.
[48] Jingyuan Wang, Ze Wang, Jianfeng Li, and Junjie Wu. Multilevel wavelet decomposition network for interpretable
time series analysis. In KDD, pages 2437–2446, 2018.
[49] Zhe Li, Zhongwen Rao, Lujia Pan, Pengyun Wang, and Zenglin Xu. Ti-MAE: Self-supervised masked time series
autoencoders. arXiv:2301.08871, 2023.
[50] Zhiyu Liang, Chen Liang, Zheng Liang, and Hongzhi Wang. UniTS: A universal time series analysis framework with
self-supervised representation learning. arXiv:2303.13804, 2023.
[51] Archibald Fraikin, Adrien Bennetot, and Stéphanie Allassonnière. T-Rep: Representation learning for time series
using time-embeddings. arXiv:2310.04486, 2023.
[52] Gaurangi Anand and Richi Nayak. Delta: deep local pattern representation for time-series clustering and classification
using visual perception. Knowl.-Based Syst., 212:106551, 2021.
[53] Anh Duy Nguyen, Trang H Tran, Hieu H Pham, Phi Le Nguyen, and Lam M Nguyen. Learning robust and consistent
time series representations: A dilated inception-based approach. arXiv:2306.06579, 2023.
[54] Yi-Chen Chen, Sung-Feng Huang, Hung-yi Lee, Yu-Hsuan Wang, and Chia-Hao Shen. Audio word2vec: Sequence-to-
sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Trans. Audio
Speech Lang. Process., 27(9):1481–1493, 2019.
[55] Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One Fits All: Power general time series analysis by
pretrained lm. arXiv:2302.11939, 2023.
[56] Simone Luetto, Fabrizio Garuti, Enver Sangineto, Lorenzo Forni, and Rita Cucchiara. One Transformer for All Time
Series: Representing and training with time-dependent heterogeneous tabular data. arXiv:2302.06375, 2023.
[57] Xudong Guo, Xun Guo, and Yan Lu. SSAN: Separable self-attention network for video representation learning. In
CVPR, pages 12618–12627, 2021.
[58] Zhiyu Liang, Jianfeng Zhang, Chen Liang, Hongzhi Wang, Zheng Liang, and Lujia Pan. Contrastive shapelet learning
for unsupervised multivariate time series representation learning. arXiv:2305.18888, 2023.
[59] Michael Zhang, Khaled Kamal Saab, Michael Poli, Tri Dao, Karan Goel, and Christopher Re. Effectively modeling
time series with simple discrete state spaces. In ICLR, 2023.
[60] Filippo Maria Bianchi, Lorenzo Livi, Karl Øyvind Mikalsen, Michael Kampffmeyer, and Robert Jenssen. Learning
representations of multivariate time series with missing data. Pattern Recognition, 96:106973, 2019.
[61] Mona Schirmer, Mazin Eltayeb, Stefan Lessmann, and Maja Rudolph. Modeling irregular time series with continuous
recurrent units. In ICML, pages 19388–19405, 2022.
[62] Marin Biloš, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, and Stephan Günnemann. Modeling temporal
data as continuous functions with stochastic process diffusion. In ICML, pages 2452–2470, 2023.
[63] Abdul Fatir Ansari, Alvin Heng, Andre Lim, and Harold Soh. Neural continuous-discrete state space models for
irregularly-sampled time series. In ICML, pages 926–951, 2023.
[64] Lingfei Wu, Ian En-Hsu Yen, Jinfeng Yi, Fangli Xu, Qi Lei, and Michael Witbrock. Random warping series: A random
features method for time-series embedding. In AISTATS, pages 793–802, 2018.
[65] Chenxi Sun, Shenda Hong, Moxian Song, Yen-hsiu Chou, Yongyue Sun, Derun Cai, and Hongyan Li. TE-ESN: time
encoding echo state network for prediction based on irregularly sampled time series data. In IJCAI, pages 3010–3016,
2021.
[66] Zhining Liu, Dawei Zhou, and Jingrui He. Towards explainable representation of time-evolving graphs via spatial-
temporal graph attention networks. In CIKM, pages 2137–2140, 2019.
[67] Elizabeth Fons, Alejandro Sztrajman, Yousef El-Laham, Alexandros Iosifidis, and Svitlana Vyetrenko. Hypertime:
Implicit neural representations for time series. In NeurIPS Workshop on Synthetic Data for Empowering ML Research,
2022.
[68] Yucheng Wang, Min Wu, Xiaoli Li, Lihua Xie, and Zhenghua Chen. Multivariate time series representation learning
via hierarchical correlation pooling boosted graph neural network. IEEE Trans. Artif. Intell., 2023.
[69] Yuening Li, Zhengzhang Chen, Daochen Zha, Mengnan Du, Jingchao Ni, Denghui Zhang, Haifeng Chen, and Xia Hu.
Towards learning disentangled representations for time series. In KDD, pages 3270–3278, 2022.
[70] Ruichu Cai, Jiawei Chen, Zijian Li, Wei Chen, Keli Zhang, Junjian Ye, Zhuozhang Li, Xiaoyan Yang, and Zhenjie
Zhang. Time series domain adaptation via sparse associative structure alignment. In AAAI, pages 6859–6867, 2021.
[71] Sangmin Lee, Hyung-Il Kim, and Yong Man Ro. Weakly paired associative learning for sound and image representations
via bimodal associative memory. In CVPR, pages 10534–10543, 2022.
[72] Yihe Wang, Yu Han, Haishuai Wang, and Xiang Zhang. Contrast everything: A hierarchical contrastive framework
for medical time-series. In NeurIPS, 2023.
[73] Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, and Dongsheng Li. ContiFormer: Continuous-time
transformer for irregular time series modeling. In NeurIPS, 2023.
[74] Kai Zhang, Chao Li, and Qinmin Yang. TriD-MAE: A generic pre-trained model for multivariate time series with
missing values. In CIKM, pages 3164–3173, 2023.
[75] Shuhan Zhong, Sizhe Song, Guanyao Li, Weipeng Zhuo, Yang Liu, and S-H Gary Chan. A multi-scale decomposition
mlp-mixer for time series analysis. arXiv:2310.11959, 2023.
[76] Zhijian Xu, Ailing Zeng, and Qiang Xu. FITS: Modeling time series with 10𝑘 parameters. arXiv:2307.03756, 2023.
[77] Qianli Ma, Sen Li, Lifeng Shen, Jiabing Wang, Jia Wei, Zhiwen Yu, and Garrison W Cottrell. End-to-end incomplete
time-series modeling from linear memory of latent variables. IEEE Trans. Cybern., 50(12):4908–4920, 2019.
[78] Jingyuan Wang, Chen Yang, Xiaohan Jiang, and Junjie Wu. WHEN: A wavelet-dtw hybrid attention network for
heterogeneous time series analysis. In KDD, page 2361–2373, 2023.
[79] Matt Gorbett, Hossein Shirazi, and Indrakshi Ray. Sparse binary transformers for multivariate time series modeling.
In KDD, page 544–556, 2023.
[80] Satya Narayan Shukla and Benjamin Marlin. Multi-time attention networks for irregularly sampled time series. In
ICLR, 2021.
[81] Fadime Sener, Dipika Singhania, and Angela Yao. Temporal aggregate representations for long-range video under-
standing. In ECCV, pages 154–171, 2020.
[82] Ranak Roy Chowdhury, Xiyuan Zhang, Jingbo Shang, Rajesh K Gupta, and Dezhi Hong. TARNet: Task-aware
reconstruction for time-series transformer. In KDD, pages 212–220, 2022.
[83] Yang Liu, Keze Wang, Lingbo Liu, Haoyuan Lan, and Liang Lin. TCGL: Temporal contrastive graph for self-supervised
video representation learning. IEEE Trans. Image Process., 31:1978–1993, 2022.
[84] Etienne Le Naour, Louis Serrano, Léon Migus, Yuan Yin, Ghislain Agoua, Nicolas Baskiotis, Vincent Guigue, et al. Time
series continuous modeling for imputation and forecasting with implicit neural representations. arXiv:2306.05880,
2023.
[85] Chenguo Lin, Xumeng Wen, Wei Cao, Congrui Huang, Jiang Bian, Stephen Lin, and Zhirong Wu. NuTime: Numerically
multi-scaled embedding for large-scale time series pretraining. arXiv:2310.07402, 2023.
[86] Isma Hadji, Konstantinos G Derpanis, and Allan D Jepson. Representation learning via global temporal alignment
and cycle-consistency. In CVPR, pages 11068–11077, 2021.
[87] Ye Yuan, Guangxu Xun, Qiuling Suo, Kebin Jia, and Aidong Zhang. Wave2vec: Deep representation learning for
clinical temporal data. Neurocomputing, 324:31–42, 2019.
[88] Sana Tonekaboni, Chun-Liang Li, Sercan O Arik, Anna Goldenberg, and Tomas Pfister. Decoupling local and global
representations of time series. In AISTATS, pages 8700–8714, 2022.
[89] Dang Nguyen, Wei Luo, Tu Dinh Nguyen, Svetha Venkatesh, and Dinh Phung. Sqn2Vec: learning sequence represen-
tation via sequential patterns with a gap constraint. In ECML PKDD, pages 569–584, 2018.
[90] Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long. SimMTM: A simple
pre-training framework for masked time-series modeling. arXiv:2302.00861, 2023.
[91] Sheng Guo, Zihua Xiong, Yujie Zhong, Limin Wang, Xiaobo Guo, Bing Han, and Weilin Huang. Cross-architecture
self-supervised video representation learning. In CVPR, pages 19270–19279, 2022.
[92] Sanjay Haresh, Sateesh Kumar, Huseyin Coskun, Shahram N Syed, Andrey Konin, Zeeshan Zia, and Quoc-Huy Tran.
Learning by aligning videos in time. In CVPR, pages 5548–5558, 2021.
[93] Hanwen Liang, Niamul Quader, Zhixiang Chi, Lizhe Chen, Peng Dai, Juwei Lu, and Yang Wang. Self-supervised
spatiotemporal representation learning by exploiting video continuity. In AAAI, pages 1564–1573, 2022.
[94] Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. In
ECCV, pages 504–521, 2020.
[95] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Wei Liu, and Yun-Hui Liu. Self-supervised video representation
learning by uncovering spatio-temporal statistics. IEEE Trans. Pattern Anal. Mach. Intell., 44(7):3791–3806, 2021.
[96] Haodong Duan, Nanxuan Zhao, Kai Chen, and Dahua Lin. TransRank: Self-supervised video representation learning
via ranking-based transformation recognition. In CVPR, pages 3000–3010, 2022.
[97] Yuchen Fang, Kan Ren, Caihua Shan, Yifei Shen, You Li, Weinan Zhang, Yong Yu, and Dongsheng Li. Learning
decomposed spatial relations for multi-variate time-series modeling. In AAAI, pages 7530–7538, 2023.
[98] Yujia Zhang, Lai-Man Po, Xuyuan Xu, Mengyang Liu, Yexin Wang, Weifeng Ou, Yuzhi Zhao, and Wing-Yin Yu.
Contrastive spatio-temporal pretext learning for self-supervised video representation. In AAAI, pages 3380–3389,
2022.
[99] Quan Kong, Wenpeng Wei, Ziwei Deng, Tomoaki Yoshinaga, and Tomokazu Murakami. Cycle-contrast for self-
supervised video representation learning. In NeurIPS, volume 33, pages 8089–8100, 2020.
[100] Shuangrui Ding, Rui Qian, and Hongkai Xiong. Dual contrastive learning for spatio-temporal representation. In MM,
pages 5649–5658, 2022.
[101] Minghao Chen, Fangyun Wei, Chong Li, and Deng Cai. Frame-wise action representations for long videos via
sequence contrastive learning. In CVPR, pages 13801–13810, 2022.
[102] Zehua Zhang and David Crandall. Hierarchically decoupled spatial-temporal contrast for self-supervised video
representation learning. In WACV, pages 3235–3245, 2022.
[103] Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin, and Nong
Sang. Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency.
In CVPR, pages 13821–13831, 2022.
[104] Pedro Morgado, Yi Li, and Nuno Nvasconcelos. Learning representations from audio-visual spatial alignment. In
NeurIPS, volume 33, pages 4733–4744, 2020.
[105] Heng Zhang, Daqing Liu, Qi Zheng, and Bing Su. Modeling video as stochastic processes for fine-grained video
representation learning. In CVPR, pages 2225–2234, 2023.
[106] Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge J Belongie, Ming-Hsuan Yang,
Hartwig Adam, and Yin Cui. On temporal granularity in self-supervised video representation learning. In BMVC,
page 541, 2022.
[107] Ranak Roy Chowdhury, Jiacheng Li, Xiyuan Zhang, Dezhi Hong, Rajesh Gupta, and Jingbo Shang. PrimeNet:
Pre-training for irregular multivariate time series. In AAAI, 2023.
[108] Peihao Chen, Deng Huang, Dongliang He, Xiang Long, Runhao Zeng, Shilei Wen, Mingkui Tan, and Chuang Gan.
RSPNet: Relative speed perception for unsupervised video representation learning. In AAAI, pages 1045–1053, 2021.
[109] Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for
time series via time-frequency consistency. In NeurIPS, volume 35, pages 3988–4003, 2022.
[110] Ainaz Hajimoradlou, Leila Pishdad, Frederick Tung, and Maryna Karpusha. Self-supervised time series representation
learning with temporal-instance similarity distillation. In ICML Workshop on Pre-training: Perspectives, Pitfalls, and
Paths Forward, 2022.
[111] Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. Spatiotem-
poral contrastive video representation learning. In CVPR, pages 6964–6974, 2021.
[112] Ishan Dave, Rohit Gupta, Mamshad Nayeem Rizve, and Mubarak Shah. TCLR: Temporal contrastive learning for
video representation. Comput. Vis. Image Underst., 219:103406, 2022.
[113] Yuncong Yang, Jiawei Ma, Shiyuan Huang, Long Chen, Xudong Lin, Guangxing Han, and Shih-Fu Chang. TempCLR:
Temporal alignment representation with contrastive learning. In ICLR, 2023.
[114] Simon Jenni and Hailin Jin. Time-equivariant contrastive video representation learning. In ICCV, pages 9970–9980,
2021.
[115] Yang Jiao, Kai Yang, Shaoyu Dou, Pan Luo, Sijia Liu, and Dongjin Song. TimeAutoML: Autonomous representation
learning for multivariate irregularly sampled time series. arXiv:2010.01596, 2020.
[116] Xinyu Yang, Zhenguo Zhang, and Rongyi Cui. TimeCLR: A self-supervised contrastive learning framework for
univariate time series representation. Knowl.-Based Syst., 245:108606, 2022.
[117] Pratik Somaiya, Harit Pandya, Riccardo Polvara, Marc Hanheide, and Grzegorz Cielniak. TS-Rep: Self-supervised
time series representation learning from robot sensor data. In NeurIPS Workshop on Self-Supervised Learning - Theory
and Practice, 2022.
[118] Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning for time series with
temporal neighborhood coding. In ICLR, 2021.
[119] Yooju Shin, Susik Yoon, Hwanjun Song, Dongmin Park, Byunghyun Kim, Jae-Gil Lee, and Byung Suk Lee. Context
consistency regularization for label sparsity in time series. In ICML, pages 31579–31595, 2023.
[120] Shengzhong Liu, Tomoyoshi Kimura, Dongxin Liu, Ruijie Wang, Jinyang Li, Suhas Diggavi, Mani Srivastava, and
Tarek Abdelzaher. FOCAL: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal
latent space. In NeurIPS, 2023.
[121] Weiqi Zhang, Jianfeng Zhang, Jia Li, and Fugee Tsung. A co-training approach for noisy time series learning. In
CIKM, pages 3308–3318, 2023.
[122] Chenxi Sun, Yaliang Li, Hongyan Li, and Shenda Hong. TEST: Text prototype aligned embedding to activate llm’s
ability for time series. arXiv:2308.08241, 2023.
[123] Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan.
Time-series representation learning via temporal and contextual contrasting. In IJCAI, pages 2352–2359, 2021.
[124] Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, and Sangyoun Lee. Exploring
temporally dynamic data augmentation for video recognition. In ICLR, 2023.
[125] Dongsheng Luo, Wei Cheng, Yingheng Wang, Dongkuan Xu, Jingchao Ni, Wenchao Yu, Xuchao Zhang, Yanchi Liu,
Yuncong Chen, Haifeng Chen, et al. Time series contrastive learning with information-aware augmentations. In
[153] Shaghayegh Gharghabi, Yifei Ding, Chin-Chia Michael Yeh, Kaveh Kamgar, Liudmila Ulanova, and Eamonn Keogh.
Matrix profile viii: domain agnostic online semantic segmentation at superhuman performance levels. In ICDM, pages
117–126, 2017.
[154] Arik Ermshaus, Patrick Schäfer, and Ulf Leser. Clasp: parameter-free time series segmentation. Data Min. Knowl.
Discov., 37(3):1262–1300, 2023.
[155] Renjie Wu and Eamonn Keogh. Current time series anomaly detection benchmarks are flawed and are creating the
illusion of progress. IEEE Trans. Knowl. Data Eng., 2021.
[156] Dennis Wagner, Tobias Michels, Florian CF Schulz, Arjun Nair, Maja Rudolph, and Marius Kloft. Timesead: Bench-
marking deep multivariate time-series anomaly detection. TMLR, 2023.
[157] Zhihan Li, Youjian Zhao, Jiaqi Han, Ya Su, Rui Jiao, Xidao Wen, and Dan Pei. Multivariate time series anomaly
detection and interpretation using hierarchical inter-metric and temporal embedding. In KDD, pages 3220–3230, 2021.
[158] John Paparrizos, Yuhao Kang, Paul Boniol, Ruey S Tsay, Themis Palpanas, and Michael J Franklin. TSB-UAD: an
end-to-end benchmark suite for univariate time-series anomaly detection. VLDB, 15(8):1697–1711, 2022.
[159] Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang
Feng, et al. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In
WWW, pages 187–196, 2018.
[160] Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun. DCdetector: Dual attention contrastive
representation learning for time series anomaly detection. In KDD, page 3033–3045, 2023.
[161] Youngeun Nam, Patara Trirat, Taeyoon Kim, Youngseop Lee, and Jae-Gil Lee. Context-aware deep time-series
decomposition for anomaly detection in businesses. In ECML PKDD, pages 330–345, 2023.
[162] Emmanouil Sylligardos, Paul Boniol, John Paparrizos, Panos Trahanias, and Themis Palpanas. Choose wisely: An
extensive evaluation of model selection for anomaly detection in time series. VLDB, 16(11):3418–3432, 2023.
[163] John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S Tsay, Aaron Elmore, and Michael J Franklin. Volume under
the surface: a new accuracy evaluation measure for time-series anomaly detection. VLDB, 15(11):2774–2787, 2022.
[164] Siwon Kim, Kukjin Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon. Towards a rigorous evaluation of
time-series anomaly detection. In AAAI, pages 7194–7201, 2022.
[165] Won-Seok Hwang, Jeong-Han Yun, Jonguk Kim, and Byung Gil Min. Do you know existing accuracy metrics overrate
time-series anomaly detections? In SAC, pages 403–412, 2022.
[166] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide
Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and
challenges for epic-kitchens-100. IJCV, pages 1–23, 2022.
[167] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M:
Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019.
[168] Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound
of pixels. In ECCV, pages 570–586, 2018.
[169] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven
traffic forecasting. In ICLR, 2018.
[170] Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. NeurIPS, 32,
2019.
[171] Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in-hospital mortality of icu
patients: The physionet/computing in cardiology challenge 2012. In CinC, pages 245–248, 2012.
[172] Gari D Clifford, Chengyu Liu, Benjamin Moody, H Lehman Li-wei, Ikaro Silva, Qiao Li, AE Johnson, and Roger G
Mark. Af classification from a short single lead ecg recording: The physionet/computing in cardiology challenge
2017. In CinC, pages 1–4, 2017.
[173] PhysioToolkit PhysioBank. Physionet: components of a new research resource for complex physiologic signals.
Circulation, 101(23):e215–e220, 2000.
[174] Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. Indications
of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on
recording region and brain state. Physical Review E, 64(6):061907, 2001.
[175] Jiayang Liu, Lin Zhong, Jehan Wickramasuriya, and Venu Vasudevan. uwave: Accelerometer-based personalized
gesture recognition and its applications. PMC, 5(6):657–675, 2009.
[176] Joon Son Chung and Andrew Zisserman. Lip reading in the wild. In ACCV, pages 87–103, 2017.
[177] Karol J Piczak. Esc: Dataset for environmental sound classification. In MM, pages 1015–1018, 2015.
[178] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from
videos in the wild. arXiv:1212.0402, 2012.
[179] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video
database for human motion recognition. In ICCV, pages 2556–2563, 2011.
[180] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv:1705.06950, 2017.
[181] Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. Mmact: A large-scale
dataset for cross modal human action understanding. In ICCV, pages 8658–8667, 2019.
[182] Nathan J Stevenson, Karoliina Tapani, Leena Lauronen, and Sampsa Vanhatalo. A dataset of neonatal eeg recordings
with seizure annotations. Sci. Data, 6(1):1–8, 2019.
[183] Christian Lessmeier, James Kuria Kimotho, Detmar Zimmer, and Walter Sextro. Condition monitoring of bearing
damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set
for data-driven classification. In PHME, 2016.
[184] Kwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, and Xia Hu. Revisiting time series outlier
detection: Definitions and benchmarks. In NeurIPS, 2021.
[185] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. Robust anomaly detection for multivariate time
series through stochastic recurrent neural network. In KDD, pages 2828–2837, 2019.
[186] Ahmed Abdulaal, Zhuanghua Liu, and Tomer Lancewicki. Practical approach to asynchronous multivariate time
series anomaly detection and localization. In KDD, pages 2485–2494, 2021.
[187] Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. Detecting spacecraft
anomalies using lstms and nonparametric dynamic thresholding. In KDD, pages 387–395, 2018.
[188] Aditya P Mathur and Nils Ole Tippenhauer. SWaT: A water treatment testbed for research and training on ics security.
In CySWater, pages 31–36, 2016.
[189] Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. WADI: a water distribution testbed for
research in the design of secure cyber physical systems. In CySWater, pages 25–28, 2017.
[190] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In CVPR, pages
6479–6488, 2018.
[191] Dave Epstein, Boyuan Chen, and Carl Vondrick. Oops! predicting unintentional action in video. In CVPR, pages
919–929, 2020.
[192] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The
deepfake detection challenge (DFDC) dataset. arXiv:2006.07397, 2020.
[193] Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E
Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet:
components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220, 2000.
[194] Yooju Shin, Susik Yoon, Sundong Kim, Hwanjun Song, Jae-Gil Lee, and Byung Suk Lee. Coherence-based label
propagation over time series for accelerated active learning. In ICLR, 2021.
[195] Aayush Rana and Yogesh Rawat. Are all frames equal? active sparse labeling for video action detection. NeurIPS, 35:
14358–14373, 2022.
[196] Supriya Agrahari and Anil Kumar Singh. Concept drift detection in data stream mining: A literature review. J. King
Saud Univ. - Comput. Inf. Sci., 34(10):9523–9540, 2022.
[197] Liheng Yuan, Heng Li, Beihao Xia, Cuiying Gao, Mingyue Liu, Wei Yuan, and Xinge You. Recent advances in concept
drift adaptation methods for deep learning. In IJCAI, pages 5654–5661, 2022.
[198] Mohamed Ragab, Emadeldeen Eldele, Wee Ling Tan, Chuan-Sheng Foo, Zhenghua Chen, Min Wu, Chee-Keong Kwoh,
and Xiaoli Li. Adatime: A benchmarking suite for domain adaptation on time series data. TKDD, 17(8):1–18, 2023.
[199] Xiaoyong Jin, Youngsuk Park, Danielle Maddix, Hao Wang, and Yuyang Wang. Domain adaptation for time series
forecasting via attention sharing. In ICML, pages 10280–10297, 2022.
[200] Zizhao Zhang, Xin Wang, Chaoyu Guan, Ziwei Zhang, Haoyang Li, and Wenwu Zhu. AutoGT: Automated graph
transformer architecture search. In ICLR, 2023.
[201] Colin White, Mahmoud Safari, Rhea Sukthanker, Binxin Ru, Thomas Elsken, Arber Zela, Debadeepta Dey, and Frank
Hutter. Neural architecture search: Insights from 1000 papers. arXiv:2301.08727, 2023.
[202] Syed Yousaf Shah, Dhaval Patel, Long Vu, Xuan-Hong Dang, Bei Chen, Peter Kirchner, Horst Samulowitz, David
Wood, Gregory Bramble, Wesley M Gifford, et al. AutoAI-TS: Autoai for time series forecasting. In SIGMOD, 2021.
[203] Renlong Jie and Junbin Gao. Differentiable neural architecture search for high-dimensional time series forecasting.
IEEE Access, 9, 2021.
[204] Chunnan Wang, Xingyu Chen, Chengyue Wu, and Hongzhi Wang. AutoTS: Automatic time series forecasting model
design based on two-stage pruning. arXiv:2203.14169, 2022.
[205] Difan Deng, Florian Karl, Frank Hutter, Bernd Bischl, and Marius Lindauer. Efficient automated deep learning for
time series forecasting. In ECML PKDD, pages 664–680, 2022.
[206] Zhichen Lai, Dalin Zhang, Huan Li, Christian S Jensen, Hua Lu, and Yan Zhao. LightCTS: A lightweight framework
for correlated time series forecasting. In SIGMOD, 2023.
[207] Hojjat Rakhshani, Hassan Ismail Fawaz, Lhassane Idoumghar, Germain Forestier, Julien Lepagnot, Jonathan Weber,
Mathieu Brévilliers, and Pierre-Alain Muller. Neural architecture search for time series classification. In IJCNN, 2020.
[208] Zhiwen Xiao, Xin Xu, Huanlai Xing, Rong Qu, Fuhong Song, and Bowen Zhao. RNTS: Robust neural temporal search
for time series classification. In IJCNN, 2021.
[209] Yankun Ren, Longfei Li, Xinxing Yang, and Jun Zhou. AutoTransformer: Automatic transformer architecture design
for time series classification. In PAKDD, 2022.
[210] Ming Jin, Qingsong Wen, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen,
Xiaoli Li, et al. Large models for time series and spatio-temporal data: A survey and outlook. arXiv:2310.10196, 2023.
[211] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for
multivariate time series with missing values. Sci. Rep., 8(1):6085, 2018.
[212] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.
In NeurIPS, volume 31, 2018.
[213] Satya Narayan Shukla and Benjamin M Marlin. Interpolation-prediction networks for irregularly sampled time series.
arXiv:1909.07782, 2019.
[214] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda
Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In
ICML, pages 8748–8763, 2021.
[215] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and
Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML,
pages 4904–4916, 2021.