0% found this document useful (0 votes)
41 views

Learning Deep Time-Index Models For Time Series Forecasting

Uploaded by

marcmyomyint1663
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Learning Deep Time-Index Models For Time Series Forecasting

Uploaded by

marcmyomyint1663
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Learning Deep Time-index Models for Time Series Forecasting

Gerald Woo 1 2 Chenghao Liu 1 Doyen Sahoo 1 Akshat Kumar 2 Steven Hoi 1

Abstract Train set Test set

Deep learning has been actively applied to time


series forecasting, leading to a deluge of new Time
methods, belonging to the class of historical-
value models. Yet, despite the attractive prop- Input Output

erties of time-index models, such as being able (a) Historical-value Models


to model the continuous nature of underlying Train set Test set
time series dynamics, little attention has been
given to them. Indeed, while naive deep time- Output
index models are far more expressive than the Time
manually predefined function representations of Input
classical time-index models, they are inadequate
(b) Time-index Models
for forecasting, being unable to generalize to un-
seen time steps due to the lack of inductive bias. Figure 1. Visual comparison between the two paradigms of
In this paper, we propose DeepTime, a meta- historical-value, and time-index models. After the models are
optimization framework to learn deep time-index trained on the training set, forecasts are made over the test set,
models which overcome these limitations, yield- which may be arbitrarily long. Historical-value models make
ing an efficient and accurate forecasting model. predictions conditioning on an input lookback window, whereas
time-index models do not make use of new incoming information
Extensive experiments on real world datasets in
during the test phase.
the long sequence time-series forecasting setting
demonstrate that our approach achieves compet-
purely functions of the corresponding time-index features
itive results with state-of-the-art methods, and
at future time-step(s), ŷt+1 = f (τt+1 ) (see Figure 1 for
is highly efficient. Code is available at https:
a visual comparison). Historical-value models have been
//github.com/salesforce/DeepTime.
widely used due to their simplicity. However, they can only
model temporal relationships at the data sampling frequency.
This is an issue since time series observations available
1. Introduction
to us tend to have a much lower resolution (sampled at
Time series forecasting has important applications across a lower frequency) than the underlying dynamics (Gong
business and scientific domains, such as demand fore- et al., 2015; 2017), leading historical-value models to be
casting (Carbonneau et al., 2008), capacity planning and prone to capturing spurious correlations. On the other
management (Kim, 2003), and anomaly detection (Laptev hand, time-index models intrinsically avoid this problem,
et al., 2017). There are two typical approaches to time directly modeling the mapping from time-index features
series forecasting – historical-value, and time-index models. to predictions in continuous space and learning signal
Historical-value models predict future time step(s) as a func- representations which change smoothly and correlate with
tion of past observations, ŷt+1 = f (yt , yt−1 , . . .) (Benidis each other in continuous space.
et al., 2020), while time-index models, a less studied
Classical time-index models (Taylor & Letham, 2018;
approach, are defined to be models whose predictions are
Hyndman & Athanasopoulos, 2018; Ord et al., 2017) rely
1
Salesforce Research Asia 2 School of Computing and Infor- on predefined function forms to generate predictions. They
mation Systems, Singapore Management University. Correspon- optionally follow the structural time-series formulation
dence to: Gerald Woo <[email protected]>, Chenghao Liu (Harvey & Shephard, 1993), yt = g(τt ) + s(τt ) + h(τt ),
<[email protected]>. where g, s, h represent trend, periodic, and holiday com-
Proceedings of the 40 th International Conference on Machine ponents respectively. For example, g could be predefined
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright as a linear, polynomial, or even a piece-wise linear function.
2023 by the author(s). While these functions are simple and easy to learn, they

1
Learning Deep Time-index Models for Time Series Forecasting

1.8 fitting parameters to recent time steps. The outer learning


1.6 Reconstruction
1.4
Ground Truth process enables the deep time-index model to learn a strong
Forecast inductive bias for extrapolation from data. We summarize
1.2 our contributions as follows:
1.0
0.8 • We introduce the use of deep time-index models for
0.6 time series forecasting, proposing a meta-optimization
0.4 framework to address their shortcomings, making them
0 100 200 300 400 500 600 700
viable for forecasting.
Figure 2. Ground truth time series and predictions from a vanilla
• We introduce a specific function form for deep time-
deep time-index model. The reconstruction of historical training
data as well as forecast over the horizon is visualized. Observe
index models by leveraging implicit neural represen-
that it overfits to historical data and is unable to extrapolate. This tations (Sitzmann et al., 2020b), and a novel concate-
model corresponds to +Local in Table 4 of our ablations. nated Fourier features module to efficiently learn high
frequency patterns in time series. We also specify an
have limited capacity, are unable to fit more complex time efficient instantiation of the meta-optimization proce-
series, and such predefined function forms may be too dure via a closed-form ridge regressor (Bertinetto et al.,
stringent of an assumption which may not hold in practice. 2019).
While it is possible to perform model selection across
various function forms, this requires either strong domain • We conduct extensive experiments on the long se-
expertise, or computationally heavy cross-validation across quence time series forecasting (LSTF) setting, demon-
a large set of parameters. strating DeepTime to be extremely competitive with
state-of-the-art baselines. At the same time, DeepTime
Deep learning gives a seemingly natural solution to this is highly efficient in terms of runtime and memory.
problem faced by classical time-index models – parameter-
ize f (τt ) as a neural network, and learn the function form in
a purely data-driven manner. While neural networks are ex- 2. Related Work
tremely expressive with a strong capability to approximate Neural Forecasting Neural forecasting (Benidis et al.,
complex functions, we argue that when trained via standard 2020) methods have seen great success in recent times. One
supervised learning, they face the debilitating problem of related line of research are Transformer-based methods for
an inability to extrapolate across the forecast horizon, i.e. LSTF (Li et al., 2019; Zhou et al., 2021; Xu et al., 2021;
unable to generalize to future time step(s). This is visualized Woo et al., 2022; Zhou et al., 2022) which aim to not only
in Figure 2, where a deep time-index model is able to fit the achieve high accuracy, but to overcome the vanilla atten-
training data well in the interval [0, T ′ ], but it is unable to tion’s quadratic complexity. Fully connected methods (Ore-
produce meaningful forecasts for the interval [T ′ , T ]. This shkin et al., 2020; Challu et al., 2022) have also shown
arises due to the lack of extrapolation capability of neural success, with (Challu et al., 2022) introducing hierarchical
networks when used in time-index models. While classical interpolation and multi-rate data sampling for the LSTF task.
time-index models have limited expressitivity, the strong Bi-level optimization in the form of meta-learning and the
inductive biases (e.g. linear trend, periodicity) instructs use of differentiable closed-form solvers has been explored
their predictions over unseen time steps. On the other hand, in time series forecasting (Grazzi et al., 2021), for the pur-
vanilla deep time-index models do not have such inductive pose of adapting to new time series datasets, where tasks
biases, and thus, while they do learn an appropriate function are defined to be the entire time series.
form over the training data, achieving extremely good
reconstruction loss, they have little to no generalization
Time-index Models Time-index models take as input
capabilities over unseen time steps.
time-index features such as datetime features to predict
We now raise the question – how do we achieve the best the value of the time series at that time step. They have
of both worlds – to retain the flexibility and expressiveness been well explored as a special case of regression analy-
of the neural network, as well as to learn inductive bias to sis (Hyndman & Athanasopoulos, 2018; Ord et al., 2017),
guarantee extrapolation capability over unseen time steps and many different predictors have been proposed for the
from time series data? We present a solution, DeepTime, classical setting,including linear, polynomial, and piecewise
a meta-optimization framework to learn deep time-index linear trends, and dummy variables indicating holidays. Of
models for time series forecasting. Our framework splits the note, Fourier terms have been used to model periodicity,
learning process of deep time-index models into two stages, or seasonal patterns, and is also known as harmonic re-
the inner, and outer learning process. The inner learning gression (Young et al., 1999). Prophet (Taylor & Letham,
process acts as the standard supervised learning process, 2018) is a popular classical approach which uses a struc-

2
Learning Deep Time-index Models for Time Series Forecasting

tural time series formulation, specialized for business fore- 3.1. Deep Time-index Model Architecture
casting. Another classical approach of note are Gaussian
INRs (Sitzmann et al., 2020b) are a class of coordinate-
Processes (Rasmussen, 2003; Corani et al., 2021) which
based models, mapping coordinates to values, which
are non-parametric models, often requiring complex ker-
have been extensively studied. Time-index models are
nel engineering. (Godfrey & Gashler, 2017) introduced an
a case of 1d coordinate-based models, thus, we leverage
initial attempt at using time-index based neural networks
this existing class of models, which are essentially a
to fit a time series for forecasting. Yet, their work is more
stack of multi-layered perceptrons as our proposed deep
reminiscent of classical methods, manually specifying peri-
time-index model architecture. Visualized in Figure 3a, a
odic and non-periodic activation functions, analogous to the
K-layered, ReLU (Nair & Hinton, 2010) INR is a function
representation functions.
fθ : Rc → Rm where:
Implicit Neural Representations INRs have recently z (0) = τ
gained popularity in the area of neural rendering (Tewari
et al., 2021). They parameterize a signal as a continuous z (k+1) = max(0, W (k) z (k) + b(k) ), k = 0, . . . , K − 1
function, mapping a coordinate to the value at that coordi- (K) (K) (K)
fθ (τ ) = W z +b (1)
nate. A key finding was that positional encodings (Milden-
hall et al., 2020; Tancik et al., 2020) are critical for ReLU where τ ∈ Rc are time-index features. MLPs have shown
MLPs to learn high frequency details, while another line to experience difficulty in learning high frequency functions,
of work introduced periodic activations (Sitzmann et al., this problem known as “spectrial bias” (Rahaman et al.,
2020b). Meta-learning on via INRs have been explored for 2019; Basri et al., 2020). Coordinate-based methods suffer
various data modalities, typically over images or for neu- from this issue in particular when trying to represent high
ral rendering tasks (Sitzmann et al., 2020a; Tancik et al., frequency content present in the signal. Tancik et al. (2020)
2021; Dupont et al., 2021), using both hypernetworks and introduced a random Fourier features layer which allows
optimization-based approaches. (Yüce et al., 2021) show INRs to fit to high frequency functions, by modifying
that meta-learning on INRs is analogous to dictionary learn- z (0) = γ(τ ) = [sin(2πBτ ), cos(2πBτ )]T , where each
ing. In time series, (Jeong & Shin, 2022) explored using entry in B ∈ Rd/2×c is sampled from N (0, σ 2 ) with d is
INRs for anomaly detection, opting to make use of periodic the hidden dimension size of the INR and σ 2 is the scale
activations and temporal positional encodings. hyperparameter. [·, ·] is a row-wise stacking operation.
Concatenated Fourier Features While the random
3. DeepTime Fourier features layer endows INRs with the ability to
learn high frequency patterns, one major drawback is the
In this section, we first formally describe the time series fore-
need to perform a hyperparameter sweep for each task and
casting problem, and how to use time-index models for this
dataset to avoid over or underfitting. We overcome this
problem setting. Next, we introduce the model architecture
limitation with a simple scheme of concatenating multiple
specifics for deep time-index models. Finally, we intro-
Fourier basis functions with diverse scale parameters, i.e.
duce the generic form of our meta-optimization framework,
γ(τ ) = [sin(2πB1 τ ), cos(2πB1 τ ), . . . , sin(2πBS τ ),
then specifically, how to use of a differentiable closed-form
cos(2πBS τ )]T , where elements in Bs ∈ Rd/2×c are sam-
ridge regression module to perform the meta-optimization
pled from N (0, σs2 ), and W (0) ∈ Rd×Sd . We perform an
efficiently. Pseudocode of DeepTime is available in Ap-
analysis in Section 5.3 and show that the performance of
pendix A.
our proposed concatenated Fourier features (CFF) does not
significantly deviate from the setting with the optimal scale
Problem Formulation In time series forecasting, we con- parameter obtained from a hyperparameter sweep.
sider a time series dataset (y1 , y2 , . . . , yT ), where yt ∈ Rm
is the m-dimension observation at time t. Consider a train-
3.2. Meta-optimization Framework
test split such that the range (1, . . . , T ′ ) is considered to be
the training set and the range (T ′ + 1, . . . , T ) is the test set, Explained in Section 1, vanilla deep time-index models
where T − T ′ ≥ H. The goal is to construct point forecasts are unable to perform forecasting due to their failure in
over a horizon of length H, over the test set, Ŷt:t+H = extrapolating beyond observed time-indices. Formally, the
[ŷt ; . . . ; ŷt+H−1 ], ∀t = T ′ + 1, T ′ + 2, . . . , T ′ − H + 1. A original hypothesis class of time-index model is denoted
time-index model, f : R → Rm , f : τt 7→ ŷt , achieves this HINR = {f (τ ; θ) | θ ∈ Θ}, where Θ is the parameter space.
by minimizing a reconstruction loss L : Rm ×Rm → R over The original hypothesis class is too expressive, providing
the training set, where τt is a time-index feature for which no guarantees that training on the lookback window leads
values are known for all time steps. Then, we can query it to good extrapolation. To solve this, our meta-optimization
over the test set to obtain forecasts, Ŷt:t+H = f (τt:t+H ). framework learns an inductive bias for the deep time-index

3
Learning Deep Time-index Models for Time Series Forecasting

𝝓 update
𝑦!
Inner Loop Outer Loop

𝜃!∗ = argmin ) ℒ(𝒚


-!$% , 𝒚!$% ) 𝜽 update 𝜙 ∗ = argmin) ) ) ℒ(𝒚
-!$* , 𝒚!$* )
𝑓(𝜏; 𝜃, 𝜙) 𝑓(𝜏, 𝜃!∗ , 𝜙)
#
% ! *

Concat Fourier Features 𝑗 ∈ [𝑡 − 𝐿, 𝑡 − 1] 𝑖 ∈ [𝑡, 𝑡 + 𝐻 − 1]

𝜏! Lookback: 𝒀!&':! Horizon: 𝒀!:!$+

(a) Deep Time-index (b) Meta-optimization Framework


Model Architecture
Figure 3. Left: Our proposed deep time-index model has a simple overall architecture, comprising a concatenated Fourier features, and
several layers of MLPs. Right: The meta-optimization framework aims to learn meta parameters of deep time-index models by following
a bi-level optimization problem formulation (Equations (2) and (3)). The inner loop is optimized over the lookback window, yielding the
optimal base parameters for a locally stationary distribution, θt∗ . Each lookback window has it’s own optimal base parameters, which are
then used to predict the immediate following forecast horizon. This composes the outer loop, which optimizes the meta parameters over
the entire time series, yielding an inductive bias which enables extrapolation over forecast horizons, given the optimal base parameters.

model from data. Firstly, rather than using the entire train- Fast and Efficient Meta-optimization Following the
ing set in a naive supervised learning setting, whereby older above generic formulation of the meta-optimization frame-
training points provide no additional benefit in learning a work to learn deep time-index models, we now describe a
time-index model for extrapolation, we split the time series specific instantiation of the framework which enables both
into lookback window and forecast horizon pairs. Let the training and forecasting at test time to be fast and efficient.
L time steps preceding the forecast horizon at time step t, Similar bi-level optimization problems have been explored
be the lookback window, Yt−L:t = [yt−L ; . . . ; yt−1 ]T ∈ (Ravi & Larochelle, 2017; Finn et al., 2017) and one naive
RL×m . Next, consider the case where we split the model pa- approach is to directly backpropagate through inner gradient
rameters into two, possibly overlapping subsets, ϕ ∈ Φ and steps. However, such methods are highly inefficient, have
θ ∈ Θ, known as the meta and base parameters, respectively, many additional hyperparameters, and are instable during
where Φ is the parameter space of the meta parameters. The training (Antoniou et al., 2019). Instead, to achieve speed
meta parameters are responsible for learning the inductive and efficiency, we specify that the base parameters consist
bias from multiple lookback window and forecast horizon of only the last layer of the INR, while the rest of the INR
pairs from the training data, while the base parameters aim are meta parameters. Thus, the inner loop optimization only
to learn and adapt quickly to the lookback window at test applies to this last layer. This transforms the inner loop op-
time. Thus, we aim to encode an inductive bias in ϕ, learned timization problem into a simple ridge regression problem
to enable extrapolation across the forecast horizon when for the case of mean squared error loss, having a simple
the base parameters adapt corresponding lookback window, analytic solution to replace the otherwise complicated non-
resulting in HMeta = {f (τ ; θ, ϕ∗ ) | θ ∈ Θ}. This is linear optimization problem (Bertinetto et al., 2019).
achieved by formulating the bi-level optimization problem:
Formally, for a K-layered model, ϕ =
T −H+1
X H−1X {W (0) , b(0) , . . . , W (K−1) , b(K−1) , λ} are the meta
ϕ∗ = arg min L(f (τt+i ; θt∗ , ϕ), yt+i ) parameters and θ = {W (K) } are the base parame-
ϕ t=L+1 i=0 ters, following notation from Equation (1). Then let
(2) gϕ : R → Rd be the meta learner where gϕ (τ ) = z (K) .
−1
X For a lookback-horizon pair, (Yt−L:t , Yt:t+H ), the features
s.t. θt∗ = arg min L(f (τt+j ; θ, ϕ), yt+j ) (3) of the lookback window obtained from the meta learner
θ j=−L is denoted Zt−L:t = [gϕ (τt−L ); . . . ; gϕ (τt−1 )]T ∈ RL×d ,
where [·; ·] is a column-wise concatenation operation. The
Illustrated in Figure 3b, Equation (2) represents the outer inner loop thus solves the optimization problem:
loop, and Equation (3), the inner loop. The first summation
in the outer loop over index t represents iterating over the Wt
(K)∗
= arg min ||Zt−L:t W − Yt−L:t ||2 + λ||W ||2
time steps of the dataset, and the second summation over W
index i represents each time step of the forecast horizon. = T
(Zt−L:t Zt−L:t + λI)−1 Zt−L:t
T
Yt−L:t (4)
The summation in the inner loop over index j represents
each time step of the lookback window. Now, let Zt:t+H = [gϕ (τt ); . . . ; gϕ (τt+H−1 )]T ∈ RH×d

4
Learning Deep Time-index Models for Time Series Forecasting

be the features of the forecast horizon. Then, our predictions 5. Experiments


(K)∗
are Ŷt:t+H = Zt:t+H Wt . This closed-form solution
We evaluate DeepTime on both synthetic, and real-world
is differentiable, which enables gradient updates on the
datasets. We ask the following questions: (i) Is DeepTime,
parameters of the meta learner, ϕ. A bias term can be in-
trained on a family of functions following the same paramet-
cluded for the closed-form ridge regressor by appending
ric form, able to perform extrapolation on unseen functions?
a scalar 1 to the feature vector gϕ (τ ). The end result of
(ii) How does DeepTime compare to other forecasting mod-
training DeepTime on a dataset is the restricted hypothesis
els on real-world data? (iii) What are the key contributing
class HDeepTime = gϕ∗ (τ )T W (K) | W (K) ∈ Rd×m .
factors to the good performance of DeepTime?
Finally, we propose to use relative time-index features,
i+L
τt+i = L+H−1 for i = −L, −L + 1, . . . , H − 1, i.e. a
[0, 1]-normalized time-index. 5.1. Experiments on Synthetic Data
We first consider DeepTime’s ability to extrapolate on the
4. Theoretical Analysis following functions specified by some parametric form: (i)
the family of linear functions, y = ax + b, (ii) the family
In the following, we derive a generalization bound of of cubic functions,Py = ax3 + bx2 + cx + d, and (iii)
DeepTime under the PAC-Bayes framework (Amit & Meir, sums of sinusoids, j Aj sin(ωj x + φj ). Parameters of the
2018). Give a long time series, suppose we can split it functions (i.e. a, b, c, d, Aj , ωj , φj ) are sampled randomly
into n instances (each has a length L lookback window (further details in Appendix E) to construct distinct tasks.
and length H forecast horizon) for training. Then the k- A total of 400 time steps are sampled, with a lookback
th instance is denoted Sk = {zk−L , . . . , zk , . . . , zk+H−1 }, window length of 200 and forecast horizon of 200. Figure 4
where zt = (τt , yt ). The PAC-Bayes framework for our pro- demonstrates that DeepTime is able to perform extrapolation
posed meta-optimization framework considers a Bayesian on unseen test functions/tasks after being trained via our
setting of DeepTime, having prior and posterior distributions meta-optimization formulation. It demonstrates an ability
of the meta and base parameters. The generalization bound to approximate and adapt, based on the lookback window,
is presented below, with the detailed proof in Appendix D. linear and cubic polynomials, and even sums of sinusoids.
Theorem 4.1. (Generalization Bound) Let Q, Q be arbi- Next, we evaluate DeepTime on real-world datasets, against
trary distribution of ϕ, θ, which are defined in Equation (2) state-of-the-art baselines.
and Equation (3), and P, P be the prior distribution of ϕ, θ.
Then for any c1 , c2 > 0 and any δ ∈ (0, 1], with probability 5.2. Experiments on Real-world Data
at least 1 − δ, the following inequality holds uniformly for
Experiments are performed on 6 real-world datasets – Elec-
all hyper-posterior distributions Q,
tricity Transformer Temperature (ETT), Electricity Consum-
n ing Load (ECL), Exchange, Traffic, Weather, and Influenza-
c1 c2 1X
er(Q) ≤ −c −c
· er(Q,
ˆ Sk ) like Illness (ILI) with full details in Appendix F. We evaluate
(1 − e )(1 − e ) n
1 2
k=1 the performance of our proposed approach using two met-
c1 KL(Q||P) + log 2δ rics, the mean squared error (MSE) and mean absolute error
+ −c
· (MAE) metrics. The datasets are split into train, valida-
1−e 1 nc1
tion, and test sets chronologically, following a 70/10/20
c1 c2 KL(Q||P ) + log 2n
δ
+ −c −c
· (5) split for all datasets except for ETTm2 which follows a
(1 − e 2 )(1 − e 1 ) (H + L)c2 60/20/20 split, as per convention. The univariate bench-
where er(Q) and er(Q,
ˆ Sk ) are the generalization error mark selects the last index of the multivariate dataset as the
and training error of DeepTime, respectively. target variable, following previous work (Xu et al., 2021).
Preprocessing on the data is performed by standardization
Theorem 4.1 states that the expected generalization error based on train set statistics. Hyperparameter selection is
of DeepTime is bounded by the empirical error plus two performed on only one value, the lookback length multi-
complexity terms. The first term represents the complexity plier, L = µ ∗ H, which decides the length of the lookback
between the meta distributions, Q, P, as well as between window. We search through the values µ = [1, 3, 5, 7, 9],
instances, converging to zero if we observe an infinitely long and select the best value based on the validation loss. Fur-
time-series (n → ∞). The second term represents the com- ther implementation details on DeepTime are reported in
plexity between the base distributions, Q, P , and of each Appendix G, and detailed hyperparameters are reported in
instance, or equivalently, the lookback window and forecast Appendix H. Reported results for DeepTime are averaged
horizon. This term converges to zero when there are a suffi- over three runs, and standard deviation is reported in Ap-
cient number of time steps in each instance (H + L → ∞). pendix J.

5
Learning Deep Time-index Models for Time Series Forecasting

6
20 Ground Truth 60 4
15 Forecast 40 2
10 20
5 0
0
0 2
20
5 40 4
10 60 6
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400

(a) Linear (b) Cubic (c) Sum of Sinusoids

Figure 4. Predictions of DeepTime on three unseen functions for each function class. The orange line represents the split between
lookback window and forecast horizon.

Table 1. Multivariate forecasting benchmark on long sequence time-series forecasting. Best results are highlighted in bold, and second
best results are underlined.
Methods DeepTime NS Transformer N-HiTS ETSformer FEDformer Autoformer Informer LogTrans GP
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.166 0.257 0.192 0.274 0.176 0.255 0.189 0.280 0.203 0.287 0.255 0.339 0.365 0.453 0.768 0.642 0.442 0.422
ETTm2

192 0.225 0.302 0.280 0.339 0.245 0.305 0.253 0.319 0.269 0.328 0.281 0.340 0.533 0.563 0.989 0.757 0.605 0.505
336 0.277 0.336 0.334 0.361 0.295 0.346 0.314 0.357 0.325 0.366 0.339 0.372 1.363 0.887 1.334 0.872 0.731 0.569
720 0.383 0.409 0.417 0.413 0.401 0.426 0.414 0.413 0.421 0.415 0.422 0.419 3.379 1.388 3.048 1.328 0.959 0.669
96 0.137 0.238 0.169 0.273 0.147 0.249 0.187 0.304 0.183 0.297 0.201 0.317 0.274 0.368 0.258 0.357 0.503 0.538
192 0.152 0.252 0.182 0.286 0.167 0.269 0.199 0.315 0.195 0.308 0.222 0.334 0.296 0.386 0.266 0.368 0.505 0.543
ECL

336 0.166 0.268 0.200 0.304 0.186 0.290 0.212 0.329 0.212 0.313 0.231 0.338 0.300 0.394 0.280 0.380 0.612 0.614
720 0.201 0.302 0.222 0.321 0.243 0.340 0.233 0.345 0.231 0.343 0.254 0.361 0.373 0.439 0.283 0.376 0.652 0.635
96 0.081 0.205 0.111 0.237 0.092 0.211 0.085 0.204 0.139 0.276 0.197 0.323 0.847 0.752 0.968 0.812 0.136 0.267
Exchange

192 0.151 0.284 0.219 0.335 0.208 0.322 0.182 0.303 0.256 0.369 0.300 0.369 1.204 0.895 1.040 0.851 0.229 0.348
336 0.314 0.412 0.421 0.476 0.371 0.443 0.348 0.428 0.426 0.464 0.509 0.524 1.672 1.036 1.659 1.081 0.372 0.447
720 0.856 0.663 1.092 0.769 0.888 0.723 1.025 0.774 1.090 0.800 1.447 0.941 2.478 1.310 1.941 1.127 1.135 0.810
96 0.390 0.275 0.612 0.338 0.402 0.282 0.607 0.392 0.562 0.349 0.613 0.388 0.719 0.391 0.684 0.384 1.112 0.665
Traffic

192 0.402 0.278 0.613 0.340 0.420 0.297 0.621 0.399 0.562 0.346 0.616 0.382 0.696 0.379 0.685 0.390 1.133 0.671
336 0.415 0.288 0.618 0.328 0.448 0.313 0.622 0.396 0.570 0.323 0.622 0.337 0.777 0.420 0.733 0.408 1.274 0.723
720 0.449 0.307 0.653 0.355 0.539 0.353 0.632 0.396 0.596 0.368 0.660 0.408 0.864 0.472 0.717 0.396 1.280 0.719
96 0.166 0.221 0.173 0.223 0.158 0.195 0.197 0.281 0.217 0.296 0.266 0.336 0.300 0.384 0.458 0.490 0.395 0.356
Weather

192 0.207 0.261 0.245 0.285 0.211 0.247 0.237 0.312 0.276 0.336 0.307 0.367 0.598 0.544 0.658 0.589 0.450 0.398
336 0.251 0.298 0.321 0.338 0.274 0.300 0.298 0.353 0.339 0.380 0.359 0.359 0.578 0.523 0.797 0.652 0.508 0.440
720 0.301 0.338 0.414 0.410 0.351 0.353 0.352 0.388 0.403 0.428 0.419 0.419 1.059 0.741 0.869 0.675 0.498 0.450
24 2.425 1.086 2.294 0.945 1.862 0.869 2.527 1.020 2.203 0.963 3.483 1.287 5.764 1.677 4.480 1.444 2.331 1.036
36 2.231 1.008 1.825 0.848 2.071 0.969 2.615 1.007 2.272 0.976 3.103 1.148 4.755 1.467 4.799 1.467 2.167 1.002
ILI

48 2.230 1.016 2.010 0.900 2.346 1.042 2.359 0.972 2.209 0.981 2.669 1.085 4.763 1.469 4.800 1.468 2.961 1.180
60 2.143 0.985 2.178 0.963 2.560 1.073 2.487 1.016 2.545 1.061 2.770 1.125 5.264 1.564 5.278 1.560 3.108 1.214

Results We compare DeepTime to the following base- on the multivariate benchmark, and also achieves competi-
lines for the multivariate setting, N-HiTS (Challu et al., tive results on the univariate benchmark despite its simple
2022), ETSformer (Woo et al., 2022), Fedformer (Zhou architecture compared to the baselines comprising complex
et al., 2022) (we report the best score for each setting from fully connected architectures and computationally intensive
the two variants they present), Autoformer (Xu et al., 2021), Transformer architectures.
Informer (Zhou et al., 2021), LogTrans (Li et al., 2019),
Non-stationary (NS) Transformer (Liu et al., 2022), and 5.3. Empirical Analysis and Ablation Studies
Gaussian Process (GP) (Rasmussen, 2003). For the uni-
variate setting, we include additional univariate forecasting We first perform empirical analyses informed by the in-
models, N-BEATS (Oreshkin et al., 2020), DeepAR (Salinas sights from our theoretical analysis. Theorem 4.1 states
et al., 2020), Prophet (Taylor & Letham, 2018), and ARIMA. that generalization error is bounded by training error, and
Baseline results are obtained from the respective papers. Ta- two complexity terms. Figure 5 and Table 2 analyse the
ble 1 and Table 7 (in Appendix I for space) summarizes the test error (which approximates generalization error) as num-
multivariate and univariate forecasting results respectively. ber of instances, n, and lookback window length, L, vary.
DeepTime achieves state-of-the-art performance on 20 out For the first complexity term, controlled by denominator
of 24 settings in MSE, and 17 out of 24 settings in MAE n, observed in Figure 5, test error decreases as n increases.
For the second complexity term, controlled by the length

6
Learning Deep Time-index Models for Time Series Forecasting

0.45 H=96 Table 3. Ablation study on backbone models. DeepTime refers to


0.40 H=192
0.35 H=336 our proposed approach, an INR with random Fourier features sam-
MSE

0.30 H=720 pled from a range of scales. MLP refers to replacing the random
0.25 Fourier features with a linear map from input dimension to hidden
0.20 dimension. SIREN refers to an INR with periodic activations as
0 2000 4000 6000 8000 proposed by Sitzmann et al. (2020b). RNN refers to an autoregres-
Number of Instances sive recurrent neural network (inputs are the time-series values,
yt ). All approaches include differentiable closed-form ridge re-
Figure 5. Analysis on the number of instances, n. MSE is mea- gressor. Further model details can be found in Appendix L.2.
sured as number of instances increases for multiple horizon lengths. Methods DeepTime MLP SIREN RNN
Analysis is performed based on the ETTm2 dataset. Metrics MSE MAE MSE MAE MSE MAE MSE MAE
96 0.166 0.257 0.186 0.284 0.236 0.325 0.233 0.324

ETTm2
Table 2. Analysis on the lookback window length, based on a mul- 192 0.225 0.302 0.265 0.338 0.295 0.361 0.275 0.337
tiplier on horizon length, L = µ ∗ H. Results presented on the 336 0.277 0.336 0.316 0.372 0.327 0.386 0.344 0.383
720 0.383 0.409 0.401 0.417 0.438 0.453 0.431 0.432
ETTm2 dataset. Best results are highlighted in bold.
Horizon 96 192 336 720
µ MSE MAE MSE MAE MSE MAE MSE MAE
that the inclusion of datetime features lead to a much lower
1 0.192 0.287 0.255 0.332 0.294 0.354 0.383 0.409
3 0.172 0.264 0.228 0.304 0.277 0.336 0.371 0.403 training loss, but degradation in test performance – this is a
5 0.168 0.259 0.225 0.302 0.275 0.337 0.389 0.420 case of meta-learning memorization (Yin et al., 2020) due
7 0.166 0.257 0.223 0.300 0.279 0.343 0.440 0.451 to the tasks becoming non-mutually exclusive (Rajendran
9 0.165 0.258 0.223 0.301 0.285 0.350 0.409 0.434
et al., 2020). We also observe that the meta-optimization
formulation is indeed superior to training a model from
scratch for each lookback window. Finally, while we ex-
of lookback and horizon, since H is an experimental set-
pect full MAML to always outperform the fast and efficient
ting, we set perform a sensitivity analysis on the lookback
meta-optimization, in reality, there are many complications
length, setting L = µ ∗ H. Similarly, test error decreases
in such gradient-based bi-level optimization methods – they
as L increases, plateauing and even increasing slightly as
are difficult to optimize, and instable during training. Re-
L grows extremely large. We expect test error to plateau
stricting the base parameters to only the last layer of the INR
as the associated term goes to zero. As the number of in-
provides a useful prior which enables stable optimization
stances available for training decreases as L grows large,
and high generalization without facing these problems.
the increase in test error can be attributed to a decrease in n.
Additional sensitivity analysis on our proposed concatenated
In Table 3 we perform an ablation study on various backbone
Fourier features, can be found in Appendix K, showing
architectures, while retaining the differentiable closed-form
that it performs no worse than an extensive hyperparameter
ridge regressor. We observe a degradation when the random
sweep on standard random Fourier features layer.
Fourier features layer is removed, due to the spectral bias
problem which neural networks face (Rahaman et al., 2019;
Tancik et al., 2020). DeepTime outperforms the SIREN 5.4. Computational Efficiency
variant of INRs which is consistent with observations INR Finally, we analyse DeepTime’s efficiency in both runtime
literature. DeepTime also outperforms the RNN variant and memory usage, with respect to both lookback window
which is the model proposed in (Grazzi et al., 2021). This and forecast horizon lengths. The main bottleneck in com-
is a direct comparison between IMS historical-value models putation for DeepTime is the matrix inversion operation in
and time-index models, and highlights the benefits of a the ridge regressor, canonically of O(n3 ) complexity. This
time-index models. is a major concern for DeepTime as n is linked to the length
We perform an ablation study to understand how various of the lookback window. As mentioned in (Bertinetto et al.,
training schemes and input features affect the performance 2019), the Woodbury formulation,
of DeepTime. Table 4 presents these results. First, we ob- W ∗ = Z T (ZZ T + λI)−1 Y
serve that our meta-optimization formulation is a critical
component to the success of DeepTime. We note that Deep- is used to alleviate the problem, leading to an O(d3 ) com-
Time without meta-optimization may not be a meaningful plexity, where d is the hidden size hyperparameter, fixed
baseline since the model outputs are always the same re- to some value (see Appendix H). Figure 6 demonstrates
gardless of the input lookback window. Including datetime that DeepTime is highly efficient, even when compared
features helps alleviate this issue, yet we observe that the to efficient Transformer models, recently proposed for the
inclusion of datetime features generally lead to a degrada- long sequence time series forecasting task, as well as fully
tion in performance. In the case of DeepTime, we observed connected models.

7
Learning Deep Time-index Models for Time Series Forecasting

Table 4. Ablation study on variants of DeepTime. Starting from the original version, we add (+) or remove (-) some component from
DeepTime. Datetime refers to datetime features. RR stands for the differentiable closed-form ridge regressor, removing it refers to
replacing this module with a simple linear layer trained via gradient descent across all training samples (i.e. without meta-optimization
formulation). Local refers to training an INR from scratch via gradient descent for each lookback window (RR is not used here, and there
is no training phase). + Finetune refers to training an INR via gradient descent for each lookback window on top of having a training
phase. Full MAML refers to performing gradient steps for the inner loop and backpropagating through them for the outer loop as in (Finn
et al., 2017). Further details on the variants can be found in Appendix L.1.
- RR + Local + Finetune
Methods DeepTime + Datetime - RR + Datetime + Local + Datetime + Finetune + Datetime Full MAML

Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.166 0.257 0.226 0.303 3.072 1.345 3.393 1.400 0.251 0.331 0.250 0.327 3.028 1.328 3.242 1.365 0.235 0.326
ETTm2

192 0.225 0.302 0.309 0.362 3.064 1.343 3.269 1.381 0.322 0.371 0.323 0.366 3.043 1.341 3.385 1.391 0.295 0.361
336 0.277 0.336 0.341 0.381 2.920 1.309 3.442 1.401 0.370 0.412 0.367 0.396 2.950 1.331 3.367 1.387 0.348 0.392
720 0.383 0.409 0.453 0.447 2.773 1.273 3.400 1.399 0.443 0.449 0.455 0.461 2.721 1.253 3.476 1.407 0.491 0.484

0.12 DeepTime 0.10


0.10 ETSformer (K=1)
Autoformer 0.08
0.08
Time (s)

Time (s)
0.06 Informer 0.06
N-BEATS 0.04
0.04 N-HiTS
0.02 0.02
48 96 168 336 720 1440 48 96 168 336 720 1440
Lookback Window Forecast Horizon
(a) Runtime Analysis
7
DeepTime 6
6 ETSformer (K=1) 5
Memory (GB)

Memory (GB)

Autoformer 4
4 Informer 3
2 N-BEATS 2
N-HiTS 1
0 0
48 96 168 336 720 1440 48 96 168 336 720 1440
Lookback Window Forecast Horizon
(b) Memory Analysis
Figure 6. Computational efficiency benchmark on the ETTm2 multivariate dataset, on a batch size of 32. Runtime is measured for
one iteration (forward + backward pass). Left: Runtime/Memory usage as lookback length varies, horizon is fixed to 48. Right:
Runtime/Memory usage as horizon length varies, lookback length is fixed to 48. Further model details can be found in Appendix M.

6. Discussion forecasting benchmarks on real world datasets. We perform


substantial ablation studies to identify the key components
In this paper, we proposed DeepTime, a deep time-index contributing to the success of DeepTime, and also show that
model learned via a meta-optimization framework, to auto- it is highly efficient.
matically learn a function form from time series data, rather
than manually defining the representation function as per Limitations & Future Work Despite having veri-
classical methods. DeepTime resolves issues arising for fied DeepTime’s effectiveness, we expect some under-
vanilla deep time-index models by splitting the learning performance in cases where the lookback window contains
process into inner and outer learning processes, where the significant anomalies, or an abrupt change point. Next,
outer learning process enables the deep time-index model while out of scope for our current work, a limitation that
to learn a strong inductive bias for extrapolation from data. DeepTime faces is that it does not consider holidays and
We propose a fast and efficient instantiation of the meta- events. We leave the consideration of such features as a
optimization framework, using a closed-form ridge regres- potential future direction, along with the incorporation of
sor. We also enhance deep time-index models with a novel exogenous covariates and datetime features, whilst avoiding
concatenated Fourier features module to efficiently learn the incursion of the meta-learning memorization problem.
high frequency patterns in time series. Our extensive em- Finally, time-index models are a natural fit for missing value
pirical analysis shows that DeepTime, while being a much imputation, as well as other time series intelligence tasks
simpler model architecture compared to prevailing state-of- for irregular time series – this is another interesting future
the-art methods, achieves competitive performance across direction to extend deep time-index models towards.

8
Learning Deep Time-index Models for Time Series Forecasting

References Finn, C., Abbeel, P., and Levine, S. Model-agnostic


meta-learning for fast adaptation of deep networks.
Amit, R. and Meir, R. Meta-learning by adjusting priors
In Precup, D. and Teh, Y. W. (eds.), Proceedings of
based on extended pac-bayes theory. In International
the 34th International Conference on Machine Learn-
Conference on Machine Learning, pp. 205–214. PMLR,
ing, volume 70 of Proceedings of Machine Learning
2018.
Research, pp. 1126–1135. PMLR, 06–11 Aug 2017.
Antoniou, A., Edwards, H., and Storkey, A. How to train URL https://proceedings.mlr.press/v70/
your MAML. In International Conference on Learning finn17a.html.
Representations, 2019. URL https://openreview.
net/forum?id=HJGven05Y7. Godfrey, L. B. and Gashler, M. S. Neural decomposition
of time-series data for effective generalization. IEEE
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normaliza- transactions on neural networks and learning systems, 29
tion, 2016. URL https://arxiv.org/abs/1607. (7):2973–2985, 2017.
06450.
Gong, M., Zhang, K., Schoelkopf, B., Tao, D., and Geiger, P.
Basri, R., Galun, M., Geifman, A., Jacobs, D., Kasten, Y., Discovering temporal causal relations from subsampled
and Kritchman, S. Frequency bias in neural networks for data. In International Conference on Machine Learning,
input of non-uniform density. In International Conference pp. 1898–1906. PMLR, 2015.
on Machine Learning, pp. 685–694. PMLR, 2020.
Gong, M., Zhang, K., Schölkopf, B., Glymour, C., and Tao,
Benidis, K., Rangapuram, S. S., Flunkert, V., Wang, D. Causal discovery from temporally aggregated time
B., Maddix, D., Turkmen, C., Gasthaus, J., Bohlke- series. In Uncertainty in artificial intelligence: proceed-
Schneider, M., Salinas, D., Stella, L., et al. Neural ings of the... conference. Conference on Uncertainty in
forecasting: Introduction and literature overview. arXiv Artificial Intelligence, volume 2017. NIH Public Access,
preprint arXiv:2004.10240, 2020. 2017.
Bertinetto, L., Henriques, J. F., Torr, P., and Vedaldi, A. Grazzi, R., Flunkert, V., Salinas, D., Januschowski, T.,
Meta-learning with differentiable closed-form solvers. In Seeger, M., and Archambeau, C. Meta-forecasting by
International Conference on Learning Representations, combining global deeprepresentations with local adapta-
2019. URL https://openreview.net/forum? tion. arXiv preprint arXiv:2111.03418, 2021.
id=HyxnZh0ct7.
Harvey, A. C. and Shephard, N. 10 structural time
Carbonneau, R., Laframboise, K., and Vahidov, R. Appli-
series models. In Econometrics, volume 11 of
cation of machine learning techniques for supply chain
Handbook of Statistics, pp. 261–302. Elsevier, 1993.
demand forecasting. European Journal of Operational
doi: https://doi.org/10.1016/S0169-7161(05)80045-8.
Research, 184(3):1140–1154, 2008.
URL https://www.sciencedirect.com/
Catoni, O. Pac-bayesian supervised classification: the science/article/pii/S0169716105800458.
thermodynamics of statistical learning. arXiv preprint
Hyndman, R. J. and Athanasopoulos, G. Forecasting: prin-
arXiv:0712.0248, 2007.
ciples and practice. OTexts, 2018.
Challu, C., Olivares, K. G., Oreshkin, B. N., Garza, F.,
Mergenthaler, M., and Dubrawski, A. N-hits: Neural hi- Jeong, K.-J. and Shin, Y.-M. Time-series anomaly detec-
erarchical interpolation for time series forecasting. arXiv tion with implicit neural representation. arXiv preprint
preprint arXiv:2201.12886, 2022. arXiv:2201.11950, 2022.

Chevillon, G. Direct multi-step estimation and forecasting. Kim, K.-j. Financial time series forecasting using sup-
Journal of Economic Surveys, 21(4):746–785, 2007. port vector machines. Neurocomputing, 55(1-2):307–319,
2003.
Corani, G., Benavoli, A., and Zaffalon, M. Time series
forecasting with gaussian processes needs priors. In Joint Kingma, D. P. and Ba, J. Adam: A method for stochastic
European Conference on Machine Learning and Knowl- optimization. arXiv preprint arXiv:1412.6980, 2014.
edge Discovery in Databases, pp. 103–117. Springer,
2021. Lai, G., Chang, W.-C., Yang, Y., and Liu, H. Modeling long-
and short-term temporal patterns with deep neural net-
Dupont, E., Teh, Y. W., and Doucet, A. Generative works. In The 41st International ACM SIGIR Conference
models as distributions of functions. arXiv preprint on Research & Development in Information Retrieval, pp.
arXiv:2102.04776, 2021. 95–104, 2018.

9
Learning Deep Time-index Models for Time Series Forecasting

Laptev, N., Yosinski, J., Li, L. E., and Smyl, S. Time- Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski,
series extreme event forecasting with neural networks at T. Deepar: Probabilistic forecasting with autoregressive
uber. In International conference on machine learning, recurrent networks. International Journal of Forecasting,
volume 34, pp. 1–5, 2017. 36(3):1181–1191, 2020.
Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X., Shalev-Shwartz, S. and Ben-David, S. Understanding ma-
and Yan, X. Enhancing the locality and breaking the mem- chine learning: From theory to algorithms. Cambridge
ory bottleneck of transformer on time series forecasting. university press, 2014.
ArXiv, abs/1907.00235, 2019.
Sitzmann, V., Chan, E., Tucker, R., Snavely, N., and Wet-
Liu, Y., Wu, H., Wang, J., and Long, M. Non-stationary zstein, G. Metasdf: Meta-learning signed distance func-
transformers: Exploring the stationarity in time series tions. Advances in Neural Information Processing Sys-
forecasting. In Oh, A. H., Agarwal, A., Belgrave, D., tems, 33:10136–10147, 2020a.
and Cho, K. (eds.), Advances in Neural Information Pro-
Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wet-
cessing Systems, 2022. URL https://openreview.
zstein, G. Implicit neural representations with periodic
net/forum?id=ucNDIDRNjjv.
activation functions. Advances in Neural Information
Marcellino, M., Stock, J. H., and Watson, M. W. A com- Processing Systems, 33:7462–7473, 2020b.
parison of direct and iterated multistep ar methods for
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
forecasting macroeconomic time series. Journal of econo-
and Salakhutdinov, R. Dropout: a simple way to prevent
metrics, 135(1-2):499–526, 2006.
neural networks from overfitting. The journal of machine
McAllester, D. A. Pac-bayesian model averaging. In Pro- learning research, 15(1):1929–1958, 2014.
ceedings of the twelfth annual conference on Computa-
Taieb, S. B., Hyndman, R. J., et al. Recursive and direct
tional learning theory, pp. 164–170, 1999.
multi-step forecasting: the best of both worlds, volume 19.
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Citeseer, 2012.
Ramamoorthi, R., and Ng, R. Nerf: Representing scenes Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil,
as neural radiance fields for view synthesis. In European S., Raghavan, N., Singhal, U., Ramamoorthi, R., Bar-
conference on computer vision, pp. 405–421. Springer, ron, J., and Ng, R. Fourier features let networks learn
2020. high frequency functions in low dimensional domains.
Nair, V. and Hinton, G. E. Rectified linear units improve Advances in Neural Information Processing Systems, 33:
restricted boltzmann machines. In Icml, 2010. 7537–7547, 2020.

Ord, K., Fildes, R. A., and Kourentzes, N. Principles of Tancik, M., Mildenhall, B., Wang, T., Schmidt, D., Srini-
business forecasting. Wessex Press Publishing Co., 2017. vasan, P. P., Barron, J. T., and Ng, R. Learned initial-
izations for optimizing coordinate-based neural represen-
Oreshkin, B. N., Carpov, D., Chapados, N., and Bengio, tations. In Proceedings of the IEEE/CVF Conference
Y. N-beats: Neural basis expansion analysis for inter- on Computer Vision and Pattern Recognition, pp. 2846–
pretable time series forecasting. In International Confer- 2855, 2021.
ence on Learning Representations, 2020. URL https:
//openreview.net/forum?id=r1ecqn4YwB. Taylor, S. J. and Letham, B. Forecasting at scale. The
American Statistician, 72(1):37–45, 2018.
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M.,
Hamprecht, F., Bengio, Y., and Courville, A. On the spec- Tewari, A., Thies, J., Mildenhall, B., Srinivasan, P.,
tral bias of neural networks. In International Conference Tretschk, E., Wang, Y., Lassner, C., Sitzmann, V., Martin-
on Machine Learning, pp. 5301–5310. PMLR, 2019. Brualla, R., Lombardi, S., et al. Advances in neural
rendering. arXiv preprint arXiv:2111.05849, 2021.
Rajendran, J., Irpan, A., and Jang, E. Meta-learning requires
meta-augmentation. Advances in Neural Information Woo, G., Liu, C., Sahoo, D., Kumar, A., and Hoi, S. Ets-
Processing Systems, 33:5705–5715, 2020. former: Exponential smoothing transformers for time-
series forecasting. arXiv preprint arXiv:2202.01381,
Rasmussen, C. E. Gaussian processes in machine learn- 2022.
ing. In Summer school on machine learning, pp. 63–71.
Springer, 2003. Xu, J., Wang, J., Long, M., et al. Autoformer: Decompo-
sition transformers with auto-correlation for long-term
Ravi, S. and Larochelle, H. Optimization as a model for series forecasting. Advances in Neural Information Pro-
few-shot learning. In ICLR, 2017. cessing Systems, 34, 2021.

10
Learning Deep Time-index Models for Time Series Forecasting

Yin, M., Tucker, G., Zhou, M., Levine, S., and Finn,
C. Meta-learning without memorization. In In-
ternational Conference on Learning Representations,
2020. URL https://openreview.net/forum?
id=BklEFpEYwS.
Young, P. C., Pedregal, D. J., and Tych, W. Dynamic har-
monic regression. Journal of forecasting, 18(6):369–394,
1999.
Yüce, G., Ortiz-Jiménez, G., Besbinar, B., and Frossard,
P. A structured dictionary perspective on implicit neural
representations. arXiv preprint arXiv:2112.01917, 2021.

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H.,
and Zhang, W. Informer: Beyond efficient transformer
for long sequence time-series forecasting. In Proceedings
of AAAI, 2021.
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., and Jin,
R. Fedformer: Frequency enhanced decomposed trans-
former for long-term series forecasting. arXiv preprint
arXiv:2201.12740, 2022.

11
Learning Deep Time-index Models for Time Series Forecasting

A. DeepTime Pseudocode

Algorithm 1 PyTorch-Style Pseudocode of Closed-Form Ridge Regressor


mm: matrix multiplication, diagonal: returns the diagonal elements of a matrix, add : in-place addition
linalg.solve computes the solution of a square system of linear equations with a unique solution.
# X: inputs, shape: (n samples, n dim)
# Y: targets, shape: (n samples, n out)
# lambd: scalar value representing the regularization coefficient

n samples, n dim = X.shape

# add a bias term by concatenating an all−ones vector


ones = torch.ones(n samples, 1)
X = cat([X, ones], dim=−1)

if n samples >= n dim:


# standard formulation
A = mm(X.T, X)
A.diagonal().add (softplus(lambd))
B = mm(X.T, Y)
weights = linalg.solve(A, B)
else:
# Woodbury formulation
A = mm(X, X.T)
A.diagonal().add (softplus(lambd))
weights = mm(X.T, linalg.solve(A, Y))
w, b = weights[:−1], weights[−1:]
return w, b

Algorithm 2 PyTorch-Style Pseudocode of DeepTIMe


rearrange: einops style tensor operations
mm: matrix multiplication
# x: input time−series, shape: (lookback len, multivariate dim)
# lookback len: scalar value representing the length of the lookback window
# horizon len: scalar value representing the length of the forecast horizon
# inr: implicit neural representation

time index = linspace(0, 1, lookback len + horizon len) # shape: (lookback len + horizon len)
time index = rearrange(time index, 't −> t 1') # shape: (lookback len + horizon len, 1)
time reprs = inr(time index) # shape: (lookback len + horizon len, hidden dim)

lookback reprs = time reprs[:lookback len]


horizon reprs = time reprs[−horizon len:]
w, b = ridge regressor(lookback reprs, x)
# w.shape = (hidden dim, multivariate dim), b.shape = (1, multivariate dim)
preds = mm(horizon reprs, w) + b
return preds

12
Learning Deep Time-index Models for Time Series Forecasting

B. Categorization of Forecasting Methods

Table 5. Categorization of time-series forecasting methods over the dimensions of time-index vs historical-value methods, and DMS vs
IMS methods.
Time-index Historical-value
DeepTime N-HiTS
Prophet FEDformer

DMS
Gaussian process ETSformer
Time-series regression Autoformer
Informer
N-BEATS
DeepAR
LogTrans
IMS

-
ARIMA
ETS

Multi-step Forecasts Forecasting over a horizon (multiple time steps) can be achieved via two strategies, direct multi-step,
or iterative multi-step (Marcellino et al., 2006; Chevillon, 2007; Taieb et al., 2012), or even a mixture of both, but this has
been less explored:

• Direct Multi-step (DMS): A DMS forecaster directly predicts forecasts for the entire horizon. For example, to achieve
a multi-step forecast of H time steps, a DMS forecaster simply outputs H values in a single forward pass.

• Iterative Multi-step (IMS): An IMS forecaster iteratively predicts one step ahead, and consumes this forecast to make
a subsequent prediction. This is performed iteratively, until the desired length is achieved.

C. Further Discussion on DeepTime as a Time-index Model


We first reiterate our definitions of time-index and historical-value models from Section 1. Time-index models are models
whose predictions are purely functions of current time-index features. To perform forecasting (i.e. make predictions over
some forecast horizon), time-index models make the predictions ŷt+h = f (τt+h ) for h = 0, . . . , H − 1. Historical-value
models predict the time-series value of future time step(s) as a function of past observations, and optionally, covariates.
Time-index Models Historical-value Models

ŷt = f (τt ) ŷt+1 = f (yt , yt−1 , . . . , zt+1 , zt , . . .)


Thus, forecasts are of the form, ŷt+h = A(f, Yt−L:t )(τt+h ), and as can be seen, while the inner loop optimization step is a
function of past observations, the adapted time-index model it yields is purely a function of time-index features.
Next, we further discuss some subtleties of how time-index models interact with past observations. Some confusion
regarding DeepTime’s categorization as a time-index model may arise from the above simplified equation for predictions,
(K)∗
since forecasts are now a function the lookback window due to the closed-form solution of Wt . In particular, that
Equations (1) and (4) indicate that forecasts from DeepTime are in fact linear in the lookback window. However, we highlight
that this is not in contradiction with our definition of historical-value and time-index models. Here, we differentiate between
the model, f ∈ H, and the learning algorithm, A, which is specified in Equation (3) (the inner loop optimization). The
learning algorithm A : H ×RL×m → H takes as input a model from the hypothesis class H and, past observations, returning
a model minimizing the loss function L. A time-index model is thus, still only a function of time-index features, while the
learning algorithm is a function of past observations, i.e. f, f0 ∈ H, f : Rc → Rm , f = A(f0 , Yt−L:t ). DeepTime as a
forecaster, is a deep time-index model endowed with a meta-optimization framework. In order to perform forecasting, it
has to perform an inner loop optimization defined by the learning algorithm, as highlighted in Equation (3). For the special
case where we use the closed-form ridge regressor, the inner loop learning algorithm reduces to a form which is linear in the
lookback window. Still, the deep time-index model is only a function of time-index features.

13
Learning Deep Time-index Models for Time Series Forecasting

D. Generalization Bound for our Meta-optimization Framework


In this section, we derive a generalization bound for DeepTime under the PAC-Bayes framework (McAllester, 1999;
Shalev-Shwartz & Ben-David, 2014). Our formulation follows Amit & Meir (2018) which introduces a meta-learning
generalization bound. We assume that all instances share the same hypothesis space H, sample space Z and loss function
ℓ : H × Z → [0, 1]. We observes n instances in the form of sample sets S1 , . . . , Sn . The number of samples in each
instance is H + L. Each instance Sk is assumed to be generated i.i.d from an unknown sample distribution DkH+L .
Each instance’s sample distribution Dk is i.i.d. generated from an unknown meta distribution, E. Particularly, we have
Sk = (zk−L , . . . , zk , . . . , zk+H−1 ), where zt = (τt , yt ). Here, τt is the time coordinate, and yt is the time-series value.
For any forecaster h(·) parameterized by θ, we define the loss function ℓ(hθ , zt ). We also define P as the prior distribution
over H and Q as the posterior over H for each instance. In the meta-learning setting, we assume a hyper-prior P, which is a
prior distribution over priors, observes a sequence of training instances, and then outputs a distribution over priors, called
hyper-posterior Q. We restate Theorem 4.1 in the following:
Theorem D.1. (Generalization Bound) Let Q, Q be arbitrary distribution of ϕ, θ, which are defined in Equation (2) and
Equation (3), and P, P be the prior distribution of ϕ, θ. Then for any c1 , c2 > 0 and any δ ∈ (0, 1], with probability at least
1 − δ, the following inequality holds uniformly for all hyper-posterior distributions Q,
n
c1 c2 1X
er(Q) ≤ · er(Q,
ˆ Sk )
(1 − e−c1 )(1 − e−c2 ) n
k=1

c1 KL(Q||P) + log 2δ
+ ·
1 − e−c1 nc1
c1 c2 KL(Q||P ) + log 2n
δ
+ · (6)
(1 − e−c2 )(1 − e−c1 ) (H + L)c2

where er(Q) and er(Q,


ˆ Sk ) are the generalization error and training error of DeepTime, respectively.

Proof. Our proof contains two steps. First, we bound the error within observed instances due to observing a limited number
of samples. Then we bound the error on the instance environment level due to observing a finite number of instances. Both
of the two steps utilize Catoni’s classical PAC-Bayes bound (Catoni, 2007) to measure the error. Here, we give Catoni’s
classical PAC-Bayes bound.
Theorem D.2. (Catoni’s bound (Catoni, 2007)) Let X be a sample space, P (X) a distribution over X , Θ a hypothesis
space. Given a loss function ℓ(θ, X) : Θ × X → [0, 1] and a collection of M i.i.d random variables (X1 , . . . , XM ) sampled
from P (X). Let π be a prior distribution over hypothesis space. Then, for any δ ∈ (0, 1] and any real number c > 0, the
following bound holds uniformly for all posterior distributions ρ over hypothesis space,
M
!
c h 1 X KL(ρ||π) + log 1δ i
P E [ℓ(θ, Xi )] ≤ E [ℓ(θ, Xm )] + , ∀ρ
Xi ∼P (X),θ∼ρ 1 − e−c M m=1 θ∼ρ Mc
≥ 1 − δ.

We first utilize Theorem D.2 to bound the generalization error in each of the observed instances. Let k be the index of
instance, we have the definition of expected error and empirical error as follows,

er(Q, Dk ) = E E E ℓ(h, z), (7)


P ∼Q h∼Q(Sk ,P ) z∼Dk
k+H−1
1 X
er(Q,
ˆ Sk ) = E E ℓ(h, zj ). (8)
P ∼Q h∼Q(Sk ,P ) H +L
j=k−L

Then, according to Theorem D.2, for any δk ∼ (0, 1] and c2 > 0, we have
1
!
c2 c2 KL(Q||P ) + log δk
P er(Q, Dk ) ≤ −c
er(Q,
ˆ Sk ) + −c
· ≥ 1 − δk . (9)
1−e 2 1−e 2 (H + L)c2

14
Learning Deep Time-index Models for Time Series Forecasting

Next, we bound the error due to observing a limited number of instances from the environment. Similarly, we have the
definition of expected instance error as follows

er(Q) = E E E E E ℓ(h, z)
D∼E S∼D H+L P ∼Q h∼Q(S,P ) z∼D

= E E er(Q, D). (10)


D∼E S∼D H+L

Then we have the definition of error across the n instances,


n n
1X 1X
E E E ℓ(h, z) = er(Q, Dk ). (11)
n P ∼Q h∼Q(Sk ,P ) z∼Dk n
k=1 k=1

Then Theorem D.2 says that the following holds for any δ0 ∼ (0, 1] and c1 > 0, we have
n 1
!
c1 1X c1 KL(Q||P) + log δ0
P er(Q) ≤ −c
er(Q, Dk ) + −c
· ≥ 1 − δ0 . (12)
1−e 1 n 1−e 1 nc1
k=1

Finally, by employing a union bound argument (Lemma 1, Amit & Meir (2018)), we could bound the probability of the
intersection of the events in Equation (12) and Equation (9) For any δ > 0, set δ0 = 2δ and δk = 2n
δ
for k = 1, . . . , n,
n
c1 c2 1X c1 KL(Q||P) + log 2δ
P er(Q) ≤ · er(Q,
ˆ Sk ) + ·
(1 − e−c1 )(1 − e−c2 ) n 1 − e−c1 nc1
k=1
KL(Q||P ) + log 2n

c1 c2 δ
+ · ≥ 1 − δ. (13)
(1 − e−c2 )(1 − e−c1 ) (H + L)c2

15
Learning Deep Time-index Models for Time Series Forecasting

E. Synthetic Data
The training set for each synthetic data experiment consists 1000 functions/tasks, while the test set contains 100 functions/-
tasks. We ensure that there is no overlap between the train and test sets.

Linear Samples are generated from the function y = ax + b for x ∈ [−1, 1]. This means that each function/task consists
of 400 evenly spaced points between -1 and 1. The parameters of each function/task (i.e. a, b) are sampled from a normal
distribution with mean 0 and standard deviation of 50, i.e. a, b ∼ N (0, 502 ).

Cubic Samples are generated from the function y = ax3 + bx2 + cx + d for x ∈ [−1, 1] for 400 points. Parameters of
each task are sampled from a continuous uniform distribution with minimum value of -50 and maximum value of 50, i.e.
a, b, c, d ∼ U(−50, 50).

Sums of sinusoids Sinusoids come from a fixed set of frequencies, generated by sampling ω ∼ U(0, 12π). We fix the
size of this set to be five, i.e. Ω = {ω1 , . . . , ω5 }. Each function is then a sum of J sinusoids, where J ∈ {1, 2, 3, 4, 5} is
PJ
randomly assigned. The function is thus y = j=1 Aj sin(ωrj x + φj ) for x ∈ [0, 1], where the amplitude and phase shifts
are freely chosen via Aj ∼ U(0.1, 5), φj ∼ U(0, π), but the frequency is decided by rj ∈ {1, 2, 3, 4, 5} to randomly select
a frequency from the set Ω.
The predictions from DeepTime in Figure 4 demonstrate some noise, likely stemming from the model’s capability to learn
high frequency features due to the use of implicit neural representations with random Fourier features. Since the synthetic
data are all low frequency, smoothly changing functions, the noise is likely to be artifacts from the concatenated Fourier
features layer, which should go away if the scale parameter of the Fourier features are carefully fine-tuned. However, the
power of our proposed concatenated Fourier features layer is that the model is able to fit to both high and low frequency
features without tuning, though at the expense of some noise as seen in the figure.

F. Datasets
ETT1 (Zhou et al., 2021) - Electricity Transformer Temperature provides measurements from an electricity transformer
such as load and oil temperature. We use the ETTm2 subset, consisting measurements at a 15 minutes frequency.
ECL2 - Electricity Consuming Load provides measurements of electricity consumption for 321 households from 2012 to
2014. The data was collected at the 15 mintue level, but is aggregated hourly.
Exchange3 (Lai et al., 2018) - a collection of daily exchange rates with USD of eight countries (Australia, United Kingdom,
Canada, Switzerland, China, Japan, New Zealand, and Singapore) from 1990 to 2016.
Traffic4 - dataset from the California Department of Transportation providing the hourly road occupancy rates from 862
sensors in San Francisco Bay area freeways.
Weather5 - provides measurements of 21 meteorological indicators such as air temperature, humidity, etc., every 10 minutes
for the year of 2020 from the Weather Station of the Max Planck Biogeochemistry Institute in Jena, Germany.
ILI6 - Influenza-like Illness measures the weekly ratio of patients seen with ILI and the total number of patients, obtained by
the Centers for Disease Control and Prevention of the United States between 2002 and 2021.

1
https://github.com/zhouhaoyi/ETDataset
2
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
3
https://github.com/laiguokun/multivariate-time-series-data
4
https://pems.dot.ca.gov/
5
https://www.bgc-jena.mpg.de/wetter/
6
https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html

16
Learning Deep Time-index Models for Time Series Forecasting

G. DeepTime Implementation Details


Optimization We train DeepTime with the Adam optimizer (Kingma & Ba, 2014) with a learning rate scheduler following
a linear warm up and cosine annealing scheme. Gradient clipping by norm is applied. The ridge regressor regularization
coefficient, λ, is trained with a different, higher learning rate than the rest of the meta parameters. We use early stopping
based on the validation loss, with a fixed patience hyperparameter (number of epochs for which loss deteriorates before
stopping). All experiments are performed on an Nvidia A100 GPU.

Model The ridge regression regularization coefficient is a learnable parameter constrained to positive values via a softplus
function. We apply Dropout (Srivastava et al., 2014), then LayerNorm (Ba et al., 2016) after the ReLU activation function in
each INR layer. The size of the random Fourier feature layer is set independently of the layer size, in which we define the
total size of the random Fourier feature layer – the number of dimensions for each scale is divided equally.

H. DeepTime Hyperparameters

Table 6. Hyperparameters used in DeepTime.


Hyperparameter Value
Epochs 50
Learning rate 1e-3
Optimization

λ learning rate 1.0


Warm up epochs 5
Batch size 256
Early stopping patience 7
Max gradient norm 10.0
Layers 5
Layer size 256
λ initialization 0.0
Model

Scales [0.01, 0.1, 1, 5, 10, 20, 50, 100]


Fourier features size 4096
Dropout 0.1
Lookback length multiplier, µ µ ∈ {1, 3, 5, 7, 9}

I. Univariate Forecasting Benchmark

Table 7. Univariate forecasting benchmark on long sequence time-series forecasting. Best results are highlighted in bold, and second best
results are underlined.
Methods DeepTime N-HiTS ETSformer Fedformer Autoformer Informer N-BEATS DeepAR Prophet ARIMA GP
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.065 0.186 0.066 0.185 0.080 0.212 0.063 0.189 0.065 0.189 0.088 0.225 0.082 0.219 0.099 0.237 0.287 0.456 0.211 0.362 0.125 0.273
ETTm2

192 0.096 0.234 0.087 0.223 0.150 0.302 0.102 0.245 0.118 0.256 0.132 0.283 0.120 0.268 0.154 0.310 0.312 0.483 0.261 0.406 0.154 0.307
336 0.138 0.285 0.106 0.251 0.175 0.334 0.130 0.279 0.154 0.305 0.180 0.336 0.226 0.370 0.277 0.428 0.331 0.474 0.317 0.448 0.189 0.338
720 0.186 0.338 0.157 0.312 0.224 0.379 0.178 0.325 0.182 0.335 0.300 0.435 0.188 0.338 0.332 0.468 0.534 0.593 0.366 0.487 0.318 0.421
96 0.086 0.226 0.093 0.223 0.099 0.230 0.131 0.284 0.241 0.299 0.591 0.615 0.156 0.299 0.417 0.515 0.828 0.762 0.112 0.245 0.165 0.311
Exchange

192 0.173 0.330 0.230 0.313 0.223 0.353 0.277 0.420 0.273 0.665 1.183 0.912 0.669 0.665 0.813 0.735 0.909 0.974 0.304 0.404 0.649 0.617
336 0.539 0.575 0.370 0.486 0.421 0.497 0.426 0.511 0.508 0.605 1.367 0.984 0.611 0.605 1.331 0.962 1.304 0.988 0.736 0.598 0.596 0.592
720 0.936 0.763 0.728 0.569 1.114 0.807 1.162 0.832 0.991 0.860 1.872 1.072 1.111 0.860 1.890 1.181 3.238 1.566 1.871 0.935 1.002 0.786

17
Learning Deep Time-index Models for Time Series Forecasting

J. DeepTime Standard Deviation

Table 8. DeepTime main benchmark results with standard deviation. Experiments are performed over three runs.

(a) Multivariate benchmark. (b) Univariate benchmark.

Metrics MSE (SD) MAE (SD) Metrics MSE (SD) MAE (SD)
96 0.166 (0.000) 0.257 (0.001) 96 0.065 (0.000) 0.186 (0.000)
ETTm2

ETTm2
192 0.225 (0.001) 0.302 (0.003) 192 0.096 (0.002) 0.234 (0.003)
336 0.277 (0.002) 0.336 (0.002) 336 0.138 (0.001) 0.285 (0.001)
720 0.383 (0.007) 0.409 (0.006) 720 0.186 (0.002) 0.338 (0.002)
96 0.137 (0.000) 0.238 (0.000) 96 0.086 (0.000) 0.226 (0.000)

Exchange
192 0.152 (0.000) 0.252 (0.000) 192 0.173 (0.004) 0.330 (0.003)
ECL

336 0.166 (0.000) 0.268 (0.000) 336 0.539 (0.066) 0.575 (0.027)
720 0.201 (0.000) 0.302 (0.000) 720 0.936 (0.222) 0.763 (0.075)
96 0.081 (0.001) 0.205 (0.002)
Exchange

192 0.151 (0.002) 0.284 (0.003)


336 0.314 (0.033) 0.412 (0.020)
720 0.856 (0.202) 0.663 (0.082)
96 0.390 (0.001) 0.275 (0.001)
Traffic

192 0.402 (0.000) 0.278 (0.000)


336 0.415 (0.000) 0.288 (0.001)
720 0.449 (0.000) 0.307 (0.000)
96 0.166 (0.001) 0.221 (0.002)
Weather

192 0.207 (0.000) 0.261 (0.000)


336 0.251 (0.000) 0.298 (0.001)
720 0.301 (0.001) 0.338 (0.001)
24 2.425 (0.058) 1.086 (0.027)
36 2.231 (0.087) 1.008 (0.011)
ILI

48 2.230 (0.144) 1.016 (0.037)


60 2.143 (0.032) 0.985 (0.016)

18
Learning Deep Time-index Models for Time Series Forecasting

K. Random Fourier Features Scale Hyperparameter Sensitivity Analysis

Table 9. Comparison of CFF against the optimal and pessimal scales as obtained from the hyperparameter sweep. We also calculate the
change in performance between CFF and the optimal and pessimal scales, where a positive percentage refers to a CFF underperforming,
and negative percentage refers to CFF outperforming, calculated as % change = (MSECF F − MSEScale )/MSEScale .
CFF Optimal Scale (% change) Pessimal Scale (% change)
Metrics MSE MAE MSE MAE MSE MAE
96 0.166 0.257 0.164 (1.20%) 0.257 (-0.05%) 0.216 (-23.22%) 0.300 (-14.22%)
ETTm2

192 0.225 0.302 0.220 (1.87%) 0.301 (0.25%) 0.275 (-18.36%) 0.340 (-11.25%)
336 0.277 0.336 0.275 (0.70%) 0.336 (-0.22%) 0.340 (-18.68%) 0.375 (-10.57%)
720 0.383 0.409 0.364 (5.29%) 0.392 (4.48%) 0.424 (-9.67%) 0.430 (-4.95%)

Table 10. Results from hyperparameter sweep on the scale hyperparameter. Best scores are highlighted in bold, and worst scores are
highlighted in bold red.
Scale Hyperparam 0.01 0.1 1 5 10 20 50 100
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.216 0.300 0.189 0.285 0.173 0.268 0.168 0.262 0.166 0.260 0.165 0.258 0.165 0.259 0.164 0.257
ETTm2

192 0.275 0.340 0.264 0.333 0.239 0.317 0.225 0.301 0.225 0.303 0.224 0.302 0.224 0.304 0.220 0.301
336 0.340 0.375 0.319 0.371 0.292 0.351 0.275 0.337 0.277 0.336 0.282 0.345 0.278 0.342 0.280 0.344
720 0.424 0.430 0.405 0.420 0.381 0.412 0.364 0.392 0.375 0.408 0.410 0.430 0.396 0.423 0.406 0.429

We perform a comparison between the optimal and pessimal scale hyperparameter for the vanilla random Fourier features
layer, against our proposed CFF. We first report the results on each scale hyperparameter for the vanilla random Fourier
features layer in Table 10. As with the other ablation studies, the results reported in Table 10 is based on performing a
hyperparameter sweep across lookback length multiplier, and selecting the optimal settings based on the validation set, and
reporting the test set results. Then, the optimal and pessimal scales are simply the best and worst results based on Table 10.
Table 9 shows that CFF achieves extremely low deviation from the optimal scale across all settings, yet retains the upside of
avoiding this expensive hyperparameter tuning phase. We also observe that tuning the scale hyperparameter is extremely
important, as CFF obtains up to a 23.22% improvement in MSE over the pessimal scale hyperparameter.

L. Ablation Studies Details


In this section, we list more details on the models compared to in the ablation studies section. Unless otherwise stated, we
perform the same hyperparameter tuning for all models in the ablation studies, and use the same standard hyperparameters
such as number of layers, layer size, etc.

L.1. Ablation study on variants of DeepTime


Datetime Features As each dataset comes with a timestamps for each observation, we are able to construct datetime
features from these timestamps. We construct the following features:

1. Quarter-of-year

2. Month-of-year

3. Week-of-year

4. Day-of-year

5. Day-of-month

6. Day-of-week

7. Hour-of-day

19
Learning Deep Time-index Models for Time Series Forecasting

8. Minute-of-hour

9. Second-of-minute

Each feature is initially an integer value, e.g. month-of-year can take on values in {0, 1, . . . , 11}, which we subsequently
normalize to a [0, 1] range. Depending on the data sampling frequency, the appropriate features can be chosen. For the
ETTm2 dataset, we used all features except second-of-minute since it is sampled at a 15 minute frequency.

RR Removing the ridge regressor module refers to replacing it with a simple linear layer, Linear : Rd → Rm , where
Linear(x) = W x + b, x ∈ Rd , W ∈ Rm×d , b ∈ Rm . This corresponds to a straight forward INR, which is trained across
all lookback-horizon pairs in the dataset.

Local For models marked “Local”, we similarly remove the ridge regressor module and replace it with a linear layer. Yet,
the model is not trained across all lookback-horizon pairs in the dataset. Instead, for each lookback-horizon pair in the
validation/test set, we fit the model to the lookback window via gradient descent, and then perform prediction on the horizon
to obtain the forecasts. A new model is trained from scratch for each lookback-horizon window. We perform tuning on an
extra hyperparameter, the number of epochs to perform gradient descent, for which we search through {10, 20, 30, 40, 50}.

Finetune Models marked “Finetune” are similar to “Local”, except that they have been trained on the training set first,
and for each lookback-horizon pair in the test set, they are “finetuned” on the lookback window.

Full MAML “Full MAML” indicates the setting for which MAML is performed on the entire deep time-index model,
by backpropagating through inner loop gradient steps as per Finn et al. (2017), rather than our proposed fast and efficient
meta-optimization framework. Inner loop optimization is performed using the Adam optimizer, and is tuned over lookback
length multiplier values of {1, 3, 5, 7, 9}, and inner loop iterations of {1, 5, 10}.

L.2. Ablation study on backbone models


For all models in this section, we retain the differentiable closed-form ridge regressor, to identify the effects of the backbone
model used.

MLP The random Fourier features layer is a mapping from coordinate space to latent space γ : Rc → Rd . To remove the
effects of the random Fourier features layer, we simply replace it with a with a linear map, Linear : Rc → Rd .

SIREN We replace the random Fourier features backbone with the SIREN model which is introduced by (Sitzmann et al.,
2020b). In this model, periodical activation functions are used, i.e. sin(x), along with specified weight initialization scheme.

RNN We use a 2 layer LSTM with hidden size of 256. Inputs are observations, yt , in an IMS fashion, predicting the next
time step, yt+1 .

M. Computational Efficiency Experiments Details


Trans/In/Auto/ETS-former We use a model with 2 encoder and 2 decoder layers with a hidden size of 512, as specified
in their original papers.

N-BEATS We use an N-BEATS model with 3 stacks and 3 layers (relatively small compared to 30 stacks and 4 layers
used in their orignal paper7 ), with a hidden size of 512. Note, N-BEATS is a univariate model and values presented here
are multiplied by a factor of m to account for the multivariate data. Another dimension of comparison is the number of
parameters used in the model. Demonstrated in Table 11, fully connected models like N-BEATS, their number of parameters
scales linearly with lookback window and forecast horizon length, while for Transformer-based and DeepTime, the number
of parameters remains constant.
7
https://github.com/ElementAI/N-BEATS/blob/master/experiments/electricity/generic.gin

20
Learning Deep Time-index Models for Time Series Forecasting

N-HiTS We use an N-HiTS model with hyperparameters as sugggested in their original paper (3 stacks, 1 block in
each stack, 2 MLP layers, 512 hidden size). For the following hyperparameters which were not specified (subject to
hyperparameter tuning), we set the pooling kernel size to [2, 2, 2], and the number of stack coefficients to [24, 12, 1]. Similar
to N-BEATS, N-HiTS is a univariate model, and values were multiplied by a factor of m to account for the multivariate data.

Table 11. Number of parameters in each model across various lookback window and forecast horizon lengths. The models were instantiated
for the ETTm2 multivariate dataset (this affects the embedding and projection layers in Autoformer). Values for N-HiTS in this table are
not multiplied by m since it is a global model (i.e. a single univariate model is used for all dimensions of the time-series).
Methods Autoformer N-HiTS DeepTime
48 10,535,943 927,942 1,314,561
96 10,535,943 1,038,678 1,314,561
168 10,535,943 1,204,782 1,314,561
Lookback

336 10,535,943 1,592,358 1,314,561


720 10,535,943 2,478,246 1,314,561
1440 10,535,943 4,139,286 1,314,561
2880 10,535,943 7,461,366 1,314,561
5760 10,535,943 14,105,526 1,314,561
48 10,535,943 927,942 1,314,561
96 10,535,943 955,644 1,314,561
168 10,535,943 997,197 1,314,561
Horizon

336 10,535,943 1,094,154 1,314,561


720 10,535,943 1,315,770 1,314,561
1440 10,535,943 1,731,300 1,314,561
2880 10,535,943 2,562,360 1,314,561
5760 10,535,943 4,224,480 1,314,561

21

You might also like