Learning Deep Time-Index Models For Time Series Forecasting
Learning Deep Time-Index Models For Time Series Forecasting
Gerald Woo 1 2 Chenghao Liu 1 Doyen Sahoo 1 Akshat Kumar 2 Steven Hoi 1
1
Learning Deep Time-index Models for Time Series Forecasting
2
Learning Deep Time-index Models for Time Series Forecasting
tural time series formulation, specialized for business fore- 3.1. Deep Time-index Model Architecture
casting. Another classical approach of note are Gaussian
INRs (Sitzmann et al., 2020b) are a class of coordinate-
Processes (Rasmussen, 2003; Corani et al., 2021) which
based models, mapping coordinates to values, which
are non-parametric models, often requiring complex ker-
have been extensively studied. Time-index models are
nel engineering. (Godfrey & Gashler, 2017) introduced an
a case of 1d coordinate-based models, thus, we leverage
initial attempt at using time-index based neural networks
this existing class of models, which are essentially a
to fit a time series for forecasting. Yet, their work is more
stack of multi-layered perceptrons as our proposed deep
reminiscent of classical methods, manually specifying peri-
time-index model architecture. Visualized in Figure 3a, a
odic and non-periodic activation functions, analogous to the
K-layered, ReLU (Nair & Hinton, 2010) INR is a function
representation functions.
fθ : Rc → Rm where:
Implicit Neural Representations INRs have recently z (0) = τ
gained popularity in the area of neural rendering (Tewari
et al., 2021). They parameterize a signal as a continuous z (k+1) = max(0, W (k) z (k) + b(k) ), k = 0, . . . , K − 1
function, mapping a coordinate to the value at that coordi- (K) (K) (K)
fθ (τ ) = W z +b (1)
nate. A key finding was that positional encodings (Milden-
hall et al., 2020; Tancik et al., 2020) are critical for ReLU where τ ∈ Rc are time-index features. MLPs have shown
MLPs to learn high frequency details, while another line to experience difficulty in learning high frequency functions,
of work introduced periodic activations (Sitzmann et al., this problem known as “spectrial bias” (Rahaman et al.,
2020b). Meta-learning on via INRs have been explored for 2019; Basri et al., 2020). Coordinate-based methods suffer
various data modalities, typically over images or for neu- from this issue in particular when trying to represent high
ral rendering tasks (Sitzmann et al., 2020a; Tancik et al., frequency content present in the signal. Tancik et al. (2020)
2021; Dupont et al., 2021), using both hypernetworks and introduced a random Fourier features layer which allows
optimization-based approaches. (Yüce et al., 2021) show INRs to fit to high frequency functions, by modifying
that meta-learning on INRs is analogous to dictionary learn- z (0) = γ(τ ) = [sin(2πBτ ), cos(2πBτ )]T , where each
ing. In time series, (Jeong & Shin, 2022) explored using entry in B ∈ Rd/2×c is sampled from N (0, σ 2 ) with d is
INRs for anomaly detection, opting to make use of periodic the hidden dimension size of the INR and σ 2 is the scale
activations and temporal positional encodings. hyperparameter. [·, ·] is a row-wise stacking operation.
Concatenated Fourier Features While the random
3. DeepTime Fourier features layer endows INRs with the ability to
learn high frequency patterns, one major drawback is the
In this section, we first formally describe the time series fore-
need to perform a hyperparameter sweep for each task and
casting problem, and how to use time-index models for this
dataset to avoid over or underfitting. We overcome this
problem setting. Next, we introduce the model architecture
limitation with a simple scheme of concatenating multiple
specifics for deep time-index models. Finally, we intro-
Fourier basis functions with diverse scale parameters, i.e.
duce the generic form of our meta-optimization framework,
γ(τ ) = [sin(2πB1 τ ), cos(2πB1 τ ), . . . , sin(2πBS τ ),
then specifically, how to use of a differentiable closed-form
cos(2πBS τ )]T , where elements in Bs ∈ Rd/2×c are sam-
ridge regression module to perform the meta-optimization
pled from N (0, σs2 ), and W (0) ∈ Rd×Sd . We perform an
efficiently. Pseudocode of DeepTime is available in Ap-
analysis in Section 5.3 and show that the performance of
pendix A.
our proposed concatenated Fourier features (CFF) does not
significantly deviate from the setting with the optimal scale
Problem Formulation In time series forecasting, we con- parameter obtained from a hyperparameter sweep.
sider a time series dataset (y1 , y2 , . . . , yT ), where yt ∈ Rm
is the m-dimension observation at time t. Consider a train-
3.2. Meta-optimization Framework
test split such that the range (1, . . . , T ′ ) is considered to be
the training set and the range (T ′ + 1, . . . , T ) is the test set, Explained in Section 1, vanilla deep time-index models
where T − T ′ ≥ H. The goal is to construct point forecasts are unable to perform forecasting due to their failure in
over a horizon of length H, over the test set, Ŷt:t+H = extrapolating beyond observed time-indices. Formally, the
[ŷt ; . . . ; ŷt+H−1 ], ∀t = T ′ + 1, T ′ + 2, . . . , T ′ − H + 1. A original hypothesis class of time-index model is denoted
time-index model, f : R → Rm , f : τt 7→ ŷt , achieves this HINR = {f (τ ; θ) | θ ∈ Θ}, where Θ is the parameter space.
by minimizing a reconstruction loss L : Rm ×Rm → R over The original hypothesis class is too expressive, providing
the training set, where τt is a time-index feature for which no guarantees that training on the lookback window leads
values are known for all time steps. Then, we can query it to good extrapolation. To solve this, our meta-optimization
over the test set to obtain forecasts, Ŷt:t+H = f (τt:t+H ). framework learns an inductive bias for the deep time-index
3
Learning Deep Time-index Models for Time Series Forecasting
𝝓 update
𝑦!
Inner Loop Outer Loop
model from data. Firstly, rather than using the entire train- Fast and Efficient Meta-optimization Following the
ing set in a naive supervised learning setting, whereby older above generic formulation of the meta-optimization frame-
training points provide no additional benefit in learning a work to learn deep time-index models, we now describe a
time-index model for extrapolation, we split the time series specific instantiation of the framework which enables both
into lookback window and forecast horizon pairs. Let the training and forecasting at test time to be fast and efficient.
L time steps preceding the forecast horizon at time step t, Similar bi-level optimization problems have been explored
be the lookback window, Yt−L:t = [yt−L ; . . . ; yt−1 ]T ∈ (Ravi & Larochelle, 2017; Finn et al., 2017) and one naive
RL×m . Next, consider the case where we split the model pa- approach is to directly backpropagate through inner gradient
rameters into two, possibly overlapping subsets, ϕ ∈ Φ and steps. However, such methods are highly inefficient, have
θ ∈ Θ, known as the meta and base parameters, respectively, many additional hyperparameters, and are instable during
where Φ is the parameter space of the meta parameters. The training (Antoniou et al., 2019). Instead, to achieve speed
meta parameters are responsible for learning the inductive and efficiency, we specify that the base parameters consist
bias from multiple lookback window and forecast horizon of only the last layer of the INR, while the rest of the INR
pairs from the training data, while the base parameters aim are meta parameters. Thus, the inner loop optimization only
to learn and adapt quickly to the lookback window at test applies to this last layer. This transforms the inner loop op-
time. Thus, we aim to encode an inductive bias in ϕ, learned timization problem into a simple ridge regression problem
to enable extrapolation across the forecast horizon when for the case of mean squared error loss, having a simple
the base parameters adapt corresponding lookback window, analytic solution to replace the otherwise complicated non-
resulting in HMeta = {f (τ ; θ, ϕ∗ ) | θ ∈ Θ}. This is linear optimization problem (Bertinetto et al., 2019).
achieved by formulating the bi-level optimization problem:
Formally, for a K-layered model, ϕ =
T −H+1
X H−1X {W (0) , b(0) , . . . , W (K−1) , b(K−1) , λ} are the meta
ϕ∗ = arg min L(f (τt+i ; θt∗ , ϕ), yt+i ) parameters and θ = {W (K) } are the base parame-
ϕ t=L+1 i=0 ters, following notation from Equation (1). Then let
(2) gϕ : R → Rd be the meta learner where gϕ (τ ) = z (K) .
−1
X For a lookback-horizon pair, (Yt−L:t , Yt:t+H ), the features
s.t. θt∗ = arg min L(f (τt+j ; θ, ϕ), yt+j ) (3) of the lookback window obtained from the meta learner
θ j=−L is denoted Zt−L:t = [gϕ (τt−L ); . . . ; gϕ (τt−1 )]T ∈ RL×d ,
where [·; ·] is a column-wise concatenation operation. The
Illustrated in Figure 3b, Equation (2) represents the outer inner loop thus solves the optimization problem:
loop, and Equation (3), the inner loop. The first summation
in the outer loop over index t represents iterating over the Wt
(K)∗
= arg min ||Zt−L:t W − Yt−L:t ||2 + λ||W ||2
time steps of the dataset, and the second summation over W
index i represents each time step of the forecast horizon. = T
(Zt−L:t Zt−L:t + λI)−1 Zt−L:t
T
Yt−L:t (4)
The summation in the inner loop over index j represents
each time step of the lookback window. Now, let Zt:t+H = [gϕ (τt ); . . . ; gϕ (τt+H−1 )]T ∈ RH×d
4
Learning Deep Time-index Models for Time Series Forecasting
5
Learning Deep Time-index Models for Time Series Forecasting
6
20 Ground Truth 60 4
15 Forecast 40 2
10 20
5 0
0
0 2
20
5 40 4
10 60 6
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Figure 4. Predictions of DeepTime on three unseen functions for each function class. The orange line represents the split between
lookback window and forecast horizon.
Table 1. Multivariate forecasting benchmark on long sequence time-series forecasting. Best results are highlighted in bold, and second
best results are underlined.
Methods DeepTime NS Transformer N-HiTS ETSformer FEDformer Autoformer Informer LogTrans GP
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.166 0.257 0.192 0.274 0.176 0.255 0.189 0.280 0.203 0.287 0.255 0.339 0.365 0.453 0.768 0.642 0.442 0.422
ETTm2
192 0.225 0.302 0.280 0.339 0.245 0.305 0.253 0.319 0.269 0.328 0.281 0.340 0.533 0.563 0.989 0.757 0.605 0.505
336 0.277 0.336 0.334 0.361 0.295 0.346 0.314 0.357 0.325 0.366 0.339 0.372 1.363 0.887 1.334 0.872 0.731 0.569
720 0.383 0.409 0.417 0.413 0.401 0.426 0.414 0.413 0.421 0.415 0.422 0.419 3.379 1.388 3.048 1.328 0.959 0.669
96 0.137 0.238 0.169 0.273 0.147 0.249 0.187 0.304 0.183 0.297 0.201 0.317 0.274 0.368 0.258 0.357 0.503 0.538
192 0.152 0.252 0.182 0.286 0.167 0.269 0.199 0.315 0.195 0.308 0.222 0.334 0.296 0.386 0.266 0.368 0.505 0.543
ECL
336 0.166 0.268 0.200 0.304 0.186 0.290 0.212 0.329 0.212 0.313 0.231 0.338 0.300 0.394 0.280 0.380 0.612 0.614
720 0.201 0.302 0.222 0.321 0.243 0.340 0.233 0.345 0.231 0.343 0.254 0.361 0.373 0.439 0.283 0.376 0.652 0.635
96 0.081 0.205 0.111 0.237 0.092 0.211 0.085 0.204 0.139 0.276 0.197 0.323 0.847 0.752 0.968 0.812 0.136 0.267
Exchange
192 0.151 0.284 0.219 0.335 0.208 0.322 0.182 0.303 0.256 0.369 0.300 0.369 1.204 0.895 1.040 0.851 0.229 0.348
336 0.314 0.412 0.421 0.476 0.371 0.443 0.348 0.428 0.426 0.464 0.509 0.524 1.672 1.036 1.659 1.081 0.372 0.447
720 0.856 0.663 1.092 0.769 0.888 0.723 1.025 0.774 1.090 0.800 1.447 0.941 2.478 1.310 1.941 1.127 1.135 0.810
96 0.390 0.275 0.612 0.338 0.402 0.282 0.607 0.392 0.562 0.349 0.613 0.388 0.719 0.391 0.684 0.384 1.112 0.665
Traffic
192 0.402 0.278 0.613 0.340 0.420 0.297 0.621 0.399 0.562 0.346 0.616 0.382 0.696 0.379 0.685 0.390 1.133 0.671
336 0.415 0.288 0.618 0.328 0.448 0.313 0.622 0.396 0.570 0.323 0.622 0.337 0.777 0.420 0.733 0.408 1.274 0.723
720 0.449 0.307 0.653 0.355 0.539 0.353 0.632 0.396 0.596 0.368 0.660 0.408 0.864 0.472 0.717 0.396 1.280 0.719
96 0.166 0.221 0.173 0.223 0.158 0.195 0.197 0.281 0.217 0.296 0.266 0.336 0.300 0.384 0.458 0.490 0.395 0.356
Weather
192 0.207 0.261 0.245 0.285 0.211 0.247 0.237 0.312 0.276 0.336 0.307 0.367 0.598 0.544 0.658 0.589 0.450 0.398
336 0.251 0.298 0.321 0.338 0.274 0.300 0.298 0.353 0.339 0.380 0.359 0.359 0.578 0.523 0.797 0.652 0.508 0.440
720 0.301 0.338 0.414 0.410 0.351 0.353 0.352 0.388 0.403 0.428 0.419 0.419 1.059 0.741 0.869 0.675 0.498 0.450
24 2.425 1.086 2.294 0.945 1.862 0.869 2.527 1.020 2.203 0.963 3.483 1.287 5.764 1.677 4.480 1.444 2.331 1.036
36 2.231 1.008 1.825 0.848 2.071 0.969 2.615 1.007 2.272 0.976 3.103 1.148 4.755 1.467 4.799 1.467 2.167 1.002
ILI
48 2.230 1.016 2.010 0.900 2.346 1.042 2.359 0.972 2.209 0.981 2.669 1.085 4.763 1.469 4.800 1.468 2.961 1.180
60 2.143 0.985 2.178 0.963 2.560 1.073 2.487 1.016 2.545 1.061 2.770 1.125 5.264 1.564 5.278 1.560 3.108 1.214
Results We compare DeepTime to the following base- on the multivariate benchmark, and also achieves competi-
lines for the multivariate setting, N-HiTS (Challu et al., tive results on the univariate benchmark despite its simple
2022), ETSformer (Woo et al., 2022), Fedformer (Zhou architecture compared to the baselines comprising complex
et al., 2022) (we report the best score for each setting from fully connected architectures and computationally intensive
the two variants they present), Autoformer (Xu et al., 2021), Transformer architectures.
Informer (Zhou et al., 2021), LogTrans (Li et al., 2019),
Non-stationary (NS) Transformer (Liu et al., 2022), and 5.3. Empirical Analysis and Ablation Studies
Gaussian Process (GP) (Rasmussen, 2003). For the uni-
variate setting, we include additional univariate forecasting We first perform empirical analyses informed by the in-
models, N-BEATS (Oreshkin et al., 2020), DeepAR (Salinas sights from our theoretical analysis. Theorem 4.1 states
et al., 2020), Prophet (Taylor & Letham, 2018), and ARIMA. that generalization error is bounded by training error, and
Baseline results are obtained from the respective papers. Ta- two complexity terms. Figure 5 and Table 2 analyse the
ble 1 and Table 7 (in Appendix I for space) summarizes the test error (which approximates generalization error) as num-
multivariate and univariate forecasting results respectively. ber of instances, n, and lookback window length, L, vary.
DeepTime achieves state-of-the-art performance on 20 out For the first complexity term, controlled by denominator
of 24 settings in MSE, and 17 out of 24 settings in MAE n, observed in Figure 5, test error decreases as n increases.
For the second complexity term, controlled by the length
6
Learning Deep Time-index Models for Time Series Forecasting
0.30 H=720 pled from a range of scales. MLP refers to replacing the random
0.25 Fourier features with a linear map from input dimension to hidden
0.20 dimension. SIREN refers to an INR with periodic activations as
0 2000 4000 6000 8000 proposed by Sitzmann et al. (2020b). RNN refers to an autoregres-
Number of Instances sive recurrent neural network (inputs are the time-series values,
yt ). All approaches include differentiable closed-form ridge re-
Figure 5. Analysis on the number of instances, n. MSE is mea- gressor. Further model details can be found in Appendix L.2.
sured as number of instances increases for multiple horizon lengths. Methods DeepTime MLP SIREN RNN
Analysis is performed based on the ETTm2 dataset. Metrics MSE MAE MSE MAE MSE MAE MSE MAE
96 0.166 0.257 0.186 0.284 0.236 0.325 0.233 0.324
ETTm2
Table 2. Analysis on the lookback window length, based on a mul- 192 0.225 0.302 0.265 0.338 0.295 0.361 0.275 0.337
tiplier on horizon length, L = µ ∗ H. Results presented on the 336 0.277 0.336 0.316 0.372 0.327 0.386 0.344 0.383
720 0.383 0.409 0.401 0.417 0.438 0.453 0.431 0.432
ETTm2 dataset. Best results are highlighted in bold.
Horizon 96 192 336 720
µ MSE MAE MSE MAE MSE MAE MSE MAE
that the inclusion of datetime features lead to a much lower
1 0.192 0.287 0.255 0.332 0.294 0.354 0.383 0.409
3 0.172 0.264 0.228 0.304 0.277 0.336 0.371 0.403 training loss, but degradation in test performance – this is a
5 0.168 0.259 0.225 0.302 0.275 0.337 0.389 0.420 case of meta-learning memorization (Yin et al., 2020) due
7 0.166 0.257 0.223 0.300 0.279 0.343 0.440 0.451 to the tasks becoming non-mutually exclusive (Rajendran
9 0.165 0.258 0.223 0.301 0.285 0.350 0.409 0.434
et al., 2020). We also observe that the meta-optimization
formulation is indeed superior to training a model from
scratch for each lookback window. Finally, while we ex-
of lookback and horizon, since H is an experimental set-
pect full MAML to always outperform the fast and efficient
ting, we set perform a sensitivity analysis on the lookback
meta-optimization, in reality, there are many complications
length, setting L = µ ∗ H. Similarly, test error decreases
in such gradient-based bi-level optimization methods – they
as L increases, plateauing and even increasing slightly as
are difficult to optimize, and instable during training. Re-
L grows extremely large. We expect test error to plateau
stricting the base parameters to only the last layer of the INR
as the associated term goes to zero. As the number of in-
provides a useful prior which enables stable optimization
stances available for training decreases as L grows large,
and high generalization without facing these problems.
the increase in test error can be attributed to a decrease in n.
Additional sensitivity analysis on our proposed concatenated
In Table 3 we perform an ablation study on various backbone
Fourier features, can be found in Appendix K, showing
architectures, while retaining the differentiable closed-form
that it performs no worse than an extensive hyperparameter
ridge regressor. We observe a degradation when the random
sweep on standard random Fourier features layer.
Fourier features layer is removed, due to the spectral bias
problem which neural networks face (Rahaman et al., 2019;
Tancik et al., 2020). DeepTime outperforms the SIREN 5.4. Computational Efficiency
variant of INRs which is consistent with observations INR Finally, we analyse DeepTime’s efficiency in both runtime
literature. DeepTime also outperforms the RNN variant and memory usage, with respect to both lookback window
which is the model proposed in (Grazzi et al., 2021). This and forecast horizon lengths. The main bottleneck in com-
is a direct comparison between IMS historical-value models putation for DeepTime is the matrix inversion operation in
and time-index models, and highlights the benefits of a the ridge regressor, canonically of O(n3 ) complexity. This
time-index models. is a major concern for DeepTime as n is linked to the length
We perform an ablation study to understand how various of the lookback window. As mentioned in (Bertinetto et al.,
training schemes and input features affect the performance 2019), the Woodbury formulation,
of DeepTime. Table 4 presents these results. First, we ob- W ∗ = Z T (ZZ T + λI)−1 Y
serve that our meta-optimization formulation is a critical
component to the success of DeepTime. We note that Deep- is used to alleviate the problem, leading to an O(d3 ) com-
Time without meta-optimization may not be a meaningful plexity, where d is the hidden size hyperparameter, fixed
baseline since the model outputs are always the same re- to some value (see Appendix H). Figure 6 demonstrates
gardless of the input lookback window. Including datetime that DeepTime is highly efficient, even when compared
features helps alleviate this issue, yet we observe that the to efficient Transformer models, recently proposed for the
inclusion of datetime features generally lead to a degrada- long sequence time series forecasting task, as well as fully
tion in performance. In the case of DeepTime, we observed connected models.
7
Learning Deep Time-index Models for Time Series Forecasting
Table 4. Ablation study on variants of DeepTime. Starting from the original version, we add (+) or remove (-) some component from
DeepTime. Datetime refers to datetime features. RR stands for the differentiable closed-form ridge regressor, removing it refers to
replacing this module with a simple linear layer trained via gradient descent across all training samples (i.e. without meta-optimization
formulation). Local refers to training an INR from scratch via gradient descent for each lookback window (RR is not used here, and there
is no training phase). + Finetune refers to training an INR via gradient descent for each lookback window on top of having a training
phase. Full MAML refers to performing gradient steps for the inner loop and backpropagating through them for the outer loop as in (Finn
et al., 2017). Further details on the variants can be found in Appendix L.1.
- RR + Local + Finetune
Methods DeepTime + Datetime - RR + Datetime + Local + Datetime + Finetune + Datetime Full MAML
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.166 0.257 0.226 0.303 3.072 1.345 3.393 1.400 0.251 0.331 0.250 0.327 3.028 1.328 3.242 1.365 0.235 0.326
ETTm2
192 0.225 0.302 0.309 0.362 3.064 1.343 3.269 1.381 0.322 0.371 0.323 0.366 3.043 1.341 3.385 1.391 0.295 0.361
336 0.277 0.336 0.341 0.381 2.920 1.309 3.442 1.401 0.370 0.412 0.367 0.396 2.950 1.331 3.367 1.387 0.348 0.392
720 0.383 0.409 0.453 0.447 2.773 1.273 3.400 1.399 0.443 0.449 0.455 0.461 2.721 1.253 3.476 1.407 0.491 0.484
Time (s)
0.06 Informer 0.06
N-BEATS 0.04
0.04 N-HiTS
0.02 0.02
48 96 168 336 720 1440 48 96 168 336 720 1440
Lookback Window Forecast Horizon
(a) Runtime Analysis
7
DeepTime 6
6 ETSformer (K=1) 5
Memory (GB)
Memory (GB)
Autoformer 4
4 Informer 3
2 N-BEATS 2
N-HiTS 1
0 0
48 96 168 336 720 1440 48 96 168 336 720 1440
Lookback Window Forecast Horizon
(b) Memory Analysis
Figure 6. Computational efficiency benchmark on the ETTm2 multivariate dataset, on a batch size of 32. Runtime is measured for
one iteration (forward + backward pass). Left: Runtime/Memory usage as lookback length varies, horizon is fixed to 48. Right:
Runtime/Memory usage as horizon length varies, lookback length is fixed to 48. Further model details can be found in Appendix M.
8
Learning Deep Time-index Models for Time Series Forecasting
Chevillon, G. Direct multi-step estimation and forecasting. Kim, K.-j. Financial time series forecasting using sup-
Journal of Economic Surveys, 21(4):746–785, 2007. port vector machines. Neurocomputing, 55(1-2):307–319,
2003.
Corani, G., Benavoli, A., and Zaffalon, M. Time series
forecasting with gaussian processes needs priors. In Joint Kingma, D. P. and Ba, J. Adam: A method for stochastic
European Conference on Machine Learning and Knowl- optimization. arXiv preprint arXiv:1412.6980, 2014.
edge Discovery in Databases, pp. 103–117. Springer,
2021. Lai, G., Chang, W.-C., Yang, Y., and Liu, H. Modeling long-
and short-term temporal patterns with deep neural net-
Dupont, E., Teh, Y. W., and Doucet, A. Generative works. In The 41st International ACM SIGIR Conference
models as distributions of functions. arXiv preprint on Research & Development in Information Retrieval, pp.
arXiv:2102.04776, 2021. 95–104, 2018.
9
Learning Deep Time-index Models for Time Series Forecasting
Laptev, N., Yosinski, J., Li, L. E., and Smyl, S. Time- Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski,
series extreme event forecasting with neural networks at T. Deepar: Probabilistic forecasting with autoregressive
uber. In International conference on machine learning, recurrent networks. International Journal of Forecasting,
volume 34, pp. 1–5, 2017. 36(3):1181–1191, 2020.
Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X., Shalev-Shwartz, S. and Ben-David, S. Understanding ma-
and Yan, X. Enhancing the locality and breaking the mem- chine learning: From theory to algorithms. Cambridge
ory bottleneck of transformer on time series forecasting. university press, 2014.
ArXiv, abs/1907.00235, 2019.
Sitzmann, V., Chan, E., Tucker, R., Snavely, N., and Wet-
Liu, Y., Wu, H., Wang, J., and Long, M. Non-stationary zstein, G. Metasdf: Meta-learning signed distance func-
transformers: Exploring the stationarity in time series tions. Advances in Neural Information Processing Sys-
forecasting. In Oh, A. H., Agarwal, A., Belgrave, D., tems, 33:10136–10147, 2020a.
and Cho, K. (eds.), Advances in Neural Information Pro-
Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wet-
cessing Systems, 2022. URL https://openreview.
zstein, G. Implicit neural representations with periodic
net/forum?id=ucNDIDRNjjv.
activation functions. Advances in Neural Information
Marcellino, M., Stock, J. H., and Watson, M. W. A com- Processing Systems, 33:7462–7473, 2020b.
parison of direct and iterated multistep ar methods for
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
forecasting macroeconomic time series. Journal of econo-
and Salakhutdinov, R. Dropout: a simple way to prevent
metrics, 135(1-2):499–526, 2006.
neural networks from overfitting. The journal of machine
McAllester, D. A. Pac-bayesian model averaging. In Pro- learning research, 15(1):1929–1958, 2014.
ceedings of the twelfth annual conference on Computa-
Taieb, S. B., Hyndman, R. J., et al. Recursive and direct
tional learning theory, pp. 164–170, 1999.
multi-step forecasting: the best of both worlds, volume 19.
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Citeseer, 2012.
Ramamoorthi, R., and Ng, R. Nerf: Representing scenes Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil,
as neural radiance fields for view synthesis. In European S., Raghavan, N., Singhal, U., Ramamoorthi, R., Bar-
conference on computer vision, pp. 405–421. Springer, ron, J., and Ng, R. Fourier features let networks learn
2020. high frequency functions in low dimensional domains.
Nair, V. and Hinton, G. E. Rectified linear units improve Advances in Neural Information Processing Systems, 33:
restricted boltzmann machines. In Icml, 2010. 7537–7547, 2020.
Ord, K., Fildes, R. A., and Kourentzes, N. Principles of Tancik, M., Mildenhall, B., Wang, T., Schmidt, D., Srini-
business forecasting. Wessex Press Publishing Co., 2017. vasan, P. P., Barron, J. T., and Ng, R. Learned initial-
izations for optimizing coordinate-based neural represen-
Oreshkin, B. N., Carpov, D., Chapados, N., and Bengio, tations. In Proceedings of the IEEE/CVF Conference
Y. N-beats: Neural basis expansion analysis for inter- on Computer Vision and Pattern Recognition, pp. 2846–
pretable time series forecasting. In International Confer- 2855, 2021.
ence on Learning Representations, 2020. URL https:
//openreview.net/forum?id=r1ecqn4YwB. Taylor, S. J. and Letham, B. Forecasting at scale. The
American Statistician, 72(1):37–45, 2018.
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M.,
Hamprecht, F., Bengio, Y., and Courville, A. On the spec- Tewari, A., Thies, J., Mildenhall, B., Srinivasan, P.,
tral bias of neural networks. In International Conference Tretschk, E., Wang, Y., Lassner, C., Sitzmann, V., Martin-
on Machine Learning, pp. 5301–5310. PMLR, 2019. Brualla, R., Lombardi, S., et al. Advances in neural
rendering. arXiv preprint arXiv:2111.05849, 2021.
Rajendran, J., Irpan, A., and Jang, E. Meta-learning requires
meta-augmentation. Advances in Neural Information Woo, G., Liu, C., Sahoo, D., Kumar, A., and Hoi, S. Ets-
Processing Systems, 33:5705–5715, 2020. former: Exponential smoothing transformers for time-
series forecasting. arXiv preprint arXiv:2202.01381,
Rasmussen, C. E. Gaussian processes in machine learn- 2022.
ing. In Summer school on machine learning, pp. 63–71.
Springer, 2003. Xu, J., Wang, J., Long, M., et al. Autoformer: Decompo-
sition transformers with auto-correlation for long-term
Ravi, S. and Larochelle, H. Optimization as a model for series forecasting. Advances in Neural Information Pro-
few-shot learning. In ICLR, 2017. cessing Systems, 34, 2021.
10
Learning Deep Time-index Models for Time Series Forecasting
Yin, M., Tucker, G., Zhou, M., Levine, S., and Finn,
C. Meta-learning without memorization. In In-
ternational Conference on Learning Representations,
2020. URL https://openreview.net/forum?
id=BklEFpEYwS.
Young, P. C., Pedregal, D. J., and Tych, W. Dynamic har-
monic regression. Journal of forecasting, 18(6):369–394,
1999.
Yüce, G., Ortiz-Jiménez, G., Besbinar, B., and Frossard,
P. A structured dictionary perspective on implicit neural
representations. arXiv preprint arXiv:2112.01917, 2021.
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H.,
and Zhang, W. Informer: Beyond efficient transformer
for long sequence time-series forecasting. In Proceedings
of AAAI, 2021.
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., and Jin,
R. Fedformer: Frequency enhanced decomposed trans-
former for long-term series forecasting. arXiv preprint
arXiv:2201.12740, 2022.
11
Learning Deep Time-index Models for Time Series Forecasting
A. DeepTime Pseudocode
time index = linspace(0, 1, lookback len + horizon len) # shape: (lookback len + horizon len)
time index = rearrange(time index, 't −> t 1') # shape: (lookback len + horizon len, 1)
time reprs = inr(time index) # shape: (lookback len + horizon len, hidden dim)
12
Learning Deep Time-index Models for Time Series Forecasting
Table 5. Categorization of time-series forecasting methods over the dimensions of time-index vs historical-value methods, and DMS vs
IMS methods.
Time-index Historical-value
DeepTime N-HiTS
Prophet FEDformer
DMS
Gaussian process ETSformer
Time-series regression Autoformer
Informer
N-BEATS
DeepAR
LogTrans
IMS
-
ARIMA
ETS
Multi-step Forecasts Forecasting over a horizon (multiple time steps) can be achieved via two strategies, direct multi-step,
or iterative multi-step (Marcellino et al., 2006; Chevillon, 2007; Taieb et al., 2012), or even a mixture of both, but this has
been less explored:
• Direct Multi-step (DMS): A DMS forecaster directly predicts forecasts for the entire horizon. For example, to achieve
a multi-step forecast of H time steps, a DMS forecaster simply outputs H values in a single forward pass.
• Iterative Multi-step (IMS): An IMS forecaster iteratively predicts one step ahead, and consumes this forecast to make
a subsequent prediction. This is performed iteratively, until the desired length is achieved.
13
Learning Deep Time-index Models for Time Series Forecasting
c1 KL(Q||P) + log 2δ
+ ·
1 − e−c1 nc1
c1 c2 KL(Q||P ) + log 2n
δ
+ · (6)
(1 − e−c2 )(1 − e−c1 ) (H + L)c2
Proof. Our proof contains two steps. First, we bound the error within observed instances due to observing a limited number
of samples. Then we bound the error on the instance environment level due to observing a finite number of instances. Both
of the two steps utilize Catoni’s classical PAC-Bayes bound (Catoni, 2007) to measure the error. Here, we give Catoni’s
classical PAC-Bayes bound.
Theorem D.2. (Catoni’s bound (Catoni, 2007)) Let X be a sample space, P (X) a distribution over X , Θ a hypothesis
space. Given a loss function ℓ(θ, X) : Θ × X → [0, 1] and a collection of M i.i.d random variables (X1 , . . . , XM ) sampled
from P (X). Let π be a prior distribution over hypothesis space. Then, for any δ ∈ (0, 1] and any real number c > 0, the
following bound holds uniformly for all posterior distributions ρ over hypothesis space,
M
!
c h 1 X KL(ρ||π) + log 1δ i
P E [ℓ(θ, Xi )] ≤ E [ℓ(θ, Xm )] + , ∀ρ
Xi ∼P (X),θ∼ρ 1 − e−c M m=1 θ∼ρ Mc
≥ 1 − δ.
We first utilize Theorem D.2 to bound the generalization error in each of the observed instances. Let k be the index of
instance, we have the definition of expected error and empirical error as follows,
Then, according to Theorem D.2, for any δk ∼ (0, 1] and c2 > 0, we have
1
!
c2 c2 KL(Q||P ) + log δk
P er(Q, Dk ) ≤ −c
er(Q,
ˆ Sk ) + −c
· ≥ 1 − δk . (9)
1−e 2 1−e 2 (H + L)c2
14
Learning Deep Time-index Models for Time Series Forecasting
Next, we bound the error due to observing a limited number of instances from the environment. Similarly, we have the
definition of expected instance error as follows
er(Q) = E E E E E ℓ(h, z)
D∼E S∼D H+L P ∼Q h∼Q(S,P ) z∼D
Then Theorem D.2 says that the following holds for any δ0 ∼ (0, 1] and c1 > 0, we have
n 1
!
c1 1X c1 KL(Q||P) + log δ0
P er(Q) ≤ −c
er(Q, Dk ) + −c
· ≥ 1 − δ0 . (12)
1−e 1 n 1−e 1 nc1
k=1
Finally, by employing a union bound argument (Lemma 1, Amit & Meir (2018)), we could bound the probability of the
intersection of the events in Equation (12) and Equation (9) For any δ > 0, set δ0 = 2δ and δk = 2n
δ
for k = 1, . . . , n,
n
c1 c2 1X c1 KL(Q||P) + log 2δ
P er(Q) ≤ · er(Q,
ˆ Sk ) + ·
(1 − e−c1 )(1 − e−c2 ) n 1 − e−c1 nc1
k=1
KL(Q||P ) + log 2n
c1 c2 δ
+ · ≥ 1 − δ. (13)
(1 − e−c2 )(1 − e−c1 ) (H + L)c2
15
Learning Deep Time-index Models for Time Series Forecasting
E. Synthetic Data
The training set for each synthetic data experiment consists 1000 functions/tasks, while the test set contains 100 functions/-
tasks. We ensure that there is no overlap between the train and test sets.
Linear Samples are generated from the function y = ax + b for x ∈ [−1, 1]. This means that each function/task consists
of 400 evenly spaced points between -1 and 1. The parameters of each function/task (i.e. a, b) are sampled from a normal
distribution with mean 0 and standard deviation of 50, i.e. a, b ∼ N (0, 502 ).
Cubic Samples are generated from the function y = ax3 + bx2 + cx + d for x ∈ [−1, 1] for 400 points. Parameters of
each task are sampled from a continuous uniform distribution with minimum value of -50 and maximum value of 50, i.e.
a, b, c, d ∼ U(−50, 50).
Sums of sinusoids Sinusoids come from a fixed set of frequencies, generated by sampling ω ∼ U(0, 12π). We fix the
size of this set to be five, i.e. Ω = {ω1 , . . . , ω5 }. Each function is then a sum of J sinusoids, where J ∈ {1, 2, 3, 4, 5} is
PJ
randomly assigned. The function is thus y = j=1 Aj sin(ωrj x + φj ) for x ∈ [0, 1], where the amplitude and phase shifts
are freely chosen via Aj ∼ U(0.1, 5), φj ∼ U(0, π), but the frequency is decided by rj ∈ {1, 2, 3, 4, 5} to randomly select
a frequency from the set Ω.
The predictions from DeepTime in Figure 4 demonstrate some noise, likely stemming from the model’s capability to learn
high frequency features due to the use of implicit neural representations with random Fourier features. Since the synthetic
data are all low frequency, smoothly changing functions, the noise is likely to be artifacts from the concatenated Fourier
features layer, which should go away if the scale parameter of the Fourier features are carefully fine-tuned. However, the
power of our proposed concatenated Fourier features layer is that the model is able to fit to both high and low frequency
features without tuning, though at the expense of some noise as seen in the figure.
F. Datasets
ETT1 (Zhou et al., 2021) - Electricity Transformer Temperature provides measurements from an electricity transformer
such as load and oil temperature. We use the ETTm2 subset, consisting measurements at a 15 minutes frequency.
ECL2 - Electricity Consuming Load provides measurements of electricity consumption for 321 households from 2012 to
2014. The data was collected at the 15 mintue level, but is aggregated hourly.
Exchange3 (Lai et al., 2018) - a collection of daily exchange rates with USD of eight countries (Australia, United Kingdom,
Canada, Switzerland, China, Japan, New Zealand, and Singapore) from 1990 to 2016.
Traffic4 - dataset from the California Department of Transportation providing the hourly road occupancy rates from 862
sensors in San Francisco Bay area freeways.
Weather5 - provides measurements of 21 meteorological indicators such as air temperature, humidity, etc., every 10 minutes
for the year of 2020 from the Weather Station of the Max Planck Biogeochemistry Institute in Jena, Germany.
ILI6 - Influenza-like Illness measures the weekly ratio of patients seen with ILI and the total number of patients, obtained by
the Centers for Disease Control and Prevention of the United States between 2002 and 2021.
1
https://github.com/zhouhaoyi/ETDataset
2
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
3
https://github.com/laiguokun/multivariate-time-series-data
4
https://pems.dot.ca.gov/
5
https://www.bgc-jena.mpg.de/wetter/
6
https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html
16
Learning Deep Time-index Models for Time Series Forecasting
Model The ridge regression regularization coefficient is a learnable parameter constrained to positive values via a softplus
function. We apply Dropout (Srivastava et al., 2014), then LayerNorm (Ba et al., 2016) after the ReLU activation function in
each INR layer. The size of the random Fourier feature layer is set independently of the layer size, in which we define the
total size of the random Fourier feature layer – the number of dimensions for each scale is divided equally.
H. DeepTime Hyperparameters
Table 7. Univariate forecasting benchmark on long sequence time-series forecasting. Best results are highlighted in bold, and second best
results are underlined.
Methods DeepTime N-HiTS ETSformer Fedformer Autoformer Informer N-BEATS DeepAR Prophet ARIMA GP
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.065 0.186 0.066 0.185 0.080 0.212 0.063 0.189 0.065 0.189 0.088 0.225 0.082 0.219 0.099 0.237 0.287 0.456 0.211 0.362 0.125 0.273
ETTm2
192 0.096 0.234 0.087 0.223 0.150 0.302 0.102 0.245 0.118 0.256 0.132 0.283 0.120 0.268 0.154 0.310 0.312 0.483 0.261 0.406 0.154 0.307
336 0.138 0.285 0.106 0.251 0.175 0.334 0.130 0.279 0.154 0.305 0.180 0.336 0.226 0.370 0.277 0.428 0.331 0.474 0.317 0.448 0.189 0.338
720 0.186 0.338 0.157 0.312 0.224 0.379 0.178 0.325 0.182 0.335 0.300 0.435 0.188 0.338 0.332 0.468 0.534 0.593 0.366 0.487 0.318 0.421
96 0.086 0.226 0.093 0.223 0.099 0.230 0.131 0.284 0.241 0.299 0.591 0.615 0.156 0.299 0.417 0.515 0.828 0.762 0.112 0.245 0.165 0.311
Exchange
192 0.173 0.330 0.230 0.313 0.223 0.353 0.277 0.420 0.273 0.665 1.183 0.912 0.669 0.665 0.813 0.735 0.909 0.974 0.304 0.404 0.649 0.617
336 0.539 0.575 0.370 0.486 0.421 0.497 0.426 0.511 0.508 0.605 1.367 0.984 0.611 0.605 1.331 0.962 1.304 0.988 0.736 0.598 0.596 0.592
720 0.936 0.763 0.728 0.569 1.114 0.807 1.162 0.832 0.991 0.860 1.872 1.072 1.111 0.860 1.890 1.181 3.238 1.566 1.871 0.935 1.002 0.786
17
Learning Deep Time-index Models for Time Series Forecasting
Table 8. DeepTime main benchmark results with standard deviation. Experiments are performed over three runs.
Metrics MSE (SD) MAE (SD) Metrics MSE (SD) MAE (SD)
96 0.166 (0.000) 0.257 (0.001) 96 0.065 (0.000) 0.186 (0.000)
ETTm2
ETTm2
192 0.225 (0.001) 0.302 (0.003) 192 0.096 (0.002) 0.234 (0.003)
336 0.277 (0.002) 0.336 (0.002) 336 0.138 (0.001) 0.285 (0.001)
720 0.383 (0.007) 0.409 (0.006) 720 0.186 (0.002) 0.338 (0.002)
96 0.137 (0.000) 0.238 (0.000) 96 0.086 (0.000) 0.226 (0.000)
Exchange
192 0.152 (0.000) 0.252 (0.000) 192 0.173 (0.004) 0.330 (0.003)
ECL
336 0.166 (0.000) 0.268 (0.000) 336 0.539 (0.066) 0.575 (0.027)
720 0.201 (0.000) 0.302 (0.000) 720 0.936 (0.222) 0.763 (0.075)
96 0.081 (0.001) 0.205 (0.002)
Exchange
18
Learning Deep Time-index Models for Time Series Forecasting
Table 9. Comparison of CFF against the optimal and pessimal scales as obtained from the hyperparameter sweep. We also calculate the
change in performance between CFF and the optimal and pessimal scales, where a positive percentage refers to a CFF underperforming,
and negative percentage refers to CFF outperforming, calculated as % change = (MSECF F − MSEScale )/MSEScale .
CFF Optimal Scale (% change) Pessimal Scale (% change)
Metrics MSE MAE MSE MAE MSE MAE
96 0.166 0.257 0.164 (1.20%) 0.257 (-0.05%) 0.216 (-23.22%) 0.300 (-14.22%)
ETTm2
192 0.225 0.302 0.220 (1.87%) 0.301 (0.25%) 0.275 (-18.36%) 0.340 (-11.25%)
336 0.277 0.336 0.275 (0.70%) 0.336 (-0.22%) 0.340 (-18.68%) 0.375 (-10.57%)
720 0.383 0.409 0.364 (5.29%) 0.392 (4.48%) 0.424 (-9.67%) 0.430 (-4.95%)
Table 10. Results from hyperparameter sweep on the scale hyperparameter. Best scores are highlighted in bold, and worst scores are
highlighted in bold red.
Scale Hyperparam 0.01 0.1 1 5 10 20 50 100
Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.216 0.300 0.189 0.285 0.173 0.268 0.168 0.262 0.166 0.260 0.165 0.258 0.165 0.259 0.164 0.257
ETTm2
192 0.275 0.340 0.264 0.333 0.239 0.317 0.225 0.301 0.225 0.303 0.224 0.302 0.224 0.304 0.220 0.301
336 0.340 0.375 0.319 0.371 0.292 0.351 0.275 0.337 0.277 0.336 0.282 0.345 0.278 0.342 0.280 0.344
720 0.424 0.430 0.405 0.420 0.381 0.412 0.364 0.392 0.375 0.408 0.410 0.430 0.396 0.423 0.406 0.429
We perform a comparison between the optimal and pessimal scale hyperparameter for the vanilla random Fourier features
layer, against our proposed CFF. We first report the results on each scale hyperparameter for the vanilla random Fourier
features layer in Table 10. As with the other ablation studies, the results reported in Table 10 is based on performing a
hyperparameter sweep across lookback length multiplier, and selecting the optimal settings based on the validation set, and
reporting the test set results. Then, the optimal and pessimal scales are simply the best and worst results based on Table 10.
Table 9 shows that CFF achieves extremely low deviation from the optimal scale across all settings, yet retains the upside of
avoiding this expensive hyperparameter tuning phase. We also observe that tuning the scale hyperparameter is extremely
important, as CFF obtains up to a 23.22% improvement in MSE over the pessimal scale hyperparameter.
1. Quarter-of-year
2. Month-of-year
3. Week-of-year
4. Day-of-year
5. Day-of-month
6. Day-of-week
7. Hour-of-day
19
Learning Deep Time-index Models for Time Series Forecasting
8. Minute-of-hour
9. Second-of-minute
Each feature is initially an integer value, e.g. month-of-year can take on values in {0, 1, . . . , 11}, which we subsequently
normalize to a [0, 1] range. Depending on the data sampling frequency, the appropriate features can be chosen. For the
ETTm2 dataset, we used all features except second-of-minute since it is sampled at a 15 minute frequency.
RR Removing the ridge regressor module refers to replacing it with a simple linear layer, Linear : Rd → Rm , where
Linear(x) = W x + b, x ∈ Rd , W ∈ Rm×d , b ∈ Rm . This corresponds to a straight forward INR, which is trained across
all lookback-horizon pairs in the dataset.
Local For models marked “Local”, we similarly remove the ridge regressor module and replace it with a linear layer. Yet,
the model is not trained across all lookback-horizon pairs in the dataset. Instead, for each lookback-horizon pair in the
validation/test set, we fit the model to the lookback window via gradient descent, and then perform prediction on the horizon
to obtain the forecasts. A new model is trained from scratch for each lookback-horizon window. We perform tuning on an
extra hyperparameter, the number of epochs to perform gradient descent, for which we search through {10, 20, 30, 40, 50}.
Finetune Models marked “Finetune” are similar to “Local”, except that they have been trained on the training set first,
and for each lookback-horizon pair in the test set, they are “finetuned” on the lookback window.
Full MAML “Full MAML” indicates the setting for which MAML is performed on the entire deep time-index model,
by backpropagating through inner loop gradient steps as per Finn et al. (2017), rather than our proposed fast and efficient
meta-optimization framework. Inner loop optimization is performed using the Adam optimizer, and is tuned over lookback
length multiplier values of {1, 3, 5, 7, 9}, and inner loop iterations of {1, 5, 10}.
MLP The random Fourier features layer is a mapping from coordinate space to latent space γ : Rc → Rd . To remove the
effects of the random Fourier features layer, we simply replace it with a with a linear map, Linear : Rc → Rd .
SIREN We replace the random Fourier features backbone with the SIREN model which is introduced by (Sitzmann et al.,
2020b). In this model, periodical activation functions are used, i.e. sin(x), along with specified weight initialization scheme.
RNN We use a 2 layer LSTM with hidden size of 256. Inputs are observations, yt , in an IMS fashion, predicting the next
time step, yt+1 .
N-BEATS We use an N-BEATS model with 3 stacks and 3 layers (relatively small compared to 30 stacks and 4 layers
used in their orignal paper7 ), with a hidden size of 512. Note, N-BEATS is a univariate model and values presented here
are multiplied by a factor of m to account for the multivariate data. Another dimension of comparison is the number of
parameters used in the model. Demonstrated in Table 11, fully connected models like N-BEATS, their number of parameters
scales linearly with lookback window and forecast horizon length, while for Transformer-based and DeepTime, the number
of parameters remains constant.
7
https://github.com/ElementAI/N-BEATS/blob/master/experiments/electricity/generic.gin
20
Learning Deep Time-index Models for Time Series Forecasting
N-HiTS We use an N-HiTS model with hyperparameters as sugggested in their original paper (3 stacks, 1 block in
each stack, 2 MLP layers, 512 hidden size). For the following hyperparameters which were not specified (subject to
hyperparameter tuning), we set the pooling kernel size to [2, 2, 2], and the number of stack coefficients to [24, 12, 1]. Similar
to N-BEATS, N-HiTS is a univariate model, and values were multiplied by a factor of m to account for the multivariate data.
Table 11. Number of parameters in each model across various lookback window and forecast horizon lengths. The models were instantiated
for the ETTm2 multivariate dataset (this affects the embedding and projection layers in Autoformer). Values for N-HiTS in this table are
not multiplied by m since it is a global model (i.e. a single univariate model is used for all dimensions of the time-series).
Methods Autoformer N-HiTS DeepTime
48 10,535,943 927,942 1,314,561
96 10,535,943 1,038,678 1,314,561
168 10,535,943 1,204,782 1,314,561
Lookback
21