0% found this document useful (0 votes)
26 views143 pages

Spatio-Temporal Statistics IV Epiphany 2024-25

The document titled 'Temporal Modelling and Time Series Analysis' by Darren Wilkinson provides a comprehensive overview of time series analysis, including linear filters, stochastic systems, and state space modeling. It covers various topics such as ARMA models, estimation, forecasting, and spectral analysis. The document is structured into multiple parts, each focusing on different aspects of time series and their applications.

Uploaded by

lucertius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views143 pages

Spatio-Temporal Statistics IV Epiphany 2024-25

The document titled 'Temporal Modelling and Time Series Analysis' by Darren Wilkinson provides a comprehensive overview of time series analysis, including linear filters, stochastic systems, and state space modeling. It covers various topics such as ARMA models, estimation, forecasting, and spectral analysis. The document is structured into multiple parts, each focusing on different aspects of time series and their applications.

Uploaded by

lucertius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

Temporal modelling and time series analysis

Darren Wilkinson

10 February, 2025
Table of contents

Preface 5
Part 1: Introduction to time series and linear filters . . . . . . . . . . . . . . . . . . . . . . 5
Part 2: Linear stochastic systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Part 3: State space modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Part 4: Spatio-temporal modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Reading list and other resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
About these notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1 Introduction to time series 7


1.1 Time series data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 What is a time series? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Detrending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Filtering time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Smoothing with convolutional linear filters . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2 Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Exponential smoothing and auto-regressive linear filters . . . . . . . . . . . . . . . 17
1.3 Multivariate time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Linear systems 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Vector random quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 The multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 First order systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Deterministic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Stochastic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Second order systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 Deterministic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.2 Stochastic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 ARMA models 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 The Yule-Walker equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 The backshift operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Example: AR(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.4 Partial auto-covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 MA(q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Backshift notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Special case: the MA(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 ARMA(p,q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 Parameter redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2
3.4.2 Example: ARMA(1,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Estimation and forecasting 61


4.1 Fitting ARMA models to data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.1 Moment matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.2 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.4 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Forecasting an ARMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Forecasting an AR(p) model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.3 Forecasting an ARMA(p,q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Spectral analysis 74
5.1 Fourier analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1.1 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1.2 Discrete time Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Spectral representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Spectral densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Finite time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.1 The discrete Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 The periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.3 Smoothing with the DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Hidden Markov models (HMMs) 90


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Marginal likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.6 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.7 R package for HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 Dynamic linear models (DLMs) 101


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2.1 The Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4 Marginal likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.5 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.5.1 The RTS smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.6 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.7 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.7.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.8 R package for DLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

3
8 State space modelling 112
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2 Polynomial trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2.1 Locally constant model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2.2 Locally linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2.3 Higher-order polynomial models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.3 Seasonal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.3.1 Seasonal effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.3.2 Fourier components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.4 Model superposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.4.1 Example: monthly births . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.5 ARMA models in state space form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.5.1 AR models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.5.2 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.5.3 Example: SOI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

9 Spatio-temporal models and data 124


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.2 Exploring spatio-temporal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.3 Spatio-temporal modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.3.1 Spatial models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.3.2 Dynamic models for spatio-temporal data . . . . . . . . . . . . . . . . . . . . . . . 132
9.3.3 Dynamic latent process models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9.3.4 R software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.4 Example: German air quality data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.5 Wrap-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

References 143

4
Preface

These notes correspond to the second half of the module MATH4341: Spatio-temporal statistics IV. The
first half of the module was primarily concerned with spatial modelling and spatial statistics. This half will
concentrate on temporal modelling and the analysis of time series data. We will look briefly at modelling
fully spatio-temporal data at the end of the course.

• These notes: https://darrenjw.github.io/time-series/


• Also available as a PDF document (better for printing and annotating). There is also an experimental
EPUB version (for e-readers), but the formatting is suboptimal.

Part 1: Introduction to time series and linear filters

• Chapter 1 Introduction

Part 2: Linear stochastic systems

• Chapter 2 Linear systems


• Chapter 3 ARMA models
• Chapter 4 Forecasting and estimation
• Chapter 5 Spectral analysis

Part 3: State space modelling

• Chapter 6 HMMs
• Chapter 7 DLMs
• Chapter 8 State space modelling

Part 4: Spatio-temporal modelling

• Chapter 9 Spatio-temporal models and data

Reading list and other resources

The main recommended text for this part of the course is Shumway and Stoffer (2017). This is an excellent
reference, but has a different style to this course and covers more/different material, in a different order, and
typically in more depth. Wikle, Zammit-Mangion, and Cressie (2019) will also be useful for the latter part
of the course. After that, there are many excellent texts on time series analysis, including some classics.
Chatfield and Xing (2019) is a good introductory text. Priestley (1989) is the classic text for spectral analysis.
West and Harrison (2013) is the classic reference for DLMs, but Petris, Petrone, and Campagnoli (2009)
and Särkkä and Svensson (2023) are interesting additions/alternatives. Cressie and Wikle (2015) is a more
substantial reference for the latter part of the course.

5
Of course, in addition to traditional textbooks, there is a lot of good material available freely online (including
wikipedia). I’ll attempt to link to some of this from appropriate points in the notes.
We will use R for all of the computational examples. Some CRAN packages will be used. I will attempt
to keep a full list of required installs here (but we will also use some of the required dependencies of these
packages):

install.packages(c(
"astsa", "signal", "netcontrol", "dlm", "mvtnorm", "spTimer"
))

All of the examples, figures and simulations in the notes are fully reproducible in R. The code blocks for
each chapter are intended to be run sequentially from the start of the chapter. These blocks are easy to
copy-and-paste from the web version of the notes. Illustrative implementations of many important algorithms
from time series analysis are provided, including simulation and fitting of ARMA models, the Kalman filter,
and forward-backward algorithms for HMMs. Despite this, there is not a single explicit for, while or
repeat loop to be found anywhere in the code examples. So the code examples also serve to illustrate a
more functional approach to R programming than is typically adopted.
Note that the CRAN task view for time series analysis gives an overview of a large number of R packages
relevant to this course. Similarly, the task view for spatio-temporal data gives an overview of packages
relevant to the end of the course.

About these notes

This is a Quarto book. To learn more about Quarto books visit https://quarto.org/docs/books.
Copyright (C) 2024-2025 Darren J Wilkinson, all rights reserved.

6
1 Introduction to time series

1.1 Time series data

1.1.1 What is a time series?

A time series is simply a collection of observations indexed by time. Since we measure time in a strictly
ordered fashion, from the past, through the present, to the future, we assume that our time index is some
subset of the real numbers, and in many cases, this subset will be a subset of the integers. So, for example,
we could denote the observation at time t by xt . Then our time series could be written

x = (x1 , x2 , . . . , xn )

if we have n observations corresponding to time indices t = 1, 2, . . . , n.


When we use (consecutive) integers for the time index, we are implicitly assuming that the observations are
equispaced in time.
Crucially, in time series analysis, the order of the observations matters. If we permute our observations in
some way, the time series is different. Statistical models for time series are not invariant to permutations of the
data. This is obviously different to many statistical models you have studied previously, where observations
(within some subgroup) are iid (or exchangeable).
In the first half of this module you studied spatial statistical models and the analysis of spatial data. Spatial
data is indexed by a position in space, and the spatial index is often assumed to be a subset of R2 or R3 .
Spatial models also have the property that they are not invariant to a permutation of the observations. Indeed,
there is a strong analogy to be drawn between time series data and 1-d spatial data, where the spatial index is
a subset of R. This analogy can sometimes be helpful, and there are situations where it makes sense to treat
time series data as 1-d spatial data.
However, the very special nature of the 1-d case (in particular, the complete ordering property of the reals),
and the strong emphasis in classical time series analysis on the case of equispaced observations, means that
models for time series and methods for the analysis of time series data typically look very different to the
models used in spatial statistics.
We will therefore temporarily put aside our knowledge of spatial modelling and spatial statistics, and develop
models and methods for time series from the ground up. We will reconcile our understanding much later,
especially in the context of spatio-temporal modelling of observations indexed in both space and time.

1.1.2 An example

Consider the price of chicken in the US (measured in US cents per pound in weight of chicken). This is a
dataset in the R CRAN package astsa.

library(astsa)
plot(chicken)

7
100
chicken

80
60

2005 2010 2015

Time

Note that observations close together in time are more highly correlated than observations at disparate times.
This uses the R base function for plotting time series (plot.ts, since the chicken data is a ts object).
We could also plot using tsplot from the astsa package.

tsplot(chicken)
110
100
chicken
90
80
70
60

2005 2010 2015


Time

If we wanted, we could make it look a bit nicer.

tsplot(chicken, col=4, lwd=2,


main="Price of chicken", ylab="US cents per pound")

8
Price of chicken

110
US cents per pound
100
90
80
70
60

2005 2010 2015


Time

Remember that you can get help on the astsa package with help(package="astsa"), and
help on particular functions with (eg.) ?tsplot. You can get a list of datasets in the package with
data(package="astsa"). Note that there is also an online guide to this package: fun with astsa.

1.1.3 Detrending

It is clear that observations close together in time are more highly correlated than observations far apart, but is
that simply because of the slowly increasing trend? We can see by detrending the data in some way. There are
many ways one can detrend a time series, but here, since there looks to be a roughly linear increase in price
with time, we could detrend by fitting a linear regression model using time as a covariate, and subtracting off
the linear fit, in order to leave us with some detrended residuals.
We could do this with base R functions.

tc = time(chicken)
res = residuals(lm(chicken ~ tc))
dc = ts(res, start=start(chicken), freq=frequency(chicken))
plot(dc, lwd=2, col=4, main="Detrended price of chicken")

9
Detrended price of chicken

10
5
dc

0
−5

2005 2010 2015

Time

But the astsa package has a built-in detrend function which makes this much easier.

tsplot(detrend(chicken), lwd=2, col=4)


10
detrend(chicken)
5
0
−5

2005 2010 2015


Time

Note that despite having removed the obvious trend from the data it is still the case that observations close
together in time are more highly correlated than observations far apart. We can visualise this nearby correlation
more clearly by producing a scatter plot showing each observation against the previous observation.

lag1.plot(dc, pch=19, col=4)

10
0.97

10
5
dc(t)
0
−5

−5 0 5 10
dc(t−1)

For iid observations, there would be no correlation between adjacent values. Here we see that the correlation
at lag 1 (observations one time index apart) is around 0.97. This correlation decreases as the lag increases.

acf1(dc)

Series: dc
1.0
0.8
0.6
0.4
ACF
0.2
−0.2

0 1 2 3 4
LAG ÷ 12

[1] 0.97 0.91 0.83 0.75 0.68 0.61 0.56 0.51 0.48 0.46 0.43 0.39
[13] 0.33 0.26 0.20 0.14 0.08 0.03 0.00 -0.03 -0.04 -0.05 -0.07 -0.10
[25] -0.13 -0.18 -0.21 -0.24 -0.25 -0.25 -0.23 -0.20 -0.16 -0.13 -0.11 -0.10
[37] -0.11 -0.13 -0.14 -0.16 -0.17 -0.16 -0.15 -0.13 -0.10 -0.08 -0.05 -0.04

This plot of the auto-correlation against lag is known as the auto-correlation function (ACF), or sometimes
correlogram, and is an important way to understand the dependency structure of a (stationary) time series.
From this plot we see that observations have to be more than one year apart for the auto-correlations to
become statistically insignificant.

11
1.2 Filtering time series

1.2.1 Smoothing with convolutional linear filters

Sometimes short-term fluctuations in a time series can obscure longer-term trends. We know that the sample
mean has much lower variance than an individual observation, so some kind of (weighted) average might be
a good way of getting rid of noise in the data. However, for time series models, it only really makes sense to
perform an average locally. This is the idea behind linear filters.
By way of example, consider the chicken price data, and consider forming a new time series where each new
observation is the sample mean of the 25 observations closest to that time point.

tsplot(chicken, col=4,
main="Chicken price smoothed with a moving average filter")
lines(filter(chicken, rep(1/25, 25)), col=2, lwd=2)

Chicken price smoothed with a moving average filter


110
100
chicken
90
80
70
60

2005 2010 2015


Time

Note that the new time series is shorter than the original time series, and some convention is required for
aligning the two time series. Here, R’s filter function has done something sensible. We see that the filter
has done exactly what we wanted - it has smoothed out short-term fluctuations, better revealing the long-term
trend in the data.
A convolutional linear filter defines a new time series y = {yt } from an input time series x = {xt } via
q2
X
yt = wi xt−i ,
i=q1

for q1 ≤ 0 ≤ q2 and pre-specified weights, wq1 , . . . , wq2 (so q2 − q1 + 1 weights in total). If the weights are
non-negative and sum to one, then the filter represents a moving average, but this is not required. Note that if
the input series is defined for time indices 1, 2, . . . , n, then the output series will be defined for time indices
q2 + 1, q2 + 2, . . . , n + q1 , and hence will be of length n + q1 − q2 .
In the context of digital signal processing (DSP), this would be called a finite impulse response (FIR) filter,
and the convolution operation is sometimes summarised with the shorthand notation

y = w ∗ x.

12
Very commonly, a symmetric filter will be used, where q1 = −q and q2 = q, for some q > 0, and wi = w−i .
Although the weight vector is of length 2q + 1, there are only q + 1 free parameters to be specified. In this
case, if the input time indices are 1, 2, . . . , n, then the output indices will be q + 1, q + 2, . . . , n − q and
hence the output will be of length n − 2q.

1.2.2 Seasonality

Let’s look now at the quarterly share earnings for the company Johnson and Johnson.

tsplot(jj, col=4, lwd=2, main="Quarterly J&J earnings")

Quarterly J&J earnings


15
10
jj
5
0

1960 1965 1970 1975 1980


Time

We can see an exponential increase in the overall trend. We can also see that the deviations from the overall
trend seem to exhibit a regular pattern with a period of 4. There seem to be particular quarters where earnings
are lower than the overall trend, and others where they seem to be higher. This kind of seasonality in time
series data is very common. Seasonality can happen on many time scales, but yearly, weekly and daily is
most common. On closer inspection, we see that the seasonal effects seem to be increasing exponentially, in
line with the overall trend. We could attempt to develop a multiplicative model for this data, but it might be
simpler to just take logs.

tsplot(log(jj), col=4, lwd=2, main="Log of J&J earnings")

13
Log of J&J earnings

2
log(jj)
1
0

1960 1965 1970 1975 1980


Time

To progress further, we probably want to detrend in some way. We could take out a linear fit, but we could
also estimate the trend by smoothing out the seasonal effects with a moving average that averages over 4
consecutive quarters to give a yearly average. That is, we could apply a convolutional linear filter with weight
vector
w = (0.25, 0.25, 0.25, 0.25).
The problem with this is that it is of even length, so it represents an average that lies between two quarters,
and so doesn’t align nicely with our existing series. We can fix this by using a symmetric moving average
centred on a given quarter by instead using a weight vector of length 5.

w = (0.125, 0.25, 0.25, 0.25, 0.125).

Now each quarter still receives a total weight of 0.25, but the weight for the quarter furthest away is split
across two different years.

sjj = filter(log(jj), c(0.125,0.25,0.25,0.25,0.125))


tsplot(sjj, col=4, lwd=2, main="Smoothed log of J&J earnings")

14
Smoothed log of J&J earnings

2.5
2.0
1.5
sjj
1.0
0.5
−0.5 0.0

1960 1965 1970 1975 1980


Time

We can see that this has nicely smoothed out the seasonal fluctuations giving just the long term trend.
So, in general, when trying to deseasonalise a time series with seasonal fluctuations with period m, if m is
odd (which is quite rare), then just use a filter with a weight vector of length m, with all weights set to 1/m.
In the more typical case where m is even, use a weight vector of length m + 1 where the end weights are set
to 1/(2m) and all other weights are set to 1/m.
This deseasonalisation process is an example of what DSP people would call a low-pass filter. It has filtered
out the high-frequency oscillations, allowing the long term (low frequency) behaviour to pass through.
We can now detrend the log data by subtracting off this trend to leave just the seasonal effects.

dljj = log(jj) - sjj


tsplot(dljj, col=4, lwd=2, main="Detrended log of J&J earnings")

Detrended log of J&J earnings


0.3
0.2
0.1
dljj
−0.1
−0.3

1960 1965 1970 1975 1980


Time

15
Note that we could have directly computed the detrended data by applying a convolutional linear filter with
weight vector
w = (−0.125, −0.25, 0.75, −0.25, −0.125)

dljj = filter(log(jj), c(-0.125,-0.25,0.75,-0.25,-0.125))


tsplot(dljj, col=4, lwd=2, main="Detrended log of J&J earnings")

Detrended log of J&J earnings


0.3
0.2
0.1
dljj
−0.1
−0.3

1960 1965 1970 1975 1980


Time

Note that this filter has weights summing to zero. This detrending process is an example of what DSP people
would call a high-pass filter. It has allowed the high frequency oscillations to pass through, while removing
the long term (low frequency) behaviour.
We could use this detrended data to get a simple estimate of the seasonal effects associated with each of the
four quarters.

seff = rowMeans(matrix(dljj, nrow=4), na.rm=TRUE)


seff

[1] -0.0001916332 0.0403896607 0.1121470367 -0.1492881658

We can now strip out these seasonal effects.

dsdtljj = dljj - seff


tsplot(dsdtljj, col=4, lwd=2,
main="Detrended and deseasonalised log of J&J earnings")

16
Detrended and deseasonalised log of J&J earnings

0.2
0.1
dsdtljj
0.0
−0.1
−0.2

1960 1965 1970 1975 1980


Time

Not perfect (as the seasonal effects aren’t perfectly additive), and definitely lots of auto-correlation left, but
it’s getting closer to some kind of correlated noise process.

1.2.3 Exponential smoothing and auto-regressive linear filters

We previously smoothed the chicken price data by applying a moving average filter to it. That is by no means
the only way we could smooth the data. One drawback of the moving average filter is that you lose values at
each end of the series. Losing values at the right-hand end is particularly problematic if you are using historic
data up to the present in order to better understand the present (and possibly forecast into the future). In
the on-line context it makes sense to keep a current estimate of the underlying level (smoothed value), and
update this estimate with each new observation. Exponential smoothing is probably the simplest variant of
this approach. We can define a smoothed time series s = {st } using the input time series x = {xt } via the
update equation
st = αxt + (1 − α)st−1 ,
for some fixed α ∈ [0, 1]. It is clear from this definition that the new smoothed value is a weighted average of
the current observation and the previous smoothed value. High values of α put more weight on the current
observation (and hence, less smoothing), whereas low values of α put less weight on the current observation,
leading to more smoothing. It is called exponential smoothing because recursive application of the update
formula makes it clear that the weights associated with observations from the past decay away exponentially
with time. It is worth noting that the update relation can be re-written
st = st−1 + α(xt − st−1 ),
which can sometimes be convenient.
Note that some strategy is required for initialising the smoothed state. eg. if the first observation is x1 , then
to compute s1 , some initial s0 is required. Sometimes s0 = 0 is used, but more often s0 = x1 is used, which
obviously involves looking ahead at the data, but ensures that s1 = x1 , which is often sensible.
The smoothing value, α can be specified subjectively, noting that this is no more subjective than choosing
the size of the smoothing window for a moving average filter. Alternatively, some kind of cross-validation
strategy can be used, based on, say, one-step ahead prediction errors.
The stats::filter function is not flexible enough to allow us to implement this kind of linear filter, so
we instead using the signal::filter function.

17
alpha = 0.1
sm = signal::filter(alpha, c(1, alpha-1), chicken, init.y=chicken[1])
tsp(sm) = tsp(chicken)
tsplot(chicken, col=4)
lines(sm, col=2, lwd=2)

110
100
chicken
90
80
70
60

2005 2010 2015


Time

We will figure out the details of the signal::filter function shortly. We see that the filtered output is
much smoother than the input, but also that it somewhat lags the input series. This can be tuned by adjusting
the smoothing parameter. Less smoothing will lead to less lag.
Exponential smoothing is a special case of an auto-regressive moving average (ARMA) linear filter, which
DSP people would call an infinite impulse response (IIR) filter. The ARMA filter relates an input time series
x = {xt } to an output time series y = {yt } via the relation
p
X q
X
ai yt−i = bj xt−j ,
i=0 j=0

for some vectors a and b, of length p + 1 and q + 1, respectively. This can obviously be rearranged as
 
q p
1 X X
yt = bj xt−j − ai yt−i  .
a0 j=0 i=1

These are more powerful than the convolution filters considered earlier, since they allow the propagation of
persistent state information.
It is clear that if we choose a = (1, α − 1), b = (α) we get the exponential smoothing model previously
described. The signal::filter function takes b and a as its first two arguments, which should explain
the above code block.

1.3 Multivariate time series

The climhyd dataset records monthly observations on several variables relating to Lake Shasta in California
(see ?climhyd for further details). The monthly observations are stored as rows of a data frame. We can
get some basic information about the dataset as follows.

18
str(climhyd)

'data.frame': 454 obs. of 6 variables:


$ Temp : num 5.94 8.61 12.28 11.61 20.28 ...
$ DewPt : num 1.436 -0.285 0.857 2.696 5.7 ...
$ CldCvr: num 0.58 0.47 0.49 0.65 0.33 0.28 0.11 0.08 0.15 0.27 ...
$ WndSpd: num 1.22 1.15 1.34 1.15 1.26 ...
$ Precip: num 160.53 65.79 24.13 178.82 2.29 ...
$ Inflow: num 156 168 173 273 233 ...

head(climhyd)

Temp DewPt CldCvr WndSpd Precip Inflow


1 5.94 1.436366 0.58 1.219485 160.528 156.1173
2 8.61 -0.284660 0.47 1.148620 65.786 167.7455
3 12.28 0.856728 0.49 1.338430 24.130 173.1567
4 11.61 2.696482 0.65 1.147778 178.816 273.1516
5 20.28 5.699536 0.33 1.256730 2.286 233.4852
6 23.83 8.275339 0.28 1.104325 0.508 128.4859

We can then plot the observations.

tsplot(climhyd, ncol=2, col=2:7)


15
25

DewPt
Temp
15

0 5
5

0 100 200 300 400 0 100 200 300 400


Time Time
0.8

1.2
WndSpd
CldCvr
0.4

0.8
0.0

0.4

0 100 200 300 400 0 100 200 300 400


Time Time
800
Precip

Inflow
800
400

200
0

0 100 200 300 400 0 100 200 300 400


Time Time

The data can be regarded as a collection of six univariate time series. We could just treat them completely
separately, applying univariate time series analysis techniques to each variable in turn. However, this wouldn’t
necessarily be a good idea unless the series are all independent of one another. But typically, the reason
we study a collection of time series together is because we suspect that they are not all independent of one
another. Here, we could look at a simple pairs plot of the variables.

19
pairs(climhyd, col=4, pch=19, cex=0.5)

0 10 0.4 1.0 200 1000

5 25
Temp

DewPt
0

CldCvr

0.0
0.4 1.4

WndSpd

0 800
Precip

Inflow
200

5 15 25 0.0 0.6 0 400

We see that there do appear to be cross-correlations between the series, strongly suggesting that the component
series are not independent of one another. It is then an interesting question as to the best way to analyse a
collection of dependent time series, with correlations across variables as well as over time. We will come
back to this question later.

1.4 Time series analysis

We’ve looked at some basic exploratory tools for visualising and decomposing time series into trend, seasonal
and residual components. But we know from previous courses that to do any meaningful statistical inference
for the underlying features, or probabilistic prediction (or forecasting), then it is very helpful to have a
statistical model of the data generating process. We typically approach this by building a model with some
systematic components (such as a trend and seasonal effects), and then using an appropriate probability model
for the residuals. But in most simple statistical models, this residual process is assumed to be iid. However,
we have seen that for time series, it is often implausible that the residuals will be iid. We have seen that it is
likely that we should be able to account for systematic effects, leaving residuals that are mean zero. In many
cases, it may even be plausible that the residuals have constant variance. But we have seen that typically
the residuals will be auto-correlated, with correlations tailing off as the lag increases. It would therefore be
useful to have a family of discrete time stochastic processes that can model this kind of behaviour. It will turn
out that ARMA models are a good solution to this problem, and we will study these properly in Chapter 3.
But first we will set the scene by examining a variety of (first and second order) linear Markovian dynamical
systems. This will help to provide some context for ARMA models, and hopefully make them seem a bit less
mysterious.

20
2 Linear systems

2.1 Introduction

It is likely that you will be familiar with many parts of this chapter, having seen different bits in different
courses, in isolation. It might seem a bit strange to look at deterministic linear dynamical systems in a
stats course, but it is arguably helpful to see them alongside their stochastic counterparts, in one place, in a
consistent notation, to be able to compare and contrast. It could also be argued that this material would belong
better in a course on stochastic processes. But time series models are stochastic processes, so we really do
need to understand something about stochastic processes to be able to model time series appropriately. This
chapter will hopefully make clear how everything links together in a consistent way, and how understanding
deterministic dynamical systems helps to understand stochastic counterparts. In particular, it should help to
motivate the study of ARMA models in Chapter 3.

2.2 Vector random quantities

Although this course will be mainly concerned with univariate time series, there are many instances in the
analysis of time series where vector random quantities and the multivariate normal distribution crop up. It
will be useful to recap some essential formulae that you have probably seen before.

2.2.1 Moments

For a random vector X with ith element Xi , the expectation is the vector E[X], of the same length, with ith
element E[Xi ].
The covariance matrix between vectors X and Y, is the matrix with (i, j)th element Cov[Xi , Yj ], and hence
given by n o
Cov[X, Y] = E (X − E[X])(Y − E[Y])⊤ = E[XY⊤ ] − E[X]E[Y]⊤ .
It is easy to show that
Cov[AX + b, CY + d] = ACov[X, Y]C⊤ ,
Cov[X + Y, Z] = Cov[X, Z] + Cov[Y, Z], and Cov[X, Y]⊤ = Cov[Y, X].
We define the variance matrix
Var[X] = Cov[X, X],
from which it follows that
Var[AX + b] = AVar[X]A⊤ .

21
2.2.2 The multivariate normal distribution

The p-dimensional multivariate normal distribution can be defined as an affine transformation of a p-vector
of iid standard normal random scalars, and is characterised by its (p-dimensional) expectation vector, µ and
(p × p) variance matrix, Σ, and written N (µ
µ, Σ). From this definition it is clear that affine transformations of
multivariate normal random quantities will be multivariate normal. It is also relatively easy to deduce that the
probability density function (PDF) must take the form
1
 
−p/2 −1/2
N (x; µ , Σ) = (2π) |Σ| exp − (x − µ )⊤ Σ−1 (x − µ ) .
2
From this, relatively routine computations show that if X and Y are jointly multivariate normal, then both X
and Y are marginally multivariate normal, and further, that the conditional random quantity, (X|Y = y) is
multivariate normal, and characterised by its conditional mean and variance,

E[X|Y = y] = E[X] + Cov[X, Y]Var[Y]−1 (y − E[Y]),

Var[X|Y = y] = Var[X] − Cov[X, Y]Var[Y]−1 Cov[Y, X].


It is noteworthy that the conditional variance of X does not depend on the observed value of Y (this is a
special property of the multivariate normal distribution).

2.3 First order systems

2.3.1 Deterministic

2.3.1.1 Discrete time

We will start with the simplest linear Markovian model: the linear recurrence relation. We start with the
special case
yt = ϕyt−1 , t ∈ {1, 2, . . .},
for some fixed ϕ ∈ R, initialised with some y0 ∈ R. Using substitution we get y1 = ϕy0 , y2 = ϕy1 = ϕ2 y0 ,
etc., leading to
yt = ϕt y0 , t ∈ {1, 2, . . .}.

library(astsa)
tsplot(0:50, sapply(0:50, function(t) 10*(0.9)ˆt),
type="p", pch=19, cex=0.5, col=4, ylab="y_t",
main="y_0 = 10, phi = 0.9")

22
y_0 = 10, phi = 0.9

10
8
6
y_t
4
2
0

0 10 20 30 40 50
Time

This will be stable for |ϕ| < 1, leading to y∞ = 0, and will diverge geometrically for |ϕ| > 1, y0 ̸= 0 (other
cases require special consideration). Next consider the shifted system xt = yt + µ, so that

xt = µ + ϕ(xt−1 − µ).

By construction, this system will be stable for |ϕ| < 1 with x∞ = µ. Note that we can re-write this as

xt = ϕxt−1 + (1 − ϕ)µ,

so for a system of the form


xt = ϕxt−1 + k,
we know that it will be stable for |ϕ| < 1 with x∞ = µ = k/(1 − ϕ). Further, since the general solution in
terms of µ is
xt = µ + ϕt (x0 − µ) = ϕt x0 + (1 − ϕt )µ,
the general solution in terms of k is
1 − ϕt
xt = ϕt x0 + k.
1−ϕ

2.3.1.2 Vector discrete time

We can generalise the above system to vectors in the obvious way. Start with

yt = Φyt−1 , t ∈ {1, 2, . . .},

for some fixed matrix Φ ∈ Mn×n , initialised with vector y0 ∈ Rn . Exactly as above we get

yt = Φt y0 ,

and this will be stable provided that all of the eigenvalues (λi ) of Φ lie inside the unit circle in the complex
plane (|λi | < 1), since then Φt → 0, giving y∞ = 0.
Defining xt = yt + µ gives

xt = µ + Φ(xt−1 − µ ) = Φxt−1 + (I − Φ)µ


µ,

23
and so a system of the form
xt = Φxt−1 + k,
in the stable case will have x∞ = µ = (I − Φ)−1 k. Note that in the stable case, I − Φ will be invertible.
Further, since the general solution in terms of µ is

xt = µ + Φt (x0 − µ ) = Φtx 0 + (I − Φt )µ
µ,

the general solution in terms of k is

xt = Φtx 0 + (I − Φt )(I − Φ)−1 k.

2.3.1.3 Continuous time

Let’s revert to the scalar case, but switch from discrete time to cts time. The analogous system is the scalar
linear ODE
dy
= −λy,
dt
for some fixed λ ∈ R, initialised with y0 ∈ R. By separating variables (or guessing) we deduce the solution

y(t) = y0 e−λt .

tsplot(0:50, sapply(0:50, function(t) 10*exp(-0.1*t)),


cex=0.5, col=4, ylab="y(t)", main="y_0 = 10, lambda = 0.1")

y_0 = 10, lambda = 0.1


10
8
6
y(t)
4
2
0

0 10 20 30 40 50
Time

This will be stable for λ > 0, giving y(∞) = 0, otherwise likely to diverge (exponentially, but various cases
to consider). Note that y(t) = ϕt y0 for ϕ = e−λ , so the continuous time model evaluated at integer times is
just a discrete time model.
Defining x(t) = y(t) + µ gives
dx
= −λ(x − µ),
dt
and so for λ > 0 we will get x(∞) = µ. So for a linear ODE of the form
dx
+ λx = k,
dt

24
we have x(∞) = µ = k/λ in the stable case (λ > 0). Further, since the general solution in terms of µ is
x(t) = µ + (x0 − µ)e−λt = x0 e−λt + µ(1 − e−λt ),
the general solution in terms of k is
k
x(t) = x0 e−λt + (1 − e−λt ).
λ

2.3.1.4 Vector continuous time

The vector equivalent to the previous model is


dy
= −Λy,
dt
for Λ ∈ Mn×n , initialised with y0 ∈ Rn . The solution is
y(t) = exp{−Λt}y0 ,
where exp{·} denotes the matrix exponential function. For stability, we need the eigenvalues of exp{−Λ} to
be inside the unit circle. For that we need the real parts of the eigenvalues of −Λ to be negative. In other
words we need the real parts of the eigenvalues of Λ to be positive. Otherwise the process is likely to diverge,
but again, there are various cases to consider. In the stable case we have y(∞) = 0.
Note that for integer times, t, we have y(t) = Φt y0 , where Φ = exp{−Λ}, so again the solution of the
continuous time model at integer times is just a discrete time recurrence relation.
Defining x(t) = y(t) + µ gives
dx
= −Λ(x − µ),
dt
with stable limit x(∞) = µ . So, given a linear ODE of the form
dx
+ Λx = k,
dt
we deduce the stable limit x(∞) = µ = Λ−1 k. Note that in the stable case, Λ will be invertible. Further,
since the general solution in terms of µ is
x(t) = µ + exp{−Λt}(x0 − µ ) = exp{−Λt}x0 + (I − exp{−Λt})µ
µ,
the general solution in terms of k is
x(t) = exp{−Λt}x0 + (I − exp{−Λt})Λ−1 k.

2.3.2 Stochastic

We now turn our attention to the natural linear Gaussian stochastic counterparts of the deterministic systems
we have briefly reviewed.

2.3.2.1 Discrete time: AR(1)

We start by looking at a first order Markovian scalar linear Gaussian auto-regressive model, commonly
denoted AR(1). The basic form of the model can be written
Yt = ϕYt−1 + εt , εt ∼ N (0, σ 2 ), (2.1)
for some fixed ϕ ∈ R, noise variance σ > 0, and for now we will assume that it is initialised at some fixed
Y0 = y0 ∈ R. It is just a linear recurrence relation with some iid Gaussian noise added at each time step. It’s
also an appropriate auto-regressive linear filter applied to iid noise.

25
tsplot(filter(rnorm(80, 0, 0.5), 0.9, "rec", init=10),
type="p", col=4, pch=19, cex=0.5, ylab="Y_t",
main="y_0=10, alpha=0.9, sigma=0.5")
abline(h=0, col=3, lwd=2)

8
6
4 y_0=10, alpha=0.9, sigma=0.5
Y_t
2
0
−2

0 20 40 60 80
Time

Taking the expectation of Equation 2.1 gives

E[Yt ] = ϕE[Yt−1 ]

In other words, the expectation satisfies a linear recurrence relation with solution

E[Yt ] = ϕt y0 .

In the stable case (|ϕ| < 1) we will have limiting expectation 0. But this is a random process, so just knowing
about its expectation isn’t enough. Next take the variance of Equation 2.1

Var[Yt ] = ϕ2 Var[Yt−1 ] + σ 2 ,

and note that the variance also satisfies a linear recurrence relation with solution
1 − ϕ2t 2
Var[Yt ] = σ ,
1 − ϕ2
since we are assuming here that Var[Y0 ] = 0. So, since each update is a linear transformation of a Gaussian
random variable, we know that the marginal distribution at each time must be Gaussian, and hence
!
t 1 − ϕ2t 2
Yt ∼ N ϕ y0 , σ .
1 − ϕ2

The limiting distribution (in the stable case, |ϕ| < 1) is therefore
!
σ2
Y∞ ∼ N 0, .
1 − ϕ2

So, in the case of a stochastic model, stability implies that the process converges to some limiting probability
distribution, and not some fixed point. But the moments of that distribution (such as the mean and variance),
are converging to a fixed point.

26
If we now consider the shifted system Xt = Yt + µ, for some fixed µ ∈ R, we have

Xt = µ + ϕ(Xt−1 − µ) + εt = ϕXt−1 + (1 − ϕ)µ + εt .

Now, Var[Xt ] = Var[Yt ], but

E[Xt ] = E[Yt ] + µ = ϕt y0 + µ = ϕt (x0 − µ) + µ = ϕt x0 + (1 − ϕt )µ,

so !
t t 1 − ϕ2t 2
Xt ∼ N ϕ x0 + (1 − ϕ )µ, σ .
1 − ϕ2
Consequently, if presented with an auto-regression in the form

Xt = ϕXt−1 + k + εt ,

for some fixed k ∈ R, just put k = (1 − ϕ)µ to get


!
1 − ϕt 1 − ϕ2t 2
Xt ∼ N ϕt x0 + k, σ ,
1−ϕ 1 − ϕ2

which obviously has limiting distribution


!
k σ2
X∞ ∼ N , .
1 − ϕ 1 − ϕ2

2.3.2.1.1 Stationarity and further properties


Let’s go back to the centred process Yt (to keep the algebra a bit simpler) and think more carefully about
what we have been doing. Everything was conditioned on the initial value of the process, Y0 = y0 , so when
we deduced the “marginal” distribution for Yt , this should more correctly have been considered to be the
conditional distribution of Yt given Y0 = y0 ,
!
1 − ϕ2t 2
(Yt |Y0 = y0 ) ∼ N ϕt y0 , σ .
1 − ϕ2

This conditional distribution can be thought of as a transition kernel of the stochastic process. Being clear
about this then allows us to consider other initialisation strategies, including random initialisation. In
particular, suppose now that Y0 is drawn from a normal distribution (independent of the process), Y0 ∼
N (µ0 , σ02 ). Since Y0 is normal and Yt |Y0 is normal with linear dependence on Y0 , then Y0 and Yt are jointly
normal, and so is the (true) marginal for Yt . The mean of this marginal can be computed with the law of total
expectation,
E[Yt ] = E(E[Yt |Y0 ]) = E(ϕt Y0 ) = ϕt E(Y0 ) = ϕt µ0 .
Similarly, the variance can be computed with the law of total variance,

Var[Yt ] = E(Var[Yt |Y0 ]) + Var(E[Yt |Y0 ])


!
1 − ϕ2t 2
=E σ + Var(ϕt Y0 )
1 − ϕ2
1 − ϕ2t 2
= σ + ϕ2t σ02
1 − ϕ2
Consequently, the marginal distribution is
!
t 1 − ϕ2t 2
Yt ∼ N ϕ µ0 , σ + ϕ2t σ02 .
1 − ϕ2

27
What happens if we choose the limiting distribution as the initial distribution? ie. if we choose µ0 = 0 and
σ02 = σ 2 /(1 − ϕ2 )? Subbing these in to the above and simplifying gives
!
σ2
Yt ∼ N 0, .
1 − ϕ2

So, the limiting distribution is stationary in the sense that if the process is initialised with this distribution, it
will keep exactly this marginal distribution for all future times. In this case, since the dynamics of the process
do not depend explicitly on time and the marginal distribution is the same at each time, it is clear that the full
joint distribution of the process is invariant under a time shift. There are various short-cuts one can deploy to
deduce the properties of such stationary stochastic processes.
The covariance Cov(Ys , Ys+t ), t ≥ 0 will only depend on the value of t, and will be independent of the
value of s. We can call this the auto-covariance function, γt = Cov(Ys , Ys+t ). This can be exploited in order
to deduce properties of the process. eg. if we start with the dynamics

Ys = ϕYs−1 + εs ,

multiply by Ys−t ,
Ys−t Ys = ϕYs−t Ys−1 + Ys−t εs ,
and take expectations
E[Ys−t Ys ] = ϕE[Ys−t Ys−1 ] + E[Ys−t εs ].
For t > 0 the final expectation is 0, leading to the recurrence

γt = ϕγt−1 ,

with solution
ϕt σ 2
γt = γ0 ϕt = .
1 − ϕ2
The corresponding auto-correlation function, ρt = Corr(Ys , Ys+t ) = γt /γ0 is

ρt = ϕt .

So, the AR(1) process, in the stable case, admits a stationary distribution, but is not iid, having correlations
that die away geometrically as values become further apart in time.

2.3.2.2 Vector discrete time: VAR(1)

The obvious multivariate generalisation of the AR(1) process is the first-order vector auto-regressive process,
denoted VAR(1):
Yt = ΦYt−1 + ε t , ε t ∼ N (0, Σ), (2.2)
for some fixed Φ ∈ Mn×n , Σ an n × n symmetric positive semi-definite (PSD) variance matrix, and again
assuming a given fixed initial condition Y0 = y0 ∈ Rn .
Taking the expectation of Equation 2.2 we get

E[Yt ] = ΦE[Yt−1 ],

which is a linear recurrence with solution

E[Yt ] = Φt y0 .

In the stable case (eigenvalues of Φ inside the unit circle), this will have a limit of 0. If we now take the
variance of Equation 2.2 we get
Var[Yt ] = ΦVar[Yt−1 ]Φ⊤ + Σ.

28
This is actually a kind of linear recurrence for matrices, but it’s more complicated than the examples we
considered earlier. However, if we let V be the solution of the discrete Lyapunov equation

V = ΦVΦ⊤ + Σ,

we can check by direct substitution that



Var[Yt ] = V − Φt VΦt

solves the recurrence relation. There are standard direct methods for solving Lyapunov equations (eg.
netcontrol::dlyap). It is then clear that in the stable case the limiting variance is V. Since updates are
linear transformations of multivariate normals, every marginal is normal, and so we have

 
Yt ∼ N Φt y0 , V − Φt VΦt ,

with limit
Y∞ ∼ N (0, V).
If we now consider the shifted system Xt = Yt + µ , we have E[Xt ] = Φt x0 + (I − Φt )µ
µ and Var[Xt ] =
Var[Yt ] following very similar arguments to those we have already seen, and so

 
Xt ∼ N Φt x0 + (I − Φt )µ
µ, V − Φt VΦt .

Consequently, if presented with an auto-regressive model in the form

Xt = ΦXt−1 + k + ε t ,

just put k = (I − Φ)µ


µ to get

 
Xt ∼ N Φt x0 + (I − Φt )(I − Φ)−1 k, V − Φt VΦt ,

with limit  
X∞ ∼ N (I − Φ)−1 k, V .

2.3.2.2.1 Stationarity and further properties


As for the scalar case, what we have deduced above is actually the transition kernel of the process conditional
on an initial condition Y0 = y0 . We can check that in the stable case if we initialise the process with the
limiting distribution it is stationary. So again, in the stable case we can consider the stationary process implied
by these dynamics, and use stationarity in order to deduce properties of the process.
Here we will look at the covariance function Γt = Cov(Ys , Ys+t ) (noting that Γ−t = Γ⊤
t ) using a similar
approach as before.

Ys = ΦYs−1 + ε s
⊤ ⊤ ⊤
⇒ Ys Ys−t = ΦYs−1 Ys−t + ε s Ys−t
⊤ ⊤ ⊤
⇒ E[Ys Ys−t ] = ΦE[Ys−1 Ys−t ] + E[εεs Ys−t ]
⇒ Γ−t = ΦΓ1−t for t > 0
⇒ Γ⊤ ⊤
t = ΦΓt−1
⇒ Γt = Γt−1 Φ⊤

⇒ Γt = Γ0 Φt

⇒ Γt = VΦt

29
2.3.2.3 Continuous time: OU

Switching back to the scalar case but moving to cts time we must describe our linear Markov process using a
stochastic differential equation (SDE),

dY (t) = −λY (t) dt + σ dW (t), Y (0) = y0 , (2.3)

for fixed λ, σ > 0, known as the Ornstein–Uhlenbeck (OU) process. Here, dW (t) should be interpreted as
increments of a Wiener process (often referred to as Brownian motion), and hence, informally, dW (t) ∼
N (0, dt). We are going to be very informal in our treatment of SDEs - they are a little bit tangential to
the main thrust of the course, and careful treatment gets very technical very quickly. It is helpful to write
Equation 2.3 in the form

Y (t + dt) = Y (t) − λY (t) dt + σ dW (t)


= (1 − λ dt)Y (t) + σ dW (t),

so then taking expectations gives

E[Y (t + dt)] = (1 − λ dt)E[Y (t)].

T=50; dt=0.1; lambda=0.2; sigma=0.5; y0=10


times = seq(0, T, dt)
y = filter(rnorm(length(times), 0, sigma*sqrt(dt)),
1 - lambda*dt, "rec", init=y0)
tsplot(times, y, col=4, ylab="Y(t)",
main=paste("lambda =",lambda,"sigma =",sigma,"y0 =",y0))
abline(h=0, col=3, lwd=2)

lambda = 0.2 sigma = 0.5 y0 = 10


10
8
6
Y(t)
4
2
0
−2

0 10 20 30 40 50
Time

Defining m(t) = E[Y (t)] leads to

m(t + dt) = (1 − λ dt)m(t),

and re-arranging gives


m(t + dt) − m(t)
= −λ m(t).
dt

30
In other words,
dm
= −λm.
dt
So as you might have guessed, the expectation satisfies the obvious simple linear ODE, with solution

m(t) = E[Y (t)] = y0 e−λt .

Now instead if we evaluate the variance we get

Var[Y (t + dt)] = (1 − λ dt)2 Var[Y (t)] + σ 2 dt


= (1 − 2λ dt)Var[Y (t)] + σ 2 dt,

since dt2 ≃ 0. Again, things might be clearer if we define v(t) = Var[Y (t)] to get

v(t + dt) = (1 − 2λ dt)v(t) + σ 2 dt,

and re-arranging gives


v(t + dt) − v(t)
= −2λ v(t) + σ 2 .
dt
In other words
dv
= −2λ v + σ 2 .
dt
So the variance also satisfies a simple linear ODE, with solution

σ2
v(t) = (1 − e−2λt ).

Again, since everything is linear Gaussian, so will be the solution, so
!
σ2
Y (t) ∼ N e−λt y0 , (1 − e−2λt ) ,

with limit
Y (∞) ∼ N (0, σ 2 /(2λ)).
It should be noted that when considered at integer times, this process is exactly an AR(1) with ϕ = e−λ ,
though the role of the parameter σ in the two models is slightly different.
Again, we can consider the shifted system, X(t) = Y (t) + µ. We get E[X(t)] = e−λt x0 + (1 − e−λt )µ,
Var[X(t)] = Var[Y (t)], so
!
−λt −λt σ2
X(t) ∼ N e x0 + (1 − e )µ, (1 − e−2λt )

for the SDE


dX(t) = −λ[X(t) − µ] dt + σ dW (t).

2.3.2.3.1 Stationarity and further properties


Again, what we have considered above is really the transition kernel of the continuous time process. But
in the stable case, we get a limiting distribution, and we can easily check that this limiting distribution is
stationary. So in the stable case, the OU model determines a stationary stochastic process in continuous time.
Since everything is Gaussian, it is a one-dimensional stationary Gaussian process (GP). In fact, it turns out
to be the only stationary GP with the (first-order) Markov property. We know its mean and variance, so to
fully characterise the process we just need to know its auto-covariance or auto-correlation function. We can

31
calculate this analogously to the discrete time case. Start by considering the dynamics of the process at time
s.

Y (s + dt) = (1 − λ dt)Y (s) + σ dW (s)


⇒ Y (s − t)Y (s + dt) = (1 − λ dt)Y (s − t)Y (s) + σY (s − t) dW (s)
⇒ γ(t + st) = (1 − λ dt)γ(t) for t > 0

⇒ = −λγ
dt
⇒ γ(t) = γ(0)e−λt

So ρ(t) = e−λt , ∀t > 0. Since ρ(t) = ρ(−t) for scalar stationary processes, in general we have

ρ(t) = e−λ|t| ,

and this correlation function completes the characterisation of an OU process as a GP.

2.3.2.4 Vector continuous time: VOU

We can now consider the obvious multivariate generalisation of the OU process, the vector OU (VOU)
process,
dY(t) = −ΛY(t) dt + Ω dW(t), Y(0) = y0 ,
where now fixed Λ, Ω ∈ Mn×n and dW(t) represents increments of a vector Wiener process, informally
dW(t) ∼ N (0, Idt). Again, it is helpful to re-write the SDE, informally, as

Y(t + dt) = (I − Λ dt)Y(t) + Ω dW(t).

Taking expectations and defining m(t) = E[Y(t)] leads, unsurprisingly, to the ODE

dm
= −Λm,
dt
with solution
m(t) = E[Y(t)] = exp{−Λt}y0 .
Stability of the fixed point of 0 will require the real parts of the eigenvalues of Λ to be positive.
Taking the variance gives

Var[Y(t + dt)] = (I − Λ dt)Var[Y(t)](I − Λ dt)⊤ + ΩΩ⊤ dt.

Putting V(t) = Var[Y(t)] and Σ = ΩΩ⊤ leads to the linear matrix ODE

dV
= −ΛV − VΛ⊤ + Σ.
dt
Even without explicitly solving, it is clear that at the fixed point we will have the continuous Lyapunov
equation
ΛV∞ + V∞ Λ⊤ = Σ.
Again, there are standard direct methods for solving this for V∞ (eg. maotai::lyapunov). Analogously
with the discrete time case, we can verify by direct substitution that the solution to the variance ODE is given
by
V(t) = V∞ − exp{−Λt}V∞ exp{−Λt}⊤ .
Everything is linear and multivariate normal, so the marginal is too, giving
 
Y(t) ∼ N exp{−Λt}y0 , V∞ − exp{−Λt}V∞ exp{−Λt}⊤ ,

32
with limiting distribution
Y(∞) ∼ N (0, V∞ )
in the stable case. We can analyse the shifted system X(t) = Y(t) + µ as we have seen previously to
obtain  
X(t) ∼ N exp{−Λt}y0 + (I − exp{−Λt})µ µ , V∞ − exp{−Λt}V∞ exp{−Λt}⊤
as the marginal for the SDE
dX(t) = −Λ[X(t) − µ ]dt + Ω dW(t).

2.3.2.4.1 Stationarity and further properties


Again, we have so far characterised the transition kernel of the VOU process. But in the stable case, we can
again check that the limiting distribution is stationary, and so the stable VOU process admits a stationary
solution. This stationary solution is a vector-valued one-dimensional GP, and since we know its mean and
variance, we just need the auto-covariance function to complete its GP characterisation. Again, we start with
the dynamics at time s.

Y(s + dt) = (I − Λ dt)Y(s) + Ω dW(s)


⇒ Y(s + dt)Y(s − t)⊤ = (I − Λ dt)Y(s)Y(s − t)⊤ + Ω dW(s)Y(s − t)⊤
⇒ Γ(−t − dt) = (I − Λ dt)Γ(−t), for t > 0
⊤ ⊤
⇒ Γ(t + dt) = (I − Λ dt)Γ(t)
⇒ Γ(t + dt) = Γ(t)(I − Λ⊤ dt)

⇒ = −ΓΛ⊤
dt
⇒ Γ(t) = Γ(0) exp{−Λ⊤ t} = V∞ exp{−Λt}⊤ .

2.4 Second order systems

We will just very briefly consider second-order generalisations of the first-order models that we have analysed.
As we shall see, we can directly analyse a second-order model, but we can also convert a second-order
model to a first-order model with extended state space, so understanding the first-order case is in some sense
sufficient.

2.4.1 Deterministic

2.4.1.1 Discrete time

Consider the linear recursion


yt = ϕ1 yt−1 + ϕ2 yt−2 , t = 2, 3, . . .
initialised with y0 , y1 ∈ R.

tsplot(filter(rep(0,50), c(1.5, -0.75), "rec", init=c(10,10)),


type="p", pch=19, col=4, cex=0.5, ylab="Y_t",
main="phi_1 = 1.5, phi_2 = -0.75")
abline(h=0, col=3, lwd=2)

33
phi_1 = 1.5, phi_2 = −0.75

6
4
Y_t
2
0
−2
−4

0 10 20 30 40 50
Time

We could try solving this directly by looking for solutions of the form yt = Aλt . Substituting in leads to the
quadratic
λ2 − ϕ1 λ − ϕ2 = 0.
This will have two solutions, λ1 , λ2 , leading to a general solution of the form

yt = A1 λt1 + A2 λt2 ,

(assuming that λ1 ̸= λ2 ), and the two initial values y0 , y1 can be used to deduce the values of A1 , A2 . It is
clear that this solution will be stable provided that |λi | < 1, i = 1, 2.
Which combinations of ϕ1 and ϕ2 will lead to stable systems? Since we have a condition on λi and the λi are
just a function of ϕ1 and ϕ2 , we can figure out the region of ϕ1 –ϕ2 space corresponding to stable solutions.
Careful analysis of the relevant inequalities reveals that the stable region is the triangular region defined by

ϕ2 < 1 − ϕ1 , ϕ2 < 1 + ϕ1 , ϕ2 > −1.

Within this triangular region, λi will be complex if ϕ21 + 4ϕ2 < 0, and this will correspond to the oscillatory
stable region.
We can consider a shifted version (xt = yt + µ), as before.
This is all fine, but an alternative is to re-write the system as a first order vector recurrence
! ! !
yt ϕ1 ϕ2 yt−1
= .
yt−1 1 0 yt−2

We know that this system will be stable provided that the eigenvalues of the update matrix lie inside the
unit circle. We can check that the eigenvalues of this matrix are given by the solutions of the quadratic in λ
considered above. This technique of taking a higher-order linear dynamical system and expressing it as a first
order system with a higher-dimensional state space can be applied to all of the models that we have been
considering: deterministic or stochastic, discrete or cts time.

34
2.4.1.2 Vector discrete time

The second-order vector discrete time linear recursion

yt = Φ1 yt−1 + Φ2 yt−2

can be re-written as the first order system


! ! !
yt Φ1 Φ2 yt−1
= .
yt−1 I 0 yt−2

2.4.1.3 Continuous time

The scalar second-order linear system in cts time can be written

ÿ + ν1 ẏ + ν2 y = 0,

and initialised with y(0) and ẏ(0). Here, when compared to the first-order equivalent, there has been a switch
of parameter from λ to ν1 , ν2 in order to avoid confusion with eigenvalues. As with the discrete time system,
we could directly analyse this system by looking for solutions of the form y = Aeλt . Substituting in leads to
the quadratic
λ2 + ν1 λ + ν2 = 0.
This can be solved for the two solutions λ1 , λ2 to get a general solution of the form

y(t) = A1 eλ1 t + A2 eλ2 t

(assuming that λ1 ̸= λ2 ), and the initial conditions can be used to solve for A1 , A2 . It is clear that this system
will be stable provided that the real parts of λ1 , λ2 are both negative.
Alternatively, by defining v ≡ ẏ, this second-order system could be re-written as

v̇ + ν1 v + ν2 y = 0,

and hence as the first order vector system,


! ! !
v̇ −ν1 −ν2 v
= .
ẏ 1 0 y

We know that this system will be stable provided that the generator matrix has eigenvalues with negative real
parts. Solving for the eigenvalues leads to the same quadratic in λ we deduced above.
We can consider the stable region in terms of ν1 and ν2 . Again, careful analysis of the relevant inequalities
reveals that the stable region corresponds to the positive quadrant ν1 , ν2 > 0, and that the roots will be
complex if 4ν2 > ν12 .

2.4.1.4 Vector continuous time

The second-order vector linear system

ÿ + Λ1 ẏ + Λ2 y = 0,

can be re-written as the first-order system


! ! !
v̇ −Λ1 −Λ2 v
= ,
ẏ I 0 y

where v ≡ ẏ.

35
2.4.2 Stochastic

2.4.2.1 Discrete time: AR(2)

The second-order Markovian scalar linear Gaussian auto-regressive model, AR(2), takes the form

Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + εt , εt ∼ N (0, σ 2 ).

tsplot(filter(rnorm(80), c(1.5, -0.75), "rec", init=c(20,20)),


type="p", pch=19, col=4, cex=0.5, ylab="Y_t",
main="phi_1 = 1.5, phi_2 = -0.75")
abline(h=0, col=3, lwd=2)

phi_1 = 1.5, phi_2 = −0.75


15
10
5
Y_t
0
−5
−10

0 20 40 60 80
Time

Taking expectations gives the second-order linear recurrence

E[Yt ] = ϕ1 E[Yt−1 ] + ϕ2 E[Yt−2 ].

Looking for solutions of the form E[Yt ] = Aλt gives the quadratic

λ2 − ϕ1 λ − ϕ2 = 0,

so the system will be stable if the solutions, λi , are inside the unit circle (|λi | < 1). We have already described
this stable region in ϕ1 –ϕ2 space, while examining the deterministic second-order linear recurrence.
Alternatively, we could re-write this second-order system as the VAR(1):
! ! ! !
Yt ϕ1 ϕ2 Yt−1 εt
= + .
Yt−1 1 0 Yt−2 0

We know that this system will be stable if the eigenvalues of the auto-regressive matrix are inside the unit
circle. But solving for the eigenvalues
! leads to the same quadratic that we deduced above. Also note that here
2
σ 0
the error variance matrix is .
0 0

36
2.4.2.2 Vector discrete time: VAR(2)

The VAR(2) model


Yt = Φ1 Yt−1 + Φ2 Yt−2 + ε t , ε t ∼ N (0, Σ)
can be re-written as the VAR(1) system
! ! ! !
Yt Φ1 Φ2 Yt−1 εt
= + .
Yt−1 I 0 Yt−2 0

It will be stable when the eigenvalues of the auto-regressive matrix lie inside the unit circle.

2.4.2.3 Continuous time

We wrote the first-order continuous time linear system in the form

dY (t) = −λY (t) dt + σ dW (t),

where W (t) is a Wiener process, but a physicist might just write it as a Langevin equation:

Ẏ (t) + λY (t) = ση(t),

where η(t) is a “white noise” process. This requires some care in interpreting, since Y (t) is not actually
differentiable, and η(t) is not a regular function. However, this way of writing it makes the second-order
generalisation more obvious:
Ÿ (t) + ν1 Ẏ (t) + ν2 Y (t) = ση(t),
where, again, we’ve switched from λ to νi to avoid confusion with eigenvalues. If we throw caution to the
wind, take expectations, and put m(t) = E[Y (t)] we get

m̈ + ν1 ṁ + ν2 = 0.

We know that seeking solutions of the form m(t) = Aeλt will lead to the quadratic

λ2 + ν1 λ + ν2 = 0

and that the system will be stable provided that the solutions, λi , both have negative real part.
If we now try to be a bit more careful, and define V (t) = Ẏ (t), we can re-write the Langevin equation as

V̇ (t) + ν1 V (t) + ν2 Y (t) = ση(t),

and then it is reasonably clear that we can write the system as a pair of coupled (S)DEs,

dV (t) = −(ν1 V (t) + ν2 Y (t))dt + σ dW (t)


dY (t) = V (t) dt,

and hence as a first-order vector OU process


! ! ! ! !
dV (t) −ν1 −ν2 V (t) σ 0 dW1,t
= dt + .
dY (t) 1 0 Y (t) 0 0 dW2,t

We know that this VOU process will be stable if the eigenvalues of the generator matrix have negative real
parts, but of course, solving for the eigenvalues leads to the quadratic we deduced above.

37
2.4.2.4 Vector continuous time

Using “physics notation”, we could informally write the second-order system as

Ÿ(t) + Λ1 Ẏ(t) + Λ2 Y(t) = Ω η (t),

understanding that what this really represents is the VOU process


! ! ! ! !
dV(t) −Λ1 −Λ2 V(t) Ω 0 dW1,t
= dt + ,
dY(t) I 0 Y(t) 0 0 dW2,t

where V(t) = Ẏ(t). Stability can be determined by checking if the eigenvalues of the generator matrix all
have negative real part.

38
3 ARMA models

3.1 Introduction

In this chapter we will consider a method for building a flexible class of stationary Gaussian processes for
discrete-time real-valued time series. Such processes will rarely be appropriate for directly fitting to real
observed time series data, but will be extremely useful for modelling the “residuals” of a time series after
taking out some systematic, and/or seasonal effect, and/or detrending in some way. We have seen that such
residuals can often be assumed to be mean zero and stationary, but with a non-trivial correlation structure.
Autoregressive–moving-average (ARMA) models are good solution to this problem.
An ARMA model is simply the stochastic process obtained by applying an ARMA (or IIR) filter to a Gaussian
white noise process. That is, the stochastic process

. . . , X−2 , X−1 , X0 , X1 , X2 , . . .

is ARMA(p, q) if it is determined by the filter


p
X q
X
Xt = ϕi Xt−i + εt + θj εt−j , (3.1)
i=1 j=1

applied to iid Gaussian white noise εt ∼ N (0, σ 2 ), for some fixed p-vector ϕ and q-vector θ . Since the filter
is linear, it follows that the ARMA model is a discrete-time Gaussian process. It should be immediately
clear that this process generalises the AR(1) and AR(2) processes considered in Chapter 2. However, this
connection with the AR(1) and AR(2) should immediately raise questions about stationarity. We have seen
that even for these simple special cases, parameters must be carefully chosen in order to ensure that the
process is stable and will converge to a stationary distribution. But we have also seen that even in this stable
case, if the process is not initialised with draws from the stationary distribution, it will not be stationary, at
least until convergence is asymptotically reached.

3.1.1 Stationarity

In the context of time series analysis, we typically want to use ARMA models as a convenient way of
describing a stationary discrete-time Gaussian process. We therefore want to restrict our attention to
parameter regimes where the process is stable and converges to a stationary distribution. We will need to be
careful to check that this is the case. We also assume that the process has been running from t = −∞, or that
it is carefully initialised with draws from the (joint) stationary distribution, so that there is no transient phase
of the process that needs to be considered.
At this point it might be helpful to more carefully define stationarity in the context of time series.
A discrete-time stochastic process is strictly stationary iff the joint distribution of

Xt , Xt+1 , . . . Xt+k

is identical to the joint distribution of

Xt+l , Xt+l+1 , . . . Xt+l+k , ∀t, k, l ∈ Z.

39
In particular, but not only, it implies that the marginal distribution of Xt is independent of t. This condition is
quite strong, and hard to verify in practice, so there is also a weaker notion of stationarity that is useful.
A discrete-time stochastic process is weakly stationary (AKA second-order stationary) if:

• E[Xt ] = E[Xs ] = µ, ∀s, t ∈ Z


• Var[Xt ] = Var[Xs ] = v, ∀s, t ∈ Z
• Cov[Xt , Xt+k ] = Cov[Xs , Xs+k ] = γk , ∀s, t, k ∈ Z.

In other words, the mean and variance are constant, and the covariance between two values depends only
on the lag. It is clear that (in the case of a process with finite mean and variance) the property of weak
stationarity follows from strict stationary, hence the name. However, a Gaussian process is determined by its
first two moments, and so for Gaussian processes the concepts of strict and weak stationarity coincide, and
we will refer to such processes as being stationary, or not, without needing to qualify.
The sequence γk , k ∈ Z is known as the auto-covariance function of a weakly stationary stochastic process.
Note that γ−k = γk . It is often also convenient to work with the corresponding auto-correlation function,

ρk ≡ Corr[Xt , Xt+k ] = γk /γ0 , ∀k ∈ Z.

Let us now get a better feel for stationary ARMA models by looking at some important special cases.

3.2 AR(p)

The Autoregressive model model of order p, denoted AR(p) or ARMA(p, 0), is defined by
p
X
Xt = ϕi Xt−i + εt , εt ∼ N (0, σ 2 ), ∀t ∈ Z. (3.2)
i=1

The process is determined by fixed ϕ ∈ Rp , σ ∈ R+ . We briefly examined the special cases AR(1) and
AR(2) in Chapter 2, but we now tackle the general case. It is crucial to notice that, by construction, εt will be
independent of Xs for t > s, but not otherwise (since then εt will have been involved in some way in the
construction of Xs ). This property also holds in the more general case, and will be repeatedly exploited in
the analysis of ARMA models.
The first thing to consider is whether the particular choice of ϕ corresponds to a stable, and hence stationary,
model. There are many ways one could try to understand this, but using the same approach as we used in
Chapter 2, we can write the model as the p-dimensional VAR(1) model
      
Xt ϕ1 · · · ϕp−1 ϕp Xt−1 εt
Xt−1  1 ··· 0 0  Xt−2   0 
      

 .  +  . .
..= . .. 

  . .. ..    
.  . . . .   ..   .. 


Xt−p+1 0 ··· 1 0 Xt−p 0

If we call the p × p autoregressive matrix Φ, then we know from our analysis of the stability of VAR(1)
models that we will get a (stable) stationary model provided that all of the eigenvalues of Φ lie inside the unit
circle in the complex plane. If we attempt to find the eigenvalues by carrying out Laplace expansion on the
equation |Φ − λI| = 0, we arrive at the polynomial equation
p
X
λp − ϕk λp−k = 0.
k=1

So, we want all of the solutions in λ to have modulus less than one.
Aside: Note that we have established a connection between the roots of an arbitrary polynomial and the
eigenvalues of a specially constructed matrix. So, if we have a numerical eigenvalue solver at our disposal,

40
we could use it to numerically determine all of the roots of an arbitrary polynomial in a very straightforward
way.
If we now make the substitution u = λ−1 we get the polynomial equation
p
X
1− ϕk uk = 0,
k=1

and so we want all of the solutions to this equation to have modulus greater than one. The polynomial on the
LHS of this equation crops up frequently in the analysis of AR models, so it has a name. Our model will be
stationary provided that all roots of the characteristic polynomial
p
X
ϕ(u) = 1 − ϕk uk
k=1

have modulus greater than one, and hence lie outside of the unit circle in the complex plane.
We will now assume that we are considering only the stationary case, and try to understand other characteristics
of this model. Since we know that it is a stationary Gaussian process, we know that it is characterised by its
mean and auto-covariance function. Let’s start with the mean.
It should be obvious, either by symmetry, or by analogy with the AR(1) and AR(2), that the only possible
stationary mean for this process is zero. However, it is instructive to explicitly compute it. Taking the
expectation of Equation 3.2 gives
p
X
E[Xt ] = ϕk E[Xt−k ]
k=1
Xp
⇒µ= ϕk µ
k=1
p
X
!
⇒ 1− ϕk µ = 0
k=1
⇒ ϕ(1)µ = 0

Now, since we are assuming a stationary process, the roots of ϕ(u) are outside of the unit circle, and hence
ϕ(1) ̸= 0, giving µ = 0 as anticipated (in the stationary case).
Next we consider the auto-covariance function.

3.2.1 The Yule-Walker equations

If we multiply Equation 3.2 by Xt−k for k > 0 and take expectations we get
p
X
E[Xt−k Xt ] = ϕi E[Xt−k Xt−i ],
i=1

using the fact that E[Xt−k εt ] = E[Xt−k ]E[εt ] = 0. In other words,


p
X
γk = ϕi γk−i , k > 0. (3.3)
i=1

So, the auto-covariance function satisfies a linear recurrence. We can either use this linear recurrence directly
in order to generate auto-covariances, or seek a general solution. But either way, we will need to know
γ0 , γ1 , . . . , γp−1 in order to initialise the recursion or fix a particular solution. If we consider the above
equations for k = 1, 2, . . . , p, and remember that γ−k = γk , we have p equations in the p + 1 unknowns
γ0 , γ1 , . . . , γp . So, we need one more linear constraint. There are two commonly adopted ways to proceed

41
from here. Either way, it is useful to consider the case k = 0. That is, multiply Equation 3.2 by Xt and take
expectations. First we need to compute
" #
X
E[Xt εt ] = Cov[Xt , εt ] = Cov ϕi Xt−i + εt , εt = Cov[εt , εt ] = Var[εt ] = σ 2 .
i

Then we get
p
X
γ0 = ϕi γi + σ 2 .
i=1

This gives us the (p + 1)th equation that we need to solve for γ0 , γ1 , . . . , γp . There is a related but slightly
different approach, where we divide Equation 3.3 by γ0 to get
p
X
ρk = ϕi ρk−i , k = 1, 2, . . . , p,
i=1

which we regard as p equations in p unknowns, since we know that ρ0 = 1. These are the equations most
commonly referred to as the Yule-Walker equations, though conventions vary. But if we adopt this approach
we still need to know the stationary variance v = γ0 in order to complete the characterisation of the process.
For that we just divide the above equation for γ0 through by γ0 and re-arrange to get

σ2
v = γ0 = Pp .
1− i=1 ϕi ρi

The above approach to solving for the particular solution of the auto-covariance function using the Yule-
Walker equations is very convenient when solving by hand in the case of an AR(1), AR(2) or AR(3), say, but
not especially convenient when solving on a computer in the general case. For this it is more convenient to
revert to our formulation in terms of a VAR(1) model, with the auto-regressive matrix Φ already described,
and innovation variance matrix  
σ2 0 · · · 0
 0 0 · · · 0
 
Σ=  .. .. . . ..  .
 . . . .
0 0 ··· 0
We then know from our analysis in Chapter 2 that the stationary variance matrix is given by the solution, V,
to
V = ΦVΦ⊤ + Σ,
and that there are standard direct methods to solve this discrete Lyapunov equation (eg. netcontrol::dlyap).
The first row (or column) of V is γ0 , γ1 , . . . , γp−1 , as required.
Once we have the first p auto-covariances, we can use Equation 3.3 to generate more auto-covariances, as
required. Alternatively, we can seek an explicit solution for γk in terms of k (only), using standard techniques
for solving linear recursions. If we seek solution of the form γk = Au−k and substitute this in to Equation 3.3
we find that the u we require are roots of the characteristic polynomial ϕ(u). If we call the roots u1 , . . . , up ,
then our general solution will be of the form
p
Ai u−k
X
γk = i ,
i=1

and we can use our first p auto-covariances to determine A1 , . . . , Ap .

3.2.2 The backshift operator

Although it is not strictly necessary, it is often quite convenient to analyse and manipulate ARMA (and other
discrete time) models using the backshift operator (AKA the lag operator).

42
The backshift operator, B, has the property

Bxt = xt−1 .

It follows that B 2 xt = BBxt = Bxt−1 = xt−2 , and hence B k xt = xt−k . Then, for consistency, we have
B −1 xt = xt+1 , the forward shift operation, and B 0 xt = xt , the identity. Since it is a linear operator, it can
be manipulated algebraically like other operators. It is very much analogous to the differential operator,
D, often used in the analysis of (second-order) ODEs. We can use the backshift operator to simplify the
description of an AR(p) model as follows.
p
X
Xt = ϕi Xt−i + εt
i=1
p
X
⇒ Xt − ϕi Xt−i = εt
i=1
Xp
⇒ Xt − ϕi B i Xt = εt
i=1
p
X
!
i
⇒ 1− ϕi B Xt = εt
i=1
Xp
⇒ (1 − ϕi B i )Xt = εt
i=1
⇒ ϕ(B)Xt = εt ,

where ϕ(B) is the characteristic polynomial evaluated with B. The linear operator ϕ(B) is known as the
autoregressive operator.
We can understand some of the algebraic properties of the backshift operator by first expanding an AR(1)
with successive substitution.

Xt = ϕXt−1 + εt
= ϕ(ϕXt−2 + εt−1 ) + εt
= ...

X
= ϕk εt−k
k=0
X∞
= ϕ k B k εt
k=0

!
X
= (ϕB)k εt .
k=0

In other words, the model


(1 − ϕB)Xt = εt
is exactly equivalent to the model
Xt = (1 − ϕB)−1 εt , (3.4)
since

(1 − ϕB)−1 =
X
(ϕB)k = 1 + ϕB + ϕ2 B 2 + · · · .
k=0

For reasons to become clear, Equation 3.4 is known as the MA(∞) representation of the AR(1) process.

43
3.2.3 Example: AR(2)

Let’s go back to the AR(2) example considered very briefly in Chapter 2.

library(astsa)
set.seed(42)
phi = c(1.5, -0.75); sigma = 1; n = 2000; burn = 100
x = filter(rnorm(n+burn, 0, sigma), phi, "rec")[-(1:burn)]
tsplot(x, col=4, cex=0.5, ylab="X_t",
main=paste0("phi_1=", phi[1], ", phi_2=", phi[2]))
abline(h=0, col=3)

phi_1=1.5, phi_2=−0.75
10
5
X_t
0
−5
−10

0 500 1000 1500 2000


Time

Although in Chapter 2 we briefly switched to plotting discrete-time series with points (to emphasise the
distinction between discrete and cts time), we now revert to the more usual convention of joining successive
observations with straight line segments.
This model has ϕ1 = 3/2, ϕ2 = −3/4, σ = 1. In principle this completely characterises the process, but on
their own, these parameters are not especially illuminating. The plot suggests that this model is probably
stationary, but isn’t especially conclusive, so we should certainly check.
To check by hand, we want to find the roots of the characteristic polynomial by solving the quadratic

1 − ϕ1 u − ϕ2 u2 = 0.

If we substitute in for ϕ1 and ϕ2 and use the quadratic formula we get



u1,2 = 1 ± i/ 3,

and these obviously have modulus greater than one, so the model is stationary.
To check with a computer, it’s arguably more convenient to see if the eigenvalues of the generator of the
equivalent VAR(1) model have modulus less than one.

Phi = matrix(c(1.5,-0.75,1,0), ncol=2, byrow=TRUE)


evals = eigen(Phi)$values

44
abs(evals)

[1] 0.8660254 0.8660254

They both have modulus less than one, so again, we are in the stationary region. We can also check that the
roots that we calculated by hand are correct.

1/evals

[1] 1-0.5773503i 1+0.5773503i

In fact, R has a built-in polynomial root finder, polyroot, so the simplest approach is to use this function
to directly find the roots of the characteristic polynomial.

polyroot(c(1, -1.5, 0.75))

[1] 1+0.57735i 1-0.57735i

However, having the VAR(1) generator matrix at our disposal will turn out to be useful, anyway, as we will
see shortly.
Now that we have established that the process is stationary, attention can focus on the auto-covariance
structure. We can start by looking at the empirical ACF of the simulated data.

acf1(x)

Series: x
0.8
0.4
ACF
0.0
−0.4

0 10 20 30 40 50
LAG

[1] 0.86 0.55 0.18 -0.14 -0.35 -0.41 -0.36 -0.23 -0.08 0.06 0.15 0.18
[13] 0.17 0.12 0.05 -0.02 -0.07 -0.09 -0.09 -0.07 -0.04 0.00 0.03 0.05
[25] 0.05 0.03 0.01 0.00 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.01 0.00
[37] 0.02 0.03 0.03 0.01 0.00 -0.02 -0.03 -0.04 -0.04 -0.04 -0.02 0.00
[49] 0.02 0.03 0.02 0.00 -0.02 -0.04 -0.04

45
Given the complex roots of the characteristic polynomial, this oscillatory behaviour in the ACF is to be
expected. But what is the true ACF of the model? We know that however we want to compute the auto-
covariances, we will need to know the first two, γ0 = v and γ1 , in order to initialise the recursion or find a
particular solution to the explicit form.
If we want to compute these by hand, we will use some form of the Yule-Walker equations. It is usually
easiest to use the auto-correlation form of the Yule-Walker equations. We will look at that approach shortly,
but to begin with, it is instructive to examine the approach based on the auto-covariance form. Start with

γ0 = ϕ1 γ1 + ϕ2 γ2 + σ 2
γ1 = ϕ1 γ0 + ϕ2 γ1
γ2 = ϕ1 γ1 + ϕ2 γ0 ,

to get our three equations in three unknowns. To solve these by Gaussian elimination, we can write in the
form of an augmented matrix,  
1 −ϕ1 −ϕ2 σ 2
 −ϕ1 1 − ϕ2 0 0 .
 
−ϕ2 −ϕ1 1 0
Substituting in our values gives  
1 −3/2 3/4 1
 −3/2 7/4 0 0 .
 
3/4 −3/2 1 0
Row reduction will lead to the first three auto-covariances. I will cheat.

m = matrix(c(1,-3/2,3/4, -3/2,7/4,0, 3/4,-3/2,1), ncol=3, byrow=TRUE)


solve(m, c(1,0,0))

[1] 8.615385 7.384615 4.615385

So v = γ0 = 8.62, γ1 = 7.38, γ2 = 4.62.


On a computer, the above approach is not the simplest. Here it is simpler to use our VAR(1) representation
and solve for the stationary variance matrix.

Sig = matrix(c(1,0,0,0), ncol=2)


V = netcontrol::dlyap(t(Phi), Sig) # note transpose of Phi
V

[,1] [,2]
[1,] 8.615385 7.384615
[2,] 7.384615 8.615385

The first row (or column) of V gives the first two auto-covariances. Note that they are the same as we
computed above.
Note: According to the documentation for dlyap, it should not be necessary to transpose Phi, but it gives
the wrong answer if I don’t. I think this is a bug in either the function or its documentation.
We can check that we do have a good solution by checking that

V - (Phi %*% V %*% t(Phi))

46
[,1] [,2]
[1,] 1.000000e+00 1.776357e-15
[2,] 2.664535e-15 3.552714e-15

is close to Sig.
Now that we have the initial auto-covariances we can generate more. If we are on a computer, it may be
simplest to just recursively generate them.

initAC = V[1,]
acvf = filter(rep(0,49), c(1.5, -0.75), "rec", init=initAC)
acvf = c(initAC[1], acvf)
acvf[1:20]

[1] 8.6153846 7.3846154 4.6153846 1.3846154 -1.3846154 -3.1153846


[7] -3.6346154 -3.1153846 -1.9471154 -0.5841346 0.5841346 1.3143029
[13] 1.5333534 1.3143029 0.8214393 0.2464318 -0.2464318 -0.5544715
[19] -0.6468835 -0.5544715

We can turn these into auto-correlations and overlay them on the empirical ACF of the simulated data set.

acrf = acvf[-1]/acvf[1]
print(acrf[1:20])

[1] 0.85714286 0.53571429 0.16071429 -0.16071429 -0.36160714 -0.42187500


[7] -0.36160714 -0.22600446 -0.06780134 0.06780134 0.15255301 0.17797852
[13] 0.15255301 0.09534563 0.02860369 -0.02860369 -0.06435830 -0.07508469
[19] -0.06435830 -0.04022394

acf1(x)

[1] 0.86 0.55 0.18 -0.14 -0.35 -0.41 -0.36 -0.23 -0.08 0.06 0.15 0.18
[13] 0.17 0.12 0.05 -0.02 -0.07 -0.09 -0.09 -0.07 -0.04 0.00 0.03 0.05
[25] 0.05 0.03 0.01 0.00 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.01 0.00
[37] 0.02 0.03 0.03 0.01 0.00 -0.02 -0.03 -0.04 -0.04 -0.04 -0.02 0.00
[49] 0.02 0.03 0.02 0.00 -0.02 -0.04 -0.04

points(1:length(acrf), acrf, col=4, pch=19, cex=0.8)

47
Series: x

0.8
0.4
ACF
0.0
−0.4

0 10 20 30 40 50
LAG

We see that the empirical ACF closely matches the true ACF of the AR(2) model, at least for small lags. In
fact, if we just want the auto-correlation function, there is a built-in function to compute it.

ARMAacf(ar=c(1.5, -0.75), lag.max=10)

0 1 2 3 4 5
1.00000000 0.85714286 0.53571429 0.16071429 -0.16071429 -0.36160714
6 7 8 9 10
-0.42187500 -0.36160714 -0.22600446 -0.06780134 0.06780134

If working by hand, it may be preferable to find an explicit solution for γk in terms of k. We know that the
solution is of the form
γk = A1 u−k1 + A2 u2 ,
−k

where ui are the roots of the characteristic equation. We can use γ0 and γ1 to fix A1 and A2 , since

γ 0 = A1 + A2
γ1 = A1 u−1 −1
1 + A2 u2 ,

and we could write this as the augmented matrix


" #
1 1 γ0
−1 −1 .
u1 u2 γ1

Substituting in our values give


" #
1 √ 1 √ 8.615
,
1/(1 − i/ 3) 1/(1 + i/ 3) 7.385

which we can row-reduce to compute A1 and A2 . Again, I will cheat.

m = matrix(c(1,1, 1/complex(r=1,i=-1/sqrt(3)), 1/complex(r=1,i=1/sqrt(3))),


ncol=2, byrow=TRUE)
A = solve(m, c(8.615, 7.385))

48
A

[1] 4.3075-1.066655i 4.3075+1.066655i

So A1 = 4.31 − 1.07i, A2 = 4.31 + 1.07i are complex conjugates, just as the roots of the characteristic
polynomial are. This is to be expected, since

γk = Au−k + Āū−k

clearly has the property that γk = γ̄k , and so γk is real, as we would hope. To get rid of the imaginary parts,
it is convenient to switch to modulus/argument form, putting u = reiθ (r > 1), A = seiψ , since then
h i
γk = sr−k ei(ψ−kθ) + e−i(ψ−kθ)
= 2sr−k cos(ψ − kθ).

Writing it this way makes it clear that the frequency of the oscillations in the ACF are determined by θ, the
argument of u.

u = polyroot(c(1, -1.5, 0.75))[1]


f = Arg(u)/(2*pi)
print(f)

[1] 0.08333333

1/f

[1] 12

For our example, the frequency is 1/12, and hence the period of the oscillations is 12. We now have everything
we need to express our auto-covariance function explicitly. Here is a quick check using R.

sapply(0:10, function(k)
2*abs(A[2])*abs(u)ˆ(-k)*cos(Arg(A[2])-k*Arg(u)))

[1] 8.6150000 7.3850000 4.6162500 1.3856250 -1.3837500 -3.1148438


[7] -3.6344531 -3.1155469 -1.9474805 -0.5845605 0.5837695

Finally, note that we have been simulating data from an AR model by applying a recursive filter to white
noise. This is pedagogically instructive, but there is a function in base R, arima.sim that can simulate data
from any ARMA model directly (and from a generalisation of ARMA, known as ARIMA).

y = arima.sim(n=(n+burn), list(ar=phi), sd=sigma)[-(1:burn)]


tsplot(y, col=4)

49
10
5
0
y
−5
−10

0 500 1000 1500 2000


Time

3.2.3.1 The auto-correlation approach

We have now completely solved for the auto-covariance function of our example process. However, when
working by hand it turns out to be significantly simpler to solve directly for the auto-correlation function of
the process using the auto-correlation form of the Yule-Walker equations. This isn’t limiting, since if we
want the auto-covariance function, we can just multiply through by the stationary variance. Let’s work this
through for our current example. Begin with the auto-correlation form of the first Yule-Walker equation:

ρ1 = ϕ1 ρ0 + ϕ2 ρ1 = ϕ1 + ϕ2 ρ1 ,

since ρ0 = 1, and so
ϕ1
ρ1 = ,
1 − ϕ2
and this is true for any AR(2) model. Substituting in for our ϕ1 and ϕ2 gives ρ1 = 6/7. But now we know ρ0
and ρ1 we have enough information to initialise the Yule-Walker equations, without having to solve a bunch
of linear equations (we have, but the system is smaller and simpler, which is generally the case). If we want
to know the stationary variance of the process, we will also need to know ρ2 , which we can calculate directly
from the 2nd Yule-Walker equation
3 6 −3 15
ρ2 = ϕ1 ρ1 + ϕ2 ρ0 = × + ×1= .
2 7 4 28
We can then compute the stationary variance,
σ2 8
γ0 = =8 .
1 − ϕ1 ρ1 − ϕ2 ρ2 13
We can now look for a general solution to either the auto-correlation or auto-covariance function. We will
look at the auto-correlation function. There are many ways to proceed here, but from our above analysis, we
know that in the case of an AR(2) with complex roots our solution will end up being of the form

ρk = Ar−k cos kθ + Br−k sin kθ,

where u = reiθ and A, B are real constants that can be determined by ρ0 and ρ1 . Clearly for any AR(2),
ρ0 = 1 implies A = 1. Then the equation for ρ1 ,

ρ1 = r−1 cos θ + Br−1 sin θ,

50
can be re-arranged for B as
ρ1 r − cos θ
B= .
sin θ

For our example, we have r = 2/ 3, θ = π/6 and ρ1 = 6/7, so substituting in gives

6 √2 3 √
7 3 − 2 3
B= 1 = ,
2
7

leading to full auto-correlation function


√ !k √ √ !k
3 3 3
ρk = cos(kπ/6) + sin(kπ/6)
2 7 2

A quick check with R confirms that we haven’t messed up.

sapply(0:10, function(k)
(sqrt(3)/2)ˆk*cos(k*pi/6) + (sqrt(3)/7)*(sqrt(3)/2)ˆk*sin(k*pi/6))

[1] 1.00000000 0.85714286 0.53571429 0.16071429 -0.16071429 -0.36160714


[7] -0.42187500 -0.36160714 -0.22600446 -0.06780134 0.06780134

3.2.4 Partial auto-covariance and correlation

The ACF is a useful characterisation of the way in which correlations decay away with increasing lag. But the
ACF of an AR model does not sharply truncate in general, since observations from the distant past do have
some (small) influence on the present. However, that influence is “carried” via the intermediate observations.
If we knew the intermediate observations, there would be no way for the observations from the distant past
to give us any more information about the present. So, for an AR model, it could be useful to understand
the correlation between observations that is “left over” after adjusting for the effect of any intermediate
observations. This is the idea behind the partial auto-correlation function (PACF).
In the context of stationary Gaussian time series, the partial auto-covariance at lag k is just the conditional
covariance
γk⋆ = Cov[Xt−k , Xt |X(t−k+1):(t−1) ].
This is well-defined, since we know from multivariate normal theory that the conditional covariance does
not actually depend on the observed value of the conditioning variable. Nevertheless, it can sometimes be
convenient to explicitly include it in the notation, in which case we could write

γk⋆ = Cov[Xt−k , Xt |X(t−k+1):(t−1) = x(t−k+1):(t−1) ].

We can then define the partial auto-correlation at lag k by

Cov[Xt−k , Xt |X(t−k+1):(t−1) ] γk⋆


ρ⋆k = = .
Var[Xt |X(t−k+1):(t−1) ] Var[Xt |X(t−k+1):(t−1) ]

Intuitively, it should be clear that for an AR(p), we will have γk⋆ = ρ⋆k = 0, ∀k > p, since then there will be
no remaining influence of the observation from lag k after conditioning on the intervening observations. We

51
can confirm this, since for k > p,

γk⋆ = Cov[Xt−k , Xt |X(t−k+1):(t−1) = x(t−k+1):(t−1) ]


" p
X
#
= Cov Xt−k , ϕi Xt−i + εt X(t−k+1):(t−1) = x(t−k+1):(t−1)
i=1
"
Xp #
= Cov Xt−k , ϕi xt−i + εt X(t−k+1):(t−1) = x(t−k+1):(t−1)
i=1
h i
= Cov Xt−k , εt | X(t−k+1):(t−1) = x(t−k+1):(t−1)
= 0.

Another interesting case is the case k = p, since then

γp⋆ = Cov[Xt−p , Xt |X(t−p+1):(t−1) = x(t−p+1):(t−1) ]


" p
X
#
= Cov Xt−p , ϕi Xt−i + εt X(t−p+1):(t−1) = x(t−p+1):(t−1)
i=1
 
p−1
X
= Cov  Xt−p , ϕi xt−i + ϕp Xt−p + εt X(t−p+1):(t−1) = x(t−p+1):(t−1) 
i=1
h i
= Cov Xt−p , ϕp Xt−p | X(t−p+1):(t−1) = x(t−p+1):(t−1)
h i
= ϕp Var Xt−p | X(t−p+1):(t−1) = x(t−p+1):(t−1) ,

and hence
ρ⋆p = ϕp .
So, importantly, ρ⋆p ̸= 0 for ϕp ̸= 0. It is also clear that γ1⋆ = γ1 and ρ⋆1 = ρ1 , since then there are no
intervening observations.
Other partial auto-correlations (ρ⋆2 , . . . , ρ⋆p−1 ) are less obvious, but can be computed using standard multivari-
ate normal theory:

Cov[X, Y |Z] = Cov[X, Y ] − Cov[X, Z]Var[Z]−1 Cov[Z, Y ].

However, in practice they are typically computed by a more efficient method, known as the Durbin-Levinson
Algorithm (an application of Levinson recursion), but the details of this are not important for this course.

3.2.4.1 Example: AR(2)

For the AR(2) model that we have been studying, we can use ARMAacf to compute the PACF as well as the
ACF.

ARMAacf(ar=c(1.5, -0.75), lag.max=4)

0 1 2 3 4
1.0000000 0.8571429 0.5357143 0.1607143 -0.1607143

ARMAacf(ar=c(1.5, -0.75), lag.max=4, pacf=TRUE)

[1] 8.571429e-01 -7.500000e-01 1.913000e-15 -7.357691e-16

52
The important things to note are that ρ⋆1 = ρ1 , ρ⋆2 = ϕ2 and ρ⋆k = 0, ∀k > 2. Crucially, the PACF truncates
at the order of the AR model.
Numerous methods exist to estimate partial auto-correlations empirically, from data. The details need not
concern us. We can look at both the ACF and PACF for a given dataset using astsa::acf2. eg. we can
apply it to our simulated dataset.

acf2(x, max.lag=10)

Series: x
0.5
ACF
−0.5

2 4 6 8 10
LAG
0.5
PACF
−0.5

2 4 6 8 10
LAG

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
ACF 0.86 0.55 0.18 -0.14 -0.35 -0.41 -0.36 -0.23 -0.08 0.06
PACF 0.86 -0.74 -0.01 -0.01 0.00 0.02 0.01 -0.01 -0.01 0.01

3.3 MA(q)

An MA(q) (or ARMA(0, q)) model is determined by


q
X
Xt = εt + θj εt−j , (3.5)
j=1

for white noise εt ∼ N (0, σ 2 ). In other words, Xt is the result of applying a FIR convolutional linear filter
to white noise. It is immediately clear that (in contrast to the AR model) since white noise is stationary,
and the convolutional linear filter being applied is time invariant, that the process will always be stationary,
irrespective of the parameters θ , σ.
Note that many formula relating to MA processes can be simplified by defining θ0 ≡ 1, so that
q
X
Xt = θj εt−j .
j=0

53
It is immediately clear via linearity of expectation that the stationary mean is E[Xt ] = 0. The stationary
variance is also very straightforwardly computed as
q
X
v = γ0 = Var[Xt ] = Var[θj εt−j ]
j=0
Xq
= θj2 σ 2
j=0
q
X
= σ2 θj2 .
j=0

The auto-covariance function is not much more difficult. For k > 0 we have

γk = Cov[Xt−k , Xt ]
 
Xq q
X
= Cov  θi εt−k−i , θj εt−j 
i=0 j=0
q
XXq
= θi θj Cov[εt−k−i , εt−j ]
i=0 j=0
X q Xq
= σ2 θi θj δk+i,j
i=0 j=0
X q
= σ2 θj−k θj , k ≤ q
j=k
q−k
X
= σ2 θj θj+k .
j=0

In other words,
q−k

X
σ 2

θj θj+k , 0 ≤ k ≤ q,

γk = j=0


0, k > q.

Note, in particular, that γq = σ 2 θq ̸= 0 for σ, θq ̸= 0. Similarly,


q−k , q

X X
θj2 , 0 ≤ k ≤ q,

θj θj+k


ρk = j=0 j=0


0, k > q.

So, the ACF truncates after lag q.

3.3.1 Backshift notation

It can sometimes be useful to write the MA(q) process using backshift notation

Xt = θ(B)εt ,

where
q
X
θ(B) = 1 + θj B j
j=1

is known as the moving average operator.

54
3.3.2 Special case: the MA(1)

It will be instructive to consider in detail the special case of the MA(1) process

Xt = εt + θεt−1 = (1 + θB)εt .

The ACF (
θ
1+θ2
, k=1
ρk =
0, k>1
has the interesting property that replacing θ by θ−1 leads to exactly the same ACF (try it!). So there are two
different values of θ that lead to exactly the same ACF. By adjusting the noise variance, σ 2 appropriately,
we can also ensure that the full auto-covariance function matches up as well. But a mean zero stationary
Gaussian process is entirely determined by its auto-covariance function. So there are two different MA(1)
models that define exactly the same Gaussian process. This is potentially problematic, especially when it
comes to estimating an MA model from data. For reasons to become clear, we should prefer the model with
|θ| < 1, and only one of the two models will have that property, so we can use this to rescue us from the
potential non-uniqueness problem. We will see how this generalises to general MA(q) process once we better
understand why the |θ| < 1 solution is preferred.
We know that the ACF truncates after lag one. Can we say anything about the PACF? Just as we could use
successive substitution to turn an AR(1) into an MA(∞), we can do the same to turn an MA(1) into an
AR(∞).

εt = Xt − θεt−1
= Xt − θ(Xt−1 − θεt−2 )
= ···

X
= Xt + (−θ)j Xt−j
j=1

This will converge to an AR(∞) representation of an MA(1) process provided that |θ| < 1. We say that the
process is invertible, in the sense that the MA process can be inverted to an AR process. It may be helpful to
write this using backshift notation as
 

X
1 + (−θ)j B j  Xt = εt .
j=1

This makes perfect sense, since

Xt = (1 + θB)εt ⇒ (1 + θB)−1 Xt = εt ,

and

(1 + θB)−1 = 1 +
X
(−θ)j B j .
j=1

So, since (in the invertible case) the MA(1) has a representation as a AR(∞), and we know that the PACF of
a AR(p) truncates after lag p, it is clear that the PACF of the MA(1) will not truncate. You should also be
starting to see some duality between AR and MA processes.
We can use R to compute the ACF and PACF of an MA process. eg. For an MA(1) with θ = 0.8:

ARMAacf(ma=c(0.8), lag.max=6)

0 1 2 3 4 5 6
1.0000000 0.4878049 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

55
ARMAacf(ma=c(0.8), lag.max=6, pacf=TRUE)

[1] 0.4878049 -0.3122560 0.2214778 -0.1651935 0.1266695 -0.0987133

We can also simulate data, either by applying a convolutional filter to noise, or by using arima.sim.

x = filter(rnorm(2000), c(1, 0.8))


x = arima.sim(n=2000, list(ma=c(0.8)), sd=1)
## either of the above approaches is fine
tsplot(x, col=4)
4
2
0
x
−2
−4

0 500 1000 1500 2000


Time

acf2(x, max.lag=6)

Series: x
0.4
ACF
−0.2

1 2 3 4 5 6
LAG
0.6
PACF
−0.2 0.2

1 2 3 4 5 6
LAG

56
[,1] [,2] [,3] [,4] [,5] [,6]
ACF 0.5 0.02 0.04 0.05 0.06 0.06
PACF 0.5 -0.30 0.25 -0.14 0.16 -0.07

3.3.3 Invertibility

For an MA(1), we saw that it would be invertible if |θ| < 1. That’s the same as saying that the root of the
moving average characteristic polynomial

θ(u) = 1 + θu

satisfies |u| > 1. This isn’t a coincidence.


For an MA(q), the characteristic polynomial θ(u) is of degree q. So, by the fundamental theorem of algebra,
we can factor this polynomial in the form
q
Y
θ(u) = θq (u − uj ),
j=1

where u1 , . . . , uq are the roots of θ(u). Consequently, we can write


q
θ(u)−1 = θq−1 (u − uj )−1 ,
Y

j=1

so it is reasonably clear that the MA(q) will be invertible if every term of the form (u−uj )−1 has a convergent
series expansion. But
!−1
u
(u − uj )−1
= −u−1
j 1−
uj
!
u u2
= −u−1
j 1+ + 2 + ···
uj uj

will converge for |uj | > 1. Consequently, θ(u)−1 will have a convergent expansion when all roots of
θ(u) are outside of the unit circle. This is the condition we will require for the very desirable property of
invertibility.

3.4 ARMA(p,q)

We have seen that a AR(p) process has the property that its PACF will truncate after lag p, but that its ACF
does not sharply truncate, but rather decays away geometrically. We have also seen that the MA(q) process
has the property that its ACF will truncate after lag q, but that its PACF does not (again, it decays away
geometrically). We require the roots of the auto-regressive operator to lie outside of the unit circle in order to
ensure stationarity. In fact, this also ensures that the AR process is invertible to an MA process. We also
know that although an MA process is always stationary, we nevertheless restrict attention to the case where
the roots of the moving average operator lie outside of the unit circle in order to ensure invertibility (and
uniqueness). We now turn our attention to the general ARMA process, with both AR and MA components.
We consider the model
p
X q
X
Xt = ϕi Xt−i + εt + θj εt−j ,
i=1 j=1

for p, q > 0. We could write this with backshift notation as

ϕ(B)Xt = θ(B)εt ,

57
for the appropriately defined auto-regressive and moving-average operators ϕ(B) and θ(B), respectively. We
assume that the roots of both ϕ(u) and θ(u) lie outside of the unit circle, so that the process is stationary and
invertible. This means that we can express the ARMA(p, q) model either as the MA(∞) process

θ(B)
Xt = εt ,
ϕ(B)

or as the AR(∞) process


ϕ(B)
Xt = εt .
θ(B)
It is then clear that for p, q > 0, neither the ACF nor the PACF will sharply truncate.

3.4.1 Parameter redundancy

There is an additional issue that arises for mixed ARMA processes that did not arise in the pure AR or MA
case. Start with a pure white noise process
Xt = εt
and apply (1 − λB) to both sides (for some fixed |λ| < 1) to get

Xt = λXt−1 + εt − λεt−1

So this now looks like an ARMA(1,1) process. But it is actually the same white noise process that we started
off with, as we would find if we tried to compute its ACF, etc. But this is a problem, because there are clearly
infinitely many apparently different ways to represent exactly the same stationary Gaussian process. So
we are back to having a non-uniqueness/parameter redundancy problem. Again, this will be problematic,
especially when we come to think about inferring ARMA models from data. This issue arises whenever ϕ(u)
and θ(u) contain an identical root. So, in addition to assuming that the roots of ϕ(u) and θ(u) all lie outside
of the unit circle, we further restrict attention to the case where ϕ(u) and θ(u) have no roots in common. If
we are presented with an ARMA(p, q) model where ϕ(u) and θ(u) do have a root in common, we can just
divide out the corresponding linear factor from both in order to obtain an ARMA(p − 1, q − 1) model that is
equivalent (and more parsimonious).

3.4.2 Example: ARMA(1,1)

Consider the ARMA(1,1) process


Xt = ϕXt−1 + εt + θεt−1 ,
for |ϕ|, |θ| < 1, εt ∼ N (0, σ 2 ). For k ≥ 0 we can multiply through by Xt−k and take expectations to get

γk = ϕγk−1 + Cov[Xt−k , εt ] + θCov[Xt−k , εt−1 ].

To proceed further it will be useful to know that

Cov[Xt , εt ] = Cov[ϕXt−1 + εt + θεt−1 , εt ] = σ 2

and

Cov[Xt , εt−1 ] = Cov[ϕXt−1 + εt + θεt−1 , εt−1 ] = ϕCov[Xt−1 , εt−1 ] + θσ 2 = (ϕ + θ)σ 2 .

So now, for k = 0 we have


γ0 = ϕγ1 + (1 + θϕ + θ2 )σ 2
and for k = 1 we have
γ1 = ϕγ0 + θσ 2 .

58
Solving these gives
1 + 2θϕ + θ2 2
v = γ0 = σ
1 − ϕ2
and
(1 + θϕ)(ϕ + θ) 2
γ1 = σ .
1 − ϕ2
For k > 1 we have
γk = ϕγk−1 ,
and hence
1 + 2θϕ + θ2 2

σ k=0


1 − ϕ2

γk =
(1 + θϕ)(ϕ + θ) k−1 2
ϕ σ k > 0.



1 − ϕ2
We can also write down the auto-correlation function
(1 + θϕ)(ϕ + θ) k−1
ρk = ϕ , k > 0.
1 + 2θϕ + θ2
Note that the ACF degenerates to that of a white noise process when θ = −ϕ. This corresponds to the
common root redundancy problem discussed above. Note also that the same would happen if ϕθ = −1, but
that our invertibility criterion rules out that possibility.
For concreteness, let’s now study the particular case ϕ = 0.8, θ = 0.6 using R. We can get R to explicitly
compute the ACF and PACF of this model.

ARMAacf(ar=c(0.8), ma=c(0.6), lag.max=6)

0 1 2 3 4 5 6
1.0000000 0.8931034 0.7144828 0.5715862 0.4572690 0.3658152 0.2926521

ARMAacf(ar=c(0.8), ma=c(0.6), lag.max=6, pacf=TRUE)

[1] 0.89310345 -0.41089371 0.22744126 -0.13276292 0.07888737 -0.04716820

We can also simulate data from this model and look at the empirical statistics.

x = signal::filter(c(1, 0.6), c(1, -0.8), rnorm(2000))


x = arima.sim(n=2000, list(ar=c(0.8), ma=c(0.6)), sd=1)
## either of the above approaches is fine
tsplot(x, col=4)

59
5
x
0
−5

0 500 1000 1500 2000


Time

acf2(x, max.lag=6)

Series: x
0.4
ACF
−0.4

1 2 3 4 5 6
LAG
1.0
PACF
0.4
−0.4

1 2 3 4 5 6
LAG

[,1] [,2] [,3] [,4] [,5] [,6]


ACF 0.88 0.68 0.52 0.39 0.29 0.21
PACF 0.88 -0.41 0.21 -0.14 0.07 -0.07

60
4 Estimation and forecasting

In this chapter we consider the problem of fitting time series models to observed time series data, and using
the fitted models to forecast the unobserved future observations of the time series.

4.1 Fitting ARMA models to data

Since we mainly know about ARMA models, we concentrate here on the problem of fitting ARMA models
to time series data. We assume that the time series has been detrended and has zero mean, so that fitting a
mean zero stationary model to the data makes sense. We will also assume that the order of the process (the
values of p and q) are known. This is unlikely to be true in practice, but the idea is that appropriate values can
be identified by looking at ACF and PACF plots for the (detrended) time series.

4.1.1 Moment matching

Since we have seen how to compute the ACF of ARMA models, and we know how to compute the empirical
ACF of an observed time series, it is natural to consider estimating the parameters of an ARMA model by
finding parameters that match up well with the empirical ACF. Since the ACF encodes the second moments
of the time series, this is moment matching, or the method of moments in the context of stationary time series
analysis. This approach works best in the context of AR(p) models, so we focus mainly on that case.

4.1.1.1 AR(p)

Recall that the Yule-Walker equations from Section 3.2.1 could be written in either auto-covariance or auto-
correlation form. We start with the auto-covariance form. If we define γ = (γ1 , . . . , γp )⊤ , ϕ = (ϕ1 , . . . , ϕp )⊤
and let Γ be the p × p symmetric matrix
Γ = {γi−j |i, j = 1, . . . , p},
then the Yule-Walker equations can be written in matrix form as
ϕ = γ,
Γϕ σ 2 = γ0 − ϕ⊤γ .
Then given an empirical estimate of the auto-covariance function computed from time series data, we can
construct estimates γ̂0 , γ̂γ , and hence Γ̂, and use them to estimate the parameters of the AR(p) model as
ϕ = Γ̂−1γ̂γ ,
ϕ̂ σ̂ 2 = γ̂0 − γ̂γ ⊤ Γ̂−1γ̂γ .
These estimates are known as the Yule-Walker estimators of the AR(p) model parameters.
If we prefer, we can re-write all of the above in auto-correlation form. Define ρ = (ρ1 , . . . , ρp )⊤ and let P be
the p × p symmetric matrix
P = {ρi−j |i, j = 1, . . . , p}.
Then the Yule-Walker equations can be written
ϕ = ρ,
Pϕ σ 2 = γ0 (1 − ϕ ⊤ρ ).
So, given empirical ACF estimates ρ̂ρ, P̂, along with an estimate of the stationary variance, v̂ = γ̂0 , we can
compute parameter estimates
ϕ = P̂−1ρ̂ρ, σ̂ 2 = γ̂0 (1 − ρ̂ρ⊤ P̂−1ρ̂ρ).
ϕ̂

61
4.1.1.1.1 Example: AR(2)
Let’s look at some of the sunspot data built in to R (?sunspot.year).

library(astsa)
tsplot(sunspot.year, col=4, main="Annual sunspot data")

Annual sunspot data


150
sunspot.year
100
50
0

1700 1750 1800 1850 1900 1950


Time

Since it is count data, it is a bit right-skewed, so the square root is a bit more Gaussian.

tsplot(sqrt(sunspot.year), col=4)
14
12
sqrt(sunspot.year)
10
8
6
4
2
0

1700 1750 1800 1850 1900 1950


Time

and we should detrend before fitting an ARMA model.

62
x = detrend(sqrt(sunspot.year))
tsplot(x, col=4)

6
4
2
x
0
−2
−4
−6

1700 1750 1800 1850 1900 1950


Time

We can see some periodicity to this data, further revealed by looking at the ACF

acf2(x)

Series: x
0.5
ACF
−0.5

0 5 10 15 20 25
LAG
0.5
PACF
−0.5

0 5 10 15 20 25
LAG

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
ACF 0.82 0.44 0.03 -0.29 -0.47 -0.45 -0.25 0.06 0.39 0.61 0.64 0.49 0.22
PACF 0.82 -0.67 -0.16 -0.01 -0.08 0.19 0.18 0.18 0.26 0.00 0.00 0.01 -0.06
[,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
ACF -0.08 -0.30 -0.41 -0.38 -0.23 -0.01 0.21 0.37 0.40 0.29 0.08 -0.15

63
PACF 0.11 -0.06 -0.07 -0.08 -0.10 0.02 0.00 0.05 -0.06 -0.10 -0.06 -0.02
[,26] [,27]
ACF -0.33 -0.41
PACF -0.04 0.06

The ACF suggests a period of between 10 and 11 years. The PACF suggests that something like an AR(2)
may be appropriate for this data, so that is what we will fit.

rho = acf(x, plot=FALSE, lag.max=2)$acf[2:3]


P = matrix(c(1, rho[1], rho[1], 1), ncol=2)
phi = solve(P, rho)
phi

[1] 1.3602493 -0.6671228

These coefficients correspond to complex roots of the characteristic polynomial, as we might expect given
the oscillatory nature of the data. We can confirm the suggested period of the oscillations.

u = polyroot(c(1, -phi))[1]
f = Arg(u)/(2*pi)
1/f

[1] 10.7068

So the period of the oscillations is a little under 11 years.

4.1.1.2 MA and ARMA models

We can imagine doing a similar moment matching approach for an MA(q) process (or a more general ARMA
model), but it doesn’t work out so nicely, due to the ACF being non-linear in the parameters. We could
investigate this in more detail, but moment matching isn’t such a great parameter estimation strategy in
general, so let’s just move on.

4.1.2 Least squares

As an alternative to moment matching, we can approach parameter estimation as a least squares problem,
where we try to find the parameters that minimise the sum of squares of the noise terms, εt . Again this is
simpler in the case of an AR(p) model, so we begin with that.

4.1.2.1 AR(p)

The model
p
X
xt = ϕk xt−k + εt , t = 1, 2, . . . , n
k=1
is actually in the form of a multiple linear regression model. We can therefore treat it as a linear least squares
problem, choosing parameters ϕ to minimise L(ϕ ϕ) = nt=1 ε2t . We can simplify the problem by ignoring the
P

initialisation problem by conditioning on the first p observations and ignoring their contribution to the loss

64
function. This gives us a so-called conditional least squares problem. We can write the model in matrix form
as       
xp+1 xp xp−1 · · · x1 ϕ1 εp+1
xp+2   xp+1
  
xp · · · x2   ϕ2  εp+2 
   
,
 . = . .. ..   ..  +  .. 
 .   . ..    
 .   . . . .  .   . 
xn xn−1 xn−2 · · · xn−p ϕp εn
or
ϕ + ε,
x = Xϕ
for appropriately defined (n − p)-vectors x and ε and (n − p) × p matrix, X. We then know from basic least
squares theory that
n
ε2t = ∥εε∥2 = ε⊤ε
X
L(ϕ
ϕ) =
t=p+1
is minimised by choosing ϕ to be the solution of the normal equations,
X⊤ Xϕ
ϕ = X⊤ x,
in other words
ϕ = (X⊤ X)−1 X⊤ x.
ϕ̂
We can then estimate σ as the empirical variance of the residuals.

4.1.2.1.1 Example: AR(2)


Let us now fit the sunspot data example again, but this time using least squares.

n = length(x)
X = cbind(x[2:(n-1)], x[1:(n-2)])
xp = x[3:n]
lm(xp ~ 0 + X)

Call:
lm(formula = xp ~ 0 + X)

Coefficients:
X1 X2
1.4032 -0.7086

We use R’s lm function to solve the least squares problem. Note that the coefficients are similar, but not
identical, to those obtained earlier.

4.1.2.1.2 Connection between moment matching and least squares


It is clear from the definition of the matrix X, that its associated sample covariance matrix, X⊤ X/(n − p)
will be (unbiased and) consistent for Γ (recall that there are n − p rows in X). Then by considering a single
column of the covariance matrix, similar reasoning reveals that X⊤ x/(n − p) will be consistent for γ . So if
we wanted, we could use the finite sample estimates:
X⊤ X X⊤ x
Γ̂ = , γ̂γ = .
n−p n−p
But if we do this, we find that our moment matching estimator,
ϕ = Γ̂−1γ̂γ = (X⊤ X)−1 X⊤ x
ϕ̂
is just the least squares estimator. So the two approaches are asymptotically equivalent. This is the real reason
why moment matching actually works quite well for AR models.

65
4.1.2.2 ARMA(p,q)

The approach for a general ARMA model is similar to that for an AR model, but with an extra twist. Again,
the problem simplifies if we ignore the contribution associated with the first p observations. But now we
also have an issue with initialising the errors. Again, we can simplify by conditioning on the first q required
errors. But here we don’t actually know what to condition on, so we just condition on them all being zero.
This conditional least squares approach is again approximate, but has negligible impact when n is large. We
can then use the least squares loss function
n
X p
X q
X
L(ϕ
ϕ, θ ) = ε2t where εt = xt − ϕj xt−j − θk εt−k ,
t=p+1 j=1 k=1

and we assume εt = 0, ∀t ≤ p. For q > 0 this is a nonlinear least squares problem, so we need to minimise
it numerically. There are efficient ways to do this, but the details are not especially relevant to this course.

4.1.2.2.1 Example: ARMA(2, 1)


Just for fun, let’s fit an ARMA(2, 1) to the sunspot data.

loss = function(param) {
phi = param[1:2]; theta = param[3]
eps = signal::filter(c(1, -phi), c(1, theta),
x[3:length(x)], init.x=x[1:2], init.y=c(0))
sum(eps*eps)
}

optim(rep(0.1,3), loss)$par

[1] 1.4840176 -0.7749327 -0.1623121

The first two elements of the parameter vector correspond to ϕ , and the last corresponds to θ. In practice, we
will typically use the built-in function arima to fit ARMA models to data.

arima(x, c(2, 0, 1), include.mean=FALSE)

Call:
arima(x = x, order = c(2, 0, 1), include.mean = FALSE)

Coefficients:
ar1 ar2 ma1
1.4828 -0.7733 -0.1631
s.e. 0.0516 0.0465 0.0785

sigma^2 estimated as 1.331: log likelihood = -452.69, aic = 913.39

The arima function fits using the method of maximum likelihood, considered next.

4.1.3 Maximum likelihood

Neither moment matching nor least squares are particularly principled estimation methods, so we next turn
our attention to the method of maximum likelihood. As usual, the AR case is simpler, so we begin with
that.

66
4.1.3.1 AR(p)

We can always write the likelihood function of any model as a recursive factorisation of the joint distribution
of the data. So, for an AR model we could write

ϕ, σ; x) = f (x)
L(ϕ
= f (x1 )f (x2 |x1 ) · · · f (xn |x1 , . . . , xn−1 )

In other words,
n
Y
ϕ, σ; x) = f (x1 )
L(ϕ f (xt |x1 , . . . , xt−1 ) (4.1)
t=2

This holds for any model, but for an AR(p), Xt depends only on the previous p values, so we can write this
as
n
Y
ϕ, σ; x) = f (x1 , . . . , xp )
L(ϕ f (xt |xt−1 , . . . , xt−p ),
t=p+1

where f (x1 , . . . , xp ) is the joint (MVN) distribution of p consecutive observations, and


p
X
!
2
f (xt |xt−1 , . . . , xt−p ) = N xt ; ϕk xt−k , σ ,
k=1

the normal density function. We could directly maximise this likelihood using iterative numerical meth-
ods. However, the likelihood obviously simplifies significantly if we ignore the contribution of the first p
observations, leading to the conditional log-likelihood
n p !2
1 X X
ϕ, σ; x) = −(n − p) log σ − 2
ℓ(ϕ xt − ϕk xt−k
2σ t=p+1 k=1
n
1 X
= −(n − p) log σ − ε2 .
2σ 2 t=p+1 t

We can immediately see that as a function of ϕ , maximising this leads to exactly the least squares problem
that we have already considered, and so our MLE can be obtained by solving the normal equations. Again,
the effect of conditioning on the first p observations will be negligible for long time series.

4.1.3.2 ARMA(p,q)

For a general ARMA model, we can start from the general factorisation Equation 4.1, and note again that the
problem would simplify enormously if we not only conditioned on the first p observations, but also on the
first q required errors. In this case there is a deterministic relationship between Xt and εt , and the required
likelihood terms are just the densities of the new error εt at each time. So again, in the conditional case, we
get the simplified log-likelihood
n
1 X
ϕ, θ , σ; x) = −(n − p) log σ −
ℓ(ϕ ε2 ,
2σ 2 t=p+1 t

where now εt = xt − pj=1 ϕj xt−j − qk=1 θk εt−k . This is a nonlinear least squares problem that we need
P P

to solve with iterative numerical methods, as we have already seen.

67
4.1.4 Bayesian inference

Bayesian inference arguably provides a more principled approach to parameter estimation than any we have
yet considered. However, the computational details are somewhat tangential to the main focus of this course.
The key ingredients are the likelihood functions that we have already considered, and a prior on the model
parameters. In general we can use sampling methods such as MCMC to explore the resulting posterior
distribution. Note that a Gaussian prior for ϕ is (conditionally) conjugate for ARMA models, so significant
simplifications arise, especially in the AR case, or in conjunction with (block) Gibbs sampling approaches.
The precise details are non-examinable in the context of this course.

4.2 Forecasting an ARMA model

4.2.1 Forecasting

Forecasting is the problem of making predictions about possible future values that a time series might take.
Suppose we have time series observations x1 , x2 , . . . , xn , and that we are currently at time n, and wish to
make predictions about future, currently unobserved, values of the time series, Xn+1 , Xn+2 , . . .. We will
use the notation x̂n (k) for the forecast made at time n for the observation Xn+k (k time points into the
future). At time n, xn (k) will be deterministic (otherwise we could just choose x̂n (k) = Xn+k !), but will be
informed by the observed time series up to time n.
How should we choose x̂n (k)? We know that we are unlikely to be able to predict Xn+k exactly, but we
would like to be as close as possible. So we would like to try and minimise some sort of penalty for how
wrong we are. So, suppose that at time n we declare x̂n (k) = x, and that at time n + k we see the observed
value of Xn+k = xn+k are are subject to the penalty

p(x) = c(xn+k − x)2 ,

for some constant c in units of “penalty” (eg. British pounds). If our prediction is perfect, we will not be
penalised, and the penalty increases depending on how wrong we are. At time n, we don’t know what penalty
we will receive, so it is random,
P (x) = c(Xn+k − x)2 .
We might therefore like to minimise the penalty that we expect to receive, given all of the information we
have at time n. That is, we want to minimise the loss function

L(x) = E[P (x)|X1:n = x1:n ]


= E[c(Xn+k − x)2 |X1:n = x1:n ]
2
= cE[Xn+k |X1:n = x1:n ] − 2cxE[Xn+k |X1:n = x1:n ] + cx2 .

This is quadratic in x, so we can minimise wrt x, either by completing the square, or by differentiating and
equating to zero to get
x̂n (k) = E[Xn+k |X1:n = x1:n ].
So, somewhat unsurprisingly, our forecast for Xn+k made at time n should be just its conditional expectation
given the observations up to time n. Similarly, our uncertainty regarding the future observation can be well
summarised by the corresponding conditional variance

Var[Xn+k |X1:n = x1:n ],

noting that c times this is the expected penalty we will receive. For stationary Gaussian processes such
as ARMA models, it is straightforward, in principle, to compute these conditional distributions using
standard normal theory. However, for large n this involves large matrix computations. So typically forecast
distributions are computed sequentially, exploiting the causal structure of the process.

68
4.2.2 Forecasting an AR(p) model

As usual, the AR(p) case is slightly simpler than that of a general ARMA model, so we start with this. We
assume that we know x1:n , and want to sequentially compute x̂n (1), x̂n (2), . . .. First,

x̂n (1) = E[Xn+1 |X1:n = x1:n ]


 
p
X
= E ϕj Xn+1−j + εn+1 X1:n = x1:n 
j=1
 
p
X
= E ϕj xn+1−j X1:n = x1:n 
j=1
p
X
= ϕj xn+1−j .
j=1

Similarly,

x̂n (2) = E[Xn+2 |X1:n = x1:n ]


 
p
X
= E ϕj Xn+2−j + εn+2 X1:n = x1:n 
j=1
 
p
X
= E  ϕ1 Xn+1 + ϕj xn+2−j X1:n = x1:n 
j=2
p
X
= ϕ1 E [ Xn+1 | X1:n = x1:n ] + ϕj xn+2−j
j=2
p
X
= ϕ1 x̂n (1) + ϕj xn+2−j .
j=2

By now it should be clear that we will have


2
X p
X
x̂n (3) = ϕj x̂n (3 − j) + ϕj xn+3−j .
j=1 j=3

The notation will be greatly simplified if we drop the distinction between forecasts and observations by
defining x̂n (−t) = xn−t , for t = 0, 1, . . . , p. Then we have
p
X
x̂n (3) = ϕj x̂n (3 − j),
j=1

and more generally,


p
X
x̂n (k) = ϕj x̂n (k − j), k = 1, 2, . . .
j=1

That is, the forecasts satisfy the obvious pth order linear recurrence relation, initialised with the last p
observed values of the time series.

4.2.2.1 Example: AR(2)

Let’s compute the forecast function for an AR(2) model fit to the sunspot datata.

69
n = length(x); k=50
phi = arima(x, c(2,0,0))$coef[1:2]
fore = filter(rep(0,k), phi, "rec", init=x[n:(n-1)])
tsplot(x, xlim=c(tsp(x)[1], tsp(x)[2]+k), col=4)
lines(seq(tsp(x)[2]+1, tsp(x)[2]+k, frequency(x)), fore,
col=2, lwd=2)
6
4
2
x
0
−2
−4
−6

1700 1750 1800 1850 1900 1950 2000 2050


Time

So, the short term forecasts oscillate in line with the recent observations, but the longer term forecasts quickly
decay away to the stationary mean of zero as we become increasingly uncertain about the phase of the
signal.

4.2.2.2 Forecast variance

Let us now turn attention to the forecast variances. Again, let’s compute them sequentially, starting with the
one-step ahead variance.
 
p
X
Var[Xn+1 |X1:n = x1:n ] = Var  ϕj Xn+1−j + εn+1 X1:n = x1:n 
j=1
 
p
X
= Var  ϕj xn+1−j + εn+1 X1:n = x1:n 
j=1

= Var [ εn+1 | X1:n = x1:n ]


= σ2.

70
Next, let’s consider the 2-step ahead forecast variance,
 
p
X
Var[Xn+2 |X1:n = x1:n ] = Var  ϕj Xn+2−j + εn+2 X1:n = x1:n 
j=1
 
p
X
= Var  ϕ1 Xn+1 + ϕj xn+2−j + εn+2 X1:n = x1:n 
j=2

= ϕ21 Var [ Xn+1 | X1:n = x1:n ] + σ 2


= (1 + ϕ21 )σ 2 .

Beyond this, things start to get a bit more complicated, due to the covariances between observations.
Nevertheless, we can continue to slog out the variances using successive substitution.
 
p
X
Var[Xn+3 |X1:n = x1:n ] = Var  ϕj Xn+3−j + εn+3 X1:n = x1:n 
j=1
 
p
X
= Var  ϕ1 Xn+2 + ϕ2 Xn+1 + ϕj xn+3−j + εn+3 X1:n = x1:n 
j=3

= Var [ ϕ1 Xn+2 + ϕ2 Xn+1 | X1:n = x1:n ] + σ 2


   
p
X
= Var  ϕ1 ϕ1 Xn+1 + ϕk xn+2−j + εn+2  + ϕ2 Xn+1 X1:n = x1:n  + σ 2
j=2
h i
= Var (ϕ21 + ϕ2 )Xn+1 + ϕ1 εn+2 X1:n = x1:n + σ 2
= (ϕ21 + ϕ2 )2 Var [ Xn+1 | X1:n = x1:n ] + (1 + ϕ21 )σ 2
 
= 1 + ϕ21 + [ϕ21 + ϕ2 ]2 σ 2

It is possible to derive recursions that give the forecast variances, but it is probably not worth it. Note that the
case of the AR(1) was dealt with in Chapter 2, where explicit forecast distributions were derived. The initial
condition there becomes the final observation, xn , here. Similarly, we can write an AR(p) as a VAR(1), and
use the explicit expressions for the forecast variance for the VAR(1) from Chapter 2 to deduce the forecast
variance for the AR(p).
In practice, we just use the arima function to compute forecasts and associated uncertainties.

mod = arima(x, c(2,0,0))


fore = predict(mod, n.ahead=k)
pred = fore$pred
sds = fore$se
tsplot(x, xlim=c(tsp(x)[1], tsp(x)[2]+k),
ylim=c(-10, 10), col=4)
ftimes = seq(tsp(x)[2]+1, tsp(x)[2]+k, tsp(x)[3])
lines(ftimes, pred, col=2, lwd=2)
lines(ftimes, pred+2*sds, col=2)
lines(ftimes, pred-2*sds, col=2)

71
10
5
0
x
−5
−10

1700 1750 1800 1850 1900 1950 2000 2050


Time

4.2.3 Forecasting an ARMA(p,q)

As usual, the case of an ARMA model is similar to that of an AR model, but a little bit more complicated due
to the extra error terms. Recall from our discussion of conditional least squares and conditional likelihood
approaches to model fitting, that for long time series, there is a deterministic relationship between xt and εt .
Just as we can simulate an ARMA model by applying an ARMA filter to white noise, we can also recover
the error process by applying an ARMA filter to the time series. In the context of forecasting, this means
that if we know the observed values of the time series up to time n, we also know the observed values of the
errors up to time n. We will denote these ε̃1 , ε̃2 , . . . , ε̃n in order to emphasise that these are observed and not
random. Given this, we can just proceed as before. It is clear that the one-step ahead predictive variance for
any ARMA model will be σ 2 , but variance forecasts further ahead get quite cumbersome quite quickly. The
(mean) forecast function, x̂n (k), is more straightforward.
For the ARMA model
p
X q
X
Xt = ϕj Xt−j + θj εt−j + εt ,
j=1 j=1

we have

x̂n (k) = E [ Xn+k | X1:n = x1:n , ε 1:n = ε̃ε1:n ]


 
p
X q
X
= E ϕj Xn+k−j + θj εn+k−j + εt X1:n = x1:n , ε 1:n = ε̃ε1:n 
j=1 j=1
p
X q
X
= ϕj E[Xn+k−j |X1:n = x1:n ] + θj E[εn+k−j |εε1:n = ε̃ε1:n ].
j=1 j=1

Now we define x̂n (−t) = xn−t for t ≥ 0, as before, and also now define ε̃t = 0 for t > n, to get
p
X q
X
x̂n (k) = ϕj x̂n (k − j) + θj ε̃n+k−j . (4.2)
j=1 j=1

So for k = 1, 2, . . . , q we explicitly enumerate x̂n (k) using Equation 4.2, and then for k > q we have the
linear recurrence
p
X
x̂n (k) = ϕj x̂n (k − j),
j=1

72
as for the AR(p).

73
5 Spectral analysis

5.1 Fourier analysis

Many time series exhibit repetitive pseudo-periodic behaviour. It is therefore tempting to try to use (mixtures
of) trigonometric functions in order to capture aspects of their behaviour. The weights and frequencies
associated with the trig functions will give us insight into the nature of the process. This is the idea behind
spectral analysis.
In fact, we know from Fourier analysis that any reasonable function can be described as a mixture of trig
functions, so we begin with a very brief recap of Fourier series before thinking about about how to use these
ideas in the context of time series models.

5.1.1 Fourier series

Starting from any “reasonable” function, f : [− 21 , 12 ] → R (or, equivalently, any periodic function with period
1), we can write

X
f (x) = a0 + (ak cos 2πkx + bk sin 2πkx)
k=1

for coefficients a0 , a1 , . . . , b1 , b2 , . . . ∈ R to be determined. However, since trig is tricky, it is often more


convenient to write this as

X
f (x) = ck e2πikx ,
k=−∞

for . . . , c−1 , c0 , c1 , . . . ∈ C. The sum is now doubly-infinite, since we can ensure we get back to a real-valued
Fourier series by choosing c−k = c̄k ∀ k ≥ 0 (and hence c0 ∈ R). Now, using the nice orthogonality
property
Z 1 (
2
2πi(j−k)x 1 j=k
e dx = δjk =
− 21 0 otherwise,
we can multiply through our series by e−2πijx and integrate to determine the coefficients,
Z 1
2
ck = f (x)e−2πikx dx.
− 12

It is a standard (but non-trivial) result of Fourier analysis that this series will converge to f (x) (in, say, an L2
sense), for reasonably well-behaved functions, f .
The main point of Fourier series is to represent a periodic (or finite range) function wrt to a countable
orthonormal set of (trigonometric) basis functions. But since the mapping is invertible, it just provides us
with a mapping back and forth between a function defined on [− 12 , 12 ] and a countable set of coefficients. So,
we could alternatively view this as a way of representing a countable collection of numbers using a function
defined on [− 21 , 12 ]. This latter perspective is helpful for the analysis of time series in discrete time, and in
this context, the Fourier mapping is known as the discrete time Fourier transform (DTFT).

74
5.1.2 Discrete time Fourier transform

An infinite time series . . . , x−1 , x0 , x1 , . . . can be represented by a function x̂ : [− 12 , 12 ] → C via the


relation Z 1
2
xt = x̂(ν)e2πitν dν, t ∈ Z,
− 12

where

xt e−2πitν .
X
x̂(ν) =
t=−∞
x̂(ν) is the weight put on oscillations of frequency ν. More precisely, |x̂(ν)| determines the weight, and
Arg[x̂(ν)] determines the phase of the oscillation at that frequency. Note that the sign of the continuous
variable (ν) has been switched. This doesn’t change anything - it just maintains the usual convention that the
forward transform has a minus in the complex exponential and the inverse transform does not. Also note
that if the time series is real-valued, then the function x̂ will be Hermitian. That is, it will have the property
x̂(−ν) = x̂(ν), ∀ν ∈ [0, 12 ] (and x̂(0) will be real). Note that no fundamentally new maths is required for this
representation - it is just Fourier series turned on its head. That said, there are lots of technical convergence
conditions that we are ignoring.

5.2 Spectral representation

Now we have this bijection between discrete-time time series and functions on a unit interval, we can
think about what it means for random time series. Clearly a random time series Xt will induce a random
continuous-time process X̂(ν), and vice-versa. It will turn out to be very instructive to think about the class
of discrete time models induced by a weighted “white noise” process. That is, X̂(ν) = σ(ν)η(ν), where
σ : [− 21 , 21 ] → C is a deterministic function specifying the weight to be placed on the frequency ν, and η(ν)
is the stationary Gaussian “white noise” process with E[η(ν)] = 0, γ(ν) = δ(ν) (Dirac’s delta). We can
therefore represent our random time series as
Z 1
2
Xt = σ(ν)e2πitν η(ν) dν, t ∈ Z. (5.1)
− 21

Now, since η is not a very nice function, it is arguably better to write this as a stochastic integral wrt a Wiener
process, W , as
Z 1
2
Xt = σ(ν)e2πitν dW (ν), t ∈ Z,
− 21

interpreted using Itô calculus. We will largely gloss over this technicality, but writing it this way arguably
makes it easier to see that the induced time series will be a Gaussian process. Taking the expectation of
Equation 5.1 gives E[Xt ] = 0, so then
Cov[Xt+k , Xt ] = E[Xt+k X̄t ]
1 1
"Z Z #
2
2πi(t+k)ν 2
′ −2πitν ′ ′ ′
=E σ(ν)e η(ν) dν σ̄(ν )e η̄(ν ) dν
− 21 − 12
1 1
"Z Z #
2 2
′ ′ 2πi[(t+k)ν−tν ′ ] ′
=E dν dν σ(ν)σ̄(ν )e η(ν)η̄(ν )
− 21 − 12
Z 1 Z 1
2 2 ′
= dν dν σ(ν)σ̄(ν)e2πi[t(ν−ν )+kν] E[η(ν)η̄(ν ′ )]
− 12 − 21
Z 1 Z 1
2 2 ′
= dν dν ′ σ(ν)σ̄(ν ′ )e2πi[t(ν−ν )+kν] δ(ν − ν ′ )
− 12 − 21
Z 1
2
= dν |σ(ν)|2 e2πikν .
− 12

75
Since this is independent of t, the induced time series is weakly stationary, with auto-covariance function
Z 1
2
γk = |σ(ν)|2 e2πikν dν.
− 12

If we define the function


S(ν) ≡ |σ(ν)|2 ,
then we can re-write this as Z 1
2
γk = S(ν) e2πikν dν. (5.2)
− 12

This is just a discrete time Fourier transform, so we can invert it as



γk e−2πikν .
X
S(ν) =
k=−∞

The non-negative real-valued function S(ν) is known as the spectral density of the discrete time series
process. So, glossing over (quite!) a few technical details, essentially any choice of spectral density function
leads to a stationary Gaussian discrete time process, and any stationary Gaussian discrete time process has an
associated spectral density function. The spectral density function and auto-covariance function are different
but equivalent ways of representing the same information. Technical mathematical results relating to this
equivalence include the Wiener-Khinchin theorem and Bochner’s theorem. Note that if S(ν) is chosen to be
an even function, and it typically will, then
Z 1
2
γk = 2 S(ν) cos(2πkν) dν,
0

and will be real. Similarly, if γk is real (and hence even), as it typically will be, then

X
S(ν) = γ0 + 2 γk cos(2πkν),
k=1

and hence is even.

5.2.1 Spectral densities

5.2.1.1 White noise

What discrete time process is represented by a flat spectral density, with S(ν) = σ 2 , ∀ν? Substituting in to
Equation 5.2 gives (
σ2 k = 0
γk =
0 k > 0.
In other words, it induces the discrete time white noise process, εt , with noise variance σ 2 .

5.2.1.2 Some properties of the DTFT

5.2.1.2.1 Linearity

76
The DTFT is a linear transformation. Suppose that zt = axt + byt for some time series xt and yt . Then,

zt e−2πitν
X
ẑ(ν) =
t=−∞

(axt + byt )e−2πitν
X
=
t=−∞
∞ ∞
xt e−2πitν + b yt e−2πitν
X X
=a
t=−∞ t=−∞
= ax̂(ν) + bŷ(ν).

5.2.1.2.2 Time shift


Suppose that yt = Bxt = xt−1 , for some time series xt . Then,

yt e−2πitν
X
ŷ(ν) =
t=−∞

xt−1 e−2πitν
X
=
t=−∞

= e−2πiν xt−1 e−2πi(t−1)ν
X

t=−∞
−2πiν
=e x̂(ν),
and more generally, it easily follows that
k x(ν) = e−2πikν x̂(ν).
Bd

5.2.1.3 AR(p)

Consider the AR(p) model,


p
X
Xt − ϕk Xt−k = εt ,
k=1
and DTFT both sides.
p
ϕk e−2πikν X̂(ν) = ε̂(ν)
X
X̂(ν) −
k=1
p !
ϕk e−2πikν X̂(ν) = ε̂(ν)
X
⇒ 1−
k=1
⇒ ϕ(e−2πiν )X̂(ν) = ε̂(ν)
ε̂(ν)
⇒ X̂(ν) =
ϕ(e−2πiν )
Now, since εt is a discrete time white noise process, it has a flat spectrum, and we can write ε̂(ν) = ση(ν),
so that
σ
X̂(ν) = −2πiν
η(ν),
ϕ(e )
and from this it is clear that the spectral density of Xt is given by
σ2
S(ν) = .
|ϕ(e−2πiν )|2
Recall that a stationary AR(p) model has roots of ϕ(z) outside of the unit circle. Here, in the denominator,
we are traversing around the unit circle, so the spectral density will remain finite for stationary processes.

77
5.2.1.3.1 Example: AR(2)
Consider an AR(2) model with ϕ1 = 3/2, ϕ2 = −3/4, σ = 1. We can plot the spectral density as follows.

library(astsa)
nu = seq(-0.5, 0.5, 0.001)
specd = function(nu) {
z = complex(1, cos(2*pi*nu), -sin(2*pi*nu))
phi = 1 - 1.5*z + 0.75*z*z
1/abs(phi)ˆ2
}
spec = sapply(nu, specd)
tsplot(ts(spec, start=-0.5, deltat=0.001),
col=3, lwd=1.5, xlab="nu", ylab="spectral density")
60
50
spectral density
40
30
20
10
0

−0.4 −0.2 0.0 0.2 0.4


nu

However, since the function is even, we typically only plot the function on [0, 21 ].

nu = seq(0, 0.5, 0.002)


spec = sapply(nu, specd)
tsplot(ts(spec, start=0, deltat=0.002),
col=3, lwd=1.5, xlab="nu", ylab="spectral density")

78
60
50
spectral density
40
30
20
10
0

0.0 0.1 0.2 0.3 0.4 0.5


nu

It is quite common to see spectral densities plotted on a half-log scale.

tsplot(ts(log(spec), start=0, deltat=0.002),


col=3, lwd=1.5, xlab="frequency (nu)", ylab="log(spectral density)")
4
3
log(spectral density)
2
1
0
−1
−2

0.0 0.1 0.2 0.3 0.4 0.5


frequency (nu)

Either way, we see that there is a single strong peak in this spectral density, occurring at around ν = 0.08. In
fact, we expect this, since we’ve looked at this AR(2) model previously.

peak = which.max(spec)*0.002
peak

[1] 0.082

79
1/peak

[1] 12.19512

So this model has oscillations with period around 12. This all makes sense, since the spectral density will be
largest at points on the unit circle that are closest to the roots of the characteristic polynomial. But this will
happen when the argument of the point of the unit circle matches that of a root.

5.2.1.4 ARMA(p,q)

A very similar argument to that used for AR(p) models confirms that the spectral density of an ARMA(p,q)
process is given by
|θ(e−2πiν )|2
S(ν) = σ 2 .
|ϕ(e−2πiν )|2
Note that when calculating spectral densities by hand it can be useful to write this in the form

θ(e2πiν )θ(e−2πiν )
S(ν) = σ 2 ,
ϕ(e2πiν )ϕ(e−2πiν )

and to simplify the numerator and denominator separately, recognising the resulting real-valued trig functions
that must arise, since both the numerator and denominator are real.

5.3 Finite time series

Of course, in practice we do not work with time series of infinite length. We instead work with time series of
finite length, n. In the context of spectral analysis, it is more convenient to work with zero-based indexing, so
we write our time series as
x0 , x1 , . . . , xn−1 .
Since we only have n degrees of freedom, we do not need a continuous function x̂(ν) in order to fix them.
Since everything is linear, knowing x̂(ν) at just n points would be sufficient, since that would give us n linear
equations in n unknowns. Knowing it at n equally spaced points around the unit circle is natural, and turns
out to give nice orthogonality properties that make everything very neat. This is the idea behind the discrete
Fourier transform (DFT).

5.3.1 The discrete Fourier transform

The (forward) DFT takes the form


n−1
xt e−2πikt/n ,
X
x̂k = k = 0, 1, . . . , n − 1.
t=0

It can be inverted with


1 n−1
X
xt = x̂k e2πikt/n , t = 0, 1, . . . , n − 1.
n k=0
For the DTFT, the integral traversed the unit circle clockwise starting from minus one. The DFT moves
around the unit circle clockwise starting from one, but this is just a commonly used convention, and doesn’t
change anything fundamental about the transform. These transforms are simple finite linear sums mapping
back and forth between n values in the time and frequency domains, and hence are very amenable to computer
implementation.

80
To understand how it works, it is helpful to define ω = e2πi/n , the nth root of unity, with wn = 1 and
ω k = ω −k = ω n−k . We can then write the transforms slightly more neatly as

1 n−1 n−1
xt ω −kt .
X X
xt = x̂k ω kt and x̂k =
n k=0 t=0

That the inversion works follows from the nice orthogonality property,
n−1 n−1
ω jt ω −kt =
X X
ω (j−k)t = n δjk ,
t=0 t=0

which follows from the fact that the roots of unity sum to zero. If we start from the definition of xt and sub in
for the definition of x̂k we find
1 n−1 1 n−1
X n−1
xs ω −ks ω kt
X X
x̂k ω kt =
n k=0 n k=0 s=0
1 n−1 n−1
ω −ks ω kt
X X
= xs
n s=0 k=0
1 n−1
X
= xs nδst
n s=0
= xt ,
as required. The factor of n in the inverse is a bit annoying, but has to go somewhere. The convention that we
have adopted is the most common, but there are others.
The finite, discrete and linear nature of the DFT (and its inverse) make it extremely suitable for computer
implementation. Naive implementation would involve O(n) operations to compute each coefficient, and
hence a full transform would involve O(n2 ) operations. However, it turns out that a divide-and-conquer
approach can be used to reduce this to O(n log n) operations. The algorithm that implements the DFT with
reduced computational complexity is known as the fast Fourier transform (FFT). For large time series, this
difference in complexity is transformational, and is one of the reasons that the FFT (and the very closely
related discrete cosine transform, DCT) now underpin much of modern life.

5.3.1.1 The FFT in R

R has a built-in fft function that can efficiently compute a DFT. It can also invert the DFT, but it returns an
unnormalised inverse (without the factor of n), so it is convenient to define our own function to invert the
DFT.

ifft = function(x) fft(x, inverse=TRUE) / length(x)

With this iDFT function in place, we can test it to make sure that it correctly inverts a DFT.

x = 1:5
x

[1] 1 2 3 4 5

fx = fft(x)
fx

[1] 15.0+0.000000i -2.5+3.440955i -2.5+0.812299i -2.5-0.812299i -2.5-3.440955i

81
ifft(fx)

[1] 1+0i 2+0i 3+0i 4+0i 5+0i

Re(ifft(fx))

[1] 1 2 3 4 5

5.3.1.2 Matrix formulation

It can also be instructive to consider the DFT in matrix-vector form. Starting with the inverse DFT, we can
write this as
1
x = Fn x̂,
n
where Fn is the n × n Fourier matrix with (i, j)th element wij . Clearly Fn is symmetric (F⊤
n = Fn ). Now
notice that the forward DFT is just
x̂ = F̄n x
Further, our orthogonality property implies that Fn F̄n = nI (so Fn is unitary, modulo a factor of n), and
F−1
n = F̄n /n. So inversion is now clear, since if we start with the definition of x and substitute in for x̂, we
get
1 1 1
Fn x̂ = Fn F̄n x = nIx = x,
n n n
as required.

5.3.2 The periodogram

For a time series x0 , x1 , . . . , xn−1 with DFT x̂0 , x̂1 , . . . , x̂n−1 , the periodogram is just the sequence of
non-negative real numbers
Ik = |x̂k |2 , k = 0, 1, . . . , n − 1.
They represent a (very bad) estimate of the spectral density function.

5.3.2.1 Example: AR(2)

Let’s see how this works in the context of our favourite AR(2) model. First, simulate some data from the
model.

set.seed(42)
n = 500
x = arima.sim(n=n, list(ar = c(1.5, -0.75)))
tsplot(x, col=4)

82
5
0
x
−5

0 100 200 300 400 500


Time

We can plot a very basic periodogram as follows.

tsplot(abs(fft(x))ˆ2,
col=3, lwd=1.5, xlab="k", ylab="Power")
50000
30000
Power
10000
0

0 100 200 300 400 500


k

Since the second half of the periodogram is just a mirror-image of the first, typically only the first half is
plotted.

tsplot(abs(fft(x)[2:(n/2+1)])ˆ2,
col=3, lwd=1.5, xlab="k", ylab="Power")

83
50000
30000
Power
10000
0

0 50 100 150 200 250


k

The periodogram is typically plotted on a half-log scale.

tsplot(log(abs(fft(x)[2:(n/2+1)])ˆ2),
col=3, lwd=1.5, xlab="k", ylab="log(Power)")
10
8
log(Power)
6
4
2
0

0 50 100 150 200 250


k

This is now exactly like R’s built-in periodogram function, spec.pgram, with all “tweaks” disabled. But
the default periodogram does a little bit of pre-processing.

spec.pgram(x, taper=0, detrend=FALSE,


col=3, lwd=1.5, main="Raw periodogram")

84
Raw periodogram

1e−03 1e−01 1e+01


spectrum

0.0 0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.000577

spec.pgram(x, col=3, lwd=1.5,


main="Default periodogram")

Default periodogram
1e+00
spectrum

1e−03

0.0 0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.000577

spectrum(x, spans=c(5, 7),


col=3, lwd=2, main="Smoothed periodogram")

85
Smoothed periodogram

5.00 50.00
spectrum

0.50
0.05

0.0 0.1 0.2 0.3 0.4 0.5

frequency
bandwidth = 0.00436

In practice, various smoothing methods are used to get better a better estimate of the spectral density. This is
the subject of spectral density estimation, but the details of this do not concern us. In the final plot, a peak at
a frequency of around 0.08 can be seen, as we would hope.

5.3.3 Smoothing with the DFT

The potential applications of the DFT (and of spectral analysis, more generally) are many and varied, but we
don’t have time to explore them in detail here. But essentially, the DFT decomposes your original signal into
different frequency components. These different frequency components can be separated, manipulated and
combined in various ways. One obvious application is smoothing, where high frequency components are
simply wiped out. This is best illustrated by example.

5.3.3.1 Example: monthly sunspot data

We previously looked at yearly sunspot data, and fitted an AR(2) model to it. There is also a monthly sunspot
dataset that is in many ways more interesting.

tsplot(sunspot.month, col=4)

86
250
200
sunspot.month
150
100
50
0

1750 1800 1850 1900 1950 2000


Time

We can see strong oscillations with period just over a decade. However, fitting an AR(2) to this data fails
miserably due to the presence of strong high frequency oscillations. A quick look at the periodogram

spectrum(sunspot.month, spans=c(5, 7), col=3)

Series: x
Smoothed Periodogram
50 500 10000
spectrum

0 1 2 3 4 5 6

frequency
bandwidth = 0.00817

reveals that although there is a strong peak at low frequency oscillation, there is also a lot of higher frequency
stuff going on. So, let’s look more carefully at the DFT of the data.

ft = fft(sunspot.month)
tsplot(log(abs(ft)), col=3,
xlab="Frequency", ylab="log(DFT)")

87
12
10
log(DFT)
8
6
4

1750 1800 1850 1900 1950 2000


Frequency

The peaks close to the two ends correspond to the low frequency oscillations that we are interested in.
Everything in the middle is high frequency noise that we are not interested in, so let’s just get rid of it.
Suppose that we want to keep the first (and last) 150 frequency components. We can zero out the rest as
follows.

ft[152:(length(ft)-152+2)] = 0
tsplot(log(abs(ft)), col=3,
xlab="Frequency", ylab="log(DFT)")
12
11
10
log(DFT)
9
8
7
6

1750 1800 1850 1900 1950 2000


Frequency

We can now switch back to the time domain to see the data with the high frequency components removed.

88
sm = Re(ifft(ft))
tsplot(sm, col=4, lwd=1.5,
ylab="Smoothed monthly sunspot data")

200
Smoothed monthly sunspot data
150
100
50
0

1750 1800 1850 1900 1950 2000


Time

This is quite a nice approach to smoothing the data.

89
6 Hidden Markov models (HMMs)

6.1 Introduction

So far we have been assuming that we perfectly observe our time series model. That is, if we have a model
for Xt , we observe xt . However, we are very often in a situation where we have a good model for a (Markov)
process Xt , but we are only able to observe some partial or noisy aspects. We call this observation process Yt
(and to be useful, this observation must depend on Xt in some way), and hence we only ever observe yt . The
Xt process remains “hidden”, but we are often able to learn a lot about it from the yt . This is the idea behind
state space modelling. Very often, our hidden process will be a linear Gaussian process like those we have
mainly been considering. We will study this case in Chapter 7. It is slightly simpler to start with the case
where the hidden process is a finite-state Markov chain. In this case the combined model is referred to as a
hidden Markov model.
Our hidden process X0 , X1 , . . . , Xn is a Markov chain with state space X = {1, 2, . . . , p} for some known
integer p > 1. At time t the probability that the chain is in each state is described by a (row) vector

π (t) = (P[Xt = 1], P[Xt = 2], . . . , P[Xt = p]).

The chain is initialised with the vector π (0). The dynamics of the chain are governed by a transition
matrix P, and we use the (usual, but bad, Western) convention that the (i, j)th element corresponds to
P[Xt+1 = j|Xt = i]. The rows of P must then sum to one, and P is known as a (right) stochastic matrix.
The law of total probability leads to the update rule

π (t + 1) = π (t)P. (6.1)

It is clear from Equation 6.1 that a distribution π satisfying

π = πP

will be stationary. Choosing π (0) = π will ensure that the Markov chain is stationary. This is a common
choice, but not required. A discrete uniform distribution on the initial states is also commonly adopted, and
can sometimes be simpler.
The observation process is Y1 , Y2 , . . . , Yn , and will ultimately lead to n observations y1 , y2 , . . . , yn . Yt
represents some kind of noisy or partial observation of Xt , and so, crucially, is conditionally independent of
all other Xs and Ys (s ̸= t) given Xt . In other words, the distribution of Yt will depend (directly) only on Xt .
The state space, Y, is largely irrelevant to the analysis. It could be discrete (finite or countable) or continuous.
We just need to specify a probability (mass or density) function for the observation which can then be used as
a likelihood. We define
li (y) = P[Yt = y|Xt = i],
but note that this can be replaced with a probability density in the continuous case. Then l(y) is a (row)
p-vector of probabilities (or densities) associated with the likelihood of observing some y ∈ Y.
For much of this chapter we consider the case where π (0), P and l(·) are known (though we will consider
parameter estimation briefly, later), and we want to use the observations y1 , y2 , . . . , yn to tell us about the
hidden Markov chain (X0 ), X1 , X2 , . . . , Xn .

90
6.1.1 Example

As our running example for this chapter, we will look at a financial time series, specifically the daily returns
of the Bank of America from 2005 to 2017.

library(astsa)
y = BCJ[,"boa"]
length(y)

[1] 3243

tsplot(y, col=4, main="BoA returns")

BoA returns
0.3
0.2
0.1
y
−0.1
−0.3

2006 2008 2010 2012 2014 2016


Time

We see that the returns have mean close to zero, as we would expect, but that there are at least two periods of
higher volatility - one in 2008/09, and another in 2011. We will use a HMM to automatically segment our
time series into regions of high and low volatility.

hist(y[2000:3000], 30, col=4, freq=FALSE,


main="Region of low volatility")
curve(dnorm(x, 0, 0.015), add=TRUE, col=3, lwd=2)

91
Region of low volatility

25
Density

15
0 5

−0.05 0.00 0.05

y[2000:3000]

hist(y[1000:1500], 30, col=4, freq=FALSE,


main="Region of high volatility")
curve(dcauchy(x, 0, 0.025), add=TRUE, col=3, lwd=2)

Region of high volatility


12
Density

8
4
0

−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

y[1000:1500]

Very informal analysis suggests that we could reasonably model the periods of low volatility as mean zero
Gaussian with standard deviation 0.015, and regions of high volatility as Cauchy with scale parameter
0.025.
For our hidden Markov chain, we will suppose that periods of low volatility (i = 1) last for 1,000 days on
average, and that periods of high volatility (i = 2) last for 200 days on average, leading to the transition
matrix, !
0.999 0.001
P= .
0.005 0.995

92
If we further assume π (0) = (0.5, 0.5), our model is fully specified.

6.2 Filtering

In the context of state space modelling, the filtering problem is the problem of (sequentially) computing, at
each time t, the distribution of the hidden state, Xt , given all data up to time t, y1:t . That is, we want to
compute
fi (t) = P[Xt = i|Y1:t = y1:t ], ∀i ∈ X , t = 0, 1, . . . , n,
so f (t) is the (row) p-vector of filtered probabilities at time t. We could imagine computing this on-line, as
each new observation becomes available. We know that f (0) = π 0 , so assume that we know f (t − 1) for
some t > 0 and want to compute f (t). It is instructive to do this in two steps.
Predict step
We begin by computing

f˜i (t) = P[Xt = i|Y1:(t−1) = y1:(t−1) ], i = 1, 2, . . . , p.

The law of total probability along with conditional independence gives

f˜i (t) = P[Xt = i|Y1:(t−1) = y1:(t−1) ]


p
X
= P[Xt = i|Xt−1 = j, Y1:(t−1) = y1:(t−1) ]P[Xt−1 = j|Y1:(t−1) = y1:(t−1) ]
j=1
Xp
= P[Xt = i|Xt−1 = j]P[Xt−1 = j|Y1:(t−1) = y1:(t−1) ]
j=1
Xp
= Pji fj (t − 1).
j=1

In other words,
f̃ (t) = f (t − 1)P.
This is obviously reminiscent of our fundamental Markov chain update property, Equation 6.1, but does rely
on the conditional independence of Xt and Y1:(t−1) given Xt−1 .
Update step
Now we have pushed our probabilities forward in time, we can condition on our new observation.

fi (t) = P[Xt = i|Y1:t = y1:t ]


= P[Xt = i|Y1:(t−1) = y1:(t−1) , Yt = yt ]
P[Xt = i|Y1:(t−1) = y1:(t−1) ]P[Yt = yt |Xt = i, Y1:(t−1) = y1:(t−1) ]
=
P[Yt = yt |Y1:(t−1) = y1:(t−1) ]
P[Xt = i|Y1:(t−1) = y1:(t−1) ]P[Yt = yt |Xt = i]
=
P[Yt = yt |Y1:(t−1) = y1:(t−1) ]
f˜i (t)li (yt )
= .
P[Yt = yt |Y1:(t−1) = y1:(t−1) ]

We could therefore write


f̃ (t) ◦ l(yt )
f (t) = ,
P[Yt = yt |Y1:(t−1) = y1:(t−1) ]
where ◦ is the Hadamard (element-wise) product, and the denominator is a scalar normalising constant, and
so we could just write
f (t) ∝ f̃ (t) ◦ l(yt ),

93
which makes it clearer that we are just re-weighting the results of the predict step according to the likelihood
of the observation. We can just compute the RHS and then normalise to get the LHS. Note that we may
nevertheless want to keep track of the normalising constant (explained in the next section). If we prefer, we
can combine the predict and update steps as

f (t) ∝ [f (t − 1)P] ◦ l(yt ).

Starting this filtering process off at t = 1 and then running it forward to t = n is the forward part of the
forward-backward algorithm.

6.2.1 Example

We can implement the filter by creating a function that advances the algorithm by one step.

hmmFilter = function(P, l)
function(f, y) {
fNew = (f %*% P) * l(y)
fNew / sum(fNew)
}

We can use this to create the advancement function for our running example as follows.

advance = hmmFilter(
matrix(c(0.999, 0.001, 0.005, 0.995), ncol=2, byrow=TRUE),
function(y) c(dnorm(y, 0, 0.015), dcauchy(y, 0, 0.025))
)

So advance is now a function that takes the current set of filtered probabilities and an observation and
returns the next set of (normalised) filtered probabilities. We can apply this function sequentially to our time
series using the Reduce function, which implements a functional fold. If we just want the final set of filtered
probabilities, we can call it as follows.

Reduce(advance, y, c(0.5, 0.5))

[,1] [,2]
[1,] 0.9989384 0.001061576

This tells us that we end at a period of low volatility (with very high probability). But more likely, we will
want the full set of filtered probabilities, and to plot them over the data in some way.

fpList = Reduce(advance, y, c(0.5, 0.5), acc=TRUE)


fpMat = sapply(fpList, cbind)
fp2Ts = ts(fpMat[2, -1], start=start(y), freq=frequency(y))
tsplot(y, col=4, ylim=c(-0.3, 1))
lines(fp2Ts, col=2, lwd=1.5)

94
1.0
0.8
0.6
0.4
y
0.2
−0.2 0.0

2006 2008 2010 2012 2014 2016


Time

Here we show the filtered probability of being in the high volatility state.

6.3 Marginal likelihood

For many reasons (including parameter estimation, to be discussed later), it is often useful to be able to
compute the marginal probability (density) of the data, P[Y1:n = y1:n ], not conditioned on the (unknown)
hidden states. If we factorise this as
n
Y
P[Y1:n = y1:n ] = P[Yt = yt |Y1:(t−1) = y1:(t−1) ],
t=1

we see that the required terms correspond precisely to the normalising constants calculated during filtering.
So we can compute the marginal likelihood easily as a by-product of filtering with essentially no additional
computation.
In practice, for reasons of numerical stability (in particular, avoiding numerical underflow), we compute the
log of the marginal likelihood as
n
X
log P[Y1:n = y1:n ] = log P[Yt = yt |Y1:(t−1) = y1:(t−1) ].
t=1

6.3.1 Example

We modify our previous function to update the marginal likelihood of the data so far, in addition to the filtered
probabilities.

hmmFilterML = function(P, l)
function(fl, y) {
fNew = (fl$f %*% P) * l(y)
ml = sum(fNew)
list(f=fNew/ml, ll=(fl$ll + log(ml)))
}

95
advance = hmmFilterML(
matrix(c(0.999, 0.001, 0.005, 0.995), ncol=2, byrow=TRUE),
function(y) c(dnorm(y, 0, 0.015), dcauchy(y, 0, 0.025))
)

Reduce(advance, y, list(f=c(0.5, 0.5), ll=0))

$f
[,1] [,2]
[1,] 0.9989384 0.001061576

$ll
[1] 7971.837

This returns the final set of filtered probabilities, as before, but now also the marginal likelihood of the data.

6.4 Smoothing

Filtering is great in the on-line context, but off-line, with a static dataset of size n, we are probably more
interested in the smoothed probabilities,
si (t) = P[Xt = i|Y1:n = y1:n ], i ∈ X , t = 0, 1, . . . , n.
Now, from the forward filter, we already know the final
s(n) = f (n).
So now (for a backward recursion), suppose we already know s(t + 1) for some t < n, and want to know
s(t).
si (t) = P[Xt = i|Y1:n = y1:n ]
p
X
= P[Xt = i, Xt+1 = j|Y1:n = y1:n ]
j=1
Xp
= P[Xt+1 = j|Y1:n = y1:n ]P[Xt = i|Xt+1 = j, Y1:n = y1:n ]
j=1
Xp
= sj (t + 1)P[Xt = i|Xt+1 = j, Y1:t = y1:t ]
j=1
p
X P[Xt = i|Y1:t = y1:t ]P[Xt+1 = j|Xt = i, Y1:t = y1:t ]
= sj (t + 1)
j=1
P[Xt+1 = j|Y1:t = y1:t ]
p
X fi (t)Pij
= sj (t + 1)
˜
fj (t + 1)
j=1
p
X sj (t + 1)
= fi (t) Pij .
j=1 f˜j (t + 1)

We could write this as


h  i
s(t) = f (t) ◦ s(t + 1) ◦ f̃ (t + 1)−1 P⊤
h  i
= f (t) ◦ s(t + 1) ◦ {f (t)P}−1 P⊤ .

So, we can smooth by running backwards using the filtered probabilities from the forward pass.

96
6.4.1 Example

We can create a function to carry out one step of the backward pass as follows.

hmmSmoother = function(P)
function(fp, sp) {
fp * ((sp / (fp %*% P)) %*% t(P))
}

We can then create a backward stepping function for our running example with:

backStep = hmmSmoother(
matrix(c(0.999, 0.001, 0.005, 0.995), ncol=2, byrow=TRUE)
)

We can then apply this to the list of filtered probabilities by requesting that Reduce folds from the right (ie.
starts with the final element of the list and works backwards).

spList = Reduce(backStep, fpList, right=TRUE, acc=TRUE)


spMat = sapply(spList, cbind)
sp2Ts = ts(spMat[2, -1], start=start(y), freq=frequency(y))
tsplot(y, col=4, ylim=c(-0.3, 1))
lines(sp2Ts, col=2, lwd=1.5)
1.0
0.8
0.6
0.4
y
0.2
−0.2 0.0

2006 2008 2010 2012 2014 2016


Time

We see that the smoothed probabilities are, indeed, smoother than the filtered probabilities, since the whole
time series is used for the marginal classification of each time point.

6.5 Sampling

The smoothed probabilities tell us marginally what is happening at each time point t, but don’t tell us
anything about the joint distribution of the hidden states. A good way to get insight into the joint distribution
is to generate samples from it. This forms a useful part of many Monte Carlo algorithms for HMMs,

97
including MCMC methods for parameter estimation. So, we want to generate samples from the probability
distribution
P[X1:n |Y1:n = y1:n ].
As with the other problems we have considered in this chapter, naive approaches won’t work. Here, the
state space that we are simulating on has size pn , so we can’t simply enumerate probabilities of possible
trajectories. However, we can nevertheless generate exact samples from this distribution by first computing
filtered probabilities with a forward sweep, as previously described, but now using a different backward
pass.
It is convenient to factorise the joint distribution in a “backwards” manner as
n
Y
P[X1:n = x1:n |Y1:n = y1:n ] = P[Xt = xt |X(t+1):n = x(t+1):n , Y1:n = y1:n ]
t=1
n−1
!
Y
= P[Xt = xt |Xt+1 = xt+1 , Y1:t = y1:t ] P[Xn = xn |Y1:n = y1:n ].
t=1

Now, we know the final term P[Xn |Y1:n = y1:n ] = f (n), so we can sample from this to obtain xn . So now
suppose that we know xt+1 for some t < n and that we want to sample from

P[Xt |Xt+1 = xt+1 , Y1:t = y1:t ].

We have
P[Xt = i|Y1:t = y1:t ]P[Xt+1 = xt+1 |Xt = i, Y1:t = y1:t ]
P[Xt = i|Xt+1 = xt+1 , Y1:t = y1:t ] =
P[Xt+1 = xt+1 |Y1:t = y1:t ]
∝ fi (t)Pi,xt+1 .

In other words, the probabilities are proportional to f (t) ◦ P(xt+1 ) , the relevant column of P. So, we proceed
backwards from t = n to t = 1, sampling xt at each step.

6.5.1 Example

We can create a function to carry out one step of backward sampling as follows.

hmmSampler = function(P) {
p = nrow(P)
function(fp, x) {
sample(1:p, 1, prob=fp*P[,x])
}
}

We can then create a backward stepping function for our running example with:

backSample = hmmSampler(
matrix(c(0.999, 0.001, 0.005, 0.995), ncol=2, byrow=TRUE)
)

We can then apply this to the list of filtered probabilities.

set.seed(42)
xList = Reduce(backSample, head(fpList, -1),
init=sample(1:2, 1, prob=tail(fpList, 1)[[1]]),
right=TRUE, acc=TRUE)
xVec = unlist(xList)

98
xTs = ts(xVec[-1], start=start(y), freq=frequency(y))
tsplot(y, col=4, ylim=c(-0.3, 1))
lines(xTs-1, col=2, lwd=1.5)

1.0
0.8
0.6
0.4
y
0.2
−0.2 0.0

2006 2008 2010 2012 2014 2016


Time

This gives us one sample from the joint conditional distribution of hidden states. Of course, every time we
repeat this process we will get a different sample. Studying many such samples gives us a Monte Carlo
method for understanding the joint distribution. We do not have time to explore this in detail here.

6.6 Parameter estimation

We have so far assumed that the HMM is fully specified, and that there are no unknown parameters (other
than the hidden states). Of course, in practice, this is rarely the case. There could be unknown parameters in
either or both of the transition matrix or observation likelihoods, for example. There are many interesting
Bayesian approaches to this problem, but for this course we focus on maximum likelihood estimation.
We have seen how we can use the forward filtering algorithm to compute the marginal model likelihood. But
evaluation requires a fully specified model, and therefore depends on any unknown parameters. If we call the
unknown parameters θ , then the marginal likelihood is actually the likelihood function, ℓ(θθ ; y). Maximum
likelihood approaches to inference seek to maximise this as a function of θ . There is a very efficient iterative
algorithm for this based on the EM algorithm, which in the context of HMMs, is known as the Baum-Welch
algorithm. The details are not relevant to this course, however, so we will just look at generic numerical
maximisation approaches.

6.6.1 Example

For our running example, we will assume that we are happy with our specification of the transition matrix for
the hidden states, but that we are not so sure about the appropriate scale parameters to use in our observation
model. We will carry out ML inference for these two parameters. First, define a function that returns the
marginal likelihood for a given vector of scale parameters.

99
mll = function(theta) {
advance = hmmFilterML(
matrix(c(0.999, 0.001, 0.005, 0.995), ncol=2, byrow=TRUE),
function(yt)
c(dnorm(yt, 0, theta[1]), dcauchy(yt, 0, theta[2]))
)
Reduce(advance, y, list(f=c(0.5, 0.5), ll=0))$ll
}

mll(c(0.015, 0.025))

[1] 7971.837

This works, and returns the likelihood we previously obtained when evaluated at our original guess. Can we
do better than this? Let’s see if optim can find anything better.

optim(c(0.015, 0.025), mll, control=list(fnscale=-1))

$par
[1] 0.01268440 0.02074005

$value
[1] 7992.119

$counts
function gradient
51 NA

$convergence
[1] 0

$message
NULL

It can find parameters that are slightly better than our original guess, so we should probably re-do all of our
previous analyses using these. This is left as an exercise!

6.7 R package for HMMs

We have seen some nice simple illustrative implementations of the key algorithms underpinning the analysis
of HMMs. But they are not very efficient or numerically stable. Many better implementations exist, and
some are available in R via packages on CRAN. The CRAN package HiddenMarkov is of particular note.
If you want to do serious work based on HMMs, it is worth finding out how this package works.

100
7 Dynamic linear models (DLMs)

7.1 Introduction

The set up here is similar to the previous chapter. We have a “hidden” Markov chain model, Xt , and a
conditionally independent observation process Yt leading to an observed time series y1 , . . . , yn . Again, we
want to use the observations to learn about this hidden process. But this time, instead of a finite state Markov
chain, the hidden process is a linear Gaussian model. Then, analytic tractability requires that the observation
process is also linear and Gaussian. In that case, we can use standard properties of the multivariate normal
distribution to do filtering, smoothing, etc., using straightforward numerical linear algebra.
For a hidden state of dimension p and an observation process of dimension m, the model can be written in
conditional form as

Xt |Xt−1 ∼ N (Gt Xt−1 , Wt ),


Yt |Xt ∼ N (Ft Xt , Vt ), t = 1, 2, . . . , n,

for (possibly time dependent) matrices Gt , Ft , Wt , Vt of appropriate dimensions, all (currently) assumed
known. Note that these matrices are often chosen to be independent of time, in which case we drop the time
subscript and write them G, F, W, V. The model is initialised with

X0 ∼ N (m0 , C0 ),

for prior parameters m0 (a p-vector), C0 (a p × p matrix), also assumed known. We can also write the model
in state-space form as

Xt = Gt Xt−1 + ω t , ω t ∼ N (0, Wt ),
Yt = Ft Xt + ν t , ν t ∼ N (0, Vt ),

from which it is possibly more clear that in the case of time independent G, W, the hidden process evolves as
a VAR(1) model. The observation process is also quite flexible, allowing noisy observation of some linear
transformation of the hidden state. Note that for a univariate observation process (m = 1), corresponding to
a univariate time series, Ft will be row vector representing a linear functional of the (vector) hidden state.

7.2 Filtering

The filtering problem is the computation of (Xt |Y1:t = y1:t ), for t = 1, 2, . . . , n. Since everything in the
problem is linear and Gaussian, the filtering distributions will be too, so we can write

(Xt |Y1:t = y1:t ) ∼ N (mt , Ct ),

for some mt , Ct to be determined. The forward recursions we use to compute these are known as the Kalman
filter.
Since we know m0 , C0 , we assume that we know mt−1 , Ct−1 for some t > 0, and proceed to compute mt ,
Ct . Just as for HMMs, this is conceptually simpler to break down into two steps.
Predict step

101
Since Xt = Gt Xt−1 + ω t and Xt is conditionally independent of Y1:(t−1) given Xt−1 , we will have
n o
E[Xt |Y1:(t−1) = y1:(t−1) ] = E E[Xt |Xt−1 ] Y1:(t−1) = y1:(t−1)
n o
= E Gt Xt−1 |Y1:(t−1) = y1:(t−1)
n o
= Gt E Xt−1 |Y1:(t−1) = y1:(t−1)
= Gt mt−1 .

Similarly,
n o
Var[Xt |Y1:(t−1) = y1:(t−1) ] = Var E[Xt |Xt−1 ]|Y1:(t−1) = y1:(t−1)
n o
+ E Var[Xt |Xt−1 ]|Y1:(t−1) = y1:(t−1)
n o n o
= Var Gt Xt−1 |Y1:(t−1) = y1:(t−1) + E Wt |Y1:(t−1) = y1:(t−1)
= Gt Ct−1 G⊤
t + Wt .

We can therefore write


(Xt |Y1:(t−1) = y1:(t−1) ) ∼ N (m̃t , C̃t ),
where we define
m̃t = Gt mt−1 , C̃t = Gt Ct−1 G⊤
t + Wt .

Update step
Using Yt = Ft Xt + ν t and some basic linear normal properties, we deduce
! " # " #!
Xt m̃t C̃t C̃t F⊤
t
Y1:(t−1) = y1:(t−1) ∼N , .
Yt Ft m̃t Ft C̃t Ft C̃t F⊤
t + Vt

If we define ft = Ft m̃t and Qt = Ft C̃t F⊤


t + Vt this simplifies to
! " # " #!
Xt m̃t C̃t C̃t F⊤
t
Y1:(t−1) = y1:(t−1) ∼N , .
Yt ft Ft C̃t Qt

We can now use the multivariate normal conditioning formula to obtain


−1
mt = m̃t + C̃t F⊤
t Qt [yt − ft ]
−1
Ct = C̃t − C̃t F⊤
t Qt Ft C̃t .

The weight that is applied to the discrepancy between the new observation and its forecast is often referred to
as the (optimal) Kalman gain, and denoted Kt . So if we define
−1
Kt = C̃t F⊤
t Qt

we get

mt = m̃t + Kt [yt − ft ]
Ct = C̃t − Kt Qt K⊤
t .

7.2.1 The Kalman filter

Let us now summarise the steps of the Kalman filter. Our input at time t is mt−1 , Ct−1 and yt . We then
compute:

• m̃t = Gt mt−1 , C̃t = Gt Ct−1 G⊤


t + Wt

102
• ft = Ft m̃t , Qt = Ft C̃t F⊤
t + Vt
⊤ −1
• Kt = C̃t Ft Qt
• mt = m̃t + Kt [yt − ft ], Ct = C̃t − Kt Qt K⊤
t .

Note that the only inversion involves the m × m matrix Qt (there are no p × p inversions). So, in the case of
a univariate time series, no matrix inversions are required.
Also note that it is easy to handle missing data within the Kalman filter by carrying out the predict step and
skipping the update step. If the observation at time t is missing, simply compute m̃t and C̃t as usual, but then
skip this update step, since there is no observation to condition on, returning m̃t and C̃t as mt and Ct .

7.2.2 Example

To keep things simple, we will just implement the Kalman filter for time-invariant G, W, F, V.

kFilter = function(G, W, F, V)
function(mC, y) {
m = G %*% mC$m
C = (G %*% mC$C %*% t(G)) + W
f = F %*% m
Q = (F %*% C %*% t(F)) + V
K = t(solve(Q, F %*% C))
m = m + (K %*% (y - f))
C = C - (K %*% Q %*% t(K))
list(m=m, C=C)
}

So, if we provide G, W, F, V to the kFilter function, we will get back a function for carrying out one step
of the Kalman filter. As a simple illustration, suppose that we are interested in the long-term trend in the
Southern Oscillation Index (SOI).

library(astsa)
tsplot(soi, col=4)
1.0
0.5
soi
0.0
−0.5
−1.0

1950 1960 1970 1980


Time

103
We suppose that a simple random walk model is appropriate for the (hidden) long term trend (so G = 1, and
note that the implied VAR(1) model is not stationary), and that the observations are simply noisy observations
of that hidden trend (so F = 1). To complete the model, we suppose that the monthly change in the long term
trend has standard deviation 0.01, and that the noise variance associated with the observations has standard
deviation 0.5. We can now create the filter and apply it, using the Reduce function and initialising with
m0 = 0 and C0 = 100.

advance = kFilter(1, 0.01ˆ2, 1, 0.5ˆ2)


Reduce(advance, soi, list(m=0, C=100))

$m
[,1]
[1,] -0.03453493

$C
[,1]
[1,] 0.00495025

This gives us mn and Cn , which might be what we want if we are simply interested in forecasting the future
(see next section). However, more likely we want to keep the full set of filtered states and plot them over the
data.

fs = Reduce(advance, soi, list(m=0, C=100), acc=TRUE)


fsm = sapply(fs, function(s) s$m)
fsTs = ts(fsm[-1], start=start(soi), freq=frequency(soi))
tsplot(soi, col=4)
lines(fsTs, col=2, lwd=2)
1.0
0.5
soi
0.0
−0.5
−1.0

1950 1960 1970 1980


Time

We will look at more interesting DLM models in the next chapter.

104
7.3 Forecasting

Making forecasts from a DLM is straightforward, since it corresponds to the computation of predictions, and
we have already examined this problem in the context of the predict step of a Kalman filter. Suppose that
we have a time series of length n, and have run a Kalman filter over the whole data set, culminating in the
computation of mn and Cn . We can now define our k-step ahead forecast distributions as

(Xn+k |Y1:n = y1:n ) ∼ N (m̂n (k), Cn (k))


(Yn+k |Y1:n = y1:n ) ∼ N (f̂n (k), Qn (k)),

for moments m̂n (k), Cn (k), f̂n (k), Qn (k) to be determined. Starting from m̂n (0) = mn and Cn (0) = Cn ,
we can recursively compute subsequent forecasts for the hidden states via

m̂n (k) = Gn+k m̂n (k − 1), Cn (k) = Gn+k Cn (k − 1)G⊤


n+k + Wn+k .

Once we have a forecast for the hidden state, we can compute the moments of the forecast distribution via

f̂n (k) = Fn+k m̂n (k), Qn (k) = Fn+k Cn (k)F⊤


n+k + Vn+k .

7.4 Marginal likelihood

For many applications (including parameter estimation) it is useful to know the marginal likelihood of the
data. We can factorise this as
n
Y
L(y1:n ) = π(yt |y1:(t−1) ),
t=1
but as part of the Kalman filter we compute the component distributions as

π(yt |y1:(t−1) ) = N (yt ; ft , Qt ),

so we can compute the marginal likelihood as part of the Kalman filter at little extra cost. In practice we
compute the log-likelihood
n
X
ℓ(y1:t ) = log N (yt ; ft , Qt ),
t=1
We might use a function (such as mvtnorm::dmvnorm) to evaluate the log density of a multivariate
normal, but if not, we can write it explicitly in the form
n h
mn 1X i
ℓ(y1:t ) = − log 2π − log |Qt | + (yt − ft )⊤ Q−1
t (yt − ft ) .
2 2 t=1

Note that this is very straightforward in the case of a univariate time series (scalar Qt ), but more generally,
there is a good way to evaluate this using the Cholesky decomposition of Qt . The details are not important
for this course.

7.4.1 Example

We can easily modify our Kalman filter function to compute the marginal likelihood, and apply it to our
example problem as follows.

kFilterML = function(G, W, F, V)
function(mCl, y) {
m = G %*% mCl$m

105
C = (G %*% mCl$C %*% t(G)) + W
f = F %*% m
Q = (F %*% C %*% t(F)) + V
ll = mvtnorm::dmvnorm(y, f, Q, log=TRUE)
K = t(solve(Q, F %*% C))
m = m + (K %*% (y - f))
C = C - (K %*% Q %*% t(K))
list(m=m, C=C, ll=mCl$ll+ll)
}

advance = kFilterML(1, 0.01ˆ2, 1, 0.5ˆ2)


Reduce(advance, soi, list(m=0, C=100, ll=0))

$m
[,1]
[1,] -0.03453493

$C
[,1]
[1,] 0.00495025

$ll
[1] -237.2907

7.5 Smoothing

For the smoothing problem, we want to calculate the marginal at each time given all of the data,

Xt |Y1:n = y1:n , t = 1, 2, . . . , n.

We will use a backward recursion to achieve this, known in this context as the Rauch-Tung-Striebel (RTS)
smoother. Again, since everything is linear and Gaussian, the marginals will be, so we have

(Xt |Y1:n = y1:n ) ∼ N (st , St ),

for st , St to be determined. We know from the forward filter that sn = mn and Sn = Cn , so we assume that
we know st+1 , St+1 for some t < n and that we want to know st , St .
Since Xt+1 = Gt+1 Xt + ω t+1 , we deduce that
! " # " #!
Xt mt Ct Ct G⊤
t+1
Y1:t = y1:t ∼N , .
Xt+1 m̃t+1 Gt+1 Ct C̃t+1

We can use normal conditioning formula to deduce that

(Xt |Xt+1 , Y1:t = y1:t )


 
−1 −1
∼ N mt + Ct G⊤ ⊤
t+1 C̃t+1 [Xt+1 − m̃t+1 ], Ct − Ct Gt+1 C̃t+1 Gt+1 Ct .

−1
If we define the smoothing gain Lt = Ct G⊤
t+1 C̃t+1 , then this becomes
 
(Xt |Xt+1 , Y1:t = y1:t ) ∼ N mt + Lt [Xt+1 − m̃t+1 ], Ct − Lt C̃t+1 L⊤
t .

106
But now, since Y(t+1):n is conditionally independent of Xt given Xt+1 , this distribution is also the distri-
bution of (Xt |Xt+1 , Y1:n = y1:n ). So now we can compute (Xt |Y1:n = y1:n ) by marginalising out Xt+1 .
First,

st = E[Xt |Y1:n = y1:n ]


 
= E E[Xt |Xt+1 , Y1:n = y1:n ] Y1:n = y1:n
 
= E mt + Lt [Xt+1 − m̃t+1 ] Y1:n = y1:n
= mt + Lt [st+1 − m̃t+1 ]

Now,

St = Var[Xt |Y1:n = y1:n ]


= E{Var[Xt |Xt+1 , Y1:n = y1:n ]|Y1:n = y1:n }
+ Var{E[Xt |Xt+1 , Y1:n = y1:n ]|Y1:n = y1:n }
= E{Ct − Lt C̃t+1 L⊤
t |Y1:n = y1:n }
+ Var{mt + Lt [Xt+1 − m̃t+1 ]|Y1:n = y1:n }
= Ct − Lt C̃t+1 L⊤ ⊤
t + Lt St+1 Lt
= Ct + Lt [St+1 − C̃t+1 ]L⊤
t

7.5.1 The RTS smoother

To summarise, we receive as input st+1 and St+1 , and compute


−1
Lt = Ct G⊤
t+1 C̃t+1
st = mt + Lt [st+1 − m̃t+1 ]
St = Ct + Lt [St+1 − C̃t+1 ]L⊤
t ,

re-using computations from the forward Kalman filter as appropriate. Note that the RTS smoother does
involve the inversion of a p × p matrix.

7.6 Sampling

If we want to understand the joint distribution of the hidden states given the data, generating samples from it
will be useful. So, we want to generate exact samples from

(X1:n |Y1:n = y1:n ).

Just as for HMMs, it is convenient to do this backwards, starting from the final hidden state. We assume that
we have run a Kalman filter forward, but now do sampling on the backward pass. This strategy is known
in this context as forward filtering, backward sampling (FFBS). We know that the distribution of the final
hidden state is
(Xn |Y1:n = y1:n ) ∼ N (mn , Cn ),
so we can sample from this to obtain a realisation of xn . So now assume that we are given xt+1 for some
t < n, and want to simulate an appropriate xt . From the smoothing analysis, we know that
 
(Xt |Xt+1 = xt+1 , Y1:n = y1:n ) ∼ N mt + Lt [xt+1 − m̃t+1 ], Ct − Lt C̃t+1 L⊤
t .

But since Xt is conditionally independent of all future Xs , s > t + 1, given Xt+1 , this is also the distribution
of (Xt |X(t+1):n = x(t+1):n , Y1:n = y1:n ) and so we just sample from this, and so on, in a backward pass.

107
7.7 Parameter estimation

Just as for HMMs, there are a wide variety of Bayesian and likelihood-based approaches to parameter
estimation. Here, again, we will keep things simple and illustrate basic numerical optimisation of the
marginal log-likelihood of the data.

7.7.1 Example

For our SOI running example, suppose that we are happy with the model structure, but uncertain about the
variance parameters V and W. We can do maximum likelihood estimation for those two parameters by first
creating a function to evaluate the likelihood for a given parameter combination.

mll = function(wv) {
advance = kFilterML(1, wv[1], 1, wv[2])
Reduce(advance, soi, list(m=0, C=100, ll=0))$ll
}

mll(c(0.01ˆ2, 0.5ˆ2))

[1] -237.2907

We can see that it reproduces the same log-likelihood as we got before when we evaluate it at our previous
guess for V and W. But can we do better than this? Let’s see what optim can do.

optim(c(0.01ˆ2, 0.5ˆ2), mll, control=list(fnscale=-1))

$par
[1] 0.05696905 0.03029240

$value
[1] -144.0333

$counts
function gradient
91 NA

$convergence
[1] 0

$message
NULL

So optim has found parameters with a much higher likelihood, corresponding to a much better fitting model.
Note that these optimised parameters apportion the noise more equally between the hidden states and the
observation process, and so lead to smoothing the hidden state much less. This may correspond to a better fit,
but perhaps not to the fit that we really want. In the context of Bayesian inference, a strong prior on V and W
could help to regularise the problem.

108
7.8 R package for DLMs

We have seen that it is quite simple to implement the Kalman filter and related computations such as the
marginal log-likelihood of the data. However, a very naive implementation such as we have given may not be
the most efficient, nor the most numerically stable. Many different approaches to implementing the Kalman
filter have been developed over the years, which are mathematically equivalent to the basic Kalman filter
recursions, but are more efficient and/or more numerically stable than a naive approach. The dlm R package is
an implementation based on the singular value decomposition (SVD), which is significantly more numerically
stable than a naive implementation. We will use this package for the study of more interesting DLMs in
Chapter 8. More information about this package can be obtained with help(package="dlm") and
vignette("dlm", package="dlm"). It is described more fully in Petris, Petrone, and Campagnoli
(2009). We can fit our simple DLM to the SOI data as follows.

library(dlm)
mod = dlm(FF=1, GG=1, V=0.5ˆ2, W=0.01ˆ2, m0=0, C0=100)
fs = dlmFilter(soi, mod)
tsplot(soi, col=4, main="Filtered states")
lines(fs$m, col=2, lwd=2)

Filtered states
1.0
0.5
soi
0.0
−0.5
−1.0

1950 1960 1970 1980


Time

We can also compute the smoothed states.

ss = dlmSmooth(soi, mod)
tsplot(soi, col=4, main="Smoothed states")
lines(ss$s, col=2, lwd=2)

109
Smoothed states

1.0
0.5
soi
0.0
−0.5
−1.0

1950 1960 1970 1980


Time

These are very smooth. We can also optimise the parameters, similar to the approach we have already seen.
But note that the dlmMLE optimiser requires that the vector of parameters being optimised is unconstrained,
so we use exp and log to map back and forth between a constrained and unconstrained space.

buildMod = function(lwv)
dlm(FF=1, GG=1, V=exp(lwv[2]), W=exp(lwv[1]),
m0=0, C0=100)

opt = dlmMLE(soi, parm=c(log(0.5ˆ2), log(0.01ˆ2)),


build=buildMod)
opt

$par
[1] -2.865240 -3.496717

$value
[1] -272.2459

$counts
function gradient
35 35

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

exp(opt$par)

[1] 0.05696943 0.03029668

110
After mapping back to the constrained space, we see that we get the same optimised values as before (the
log-likelihood is different, but that is just due to dropping of unimportant constant terms). We can check the
smoothed states for this optimised model as follows.

mod = buildMod(opt$par)
ss = dlmSmooth(soi, mod)
tsplot(soi, col=4, main="Smoothed states")
lines(ss$s, col=2, lwd=1.5)

Smoothed states
1.0
0.5
soi
0.0
−0.5
−1.0

1950 1960 1970 1980


Time

This confirms that the smoothed states for the optimised model are not very smooth. This suggests that our
simple random walk model for the data is perhaps not very good.

111
8 State space modelling

8.1 Introduction

In Chapter 7 we introduced the DLM and showed how a model of DLM form could be used for filtering,
smoothing, forecasting, etc. However, we haven’t yet discussed how to go about “building” an appropriate
DLM model (in particular, specifying Gt , Ft , Wt , Vt ) for a given time series. That is the topic of this chapter.
We will use the dlm R package to illustrate the concepts. We will start by looking at some simple models for
long-term trends, then consider seasonal effects, then the problem of combining trends and seasonal effects,
before going on to think about the incorporation of ARMA components. We will focus mainly on (time
invariant models for) univariate time series, but the principles are quite similar for multivariate time series.
Our main running example will be (the log of) J&J earnings, considered briefly in Chapter 1.

library(astsa)
library(dlm)
y = log(jj)
tsplot(y, col=4, lwd=2, main="Log of J&J earnings")

Log of J&J earnings


2
1
y
0

1960 1965 1970 1975 1980


Time

8.2 Polynomial trend

8.2.1 Locally constant model

In the previous chapter we saw that the choice G = 1, F = 1 corresponds to noisy observation of a Gaussian
random walk. This random walk model for the underlying (hidden) state is known as the locally constant
model in the context of DLMs, and sometimes (for reasons to become clear), as the first order polynomial

112
trend model. Such a model is clearly inappropriate for the J&J example, since the trend appears to be (at
least, locally) linear.

8.2.2 Locally linear model

We can model a time series with a locally linear trend by choosing F = (1, 0),
! !
1 1 w1 0
G= , W= .
0 1 0 w2

This is known as the locally linear model, or sometimes as the second order polynomial trend model. To see
why this corresponds to a linear trend, it is perhaps helpful to write the hidden state vector as Xt = (µt , τt )⊤ .
So then yt is clearly just a noisy observation of µt (the current level), and µt = µt−1 + τt−1 + ω1,t , so the
current level increases by a systematic amount τt−1 (the trend), and is also corrupted by noise. On the other
hand, τt = τt−1 + ω2,t is just a random walk (typically with very small variance, so that the trend changes
very slowly).
We can create such a model using the dlm::dlmModPoly function, by providing the order required (2 for
locally linear), and the diagonal of V and W (off-diagonal elements are assumed to be zero).

mod = dlmModPoly(2, dV=0.1ˆ2, dW=c(0.01ˆ2, 0.01ˆ2))


mod

$FF
[,1] [,2]
[1,] 1 0

$V
[,1]
[1,] 0.01

$GG
[,1] [,2]
[1,] 1 1
[2,] 0 1

$W
[,1] [,2]
[1,] 1e-04 0e+00
[2,] 0e+00 1e-04

$m0
[1] 0 0

$C0
[,1] [,2]
[1,] 1e+07 0e+00
[2,] 0e+00 1e+07

ss = dlmSmooth(y, mod)
tsplot(y, col=4, lwd=2, main="Log of J&J earnings")
lines(ss$s[,1], col=2, lwd=2)

113
Log of J&J earnings

2
1
y
0

1960 1965 1970 1975 1980


Time

This captures the trend quite well, but not the strong seasonal effect. We will return to this issue shortly.

8.2.3 Higher-order polynomial models

This strategy extends straightforwardly to higher order polynomials. For example, a locally quadratic (order
3) model can be constructed using F = (1, 0, 0) and
 
1 1 0
G = 0 1 1  .
 
0 0 1

A diagonal W is typically adopted, sometimes with some of the diagonal elements set to zero. Models beyond
order 3 are rarely used in practice.

8.3 Seasonal models

Many time series exhibit periodic behaviour with known period. There are two commonly used approaches
used to modelling such seasonal behaviour using DLMs. Both involve using a cyclic matrix, G, with period s
(eg. s = 12 for monthly data), so that Gs = I. We will look briefly at both.

8.3.1 Seasonal effects

Arguably the most straightforward way to model seasonality is to explicitly model s separate seasonal
effects, and rotate and evolve them, as required. So, for a purely seasonal model, call the s seasonal effects
α1 , α2 , . . . , αs , and make these the components of the state vector, Xt . Then choosing the 1 × s matrix
F = (1, 0, . . . , 0) and G the s × s cyclic permutation matrix
!
0⊤ 1
G=
I 0

114
gives the basic evolution structure. For example, in the case s = 4 (quarterly data), we have
 
0 0 0 1
1 0 0 0
G= ,
 
0 1 0 0
0 0 1 0
and so    
α4 α1
α  α 
G  3 =  4 ,
   
α2  α3 
α1 α2
and the seasonal effects get rotated through the state vector each time, repeating after s time points. If we
want the seasonal effects to be able to evolve gradually over time, we can allow this by choosing (say)
W = diag{w, 0, . . . , 0},
for some w > 0.
This structure would be fine for a purely seasonal model, but in practice a seasonal component is often used as
part of a larger model that also models the overall mean of the process. In this case there is an identifiability
issue, exactly analogous to the problem that arises in linear models with a categorical covariate. Adding a
constant value to each of the seasonal effects and subtracting the same value from the overall mean leads to
exactly the same model. So we need to constrain the effects to maintain identifiability of the overall mean.
This is typically done by ensuring that the effects sum to zero. There are many ways that such a constraint
could be encoded with a DLM, but one common approach is to notice that in this case there are only s − 1
degrees of freedom for the effects (eg. the final effect is just minus the sum of the others), and so we can
model the effects with an s − 1 dimensional state vector. We can then update this with the (s − 1) × (s − 1)
evolution matrix !
−1⊤ −1
G= .
I 0
We will illustrate for s = 4. Suppose that our hidden state is Xt = (α4 , α3 , α2 )⊤ . We know that we can
reconstruct the final effect, α1 = −(α4 + α3 + α2 ), so this vector captures all of our seasonal effects. But
now     
−1 −1 −1 α4 α1
 1 0 0  α3  = α4  ,
    
0 1 0 α2 α3
and again, the final effect (now α2 ) can be reconstructed from the rest. This approach also rotates through
our s seasonal effects, but now constrains the sum of the effects to be fixed at zero. This is how the function
dlm::dlmModSeas constructs a seasonal model.

8.3.2 Fourier components

An alternative approach to modelling seasonal effects, which can be more parsimonious, is to use Fourier
components. Start by assuming that we want to model our seasonal effects with a single sine wave of period
s. Putting ω = 2π/s, we can use the rotation matrix
!
cos ω sin ω
G=
− sin ω cos ω

and the observation matrix F = (1, 0). Applying G repeatedly to a 2-vector Xt will give an oscillation of
period s, with phase and amplitude determined by the components of Xt . This gives a very parsimonious way
of modelling seasonal effects, but may be too simple. We can add in other Fourier components by defining
!
cos kω sin kω
Hk = , k = 1, 2, . . . ,
− sin kω cos kω

115
(so Hk = Gk ) and setting F = (1, 0, 1, 0, . . .),

G = blockdiag{H1 , H2 , . . .}.

In principle we can go up to k = s/2. Then we will have s degrees of freedom, and we will essentially
be just representing our seasonal effects by their discrete Fourier transform. Typically we will choose k
somewhat smaller than this, omitting higher frequencies in order to obtain a smoother, more parsimonious
representation. We can use a diagonal W matrix in order to allow the phase and amplitude of each harmonic
to slowly evolve. This is how the dlm::dlmModTrig function constructs a seasonal model.

8.4 Model superposition

We can combine DLM models to make more complex DLMs in various ways. One useful approach is
typically known as model superposition. Suppose that we have two DLM models with common observation
dimension, m, and state dimensions p1 , p2 , respectively. If model i is {Git , Fit , Wit , Vit }, then the “sum” of
the two models can be defined by
n o
blockdiag{G1t , G2t }, (F1t , F2t ), blockdiag{W1t , W2t }, V1t + V2t ,

having observation dimension m and state dimension p1 + p2 . It is clear that this “sum” operation can be
applied to more than two models with observation dimension m, and that the “sum” operation is associative,
so DLM models of this form comprise a semigroup wrt this operation. Combining models in this way
provides a convenient way of building models including both overall trends as well as seasonal effects. This
sum operation can be used to combine DLM models in the dlm package by using the + operator.
For our running example, we can fit a model including a locally linear trend and a quarterly seasonal effect as
follows.

mod = dlmModPoly(2, dV=0.1ˆ2, dW=c(0.01ˆ2, 0.01ˆ2)) +


dlmModSeas(4, dV=0, dW=c(0.02ˆ2, 0, 0))
mod

$FF
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 0 0

$V
[,1]
[1,] 0.01

$GG
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 -1 -1 -1
[4,] 0 0 1 0 0
[5,] 0 0 0 1 0

$W
[,1] [,2] [,3] [,4] [,5]
[1,] 1e-04 0e+00 0e+00 0 0
[2,] 0e+00 1e-04 0e+00 0 0
[3,] 0e+00 0e+00 4e-04 0 0

116
[4,] 0e+00 0e+00 0e+00 0 0
[5,] 0e+00 0e+00 0e+00 0 0

$m0
[1] 0 0 0 0 0

$C0
[,1] [,2] [,3] [,4] [,5]
[1,] 1e+07 0e+00 0e+00 0e+00 0e+00
[2,] 0e+00 1e+07 0e+00 0e+00 0e+00
[3,] 0e+00 0e+00 1e+07 0e+00 0e+00
[4,] 0e+00 0e+00 0e+00 1e+07 0e+00
[5,] 0e+00 0e+00 0e+00 0e+00 1e+07

ss = dlmSmooth(y, mod) # smoothed states


so = ss$s[-1,] %*% t(mod$FF) # smoothed observations
so = ts(so, start=start(y), freq=frequency(y))
tsplot(y, col=4, lwd=2, main="Log of J&J earnings")
lines(ss$s[,1], col=3, lwd=1.5)
lines(so, col=2, lwd=1.2)

Log of J&J earnings


2
1
y
0

1960 1965 1970 1975 1980


Time

Once we are happy that the model is a good fit to the data, we can use it for forecasting.

fit = dlmFilter(y, mod)


fore = dlmForecast(fit, 16) # forecast 16 time points
pred = ts(c(tail(y, 1), fore$f),
start=end(y), frequency=frequency(y))
upper = ts(c(tail(y, 1), fore$f + 2*sqrt(unlist(fore$Q))),
start=end(y), frequency=frequency(y))
lower = ts(c(tail(y, 1), fore$f - 2*sqrt(unlist(fore$Q))),
start=end(y), frequency=frequency(y))
all = ts(c(y, upper[-1]), start=start(y),

117
frequency = frequency(y))
tsplot(all, ylab="log(earnings)",
main="Forecasts with 2SD intervals")
lines(y, col=4, lwd=1.5)
lines(pred, col=2, lwd=2)
lines(upper, col=2)
lines(lower, col=2)

Forecasts with 2SD intervals


4
3
log(earnings)
2
1
0
−1

1960 1965 1970 1975 1980 1985


Time

8.4.1 Example: monthly births

For an additional example, we will look at monthly live birth data for the US.

y = birth
tsplot(y, col=4, lwd=1.5, main="Monthly births")

118
Monthly births

400
350
y
300
250

1950 1955 1960 1965 1970 1975 1980


Time

This seems to have a random walk nature, but with a strong seasonal component. We will assume that we
do not want to use a full seasonal model, and instead use a seasonal model based on Fourier components
with two harmonics. We may be unsure as to what variances to assume, and so want to estimate those using
maximum likelihood. We could fit this using the dlm package as follows.

buildMod = function(lpar)
dlmModPoly(1, dV=exp(lpar[1]), dW=exp(lpar[2])) +
dlmModTrig(12, 2, dV=0, dW=rep(exp(lpar[3]), 4))
opt = dlmMLE(y, parm=log(c(100, 1, 1)),
build=buildMod)
opt

$par
[1] 4.482990 1.925763 -3.228793

$value
[1] 1116.91

$counts
function gradient
14 14

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

mod = buildMod(opt$par)
ss = dlmSmooth(y, mod)
so = ss$s[-1,] %*% t(mod$FF) # smoothed observations
so = ts(so, start(y), frequency = frequency(y))

119
tsplot(y, col=4, lwd=1.5, main="Monthly births")
lines(ss$s[,1], col=3, lwd=1.5)
lines(so, col=2, lwd=1.2)

Monthly births
400
350
y
300
250

1950 1955 1960 1965 1970 1975 1980


Time

The optimised parameters seem to correspond to a good fit to the data.

8.5 ARMA models in state space form

We have seen in previous chapters how ARMA models define a useful class of discrete time Gaussian
processes that can effectively capture the kinds of auto-correlation structure that we often see in time series.
They can be useful building blocks in DLM models. To use them in this context, we need to know how to
represent them in state space form.

8.5.1 AR models

In fact, we have already seen a popular way to represent an AR(p) model as a VAR(1), and this is exactly
what we need for state space modelling. So, to build a DLM representing noisy observation of a hidden
AR(p) model we specify F = (1, 0, . . . , 0),
 
ϕ1 · · · ϕp−1 ϕp
 1 ··· 0 0


G= . .

. . ,
 .. .. .. .. 

0 ··· 1 0

W = diag(σ 2 , 0, . . . , 0), V = (v). As a reminder of why this works, we see that if we define Xt =
(Xt , Xt−1 , . . . , Xt−p )⊤ , where Xt is our AR(p) process, then
 
εt
0
 
GXt−1 + 
 ..  = Xt ,

.
0

120
and so the VAR(1) update preserves the stationary distribution of our AR(p) model. However, this VAR(1)
representation of an AR(p) model is by no means unique. In fact, it turns out that replacing the above G
with G⊤ gives a VAR(1) model which also has the required AR(p) model as the marginal distribution of
the first component. This is not so intuitive, but generalises more easily to the ARMA case, so it is worth
understanding. For this case we instead define the state vector to be
 
Xt
ϕ X
 2 t−1 + ϕ3 Xt−2 + . . . + ϕp Xt−p+1 

X⋆t = 
 .. 
,
 . 
ϕp−1 Xt−1 + ϕp Xt−2
 
 
ϕp Xt−1
where Xt is our AR(p) process of interest, and note that only the first component of this state vector has the
distribution we require. But with this definition we have
 
εt
0
 
G⊤ X⋆t−1 +  ⋆
 ..  = Xt ,

.
0
and so this VAR(1) update preserves the distribution of this state vector. Consequently, the first component of
the VAR(1) model defined this way will be marginally our AR(p) model of interest. This representation is
commonly encountered in practice, since the same approach works for ARMA models, where it is necessary
to carry information about “old” errors in the state vector.

8.5.2 ARMA models

Representing an arbitrary ARMA model as a VAR(1) is also possible, and again the representation is not
unique. However, the second approach that we used for AR(p) models can be adapted very easily to the
ARMA case. Consider first an ARMA(p, p − 1) model, so that we have one fewer MA coefficients than AR
coefficients. Using the p × p auto-regressive matrix
 
ϕ1 1 ··· 0
 . .. . . .. 
 .. . . .
G= ,
 
..
ϕp−1

0 . 1
ϕp 0 ··· 0
together with the error vector  
1

 θ1 

εt =  ..  εt ,
.
 
 
θp−1
leads to a hidden state vector with first component marginally the ARMA(p, p − 1) process of interest. Note
that the error covariance matrix is   ⊤
1 1
 θ1   θ 1 
  
2
W =  .  . 
  
 σ .
 ..   .. 
θp−1 θp−1
To understand why this works, define our state vector to be
 
Xt
 2 t−1 + ϕ3 Xt−2 + . . . + ϕp Xt−p+1 + θ1 εt + · · · + θp−1 εt−p+2 
ϕ X 

Xt =  .
..

.
 
ϕp−1 Xt−1 + ϕp Xt−2 + θp−2 εt + θp−1 εt−1
 
 
ϕp Xt−1 + θp−1 εt

121
We can then directly verify that
GXt−1 + ε t = Xt ,
and so this VAR(1) update preserves the distribution of our state vector. Consequently, the first component is
marginally ARMA, and so using F = (1, 0, · · · , 0) will pick out this component for incorporation into the
model of observation process.
We have considered the case of ARMA(p, p − 1), but note that any ARMA(p, q) model can be represented
as an ARMA(p⋆ , p⋆ − 1) model by choosing p⋆ = max{p, q + 1} and padding coefficient vectors with zeros
as required. This is exactly the approach used by the function dlm::dlmModARMA.

8.5.3 Example: SOI

Let us return to the SOI example we looked at at the end of Chapter 7. We can model this using a locally
constant model, but also with a seasonal effect. Further, we can investigate any residual correlation structure
by including an AR(2) component. We can fit this as follows.

y = soi
buildMod = function(param)
dlmModPoly(1, dV=exp(param[1]), dW=exp(param[2])) +
dlmModARMA(ar=c(param[3], param[4]),
sigma2=exp(param[5]), dV=0) +
dlmModTrig(12, 2, dV=0, dW=rep(exp(param[6]), 4))
opt = dlmMLE(y, parm=c(log(0.1ˆ2), log(0.01ˆ2),
0.2, 0.1, log(0.1ˆ2), log(0.01ˆ2)), build=buildMod)
opt

$par
[1] -3.100868e+00 -9.242014e+00 8.792923e-01 -7.119263e-06 -4.572246e+00
[6] -1.010190e+01

$value
[1] -310.9818

$counts
function gradient
51 51

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

mod = buildMod(opt$par)
ss = dlmSmooth(y, mod)
so = ss$s[-1,] %*% t(mod$FF) # smoothed observations
so = ts(so, start(y), freq=frequency(y))
tsplot(y, col=4, lwd=1.5, main="SOI")
lines(ss$s[,1], col=3, lwd=1.5)
lines(so, col=2)

122
SOI

1.0
0.5
0.0
y
−0.5
−1.0

1950 1960 1970 1980


Time

This looks to be a good fit, suggesting a gradual slow decreasing trend, but inspection of the optimised
parameters (in particular param[5], the log of σ 2 in the ARMA component), suggests that the AR(2)
component is not necessary.

123
9 Spatio-temporal models and data

9.1 Introduction

A spatio-temporal data set is just a collection of observations labelled in both time and space. So x(s; t) is an
observation at location s ∈ D at time t ∈ R. The spatial domain, D, is usually a subset of R2 or R3 . You have
seen in the first term that there are a huge range of different kinds of spatial data, and that different models
and methods are appropriate for different situations. The range of different kinds of spatio-temporal models
and data is even greater. We do not have time to explore these in detail now. Here we need to concentrate on
the most commonly encountered form of spatio-temporal data. That is, data consisting of scalar-valued time
series on a regular time grid of length n being observed at a fixed collection of irregularly distributed sites, of
which there are m. We could write x(s) for the time series at site s ∈ D, D = {s1 , . . . , sm }. We assume for
now that we have temporally aligned time series across the sites, leading to a “full grid” of spatio-temporal
data. That is, we have nm scalar observations,

{x(si ; t) | i = 1, . . . , m, t = 1, . . . , n}.

For data of this form on a regular time grid, we often use the notation xt (s) for x(s; t), and sometimes write
xt = (xt (s1 ), . . . , xt (sm ))⊤ for the realisation of the spatial process at time t. We could also write X for
the n × m matrix with (i, j)th element xi (sj ). When the data matrix is arranged this way it is said to be in
“space-wide” format. X⊤ is said to be in “time-wide” format. In practice, it is rare to actually have all nm
observations of this form, but we can often represent our data in this form provided that we are allowed to
have missing data. In general, strategies are needed to deal with missing spatio-temporal observations.
We immediately see that there are two different ways of “slicing” data of this form, and these correspond to
different modelling perspectives. If we adopt the spatial perspective, we regard the data as spatial, but with a
multivariate observation at each site that happens to be a time series. We can then adopt spatial approaches to
model the cross-correlation between the time series at different sites. This spatial perspective underpins many
classical approaches to spatio-temporal modelling, but has limitations that we don’t have time to fully explore
in this module. The alternative dynamic or temporal perspective, views the data as a time series, where the
multivariate observation at each time happens to be the realisation of a spatial process. This latter approach is
in many ways more satisfactory, and underpins many modern approaches to spatio-temporal modelling.

9.2 Exploring spatio-temporal data

Before proceeding further, it will be useful to familiarise ourselves with some spatio-temporal data. Our main
running example for this chapter will be the dataset spTimer::NYdata, some air quality data, measured
over time, at a collection of locations across New York. You can find out more about the spTimer package
with help(package="spTimer"). Let’s start by trying to understand the basic structure of the data.

library(astsa)
library(spTimer)
dim(NYdata)

[1] 1736 10

124
head(NYdata)

s.index Longitude Latitude Year Month Day o8hrmax cMAXTMP WDSP RH


1 1 -73.757 42.681 2006 7 1 53.88 27.85772 5.459953 2.766221
2 1 -73.757 42.681 2006 7 2 57.13 30.11563 8.211767 3.197750
3 1 -73.757 42.681 2006 7 3 72.00 30.00001 4.459581 3.225186
4 1 -73.757 42.681 2006 7 4 36.63 27.89656 3.692225 4.362334
5 1 -73.757 42.681 2006 7 5 42.63 25.65698 4.374314 3.950320
6 1 -73.757 42.681 2006 7 6 30.88 24.61968 4.178086 3.420533

We can see straight away that the data is currently in “long format”, where observations (of several different
variables) are recorded with both a position in space and time. You can find out more about the data with
?NYdata. Let us proceed by finding out more about the sites.

sites = unique(NYdata[,1:3])
dim(sites)

[1] 28 3

numSites = dim(sites)[1]
head(sites)

s.index Longitude Latitude


1 1 -73.757 42.681
63 2 -73.881 40.866
125 3 -79.587 42.291
187 4 -76.802 42.111
249 5 -73.743 41.782
311 6 -78.771 42.993

plot(sites[,2:3], pch=19, col=2, ylim=c(40, 45),


main="Location of sites across New York")
text(sites[,2:3], labels=sites[,1], pos=1, cex=0.5)

125
Location of sites across New York

44
78
11
Latitude

10
9
13 22 26
6 14 17
12 18 1
42

3 24
4 21
5
15 16
23 20
228
25 19 27
40

−79 −78 −77 −76 −75 −74 −73

Longitude

So we see that there are 28 sites, scattered irregularly across New York. Let’s just look at the first site.

site1 = NYdata[NYdata$s.index == 1, 7:10]


tsplot(site1, col=2:5, lwd=1.5, main="Site 1")

Site 1
o8hrmax
20 60

0 10 20 30 40 50 60
Time
cMAXTMP
35
20

0 10 20 30 40 50 60
Time
WDSP
6
2

0 10 20 30 40 50 60
Time
5.5
RH
2.5

0 10 20 30 40 50 60
Time

So the observations in space and time are actually multivariate, with several different variables begin measured
simultaneously. To keep things simpler, we will focus on just one variable, ozone.

ozone = NYdata[,c("o8hrmax", "s.index")]


ozone = unstack(ozone)
dim(ozone) # "space-wide" format

[1] 62 28

126
## "full grid" representation - "STF" in "spacetime" terminology
image(as.matrix(ozone),
main="Ozone data at 28 sites for 62 times",
xlab="Time", ylab="Site")

Ozone data at 28 sites for 62 times


0.8
Site

0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Time

We can visualise the data matrix as an image, which shows a few missing observations (in white), but not too
many. This is a fairly complete full-grid spatio-temporal dataset. We can do time series plots for a very small
number of sites at once.

tsplot(ozone[,1:5], col=2:6, lwd=1.5,


main="Ozone at sites 1 to 5")

Ozone at sites 1 to 5
X1
20

0 10 20 30 40 50 60
Time
X2
20

0 10 20 30 40 50 60
Time
X3
30

0 10 20 30 40 50 60
Time
X4
30

0 10 20 30 40 50 60
Time
X5
20

0 10 20 30 40 50 60
Time

127
But to look at the time series for all sites simultaneously, we need to overlay, and use transparency to stop the
plot from looking too messy.

tsplot(ozone, spaghetti=TRUE, col=rgb(0, 0, 1, 0.2),


ylab="ozone", main="Time series for all sites")
lines(rowMeans(ozone, na.rm=TRUE), col=2, lwd=1.5)

Time series for all sites


100
80
ozone
60
40
20

0 10 20 30 40 50 60
Time

Similarly, we can look in detail at cross-correlations for a small number of time series, but need to revert to
an image to see the full correlation structure.

pairs(ozone[,1:5], pch=19, col=4, cex=0.5)

20 60 30 50
60

X1
20
80

X2
20

70

X3
30
60

X4
30

60

X5
20

20 40 60 30 50 70 20 40 60

128
image(cor(ozone, use="pair")[,numSites:1])

0.8
0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

It is clear from this that there is a lot of interesting correlation structure present, but we are not currently
doing anything with the highly relevant context of spatial proximity.

9.3 Spatio-temporal modelling

9.3.1 Spatial models

We begin by taking a more spatial view of spatio-temporal data. So, now we have a time series at a collection
of sites. We could completely ignore the temporal dependence and regard the observations at each time as
being iid realisations of a spatial process, or we could regard time as an additional dimension. We begin with
the former view.

9.3.1.1 Purely spatial models

We begin by ignoring temporal dependence completely, so that observations at each time are independent
realisations of some multivariate distribution representing the spatial process.

9.3.1.1.1 Unconstrained multivariate data


If we know nothing about the spatial context, we can just regard the observations at each time point as being
multivariate normal,
Xt ∼ N (0, Σ),
for unconstrained Σ, after stripping out any mean. Then the sample covariance matrix of the data estimates
Σ.

Sigma = cov(ozone, use="pair")


image(Sigma[,numSites:1])

129
0.8
0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

9.3.1.1.2 Spatial covariance


Unconstrained estimation of the covariance matrix is potential problematic, since it has a very large number
of degrees of freedom, and we are not exploiting our known spatial context. So we probably prefer to regard
the observations as being iid from a Gaussian process (GP) with a parameterised covariance function. Here,
for simplicity, we will use an isotropic covariance function, depending only on the geographical distance
between the sites. Here, for illustrative purposes, we will use the squared exponential covariance function
C(d) = σ 2 exp{−d2 /a2 }, σ, a > 0,
with σ2 representing the stationary variance of the GP, and a the length scale. We can encode this in R with

cf = function(param) {
sig = param[1]; a = param[2]
function(d)
sigˆ2 * exp(-(d/a)ˆ2)
}

The following R command computes the full matrix of geographical distances between the sites, in km, using
WGS84 (an ellipsoidal refinement of the Haversine formula).

distMat = sp::spDists(as.matrix(sites[,2:3]), longlat=TRUE)

We can use this to construct a covariance matrix over the observations with

cm = function(param)
cf(param)(distMat)

Then we can define a log-likelihood function for the data with

## centre the data


cOzone = sweep(ozone, 2, colMeans(ozone, na.rm=TRUE))

ll = function(param)
sum(apply(cOzone, 1,

130
function(x)
mvtnorm::dmvnorm(x, sigma=cm(param), log=TRUE)),
na.rm=TRUE)

and find the MLE for the covariance function parameters with

opt = optim(c(20, 50), ll, control=list(fnscale=-1))


opt$par

[1] 12.06888 30.16920

image(cm(opt$par)[,numSites:1])
0.8
0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

suggesting a length scale of around 30km for ozone. We also see that the optimal covariance matrix looks
quite different to the sample covariance matrix, and that there is very little correlation between sites that
are not very close together (such as sites 7 and 8). This is good and bad. It is good that it properly spatially
smooths, but the assumption of a common variance across sites is probably not realistic.

9.3.1.2 Space-time Gaussian process models

Ignoring the temporal dependence structure in the data is obviously unsatisfactory for various reasons. If we
persist with our spatial-first perspective, we consider the presence of a time series at each spatial location.
This time series could potentially also be modelled as a GP. If we think that a GP is appropriate for both
spatial and temporal variation, and that the same GP model is appropriate for every time series irrespective of
spatial location, then we are led to modelling the data in space and time as arising jointly from a single GP
with a separable space-time covariance function of the form

C(s, t) = Cs (s)Ct (t),

for given spatial covariance function Cs (·) and temporal covariance function Ct (·). Adopting a separable
covariance function is convenient for multiple reasons. First, if Cs is a valid spatial covariance function and

131
Ct is a valid temporal covariance function, then their product is guaranteed to be a valid space-time covariance
function. Thus, the imposition of separability greatly simplifies the specification of a valid spatio-temporal
covariance function. Second, the assumption greatly simplifies computation, since then the joint covariance
matrix over all observations can be represented as a Kronecker product of the spatial and temporal covariance
matrices, allowing the avoidance of the construction or inversion of very large matrices.
Much of classical spatio-temporal modelling was built on separable GP models. However, it turns out that
the assumption of separability is very strong, and quite unrealistic for most spatio-temporal data sets. So,
given our limited time, we will abandon this approach, and adopt a more dynamical perspective.

9.3.2 Dynamic models for spatio-temporal data

This term we have studied models for time series, and in particular, ARMA models, and DLMs. Both of
these families of models lead to linear Gaussian systems. They therefore determine Gaussian process models,
and implicitly determine a (not necessarily stationary) covariance structure. However, we don’t specify these
models via their covariance structure. We specify their dynamics, and the dynamics implicitly determines the
covariance structure. There are many advantages to this more dynamical perspective.

9.3.2.1 Random walk model

We will begin with a model that is typically over-simplistic, but we will gradually refine and improve it.
Rather than assuming that our observations are iid, we will assume that they form a random walk

Xt = Xt−1 + ε t , ε t ∼ N (0, Σ).

This non-stationary model will sometimes be appropriate when there is a high degree of persistence in
changes to the levels observed. This model is very simple to fit, since the one-step differences are iid, so we
can put
ε t = Xt − Xt−1 ,
and estimate the parameters of iid ε t as for the iid model. Again, we could have an unconstrained Σ, which
we can estimate using the sample covariance matrix of the one-step differences, or a constrained matrix, with
an explicitly spatial covariance structure, which we can estimate via maximum likelihood, as we have already
seen.

9.3.2.2 VAR(1) models

We can greatly improve on the simple random walk model by allowing VAR(1) dynamics, assuming temporal
evolution of the form
Xt = GXt−1 + ε t , ε t ∼ N (0, Σ),
perhaps following some mean-centring. This model is specified by m × m matrices G and Σ. This model
is very flexible, allowing a range of different stationary and non-stationary dynamics. But the utility of the
model depends crucially on having an appropriate structure for the propagator matrix, G. Clearly, choosing
G = I gives the random walk model we have already considered, so this model class includes the random
walk model as a special case.

132
9.3.2.2.1 Unconstrained models
Obviously, one possibility is to consider a completely unconstrained G, with elements to be estimated by least
squares (or maximum likelihood). We can use the same least squares approach that we adopted in Chapter 4
by writing our model in the form
X2:n = X1:(n−1) G⊤ + E,
to get the least squares solution

Ĝ⊤ = (X1:(n−1)

X1:(n−1) )−1 X1:(n−1)

X2:n .

ozone2 = as.matrix(cOzone[2:62,])
ozone1 = as.matrix(cOzone[1:61,])
mod = lm(ozone2 ~ 0 + ozone1)
G = t(mod$coeff)
image(t(G)[,numSites:1], main="Estimated propagator matrix, G")

Estimated propagator matrix, G


0.8
0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Once we know G we can estimate Σ conditional on G as discussed for the random walk model. Note that
there is no reason to expect (or want) G to be symmetric, and there is also no reason why all of the elements
should be non-negative. Clearly this approach could be used for any multivariate time series, so we are not
properly exploiting the spatial context. Also, for a large number of sites, it can be problematic to estimate all
m2 elements of G given at most nm data points.

9.3.2.2.2 Diagonal propagation


As a compromise between a simple random walk model and a completely unconstrained propagation matrix,
a diagonal G can be assumed. This also ignores the spatial context, but nevertheless fixes a number of issues
with the simple random walk model. Since G is diagonal, from a least squares perspective the problem
reduces to m independent AR(1) models that can be fit separately.

apply(cOzone, 2, function(x)
arima(x, c(1,0,0), include.mean=FALSE)$coef)

133
X1 X2 X3 X4 X5 X6 X7 X8
0.3179119 0.5368396 0.4338366 0.4206251 0.3077022 0.4162096 0.4779711 0.4828871
X9 X10 X11 X12 X13 X14 X15 X16
0.4572942 0.4952505 0.3255831 0.3944543 0.4224548 0.3950086 0.4009679 0.3972285
X17 X18 X19 X20 X21 X22 X23 X24
0.3785257 0.4059217 0.4876205 0.5324465 0.3282226 0.2242859 0.4144865 0.1876176
X25 X26 X27 X28
0.5343495 0.3515989 0.5994477 0.4730342

Alternatively, the problem can be set up as a least squares problem for the m-vector g, where G = diag{g}.
This is tractable, with solution
n
! n
!−1
X X
ĝ = xt−1 ◦ xt ◦ xt−1 ◦ xt−1 .
t=2 t=2

colSums(ozone1*ozone2, na.rm=TRUE)/colSums(ozone1*ozone1, na.rm=TRUE)

X1 X2 X3 X4 X5 X6 X7 X8
0.3204579 0.5396044 0.4251620 0.4041003 0.3066428 0.4025162 0.4253797 0.4802306
X9 X10 X11 X12 X13 X14 X15 X16
0.4468311 0.4804979 0.3204789 0.3849037 0.3993497 0.3830393 0.3997480 0.3953805
X17 X18 X19 X20 X21 X22 X23 X24
0.3769048 0.4037909 0.4834982 0.5340675 0.3212627 0.2200922 0.4058193 0.1763395
X25 X26 X27 X28
0.5320828 0.3368777 0.5524844 0.4517766

Here, the solution using arima is preferred, since it has more intelligent handling of missing data.

9.3.2.2.3 Spatial mixing kernels


Ideally, we would like to use a propagator matrix, G, which takes into account the spatial context. This is
relatively simple when the sites lie on a regular lattice (see the later discussion of STAR models), but more
challenging for irregularly distributed spatial locations. Note that the ith row of G corresponds to the weights
applied to the sites at the previous time point when making a prediction for site i at the current time point.
The diagonal model puts all weight on just site i. It might be better to distribute the weights across the k
nearest neighbours of site i for some reasonably small k > 1. Choosing a small k will ensure that most of the
elements of G are zero, and hence G will be a sparse matrix, which has significant computational advantages.
However, for fairly small m we could assume a dense G, with weights varying as a function of distance. It is
common to assume some sort of kernel distance function, and there are many possible choices with various
advantages and disadvantages. For example, we could use the squared exponential kernel (irrespective of any
covariance kernel assumed for Σ),

gij = σi exp{−∥sj − si ∥2 /a2i }.

It is quite common (but not required) to assume that the length scale is common across all sites, ai = a.
However, there are often very good reasons to allow σi to vary across sites, and in this case G will not be
symmetric (which is fine).
Given this parameterisation of G, and most likely also a low-dimensional parameterisation of Σ, it is
straightforward to evaluate the Gaussian likelihood, and hence optimise the log-likelihood to find the optimal
parameters, similar to what we have seen many times previously.

134
9.3.3 Dynamic latent process models

The spatio-temporal models that we have examined so far have all been special cases of the VAR(1) model.
However, when we studied DLMs in Chapter 7, we saw that it can often make sense to assume a latent
process of auto-regressive form, but to then model our observations as some noisy linear transformation of
this hidden latent process. This remains the case in the spatio-temporal context. So it is natural to consider
models of DLM form for spatio-temporal processes,

Yt = FXt + ν t , ν t ∼ N (0, V)
Xt = GXt−1 + ω t . ω t ∼ N (0, W),

where Yt is our spatial observation at time t, and Xt is some kind of latent process representation of the
underlying system state at time t. The precise nature of the latent process can vary according to the modelling
approach adopted. We briefly consider three possibilities.

9.3.3.1 Spatial mixing kernels

If we choose F = I, then our observations are just random corruptions of the hidden latent state. So the latent
state has a very similar interpretation as the models we have been considering so far. In particular, there
is a one-to-one correspondence between sites and elements of the latent process state vector, and G has an
interpretation as a spatial mixing kernel for the hidden latent process. It can be parameterised in the manner
previously discussed. A spatial covariance kernel is often used to parameterise W, and V is often assumed to
be diagonal. We have seen how to evaluate the log-likelihood of a DLM, so we can optimise the parameters
of the kernel functions using maximum likelihood in the usual way.

9.3.3.2 Example: NY ozone data

Let’s see how we could implement a model like this using the dlm R package. To keep things simple we
will just assume a propagator matrix of the form G = αI. We will also start off with an evolution covariance
matrix of the form W = σw 2 I, but we will relax this assumption soon. We can fit this as follows.

library(dlm)

buildMod = function(param) {
alpha = exp(param[1]); sigW = exp(param[2])
sigV = exp(param[3])
dlm(FF=diag(numSites), GG=alpha*diag(numSites),
V = (sigVˆ2)*diag(numSites), W = (sigWˆ2)*diag(numSites),
m0 = rep(0, numSites), C0 = (1e07)*diag(numSites))
}

opt = dlmMLE(as.matrix(cOzone), parm=log(c(0.8, 1, 3.0)),


build=buildMod)
opt

$par
[1] -0.9494696 2.4860296 -4.7633070

$value
[1] 5228.299

$counts

135
function gradient
47 47

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

mod = buildMod(opt$par)

ss = dlmSmooth(cOzone, mod)
image(ss$s)
0.8
0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

This works, and give smoothed states that are visibly smoother than the raw data. Importantly, it also sensibly
smooths over the missing data. But this model isn’t really explicitly spatial. Apart from assuming common
parameters across sites, the DLMs at each site are independent. For spatial smoothing we really want W to
be a spatial covariance matrix. We can fit a model using a spatial variance matrix of the form previously
discussed (using our function cm) as follows.

buildMod = function(param) {
alpha = exp(param[1]); sigW = exp(param[2])
a = exp(param[3]); sigV = exp(param[4])
dlm(FF=diag(numSites), GG=alpha*diag(numSites),
V = (sigVˆ2)*diag(numSites), W = cm(c(sigW, a)),
m0 = rep(0, numSites), C0 = (1e07)*diag(numSites))
}

opt = dlmMLE(as.matrix(cOzone), parm=c(log(0.8), log(1), log(10), log(3.0)),


build=buildMod, lower=-4, upper=4)
opt

136
$par
[1] -1.1417530 2.2887486 4.0000000 0.8926005

$value
[1] 4537.371

$counts
function gradient
28 28

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

mod = buildMod(opt$par)

ss = dlmSmooth(cOzone, mod)
image(ss$s)
0.8
0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

This also works, after a fashion, but note that the optimal length scale is just the upper bound I imposed
on the optimiser. If an upper bound is not imposed, the length scale just keeps increasing until the model
crashes. Inferring length scales is notoriously difficult in spatial statistics, and is even more challenging in the
spatio-temporal setting. So, given that in the context of purely spatial modelling we previously inferred a
length scale of around 30km, we could just use that length scale in the context of the dynamic model and not
try to optimise it.

buildMod = function(param) {
alpha = exp(param[1]); sigW = exp(param[2])
sigV = exp(param[3])
dlm(FF=diag(numSites), GG=alpha*diag(numSites),

137
V = (sigVˆ2)*diag(numSites), W = cm(c(sigW, 30)),
m0 = rep(0, numSites), C0 = (1e07)*diag(numSites))
}

opt = dlmMLE(as.matrix(cOzone), parm=c(log(0.8), log(1), log(3.0)),


build=buildMod)
opt

$par
[1] -1.0119162 2.3656191 0.6758664

$value
[1] 4848.624

$counts
function gradient
25 25

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

mod = buildMod(opt$par)

ss = dlmSmooth(cOzone, mod)
image(ss$s)
0.8
0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

This is better. This is now giving us a dynamic model that can smooth in both time and space jointly.

138
9.3.3.3 STAR models

It has previously been mentioned that choosing the form of the propagator matrix, G, would be more
straightforward if the spatial locations all lay on a regular lattice (typically, in 2d or 3d). Even if our actual
observations Yt do not, we could nevertheless model the latent state, Xt , as a lattice process. eg., in the
2d case, we could then let Xt,i,j be the value of the latent state at time t at position (i, j) on the lattice. A
nearest-neighbour model for the time evolution of the latent state might then take the form

Xt,i,j = αXt−1,i,j + β(Xt−1,i−1,j + Xt−1,i+1,j + Xt−1,i,j−1 + Xt−1,i,j+1 ) + ωt,i,j ,

for some fixed α, β > 0. For stability, we might require α + 4β < 1. This then determines the sparse
structure of G. There are many possible variations on this approach. W could be diagonal or parameterised via
a spatial covariance kernel. V is often assumed to be diagonal. Models of this form are known as space-time
auto-regressive models of order one, or STAR(1), and there are generalisations to higher order, STAR(p).
The matrix F maps the actual observation sites onto the lattice. It will have m rows, and the number of
columns will match the total number of sites on the lattice. The ith row of F will have a 1 in the position
corresponding to the lattice site closest to si , and zeros elsewhere.
DLM smoothing with this model allows interpolation of the observed data onto a regular space-time lattice.
However, if the size of the lattice is large, this is computationally very demanding (notwithstanding the
sparsity of G). There are other possible approaches to spatio-temporal smoothing and interpolation which
may be less computationally demanding.

9.3.3.4 Basis models (spectral approaches)

STAR models provide one approach to interpolate the hidden spatio-temporal process to spatial locations
other than those that have been directly observed. STAR models are typically used in order to interpolate onto
a regular lattice. However, there is no reason why we can’t construct a DLM that allows interpolation onto a
continuous space. We just need some basis functions, for example, 2d or 3d Fourier basis functions. So, let

ϕj : R2 → R, j = 1, 2, . . .

be basis functions defined on the whole of R2 (or R3 for 3d data). The idea is that we can represent an
arbitrary function θ : R2 → R with an appropriate linear combination of basis functions

X
θ(s) = ϕj (s)xj ,
j=1

for some collection of coefficients x1 , x2 , . . .. We will seek a reduced dimension representation of a function
by using just p basis functions. We will typically choose p < m, the number of sites with observations. Then,
every p-vector x = (x1 , . . . , xp )⊤ determines a function
p
X
θ(s) = ϕj (s)xj ,
j=1

If we allow x to evolve in time, then the function θ(·) that it represents will also evolve in time. So, we can
let the coefficient vector evolve according to

Xt = GXt−1 + ω t . ω t ∼ N (0, W),

for some very simple propagator matrix such as G = I or G = αI for α ∈ (0, 1). Since we know from
Chapter 5 that Fourier transforms decorrelate GPs, we can reasonably assume diagonal W.

139
We probably want an observation model of the form

Yt,i = θt (si ) + νt,i


p
X
= ϕj (si )Xt,j + νt,i ,
j=1

and so F is the m × p matrix with (i, j)th element ϕj (si ). Again, V is often taken to be diagonal, but doesn’t
have to be. This is now just a DLM with a small number of parameters that can be fit in the usual ways. If we
compute the smoothed coefficients of the latent coefficient vectors, these can be used in conjunction with the
basis functions to interpolate the hidden spatial process over continuous space.
There are many possible choices of basis functions that can be used. 2d or 3d Fourier basis functions are
the most obvious choice, but cosine basis functions, or wavelet basis functions, or some kind of empirical
eigenfunctions can all be used.

9.3.4 R software

Fitting dynamic spatio-temporal models to data gets quite complicated and computationally intensive quite
quickly. Relevant R software packages are summarised in the spatio-temporal task view. The IDE package
will fit an integro-difference equation model, which we have not explicitly discussed, but is closely related to
the STAR and spectral approaches. This package, in addition to a number of other approaches to the analysis
of spatio-temporal data using R, is discussed in Wikle, Zammit-Mangion, and Cressie (2019). Otherwise,
packages such as spBayes and spTimer will fit spatio-temporal models from a Bayesian perspective, using
MCMC. Unfortunately we do not have time to explore these packages properly in this course.

9.4 Example: German air quality data

For further spatio-temporal investigation, it will be useful to have a different dataset to explore. We will
now look briefly at a larger dataset than the one we have considered so far, but detailed analysis is left as an
exercise.
The data is air quality data (specifically, PM10 concentration), for 70 different sites in Germany, over a 10
year period. We can load and inspect the data as follows.

library(spacetime)
data(air)
dim(air) # "time-wide" format

space time
70 4383

It is actually in “time-wide” format, which we don’t necessarily want, and can be logged to be more
Gaussian.

lair = t(log(air)) # space-wide format (and logged)


image(lair, xlab="Time", ylab="Site") # lots of missing data

140
0.8
Site

0.4
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Time

We see that there is a lot of missing data in this dataset, so a proper analysis will need to carefully handle
missing data issues. We can attempt to overlay the time series at the different sites.

tsplot(lair, spaghetti=TRUE, col=rgb(0,0,1,0.15), lwd=0.5)


lines(rowMeans(lair, na.rm=TRUE), col=2, lwd=0.5)
5
4
3
2
1

0 1000 2000 3000 4000


Time

This is less satisfactory than for the New York data, but does give a reasonable overview of the dataset. The
spacetime package has a special data structure for “full grid” spatio-temporal data like this, and we can
create such a STF object as follows.

head(stations)

SpatialPoints:

141
coords.x1 coords.x2
DESH001 9.585911 53.67057
DENI063 9.685030 53.52418
DEUB038 9.791584 54.07312
DEBE056 13.647013 52.44775
DEBE062 13.296353 52.65315
DEBE032 13.225856 52.47309
Coordinate Reference System (CRS) arguments: +proj=longlat +datum=WGS84
+no_defs

head(dates)

[1] "1998-01-01" "1998-01-02" "1998-01-03" "1998-01-04" "1998-01-05"


[6] "1998-01-06"

rural = STFDF(stations, dates, data.frame(PM10 = as.vector(air)))

See help(package="spacetime") for further details about these kinds of data structures.
We are now in a position to think about how to apply our newly acquired spatio-temporal modelling skills to
this dataset. Doing so is left as an exercise.

9.5 Wrap-up

This has been the briefest of introductions to spatio-temporal modelling. However, many of the most
important concepts and issues have been touched upon, so this material will hopefully form a useful starting
point for further study.
More generally, in this half of the module we have concentrated mainly on a dynamical approach to the
modelling and analysis of temporal data, using model families such as ARMA, HMM and DLM. This
dynamical view has many advantages over some more classical approaches to describing temporal data.
Further, we have seen how this dynamical view often has computational advantages, typically leading to
algorithms that have complexity that is linear in the number of time points. We have skimmed over many
technical issues and details, and there is obviously a lot more to know. But again, I hope to have provided an
intuitive introduction to many of the most important problems and concepts, and that you now have a better
appreciation for random processes and data that evolve in (space and) time.

142
References
Chatfield, C., and H. Xing. 2019. The Analysis of Time Series: An Introduction with R. CRC Press.
Cressie, N., and C. K. Wikle. 2015. Statistics for Spatio-Temporal Data. Wiley.
Petris, G., S. Petrone, and P. Campagnoli. 2009. Dynamic Linear Models with R. Use R! New York:
Springer.
Priestley, M. B. 1989. Spectral Analysis and Time Series. Academic Press.
Särkkä, S., and L. Svensson. 2023. Bayesian Filtering and Smoothing. Cambridge University Press.
Shumway, R. H., and D. S. Stoffer. 2017. Time Series Analysis and Its Applications, with R Examples, Fourth
Edition. Springer.
West, M., and J. Harrison. 2013. Bayesian Forecasting and Dynamic Models. Springer.
Wikle, C. K., A. Zammit-Mangion, and N. Cressie. 2019. Spatio-Temporal Statistics with R. CRC Press.

143

You might also like