DLBDSTSA01 Course Book Time Series Analysis
DLBDSTSA01 Course Book Time Series Analysis
DLBDSTSA01
TIME SERIES ANALYSIS
MASTHEAD
Publisher:
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
[email protected]
www.iu.de
DLBDSTSA01
Version No.: 001-2023-1016
N.N.
2
TABLE OF CONTENTS
TIME SERIES ANALYSIS
Introduction
Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Suggested Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Unit 1
Introduction to Time- Series Analysis 11
Unit 2
Time Series Components 33
2.1 Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2 Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Unit 3
Simple Models 79
Unit 4
ARMA Models 101
Unit 5
Holt-Winters Models 175
3
Unit 6
Advanced Topics 201
Appendix
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
List of Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4
INTRODUCTION
WELCOME
SIGNPOSTS THROUGHOUT THE COURSE BOOK
This course book contains the core content for this course. Additional learning materials
can be found on the learning platform, but this course book should form the basis for your
learning.
The content of this course book is divided into units, which are divided further into sec-
tions. Each section contains only one new key concept to allow you to quickly and effi-
ciently add new learning material to your existing knowledge.
At the end of each section of the digital course book, you will find self-check questions.
These questions are designed to help you check whether you have understood the con-
cepts in each section.
For all modules with a final exam, you must complete the knowledge tests on the learning
platform. You will pass the knowledge test for each unit when you answer at least 80% of
the questions correctly.
When you have passed the knowledge tests for all the units, the course is considered fin-
ished and you will be able to register for the final assessment. Please ensure that you com-
plete the evaluation prior to registering for the assessment.
Good luck!
6
SUGGESTED READINGS
GENERAL SUGGESTIONS
Brockwell, P. J., & Davis, R. A. (2016). Introduction to time series and forecasting (3rd ed.).
Springer. http://search.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=c
at05114a&AN=ihb.49647&site=eds-live&scope=site
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.).
OTexts. https://otexts.com/fpp3/
Nielsen, A. (2019). Practical time series analysis: Prediction with statistics & machine learn-
ing. O’Reilly. http://search.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&d
b=cat05114a&AN=ihb.49657&site=eds-live&scope=site
Shumway, R. H., & Stoffer, D. S. (2017). Time series analysis and its applications—With R
examples (4th ed.). Springer. http://search.ebscohost.com.pxz.iubh.de:8080/login.asp
x?direct=true&db=cat05114a&AN=ihb.49658&site=eds-live&scope=site
UNIT 1
UNIT 2
Brockwell, P. J., & Davis, R. A. (2016). Introduction to time series and forecasting (3rd ed.).
Springer. Chapters 1.5 & 1.6 http://search.ebscohost.com.pxz.iubh.de:8080/login.aspx
?direct=true&db=cat05114a&AN=ihb.49647&site=eds-live&scope=site
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). An introduction to statistical learn-
ing. Springer.
UNIT 3
Svetunkov, I., & Petropoulos, F. (2018). Old dog, new tricks: A modelling view of simple
moving averages. International Journal of Production Research, 56(18), 6034—6047. htt
p://search.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=bsu&AN=132
000665&site=eds-live&scope=site
7
UNIT 4
Fattah, J., Ezzine, L., Aman, Z., El Moussami, H., & Lachhab, A. (2018). Forecasting of
demand using ARIMA model. International Journal of Engineering Business Manage-
ment, 10, 1—9. https://eds-p-ebscohost-com.pxz.iubh.de:8443/eds/detail/detail?vid=2
&sid=fd516990-8288-4cf0-a32f-5c2a89f76654%40redis&bdata=JnNpdGU9ZWRzLWxpd
mUmc2NvcGU9c2l0ZQ%3d%3d#AN=edsbas.70F2BCA8&db=edsbas
Poleneni, V., Rao, J. K., & Hidayathulla, S. A. (2021). COVID-19 prediction using ARIMA
model. proceedings of the confluence 2021 11th international conference on cloud com-
puting, data science & engineering (pp. 860—865). IEEE. https://eds-p-ebscohost-com.
pxz.iubh.de:8443/eds/detail/detail?vid=4&sid=fd516990-8288-4cf0-a32f-5c2a89f76654
%40redis&bdata=JnNpdGU9ZWRzLWxpdmUmc2NvcGU9c2l0ZQ%3d%3d#AN=edseee.
9377038&db=edseee
UNIT 5
Da Veiga, C. P., Da Veiga, C. R. P., Catapan, A., Tortato, U., Da Silva, W. V. (2014). Demand
forecasting in food retail: A comparison between the Holt-Winters and ARIMA models.
WSEAS Transactions on Business and Economics, 11, 608—614.
Available online.
UNIT 6
Nielsen, A. (2020). Practical time series analysis: Prediction with statistics and machine
learning. O'Reilly.Chapters 9, 10, 16, and 17 https://eds-p-ebscohost-com.pxz.iubh.de:
8443/eds/detail/detail?vid=2&sid=0ea6556b-f27b-4bdd-b1e8-9af9791ace6e%40redis&
bdata=JnNpdGU9ZWRzLWxpdmUmc2NvcGU9c2l0ZQ%3d%3d#AN=ihb.49657&db=cat
05114a
Vishwas, B. V., & Patel, A. (2020). Hands-on time series analysis with Python: From basics to
bleeding edge techniques. Apress. Chapters 5, 6, 7, and 8 https://eds-p-ebscohost-com.
pxz.iubh.de:8443/eds/detail/detail?vid=4&sid=0ea6556b-f27b-4bdd-b1e8-9af9791ace6
e%40redis&bdata=JnNpdGU9ZWRzLWxpdmUmc2NvcGU9c2l0ZQ%3d%3d#AN=edshb
z.DE.605.HBZ01.036761534&db=edshbz
8
LEARNING OBJECTIVES
This course, Time Series Analysis, will begin by explaining what time series analysis is
and its possible applications. The distinguishing feature of these data is their dependence
across time. This dependence can take, in principle, any form. The course presents the
most widely used measure of interdependence: the linear autocorrelation. In terms of
structure, a time series can be seen as a superposition of three components: trend, sea-
sonality, and residuals. The reader will obtain the skills to conduct methods to represent
each of these aspects.
Once an understanding of the basic elements of time series has been established, the
basic time series models will be introduced, starting with the Simple Average model.
These models correspond to averages of windows of data. Unfortunately, their scope is
rather limited, namely, they can only perform time series trend extraction. Arguably the
most widely used family of models is the ARIMA. This course will explain how to use these
models to capture the autocorrelation present in a sequence and how they can be used to
produce out-of-sample forecasts. Simple Average models can be extended to the Holt-
Winters models, allowing not only predictions of the level to be generated, but also predic-
tions of the trend and seasonality of a time series. The resulting method can easily be
used to generate out-of-sample forecasts, and you will learn to do this in practice.
Finally, some modern time series techniques will be introduced. First, ensembles will be
explored, whereby different models and methods are combined to synergically produce
models that integrate the strengths of the constituent parts. The necessary tools include
BATS, TBATS, and Facebook Prophet. Second, in order to take advantage of the enormous
potential of Machine Learning (ML), the ability to convert a time series forecasting prob-
lem into a supervised learning problem (under certain circumstances) will be explored,
enabling the reader to utilize the whole arsenal of ML methods within the context of time
series.
9
UNIT 1
INTRODUCTION TO TIME- SERIES ANALYSIS
STUDY GOALS
Introduction
When studying statistics for the first time, little attention is paid to questions such as
where and when the data were recorded. On that level, one is primarily concerned with
describing data using histograms or frequency tables, or in cases where the data is rich
enough, calculating the mean, variance, and other statistics. This is quite static in some
sense. When analyzing, for instance, daily temperature data, the seemingly unimportant
fact that the observations were recorded at different points in time might be overlooked,
with averages and standard deviations being calculated, and conclusions reached. How-
ever, by neglecting the underlying time structure, relevant information might be missed.
An example of this would be the Fibonacci sequence, wherein each number is the sum of
the two preceding numbers. Starting with 0 and 1, the next term will be 1, because
0 + 1 = 1. The fourth term will be 1 + 1 = 2 and so on. Thus, the first nine terms of the
sequence are 0, 1, 1, 2, 3, 5, 8, 13, and 21. If the sequence is randomized as 8, 21, 0, 1, 5, 13,
1, 3, and 2, the sequence can no longer be recognized, because the “structure” has been
lost. The conclusion is simple—the order is relevant.
Analogously, observations recorded in time will have a certain structure in many situa-
tions. The observations will not be independent from each other. Daily temperature obser-
vations over a long enough time span, for example, three months, will reveal certain pat-
terns. If this time span is occurs during autumn, for instance, a plot of time versus
temperature will show a decreasing tendency.
In this unit, the concept of time series will be introduced, along with a basic tool for ana-
lyzing the time structure of a time series—the autocorrelation function. Using this func-
tion, the correlation between elements of the same sequence can be measured.
12
In regression analysis, one basic assumption is the lack of correlation of the residuals. Correlation
The key feature is the dependence between past and future data points. It is therefore nat- The phenomenon called
correlation is the linear
ural to exploit such relationships by using past observations to predict the future. interdependence
between two sets of data.
It is a number in the range
In notation, a time series is represented by symbols such as Xt or Y t , with t being an
[-1, +1]. A correlation
integer number. The actual monthly vehicle sales in one year, for instance, can be denoted equal or close to zero
as x1, x2, …, x12 . The use of upper- and lowercase letters to denote a time series is rele- indicates an absence of
interdependency.
vant. Unless otherwise stated, time series processes are denoted with uppercase letters,
and time series realizations are denoted in lowercase. The difference can be explained
with an analogy. The experiment of tossing a coin is described mathematically by the ran-
dom variable X, mapping results from the sample space composed of ℎead, tail to a
subset of the real numbers (in this case, 0, 1 ). Therefore, the experiment of tossing N
coins is represented by a set of N equally distributed random variables, X1, X2, …, XN .
This is an abstract setting, built to model the randomness of this experiment. On the other
hand, if real coins are tossed, and the results are recorded, then the outcome, usually
called a realization, will be represented by x1, …, xN . In this example, a particular realiza-
tion might be H, T , T , …, H, H . Therefore, a time series process can be seen as a collec-
tion of random variables indexed by time, and a realization, as an instantiation of it.
In the literature, a time series is a type of random (stochastic) process, that is, a non-deter-
ministic process that evolves in time and/or space.Time series modeling is considered to
be a superposition of three components, although not all three are always present. These
components are trend, seasonality, and residuals (Brockwell & Davis, 2009).
Trend
The first characteristic of time series is the trend. In the following example, 141 global sur-
face temperatures relative to the 1951—1980 average temperature (National Aeronautics
and Space Administration, 2021) have been plotted. These data show an increment of the
world surface temperature of around 1.3°C along the time span in question. The tempera-
tures were relatively stable until 1920, at which point they started increasing. An increase
in the rate of increase can be identified as starting in the 1960s.
13
Figure 1: Average Global Surface Temperatures (1951—1980)
Source: Danilo Pezo, based on National Aeronautics and Space Administration (2021).
This effect is typical—time series are normally described by periods of increments and
decrements (commonly referred to as “bull & bear markets.”) Several techniques are used
to model trends: linear regression (full range and piecewise), smoothing techniques (such
as moving averages), and differencing.
Seasonality
To understand seasonality, it is best to look at an example. Consider the data (Campbell &
Walker, 1977) in the following graphic, which are comprised of a sequence of peaks and
valleys. The distance between two consecutive peaks is called a period. When the peak-
and-valley pattern is clear, we can say the data are seasonal, periodic, or cyclic. In this
case, the period length is approximately ten years. Although the distance between each
peak is quite regular, the amplitudes vary significantly.
Cycles are important features in time series. If the period can be reliably estimated, and if
the current position in the cycle is known, the future peaks and valleys can be predicted.
Unfortunately, not all time series exhibit cyclic patterns as clearly as in this graphic. In
Superposition many cases, the data consist of a superposition of several cycles with different periods.
A superposition is the
addition of two or more
time series. The time ser-
ies do not necessarily
have equal weights.
14
Figure 2: Annual Lynx Trappings in a Region of North-West Canada
The following graphic presents an additional example using data from the American
energy retailer, American Electric Power (known as AEP) (Mulla, n.d.). The original data set
represents the hourly energy consumption in megawatts over a span of ten years. For this
example, the customer consumption data from September 2012 has been chosen. It is
possible to distinguish both an intraday cycle and a weekly cycle. Excluding weekends,
daily peaks are concentrated in the afternoons, which then decrease smoothly to reach a
minimum around 5:00 a.m. There is also a weekly pattern stressed by the weekend effect
(consumption normally drops on Saturdays and Sundays).
This last example shows that a time series can comprise more than one cyclic pattern. The
challenge is how to decompose the sequence into its underlying cycles or seasonality.
15
Many authors make a distinction between the concepts of cycle and seasonality. In eco-
nomics literature, seasonality refers to a systematic annual pattern (e.g., unemployment
or retail sales). In turn, a cycle usually corresponds to a long-term movement over several
years or even decades (e.g., the business cycle: expansion, crisis, recession, and recovery).
Cycles may also occur with random and unpredictable lengths; this is what makes them
much more difficult to model than seasonality. Unless otherwise stated, these two con-
cepts (cycle and seasonality) will be used interchangeably throughout this text.
Residuals
Residuals or errors (occasionally called innovations) are also extremely important compo-
nents of time series. Detrending and removing seasonality will yield data that have no
clear visual structure. An example of this would be a simulation of 500 uncorrelated Gaus-
sian observations with zero-mean and unit variance, in which no clear pattern (trend
and/or seasonality) can be distinguished. Given its importance in statistics, this type of
White noise process receives a name: white noise. The name refers to a concept within the field of
A random process {Xt}tZ physics, whereby sounds of all frequencies are combined with equal power.
is called white noise if
E(X_t )=0, Var(X_t )=σ^2
(constant), and According to Fan and Yao’s (2003) formal definition, a random process {Xt}tÎZ is called white
Cov(X_t,X_s )=0 for all t≠s noise if
(Fan & Yao, 2003).
Independence Some authors define white noise as imposing independence, a stronger version of white
Two random variables are noise. As previously mentioned, independence implies zero covariance, but zero cova-
said to be independent if
the distribution of one riance does not necessarily imply independence, unless normality is assumed. Unless oth-
variable does not affect erwise stated, zero-covariance will remain the default assumption for white noise.
the distribution of the
other variable.
Figure 4: Gaussian White Noise
16
When a time series model is given, the resulting residuals are not necessarily white noise.
In financial statistics, for instance, it is common to observe time series returns with chang-
ing variance (called heteroscedasticity), after removing trends or possible seasonality. But
even more common is the presence of autocorrelation, i.e., correlation of the residuals
with a lagged (delayed) version of itself. As mentioned at the beginning of this unit, one of
the main goals of time series analysis is to exploit the natural relationships between
observations in a sequence. When the time order is known, it is reasonable to suspect
some sort of dependence between past and future observations.
Removing trend and seasonality from the initial time series often yields residuals with
non-zero correlation or changing variance. Of course, this violates the white noise defini-
tion, making stationarity a far more realistic assumption of the residual process. Fan and
Yao (2003) define a time series Xt t ∈ Z as having weak stationarity if
Stationarity can be understood using the following analogy: If one drives along a one-way
street, looking only through the rear view mirror, a crash is almost inevitable.
However, this changes when the driver is informed that what can be seen in the mirror
repeats itself all the way to the destination. If enough attention is paid to what is visible in
the mirror, the chances of arriving safely at the destination are higher, because the route
becomes more predictable.
By the same token, stationarity means that some of the statistical properties of a time ser-
ies, namely expectation, variance, and covariance or correlation, remain constant at a
given interval, increasing the time series’ predictability. This assumption does not elimi-
nate the randomness but reduces it to a more manageable level.
There are many versions of stationarity: strong stationarity, where the distribution func-
tion of the whole process is assumed to be known; trend and cyclic stationarity, where the
process is stationary after removing trend and seasonal components; and local stationar-
ity, whereby some statistical features of the process change so slowly that they can be
approximated by a stationary process on a given range. All versions guarantee some
degree of constancy of the underlying law of probability. The most widely used version is Law of probability
weak stationarity (hereafter simply “stationarity”), which is a fair assumption for the resid- The law of probability
describes the random var-
uals of a model but not generally for a raw time series. iable behavior. If known,
figures such as expecta-
To summarize, a number of time series models have been presented, for which the proc- tion, variance, covariance,
and probabilities are also
ess can be reduced to a mean function plus a residual term. Mathematically, this can be known.
expressed as
Xt = μt + Yt, t = 1, …, N
17
where N is the number of observations and μt is a mean function containing trends and
seasonality, i.e.,
μt = T t + St,
Stationarity is by no means a magic word, and assuming its presence will not make residu-
als stationary. When time series data are provided, the goal is to model trend and season-
ality as well as possible, removing them from the original data so that the resulting residu-
als satisfy the assumed stationarity. In such a case, one can continue extracting
information from the residuals not contained in the trend or seasonality.
Trend and seasonality identification is not an easy task but there are many methods to
assist with the process, some of which will be examined subsequently.
1. Income per person, measured as the Gross Domestic Product (GDP) per capita, with
Purchasing Power Parity (PPP), in international dollars, fixed at 2011 prices (inflation
adjusted) (Gapminder, n.d.-a.)
2. Life expectancy at birth, measured in years (Gapminder, n.d.-b.)
The original data sets are ordered by country and year. Only countries for which both data
sets were available for the year 2018 are included.
The variability of income is controlled by using the logarithm (one of the Box-Cox transfor-
mations) to stabilize the variance (Shumway & Stoffer, 2017). A scatterplot of income ver-
sus life expectancy versus the log income is depicted below.
18
Figure 5: Income versus Life Expectancy by Country (2018)
The conclusion is clear: Countries with higher average incomes tend to have higher life
expectancies. Higher income usually entails a better standard of living which, in turn,
impacts positively the access to and quality of goods and services, such as food, health-
care, and sanitation.
Generally speaking, a correlation is a statistical tool used to measure the degree of simul-
taneous movement of two random variables. In some cases, the researcher is only interes-
ted in determining the likelihood of increments of one variable resulting in increments of
the second variable, with the magnitude of the increment being largely irrelevant. To ach-
ieve this goal, rank correlation measures can be used, with Kendall and Spearman correla-
tions being two of the best-known.
The abstract formula for the correlation ρbetween two random variables X and Y is given
by
E X−E X Y −E Y
ρ X, Y =
V ar X · V ar Y
The symbol Ε stands for expectation and V ar for variance, which is the squared standard
deviation. The numerator on the right-hand side is known as covariance g between X and
Y , i.e.,
γ X, Y = E X − E X Y −E Y
19
For any values of X and Y , the expression X − E X Y − E Y is positive if X and Y
lie simultaneously above or below their corresponding expectations E X and E Y
(averages of X and Y ), and negative if otherwise. When averaging
X − E X Y − E Y , i.e., taking the expectation from among all possible values of X
and Y , the resulting covariance sign will provide information about the average simulta-
neous behavior of X and Y. If this is positive, it means that, on average, X and Y tend to be
above or below their means at the same time. The interpretation for negative values fol-
lows analogously. Additionally, the covariance is zero when X or Y are deterministic, i.e.,
X=EXor Y = E Y , or when the average of X − E X Y − E Y is zero. In the latter
case, it might indicate that increments of one variable, for example, X, provide no clue
about the behavior of Y , and hence, by averaging over all possible values of
X − E X Y − E Y , the result is zero.
The above formula is described as “abstract,” because it deals with random variables.
When data are collected, e.g., N pairs x1, y1 , . . . , xN , yN , one proceeds with the fol-
lowing formulae for correlation and covariance, respectively.
N
∑i = 1 xi − −x yi − −
y
ρ X, Y =
N 2 N 2
∑i = 1 xi − −
x ∑i = 1 yi − −
y
N
xi − −
x yi − −
1
γ X, Y = N ∑ y
i=1
Here, x̅ and y̅ correspond to the mean of X and Y data, respectively. The “hats” ^indicate
use of real data, and not abstract variables. The covariance can take any real number. In
practice, correlation is used, because it is a standardized version of the covariance, and
therefore easier to interpret. For any two non-deterministic random variables, we use
−1 ≤ ρ X, Y ≤ + 1
The proof of this inequality lies beyond the scope of this text, but suffice it to say that the
main argument lies in the so-called Cauchy-Schwarz inequality (Casella & Berger, 2002).
The closer the correlation is to ±1, the more the scatterplot resembles a straight line.
The result is positive, as expected, with higher income resulting in higher life expectancy.
Whether this value should be considered high or moderate depends on the context. As a
rule of thumb, anything over 80 percent is considered high. For more clarity on the differ-
ent situations one might encounter, consider the following figure built on simulated data.
20
Figure 6: Correlation Examples
This plot shows us that the sign of the correlation will be linked to the sign of the slope of
an imaginary line fitted to the data. In the case of zero or close-to-zero correlation, no
clear tendency should be detected. In cases of perfect correlation, ±1, all plotted points
lie over a finite, non-zero slope line.
γ X, Y = E X − E X Y −E Y = E X − 0 X2 − 1 = E X3
−E X =0
Above, the first, second, and third moments of a standard normal variable are 0, 1, and
0respectively (remember that the kth moment is defined as E Xk ). The trick is that the
correlation only measures linear dependence. When the relationship has a different func-
tional form, nothing can be guaranteed.
21
In the context of time series analysis, we will use similar ideas of correlation and cova-
riance. The main difference lies in the fact that the correlation will be calculated over a
single set of data Xtwith t = 1, . . . , N, and not over two sets, X and Y , as before. This can
be seen in the following time series example.
This example uses the GDP at a constant PPP in dollars for Saudi Arabia from 1950—2013
(Gapminder, n.d.-a.)
As in the first example, the (natural) logarithm is applied to control for changes in volatil-
ity. To avoid cumbersome notation, we will drop any reference to the logarithm in the fol-
lowing discussion, but do not forget this is log GDP.
Note that the GDP increases across time. To remove trends from GDP, one can generally
apply differencing, i.e., subtracting the previous observation from each member of the
sequence. In doing so, the implicit assumption is that the subsequent observations are
mainly given by the current observation plus a random term, with the effect of the stabili-
zation of the mean level (Brockwell & Davis, 2016). As the first element has no predecessor,
it is generally removed or replaced by the series average. Symbolically, when given a time
series xt , the first difference is denoted by
∇xt = xt − xt − 1
Another related term frequently used is the “lag.” Given a time series x1, …, xN ,the 1-lag-
ged time series will be x0, …, xN − 1. In general, the k-lagged time series is
x1 − k, …, xN − k. A lag is, therefore, a delay of the original sequence by a given number of
time steps.
22
Coming back to the example, differencing generates a time series, which does not show
trends but looks more stationary. The corresponding plot is depicted below.
This scatterplot resembles that depicted in Example 1 in some ways. It is evident that a
mild, yet non-zero correlation is present. This is crucial information, because it tells us
that a past value contains valuable information to predict the next future value.
23
We can now repeat the exercise, plotting ÑGDP t − 2 versus ÑGDP t,and do a visual (for
the moment) check for the presence of linear dependence or correlation.
Again, some degree of positive correlation can be observed, although this is lower than in
the first case. Following with additional lags 3, 4, …, etc., it is possible to continue detect-
ing potential levels of correlation between lagged observations. That is the idea behind
the “autocorrelation/autocovariance” (also known as serial correlation).
The abstract form of autocorrelation can be defined as follows. Let Xt t∈Z be a random
process. The autocorrelation function (ACF) is defined as
E Xt − E Xt Xs − E Xs γ t, s
ρ t, s = = , for each t, s ∈ ℤ
V ar Xt V ar Xs V ar Xt V ar Xs
γ t, s = E Xt − E Xt Xs − E Xs , for all t, s ∈ ℤ
The Greek word “auto” means “self.” The name is fitting because it stresses the fact that
the correlation/covariance is calculated over points of the same random process.
Notice the similarity between the formulae for classical correlation and autocorrelation.
There is, however, one important difference: in the previous case, we had two random var-
iables X and Y , now we have variables Xt and Xsbelonging to the same random process.
An explanation of this, as well as an overview of the consequences, follows.
24
A time series should not be understood as a single random variable generating all the ele-
ments that form the sequence. A time series is a collection of random variables indexed by
time. A realization of each one of these random variable forms one time series realization.
This is important because two random variables are required to calculate autocorrelation.
A time series is made up of many variables; therefore, it would be sufficient to pick any
two random variables from X1, …, XN and apply the conventional definition to redefine
correlation in the context of time series.
Unfortunately, in real world applications, one rarely has more than a single time series
realization, so in practice, only two observations would be available to calculate
ρ Xt, Xs , for a given pair t, s . This limitation makes it impossible to estimate the
expectations involved in the definition of autocorrelation/covariance—an expectation is
an average (roughly speaking), and a minimum of two observations are required to calcu-
late an average. To define a practical autocorrelation formula, we need one further
assumption. This additional assumption is included in the concept of (weak) stationarity
discussed in the previous section.
The formula for covariance and correlation simplifies assuming stationarity as follows.
Firstly, the expectation no longer depends on the chosen s, t, i.e., E Xt = E Xs = μ for
all s, t ∈ Z.
This means that all expected values of lagged versions of the time series will be equal to
the mean of the series.
γ t, s = E Xt − E Xt Xs − E Xs = E Xt − μ Xt + k − μ
=γ k
Therefore, the covariance simplifies to a function of a single variable k, the lag. It should
be noted that once stationarity is assumed, the autocovariance evaluated on lag k=0yields
2
γ 0 = E Xt − μ Xt − μ = E Xt − μ = Var Xt , for allt
The formulae used so far are theoretical. Now we need a formula that is able to be used
with real data.
Assume a stationary time series x1, …, xN , with the mean m. Therefore, the sample auto-
covariance function is given by
N −k
1
γk = N ∑ xt − μ xt + k − μ
t=1
25
for k = 0, 1, …, N − 1. This formula is analogous to that used for covariance, except for
the summation that runs until N − k. If we sum until N , given the indices of x, the last
term would be out of range (e.g., xN + k does not exist). The example below illustrates this.
Let us suppose the following N = 7 observations: 2, 4, -4, 3, -2, -3, 0, with zero-mean. We
want to calculate the 1-lag autocovariance, so k = 1.
t 1 2 3 4 5 6 7
Xt 2 4 -4 3 -2 -3 0
Xt + 1 4 -4 3 -2 -3 0 -
Xt + 1
−0
The first row of this table corresponds to the time index—we have seven observations and
thus the time index ranges from 1 to 7. The second row contains these observations. The
third row contains 1-lag ahead shifted data. Notice the first point (x1 = 2) is lost. Finally,
the fourth row corresponds to the products involved in the autocovariance formula. It
should be noted that only six terms can be calculated. Finally,
8 − 16 − 12 − 6 + 6 + 0 −20
γ1 = 7
= 7
= − 2.86
If, instead of shifting 1-lag ahead, we shift 1-lag back, the following table is produced.
t 1 2 3 4 5 6 7
Xt 2 4 -4 3 -2 -3 0
Xt + 1 - 2 4 -4 3 -2 -3
Xt − 1
−0
26
8 − 16 − 12 − 6 + 6 + 0 −20
γ −1 = 7
= 7
= − 2.86
This is not a coincidence, but rather a mathematical property of the autocovariance func-
tion called symmetry. In mathematical terms this means
γ k = γ −k
Symmetry simplifies the calculations, as we can restrict the estimation to just one half,
usually for k ≥ 0.
We can easily check that, if we set k = 0, then we get the formula for the sample variance,
i.e.,
2
S =γ 0
γk
ρ k = γ0
ρ k = ρ −k
As discussed, the ACF shows the level of correlation of a time series with lagged versions
of itself. However, the ACF is not controlled by other lags, i.e., the ACF does not account for
indirect effects of long passed observations through more recent observations. This is bet-
ter illustrated in the following diagram.
27
enced by its immediate past Xt − 1 through (c), by Xt − 2 directly through (a), and Xt-2indi-
rectly through (b) and (c). Therefore, one can wonder what influence is being calculated
when computing ρ Xt, Xt − 2 . The answer is (a) and (b+c).
In this sense, autocorrelation does not remove the indirect effects produced by other lags.
The statistical tool doing this job is the so-called partial autocorrelation function (PACF).
The formal definition is given below (Fan & Yao, 2003). The basic concept is to conduct a
partial regression analysis, in which the residuals are calculated for each specific lag. This
isolates the influence of these specific lags and allows the partial autocorrelation to be
calculated, with indirect influences already having been accounted for.
Partial autocorrelation can be defined as follows. Assume that Xtis a stationary time ser-
ies, with E Xt = 0. Any time series can reach this expected value through subtracting the
mean. The partial autocorrelation function (PACF), denoted by πk, is defined as
π 1 = ρ 1 and Rj 2, …, k, such that
π k = γ R1 2, …, k, Rk + 1 2, …, k ,k ≥ 2
Rj 2, …, k = Xj − αj2X2 + … + αjkXk
Making use of the diagram example, if we want to remove the effects of Xt − 1 in the esti-
mation of ρ Xt, Xt − 2 , we must utilize the four following steps:
Coming back to the earlier example of the Saudi Arabian GDP, both the ACF and the PACF
are estimated in Python below.
Code
# import modules
import pandas as pd
import numpy as np
28
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Differencing data
gdp_country_diff = gdp_country.diff().dropna()
gdp_country_diff.index=gdp_country_diff.index.astype(int)
gdp_country.index=gdp_country.index.astype(int)
# ACF
plot_acf(gdp_country_diff, alpha=0.05, ax=axes[0], \
title='Autocorrelation function (ACF)')
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('Correlation')
# PACF
plot_pacf(gdp_country_diff, alpha=0.05, ax=axes[1], \
title='Partial Autocorrelation function (PACF)')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Correlation')
Most statistical software estimates the ACF and PACF for not only one lag, but a (default)
number of lags, usually around the first 20, which reduces the need to call the function too
many times. The first lag depicted is zero, with one as a default value, and the correlation
of a set of data with itself is one. Additionally, the results are depicted in a figure, called an
autocorrelogram, where the x-axis corresponds to the lags, and the y-axis to the level of
correlation (a number between -1 and +1).
In the plots below, both the ACF and PACF are depicted, including a 95% confidence inter-
val (indicated by the light blue bands). The standard deviation needed to estimate this
band is based on Bartlett’s formula, which is an estimator of the covariance matrix for the
asymptotic distribution of the correlation estimator (Brockwell & Davis, 2016). Python Asymptotic
completes these steps automatically. In the present context,
“asymptotic” refers to the
manner in which the cor-
The following points about the ACF and PACF plots should be noted: relation’s distribution
function approaches a
given distribution, as the
sample size increases.
29
• The ACF decreases to zero. As we move back in time, more distant observations have
less influence on present observations.
• The ACF confidence band grows as we move along the lag-axis, because it uses less data
(more and more data are left out when the ACF is evaluated on higher lags). This is not
the case for the PACF.
• The first three lags are outside of the confidence bands, implying, with a 5% signifi-
cance, that they are statistically different from zero and not attributable to random
error. This indicates that there is a considerable influence of values lagged 1, 2, and 3
times. However, this holds only for the first and third lag for the PACF.
• Based on the PACF, which removes additional intermediate lag effects, we can see that
the first lag, amongst others, plays a statistically significant role. This tells us that an
observation of the series contains considerable information about the observation
which directly follows.
Figure 12: ACF and PACF Plots of the Differenced Data on Saudi Arabia’s GDP (1951—
2013)
30
SUMMARY
A time series is a collection of observations recorded over time. These
observations can be categories, integers, or real numbers, and are gen-
erally recorded at equally spaced points in time. Time series applica-
tions range from social sciences to mathematics. As in classical statis-
tics, the main goals are description, inference, prediction, and control.
Xt = Tt + St + Yt
i.e., the original process can be broken down into three components:
trend, seasonality, and a zero-mean stationary process. Trends are usu-
ally captured using tools, such as regression analysis, smoothing techni-
ques, and differencing. Other additional techniques allow seasonality
modeling.
31
UNIT 2
TIME SERIES COMPONENTS
STUDY GOALS
Introduction
A time series is a set of observations collected over time. The elements within it are num-
bers indexed by time. Monthly inflation rates in the US over the last five years and daily
COVID-19 cases in Germany in 2020 are examples of time series. A table displaying this
type of data can be uninformative when it comes to extracting insights. When observa-
tions are plotted against time, however, it becomes possible to detect patterns.
At the beginning of the COVID-19 pandemic, it was common to see graphs displaying the
evolution of case and death counts in the media on a daily basis. We felt some relief when
the numbers from the past days or weeks showed a decreasing “trend” or tendency; our
brains automatically conducted an out-of-sample forecast to predict the arrival of that
hoped-for moment when the total number of cases would be zero, bringing the pandemic
to an end. Epidemiologists and authorities alerted the population to the incremental
increase of cases as the winter got closer (Molteni, 2021). They were concerned about a
possible worsening of the pandemic conditions, associated with a “seasonal” effect.
Lower temperatures, less sunlight, and drier air could increase the virus’ ability to spread
(Liu et al., 2021; Terry, 2021).
As is the case for almost every phenomenon in life, one particular observation can be the
result of many factors. For example, a decreasing number of COVID-19 cases at the start of
summer may be partially attributable to increasing vaccination rates but also partially
attributable to the warmer, drier weather that is typical in the summer months. Neverthe-
less, there are always factors that cannot be observed, either because they are not known
or because they cannot be measured. Accounting for what is known of the time series, i.e.,
trend and seasonality, what cannot be explained corresponds to a “residual” part. These
residuals are treated as random, not because they are necessarily random, but because it
is a parsimonious way to represent the ignorance of the non-modeled factors involved.
This unit presents a basic model structure for time series that will guide its analysis. Time
series will be conceptualized as a sum of three components: trend, seasonality, and resid-
uals (Chatfield, 2004; Brockwell & Davis, 2016). Methods to identify each component and
to conduct inference and prediction will be discussed.
2.1 Trend
The concept of trend can be understood by looking at data on global warming. A total of
141 global temperature observations from 1880 to 2020 (Goddard Institute for Space Stud-
ies, 2021) are plotted in the figure below. Because increments, not absolute temperatures,
are of interest here, the data are presented relative to the average temperature from 1951
—1980. For example, if the year 2000 had an average global temperature of 15.5°C and the
average between 1951 and 1980 was 15°C, then in the data, the year 2000 would have a
34
value of 0.5°C. Thus, every point represents a jump in temperature or the degree of anom-
aly with respect to the 1951—1980 average level. Unless said otherwise, this variable will
be referred to as “temperature.” Bear in mind, however, that this is with respect to the
1951—1980 average.
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
A trend is a long-term tendency that a data set follows. The meaning of “long term” is rela-
tive to its context. A time series can be composed of more than one single trend (called a
piecewise trend). Later in this unit, you will learn how to apply statistical testing to trends
like the one above, but for now, simply consider the visual representation. The tempera-
ture trend in the second half of the twentieth century is clear—the world is getting hotter.
The task is, then, to make a reasonable time series model that represents this trend. Such
a model would show where temperatures are heading and at which speed.
Three different methods will be introduced in this unit: linear regression, smoothing tech-
niques, and differencing. With the application of these methods, the trend can be
removed from the original data. At that stage, further techniques can be applied, which
assume that there is no trend in the data.
Linear Regression
Linear regression is a fundamental statistical method, tracing its roots to Carl F. Gauss in
the nineteenth century (Kopf, 2015). It has inspired many new models and applications in
a plethora of fields. The concept is as follows: Suppose that there are two variables, Z and
X, that we shall call the “independent” and “dependent” variable, respectively. As an
example, Z might be “CO2 emissions” and X “average temperatures.” Gathering data, we
might build the following table.
35
Table 3: Toy Example: Regression Data
Z X
2 -0.22
4 -0.03
6 0.25
8 0.54
10 0.78
We want to build a model that “explains” X in terms of Z . A model here is simply a for-
mula that relates these two variables. But which formula? In mathematics, the word “lin-
ear” should recall the line and its analytic expression X = a + bZ. This is perhaps the sim-
plest formula relating Z and X. Now the question is, which values are a and b taking? In
principle, one might guess. Below are some attempts.
The left guess is too far off. The reason is that the slope (parameter b) has the wrong sign.
The figure on the right has the correct orientation, but its intercept (parameter a) is too
high. How do we find the best values of these parameters? First, we need a definition for
what is considered “best.” In other words, what are the optimality criteria? In this context,
the most common estimation method (though not the only one) is “ordinary least
squares.” In this method, the squared vertical distance between each point and the candi-
date line is calculated, added, and minimized by changing the values of a and b. In other
words, for each Z,Xpair in the data, we build the following distance function:
2 2
Error a, b = −0.2 − a − b · 2 + 0−a−b·4 +…
2
+ 0.8−a − b · 10
36
and we move a and b, gradually achieving a smaller total error. This error function is also
called the “residual sum of squares.”
The square of the error formula is taken to remove possible negative signs, which would
corrupt the notion of distance (this is always non-negative). Of course, the absolute value,
or any even power, could be taken to achieve the same goal. This would yield a different
set of parameters, a and b; however, as the use of the power of two is ubiquitous in statis-
tics, it is logical to use it here as well. The outcome is that the resulting properties of the
estimates are widely known in the literature (for instance, the bias-variance tradeoff is
based on the related concept of mean squared error.)
The values of a and b could be found by hand using derivatives, but the calculations are
tedious. Fortunately, all statistical software, including Excel and Python, comes with
standard libraries that perform the calculations for us. Below is the Python code used to
estimate the parameters of our example using the scikit learn library.
Code
# import modules
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# display parameters
print(reg.intercept_)
print(reg.coef_)
# console output:
# [-0.507]
# [[0.1285]]
37
plt.gca().set(xlabel='Z', ylabel='X')
plt.grid()
plt.show()
The figure below depicts the least squares estimated linear model.
Figure 15: Toy Example: Least Squares Estimated Linear Regression Model
a = − 0.507
b = 0.1285
The hat “^” in statistics stands for “estimation.” Based on these data points and the
assumed error function, these are the best guesses for the values of a and b. Their true
values will likely never be known.
Notice that even though the fit is close to the points, it is not exact; there is an error. Such
an error is called the residual, and it is represented mathematically as an additional term
on the right-hand side of the formula
X = a +bZ+ η
The interpretation of η is all that Z cannot explain of X through a linear function. Its mean
is zero and its variance is 0 < σ2 < ∞. Any pattern, aside from pure randomness on the
residuals, might be an indication that the model is incomplete, and a more sophisticated
model is needed. This will be discussed in more detail towards the end of this unit.
The independent variables receive the names “predictors,” “regressors,” and “features.” A
model with more than one regressor is called a multiple linear regression model. It takes
the form
X = α0 + α1Z1 + … + αmZm + η
38
where Z1, …, Zm are m regressors and η represents the residual term.
In the previous example, no assumption was made as to the time dependence of Z and X.
This is because regression analysis can also be used with variables not indexed by time.
For example, the pairs Z, X could be data on CO2 levels and temperatures recorded at
the same time in different regions, or sequential recordings from the same region, or even
both. Whenever we need to make the time dependence of a variable explicit, a subscript t
is added, e.g., Xt.
Now we return to the global warming example in order to use linear regression on real
data. The temperatures correspond, in our notation, to Xt. The time index can be used as
a regressor. Thus, a table of the form can be built.
Xt t
X1 1
X2 2
⋮ ⋮
X141 141
Instead of “year,” the time regressor used is the time index, i.e., the position of every year
in the year list. This is done for simplicity’s sake—using time instead of a time index will
yield the same results qualitatively but the handling is usually more cumbersome.
Although the parameters will change, the predictions will not.
Below the original temperature series is plotted in black, with two linear regression mod-
els in red and blue. The red line corresponds to a simple line fit, with its model taking the
form
Xt = α0 + α1t, t = 1, …, 141
39
Figure 16: Global Surface Temperatures (1880—2020): Original Series and Fits Using a
Linear and Quadratic Regression Model
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
The temperature curve is certainly not a straight line; hence, the simple model is inappro-
priate. A quadratic curve does a better job. In this case, the appropriateness of one model
over the other can be visually assessed, but in practice, the presence of many predictors
makes this impossible. Such assessment is based on residuals and a variety of statistics.
Some of these methods will be examined here.
Python offers several libraries for statistical model estimation. In the toy example on
regression, we used scikit learn. To demonstrate some of the variety of Python libra-
ries, we will now use statsmodels. With the command model.fit().summary() a
table can be generated summarizing important statistics needed to assess the accuracy of
the parameter estimates as well as the accuracy of the model. Below are the summary
tables of both the linear and quadratic models.
40
Figure 17: Summary Table: Linear Model
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
41
Not every element of these tables will be considered but rather give a brief account of the
most relevant statistics and their corresponding interpretation (for further details, see
James et al., 2013).
• Coefficient of determination (R2). This model has an intercept but no regressor. This is
the simplest model we can think of. It is expected to have the residual sum of squares as
an upper bound. In that specific case, it is sometimes called the TSS (total sum of
squares). Since any linear regression model including an intercept (and at least one
regressor) will have a smaller RSS, we might calculate the delta, namely, TSS-RSS, and
divide the result by the TSS. This ratio is called the coefficient of determination or R2. It
lies between zero and one. The closer to one, the better the fit. In our example, the
result is R2 = 0.892. Judging whether this figure is high or not depends on the context.
The R2 is interpreted as the amount of variability explained by the model in comparison
with the only-intercept model.
• F-Statistic/Prob (F-statistic). This corresponds to the use of statistics to test the regres-
sion significance, i.e., to evaluate whether at least one regressor exists about which it
can be said, with statistical certainty, is not zero. In symbols, this hypothesis, called the
“null hypothesis,” is written as H0 : β1 = … = βm = 0. We will reject this hypothesis
whenever the F-statistic is significantly larger than one. Assuming H0 to be true and the
residuals to be normally distributed, the test statistic has an F-distribution, making the
hypothesis testing procedure easier. The p-value can be computed and conclusions
drawn from the result. In this case, the F-Statistic is large. Considering the p-value, the
Prob (F-statistic) is almost zero, which is sound statistical evidence that at least one
regressor parameter, either α1 or α2, is different from zero.
• Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). The AIC
and BIC are two widely used estimators of the prediction error. The “bias-variance
trade-off” states that whenever we increase the complexity of a model (for example, by
adding more parameters), the bias involved is reduced, while the variance increases. On
the other hand, if we eliminate the parameters of the model, the bias increases and the
variance drops. Since the total error, the mean squared error, is basically the sum of the
variance and bias terms, one is compelled to find the sweet spot. To do this, an estima-
tor of the prediction error is needed. The logic is as follows: If the model is too simple,
the predictions are bad due to underfitting. If the model is too complex, the predictions
are also bad, due to overfitting. Hence, up to some constants, AIC and BIC provide us
that information. In conclusion, a model with a lower AIC or BIC is preferred. The AIC of
the 1-predictor model is -83.66, while the quadratic model has an AIC of -198.8. Even
though the quadratic model is more complex than the 1-predictor model, this addi-
tional complexity is compensated by a substantial decrease in prediction error. We
choose the quadratic model.
• Adjusted R-squared. This is based, as its names suggests, on the coefficient of determi-
nation. The difference is that it controls by the effect of additional parameters. As men-
tioned above, the coefficient of determination is a proportion of residual sum of
squares. The R2 of the naïve model (only an intercept) will be zero. As we add more and
more predictors, the amount of variability explained by the model increases, and as a
result, the RSS drops. The limit case is when our model fits every single point of the
sample. In such a case, R2 = 1. As we can imagine, the predictive power of the limit
model is small. To account for the additional predictor, the coefficient of determination
42
has been modified by including a penalization depending on the number of regressors
the model has. Comparing the 1-predictor and the quadratic models, this is
AdjR2 = 0.749 and 0.890, respectively. This again confirms the results from the AIC—
we choose the quadratic model.
• Parameter 95% confidence interval. This is the classical t-Student 95% confidence inter-
val. It is built upon the assumption of the normality of residuals. In that case, the
parameter estimator distributes t-Student with n − m degrees of freedom (n: number
of observations, m: number of regressors, in our case 141 − 2 = 139). If the confidence
interval of a parameter covers zero, then the parameter is not statistically significant at
that level (in this example, 100%−95% = 5%), and it should be excluded from the
model, unless an argument of hierarchy forces us to keep it. In our case, all parameter Hierarchy
estimators have confidence intervals not covering zero. Inclusion of all terms of
lower polynomial order in
• Normality check. The normality assumption on the residuals is useful to conduct classi- a linear regression esti-
cal test hypotheses or to build confidence intervals. Time spent checking normality is mation is considered a
time well spent, if we need to inference. Normality is characterized by skewness (equal hierarchy. In our case, a
quadratic model should
to zero for a standard normal distribution) and kurtosis (equal to three in the same include the linear term
case). Recall that these measures describe the shape of the distribution in terms of a1t even if the parameter
a1 turns out to be not sig-
where the probability mass is more concentrated, to the left or right of the mean (skew-
nificant.
ness), and how fat the tails of the distribution are (kurtosis). “Jarque-Bera” is a statisti-
cal test to assess the normality assumption by combining skewness and kurtosis. Based
on the p-value of Jarque-Bera, Prob(JB), we cannot reject the normality hypothesis of
the residuals of the quadratic model.
Taking into account all of the above considerations, we choose the quadratic model. This
model suggests the temperatures are increasing with quadratic rates. Using the Python
function predict() we can perform an out-of-sample forecast. According to this model,
the average temperature will increase by 0.17°C by 2030 and by 2.13°C by the end of this
century.
Finally, the Python code used to generate the plots and the statistics summary table is
presented below.
Code
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
43
time = np.arange(1, T+1)/100 #Line
time2 = np.column_stack((time,time**2)) #Quadratic
# original data
plt.plot(df_temp.index, df_temp['Global_Temp'], color='k',\
label='Original data (Delta Temperature)')
# fitted values 1
plt.plot(df_temp.index, results_01a.fittedvalues, \
color='red',label='$X_{t}=a+b\cdot t$')
# fitted values 2
plt.plot(df_temp.index,results_01b.fittedvalues, \
'--', color='blue',\
label='$X_{t}=a+b\cdot t+c\cdot t^{2}$')
44
Smoothing Techniques
Regression methods are used when we have a model structure in mind. An example would
be a linear function, where the regressor effects are additive. In regression analysis, the
functional form (the formula) is given; thus, the researcher must only calculate the param-
eter estimation. Smoothing techniques are part of the “nonparametric methods” where
the absence of any preconceived model structure is key. This lets the data speak for them-
selves.
As its name suggests, the MA method is built on the averaging of the time series. Consider
the data on global warming (Goddard Institute for Space Studies, 2021).
t Year Temperature
(xt)
1 1880 -0.16
2 1881 -0.07
3 1882 -0.1
4 1883 -0.16
5 1884 -0.28
⋮ ⋮ ⋮
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
45
A naïve model would just be the average of the whole data set. Even though the resulting
smoothed curve (constant and equal to 0.050°C) is no longer jumpy, its use is very limited,
because we know the temperatures are increasing.
⋮ ⋮ ⋮ ⋮
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
Another possibility might be to increase the number of averages across the sample. For
example, we can average every 20 observations (the final calculation will average the last
21 observations). In such a way, seven averages are obtained. This seems to improve the
model in so far as more of the variability is captured. We have built a piecewise constant
model, but the data have a clear upward trend, which does not increase with plateaus in
between. This is still not entirely satisfactory.
The MA model tries to circumvent the issue of piecewise constant models by calculating
the averages on rolling windows of data, avoiding non-intersecting intervals. For example,
consider a window of length three. The first three elements of the MA would be
46
aged,” we cannot estimate T 140 and T 141, because we would need additional data points
x142 and x143, which do not exist. In the latter case, called “centered,” to approximate x1
we would need x0, which also does not exist. In general, the three possibilities for an MA(3)
are
xt − 2 + xt − 1 + xt C xt − 1 + xt + xt + 1 R xt + xt + 1 + xt + 2
T tL = 3
Tt = 3
Tt = 3
The set of smoothed points in both cases has the length 139 = 141 − 2. If we choose a
window size of five units, then the number of smoothed points is 137 = 141 − 4. The
moral is that working with windows of length q allows us to estimate only N − q + 1
smoothed points, where N is the length of the series. Some data points at the extremes
are excluded, unless it is specified that those points should be treated differently. The
table displaying the calculations for all the possibilities discussed is depicted below.
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
The next concept to explore is weighted averages. In the global temperature example,
every observation is equally weighted with 1/3. But if we are conducting a left moving
averaging, it might be reasonable for the older observations to be attributed lesser weight
in comparison with the newer observations.
47
In the figure below, the original global temperature data and two centered moving aver-
age filters with windows equal to five and 15 are depicted, with equal weights of 1/5 and
1/15, respectively. To help the reader appreciate the difference in smoothness between
MA(5) and MA(15), a subsample is also plotted.
Figure 19: Global Surface Temperatures (1880—2020): Original and Smoothed Series
Using MA Smoothing
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
In Python, this can be generated using the command rolling() from the Pandas library,
and averaging the result, as displayed below.
Code
# import modules
import pandas as pd
48
# perform rolling window averaging
# window = window length, centered = type of averaging
roll_temp_mean = df.rolling(window=5, center=True).mean()
• As we increase the window size, the resulting curve turns out to be less “jumpy,” i.e.,
smoother (hence the name). This is because two consecutive outcome points from the
MA differ only on the extremes; due to averaging, they lose importance as the window
size increases.
• As a result of the formulation of the MA discussed above, the extremes of the time series
are not smoothed, and more data is left out as the window size increases.
The idea behind the KS method is essentially the same as for the MA. Consider a centered
MA(5). In the simplest case, the weights are 1/5 for every term, thus the T t term of an
MA(5) is generated by multiplying each xt term by 1/5 and then adding. This is represented
graphically in the following figure.
The weights of a kernel smoother are given in turn by a function called “kernel.” A com-
mon choice for a kernel is the Gaussian density. A kernel is centered around xt, weighting
closer points higher. It depends on a parameter b, called bandwidth, that controls how
many neighboring observations are averaged. This is represented graphically in the fol-
lowing figure.
49
Figure 21: Kernel Smoothing Representation
In the plot below, several smoothed curves of the global average temperatures are depic-
ted. They differ in the degree of smoothness controlled by the bandwidth b. This parame-
ter plays an important role in the bias-variance trade-off. If it is too large, the resulting
curve is oversmoothed, increasing the bias. On the contrary, if it is too small, then the total
variance is higher. Which one should be used? As discussed above, cross validation and
AIC are methods to estimate the prediction error, and in this context, can be used to
decide between two or more competing models, which differ on the bandwidth b.
Figure 22: Global Surface Temperatures (1880—2020): Original and Kernel Smoothed
Series
Source: Danilo Pezo, based on Goddard Institute for Space Studies (2021); Lenssen et al. (2019).
The sm library for Python has the function KernelReg() to conduct kernel smoothing.
Code
# import modules
from statsmodels.nonparametric.kernel_regression import KernelReg
from matplotlib import pyplot as plt
50
# conduct kernel smoothing
# base the smoothing on the data index
# set the variable type to 'continuous'
# use a bandwidth of 5
kr = KernelReg(df['Global_Temp'].values, \
df.index, var_type='c', bw=[5])
# smoothed values
plt.plot(df.index, x_pred,':', \
color='blue', label='$b = 5$')
# add a legend
plt.legend()
Instead of user-specified values for the bandwidth bw, cross validation can be used by
entering bw=’cv_ls’ when using the least squares method, or bw=’aic’ when using the
AIC.
Differencing
Another popular way to extract or remove a trend from a time series is differencing, i.e.,
the predecessor (first lag) of each number is subtracted from the number itself. For
instance, the difference of the data {3, 0, -2, 4, 1, 1, 2} is {NaN, -3, -2, 6, -3, 0, 1}. A “NotA-
Number” (NaN) is set at the beginning of the differenced sequence, because the first ele-
ment of the original series does not have a predecessor. Dropping the “NaN,” the first lag
difference is {-3, -2, 6, -3, 0 ,1}.
∇xt = xt − xt − 1
∇2 xt = xt − xt − 2
In such a case, the first two differenced elements do not exist. The notation on subtracting
additional lags is analogous.
51
Often, the difference of a differenced time series needs to be calculated. This is called the
second difference. Using the data of the previous example, the second difference is calcu-
lated as
Let us see how to use this in practice. The Deutsche Bank (stock code DBK.DE) daily clos-
ing stock prices (upper plot) and its first lag difference (lower plot), from January 1, 2019
until December 31, 2020, are depicted below. The data can be accessed through the library
yfinance in Python.
Figure 23: Deutsche Bank Stock Prices—Raw Series (Top) and Differenced Series
(Bottom)
Briefly recall the definition of (weak) stationarity (Fan & Yao, 2003). A time series Xt t∈Z
is weakly stationary if
52
• the covariance between two different random variables of the process depends only on
the time lag between them.
Xt = α0 + α1t + ϵt
where the t is time index, and εt is the residual component assumed to have zero-mean,
constant variance, and zero autocorrelation. This is called “white noise.” Notice that it is
stationary. Differencing this expression, we get
∇Xt = Xt − Xt − 1 = α0 + α1t − α0 + α1 t − 1 + εt − εt − 1
= α1 + ϵt − ϵt − 1
In other words, we obtain only a constant and the difference of residuals, which is also a
residual (although not necessarily with the same properties as the residual of the original
sequence, because, for example, its variance might be different). Hence, by differencing
we are able to remove the trend, with only a residual component remaining.
Back to the Deutsche Bank example. The price series has upward and downward trends,
sudden drops, and patterns with some cyclic resemblance; the series is by no means sta-
tionary. In turn, the differenced series is much more stable; it moves around zero, and
there is no clear pattern making it predictable. At first glance, stationarity is a reasonable
assumption. Thus, the effect obtained by differencing is roughly the same as that seen in
the example with the linear trend. In case a quadratic trend is present, we can try to
remove such trends by taking the second difference. For higher order polynomial trends,
higher order differencing should be attempted.
When differencing, the final residual component is not the same as the original series, as
in the linear trend example. This is a warning. In some cases, differencing might introduce
dynamics not present in the original series. Thus, when removing a trend, we might want
to try another method first, or to focus our attention on the behavior of the new residuals,
identifying possible patterns and modeling them. One common effect on these new resid-
uals is the change of the ACF (autocorrelation function). Later, we will study methods to
model sequences looking at features extracted from the ACF.
53
2.2 Seasonality
Seasonality arises in a variety of fields. Economics, earth sciences, and medicine are just a
few examples. Identifying periodicities in a time series is, in general, not an easy task. Real
world time series are often composed of several underlying cyclic components combined
in different ways. This means that a single time series might be a combination of several
seasonal time series, each with different frequencies and amplitudes.
Consider the following data on hourly energy consumption in megawatts from AEP Retail
Energy Partners (AEP), an American energy distribution retailer (Mulla, n.d.).
Each peak-valley cycle represents a 24-hour period. Overlayed onto this daily cycle is a
weekly cycle, which is characterized by an increment from Monday to Thursday and a
decline from Friday to Sunday (weekend effect). This is basically what the naked eye can
detect. If additional hidden periodicities exist, special tools will be needed to detect them.
Two regression-based methods to model seasonality will be introduced in this unit: har-
monic regression and seasonal dummy variables.
Harmonic Regression
In the AEP data, the clearest seasonality takes place on a daily basis. There is a peak in
consumption in the afternoon and a valley around 5:00 a.m. This intraday pattern repeats
every twenty four hours. A second seasonality occurs on a weekly basis. Consumption
remains high throughout the work week, then drops over the weekend.
This simple example shows three things. Firstly, some time series are seasonal; they con-
tain patterns that repeat with a given frequency. Secondly, a series might be a “superposi-
tion” of different seasonal series, meaning it is composed of two or more seasonalities.
54
Thirdly, most of the variation of the sequence occurs intraday—the daily variability is
larger than the weekly variability, or more specifically, the maximum distance between a
peak and a valley within a day is generally higher than the maximum distance between
peaks within a week.
Another example showing how these patterns are built is depicted in the figure below.
The top plot depicts three different waves. A wave is described primarily by two numbers.
The first number is the amplitude, which tells us how high or low the curve reaches. The
blue line, for instance, has an amplitude equal to one. The second number is the period,
which is the horizontal distance between two consecutive peaks. The blue curve has, for
example, a period equal to 100 (the distance between peaks is the same for valleys or any
two points with the same level, in particular, two points in the middle.) The bottom red
curve represents the superposition of these three waves. The clear seasonality of the first
three waves is no longer obvious; the different curves overlap, making it difficult to visu-
ally distinguish each component.
Mathematically, a wave can be described by sinusoid functions (recall sine and cosine
functions from earlier math courses). The three waves of the last example are
55
2πt
y=1 · sin 100
+ 0 blue
2πt
y=0.3 · sin 20
+ 0 orange
2πt
y=0.08 · sin 5
+ 0 green
The addition of “0” within the argument of the sine function is purposeful. It was said
before that a wave is primarily described by two numbers, and this is true; the shape of
the wave is determined by the amplitude (the number multiplied by the sine) and the
period (the number divided by the 2πt). Notice though, that the three sine functions start
at t = 0 (assuming the x-axis to be time t). If we need to shift the green curve to make it
start at t = 10, then instead of 0 we must specify −10 · 2π/5 within the sine function.
That number is called a “phase.” In general, a sinusoid is given by the formula
2πt
y = A · sin T
+ϕ
where A is the amplitude, T is the period, f is the phase, and t is time. A figure equivalent
to the period used in practice to describe sinusoids is the frequency. Frequency is defined
mathematically as
2π
ω= T
This gives a measure of the number of cycles over a fixed time interval. As T increases, ω
drops, which means that it takes the curve longer to complete one single cycle. In physics
and engineering, it is common to speak about low and high frequencies (for example,
when describing a light spectrum or oscillator movements). A sinusoid comprised of a sin-
gle frequency is called a “harmonic.”
As mentioned before, the red line on the above graph corresponds to the superposition of
the other three sinusoids, i.e.,
Code
# import modules
import numpy as np
import matplotlib.pyplot as plt
56
# create an harmonic time series 1
A1 = 1
T1 = 100
x1 = A1*np.sin(2*np.pi*t/T1)
In this example, we have followed a bottom-up approach; having been given a set of har-
monics, we combined them to produce a new curve. In turn, if a time series is being ana-
lyzed, a top-down approach must be followed in most cases. Since the exact frequencies
are rarely known, the harmonic components of the time series must be “disentangled,” if
they exist.
Assuming that the frequency ω0 of a time series can be estimated with some certainty, the
formula of a sinusoid can be written as
57
This formula follows from a basic trigonometric identity—the sine of a sum of two angles.
Here, we can consider the terms sin ω0t and cos ω0t as two predictors and regress the
variable yt in terms of those predictors. The quantities Acos ϕ and Asin ϕ can be wrap-
ped up into two constants, called α1 and β1, to be estimated. In the following regression
formula, we also add an intercept α0:
This is called a harmonic regression model. We can work out an example using the data on
energy consumption (Mulla, n.d.). It is clear that the data have a relevant period of twenty
four hours. Depicted in the figure below are the original data and a regression model con-
sidering just one frequency, i.e., three parameters: α0, α1, and β1.
We have correctly captured the daily frequency; nevertheless, the sequence consists of
additional harmonics, explaining why the fit does not look good.
In general, if we have P guesses on frequencies, say ω1, ω2, …, ωP , the regression model to
estimate this is
P
yt = α0 + ∑ αjsin ωjt + βjcos ωjt + εt, t = 1, …, N
j=1
A second guess could be weekly periods. The data were recorded on an hourly basis, pro-
ducing 168 data points. In the code, the second period must be added and the regressors
recalculated using sine and cosine.
58
Figure 27: Hourly Energy Consumption with 24- and 168-Hour Harmonic Regression
This achieves a much better fit, but investigating additional frequencies could further
increase the fit of the model.
There are many ways to approach estimating the model (in this case, fitting harmonic
regression models to the energy consumption); one such option is the Python code dis-
played below, for which the sm library has been used.
Code
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
# load data
df = pd.read_csv('AEP_hourly_Sep2012.csv', sep=',')
df['energy'] = df['energy']
df.index = pd.to_datetime(df.datetime, \
dayfirst=True, infer_datetime_format = True)
59
T=len(df['energy'])
time=np.arange(1,T+1)
# original data
plt.plot(df. index,df['energy'], color='k', \
label='Energy consumption')
# fitted values
plt.plot(df.index, results.fittedvalues, color='red', \
label='$\omega_{1} = 2\pi/24$ and $\omega_{2} = 2\pi/168$')
The second model can capture the weekly dynamics along with the daily changes. In this
example, the data from the first week have more variability compared to the data from the
other three weeks. Furthermore, the consumption maxima seem to decline as the month
proceeds. If additional data relating to other months were available, it could be identified
60
whether this effect corresponds to a seasonality or whether it was simply a local trend in
September. Are there additional seasonalities? It is hard to say just by looking at the plot.
A tool called a periodogram can help to answer this question.
Periodogram
The idea behind the periodogram goes back to the French mathematician Joseph Fourier.
Studying the properties of heat transfer, he realized that, under rather mild assumptions,
a function can be written as an infinite sum of sine and cosine functions with different
amplitudes and frequencies (Dominguez, 2016). Take sunlight, for example. Considering
light as an electromagnetic wave, sunlight is the superposition of many wavelengths of
light with different frequencies and amplitudes.
If sunlight could be modeled as a mathematical function, then the tool developed by Four-
ier, namely, the Fourier transform, would help to disentangle its different components
(visible, UV, radio, X-ray, etc.) by determining the frequencies and “weights” of each of
them. The word “weight” here means amplitude. In the previous example of the three
waves, the amplitudes were 1, 0.3, and 0.08. Clearly, the “heaviest” is the first wave with
an amplitude equal to 1. The mathematical details of Fourier analysis and its applications
in statistics and signal processing are very involved and will not be given here. It suffices to
say that we have a similar situation at hand, namely, a signal or sequence of data, which
we believe is composed of underlying harmonics. The frequencies and amplitudes of the
harmonicss need to be identified.
In the context of statistics, we will use a function called a periodogram (also sometimes
called a power spectral density) to assess the frequency decomposition of a time series.
The periodogram is defined as the squared modulus (absolute value, length) of the (dis-
crete) Fourier transform of a sequence xt. It can be interpreted as a projection of the time
series xt onto the space generated by the harmonics with a frequency 2πj/N. Thus, we
will look for the frequencies that yield the highest value in the periodogram among its
neighbors; in other words, we look for the peaks in the periodogram as a function of the
frequencies.
Using the data on energy consumption, we have calculated the periodogram depicted in
the figure below. The x-axis ranges from 0 to 0.5. For example, a period of twenty four
hours appears as the frequency of f = 1/24 ≅ 0.042 in the periodogram. A weekly period
is shown as the frequency f = 1/ 24 · 7 ≅ 0.006 in the periodogram. The frequency 0.5 is
known as the Nyquist frequency; it corresponds to one half of the sample frequency
fNyquist = 0.5 · fsignal = 0.5, which in this example would be 30 minutes (the sample fre-
quency is one hour). This sets a limit for the frequencies to be analyzed. For example, if
consumption is measured every hour, possible periodicities of shorter length go unno-
ticed; we cannot know what happens, for example, every ten minutes. The y-axis corre-
sponds to the periodogram outcome in log scale, the “power spectral density” (PSD).
61
Figure 28: Periodogram of Energy Consumption Data
We look for the dominant frequencies, namely, where the peaks occur. As with any other
statistical methods, the results should be accompanied by an expert assessment of the
validity of the results. Recall that a time series is usually not only a superposition of har-
monics, but also of trends and noise, which can bias our findings in relation to particular
relevant frequencies.
The Python code and the results generated thereby are depicted below.
Code
# import modules
import pandas as pd
from scipy import signal
import matplotlib.pyplot as plt
# load data
df = pd.read_csv('AEP_hourly_Sep2012.csv', sep=',')
df['energy'] = df['energy']
df.index = pd.to_datetime(df.datetime, \
dayfirst=True, infer_datetime_format = True)
# generate periodogram
freq, Pxx_spec = signal.periodogram( \
df['energy'].values, scaling='spectrum')
# create graphic
plt.figure(figsize=(16,5), dpi=100)
plt.plot(freq[1:], Pxx_spec[1:]) #f[0]~0, dropped
plt.xlabel('freq (Hz)')
62
plt.ylabel('PSD')
plt.yscale('log')
plt.grid()
We use the following Python function to find maxima of the periodogram within a fre-
quency interval (f1, f2).
Code
# convert to periods
periodMax = 1/freqMax
return({'Maximum PSD':maxPsd, \
'Frequency': freqMax[0], \
'Period': periodMax[0]})
where the freq and Pxx_spec are frequencies and the periodogram outcome, respec-
tively. Applying the function in the vicinities of the peaks, we can find possible hidden rel-
evant frequencies. Applying
Code
# print the results for given bounds
print(getMaxPeriodogram(freq ,Pxx_spec, 0.002, 0.02))
# console output:
# {'Maximum PSD': 477550.8444001595, \
# 'Frequency': 0.005747126436781609, \
# 'Period': 174.0}
# console output:
# {'Maximum PSD': 2246538.2768236673, \
# 'Frequency': 0.041666666666666664, \
# 'Period': 24.0}
63
we obtain 174 and 24, respectively. The second period was already known to us. The first is
suspiciously close to 168 (=24 · 7). It is here that the expert has to make sense of the
results.
If we know about the existence of seasonality in the data, say on an annual cycle, we
might create a set of 12 dummy variables, one per month, and use them as regressors to
extract the annual seasonal component of the data. For example, if an observation took
place in January, it would obtain the value “1” for the first dummy variable and “0” for the
remaining 11 dummy variables.
1, m t = j
M t, j =
0, m t ≠ j
where j represents each month of the year and m t determines the month corresponding
to the index t.
The constant term has been excluded. If included, the parameters cannot be estimated,
since the intercept and the dummies would contain the same information to predict yt,
making them indistinguishable. This entails numerical problems, because the situation is
analogous to having a first-degree equation with two variables. One cannot have a unique
solution. Alternatively, one of the dummy variables could be removed, if it were known
that this month would not play any role in the use case.
Consider the data on monthly automobile sales in the US (Bureau of Transportation Statis-
tics, n.d.) depicted in the figure below. We are interested in capturing the annual seasonal-
ity; however, the presence of local trends would bias our dummy estimation. For this rea-
son, we must first remove them. We already know how to remove trends from a time
series from the previous section. A smoothing technique will now be applied.
64
Figure 29: Monthly Automobile Sales (United States)
Using a Gaussian kernel and bandwidtℎ = 7, we obtain the trend of sales data (red dotted
line in the figure above.) Subtracting the trend from the original data, we obtain the
detrended sales data depicted in the figure below (black line). Finally, we build a regres-
sion model using monthly dummy variables to explain the detrended time series (red line
in the figure below).
Figure 30: Monthly Automobile Sales (United States) with Dummy Variable Regression
Using seasonal dummy variables is a good strategy to deseasonalize a time series but it
certainly does not perform miracles. Some peaks cannot be captured but for good rea-
sons. For example, neither the US financial crisis of 2007 nor the pandemic lockdown in
2020 are seasonal effects.
65
The steps described above can be conducted using the following Python code:
Code
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
from statsmodels.nonparametric.kernel_regression \
import KernelReg
# load data
url = "https://data.bts.gov/resource/crem-w557.json"
df = pd.read_json(url)
# create a plot
plt.figure(figsize=(16,5), dpi=100)
# original data
plt.plot(df.date, df.auto_sales, \
66
color='k', label='Actual vehicle sales')
# fitted data
plt.plot(df.date, y_pred, '--', \
color='red', label='KS, $b = 7$')
# create a plot
plt.figure(figsize=(16,5), dpi=100)
# detrended data
plt.plot(df.date, detrended_data, color='k', \
label='Detrended actual vehicle sales')
# fitted values
plt.plot(df.date, results.fittedvalues, \
color='red', label='Regression seasonal \
dummy variables')
67
Monthly seasonalities can be removed from the original data to obtain a deseasonalized
time series. This is useful for the comparison of two different years, without mixing up
effects attributable to seasons. The same procedure holds in case one needs to remove
quarterly, weekly, or daily effects, among others. The generated dummies are used as
regressors, provided the data has been detrended.
2.3 Residuals
This discussion on time series components started by considering the model
Xt = T t + St + Y t, t ∈ Z
where T t is a trend, St is the seasonal component, and Y t is the residual. This corresponds
to a zero-mean stationary time series that does not have a visually apparent, “clear” struc-
ture. In regression analysis, once we control by a set of regressors, the analogous error
component is expected to be independent. In the context of time series analysis, that
assumption is unrealistic. As observations are ordered in time, it is expected that they will
have some degree of autocorrelation or changing variance that also needs to be modeled.
The first inspection conducted to assess the stationarity of residuals is visual. The follow-
ing questions should be considered. Can trends or piecewise trends be observed? Is there
any functional behavior? Is the variance increasing without any bound? Is the autocovar-
iance and/or the autocorrelation decreasing too slowly as the lag increases? If any of these
questions is answered with a “yes,” then the tme series at hand might not be stationary.
68
Figure 31: Three Time Series Simulations
Notice that all sequences start at 0 but that their subsequent behavior differs. Sequence A
fluctuates randomly around 0, Sequence B increases while having subintervals of decre-
ment, and Sequence C increases smoothly and dramatically. Surprising as it might be,
these series are all generated from the same process with just one parameter having been
modified for each case. The process is the “autoregressive of order 1,” or AR(1). The for-
mula is
Xt = αXt − 1 + εt
εt W N 0, σ2
69
In simple words, given an initial value X0, the next element is obtained by multiplying the
previous element by α plus a “white noise term,” i.e., a zero-mean, constant-variance and
uncorrelated random variable. The three sequences have length N = 100, and each simu-
lation depends on three parameters: α, σε2 (variance of εt), and X0(the starting value of
the series). In all cases σε2 = 1 and X0 = 0. The values for α are
Seq . A: α = 0.8
Seq . B: α = 1
Seq . C : α = 1.1
Code
# import modules
import matplotlib.pyplot as plt
import numpy as np
return x
70
fig, axs = plt.subplots(3, figsize=(16,18), dpi=100)
plt.subplots_adjust(hspace = 0.3)
# Stationary
axs[0].plot(x_stat,color='k')
axs[0].set_title('Sequence: A')
axs[0].set(xlabel='Time')
axs[0].grid()
# Unit root
axs[1].plot(x_unit_root,color='k')
axs[1].set_title('Sequence B')
axs[1].set(xlabel='Time')
axs[1].grid()
# Non-stationary
axs[2].plot(x_non_stat,color='k')
axs[2].set_title('Sequence C')
axs[2].set(xlabel='Time')
axs[2].grid()
The stationarity of this process depends exclusively on the parameter α. The stationarity
condition demands a constant mean over time and finite variance; therefore, depending
on the value of α, we can decide whether an AR(1) is stationary or not.
To see what this means for the three sequences with (A) α = 0.8, (B) α=1, and (C) α=1.1, it
becomes apparent that the equation for the AR(1) process can be rewritten as
Xt = αXt − 1 + εt
t−1
Xt = αtX0 + ∑ αkεt − k
k
Deriving from this, and knowing that the error term has an expected value of 0, we can
formulate the expected value of any point in the series as
E Xt = αtX0
Likewise, we can derive the formula for the variance at any given point in time for this ser-
ies. The variance only depends on non-constant terms which, in this case, are the ε and α
terms. The former is known to have σ² variance. Including the latter, the variance of Xt,
when t is large, for example, can be shown to be
σ2
Var Xt =
1 − α2
71
only if α < 1 (see Shumway and Stoffer, 2017, for elaboration on the mathematical deri-
vation). This can be written as a geometric series. Accordingly, the formula for the variance
of the example time series is
σ2
Var Xt = = σ2 α0 + α2 + α4 + … + α2 t−1
1 − α2
In practice, a process like Sequence C is clearly non-stationary. For this series, α > 1, so
both E X ɣ and γ X ɣ “explode” for larger and larger values of t.
A series like sequence A, in turn, appears to be stationary and actually is. For this
sequence, α < 1, so that E X ɣ goes to 0 and γ X ɣ goes to σ2 /1 − α2(which is con-
stant) for increasingly high values of t.
The problem is sequence B. At least in the interval 0,100 , it does not seem to explode.
Perhaps with more data, we would see a decline and finally something that fluctuates
around zero. We do not know. These sorts of processes are called “unit root” (or in this
case, a “random walk”). “Unit” comes from α being equal to one. The reason for the word
“root” is more involved and beyond the scope of this unit. In this case, E X ɣ goes to X0
for larger and larger values for t, which is constant, which is in line with the stationarity
requirements. For this reason, these processes may look like they are stationary. However,
γXₜgoes to tσ2for increasingly larger values for t and thereby to infinity. Accordingly, this
process is, by definition, not stationary. Don’t be mistaken: a unit root process is a non-
stationary process. The problem is that in practice, one has only a finite and usually small
set of samples, and a unit root sequence might be confused with a stationary sequence.
Now, the question is how to decide whether a given sequence is a unit root process or not.
To answer this question, the Dickey-Fuller test can be used. This tests the null hypothesis
of a unit root, i.e., α = 1 versus the alternative α < 1. Recall that the test of a hypothe-
sis is basically an objective rule of decision. To have a taste of the logic behind the test,
notice that the AR(1) model
Xt = αXt − 1 + εt
∇Xt = α − 1 Xt − 1 + εt
Recall that ∇ stands for the first lag difference. We also introduce a new convenience sym-
bol, δ = α – 1.
∇Xt = δXt − 1 + εt
72
This way, we can re-write the null hypothesis H0: α = 1 as H0: δ = 0 and the alternative
hypothesis H1: α < 1 as H1: δ < 0. This equation can be seen as a regression equation,
and hence, we can apply the methods of regression analysis to obtain the least squares
estimates for δ(being α − 1) and σε2. With these estimations, and the observations Xt,
we can build the test statistic.
δ
tN =
N −1
σε ∑t = 2 Xt2 − 1
2
There is an extension of this test called the augmented Dickey-Fuller (ADF) test. It
addresses cases where one or both of the following situations occur:
p
∇Xt = c + μt + α − 1 Xt − 1 + ∑ βi ∇Xt − i + εt
i=1
Situation “a” is addressed by the summation term. By including lagged differences weigh-
ted by a parameter βi (to be estimated), the autocorrelation effect can be removed if the
parameter p is correctly chosen. On the other hand, situation “b” is addressed explicitly by
the term c + μt, which is a line trend.
The parameter p (the autoregression order of the errors that need to be removed) is usu-
ally set by cross validation and the AIC criterion, meaning estimations of this regression
equation are run for several values of p, and that which has the lower AIC is chosen. The
test statistic and critical values are the same as for the simple Dickey-Fuller test.
Finally, consider an example using the Deutsche Bank closing stock prices (Bureau of
Transportation Statistics, n.d.). We want to see how the ADF test performs on this data set
for both closing prices and differenced prices. To strengthen our results, we will first exam-
ine the behavior of the autocorrelation function (ACF) of the price series (upper plot) and
of the differenced price series (lower plot).
73
Figure 32: Autocorrelograms: Deutsche Bank Closing Prices and Differenced Prices
The slowly decaying ACF of the stocks is a feature observed in many stock price time ser-
ies, leading some researchers to believe in the “random walk hypothesis” of the stock
market, thereby denying its predictability. Why? Recall the autocovariance formula
σ2αℎ
γh =
1 − α2
for the AR(1) process defined above. Furthermore, we know that the ACF is defined as
γℎ
ρℎ = γ0
= αℎ
Suppose that α is smaller than one but very close to 0.99. In such a case, the ρ ℎ for the
first seven values of ℎ would be (starting at zero): 1, 0.99, 0.98, 0.97, 0.96, 0.95, and 0.94.
The ACF decreases very slowly, resembling the stock prices’ ACF, as shown above. Assum-
ing there are not too many data points, statistically we would say that the AR(1) process
has a unit root, i.e., it is a random walk, because, in that case, α = 0.99 is undistinguisha-
ble from α = 1. In other words, any difference between 0.99 and 1 is due to random error
and not due to structural difference. A process that shows evidence of the random walk
hypothesis implies that its variance increases with time, making accurate predictions
impossible.
74
Whether the market is predictable or not, it can be seen that the ACF of the Deutsche Bank
stock price (Bureau of Transportation Statistics, n.d.) decays slowly as the lag increases,
which is consistent with the unit root hypothesis (random walk), as explained above. This
is a piece of evidence in favor of such a hypothesis based on the description of the ACF.
Now, let us evaluate the unit root using the ADF test. The Deutsche Bank stock prices
(Bureau of Transportation Statistics, n.d.) were downloaded with the library yfinance.
The ACF and ADF tests can be conducted with the following Python code:
Code
# import modules
import yfinance as yf
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import adfuller
# stock selection
stock = ['DBK.DE'] #Deutsche Bank stock code
stock_df = yf.download(stock,
start='2019-01-01',
end='2020-12-31',
progress=False)
# plot ACF’s
fig, axes = plt.subplots(2, 1, \
figsize=(16,10), dpi=100)
# ACF difference
plot_acf(stock_df.Close.diff().dropna(), \
alpha=0.05, ax=axes[1], \
title='Autocorrelation -\ Differenced Prices')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Correlation')
# print results
print('ADF Statistic: %f' % resultADF[0])
print('p-value: %f' % resultADF[1])
for key, value in resultADF[2].items():
print('\t%s: %.3f' % (key, value))
75
The results of the augmented Dickey-Fuller test (not including constant and trend) are
Code
ADF Statistic: 0.008459
p-value: 0.687086
1%: -2.570
5%: -1.942
10%: -1.616
Given the high p-value, we do not have the necessary statistical evidence to reject the
hypothesis of a unit root. If the same exercise is conducted for the differenced price series,
the p-value obtained is of order 10−9. In this case, the hypothesis of a unit root is rejected,
and stationarity is accepted. As previously stated, the first inspection of stationarity is vis-
ual, and thus quite simple. Tools like the Dickey-Fuller test and its augmented version are
necessary to evaluate stationarity in a more objective way.
SUMMARY
In time series analysis, a sequence Xt is usually decomposed into three
components: trend T t, seasonality St, and a residual term Y t. Mathemat-
ically, this can be expressed as
Xt = T t + St + Y t, t ∈ Z
76
mates for the harmonic regression. On the other hand, a seasonal
dummy variable regression is used when the period is known and con-
stant. This method is typically used to deseasonalize time series data.
77
UNIT 3
SIMPLE MODELS
STUDY GOALS
Introduction
Almost every event in life has multiple causes. Phenomena, from rainstorms to fluctuating
stock prices, are caused by innumerable factors that are weighted differently and com-
bined nonlinearly. To make sense of these sort of phenomena, researchers try to extract
patterns from observations. As in the case of predicting rainfall, a physical model can
sometimes aid the search for causality. A physical law may account for the main body of
the data, and this can serve many practical purposes. However, if we increase the level of
accuracy in our measurements, the effects of unknown factors will eventually become evi-
dent as the data cease to perfectly match the law. Yet, in many situations, there is no
known law that can help us make predictions. In these cases, we must rely on statistical
methods to extract the main trends and patterns from the data while ignoring irrelevant
factors that, depending on the context, can be safely considered “noise.” This concept is
analogous to listening to a recording of a piece of music that has interference. We wish to
hear the music, a pattern of sounds, and tune out the noises that interfere with those
sounds.
To extract the trend of a set of observations, this unit presents the moving average meth-
ods. As its name suggests, we will use simple averages (SMA) (in general the mean, but
theoretically the mode and median may also be used) over rolling windows of data.
Three methods will be covered: the simple, centered, and weighted moving averages.
Given a point in time within a sequence of values, the first method averages a predefined
number of observations that occur just before that point. The second case will average
elements on both sides of a chosen point, including the observations made at that specific
time. The third method weights the observations differently. In this method, the observa-
tions in the direct vicinity of the relevant point in time have a stronger emphasis than the
observations that are further removed in time.
80
Figure 33: Average Precipitation in Germany
The average (in this case, the mean) of the 120 annual observations is 699.7 mm. However,
the average for the first ten years is 673.9 mm and the last ten years is 676.2 mm. This sug-
gests that there should be periods along the time span that have averages above 700 mm.
Thus, a raw average cannot effectively describe the evolution of the data.
The SMA method calculates the average over rolling windows of data of a predefined
length. The outcome is a smoothed version of the original data, in which the “noise” is
filtered out and the main trend is clearly represented.
Let x1, x2, …, xN be a sequence of data. Then, the SMA of order q is given by
xt − 1 + xt − 2 + … + xt − q q
1
xt = q
= q ∑ xt − i,
i=1
Thus, the smoothed observation in t is given by the average of the previous q observations
(1 side average). Using the data on precipitation, the table below shows how to calculate
the SMA of order three (SMA(3)) schematically.
81
Figure 34: Calculating Precipitation Values Using an SMA(3) Model
To generate the first three SMA observations, the data points for t = 0, −1, and −2 are
required. Of course, those are not defined, and typically a convention is used. Here, for the
third smoothed observation, the average of the first two observations has been used. For
the second smoothed observation, the first value has been copied, and the first smoothed
observation is simply left as null.
Sometimes, practitioners apply the SMA by taking xt as the first observation instead of
xt − 1. This is a valid approach if the goal is trend estimation and not forecasting. The spec-
ification described above has been chosen as it has an interesting connection to the autor-
egressive processes.
Coming back to the precipitation example, the next figure depicts two plots. The first plot
consists of three curves: the original precipitation data, the SMA(5), and the SMA(10). The
second plot zooms in on the period 1960—1985.
82
Figure 35: Observed and Fitted Precipitation Values Using an SMA(5) and an SMA(10)
Model
• As more terms are included in the average (i.e., as the SMA order increases), the result-
ing curve becomes smoother. The blue line fluctuates more than the red line.
• The SMA does not portray fluctuations as they occur. For example, starting in 1965, a
peak is reached for two consecutive years followed by troughs. However, the blue line
reaches a peak only years later, and not at the same level as the precipitation data. The
red line increases only weakly.
A forecast method is usually judged, at least partially, based on how accurately it fits the
actual results. The SMA performs poorly according to this criterion. Why then should we
use it? In some cases, we need to be conservative about new trends. For example, when a
decision based on trend changes is expensive, one would like to be sure the new trend is
definitely occurring before initiating a different course of action. The SMA will show a clear
trend pattern only when there is enough past evidence that the data are following a trend.
83
The same holds when the data exhibit a short-term break in a trend. It might be the case
that a trend is interrupted by a few observations that move in a different direction, but
that it soon resumes its prior course. In this case, an SMA can prove useful.
Code
# Rainfall example
# import modules
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})
import numpy as np
# rename index
df.index = df.index.rename('Year')
# define colors
colors = ['k','red','blue']
84
axs[0].grid()
# show a subsection
df.loc[1960:1985].plot(ax=axs[1], color=colors,\
linewidth=3)
How large should the rolling window be? To answer this question, we can use methods of
model selection, such as the Akaike Information Criterion (AIC) (Nielsen, 2019). The SMA
model is
q
1
Xt = q ∑ Xt − i + εt
i=1
The underlying assumption is that the next value is an average of the last q observations
plus an error term with some distribution centered on zero. If we assume Gaussianity on
the residuals, then its log-likelihood functionl θ X (where θ is the set of parameters) Log-likelihood function
can be derived, which will depend only on the variance of the ε’s, since its mean is zero. The log-likelihood func-
tion is a logarithm of the
Now, the AIC formula is likelihood function.
Under the i.i.d. (inde-
pendent and identically
AIC = 2 · No.Parameters − 2 · l θX
distributed) assumption
of the Gaussian residuals,
Interestingly, there are only two parameters in an SMA model, namely, the variance of the the likelihood function is
the product of the Gaus-
εs and the order q (if the order is given, the weights are completely determined, meaning sian densities of each
these parameters do not need to be estimated). Thus, the number of parameters is always data point. This is a func-
two. This leads the AIC to depend only on the variance or the mean squared error (MSE) of tion of the parameter q
where the data X is con-
the residuals (for further details see Svetunkov and Petropoulos (2018)). Therefore, for an sidered given.
automatic order selection procedure, we can estimate several SMAs with different orders
and retain that which has the lowest MSE.
The AIC of the SMA(5) and SMA(10) of the previous example are given by the following
code:
Code
# define a function to calculate the AIC
# y: original data
# yhat: SMA data
# npar: number of parameters (by default equal to 2)
85
error = y-yhat
likelihood = (N/2)*np.log(2*np.pi) + \
(N/2)*np.log(error.var()) + (N/2)
result = 2*npar - 2*likelihood
return result
# console output:
# SMA(5) - AIC: -1414.0229514348625
# SMA(10) - AIC: -1402.6497628190984
AICSMA 5 = − 1414.02
AICSMA 10 = − 1402.65
The model SMA(5) is chosen, because it has the smallest AIC. To calculate the minimum
AIC, each SMA(q) would need to be estimated for q = 1, …, Q (Q big enough), the AIC for
each model calculated, and the minimum of each would then be taken.
Out-of-sample prediction is straightforward with this method. The SMA is calculated using
the last q observations to generate the first out-of-sample forecast. To generate the second
out-of-sample forecast, the first forecast plus the last q − 1 elements of the series are
averaged. We proceed iteratively to produce as many out-of-sample forecasts as we
choose. Consider the following diagram.
The black numbers depicted in the diagram are actual data. To generate the first out-of-
sample forecast point, the last five actual data points are averaged, which produces
669.702. Now, to generate the second out-of-sample forecast point, the last four actual
data points must be averaged along with the forecasted data point of the previous step.
86
Notice that after several iterations, we end up calculating averages over averages, so the
forecasted values converge to a given level. The forecasting capacity of methods such as
the SMA (and the upcoming weighted moving average) are sometimes misunderstood,
and the methods can be used for the wrong purposes. To forecast within acceptable error
bands, one must understand the underlying random process. In this case, we are implicitly
assuming that the next observation in the process is the average of the previous q data
points plus an error term (for instance, white noise) and that we can therefore safely pre-
dict by applying the SMA method. However, this might not necessarily be the case. In this
unit, emphasis has been placed on trend calculation rather than on any model assump-
tion. We can use the SMA method to estimate the trend, but for such forecasts based on
those estimations to be reliable, there must be concrete evidence that the data truly
behave like an SMA process. There is a whole family of models, and the features of each
model depend on a set of parameters that can be tuned to match specific characteristics
observed in the data.
Calculating the forecast with a horizon of 20 years for the precipitation data produces the
following graph.
Figure 37: Precipitation Forecast for an SMA(5) Model with a 20-Year Horizon
Code
# define a function for SMA out-of-sample forecasting
# data: original data
# q: SMA order
# h: forecast horizon
87
# to be forecasted values plus the last
# q observed values (size of the window)
fc = np.zeros(q+h)
return fc
# reset index
df = df.set_index([pd.Index(df.index.values+1901)])
# observations
plt.plot(df.index, df.Precipitation, color='k')
# forecast
plt.plot(df.index, df['SMA5'], color='red')
88
3.2 Moving Average
Moving average methods are clever ways of smoothing data to capture trends. The goal
can be forecasting or it can be detrending to analyze other underlying dynamics. Many
different procedures can be termed moving average methods, including SMA. However,
what is usually understood as a moving average method is an average of the observations
on both sides of a given data point.
We will refer to this method as a centered moving average of order q, or CMA(q). In the
literature, the word “centered” is dropped, and the method is simply called a moving aver-
age. We add this word to differentiate this method from the widely known moving average
autoregression model of order q. Despite their similar names, a CMA(q) method and an
MA(q) model have little in common.
The table below displays the concept of a CMA(3) model using precipitation data. The col-
ored values are given by the CMA(3) model.
When q is an odd number, the CMA(q) averages d points on the left + d points on the right +
1 point at the center, where d=q-1/2.
xt − d + … + xt + … + xt + d d
1
xt = q
= q ∑ xt + i
i= −d
In the case of an even order q, the formula is slightly different, and is expressed as
0.5xt − d + … + xt + … + 0.5xt + d
xt = q
where d = q/2.
89
The figure below depicts two plots. In the upper plot, one sees the precipitation sequence,
the CMA(3), and the CMA(7) for the whole period. The lower plot zooms in on the period
1960—1985.
Figure 39: Observed and Fitted Precipitation Values Using a CMA(3) and a CMA(7) Model
Compared to the SMA method, the CMA method performs better when the goal is to cap-
ture some features of the original sequence in real time, i.e., without delays. As mentioned
before, this is useful when the emphasis is placed on time series detrending.
90
Code
# Rainfall example
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})
import numpy as np
# rename index
df.index = df.index.rename('Year')
# moving average
df['CMA3'] = df.Precipitation.rolling(3, \
center=True, min_periods=1).mean()
df['CMA7'] = df.Precipitation.rolling(7, \
center=True, min_periods=1).mean()
# define colors
colors = ['k','red','blue']
# show a subsection
df.loc[1960:1985].plot(ax=axs[1], \
color=colors, linewidth=3)
91
axs[1].set(xlabel='Year', ylabel='Rainfall [mm]')
axs[1].legend(labels=['Precipitations',\
'3-year CMA','7-year CMA'])
axs[1].grid()
The Pandas function pd.rolling(q, center, min_periods) has been used here to
calculate the moving average. If center=True, the function calculates a CMA with dobser-
vations on each side of the centered observation (d = q − 1 /2 or d = p/2). The option
min_periods indicates the minimum number of observations needed to calculate the
moving average, particularly at the ends of the sequence. For instance, if this number is
set to two, only two observations are required to calculate the first CMA value. The func-
tion detects that the zero observation does not exist but that the t = 1 and t = 2 observa-
tions are available, and it thus calculates the average using only these values.
This method cannot be used for prediction, because it involves future observation to esti-
mate each value. Its importance relies on being a simple method for data detrending.
Smoothing methods depend on a smoothing parameter. In the case of the CMA method,
that parameter is the order q. For the SMA method, we used the AIC to select the order,
which in that specific case is reduced to calculate the variance. In the case of a CMA, mini-
mizing the variance would result in a reproduction of the original series (i.e., a CMA(1)
model), because all the residuals would be zero. In practice, the order is defined by trial
and error. Usually, the researcher has a level of smoothness in mind, given the context of
the problem. Furthermore, the CMA method discussed here is a particular case of the “lin-
ear filter,” the weighting of which is defined by the user. The purpose is to capture trends
of polynomial order while filtering out possible noise components (for more details, see
Brockwell & Davis (2009)).
Two questions arise. How many weights should be considered, and what should the val-
ues of those weights be?
Technical analysis The WMA(q) method is widely used in technical analysis. In this context, the analyst must
The methodology called fix only the order q, since the weights are usually assumed to decrease following an arith-
technical analysis utilizes
charts, statistics, and, metic progression. It is believed the most recent prices carry more information about
more recently, AI (artifi- future values than older prices. A weighted average is still an average; therefore, the
cial intelligence) and ML weights must add up to 1.
(machine learning).
92
Consider Pfizer’s closing stock prices between March 1, 2020 and July 31, 2020 (Yahoo Arithmetic progression
Finance, n.d.-a) (the Python library yfinance was used to download the data). A few val- An arithmetic progression
is a sequence of numbers,
ues have been taken from this sequence to illustrate the calculations in the table below. such that the difference
between two consecutive
terms is constant, for
Figure 40: Calculating Pfizer Stock Prices Using a WMA(3) Model example, 3,6,9,12,…
In this example, only three weights have been defined: one, two, and three. Why are the
weights then being divided by six? Recall that the sum must be equal to one. In general,
having specified the weights as 1, 2, …, q, we will divide by the sum of all the weights, i.e.,
1 + 2 + 3 + … + q = q q + 1 /2 (in our example, 3´4/2 = 6). Thus, the highest weight is
assigned to the last observation, decreasing arithmetically as we move back in time.
1 · xt − q + … + q · xt − 1 q
1
xt = q q + 1 /2
= q q + 1 /2 ∑ q − k + 1 · xt − k
k=1
The figure below depicts the closing prices of Pfizer’s stock with a WMA(10) with arithmeti-
cally decaying weights.
93
Figure 41: Closing Prices of the Pfizer Stock and WMA(10) Model Fits with Arithmetically
Decaying Weights
• As in the case of the SMA method, there are some observations at the beginning that
cannot be calculated, because the window involves data that do not exist. A possibility
(the one used here) would be to shorten the length of the windows of the SMA as we
move closer to the left boundary.
• The WMA curve seems to be lagged, or delayed, compared to the black curve. This
reveals why this method is used in finance. When the red line starts to rise, that indi-
cates the actual price is experiencing a sustained period of growth and not just a couple
of lucky days. This could be a good indication to buy. On the contrary, if a stock’s WMA
curve begins to drop, that could be a signal to sell.
Code
# import modules
import yfinance as yf
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})
import numpy as np
94
# define a function for
# arithmetically decreasing weights
def getArithmWeights(q):
return weights
# plot observations
plt.plot(stock_df.index, stock_df.Close, color='k')
The following Python code can be used for WMA(q) model selection.
95
Code
# define a function to calculate the AIC
# (in this case just the variance of residuals)
def getAIC(y, yhat, npar=2):
error = y-yhat
error = error[~np.isnan(error)]
res_var = error.var()
return res_var
# console output:
# Best value for q: 2
# Corresponding AIC: 0.82552768533932
96
Two questions were posed at the beginning of this unit.
It was stated that if we impose the number of weights, say five for instance, then we can
assume, like in a technical analysis, that the weights decrease following an arithmetic pro-
gression: 5/15, 4/15, 3/15, 2/15, and 1/15. This is a valid approach but certainly not the
only one. We can alternate the weights by assuming they decrease exponentially. This
technique is called an exponential moving average.
Letting the data speak for themselves is another possibility. For instance, assume the
order q is given and equals three but the values of the weights are unknown. In this case,
we would need to estimate the parameters of the subsequent regression equation using
the following equation:
This is done with the condition β1 + β2 + β3 = 1 and, provided that all weights are non-
negatives, the terms xt − 1, xt − 2 and xt − 3 are lagged copies of the original xt. The
Python code used in this case is depicted below.
Code
# import modules
from scipy import optimize as opt
from scipy.optimize import Bounds
97
# (repeating 1/3 for 3 times)
start_weights = np.tile(1/3, 3)
# console output:
# array([0.06571588, 0.15782975, 0.77645437])
# plot observations
plt.plot(stock_df.index,stock_df.Close, color='k')
98
Figure 42: Closing Prices of the Pfizer Stock and Model Fits of an Optimal WMA(3) Model
The function optimize from the library scipy has been used in this example. Essentially,
the usual linear regression estimation has been replicated using a minimizer with a set of
constraints: The coefficients add up to one, and they must lie between zero and one.
β1 = 0.776
β2 = 0.158
β3 = 0.066
Interestingly, this method assigns the weights in decreasing order, following the logic of
the arithmetic decaying weights.
SUMMARY
There are three variants of moving average methods: simple, centered,
and weighted. All methods depend on a parameter q, called order. This
parameter determines the degree of smoothness of the resulting curve,
and the weights are interpreted as the level of importance of averaged
values.
The SMA is a basic tool used to extract trend patterns based on the aver-
age of the previous q observations. Depending on the underlying ran-
dom process, the SMA can be used to forecast. The CMA corresponds to
a smoothing technique and cannot be used to forecast, because it uses
past and future observations to estimate the smoothed values. As a
99
curve smoother, it performs better than the SMA method at extracting
the trend, since smoothed estimations are centered at the respective
point in time of the sequence. The last method discussed is the WMA.
The WMA follows the same form as the SMA and CMA, but as its name
suggests, observations are weighted differently. Provided the order is
known, the weights can be determined in two ways. Motivated by appli-
cations in technical analysis, the weights are frequently chosen as an
arithmetic progression. Additionally, the weights can be estimated
directly from the data without the need to impose rules of decay. Like
the SMA, the WMA can also be used to forecast, provided the underlying
model follows such a structure.
In the case of the SMA method and the examples of the WMA method
given in this unit, the AIC model selection procedure simplifies the proc-
ess by choosing the order that yields the lowest variance. In the case of
the CMA method, the order must be determined depending on the con-
text of the problem.
100
UNIT 4
ARMA MODELS
STUDY GOALS
Introduction
Whenever we need to make predictions about the future, we try to establish causal rela-
tionships between what is being predicted and the information we already know. By
repeatedly observing the same or similar patterns, we can establish what will happen with
some degree of certainty. The specific way factors are combined is not always clear, but
we humans have the ability to organize complex dynamics and derive meaningful conclu-
sions from them.
This method of drawing conclusions about the future permeates the sciences. We observe
a phenomenon and try to establish causal relationships between its constituent parts. We
typically begin by creating models where the factors do not have simple, uncomplicated
relationships. In the field of time series analysis, the autoregressive moving average, or
ARMA models, also follow this path. These models “learn” from the past, combine factors
in a simple, “linear” way, and are parsimonious in the sense that better models tend to be
less complex.
Arguably, the so-called ARMA models are the most widely used in the field of time series
analysis, and their prestige is well deserved. Introduced in the 1950s by Peter Whittle
(1951), they did not become popular until the 1970s through the work of George Box and
Gwilym Jenkins (1970). Nowadays, numerous extensions with very specific applications
exist.
There are three different ARMA models: the autoregressive model (AR), the moving aver-
age model (MA), and the combination of these two, the autoregressive moving average
model (ARMA). There are also extensions of this model: the integrated model, the seasonal
integrated model, and the seasonal integrated model with exogeneous variables. The inte-
grated version covers situations in which the data have some trend (stochastic or deter-
ministic), while the seasonal approach attempts to model time series with strong seasonal
components, as seen in sales and meteorological data. Finally, the seasonal version with
exogeneous variables deals with situations in which there are additional variables that
might convey information on the period under study but which may also have serial corre-
lations that can be useful for prediction.
102
ARMA models are probably the most widely used tool in the practice of time series analy-
sis, and are used to make predictions in various fields. Some examples of such predictions
include sales (Arunraj et al., 2016), COVID-19 incidence rates (Poleneni et al., 2021), and
electricity consumption (Elsaraiti et al., 2021), which makes this topic of particular impor-
tance for practical purposes. Nevertheless, on a more theoretical level, ARMA models also
involve deep mathematical concepts. Understanding them well helps avoid the misuse of
these models. This unit takes an intermediate approach. The general characteristics of
these models are illustrated through examples with data. Whatever the case, general
statements will be derived from naturally occurring results, such as when developing the-
orems.
Let us first consider the following example on global temperatures. This data set contains
141 global temperatures measured annually between 1880 and 2020 (National Aeronau-
tics and Space Administration, 2021). These data consist of deviations from the average
temperature (the delta temperature) in the years 1951—1980. To remove the trend, we
have fitted two simple regression models with ordinary least squares (OLS).
Source: Danilo Pezo, based on National Aeronautics and Space Administration (2021).
103
Visually, the quadratic model (blue-dotted line) fits better than the red line. The underly-
ing assumption of these models is that the temperature data xt is equal to a trend func-
tion T t, given either by a quadratic or linear function, plus an error term εt, called a resid-
ual. Mathematically, this is written as
xt = T t + εt
Thus, subtracting T t from xt, we obtain the residuals. The sets of residuals for both models
are plotted in the figure below. The linear model does not capture the curvature of the
data. The residuals reflect that lack of fit in the sense that they seem to describe a curve
pattern on its own rather than a stable curve over time. Analyzing the series of residuals is
common practice when evaluating a model fit. They should exhibit no clear pattern, which
is not the case for the first model. The quadratic model performs better in this regard, so
we will choose it for now.
Figure 44: Residuals of a Linear (Top) and Quadratic (Bottom) Model for Global Average
Temperatures (1880—2020)
Source: Danilo Pezo, based on National Aeronautics and Space Administration (2021).
Is there any additional information contained in the residuals that might help us to
enhance the forecasts?
What distinguishes time series from cross sectional data is the time dependence between
observations. Theoretically, such dependence might take any functional form. However,
following the principle of Occam’s razor, we should investigate, at least initially, the sim-
plest possible relationship. In this case, this would be a linear relationship, and the ACF
104
and PACF provide a number interpreted as the “degree of linear dependence” between the Functional form
time sequence and a lagged version of itself. That is the information we want to exploit, The mathematical rela-
tionship between two or
because it links the future and past values of a time series. Put simply, we seek to predict a more variables given by a
value of the series based on the previous value, and the ACF and PACF provide insights function (polynomial,
into how one value is connected to the previous values. rational, trigonometric,
etc.) is called a functional
form.
The ACF and PACF of the residuals of the quadratic model are plotted in the figure below. Occam’s razor
The principle of Occam's
razor is also known as the
Figure 45: ACF and PACF of the Residuals: Quadratic Model principle of parsimony.
Among two or more com-
peting explanations, we
choose the one with
fewer assumptions.
A plot of the autocorrelation is called a correlogram. The “pin” length at lag 0 is always 1,
because it represents the correlation of the series with itself. These plots were created in
Python’s library statsmodels, which displays a 95 percent confidence interval by default.
The pins that reach outside of the confidence bands (the blue areas) are termed different
from zero with a 5 percent significance. Plainly speaking, the PACF plot, for instance, tells
us that one prior observation (time series lagged by 1) contains useful information for
forecasting the next data point.
What does this have to do with ARMA models? Many different autocorrelation and partial
autocorrelation patterns can be generated using these models. Thus, it is possible to set
the parameters in a way such that the resulting ACF and PACF match the sample versions
using the sequence under analysis.
105
Autoregressive Models
The basic idea behind an autoregressive model is that the current observation of a time
series is explained by its past values. The way the values relate to each other is linear, i.e.,
Xt = ϕ1Xt − 1 + … + ϕpXt − p, where the ϕs are constants, which is also a characteristic
shared by the general ARMA models. The simplest autoregressive model is
Xt = ϕXt − 1 + εt
εt WN 0, σ2
where ϕ is a parameter (estimated during the analysis) and the εt is a white noise process
of zero mean and finite variance σ2 < ∞. This model is referred to as an autoregressive
model of order 1, or AR(1). The idea is rather simple: The next value is given by the current
value, multiplied by a constant plus an error term, usually assumed to be a zero-mean
White noise white noise. To make things clearer, suppose we define the first value of the Xt process as
A zero-mean uncorrelated
0, while φ = 0.8, and the white noise variance equals 1. Were a series to follow this pat-
random process with con-
stant variance is descri- tern, we would obtain the series values depicted in the table below.
bed as being white noise.
t ε Xt (operation) Xt (Value)
0 0
In the second column, ε, independently and identically distributed random numbers are
displayed. They follow a standard Gaussian distribution with a mean of 0 and a variance of
1. The third column shows the operation being conducted (the calculation) and the fourth
column displays the resulting values. Using this approach, further data points can be
simulated and are depicted in the figure below.
106
Figure 46: AR(1) Simulation: 100 Points
The general autoregressive model of order p ≥ 1, with p being integer values, the AR(p)
model has the form
Xt = ϕ1Xt − 1 + … + ϕpXt − p + εt
εt WN 0, σ2
The moving average model (MA) is similar to the AR models, but instead of predicting val-
ues based on previous values, they are predicted based on previous errors. An MA of order
q (MA(q)), is a linear combination of past white noise terms. Its mathematical formulation
is like the AR(p), but instead of regressing Xt in terms of lagged versions of itself, the for-
mulation is like a regression in terms of a white noise sequence. Consider the example of
an MA(1) model
Xt = εt + θεt − 1
εt WN 0, σ2
where q=0.8, σ2 = 1, and the white noise is Gaussian. Using this formula and parameters,
values are simulated based on previous errors, as depicted in the table below.
t ε Xt (operation) Xt (Value)
107
0 0.805 0
We do not have to assume an initial value for Xt but we do for εt. It is not obvious why
such a process would be suitable to represent the dynamics of the process Xt. In the end,
the right-hand side of the last formula depends entirely on εt and its lagged versions,
which is just white noise. This is correct for all Xt, not for only one. All Xt are generated in
the same way and thus share the same underlying dynamic.
In the general MA model of order q ≥ 1, with q being integer values, the MA(q) model has
the form
Xt = εt + θ1εt − 1 + … + θqεt − q
εt WN 0, σ2
where the θs and σ2 are parameters to be estimated. The way this process is defined
allows its autocorrelation to be calculated easily. This will be addressed in more detail
later on. One might wonder why this model should be preferred over an AR(p). It can be
shown, for instance, that the MA(1) used in the last example can be written as
∞
− ∑ 0.8kXt − k + εt
k=1
where the numbers 0.8, 0.64, 0.512, and so on, correspond to the powers of the parameter
θ, which is equal to 0.8. In other words, an MA(1) model can be represented as an infinite
AR process, AR(∞). Of course, in practice we cannot work with an infinite autoregression
(time had a beginning, as far as we know). What we do have in practice are processes that
might depend on many lags, say p > 20. The number of parameters to estimate is high
and comparisons among them would not be easy. However, when the parameters
decrease exponentially, as in this example, the MA(1) model provides a parsimonious rep-
resentation, for which only two parameters need to be estimated: θ and σ2.
108
Simulating an MA(1) is very simple. A sequence of white noise terms (i.i.d. standard Gaus- i.i.d.
sian numbers, for instance) must be generated, and its first lag multiplied by θ. A whole Xt The abbreviation i.i.d.
stands for variables that
time series is obtained by adding these two sequences. A further simulation of the MA(1) are independent and
process (of which the first few values are depicted in the table above) is plotted in the fig- identically distributed.
ure below.
The autoregressive moving average model, or ARMA model, is a combination of the two
models, AR and MA. By combining them, one profits from the conciseness and parsimony
of the final representation. While some model specifications would require several param-
eters to be estimated when using an AR or MA model, an equivalent ARMA model requires
significantly fewer estimations. Additionally, different aspects of the series and perspec-
tives on the underlying patterns can be combined.
Xt = ϕ1Xt − 1 + εt + θ1εt − 1
εt WN 0, σ2
This model is a combination of the AR(1) and MA(1), called an ARMA model of order 1, 1 ,
or ARMA(1,1). In the table below, some values are calculated to illustrate how the Xt val-
ues are generated.
t ε Xt (operation) Xt (Value)
0 0.221 0
109
2 0.52 0.3x(2.44)+(0.52)+0.4x2.353 2.19
The general autoregressive moving average model of order (p,q), or ARMA(p,q) with both
integers p, q ≥ 0, has the form
where the ϕs, θs, and σ2 are parameters to be estimated. In the table above, a raw idea is
presented to simulate the ARMA(1,1). This idea can be extended to simulate a general
ARMA(p,q) model, p, q ≥ 1, provided initial values for Xt − 1, …, Xt − p and a white noise
sequence. It should be clear that the AR(p) and MA(q) models are particular cases of the
general ARMA(p,q) model, where AR p = ARMA p, 0 and MA q = ARMA 0, q .
A short piece of Python code illustrates the simulation of the ARMA(1,1). The logic of the
AR(1) and MA(1) follows analogously.
Code
# import modules
import matplotlib.pyplot as plt
import numpy as np
110
x = np.append(x, xt)
return x
Using this code, we have simulated 100 values of an ARMA(1,1) with Gaussian standard
errors, with parameters ϕ=0.3, θ = 0.4, and σ2 = 1.
No pattern is apparent, although this series was created using a very simple model that
should have generated a structured pattern. Do more data points need to be added in
order to detect a pattern? The answer is no. In reality, it will become apparent that this
seemingly noisy sequence has a very clear ACF and PACF. The moral of this story is that
while some time series encountered in applications might look noisy and void of struc-
ture, their ACF and PACF might tell a different story, and we might find an adequate model
representation of the data through an ARMA model.
As complexity is added to an ARMA model by adding more lags and parameters, the for-
mulae involved become cumbersome to handle. To maintain a good overview, a new
notation is presented that will prove useful in future manipulations.
111
A New Notation: The Backshift Operator
The backshift operator is a convenience notation to shorten long time series formulae that
contain many lags. It is defined by
BkXt = Xt − k
where k can be any integer. Applied to a series, the backshift operator creates a new series
that is a lagged by k units. If k is positive, the lag occurs backwards (increasing steps into
the past), hence the name “backshift.” Using this operator, the AR(p) can be written as
Xt = ϕ1Xt − 1 + … + ϕpXt − p + εt
Xt − ϕ1Xt − 1 − … − ϕpXt − p = εt
Xt − ϕ1B1Xt − … − ϕpBpXt = εt
1 − ϕ1B1 − … − ϕpBp Xt = εt
Note that the left-hand side of the last equation contains a polynomial in B of degree p.
This polynomial is called an AR polynomial. It is denoted with an uppercase Φ (Phi) as
Φ B = 1 − ϕ1B1 − … − ϕpBp
Φ B Xt = εt
Xt = εt + θ1εt − 1 + … + θqεt − q
1εt qεt
Xt = εt + θ1B + …+θqB
Xt = 1 + θ1B1 + … + θqBq εt
Θ B = 1 + θ1B1 + … + θqBq
112
Xt = Θ B εt
Putting all this together, the ARMA(p,q) model notation can be simplified to
Φ B Xt = Θ B εt
Note that the backshift operator is sometimes also called the lag operator and denoted
with an uppercase L. Accordingly, the ARMA(p,q) model is sometimes denoted in the liter-
ature as
Φ L Xt = Θ L εt
This notation will be used when formulae become too large to easily scan; however, before
doing so, two concepts fundamental to ARMA models need to be explored.
So far, we have defined ARMA models in general. These models depend on parameters
that can take any number. Nevertheless, not all sets of parameters yield applicable models
in practice. For that purpose, we need models that are “causal” and/or “invertible.”
Causality
Xt = ϕXt − 1 + εt
where the εt is a white noise process of mean zero and variance σ2. In the figure below,
three simulations of this process are depicted, varying the parameter ϕ and assuming
Gaussian noise with σ2 = 1. Sequence A was simulated with ϕ=0.8. It appears stable with
values fluctuating between -3 and +3. Sequence B shows a downward trend whose value
for ϕ is 1 (you may have noticed that this series has a unit root; if you do not know what a
unit root is, now is a good time to look up the concept). Finally, sequence C basically
explodes. Its value for ϕ is 1.1.
113
Figure 49: Three AR(1) Process Simulations
The behavior seems to be very sensitive to the values of the parameter ϕ. Why is that? To
explain this, some mathematical elaboration is required.
Since all observations are given by the expression Xt = ϕXt − 1 + εt, the element Xt − 1
satisfies
Xt − 1 = ϕXt − 2 + εt − 1
114
Xt = ϕ ϕXt − 2 + εt − 1 + εt = ϕ2Xt − 2 + εt + ϕεt − 1
Xt − 1
After a bit of algebraic manipulation and replacing the expressions for Xt-2, Xt − 3, and so
on until Xt − N …, the resulting formula is
Xt = ϕN Xt − N + εt + ϕεt − 1 + … + ϕN − 1εt − N + 1 = ϕN Xt − N
N −1
+ ∑ ϕkεt − k
k=0
• As N increases, this expression will blow up unless we impose ϕ < 1. In such cases,
the powers of ϕ decrease so the result does not explode (recall, for instance, that
0.81000≈0, but 1.11000≈ ∞). It can be proven that this is the necessary and sufficient
condition for stationarity of the AR(1) model defined above. In the case of ϕ = 1, it cor-
responds to the random walk case. Its variance increases proportionally with time;
therefore, it is not stationary.
• If ϕ > 1, we can still achieve a stationary process. The model formula for the AR(1)
case is
Xt = ϕXt − 1 + εt
Xt − 1 = ϕ−1Xt − ϕ−1εt
Now, this model has a parameter, ϕ − 1, which is less than one in absolute value (i.e.,
the new model is stationary). The only drawback of this model is that it depends on
future values (the right-hand side of the last formula depends on the one-step ahead
observation and error.) If we repeat the substitution procedure explained above, we will
end up with observations Xt written in terms of future errors εTfor T > t, i.e., future
errors. This is clearly useless for practical applications, because the causality must flow
from the past to the future. The correct causal direction is called causality. Thus, a
model is referred to as being causal when it can be written as the weighted sum of an
arbitrarily large number of “past” errors and when the sum of all weights in absolute
value is finite, i.e.,
∞
∑ ϕk < ∞
k=0
In our example, under the constraint of causality, the convergence is guaranteed if, and
only if, ϕ < 1.
115
• In the case of an MA(q) model with q < ∞, the model is causal by definition. Notice
that Xt is represented by the sum of q weighted past error terms,
Xt = εt + θ1εt − 1 + … + θqεt − q
No matter how large the weights are, their sum will always converge, since there are
only q terms that are, by definition, less than infinity. In mathematical terms, this is
described as
1 + θ1 + θ2 + . . . + θq < ∞
As for AR models, the discussion until now has revolved around the AR(1) model. However,
given a set of parameters for an AR(p) model, can it be determined whether such a model
is causal or not? To answer this question, recall the notation introduced to represent ARMA
models using the backshift operator B. The AR(p) model is
Φ B Xt = εt
Φ B = 1 − ϕ1B1 − … − ϕpBp
Φ B = 1 − ϕB
The solutions for these three simple equations are 1.25, 1, and 0.91, respectively; recall
that only model (1) is stationary under the constraint of causality. The fact that the causal
model has a root larger than 1 is not a coincidence. This can be written as a general prop-
erty, even for ARMA(p,q) models.
In terms of the property causality, an ARMA(p,q) model is causal, if and only if, all the zeros
of F z are larger than one in absolute value.
The mathematical derivation lies beyond the scope of this text. The interested reader can
verify the mathematical details in Shumway and Stoffer (2017, pp. 87—88, 495—497.)
For example, to analyze the causality of the AR(2) model given by the equation
116
7 1
Xt = 6 Xt − 1 − 3 Xt − 2 + εt
Xt − 7/6Xt − 1 + 1/3Xt − 2 = εt
1 − 7/6B + 1/3B2 Xt = εt
Then
7 1
Φ z = 1 − 6 z + 3 z2
The Python library polynomial can be used to find the roots of a quadratic polynomial.
Code
# import modules
import numpy as np
# console ouput
# array([1.5, 2. ])
The roots 1.5 and 2.0 are both larger than one, and thus the example AR(2) model is
causal.
Finally, it is worth noting how the terms causality and stationarity are used. While a time
series process described by Xt can be stationary, an ARMA model describing the relation-
ship between Xtand WNcan be causal.
Invertibility
The second desirable property of an ARMA model is invertibility. Consider the following
two ACFs depicted in the figure below, which are derived from two different MA(1) models
with θ = 5 and θ = 1/5, respectively.
117
Figure 50: Two ACFs of Two Different MA(1) Processes
Both ACF look similar. In fact, given the underlying models, they are theoretically the
same. The ACFs at 0 and 1 are ρ 0 = 1 and ρ 1 = 5/26 ≈ 0.192, respectively. We take
these values as given, but the interested reader is directed to Shumway and Stoffer (2017)
for a review of the ACF and PACF of the general MA(q) model. The correlation for a lag ℎ is
ρ ℎ = 0, for ℎ ≥ 2. The tiny differences in the plot are consequences of these being two
simulated data sets.
1 Xt = εt + 5εt − 1
2 Xt = εt + 1/5εt − 1
The question is, given a set of data with exactly this ACF, which model should be chosen?
To answer this question, we will apply the same causality principle explained in relation to
the AR model, i.e., for a stationary MA(q) process Xt to be meaningful, it should be possi-
ble to write the errors εt in terms of a weighted sum of past observations of Xt where the
weights do not explode. This property, analogous to causality, is called “invertibility.”
Thus, the only difference in the case of an MA is that the εt needs to be written in terms of
past Xt’s; for the AR model, it is the other way around.
The logic is analogous to causality, and therefore the same argumentation holds. A num-
ber of points are worth noting.
118
• The AR process is invertible by definition, because we can always express the error εt in
terms of past observations of Xt, i.e.,
If we back substitute the MA(1), as in the case for the AR(1), we will also end up with an
infinite summation, whereby the invertibility property is guaranteed by the condition
θ < 1. We can do this by finding an expression for εt in terms of εt-1 and Xt; as a sec-
ond step, we find an expression for εt − 1 in terms of εt − 2 and Xt − 1, we replace the
second expression in the first one and we will get εt = θεt − 2 + θXt − 1 + Xt. After N
replacements, we obtain εt=θεt-N+θN-1Xt-N+...θXt-1+Xt.
In the general case of an MA(q) or ARMA(p,q), the invertibility property will be given by the
roots of the MA polynomial Θ z .
In terms of the property invertibility, an ARMA(p,q) model is invertible if, and only if, all the
zeros or roots of Θz are larger than one in absolute value (Shumway & Stoffer, 2017).
The mathematical derivation is analogous to causality, the notation for which is given
above.
For example, to analyze the invertibility of the ARMA(1,2) model given by the equation
Xt = 3Xt − 1 + εt − 5/2εt − 1 + εt − 2
1 − 3B Xt= 1 − 5/2B + B2 εt
Θ z = 1 − 5/2z + z2
Code
import numpy as np
p= np.polynomial.Polynomial([1,-5/2,1])
p.roots()
# Outcome
array([0.5, 2. ])
One of the roots is less than one, so this ARMA(1,2) model is not invertible.
119
This discussion of invertibility started with a question about two MA(1) model that each
yield the same ACF. To decide which model should be considered, both models can be
examined in terms of the invertibility property. The MA polynomials for both models are
Θ1 z = 1 − 5z
Θ2 z = 1 − 1/5z
The roots are 0.2 and 5, respectively. Model (2) is chosen, because its root is larger than
one in terms of its absolute value, and it thus satisfies the invertibility property.
By imposing causality and invertibility, we ensure that no two different models yield the
same ACF; therefore, these properties work as exclusion rules.
Let us start with an example. Consider the data on temperature (National Aeronautics and
Space Administration, 2021) analyzed at the beginning of this section. We adjusted two lin-
ear regression models—a straight line and a quadratic model. We chose the latter. Mathe-
matically, this can be expressed as
where t is the time index, the βs are parameters, and Xt is the residuals. This last notation
deserves an explanation. Residuals are typically represented by ε, or η, or another lower-
case Greek letter. However, after detrending (i.e., removing the quadratic component from
the data) what remains are residuals which can be modeled by an ARMA, and until this
point, ARMA models have been represented with Xt.
The real relationship between the residuals Xt might be impossible to discover. For sim-
plicity, we evaluate its linear dependence through the ACF and PACF. The plots are depic-
ted below.
120
Figure 51: ACF and PACF of the Residuals: Quadratic Model
Recall that the light blue bands represent a 95 percent confidence interval of the zero cor-
relation. In other words, it can be said that every correlation large enough to protrude
from the band is significantly different from zero.
The ACF has four significant correlations, lags one to four, while the only significant corre-
lation for the PACF is the first lag. The key question now is, which ARMA model has an ACF
and PACF close to those depicted above? To answer this question, we need to know what
patterns are followed by the theoretical ARMA models, if any.
In this subsection, two key functions are presented, linking the theory of ARMA models
with practice. The following three important points should be taken into consideration:
• In general, finding explicit ACF and PACF formulae for an arbitrary AR, MA, or ARMA
model can be difficult, even for the simplest models. Our aim here will be more modest.
• The unit will discuss in some detail the ACF for the AR(1) and MA(1) cases, the formulae
of which are well known, and aim to develop insights about the general cases.
• The ARMA models generate known patterns of the ACF and PACF, and given a set of data
and its corresponding sample ACF and PACF, we look for the theoretical ARMA model
that best mimics those sample patterns.
121
AR(p): ACF and PACF
Consider the simulated AR(1) process with f = 0.8, the ACF correlogram of which is depic-
ted in the figure below.
Notice the significant differences between the two plots. The ACF decreases exponentially
towards zero, while the PACF cuts off after the first lag. As discussed elsewhere, the PACF
at 1, π 1 , is equal to the ACF at 1, i.e., π 1 = ρ 1 (Shumway & Stoffer, 2017).
The formulae for the expectation, variance, autocovariance γ, and autocorrelation ρ of the
AR(1) causal model are
σ2
E Xt =0; V ar Xt =
1 − ϕ2
ϕℎ ℎ
γ ℎ =σ2 ;ρ ℎ = ϕ
1 − ϕ2
A complete derivation makes use of limit theorems and will not be given here (for more
information see Chatfield [2004, pp. 41—42]). The h corresponds to a lag (integer number)
and σ2 is the variance of the white noise.
122
The mean of the process is assumed to be zero; if this is not the case, we can subtract the
mean to center it. The variance of the process Xt depends on the variance of the white
noise involved; if the model is correct, this level of variability cannot be reduced. Interest-
ingly, the variance also depends on the parameter ϕ. As ϕ tends asymptotically towards 1,
the variance increases arbitrarily. This should not be surprising, because when ϕ is exactly
one, the process turns out to be a random walk and the variance of a random walk increa-
ses proportionally with time. Finally, the autocorrelation ρ ℎ results from dividing γ ℎ
by γ 0 = V ar Xt .
The shape of the ACF depicted above should be clear by looking at the expression for ρ ℎ .
In the figure below, ρ ℎ is plotted for three different values of ϕ.
This is a number less than one in absolute value to the power of an integer number; it gets
smaller and smaller as the h(lag) increases. Moreover, as the ϕ gets closer to one, the
autocorrelation needs more lags to get closer to zero. We mentioned that the stationarity/
causality condition of an AR is ϕ < 1. Thus, negative values are also allowed. If
−1 < ϕ < 0, then the function ρ ℎ still tends asymptotically towards zero but with alter-
ℎ
nating signs between positive and negative (e.g., ‐0.8 = 1, ‐0.8, 0.64, ‐0.512, … for
ℎ = 0, 1, 2, 3, …).
Figure 53: Theoretical ACF Plots for an AR(1) Model with Different Values of
In summary, the condition ϕ<1implies causality (and stationarity), and in this context, it
guarantees that the ACF tends asymptotically to zero as the values of the lag increase. The
practical implication is that the influence of past observations on recent observations
decreases, the further removed in time they are. The AR(1) case exemplifies the behavior
of the general AR(p) process. It can be shown that the ACF of a causal AR(p) tails off as the
lag increases (Shumway & Stoffer, 2017). Thus, in practice, an ACF that decreases towards
zero as the lags increase should make us consider the possibility of using an AR(p) model.
However, it is rare to see an ACF as perfect as that depicted above. Remember that it was
123
generated with simulated data and that the real world is far from perfect. Even though the
data were simulated, notice that for large lags, the correlations are still not exactly zero.
This is expected given the random nature of the subject. In practice, one might encounter
data for which the ACF at large lags are not zero, despite the lack of practical interpreta-
tion for it. Here, the confidence bands become relevant. In applications, the researcher
must set up a significance level and treat as zero all correlations for which the ACF shows
autocorrelations that lie within the confidence band.
The code below generated the simulated data following an AR(1) process with f = 0.8:
Code
# import modules
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# set AR parameters
# for zero-lag, add 1
# for lag 1, e.g. AR(1) model, set phi to 0.8
# (given as negative parameter, see polynomial formula)
ar = np.array([1, -0.8])
# figure preparation
plt.rcParams.update({'font.size': 18})
fig, axes = plt.subplots(2, 1, figsize=(16,10), dpi=100)
plt.subplots_adjust(hspace=0.3)
# plot ACF
plot_acf(simulated_ar, alpha=0.05, ax=axes[0])
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('Correlation')
axes[0].set_title(r'ACF AR(1) Model with $\phi=0.8$')
# plot PACF
plot_pacf(simulated_ar, alpha=0.05, ax=axes[1])
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Correlation')
axes[1].set_title(r'PACF AR(1) Model with $\phi=0.8$')
124
• The definition of the AR parameters is to be completed following the AR polynomial (i.e.,
the AR(1) process Xt = 0.8Xt − 1 + εt has parameters 1 and -0.8, because its AR poly-
nomial is φ z = 1 − 0.8z).
• The use of a large number of simulated points (N = 10000) is to obtain a better estima-
tion of the ACF/PACF, meaning one closer to the theoretical estimation.
Let us move on to examine the PACF of the AR(1) model. The PACF πhat lag ℎ corresponds
to the autocorrelation at lag ℎ, having removed the linear interdependency between the
two observations. By definition, π 1 = ρ 1 = φ. What is the value of π 2 ? Removing lin-
ear interdependency means, in general, calculating a linear regression between an obser-
vation Xt and lagged values with different lags Xt − 1, …, Xt − ℎ + 1 . In the case of find-
ing the direct effect of Xt − 2 on Xt, we need to remove the linear interdependency
through Xt − 1, i.e.,
Xt=αXt − 1
Xt − 2=βXt − 1
2
E Xt − αXt − 1 =E Xt2 − 2αE XtXt − 1 + α2E Xt2− 1
=γ 0 − 2αγ 1 + α2γ 0
−2γ 1
α= − 2γ 0
=ρ 1 =ϕ
The argument for β is analogous, and therefore α = β = ϕ. The PACF at the second lag is
But
Hence, π 2 = 0. With a bit more effort, it can be proved that π ℎ = 0 for all ℎ > 2.
125
In general, it can be shown that for every causal AR(p) model, π ℎ = 0 for all ℎ > p. (Fur-
ther elaboration is outside of the scope of this text. The interested reader is referred to
Shumway and Stoffer (2017, p. 108)). This fact will play an important role in determining
the correct order p of an AR model. The simulated data illustrate this clearly. All PACF with
lags greater than 1 are cut off, being statistically not different from zero.
The findings about the ACF and PACF of an AR(p) process are summarized in the figure
below.
Parameter to be speci-
Model fied ACF pattern PACF pattern
Let us continue with an examination of the ACF and PACF of a MA(q) process. Again, to
keep it simple, an MA(1) example is considered. An MA(1) model is given by
Xt = εt + θεt − 1
where εt is a zero-mean white noise with variance σ2. Like the AR(1) case, the expectation,
variance, autocovariance, and autocorrelation for the general MA(1) model are given by
2
E Xt =0; V ar Xt = 1 + θ σ2
2
1 + θ σ2; h = 0
γh= 2
θσ ; h = ± 1
0; otherwise
1; h = 0
θ
ρh= 2
;h = ± 1
1+θ
0; otherwise
Depicted in the figure below are the ACF and PACF for a simulated MA(1) model with
parameters σ2=1and θ = 0.8.
126
Figure 54: MA(1) Simulation: ACF and PACF
Code
from statsmodels.tsa.arima_process import ArmaProcess
ma = np.array([1,-0.8])
arma_model = ArmaProcess(ar=None, ma=ma)
simulated_ar = \
arma_model.generate_sample(nsample=10000)
The ACF of the MA(1) model cuts off after the first lag. Dismissing the sign, it is apparent
that this is the same behavior exhibited by the AR(1) PACF, which also cuts off after the first
lag. The PACF, in turn, tails off as the lag increases, which is also the same pattern
observed in the AR(1) ACF.
In general, the ACF of an MA(q) model cuts off for every lag ℎ > q, while the PACF of an
MA(q) tails off. In the latter case, the speed with which the PACF tends toward zero
depends on the MA parameters.
The ACF and PACF of the AR and MA models show very similar behavior, but they are not
identical. Using the same parameters as in the example above, φ = q yields correlograms
with different signs. Additionally, the degrees of correlation in the ACF plot differ. In our
example, the correlation for the AR(1) model is π 1 = 0.8, while for the MA(1) model it is
127
−0.8
ρ1 = = − 0.488
1 + 0.82
In a way, the AR and MA models mirror each other. If we forget, for a moment, the distinc-
tion between noise and observation, the polynomial operator notation offers the expres-
sions
Φ B Xt=εt
Θ B εt=Xt
In theory, the parameters of the AR and MA polynomial can be tuned to yield almost indis-
tinguishable models. This conveys the following idea. What we observe, for example, on
the ACF of an AR model is similar to what we observe from the PACF of an MA model, pro-
vided equal orders (i.e., p = q). Rewriting the model of the example discussed above, we
obtain
εt = − θεt − 1 + Xt
If we treat the εt as the series of interest and Xt as noise, then the resulting process would
ℎ
be similar to an AR(1) model. The ACF of this modified model is ρ ℎ = −θ . This is an
exponentially decreasing function. This is exactly what we observe on the PACF MA(1) plot
above. Obviously, many more mathematical details should be considered before conclud-
ing this, but the general relationship between AR and MA models follows this idea. The
mathematical details involve, in general, difference equations, which are beyond the
scope of this text. The interested reader is directed to Shumway and Stoffer (2017, pp. 96—
101) and Brockwell and Davis (2016, pp. 88—99).
The behavior of the ACF and PACF of the MA(q) process are summarized in the table below.
Table 11: ACF and PACF Patterns of an AR(p) and MA(q) Process
Parameter to be speci-
Model fied ACF pattern PACF pattern
MA(q) q Shuts off for lags > q Decay(s) to zero with expo-
nential pattern
Finally, we consider the ACF and PACF behavior of the general ARMA model. Consider the
simulated ARMA(1,1) model, the ACF and PACF of which are depicted in the figure below.
128
Figure 55: ARMA(1;1) Simulation: ACF and PACF
Code
from statsmodels.tsa.arima_process import ArmaProcess
ar = np.array([1,-0.4])
ma = np.array([1,0.8])
arma_model = ArmaProcess(ar=ar, ma=ma)
simulated_ar = \
arma_model.generate_sample(nsample=10000)
Xt = 0.4Xt + εt + 0.8εt − 1
A quick inspection of the AR and MA polynomials confirms this model is causal and inverti-
ble, and thus, stationary (the roots are 2.5 and 1.25, respectively). The ACF and PACF tail
off as the lag increases. Let us examine this concept more closely.
Recall that the general ARMA(p,q) model with p, q ≥ 1 is given by the equation
Φ B Xt = Θ B εt
129
Given that it includes both an AR component, given by the polynomial Φ B , and an MA
component, given by the polynomial Θ B , what should the ACF and PACF look like? Since
the AR polynomial has the order p, and in that case the PACF cuts off after p lags, it might
be assumed that the ARMA(p,q) also cuts off for all lags greater than p, but that assump-
tion would be incorrect. The reason is that that we also have an MA polynomial of order q,
and it is known that the PACF tails off in that case. When these two effects are considered,
the tailing off effect of the MA dominates. We might reverse the argument and think that
the ACF of an ARMA(p,q) should cut off for all lags greater than q, given the effect of the MA
polynomial of order q, but again, this would be incorrect. This is because the tailing off
effect of the AR supersedes the cutting off effect of the MA. In summary, in the case of an
ARMA(p,q) model, with p, q ≥ 1, both the ACF and the PACF tail off as the lag increases.
These findings are summarized in the table below (see Shumway & Stoffer (2017, p. 101)).
Table 12: ACF and PACF Behavior for the AR, MA, and ARMA Models
Parameter to be
Model specified ACF pattern PACF pattern
The previous subsection (ACF and PACF of ARMA models) described the process of choos-
ing the order of an AR, MA, or ARMA model for the given sample ACF/PACF of a data set.
This procedure is the first step of Box-Jenkins formalism, a methodology used to choose
the ARMA model that “best” fits a set of data. The four steps are as follows:
In the subsection “ACF and PACF of Residuals: Quadratic Model,” we detrended the series
of temperatures and then calculated the ACF and PACF of the residuals that we called Xt.
Here, we continue with this example and whenever possible, we will make general state-
ments from this case study.
130
Step one: Order selection
The ACF and PACF of the residuals Xt are depicted in the figure below.
In this case, the PACF seems to cut off after the first lag, while the ACF decreases. This sug-
gests an underlying AR(1) process. The reader might question the rate of decay because it
does not look exponential, but again, we are approximating a model. Thus, the first candi-
date is an AR(1) model.
Notice that the first four lags of the ACF turn out to be statistically significant. Afterwards,
the ACF tails off. To complement the AR(1) case, an MA(4) process will be considered as a
second candidate.
The ARMA(1,4) model will be considered as a third candidate; however, because the PACF
cuts off (rather than tailing off), this option should be rejected.
Once the candidate models have been selected, including the orders, the next step is
parameter estimation. This is a very technical topic, and the interested reader is directed
to the literature for further elaboration (Shumway & Stoffer, 2017). Some of the most com-
mon techniques include
131
• the maximum likelihood method. The residuals of the candidate ARMA model are
assumed to have a given distribution, typically Gaussian. Then, the log-likelihood is
expressed as a product of conditional distribution functions to capture the autoregres-
sion. This expression depends on the parameters ϕ and θ. Their values can be found by
maximizing the log-likelihood.
• least squares. In the case of an AR model, this coincides with the ordinary least square
used in linear regression. For example, in the AR(1) model, the series of square residuals
is built in terms of Xt, Xt − 1, and ϕ.
N N
2
∑ ε2t = ∑ Xt − ϕXt − 1
t=2 t=2
This summation is minimized in terms of ϕ. In the case of the MA model, one can follow
an iterative approach. For example, given the MA(1) model Xt = εt + θεt − 1, one can
start by assuming ε1 = 0 and estimating a value for θ, say 0.5. These values generate ε2.
With the same value for θ, the value for ε3 is updated and so on. Ultimately, the total
sum of squares residual is calculated and stored. The process is repeated many times
for different values of θ. The parameter associated with the lowest sum of the square
residuals is chosen.
• method of moments. The method of moments is known from basic courses on statis-
tics. It consists of equating sample moments with population moments. The case of an
MA model is relatively simple, but the AR model estimation is much more involved and
is beyond the scope of this course. For a discussion of this topic the interested reader is
directed to Shumway and Stoffer (2017, pp. 115—118).
All of these methods can be implemented in the Python library statsmodels, along with
some more advanced techniques. The data on global temperatures can be accessed via
the NASA website (National Aeronautics and Space Administration, 2021).
Code
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
# load data
df_temp = pd.read_csv('Global_Temperature.csv', \
132
sep=';', index_col='Year')
# ACF
plot_acf(data, alpha=0.05, ax=axes[0],\
title='Autocorrelation function (ACF)')
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('Correlation')
# PACF
plot_pacf(data, alpha=0.05, ax=axes[1],\
title='Partial Autocorrelation function (PACF)')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Correlation')
133
order = (1, 0, 0) # (p, d, q) - d will later be discussed
model = ARIMA(endog=X, exog=None, \
trend='n', order=order, \
enforce_stationarity=True, \
enforce_invertibility=True)
m_ar1 = model.fit()
Python uses a state-space model as a default estimation method. However, the results are
the same as those produced by the maximum likelihood (not shown). Not all supported
methods are available for all models. For instance, the Yule-Walker method, a method of
moment estimation procedure, is only applicable to AR processes.
The results produced by this Python code will be discussed in steps three and four.
The third step is model diagnostics. Three numbers are important here, each of which is
identified through a different method—the Akaike Information Criterion (AIC), the Ljung-
Box test, and the Jarque-Bera test. These metrics can be investigated by calling the
summary() method of the trained models, as depicted below.
Code
# print model summary for AR(1)
print(m_ar1.summary())
134
Figure 57: AR(1) Estimation Results
135
Figure 59: ARMA(1,4) Estimation Results
The AIC helps to compare two or more competing models based on a formula that
depends on the likelihood and the number of parameters of the model. Adding more
parameters will certainly improve the goodness of fit, and the likelihood will improve;
however, it might damage the total variance and thereby lower the forecasting power of
the model. To choose a parsimonious model, we select the model with the lowest AIC,
which in this case is the AR(1). In general, we should use more than one information crite-
rion, for example, we could apply the Bayesian Information Criterion (BIC) or the Hannan-
Quinn Criterion. These have different levels of penalization for adding more parameters
but the goal is the same. Interestingly, the three criteria yield the same result, as seen in
the table below.
Hannan-Quinn Infor-
Akaike Information Bayesian Information mation Criterion
Criterion (AIC) Criterion (BIC) (HQIC)
136
Having chosen the AR(1) model, what can be said about its residuals? First, the normality
test, also known as Jarque-Bera, has a p-value of 0.25, which is not enough to reject the
null hypothesis of normality on the residuals. This test compares the combined effects of
skewness (third moment) and kurtosis (fourth moment) of the residuals with those of a
standard normal distribution. The skewness and kurtosis of a normal distribution are 0
and 3, respectively.
Once an ARMA model has been fitted, one way to verify that all important autocorrelation
components have been captured is to estimate the ACF and PACF of the residuals of the
ARMA model. If the relevant autocorrelation lags have been correctly specified, the ACF
and PACF should have no significant remaining lag. The residual correlograms of the esti-
mated AR(1), MA(4), and ARMA(1,4) models are depicted below. The ACF and PACF plots for
all models can be created with the Python function for plotting the ACF and PACF (defined
above).
Code
# plot ACF & PACF residuals of the AR(1) model
plot_acf_pacf(m_ar1.resid)
137
Figure 60: ACF and PACF Plots of the Residuals of an AR(1) Model
138
Figure 61: ACF and PACF Plots of the Residuals of an MA(4) Model
139
Figure 62: ACF and PACF Plots of the Residuals of an ARMA(1,4) Model
Given the chosen level of 5 percent, it is expected under the assumption of normality that
1 in 20 lags will be outside of the confidence band simply by error. Thus, the small but
significant correlation in lag 4 should not be a major concern.
However, it might be the case that the ACF or PACF exists at several lags with values just
below the predefined significance band. Considering each lag in isolation, no remaining
correlation would be observed; nevertheless, the overall degree of correlation might still
be high. The Ljung-Box test (the interested reader is directed to Shumway and Stoffer
(2017, p. 141)) can be used to test such a possibility. In this case, the null hypothesis of no
correlation of the residuals cannot be rejected given the high value of the Ljung-Box p-
value, namely, 0.97. A similar pattern can be observed for the ACF and PACF plots for the
ARMA(1,4) model, but remember that this model might not be a good choice, considering
the ACF and PACF plots showing the residuals of the quadratic regression model. For the
MA(4) model, there are no remaining correlations in the residuals.
Accurate forecasting means that the prediction, usually about the future, will be close to
the actual value. To measure how close, a distance notion is required. In statistics and
other fields, a common metric is the quadratic distance between predicted and observed
values. This is the same measure applied as a target function for linear regressions. In gen-
eral, we search for a function f Z , dependent on the predictors Z , such that its distance
to the data X (dependent variable) be minimal in terms of the mean square error, i.e.,
140
2
E X−f Z
It can be proven that the minimizer f of this expectation is the conditional expectation
Ef ZX
The symbol “ ” reads as “conditional to.” Thus, the last expectation can be interpreted as
the expected value of the function f , given—or conditional to—the information X. Which
form should f take? In principle, any; however, in most applications, researchers work
with a linear function of the predictors and parameters, given its ease of interpretation
and estimation.
A similar process is followed in the case of the AR(p) model. The function E f Z X is
precisely the AR polynomial, namely, ϕ1Xt − 1 + … + ϕpXt − p.
For example, if we have modeled a time series of length N with an AR(2), the parameters
of which are ϕ1 and ϕ2, and we want to forecast two steps ahead, x N 1 and x N 2 , then
we can use
xN + 1 = ϕ1xN + ϕ2xN − 1
xN + 2 = ϕ1xN + 1 + ϕ2xN
Recall that, in this context, the hat “^” means estimation. We might forecast more terms
following this iterative procedure.
The forecast for the MA and, therefore, ARMA models is technically more involved and is
beyond the scope of this book. The interested reader is directed to Shumway and Stoffer
(2017, p. 102—114). Put simply, the invertibility of the MA process is needed to bring the
process to an AR(∞) form, and the procedure described for the AR model should thus be
applied.
Recall from our example that we fitted the AR(1) process to the residuals of the OLS esti-
mated model of temperatures. Consequently, if we want to predict the temperatures, we
need to generate forecasts from the regression model and from the AR(1) process. To do
that for 20 future steps, we can use the following Python code:
Code
# define time steps to be forecasted into the future
steps=20
141
reg_component = results_reg.predict(exog=regressors_oos)
# observed data
plt.plot(df_temp.index, df_temp['Global_Temp'], \
color='k', label='Original data (Delta Temperature)')
# forecasted data
label = '$Temperature_{t}=a+b\cdot '
label += 't+c\cdot t^{2}+AR(1)-Process$'
plt.plot(all_index, all_forecast, '--', color='blue', \
label=label)
# add artists
plt.gca().set(xlabel='Year', ylabel='Temperature (C°)')
plt.legend()
plt.grid()
plt.show()
The results of this out-of-sample forecast with 20 steps are depicted below.
142
Figure 63: Temperature Forecasts Using a Combination of a Regression and an AR(1)
Model
Since the AR(1) model is stationary (ϕ = 0.56 < 1), the effect of the autocorrelation
decays for future forecast steps. With 20 steps ahead, the influence of the autoregression
component is negligible and the regression dominates.
It is worth noting at this point that, although it is not classically part of Box-Jenkins for-
malism, a rolling forecast origin is nowadays applied instead of out-of-sample forecasting.
This method iteratively predicts only the next value based on models that consider all pre-
viously available data. For example, to predict a value two time steps into the future
(xt + 2) using a rolling forecast origin, we would train a new model based on the observed
data and the predicted value one time step into the future (xt + 1). We would then use this
model to predict the value in question. This method is computationally far more expen-
sive, as the number of models to be trained is the same as the number of time steps to be
forecasted. In practice, however, this method usually shows higher predictive accuracy
than an out-of-sample approach. Furthermore, forecasting only a couple of time steps into
the future is usually of practical relevance.
143
In this section, we will remove the stationarity assumption by including a specific non-sta-
tionary case, namely, processes whose differences yield a stationary ARMA process. The
mathematical inverse operation of differencing is called integration. The idea is that we
have integrated data, which are non-stationary, but once they are differenced the result is
stationary data that can be modeled by the usual ARMA processes. The “I,” between the AR
and MA in ARIMA stands for “integrated.” Thus, the topic at hand is autoregressive integra-
ted moving average models (ARIMA).
This topic is considered in the context of an example using the gross domestic product
(GDP) of Spain between 1950 and 2013, adjusted by purchasing power parity (PPP) and
inflation. There are 64 observations in total (Gapminder, n.d.-b), and the data are depicted
in the figure below.
The sequence has a clear upward trend, with the exception of a brief period in the early
2000s. From the plot, it is clear that the Spanish GDP time series is not stationary. This is to
be expected as countries usually strive to improve their wealth levels. As the reader may
notice, the curve is rather smooth. Except for when a crash occurs, the incremental, year-
to-year changes are strongly linked. Economists typically express these increments in
terms of percentage changes, such as GDP t = 1 + p GDP t − 1. To control for hetero-
Heteroscedasticity scedasticity and possible outliers, it is common to see economists use log-variables in
The property of data their analysis. We will do the same.
whereby the variance var-
ies in time is called heter-
oscedasticity.
144
Figure 65: ACF and PACF Plots of the Logged Data on the GDP of Spain (1950—2013)
The Spanish GDP increases over time and is therefore not stationary. Judging by the corre-
lograms, the evidence for non-stationarity is not very strong; nonetheless, the ACF and
PACF, which slowly decay as the lag increases, indicate a possible unit root process (ran-
dom walk). What is clear from the figures is that the ARMA is not appropriate to model the
data as it is. In case a more statistically sound confirmation is required, this hypothesis
could be tested by conducting an augmented Dickey-Fuller (ADF) test (Franke et al., 2011).
The Python code with which to visualize the series and to conduct an ADF test is depicted
below. Within the code, data preprocessing is handled in an external function called “pre-
pare_gdp_data” within the “gdp_data_preparation.py” file. The function yields the trans-
formed data log. To recover the raw data, the np.exp() function needs to be applied. You
will find this file in the code repository for this course.
Code
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from statsmodels.tsa.stattools import adfuller
from gdp_data_preparation import prepare_gdp_data
145
# load data
df_gdp = pd.read_csv('total_gdp_ppp_inflation_adjusted.csv', \
index_col = 'country')
# data preprocessing
gdp_country = prepare_gdp_data(df_gdp, "Spain")
plt.plot(df_gdp_raw)
plt.gca().set(xlabel='Time (years)', \
ylabel = 'GDP PPP Inflation adjusted (USD)')
plt.xticks(np.arange(min(gdp_country.index), \
max(gdp_country.index)+1, 10.0))
plt.grid()
plt.show()
Code
ADF Statistic: 2.715548
p-value: 0.999228
These numbers strongly indicate non-stationarity of the data. Therefore, the hypothesis of
a unit root cannot be rejected.
This section will discuss a modification of the ARMA models that will allow for the capture
of a particular type of non-stationarity, namely, trends (units roots, linear trends, polyno-
mials, etc.). For instance, to remove a unit root, differencing is applied (Brockwell & Davis,
2016). Mathematically, this can be expressed as
∇Xt ≔ Xt − Xt − 1
It is sometimes necessary to consider higher order differencing. For instance, the second
order differencing is
∇2 Xt = ∇ ∇Xt = ∇ Xt − Xt − 1 = Xt − 2Xt − 1 + Xt − 2
146
Using the backshift operator B and the identity operator I , the second order differencing
can be written as
2
∇2 Xt = I − B Xt = I − 2B + B2 Xt = Xt − 2Xt − 1 + Xt − 2
d
∇d Xt = I − B Xt
The basic idea is to difference the sequence as many times as necessary to obtain station-
ary data that can be represented by an ARMA model. In most cases, the differencing order
d is 1 or 2. In the figure below, this process is depicted schematically.
d
Φ B 1 − B Xt = Θ B εt
In this mathematical setting, the original series is transformed into its differenced versions
and an ARMA model is fitted to it. When forecasting, the reverted process must be comple-
ted. For example, in the case of a random walk Xt = Xt − 1 + εt the difference will be
∇Xt = εt. In this case, we model εt with an ARMA and then a forecast can be written as
X̂N+1=XN+εN+1 ARMA.
Following up with the example of Spanish GDP, we can take the first difference and recal-
culate the ADF test on the differenced data.
147
Code
# difference the data and reset index
gdp_country_diff = gdp_country.diff().dropna()
Code
ADF Statistic: -2.609921
p-value: 0.008779
The p-value, 0.008779, is small and far less than the significance level of 0.05. Thus, there
is statistical evidence to reject the null hypothesis of a unit root. Consequently, differenc-
ing seems to be a suitable method to achieve stationarity and therefore meet the assump-
tion of an ARMA model. To find good parameters for such an ARMA model, we can proceed
to examine the ACF and PACF to evaluate the pertinence of an ARMA model.
The following Python code can be used to generate the correlograms of the differenced
data.
Code
# import additional method
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# ACF
plot_acf(gdp_country_diff, alpha=0.05, ax=axes[0], \
title='ACF of differenced Log GDP of Spain')
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('Correlation')
# PACF
plot_pacf(gdp_country_diff, alpha=0.05, ax=axes[1], \
title='PACF of differenced Log GDP of Spain')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Correlation')
148
Figure 67: ACF Plot of the Differenced Data on the GDP of Spain (1950—2013)
By removing the effects of a trend through differencing, we can investigate the ACF and
PACF of the resulting data for significant autocorrelation, which might suggest the use of
an ARMA model. In this case, the ACF seems to decay without any clear cutoff point, while
the PACF cuts off after the first lag. These are indications for a suitable model with p = 1
and q = 0. You might want to refresh your memory of this by revisiting the summary
tables in section 4.1.
Considering the behavior of the ACF and PACF and the fact that we have differenced one
time (d = 1), we are able to suggest an ARIMA(1,1,0) model. As the differencing is integra-
ted into the model, the model will be fed with the original, undifferenced data. The Python
code is depicted below.
Code
# import additional method
from statsmodels.tsa.arima.model import ARIMA
149
trend='t', order=order)
m_ar1 = model.fit()
Figure 68: Summary Table for an ARIMA(1,1,0) Model for the Logarithm of Spain’s GDP
One particularly important statistic from this table is the Ljung-Box. In our case, its p-value
is rather large, 0.57, which indicates that the relevant correlation in the model has been
captured.
A key question that arises when modeling with ARIMA is how to define the differencing
order d. Without having the exact mathematical specification, it is difficult to determine;
however, certain statements can be made, for example, that the model is a random walk.
It must also be considered that
• if the model exhibits slowly decreasing correlograms, the series might be non-station-
ary and therefore might need to be differenced to make it stationary. If the correlograms
still decay slowly, higher order differencing should be considered.
• if the first lag shows no correlation, no further differencing is required. Over-differenc-
ing might introduce negative autocorrelation in the first lag. For example, a random
walk needs one lag difference, but if we difference twice, then the first lag autocorrela-
tion will be -0.5.
• the resulting process after over-differencing has a higher variance. A possible check is to
compare the variance of the residuals after each difference. If the variance increases,
then over-differencing might have occurred.
The following table summarizes the behavior of the ACF and PACF in relation to the vari-
ous ARIMA models.
150
Table 14: ACF and PACF Behavior for the ARIMA Models
The figure below depicts two ACFs. The upper ACF depicts the second difference of the log
GDP data of Spain, and the lower shows the third difference.
While in the second-difference plot there is a negative autocorrelation at the first lag, this
autocorrelation increases (in terms of the absolute value) when the third difference is
taken. This exemplifies the possible consequence of over-differencing.
151
Figure 69: ACF Plots for the Differenced Series of the Logarithm of Spain’s GDP
This section explains how to modify the ARIMA model to include seasonal components,
such as those described above. A SARIMA model is comprised of two models. The first part
models the non-seasonal component, and the second part describes the seasonal compo-
nent. Both parts are modeled by ARIMA models, and thus the order identification follows
the same rules based on the theoretical properties of the ACF and PACF.
The steps of the SARIMA model will be explored using an example, namely data on auto-
mobile sales in the United States (Bureau of Transportation Statistics, n.d.). These data
represent the monthly sales (in thousands) of passenger cars and station wagons between
January 2001 and April 2021, with a total of 244 observations.
152
Two plots are depicted in the figure below—the raw data on vehicle sales and their ACF.
The data show evidence of non-stationarity, although there are some periods during
which the behavior is constant. The data also show a seasonal pattern of 12 months,
whereby people seem to buy more vehicles in the summer months and fewer in winter
(this might not be the only seasonality). The range of the ACF has been amplified to
include as many as 36 lags in order to see the seasonal pattern more clearly. The ACF
exhibits a dual pattern. On the one hand, it seems to decrease slowly as the lag increases;
on the other hand, it has a periodicity. The former feature might be an indication of the
non-stationarity, while the latter might be due to the seasonality of the data.
To remove the trend, we take the first difference. The resulting ACF and PACF are depicted
in the figure below. The clarity and strength of the 12-month pattern is striking, and its
repetition (beginning at lag 24) is clearly visible. This shows that a relevant seasonal com-
ponent is missing, and that is what the SARIMA model aims to capture.
153
Figure 71: ACF and PACF Plots for the Differenced Data on Automobile Sales in the
United States (2001—2021)
At this point, a note of caution is appropriate. Until now, we have only described the
sequence in terms of the correlograms. This is sensible, because SARIMA models combine
two ARIMA models into one, a non-seasonal (ARIMA) model and an ARIMA model with a
seasonal component, and there is no a priori knowledge of which parameters are involved
in either of those two parts. However, after one or more candidate models are selected,
the estimation of both components is conducted for the raw data. The differencing, when
needed, is taken automatically as part of the estimation procedure.
If the data have no non-seasonal component, or it has been removed, the dependency on
multiples of s lags, s = 12 in our example, can be modeled by the pure SARIMA models,
which closely follow the structure of an ARIMA model. The word “pure” is sometimes
added to stress that the model has no non-seasonal component. In general, however, a
SARIMA model is a mix of ARIMAs with seasonal and non-seasonal components.
The notation to be used is as follows. In this case, we define a new backshift operator
(Shumway & Stoffer, 2017). This is no different from the ordinary backshift operator,
except that it shifts the series by a given lag s, which should match the seasonal pattern.
The operators
ΦP Bs = 1 − Φ1Bs − Φ2B2s − … − ΦP BP s
154
and
are called the seasonal autoregressive and the seasonal moving average operators of
orders P and Q, respectively, with the seasonal period s.
For example, a seasonal autoregressive operator of order 1 and seasonal period 12 would
be Φ1 B12 = I − Φ1B12. Applied to the series Xt, we obtain
Φ1 B12 Xt = εt
Xt = Φ1Xt − 12 + εt
A model expressed uniquely through these two operators is called a pure seasonal ARMA
model, or SARMA model.
It may be that the seasonal dynamics (in the example, those occurring every 12 months)
exhibit a non-stationary pattern. As in the non-seasonal ARIMA case, such a pattern is
reflected in an ACF with lags divisible by s, which are decaying very slowly. In that case, we
differentiate D times the s-lagged sequence, or, in terms of operators, we apply ∇Ds=I-BsDto
the sequence. The parameter D corresponds to the minimum number of times that differ-
encing must be conducted to make the seasonal component stationary. The discussion on
the determination of the order d in section 4.2 still holds for determining D but at the sea-
sonal level, i.e., for lags multiple of s.
The schema depicted in the figure below shows how to model a purely seasonal process.
155
Figure 72: Pure SARIMA Schema
Given that we are modeling seasonal dynamics with ARIMA models, the rules that apply to
the ACF and PACF to determine the corresponding orders P and Q hold at multiples of s.
In this case, the behavior of the ACF and PACF for the seasonal AR, MA, and ARMA models
is the same as in a non-seasonal case. The ACF and PACF patterns of the pure SARIMA
models are depicted in the table below (Shumway & Stoffer, 2017).
Table 15: ACF and PACF Behavior for the Non-Seasonal and Pure Seasonal ARMA Models
156
Parameter to be Differenced ser-
Model specified ACF pattern PACF pattern ies
In practice, when we deal with seasonal data, we need to consider mixed models, i.e.,
models with a non-seasonal and a seasonal component, not just pure models as in the
table above. Unfortunately, in those cases the correct identification of the orders p, d, q, P,
D, Q, and s is rather difficult. One strategy commonly used is to sequentially remove the
observed dynamics. For example, first, any non-stationarity is removed, possible season-
alities with a pure SARIMA are analyzed, and finally any remaining non-seasonal correla-
tion component within an ARMA model is removed.
Thus far, the non-seasonal ARIMA models have been described and the main elements of
the pure SARIMA model presented. A definition of the full SARIMA model, which combines
both non-seasonal and seasonal models, follows.
According to Shumway and Stoffer (2017), the seasonal (or multiplicative) autoregressive
integrated moving average model (SARIMA model) is defined as
ΦP Bs Φ B ∇D d s
s ∇ Xt = ΘQ B Θ B εt
where εt is the usual white noise. Thus, a general SARIMA model will be written as
SARIMA p, d, q ´ P , D, Q s,which makes clear which two ARIMA components are
involved, namely, a non-seasonal and a seasonal component. For the non-seasonal com-
ponents, AR(p) and MA(q), the corresponding polynomials are Φ B and Θ B , respec-
tively. For the seasonal components AR(P) and MA(Q), the corresponding polynomials are
ΦP Bs and ΘQ Bs , respectively. The polynomial backshift operators of the non-seasonal
d D
and seasonal components are given by ∇d = I − B and ∇Ds = I − Bs , respec-
tively. In total, six orders (p, d, q, P, D, Q) and one seasonal parameter (s) need to be
defined.
157
I − φ1B12 I − B Xt = I + θB εt
Notice that the backward operator of order 13, B13, comes from the consecutive applica-
tion of the operators B1 and B12, analogous to the algebraic power properties. Continu-
ing with our example, notice how the PACF cuts off at s = 12 and the ACF decreases slowly
at multiples of 12. As a result, we choose a pure SARIMA(1,0,0) to incorporate the 12-
month seasonality over the differenced data into the model. Recall that we applied differ-
encing to remove the non-stationarity component, so that D is set to 0. Alternatively, we
could use the non-differenced data and set the parameter D to 1. The correlograms of the
residuals of the SARIMA(0,1,0)·(1,0,0)12 are depicted in the figure below.
Figure 73: ACF and PACF Plots of the Residuals of a SARIMA(0,1,0)x(1,0,0),, Model for
Automobile Sales in the United States (2001—2021)
158
off and cutting off components are indistinguishable. We can fit both models and decide
which model is preferable based on an information criterion. After removing the unit root
component with a first, non-seasonal differencing operation (d = 1) and capturing the
first lag autocorrelation on the seasonal component, there is no discernible pattern in the
resulting ACF and PACF, except for the first lag. This should be evidence against additional
differentiation, such as D > 0 or D > 1 or higher order ARIMA models.
159
Figure 75: SARIMA(0,1,1)x(1,0,0),2 Model—Automobile Sales in the United States (2001
—2021)
The SARIMA(0,1,1)·(1,0,0)12 model exhibits lower information criteria for all three options,
AIC, BIC, and HQIC. The variance of the residuals of this model are also lower than for the
SARIMA(1,1,0)´(1,0,0)12 model. Accordingly, the SARIMA(0,1,1)´(1,0,0)12 model is chosen as
the better model.
The resulting ACF and PACF for the residuals of the final model are depicted in the figure
below.
160
Figure 76: ACF and PACF Plots of the Residuals of a SARIMA(0,1,1)x(1,0,0),, Model for
Automobile Sales in the United States (2001—2021)
Some ACF and PACF values lie at the edge of the confidence interval; however, the Ljung-
Box test yields a p-value of 0.37, meaning the hypothesis of uncorrelated residuals has not
been rejected. This evidence shows that the main components of the vehicle sales data
have been captured. Further analysis of the residuals is then conducted, including split-
ting the data into training and test sets to evaluate out-of-sample forecasting perform-
ance. For the sake of brevity, that analysis is not presented here but doing so is mandatory
in a real project.
Code
# import modules
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
161
df = df.dropna()
# ACF
plot_acf(data, alpha=0.05, ax=axes[0],\
title='Autocorrelation function (ACF)')
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('Correlation')
# PACF
plot_pacf(data, alpha=0.05, ax=axes[1],\
title='Partial Autocorrelation function (PACF)')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Correlation')
plot_acf_pacf(car_diff)
# SARIMA modeling
X = df.Auto_sales
162
order = (0, 1, 1) # (p, d, q)
seasonal_order = (1, 0, 0, 12) # (P, D, Q, s)
model=SARIMAX(endog=X, exog=None, \
trend='n', order=order, \
seasonal_order=seasonal_order, \
enforce_stationarity=True, \
enforce_invertibility=True)
m_sarima6=model.fit()
The SARIMAX model works as follows. Consider a dependent variable Xt and a set of inde-
pendent variables Z1, t, Z2, t,…,Zn, t, where the Z s are the set of predictors or exogeneous
variables. Then, a linear regression model has the following form:
Once the parameters of this regression have been estimated, the weighted variables Z can
be subtracted from Xt, leaving the term εt alone. The residuals εt might still contain valua-
ble information about the process Xt that would enhance prediction. It has already been
noted that ARMA captures the remaining information contained in εt. In the present con-
text, if the residual term of this linear regression follows a SARIMA model, it can be said
that Xt follows a SARIMAX model. The “X” stands for exogenous. The central idea is to
combine exogeneous variables into a linear regression, whereby the error term is descri-
bed by the most general model within the family of ARMA models (i.e., SARIMA).
This model specification can be seen as a two-stage estimation; the first stage is the
regression in terms of the predictors Z , and the second is the SARIMA modeling. In gen-
eral, the problem with the two-stage estimation is the loss of efficiency of the resulting
estimators. An efficient estimator has a lower mean squared error. In simple words, a loss
of efficiency yields less accurate and/or less precise parameters.
163
The reason for the loss of efficiency lies in the assumptions and the use of information in
each stage. Recall that one of the basic assumptions in linear regression is that the residu-
als are correlated neither with themselves (autocorrelation) nor with the predictors. If the
errors εt of the regression can be described by a SARIMA model, then the zero-correlation
assumption of the linear regression is not fulfilled; therefore, some undesirable bias might
slip into the estimated parameters. In many practical situations this is considered negligi-
ble. However, dismissing these effects could yield less accurate fits, forecasts, and, ulti-
mately, conclusions.
The following analogy helps to explain the idea behind the two-stage estimation. Suppose
a company that sells computers has a finance department and a sales department. For
mysterious reasons, the people from finance do not talk to the people from sales, and vice
versa. So, whenever finance seeks to increase margins by eliminating discounts or increas-
ing prices, they simply inform the customers without asking the sales department. On the
other hand, whenever the sales department looks to increase sales volumes, they simply
offer discounts without asking finance. Naturally, the board of the company is not happy.
The problem is that if the sales department offers too many discounts, they sell a greater
quantity but with tiny profit margins. However, if the finance department increases prices
and does not allow discounts, the company will achieve higher profit margins but lower
sales. A compromise between these two positions must be found. The board realizes that
the key issue is information sharing. The board thus arranges for the employees of both
departments to have a party, exchange information, and thereby maximize the company’s
profit.
Fitting the parameters of the linear regression first makes them biased, given the presence
of correlation (it is assumed that the residuals are SARIMA), and therefore less efficient.
The estimation of the linear regression and SARIMA parameters must be conducted simul-
taneously. Just like with the sales and finance departments, both models must “share
information” during the parameter estimation. In other words, the estimation must be
done in only one stage. In this way, the model is optimized.
For example, consider the data on automobile sales in the US between 2001 and 2021
(Bureau of Transportation Statistics, n.d.). In that example, the existing trend was cap-
tured using differencing; however, in this case we want to model trend using exogeneous
variables. For simplicity, we will use the time index 0,1,2,3…,123 and its square,
0,1,4,9,…,1232 as exogeneous variables. In real-world applications, the researcher evalu-
ates several variables believed to explain part of the variability.
164
Figure 77: Automobile Sales in the United States (2011—2021) and ACF Plot
Figure 78: Automobile Sales in the United States (2011—2021) and Linear Regression Fit
165
The exercise of fitting the regression curve without paying much attention to the residuals
has a descriptive goal only. If there is evidence that these variables (time index and its
square) have predictive power concerning the automobile sales, these are added in a later
SARIMAX estimation procedure to perform a one-stage estimation. Again, in a real-world
project, we would likely use additional data sets to improve our model. However, in this
simple example, we only consider the time index and its square exogenous variables to
learn the basic concept and technique.
The fit is visually good; although, there are some features that deserve further comment.
Firstly, starting in 2016, the amplitude of the seasonality seems to decrease. This coincides
with the decreasing trend observed in automobile sales from 2015 onward. Secondly, at
the end of the series, there seems to be an additional increment of sales, possibly related
Pent-up demand to a pent-up demand effect.
The level of demand can
build up due to various
factors, for example, The residuals of this regression, displayed in the figure below, present a similar seasonal
insufficient supply or cus- pattern to the difference detrended case of the previous section.
tomers’ unwillingness to
make purchases during a
specific time period. The ACF, in its 12-lag seasonal component, seems to decline to zero, but the levels of its
PACF of the 12-lagged seasonal component are lower than in the ACF. To capture these
dynamics, and to keep the model parsimonious, we fit a SARIMAX(0,0,0)´(1,0,0)12. We
might fit a SARIMAX(0,0,0)´(1,0,1)12 as well, because the PACF also seems to tail off at mul-
tiples of 12. We keep the model parsimonious by fitting just a pure SARIMA(1,0,0)12 model.
166
Figure 79: ACF and PACF Plots of the Residuals of a Linear Regression Fit for Automobile
Sales in the United States (2001—2021)
The ACF and PACF of the residuals of the SARIMAX are shown in the figure below. The ACF
and PACF behavior suggest additional, non-seasonal components, either AR(1) or MA(1).
Having experimented with both (not shown), the AR(1) model has been selected, because
it has the lower information criteria.
167
Figure 80: ACF and PACF Plots of the Residuals of a SARIMAX(0,0,0)x(1,0,0),, Model
Finally, the selected model corresponds to a SARIMAX(1,0,0)´(1,0,0)12 with time index and
its square as exogeneous variables.
168
Figure 81: SARIMAX(1,0,0)x(1,0,0);. Model with Time Index as an Exogeneous Variable—
Automobile Sales in the United States (2001—2021)
The Ljung-Box p-value (0.64) shows evidence that the correlation present in the original
series has been accounted for by this model. However, the kurtosis is high, which is an
indication of higher residuals later in the series. A closer look at these residuals reveals
that they are located close to the boundaries of the series, which might be evidence of
changes in the dynamics. This is clear for 2020, given the effects of the pandemic, and
therefore a new set of predictors might be needed for this time period. Despite these
model imperfections, this constitutes the most informed and best performing model we
have built for this data set.
For the sake of brevity, some of the steps of the Box-Jenkins formalism process are not
detailed here. Yet, in a real time series analysis, each step must be fully backed up.
The Python code to build the SARIMAX model described in this section is depicted below.
Code
# import modules and set seed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
import statsmodels.api as sm
np.random.seed(97)
169
df = pd.read_csv('Auto_sales.csv', sep=';', \
decimal=',', parse_dates=['Date'], \
index_col='Date')
df=df.iloc[120:] # consider data after 2011
df.dropna()
# raw data
axes[0].plot(df.index, df.Auto_sales, \
color='k', label='Actual vehicle sales')
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('Correlation')
axes[0].grid()
# ACF
plot_acf(df.Auto_sales, alpha=0.05, ax=axes[1], \
title='Autocorrelation function (ACF)', lags=36)
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Correlation')
axes[1].set_xticks(np.arange(0,37,6))
# original data
plt.plot(X, color='k', label='U.S. Automobile Data')
# regression curve
plt.plot(df.index, reg_fit.fittedvalues, \
color='red', label='Linear Regression')
170
# add artists
plt.gca().set(xlabel='Year', ylabel='Sales')
plt.legend()
plt.grid()
plt.show()
# ACF
plot_acf(data, alpha=0.05, ax=axes[0], lags=36, \
title='Autocorrelation function (ACF)')
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('Correlation')
# PACF
plot_pacf(data, alpha=0.05, ax=axes[1], lags=36, \
title='Partial Autocorrelation function (PACF)')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Correlation')
axes[1].set_xticks(np.arange(0,37,6))
plot_acf_pacf(reg_fit.resid)
171
enforce_stationarity=True, \
enforce_invertibility=True)
m_sarimax = model.fit()
SUMMARY
Autoregressive moving average (ARMA) models refer to three models
under one term: the autoregressive (AR), the moving average (MA), and
the combined ARMA model. These models are linear and can be used to
represent stationary processes. In the context of ARMA models, the sta-
tionarity is intimately related to the concepts of causality and invertibil-
ity. Essentially, these concepts are conditions that ensure that the proc-
ess depends on information from the past and not from the future.
We also discussed the three model variations. If the differenced data fol-
low an ARMA model, then the original data follow an ARIMA model. The
difference order can be any positive integer. However, care is advised
while differencing with higher orders. Autoregression in the form of sea-
sonality can be captured by SARIMA models. In some sense, these mod-
els are two ARIMA models in one. The “first” model is the ordinary
ARIMA, and the second models the data shifted by periods of seasonal-
ity. Finally, SARIMA models that include exogeneous variables as predic-
tors are called SARIMAX.
172
Time series modeling can be conducted in Python using the library
statsmodels. Much of the content presented here has its correspond-
ing functionality available in this library, which makes the application of
ARMA models user friendly.
173
UNIT 5
HOLT-WINTERS MODELS
STUDY GOALS
Introduction
“Prediction is very difficult, especially if it is about the future.” This quote, often attributed
to the Danish physicist Niels Bohr, warns about the inherent uncertainty of forecasts. We
all make decisions about future events, and to do so optimally, one needs to make predic-
tions about the future. Our daily clothing is to some extent a function of weather condi-
tions, and therefore an accurate weather forecast can aid in making decisions about what
to wear.
The question that remains is exactly how to create a forecast. In the context of time series
analysis, one could feasibly begin by looking for any distinguishable patterns within a data
plot. For example, if the data tend to move around a straight line, then a forecast may be
obtained by extending such a trend beyond the available time span. Analogously, in the
presence of seasonality, the distance between peaks and when these occur could repre-
sent valuable information for a proper forecast.
However, how does one make predictions when the data do not exhibit discernible pat-
terns, such as trends or seasonality? Common sense suggests that, in the absence of any
obvious pattern, and assuming that the past information can provide information about
the future, the most recent data points contain the most relevant information for making
predictions. Therefore, it makes sense to build forecasting strategies that give recent infor-
mation more weight than older information.
This is the principle upon which the Holt-Winters methods are based, originally intro-
duced for inventory level forecasting. The “one step ahead” prediction is a weighted aver-
age of all past data points, with the weights decaying exponentially. Interestingly, there is
no need to calculate the average of all past data points for each prediction; most of these
calculations can be avoided by applying a set of forecast updating rules. Additionally, the
same principle can be applied to series with trends and/or seasonality, resulting in a tool
with a wide scope of applications.
176
Motivation of the Simple Exponential Smoothing Method
This unit applies the tools discussed to the total monthly automobile sales in the US from
January 2001 until April 2021 (244 observations) (Bureau of Transportation Statistics,
n.d.), as depicted in the figure below.
Figure 82: Total Monthly Automobile Sales in the United States (2001—2021)
The data above exhibit piecewise trends and annual seasonality. In the absence of trends Piecewise trend
and seasonality, a reasonable alternative assumption, in terms of forecasting, would be to A piecewise trend repre-
sents different trends
use a weighted average of past observations. If one names this process Xt, the formula is over different time inter-
vals of the sequence.
XN + 1 = ω0XN + ω1XN − 1 + ω2XN − 2 + …
In the equation above, ω refers to the weights, and the hat “^” symbol indicates an esti-
mator. Therefore, the observations XN XN − 1 and … are elements of the process, but
X̂N + 1 is a prediction for one time step into the future, using the actual data and their
respective weights.
Weights cannot be just any number. Firstly, it seems reasonable to give the most recent
observations the most weight, under the assumption that these carry more information
about the near future, compared to those observations that are further removed in time.
Secondly, for technical reasons, these weights need to decay quickly enough to average
the whole past of the time series. In theory, the sequence could be infinite, and therefore
the weights for the oldest observations must be close to zero. Finally, to avoid artificially
amplifying the sequence by going beyond the bounds of the series’ historical range, a con-
straint must be imposed: The total sum of the weights must be equal to one.
i
ωi = α 1 − α , i = 0,1, …
177
where α is a number between zero and one. For example, if α = 0.5, then the sequence of
weights would be ω0 = 0.5, ω1 = 0.25, ω2 = 0.125, ω3 = 0.0625, and … . A sequence like
this is referred to as geometric, and its infinite sum can be easily proven to be equal to one
(Weisstein, n.d.). Geometric is another word for exponential, which explains the name of
the method. If α is chosen to be close to one, then the most recent observation has a
higher weight, decaying very quickly to zero; if α is chosen as close to 0 then the first
weights are relatively similar but still decay to zero. The number α, usually called the
smoothing level, is a relevant parameter that can be set up by the analyst or estimated
from the data.
2
XN + 1 = αXN + α 1 − α XN − 1 + α 1 − α XN − 2 + …
In real world applications there is no infinite time series, but this representation can still
be used by implementing the following trick. X̂N can be written as
XN = αXN − 1 + α 1 − α XN − 2 + …
Thus, the prediction XN + 1 is a weighted average of the prediction at N , i.e., XN , and its
actual value XN . Defining the first prediction as X1 = X1, we can use the formula above
as an updating rule for each forecasted term X.
Returning to the example, let us now suppose a1 = 0.9, a2 = 0.2 and X^1=X1=586.9. The
models can be built and fitted to the observed data, except for the last five time steps.
These final 20 observations are left for a later visual evaluation of the forecasts. The fol-
lowing Python code can be used to build and fit the models and to plot the results.
Code
# import modules
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing as ES
# load data
df = pd.read_csv('Auto_sales.csv', sep=';', \
decimal=',', parse_dates=['Date'], \
index_col='Date')
df = df.dropna()
178
# define initial level, X_hat_1 and
# smoothing parameter alpha
Xhat1 = df.Auto_sales[0]
alpha1 = 0.9
# observed data
plt.plot(df.index, df.Auto_sales, \
color='k', label='Automobile data')
# set artists
plt.gca().set(xlabel='Date', \
ylabel='Number of automobiles (k)')
179
plt.grid()
plt.legend()
plt.show()
Figure 83: Automobile Sales in the United States (2001—2021) and Exponential
Smoothing Curves
• The higher the smoothing level α, the more accurately the resulting curve mimics the
actual sequence. However, given the use of lagged data, the resulting curve is delayed.
This effect can be clearly seen when comparing the black and red curves in the figure
above.
• With a lower smoothing level α, the resulting curve tends to follow the trend of the
process (slow changing movements).
With a slight change to the last equation, the method can be formulated as follows:
XN + 1=αXN + 1 − α XN
=α XN − XN + XN
=αεN + XN
This reveals a second method of interpreting the SES: The (N + 1)-prediction is equal to
the N -prediction, plus a weighted forecast error εN .
These formulae simplify the process of making out-of-sample predictions. Using the SES,
with α being equal to 0.9 and 0.2, a forecast for five periods ahead can be created for
which the following Python code can be used.
180
Code
# set the forecast horizon
h = 5
# last 20 observations
plt.plot(df.Auto_sales[-20:], \
color='k',label='Automobile data')
# set artists
plt.gca().set(xlabel='Date', \
ylabel='Number of automobiles (k)')
plt.grid()
plt.legend()
plt.show()
The results are depicted in the figure below, restricted to the last 20 observations of the
data range.
181
Figure 84: Observed Automobile Sales and Forecasts Using the SES Method
The dotted line in the graphic above corresponds to the out-of-sample starting point,
namely, May 2021. All the predictions are equal to 349.11 in the case of a1 = 0.9, and
310.53 for a2 = 0.2. Since this method includes neither a trend nor a seasonality compo-
nent, the out-of-sample forecast corresponds to a level prediction, which can be corrobo-
rated by observing the last formula. The predicted step ahead becomes the new actual
observation XN , making the forecast error for the subsequent steps zero, resulting in the
same forecast.
An intriguing link exists between the SES and ARIMA models (for an overview of ARIMA
models, see Shumway and Stoffer (2017, Chapter 3)).
An ARIMA(0,1,1) with the parameter 0 < θ < 1 is represented by the following equation:
Xt − Xt − 1 = εt + θεt − 1
After some algebraic manipulation (Shumway & Stoffer, 2017), this model can be written
as follows:
∞ j−1
Xt = ∑j = 1 1 − θ θ Xt − j + εt
182
∞ j−1
XN + 1=∑j = 1 1 − θ θ XN + 1 − j
∞ j−1
= 1 − θ XN + θ∑j = 1 1 − θ θ XN − j
= 1 − θ XN + θXN
Making α = 1 − θ, we can obtain the first representation of the SES method. Beyond
being an interesting “coincidence,” this finding also provides the underlying process of the
SES method. In principle, the SES principle should be used to forecast when there is evi-
dence of the data following an ARIMA(0,1,1) process with the parameter 0 < θ < 1. How-
ever, Chatfield et al. (2001) suggest that the family of processes that can be optimally rep- Optimally
resented by an SES is larger than only the ARIMA(0,1,1). This would explain the robustness An optimal representa-
tion refers to an underly-
of the SES method when applied to data sets with other behaviors. ing process that produces
the same forecast as the
SES method.
Estimation
The SES method is completely determined by the parameter α and the initial level X1. The
smoothing level α can be determined similarly to the parameter of an MA(1) model (the
SES can be seen as an integrated moving average). The procedure is as follows: For a given
smoothing level and initial level, the whole forecast series is estimated, and the sum
squared error is calculated and then stored. The process is repeated for a grid of values α
and X1, and the one with the lowest total error is selected. This is the least squares esti-
mation method for the SES, and the following Python code can be used to estimate the
parameters.
Code
# build and fit the model using least squares
ses_ols = ES(df.Auto_sales).fit(method='ls')
As visible in the console output of the last Python command, the corresponding smooth-
ing and initial levels are α = 0.541 and X1 = 649.97, respectively.
The code below depicts the least squares fit of the SES method, in relation to the automo-
bile data.
Code
plt.figure(figsize=(16, 5), dpi=100)
# observed data
plt.plot(df.index, df.Auto_sales, color='k')
# fitted data
plt.plot(df.index, ses_ols.fittedvalues, color='r')
# artists
183
plt.gca().set(xlabel='Date', \
ylabel='Number of automobiles (k)')
plt.grid()
plt.show()
To briefly review level updates in the context of the SES method, the SES method formula
is
XN + 1 = αXN + 1 − α XN
In the absence of trends and seasonality, XN can be seen as a prediction of a “level.” The
formula for level is below, in which L is used as an aid for the discussion that follows.
Note: This is not a representation of the lag operator (also known as the backshift opera-
tor), which is also sometimes denoted with an uppercase L.
184
Lt = αXt + 1 − α Lt − 1
In this formula the updated level Lt, made for time t + 1 (or conditional to the information
until time t), is a weighted average of the actual value at t, Xt, and the one-step previous
prediction level Lt − 1 made at time t − 1. Having an initial level L1 and a smoothing level
α, all levels can be generated up to time t (beyond that point all levels are equal). The fig-
ure below presents a schema showing how the SES method works. If α tends to 1, the
level Lt will be closer to Xt. If a tends to 0, the updated level Lt comes closer to the previ-
ous level Lt-1.
The level update Lt is always between Xt (being forecast based on forecast values multi-
ple time steps into the future) and Lt − 1, which of course cannot correctly anticipate the
observation Xt + 1, if the data have a trend.
If a trend is assumed, it will be updated analogously to the level Lt, as we consider a new
observation Xt with the estimated trend denoted as T t. The logic is as follows: In the level
equation, the term Lt − 1 is not enough to capture increments or decreases driven by a
trend, so the first change to be made is to add that contribution. Denoting the previously
estimated trend as T t − 1, we obtain
Lt = αXt + 1 − α Lt − 1 + T t − 1
The way the trend is updated is similar to updating the level. A trend is defined as a
roughly constant change in the vertical axis per time step, meaning that a weighted aver-
age between the difference of two consecutive levels (the update in the vertical direction)
and the trend value of the last step is calculated. Therefore, the update formula for the
trend is
185
T t = γ Lt − Lt − 1 + 1 − γ T t − 1
where the parameter ɣ, called the smoothing trend, lies between 0 and 1. In the SES
method, the forecast was the same as the level update, but now, the forecast needs to also
account for the trend. Therefore, the forecast formula for a horizon of ℎ time steps is
(Chatfield, 2004, p. 78)
Xt + ℎ = Lt + ℎT t, ℎ = 1,2,3…
This set of two update equations is known as the double exponential smoothing (DES)
method, with both being based on the concept of exponentially decaying weights. These
equations are depicted in the following figure and are analogous to the level update for-
mula.
This new method can be applied to the automobile sales data (Bureau of Transportation
Statistics, n.d.). As the DES method updates both the level and trend, initial values for
both terms are needed. Within the Python library, statsmodels, the user is able to apply
predetermined rules for automatic selection of initial values or to define their own values.
The model specification needs four parameters: two smoothing parameters α and ɣ, and
two initial values L1 and T 1. The following Python code can be used to estimate these
parameters.
Code
# set the forecast horizon
h = 5
186
# build and fit the model
des_auto = ES(df.iloc[:-h, :].Auto_sales, \
trend='add').fit()
From the console output of the last Python command, one can see that the estimated
smoothing level and trend parameters are 0.636 and 0.110, respectively, while the initial
level and trend values are 585.47 and 7.35, respectively. Using these parameters, the fitted
values for the series can be extracted and plotted.
Code
# add fitted values to data frame
df['des_fit'] = des_auto.fittedvalues
# observed data
plt.plot(df.index, df.Auto_sales, \
color='k', label='Observed data')
# DES model
plt.plot(df.index, df.des_fit, \
color='r', label='Fitted values')
# set artists
plt.gca().set(xlabel='Date', \
ylabel='Number of automobiles (k)')
plt.legend()
plt.grid()
plt.show()
187
Figure 88: DES Method Estimation—Automobile Sales Data
Analogously to the SES method, sample forecasts with a horizon of ℎ = 5 can be created
and the results plotted.
Code
# forecast the values and save them
# to the data frame
df.des_fit.iloc[-h:] = des_auto.forecast(steps=h)
# last 20 observations
plt.plot(df.Auto_sales[-20:], \
color='k', label='Observed')
# DES forecasts
plt.plot(df.des_fit[-20:], \
color='r', label='Forecast')
# set artists
plt.gca().set(xlabel='Date', \
ylabel='Number of automobiles (k)')
plt.grid()
plt.legend()
plt.show()
188
Figure 89: Observed Automobile Sales and Forecasts Using the DES Method with h=5
The out-of-sample points draw a mild positive trend, which reflects the tendency of the
last in-sample months. The effect of trends further removed in time dies out, as a result of
the exponentially decaying weights underlying the DES method.
Analogously to the SES method, one might ask if there is an underlying model that could
provide a theoretical back up for the DES as a forecasting procedure. The answer is yes.
The DES method has been proven to yield the same forecast as the ARIMA(0,2,2); however,
the details of this topic go beyond the scope of this text. The interested reader is directed
to Abraham and Ledolter (1986).
In sales applications, it is common to see a trend break after several months of a positive
trend (no company sells ad infinitum). Independent of whether this is a plateau or a
change point in the trend toward a negative slope, the issue remains that this effect can-
not be captured by the DES method in its current form.
Damped Trend
Gardner and McKenzie (1985) proposed to damp the trend by adding a new parameter ϕ in
the forecast, level, and trend update equations. It is important to note that this parameter
is different from the parameter ϕ in AR models. The set of equations is as follows:
Lt=αXt + 1 − α Lt − 1 + ϕT t − 1
T t=γ Lt − Lt − 1 + 1 − γ ϕT t − 1
Xt + ℎ=Lt + ϕ + ϕ2 + … + ϕℎ T t
189
There are basically two changes. Firstly, T t − 1 changes to ϕT t − 1, and secondly, in the
forecast equation we can see
h = 1 + 1 + … + 1changes to ϕ1 + ϕ2 + … + ϕℎ
ℎ − times
The effect of 0 < ϕ < 1 is that T t becomes more damped over time (as we move along the
forecast horizon). In applications, this might be more realistic when it comes to informing
forecasts.
If the parameter ϕ = 0, there is no trend, and we can return to the SES method. If ϕ = 1,
then we get the DES method.
Gardner and McKenzie (1985) state that the damped DES method has the ARIMA(1,1,2) as
an equivalent forecasting process; although, in such a case, some parameters must be
constrained to make this coincide with the DES method. A value of ϕ > 1 is possible in the
Exponential growth context of the DES method, but dangerous, given the exponential growth of the predic-
This refers to a trend tions. In this case, there is no longer an ARIMA(1,1,2) representation.
being multiplied by the
powers of a number big-
ger than one, meaning We use the following code to build and fit a damped DES model and estimate the parame-
that with each step the ters using the least squares method. For the sake of this example, we used the damped
trend is amplified more
than in the previous step. DES method with a fixed value of ϕ (=0.7), because the value estimated from the data was
0.99 and the difference between it and the undamped DES was not clearly visible.
Code
# build and fit the model
des_damp = ES(df.iloc[:-h, :].Auto_sales, \
trend='add', damped_trend=True).\
fit(damping_trend=0.7)
The console output provides the estimated parameters. The fitted values from this model
can be extracted and forecast for the last five values of the series not used for training the
model.
Code
# add fitted values to data frame
df['des_damp_fit'] = des_damp.fittedvalues
190
# last 20 observations
plt.plot(df.Auto_sales[-20:], \
color='k', label='Observed')
# DES forecasts
plt.plot(df.des_fit[-20:], \
color='r', label='DES')
# set artists
plt.gca().set(xlabel='Date', \
ylabel='Number of automobiles (k)')
plt.grid()
plt.legend()
plt.show()
Figure 90: Automobile Sales and Fitted Values Using the DES Method with and without
Damping
191
5.3 Dealing with Seasonality: Triple
Exponential Smoothing
The key concept of exponential smoothing methods is the calculation of a weighted aver-
age of the entire past of the sequence, with weights that decrease exponentially towards
zero. As we now know, the average of every forecast does not need to be calculated, but
the levels and trends in each step must be updated. This idea is further extended by intro-
ducing a third component: checking for seasonality in series with known periods.
Types of Seasonality
We can distinguish two basic types of seasonalities: additive and multiplicative. For this
example, a series Xt with level Lt is supposed with the seasonality St. Under the assump-
tion of additive seasonality, Xt can be written as
Xt = Lt + St + εt
where εt is a residual component. This type of formulation is used when the difference
produced by the seasonality is measured in absolute terms. Let us consider this concept in
terms of an example. A store reports a monthly sales average of 500 units, which increases
by an additional 150 units in December, due to holiday season purchasing.
Xt = LtSt + εt
This formulation is typically used when the differences produced by the seasonality of a
given level are measured in relative terms. Therefore, it can be said that the average
monthly sales level is 500 units on average, but the seasonal effect of holiday purchasing
generates an increment of 30 percent in December.
These two types of seasonality, additive and multiplicative, will imply two different sets of
formulae, but the logic remains the same. Let p be the period of the seasonality, St the
seasonal component, and δ the smoothing seasonality parameter. The set of update and
forecast formulae for the additive case (Gardner, 1985) are
192
Lt=α Xt − St − p + 1 − α Lt − 1 + T t − 1
T t=γ Lt − Lt − 1 + 1 − γ T t − 1
St=δ Xt − Lt + 1 − δ St − p
Xt + ℎ=Lt + ℎT t + St − p + ℎ
Given the additional terms, the formulae can be best understood by comparing them with
those of the DES method, noting the differences. There are three basic differences.
This method, which involves adding a third smoothing parameter, the seasonality
smoothing δ, is called Triple Exponential Smoothing (TES).
This new method can now be applied to the automobile sales data (Bureau of Transporta-
tion Statistics, n.d.). The model can be fitted to the whole series, with the exception of the
last five observed data, and a 12-month, multiplicative seasonality can be assumed. Three
smoothing parameters must be estimated for this model: trend, level, and seasonality, all
three of which are estimated using the maximum likelihood. A complete period involves p
time steps, meaning it is necessary to specify p initial values for the seasonality, which are
selected automatically in statsmodels.
Code
# set the forecast horizon
h = 5
This model summary provides several pieces of information. For example, the estimated
smoothing trend γ = 0.04 is considerably lower than the smoothing level α = 0.42 and
seasonality δ = 0.35, indicating that the trend is of little importance in this forecast.
Next, the series with the fitted values from the TES model can be plotted.
193
Figure 91: TES Method Estimation—Automobile Sales Data
The seasonality is approximately, but not precisely, 12 months. The maximum occurs in
the summer, but not always in the same month. The TES model is able to capture this pat-
tern by allocating more volume between May and September.
Finally, the five remaining observations can be forecasted using the TES model.
Code
# forecast the values and save them
# to the data frame
df.tes_fit.iloc[-h:] = tes_auto.forecast(steps=h)
# last 20 observations
plt.plot(df.Auto_sales[-20:], \
color='k', label='Observed')
# TES forecasts
plt.plot(df.tes_fit[-20:], \
color='r', label='Forecast')
# set artists
plt.gca().set(xlabel='Date', \
ylabel='Number of automobiles (k)')
194
plt.grid()
plt.legend()
plt.show()
Figure 92: Observed Automobile Sales and Forecasts Using the TES Method with h=12,
p=12, and Assuming Additive Seasonality
Lt=α Xt /St − p + 1 − α Lt − 1 + T t − 1
T t=γ Lt − Lt − 1 + 1 − γ T t − 1
St=δ Xt /Lt + 1 − δ St − p
Xt + ℎ= Lt + ℎT t St − p + ℎ
The logic is entirely analogous to the additive case. However, St corresponds to a ratio
between the current observation Xt and the baseline level Lt, making this a percentual
change. If these ratios are approximately one, there is no need to include a seasonality
component.
Unlike the SES and DES models, there is generally no optimal ARIMA model underlying the
TES method; however, if there were, its formulation would be so complex that identifica-
tion via Box-Jenkins formalism would be impossible (Gardner, 1985).
195
Analogous to the DES method, damping the trend is also possible in the context of the TES
method. The formulae are analogous to the DES case and will not be detailed here, but the
interested reader is directed to Gardner (2006, p. 640). Below are the results of the TES
method with trend damping, assuming ϕ = 0.7 (with additive trend and multiplicative
seasonality) and a 12-month period.
Code
# build and fit the model
tes_damp = ES(df.iloc[:-h, :].Auto_sales, \
trend='add', seasonal='mul', \
damped_trend=True, \
seasonal_periods=12).fit(damping_trend=0.7)
# last 20 observations
plt.plot(df.Auto_sales[-20:], \
color='k', label='Observed')
# TES forecasts
plt.plot(df.tes_fit[-20:], \
color='r', label='TES')
# set artists
plt.gca().set(xlabel='Date', \
ylabel='Number of automobiles (k)')
plt.grid()
plt.legend()
plt.show()
196
Figure 93: Automobile Sales and Fitted Values Using the TES Method Assuming
Multiplicative Seasonality with and without Damping
Trends can also be set up in a multiplicative way. In this case, the implicit assumption was
an additive trend. A full discussion of this matter is beyond the scope of this unit, but the
interested reader is directed to Gardner (1985; 2006) for thorough summaries of exponen-
tial smoothing methods.
Model Selection
Several models have been examined and now it is necessary to decide which is the most
suitable. To do so, metrics able to be used for a performance comparison between models
need to be calculated. The implementation of these methods in statsmodels creates
estimations using the maximum likelihood, providing the correct input (the likelihood) to
calculate the information criteria AIC, BIC, and the AIC corrected for small sample bias.
Information criteria are useful when choosing between different methods, i.e., when these
methods have different numbers of parameters (otherwise, the comparison is only on the
likelihood level). Another classic tool is cross validation (CV). The sequence is split into a
training and a test set, following which the smoothing parameters are estimated using the
training set, and the prediction error is estimated using the test set, for example, by calcu-
lating the mean squared error (MSE). In this example, the following procedure is used:
• Fit each model (SES, DES, and TES) to a set of data X1, …, Xt (N observations), where
X̂t + 1 corresponds to the out-of-sample forecast, and compute the
εt + 1 = X ̂t + 1 − Xt + 1.
• Repeat the first step for t=k, …, N-1time steps, where k is 44 time steps. As N = 244, 200
errors ε are generated.
• Calculate the MSE from the ε series.
This simple procedure can be implemented in Python using the following code:
197
Code
# prepare data splitter
n_splits = 200
tscv = TimeSeriesSplit(n_splits=n_splits, test_size=1)
# calculate MSE
mse = (y_hat-y)**2
return mse
k+=1
198
print("SES MSE:", np.sqrt(mse_ses.mean()), '\n',\
"DES MSE:", np.sqrt(mse_des.mean()), '\n',\
"TES MSE:", np.sqrt(mse_tes.mean()))
The estimated prediction errors for each method are given in the table below. As expec-
ted, the TES model performed the best.
Table 16: Cross Validation Prediction Error Estimation—SES, DES, and TES Models
Model MSE
SUMMARY
The Holt-Winters models, introduced in the late 1950s and early 1960s,
are essentially forecasting methods. Originally created for inventory
level forecasting, they were further developed following the intuitive
ideas that a time series, without trend and seasonality, can be forecas-
ted using the information contained in its own past, and that not all
observations contribute equally to the correct prediction of the next
time-step observation.
These ideas eventually took the following forms. The simplest method,
the Simple Exponential Smoothing (SES) method, consists of a weighted
average of the whole time series, with more weight assigned to the most
recent observations. The weights are positive, exponentially decreasing,
and add up to one. The method is computationally efficient, as one does
not need to compute the weighted average of the whole series again to
compute the next step prediction. The original (infinite) weighted aver-
age can be written as a set of two update and forecast formulae, and a
prediction can then be calculated with only four operations.
However, the SES method only predicts reasonably well for series with-
out trend or seasonality. These limitations are avoided with the intro-
duction of the Double Exponential Smoothing (DES) and the Triple Expo-
nential Smoothing (TES) methods. In DES, the update and forecast
formulae of the SES method are complemented with an additional
parameter, trend smoothing. These two smoothing parameters are the
199
reason for the method’s name (one for the level, the other for the trend).
Finally, the TES method further extends the same idea, using a third
smoothing parameter to capture the effects of seasonality.
200
UNIT 6
ADVANCED TOPICS
STUDY GOALS
Introduction
It can be said that 99.99 percent of all scientific production has been based on previous
discoveries or developments. Sir Isaac Newton once wrote, “If I have seen further, it is by
standing on the shoulders of Giants” (Chen, 2003, p. 135). If this is true for him, it is most
likely true for everyone else. The point is simple: the number of groundbreaking ideas is
low in comparison with everything else built upon them.
Some fundamental concepts exist in the field of time series analysis, for example, ARMA
models and Holt-Winters methods, which date back to the 1960s. These methods were
groundbreaking at the time of their introduction, and they underlie newer methods. This
unit will discuss some modern (developed within the last fifteen years) models and meth-
ods in time series analysis. Interestingly, most of the tools are based on ARMA models
and/or Holt-Winters methods, showing how crucial they are, even today.
This unit will first discuss the ensemble models. An ensemble model is a combination of
different models, whereby each component captures specific dynamics of the data. The
result is a model with much more flexibility. Specifically, this unit will examine the BATS
and TBATS models described by De Livera et al. (2011), which are ensembles of ARMA
models, Holt-Winters models, and harmonic regression. Subsequently, this unit will briefly
discuss a popular ensemble known as Facebook Prophet. This method is also a combina-
tion of different techniques, including the generalized additive models and Holt-Winters
models.
Finally, this unit will examine how a time series forecast problem can be rewritten as a
supervised learning problem. This will prove to be key in order to take advantage of the
great variety of techniques available in machine learning.
A variety of methods exist with which seasonality can be modeled, for example, harmonic
regression, Holt-Winters methods, and SARIMA models, just to name a few. These are well
established methods with several extensions.
202
BATS and TBATS Models
In BATS and TBATS models, a range of different models are combined into a single frame-
work to solve the problem of multiple seasonaility. These models include Exponential
Smoothing, ARMA models, Box-Cox transformation and, depending on the version, Fourier
terms. Moreover, the estimation can be fully automated, providing better usability for
those practitioners from fields outside of statistics and data science.
The following example illustrates the type of data we want to model. The data are com-
prised of a sample data set on energy consumption from American Electric Power (AEP),
which was recorded between October 2004 and August 2018 (Mulla, n.d.). The data consist
of 121,273 hourly observations, thus subsamples will be considered for the sake of sim-
plicity. Displayed in the figure below are the observations from January 1, 2011 to June
30, 2011 (upper plot) and the observations from the first three weeks of February 2011
(lower plot). At different levels, different cycles can be distinguished. There is a seasonal
pattern of 24 hours and also a weekly pattern, which exhibits a clear reduction in usage on
the weekends. The annual data have not been considered, but the first six months of the
year appear to show a decline in consumption, most likely related to the advent of sum-
mer.
203
The method proposed by De Livera et al. (2011) conducts forecasts based on a modified
Holt-Winters method (exponential smoothing). The original Holt-Winters method accounts
for only one seasonal component, i.e., only one cycle can be represented by the triple
exponential smoothing method. For this reason, Holt-Winters methods cannot capture all
of the dynamics of the series in the figure above. As will be seen, this drawback is covered
in this new framework. In addition, the type of seasonality that can be captured using
Holt-Winters methods must have an integer period. In order to correct the leap year effect,
for example, a period of 365.25 days would need to be considered, but a decimal point is
not possible in the Holt-Winters case. However, in the De Livera et al. (2011) proposal
(TBATS), it is indeed possible to include decimals by using trigonometric terms to repre-
sent seasonality.
To be more precise, the framework proposed by De Livera et al. (2011) is comprised of two
methods, the BATS and TBATS methods. These are acronyms that stand for the techniques
involved. BATS stands for Box-Cox transformation, ARMA models, trend, and seasonal
components (Holt-Winters). The “T” in TBATS stands for trigonometric terms.
Box-Cox transformation
Homoscedasticity This transformation method aims to achieve normal distribution and homoscedasticity
The constant variance of the data (as this is an assumption applied in many statistical methods and techniques).
property is called homo-
scedasticity. Supposing a set of data X1, X2, …, Xn, the Box-Cox transformation applies the logarithm
to each of the data, if the parameter λ is set to 0. If λ takes a value other than 0, the trans-
formation subtracts 1 from the data to the power of λ, dividing the result by λ. Mathemati-
cally, this can be expressed as
Xtλ − 1
λ ;λ ≠0
Xt = λ
Log Xt ; λ = 0
This expression only has a meaning for positive data. For instance, if λ = 0.5 and Xt is
negative, the Box-Cox transformation cannot be calculated. For more details, the interes-
ted reader is directed to Box and Cox (1964).
ARMA models
ARMA models are a family of models used to represent autocorrelated, stationary time ser-
ies. These models are composed of an autoregressive component (AR), i.e., a term
dependent on the last p observations, Xt-1, Xt-2, …, Xt − p; a moving average component
(MA) dependent on the last q error terms, εt − 1, εt − 2, …, εt − q; and an error term εt. The
orders p and q are to be determined. For more details, see Shumway and Stoffer (2017).
204
Trend and seasonal (Holt-Winters) components
The Holt-Winters methods are time series forecasting methods based on the weighted
averages of the entire past of the series. The out-of-sample predictions are built upon the
forecasts of a level, which are then corrected by trend and seasonality predictions. To
avoid the trend growing indefinitely, a damping parameter, often called ϕ (not to be con-
fused with the parameter ϕ in ARMA models), can be applied (Gardner, 1985; 2006).
Trigonometric terms
In the context of TBATS, “trigonometric terms” refer to the harmonics needed to repre- Harmonic
sent a given seasonality. While the BATS method is able to capture seasonality, it uses A harmonic is a wave with
a given frequency and is
many parameters to do so and is only able to model integer periods. As will be seen, the typically represented by a
TBATS technique removes this limitation by introducing the use of sine and cosine terms combination of sine and
(harmonics) to represent seasonality. Here, the use of harmonics is not exactly the same cosine functions.
as in harmonic regression, but it is still very much related. The interested reader is direc-
ted to Brockwell and Davis (2016) for a review of harmonic regression.
A BATS model combines the components briefly described above into a composite model.
This model is able to capture the dynamics of a time series by combining the advantages
of several modeling perspectives. In the following equations describing the BATS model-
ing approach, the reader might notice elements of the ARMA models and the Holt-Winters
methods. Given a time series Xt , the BATS method formulation is given by the set of
equations depicted in the figure below. To give hints, colors have been used in the follow-
ing ways.
• Green: The original series Xt. This series is Box-Cox transformed using a parameter λ to
be determined in the estimation procedure.
• Yellow: These terms correspond to a modified version of the Holt-Winters Triple Expo-
i
nential Smoothing (TES), including level Lt, trend T t, and seasonality St . This version
distinguishes between a short-run and long-run trend τ , and the possibility of including
T > 1 seasonality terms with periods mi.
• Purple: These terms correspond to the error component of the TES yt. Some authors
have reported that after fitting TES models, the residuals exhibit autocorrelation. This
problem has been addressed in BATS/TBATS by modeling these errors with ARMA mod-
els. As we may notice, the last equation is the formulation of the ARMA model.
205
Figure 95: BATS Method
where
λ
Xt represents the Box-Cox transformed data with parameter λ.
i
St represents the ith seasonal component at time t.
206
As with each of the individual models and methods, different parameters in BATS models
must be estimated. Analogously to ARMA models, a BATS model and its parameters can be
denoted as
BATS λ, ϕ, p, q, m1, …, mT
For example, the additive, single seasonal Holt-Winters method without damping corre-
sponds to a BATS 1, 1, 0, 0, m1 .
We will now return to the energy consumption example (Mulla, n.d.). For the sake of this
example, the period of estimation (training) has been restricted to February 1, 2011
through April 30, 2011. The 20 days (20 · 24hours) that follow are considered an out-of-
sample forecast (testing).
Code
# import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tbats import TBATS, BATS
import numpy as np
import pickle
## BATS
207
use_box_cox=True, # consider Box-Cox Trans.
use_damped_trend=False, # no damping
use_trend = False, # no trend
n_jobs=1 # number of parallel jobs
)
Due to the sheer number of parameters to be estimated, this process may take some time
(about 20 minutes as one job on an Intel i9-10900K with 64-GB RAM). For this reason, the
fitted model is saved to a pickle file, meaning that the model does not have to be trained
in order to use it in another session. The model results demonstrate the best fit for all the
required parameters in accordance with the defined model specification.
Code
Use Box-Cox: True
Use trend: False
Use damped trend: False
Seasonal periods: [ 24 169]
ARMA errors (p, q): (0, 0)
Box-Cox Lambda 0.548482
Smoothing (Alpha): 1.319544
Seasonal Parameters (Gamma): [ 0.14216858 -0.00888503]
AR coefficients []
MA coefficients []
Seed vector [ 3.61735431e+02 -3.42428362e+00 …
…
1.28933009e+01 1.28062408e+01]
AIC 39392.094312
Once the model has been fitted, the fitted values can be extracted and the model can be
used for forecasting.
208
Code
# extract fitted values and add to Data Frame
bats_fit = pd.DataFrame(bats_model.y_hat, \
columns=['bats_fit'], index=df.index[:len(X_train)])
df['bats_fit'] = bats_fit
# observed data
ax1.plot(df.index, df.energy, \
color='k', label='Actual')
# fitted values
ax1.plot(df.index, df.bats_fit, \
color='b', label='Fitted')
# forecasted values
ax1.plot(df.index, df.bats_forecast, \
color='r', label='Forecast')
# set artists
ax1.set_xlabel('Date')
ax1.set_ylabel('Consumption (MW)')
ax1.grid()
ax1.legend()
# observed data
ax2.plot(df_zoom.index, df_zoom.energy, \
color='k', label='Actual')
# fitted values
ax2.plot(df_zoom.index, df_zoom.bats_fit, \
color='b', label='Fitted')
# forecasted values
209
ax2.plot(df_zoom.index, df_zoom.bats_forecast, \
color='r', label='Forecast')
# set artists
ax1.set_xlabel('Date')
ax1.set_ylabel('Consumption (MW)')
ax2.grid()
plt.show()
The upper plot corresponds to the fit and prediction for the whole period, while the lower
plot displays a ten-day sample of the upper plot. Two periods of 24 and 169 hours (almost
7 days) have been chosen. Since 168 is a multiple of 24, the former has been modified to
force the method to calculate a whole different set of initial values for the seasonality and
thus improve the fit. This is because choosing exactly 168 would cause the model to
210
assume that the 24-hour seasonality is part of the 168-hour seasonality, simplifying the
number of parameters involved. The trend estimation was excluded to speed up calcula-
tions. A summary of the results is depicted below. The initial values (or seeds) for the sea-
sonality and level are not shown, because they are too numerous (194).
Unfortunately, the BATS method has some drawbacks. Besides not being able to handle
non-integer period seasonality, it needs a large number of parameters related to the initial
values. Given that this framework involves the automatic estimation of many models, the
number of calculations increases considerably if more parameters are added. This intro-
duces the danger of overfitting and makes the estimation a computationally expensive
procedure. We may notice this when we reproduce the code example, even though it
allows for parallel processing.
Introducing trigonometry—TBATS
Given the high number of initial values needed to estimate the seasonal component, De
Livera et al. (2011) introduced TBATS as a parsimonious way to represent seasonality. The
idea is as follows: Instead of modeling each seasonality with as many parameters as time
steps are in a period, these are represented using harmonics (also called Fourier terms). A
harmonic is represented by two parameters, which is a method of representing most of
the seasonality parameters using fewer terms. The interested reader is directed to De Liv-
era et al. (2011) for more details about this representation.
T BAT S λ, ϕ, p, q, m1, k1 , …, mT , kT
where the pairs mi, ki correspond to the period of the ith seasonality and the number of
harmonics needed to represent the ith seasonality, respectively.
Because the seasonality is modeled by harmonics, the limitation of having integer periods
is removed. Recall that a frequency is defined as 2π/mi and this ratio can be any number,
so mi can be, for instance, 365.25 days to represent a more exact annual period.
To determine the number of harmonics, the model starts with only one harmonic and pro-
gressively adds more, testing the significance at each step with an F-test (p<0.001). F-test
Regarding the ARMA selection method, the algorithm starts with the assumption that no The F-test is a statistical
test used to compare dif-
ARMA model is needed. In a second step, an automated algorithm of model selection for ferent models based on
ARMA (Hyndman & Khandakar, 2008) is applied to the residuals to select the orders p and the proportion of variabil-
q. Then, the full model (including harmonics, Holt-Winters, and Box-Cox) is re-estimated ity that each one involves.
with the residuals assumed to follow an ARMA(p,q). The ARMA model is only kept if its AIC
(Akaike Information Creiterion) is lower than that of the model without ARMA(p,q) residu-
als.
211
A working TBATS example
In the following code, the TBATS model is applied to the energy consumption data with
the same setup used for the BATS model. This process runs considerably faster than the
BATS training, requiring approximately two minutes as one job on an Intel i9-10900K, 64-
GB RAM).
Code
# fit a TBATS model
tbats_estimator = TBATS(
seasonal_periods=[24, 169],
use_arma_errors=True, # consider ARMA
use_box_cox=True, # consider Box-Cox transformation
use_damped_trend=False, # no damping
use_trend = False, # no trend
n_jobs=8 # number of parallel jobs
)
tbats_model = tbats_estimator.fit(X_train)
A summary with the parameter estimation is presented below. Those involved in the esti-
mation of the seasonality are not shown here. In the example, the number of parameters
has decreased from 194 to only 37. This is at the cost of the fit (a phenomenon that can be
seen in the subsequent plots displaying the fitted and forecasted values). However, the
method estimates several models until selecting the definitive one, as determined by the
AIC. Hence, the resulting model has, by design, a fair balance between complexity and
goodness of fit. In fact, the TBATS model has a lower AIC (39,261) compared to BATS
(39,392), mainly due to its much lower complexity in terms of number of parameters.
Code
Use Box-Cox: True
Use trend: False
Use damped trend: False
Seasonal periods: [ 24. 169.]
Seasonal harmonics [11 7]
ARMA errors (p, q): (0, 0)
Box-Cox Lambda 0.546183
Smoothing (Alpha): 1.473954
Seasonal Parameters (Gamma): [-4.38139779e-05
3.76087534e-05 -2.20168900e-05 4.57034128e-06]
AR coefficients []
MA coefficients []
Seed vector [ 3.75786567e+02 -9.96440619e+00
212
…
9.47017235e-01]
AIC 39291.026040
Code
# extract fitted values and add to Data Frame
tbats_fit = pd.DataFrame(tbats_model.y_hat, \
columns=['tbats_fit'], index=df.index[:len(X_train)])
df['tbats_fit'] = tbats_fit
# observed data
ax1.plot(df.index, df.energy, \
color='k', label='Actual')
# fitted values
ax1.plot(df.index, df.tbats_fit, \
color='b', label='Fitted')
# forecasted values
ax1.plot(df.index, df.tbats_forecast, \
color='r', label='Forecast')
# set artists
ax1.set_xlabel('Date')
ax1.set_ylabel('Consumption (MW)')
ax1.grid()
ax1.legend()
# observed data
ax2.plot(df_zoom.index, df_zoom.energy, \
213
color='k', label='Actual')
# fitted values
ax2.plot(df_zoom.index, df_zoom.tbats_fit, \
color='b', label='Fitted')
# forecasted values
ax2.plot(df_zoom.index, df_zoom.tbats_forecast, \
color='r', label='Forecast')
# set artists
ax1.set_xlabel('Date')
ax1.set_ylabel('Consumption (MW)')
ax2.grid()
plt.show()
The figure below shows the fit and out-of-sample forecast for the training and test periods
defined above.
214
Figure 97: TBATS In-Sample and Out-Sample Forecast—Energy Consumption Data
Compared to the BATS forecast, one might argue for the more complex model to show a
better fit with the data. However, the TBATS model shows comparable results and has
much lower complexity, decreasing computational training costs, and a reduced risk of
overfitting. Furthermore, this model has a lower AIC value. For these reasons, TBATS is the
model of choice.
BATS and TBATS are not the only attempts to create composite forecasting models rich
enough to include trends and multiple seasonality. Another example of this approach is
the Facebook (FB) Prophet.
The open-source forecasting tool FB Prophet is a time series forecasting model developed
by Facebook and motivated by series with piecewise trends, multiple seasonality, and
floating holidays, i.e., public holidays that do not take place on the same day every year
(Facebook, n.d.). The existence of floating holidays usually breaks the pattern defined by a
215
seasonality, making a correct estimation and prediction of models like BATS and TBATS
biased. FB Prophet allows us to capture those days (and the days immediately before and
after) separately. FB Prophet is open to a broad user community as it has an open-source
license and is accessible via GitHub. The model was developed with intuitive parameters,
aiming to avoid the need to be familiar with the precise inner workings of the model.
The model is available for R and Python (Facebook, n.d.). Its mathematical structure is
Xt = T t + St + Ht + εt
FB Prophet was built primarily as a curve estimation procedure, similar to splines. Thus,
this model is oriented less towards statistical inference, because the key is put on the fit
and not on the residual statistical properties. On the other hand, the model has several
advantages:
The following example uses the same data on energy consumption (Mulla, n.d.) used for
BATS and TBATS. The training data is taken from the period February 1, 2011 to March 30,
2011.
When building the model, we include a linear trend (there is also the possibility to con-
sider S-shaped exponential growth), daily and weekly seasonality, no added holidays, and
no turning points for trends (that can also be modeled).
Code
# import modules
import numpy as np
from prophet import Prophet
import pandas as pd
import matplotlib.pyplot as plt
216
df = df[(df.Datetime >= '2011-02-01 00:00:00') & \
(df.Datetime < '2011-05-21 00:00:00')]
df = df.rename(columns={'AEP_MW': 'y'})
df = df.rename(columns={'Datetime': 'ds'})
df.index = pd.to_datetime(df.ds, dayfirst=True, \
infer_datetime_format=True)
df = df.sort_index()
As with the BATS and TBATS example, the trained model is used to extract the fitted values
and a forecast for 20 days. The results and the observed data are compared visually. The
Prophet library also provides the confidence intervals for the model predictions, but for
the sake of this simple example, they are not considered here.
Code
# extract fitted values and add to Data Frame
prophet_fit = prophet_mod.\
predict(X_train).\
yhat.\
set_axis(df.index[:len(X_train)])
df['y_fit'] = prophet_fit
# observed data
ax1.plot(df.index, df.y, \
color='k', label='Actual')
217
# fitted values
ax1.plot(df.index, df.y_fit, \
color='b', label='Fitted')
# forecasted values
ax1.plot(df.index, df.y_fc, \
color='r', label='Forecast')
# set artists
ax1.set_xlabel('Date')
ax1.set_ylabel('Consumption (MW)')
ax1.grid()
ax1.legend()
# observed data
ax2.plot(df_zoom.index, df_zoom.y, \
color='k', label='Actual')
# fitted values
ax2.plot(df_zoom.index, df_zoom.y_fit, \
color='b', label='Fitted')
# forecasted values
ax2.plot(df_zoom.index, df_zoom.y_fc, \
color='r', label='Forecast')
# set artists
ax2.set_xlabel('Date')
ax2.set_ylabel('Consumption (MW)')
ax2.grid()
plt.show()
218
Figure 98: Facebook Prophet—In-Sample and Out-Sample Forecast—Energy
Consumption Data
219
Not only has computational power increased in the course of the almost 80 years since the
advent of the first digital computer, the sizes of the data sets to analyze have also
increased. Considering their usually enormous levels of complexity, the challenge has
been to determine how to extract usable knowledge from the data. In a more philosophi-
cal discussion, one might define what is to be understood by “knowledge,” which is, by its
nature, a philosophical question. For the current purposes, it suffices to say, the analyst
should restrict themself to finding hidden patterns embedded in the data that would
allow a better prediction of the values of a variable of interest.
Thus, to uncover those hidden patterns, and taking into account that humans are great at
interpreting context but weak in repetitive tasks and overlooking complex relations, scien-
tists have introduced concepts like “machine learning.” As humans, we know that we have
learned something if we can solve a problem “B” after completing some (training) experi-
ences that show us how to solve problems similar to “B.” This is not entirely precise, but in
this context and at this level, it is sufficient. Therefore, the machine “learns” in the follow-
ing way: The algorithm (machine) receives data, estimates or modifies some of its own
parameters (training), and ideally, will be able to solve problems for similar data sets (test
and prediction) similar to that used for training.
Machine learning is divided into two categories: supervised and unsupervised. Here, we
are concerned with supervised learning, which in simple words means that the algorithm
uses its predicted outcome to updates its own parameters.
In symbols, suppose inputs X1,…,XN and an output Y , which represent observed data.
We want to find a mapping f that links the Xs with the Y .
Another useful subdivision of the supervised learning methods is related to the nature of
the variable Y . Depending on whether the prediction is quantitative or qualitative, it will
be called a regression or classification problem, respectively.
220
General Model
The time series forecasting problem can be expressed as a supervised learning problem,
and thus, it can be modeled by applying many of the tools from the supervised machine
learning framework to conduct forecasts.
The most common time series models are based, in one way or another, on an expression
of the form
Xt + ℎ = f Xt − 1, Xt − 2, …, Xt − d + εt
with ℎ ≥ 0. This means that future values of the series are assumed to be a function of its
past values plus an error term representing all ignored information. For example, if f is
just a linear function and the errors are white noise, we recover the classical autoregres-
sive model. This expression is quite general, and hence, useful at a conceptual level. In
most real problems, the practitioner will have, at most, an approximated idea of this func-
tion (it will likely never be known exactly.) However, independently of the shape of f , what
is always present is the relationship between future and past values. Thus, once a certain
amount of data has been gathered, a one-step forecast can theoretically be generated,
provided that information about f has been given.
Regarding data, the key issue is whether we can convert them to the input/output format
needed in the supervised learning context. Excluding cases like missing values or sampling
at irregular time points, this is possible and rather trivial. The schema presented above
would be modified as follows.
This schema of taking a set of lagged variables as individual inputs and the next variable
as the output is called sliding windows. It should now be clear how a classical time series
forecast problem can be transformed into a supervised learning problem.
221
Notice that the supervised learning setting is flexible enough to include the multiple step
prediction problem by taking ℎ ≥ 1.
What can be said about f ? The data have a context. They can be physical, economic,
social, etc. In each context, there is some accumulated expert knowledge that might
enhance the estimation of f . Leaving the knowledge component aside, the value of f can
be derived using a variety of different models. From regression to artificial neural net-
works, all of these techniques try to correctly model the function f .
This section will discuss two examples with time series using methods of supervised learn-
ing, the first as a classification problem and the second as a regression problem.
The code below creates the subsequent figure, which shows the adjusted closing prices of
the company Amazon.com, Inc. (AMZN) from the period January 1, 2015 to December 31,
2020. The Yahoo Finance data (Yahoo Finance, n.d.-a) were extracted with the Python
library yfinance.
Code
# import modules
import yfinance as yf
import numpy as np
import matplotlib.pyplot as plt
import cesium.featurize as ft
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
222
Figure 101: Adjusted Closing Prices of Amazon Stocks
Here, the objective is to build a classifier. Given a five-day series, i.e., five stock prices, Pt-1,
Pt-2, …, Pt-5, corresponding to five consecutive days (excluding weekends and holidays dur-
ing which the stock markets are closed), the classifier must predict the one-day-ahead
position with respect to the fifth day. In other words, it must indicate whether the sixth
day price will be higher or lower than the fifth day price. As inputs, the classifier will take
features of the price series, including the mean, the maximum, and the standard devia-
tion.
At first glance, this problem looks more like a typical time series prediction problem. Clas-
sification seems to be out of place. However, the problem formulation states that we want
to predict whether the price will increase (the goal is not to predict the price increment
with respect to the last day of the series); in other words, the output variable here is
binary. We use the financial terms “bull” for increases and “bear” for decreases. This is
how we can think of this problem as a binary classification problem with two classes.
As mentioned above, the first thing to do is to carefully frame the time series within the
supervised learning set up. To do this, the input and output variables must be defined.
Given a starting day, say Thursday, we count four additional days and record their prices.
Several metrics are calculated using these five values, including the mean, minimum, and
standard deviation. In the following practical example, for instance, 14 different statistics
will be used to describe each series calculated by sliding windows. These statistics are
then considered the features for the classification task. Thus, the input vector is of length
14 (K = 14). The output corresponds to the label “bull,” if the probability of this class is
higher than the probability of a sample belonging to the opposite class “bear,” i.e., if the
price Ptis greater than the previous price P t − 1. Likewise, if the probability of a sample
belonging to the class “bear” is greater than the probability of belonging to the class
“bull”, or in terms of price values, if P t < P t − 1, this sample will be classified as “bear.”
The figure below depicts this concept.
223
Figure 102: Supervised Learning Schema (Classification)
The features 1 to K are the Xs, and the Y is a categorical variable with two values, “bull”
and “bear.”
The Python library cesium can be used to calculate the set of features. Having defined the
set of features with which we want to describe each series, this library will create a data
frame in which each column corresponds to a feature and each row to a different time ser-
ies. Additional Python libraries are extremely useful for time series feature extraction, such
as TSFEL and tsfresh.
There are several techniques with which to conduct classification. For the purpose of dem-
onstrating how a time series problem can be understood as a machine learning task, the
random forest method has been chosen, but we might have chosen logistic regression,
decision trees, support vector machines, or one of several other options.
The above description can be coded using the following code. The features are extracted
from the series, and the feature vector is split into training and test sets. Next, the model is
trained and its performance evaluated.
Code
## Feature Extraction
224
# convert array of lists to list of arrays
data = list(map(np.asarray, data))
## Classification
# model evaluation
acc = rf_clf.score(X_test, y_test)
print('Mean Accuracy:', round(acc, 4))
# console output:
# Mean Accuracy: 0.5013
The evaluation of the classifier on the test set showed a mean accuracy of 0.5013. This
indicates a model that does not produce noticeably better results than those achieved by
making random guesses. A better approach is required.
225
Time series as a regression problem: Artificial neural network example
Analogous to the classification case, we split the data into two sets, a training set and a
test set. Once the model has been estimated with the training data set, its performance on
the test set is evaluated. To evaluate the performance, we use R2. Recall that this measure
quantifies the proportion of the variability explained by the model, taking the constant
model as a benchmark.
Code
# import module
from sklearn.neural_network import MLPClassifier, MLPRegressor
# model evaluation
acc = rgf.score(X_test, y_test)
print('R²:', round(acc, 4))
# console output:
# R²: 0.998
Evaluating the test set, we obtain R2 = 0.998. Although the number appears to be very
good, it must be read with care. There are only a few points within the trend prices where
the series seems to “jump.” In general, after five rising stock prices in a row, the sixth price
is likely to be similar to the most recent prices, i.e., it is clear that the price is generally
increasing. Thus, the high R2 is not so surprising after all and should not be taken as evi-
dence of an extraordinarily useful model. The following example explores this concept fur-
ther.
If an investor buys a stock today for $100 with tomorrow’s forecasted price of $101 in
mind, but the actual price tomorrow is $99, the error will only be 2 percent. However, from
the investor’s perspective, they lost money and the low error is no consolation. Accord-
ingly, the model is slightly modified to perform classification rather than regression.
226
Code
# split into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split( \
fset_cesium.values, classes, random_state=21)
# model evaluation
acc = clf.score(X_test, y_test)
print('Mean Accuracy:', round(acc, 4))
# console output:
# Mean Accuracy: 0.5119
The score on the test set turned out to be 0.5119. This is higher than for the random forest
but not enough to beat the market at all times.
With these two simple examples we are now able to formulate and solve a time series
forecasting problem as a supervised learning problem. We can now use techniques from
machine learning, such as random forest and neural networks, for time series forecasting
purposes.
SUMMARY
The models and methods in time series forecasting do not end with the
classical ARIMA models and Holt-Winters methods. The impact of these
tools has been enormous beyond any doubt. Nevertheless, they are sel-
dom applied in isolation, because very few sequences behave, for
instance, as a pure ARMA process. Real series are combinations of sea-
sonality (sometimes more than one), piecewise trends, correlated resid-
uals, calendar effects, and other factors. Therefore, it is common to see
ensembles of models in more recent research. An ensemble is an ad-hoc
combination of models. Examples are BATS, TBATS, and FB Prophet.
BATS uses Box-Cox transformation and ARMA residuals, and includes
Holt-Winters methods to capture trends and seasonality. TBATS adds the
possibility of modeling seasonality with more flexibility and parsimony
by means of harmonics. The FB Prophet uses generalized additive mod-
els and Holt-Winters methods and allows holiday calendar effects to be
considered.
Model ensembles are good for prediction, but with the rise of the ML
methods, it is natural to ask if time series problems can be redefined to
be treated as ML problems, and this is indeed the case. The scope
227
widens further, because with ML, we cannot only make time series pre-
dictions, but also conduct time series classification. Techniques, such as
the random forest and the artificial neural networks, can be used with
data with autocorrelation for prediction purposes.
228
BACKMATTER
LIST OF REFERENCES
Abraham, B., & Ledolter, J. (1986). Forecast functions implied by autoregressive integrated
moving average models and other related forecast procedures. International Statisti-
cal Review, 54(1), 51—66. https://doi.org/10.2307/1403258
Arunraj, N. S., Ahrens, D., & Fernandes, M. (2016). Application of SARIMAX model to fore-
cast daily sales in food retail industry. International Journal of Operations Research
and Information Systems, 7(2), 1—21. http://doi.org/10.4018/IJORIS.2016040101
Board of Governors of the Federal Reserve. (n.d.). Finance rate on consumer installment
loans at commercial banks, new autos 48 month loan: Not seasonally adjusted. https://
www.federalreserve.gov/datadownload/Output.aspx?rel=G19&series=aeb2bfc737046
6afcb39c36558233ecd&lastobs=&from=01/01/2001&to=05/31/2021&filetype=csv&labe
l=include&layout=seriescolumn
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of Royal Statistical
Society. Series B (Methodological), 26(2), 211—252. https://www.jstor.org/stable/29844
18
Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. Holden-
Day.
Brockwell, P. J., & Davis, R. A. (2009). Time series: Theory and methods (2nd ed.). Springer.
Brockwell, P. J., & Davis, R. A. (2016). Introduction to time series and forecasting (3rd ed.).
Springer.
Campbell, M. J., & Walker, A. M. (1977). A survey of statistical work on the Mackenzie River
series of annual Canadian lynx trappings for the years 1821—1934 and a new analysis.
Journal of the Royal Statistical Society. Series A (General), 140(4), 411—431. https://doi.
org/10.2307/2345277
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury.
Chatfield, C. (2004). The analysis of time series: An introduction (6th ed.). Chapman & Hall/
CRC. https://doi.org/10.4324/9780203491683
Chatfield, C., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2001). A new look at models for
exponential smoothing. The Statistician, 50(2), 147—159. http://www.jstor.org/stable/
2681090
230
Chen, C. (2003). Mapping scientific frontiers: The quest for knowledge visualization.
Springer.
De Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting time series with com-
plex seasonal patterns using exponential smoothing. Journal of the American Statisti-
cal Association, 106(496), 1513—1527. http://www.jstor.org/stable/23239555
Elsaraiti, M., Ali, G., Musbah, H., Merabet, A., & Little, T. (2021). Time series analysis of elec-
tricity consumption forecasting using ARIMA models. In 13th Annual IEEE Green Tech-
nologies Conference (pp. 259—262). IEEE. https://doi.org/10.1109/GreenTech48523.20
21.00049
Falk, M., Marohn, F., Michel, R., Hofmann, D., Macke, M., Spachmann, C., & Englert, S.
(Eds.). (2011). A first course on time series analysis: Examples with SAS. Chair of Statis-
tics, University of Würzburg.
Fan, J., & Yao, Q. (2003). Nonlinear time series: Nonparametric and parametric methods.
Springer.
Franke, J., Härdle, W. K., & Hafne, C. M. (2011). Statistics of financial markets: An introduc-
tion (3rd ed.). Springer.
Gapminder. (n.d.-a). GD001 GDP per capita, constant PPP dollars (Version 27) [Data set]. htt
ps://www.gapminder.org/data/documentation/gd001/
Gapminder. (n.d.-b). GD004 Life expectancy at birth (Version 11) [Data set]. https://www.ga
pminder.org/data/documentation/gd004/
Gardner, E. S. (1985). Exponential smoothing: The state of the art. International Journal of
Forecasting, 4, 1—28. https://doi.org/10.1002/for.3980040103
Gardner, E. S. (2006). Exponential smoothing: The state of the art—Part II. International
Journal of Forecasting, 22, 637—666. https://doi.org/10.1016/j.ijforecast.2006.03.005
Gardner, E. S., & McKenzie, E. (1985). Forecasting trends in time series. Management Sci-
ence, 31(10), 1237—1246. http://www.jstor.org/stable/2631713
Goddard Institute for Space Studies. (2021). GISS surface temperature analysis (GISTEMP
v4) [Data set]. National Aeronautics and Space Administration. https://data.giss.nasa.
gov/gistemp/
231
Holt, C. C. (1957). Forecasting seasonals and trends by exponentially weighted moving aver-
ages [Research memorandum no. 52]. Carnegie Institute of Technology.
Hyndman, R., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast pack-
age for R. Journal of Statistical Software, 26(3), 1—22. https://doi.org/10.18637/jss.v02
7.i03
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learn-
ing: With applications in R. Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learn-
ing: With applications in R (2nd ed.) Springer.
Kopf, D. (2015, November 6). The discovery of statistical regression. Priceonomics. https://
priceonomics.com/the-discovery-of-statistical-regression/
Lenssen, N. J. L., Schmidt, G. A., Hansen, J. E., Menne, M. J., Persin, A., Ruedy, R., & Zyss, D.
(2019). Improvements in the GISTEMP uncertainty model. Journal of Geophysical
Research: Atmospheres, 124(12), 6307—6326. https://doi.org/10.1029/2018JD029522
Liu, X., Huang, J., Li, C., Zhao, Y., Wang, D., Huang, Z., & Yang, K. (2021). The role of season-
ality in the spread of COVID-19 pandemic. Environmental Research, 195. https://doi-or
g.pxz.iubh.de:8443/10.1016/j.envres.2021.110874
Molteni, M. (2021, December 27). Forecasting the Omicron winter: Experts envision various
scenarios, from bad to worse. Stat. https://www.statnews.com/2021/12/27/forecastin
g-the-omicron-winter-experts-envision-various-scenarios-from-bad-to-worse/
Mulla, R. (n.d.). Hourly energy consumption (Version 3) [Data set]. Kaggle. https://www.kag
gle.com/robikscube/hourly-energy-consumption
National Aeronautics and Space Administration. (2021, December 13). Global temperature.
https://climate.nasa.gov/vital-signs/global-temperature/
Nielsen, A. (2019). Practical time series analysis: Prediction with statistics and machine
learning. O’Reilly.
Poleneni, V., Rao, J. K., & Hidayathulla, S. A. (2021). COVID-19 prediction using ARIMA
model. In Proceedings of the Confluence 2021: 11th International Conference on Cloud
Computing, Data Science and Engineering (pp. 860—865). https://doi.org/10.1109/Conf
luence51648.2021.9377038
Shumway, R. H., & Stoffer, D. S. (2017). Time series analysis and its applications – With R
examples (4th ed.). Springer.
232
Svetunkov, I., & Petropoulos, F. (2018). Old dog, new tricks: A modelling view of simple
moving averages. International Journal of Production Research, 56(18), 6034—6047. htt
ps://doi.org/10.1080/00207543.2017.1380326
Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37—
45. https://doi.org/10.1080/00031305.2017.1380080
Terry, K. (2021, May 20). What experts predict from COVID this fall and winter. WebMD. http
s://www.webmd.com/lung/news/20210520/what-experts-predict-from-covid-this-fall-
and-winter
Whittle, P. (1951). Hypothesis testing in time series analysis. Almqvist & Wiksells.
233
LIST OF TABLES AND
FIGURES
Figure 1: Average Global Surface Temperatures (1951—1980) . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 12: ACF and PACF Plots of the Differenced Data on Saudi Arabia’s GDP (1951—2013)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 15: Toy Example: Least Squares Estimated Linear Regression Model . . . . . . . . . . . . 38
Figure 16: Global Surface Temperatures (1880—2020): Original Series and Fits Using a Lin-
ear and Quadratic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
234
Figure 17: Summary Table: Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 19: Global Surface Temperatures (1880—2020): Original and Smoothed Series Using
MA Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 22: Global Surface Temperatures (1880—2020): Original and Kernel Smoothed Ser-
ies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 23: Deutsche Bank Stock Prices—Raw Series (Top) and Differenced Series (Bottom)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 27: Hourly Energy Consumption with 24- and 168-Hour Harmonic Regression . . . 59
Figure 30: Monthly Automobile Sales (United States) with Dummy Variable Regression . 65
Figure 32: Autocorrelograms: Deutsche Bank Closing Prices and Differenced Prices . . . . 74
235
Figure 35: Observed and Fitted Precipitation Values Using an SMA(5) and an SMA(10)
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figure 37: Precipitation Forecast for an SMA(5) Model with a 20-Year Horizon . . . . . . . . . . 87
Figure 39: Observed and Fitted Precipitation Values Using a CMA(3) and a CMA(7) Model 90
Figure 41: Closing Prices of the Pfizer Stock and WMA(10) Model Fits with Arithmetically
Decaying Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Figure 42: Closing Prices of the Pfizer Stock and Model Fits of an Optimal WMA(3) Model 99
Figure 44: Residuals of a Linear (Top) and Quadratic (Bottom) Model for Global Average
Temperatures (1880—2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 45: ACF and PACF of the Residuals: Quadratic Model . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 51: ACF and PACF of the Residuals: Quadratic Model . . . . . . . . . . . . . . . . . . . . . . . . . 121
Figure 53: Theoretical ACF Plots for an AR(1) Model with Different Values of . . . . . . . . . . 123
236
Table 10: ACF and PACF Patterns of an AR(p) Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Table 11: ACF and PACF Patterns of an AR(p) and MA(q) Process . . . . . . . . . . . . . . . . . . . . . 128
Table 12: ACF and PACF Behavior for the AR, MA, and ARMA Models . . . . . . . . . . . . . . . . . . 130
Figure 56: ACF and PACF of the Residuals: Quadratic Model . . . . . . . . . . . . . . . . . . . . . . . . . 131
Table 13: Information Criteria Comparison: AR(1), MA(4), and ARMA(1,4) . . . . . . . . . . . . . 136
Figure 60: ACF and PACF Plots of the Residuals of an AR(1) Model . . . . . . . . . . . . . . . . . . . . 138
Figure 61: ACF and PACF Plots of the Residuals of an MA(4) Model . . . . . . . . . . . . . . . . . . . 139
Figure 62: ACF and PACF Plots of the Residuals of an ARMA(1,4) Model . . . . . . . . . . . . . . . 140
Figure 63: Temperature Forecasts Using a Combination of a Regression and an AR(1) Model
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Figure 65: ACF and PACF Plots of the Logged Data on the GDP of Spain (1950—2013) . . 145
Figure 67: ACF Plot of the Differenced Data on the GDP of Spain (1950—2013) . . . . . . . . . 149
Figure 68: Summary Table for an ARIMA(1,1,0) Model for the Logarithm of Spain’s GDP 150
Table 14: ACF and PACF Behavior for the ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Figure 69: ACF Plots for the Differenced Series of the Logarithm of Spain’s GDP . . . . . . . 152
Figure 70: Monthly Automobile Sales in the United States (2001—2021) . . . . . . . . . . . . . . 153
237
Figure 71: ACF and PACF Plots for the Differenced Data on Automobile Sales in the United
States (2001—2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Table 15: ACF and PACF Behavior for the Non-Seasonal and Pure Seasonal ARMA Models
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Figure 73: ACF and PACF Plots of the Residuals of a SARIMA(0,1,0)x(1,0,0),, Model for Auto-
mobile Sales in the United States (2001—2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Figure 76: ACF and PACF Plots of the Residuals of a SARIMA(0,1,1)x(1,0,0),, Model for Auto-
mobile Sales in the United States (2001—2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Figure 77: Automobile Sales in the United States (2011—2021) and ACF Plot . . . . . . . . . . 165
Figure 78: Automobile Sales in the United States (2011—2021) and Linear Regression Fit
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Figure 79: ACF and PACF Plots of the Residuals of a Linear Regression Fit for Automobile
Sales in the United States (2001—2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Figure 80: ACF and PACF Plots of the Residuals of a SARIMAX(0,0,0)x(1,0,0),, Model . . . . 168
Figure 82: Total Monthly Automobile Sales in the United States (2001—2021) . . . . . . . . . 177
Figure 83: Automobile Sales in the United States (2001—2021) and Exponential Smoothing
Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Figure 84: Observed Automobile Sales and Forecasts Using the SES Method . . . . . . . . . . 182
238
Figure 88: DES Method Estimation—Automobile Sales Data . . . . . . . . . . . . . . . . . . . . . . . . . 188
Figure 89: Observed Automobile Sales and Forecasts Using the DES Method with h=5 . 189
Figure 90: Automobile Sales and Fitted Values Using the DES Method with and without
Damping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Figure 92: Observed Automobile Sales and Forecasts Using the TES Method with h=12,
p=12, and Assuming Additive Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Figure 93: Automobile Sales and Fitted Values Using the TES Method Assuming Multiplica-
tive Seasonality with and without Damping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Table 16: Cross Validation Prediction Error Estimation—SES, DES, and TES Models . . . . 199
Figure 94: American Electric Power’s Hourly Energy Consumption in Megawatts . . . . . . 203
Figure 96: BATS In-Sample and Out-Sample Forecast—Energy Consumption Data . . . . . 210
Figure 97: TBATS In-Sample and Out-Sample Forecast—Energy Consumption Data . . . . 215
239
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing Address
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
[email protected]
www.iu.org