0% found this document useful (0 votes)
158 views82 pages

Handout 2020 Part1 PDF

Project 1: 15% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Due Sep 13, 2020

Uploaded by

Bree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views82 pages

Handout 2020 Part1 PDF

Project 1: 15% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Due Sep 13, 2020

Uploaded by

Bree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

IE5202: Applied Forecasting Methods

Chen Nan

August 10, 2020

Contents
1 Syllabus 4
1.1 Module Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Introduction 6
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Deterministic vs Stochastic Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Errors in Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

PART I: FORECASTING METHODS FOR CROSS-SECTIONAL DATA

3 Linear Regression 10
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Least square estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Accuracy of the estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Point forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.4 Interval Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.5 Other notable notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Why multiple regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Interactions between variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Least square estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.4 Confidence & prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . 23

1
4 Model Checking and Diagnosis 25
4.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Diagnostics using residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Major graphical tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Tests for certain property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Outliers, leverage points, influential points, collinearity . . . . . . . . . . . . . . . . . 29
4.3.1 Identifying outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.2 High leverage points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Identifying influential observations . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.4 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Model Evaluation and Selection 33


5.1 Evaluating the regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Selecting the regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Hypothesis Testing in Regression Models 38

7 Methods Beyond Linear Regression 43

PART II: FORECASTING METHODS FOR TIME SERIES DATA

8 Regression on Time 44
8.1 Time Series Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Detecting Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.3 Seasonal Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.4 Growth Curve Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

9 Exponential Smoothing 48
9.1 Simple Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.2 Holt’s Trend Corrected Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.3 Holt-Winters Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

10 ARMA Time Series Model 53


10.1 Stationary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.2 ACF and PACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.3 ARMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.4 ARMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
10.4.1 Link to other models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
10.4.2 Model Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2
10.4.3 Model Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.5 Seasonal ARMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.6 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10.6.1 Matching TAC with SAC(Moment Method) . . . . . . . . . . . . . . . . . . . 66
10.6.2 Least square and MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.6.3 Model Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

PART III: SPATIAL AND SPATIAL-TEMPORAL DATA

11 Spatial Data Forecasting 68


11.1 Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11.2 Lattice Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11.2.1 Moran’s I to Test Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.2.2 Spatial Autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
11.2.3 Spatial Linear Regression with Exogenous Variables . . . . . . . . . . . . . . 71
11.2.4 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
11.3 Geostatistical Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
11.3.1 Spatial Dependencies: Covariance and Semivariance . . . . . . . . . . . . . . 72
11.3.2 Kriging as an interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

12 Spatial Temporal Data and Models 79


12.1 Spatial-temporal lattice data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
12.2 Spatial-Temporal Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

13 References 82

3
1 Syllabus
1.1 Module Information
Instructor: Dr Chen Nan, E1-05-20
Contact: Phone: 65167914
Email: [email protected]
TA: Xie Jiaohong ([email protected])
Office Hours: By appointment
Textbook: Forecasting, Time Series, and Regression, by Bowerman, O’Connell, and
Koehler
References: Linear Regression Analysis, by George A. F. Seber, Alan J. Lee
Time Series Analysis, by George E. P. Box, Gwilym M. Jenkins, Gregory
C. Reinsel
Prerequisites: IE 5002, IE 6002, programming
Description: This module focuses on the theory and practice of forecasting methods. It
discusses two major categories of forecasting problems, and corresponding
techniques. Extensive hands on projects will be provided to solve real life
problems.

1.2 Schedule
Aug 12, 2020 Module logistics, introduction, and reviews
Aug 19, 2020 Regression analysis
Aug 26, 2020 Model checking and diagnosis
Sep 02, 2020 Model evaluation & selection
Sep 09, 2020 Machine learning approaches
Sep 16, 2020 Case Studies
Sep 23, 2020 Recess Week
Sep 30, 2020 Seasonality, regression on time
Oct 07, 2020 Exponential smoothing
Oct 14, 2020 Autocorrelation and ARMA
Oct 21, 2020 Seasonal ARIMA Model
Oct 28, 2020 Neural networks for time series
Nov 04, 2020 Forecasting spatial data
Nov 11, 2020 TBD

4
1.3 Grading
Grading: Project 0: 10% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Due Aug 30, 2020
Project 1: 30% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Due Sep 27, 2020
Project 2: 30% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Due Nov 15, 2020
Final Exam: 30% . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nov 27 2020 14:30-16:30am

The projects are based on real problems and real data. The dataset and background information
of the project will be provided. For each project, the submission should include the following items

• A report not more than 10 pages with 1.5 spacing (soft copies and hard copies), which doc-
uments the methods using, main findings, and interpretations. Codes and software printouts
should NOT be included in the report.
• Complete codes used for the analysis, with reasonable details of comments (Soft copies only)
• Forecasting results on the test dataset in a “csv” file with a single column, as shown in the
following example

A0001124H
10.31
8.5
20.1
...
11.5

5
2 Introduction
2.1 Definitions
Definition 2.1 Predictions of future events and conditions are called forecasts, and the act of
making such predictions is called forecasting.

Forecasting is very important in many types of organizations since predictions of future events
must be incorporated into the decision-making process. Examples include

• Government needs to forecast such things as air quality, water quality, unemployment rate,
inflation rate, and welfare payment, etc.
• A university needs to forecast its enrollment, temperature, broken asset.
• Business firms need to forecast demands to plan sales and production strategy, to forecast
interest rate for financial planning, to forecast number of workers required for human resource
planning, and to forecast the quality of the product for process improvement and quality
control.

To forecast events that will occur in the future, one must rely on information concerning events
that have occurred in the past. Based on the type of information used, there are two categories of
forecasting methods.

1. Qualitative forecasting methods: use the opinions of experts to subjectively predict


future events. It is often required when historical data are either not available or scarce, or
when changes in data pattern cannot be predicted on the basic of historical data. Commonly
used qualitative methods include

(a) Delphi Method: Use a panel of experts to produce predictions concerning a specific
question such as when a new development will occur in a particular field. The panel
members are kept physically separated. After the first questionnaire has been completed
and sent, subsequent questionnaires are accompanied by information concerning the
opinions of the group as a whole.
(b) Technological comparisons: are used in predicting technological change. It determines a
pattern of change in one area, called a primary trend, which the forecaster believes will
result in new developments being made in some other area. A forecast of developments
in the second area can then be made by monitoring developments in the first area.
(c) Subjective curve fitting: The forecaster subjectively determines the form of the curve to
be used, and a great deal of the expertise and judgment is required.

6
2. Quantitative forecasting methods: involves the analysis of historical data in an attempt
to predict future values of a variable of interest. The methods often depend on the types of
data available.

Definition 2.2 Cross-sectional data are values observed at one point in time; A time
series is a chronological sequence of observations on a particular variable.

As a result, the Quantitative methods can be roughly classified as

(a) Causal methods: involve the identification of other variables that are related to the
variable to be predicted. It develop a statistical model that describes the relationship
between these variables and the variable to be forecasted. For example, the sales of
a product might be related to the price of the product, competitors’ prices for similar
products, advertising expenditures to promote the products.
(b) Time series methods: make prediction of future values of a time series based solely on the
basis of the past values of the time series. It tries to identify a pattern in the historical
data, which is extrapolated in order to make a forecast. It is assumed that the pattern
will continue in the future. For example, one predicts the temperature tomorrow based
solely on the temperatures in the past days.

2.2 Deterministic vs Stochastic Relations


In this module, we only focus on quantitative methods for forecasting. Before proceeding, we
must clarify the types of relationships we do not study in this module, namely, deterministic (or
functional) relationships. Here are examples of a deterministic relationship.
• As you may know , the relationship between degrees Fahrenheit and degrees Celsius is known
to be: Fahr = 9 × Cels/5 + 32. If you know the temperature in degrees Celsius, you can use
this equation to determine the temperature in degrees Fahrenheit exactly.
• Circumference = π× diameter
• Hooke’s Law: Y = α + βX, where Y is amount of stretch in a spring, and X is the applied
weight.
• Ohm’s Law: I = V /r, where V is the voltage applied, r is the resistance, and I is the current.
• Boyle’s Law: For a constant temperature, P = α/V , where P is pressure, α is a constant for
each gas, and V is the volume of the gas.
For each of these deterministic relationships, the equation exactly describes the relationship between
the two variables. This course does not examine deterministic relationships. Instead, we are
interested in statistical or stochastic relationships, in which the relationship between the variables
is not perfect. Some examples of statistical relationships might include:

7
• Height and weight: as height increases, you’d expect weight to increase, but not perfectly.
• Alcohol consumed and blood alcohol content: as alcohol consumption increases, you’d expect
one’s blood alcohol content to increase, but not perfectly.
• Vital lung capacity and pack-years of smoking: as amount of smoking increases (as quantified
by the number of pack-years of smoking), you’d expect lung function (as quantified by vital
lung capacity) to decrease, but not perfectly.
• Driving speed and gas mileage: as driving speed increases, you’d expect gas mileage to
decrease, but not perfectly.

It is also noted that the boundary between deterministic and stochastic relationship might not
be clear in some scenarios. For example, depending on the accuracy and precision requirement,
Newton’s laws in physics can be viewed as deterministic relations in some cases, but can only serve
as approximation to theory of relativity in some other cases. It is also possible that the stochastic or
random elements observed are simply due to some unknown variables in a deterministic relations.

2.3 Errors in Forecasting


Unfortunately, all (stochastic) forecasting involve some degree of uncertainty. We recognize this fact
by including an irregular component in the description of the model. The presence of this irregular
component, which represents unexplained or unpredictable fluctuations in the data, means that
some error in forecasting must be expected.
The fact that forecasting techniques often produce predictions that are somewhat in error has
a bearing on the form of the forecasts we require. Two types of forecasts are common in practice.

8
• Point forecast: a single number representing our “best” prediction of the actual value
• Prediction interval forecast: an interval of numbers that will contain the actual value
with certain confidence (95%)

To evaluate the performance or accuracy of the forecasting methods, certain criteria shall be
used. For point forecast, a natural way is to calculate the forecast error.

Definition 2.3 The forecast error for a particular forecast ŷ of a quantity of interest y is

e = y − ŷ
small mean & small variance
If the forecast is accurate, the error is small stochastically. In general, e cannot be zero, and
can be large even for a good forecasting method in “unlucky” cases. Therefore, it is important to
measure the magnitude of the errors over time or over different samples to evaluate the forecasting
method.

Definition 2.4 The mean absolute deviation (MAD) of the forecasting is defined as
n
1X
MAD = |yi − ŷi |.
n
i=1

The mean squared error (MSE) of the forecasting is defined as


n
1X
MSE = |yi − ŷi |2 .
n
i=1

Intuitively, MSE is influenced by large forecast errors more sensitively.

9
To compare the forecast on quantities of different scales, relative errors can be adopted by
normalizing the error by the value to be forecasted.

Definition 2.5 The mean absolute percentage errors (MAPE) of a method is defined as
n
1 X yi − ŷi MAPE is easier to compute and optimize
MAPE = yi . but also need to consider traits of the dataset
n
i=1

Different from point forecast, prediction interval forecast is often an interval of values. Its
performance depends on two factors.

Definition 2.6 Coverage probability is the proportion of the time that a confidence interval
contains the true value of interest. Length of the interval is simply the difference in the two
endpoints.

Ideally, for a 95% prediction interval, the interval should have coverage probability 0.95, i.e., con-
taining the true value 95% of the time. On the other hand, the length of the interval indicates
the precision of the forecast. Given the same coverage probability, the shorter interval length the
better.

3 Linear Regression
3.1 Overview
Galton was a pioneer in the application of statistical methods. In studying data on relative sizes
of parents and their offspring in various species of plants and animals, he observed the following
phenomenon: a larger-than-average parent tends to produce a larger-than-average child, but the
child is likely to be less large than the parent in terms of its relative position within its own
generation. Galton termed this phenomenon a regression toward mediocrity, which in modern
terms is a regression to the mean.
Regression to the mean is an inescapable fact of life. Your children can be expected to be less
exceptional (for better or worse) than you are. Your score on a final exam in a course can be
expected to be less good (or bad) than your score on the midterm exam, relative to the rest of
the class. The key word here is “expected”. This does not mean it’s certain that regression to the
mean will occur, but it has a higher chance than not. For detailed account for this, please refer to
http://www.socialresearchmethods.net/kb/regrmean.php.
Linear regression analysis is the most widely used of all statistical techniques: it is the study of
linear, additive relationships between variables. Even though the linear condition seems restrictive,
it has some practical justifications:

• linear relationships are the simplest non-trivial relationships;

10
• the “true” relationships between variables are often approximately linear, at least over a range
of values;
• Even for some nonlinear relationships, we can often transform the variables in a way to
linearize the relationships.

3.2 Simple linear regression


Simple linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables. One variable, denoted x, is regarded as the
predictor, explanatory, or independent variable. The other variable, denoted y, is regarded as the
response, outcome, or dependent variable. We will use the predictor and response terms to refer
to the variables encountered in this module.
Linear regression attempts to describe the nature of the association by constructing a “best-
fitting” mathematical model. When using linear regression we assume that the variables are asso-
ciated in a linear fashion, and we attempt to find that line that explains the association. Mathe-
matically, it describes the relation between response Y and predictor X as

Y = β0 + β1 X + , (3.1)

where E = 0, and σ 2 = var(σ) < ∞.  indicates the observation error, noise, or uncertainties that
cannot be accounted for by the linear relation β0 +β1 X. Here “linear ” is to quantify the relationship
in parameters, which means the partial derivative w.r.t. β should be free of all parameters.
linear regression is in terms of beta
3.2.1 Least square estimation Y and X relation may not be linear
but regress can be linear
The linear relation (linear model) (3.1) has three parameters β0 , β1 , σ 2 . They have clear physical
interpretations. However, in practice their values might not be available, and need to be estimated
from observations.
Assume that we have n observations of (xi , yi ). We want to find a straight line that “best”
forecast (approximate) these n points. For any given values β0 = a, β1 = b, a natural point forecast
of the response given predictor X is simply a + bX, the conditional expectation E(Y |X). Recall
the commonly used forecasting error MSE is defined as
n n
1X 1X
MSE = (yi − ŷi )2 = (yi − a − bxi )2 .
n n
i=1 i=1

The least square estimation of the parameters are defined as the values of β0 , β1 that can minimize
the MSE
n
X
β̂0 , β̂1 = arg min (yi − a − bxi )2 /n (3.2)
a,b
i=1

11
By taking the derivatives with respect to a, b, we can get the analytical expression for β̂0 , β̂1 as

β̂0 = ȳ − β̂1 x̄,


Pn
(x − x̄)(yi − ȳ)
β̂1 = i=1 Pn i 2
, (3.3)
i=1 (xi − x̄)
Pn Pn
where x̄ = i=1 xi /n and ȳ = i=1 yi /n are the sample average of xi , yi respectively. It is easy
to show that Eβ̂0 = β0 , and Eβ̂1 = β1 , meaning they are unbiased estimators of the regression
coefficients.
A natural way to estimate the variance σ 2 is by
n
1 X
σ̂ 2 = (yi − β̂0 − β̂1 xi )2 , (3.4)
n−2
i=1

Pn
where the term i=1 (yi − β̂0 − β̂1 xi )2 is often named as the sum of squared errors (SSE). The
term (n − 2) is to make the estimation unbiased, Eσ̂ 2 = σ 2 . σ̂ 2 quantifies the uncertainty of the
regression line, and is related to goodness-of-fit as well.
Remark: There are a few interesting notes for the least square estimation in simple linear regression

• The estimated regression line Y = β̂0 + β̂1 X passes the point (x̄, ȳ).
• The estimated slope β̂1 is closely related to the correlation coefficients between X and Y .
Recall that the sample correlation is defined as
Pn
i=1 (xi− x̄)(yi − ȳ)
ρ̂ = pPn Pn .
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)

Comparing it with β̂1 , we can find that


sP
n
(y − ȳ)2 SY
β̂1 = Pni=1 i 2
ρ̂ = ρ̂,
(x
i=1 i − x̄) S X

where SY2 , SX
2 are the sample variance of Y, X respectively.

• The denominator n − 2 in (3.4) is used to make σ̂ 2 unbiased, i.e., Eσ̂ 2 = σ 2 .

3.2.2 Accuracy of the estimation

The aforementioned results do not require the specific form of the error distribution. However, to
assess the accuracy of the estimation, and to construct the confidence interval, it is necessary to
know the distribution of . A commonly used assumption is that i follows normal distribution
i ∼ N (0, σ 2 ), independently.

12
Based on the linear regression model, Y = β0 +β1 X +, we can also conclude that the conditional
distribution of P (Y |X) ∼ N (β0 + β1 X, σ 2 ). It is clear that the regression part β0 + β1 X models the
mean (expectation) of the response, and the error term quantifies the uncertainty in the prediction
(modeling).
If yi are independent, and the true relationship follows the model (3.1), it can be derived that
the parameter estimation (3.3) follows normal distribution with

σ2 x̄2
     
1
β̂1 ∼ N β1 , Pn 2
, β̂0 ∼ N β0 , + Pn 2
σ2 . (3.5)
i=1 (xi − x̄) n i=1 (x i − x̄)

It shows the least square method can estimate the true parameters “on average”. However, given
finite number of samples, there exists uncertainty in estimating the true parameters. The magnitude
of uncertainty depends on two factors:

• The sample size n used in the estimation


larger sampling variance
• The scatters of xi . The more dispersed of xi , the better estimation.

Similarly, we can find the distribution of σ̂ 2 with normal assumption. Using the formula in
(3.4), it can be derived that
σ2
σ̂ 2 ∼ · χ2 . (3.6)
n − 2 n−2 chi square distr.
As a result, we have Eσ̂ 2 = σ 2 .
The distribution of estimated parameters also allow us to construct the confidence intervals for
such estimates. From (3.5) and (3.6), we can find that

β̂ − β1 β̂ − β0
q1 ∼ tn−2 , q0 ∼ tn−2 . t-distribution
var(β̂1 ) var(β̂0 )

As a result, the 1 − α level confidence interval for both parameters are


s s
1 1 x̄2
β̂1 ± tn−2,α/2 · σ̂ Pn 2
, β̂0 ± tn−2,α/2 · σ̂ + Pn 2
, (3.7)
i=1 (xi − x̄) n i=1 (xi − x̄)

larger sample size, higher


where tn−2,α/2 is the upper α/2 quantile of the tn−2 distribution.
accuracy/narrow CI, bet-
Remark: Some additional notes on the estimation accuracy. ter accuracy
• We can construct the confidence interval for σ 2 as well based on (3.6).
• Since β̂0 and β̂1 are estimated based on the same set of data, they are not independent. The
covariance between them is crucial in making interval forecasting.

−x̄
Cov(β̂0 , β̂1 ) = Pn 2
σ2. (3.8)
i=1 (x i − x̄)

13
• When the number of samples is large enough (n → ∞), we can expect β̂0 , β̂1 , σ̂ 2 all converge
to the true values.

3.2.3 Point forecasting

Given the estimated parameters based on least square methods, we can make point forecast given
any value of the predictor X. In fact, the “best” prediction is the conditional mean. Given X = x∗ ,
the point forecast y∗ is simply
y ∗ = β̂0 + β̂1 x∗ .

Substituting the formulas in (3.3), we have

y ∗ = ȳ − β̂1 x̄ + β̂1 x∗ = ȳ + β̂1 (x∗ − x̄).

Essentially, the forecasted value of the response, in terms of deviation from the mean, is proportional
to the deviation of the predictor value from its mean.
Using the relationship between β̂1 and the correlation ρ̂, we can also write the forecasting as

y ∗ − ȳ x∗ − x̄
= ρ̂ , (3.9)
SY SX

where (y ∗ − ȳ)/SY and (x∗ − x̄)/SX can be viewed as “standardized” value. Since ρ̂ is always
smaller than 1, this also explains the “regression to mean” technically. In particular, our prediction
for standardized y ∗ is typically smaller in absolute value than our observed value for standardized
x∗ . That is, the prediction for Y is always closer to its own mean, in units of its own standard
deviation, than X was observed to be, which is Galton’s phenomenon of regression to the mean.
The perfect positive correlation (ρ = 1) or perfect negative correlation (ρ = −1) is only obtained
if one variable is an exact linear function of the other, without error, i.e., Y = β0 + β1 X. In this
case, the relationship between X and Y becomes deterministic rather than stochastic.

3.2.4 Interval Forecasting

When we use the estimated parameters to make forecasting, we need to consider the uncertainty in
the estimated parameters. In particular, when the predictor has value X = x∗ , the conditional mean
of the response y ∗ = β̂0 + β̂1 · x∗ , as shown in Section 3.2.3. Using the distributional information on
β̂, we can also find the distribution of y ∗ . Because both β̂0 and β̂1 are normally distributed (3.5),
we can derive
(x∗ − x̄)2
   
∗ ∗ ∗ 1
P (y |x ) ∼ N β0 + β1 x , + Pn 2
σ2 . (3.10)
n i=1 (xi − x̄)

It is clear that the parameter uncertainty propagates to the forecast uncertainty. It still has the
true mean “on average”, but with additional variation as the variance component in the normal

14
distribution (3.11). The form of the variance also suggest that the forecasting accuracy (or equiv-
alently the magnitude of the variance) depends on the following factor: (a) the sample size n used
in estimation; (b) the scatters of the observations xi ; (c) the distance between the forecast point
x∗ and the data center x̄.
From (3.11), we can construct the prediction interval forecast at 1 − α confidence level:
s
1 (x∗ − x̄)2
β̂0 + β̂1 x∗ ± tn−2,α/2 σ̂ + Pn 2
. (3.11)
n i=1 (xi − x̄)

It is noted that the prediction interval is for the mean value of the response at x∗ . It is different
from the prediction interval of the response, which includes another error term . Consequently,
the prediction interval for the response is
s
1 (x∗ − x̄)2
β̂0 + β̂1 x∗ ± tn−2,α/2 σ̂ 1+ + Pn 2
. (3.12)
n i=1 (xi − x̄)

CI for distribution of mean

3.2.5 Other notable notes

1. Model assumptions: most results discussed above rely on some important assumptions of
the data. In addition to the linear form of the mean β0 + β1 X, the error term i must be
independent, and normally distributed, with the same variance. These assumptions must be
checked after the model estimation (discussed later). In many problems, these assumptions
might be severely violated. In these cases, the model needs to be revised before reaching
meaningful results.

15
2. Transformation: The linear constraint does not imply the relationship can only be a straight

line. In fact, different transformations on x can be used (e.g., x2 , ln(x), x). The results
discussed above still hold with transformed variables. The transformation can be inspired
from data feature, or guided by first-principles.

3. Outliers: The least square method minimizes the sum of squared error, as a result it is
sensitive to outliers. A single outlier can drive the estimated parameters far from its true
values. Therefore, it is important to recognize outliers, especially to differ between outliers
and normal large values.

• Leverage: is a measure of how far away the predictor values of an observation are from
those of the other observations
• Outliers are values that cause surprise in relation to the majority
• Influential observations have a relatively large effect on the regression model’s predic-
tions

4. Interpretation: When making forecasting or interpreting the results, it is important to


understand the limits of the model. For example, a linear relationship might be satisfactory
in a range of temperature, but loses its validity when extended to the cases outside the range.
Special attention is required when making forecasts far from the center x̄ of historical data.

3.3 Multiple linear regression


Multiple linear regression attempts to model the relationship between two or more explanatory
variables and a response variable by fitting a linear equation to observed data. If there are p
predictors (or explanatory variables) x1 , x2 , · · · , xp , for every combination of the predictor variables
xi1 , xi2 , · · · , xip , it is associated with a value of the response variable yi . Similar to the simple
regression, the multiple linear regression define the regression model as

Y = β0 + β1 x1 + β2 x2 + · · · + βp xp + .

16
This model describes how the mean response E(Y ) changes with the explanatory variables. The
observed values for Yi vary about their means and are assumed to have the same standard deviation
σ. The parameters of the model include β0 , β1 , · · · , βp , σ 2 .
For concise representation, we often write the model in matrix/vector form. Define β =
[β0 , β1 , · · · , βp ], and xi = [1, xi1 , · · · , xip ], then we have

Yi = xi · β + i , (3.13)

with E(i ) = 0, var(i ) = σ 2 again. Throughout the handout, we will use bold symbols to represent
vectors or matrices. Combine all observations together, we have:
    
 
Y1 1 x11 · · · x1p β0 1
 Y2  1 x21 · · · x2p  β1   2 
       
 ·  ..  +  ..  .
 .  = . .. .. 
..
   
 .  .
 .  . . . .  . .
Yn 1 xn1 · · · xnp βp n

or,
Y = Xβ +  (3.14)

3.3.1 Why multiple regression?

Simple linear regression only allows a single predictor in modeling the response. This might be too
restrictive in many cases. The following examples illustrate the need for multiple linear regression.

I. Complex relation with a single predictor variable. Even when the response is related to a
single predictor variable, the relation might be more complex than a straight line. Consider a
commonly used model in practice, the polynomial regression

Yi = β0 + β1 xi + β2 x2i + i .

The response changes with x in a quadratic way. Only in special cases (β1 = 0 or β2 = 0),
can we use simple linear regression to estimate the parameters.

II. Relation with qualitative predictor. The simple linear regression implies the predictor is a
quantitative (continuous) variable, so that the multiplication and addition have meaning.
However, we often encounter qualitative variables, such as gender, color, race, etc. They are
often called attribute variables or factors. Since they do not have natural ordering, numerical
operations on the variable lose the validity.
Taking Race with four values (Chinese, Malay, Indian, Caucasian) as an example. Instead of
doing regression directly on this variable, some dummy variables can be created.

17
Original Transformed
X Chinese(R1) Malay(R2) Indian(R3) Caucasian(R4)
Chinese 1 0 0 0
Caucasian 0 0 0 1
Indian 0 0 1 0
Indian 0 0 1 0
Chinese 1 0 0 0
Malay 0 1 0 0

As a result, instead of having Y = β0 + β1 X, which is not meaningful here, we can have


Y = β1 R1 + β2 R2 + β3 R3 + β4 R4 using the dummy variables. The coefficients have clear
interpretation in this case. Again, even with a single predictor (Race in this example), we still
need multiple linear regression. Another example is salary in IT industry (many years ago)
versus the education level.

S X E M X1 X2 X3
0 13876 1 1 1 1. 0. 0.
1 11608 1 3 0 1. 0. 1.
2 18701 1 3 1 1. 0. 1.
3 11283 1 2 0 1. 1. 0.
4 11767 1 3 0 1. 0. 1.

Here “S” denotes for salary, “E” denotes for education level, “M” denotes for management or
non-management, “X” denotes for experience.

III. Multiple variables with non-separable effects. More commonly, a response variable is often
influenced by multiple predictors. It is generally not sensible to quantify their influence one
by one using simple linear regression. In addition, sometimes explicit interaction between two
variables are required. The interactions will be discussed in more details later. An example
here is the joint effects of TV and Radio on sales: Y = 2.94 + 0.046 ∗ T V + 0.19 ∗ Radio −
0.001 ∗ N ewspaper

baseline, avg for B.Eng

additional from B.Eng

additional from B.Eng


18
3.3.2 Interactions between variables

There are two implicit assumptions when formulating the multiple linear regression: (1) Effects of
different predictors are additive. (2) If x1 changes ∆x1 , the mean response always changes β1 ∆x1 ,
regardless other predictors. In statistics, an interaction may arise when considering the relationship
among three or more variables, and describes a situation in which the simultaneous influence of
two variables on a third is not additive.
The presence of interactions can have important implications for the interpretation of statisti-
cal models. If two variables of interest interact, the relationship between each of the interacting
variables and the response variable depends on the value of the other interacting variable. In prac-
tice, this makes it more difficult to predict the consequences of changing the value of a variable,
particularly if the variables it interacts with are hard to measure or difficult to control.
Real-world examples of interaction include:

• Interaction between adding sugar to coffee and stirring the coffee. Neither of the two indi-
vidual variables has much effect on sweetness but a combination of the two does.

• Interaction between adding carbon to steel and quenching. Neither of the two individually
has much effect on strength but a combination of the two has a dramatic effect.

• Interaction between smoking and inhaling asbestos fibres: Both raise lung carcinoma risk,
but exposure to asbestos multiplies the cancer risk in smokers and non-smokers. Here, the
joint effect of inhaling asbestos and smoking is higher than the sum of both effects.

• Interaction between genetic risk factors for type 2 diabetes and diet (specifically, a “western”
dietary pattern). The western dietary pattern was shown to increase diabetes risk for subjects
with a high “genetic risk score”, but not for other subjects.

19
To recognize the possible interactions between two variables, we can explore their relation
graphically. There are three major types of interactions.

1. Interaction between two continuous variables


If there is no interaction, the mean response is a plane (linear in both variables). However,
when interaction exists, the mean response becomes a curved surface. Alternatively, the
contour plots are parallel lines without interaction, and are curves with interactions.

2. Interaction between a continuous variable and an attribute variable


If there exists interactions between attribute variable and continuous variables, it means the
effect of the continuous variable depends on the value of the attribute. In other words, the
coefficients (or slope on the graph) are different when the attribute taking different values, as
shown below (left figure). On the other hand, if the slopes do not change with the attribute
value, there is no significant interaction (the right two figures).

x doesn't interact with E or M


20 but E and M interact with each
other
In the left picture, Experience (X) and education level (E) seems to be interactive because the
slope of experience depends on the value of education level. However, in the right two figures
we can find out that the real interaction exists between education level and management (M).
The slope of experience doesn’t depends on the value of management or education level.

3. Interaction between two attribute variables


When the effect of one variable depends on the value of another variable, the interaction
might exist. Graphically, the pattern of the box plot changes as the other variable takes
different value. In both cases below, the pattern of the box plots of the same color changes
for different color.

In conclusion, a reasonable sample will be Y = β0 + β1 · E + β2 · M + β3 · X + β12 E ∗ M

salaryfit = smf.ols(formula="S~C(E)*C(M)+X", data=salary).fit()

===============================================================
coef std err t P>|t|
---------------------------------------------------------------
Intercept 9472.6854 80.344 117.902 0.000
C(E)[T.2] 1381.6706 77.319 17.870 0.000
C(E)[T.3] 1730.7483 105.334 16.431 0.000
C(M)[T.1] 3981.3769 101.175 39.351 0.000
C(E)[T.2]:C(M)[T.1] 4902.5231 131.359 37.322 0.000
C(E)[T.3]:C(M)[T.1] 3066.0351 149.330 20.532 0.000
X 496.9870 5.566 89.283 0.000
===============================================================

21
3.3.3 Least square estimation

When n observations are collected (xi , yi ), we can estimate the model parameters to identify in-
fluential variables or to make forecastings. Following the same criterion to minimize the MSE, we
can estimate the parameters by
n
1X
β̂ = arg min (yi − xi · β)2 .
β n
i=1

The multiple linear regression model is

Y = Xβ + ,

and the least square criterion reduces to minimizing the vector norm of the difference

β̂ = arg min ||Y − Xβ||2 , (3.15)


β

p
where ||Y|| = y12 + y22 + · · · + yn2 is the 2-norm of the vector. Using matrix calculus (https://
en.wikipedia.org/wiki/Matrix_calculus), we can show that β̂ again has analytical expression

β̂ = (XT X)−1 XT Y, (3.16)

where XT , X−1 represents the transpose and inverse of the matrix, respectively. β̂ is unique as
long as XT X is full rank (or invertible).
Example: In simple linear regression, we can express them in matrix form by
 
1 x1
" Pn # "P #
1 x2 
  n 2
T n xi T −1 1 i=1 xi −nx̄
X = . . 

, X X = Pn Pni=1 2 , (X X) = Pn .
 .. ..  i=1 xi i=1 xi
n i=1 x2i − (nx̄)2 −nx̄ n
1 xn

With some algebraic transformation, we can get consistent results as those in Section 3.2.
The natural way to estimate the variance σ̂ 2 is

||Y − Xβ̂||2
σ̂ 2 = .
n−p−1
Pn
Again the term ||Y − Xβ̂||2 or equivalently i=1 (yi − xi · β̂)
2 is called sum of square errors (SSE).
In fact, β̂ is generally a good way to estimate the model parameters as stated in the following
theorem.

22
Theorem 3.1 (Gauss-Markov) In a linear regression model, in which i have expectation zero,
equal variances, and are uncorrelated, the ordinary least square estimator β̂ is the best linear un-
biased estimator (BLUE). Furthermore, if i are normally distributed, β̂ is the best among all
unbiased estimators.

Similar to the case in simple linear regression, when the data are assumed normally distributed,
we can get the distribution of β̂. In more details, β̂ follows multivariate normal distribution

β̂ ∼ MVN(β, (XT X)−1 σ 2 ). (3.17)

Note that this result not only gives the marginal distribution of each component of β̂, it also
provides the covariance among different components. The results in the simple linear regression
are in fact its special case.
There are some remarks about the estimation of β̂: (1) The accuracy of β̂ depends on the
sample size, and also the scatters of xi ; (2) Sometimes, XT X is singular, then β is not estimable.
The coefficients are not interpretable (collinearity); (3) Linear transformation of β̂ is still normal:
Aβ̂ ∼ MVN(Aβ, A(XT X)−1 AT σ 2 ).

3.3.4 Confidence & prediction intervals

With the estimated β̂ and its covariance (XT X)−1 σ 2 , and the σ̂ 2 , we can construct the confidence
interval for the parameters. For each component β̂j in β̂, its 1 − α confidence interval can be
expressed as
p
β̂j ± tn−p−1,α/2 σ̂ dii , (3.18)

where dii is the (i, i)th diagonal element of (XT X)−1 . In other words, the standard deviation of β̂j

is simply σ dii .
More importantly, with the covariance matrix of β̂, we can construct the confidence interval (or
region) for multiple components of β̂ together, or some linear combinations of the components.
• What is the 95% joint confidence region for (β0 , β1 )?
• What is the confidence interval for the difference β1 − β2 ?
A general way can be found using the property of multivariate normal distribution. For any matrix
A with rank q, we know that

Aβ̂ ∼ MVN(Aβ, A(XT X)−1 AT σ 2 ).

Consequently, we can further derive

(Aβ̂ − Aβ)T [A(XT X)−1 AT ]−1 (Aβ̂ − Aβ)


∼ Fq,n−p−1 ,
qσ̂ 2

23
where Fq,n−p−1 is the F distribution with degree of freedom q and n − p − 1. The 1 − α level
confidence region for Aβ is thus defined as the set of any q dimensional points satisfying
n o
Aβ : b ∈ Rq : (Aβ̂ − b)T [A(XT X)−1 AT ]−1 (Aβ̂ − b) ≤ qσ̂ 2 · Fq,n−p−1,α , (3.19)

where Fq,n−p−1,α is the upper α quantile of the F distribution.


Example: The confidence region (3.19) is very general to include many useful cases as special case.
We use the following examples to demonstrate.

I. The confidence region for all the parameters β. In this case, A = Ip+1 with rank p + 1, and
the confidence region becomes:
n o
β : b ∈ Rq : (β̂ − b)T (XT X)(β̂ − b) ≤ (p + 1)σ̂ 2 · Fp+1,n−p−1,α .

When β̂ has two dimension, this region is an ellipse in the plane.

II. The confidence interval for mean response at new predictor value x∗ . The point forecast
is straightforward, with mean response y ∗ = x∗ β̂ based on (3.13). To further obtain the
confidence interval for the values, we can use (3.19) with A = x∗ of rank q = 1. Hence, the
confidence interval of the mean response satisfy

y : y ∈ R : (y ∗ − y)T [x∗ (XT X)−1 x∗ ]−1 (y ∗ − y) ≤ σ̂ 2 · F1,n−p−1,α .




Equivalently, using the relation between F1,n−p−1 and tn−p−1 , we can get a more explicit form
q
y : y ∗ ± tn−p−1,α/2 σ̂ x∗ (XT X)−1 x∗ .

24
III. Other useful comparisons. For example, if β1 and β2 represents coefficients of two predictor
variables. β1 − β2 is a measure of their relative effects on the response. To get the confidence
region for (β1 − β2 , β2 − β3 ), we can define A with rank q = 2 as
" #
0 1 −1 0 ··· 0
A= ,
0 0 1 −1 · · · 0

and then follow the formula in (3.19).

4 Model Checking and Diagnosis


Major assumptions of linear regression include

1. The relationship between the outcomes and the predictors is (approximately) linear.
2. The error term  has zero mean.
3. The error term  has constant variance.
4. The errors are uncorrelated.
5. The errors are normally distributed or we have an adequate sample size to rely on large
sample theory.

We should always check the fitted models to make sure that these assumptions have not been
violated.

25
4.1 Residuals
The diagnostic methods we’ll be exploring are based primarily on the residuals. Recall, the residual
is defined as
ei = yi − ŷi , i = 1, ..., n

where ŷi = Xβ̂. If the model is appropriate, it is reasonable to expect the residuals to exhibit
properties that agree with the stated assumptions.
According to the definition of the residuals, it is easy to show that the mean of the residuals is
0,
n
1X
ē = ei = 0,
n
i=1

and it can yield the estimation of the population variance


n
1 X
σ̂ 2 = e2i .
n−p−1
i=1

Precisely speaking, The ei , i = 1, · · · , n are not independent random variables. In general, if the
number of residuals (n) is large relative to the number of predictor variables (p), the dependency
can be ignored for all practical purposes in an analysis of residuals.
To analyze the residuals in different contexts, it is also common to “standardize” the residuals
by dividing its standard deviation. Using matrix form, we can write the residual vector as

e = Y − Xβ̂ = (I − X(XT X)−1 XT )Y.

The term H = X(XT X)−1 XT is often called hat matrix, and plays a crucial role in linear regression
analysis and model diagnostics. Using the notation, we can get the covariance matrix of e is simply
(I − H)σ 2 . As a result, the studentized residual is defined as

ei
ri = √ , (4.1)
1 − hii σ̂

where hii is the ith diagonal element of the hat matrix H. When the assumptions of the linear
model hold, ri follows a t distribution with n − p − 1 degree of freedom. Consequently, ri is free
from measurement scales in different contexts.

4.2 Diagnostics using residuals


The model assumptions can be checked against the property of the residuals.There are two kinds
of residual analysis:

1. Major graphical tools for model checking

26
• QQ plot to check normality
• Scatter plot to check linearity and variance
• Autocorrelation plot to check independence

2. Specialized hypothesis testing on residuals

4.2.1 Major graphical tools

Residual analysis is usually done graphically. We describes the major plots as follows.

I. Normal probability plot (or quantile-quantile plot).


Using quantile-quantile (QQ) plot, we can compare quantiles of a sample to the expected
quantiles if the sample came from some distribution for a visual assessment. To construct
a quantile-quantile plot for the residuals, we plot the quantiles of the residuals against the
theoretical quantiles of a normal distribution. If the residuals follow a normal distribution, the
QQ plot should resemble a straight line. A straight line connecting the 1st and 3rd quartiles
is often added to the plot to aid in visual assessment.

tails in QQ plots is impt

not normal
model is insufficient

II. Scatter plots


Another useful aid for inspection is a scatter plot of the residuals against the fitted values
and/or the predictors. These plots can help us identify: non-constant variance, violation of
the assumption of linearity, and potential outliers.

27
non constant (poison
u=lamda, var=lamda)

constant variance non constant var


and zero mean

variance stabilization
Non-constant variance can often be remedied using appropriate transformations. Ideally,
we would choose the transformation based on some prior scientific knowledge, but this might
not always be available. Some typical choices are listed below

Relation of σ 2 to E(Y |x) Transformation Comment

σ 2 ∝ constant y0 = y no transformation

σ2 ∝ E(Y ) y0 = y Poisson data
σ2 ∝ E(Y )2 y0 = ln(y) y>0

In general, for Y > 0, an automatic transformation can be done (suggested) by Box-Cox


transformation, in the form
(
(y λ − 1)/λ, λ 6= 0
y0 = . (4.2)
ln(y), λ=0

The best choice of λ can be determined based on the data.

III. Independence check/test


If the samples are independent, the residuals should not have visible patterns when plotted
against time or observation index. Autocorrelation plot or partial autocorrelation plot (will
be discussed later) can graphically illustrate the degree of violation. Some formal statistical
tests have also been developed to test the independence of the data.

28
4.2.2 Tests for certain property

I. Shapiro-Wilk test for normality


( ni=1 ai e(i) )2
P
W = Pn 2
,
i=1 (ei − ē)

where e(i) is the ith smallest value, ai are some constants.

II. Modified Levene Test for constant variance

III. Durbin-Watson statistic In statistics, the Durbin-Watson statistic is a test statistic used to
detect the presence of autocorrelation in the residuals from a regression analysis. If et is the
residual associated with the observation at time t, then the test statistic is
PT 2
t=2 (et − et−1 )
d= PT 2 ,
t=1 et

where T is the number of observations. Since d is approximately equal to 2(1 − r), where r
is the sample autocorrelation of the residuals, d = 2 indicates no autocorrelation. The value
of d always lies between 0 and 4. If the Durbin-Watson statistic is substantially less than 2,
there is evidence of positive serial correlation. As a rough rule of thumb, if Durbin-Watson is
less than 1.0, there may be cause for alarm.

4.3 Outliers, leverage points, influential points, collinearity


4.3.1 Identifying outliers

An outlier is an extreme observation. Depending on their location in the predictor space, outliers
can have severe effects on the regression model. We can use jackknife residuals to identify potential
outliers. Any points that are greater than 3 or 4 standard deviations away from 0 may be considered
potential outliers.
There are several scenarios for outliers.
• “Bad” data that results from unusual but explainable events, eg - malfunction of measuring
instrument, incorrect recording of data. In this case we should try to retrieve the correct
value, but if that’s not possible we may need to discard the data point.

• Inadequacies in the model. The model may fail to fit the data well for certain values of the
predictor. In this case it could be disastrous to simply discard outliers.

• Poor sampling of observations in the tail of the distribution. This may be especially true if
the outcome arises from a heavy-tailed distribution.
With a sample size of 60, we might expect 2 or 3 residuals to be further than 2 standard
deviation from 0 and none to be more than 3 standard deviation.

29
4.3.2 High leverage points

Leverage is a measure of how strongly the data for obs i determine the fitted value Ŷi . If hii is
close to 1, the fitted line will usually pass close to (xi , Yi ).
The hat matrix,
H = X(XT X)−1 XT

plays an important role in identifying influential observations. The diagonal elements hii =
xi (XT X)−1 xTi , where xi is the ith row of the X matrix, play an especially important role. hii
is a standardized measure of the distance of the covariate values for ith observation and the means
of the X values for all n observations.
Also,
n
X
0 ≤ hii ≤ 1, hii = p + 1.
i=1
p = # of predictors
Therefore the average size of a hat diagonal is h̄ = (p + 1)/n. Leverage values greater than 2h̄
are considered to be high leverage with regard to their xi values and we would consider them high
leverage points. The left two pictures below shows the leverage values in a simple linear regression.
The third picture shows leverage values in a multiple linear regression.

higher leverage value


towards the end of x
4.3.3 Identifying influential observations

Points that are remote in the predictor space may not influence the estimate of the regression
coefficients but may influence other summary statistics, such as R2 and the standard errors of the
coefficients. These points are called leverage points. Points that have a noticeable effect on
the regression coefficients are called influential points. In other words, Influence measures the
degree to which deletion of an observations changes the fitted model. A high leverage point has
the potential to be influential, but is not always influential
Influence can be measured by Cook’s distance. Cook’s Distance measures the influence of the
ith observation on all n fitted values and is given by

(Ŷ − Ŷ(−i) )T (Ŷ − Ŷ(−i) )


Di =
(p + 1)σ̂ 2

30
where Ŷ is the vector of fitted values when all n observations are included and Ŷ(−i) is the
vector of fitted values when the ith observation is deleted. Cook’s D can also be expressed as

e2i hii
Di =
(p + 1)σ̂ 2 (1 − hii )2

From this expression we see that Di depends on both the size of the residual ei , and the leverage,
hii .
The magnitude of Di is usually assessed by comparing it to Fp+1,n−p−1 . If the percentile value
is less than 10 or 20 %, then the ith observation has little apparent influence on the fitted values. If
the percentile value is greater than 50%, we conclude that the ith observation has significant effect
on the fitted values.
As a general rule, Di values from 0.5 to 1 are high, and values greater than 1 are considered to
be a possible problem.

4.3.4 Collinearity

In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor


variables in a multiple regression model are highly correlated, meaning that one can be linearly
predicted from the others with a substantial degree of accuracy. In this situation the coefficient
estimates of the multiple regression may change erratically in response to small changes in the
model or the data.
Indicators that multicollinearity may be present in a model:

• Large changes in the estimated regression coefficients when a predictor variable is added or
deleted
• Insignificant regression coefficients for the affected variables in the multiple regression, but a
rejection of the joint hypothesis that those coefficients are all zero (using an F-test)

pairplot
31
• If a multiple regression finds an insignificant coefficient of a particular explanatory variable,
yet a simple linear regression of the explained variable on this explanatory variable shows its
coefficient to be significantly different from zero, this situation indicates multicollinearity in
the multiple regression
• Some authors have suggested a formal variance inflation factor (VIF) for multicollinearity:

1
VIF =
1 − Rj2

where Rj2 is the coefficient of determination of a regression of explanatory variable j on all


the other explanatory variables:

Xj = α0 + α1 X1 + · · · + αj−1 Xj−1 + αj+1 Xj+1 + αp Xp + ε,

The better the fit, the more severe the collinearity. A VIF of 5 or 10 and above indicates a
multicollinearity problem.
Added variable plots are also called partial regression plots, or adjusted variable plot. It
allows us to study the marginal relationship of a regression given the other variables that are
in the model. For the variable Xj

Y = β0 + β1 X1 + · · · + βj−1 Xj−1 + βj+1 Xj+1 + βp Xp + 

and
Xj = α0 + α1 X1 + · · · + αj−1 Xj−1 + αj+1 Xj+1 + αp Xp + ε

Plot Y − Ŷ vs Xj − X̂j

32
Some comments on using the plots

– They only suggest possible relationships between the predictor and the response.
– In general, they will not detect interactions between regressors.
– The presence of strong multicollinearity can cause partial regression plots to give incor-
rect information

5 Model Evaluation and Selection


Regression models have two major objectives: i) quantifying the effects of each predictors, consider-
ing the influences of other predictors; ii) predict the (mean) response at other unobserved predictor
values. It is important to evaluate the regression models, and select the “best model” among all
candidates to make the forecasting more accurate. Here are a few reasons why we want to select
the “best model”.
smallest MSE

33
1. We want to explain the data in the simplest way. Redundant predictors should be removed.
The principle of Occam’s Razor states that among several plausible explanations for a phe-
nomenon, the simplest is best. Applied to regression analysis, this implies that the smallest
model that fits the data is best.
2. Unnecessary predictors will add noise to the estimation of other quantities that we are inter-
ested in. Degrees of freedom will be wasted.
3. Collinearity is caused by having too many variables trying to do the same job.
4. Cost: if the model is to be used for prediction, we can save time and/or money by not
measuring redundant predictors.

5.1 Evaluating the regression model


In the introduction, we discussed the criteria to evaluate the forecasting performance, including
MSE and MAD. In the regression context, there are several evaluation criteria developed based on
them. We will go through them as follows.

I The “notorious” R2 :
R2 , also called coefficient of determination, evaluates the percentage of total variation (uncer-
tainty) explained by the regression model. Mathematically, it is defined as:
Pn
2 SSE (yi − ŷi )2
R =1− = 1 − Pi=1
n 2
. (5.1)
TSS i=1 (yi − ȳ)

Compared with MSE definition, we can see that MSE = SSE/n. In other words, the smaller
the MSE, the closer R2 to 1. It appears R2 is a good measure of the forecasting performance.
Unfortunately, there is an inherent problem: the forecasting errors are calculated on the same
dataset as that used for model estimation. As a result, it often under-estimates the forecasting
errors when used in future predictions. In fact, by increasing the number of predictors (relevant
or not), R2 always increases. As a result, it is never used as a criterion to select the “best”
model because only the largest model has the largest R2 .

II Adjusted R2
Since R2 always increases as the model size increases, an adjusted R2 is proposed, often denoted
by Ra2 . It is defined by

SSE/(n − p − 1) n−1 σ̂ 2
Ra2 = 1 − =1− (1 − R2 ) = 1 − model
2 . (5.2)
TSS/(n − 1) n−p−1 σ̂null

Because of the adjustment, increasing the model size, will increase R2 , but not necessarily
increase Ra2 . Adding a predictor will only increase Ra2 if it has some value in prediction. From

34
another angle, minimizing the standard error for prediction means maximizing R22 a. Compared
with R2 , it “penalized” bigger models.

III Cross-validated forecast errors:


While the MSE and MAD are intuitive criteria to evaluate the forecasting performance, the
difficulty lies in how to obtain them accurately. Using the same dataset to estimate the model
and to calculate the MSE (e.g., R2 ) cannot provide reliable assessment. A natural solution is
to use two independent dataset, one training set for model estimation, and the other testing
set for model evaluation.
Cross validation is one such strategy to evaluate the forecasting errors more reliably. For k-fold
cross validation, it often consists of the following steps

(a) Randomly divide the data into k non-overlapping subsets, of (roughly) equal size.
(b) Select one subset as the testing data, and the remaining k−1 subsets combined as training
data. Estimate the model using the training data, and compute the prediction error (e.g.,
MSE) on the testing data, denoted by MSEi .
(c) Repeat this procedure for k times, with each of the k subsets being the testing data.
(d) Average the k prediction error estimates to get the cross-validated error MSECV =
Pk
j=1 MSEj /k.

Compared with other criteria, cross validated forecast error is more intuitive and often more
effective. However, it requires much more computational effort (k times), except certain special
cases. Common choice of k includes k = 5, k = 10. When k = n, it is more commonly known
as leave-one-out cross validation.

IV Akaike’s Information Criterion (AIC), Schwarz’s BIC

35
The information criteria, including Akaike’s Information Criterion (AIC) and Schwarz’s Bayesian
Information Criterion (BIC), are commonly used for model comparison or selection. For linear
regression models, they can be reduced to
 
SSE
AIC = n ln + 2(p + 1) (5.3)
n
 
SSE
BIC = n ln + (p + 1) ln(n). (5.4)
n

We want to minimize AIC or BIC to select the “best” model. Larger models will fit better and
so have smaller SSE. But they also use more parameters. Thus the “best” model will balance
the goodness-of-fit with model size. BIC penalizes larger models more heavily and so tends to
prefer smaller models comparing with AIC. AIC and BIC can be used as selection criteria for
other types of model (not limited to regression models).

V Mallow’s Cp only valid for regression model


The criterion is developed based on the intuition that: a good model should predict well, so
average MSE of the prediction might be small
n
1 X
E(ŷi − Eyi )2 .
σ2
i=1

This quantity can be estimated by the Cp statistic

SSEp
Cp = + 2(P + 1) − n, (5.5)
σ̂ 2

where σ̂ 2 is from the model with all P predictors and SSEp indicates the sum of squared errors
from a model with p parameters. In a sense, Cp balances the model errors (in terms of SSE)
and the number of predictors used (in terms of p). Cp has the following properties in model
selection:

• Cp is easy to compute
• It is closely related to Ra2 and the AIC.
• For the full model Cp = P + 1 exactly.
• If a model with p parameter fits the data, then E(Cp ) ≈ p. A model with a bad fit will
have Cp much larger than p.

It is usual to plot Cp against p. We desire models with small p and Cp around or less than p.

36
5.2 Selecting the regression model
When we have many predictors (with many possible interactions), it can be difficult to find a good
model. It can be challenging to find which main effects do we include, and which interactions do we
include. Model selection tries to “simplify” this task. However, this is still an “unsolved” problem
in statistics. There are no magic procedures to get you the “best model.”

I All subset selection


When the number of predictors is not large, it is possible to enumerate all possible models
with different number and different set of predictors. When there are m candidate predictors,
the total number of distinct model is 2m , without considering interactions and transformation.
Given any criteria (adjusted R2 , Mallow’s Cp , BIC, AIC, or cross validation error), we can
find the model with optimal value. For models with close performance, the smaller model is
preferred.

II Greedy search
When the number of predictors becomes large, it is not feasible to conduct all subset selection,
especially when interactions and transformations of variables should be considered. In this
case, some greedy search (or other heuristic methods) should be used to find the “best” model
by certain evaluation criterion.

(a) Backward Elimination


This is the simplest of all variable selection procedures and can be easily implemented
without special software. In situations where there is a complex hierarchy, backward
elimination can be run while taking account of what variables are eligible for removal.
• Start with all the predictors in the model;

37
• Remove the predictor leading to largest improvement in performance;
• Refit the model and goto Step 2;
• Stop when no more improvement can be made by removing predictors;
(b) Forward Selection
Forward selection reverses the backward method.
• Start with no variables in the model;
• For all predictors not in the model, check the model performance if they are added
to the model;
• Choose the one leading to largest improvement, and include it in the model;
• Continue until no new predictors can be added.
(c) Stepwise Regression
This is a combination of backward elimination and forward selection. This addresses the
situation where variables are added or removed early in the process and we want to change
our mind about them later. At each stage a variable may be added or removed and there
are several variations on exactly how this is done.

Greedy procedures are relatively cheap computationally but they do have some drawbacks.

• Because of the “one-at-a-time” nature of adding/dropping variables, it’s possible to miss


the “best” model.
• The procedures are not directly linked to final objectives of prediction or explanation
and so may not really help solve the problem of interest. With any variable selection
method, it is important to keep in mind that model selection cannot be divorced from the
underlying purpose of the investigation. Variable selection tends to amplify the statistical
significance of the variables that stay in the model. Variables that are dropped can still
be correlated with the response. It would be wrong to say these variables are unrelated
to the response, it”s just that they provide no additional explanatory effect beyond those
variables already included in the model.
• Stepwise variable selection tends to pick models that are smaller than desirable for pre-
diction purposes. To give a simple example, consider the simple regression with just one
predictor variable. Suppose that the slope for this predictor is not quite statistically sig-
nificant. We might not have enough evidence to say that it is related to y but it still
might be better to use it for predictive purposes.

6 Hypothesis Testing in Regression Models


A statistical hypothesis is a hypothesis that is testable on the basis of observing a process that
is modeled via a set of random variables. A statistical hypothesis test is a method of statistical

38
inference.
Hypothesis testing can be used to formally test whether a predictor (or its transformation and
interaction with other predictors) is statistically significant in predicting the mean response. In the
general framework, it includes a few commonly used special cases. Some of the results have been
summarized before in different places.
Hypothesis testing allows us to carry out inferences about population parameters using data
from a sample. In order to test a hypothesis in statistics, we must perform the following steps: 1)
Formulate a null hypothesis and an alternative hypothesis on population parameters; 2) Build a
statistic to test the hypothesis made; 3) Define a decision rule to reject or not to reject the null
hypothesis.
It is very important to remark that hypothesis testing is always about population parameters.
Hypothesis testing implies making a decision, on the basis of sample data, on whether to reject
that certain restrictions are satisfied by the basic assumed model. The restrictions we are going
to test are known as the null hypothesis, denoted by H0 . Thus, null hypothesis is a statement on
population parameters.
The details of testing process shows below.

1. State the relevant null and alternative hypotheses.

2. Consider the statistical assumptions being made about the sample in doing the test;

3. Decide and state the relevant test statistic T .

4. Derive the distribution of the test statistic under the null hypothesis

5. Select a significance level α.

6. Compute from the observations the observed value of the test statistic T .

7. Decide to either reject the null hypothesis in favor of the alternative or not reject it.

I Testing a single βj = 0
Using the results in multiple linear regression in Chapter 3, we know that the least square esti-
mation β̂j follows normal distribution with mean βj and corresponding variance. In addition,
we have
β̂j − βj
∼ tn−p−1 ,
SE(β̂j )
when there are p predictors in the model. Therefore, the natural way to test H0 : βj = 0 versus
H1 : βj 6= 0 is to use the statistic T = β̂j /SE(β̂j ), with decision rule
(
> tn−p−1,α/2 , Reject null hypothesis
|T | : ,
≤ tn−p−1,α/2 , Do not reject null hypothesis

39
where α is the significance level. For most regression outputs, the test values besides each
predictor indicate such test results

==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 1.5925 1.389 1.146 0.252 -1.130 4.315
x1 -0.0016 0.001 -1.072 0.284 -0.004 0.001
x2 0.0006 0.008 0.073 0.942 -0.015 0.016
x3 9.349e-05 4.02e-05 2.323 0.020 1.46e-05 0.000
x4 -0.0003 0.008 -0.035 0.972 -0.016 0.016
x5 -0.0001 4.51e-05 -2.694 0.007 -0.000 -3.31e-05
x6 -0.0001 9.08e-05 -1.184 0.236 -0.000 7.04e-05
x7 6.249e-08 5.81e-08 1.076 0.282 -5.14e-08 1.76e-07
x8 1.96e-06 2.53e-06 0.774 0.439 -3e-06 6.92e-06
x9 -0.0006 0.000 -2.958 0.003 -0.001 -0.000
==============================================================================

It is also noted that the test significance shall be interpreted one by one. Because the test is
equivalent to the following test

H0 :Y = β0 + β1 x1 + β2 x2 + · · · + βj−1 xj−1 + βj+1 xj+1 + · · · + βp xp + 


H1 :Y = β0 + β1 x1 + β2 x2 + · · · + βj−1 xj−1 + βj xj + βj+1 xj+1 + · · · + βp xp + 

II Testing all predictors simultaneously


Testing model significance, or overall significance, is a particular case. Model significance
means global significance of the model. One could think that the test is formulated in the
following

H0 :Y = β0 + 
H1 :Y = β0 + β1 x1 + β2 x2 + · · · + βj−1 xj−1 + βj xj + βj+1 xj+1 + · · · + βp xp + .

In other words, none of the predictors need to be included in the model. We can use the sum
of square errors (SSE), or equivalently R2 to express the test statistic

R2 /p
T = ∼ Fp,n−p−1 .
(1 − R2 )/(n − p − 1)

As a result, if T > Fp,n−p−1,α , we should reject this hypothesis. Usually, the test result is also

40
prepared in the software output.

OLS Regression Results


==============================================================================
Dep. Variable: y R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 2.676
Date: Mon, 12 Sep 2016 Prob (F-statistic): 0.00417
Time: 22:12:18 Log-Likelihood: -28217.
No. Observations: 41407 AIC: 5.645e+04
Df Residuals: 41397 BIC: 5.654e+04
Df Model: 9
Covariance Type: nonrobust
==============================================================================

III Testing a sub-model nested in the full model


More often than not, we are interested in testing some hypotheses lying in between the above
two cases. They are identical in mathematical formulation, but might imply different compu-
tational cost. The following are some types of such tests on regression parameters.

H0 :β2 = β5 = βp = 0
H0 :β1 + β2 = 1
H0 :β1 = 0
β2 = 1
βp−1 + βp = 0

Note that the first two cases only involve one constraint on the regression parameters. The
third example has 3 constraints on the regression models. For this group of tests, we can still
use F test with corresponding degree of freedom.
In more details, we can express the null hypothesis in a general way as Aβ = c, with row rank
of A being m. Then we can solve the constrained least square estimation

β̃ = arg min ||Y − Xβ||2 , s.t. Aβ = c (6.1)


β

If the null hypothesis is correct, we would expect that the constrained model and unconstrained
model has similar performance in terms of estimation of β or the approximation difference

41
Y − Xβ. As a result, the F test can be constructed as

(||Y − Xβ̃||2 − ||Y − Xβ̂||2 )/m


T = ∼ Fm,n−p−1 .
||Y − Xβ̂||2 /(n − p − 1)

If T > Fm,n−p−1,α , we reject the null hypothesis, meaning that the constrained model is not
sufficient in explaining the variation of the response. This test can be done by fitting two
models and use ANOVA to get the test results.

df_resid ssr df_diff ss_diff F Pr(>F)


0 765.0 13038.806074 0.0 NaN NaN NaN
1 762.0 6617.423890 3.0 6421.382184 246.47523 9.268399e-112

Example of meaningful linear constraints on parameters

To examine whether there are constant returns to scale in the chemical sector, we are going to use
the Cobb-Douglas production function, given by

ln(output) = β1 + β2 ln(labor) + β3 ln(capital) + 

In the above model parameters β2 and β3 are elasticities (output/labor and output/capital).
Before making inferences, remember that returns to scale refers to a technical property of the
production function examining changes in output subsequent to a change of the same proportion in
all inputs, which are labor and capital in this case. If output increases by that same proportional
change then there are constant returns to scale. Constant returns to scale imply that if the factors
labor and capital increase at a certain rate (say 10%), output will increase at the same rate (e.g.,
10%). If output increases by more than that proportion, there are increasing returns to scale. If
output increases by less than that proportional change, there are decreasing returns to scale. In
the above model, the following occurs
• If β2 + β3 = 1, there are constant returns to scale
• If β2 + β3 > 1, there are increasing returns to scale
• If β2 + β3 < 1, there are decreasing returns to scale.
To answer the question posed in this example, we must test

H0 : β2 + β3 = 1, v.s. H1 : β2 + β3 6= 1.

Reference

In economics, elasticity is the measurement of how responsive an economic variable is to a change


in another. Elasticity can be quantified as the ratio of the percentage change in one variable to

42
the percentage change in another variable, when the latter variable has a causal influence on the
former. The elasticity on response on predictor xi can be calculated easily from the marginal effect

d(ln Y )
ei ≡ .
d(ln xi )

7 Methods Beyond Linear Regression


Beyond classical regression models, there are also some extensions developed. An incomplete list
is provided below:

• Generalized least square method when error variance are not constant, also refered as regres-
sion with heterogenous variance.
• Robust regression is designed for non-normal  with unknown distributions. They are robust
to outliers or extreme values.
• Nonparametric regression is used to model unknown relation between covariates and re-
sponses, which goes beyond linear assumptions.
• Quantile regression investigates the relation between covariates and quantiles of response (not
the mean of the response). It has wide application in economics and social science studies.

In addition to regression, there are many methods developed to classify an entity into certain
category based on its covariate values.

Definition 7.1 In machine learning and statistics, classification is the problem of identifying to
which of a set of categories (sub-populations) a new observation belongs, on the basis of a training
set of data containing observations (or instances) whose category membership is known.

It has widespread application in different domains, and becomes especially hot in current AI buzz.
Some typical application includes medical diagnosis, precision medicine; spam email filtering; face
recognition; virtual reality, augmented reality; handwriting recognition, voice recognition; recom-
mendation, job screening. In other module, we are not going to cover them in details. Instead,
we just provide some keywords and examples, and point you to further reference should you need
them in the future.

• Logistic regression
• Fisher’s linear discriminant analysis (LDA)
• Naive Bayes classifier
• Support vector machines
• k-nearest neighbor

43
• Boosting
• Decision trees, random forests
• (Deep) neural networks

8 Regression on Time
8.1 Time Series Regression
The dependent variable y is a function time t. It can be modeled as a trend model

yt = T Rt + t ,

where yt is the value of the time series in period t; T Rt is the trend in period t, and t is the error
term in period t. Compared with cross-sectional data, there is no other covariates except time in
the model. Depending on the complexity of the trend, it can use

• No trend: T Rt = β0
• Linear trend: T Rt = β0 + β1 t
• Quadratic trend: T Rt = β0 + β1 t + β2 t2

Aside from the conceptual differences, the model estimation, and prediction methods are the same
as in multiple linear regression.

8.2 Detecting Autocorrelation


Compared with cross-sectional data, it is more likely to experience autocorrelation in time series
regression. By model assumption, t need to be independently and identically distributed. However,
we may need to test the validity of the assumption.
If the relationship between t and t−1 can be modeled as

t = φt−1 + at ,

it is called first-order autocorrelation when at are i.i.d In particular, if φ > 0, t has positive
correlation; if φ < 0, t has negative correlation, and if φ = 0, there is no correlation. We can use
residual plot and other diagnostic plot to check

44
6.2 Detecting Autocorrelation 289

I II.ORE 6.8 t,
2 of 001 ), In add I
00, and t has a , ,HId
n thi~ model. the
forecast of a futur('
, "!,,Iation

• • • •
Time I
937t 5677t1

s in January and February of year 3 an', • •

677(25)2 (a) Positive autocorrelation in the error terms: pattern


t,

• • • •
ut of Figure 6.7. This figure also shows thdl
1104009,1196321 and [10">7,70, • Time I
2 3 4 5 6 7 8 9
that the credit union can be very sure thdl

• • •
higher than $1,196,320,

(b) autocorrelation in the error terms: pattern


ATION
11•• 1/111 /',9 t,

strated in Section 6. Residual plot is intuitive, 1111',


but subjective. Durbin-Watson Test is one of the rigorous statistical
r, when time series data are

testserror canterms help identify autocorrelations in the • •residuals. • The test is defined as
11"111
e common I'or the time-ordered
ms occurring over time have positive auto­ • Time I
e period t tends to produce, or he followed
2 3

4
P n5 6 7 8 9 2
iod t + k (a later time period) and if a neg
(e − e
t • t−1 •)
roduce, or be followed by, another negatiY('

d = i=2P • n
2
words. positive autocorrelation exists when i=1 et
ver time by positive error terms and
hy negative error terms. An If example
et areof positively
pos correlated, d is small; if et negative
illustrates that
are negatively correlated, d is large; If there is no
autocorrelation in the error terms can produce an
depicted in Figure 6.8(a). This figure iIIus
pattern over time, It follows that negativc autocorrelation in the error telms
correlation,
rrorterms can produce a cyclical pattern OWl d is in the middle. In particular, cut offs for different hypothesis testings are provided.
IIV;I1lS that greater than average values of YI lend to he followed by smaller than aver­
error terms means that greater than avem!~:t'
than average values of ,vI' and smaller than
;I~'l' v;illil'S of -", and smaller than average values of YI tend to be followed by greater
(I) H : The error terms are111:111
by smaller than avcrage values orv"
0
not;Ivn;lgc autocorrelated
values ufv l , An example of negative autocorrelation might be provided
h" ;1 Idailel', weekly slock orders. Here a larger than average stock order one week
egative autocorrelation if a positive C1Tor terlll

lowed by, a negative elTor term in time H1period


: The
error terms are1111)'positively autocorrelated
1t1 11".tilt III:In oversupply and hence a smaller than average order the next week.
Ih' IIHil-Pl'IHic-IIl'l' asslIllIptioll says Ihal the lime-ordered error terms display no
period t tends to produce, llf he followed hy, a
/, III lliher won", ""!';III\',' ;1111<)( IIrrl'iatioli
1'"',111 \ l ' ill II,'!';III\',' ;1111",011 dal iOil, This says Ihat Ille error terms oeeur in a random
h'I"lInwl'dovl'l 111111'1>\ ""'.lI",, 1Ii1lllTIlI\• If d < d L,α , reject H 0
II
(1\(" 111111' ;1', 11I1"11.ll<'d III hPIIIl' 11'1. SlIvh :1 pattern would IIIlPly lhallhesc
n 1'1 !Il', An '1 hllh \\Pllid 111111111 lilal IIIl' 111I1l'
li\\t·d n\t't tunc h\' ti\~"11L\' '

• If d > dU,α , do not reject H0


• If dL,α ≤ d ≤ dU,α , inconclusive

(II) H0 : The error terms are not autocorrelated


H1 : The error terms are negatively autocorrelated

• If 4 − d < dL,α , reject H0


• If 4 − d > dU,α , do not reject H0
• If dL,α ≤ (4 − d) ≤ dU,α , inconclusive

(III) H0 : The error terms are not autocorrelated


H1 : The error terms are autocorrelated

45
• If d < dL,α/2 or 4 − d < dL,α/2 , reject H0
• If d > dU,α/2 and 4 − d > dU,α/2 , do not reject H0
• Otherwise, inconclusive

8.3 Seasonal Variation


Time series data is data collected at regular intervals. When there are patterns that repeat over
known, fixed periods of time within the data set, such patterns are known as seasonality, seasonal
variation, periodic variation, or periodic fluctuations. This variation can be either regular or semi-
regular.
Seasonality may be caused by various factors, such as weather, vacation, and holidays and
usually consists of periodic, repetitive, and generally regular and predictable patterns in the levels
of a time series. Seasonality can repeat on a weekly, monthly or quarterly basis, these periods of
time are structured and occur in a length of time less than a year.
Generally, there are constant (additive) seasonal variation, where the magnitude of sea-
sonal swing does not depend on the level of the time series. In contrast, for multiplicative
seasonal variation, the magnitude of seasonal swing is proportional to the average level deter-
mined by the trend. When a time series displays multiplicative seasonal variation, we may apply a
transformation to the data to produce a transformed series that displays constant seasonal variation.
For a time series that exhibits constant variation, we can use a model of the following form

yt = T Rt + SNt + t ,

where the seasonal factor SNt can be expressed by using dummy variables:

SNt = βs1 xs1 ,t + βs2 xs2 ,t + · · · + βsL−1 xsL−1 ,t ,

and L is the period of the season.


(
1, time period t is season 1
xs1 ,t =
0, otherwise
(
1, time period t is season 2
xs2 ,t =
0, otherwise
(
1, time period t is season L-1
xsL−1 ,t =
0, otherwise

46
Another way to model seasonal trend is to use the trigonometric functions
   
2π 2π
SNt = β2 sin t + β3 cos t
L L

or with more frequencies


       
2π 2π 4π 4π
SNt = β2 sin t + β3 cos t + β4 sin t + β5 cos t
L L L L

8.4 Growth Curve Models


The regression models we use to describe the trend and seasonal effects are function of time that
are linear in the parameters. There are useful models that are not linear in the parameters.
Growth curve model is an example of this case. It is used for long-term or technological
forecasting. There are several type of growth curve models are available to model different kinds
of time series data.
Consider a growth curve model
yt = β0 · β1t · t

This model is not linear in the parameters. However, with proper transformation,

xt = ln yt = ln(β0 ) + ln(β1 ) · t + ln(t )

we can get an linear regression form.


Based on the model,

yt = β0 · β1t · t = β1 [β0 · β1t−1 ·]t ≈ β1 yt−1 t

• β1 indicates the growth rate of the response

• Equivalent form: yt = β0 · exp(β1 · t) · t

• Typical solution to some differential equations governing underlying dynamics

Some Other useful growth models includes

• Modified exponential curve


yt = s + αeβt

• Gompertz curve
yt = s exp(αeβt )

47
• Logistic curve
s
yt =
1 + αect

9 Exponential Smoothing
In time series regression, the functions are have constant parameters, i.e.,

• T Rt , SNt etc are fixed functions of t

• The variance of t does not change over t

This assumption might be valid in short time span, but is questionable in the long run. We might
need to update the model (parameters) to account for unknown changes. Exponential smoothing is
used in such scenario. It weights the observed time series values unequally (also called exponentially
weighted moving average (EWMA)). It is most effective when trend (and seasonal factors) of the
time series change over time.

9.1 Simple Exponential Smoothing


If the observations follow a constant trend model

Yt = β0 + t

a natural way to estimate β0 is to take the average


Pn
i=1 Yi
β̂0 =
n

If β0 is not a constant, but slowly changing, then recent observations are more relevant. A
simple solution is to take the moving average
Pn
i=n−w+1 Yi
β̂0n =
w
3.0
2.0
y

1.0

0 50 100 150

Index

48
A more popular approach is to use exponential smoothing

Ln = αYn + (1 − α)Ln−1
Xn
= α(1 − α)n−i Yi
i=1

3.0
2.0
y

1.0

0 50 100 150

Index
The smoothing constant α is very important. A small α gives smaller weight to current Yn , leading
to smoother curve, slower response to changes. In contrast, a large α gives higher weight to current
Yn , leading to rougher curve, faster response to changes. To select a good α, we can find the value
that can minimize the forecast error. Recall that the forecast error at time n can be computed as
en = Yn − Ln−1 . Combining the errors together, we have the sum of squared error (SSE):
n
X
SSE = (Yi − Li−1 )2
i=1

Note that SSE depends on α, and as a result, we find the “best” α that can minimize SSE.
Based on the exponential smoothing model, we can forecast future Yn+τ , for τ ≥ 1 based on
the last information Yn . Since no trend is assumed, the point forecast equals to Ln . Naturally, the
larger τ , the less accurate prediction. We can construct the prediction interval
r
p SSE
Ln ± z0.025 · s · 1 + (τ − 1)α2 , s=
n−1

49
500
450
400
data$y

350
300
250
200

0 5 10 15 20 25 30 35

As a summary, we summarized a few commondata$t


forms of the exponential smoothing.

• Standard form
Ln = αYn + (1 − α)Ln−1

• Weighted moving average

Ln = αYn + α(1 − α)Yn−1 + · · · + α(1 − α)n−1 Y1 + (1 − α)n L0

• Correction form
Ln = Ln−1 + α(Yn − Ln−1 )

9.2 Holt’s Trend Corrected Smoothing


The application of simple exponential smoothing is limited, as it allows no trend in the model.
Even when linear trend exists in the time series, the simple exponential smoothing might not work.
To apply the smoothing in the cases with linear trend, Holt’s trend corrected smoothing shall be
used.
In more details, consider a time series regression model

Yt = β0 + β1 t + t

If both β0 , β1 are slowly changing, we need to consider the smoothing for both β0 and β1 , the
intercept and the slope. We can use two smoothings for each parameter, respectively

• Level smoothing
Ln = αYn + (1 − α)(Ln−1 + Bn−1 ),

where Bn−1 is the estimate of β1 at step n − 1

50
• Growth rate smoothing
Bn = γ(Ln − Ln−1 ) + (1 − γ)Bn−1

The rationale for the second smoothing is due to the observation that from n to n+1, the increment
in trend function is in fact

Ln+1 − Ln = β0 + β1 (n + 1) − [β0 + β1 n] = β1 ,

which provides new information on the rate of the trend.


To make τ -step ahead forecast at time n, we compute

Ŷn+τ = Ln + τ · Bn

Given the value of Ln , Bn , the prediction is a linear function of τ . Its prediction interval can be
calculated as v
u τ −1 r
u X SSE
Ln + τ Bn ± z0.025 · s · 1 +
t α2 (1 + jγ)2 , s=
n−2
j=1

based on the normality assumptions.


450
400
350
data$y

300
250
200
150

0 10 20 30 40 50 60

data$t
Similarly, for different purposes, other forms have been used for trend corrected smoothing.

• Standard form
Ln = αYn + (1 − α)(Ln−1 + Bn−1 ),

Bn = γ(Ln − Ln−1 ) + (1 − γ)Bn−1

• Correction form
Ln = Ln−1 + Bn−1 + α(Yn − Ln−1 − Bn−1 )

51
Bn = Bn−1 + αγ(Yn − Ln−1 − Bn−1 )

9.3 Holt-Winters Method


If both seasonal trend and linear trend are present, we also need to consider the impact from the
seasonality. Consider the model with additive seasonal variation,

Yt = β0 + β1 t + SNt + t

All parameters β0 , β1 , SNt need to be updated.

Ln =α(Yn − SNn−L ) + (1 − α)(Ln−1 + Bn−1 )


Bn =γ(Ln − Ln−1 ) + (1 − γ)Bn−1
SNn =δ(Yn − Ln ) + (1 − δ)SNn−L

It can be noted that all three smoothing are based on the same error term En . The following
form are simpler to implement in practice.

En =Yn − (Ln−1 + Bn−1 + SNn−L )


Ln =Ln−1 + Bn−1 + αEn
Bn =Bn−1 + αγEn
SNn =SNn−L + (1 − α)δEn

A point forecast for τ step later at n is Ŷn+τ = Ln + τ · Bn + SNn+τ −kL The 95% prediction

interval is Ŷn+τ ± z0.025 s cτ , where

 h1, P τ =1

 i
τ −1 2 2
cτ = 1 + j=1 α (1 + jγ) , 2≤τ ≤L

 1 + Pτ −1 [α(1 + jγ) + d (1 − α)δ]2 , L ≤ τ

j=1 j,L

For models with multiplicative seasonal variation,

Yt = (β0 + β1 t) · SNt · t

52
Changes can be made in the smoothing

Ln = α(Yn /SNn−L ) + (1 − α)(Ln−1 + Bn−1 )


Bn = γ(Ln − Ln−1 ) + (1 − γ)Bn−1
SNn = δ(Yn /Ln ) + (1 − δ)SNn−L

A point forecast for τ step later at n is Ŷn+τ = (Ln + τ · Bn ) · SNn+τ −kL . The 95% prediction

interval is Ŷn+τ ± z0.025 sr · cτ · SNn+τ −L , where
v
u n [ Yi −Ŷi (i−1) ]2
uP
t i=1 Ŷi (i−1)
sr =
n−3

10 ARMA Time Series Model


In this chapter, we are discussing a forecasting methodology for stationary time series. It finds the
best fit of a time series to past values in order to make forecasts. The methodology is named after
George Box and Gwilym Jenkins, called Box-Jenkins methodology. They include two major
types of time-series models: i) Autoregressive (AR) Models; ii) Moving Average (MA) Models.
Combining them together, Box-Jenkins methods are also known as Autoregressive Moving Average
(ARMA) models or Autoregressive Integrated Moving Average (ARIMA) models. Details will be
discussed later.

10.1 Stationary
Definition 10.1 A time series is stationary if its statistical properties, e.g. mean and variance,
are essentially constant through time.

In particular, if a series t has zero mean, and constant variance σ 2 , and i and j are uncorrelated
for any i 6= j, the sequence is called white noise sequence.
When data is not stationary, as in many examples, transformation might be needed. Typical
transformation includes differencing the time series in different degree, e.g.,

• First order difference


zt = yt − yt−1 , t = 2, 3, · · · , n

• Second order difference

zt = (yt − yt−1 ) − (yt−1 − yt−2 ) = yt − 2yt−1 + yt−2 , t = 3, · · · , n

• Seasonal adjustment

53
10.2 ACF and PACF
Recall that correlation between two random variable X and Y is used to measure the strength of
their linear relationship:
Pn
xi yi − ni=1 xi ni=1 yi
P P
Cov(X, Y ) n
i=1
r=p ≈ q P
var(X) · var(Y ) [n ni=1 x2i − ( ni=1 xi )2 ][n ni=1 yi2 − ( ni=1 yi )2 ]
P P P

when n pairs of observations of them are given.


Similarly, to measure the dependence between past variables and current variables, we can
compute the correlation between Zt and Zt−k . Since the correlation is defined for the same random
variable at different times, it is called autocorrelation. The lag k measures the correlation between
observations apart from each other in k steps

Cov(Zt , Zt+k )
ρk = .
var(zt )

The definition is only meaningful when the time series is stationary, such that Zt and Zt−k have
the same mean and variance. It can be estimated from the data
Pn−k
t=b (zt − z̄)(zt+k − z̄)/(n − k − b + 1)
ρ̂k = Pn 2
,
t=b (zt − z̄) /(n − b + 1)
Pn
where z̄ = t=b zt /(n − b + 1) is the sample mean of the series.
Like other statistics based on the data, ρ̂k is random, and has corresponding standard error.
This standard error can be used to assess whether the autocorrelation is statistically significant
from 0. s Pk−1
1+2 j=1 ρ2j
SE(ρ̂k ) =
n−b+1
p
In particular, for ρ̂1 , we have SE(ρ̂1 ) = 1/(n − b + 1). Similar to the test of regression coefficients,
if |ρ̂k /SE(ρ̂k )| > tn−p−1,α/2 , we can claim that the autocorrelation is significant at level α at lag k.
Similar to autocorrelation, a closed related concept is Partial Autocorrelation Function. The
partial autocorrelation at lag k may be viewed as the autocorrelation of time series observations
separated by a lag of k time units with the effect of the intervening observations eliminated. It can
be computed based on the autocorrelation function. In particular,

r11 =ρ1
Pk−1
ρk − j=1 rk−1,j · ρk−j
rkk = Pk−1 , k = 2, 3, · · ·
1 − j=1 rk−1,j · ρj

54
where rk,j = rk−1,j − rkk · rk−1,k−j for j = 1, 2, · · · , k − 1.
Similar to the sample autocorrelation, we can obtain the sample PAC based on the time series
observations. The standard error of the SPAC can be obtained

1
SE(r̂kk ) = √ ,
n−b+1

which is constant regardless of the choice of k.


Both ACF and PACF are characteristics of dependence among samples at different lags. They
are often used jointly to determine the dependance nature of the sequence.

10.3 ARMA model


AR and MA are two fundamental building blocks of the Box-Jenkins methods.
(I) Moving Average (MA) models
The MA model assumes the time series are generated by the moving average of white noise.
As a result, the autocorrelation of the data is caused by the overlapping in computing the
moving average. In general, a MA model with order q is specified as

zt = δ + t + θ1 · t−1 + θ2 · t−2 + · · · + θq · t−q (10.1)

where t are white noise (or iid normal), and cannot be directly observed. The process has
q
X
2
E(zt ) = µ, var(zt ) = σ (1 + θj2 )
j=1

55
By the construction, we can observe that the autocorrelation at lag k > q should be zero as
the moving average windows do not overlap any more. More specifically, it has the following
(theoretical) autocorrelation function

q−k
X
2
ρk = σ θj θj+k /var(zt ), k ≤ q,
j=0

= 0, k>q (10.2)

where for notation simplicity, θ0 is defined to be 1.


Example:

(a) MA(1) model


zt = δ + t + θ1 · t−1 ,

• ρ1 = θ1 /(1 + θ12 ), ρk = 0, k ≥ 2
• AC cuts off after lag 1
(b) MA(2) model
zt = δ + t + θ1 · t−1 + θ2 · t−2

• The autocorrelation function

θ1 (1 + θ2 ) θ2
ρ1 = , ρ2 = , ρk = 0, k ≥ 3.
1 + θ12 + θ22 1 + θ12 + θ22

• AC cuts off after lag 2

56
(II) Autoregressive (AR) models
The AR model assumes the time series are generated by explicitly regress on its previous
values. As a result, the autocorrelation of the data is caused by the direct dependence on
previous data. In general, a AR model with order p is specified as

zt = δ + φ1 · zt−1 + φ2 · zt−2 + · · · + φp · zt−p + t (10.3)

where t are white noise (or iid normal). Since zt can be directly observed, the AR model can
be obtained by multiple linear regression, setting zt as the response, and zt−1 , zt−2 , · · · , zt−p
as covariates.
Because of the explicit dependence on the past data, the autocorrelation function has a
recursive formula
ρk = φ1 ρk−1 + φ2 ρk−2 + · · · + φp ρk−p

In general, ρk has an exponential behavior and cyclical patterns. In contrast, the PACF of
AR(p) model cuts down to zero after lag k = p.

57
Example: Consider the AR(1) model, zt = δ + φ1 · zt−1 + t its PACF is

r11 = φ1 , rkk = 0, k ≥ 2.

(III) ARMA(p, q) model


The ARMA(p, q) is a combination of both components. It has the general form

zt = δ + φ1 · zt−1 + φ2 · zt−2 + · · · + φp · zt−p + t + +θ1 · t−1 + θ2 · t−2 + · · · + θq · t−q (10.4)

Sometimes it is more organized to shift the zt and t to two sides of the equation

zt − φ1 · zt−1 − φ2 · zt−2 − · · · − φp · zt−p = δ + t + θ1 · t−1 + θ2 · t−2 + · · · + θq · t−q

Introducing the backshift operator, B, which has the effect Bzt = zt−1 , and B k zt = zt−k ,
then we have

(1 − φ1 B − φ2 B 2 − · · · − φp B p )zt = δ + (1 + θ1 B + θ2 B 2 + · · · + θq B q )t .

It can be viewed as polynomial of B of the coefficient process. In fact, the polynomials of B


play a central role in determining the properties of the ARMA process.

10.4 ARMA model


As a preliminary identification of the model types and orders, we can use the following observations:

• AR(p) model typically has autocorrelation function dying down, and the partial autocorrela-
tion function cuts off at lag p.

58
Series data

1.0
0.8
0.6
ACF

0.4
0.2
0.0
0 5 10 15 20 25

0.4
0.3
Partial ACF

0.2
0.1
0.0
−0.1

0 5 10 15 20 25

• MA(q) model: the autocorrelation function cuts off at lag q, and partial autocorrelation dies
down.
Series data
0.8
ACF

0.4
0.0

0 5 10 15 20 25
0.4
Partial ACF

0.2
0.0
−0.2

0 5 10 15 20 25

• ARMA(p, q): both autocorrelation and partial autocorrelation dies down.


Series data
0.8
ACF

0.4
0.0

0 5 10 15 20 25
0.6
Partial ACF

0.2
−0.2

0 5 10 15 20 25

59
10.4.1 Link to other models

ARMA model has close link to other time series analysis techniques.

1. Simple smoothing: The forecasting with simple exponential smoothing is equivalent to


forecasting with
zt = t − θ1 t−1 , where zt = yt − yt−1

In this case, the smoothing constant α = 1 − θ1 . The previous forecast errors are used to
adjust current forecast.

2. Holt’s trend corrected smoothing: The forecasting with trend corrected exponential
smoothing is equivalent to forecasting with

zt = t − θ1 t−1 − θ2 t−2 , where zt = yt − 2yt−1 + yt−2

In this case, the smoothing constant

θ1 = 2 − α − γ, θ2 = α − 1.

3. Regression on time: In time series regression, we often assume yt = T Rt + SNt + ξt


and ξt are independent and normally distributed. In many applications, we may find ξt are
correlated. In such cases, we can use Box-Jenkins model to model correlated ξt , and combine
them together
yt = T Rt + SNt + ξt , ξi ∼ ARMA(p, q)

For notation simplicity, the Box-Jenkins methods often use the following notation

yt ∼ ARIMA(p, d, q)

where

• p: is the order of the AR terms


• q: is the order of the MA terms
• d: is the order of differencing

The notation implies

zt =δ + φ1 zt−1 + φ2 zt−2 + · · · + φp zt−p + t + θ1 t−1 + θ2 t−2 + · · · + θp t−q

where zt = (1 − B)d yt is dth order differencing.


Examples:

60
• ARIMA(1,0,0) becomes AR(1)
yt = µ + φ1 yt−1 + t

• ARIMA(0,0,2) becomes MA(2)

yt = µ + t + θ1 t−1 + θ2 t−2

• ARIMA(0,1,1) becomes IMA(1,1)

yt − yt−1 = µ + t + θ1 t−1

10.4.2 Model Constraints

The parameters of the ARIMA models need to satisfy a few constraints to make the model mean-
ingful and easy to interpret. Among them, the following two are most crucial.

• Stationary (causal) condition: the roots of the following equation must satisfy |z| > 1

1 − φ1 z − φ2 z 2 − · · · − φp z p = 0

• Invertible condition: the roots of the following equation must satisfy |z| > 1

1 + θ1 z + θ2 z 2 + · · · + θq z q = 0

These constraints are in place for both true model parameters and estimated model parameters.

10.4.3 Model Prediction

In general ARMA model, τ -step forecasting can be made at time t

ŷt+τ =δ + φ1 ŷt+τ −1 + · · · + φp ŷt+τ −p + ˆt+τ + θ1 ˆt+τ −1 + · · · + θq ˆt+τ −q

If yt+τ −i is observed, we use the observed values (ŷt+τ −i = yt+τ −i , τ ≤ i), otherwise, the forecasted
values at previous steps are used. In contrast, if t+τ −i is beyond current time step t, it is set to 0

t+τ −i = 0, τ ≥ i). Otherwise, ˆi is estimated by the ith step prediction error.

1. AR(p) model: Given the data up to time t − 1, the forecast at t is naturally

ŷt = δ + φ1 yt−1 + φ2 yt−2 + · · · + φp yt−p


Pn
with forecast error et = yt − ŷt , and the sum of squared error i=p+1 (yt − ŷt )2 .

61
2. MA(1) model: yt = δ + t − θ1 t−1 . The values of t are observable. Using previous forecast
error to estimate t−1 , , ˆt−1 = yt−1 − ŷt−1 . Then the forecast values can be computed by
ŷt = δ − θ1 ˆt−1 , with the sum of squared forecast error as SSE = ni=2 (yt − ŷt )2 = ni=2 ˆ2t
P P

10.5 Seasonal ARMA model


When the time series has seasonal effects, the autocorrelation often shows the seasonality, as demon-
strated below

Series tdata

0 10 20 30 40

Lag

This is also a sign of non-stationarity. In general, we need to check the SAC at two different levels.
At nonseasonal level, the SAC at lags ranging from 1 to L − 3, is used to indicate the stationarity
(whether trend exists), similar to non-seasonal ARMA models. At seasonal level, the SAC at lags
around L, 2L, 3L, · · · indicate the correlation between the same season in different periods. Both
levels should die down or cut off quickly to indicate stationarity.
If the time series is nonstationary, we can use the first order or higher order differencing to
make them stationary.

1. First regular differencing: zt = yt − yt−1


2. First seasonal differencing: zt = yt − yt−L
3. First regular and seasonal differencing: zt = yt − yt−1 − (yt−L − yt−L−1 )

Back to the picture above, we can try these three differencing ways to make the series stationary.
After regular differencing, the SAC is

62
1.0
0.8
0.6
0.4
0.2
0.0
−0.2

0 10 20 30 40 ;

After seasonal differencing, the SAC is


1.0
0.8
0.6
0.4
0.2
0.0
−0.4 −0.2

0 10 20 30 40 ;

After regular and seasonal differencing, the SAC is


1.0
0.8
0.6
0.4
0.2
0.0
−0.4 −0.2

0 10 20 30 40 ;

63
Similar to ARMA model, we can define the model at seasonal level. For seasonal models with
period L, their counter part can be defined as

1. Seasonal moving average model of order Q: zt = δ + t + θ1 t−L + θ2 t−2L + · · · + θQ t−QL

0.8
0.4
0.0
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
0.3
0.1
−0.1

zt = t − 0.5t−4 − 0.3t−8

2. Seasonal moving average model of order P : zt = δ + φ1 zt−L + φ2 zt−2L + · · · + φP zt−P L + t


0.8
0.4
0.0

0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
0.6
0.4
0.2
0.0

zt = 0.5zt−4 + 0.3zt−8 + t

To tentatively specify the seasonal Box-Jenkins model

• Use the behavior of SAC and SPAC at nonseasonal level to identify a nonseasonal model

• Use the behavior of SAC and SPAC at seasonal level to identify a seasonal model

• Combine the model identified in the previous two steps

Example:
After first seasonal differencing, the data is stationary

64
0.8
0.4
−0.4 0.0
Series st
0 10 20 30 40

Lag

0.1
−0.1
−0.3

0 10 20 30 40

The SPAC cuts off after lag 5 with spikes at 1,3,5,which indicateszt = δ + φ1 zt−1 + φ3 zt−3 +
φ5 zt−5 + t . The SAC cuts off after lag 1 at seasonal level,which indicates zt = δ + t −
θ1 t−12 .Combine the model, we can have the final model:zt = δ+φ1 zt−1 +φ3 zt−3 +φ5 zt−5 +t −θ1 t−12
To make point forecasting, we first forecast on the zt , and then yt . Firstly, use Box-Jenkins
model to obtain the point and interval forecast for zt+τ : ẑt+τ = δ+φ1 zt+τ −1 +φ3 zt+τ −3 +φ5 zt+τ −5 −
θ1 ˆt+τ −12 . Then,since first order seasonal differencing is used, yt+τ = yt+τ −12 + ẑt+τ .
When both seasonal and non-seasonal model have the same type of model (both are AR or both
are MA), some new multiplicative terms are needed

(1−a1 B−a2 B 2 −· · ·−ap B p )(1−φ1 B L −· · ·−φP B P L )yt = (1−b1 B−b2 B 2 −· · ·−bq B q )(1−θ1 B L −· · ·−θQ B QL )t
(10.5)
There is no problem in specifying the model, but we need to be careful in parameter estimation
and forecasting.
ARIMA model with seasonal effects is also short for SARIMA

yt ∼ SARIMA(p, d, q) × (L, P, D, Q)

• L: the period of the seasonal effect

• P : the order of AR part at seasonal level

• Q: the order of MA part at seasonal level

• D: the order of differencing at seasonal level

Example: Consider SARIM A(0, 1, 1) × (12, 0, 1, 1) model. It has period L = 12. Using Lag
operator, we have

(1 − B)(1 − B 12 )yt =(1 + θ1 B)(1 + a1 B 12 )t

65
It is equivalent to

(1 − B − B 12 + B 13 )yt =(1 + θ1 B + a1 B 12 + θ1 a1 B 13 )t

When multiplicative seasonal effects are present, it can be first transformed to additive effects,
and then use the SARIMA model to model the variability over time.

10.6 Model Estimation


Given the model structure and order, there are generally three types of estimation methods. The
first one matches the sample ACF and sample PACF with the theoretical ACF and PACF (in the
function of parameters), and get their estimated values. The second category is the least square
method, and the third category is maximum likelihood estimation.

10.6.1 Matching TAC with SAC(Moment Method)

Given the tentatively determined model, its TAC is often available. For example, the MA(1) model
yt = δ + t + θ1 t−1 has TAC
θ1
ρ1 = , ρk = 0, ∀k > 1
1 + θ12
By equating TAC with SAC (ρ1 = r1 ), we can get
p
1± 1 − 4r12
θ̂1 = .
2r1

Similar idea can be applied to other models with relatively low orders, as long as the TAC is
analytically available.

(a) MA(2) model


θ1 (1 + θ2 ) θ2
ρ1 = 2 2 , ρ2 = , ρk = 0, k > 2
1 + θ1 + θ2 1 + θ12 + θ22

(b) AR(1) model


ρk = φk1 , k≥1

(c) AR(2) model

φ1 φ21
ρ1 = , ρ2 = + φ2 , ρk = φ1 ρk−1 + φ2 ρk−2 , k ≥ 3
1 − φ2 1 − φ2

(d) ARMA(1,1) model


(1 + φ1 θ1 )(φ1 + θ1 )
ρ1 = , ρk = φ1 ρk−1 , k ≥ 2
1 + θ12 + 2φ1 θ1

66
By matching the theoretical AC with sample AC, we can calculate the parameters of interest for
corresponding model. This method can be applied to more complex model as well, and is known
as Yule-Walker estimation in general. It is simple, and does not require the original raw data.
However, the estimation accuracy is not the best. When multiple solutions exists, it is important
to check whether these solutions can satisfy the stationary and invertible conditions of the model.

10.6.2 Least square and MLE

These two methods are not elaborated, and can take full advantage of the entire dataset. The
least square method tries to find out the parameters such that the sum of squared forecast error
is minimized. It is similar to the estimation in conventional regression analysis. In the context
of ARMA models, calculating the forecast errors is straightforward for AR(p) model. However, it
needs more care when MA component is involved, e.g., MA(q) and ARMA(p, q) model.
The maximum likelihood estimation requires the joint distribution of all observations (typically
multivariate normal). It generally provides more accurate estimation, but also is more complex.
Software can be used to obtain such estimates.

10.6.3 Model Diagnostics

Similar to regression models, a good way to check the adequacy of an Box-Jenkins model is to
analyze the residuals
et = yt − ŷt .

In particular, we can plot the SAC and SPAC for the residuals to check whether the model is
adequate. These plots are often named as RSAC and RSPAC, respectively, for short. If the model
is adequate, the error should be uncorrelated, and the RSAC be small. Detailed plot of RSAC or
RSPAC can be used to improve the model as well. In addition, we can also use some statistic to
quantify the dependence of the residuals.
One of such statistic is the Ljung-Box Statistic. The statistic is computed as

K
X
Q∗ = (n − d)(n − d + 2) (n − d − l)−1 rl2 (),
l=1

where n is the sample size, d is the number of differencing, rl () is the SAC of the residual at lag l,
K is some number indicating the range of interests. If Q∗ > χ2α,K−nC , the residuals are correlated,
i.e., the model is inadequate. In practice, multiple K *(e.g.=6,12,18,24) can be used to check the
correlation of the residuals.

67
11 Spatial Data Forecasting
11.1 Spatial Data
Spatial data comes from a myriad of fields, which lead to various spatial data types. A general and
useful classification of spatial data is provided by Cressie (1993, pp. 8-13) based on the nature of
the spatial domain under study.
Following Cressie (1993), let s ∈ Rd be a generic location in a d-dimensional Euclidean space
and {Z(s) : s ∈ Rd } be a spatial random function, Z denote the attribute we are interested in.
The three spatial data types are: lattice data, geostatistical data, and point processes.
• Lattice (Areal/Regional) data: the domain D under study is discrete. Data can be exhaus-
tively observed at fixed locations that can be enumerated. Locations can be ZIP codes,
neighborhoods, provinces etc. Data in most of cases are spatially aggregated. Eg: the unem-
ployment rate by states, crime data by counties, average housing prices by provinces.

• Geostatistical data: the domain under study is a fixed continuous set D. Eg: the level of a
pollutant in a city, the precipitation or air temperature values in a country. They can have
value at any point in D.

• Point processes (Point patterns): the attribute under study is the location of events (obser-
vations). Therefore, the domain D is random. The observation is not necessarily labeled and
the interest lies mainly in where the events occur. Eg: the location of trees in a forest, the
location of nests in a breeding colony of birds.
Figure 1 shows some examples of the spatial data in each category. The main goal of different
spatial data types can be different:
• Lattice data analysis: smoothing and clustering acquire special importance. It is of interest
to describe how the value of interest at one location depends on nearby values, and whether
this dependence is direction dependent.

• Geostatistical data: to predict value of interest at unobserved locations across the entire
domain of interest. An exhaustive observation of the spatial process is not possible. Obser-
vations are only made at a small subset of locations.

• Point processes analysis: to determine if the location of events tends to exhibit a systematic
pattern over the area under study or, on the contrary, they are randomly distributed.

11.2 Lattice Data Analysis


As indicated earlier, for lattice data, the key interest is to identify their spatial dependency pattern.
Before analyzing spatial dependency, it is essential to test the existeance of spatial dependency.

68
Figure 1: Three spatial data types

11.2.1 Moran’s I to Test Dependency

Moran’s I statistic is a widely used measure for spatial correlation (dependencies):


P P
N i j wij (xi − x̄)(xj − x̄)
I= P 2
.
S0 i (xi − x̄)

Here N is the number of spatial units indexed by i or j. x is the variable of interest. x̄ is the
mean of xi , i = 1, · · · , N . wij is the (i, j)th element of a spatial weight matrix W with wii = 0 and
P
S0 = i,j wij . The design of spatial weight matrix W can be:

• wij = 1 if zone i and zone j are neighbors. wij = 0 otherwise.


• wij = 1 if zone j is one of the k nearest neighbors of zone i. wij = 0 otherwise.
• wij is set based on a decay function of distance between zone i and zone j. An example can
be Gaussian kernel:
1
wij = √ exp[−d2ij /(2h2 )]
2πh
with dij being the distance between zone j and j, h being the bandwidth parameter to be
tuned.

To test the spatial correlation, we can formualte the following hypothesis testing.
H0 : no spatial correlation; H1 : spatial correlation exists.

The H0 indicates the spatial randomness. The expected value of Moran’s I under the null hypothesis
that there is no spatial correlation is E(I) = −1/(N −1). With large sample sizes, the expected value
approaches zero. I usually range from −1 to +1. Positive I indicates positive spatial correlation,
while negative I indicates negative spatial correlation. Values significantly deviate from −1/(N −1)

69
indicate spatial correlation. The variance of the statistic under the null (assuming each value is
equally likely to occur at any location) is:

N S4 − S3 S5
Var(I) = − (E(I))2
(N − 1)(N − 2)(N − 3)S02

+ wji )2 /2; S2 = 2
P P P P P
where S1 = i j (wij i ( j wij + j wji ) ;

N −1 i (xi − x̄)4
P
S3 = ,
(N −1 i (xi − x̄)2 )2
P

S4 = (N 2 − 3N + 3)S1 − N S2 + 3S02 ; S5 = (N 2 − N )S1 − 2N S2 + 6S02 .


Alternatively, using permutation, we can obtain a reference distribution for the statistic under
the null hypothesis. In each random permutation of the data across locations, we compute the
statistic I. Denote M as the total number of permulations, R the number of permutation from
which the computed Moran’s I is equal to or more extreme than the original statistic I0 . From the
reference distributions from M permutation, we can calculate the p-value

R+1
p= .
M +1

Small p value indicates existence of significant spatial correlation.


If significant spatial dependence exists, it is beneficial to characterize the dependence structure,
so that this information can be used for forecasting or anomaly detection. The following few models
are different approaches to characterize the spatial dependence.

11.2.2 Spatial Autoregression

One common tool to account for spatial dependency is linear regression. The idea is to have models
analogous to time series models but with spatial lags. As the simplest case, when only the spatially
lagged variable is considered,
y = λW y + , |λ| < 1 (11.1)

where  ∼ i.i.d.N (0, σ2 In ), W is a non-stochastic standardized spatial weight matrix. Compared
to the spatial weight matrix explained above, it is standardized in the sense that the elements of
s =w /
P
any row sum to one, i.e., wij ij j wij . When the W matrix is row-standardized and |λ| < 1,
the matrix (I − λW ) is invertible. From equation 11.1 we can have y = (I − λW )−1 , and:

E(y) = 0 (11.2)

E(yy T ) = σ2 (I − λW )−1 (I − λW T )−1 = σ2 Ω. (11.3)

70
With normality assumption of i , this model can also be estimated via maximum likelihood proce-
dure. The log-likelihood can be expressed as:

n 1 1
l(λ, σ2 ) = const − ln(σ2 ) − ln |(I − λW )−1 (I − λW )−T | − 2 y T [(I − λW )−1 (I − λW )−T ]−1 y.
2 2 2σ

The λ̂, σ̂2 that maximize the likelihood function becomes the parameter estimates.

11.2.3 Spatial Linear Regression with Exogenous Variables

In addition to the spatially lagged variable as regressors, there are also some exogenous variables
that can influence the response. Define the matrix of all exogenous regressors, current and spatially
lagged, as Z = [X, W X] and the vector of regression parameters as β = [β(1) , β(2) ]. In presence of
explanatory variables, it is also possible to test the spatial dependencies, following the procedure
below:
1. Run the non-spatial regression yi = Xi β(1) + e, e ∼ N (0, σ 2 ) for every yi .
2. Test the regression residuals for spatial correlation, using Moran’s I.
3. If no significant spatial correlation exists, STOP.
4. Otherwise, use a special model which takes spatial dependencies into account.
Two commonly used models considering spatial dependecies are Spatial Lag Model (SLM) and
Spatial Error Model (SEM).
Spatial Lag Model (SLM):

y = λW y + Zβ + u, |λ| < 1

u|X ≈ i.i.d.N (0, σu2 In )

In this case, a problem of endogeneity emerges in that the spatially lagged value of y is correlated
with the stochastic disturbance, i.e, E[(W y)uT ] 6= 0. Therefore, least square can not be employed.
The parameters can be estimated by maximum likelihood. Sometimes this model is also called
spatial autoregressive model (SAR).
Spatial Error Model (SEM) :
y = Zβ + u

u = ρW u + , |ρ| < 1

Compared to SLM, SEM contains spatial dependence in the noises. Similar as before, the con-
straints on ρ hold for row-standardized W to make I − ρW invertible. Due to the endogeneity
of the errors, i.e., E[(W y)+T ] 6= 0, the least square procedure loses its optimal properties.The
parameters can be estimated by maximum likelihood.

71
11.2.4 Generalizations

Last but not least, the models discussed above are all special cases of a general form:

y = λW1 y + Zβ + u, |λ| < 1

u = ρW2 u + , |ρ| < 1

Where W1 and W2 are not necessarily the same. This generalized model comes with several names,
e.g., spatial autocorrelation model (SAC), extended spatial Durbin model (SDM), or SARAR(1,1)
(acronym for spatial autoregressive with additional autoregressive error structure).

11.3 Geostatistical Interpolation


Geostatistics deals with spatially autocorrelated data 1 . The first law of geography: “everything is
related to everything else, but near things are more related than distant things.”(Tobler 1970)
Figure 2 shows the precipitation surface of Switzerland as an example. Blue dots are monitoring
stations with size corresponding to the amount of rainfall. The different heights of the surface and
their color are associated with amount of rainfalls at each location. An essential problem is to
construct a continuous surface from observations at these stations.

Figure 2: A precipitation surface of Switzerland.

11.3.1 Spatial Dependencies: Covariance and Semivariance

Statistically, denote X(s) as the response of interest at location s ∈ D, m(s) = EX(s) is the
mean response value. The spatial dependence between response at any two location si , sj can be
characterized by the following two quantities.
1
critical reference of this section: Geographic Information Technology Training Alliance (GITTA)
http://www.gitta.info/website/en/html/index.html

72
• Covariance: C(si , sj ) = E{[X(si ) − m(si )] · [X(sj ) − m(sj )]},

• Semivariance: γ(si , sj ) = Var(X(si ) − X(sj ))/2.

Covariance is a measure of similarity, the larger the value, the more correlated of their responses.
In contrast, semivariance is calculated as a measure of dissimilarity: smaller value indicates higher
dependence. They plays the pivotal role in the properties of geospatial models and their prediction
accuracies. To simplify the model complexity, it is common to limit our attention to a class of
stationary models:
Intrinsic stationary (IS):

• E[X(s)] = µ for all s ∈ D

• 2γ(h) = Var(X(si ) − X(sj )), for all si , sj ∈ D, h = si − sj is the difference between si , sj .

Second-order stationary (SOS):

• E[X(s)] = µ for all s ∈ D

• Cov(X(si ), X(sj )) = E(X(si ) · X(sj )) = C(h), for all si , sj ∈ D, h = si − sj .

An SOS process implies IS, which means IS is a weaker assumption. Under SOS, the relationship
between semivariance and covariance is:

γ(h) = C(0) − C(h),

as demonstrated in Figure 3.

Figure 3: Relationship between covariance and semivariance under SOS.

Covariance is a more commonly seen concept in statistics. It is common to formulate geostatistic


models in terms of the covariance function. Nevertheless, for estimation purposes, semivariance has
more advantages:

73
• To estimate semivariance, no estimate of mean is required. semivariance can adapts more
easily to nonstationary cases. On the contrary, covariance estimator requires the estimation of
mean. When mean is unknown and needs to be estimated from sample, estimating covariance
is more biased.
• Semivariance can be applied under IS, meaning that the semivariance can be defined in some
cases where the covariance function cannot be defined. In particular, the semivariance may
keep increasing with increasing lag, rather than leveling off, corresponding to an infinite global
variance. In this case the covariance function is undefined. As a result, IS is the fundamental
assumption required for Kriging instead of SOS.

Figure 4 shows an example of how semivariance works. Two datasets have similar summary
statistics: 15251 points with (1) average value 100; (2) standard deviation 100; (3) median 100; (4)
10 Percentile 74; (5)90 percentile 125. However, due to different semivariance they exhibits totally
different pattern (spatial structure)

Figure 4: Different semivariance

It is note that in practice, we may further assume that the semivariance is isotropy, i.e., the
spatial correlation is the same in all directions. There are anisotropy cases which requires further
design of the semivariance formula, which we will not discuss in this note. For isotropy semivariance,

74
the distance between si and sj completely determines their spatial correlation. As a result, we use
the scalar h instead of the directional vector h.
To estimate γ(h) from the data, we can use:

1 X
γ(h) = [X(si ) − X(sj )]2
2N (h)
i,j,stksi −sj k≈h

where N (h) is the number of pairs whose distances are around h. By changing the value of h, we
can get the function γ̂(h), which we also refer to as semivariogram, as shown in Figure 5.

Figure 5: Left: Empirical estimation of semivariance; Right: Empirical semivariogram.

The semivariogram plot often reveals a few important points:

• Sill: The semivariance at which semivariogram levels off.


• Range: The lag distance at which the semivariogram reaches the sill value. Spatial correlation
is zero beyond the range.
• Nugget: In theory the semivariogram value at the origin should be zero. If it is significantly
different from zero for lags very close to zero, then this value is referred to as the nugget.
(variability at distances smaller than the typical sample spacing, including measurement
errors)

For modeling and prediction, we need to replace the empirical semivariogram with an acceptable
parametric semivariogram model because we need to use the semivariogram values at lag distances
other than empirical ones. More importantly, the semivariogram need to be non-negative definite.
Let a denote the range, and c denote the sill, three most frequently used models are:

• Spherical: γ(h) = c 1.5(h/a) − 0.5(h/a)3 if h ≤ a, c otherwise.




• Exponential: γ(h) = c(1 − exp(−3h/a))


• Gaussian: γ(h) = c(1 − exp(−3h2 /a2 ))

75
11.3.2 Kriging as an interpolation

Given the spatial covariance (semivariance) structures of the data, we are able to predict the
response at any location given the observations from a few locations. This process is also called
interpolation. Interpolation algorithms predict the value at a given location as a weighted sum of
data values at surrounding locations. Almost all weights are assigned according to functions that
give a decreasing weight as the distance increases. Kriging is the optimal interpolation based on
observed values of surrounding data points, weighted according to spatial covariance values. It also
has other advantages:

• Helps to compensate for the effects of data clustering, assigning individual points within a
cluster less weight than isolated data points
• Estimates standard error (kriging variance), along with estimate of the mean, which provides
basis for interval forecasting .

Figure 6: Left: Interpolation using inverse distance weighting; Middle: Kriging Interpolation;
Right: Kriging’s confidence interval.

Kriging assumes the response at location s, Z(s), follows Z(s) = m(s) + R(s) in the domain
D. m(s) is the mean response function at location s, R(s) is an intrinsically stationary (IS)
process. Kriging aims to predict the response value at unobserved location s0 given observations
Z(s1 ), Z(s2 ), · · · , Z(sn ). A basic form of the kriging estimator is

N (s0 )
X
Z ∗ (s0 ) − m(s0 ) = λi [Z(si ) − m(si )]
i=1

where N (s0 ) is the number of data points in the local neighborhood used for estimation of Z ∗ (s0 ).
The mathematical goal of kriging is to determine weights that minimize the variance of the estimator
under the unbiasedness constraint.

minimize 2
σE (s) = Var{Z ∗ (s0 ) − Z(s0 )}
λ

subject to E{Z ∗ (s0 ) − Z(s0 )} = 0

76
Ordinary Kriging has the simplest structure for the underlying mean function, i.e., m(s) = µ.
In this case, the bias is:  
N (s0 )
X
E(Z ∗ (s0 ) − Z(s0 )) =  λi − 1 m.
i=1

PN (s0 )
The unbiased estimation requires i=1 λi = 1. The semivariance can be estimated from sample,
denoted by γ(h). The semivariance between any two observation can form a matrix Γ, where
Γi,j = γ(||si − sj ||), i, j = 1, · · · , n. The semivariance between s0 and existing observations can also
be summarized in a vector θ = [θ1 , θ2 , · · · , θn ] with θi = γ(||s0 − si ||). Minimizing the variance
(uncertainty) of the prediction, we can have

1 − 1T Γ−1 θ − 1
λ∗ = Γ−1 (θ + µ̂1), where µ̂ = ,
1T Γ−1 1

and 1 is a vector of 1’s.


It can be seen that the kriging weights are determined entirely by the data locations s and the
covariance model, not the actual data values of Z(s). Despite the simple structure and assumption
of constant mean, it has been found that the ordinary Kriging is working well in many cases. It is
generally unnecessary to use a complex mean function unless there are a sufficient reasons to do
that.
Figure 7 is an example of the result of ordinary kriging. Compare the weight of point 5 and
6 (similar covariance and distance): point 6 is effectively “screened” by the nearby data point 5.
Data points 5 and 6 are fairly strongly correlated with each other and 5 has a stronger correlation
with the estimation point, so data point 6 is effectively ignored.
It is important to note that Kriging interpolation also comes with assumptions.

1. The underline function is from a stationary process with specified covariance function
2. If the distribution of the data is skewed, then the Kriging estimators are sensitive to a few
large data values.
3. Normality of observations is not a requirement for Kriging. Nevertheless, under Gaussian
assumption, Kriging is BLUE (“best linear unbiased estimator”). Kriging under Gaussian
assumption is also equivalent to the the famous “Gaussian process”.

Aside from the orindary Kriging, there are other variants as well.

• Simple kriging: assumes the mean response over the entire domain is a known constant:
PN (s )
E{Z(s)} = m. In this case, the constrain i=1 0 λi = 1 is no longer needed.
• Universal kriging: assumes the mean response is not a constant but a linear combination of
known functions: E{Z(s)} = pk=0 βk fk (s).
P

77
Figure 7: Example for Ordinary Kriging when N (s) = 6.

78
• Cokriging: Kriging using information from one or more correlated variables, or multivariate
kriging in general.

12 Spatial Temporal Data and Models


Spatio-temporal data refers to the data that are indexed in both space and time. Generalizing
from the purely spatial setting, we use {Z(s, t) : s ∈ Rd , t ∈ R} to denote the response as a
function of both space (locations) and –time. As a result, we expect that the responses are both
spatially correlated and temporally correlated. Figure 82 provides an illustration for such data.
The following list are some common examples in practice.

• If we record the Per Capita Income of every state every year, we have a collection of spatio-
temporal lattice data. We can study how incomes of every state evolve over time.
• If we observe every hour the level of pollutant in a city at the points where the monitoring
stations are located, we have a spatio-temporal geostatistical dataset.
• If we observe the location of bird nests every year, we have a spatio-temporal point pattern
dataset. Now we can study whether there is complete spatio-temporal randomness or they
exhibit clustering/inhibition.

Figure 8: Spatio-Temporal Data Illustration

12.1 Spatial-temporal lattice data analysis


Spatial Markov Chain
2
Figure cited from Fernández-Avilés, G., & Mateu, J. (2015). Spatial and spatio-temporal geostatistical modeling
and kriging (Vol. 998). John Wiley & Sons.

79
Since locations in lattice data are discrete and finite, Markov chain can be adopted to study
regional dynamics. If we divide the data into k classes and T periods, we can denote a vector
Pt = [P1,t , P2,t , · · · , Pk,t ] to represent the probability that the response in a region be a member of
a particular class at period t, i.e., P (Z(s, t) ∈ Ck ). To model the dynamics over time, we use the
transition probability matrix Mt , whose element mt,i,j denotes the probability that the response
currently in state i at time t ends up in state j in the next period P (Z(s, t + 1) ∈ Cj |Z(s, t) ∈ Ci ).
An example of the transition matrix is shown in Figure 9. If the transition probabilities do not
change over time, we can drop the index t in the notations above, and we can easily get Pt+b = Pt M b .

Figure 9: An example of transition matrix M .

To use Markov chain in modeling the transition in spatial distribution, we can design the spatial
markov matrix. This matrix extends the traditional k × k transition matrix into a k × k × k tensor.
Conditioning on different response category of the spatial lag in the initial period, there will be k
different transition probability matrix. The class of the neighbors is summarized by the spatial lag
zi∗ = N
P
j=1 wi,j zj . The overall influence of spatial dependence would be reflected in the differences

Table 1: A Spatial Markov Matrix

Spatial Lag t0 t1 = a t1 = b t1 = c
a maa|a mab|a mac|a
a b mba|a mbb|a mbc|a
c mca|a mcb|a mcc|a
a maa|b mab|b mac|b
b b mba|b mbb|b mbc|b
c mca|b mcb|b mcc|b
a maa|c mab|c mac|c
c b mba|c mbb|c mbc|c
c mca|c mcb|c mcc|c
a maa mab mac
not considered b mba mbb mbc
c mca mcb mcc

between the marginal cell values and the corresponding values in the various conditional matrices.

80
For example, if mbc > mbc|a then the probability of an upward move for median class regions,
irrespective of their neighbors, is higher than the probability of an upward move for median class
regions with poor neighbors.
Multivariate Time Series
Another model for spatial-temporal lattice data is multivariate time series. Let Zt = [Z(s1 , t), · · · , Z(sn , t)]
denote the vector of n responses across all spatial locations, at time t. A multivariate ARMA(p, q)
process is given by:

Zt − A1 Zt−1 − · · · − Ap Zt−p = t + B1 t−1 + · · · + +Bq t−q

where A1 , · · · , Ap and B1 , · · · , Bq are n × n matrices. t is a white noise process such that


Var(t ) = Σ and Cov(t , τ ) = 0 for t 6= τ . This model can be considered when the number of
spatial locations is relatively small.

12.2 Spatial-Temporal Kriging


Kriging in space-time is pretty much the same as Kriging in space, with an extra dimension t. Let
µ(s, t) denote the mean of Z(s, t), the predictor is now formed as:

N (s,t)
X

Z (s, t) = λα Z(sα , tα ).
α=1

Similar to spatial Kriging, the key in prediction is to model the covariance structure between any
two observations. By definition, covariance between two space–time variables is Cov[Z(s, u), Z(r, v)] =
E{[Z(s, u) − µ(s, u)][Z(r, v) − µ(r, v)]}. As before, we normally require the stationarity of the
spatial-temporal process. For example, the second-order stationarity of spatio-temporal data re-
quires that

1. E{[Z(s, t)]} = µ.

2. Cov[Z(s, u), Z(r, v)] = C(r − s, v − u).

Although the definition of second-order stationarity in the space–time context is a straightfor-


ward analogue extension, there are some fundamental differences between the spatial and spatial-
temporal settings. In particular, in the spatial setting, C(h) = C(−h), by definition of the covari-
ance function. However, this is not the case in the spatio-temporal setting for the time dimension.
Covariance functions for which C(h, u) = C(h, −u) (C(−h, u) = C(h, u)) holds, for all h
and u, are called fully symmetric. Among the class of fully symmetric covariances, a covariance
function is separable if C(h, u)/C(h, 0) = C(0, u)/C(0, 0), for all h and u. If this condition holds,
we see that the space– time covariance can be factored (separated) into the product of a purely

81
spatial covariance and a purely temporal covariance. It is straightforward to show that a separable
covariance must be fully symmetric, but full symmetry does not imply separability.
The following covariance is an example from separable covariance function

C(h, u) = σ 2 exp(−νs khk) exp(−νt |u|)

with σ 2 > 0, νs > 0, νt > 0. Nonseparable functions can also find applications, taking similar form
as the following example.

σ2 −ckhk2γ
 
C(h, u, θ) = exp
(|u|2γ + 1)τ (|u|2γ + 1)βγ

Here, τ determines the smoothness of the temporal correlation; γ ∈ (0, 1] determines the smoothness
of the spatial correlation; c determines the strength of the spatial correlation; β ∈ (0, 1] determines
the strength of space/time interaction. In this parameterization, γ = 1 corresponds to the Gaussian
covariance function, while γ = 1/2 corresponds to the exponential covariance function. Smaller
value of γ leads to less smoothness in the interpolation results.

13 References
• Sherman, M. (2011). Spatial statistics and spatio-temporal data: covariance functions and
directional properties. John Wiley & Sons.
• Geographic Information Technology Training Alliance (GITTA) http://www.gitta.info/website/en/html/inde
• Rey, S. J. (2001). Spatial empirics for economic growth and convergence. Geographical
analysis, 33(3), 195-214.
• LeSage, J., & Pace, R. K. (2009). Introduction to spatial econometrics. Chapman and
Hall/CRC.

82

You might also like