Handout 2020 Part1 PDF
Handout 2020 Part1 PDF
Chen Nan
Contents
1 Syllabus 4
1.1 Module Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Introduction 6
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Deterministic vs Stochastic Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Errors in Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Linear Regression 10
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Least square estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Accuracy of the estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Point forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.4 Interval Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.5 Other notable notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Why multiple regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Interactions between variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Least square estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.4 Confidence & prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . 23
1
4 Model Checking and Diagnosis 25
4.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Diagnostics using residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Major graphical tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Tests for certain property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Outliers, leverage points, influential points, collinearity . . . . . . . . . . . . . . . . . 29
4.3.1 Identifying outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.2 High leverage points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Identifying influential observations . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.4 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 Regression on Time 44
8.1 Time Series Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Detecting Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.3 Seasonal Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.4 Growth Curve Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9 Exponential Smoothing 48
9.1 Simple Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.2 Holt’s Trend Corrected Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.3 Holt-Winters Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2
10.4.3 Model Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.5 Seasonal ARMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.6 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10.6.1 Matching TAC with SAC(Moment Method) . . . . . . . . . . . . . . . . . . . 66
10.6.2 Least square and MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.6.3 Model Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
13 References 82
3
1 Syllabus
1.1 Module Information
Instructor: Dr Chen Nan, E1-05-20
Contact: Phone: 65167914
Email: [email protected]
TA: Xie Jiaohong ([email protected])
Office Hours: By appointment
Textbook: Forecasting, Time Series, and Regression, by Bowerman, O’Connell, and
Koehler
References: Linear Regression Analysis, by George A. F. Seber, Alan J. Lee
Time Series Analysis, by George E. P. Box, Gwilym M. Jenkins, Gregory
C. Reinsel
Prerequisites: IE 5002, IE 6002, programming
Description: This module focuses on the theory and practice of forecasting methods. It
discusses two major categories of forecasting problems, and corresponding
techniques. Extensive hands on projects will be provided to solve real life
problems.
1.2 Schedule
Aug 12, 2020 Module logistics, introduction, and reviews
Aug 19, 2020 Regression analysis
Aug 26, 2020 Model checking and diagnosis
Sep 02, 2020 Model evaluation & selection
Sep 09, 2020 Machine learning approaches
Sep 16, 2020 Case Studies
Sep 23, 2020 Recess Week
Sep 30, 2020 Seasonality, regression on time
Oct 07, 2020 Exponential smoothing
Oct 14, 2020 Autocorrelation and ARMA
Oct 21, 2020 Seasonal ARIMA Model
Oct 28, 2020 Neural networks for time series
Nov 04, 2020 Forecasting spatial data
Nov 11, 2020 TBD
4
1.3 Grading
Grading: Project 0: 10% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Due Aug 30, 2020
Project 1: 30% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Due Sep 27, 2020
Project 2: 30% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Due Nov 15, 2020
Final Exam: 30% . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nov 27 2020 14:30-16:30am
The projects are based on real problems and real data. The dataset and background information
of the project will be provided. For each project, the submission should include the following items
• A report not more than 10 pages with 1.5 spacing (soft copies and hard copies), which doc-
uments the methods using, main findings, and interpretations. Codes and software printouts
should NOT be included in the report.
• Complete codes used for the analysis, with reasonable details of comments (Soft copies only)
• Forecasting results on the test dataset in a “csv” file with a single column, as shown in the
following example
A0001124H
10.31
8.5
20.1
...
11.5
5
2 Introduction
2.1 Definitions
Definition 2.1 Predictions of future events and conditions are called forecasts, and the act of
making such predictions is called forecasting.
Forecasting is very important in many types of organizations since predictions of future events
must be incorporated into the decision-making process. Examples include
• Government needs to forecast such things as air quality, water quality, unemployment rate,
inflation rate, and welfare payment, etc.
• A university needs to forecast its enrollment, temperature, broken asset.
• Business firms need to forecast demands to plan sales and production strategy, to forecast
interest rate for financial planning, to forecast number of workers required for human resource
planning, and to forecast the quality of the product for process improvement and quality
control.
To forecast events that will occur in the future, one must rely on information concerning events
that have occurred in the past. Based on the type of information used, there are two categories of
forecasting methods.
(a) Delphi Method: Use a panel of experts to produce predictions concerning a specific
question such as when a new development will occur in a particular field. The panel
members are kept physically separated. After the first questionnaire has been completed
and sent, subsequent questionnaires are accompanied by information concerning the
opinions of the group as a whole.
(b) Technological comparisons: are used in predicting technological change. It determines a
pattern of change in one area, called a primary trend, which the forecaster believes will
result in new developments being made in some other area. A forecast of developments
in the second area can then be made by monitoring developments in the first area.
(c) Subjective curve fitting: The forecaster subjectively determines the form of the curve to
be used, and a great deal of the expertise and judgment is required.
6
2. Quantitative forecasting methods: involves the analysis of historical data in an attempt
to predict future values of a variable of interest. The methods often depend on the types of
data available.
Definition 2.2 Cross-sectional data are values observed at one point in time; A time
series is a chronological sequence of observations on a particular variable.
(a) Causal methods: involve the identification of other variables that are related to the
variable to be predicted. It develop a statistical model that describes the relationship
between these variables and the variable to be forecasted. For example, the sales of
a product might be related to the price of the product, competitors’ prices for similar
products, advertising expenditures to promote the products.
(b) Time series methods: make prediction of future values of a time series based solely on the
basis of the past values of the time series. It tries to identify a pattern in the historical
data, which is extrapolated in order to make a forecast. It is assumed that the pattern
will continue in the future. For example, one predicts the temperature tomorrow based
solely on the temperatures in the past days.
7
• Height and weight: as height increases, you’d expect weight to increase, but not perfectly.
• Alcohol consumed and blood alcohol content: as alcohol consumption increases, you’d expect
one’s blood alcohol content to increase, but not perfectly.
• Vital lung capacity and pack-years of smoking: as amount of smoking increases (as quantified
by the number of pack-years of smoking), you’d expect lung function (as quantified by vital
lung capacity) to decrease, but not perfectly.
• Driving speed and gas mileage: as driving speed increases, you’d expect gas mileage to
decrease, but not perfectly.
It is also noted that the boundary between deterministic and stochastic relationship might not
be clear in some scenarios. For example, depending on the accuracy and precision requirement,
Newton’s laws in physics can be viewed as deterministic relations in some cases, but can only serve
as approximation to theory of relativity in some other cases. It is also possible that the stochastic or
random elements observed are simply due to some unknown variables in a deterministic relations.
8
• Point forecast: a single number representing our “best” prediction of the actual value
• Prediction interval forecast: an interval of numbers that will contain the actual value
with certain confidence (95%)
To evaluate the performance or accuracy of the forecasting methods, certain criteria shall be
used. For point forecast, a natural way is to calculate the forecast error.
Definition 2.3 The forecast error for a particular forecast ŷ of a quantity of interest y is
e = y − ŷ
small mean & small variance
If the forecast is accurate, the error is small stochastically. In general, e cannot be zero, and
can be large even for a good forecasting method in “unlucky” cases. Therefore, it is important to
measure the magnitude of the errors over time or over different samples to evaluate the forecasting
method.
Definition 2.4 The mean absolute deviation (MAD) of the forecasting is defined as
n
1X
MAD = |yi − ŷi |.
n
i=1
9
To compare the forecast on quantities of different scales, relative errors can be adopted by
normalizing the error by the value to be forecasted.
Definition 2.5 The mean absolute percentage errors (MAPE) of a method is defined as
n
1 X yi − ŷi MAPE is easier to compute and optimize
MAPE = yi . but also need to consider traits of the dataset
n
i=1
Different from point forecast, prediction interval forecast is often an interval of values. Its
performance depends on two factors.
Definition 2.6 Coverage probability is the proportion of the time that a confidence interval
contains the true value of interest. Length of the interval is simply the difference in the two
endpoints.
Ideally, for a 95% prediction interval, the interval should have coverage probability 0.95, i.e., con-
taining the true value 95% of the time. On the other hand, the length of the interval indicates
the precision of the forecast. Given the same coverage probability, the shorter interval length the
better.
3 Linear Regression
3.1 Overview
Galton was a pioneer in the application of statistical methods. In studying data on relative sizes
of parents and their offspring in various species of plants and animals, he observed the following
phenomenon: a larger-than-average parent tends to produce a larger-than-average child, but the
child is likely to be less large than the parent in terms of its relative position within its own
generation. Galton termed this phenomenon a regression toward mediocrity, which in modern
terms is a regression to the mean.
Regression to the mean is an inescapable fact of life. Your children can be expected to be less
exceptional (for better or worse) than you are. Your score on a final exam in a course can be
expected to be less good (or bad) than your score on the midterm exam, relative to the rest of
the class. The key word here is “expected”. This does not mean it’s certain that regression to the
mean will occur, but it has a higher chance than not. For detailed account for this, please refer to
http://www.socialresearchmethods.net/kb/regrmean.php.
Linear regression analysis is the most widely used of all statistical techniques: it is the study of
linear, additive relationships between variables. Even though the linear condition seems restrictive,
it has some practical justifications:
10
• the “true” relationships between variables are often approximately linear, at least over a range
of values;
• Even for some nonlinear relationships, we can often transform the variables in a way to
linearize the relationships.
Y = β0 + β1 X + , (3.1)
where E = 0, and σ 2 = var(σ) < ∞. indicates the observation error, noise, or uncertainties that
cannot be accounted for by the linear relation β0 +β1 X. Here “linear ” is to quantify the relationship
in parameters, which means the partial derivative w.r.t. β should be free of all parameters.
linear regression is in terms of beta
3.2.1 Least square estimation Y and X relation may not be linear
but regress can be linear
The linear relation (linear model) (3.1) has three parameters β0 , β1 , σ 2 . They have clear physical
interpretations. However, in practice their values might not be available, and need to be estimated
from observations.
Assume that we have n observations of (xi , yi ). We want to find a straight line that “best”
forecast (approximate) these n points. For any given values β0 = a, β1 = b, a natural point forecast
of the response given predictor X is simply a + bX, the conditional expectation E(Y |X). Recall
the commonly used forecasting error MSE is defined as
n n
1X 1X
MSE = (yi − ŷi )2 = (yi − a − bxi )2 .
n n
i=1 i=1
The least square estimation of the parameters are defined as the values of β0 , β1 that can minimize
the MSE
n
X
β̂0 , β̂1 = arg min (yi − a − bxi )2 /n (3.2)
a,b
i=1
11
By taking the derivatives with respect to a, b, we can get the analytical expression for β̂0 , β̂1 as
Pn
where the term i=1 (yi − β̂0 − β̂1 xi )2 is often named as the sum of squared errors (SSE). The
term (n − 2) is to make the estimation unbiased, Eσ̂ 2 = σ 2 . σ̂ 2 quantifies the uncertainty of the
regression line, and is related to goodness-of-fit as well.
Remark: There are a few interesting notes for the least square estimation in simple linear regression
• The estimated regression line Y = β̂0 + β̂1 X passes the point (x̄, ȳ).
• The estimated slope β̂1 is closely related to the correlation coefficients between X and Y .
Recall that the sample correlation is defined as
Pn
i=1 (xi− x̄)(yi − ȳ)
ρ̂ = pPn Pn .
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
where SY2 , SX
2 are the sample variance of Y, X respectively.
The aforementioned results do not require the specific form of the error distribution. However, to
assess the accuracy of the estimation, and to construct the confidence interval, it is necessary to
know the distribution of . A commonly used assumption is that i follows normal distribution
i ∼ N (0, σ 2 ), independently.
12
Based on the linear regression model, Y = β0 +β1 X +, we can also conclude that the conditional
distribution of P (Y |X) ∼ N (β0 + β1 X, σ 2 ). It is clear that the regression part β0 + β1 X models the
mean (expectation) of the response, and the error term quantifies the uncertainty in the prediction
(modeling).
If yi are independent, and the true relationship follows the model (3.1), it can be derived that
the parameter estimation (3.3) follows normal distribution with
σ2 x̄2
1
β̂1 ∼ N β1 , Pn 2
, β̂0 ∼ N β0 , + Pn 2
σ2 . (3.5)
i=1 (xi − x̄) n i=1 (x i − x̄)
It shows the least square method can estimate the true parameters “on average”. However, given
finite number of samples, there exists uncertainty in estimating the true parameters. The magnitude
of uncertainty depends on two factors:
Similarly, we can find the distribution of σ̂ 2 with normal assumption. Using the formula in
(3.4), it can be derived that
σ2
σ̂ 2 ∼ · χ2 . (3.6)
n − 2 n−2 chi square distr.
As a result, we have Eσ̂ 2 = σ 2 .
The distribution of estimated parameters also allow us to construct the confidence intervals for
such estimates. From (3.5) and (3.6), we can find that
β̂ − β1 β̂ − β0
q1 ∼ tn−2 , q0 ∼ tn−2 . t-distribution
var(β̂1 ) var(β̂0 )
−x̄
Cov(β̂0 , β̂1 ) = Pn 2
σ2. (3.8)
i=1 (x i − x̄)
13
• When the number of samples is large enough (n → ∞), we can expect β̂0 , β̂1 , σ̂ 2 all converge
to the true values.
Given the estimated parameters based on least square methods, we can make point forecast given
any value of the predictor X. In fact, the “best” prediction is the conditional mean. Given X = x∗ ,
the point forecast y∗ is simply
y ∗ = β̂0 + β̂1 x∗ .
Essentially, the forecasted value of the response, in terms of deviation from the mean, is proportional
to the deviation of the predictor value from its mean.
Using the relationship between β̂1 and the correlation ρ̂, we can also write the forecasting as
y ∗ − ȳ x∗ − x̄
= ρ̂ , (3.9)
SY SX
where (y ∗ − ȳ)/SY and (x∗ − x̄)/SX can be viewed as “standardized” value. Since ρ̂ is always
smaller than 1, this also explains the “regression to mean” technically. In particular, our prediction
for standardized y ∗ is typically smaller in absolute value than our observed value for standardized
x∗ . That is, the prediction for Y is always closer to its own mean, in units of its own standard
deviation, than X was observed to be, which is Galton’s phenomenon of regression to the mean.
The perfect positive correlation (ρ = 1) or perfect negative correlation (ρ = −1) is only obtained
if one variable is an exact linear function of the other, without error, i.e., Y = β0 + β1 X. In this
case, the relationship between X and Y becomes deterministic rather than stochastic.
When we use the estimated parameters to make forecasting, we need to consider the uncertainty in
the estimated parameters. In particular, when the predictor has value X = x∗ , the conditional mean
of the response y ∗ = β̂0 + β̂1 · x∗ , as shown in Section 3.2.3. Using the distributional information on
β̂, we can also find the distribution of y ∗ . Because both β̂0 and β̂1 are normally distributed (3.5),
we can derive
(x∗ − x̄)2
∗ ∗ ∗ 1
P (y |x ) ∼ N β0 + β1 x , + Pn 2
σ2 . (3.10)
n i=1 (xi − x̄)
It is clear that the parameter uncertainty propagates to the forecast uncertainty. It still has the
true mean “on average”, but with additional variation as the variance component in the normal
14
distribution (3.11). The form of the variance also suggest that the forecasting accuracy (or equiv-
alently the magnitude of the variance) depends on the following factor: (a) the sample size n used
in estimation; (b) the scatters of the observations xi ; (c) the distance between the forecast point
x∗ and the data center x̄.
From (3.11), we can construct the prediction interval forecast at 1 − α confidence level:
s
1 (x∗ − x̄)2
β̂0 + β̂1 x∗ ± tn−2,α/2 σ̂ + Pn 2
. (3.11)
n i=1 (xi − x̄)
It is noted that the prediction interval is for the mean value of the response at x∗ . It is different
from the prediction interval of the response, which includes another error term . Consequently,
the prediction interval for the response is
s
1 (x∗ − x̄)2
β̂0 + β̂1 x∗ ± tn−2,α/2 σ̂ 1+ + Pn 2
. (3.12)
n i=1 (xi − x̄)
1. Model assumptions: most results discussed above rely on some important assumptions of
the data. In addition to the linear form of the mean β0 + β1 X, the error term i must be
independent, and normally distributed, with the same variance. These assumptions must be
checked after the model estimation (discussed later). In many problems, these assumptions
might be severely violated. In these cases, the model needs to be revised before reaching
meaningful results.
15
2. Transformation: The linear constraint does not imply the relationship can only be a straight
√
line. In fact, different transformations on x can be used (e.g., x2 , ln(x), x). The results
discussed above still hold with transformed variables. The transformation can be inspired
from data feature, or guided by first-principles.
3. Outliers: The least square method minimizes the sum of squared error, as a result it is
sensitive to outliers. A single outlier can drive the estimated parameters far from its true
values. Therefore, it is important to recognize outliers, especially to differ between outliers
and normal large values.
• Leverage: is a measure of how far away the predictor values of an observation are from
those of the other observations
• Outliers are values that cause surprise in relation to the majority
• Influential observations have a relatively large effect on the regression model’s predic-
tions
Y = β0 + β1 x1 + β2 x2 + · · · + βp xp + .
16
This model describes how the mean response E(Y ) changes with the explanatory variables. The
observed values for Yi vary about their means and are assumed to have the same standard deviation
σ. The parameters of the model include β0 , β1 , · · · , βp , σ 2 .
For concise representation, we often write the model in matrix/vector form. Define β =
[β0 , β1 , · · · , βp ], and xi = [1, xi1 , · · · , xip ], then we have
Yi = xi · β + i , (3.13)
with E(i ) = 0, var(i ) = σ 2 again. Throughout the handout, we will use bold symbols to represent
vectors or matrices. Combine all observations together, we have:
Y1 1 x11 · · · x1p β0 1
Y2 1 x21 · · · x2p β1 2
· .. + .. .
. = . .. ..
..
. .
. . . . . . .
Yn 1 xn1 · · · xnp βp n
or,
Y = Xβ + (3.14)
Simple linear regression only allows a single predictor in modeling the response. This might be too
restrictive in many cases. The following examples illustrate the need for multiple linear regression.
I. Complex relation with a single predictor variable. Even when the response is related to a
single predictor variable, the relation might be more complex than a straight line. Consider a
commonly used model in practice, the polynomial regression
Yi = β0 + β1 xi + β2 x2i + i .
The response changes with x in a quadratic way. Only in special cases (β1 = 0 or β2 = 0),
can we use simple linear regression to estimate the parameters.
II. Relation with qualitative predictor. The simple linear regression implies the predictor is a
quantitative (continuous) variable, so that the multiplication and addition have meaning.
However, we often encounter qualitative variables, such as gender, color, race, etc. They are
often called attribute variables or factors. Since they do not have natural ordering, numerical
operations on the variable lose the validity.
Taking Race with four values (Chinese, Malay, Indian, Caucasian) as an example. Instead of
doing regression directly on this variable, some dummy variables can be created.
17
Original Transformed
X Chinese(R1) Malay(R2) Indian(R3) Caucasian(R4)
Chinese 1 0 0 0
Caucasian 0 0 0 1
Indian 0 0 1 0
Indian 0 0 1 0
Chinese 1 0 0 0
Malay 0 1 0 0
S X E M X1 X2 X3
0 13876 1 1 1 1. 0. 0.
1 11608 1 3 0 1. 0. 1.
2 18701 1 3 1 1. 0. 1.
3 11283 1 2 0 1. 1. 0.
4 11767 1 3 0 1. 0. 1.
Here “S” denotes for salary, “E” denotes for education level, “M” denotes for management or
non-management, “X” denotes for experience.
III. Multiple variables with non-separable effects. More commonly, a response variable is often
influenced by multiple predictors. It is generally not sensible to quantify their influence one
by one using simple linear regression. In addition, sometimes explicit interaction between two
variables are required. The interactions will be discussed in more details later. An example
here is the joint effects of TV and Radio on sales: Y = 2.94 + 0.046 ∗ T V + 0.19 ∗ Radio −
0.001 ∗ N ewspaper
There are two implicit assumptions when formulating the multiple linear regression: (1) Effects of
different predictors are additive. (2) If x1 changes ∆x1 , the mean response always changes β1 ∆x1 ,
regardless other predictors. In statistics, an interaction may arise when considering the relationship
among three or more variables, and describes a situation in which the simultaneous influence of
two variables on a third is not additive.
The presence of interactions can have important implications for the interpretation of statisti-
cal models. If two variables of interest interact, the relationship between each of the interacting
variables and the response variable depends on the value of the other interacting variable. In prac-
tice, this makes it more difficult to predict the consequences of changing the value of a variable,
particularly if the variables it interacts with are hard to measure or difficult to control.
Real-world examples of interaction include:
• Interaction between adding sugar to coffee and stirring the coffee. Neither of the two indi-
vidual variables has much effect on sweetness but a combination of the two does.
• Interaction between adding carbon to steel and quenching. Neither of the two individually
has much effect on strength but a combination of the two has a dramatic effect.
• Interaction between smoking and inhaling asbestos fibres: Both raise lung carcinoma risk,
but exposure to asbestos multiplies the cancer risk in smokers and non-smokers. Here, the
joint effect of inhaling asbestos and smoking is higher than the sum of both effects.
• Interaction between genetic risk factors for type 2 diabetes and diet (specifically, a “western”
dietary pattern). The western dietary pattern was shown to increase diabetes risk for subjects
with a high “genetic risk score”, but not for other subjects.
19
To recognize the possible interactions between two variables, we can explore their relation
graphically. There are three major types of interactions.
===============================================================
coef std err t P>|t|
---------------------------------------------------------------
Intercept 9472.6854 80.344 117.902 0.000
C(E)[T.2] 1381.6706 77.319 17.870 0.000
C(E)[T.3] 1730.7483 105.334 16.431 0.000
C(M)[T.1] 3981.3769 101.175 39.351 0.000
C(E)[T.2]:C(M)[T.1] 4902.5231 131.359 37.322 0.000
C(E)[T.3]:C(M)[T.1] 3066.0351 149.330 20.532 0.000
X 496.9870 5.566 89.283 0.000
===============================================================
21
3.3.3 Least square estimation
When n observations are collected (xi , yi ), we can estimate the model parameters to identify in-
fluential variables or to make forecastings. Following the same criterion to minimize the MSE, we
can estimate the parameters by
n
1X
β̂ = arg min (yi − xi · β)2 .
β n
i=1
Y = Xβ + ,
and the least square criterion reduces to minimizing the vector norm of the difference
p
where ||Y|| = y12 + y22 + · · · + yn2 is the 2-norm of the vector. Using matrix calculus (https://
en.wikipedia.org/wiki/Matrix_calculus), we can show that β̂ again has analytical expression
where XT , X−1 represents the transpose and inverse of the matrix, respectively. β̂ is unique as
long as XT X is full rank (or invertible).
Example: In simple linear regression, we can express them in matrix form by
1 x1
" Pn # "P #
1 x2
n 2
T n xi T −1 1 i=1 xi −nx̄
X = . .
, X X = Pn Pni=1 2 , (X X) = Pn .
.. .. i=1 xi i=1 xi
n i=1 x2i − (nx̄)2 −nx̄ n
1 xn
With some algebraic transformation, we can get consistent results as those in Section 3.2.
The natural way to estimate the variance σ̂ 2 is
||Y − Xβ̂||2
σ̂ 2 = .
n−p−1
Pn
Again the term ||Y − Xβ̂||2 or equivalently i=1 (yi − xi · β̂)
2 is called sum of square errors (SSE).
In fact, β̂ is generally a good way to estimate the model parameters as stated in the following
theorem.
22
Theorem 3.1 (Gauss-Markov) In a linear regression model, in which i have expectation zero,
equal variances, and are uncorrelated, the ordinary least square estimator β̂ is the best linear un-
biased estimator (BLUE). Furthermore, if i are normally distributed, β̂ is the best among all
unbiased estimators.
Similar to the case in simple linear regression, when the data are assumed normally distributed,
we can get the distribution of β̂. In more details, β̂ follows multivariate normal distribution
Note that this result not only gives the marginal distribution of each component of β̂, it also
provides the covariance among different components. The results in the simple linear regression
are in fact its special case.
There are some remarks about the estimation of β̂: (1) The accuracy of β̂ depends on the
sample size, and also the scatters of xi ; (2) Sometimes, XT X is singular, then β is not estimable.
The coefficients are not interpretable (collinearity); (3) Linear transformation of β̂ is still normal:
Aβ̂ ∼ MVN(Aβ, A(XT X)−1 AT σ 2 ).
With the estimated β̂ and its covariance (XT X)−1 σ 2 , and the σ̂ 2 , we can construct the confidence
interval for the parameters. For each component β̂j in β̂, its 1 − α confidence interval can be
expressed as
p
β̂j ± tn−p−1,α/2 σ̂ dii , (3.18)
where dii is the (i, i)th diagonal element of (XT X)−1 . In other words, the standard deviation of β̂j
√
is simply σ dii .
More importantly, with the covariance matrix of β̂, we can construct the confidence interval (or
region) for multiple components of β̂ together, or some linear combinations of the components.
• What is the 95% joint confidence region for (β0 , β1 )?
• What is the confidence interval for the difference β1 − β2 ?
A general way can be found using the property of multivariate normal distribution. For any matrix
A with rank q, we know that
23
where Fq,n−p−1 is the F distribution with degree of freedom q and n − p − 1. The 1 − α level
confidence region for Aβ is thus defined as the set of any q dimensional points satisfying
n o
Aβ : b ∈ Rq : (Aβ̂ − b)T [A(XT X)−1 AT ]−1 (Aβ̂ − b) ≤ qσ̂ 2 · Fq,n−p−1,α , (3.19)
I. The confidence region for all the parameters β. In this case, A = Ip+1 with rank p + 1, and
the confidence region becomes:
n o
β : b ∈ Rq : (β̂ − b)T (XT X)(β̂ − b) ≤ (p + 1)σ̂ 2 · Fp+1,n−p−1,α .
II. The confidence interval for mean response at new predictor value x∗ . The point forecast
is straightforward, with mean response y ∗ = x∗ β̂ based on (3.13). To further obtain the
confidence interval for the values, we can use (3.19) with A = x∗ of rank q = 1. Hence, the
confidence interval of the mean response satisfy
Equivalently, using the relation between F1,n−p−1 and tn−p−1 , we can get a more explicit form
q
y : y ∗ ± tn−p−1,α/2 σ̂ x∗ (XT X)−1 x∗ .
24
III. Other useful comparisons. For example, if β1 and β2 represents coefficients of two predictor
variables. β1 − β2 is a measure of their relative effects on the response. To get the confidence
region for (β1 − β2 , β2 − β3 ), we can define A with rank q = 2 as
" #
0 1 −1 0 ··· 0
A= ,
0 0 1 −1 · · · 0
1. The relationship between the outcomes and the predictors is (approximately) linear.
2. The error term has zero mean.
3. The error term has constant variance.
4. The errors are uncorrelated.
5. The errors are normally distributed or we have an adequate sample size to rely on large
sample theory.
We should always check the fitted models to make sure that these assumptions have not been
violated.
25
4.1 Residuals
The diagnostic methods we’ll be exploring are based primarily on the residuals. Recall, the residual
is defined as
ei = yi − ŷi , i = 1, ..., n
where ŷi = Xβ̂. If the model is appropriate, it is reasonable to expect the residuals to exhibit
properties that agree with the stated assumptions.
According to the definition of the residuals, it is easy to show that the mean of the residuals is
0,
n
1X
ē = ei = 0,
n
i=1
Precisely speaking, The ei , i = 1, · · · , n are not independent random variables. In general, if the
number of residuals (n) is large relative to the number of predictor variables (p), the dependency
can be ignored for all practical purposes in an analysis of residuals.
To analyze the residuals in different contexts, it is also common to “standardize” the residuals
by dividing its standard deviation. Using matrix form, we can write the residual vector as
The term H = X(XT X)−1 XT is often called hat matrix, and plays a crucial role in linear regression
analysis and model diagnostics. Using the notation, we can get the covariance matrix of e is simply
(I − H)σ 2 . As a result, the studentized residual is defined as
ei
ri = √ , (4.1)
1 − hii σ̂
where hii is the ith diagonal element of the hat matrix H. When the assumptions of the linear
model hold, ri follows a t distribution with n − p − 1 degree of freedom. Consequently, ri is free
from measurement scales in different contexts.
26
• QQ plot to check normality
• Scatter plot to check linearity and variance
• Autocorrelation plot to check independence
Residual analysis is usually done graphically. We describes the major plots as follows.
not normal
model is insufficient
27
non constant (poison
u=lamda, var=lamda)
variance stabilization
Non-constant variance can often be remedied using appropriate transformations. Ideally,
we would choose the transformation based on some prior scientific knowledge, but this might
not always be available. Some typical choices are listed below
σ 2 ∝ constant y0 = y no transformation
√
σ2 ∝ E(Y ) y0 = y Poisson data
σ2 ∝ E(Y )2 y0 = ln(y) y>0
28
4.2.2 Tests for certain property
III. Durbin-Watson statistic In statistics, the Durbin-Watson statistic is a test statistic used to
detect the presence of autocorrelation in the residuals from a regression analysis. If et is the
residual associated with the observation at time t, then the test statistic is
PT 2
t=2 (et − et−1 )
d= PT 2 ,
t=1 et
where T is the number of observations. Since d is approximately equal to 2(1 − r), where r
is the sample autocorrelation of the residuals, d = 2 indicates no autocorrelation. The value
of d always lies between 0 and 4. If the Durbin-Watson statistic is substantially less than 2,
there is evidence of positive serial correlation. As a rough rule of thumb, if Durbin-Watson is
less than 1.0, there may be cause for alarm.
An outlier is an extreme observation. Depending on their location in the predictor space, outliers
can have severe effects on the regression model. We can use jackknife residuals to identify potential
outliers. Any points that are greater than 3 or 4 standard deviations away from 0 may be considered
potential outliers.
There are several scenarios for outliers.
• “Bad” data that results from unusual but explainable events, eg - malfunction of measuring
instrument, incorrect recording of data. In this case we should try to retrieve the correct
value, but if that’s not possible we may need to discard the data point.
• Inadequacies in the model. The model may fail to fit the data well for certain values of the
predictor. In this case it could be disastrous to simply discard outliers.
• Poor sampling of observations in the tail of the distribution. This may be especially true if
the outcome arises from a heavy-tailed distribution.
With a sample size of 60, we might expect 2 or 3 residuals to be further than 2 standard
deviation from 0 and none to be more than 3 standard deviation.
29
4.3.2 High leverage points
Leverage is a measure of how strongly the data for obs i determine the fitted value Ŷi . If hii is
close to 1, the fitted line will usually pass close to (xi , Yi ).
The hat matrix,
H = X(XT X)−1 XT
plays an important role in identifying influential observations. The diagonal elements hii =
xi (XT X)−1 xTi , where xi is the ith row of the X matrix, play an especially important role. hii
is a standardized measure of the distance of the covariate values for ith observation and the means
of the X values for all n observations.
Also,
n
X
0 ≤ hii ≤ 1, hii = p + 1.
i=1
p = # of predictors
Therefore the average size of a hat diagonal is h̄ = (p + 1)/n. Leverage values greater than 2h̄
are considered to be high leverage with regard to their xi values and we would consider them high
leverage points. The left two pictures below shows the leverage values in a simple linear regression.
The third picture shows leverage values in a multiple linear regression.
Points that are remote in the predictor space may not influence the estimate of the regression
coefficients but may influence other summary statistics, such as R2 and the standard errors of the
coefficients. These points are called leverage points. Points that have a noticeable effect on
the regression coefficients are called influential points. In other words, Influence measures the
degree to which deletion of an observations changes the fitted model. A high leverage point has
the potential to be influential, but is not always influential
Influence can be measured by Cook’s distance. Cook’s Distance measures the influence of the
ith observation on all n fitted values and is given by
30
where Ŷ is the vector of fitted values when all n observations are included and Ŷ(−i) is the
vector of fitted values when the ith observation is deleted. Cook’s D can also be expressed as
e2i hii
Di =
(p + 1)σ̂ 2 (1 − hii )2
From this expression we see that Di depends on both the size of the residual ei , and the leverage,
hii .
The magnitude of Di is usually assessed by comparing it to Fp+1,n−p−1 . If the percentile value
is less than 10 or 20 %, then the ith observation has little apparent influence on the fitted values. If
the percentile value is greater than 50%, we conclude that the ith observation has significant effect
on the fitted values.
As a general rule, Di values from 0.5 to 1 are high, and values greater than 1 are considered to
be a possible problem.
4.3.4 Collinearity
• Large changes in the estimated regression coefficients when a predictor variable is added or
deleted
• Insignificant regression coefficients for the affected variables in the multiple regression, but a
rejection of the joint hypothesis that those coefficients are all zero (using an F-test)
pairplot
31
• If a multiple regression finds an insignificant coefficient of a particular explanatory variable,
yet a simple linear regression of the explained variable on this explanatory variable shows its
coefficient to be significantly different from zero, this situation indicates multicollinearity in
the multiple regression
• Some authors have suggested a formal variance inflation factor (VIF) for multicollinearity:
1
VIF =
1 − Rj2
The better the fit, the more severe the collinearity. A VIF of 5 or 10 and above indicates a
multicollinearity problem.
Added variable plots are also called partial regression plots, or adjusted variable plot. It
allows us to study the marginal relationship of a regression given the other variables that are
in the model. For the variable Xj
and
Xj = α0 + α1 X1 + · · · + αj−1 Xj−1 + αj+1 Xj+1 + αp Xp + ε
Plot Y − Ŷ vs Xj − X̂j
32
Some comments on using the plots
– They only suggest possible relationships between the predictor and the response.
– In general, they will not detect interactions between regressors.
– The presence of strong multicollinearity can cause partial regression plots to give incor-
rect information
33
1. We want to explain the data in the simplest way. Redundant predictors should be removed.
The principle of Occam’s Razor states that among several plausible explanations for a phe-
nomenon, the simplest is best. Applied to regression analysis, this implies that the smallest
model that fits the data is best.
2. Unnecessary predictors will add noise to the estimation of other quantities that we are inter-
ested in. Degrees of freedom will be wasted.
3. Collinearity is caused by having too many variables trying to do the same job.
4. Cost: if the model is to be used for prediction, we can save time and/or money by not
measuring redundant predictors.
I The “notorious” R2 :
R2 , also called coefficient of determination, evaluates the percentage of total variation (uncer-
tainty) explained by the regression model. Mathematically, it is defined as:
Pn
2 SSE (yi − ŷi )2
R =1− = 1 − Pi=1
n 2
. (5.1)
TSS i=1 (yi − ȳ)
Compared with MSE definition, we can see that MSE = SSE/n. In other words, the smaller
the MSE, the closer R2 to 1. It appears R2 is a good measure of the forecasting performance.
Unfortunately, there is an inherent problem: the forecasting errors are calculated on the same
dataset as that used for model estimation. As a result, it often under-estimates the forecasting
errors when used in future predictions. In fact, by increasing the number of predictors (relevant
or not), R2 always increases. As a result, it is never used as a criterion to select the “best”
model because only the largest model has the largest R2 .
II Adjusted R2
Since R2 always increases as the model size increases, an adjusted R2 is proposed, often denoted
by Ra2 . It is defined by
SSE/(n − p − 1) n−1 σ̂ 2
Ra2 = 1 − =1− (1 − R2 ) = 1 − model
2 . (5.2)
TSS/(n − 1) n−p−1 σ̂null
Because of the adjustment, increasing the model size, will increase R2 , but not necessarily
increase Ra2 . Adding a predictor will only increase Ra2 if it has some value in prediction. From
34
another angle, minimizing the standard error for prediction means maximizing R22 a. Compared
with R2 , it “penalized” bigger models.
(a) Randomly divide the data into k non-overlapping subsets, of (roughly) equal size.
(b) Select one subset as the testing data, and the remaining k−1 subsets combined as training
data. Estimate the model using the training data, and compute the prediction error (e.g.,
MSE) on the testing data, denoted by MSEi .
(c) Repeat this procedure for k times, with each of the k subsets being the testing data.
(d) Average the k prediction error estimates to get the cross-validated error MSECV =
Pk
j=1 MSEj /k.
Compared with other criteria, cross validated forecast error is more intuitive and often more
effective. However, it requires much more computational effort (k times), except certain special
cases. Common choice of k includes k = 5, k = 10. When k = n, it is more commonly known
as leave-one-out cross validation.
35
The information criteria, including Akaike’s Information Criterion (AIC) and Schwarz’s Bayesian
Information Criterion (BIC), are commonly used for model comparison or selection. For linear
regression models, they can be reduced to
SSE
AIC = n ln + 2(p + 1) (5.3)
n
SSE
BIC = n ln + (p + 1) ln(n). (5.4)
n
We want to minimize AIC or BIC to select the “best” model. Larger models will fit better and
so have smaller SSE. But they also use more parameters. Thus the “best” model will balance
the goodness-of-fit with model size. BIC penalizes larger models more heavily and so tends to
prefer smaller models comparing with AIC. AIC and BIC can be used as selection criteria for
other types of model (not limited to regression models).
SSEp
Cp = + 2(P + 1) − n, (5.5)
σ̂ 2
where σ̂ 2 is from the model with all P predictors and SSEp indicates the sum of squared errors
from a model with p parameters. In a sense, Cp balances the model errors (in terms of SSE)
and the number of predictors used (in terms of p). Cp has the following properties in model
selection:
• Cp is easy to compute
• It is closely related to Ra2 and the AIC.
• For the full model Cp = P + 1 exactly.
• If a model with p parameter fits the data, then E(Cp ) ≈ p. A model with a bad fit will
have Cp much larger than p.
It is usual to plot Cp against p. We desire models with small p and Cp around or less than p.
36
5.2 Selecting the regression model
When we have many predictors (with many possible interactions), it can be difficult to find a good
model. It can be challenging to find which main effects do we include, and which interactions do we
include. Model selection tries to “simplify” this task. However, this is still an “unsolved” problem
in statistics. There are no magic procedures to get you the “best model.”
II Greedy search
When the number of predictors becomes large, it is not feasible to conduct all subset selection,
especially when interactions and transformations of variables should be considered. In this
case, some greedy search (or other heuristic methods) should be used to find the “best” model
by certain evaluation criterion.
37
• Remove the predictor leading to largest improvement in performance;
• Refit the model and goto Step 2;
• Stop when no more improvement can be made by removing predictors;
(b) Forward Selection
Forward selection reverses the backward method.
• Start with no variables in the model;
• For all predictors not in the model, check the model performance if they are added
to the model;
• Choose the one leading to largest improvement, and include it in the model;
• Continue until no new predictors can be added.
(c) Stepwise Regression
This is a combination of backward elimination and forward selection. This addresses the
situation where variables are added or removed early in the process and we want to change
our mind about them later. At each stage a variable may be added or removed and there
are several variations on exactly how this is done.
Greedy procedures are relatively cheap computationally but they do have some drawbacks.
38
inference.
Hypothesis testing can be used to formally test whether a predictor (or its transformation and
interaction with other predictors) is statistically significant in predicting the mean response. In the
general framework, it includes a few commonly used special cases. Some of the results have been
summarized before in different places.
Hypothesis testing allows us to carry out inferences about population parameters using data
from a sample. In order to test a hypothesis in statistics, we must perform the following steps: 1)
Formulate a null hypothesis and an alternative hypothesis on population parameters; 2) Build a
statistic to test the hypothesis made; 3) Define a decision rule to reject or not to reject the null
hypothesis.
It is very important to remark that hypothesis testing is always about population parameters.
Hypothesis testing implies making a decision, on the basis of sample data, on whether to reject
that certain restrictions are satisfied by the basic assumed model. The restrictions we are going
to test are known as the null hypothesis, denoted by H0 . Thus, null hypothesis is a statement on
population parameters.
The details of testing process shows below.
2. Consider the statistical assumptions being made about the sample in doing the test;
4. Derive the distribution of the test statistic under the null hypothesis
6. Compute from the observations the observed value of the test statistic T .
7. Decide to either reject the null hypothesis in favor of the alternative or not reject it.
I Testing a single βj = 0
Using the results in multiple linear regression in Chapter 3, we know that the least square esti-
mation β̂j follows normal distribution with mean βj and corresponding variance. In addition,
we have
β̂j − βj
∼ tn−p−1 ,
SE(β̂j )
when there are p predictors in the model. Therefore, the natural way to test H0 : βj = 0 versus
H1 : βj 6= 0 is to use the statistic T = β̂j /SE(β̂j ), with decision rule
(
> tn−p−1,α/2 , Reject null hypothesis
|T | : ,
≤ tn−p−1,α/2 , Do not reject null hypothesis
39
where α is the significance level. For most regression outputs, the test values besides each
predictor indicate such test results
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 1.5925 1.389 1.146 0.252 -1.130 4.315
x1 -0.0016 0.001 -1.072 0.284 -0.004 0.001
x2 0.0006 0.008 0.073 0.942 -0.015 0.016
x3 9.349e-05 4.02e-05 2.323 0.020 1.46e-05 0.000
x4 -0.0003 0.008 -0.035 0.972 -0.016 0.016
x5 -0.0001 4.51e-05 -2.694 0.007 -0.000 -3.31e-05
x6 -0.0001 9.08e-05 -1.184 0.236 -0.000 7.04e-05
x7 6.249e-08 5.81e-08 1.076 0.282 -5.14e-08 1.76e-07
x8 1.96e-06 2.53e-06 0.774 0.439 -3e-06 6.92e-06
x9 -0.0006 0.000 -2.958 0.003 -0.001 -0.000
==============================================================================
It is also noted that the test significance shall be interpreted one by one. Because the test is
equivalent to the following test
H0 :Y = β0 +
H1 :Y = β0 + β1 x1 + β2 x2 + · · · + βj−1 xj−1 + βj xj + βj+1 xj+1 + · · · + βp xp + .
In other words, none of the predictors need to be included in the model. We can use the sum
of square errors (SSE), or equivalently R2 to express the test statistic
R2 /p
T = ∼ Fp,n−p−1 .
(1 − R2 )/(n − p − 1)
As a result, if T > Fp,n−p−1,α , we should reject this hypothesis. Usually, the test result is also
40
prepared in the software output.
H0 :β2 = β5 = βp = 0
H0 :β1 + β2 = 1
H0 :β1 = 0
β2 = 1
βp−1 + βp = 0
Note that the first two cases only involve one constraint on the regression parameters. The
third example has 3 constraints on the regression models. For this group of tests, we can still
use F test with corresponding degree of freedom.
In more details, we can express the null hypothesis in a general way as Aβ = c, with row rank
of A being m. Then we can solve the constrained least square estimation
If the null hypothesis is correct, we would expect that the constrained model and unconstrained
model has similar performance in terms of estimation of β or the approximation difference
41
Y − Xβ. As a result, the F test can be constructed as
If T > Fm,n−p−1,α , we reject the null hypothesis, meaning that the constrained model is not
sufficient in explaining the variation of the response. This test can be done by fitting two
models and use ANOVA to get the test results.
To examine whether there are constant returns to scale in the chemical sector, we are going to use
the Cobb-Douglas production function, given by
In the above model parameters β2 and β3 are elasticities (output/labor and output/capital).
Before making inferences, remember that returns to scale refers to a technical property of the
production function examining changes in output subsequent to a change of the same proportion in
all inputs, which are labor and capital in this case. If output increases by that same proportional
change then there are constant returns to scale. Constant returns to scale imply that if the factors
labor and capital increase at a certain rate (say 10%), output will increase at the same rate (e.g.,
10%). If output increases by more than that proportion, there are increasing returns to scale. If
output increases by less than that proportional change, there are decreasing returns to scale. In
the above model, the following occurs
• If β2 + β3 = 1, there are constant returns to scale
• If β2 + β3 > 1, there are increasing returns to scale
• If β2 + β3 < 1, there are decreasing returns to scale.
To answer the question posed in this example, we must test
H0 : β2 + β3 = 1, v.s. H1 : β2 + β3 6= 1.
Reference
42
the percentage change in another variable, when the latter variable has a causal influence on the
former. The elasticity on response on predictor xi can be calculated easily from the marginal effect
d(ln Y )
ei ≡ .
d(ln xi )
• Generalized least square method when error variance are not constant, also refered as regres-
sion with heterogenous variance.
• Robust regression is designed for non-normal with unknown distributions. They are robust
to outliers or extreme values.
• Nonparametric regression is used to model unknown relation between covariates and re-
sponses, which goes beyond linear assumptions.
• Quantile regression investigates the relation between covariates and quantiles of response (not
the mean of the response). It has wide application in economics and social science studies.
In addition to regression, there are many methods developed to classify an entity into certain
category based on its covariate values.
Definition 7.1 In machine learning and statistics, classification is the problem of identifying to
which of a set of categories (sub-populations) a new observation belongs, on the basis of a training
set of data containing observations (or instances) whose category membership is known.
It has widespread application in different domains, and becomes especially hot in current AI buzz.
Some typical application includes medical diagnosis, precision medicine; spam email filtering; face
recognition; virtual reality, augmented reality; handwriting recognition, voice recognition; recom-
mendation, job screening. In other module, we are not going to cover them in details. Instead,
we just provide some keywords and examples, and point you to further reference should you need
them in the future.
• Logistic regression
• Fisher’s linear discriminant analysis (LDA)
• Naive Bayes classifier
• Support vector machines
• k-nearest neighbor
43
• Boosting
• Decision trees, random forests
• (Deep) neural networks
8 Regression on Time
8.1 Time Series Regression
The dependent variable y is a function time t. It can be modeled as a trend model
yt = T Rt + t ,
where yt is the value of the time series in period t; T Rt is the trend in period t, and t is the error
term in period t. Compared with cross-sectional data, there is no other covariates except time in
the model. Depending on the complexity of the trend, it can use
• No trend: T Rt = β0
• Linear trend: T Rt = β0 + β1 t
• Quadratic trend: T Rt = β0 + β1 t + β2 t2
Aside from the conceptual differences, the model estimation, and prediction methods are the same
as in multiple linear regression.
t = φt−1 + at ,
it is called first-order autocorrelation when at are i.i.d In particular, if φ > 0, t has positive
correlation; if φ < 0, t has negative correlation, and if φ = 0, there is no correlation. We can use
residual plot and other diagnostic plot to check
44
6.2 Detecting Autocorrelation 289
I II.ORE 6.8 t,
2 of 001 ), In add I
00, and t has a , ,HId
n thi~ model. the
forecast of a futur('
, "!,,Iation
• • • •
Time I
937t 5677t1
• • • •
ut of Figure 6.7. This figure also shows thdl
1104009,1196321 and [10">7,70, • Time I
2 3 4 5 6 7 8 9
that the credit union can be very sure thdl
• • •
higher than $1,196,320,
•
4
P n5 6 7 8 9 2
iod t + k (a later time period) and if a neg
(e − e
t • t−1 •)
roduce, or be followed by, another negatiY('
d = i=2P • n
2
words. positive autocorrelation exists when i=1 et
ver time by positive error terms and
hy negative error terms. An If example
et areof positively
pos correlated, d is small; if et negative
illustrates that
are negatively correlated, d is large; If there is no
autocorrelation in the error terms can produce an
depicted in Figure 6.8(a). This figure iIIus
pattern over time, It follows that negativc autocorrelation in the error telms
correlation,
rrorterms can produce a cyclical pattern OWl d is in the middle. In particular, cut offs for different hypothesis testings are provided.
IIV;I1lS that greater than average values of YI lend to he followed by smaller than aver
error terms means that greater than avem!~:t'
than average values of ,vI' and smaller than
;I~'l' v;illil'S of -", and smaller than average values of YI tend to be followed by greater
(I) H : The error terms are111:111
by smaller than avcrage values orv"
0
not;Ivn;lgc autocorrelated
values ufv l , An example of negative autocorrelation might be provided
h" ;1 Idailel', weekly slock orders. Here a larger than average stock order one week
egative autocorrelation if a positive C1Tor terlll
45
• If d < dL,α/2 or 4 − d < dL,α/2 , reject H0
• If d > dU,α/2 and 4 − d > dU,α/2 , do not reject H0
• Otherwise, inconclusive
yt = T Rt + SNt + t ,
where the seasonal factor SNt can be expressed by using dummy variables:
46
Another way to model seasonal trend is to use the trigonometric functions
2π 2π
SNt = β2 sin t + β3 cos t
L L
This model is not linear in the parameters. However, with proper transformation,
• Gompertz curve
yt = s exp(αeβt )
47
• Logistic curve
s
yt =
1 + αect
9 Exponential Smoothing
In time series regression, the functions are have constant parameters, i.e.,
This assumption might be valid in short time span, but is questionable in the long run. We might
need to update the model (parameters) to account for unknown changes. Exponential smoothing is
used in such scenario. It weights the observed time series values unequally (also called exponentially
weighted moving average (EWMA)). It is most effective when trend (and seasonal factors) of the
time series change over time.
Yt = β0 + t
If β0 is not a constant, but slowly changing, then recent observations are more relevant. A
simple solution is to take the moving average
Pn
i=n−w+1 Yi
β̂0n =
w
3.0
2.0
y
1.0
0 50 100 150
Index
48
A more popular approach is to use exponential smoothing
Ln = αYn + (1 − α)Ln−1
Xn
= α(1 − α)n−i Yi
i=1
3.0
2.0
y
1.0
0 50 100 150
Index
The smoothing constant α is very important. A small α gives smaller weight to current Yn , leading
to smoother curve, slower response to changes. In contrast, a large α gives higher weight to current
Yn , leading to rougher curve, faster response to changes. To select a good α, we can find the value
that can minimize the forecast error. Recall that the forecast error at time n can be computed as
en = Yn − Ln−1 . Combining the errors together, we have the sum of squared error (SSE):
n
X
SSE = (Yi − Li−1 )2
i=1
Note that SSE depends on α, and as a result, we find the “best” α that can minimize SSE.
Based on the exponential smoothing model, we can forecast future Yn+τ , for τ ≥ 1 based on
the last information Yn . Since no trend is assumed, the point forecast equals to Ln . Naturally, the
larger τ , the less accurate prediction. We can construct the prediction interval
r
p SSE
Ln ± z0.025 · s · 1 + (τ − 1)α2 , s=
n−1
49
500
450
400
data$y
350
300
250
200
0 5 10 15 20 25 30 35
• Standard form
Ln = αYn + (1 − α)Ln−1
• Correction form
Ln = Ln−1 + α(Yn − Ln−1 )
Yt = β0 + β1 t + t
If both β0 , β1 are slowly changing, we need to consider the smoothing for both β0 and β1 , the
intercept and the slope. We can use two smoothings for each parameter, respectively
• Level smoothing
Ln = αYn + (1 − α)(Ln−1 + Bn−1 ),
50
• Growth rate smoothing
Bn = γ(Ln − Ln−1 ) + (1 − γ)Bn−1
The rationale for the second smoothing is due to the observation that from n to n+1, the increment
in trend function is in fact
Ln+1 − Ln = β0 + β1 (n + 1) − [β0 + β1 n] = β1 ,
Ŷn+τ = Ln + τ · Bn
Given the value of Ln , Bn , the prediction is a linear function of τ . Its prediction interval can be
calculated as v
u τ −1 r
u X SSE
Ln + τ Bn ± z0.025 · s · 1 +
t α2 (1 + jγ)2 , s=
n−2
j=1
300
250
200
150
0 10 20 30 40 50 60
data$t
Similarly, for different purposes, other forms have been used for trend corrected smoothing.
• Standard form
Ln = αYn + (1 − α)(Ln−1 + Bn−1 ),
• Correction form
Ln = Ln−1 + Bn−1 + α(Yn − Ln−1 − Bn−1 )
51
Bn = Bn−1 + αγ(Yn − Ln−1 − Bn−1 )
Yt = β0 + β1 t + SNt + t
It can be noted that all three smoothing are based on the same error term En . The following
form are simpler to implement in practice.
A point forecast for τ step later at n is Ŷn+τ = Ln + τ · Bn + SNn+τ −kL The 95% prediction
√
interval is Ŷn+τ ± z0.025 s cτ , where
h1, P τ =1
i
τ −1 2 2
cτ = 1 + j=1 α (1 + jγ) , 2≤τ ≤L
1 + Pτ −1 [α(1 + jγ) + d (1 − α)δ]2 , L ≤ τ
j=1 j,L
Yt = (β0 + β1 t) · SNt · t
52
Changes can be made in the smoothing
A point forecast for τ step later at n is Ŷn+τ = (Ln + τ · Bn ) · SNn+τ −kL . The 95% prediction
√
interval is Ŷn+τ ± z0.025 sr · cτ · SNn+τ −L , where
v
u n [ Yi −Ŷi (i−1) ]2
uP
t i=1 Ŷi (i−1)
sr =
n−3
10.1 Stationary
Definition 10.1 A time series is stationary if its statistical properties, e.g. mean and variance,
are essentially constant through time.
In particular, if a series t has zero mean, and constant variance σ 2 , and i and j are uncorrelated
for any i 6= j, the sequence is called white noise sequence.
When data is not stationary, as in many examples, transformation might be needed. Typical
transformation includes differencing the time series in different degree, e.g.,
• Seasonal adjustment
53
10.2 ACF and PACF
Recall that correlation between two random variable X and Y is used to measure the strength of
their linear relationship:
Pn
xi yi − ni=1 xi ni=1 yi
P P
Cov(X, Y ) n
i=1
r=p ≈ q P
var(X) · var(Y ) [n ni=1 x2i − ( ni=1 xi )2 ][n ni=1 yi2 − ( ni=1 yi )2 ]
P P P
Cov(Zt , Zt+k )
ρk = .
var(zt )
The definition is only meaningful when the time series is stationary, such that Zt and Zt−k have
the same mean and variance. It can be estimated from the data
Pn−k
t=b (zt − z̄)(zt+k − z̄)/(n − k − b + 1)
ρ̂k = Pn 2
,
t=b (zt − z̄) /(n − b + 1)
Pn
where z̄ = t=b zt /(n − b + 1) is the sample mean of the series.
Like other statistics based on the data, ρ̂k is random, and has corresponding standard error.
This standard error can be used to assess whether the autocorrelation is statistically significant
from 0. s Pk−1
1+2 j=1 ρ2j
SE(ρ̂k ) =
n−b+1
p
In particular, for ρ̂1 , we have SE(ρ̂1 ) = 1/(n − b + 1). Similar to the test of regression coefficients,
if |ρ̂k /SE(ρ̂k )| > tn−p−1,α/2 , we can claim that the autocorrelation is significant at level α at lag k.
Similar to autocorrelation, a closed related concept is Partial Autocorrelation Function. The
partial autocorrelation at lag k may be viewed as the autocorrelation of time series observations
separated by a lag of k time units with the effect of the intervening observations eliminated. It can
be computed based on the autocorrelation function. In particular,
r11 =ρ1
Pk−1
ρk − j=1 rk−1,j · ρk−j
rkk = Pk−1 , k = 2, 3, · · ·
1 − j=1 rk−1,j · ρj
54
where rk,j = rk−1,j − rkk · rk−1,k−j for j = 1, 2, · · · , k − 1.
Similar to the sample autocorrelation, we can obtain the sample PAC based on the time series
observations. The standard error of the SPAC can be obtained
1
SE(r̂kk ) = √ ,
n−b+1
where t are white noise (or iid normal), and cannot be directly observed. The process has
q
X
2
E(zt ) = µ, var(zt ) = σ (1 + θj2 )
j=1
55
By the construction, we can observe that the autocorrelation at lag k > q should be zero as
the moving average windows do not overlap any more. More specifically, it has the following
(theoretical) autocorrelation function
q−k
X
2
ρk = σ θj θj+k /var(zt ), k ≤ q,
j=0
= 0, k>q (10.2)
• ρ1 = θ1 /(1 + θ12 ), ρk = 0, k ≥ 2
• AC cuts off after lag 1
(b) MA(2) model
zt = δ + t + θ1 · t−1 + θ2 · t−2
θ1 (1 + θ2 ) θ2
ρ1 = , ρ2 = , ρk = 0, k ≥ 3.
1 + θ12 + θ22 1 + θ12 + θ22
56
(II) Autoregressive (AR) models
The AR model assumes the time series are generated by explicitly regress on its previous
values. As a result, the autocorrelation of the data is caused by the direct dependence on
previous data. In general, a AR model with order p is specified as
where t are white noise (or iid normal). Since zt can be directly observed, the AR model can
be obtained by multiple linear regression, setting zt as the response, and zt−1 , zt−2 , · · · , zt−p
as covariates.
Because of the explicit dependence on the past data, the autocorrelation function has a
recursive formula
ρk = φ1 ρk−1 + φ2 ρk−2 + · · · + φp ρk−p
In general, ρk has an exponential behavior and cyclical patterns. In contrast, the PACF of
AR(p) model cuts down to zero after lag k = p.
57
Example: Consider the AR(1) model, zt = δ + φ1 · zt−1 + t its PACF is
r11 = φ1 , rkk = 0, k ≥ 2.
Sometimes it is more organized to shift the zt and t to two sides of the equation
Introducing the backshift operator, B, which has the effect Bzt = zt−1 , and B k zt = zt−k ,
then we have
(1 − φ1 B − φ2 B 2 − · · · − φp B p )zt = δ + (1 + θ1 B + θ2 B 2 + · · · + θq B q )t .
• AR(p) model typically has autocorrelation function dying down, and the partial autocorrela-
tion function cuts off at lag p.
58
Series data
1.0
0.8
0.6
ACF
0.4
0.2
0.0
0 5 10 15 20 25
0.4
0.3
Partial ACF
0.2
0.1
0.0
−0.1
0 5 10 15 20 25
• MA(q) model: the autocorrelation function cuts off at lag q, and partial autocorrelation dies
down.
Series data
0.8
ACF
0.4
0.0
0 5 10 15 20 25
0.4
Partial ACF
0.2
0.0
−0.2
0 5 10 15 20 25
0.4
0.0
0 5 10 15 20 25
0.6
Partial ACF
0.2
−0.2
0 5 10 15 20 25
59
10.4.1 Link to other models
ARMA model has close link to other time series analysis techniques.
In this case, the smoothing constant α = 1 − θ1 . The previous forecast errors are used to
adjust current forecast.
2. Holt’s trend corrected smoothing: The forecasting with trend corrected exponential
smoothing is equivalent to forecasting with
θ1 = 2 − α − γ, θ2 = α − 1.
For notation simplicity, the Box-Jenkins methods often use the following notation
yt ∼ ARIMA(p, d, q)
where
60
• ARIMA(1,0,0) becomes AR(1)
yt = µ + φ1 yt−1 + t
yt = µ + t + θ1 t−1 + θ2 t−2
yt − yt−1 = µ + t + θ1 t−1
The parameters of the ARIMA models need to satisfy a few constraints to make the model mean-
ingful and easy to interpret. Among them, the following two are most crucial.
• Stationary (causal) condition: the roots of the following equation must satisfy |z| > 1
1 − φ1 z − φ2 z 2 − · · · − φp z p = 0
• Invertible condition: the roots of the following equation must satisfy |z| > 1
1 + θ1 z + θ2 z 2 + · · · + θq z q = 0
These constraints are in place for both true model parameters and estimated model parameters.
If yt+τ −i is observed, we use the observed values (ŷt+τ −i = yt+τ −i , τ ≤ i), otherwise, the forecasted
values at previous steps are used. In contrast, if t+τ −i is beyond current time step t, it is set to 0
(ˆ
t+τ −i = 0, τ ≥ i). Otherwise, ˆi is estimated by the ith step prediction error.
61
2. MA(1) model: yt = δ + t − θ1 t−1 . The values of t are observable. Using previous forecast
error to estimate t−1 , , ˆt−1 = yt−1 − ŷt−1 . Then the forecast values can be computed by
ŷt = δ − θ1 ˆt−1 , with the sum of squared forecast error as SSE = ni=2 (yt − ŷt )2 = ni=2 ˆ2t
P P
Series tdata
0 10 20 30 40
Lag
This is also a sign of non-stationarity. In general, we need to check the SAC at two different levels.
At nonseasonal level, the SAC at lags ranging from 1 to L − 3, is used to indicate the stationarity
(whether trend exists), similar to non-seasonal ARMA models. At seasonal level, the SAC at lags
around L, 2L, 3L, · · · indicate the correlation between the same season in different periods. Both
levels should die down or cut off quickly to indicate stationarity.
If the time series is nonstationary, we can use the first order or higher order differencing to
make them stationary.
Back to the picture above, we can try these three differencing ways to make the series stationary.
After regular differencing, the SAC is
62
1.0
0.8
0.6
0.4
0.2
0.0
−0.2
0 10 20 30 40 ;
0 10 20 30 40 ;
0 10 20 30 40 ;
63
Similar to ARMA model, we can define the model at seasonal level. For seasonal models with
period L, their counter part can be defined as
0.8
0.4
0.0
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
0.3
0.1
−0.1
zt = t − 0.5t−4 − 0.3t−8
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
0.6
0.4
0.2
0.0
zt = 0.5zt−4 + 0.3zt−8 + t
• Use the behavior of SAC and SPAC at nonseasonal level to identify a nonseasonal model
• Use the behavior of SAC and SPAC at seasonal level to identify a seasonal model
Example:
After first seasonal differencing, the data is stationary
64
0.8
0.4
−0.4 0.0
Series st
0 10 20 30 40
Lag
0.1
−0.1
−0.3
0 10 20 30 40
The SPAC cuts off after lag 5 with spikes at 1,3,5,which indicateszt = δ + φ1 zt−1 + φ3 zt−3 +
φ5 zt−5 + t . The SAC cuts off after lag 1 at seasonal level,which indicates zt = δ + t −
θ1 t−12 .Combine the model, we can have the final model:zt = δ+φ1 zt−1 +φ3 zt−3 +φ5 zt−5 +t −θ1 t−12
To make point forecasting, we first forecast on the zt , and then yt . Firstly, use Box-Jenkins
model to obtain the point and interval forecast for zt+τ : ẑt+τ = δ+φ1 zt+τ −1 +φ3 zt+τ −3 +φ5 zt+τ −5 −
θ1 ˆt+τ −12 . Then,since first order seasonal differencing is used, yt+τ = yt+τ −12 + ẑt+τ .
When both seasonal and non-seasonal model have the same type of model (both are AR or both
are MA), some new multiplicative terms are needed
(1−a1 B−a2 B 2 −· · ·−ap B p )(1−φ1 B L −· · ·−φP B P L )yt = (1−b1 B−b2 B 2 −· · ·−bq B q )(1−θ1 B L −· · ·−θQ B QL )t
(10.5)
There is no problem in specifying the model, but we need to be careful in parameter estimation
and forecasting.
ARIMA model with seasonal effects is also short for SARIMA
yt ∼ SARIMA(p, d, q) × (L, P, D, Q)
Example: Consider SARIM A(0, 1, 1) × (12, 0, 1, 1) model. It has period L = 12. Using Lag
operator, we have
65
It is equivalent to
When multiplicative seasonal effects are present, it can be first transformed to additive effects,
and then use the SARIMA model to model the variability over time.
Given the tentatively determined model, its TAC is often available. For example, the MA(1) model
yt = δ + t + θ1 t−1 has TAC
θ1
ρ1 = , ρk = 0, ∀k > 1
1 + θ12
By equating TAC with SAC (ρ1 = r1 ), we can get
p
1± 1 − 4r12
θ̂1 = .
2r1
Similar idea can be applied to other models with relatively low orders, as long as the TAC is
analytically available.
φ1 φ21
ρ1 = , ρ2 = + φ2 , ρk = φ1 ρk−1 + φ2 ρk−2 , k ≥ 3
1 − φ2 1 − φ2
66
By matching the theoretical AC with sample AC, we can calculate the parameters of interest for
corresponding model. This method can be applied to more complex model as well, and is known
as Yule-Walker estimation in general. It is simple, and does not require the original raw data.
However, the estimation accuracy is not the best. When multiple solutions exists, it is important
to check whether these solutions can satisfy the stationary and invertible conditions of the model.
These two methods are not elaborated, and can take full advantage of the entire dataset. The
least square method tries to find out the parameters such that the sum of squared forecast error
is minimized. It is similar to the estimation in conventional regression analysis. In the context
of ARMA models, calculating the forecast errors is straightforward for AR(p) model. However, it
needs more care when MA component is involved, e.g., MA(q) and ARMA(p, q) model.
The maximum likelihood estimation requires the joint distribution of all observations (typically
multivariate normal). It generally provides more accurate estimation, but also is more complex.
Software can be used to obtain such estimates.
Similar to regression models, a good way to check the adequacy of an Box-Jenkins model is to
analyze the residuals
et = yt − ŷt .
In particular, we can plot the SAC and SPAC for the residuals to check whether the model is
adequate. These plots are often named as RSAC and RSPAC, respectively, for short. If the model
is adequate, the error should be uncorrelated, and the RSAC be small. Detailed plot of RSAC or
RSPAC can be used to improve the model as well. In addition, we can also use some statistic to
quantify the dependence of the residuals.
One of such statistic is the Ljung-Box Statistic. The statistic is computed as
K
X
Q∗ = (n − d)(n − d + 2) (n − d − l)−1 rl2 (),
l=1
where n is the sample size, d is the number of differencing, rl () is the SAC of the residual at lag l,
K is some number indicating the range of interests. If Q∗ > χ2α,K−nC , the residuals are correlated,
i.e., the model is inadequate. In practice, multiple K *(e.g.=6,12,18,24) can be used to check the
correlation of the residuals.
67
11 Spatial Data Forecasting
11.1 Spatial Data
Spatial data comes from a myriad of fields, which lead to various spatial data types. A general and
useful classification of spatial data is provided by Cressie (1993, pp. 8-13) based on the nature of
the spatial domain under study.
Following Cressie (1993), let s ∈ Rd be a generic location in a d-dimensional Euclidean space
and {Z(s) : s ∈ Rd } be a spatial random function, Z denote the attribute we are interested in.
The three spatial data types are: lattice data, geostatistical data, and point processes.
• Lattice (Areal/Regional) data: the domain D under study is discrete. Data can be exhaus-
tively observed at fixed locations that can be enumerated. Locations can be ZIP codes,
neighborhoods, provinces etc. Data in most of cases are spatially aggregated. Eg: the unem-
ployment rate by states, crime data by counties, average housing prices by provinces.
• Geostatistical data: the domain under study is a fixed continuous set D. Eg: the level of a
pollutant in a city, the precipitation or air temperature values in a country. They can have
value at any point in D.
• Point processes (Point patterns): the attribute under study is the location of events (obser-
vations). Therefore, the domain D is random. The observation is not necessarily labeled and
the interest lies mainly in where the events occur. Eg: the location of trees in a forest, the
location of nests in a breeding colony of birds.
Figure 1 shows some examples of the spatial data in each category. The main goal of different
spatial data types can be different:
• Lattice data analysis: smoothing and clustering acquire special importance. It is of interest
to describe how the value of interest at one location depends on nearby values, and whether
this dependence is direction dependent.
• Geostatistical data: to predict value of interest at unobserved locations across the entire
domain of interest. An exhaustive observation of the spatial process is not possible. Obser-
vations are only made at a small subset of locations.
• Point processes analysis: to determine if the location of events tends to exhibit a systematic
pattern over the area under study or, on the contrary, they are randomly distributed.
68
Figure 1: Three spatial data types
Here N is the number of spatial units indexed by i or j. x is the variable of interest. x̄ is the
mean of xi , i = 1, · · · , N . wij is the (i, j)th element of a spatial weight matrix W with wii = 0 and
P
S0 = i,j wij . The design of spatial weight matrix W can be:
To test the spatial correlation, we can formualte the following hypothesis testing.
H0 : no spatial correlation; H1 : spatial correlation exists.
The H0 indicates the spatial randomness. The expected value of Moran’s I under the null hypothesis
that there is no spatial correlation is E(I) = −1/(N −1). With large sample sizes, the expected value
approaches zero. I usually range from −1 to +1. Positive I indicates positive spatial correlation,
while negative I indicates negative spatial correlation. Values significantly deviate from −1/(N −1)
69
indicate spatial correlation. The variance of the statistic under the null (assuming each value is
equally likely to occur at any location) is:
N S4 − S3 S5
Var(I) = − (E(I))2
(N − 1)(N − 2)(N − 3)S02
+ wji )2 /2; S2 = 2
P P P P P
where S1 = i j (wij i ( j wij + j wji ) ;
N −1 i (xi − x̄)4
P
S3 = ,
(N −1 i (xi − x̄)2 )2
P
R+1
p= .
M +1
One common tool to account for spatial dependency is linear regression. The idea is to have models
analogous to time series models but with spatial lags. As the simplest case, when only the spatially
lagged variable is considered,
y = λW y + , |λ| < 1 (11.1)
where ∼ i.i.d.N (0, σ2 In ), W is a non-stochastic standardized spatial weight matrix. Compared
to the spatial weight matrix explained above, it is standardized in the sense that the elements of
s =w /
P
any row sum to one, i.e., wij ij j wij . When the W matrix is row-standardized and |λ| < 1,
the matrix (I − λW ) is invertible. From equation 11.1 we can have y = (I − λW )−1 , and:
E(y) = 0 (11.2)
70
With normality assumption of i , this model can also be estimated via maximum likelihood proce-
dure. The log-likelihood can be expressed as:
n 1 1
l(λ, σ2 ) = const − ln(σ2 ) − ln |(I − λW )−1 (I − λW )−T | − 2 y T [(I − λW )−1 (I − λW )−T ]−1 y.
2 2 2σ
The λ̂, σ̂2 that maximize the likelihood function becomes the parameter estimates.
In addition to the spatially lagged variable as regressors, there are also some exogenous variables
that can influence the response. Define the matrix of all exogenous regressors, current and spatially
lagged, as Z = [X, W X] and the vector of regression parameters as β = [β(1) , β(2) ]. In presence of
explanatory variables, it is also possible to test the spatial dependencies, following the procedure
below:
1. Run the non-spatial regression yi = Xi β(1) + e, e ∼ N (0, σ 2 ) for every yi .
2. Test the regression residuals for spatial correlation, using Moran’s I.
3. If no significant spatial correlation exists, STOP.
4. Otherwise, use a special model which takes spatial dependencies into account.
Two commonly used models considering spatial dependecies are Spatial Lag Model (SLM) and
Spatial Error Model (SEM).
Spatial Lag Model (SLM):
y = λW y + Zβ + u, |λ| < 1
In this case, a problem of endogeneity emerges in that the spatially lagged value of y is correlated
with the stochastic disturbance, i.e, E[(W y)uT ] 6= 0. Therefore, least square can not be employed.
The parameters can be estimated by maximum likelihood. Sometimes this model is also called
spatial autoregressive model (SAR).
Spatial Error Model (SEM) :
y = Zβ + u
u = ρW u + , |ρ| < 1
Compared to SLM, SEM contains spatial dependence in the noises. Similar as before, the con-
straints on ρ hold for row-standardized W to make I − ρW invertible. Due to the endogeneity
of the errors, i.e., E[(W y)+T ] 6= 0, the least square procedure loses its optimal properties.The
parameters can be estimated by maximum likelihood.
71
11.2.4 Generalizations
Last but not least, the models discussed above are all special cases of a general form:
Where W1 and W2 are not necessarily the same. This generalized model comes with several names,
e.g., spatial autocorrelation model (SAC), extended spatial Durbin model (SDM), or SARAR(1,1)
(acronym for spatial autoregressive with additional autoregressive error structure).
Statistically, denote X(s) as the response of interest at location s ∈ D, m(s) = EX(s) is the
mean response value. The spatial dependence between response at any two location si , sj can be
characterized by the following two quantities.
1
critical reference of this section: Geographic Information Technology Training Alliance (GITTA)
http://www.gitta.info/website/en/html/index.html
72
• Covariance: C(si , sj ) = E{[X(si ) − m(si )] · [X(sj ) − m(sj )]},
Covariance is a measure of similarity, the larger the value, the more correlated of their responses.
In contrast, semivariance is calculated as a measure of dissimilarity: smaller value indicates higher
dependence. They plays the pivotal role in the properties of geospatial models and their prediction
accuracies. To simplify the model complexity, it is common to limit our attention to a class of
stationary models:
Intrinsic stationary (IS):
An SOS process implies IS, which means IS is a weaker assumption. Under SOS, the relationship
between semivariance and covariance is:
as demonstrated in Figure 3.
73
• To estimate semivariance, no estimate of mean is required. semivariance can adapts more
easily to nonstationary cases. On the contrary, covariance estimator requires the estimation of
mean. When mean is unknown and needs to be estimated from sample, estimating covariance
is more biased.
• Semivariance can be applied under IS, meaning that the semivariance can be defined in some
cases where the covariance function cannot be defined. In particular, the semivariance may
keep increasing with increasing lag, rather than leveling off, corresponding to an infinite global
variance. In this case the covariance function is undefined. As a result, IS is the fundamental
assumption required for Kriging instead of SOS.
Figure 4 shows an example of how semivariance works. Two datasets have similar summary
statistics: 15251 points with (1) average value 100; (2) standard deviation 100; (3) median 100; (4)
10 Percentile 74; (5)90 percentile 125. However, due to different semivariance they exhibits totally
different pattern (spatial structure)
It is note that in practice, we may further assume that the semivariance is isotropy, i.e., the
spatial correlation is the same in all directions. There are anisotropy cases which requires further
design of the semivariance formula, which we will not discuss in this note. For isotropy semivariance,
74
the distance between si and sj completely determines their spatial correlation. As a result, we use
the scalar h instead of the directional vector h.
To estimate γ(h) from the data, we can use:
1 X
γ(h) = [X(si ) − X(sj )]2
2N (h)
i,j,stksi −sj k≈h
where N (h) is the number of pairs whose distances are around h. By changing the value of h, we
can get the function γ̂(h), which we also refer to as semivariogram, as shown in Figure 5.
For modeling and prediction, we need to replace the empirical semivariogram with an acceptable
parametric semivariogram model because we need to use the semivariogram values at lag distances
other than empirical ones. More importantly, the semivariogram need to be non-negative definite.
Let a denote the range, and c denote the sill, three most frequently used models are:
75
11.3.2 Kriging as an interpolation
Given the spatial covariance (semivariance) structures of the data, we are able to predict the
response at any location given the observations from a few locations. This process is also called
interpolation. Interpolation algorithms predict the value at a given location as a weighted sum of
data values at surrounding locations. Almost all weights are assigned according to functions that
give a decreasing weight as the distance increases. Kriging is the optimal interpolation based on
observed values of surrounding data points, weighted according to spatial covariance values. It also
has other advantages:
• Helps to compensate for the effects of data clustering, assigning individual points within a
cluster less weight than isolated data points
• Estimates standard error (kriging variance), along with estimate of the mean, which provides
basis for interval forecasting .
Figure 6: Left: Interpolation using inverse distance weighting; Middle: Kriging Interpolation;
Right: Kriging’s confidence interval.
Kriging assumes the response at location s, Z(s), follows Z(s) = m(s) + R(s) in the domain
D. m(s) is the mean response function at location s, R(s) is an intrinsically stationary (IS)
process. Kriging aims to predict the response value at unobserved location s0 given observations
Z(s1 ), Z(s2 ), · · · , Z(sn ). A basic form of the kriging estimator is
N (s0 )
X
Z ∗ (s0 ) − m(s0 ) = λi [Z(si ) − m(si )]
i=1
where N (s0 ) is the number of data points in the local neighborhood used for estimation of Z ∗ (s0 ).
The mathematical goal of kriging is to determine weights that minimize the variance of the estimator
under the unbiasedness constraint.
minimize 2
σE (s) = Var{Z ∗ (s0 ) − Z(s0 )}
λ
76
Ordinary Kriging has the simplest structure for the underlying mean function, i.e., m(s) = µ.
In this case, the bias is:
N (s0 )
X
E(Z ∗ (s0 ) − Z(s0 )) = λi − 1 m.
i=1
PN (s0 )
The unbiased estimation requires i=1 λi = 1. The semivariance can be estimated from sample,
denoted by γ(h). The semivariance between any two observation can form a matrix Γ, where
Γi,j = γ(||si − sj ||), i, j = 1, · · · , n. The semivariance between s0 and existing observations can also
be summarized in a vector θ = [θ1 , θ2 , · · · , θn ] with θi = γ(||s0 − si ||). Minimizing the variance
(uncertainty) of the prediction, we can have
1 − 1T Γ−1 θ − 1
λ∗ = Γ−1 (θ + µ̂1), where µ̂ = ,
1T Γ−1 1
1. The underline function is from a stationary process with specified covariance function
2. If the distribution of the data is skewed, then the Kriging estimators are sensitive to a few
large data values.
3. Normality of observations is not a requirement for Kriging. Nevertheless, under Gaussian
assumption, Kriging is BLUE (“best linear unbiased estimator”). Kriging under Gaussian
assumption is also equivalent to the the famous “Gaussian process”.
Aside from the orindary Kriging, there are other variants as well.
• Simple kriging: assumes the mean response over the entire domain is a known constant:
PN (s )
E{Z(s)} = m. In this case, the constrain i=1 0 λi = 1 is no longer needed.
• Universal kriging: assumes the mean response is not a constant but a linear combination of
known functions: E{Z(s)} = pk=0 βk fk (s).
P
77
Figure 7: Example for Ordinary Kriging when N (s) = 6.
78
• Cokriging: Kriging using information from one or more correlated variables, or multivariate
kriging in general.
• If we record the Per Capita Income of every state every year, we have a collection of spatio-
temporal lattice data. We can study how incomes of every state evolve over time.
• If we observe every hour the level of pollutant in a city at the points where the monitoring
stations are located, we have a spatio-temporal geostatistical dataset.
• If we observe the location of bird nests every year, we have a spatio-temporal point pattern
dataset. Now we can study whether there is complete spatio-temporal randomness or they
exhibit clustering/inhibition.
79
Since locations in lattice data are discrete and finite, Markov chain can be adopted to study
regional dynamics. If we divide the data into k classes and T periods, we can denote a vector
Pt = [P1,t , P2,t , · · · , Pk,t ] to represent the probability that the response in a region be a member of
a particular class at period t, i.e., P (Z(s, t) ∈ Ck ). To model the dynamics over time, we use the
transition probability matrix Mt , whose element mt,i,j denotes the probability that the response
currently in state i at time t ends up in state j in the next period P (Z(s, t + 1) ∈ Cj |Z(s, t) ∈ Ci ).
An example of the transition matrix is shown in Figure 9. If the transition probabilities do not
change over time, we can drop the index t in the notations above, and we can easily get Pt+b = Pt M b .
To use Markov chain in modeling the transition in spatial distribution, we can design the spatial
markov matrix. This matrix extends the traditional k × k transition matrix into a k × k × k tensor.
Conditioning on different response category of the spatial lag in the initial period, there will be k
different transition probability matrix. The class of the neighbors is summarized by the spatial lag
zi∗ = N
P
j=1 wi,j zj . The overall influence of spatial dependence would be reflected in the differences
Spatial Lag t0 t1 = a t1 = b t1 = c
a maa|a mab|a mac|a
a b mba|a mbb|a mbc|a
c mca|a mcb|a mcc|a
a maa|b mab|b mac|b
b b mba|b mbb|b mbc|b
c mca|b mcb|b mcc|b
a maa|c mab|c mac|c
c b mba|c mbb|c mbc|c
c mca|c mcb|c mcc|c
a maa mab mac
not considered b mba mbb mbc
c mca mcb mcc
between the marginal cell values and the corresponding values in the various conditional matrices.
80
For example, if mbc > mbc|a then the probability of an upward move for median class regions,
irrespective of their neighbors, is higher than the probability of an upward move for median class
regions with poor neighbors.
Multivariate Time Series
Another model for spatial-temporal lattice data is multivariate time series. Let Zt = [Z(s1 , t), · · · , Z(sn , t)]
denote the vector of n responses across all spatial locations, at time t. A multivariate ARMA(p, q)
process is given by:
N (s,t)
X
∗
Z (s, t) = λα Z(sα , tα ).
α=1
Similar to spatial Kriging, the key in prediction is to model the covariance structure between any
two observations. By definition, covariance between two space–time variables is Cov[Z(s, u), Z(r, v)] =
E{[Z(s, u) − µ(s, u)][Z(r, v) − µ(r, v)]}. As before, we normally require the stationarity of the
spatial-temporal process. For example, the second-order stationarity of spatio-temporal data re-
quires that
1. E{[Z(s, t)]} = µ.
81
spatial covariance and a purely temporal covariance. It is straightforward to show that a separable
covariance must be fully symmetric, but full symmetry does not imply separability.
The following covariance is an example from separable covariance function
with σ 2 > 0, νs > 0, νt > 0. Nonseparable functions can also find applications, taking similar form
as the following example.
σ2 −ckhk2γ
C(h, u, θ) = exp
(|u|2γ + 1)τ (|u|2γ + 1)βγ
Here, τ determines the smoothness of the temporal correlation; γ ∈ (0, 1] determines the smoothness
of the spatial correlation; c determines the strength of the spatial correlation; β ∈ (0, 1] determines
the strength of space/time interaction. In this parameterization, γ = 1 corresponds to the Gaussian
covariance function, while γ = 1/2 corresponds to the exponential covariance function. Smaller
value of γ leads to less smoothness in the interpolation results.
13 References
• Sherman, M. (2011). Spatial statistics and spatio-temporal data: covariance functions and
directional properties. John Wiley & Sons.
• Geographic Information Technology Training Alliance (GITTA) http://www.gitta.info/website/en/html/inde
• Rey, S. J. (2001). Spatial empirics for economic growth and convergence. Geographical
analysis, 33(3), 195-214.
• LeSage, J., & Pace, R. K. (2009). Introduction to spatial econometrics. Chapman and
Hall/CRC.
82