Tutorial R LM
Tutorial R LM
Contents
1) Reminder: Linear Regression Model 1
1.1) Multiple Linear Model in Matrix Notation . . . . . . . . . . . . . . . . . . . . 2
1.2) Least Squares for MLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . 3
8) Predictions 16
8.1) predict() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9) Residual Analysis 18
9.1) Plot of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
9.2) Standardized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
9.3) Q-Q Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1
between the response and the predictors:
Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ε
where ε is a random error term, assumed uncorrelated from observation to observation, with
mean zero and constant variance σ 2 .
As usual, we suppose that a data set of n ≥ p + 1 points has been collected. If we denote
xij the i-th value of the variable Xj then the model generates a system of equations linear
in β0 , β1 , . . . , βp of the form
β0
1 x11 x12
y1 ··· x1j · · · x1p β ε1
1 x21 x22
1
y2 ··· x2j · · · x2p ε2
β2
.. .. .. .. .. .. ..
. . . . . . .. .
y= X=
1 x .
β= ε=
y
i i1 x i2 ··· xij ··· xip ε
i
. . .. .. .. .. βj .
. . . .
. . . . . . . .
.
yn 1 xn1 xn2 ··· xnj · · · xnp εn
βp
and note that the system of linear equations can be expressed using matrix notation as
y = Xβ + ε
• the error vector ε is a random vector
• the response vector y is also a random vector
• the X matrix is of dimension n × (p + 1); its j-th column contains the regressor xj
measured with negligible error
• it is customary to represent the constant vector—first column of matrix X—as x0 = 1n
• X is referred to as the model matrix or the design matrix;
2
In this form, y and ε are each n × 1 random vectors. The vector β is a (p + 1) × 1 vector of
unknown parameters and X is an n × (p + 1) matrix of scalars.
b = (XT X)−1 XT y
Say we want to fit a simple linear model in which miles-per-gallon (mpg) is regressed on
displacement (disp), that is:
mpg = β0 + β1 disp + ε
3
We can take look at the scatterplot between disp and mpg to get an idea of the direction
and form of their relationship:
# scatterplot
plot(mtcars$disp, mtcars$mpg, las = 1, pch =19)
30
25
mtcars$mpg
20
15
10
mtcars$disp
where:
• formula is the model formula (the only required argument)
• data is an optional data frame
• subset is an index vector specifying a subset of the data to be used (by default all
items are used)
• na.action is a function specifying how missing values are to be handled (by default
missing values are omitted)
4
3.1) Example: Simple Linear Regression
Let’s bring back the simple linear model:
mpg = β0 + β1 disp + ε
How do you specify the formula for this model with lm()? When the predictor(s) and the
response variable are all in a single data frame, you can use lm() as follows:
# simple linear regression
reg = lm(mpg ~ disp, data = mtcars)
The first argument of lm() consists of an R formula: mpg ~ disp. The tilde, ~, is the
formula operator used to indicate that mpg is predicted or described by disp.
The second argument, data = mtcars, is used to indicate the name of the data frame that
contains the variables mpg and disp, which in this case is the object mtcars. Working with
data frames and using this argument is strongly recommended.
Call:
lm(formula = mpg ~ disp, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
Notice that the output contains two parts: Call: and Coefficients:.
The first part of the output, Call:, simply tells you the command used to run the analysis,
in this case: lm(formula = mpg ~ disp, data = mtcars).
The second part of the output, Coefficients:, shows information about the regression
coefficients. The intercept is 29.6, and the other coefficient is -0.0412. Observe the names
used by R to display the intercept b0 . While the intercept has the same name (Intercept),
the non-intercept term is displayed with the name of the associated variable disp.
The printed output of reg is very minimalist. However, reg contains more information. To
see a list of the different components in reg, use the function names():
5
# what's in an "lm" object?
names(reg)
As you can tell, reg contains many more things than just the coefficients. In fact, the
output of lm() is heavily focused on statistical inference, designed to provide results that
you can use to form confidence intervals and perform significance tests.
Here’s a short description of each of the output elements:
• coefficients: a named vector of coefficients.
• residuals: the residuals, that is, response minus fitted values.
• fitted.values: the fitted mean values.
• rank: the numeric rank of the fitted linear model.
• df.residual: the residual degrees of freedom.
• call: the matched call.
• terms: the terms object used.
• model: if requested (the default), the model frame used.
To inspect what’s in each returned component, type the name of the regression object, reg,
followed by the $ dollar operator, followed by the name of the desired component. For
example, to inspect the coefficients run this:
# regression coefficients
reg$coefficients
(Intercept) disp
29.59985476 -0.04121512
6
Alternatively, lm()—and other similar model fitting functions—have generic helper functions
to extract the output elements such as:
• coef() to extract the coefficients
• fitted() to extract the fitted values
• residuals() to extract the residuals
30
mtcars$mpg
25
20
15
10
mtcars$disp
The function abline() allows you to add lines to a plot(). The good news is that abline()
recognizes objects of class "lm", and when invoked after a call to plot(), it will add the
regression line to the plotted chart.
You can tweak some of the abline() arguments to increase the width of the line (lwd), or
change its color (col)
7
# scatterplot with regression line
plot(mtcars$disp, mtcars$mpg, las = 1, pch = 19)
abline(reg, lwd = 2, col = "blue")
30
mtcars$mpg
25
20
15
10
mtcars$disp
35
Miles per gallon
30
25
20
15
10
50 100 200 300 400 500
Displacement (cu.in.)
8
abline(reg, col = "tomato", lwd = 2) # regression line
axis(side = 1, pos = 10, at = seq(50, 500, 50))
axis(side = 2, las = 1, pos = 50, at = seq(10, 40, 5))
30
25
mpg
20
15
10
100 200 300 400
disp
9
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(method = "lm")
35
30
25
mpg
20
15
10
When using stat_smooth() we need to set the argument method = "lm". As you can
tell, by default stat_smooth() displays the regression (in blue) but also a gray “ribbon”
surrounding the regression line. This shaded region is the 95% confidence interval of the
Standard Error.
To prevent the SE confidence region from being displayed we have to set the argument se
= FALSE
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
35
30
25
mpg
20
15
10
10
5) Example: Multiple Linear Regression
Say we want to fit a multiple linear model in which miles per gallon (mpg) is regressed
on horsepower (hp), 1/4 mile time (qsec) and weight (1000 lbs) (wt), that is:
mpg = β0 + β1 hp + β2 qsec + β3 wt + ε
Call:
lm(formula = mpg ~ hp + qsec + wt, data = mtcars)
Coefficients:
(Intercept) hp qsec wt
27.61053 -0.01782 0.51083 -4.35880
response ∼ expression
where the left-hand side, response, may in some uses be absent and the right-hand side,
expression, is a collection of terms joined by operators usually resembling an arithmetical
expression. The meaning of the right-hand side is context dependent.
The formula is interpreted in the context of the argument data which must be a list, usually
a data frame; the objects named on either side of the formula are looked for first in data. If
no data frame is provided, then R will search for the specified objects (i.e. variables) in the
global environment. So, the following calls to lm() are equivalent:
# with argument 'data'
lm(mpg ~ disp + hp, data = mtcars)
11
6.1) The + (plus) operator
Notice that in these cases the + indicates inclusion, not addition. You can also use - which
indicates exclusion.
As mentioned above, the formula expression mpg ~ disp corresponds to the linear model:
mpg = β0 + β1 disp + ε
The dot "." has a special meaning; in lm() it means all the other variables available in the
object data. This is very convenient when your model has several variables, and typing all
of them becomes tedious. The previous call is equivalent to:
# verbose: fit model with all available predictors
reg_all <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec +
vs + am + gear + carb, data = mtcars)
This is where ":" comes handy. You use it to specify an interaction term between two
variables in a formula object:
# fit model with an interaction term
reg_int1 <- lm(mpg ~ cyl:disp + hp, data = mtcars)
12
6.4) The * (product) operator
Related to the ":" interaction operator, R also provides the "*" operator to handle interac-
tions between variables. The difference is that "*" will also produce individual terms.
For example, suppose we are interested in modeling mpg in terms of cyl, disp, and hp,
suspecting an interaction bewteen cyl and disp. This time, though, the model of interest
contains individual terms for both cyl and disp
To avoid this confusion, the function I() can be used to bracket those portions of a model
formula where the operators are used in their arithmetic sense.
In reg_qua2, the term cyl*cyl indicates an interaction between cyl and cyl. Which may
not necessarily be waht we want to use. So, to tell R that * should be used as the arithmetic
product operator instead of the formula interaction operator, we use I():
reg_qua <- lm(mpg ~ cyl + I(cyl*cyl), data = mtcars)
or equivalently
reg_qua <- lm(mpg ~ cyl + I(cylˆ2), data = mtcars)
13
7) summary() of "lm" objects
As with many objects in R, you can apply the function summary() to an object of class "lm".
This will provide, among other things, an extended display of the fitted model. Here’s what
the output of summary() looks like with our object reg:
# summary of an "lm" object
reg <- lm(mpg ~ disp + hp, data = mtcars)
reg_sum <- summary(reg)
reg_sum
Call:
lm(formula = mpg ~ disp + hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.7945 -2.3036 -0.8246 1.8582 6.9363
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.735904 1.331566 23.083 < 2e-16 ***
disp -0.030346 0.007405 -4.098 0.000306 ***
hp -0.024840 0.013385 -1.856 0.073679 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There’s a lot going on in the output of summary(). So let’s examine all the returned pieces.
14
• 3rd quartile,
• and maximum.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.7945 -2.3036 -0.8246 0.0000 1.8582 6.9363
In case you wonder, you can visualize this numeric summaries with a boxplot
# boxplot of residuals
boxplot(residuals(reg), horizontal = TRUE, ylim = c(-6, 8), axes = FALSE)
axis(side = 1)
−6 −4 −2 0 2 4 6 8
• The first column has the names of the coefficients: (Intercept) and disp.
• The second column contains the estimated values.
• The third column has the standard error of the estimates.
• The fourth column has the t-statistic values.
• The fifth column corresponds to the p-values associated to t.
Notice all those asterisks next to the p-values. Both estimates are marked with three stars,
indicating a p-value of less than 0.001.
These p-values correspond to tests of the (null) hypothesis:
15
H0 : βj = 0 j = 1, 2
under the assumptions of the classic linear model (i.e. linearity, normality, homoscedasticity,
independence). We’ll say more things about the table of coefficients shortly.
Keep in mind that most elements of this 4th part have to do with inferential aspects of a
classic linear model.
8) Predictions
In addition to the summary.lm() and print.lm() functions, "lm" objects also have an
associated predict.lm() function.
Consider the following example of a multiple linear model in which miles per gallon (mpg) is
regressed on displacement (disp) and horse power (hp) which would correspond to:
mpg = β0 + β1 disp + β2 hp + ε
16
Call:
lm(formula = mpg ~ disp + hp, data = mtcars)
Coefficients:
(Intercept) disp hp
30.73590 -0.03035 -0.02484
new_data
disp hp
new car 200 150
To obtain the predictions we call predict() and pass the "lm" object and the data frame
for the argument newdata:
new car
20.94064
The predict() method for "lm" objects works by attaching the estimated coefficients to a
new model matrix that it constructs using the formula and the new data.
Now let’s get predictions with more new observations:
new_data
17
disp hp
car_a 100 80
car_b 125 90
car_c 150 100
car_d 175 110
car_e 200 120
9) Residual Analysis
As an example, let’s take the data mtcars in order to regress mpg on hp
reg <- lm(mpg ~ hp, data = mtcars)
reg
Call:
lm(formula = mpg ~ hp, data = mtcars)
Coefficients:
(Intercept) hp
30.09886 -0.06823
Regression line
plot(mtcars$hp, mtcars$mpg, las = 1)
abline(reg)
30
mtcars$mpg
25
20
15
10
mtcars$hp
18
9.1) Plot of Residuals
The plot below is the default residual plot provided in R when using the plot() method on
an object of class "lm", choosing the argument which = 1
# plot of ordinary residuals -vs- fitted values
plot(reg, which = 1, las = 1)
Residuals vs Fitted
5
Residuals
−5
10 15 20 25
Fitted values
lm(mpg ~ hp)
You can obtain the same plot, by graphing residuals against fitted values.
hat_values <- hatvalues(reg)
fitted <- fitted(reg)
res_ord <- residuals(reg)
res_std <- rstandard(reg)
res_stu <- rstudent(reg)
19
8
res_ord
2
−2
−4
−6
10 15 20 25
fitted
1.0
0.5
0.0
10 15 20 25
Fitted values
lm(mpg ~ hp)
20
plot(fitted, sqrt(abs(res_std)), las = 1, ylim = c(0, 1.6))
1.5
sqrt(abs(res_std))
1.0
0.5
0.0
10 15 20 25
fitted
21
Normal Q−Q
Maserati Bora
LotusToyota
EuropaCorolla
Standardized residuals 2
−1
−2 −1 0 1 2
Theoretical Quantiles
lm(mpg ~ hp)
2
Studentized Residuals
−1
−2 −1 0 1 2
Theoretical Quantiles
22