0% found this document useful (0 votes)

26 views22 pages

Tutorial R LM

This document is an R tutorial by Gaston Sanchez that covers linear regression models, including multiple and simple linear regression, and the use of the lm() function in R. It explains the underlying concepts, provides examples, and discusses how to visualize regression results with plots. Key topics include model formulae, coefficient estimation, and residual analysis.

Uploaded by

Srishti Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views22 pages

Tutorial R LM

Uploaded by

Srishti Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

The all-mighty linear model function lm() and friends

An R Tutorial by Gaston Sanchez

Contents
1) Reminder: Linear Regression Model 1
1.1) Multiple Linear Model in Matrix Notation . . . . . . . . . . . . . . . . . . . . 2
1.2) Least Squares for MLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2) Simple Linear Regression 3

3) Linear Model Function lm() 4

3.1) Example: Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 5
3.2) Quick inspection of lm() output . . . . . . . . . . . . . . . . . . . . . . . . . 5

4) Plotting the Regression Line 7

4.1) Using plot() and abline() . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2) Regression Line with "ggplot2" . . . . . . . . . . . . . . . . . . . . . . . . . 9

5) Example: Multiple Linear Regression 11

6) About Model Formulae 11

7) summary() of "lm" objects 14

8) Predictions 16
8.1) predict() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

9) Residual Analysis 18
9.1) Plot of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
9.2) Standardized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
9.3) Q-Q Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1) Reminder: Linear Regression Model

I assume you are familiar with linear regression models, but just in case, let me begin with
a brief reminder of some important concepts.
Suppose we have a response variable Y that we want to predict using p explanatory variables
X1 , X2 , . . . , Xp . In plain vanilla linear models, we depart from the following relationship

1
between the response and the predictors:

Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ε

where ε is a random error term, assumed uncorrelated from observation to observation, with
mean zero and constant variance σ 2 .
As usual, we suppose that a data set of n ≥ p + 1 points has been collected. If we denote
xij the i-th value of the variable Xj then the model generates a system of equations linear
in β0 , β1 , . . . , βp of the form

y1 = β0 + β1 x11 + β2 x12 + · · · + βp x1p + ε1

..
.
yi = β0 + β1 xi1 + β2 xi2 + · · · + βp xip + εi
..
.
yn = β0 + β1 xn1 + β2 xn2 + · · · + βp xnp + εn

1.1) Multiple Linear Model in Matrix Notation

We can express the system of equations of a multiple linear model in vector-matrix form.

 
β0
1 x11 x12
     
y1 ··· x1j · · · x1p β  ε1
1 x21 x22
 1
 y2  ··· x2j · · · x2p   ε2 
     
β2 
 
 ..   .. .. .. .. ..   .. 
    
 .  . . . . .   ..  .
  
y= X=
1 x  . 
β= ε=
   
y 
 i i1 x i2 ··· xij ··· xip  ε 
 i
 .  . .. .. .. ..  βj  .
   
 .  .  .  .
 .  . . . . .   .  .
 . 

yn 1 xn1 xn2 ··· xnj · · · xnp εn
βp

and note that the system of linear equations can be expressed using matrix notation as

y = Xβ + ε
• the error vector ε is a random vector
• the response vector y is also a random vector
• the X matrix is of dimension n × (p + 1); its j-th column contains the regressor xj
measured with negligible error
• it is customary to represent the constant vector—first column of matrix X—as x0 = 1n
• X is referred to as the model matrix or the design matrix;

2
In this form, y and ε are each n × 1 random vectors. The vector β is a (p + 1) × 1 vector of
unknown parameters and X is an n × (p + 1) matrix of scalars.

1.2) Least Squares for MLR Model

The goal is to estimate the coefficients β0 , β1 , . . . , βp . The most common estimation method
is ordinary least squares (OLS). If XT X is invertible, then the estimated coefficients
β̂ = b are given by:

b = (XT X)−1 XT y

2) Simple Linear Regression

Let’s start with a simple linear regression model. For illustration purposes we’ll use the data
set mtcars (a few rows shown below)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

The variables in mtcars are:

• mpg: Miles/(US) gallon
• cyl: Number of cylinders
• disp: Displacement (cu.in.)
• hp: Gross horsepower
• drat: Rear axle ratio
• wt: Weight (1000 lbs)
• qsec: 1/4 mile time
• vs: Engine (0 = V-shaped, 1 = straight)
• am: Transmission (0 = automatic, 1 = manual)
• gear: Number of forward gears
• carb: Number of carburetors

Say we want to fit a simple linear model in which miles-per-gallon (mpg) is regressed on
displacement (disp), that is:

mpg = β0 + β1 disp + ε

3
We can take look at the scatterplot between disp and mpg to get an idea of the direction
and form of their relationship:
# scatterplot
plot(mtcars$disp, mtcars$mpg, las = 1, pch =19)

25
mtcars$mpg

100 200 300 400

mtcars$disp

3) Linear Model Function lm()

In R, the function that allows you to fit a regression model via Least Squares is lm(), which
stands for linear model. I should say that this function is a general function that works for
various types of linear models such as regression, single stratum analysis of variance, and
analysis of covariance.
The main arguments to lm() are:
lm(formula, data, subset, na.action)

where:
• formula is the model formula (the only required argument)
• data is an optional data frame
• subset is an index vector specifying a subset of the data to be used (by default all
items are used)
• na.action is a function specifying how missing values are to be handled (by default
missing values are omitted)

4
3.1) Example: Simple Linear Regression
Let’s bring back the simple linear model:

mpg = β0 + β1 disp + ε

How do you specify the formula for this model with lm()? When the predictor(s) and the
response variable are all in a single data frame, you can use lm() as follows:
# simple linear regression
reg = lm(mpg ~ disp, data = mtcars)

The first argument of lm() consists of an R formula: mpg ~ disp. The tilde, ~, is the
formula operator used to indicate that mpg is predicted or described by disp.
The second argument, data = mtcars, is used to indicate the name of the data frame that
contains the variables mpg and disp, which in this case is the object mtcars. Working with
data frames and using this argument is strongly recommended.

3.2) Quick inspection of lm() output

The output of lm() is an object of class "lm". When you print the "lm" objects reg, R
displays the following information:
reg = lm(mpg ~ disp, data = mtcars)
reg

Call:
lm(formula = mpg ~ disp, data = mtcars)

Coefficients:
(Intercept) disp
29.59985 -0.04122

Notice that the output contains two parts: Call: and Coefficients:.
The first part of the output, Call:, simply tells you the command used to run the analysis,
in this case: lm(formula = mpg ~ disp, data = mtcars).
The second part of the output, Coefficients:, shows information about the regression
coefficients. The intercept is 29.6, and the other coefficient is -0.0412. Observe the names
used by R to display the intercept b0 . While the intercept has the same name (Intercept),
the non-intercept term is displayed with the name of the associated variable disp.
The printed output of reg is very minimalist. However, reg contains more information. To
see a list of the different components in reg, use the function names():

5
# what's in an "lm" object?
names(reg)

[1] "coefficients" "residuals" "effects" "rank"

[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"

As you can tell, reg contains many more things than just the coefficients. In fact, the
output of lm() is heavily focused on statistical inference, designed to provide results that
you can use to form confidence intervals and perform significance tests.
Here’s a short description of each of the output elements:
• coefficients: a named vector of coefficients.
• residuals: the residuals, that is, response minus fitted values.
• fitted.values: the fitted mean values.
• rank: the numeric rank of the fitted linear model.
• df.residual: the residual degrees of freedom.
• call: the matched call.
• terms: the terms object used.
• model: if requested (the default), the model frame used.
To inspect what’s in each returned component, type the name of the regression object, reg,
followed by the $ dollar operator, followed by the name of the desired component. For
example, to inspect the coefficients run this:

# regression coefficients
reg$coefficients

(Intercept) disp
29.59985476 -0.04121512

Likewise, to take a peek at the fitted values use $fitted.values

# fitted values
head(reg$fitted.values, n = 8)

Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive

23.00544 23.00544 25.14862 18.96635
Hornet Sportabout Valiant Duster 360 Merc 240D
14.76241 20.32645 14.76241 23.55360

6
Alternatively, lm()—and other similar model fitting functions—have generic helper functions
to extract the output elements such as:
• coef() to extract the coefficients
• fitted() to extract the fitted values
• residuals() to extract the residuals

4) Plotting the Regression Line

With a simple linear regression model (i.e. one predictor), once you obtained the "lm" object
reg, you can use it to get a scatterplot with the regression line on it.

4.1) Using plot() and abline()

The simplest way to achieve this visualization is to first create a scatter diagram with plot(),
and then add the regression line with the function abline(); here’s the code in R:
# scatterplot with regression line
plot(mtcars$disp, mtcars$mpg, las = 1, pch = 19)
abline(reg, lwd = 2)

30
mtcars$mpg

100 200 300 400

mtcars$disp

The function abline() allows you to add lines to a plot(). The good news is that abline()
recognizes objects of class "lm", and when invoked after a call to plot(), it will add the
regression line to the plotted chart.
You can tweak some of the abline() arguments to increase the width of the line (lwd), or
change its color (col)

7
# scatterplot with regression line
plot(mtcars$disp, mtcars$mpg, las = 1, pch = 19)
abline(reg, lwd = 2, col = "blue")

30
mtcars$mpg

100 200 300 400

mtcars$disp

Here’s how to get a nicer plot using low-level plotting functions:

35
Miles per gallon

10
50 100 200 300 400 500

Displacement (cu.in.)

# scatterplot with regression line

plot.new()
plot.window(xlim = c(50, 500), ylim = c(10, 40))
title(xlab = 'Displacement (cu.in.)', ylab = 'Miles per gallon')
points(mtcars$disp, mtcars$mpg, pch = 19, cex = 1.5, col = "#33333355")

8
abline(reg, col = "tomato", lwd = 2) # regression line
axis(side = 1, pos = 10, at = seq(50, 500, 50))
axis(side = 2, las = 1, pos = 50, at = seq(10, 40, 5))

4.2) Regression Line with "ggplot2"

If you prefer to use functions from "ggplot2" to create your graphs, you can also plot a
regression line in fairly straightforward manner.
library(ggplot2)

4.2.1) Regression line with geom_abline()

The corresponding abline() function in "ggplot2" is geom_abline(). You need to specify
the slope and the intercept
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_point() +
geom_abline(slope = reg$coefficients[2], intercept = reg$coefficients[1])
35

25
mpg

10
100 200 300 400
disp

4.2.2) Rgeression line with stat_smooth()

Another option to graph the line of a simple regression model with ggplot() is to use
stat_smooth() instead of geom_abline(). In this case, we don’t even have to use an object
of class "lm": "ggplot2" will compute the regression output for us and use them to create
the graphic.

9
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(method = "lm")
35

25
mpg

100 200 300 400

disp

When using stat_smooth() we need to set the argument method = "lm". As you can
tell, by default stat_smooth() displays the regression (in blue) but also a gray “ribbon”
surrounding the regression line. This shaded region is the 95% confidence interval of the
Standard Error.
To prevent the SE confidence region from being displayed we have to set the argument se
= FALSE
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
35

25
mpg

100 200 300 400

disp

10
5) Example: Multiple Linear Regression
Say we want to fit a multiple linear model in which miles per gallon (mpg) is regressed
on horsepower (hp), 1/4 mile time (qsec) and weight (1000 lbs) (wt), that is:

mpg = β0 + β1 hp + β2 qsec + β3 wt + ε

How can we specify such a model in R?

When the predictor(s) and the response variable are all in a single data frame, you can use
lm() as follows:
# multiple linear regression
reg = lm(mpg ~ hp + qsec + wt, data = mtcars)
reg

Call:
lm(formula = mpg ~ hp + qsec + wt, data = mtcars)

Coefficients:
(Intercept) hp qsec wt
27.61053 -0.01782 0.51083 -4.35880

6) About Model Formulae

The formula declaration in lm() was originally introduced as a way to specify linear models,
but have since been adopted for so many other purposes. An R formula has the general form:

response ∼ expression

where the left-hand side, response, may in some uses be absent and the right-hand side,
expression, is a collection of terms joined by operators usually resembling an arithmetical
expression. The meaning of the right-hand side is context dependent.
The formula is interpreted in the context of the argument data which must be a list, usually
a data frame; the objects named on either side of the formula are looked for first in data. If
no data frame is provided, then R will search for the specified objects (i.e. variables) in the
global environment. So, the following calls to lm() are equivalent:
# with argument 'data'
lm(mpg ~ disp + hp, data = mtcars)

# without argument 'data'

lm(mtcars$mpg ~ mtcars$disp + mtcars$hp)

11
6.1) The + (plus) operator
Notice that in these cases the + indicates inclusion, not addition. You can also use - which
indicates exclusion.
As mentioned above, the formula expression mpg ~ disp corresponds to the linear model:

mpg = β0 + β1 disp + ε

6.2) The . (dot) operator

Another useful syntax when working with formulas is the dot "." character. Sometimes you
will find the dot . as part of a formula declaration, for example:
# fit model with all available predictors
reg_all <- lm(mpg ~ ., data = mtcars)

The dot "." has a special meaning; in lm() it means all the other variables available in the
object data. This is very convenient when your model has several variables, and typing all
of them becomes tedious. The previous call is equivalent to:
# verbose: fit model with all available predictors
reg_all <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec +
vs + am + gear + carb, data = mtcars)

6.3) The : (colon) operator

Another feature of formulas is provided by the colon ":" operator. This allows you to specify
an interaction term.
For example, suppose we are interested in modeling mpg in terms of cyl, disp, and hp. But
we also suspect that cyl interacts with disp. Algebraically, we could think of the following
possible model:

mpg = β0 + β1 (cyl × disp) + β2 hp + ε

This is where ":" comes handy. You use it to specify an interaction term between two
variables in a formula object:
# fit model with an interaction term
reg_int1 <- lm(mpg ~ cyl:disp + hp, data = mtcars)

12
6.4) The * (product) operator
Related to the ":" interaction operator, R also provides the "*" operator to handle interac-
tions between variables. The difference is that "*" will also produce individual terms.
For example, suppose we are interested in modeling mpg in terms of cyl, disp, and hp,
suspecting an interaction bewteen cyl and disp. This time, though, the model of interest
contains individual terms for both cyl and disp

mpg = β0 + β1 cyl + β2 disp + +β3 (cyl × disp) + β4 hp + ε

Instead of using : we should use * between cyl and disp:

# fit model with an interaction, and single terms
reg_int2 <- lm(mpg ~ cyl*disp + hp, data = mtcars)

6.5) The I() inhibit function

Another interesting operator is the inhibit function I().
When used in a function formula, this operator inhibits the interpretation of operators such
as "+", "-", "*" and "ˆ" as formula operators, so they are used as arithmetical operators.
For example, consider the following quadratic model—in terms of cyl:

mpg = β0 + β1 cyl + β2 cyl2 + ε

You could try the following R formulas:

reg_qua1 <- lm(mpg ~ cyl + cylˆ2, data = mtcars)

reg_qua2 <- lm(mpg ~ cyl + cyl*cyl, data = mtcars)

To avoid this confusion, the function I() can be used to bracket those portions of a model
formula where the operators are used in their arithmetic sense.
In reg_qua2, the term cyl*cyl indicates an interaction between cyl and cyl. Which may
not necessarily be waht we want to use. So, to tell R that * should be used as the arithmetic
product operator instead of the formula interaction operator, we use I():
reg_qua <- lm(mpg ~ cyl + I(cyl*cyl), data = mtcars)

or equivalently
reg_qua <- lm(mpg ~ cyl + I(cylˆ2), data = mtcars)

13
7) summary() of "lm" objects
As with many objects in R, you can apply the function summary() to an object of class "lm".
This will provide, among other things, an extended display of the fitted model. Here’s what
the output of summary() looks like with our object reg:
# summary of an "lm" object
reg <- lm(mpg ~ disp + hp, data = mtcars)
reg_sum <- summary(reg)
reg_sum

Call:
lm(formula = mpg ~ disp + hp, data = mtcars)

Residuals:
Min 1Q Median 3Q Max
-4.7945 -2.3036 -0.8246 1.8582 6.9363

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.735904 1.331566 23.083 < 2e-16 ***
disp -0.030346 0.007405 -4.098 0.000306 ***
hp -0.024840 0.013385 -1.856 0.073679 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.127 on 29 degrees of freedom

Multiple R-squared: 0.7482, Adjusted R-squared: 0.7309
F-statistic: 43.09 on 2 and 29 DF, p-value: 2.062e-09

There’s a lot going on in the output of summary(). So let’s examine all the returned pieces.

7.1) Function Call

The first part of the output, Call:, corresponds to the command that we used to fit the
model with lm(), in this case: lm(formula = mpg ~ disp, data = mtcars).

7.2) Sumary statistics of Residuals

The second part, Residuals:, has the 5-number summary of the computed residuals:
• minimum,
• 1st quartile,
• median,

14
• 3rd quartile,
• and maximum.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.7945 -2.3036 -0.8246 0.0000 1.8582 6.9363

In case you wonder, you can visualize this numeric summaries with a boxplot
# boxplot of residuals
boxplot(residuals(reg), horizontal = TRUE, ylim = c(-6, 8), axes = FALSE)
axis(side = 1)

−6 −4 −2 0 2 4 6 8

7.3) Table of Coefficients

The 3rd part, Coefficients, corresponds to a table with five columns, and as many rows
as coefficient estimates.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.73590425 1.331566129 23.082522 3.262507e-20
disp -0.03034628 0.007404856 -4.098159 3.062678e-04
hp -0.02484008 0.013385499 -1.855746 7.367905e-02

• The first column has the names of the coefficients: (Intercept) and disp.
• The second column contains the estimated values.
• The third column has the standard error of the estimates.
• The fourth column has the t-statistic values.
• The fifth column corresponds to the p-values associated to t.
Notice all those asterisks next to the p-values. Both estimates are marked with three stars,
indicating a p-value of less than 0.001.
These p-values correspond to tests of the (null) hypothesis:

15
H0 : βj = 0 j = 1, 2

under the assumptions of the classic linear model (i.e. linearity, normality, homoscedasticity,
independence). We’ll say more things about the table of coefficients shortly.

7.4) Additional statistics

The 4th and last part is comprised by the last three lines of text. Here there is also a lot
going on.
Residual standard error: 3.127 on 29 degrees of freedom
Multiple R-squared: 0.7482, Adjusted R-squared: 0.7309
F-statistic: 43.09 on 2 and 29 DF, p-value: 2.062e-09

We have the following elements:

• Residual Standard Error (RSE) with its degrees of freedom
• Coefficient of determination R2
• Adjusted R2
• F-statistic with its degrees of freedom
• And p-value associated to the F-statistic.

Keep in mind that most elements of this 4th part have to do with inferential aspects of a
classic linear model.

8) Predictions
In addition to the summary.lm() and print.lm() functions, "lm" objects also have an
associated predict.lm() function.
Consider the following example of a multiple linear model in which miles per gallon (mpg) is
regressed on displacement (disp) and horse power (hp) which would correspond to:

mpg = β0 + β1 disp + β2 hp + ε

Here’s the code to fit the model with lm()

# multiple linear regression
reg = lm(mpg ~ disp + hp, data = mtcars)
reg

16
Call:
lm(formula = mpg ~ disp + hp, data = mtcars)

Coefficients:
(Intercept) disp hp
30.73590 -0.03035 -0.02484

8.1) predict() function

Suppose we wished to predict future consumption of miles per gallon mpg. The first step
is to create a new data frame with a variables disp and hp containing the new values, for
example:

# data frame with new data

new_data <- data.frame(
disp = 200,
hp = 150,
row.names = 'new car')

new_data

disp hp
new car 200 150

To obtain the predictions we call predict() and pass the "lm" object and the data frame
for the argument newdata:

predict(reg, newdata = new_data)

new car
20.94064

The predict() method for "lm" objects works by attaching the estimated coefficients to a
new model matrix that it constructs using the formula and the new data.
Now let’s get predictions with more new observations:

# data frame with more new observations

new_data <- data.frame(
disp = seq(100, 200, by = 25),
hp = seq(80, 120, by = 10),
row.names = paste0('car_', letters[1:5]))

new_data

17
disp hp
car_a 100 80
car_b 125 90
car_c 150 100
car_d 175 110
car_e 200 120

and obtain the predicted mpg’s:

predict(reg, newdata = new_data)

car_a car_b car_c car_d car_e

25.71407 24.70701 23.69995 22.69290 21.68584

9) Residual Analysis
As an example, let’s take the data mtcars in order to regress mpg on hp
reg <- lm(mpg ~ hp, data = mtcars)
reg

Call:
lm(formula = mpg ~ hp, data = mtcars)

Coefficients:
(Intercept) hp
30.09886 -0.06823
Regression line
plot(mtcars$hp, mtcars$mpg, las = 1)
abline(reg)

30
mtcars$mpg

50 100 150 200 250 300

mtcars$hp

18
9.1) Plot of Residuals
The plot below is the default residual plot provided in R when using the plot() method on
an object of class "lm", choosing the argument which = 1
# plot of ordinary residuals -vs- fitted values
plot(reg, which = 1, las = 1)
Residuals vs Fitted

Maserati Bora Lotus Europa Toyota Corolla

5
Residuals

−5

10 15 20 25

Fitted values
lm(mpg ~ hp)

You can obtain the same plot, by graphing residuals against fitted values.
hat_values <- hatvalues(reg)
fitted <- fitted(reg)
res_ord <- residuals(reg)
res_std <- rstandard(reg)
res_stu <- rstudent(reg)

plot(fitted, res_ord, las = 1)

abline(h = 0, lty = 2)

19
8

res_ord
2

−2

−4

−6

10 15 20 25

fitted

9.2) Standardized Residuals

Another common plot is |Standardized Residuals|1/2 against ŷ
# plot of (standardized residuals)ˆ0.5 -vs- fitted values
plot(reg, which = 3)
Scale−Location
Maserati Bora
1.5

Lotus Europa Toyota Corolla

Standardized residuals

1.0
0.5
0.0

10 15 20 25

Fitted values
lm(mpg ~ hp)

The same plot can be obtained “by hand” like this

20
plot(fitted, sqrt(abs(res_std)), las = 1, ylim = c(0, 1.6))

1.5

sqrt(abs(res_std))
1.0

0.5

0.0

10 15 20 25

fitted

9.3) Q-Q Plots

The most common way to assess normality of the errors is to look at what is referred to as
a normal probability plot or quantile-comparison plot, most commonly known as a normal
Q-Q plot of the standardized residuals (for Quantile-Quantile plot). The idea of this plot
is to compare the residuals to “ideal” normal observations.

9.3.1) Q-Q plot with standardized residuals

The typical Q-Q plot involves plotting the ordered standardized residuals on the vertical
axis against the expected order statistics from a standard normal distribution N (0, 1) on the
horizontal axis.
# qqnorm(res_std, las = 1)
plot(reg, which = 2, las = 1)

9.3.2) Q-Q plot with studentized residuals

Another flavor of Q-Q plot involves using studentized residuals instead of standardized resid-
uals. We compare the sample distribution of the studentized residuals
qqnorm(rstudent(reg), las = 1, ylab = "Studentized Residuals")
qqline(rstudent(reg), lty = 3)

21
Normal Q−Q

Maserati Bora
LotusToyota
EuropaCorolla
Standardized residuals 2

−1

−2 −1 0 1 2

Theoretical Quantiles
lm(mpg ~ hp)

Figure 1: Q-Q plot of standardized residuals

Normal Q−Q Plot

2
Studentized Residuals

−1

−2 −1 0 1 2

Theoretical Quantiles

Figure 2: Q-Q plot of studentized residuals

Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
No ratings yet
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
8 pages
Linear Regression Analysis in R
100% (1)
Linear Regression Analysis in R
15 pages
Advanced Statistical Modeling in R
No ratings yet
Advanced Statistical Modeling in R
16 pages
CH6 Regression
No ratings yet
CH6 Regression
18 pages
Statistical Analysis
No ratings yet
Statistical Analysis
26 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Using R For Linear Regression
No ratings yet
Using R For Linear Regression
9 pages
R Modeling Workflow Guide
No ratings yet
R Modeling Workflow Guide
36 pages
R Stastics PDF
No ratings yet
R Stastics PDF
30 pages
Notes 23 Regression R
No ratings yet
Notes 23 Regression R
5 pages
Linear Regression
No ratings yet
Linear Regression
17 pages
Lecture 3
No ratings yet
Lecture 3
42 pages
ML Unit3
No ratings yet
ML Unit3
9 pages
Dar Lec10
No ratings yet
Dar Lec10
22 pages
Practical 5
No ratings yet
Practical 5
8 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Unit5 R
No ratings yet
Unit5 R
5 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Linear Regression Explained
No ratings yet
Linear Regression Explained
8 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
Linear Model
No ratings yet
Linear Model
10 pages
Chapter 5. Regression Models: 1 A Simple Model
No ratings yet
Chapter 5. Regression Models: 1 A Simple Model
49 pages
Fsgs
No ratings yet
Fsgs
28 pages
WINSEM2024-25 CSE3506 ELA CH2024250502181 Reference Material III 21-12-2024 21NEW3
No ratings yet
WINSEM2024-25 CSE3506 ELA CH2024250502181 Reference Material III 21-12-2024 21NEW3
7 pages
CS ELEC 4 Finals Module
No ratings yet
CS ELEC 4 Finals Module
57 pages
M348 Applied Statistical Modelling - Linear Models
No ratings yet
M348 Applied Statistical Modelling - Linear Models
504 pages
Statistical Modelling and Regression Techniques
No ratings yet
Statistical Modelling and Regression Techniques
63 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Linear Regression for Researchers
No ratings yet
Linear Regression for Researchers
41 pages
Understanding Polynomials and Their Zeroes
No ratings yet
Understanding Polynomials and Their Zeroes
9 pages
Operator Methods for ODE Solutions
No ratings yet
Operator Methods for ODE Solutions
9 pages
King Saud University Department of Mathematics 244 First Midterm, March 2016
No ratings yet
King Saud University Department of Mathematics 244 First Midterm, March 2016
6 pages
Lesson 1. Introduction To Metaheuristics and General Concepts
No ratings yet
Lesson 1. Introduction To Metaheuristics and General Concepts
37 pages
A Finite Element Approach To Burgers' Equation: J. Caldwell and P. Wanless
No ratings yet
A Finite Element Approach To Burgers' Equation: J. Caldwell and P. Wanless
5 pages
Shell Design Variable Point Method PDF
100% (1)
Shell Design Variable Point Method PDF
9 pages
Mbaiqtregression 191207012407
100% (1)
Mbaiqtregression 191207012407
26 pages
6573 - Spe 20152 MS
No ratings yet
6573 - Spe 20152 MS
12 pages
Dual Simplex Sensitivity Guide
No ratings yet
Dual Simplex Sensitivity Guide
7 pages
A Flexible Lognormal Sum Approximation
No ratings yet
A Flexible Lognormal Sum Approximation
7 pages
Newton's Raphson Method Lab Report
No ratings yet
Newton's Raphson Method Lab Report
4 pages
Least Learned Competencies by Grade
No ratings yet
Least Learned Competencies by Grade
8 pages
Structural Dynamics - Computational-Dynamics - Soren R. K. Nielsen PDF
100% (1)
Structural Dynamics - Computational-Dynamics - Soren R. K. Nielsen PDF
177 pages
Curve Fitting Assignment
No ratings yet
Curve Fitting Assignment
5 pages
Numerical Analysis: Integration & Differentiation Techniques
No ratings yet
Numerical Analysis: Integration & Differentiation Techniques
2 pages
Excel for Solving Linear Equations
No ratings yet
Excel for Solving Linear Equations
44 pages
An Introduction To Computational Science Allen Holder Download
No ratings yet
An Introduction To Computational Science Allen Holder Download
108 pages
Efficient Colebrook Equation Solver
No ratings yet
Efficient Colebrook Equation Solver
7 pages
Exam 2 Cheat Sheet
No ratings yet
Exam 2 Cheat Sheet
14 pages
Quadratic Equations Assertion Reason CBSE PYQs - Class 10
No ratings yet
Quadratic Equations Assertion Reason CBSE PYQs - Class 10
17 pages
Notes Unit-I
No ratings yet
Notes Unit-I
35 pages
Using 10-26, Say 10 2-52, Say 3 0-52,: Trips Eq
No ratings yet
Using 10-26, Say 10 2-52, Say 3 0-52,: Trips Eq
2 pages
Newton Raphson Method
No ratings yet
Newton Raphson Method
4 pages
Details of Assembled Global Stiffness Matrices
No ratings yet
Details of Assembled Global Stiffness Matrices
3 pages
Solved Long Division Problems With Step-By-Step Walkthrough: Solutions Are On Page 2
No ratings yet
Solved Long Division Problems With Step-By-Step Walkthrough: Solutions Are On Page 2
2 pages
Taylor Series Workshop Exercises
No ratings yet
Taylor Series Workshop Exercises
2 pages
Ennuma1l - Final Project
No ratings yet
Ennuma1l - Final Project
20 pages
Arfken MMCH 7 S 4 e 3
No ratings yet
Arfken MMCH 7 S 4 e 3
2 pages
MatLab Activity#5
No ratings yet
MatLab Activity#5
32 pages

Tutorial R LM

Uploaded by

Tutorial R LM

Uploaded by

The all-mighty linear model function lm() and friends

An R Tutorial by Gaston Sanchez

2) Simple Linear Regression 3

3) Linear Model Function lm() 4

4) Plotting the Regression Line 7

5) Example: Multiple Linear Regression 11

6) About Model Formulae 11

7) summary() of "lm" objects 14

1) Reminder: Linear Regression Model

y1 = β0 + β1 x11 + β2 x12 + · · · + βp x1p + ε1

1.1) Multiple Linear Model in Matrix Notation

1.2) Least Squares for MLR Model

2) Simple Linear Regression

The variables in mtcars are:

100 200 300 400

3) Linear Model Function lm()

3.2) Quick inspection of lm() output

[1] "coefficients" "residuals" "effects" "rank"

Likewise, to take a peek at the fitted values use $fitted.values

Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive

4) Plotting the Regression Line

4.1) Using plot() and abline()

100 200 300 400

100 200 300 400

Here’s how to get a nicer plot using low-level plotting functions:

# scatterplot with regression line

4.2) Regression Line with "ggplot2"

4.2.1) Regression line with geom_abline()

4.2.2) Rgeression line with stat_smooth()

100 200 300 400

100 200 300 400

How can we specify such a model in R?

6) About Model Formulae

# without argument 'data'

6.2) The . (dot) operator

6.3) The : (colon) operator

mpg = β0 + β1 (cyl × disp) + β2 hp + ε

mpg = β0 + β1 cyl + β2 disp + +β3 (cyl × disp) + β4 hp + ε

Instead of using : we should use * between cyl and disp:

6.5) The I() inhibit function

mpg = β0 + β1 cyl + β2 cyl2 + ε

You could try the following R formulas:

reg_qua2 <- lm(mpg ~ cyl + cyl*cyl, data = mtcars)

Residual standard error: 3.127 on 29 degrees of freedom

7.1) Function Call

7.2) Sumary statistics of Residuals

7.3) Table of Coefficients

7.4) Additional statistics

We have the following elements:

Here’s the code to fit the model with lm()

8.1) predict() function

# data frame with new data

predict(reg, newdata = new_data)

# data frame with more new observations

and obtain the predicted mpg’s:

car_a car_b car_c car_d car_e

50 100 150 200 250 300

Maserati Bora Lotus Europa Toyota Corolla

plot(fitted, res_ord, las = 1)

9.2) Standardized Residuals

Lotus Europa Toyota Corolla

The same plot can be obtained “by hand” like this

9.3) Q-Q Plots

9.3.1) Q-Q plot with standardized residuals

9.3.2) Q-Q plot with studentized residuals

Figure 1: Q-Q plot of standardized residuals

Normal Q−Q Plot

Figure 2: Q-Q plot of studentized residuals

You might also like