0% found this document useful (0 votes)
2 views

HW2_solution_Fall2024

Uploaded by

zhou0733
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

HW2_solution_Fall2024

Uploaded by

zhou0733
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

HW2_2024Fall_solution

Rui Zhou

2024-09-21

Load the library


Problem I. HighPeaks Continued (4 pts)
Interest is in studying the time it takes to complete a hike, Time. This time, we will examine four potential
predictors, Length, Elevation, Ascent, and Difficulty.

#install.packages('Stat2Data')
library('Stat2Data')
data("HighPeaks")

a. Construct a multiple regression model for Time using the four quantitative predictors in the HighPeaks
dataset.

head(HighPeaks)

## Peak Elevation Difficulty Ascent Length Time


## 1 Mt. Marcy 5344 5 3166 14.8 10.0
## 2 Algonquin Peak 5114 5 2936 9.6 9.0
## 3 Mt. Haystack 4960 7 3570 17.8 12.0
## 4 Mt. Skylight 4926 7 4265 17.9 15.0
## 5 Whiteface Mtn. 4867 4 2535 10.4 8.5
## 6 Dix Mtn. 4857 5 2800 13.2 10.0

mod1 <- lm(Time ~ Ascent + Elevation + Length + Difficulty, data = HighPeaks)


summary(mod1)

##
## Call:
## lm(formula = Time ~ Ascent + Elevation + Length + Difficulty,
## data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77942 -0.81216 -0.08647 0.68962 3.06736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9567864 2.2307630 2.670 0.01082 *

1
## Ascent 0.0006011 0.0003310 1.816 0.07669 .
## Elevation -0.0016703 0.0005183 -3.223 0.00249 **
## Length 0.4440084 0.0812523 5.465 2.49e-06 ***
## Difficulty 0.8654527 0.2285275 3.787 0.00049 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.171 on 41 degrees of freedom
## Multiple R-squared: 0.8401, Adjusted R-squared: 0.8245
## F-statistic: 53.84 on 4 and 41 DF, p-value: 8.738e-16

b. What is the F-statistic at the bottom of the multiple regression model output in R testing? State your
answer in terms of NH and AH.

T ime = β0 + β1 ∗ Ascent + β2 ∗ Elevation + β3 ∗ Length + β4 ∗ Dif f iculty + e


N H : β 1 = β2 = β3 = β4 = 0
AH : N ot all βi = 0 (i = 1, 2, 3, 4)

c. Use t-tests to assess the contribution of each regressor to the model. Discuss your findings.

summary(mod1)$coefficient

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 5.9567864249 2.2307630212 2.670291 1.081504e-02
## Ascent 0.0006010924 0.0003310058 1.815958 7.669433e-02
## Elevation -0.0016703195 0.0005183219 -3.222552 2.491872e-03
## Length 0.4440083835 0.0812522535 5.464567 2.490364e-06
## Difficulty 0.8654527479 0.2285275407 3.787083 4.900769e-04

If our α level is 0.05, the contributions of all regressors, except for Ascent, are significant.

d. Calculate the R2 and R2 adj for the multiple regression model. Check your work against R output.

RSS
R2 = 1 −
SY Y
RSS <- sum(mod1$residualsˆ2)
SYY <- sum(HighPeaks$Time*(HighPeaks$Time- mean(HighPeaks$Time)))
(R.2 <- 1-RSS/SYY)

## [1] 0.8400656

Check

#names(summary(mod1))
summary(mod1)$r.squared

## [1] 0.8400656

2
2 RSS/df
Radj =1−
SY Y /(n − 1)

n <- nrow(HighPeaks)
df = n-5
(R.2.adj <- 1-(RSS/df)/(SYY/(n-1)))

## [1] 0.8244622

check

summary(mod1)$adj.r.squared

## [1] 0.8244622

e. Does the multiple regression model fit the data better than the simple linear regression with just Length
as the predictor? Use at least two different statistical assessments to compare your models.

summary(lm(Time~Length, data = HighPeaks))

##
## Call:
## lm(formula = Time ~ Length, data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4491 -0.6687 -0.0122 0.5590 4.0034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.04817 0.80371 2.548 0.0144 *
## Length 0.68427 0.06162 11.105 2.39e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.449 on 44 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.7311
## F-statistic: 123.3 on 1 and 44 DF, p-value: 2.39e-14

summary(mod1)

##
## Call:
## lm(formula = Time ~ Ascent + Elevation + Length + Difficulty,
## data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77942 -0.81216 -0.08647 0.68962 3.06736
##

3
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9567864 2.2307630 2.670 0.01082 *
## Ascent 0.0006011 0.0003310 1.816 0.07669 .
## Elevation -0.0016703 0.0005183 -3.223 0.00249 **
## Length 0.4440084 0.0812523 5.465 2.49e-06 ***
## Difficulty 0.8654527 0.2285275 3.787 0.00049 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.171 on 41 degrees of freedom
## Multiple R-squared: 0.8401, Adjusted R-squared: 0.8245
## F-statistic: 53.84 on 4 and 41 DF, p-value: 8.738e-16

Yes, including more predictors improves the model fit. The adjusted R-squared increases from 0.73 to 0.82,
and the residual standard error decreases from 1.171 to 1.449.

f. Find a 95% confidence interval for the regression coefficient for Length in both models (multiple linear
regression and simple linear regression). Discuss any differences.

Confint(mod1)

## Estimate 2.5 % 97.5 %


## (Intercept) 5.9567864249 1.4516691083 10.4619037415
## Ascent 0.0006010924 -0.0000673873 0.0012695721
## Elevation -0.0016703195 -0.0027170917 -0.0006235472
## Length 0.4440083835 0.2799161286 0.6081006385
## Difficulty 0.8654527479 0.4039320165 1.3269734793

Confint(lm(Time~Length, data = HighPeaks))

## Estimate 2.5 % 97.5 %


## (Intercept) 2.0481729 0.4284104 3.6679354
## Length 0.6842739 0.5600910 0.8084569

The 95% CI for Length in the simple model is [0.56, 0.80], and [0.28, 0.61] in the more complex model. The
interval is wider in the more complex model, and it is also shifted closer to 0.

g. We hiked Algonquin Peak, the second tallest mountain in the Adirondacks. We are contemplating
hiking Mt. Marcy, the tallest mountain. Construct a 95% prediction interval for the time it could take
to complete the hike.

data.new <- subset(HighPeaks, Peak == HighPeaks[1,1], select = Elevation:Length)

predict(mod1, interval = 'prediction', newdata = data.frame(data.new ), level =0.95)

## fit lwr upr


## 1 9.832246 7.217136 12.44735

The 95% prediction interval is [7.21, 12.44] hours for the time it could take to complete the hike.

4
h. Assess the assumptions (linearity, normality, constant variance, independence) of the multiple linear
regression model. When applicable, generate plots and interpret your findings.

par(mfrow = c(1,2))
plot(mod1, c(1,2))

Residuals vs Fitted Q−Q Residuals

3
24 24
3

40 40

Standardized residuals
33 33

2
2
Residuals

1
0

0
−1

−1
−2

4 6 8 10 12 14 −2 −1 0 1 2

Fitted values Theoretical Quantiles

There are three outliers: 24, 33, and 10. They are Seward Mtn., Gothics, and Mt. Donaldson. Otherwise,
the constant variance and linearity assumptions hold based on the first plot, and the normality assumption
holds based on the second plot. If the data is collected from random and independent mountain climbers,
then the independence assumption holds. However, if multiple data points are linked to one climber, the
independence assumption is violated.
Problem II. More High Peaks. . . . (4 pts)
For this problem, we will focus on understanding the estimates of both predictors, Length and Difficult on
Time.

mod2<- lm(Time ~ Length + Difficulty, data = HighPeaks)


summary(mod2)

##
## Call:
## lm(formula = Time ~ Length + Difficulty, data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5341 -0.7174 -0.1508 0.5402 3.3743

5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.15677 0.89369 0.175 0.861577
## Length 0.45784 0.08436 5.427 2.47e-06 ***
## Difficulty 0.88969 0.25176 3.534 0.000994 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.291 on 43 degrees of freedom
## Multiple R-squared: 0.7962, Adjusted R-squared: 0.7867
## F-statistic: 84.01 on 2 and 43 DF, p-value: 1.403e-15

a. Construct the added-variable plots for both predictors. Based on the added-variable plots, how much
variability in Time does Length account for after adjusting for Difficulty and similarly, how much
variability in Time does Difficulty account for after adjusting for Length?

avPlots(mod2, id=FALSE)

Added−Variable Plots
4
4

3
2

2
Time | others

Time | others

1
0

0
−1
−2

−2

−6 −4 −2 0 2 4 −1.5 −0.5 0.5 1.5

Length | others Difficulty | others

Based on the added-variable plots, how much variability in Time does Length account for after adjusting for
Difficulty and similarly, how much variability in Time does Difficulty account for after adjusting for Length?

mod2.1 <- lm(Time~ Length, data = HighPeaks)


mod2.2 <- lm(Time~ Difficulty, data = HighPeaks)

6
mod2.3 <- lm(Length~ Difficulty, data = HighPeaks)
mod2.4 <- lm(Difficulty~ Length, data = HighPeaks)

summary(lm(mod2.1$residuals ~ mod2.4$residuals))$r.squared

## [1] 0.2250591

After adjusting for Length, Difficulty accounts for 22.5% variability in Time.

summary(lm(mod2.2$residuals ~ mod2.3$residuals))$r.squared

## [1] 0.4065331

After adjusting for Difficulty, Length accounts for 40.65% variability in Time.

b. Show that the estimated coefficient for Length in the multiple linear regression model is the same as
in the added-variable plot after Difficulty has been added to the model.

summary(mod2)$coefficients

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 0.1567659 0.89368622 0.1754149 8.615771e-01
## Length 0.4578402 0.08435872 5.4273016 2.471843e-06
## Difficulty 0.8896898 0.25176213 3.5338507 9.935134e-04

summary(lm(mod2.2$residuals ~ mod2.3$residuals))$coefficients

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) -2.576235e-16 0.18810475 -1.369574e-15 1.000000e+00
## mod2.3$residuals 4.578402e-01 0.08339459 5.490047e+00 1.883878e-06

The estimates, standard errors, t-values, and p-values are very similar in both models.

c. Use either added-variable plot to compute the R2 for the multiple linear regression model with both
predictors. Show computations.

(R.2.1 <-summary(mod2.2)$r.squared)

## [1] 0.6566249

(R.2.2 <-summary(lm(mod2.2$residuals ~ mod2.3$residuals))$r.squared)

## [1] 0.4065331

R.2.1 + (1-R.2.1)*R.2.2

## [1] 0.7962182

Check

7
summary(mod2)$r.squared

## [1] 0.7962182

Confirmed!
Problem III. Still More High Peaks Problem. . . (2 pts)
This problem is designed to give less guidance than the previous problem and get you to think about how
to solve a statistical problem.
For a multiple linear regression model, R output provides tests of single predictors using t-tests and all
predictors using a F-test. How would I construct a test of two predictors in a multiple linear regression
at the same time? You may use the HighPeaks data as an example to construct a test to illustrate your
solution. State NH and AH. Hint: think ANOVA table and partial F-test. Use α = 0.05.
Example:

T ime = β0 + β1 ∗ Ascent + β2 ∗ Elevation + β3 ∗ Length + β4 ∗ Dif f iculty + e


N H : β1 = β2 = 0, β0 β3 β4 arbitrary
AH : N ot all βi = 0 (i = 1, 2), β0 β3 β4 arbitrary

library(imager)

## Loading required package: magrittr

##
## Attaching package: ’imager’

## The following object is masked from ’package:magrittr’:


##
## add

## The following objects are masked from ’package:stats’:


##
## convolve, spectrum

## The following object is masked from ’package:graphics’:


##
## frame

## The following object is masked from ’package:base’:


##
## save.image

im = load.image("F_test.png")

plot(im)

8
−50
0
50
100
150
200

100 200 300 400 500

RSS.NH <- sum(mod2$residualsˆ2)


RSS.AH <- sum(mod1$residualsˆ2)
(F.stats <- ((RSS.NH -RSS.AH)/2)/(RSS.AH/mod1$df.residual))

## [1] 5.620243

(pf(F.stats, 2, mod1$df.residual,lower.tail = F))

## [1] 0.006965003

Alternatively

anova(mod2, mod1)

## Analysis of Variance Table


##
## Model 1: Time ~ Length + Difficulty
## Model 2: Time ~ Ascent + Elevation + Length + Difficulty
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 43 71.616
## 2 41 56.207 2 15.409 5.6202 0.006965 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Since the p-value = 0.006965 is less than 0.05, we can reject the null hypothesis.

You might also like