HW2_solution_Fall2024
HW2_solution_Fall2024
Rui Zhou
2024-09-21
#install.packages('Stat2Data')
library('Stat2Data')
data("HighPeaks")
a. Construct a multiple regression model for Time using the four quantitative predictors in the HighPeaks
dataset.
head(HighPeaks)
##
## Call:
## lm(formula = Time ~ Ascent + Elevation + Length + Difficulty,
## data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77942 -0.81216 -0.08647 0.68962 3.06736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9567864 2.2307630 2.670 0.01082 *
1
## Ascent 0.0006011 0.0003310 1.816 0.07669 .
## Elevation -0.0016703 0.0005183 -3.223 0.00249 **
## Length 0.4440084 0.0812523 5.465 2.49e-06 ***
## Difficulty 0.8654527 0.2285275 3.787 0.00049 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.171 on 41 degrees of freedom
## Multiple R-squared: 0.8401, Adjusted R-squared: 0.8245
## F-statistic: 53.84 on 4 and 41 DF, p-value: 8.738e-16
b. What is the F-statistic at the bottom of the multiple regression model output in R testing? State your
answer in terms of NH and AH.
c. Use t-tests to assess the contribution of each regressor to the model. Discuss your findings.
summary(mod1)$coefficient
If our α level is 0.05, the contributions of all regressors, except for Ascent, are significant.
d. Calculate the R2 and R2 adj for the multiple regression model. Check your work against R output.
RSS
R2 = 1 −
SY Y
RSS <- sum(mod1$residualsˆ2)
SYY <- sum(HighPeaks$Time*(HighPeaks$Time- mean(HighPeaks$Time)))
(R.2 <- 1-RSS/SYY)
## [1] 0.8400656
Check
#names(summary(mod1))
summary(mod1)$r.squared
## [1] 0.8400656
2
2 RSS/df
Radj =1−
SY Y /(n − 1)
n <- nrow(HighPeaks)
df = n-5
(R.2.adj <- 1-(RSS/df)/(SYY/(n-1)))
## [1] 0.8244622
check
summary(mod1)$adj.r.squared
## [1] 0.8244622
e. Does the multiple regression model fit the data better than the simple linear regression with just Length
as the predictor? Use at least two different statistical assessments to compare your models.
##
## Call:
## lm(formula = Time ~ Length, data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4491 -0.6687 -0.0122 0.5590 4.0034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.04817 0.80371 2.548 0.0144 *
## Length 0.68427 0.06162 11.105 2.39e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.449 on 44 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.7311
## F-statistic: 123.3 on 1 and 44 DF, p-value: 2.39e-14
summary(mod1)
##
## Call:
## lm(formula = Time ~ Ascent + Elevation + Length + Difficulty,
## data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77942 -0.81216 -0.08647 0.68962 3.06736
##
3
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9567864 2.2307630 2.670 0.01082 *
## Ascent 0.0006011 0.0003310 1.816 0.07669 .
## Elevation -0.0016703 0.0005183 -3.223 0.00249 **
## Length 0.4440084 0.0812523 5.465 2.49e-06 ***
## Difficulty 0.8654527 0.2285275 3.787 0.00049 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.171 on 41 degrees of freedom
## Multiple R-squared: 0.8401, Adjusted R-squared: 0.8245
## F-statistic: 53.84 on 4 and 41 DF, p-value: 8.738e-16
Yes, including more predictors improves the model fit. The adjusted R-squared increases from 0.73 to 0.82,
and the residual standard error decreases from 1.171 to 1.449.
f. Find a 95% confidence interval for the regression coefficient for Length in both models (multiple linear
regression and simple linear regression). Discuss any differences.
Confint(mod1)
The 95% CI for Length in the simple model is [0.56, 0.80], and [0.28, 0.61] in the more complex model. The
interval is wider in the more complex model, and it is also shifted closer to 0.
g. We hiked Algonquin Peak, the second tallest mountain in the Adirondacks. We are contemplating
hiking Mt. Marcy, the tallest mountain. Construct a 95% prediction interval for the time it could take
to complete the hike.
The 95% prediction interval is [7.21, 12.44] hours for the time it could take to complete the hike.
4
h. Assess the assumptions (linearity, normality, constant variance, independence) of the multiple linear
regression model. When applicable, generate plots and interpret your findings.
par(mfrow = c(1,2))
plot(mod1, c(1,2))
3
24 24
3
40 40
Standardized residuals
33 33
2
2
Residuals
1
0
0
−1
−1
−2
4 6 8 10 12 14 −2 −1 0 1 2
There are three outliers: 24, 33, and 10. They are Seward Mtn., Gothics, and Mt. Donaldson. Otherwise,
the constant variance and linearity assumptions hold based on the first plot, and the normality assumption
holds based on the second plot. If the data is collected from random and independent mountain climbers,
then the independence assumption holds. However, if multiple data points are linked to one climber, the
independence assumption is violated.
Problem II. More High Peaks. . . . (4 pts)
For this problem, we will focus on understanding the estimates of both predictors, Length and Difficult on
Time.
##
## Call:
## lm(formula = Time ~ Length + Difficulty, data = HighPeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5341 -0.7174 -0.1508 0.5402 3.3743
5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.15677 0.89369 0.175 0.861577
## Length 0.45784 0.08436 5.427 2.47e-06 ***
## Difficulty 0.88969 0.25176 3.534 0.000994 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.291 on 43 degrees of freedom
## Multiple R-squared: 0.7962, Adjusted R-squared: 0.7867
## F-statistic: 84.01 on 2 and 43 DF, p-value: 1.403e-15
a. Construct the added-variable plots for both predictors. Based on the added-variable plots, how much
variability in Time does Length account for after adjusting for Difficulty and similarly, how much
variability in Time does Difficulty account for after adjusting for Length?
avPlots(mod2, id=FALSE)
Added−Variable Plots
4
4
3
2
2
Time | others
Time | others
1
0
0
−1
−2
−2
Based on the added-variable plots, how much variability in Time does Length account for after adjusting for
Difficulty and similarly, how much variability in Time does Difficulty account for after adjusting for Length?
6
mod2.3 <- lm(Length~ Difficulty, data = HighPeaks)
mod2.4 <- lm(Difficulty~ Length, data = HighPeaks)
summary(lm(mod2.1$residuals ~ mod2.4$residuals))$r.squared
## [1] 0.2250591
After adjusting for Length, Difficulty accounts for 22.5% variability in Time.
summary(lm(mod2.2$residuals ~ mod2.3$residuals))$r.squared
## [1] 0.4065331
After adjusting for Difficulty, Length accounts for 40.65% variability in Time.
b. Show that the estimated coefficient for Length in the multiple linear regression model is the same as
in the added-variable plot after Difficulty has been added to the model.
summary(mod2)$coefficients
summary(lm(mod2.2$residuals ~ mod2.3$residuals))$coefficients
The estimates, standard errors, t-values, and p-values are very similar in both models.
c. Use either added-variable plot to compute the R2 for the multiple linear regression model with both
predictors. Show computations.
(R.2.1 <-summary(mod2.2)$r.squared)
## [1] 0.6566249
## [1] 0.4065331
R.2.1 + (1-R.2.1)*R.2.2
## [1] 0.7962182
Check
7
summary(mod2)$r.squared
## [1] 0.7962182
Confirmed!
Problem III. Still More High Peaks Problem. . . (2 pts)
This problem is designed to give less guidance than the previous problem and get you to think about how
to solve a statistical problem.
For a multiple linear regression model, R output provides tests of single predictors using t-tests and all
predictors using a F-test. How would I construct a test of two predictors in a multiple linear regression
at the same time? You may use the HighPeaks data as an example to construct a test to illustrate your
solution. State NH and AH. Hint: think ANOVA table and partial F-test. Use α = 0.05.
Example:
library(imager)
##
## Attaching package: ’imager’
im = load.image("F_test.png")
plot(im)
8
−50
0
50
100
150
200
## [1] 5.620243
## [1] 0.006965003
Alternatively
anova(mod2, mod1)
Since the p-value = 0.006965 is less than 0.05, we can reject the null hypothesis.