0% found this document useful (0 votes)

65 views35 pages

6th Lecture Note 108335647 230518 203102

Linear regression analysis with R is discussed. Key topics include simple and multiple linear regression, estimating coefficients using least squares, and measuring goodness of fit using R-squared. An example uses data from the "stat500" dataset to fit a linear regression model predicting final exam scores from midterm and homework scores.

Uploaded by

Girie Yum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views35 pages

6th Lecture Note 108335647 230518 203102

Uploaded by

Girie Yum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Linear Regression Analysis with R

Seyoung Park

Objective
• Linear Regression Analysis with R

• Learn "Estimation", "Inference" and "Shrinkage methods (Lasso regression, Ridge regression)" with
linear regression model

Install “faraway” package

All datasets used in this topic are from “faraway” package.
Reference book : “Linear models with R” by Julian J. Faraway
install.packages("faraway")
library("faraway")

Regression Analysis
• Regression analysis: used for modeling the relationship between explanatory variables (X) and a
dependent variable (Y)

• Regression model: Y = f (X). Here f (·) explains the regression relationship

• Example 1: X: height, Y: weight

• Example 2: X: Math score, Y: total score

Simple Linear Regression

• Simple linear regression model is based on the following linear model: Y = β0 + β1 X

• Simple linear regression analysis: linear regression model with a single explanatory variable (X), i.e.
Y = β0 + β1 X + ϵ. Here ϵ represents a random error, e.g. ϵ ∼ N (0, σ 2 )

• Given n data pairs {(xi , yi ), i = 1, · · · , n}, we can write

y i = β0 + β1 x i + ϵi ,

where ϵi s are indepedent random errors.

• β0 and β1 are unknown parameters

• How to interpret β0 and β1 ? β1 is the effect of X on Y .

1
• Is the obtained β1 statistically significant?

“stat500” Example
# use "stat500" data
# stat500 is 55 by 4 dimensional data
library("faraway")
data(stat500)
head(stat500)

## midterm final hw total

## 1 24.5 26.0 28.5 79.0
## 2 22.5 24.5 28.2 75.2
## 3 23.5 26.5 28.3 78.3
## 4 23.5 34.5 29.2 87.2
## 5 22.5 30.5 27.3 80.3
## 6 16.0 31.0 27.5 74.5
# We can specify column/row names using
# "colnames" and "rownames"

“stat500” Example (cont.)

# scatter plot
plot(final ~ midterm, data = stat500)
35
30
final

25
20
15

10 15 20 25 30

midterm

“stat500” Example (cont.)

Fitted linear regression model is f inal = 15.05 + 0.56 midterm
# apply linear regression model
lm(final ~ midterm, data = stat500)

2
## Call:
## lm(formula = final ~ midterm, data = stat500)
##
## Coefficients:
## (Intercept) midterm
## 15.0462 0.5633

“stat500” Example (cont.)

plot(final ~ midterm, data = stat500) # scatter plot
abline(lm(final ~ midterm, data = stat500), col ="red")
35
30
final

25
20
15

10 15 20 25 30

midterm

“stat500” Example (cont.)

# scatter plot and fitted regression line (version2)
plot(final ~ midterm, data = stat500) # scatter plot
fit = lm(final ~ midterm, data = stat500)
abline(coef(fit), col ="red", lty = 2)

3
35
30
final

25
20
15

10 15 20 25 30

midterm

“stat500” Example (cont.)

fit = lm(final ~ midterm, data = stat500)
# Check quantities in the "fit"
names(fit)

## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"

## [9] "xlevels" "call" "terms" "model"
# detailed results
summary(fit)

##
## Call:
## lm(formula = final ~ midterm, data = stat500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.932 -2.657 0.527 2.984 9.286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.0462 2.4822 6.062 1.44e-07 ***
## midterm 0.5633 0.1190 4.735 1.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.192 on 53 degrees of freedom
## Multiple R-squared: 0.2973, Adjusted R-squared: 0.284
## F-statistic: 22.42 on 1 and 53 DF, p-value: 1.675e-05

4
Multiple Linear Regression
• Multiple linear regression analysis is based on multiple explanatory variables X1 , X2 , · · · , Xp and
Y = β0 + β1 X1 + β2 X2 + · · · βp Xp + ϵ

• Given n data pairs {(xi1 , · · · , xip , yi ), i = 1, · · · , n}, we can write

yi = β0 + β1 xi1 + β2 xi2 + · · · βp xip + ϵi ,

where ϵi s are indepedent random errors.

• How to interpret β1 , · · · , βp ? βj is the effect of Xj on Y when the other p − 1 explanatory variables are
fixed.

• Is the obtained linear regression model statistically significant? using F-test

• Are the obtained βj s statistically significant? using t-test or F-test

• How to analyze when there are too many explanatory variables (i.e. p is too large) ?

“stat500” Example
Fitted linear regression model is f inal = 16.81 + 0.57 midterm − 0.08 hw
# use "stat500" data
# stat500 is 55 by 4 dimensional data
data(stat500)
lm(final ~ midterm + hw, data = stat500)

##
## Call:
## lm(formula = final ~ midterm + hw, data = stat500)
##
## Coefficients:
## (Intercept) midterm hw
## 16.81061 0.58179 -0.08157

“stat500” Example (cont.)

summary(lm(final ~ midterm + hw, data = stat500))

##
## Call:
## lm(formula = final ~ midterm + hw, data = stat500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0388 -2.5964 0.3714 3.0063 9.3497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.81061 4.08112 4.119 0.000137 ***
## midterm 0.58179 0.12445 4.675 2.12e-05 ***
## hw -0.08157 0.14916 -0.547 0.586836
## ---

5
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.22 on 52 degrees of freedom
## Multiple R-squared: 0.3013, Adjusted R-squared: 0.2744
## F-statistic: 11.21 on 2 and 52 DF, p-value: 8.948e-05

How to estimate coefficients? Least squares

• Data: {(xi1 , · · · , xip , yi ), i = 1, · · · , n}

• Find β0 , β1 , · · · , βp that minimizes the sum of squared residuals (SSR):

n
X 2
minimize (yi − (β0 + β1 xi1 + · · · + βp xip ))
i=1

In vector form
n
X
minimize ∥yi − x′i β∥2 ,
i=1

where β = (β0 , β1 , · · · , βp ) and xi = (1, xi1 , · · · , xip )′

• The obtained minimizer β̂ = (β̂0 , β̂1 , · · · , β̂p ) is called the "Ordinary Least Squares" estimator (OLS
estimator) for β.

OLS estimator
Let X is an n by p + 1 matrix and y is an n-dimensional vector such that
   
−x1 − y1
 −x2 −   y2 
X =  .  and y =  . 
   
 ..   .. 
−xn − yn

Note that the first column of X are all ones!

• Minimize ∥y − Xβ∥2 is equivalent to

X T X β̂ = X T y ⇒ β̂ = (X T X)−1 X T y

• Fitted (predicted) values are ŷ = X β̂ = X(X T X)−1 X T y

• H = X(X T X)−1 X T are called the hat-matrix

OLS estimator (“stat500” Example)

# still consider stat500
data(stat500)

# select "midterm" and "hw" for X

# select "final" for y
X = stat500[,c(1,3)]; y = stat500[,2]

# the first column of X must be all ones!

6
X = cbind(rep(1, nrow(X)), X)

# X and y must be matrix/vector for computation

X = as.matrix(X); y = as.matrix(y)

# OLS estimator
OLS = solve(t(X)%*%X, t(X)%*%y)

OLS estimator (“stat500” Example)

# confirm that two results are the same!
OLS

## [,1]
## rep(1, nrow(X)) 16.81060740
## midterm 0.58178957
## hw -0.08156661
lm(final ~ midterm + hw, data = stat500)$coefficients

## (Intercept) midterm hw
## 16.81060740 0.58178957 -0.08156661

Goodness of fit
• It is essential to measure how well the linear regression model fits the data

• R2 ("R-squared") is one popular measure. Sometimes called the "coefficient of determination" or

"percentage of variance explained"

• Total sum of squares:

n
X n
X
SST = (yi − ȳ)2 , ȳ = yi /n
i=1 i=1

• Regression sum of squares, or called the explained sum of squares:

n
X
SSR = (ŷi − ȳ)2 , ŷi = x′i β̂
i=1

• Sum of squares of residuals (related to unexplained variance):

n
X
SSE = (yi − ŷi )2
i=1

Goodness of fit
• It holds that SSR + SSE = SST . Why?

• R2 (R-squared):
SSR SSE
R2 = =1−
SST SST

• R2 has ranges from 0 to 1. 0 indicates that ŷi = ȳ. On the other hand, 1 indicates ŷi = yi , i.e. linear
regression predictions perfectly fit the data (or perfectly explains the observed variation)

7
• Larger R2 indicates a better fit to the data

• For simple linear regression (i.e. p = 1), R2 is equal to r2 which is a square of the sample correlation
between X and Y

Adjusted R2
• Note that R2 , i.e.,
SSE
R2 = 1 −
SST
is based on biased estimates of the variances of the dependent variable and of the errors. Why?

• Adjusted R2 is unbiased estimator:

SSE/(n − p − 1) n−1
Adjusted R2 = 1 − = 1 − (1 − R2 ) ,
SST /(n − 1) n−p−1

where n − 1 and n − p − 1 represent degree of freedom of the estimate of the variance of the dependent
variable and of the estimate of the error variance, respectively

“Galapagos Islands” Example

# use "Galapagos" data
# "gala" is 30 by 7 dimensional data
library("faraway")
data(gala)
head(gala)

## Species Endemics Area Elevation Nearest Scruz Adjacent

## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84

“Galapagos Islands” Example (cont.)

Fitted linear regression model is

Species = 7.08 − 0.02Area + 0.32Elevation − 0.23Scruz − 0.07Adjacent

# Fit a linear model

fit = lm(Species ~ Area+Elevation+Scruz+Adjacent, gala)
fit

##
## Call:
## lm(formula = Species ~ Area + Elevation + Scruz + Adjacent, data = gala)
##
## Coefficients:
## (Intercept) Area Elevation Scruz Adjacent
## 7.07538 -0.02398 0.31957 -0.23936 -0.07485

8
“Galapagos Islands” Example (cont.)
# compute R-squared
y = gala$Species
R2 = 1 - deviance(fit)/ sum((y-mean(y))ˆ2)
R2

## [1] 0.7658462
# compute Adjusted R-squared
n=30; p=4
R2_adjusted = 1 - (1-R2)*(n-1)/(n-p-1)
R2_adjusted

## [1] 0.7283816
# "Multiple R-squared" is the R-squared value
summary(fit)

##
## Call:
## lm(formula = Species ~ Area + Elevation + Scruz + Adjacent, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.637 -34.930 -7.864 33.432 182.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.07538 18.74982 0.377 0.709093
## Area -0.02398 0.02151 -1.115 0.275554
## Elevation 0.31957 0.05113 6.250 1.54e-06 ***
## Scruz -0.23936 0.16464 -1.454 0.158434
## Adjacent -0.07485 0.01663 -4.501 0.000136 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.74 on 25 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7284
## F-statistic: 20.44 on 4 and 25 DF, p-value: 1.39e-07

“Galapagos Islands” Example (cont.)

# Fit a linear model by adding one more variable
fit2 = lm(Species ~ Area+Elevation+Scruz+Adjacent+
Nearest, gala)
fit2

##
## Call:
## lm(formula = Species ~ Area + Elevation + Scruz + Adjacent +
## Nearest, data = gala)
##
## Coefficients:
## (Intercept) Area Elevation Scruz Adjacent Nearest
## 7.068221 -0.023938 0.319465 -0.240524 -0.074805 0.009144

9
“Galapagos Islands” Example (cont.)
# compute R-squared
y = gala$Species
R2 = 1 - deviance(fit2)/ sum((y-mean(y))ˆ2)
R2

## [1] 0.7658469
# compute Adjusted R-squared
n=30; p=5
R2_adjusted = 1 - (1-R2)*(n-1)/(n-p-1)
R2_adjusted

## [1] 0.7170651
# "Multiple R-squared" is the R-squared value
summary(fit2)

##
## Call:
## lm(formula = Species ~ Area + Elevation + Scruz + Adjacent +
## Nearest, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07

Inference of model
• Are any of the p predictors X1 , · · · , Xp useful when predicting the dependent variable Y ?

• Consider the following hypothesis test:

H0 : β1 = β2 = · · · = βp = 0 versus Ha : βj ̸= 0 for some j

• Corresponding F-statistics is
SSR/p M SR
F = = ,
SSE/(n − p − 1) M SE
where MSR and MSE are "regression mean square" and "mean square error", respectively

10
Table 1: Analysis of variance table
Source of variation df Sum of squares Mean of squares F-statistic
Pn
Regression p SSR = i=1 (ŷi − ȳ)2 M SR = SSRp F =M M SR
SE
Pn SSE
Residual n - p -1 SSE = i=1 (yi − ŷi )2 M SE = n−p−1
Pn
Total n-1 SST = i=1 (yi − ȳ)2

F-statistic (Analysis of Variance)

• To compute p-value, we refer to Fp,n−p−1 which represent the F-distribution with degress of freedom
(p, n − p − 1)

• Null hypothesis H0 is rejected if the F value computed from the data is greater than the critical value.
−1
More specifically, given the significance level α such as 0.01, 0.05, 0.1, check whether F > F1−α (p, n−p−1)
−1
or not, where F1−α (p, n − p − 1) is the 1 − α quantile of the Fp,n−p−1 distribution

F-statistic (Analysis of Variance)

• Or Null hypothesis H0 is rejected if the p-value computed from the data is less than the significance
level α. More specifically, check whether P (F (p, n − p − 1) > F ) < α or not

• Larger F would means rejection of the null hypothesis H0

• F value is related to R2 :
R2 n − p − 1
F =
1 − R2 p

• What if we get a very small F statistic? We can try nonlinear transformation variables or apply other
models: e.g. xj ← log(xj + 1)

“Galapagos Islands” Example

fit = lm(Species ~ Area+Elevation+Scruz+Adjacent, gala)
# Check quantities in the "fit"
names(fit)

## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"

## [9] "xlevels" "call" "terms" "model"
# Compute F-Statistic
n=30; p=4
SST = sum((gala$Species - mean(gala$Species))ˆ2)
SSE = deviance(fit)
MSE = SSE/fit$df.residual #fit$df.residual = n-p-1
SSR = SST - SSE
MSR = SSR/p
Fstat = MSR/MSE

“Galapagos Islands” Example (cont.)

# Compare with critical value when alpha = 0.05
crit_value = qf(0.95, p, n-p-1)
Fstat > crit_value

11
## [1] TRUE
# Fstat > crit_value! This means we reject H0

# Compute p-value
pvalue = 1-pf(Fstat, p, n-p-1)
pvalue < 0.05

## [1] TRUE
# pvalue < significance level!
# This means we reject H0

Testing single predictor

• Suppose that the previous hypotheis test indicates the rejection of H0 , i.e., some predictors are useful
when predicting Y under the linear regression model

• Now we are interested in whether one particular explanatory variable (say βj ) can be dropped from the
linear regression model, i.e. consider

H0 : βj = 0 versus Ha : βj ̸= 0

• This can be rewritten as

H0 : M1 versus Ha : M2 ,
where M1 = {x1 , · · · , xj−1 , xj+1 , · · · , xp } and M2 = {x1 , · · · , xp }

Comparing two nested models (general version)

• Consider two linear regression models M1 and M2 satisfying M1 ⊂ M2 , and |M1 | = p1 and |M2 | = p2

• Let SSE1 and SSE2 be the sum of squares of residuals of the models M1 and M2 , respectively:

• Then F-Statistic is
(SSE1 − SSE2 )/(p2 − p1 )
F =
SSE2 /(n − p2 − 1)

• Referred distribution is Fp2 −p1 ,n−p2 −1

Comparing two nested models (testing single variable version)

• Now revisit the following "testing single variable" problem:

H0 : βj = 0 versus Ha : βj ̸= 0

• |M1 | := p1 = p − 1 and |M2 | := p2 = p

• Recall that SSE1 and SSE2 are the sum of squares of residuals of the models M1 and M2 , respectively:

12
• Then F-Statistic is
(SSE1 − SSE2 )
F =
SSE2 /(n − p − 1)

• Referred distribution is F1,n−p−1 =d [t(n − p − 1)]2 , where t(n − p − 1) represents Student’s t-distribution
with a degree of freedom n − p − 1

“savings” Example
# use "savings" data
# savings is an old economic dataset on 50
# different countries (50 by 5 dimensional data)
library("faraway")
data(savings)
head(savings)

## sr pop15 pop75 dpi ddpi

## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria 12.07 23.32 4.41 1507.99 3.93
## Belgium 13.17 23.80 4.43 2108.47 3.82
## Bolivia 5.75 41.89 1.67 189.13 0.22
## Brazil 12.88 42.19 0.83 728.47 4.56
## Canada 8.79 31.72 2.85 2982.88 2.43
# We can specify column/row names using
# "colnames" and "rownames"

“savings” Example (cont.)

Fitted linear regression model is sr = 28.57 − 0.46pop15 − 1.69pop75 − 0.0003dpi + 0.41ddpi

13
# apply linear regression model
fit2 = lm(sr ~ ., data = savings)

# Compute F-Statistic
n=nrow(savings); p=ncol(savings)-1
SST2 = sum((savings$sr - mean(savings$sr))ˆ2)
SSE2 = deviance(fit2)
MSE2 = SSE2/fit2$df.residual #fit$df.residual = n-p-1
SSR2 = SST2 - SSE2
MSR2 = SSR2/p
Fstat2 = MSR2/MSE2

“savings” Example (cont.)

Is pop75 significant in the full model?
fit2 = lm(sr ~ ., data = savings)
fit1 = lm(sr ~ pop15 + dpi + ddpi, savings)
SSE1 = deviance(fit1)
Fstat = (SSE1-SSE2)/(SSE2/(n-p-1))
1-pf(Fstat,1,n-p-1) # this is the p-value

## [1] 0.1255298
# compute p-value using t-distribution
2*pt(-1.561,n-p-1)

## [1] 0.1255297

“savings” Example (cont.)

We can perform the hypothesis testing using “anova” function
# compare two nested models
anova(fit1, fit2)

## Analysis of Variance Table

##
## Model 1: sr ~ pop15 + dpi + ddpi
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 685.95
## 2 45 650.71 1 35.236 2.4367 0.1255

Questions
1. Perform the following hypothesis test:

H0 : both dpi and ddpi are not significant

versus Ha : All explanatory variables are significant

2. Perform the following hypothesis test:

H0 : both dpi and ddpi are not significant

versus Ha : pop15, pop75, ddpi are all significant

14
Questions (cont.)
# [1.] compare two nested models
fit2 = lm(sr ~ ., data = savings)
fit1 = lm(sr ~ pop15 + pop75, savings)
anova(fit1, fit2)

## Analysis of Variance Table

##
## Model 1: sr ~ pop15 + pop75
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 47 726.17
## 2 45 650.71 2 75.455 2.609 0.08471 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Questions (cont.)
# [2.] compare two nested models
fit2 = lm(sr ~ pop15 + pop75 + ddpi, data = savings)
fit1 = lm(sr ~ pop15 + pop75, savings)
anova(fit1, fit2)

## Analysis of Variance Table

##
## Model 1: sr ~ pop15 + pop75
## Model 2: sr ~ pop15 + pop75 + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 47 726.17
## 2 46 652.61 1 73.562 5.1851 0.02748 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing a subspace
One can consider the following hypothesis test:

H0 : βpop15 = βpop75 versus Ha : βpop15 ̸= βpop75

fit1 = lm(sr ~ I(pop15 + pop75) + dpi + ddpi, savings)

fit2 = lm(sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
anova(fit1,fit2)

## Analysis of Variance Table

##
## Model 1: sr ~ I(pop15 + pop75) + dpi + ddpi
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 673.63
## 2 45 650.71 1 22.915 1.5847 0.2146

15
Testing a subspace
One can consider the following hypothesis test:

H0 : βpop75 = 4βpop15 versus Ha : βpop15 ̸= 2βpop75

fit1 = lm(sr ~ I(1pop15 + 4pop75) + dpi + ddpi, savings)

anova(fit1,fit2)

## Analysis of Variance Table

##
## Model 1: sr ~ I(1 * pop15 + 4 * pop75) + dpi + ddpi
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 651.33
## 2 45 650.71 1 0.61849 0.0428 0.8371

Testing a subspace
One can consider the following hypothesis test:

H0 : βddpi = 0.5 versus Ha : βddpi ̸= 0.5

fit1 = lm(sr ~ pop15+pop75+dpi+offset(0.5*ddpi), savings)

anova(fit1,fit2)

## Analysis of Variance Table

##
## Model 1: sr ~ pop15 + pop75 + dpi + offset(0.5 * ddpi)
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 653.78
## 2 45 650.71 1 3.0635 0.2119 0.6475

Questions
Q1 . Perform the following hypothesis test:

H0 : βpop75 = 4βpop15 and ddpi = 0.5 versus Ha : full model

Caution when using multiple constraints in the “lm” function

Consider the following hypothesis test:

H0 : βpop75 = 4βpop15 and βdpi = βpop15 versus Ha : full model

# which linear model is correct between the following two?

fit1 = lm(sr~I(pop15+4*pop75)+I(dpi+pop15)+ddpi,savings)
fit11 = lm(sr~I(pop15+4*pop75+dpi)+ddpi,savings)

Caution when using multiple constraints (cont.)

The model “fit1” is equivalent to the following linear model:

y = (Xpop15 + 4Xpop75 )β1 + (Xpop15 + Xdpi )β2 + Xddpi β3 + ϵ

= Xpop15 (β1 + β2 ) + Xpop75 (4β1 ) + Xdpi β2 + Xddpi β3 ,

16
which implies
βpop15 = β1 + β2 , βpop75 = 4β1 , βdpi = β2 , βddpi = β3 .
This can be rewritten as the following compact form:

βpop15 = βpop75 /4 + βdpi ,

which is not eqivalent to the null hypothesis H0 : βpop75 = 4βpop15 = 4βdpi

Penalization methods (Shrinkage methods)

• Recall that linear regression is based on minimizing residual sum of squares:
n
X
minimizeβ (yi − x′i β)2
i=1

• The obtained minimizer β̂ (OLS estimator) is generally a good estimator of β.

• However, (1). when the number of explanatory variables (p) is much larger than sample size (n); (2).
when columns of the design matrix X are highly correlated, obtained β̂ can be unstable and often less
interpretable

Penalization methods (Shrinkage methods)

• Shrnkage methods give a penalty on the coefficient (β) in the optimization problem such that the
obtained coefficient (β̂) can’t be too large!

• Shrnkage methods generally solve

n
X p
X
(yi − β0 − xij βj )2 + P enalty(β),
i=1 j=1

where P enalty(·) is a function that penalizes β

• In this class, we will learn "Ridge penalty" and "Lasso penalty"

Ridge regression
Pp
• Ridge regression is based on limiting j=1 βj2

• Suppose that X ∈ Rn×p is columnwise centered. Ridge regression solves

n
X p
X p
X
2
minimizeβ (yi − xij βj ) + λ βj2 ,
i=1 j=1 j=1

where λ > 0 is a user-determined tuning parameter that controls the tradeoff between fit and penalty

• Ridge regression has a closed form solution:

β̂ridge (λ) = (X ′ X + λIp×p )−1 X ′ y,

where Ip×p is a p by p identity matrix

17
library("MASS")
# ridge regression
lm.ridge(sr ~ ., data= savings, lambda = 1)

## pop15 pop75 dpi ddpi

## 25.3262192821 -0.3971087502 -1.2595249673 -0.0003267073 0.4068673345

Ridge regression
1
Coefficients
0
−1

βpop15
−2

βpop75
−3

βdpi
−4

βddpi
0 200 400 600 800 1000

λ
Apendix
# Ridge regression
lam_set = seq(0, 1000, 1)
result = lm.ridge(sr ~ .,data=savings,lambda=lam_set)
plot(lam_set, result$coef["pop15",],type = "l",
xlim=range(lam_set),ylim=range(result$coef),lwd=2,
xlab=expression(lambda),ylab="Coefficients",cex.lab=2)
lines(lam_set,result$coef["pop75",],col="blue",lty=2,lwd=2)
lines(lam_set,result$coef["dpi",],col="red",lty=3,lwd=2)
lines(lam_set,result$coef["ddpi",],col="green",lty=4,lwd=2)
abline(h = 0, lwd = 2)

# Add legend
legend(300,-1,legend=expression(beta[pop15],beta[pop75],
beta[dpi],beta[ddpi]),
col=c("black", "blue", "red", "green"), lty=1:4, cex=1)

Ridge regression with orthonormal design matrix

• In the case of an orthonormal design matrix X ∈ Rn×p , i.e, X ′ X = Ip×p ,

β̂ OLS = X ′ y, β̂ ridge = X ′ y/(1 + λ),

18
which clearly illustrates the shrinkage effect of Ridge regression

• Ridge regression produce the effect of shrinking the estimates of β toward zero that cause a bias but
reduce a variance of the estimator. Think about M SE(β̂) = Bias(β̂)2 + V ariance(β̂)!

Lasso regression
Pp
• Lasso regression is based on limiting j=1 |βj |

• Lasso regression solves

n p p
1X X X
minimizeβ (yi − β0 − xij βj )2 + λ |βj |,
2 i=1 j=1 j=1

where λ > 0 is a user-determined tuning parameter that controls the tradeoff between fit and penalty

• Lasso regression doesn’t have a closed form solution

• Compared to Ridge regression, Lasso regression provides a more sparse solution

Lasso regression with orthonormal design matrix

• In the case of an orthonormal design matrix X ∈ Rn×p , i.e, X ′ X = Ip×p ,

β̂jLasso = β̂jOLS − λ if β̂jOLS > λ

= 0 if λ ≤ β̂jOLS ≤ λ
= β̂jOLS + λ if β̂jOLS < −λ

which clearly illustrates the shrinkage effect of Lasso regression

• Lasso regression has the effect of making the estimates of some of βj s exactly zero that cause a bias but
reduce a variance of the estimator.

Lasso and Ridge regression

library(glmnet)
# Lasso regression
X = as.matrix(savings[,2:5])
y = as.matrix(savings[,1])
lam_set1 = seq(0, 100, 1)
lam_set2 = seq(0, 10, 0.1)

# Lasso and Ridge regression

ridge=glmnet(X,y,family="gaussian",alpha=0,lambda=lam_set1)
lasso=glmnet(X,y,family="gaussian",alpha=1,lambda=lam_set2)

19
Ridge regression (not efficient)
Ridge
Coefficients
0.0
−0.5

βpop15
βpop75
−1.0

βdpi
βddpi
−1.5

0 20 40 60 80 100

λ
Appendix (not efficient)
par(mfrow = c(1,2))
# ridge regression
plot(lam_set1, rev(ridge$beta[1,]),type = "l",
xlim=range(lam_set1),ylim=range(ridge$beta),lwd=2,
main = "Ridge",
xlab=expression(lambda),ylab="Coefficients",cex.lab=2)
lines(lam_set1,rev(ridge$beta[2,]),col="blue",lty=2,lwd=2)
lines(lam_set1,rev(ridge$beta[3,]),col="red",lty=3,lwd=2)
lines(lam_set1,rev(ridge$beta[4,]),col="green",lty=4,lwd=2)
abline(h = 0, lwd = 2)
# Add legend
legend(20,-0.5,legend=expression(beta[pop15],beta[pop75],
beta[dpi],beta[ddpi]),
col=c("black", "blue", "red", "green"), lty=1:4, cex=1)

20
Ridge/Lasso regression (efficient)
Ridge Lasso
Coefficients

Coefficients
0.0

0.0
−0.5

−0.5
−1.0

−1.0
βpop15 βpop15
βpop75 βpop75
βdpi βdpi
−1.5

−1.5
βddpi βddpi

0 20 40 60 80 100 0 2 4 6 8 10

λ λ
Ridge/Lasso regression (efficient)
par(mfrow = c(1,2))
names = c("Ridge", "Lasso");result = list(ridge, lasso)
lam = list(lam_set1,lam_set2)

for (i in 1:2){
plot(lam[[i]], rev(result[[i]]$beta[1,]),type = "l",
xlim=range(lam[[i]]),ylim=range(result[[i]]$beta),
main = names[i],
xlab=expression(lambda),ylab="Coefficients",cex.lab=2)
lines(lam[[i]],rev(result[[i]]$beta[2,]),col="blue",lty=2)
lines(lam[[i]],rev(result[[i]]$beta[3,]),col="red",lty=3)
lines(lam[[i]],rev(result[[i]]$beta[4,]),col="green",lty=4)
abline(h = 0, lwd = 2)
# Add legend
legend("bottomright",legend=expression(beta[pop15],
beta[pop75], beta[dpi],beta[ddpi]),
col=c("black", "blue", "red", "green"), lty=1:4, cex=1)}

Sparsity assumption on the coefficient when p>n

• When p > n, i.e. high-dimensional case, β is not uniquely defined, which cause an identifiability issue.
Why?

• With a sparsity condition ∥β∥0 ≤ s for some s < n, we could estimate β

21
• "Sparsity assumption" is essential in the high-dimensional model

Comparisons of regression methods via simulation models

# Generate X and y using normal distribution
set.seed(2000)
n = 100; p = 50
X = matrix(rnorm(n*p), ncol = p)
X = cbind(rep(1,n), X)

# column normalizing
norm_vec = sqrt(apply(Xˆ2, 2, mean))
X = X / matrix(rep(norm_vec, each = n), nrow = n)

# Check the norm of columns

#sqrt(apply(Xˆ2, 2, mean))

# Generate a dependent variables y

beta = runif(p+1)
#beta[6:(p+1)] = 0
y = X %*% beta + 0.1*rnorm(n)

Comparisons of regression methods via simulation models

# Apply Least squares
ls_beta = lm(y ~ X[,2:(p+1)])$coefficients

# Apply Ridge regression

lam_Ridge = seq(0, 20, 0.05)
ridge=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=0,
lambda=lam_Ridge)
ridge_beta = rbind(ridge$a0,ridge$beta)

# Apply Lasso regression

lam_Lasso = seq(0, 0.5, 0.01)

lasso=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=1,
lambda=lam_Lasso)
lasso_beta = rbind(lasso$a0,lasso$beta)

Comparisons of regression methods via simulation models

# Analyzing estimation errors
err_ls = sqrt(sum((ls_beta - beta)ˆ2))

err_ridge = NULL
for (i in 1:length(lam_Ridge)){
err_ridge=c(err_ridge,sqrt(sum((ridge_beta[,i]-beta)ˆ2)))}

err_ridge = rev(err_ridge)

err_lasso = NULL
for (i in 1:length(lam_Lasso)){

22
err_lasso=c(err_lasso,sqrt(sum((lasso_beta[,i]-beta)ˆ2)))}

err_lasso = rev(err_lasso)

Comparisons of regression methods via simulation models

Errors plot (Ridge) Errors plot (Lasso)
Estimation errors

Estimation errors
3.0
3

2.0
2

1.0
1

Errors plot (Ridge) Errors plot (Lasso)

o Least squares o Least squares
0.0
0

0 5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5

λ λ
Appendix
# Drawing error plots

par(mfrow = c(1,2))
names = c("Errors plot (Ridge)", "Errors plot (Lasso)");
result = list(err_ridge, err_lasso);
lam = list(lam_Ridge,lam_Lasso)
for (i in 1:2){
# ridge regression
plot(lam[[i]], result[[i]],type = "l",
xlim=range(lam[[i]]), ylim=range(result[[i]]),lwd=2,
main = names[i],
xlab=expression(lambda),ylab="Estimation errors")
points(0,err_ls, col = "blue")

# Add legend
legend("bottomright",legend=c(names[[i]],"Least squares"),
col=c("black","blue"),lty=c(1,0),pch = c("","o"),cex=1)}

23
Make a function
# load the uploaded "generating_plot" function instead

generating_plot = function(X, y, beta, lam_Ridge, lam_Lasso){

# Apply Least squares
ls_beta = lm(y ~ X[,2:(p+1)])$coefficients

# Apply Ridge regression

ridge=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=0, lambda=lam_Ridge)
ridge_beta = rbind(ridge$a0,ridge$beta)

# Apply Lasso regression

lasso=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=1, lambda=lam_Lasso)

lasso_beta = rbind(lasso$a0,lasso$beta)

# Analyzing estimation errors

err_ls = sqrt(sum((ls_beta - beta)ˆ2))

err_ridge = NULL
for (i in 1:length(lam_Ridge)){
err_ridge = c(err_ridge, sqrt(sum((ridge_beta[,i] - beta)ˆ2)))}

err_ridge = rev(err_ridge)

err_lasso = NULL
for (i in 1:length(lam_Lasso)){
err_lasso = c(err_lasso, sqrt(sum((lasso_beta[,i] - beta)ˆ2)))}

err_lasso = rev(err_lasso)

# Drawing error plots

par(mfrow = c(1,2))
names = c("Errors plot (Ridge)", "Errors plot (Lasso)"); result = list(err_ridge, err_lasso);
lam = list(lam_Ridge,lam_Lasso)
for (i in 1:2){
# ridge regression
plot(lam[[i]], result[[i]],type = "l",
xlim=range(lam[[i]]), ylim=range(result[[i]]),lwd=2,
main = names[i],
xlab=expression(lambda),ylab="Estimation errors",cex.lab=2)
#abline(h = 0, lwd = 2)
points(0, err_ls, col = "blue")

# Add legend
legend("bottomright", legend=c(names[[i]], "Least squares"),
col=c("black", "blue"), lty=c(1,0), pch = c("", "o") , cex=1)}

lasso_beta = lasso_beta[ , seq(length(lam_Lasso), 1,-1)]

ridge_beta = ridge_beta[ , seq(length(lam_Ridge), 1,-1)]

return(list(err_lasso=err_lasso, err_ridge=err_ridge,

24
lasso_beta=lasso_beta[, ], ridge_beta=ridge_beta))}

Comparisons of regression methods (Sparse model case)

# Generate a dependent variables y with a sparse beta!
beta[6:p+1] = 0
y = X %*% beta + 0.1*rnorm(n)

lam_Ridge = seq(0, 20, 0.05)

lam_Lasso = seq(0, 0.5, 0.005)

results = generating_plot(X,y,beta,lam_Ridge,lam_Lasso)

# We can observe that Lasso gives more accurate solutions

# with some penalty parameter lambda when underlying beta
# is sparse!

# results$lasso_beta[,3] gives a sparse solution and

# provides the most accurate solution!

Comparisons of regression methods (Sparse model case)

Errors plot (Ridge) Errors plot (Lasso)
0.8
Estimation errors

Estimation errors
0.8

0.6
0.6

0.4
0.4

0.2

Errors plot (Ridge) Errors plot (Lasso)

0.2

o Least squares o Least squares

0 5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5

λ λ
Comparisons of regression methods (highly correlated matrix X)
## Now we consider correlated design matrix case
set.seed(3000)
n = 100; p = 70
# Generating covariance (correlation) matrix

25
sigma = 0.99
A = array(0,c(p,p))
for (i in 1:p){for (j in 1:p){A[i,j] = sigmaˆ(abs(i-j))}}
Z = matrix(rnorm(n*p), ncol = p)
# The generating X has independent rows but dependent
# columns whose population covariance matrix is A
# library("expm")
X = Z %*% sqrtm(A); X = cbind(rep(1,n), X)
beta = c(rep(1,5), rep(0, p-4))
y = X %*% beta + 0.1*rnorm(n)
lam_Ridge = seq(0, 20, 0.05); lam_Lasso = seq(0, 1, 0.005)
results = generating_plot(X, y, beta, lam_Ridge, lam_Lasso)

Comparisons of regression methods (highly correlated matrix X)

Errors plot (Ridge) Errors plot (Lasso)

1.0
Estimation errors

Estimation errors
1.6
1.4

0.8
1.2

0.6
1.0
0.8

0.4

Errors plot (Ridge) Errors plot (Lasso)

o Least squares o Least squares
0.6

0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0

λ λ
Model selection criterion
• Among many obtained linear models (by using different λ values), we could choose the best one based
on some criterion

• "R-squared" and "Adjusted R-squared" are one of such criteria, but not often used in the high-dimensional
model

• More popular criterion are "Akaike information criterion" (AIC) and "Bayesian information criterion"
(BIC)

26
AIC and BIC
• AIC/BIC consider trade-off between goodness of fit and simplicity of the model.
• AIC/BIC only provide a relative quality of the model, i.e. they do not provide a statistical inference
(i.e. test) of a model
• Lower AIC/BIC indicates a better model!
• For the model M , let L be the maximum value of the log-likelihood function for the model M . Then
n
!
X (yi − x′ β̂)2 i
AIC(M ) = −2 log L + 2|M | = n log + 2|M |
n
i=1
Pn
(yi − x′i β̂)2

i=1
BIC(M ) = −2 log L + |M | log n = n log + |M | log n
n

• BIC penalizes larger models more aggressively, i.e. BIC prefers smaller models compared to AIC

AIC and BIC (cont.)

• The likelihood function is
n
!
2 −n/2 1 X
L = (2πσ ) exp (yi − x′i β)2
2σ 2 i=1

• The log-likelihood function is

n
n n 1 X
log L = − log(2π) − log(σ 2 ) − 2 (yi − x′i β)2
2 2 2σ i=1

Pn
• Since σ̂ 2 = 1
n i=1 (yi− x′i β)2 , the maximum value of the log-likelihood function is
n
!
n 2 n 1X ′ 2
− log(σ̂ ) + constant = − log (yi − xi β̂) + constant,
2 2 n i=1

which gives a AIC/BIC formula

Selecting model using AIC/BIC via simulation model (p>n and sparse case)
## Apply AIC/BIC to the simulation model

# Generate X and y using normal distribution

set.seed(1000)
n = 100; p = 200
X = matrix(rnorm(n*p), ncol = p)
X = cbind(rep(1,n), X)

# column normalizing
norm_vec = sqrt(apply(Xˆ2, 2, mean))
X = X / matrix(rep(norm_vec, each = n), nrow = n)

# Generate a dependent variables y

beta = runif(p+1)
beta[8:p+1] = 0
y = X %*% beta + 0.1*rnorm(n)

27
Selecting model using AIC/BIC via simulation model (p>n and sparse case)
# Apply Lasso regression
lam_Lasso = seq(0.05, 5, 0.01)*sqrt(log(p)/n)*0.1

lasso=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=1,
lambda=lam_Lasso)
lasso_beta = rbind(lasso$a0,lasso$beta)

# Analyzing estimation errors

err_lasso = NULL
for (i in 1:length(lam_Lasso)){
err_lasso=c(err_lasso,sqrt(sum((lasso_beta[,i]-beta)ˆ2)))}

err_lasso = rev(err_lasso)

Selecting model using AIC/BIC via simulation model (p>n and sparse case)
Errors plot (Lasso)
0.05 0.10 0.15 0.20 0.25 0.30
Estimation errors

0.00 0.02 0.04 0.06 0.08 0.10

λ
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
## 0.608 0.969 0.588 0.645 0.226 0.688 0.770 0.149 0.000 0.006 0.000 0.000 0.000 0.000 0.
## V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.004 0.000 0.000 0.000 0.000 -0.004 0.000 0.
## V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51
## 0.000 0.000 0.000 0.000 0.000 -0.005 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.
## V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.005 0.000 0.000 0.
## V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89
## 0.000 0.002 0.000 0.000 0.000 -0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.004 0.
## V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V

28
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.005 0.000 0.000 0.001 0.000 -0.001 0.000 0.
## V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.
## V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.007 0.000 0.000 0.000 0.004 0.
## V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V
## 0.000 0.000 0.000 0.000 0.013 0.000 0.000 0.000 0.002 0.000 0.000 0.000 0.000 0.000 0.
## V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V
## -0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.006 0.
## V190 V191 V192 V193 V194 V195 V196 V197 V198 V199 V200
## 0.000 0.000 0.000 0.001 0.000 0.000 0.000 -0.006 -0.010 0.000 0.000

Selecting model using AIC/BIC via simulation model (p>n and sparse case)
# Drawing error plots
names = "Errors plot (Lasso)" ; result = err_lasso;
lam = lam_Lasso

plot(lam, result,type = "l",

xlim=range(lam), ylim=range(result),lwd=2,
main = names,
xlab=expression(lambda),ylab="Estimation errors")

ind = which(err_lasso==min(err_lasso))
round(lasso_beta[,ncol(lasso_beta)-ind+1],3)

Selecting model using AIC/BIC via simulation model (p>n and sparse case)
−200

−200
−300

−250
−400

−300
AIC

BIC
−500

−350
−600

−400

0 100 300 500 0 100 300 500

Index Index
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 0.60696 0.96263 0.58212 0.63603 0.21744 0.68105 0.76318 0.14181 0.00000 0.00000 0.00000 0
## V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25

29
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
## 0.00000 -0.00094 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -0.00220 0.00000 0.00000 0
## V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160
## -0.00281 0.00000 0.00000 0.00000 0.00000 0.00000 0.01067 0.00000 0.00000 0.00000 0.00000 0
## V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190
## 0.00000 0.00000 0.00000 0.00000 0.00348 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V195 V196 V197 V198 V199 V200
## 0.00000 0.00000 0.00000 -0.00636 0.00000 0.00000

Appendix
# Computing AIC/BIC
AIC = NULL; BIC = NULL; BIC_fit = NULL; BIC_pen = NULL
for (i in 1:length(lam_Lasso)){
AIC=c(AIC, n*log(sum((y - X%*% lasso_beta[,i])ˆ2)/n)+
2*sum(abs(lasso_beta[2:(p+1),i])>0.00001))
BIC=c(BIC, n*log(sum((y - X%*% lasso_beta[,i])ˆ2)/n)+
log(n)*sum(abs(lasso_beta[2:(p+1),i])>0.00001))
BIC_fit=c(BIC_fit,n*log(sum((y-X%*%lasso_beta[,i])ˆ2)/n))
BIC_pen=c(BIC_pen, log(n)*
sum(abs(lasso_beta[2:(p+1),i])>0.00001))}
AIC = rev(AIC); BIC = rev(BIC)
BIC_fit = rev(BIC_fit); BIC_pen = rev(BIC_pen)
par(mfrow = c(1,2))
plot(AIC); plot(BIC)
ind_B = which(BIC==min(BIC))
round(lasso_beta[,ncol(lasso_beta)-ind_B+1],5)

30
Selecting model using AIC/BIC via simulation model (p>n and sparse case)
BIC plot
200
Value
0
−400

−2L (fit)
Penalty(simplicity)
−800

BIC

0.00 0.02 0.04 0.06 0.08 0.10

λ
Appendix
# Drawing BIC plot
names = "BIC plot" ;
lam = lam_Lasso

par(mfrow=c(1,1))
plot(lam, BIC_fit,type = "l",
xlim=range(lam),lwd=2,main=names,ylim=
c(min(BIC_fit), max(BIC_pen)),
xlab=expression(lambda),ylab="Value",cex.lab=2)

lines(lam,BIC_pen, col = "blue", lty = 2)

lines(lam,BIC,col="red",lty=3, lwd=2)
# Add legend
legend("bottomright",legend=c("-2L (fit)",
"Penalty(simplicity)", "BIC"),
col=c("black", "blue", "red"), lty=1:3, cex=1)

optim(par, fn, gr = NULL,...,

method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B",
"SANN", "Brent"),
lower = -Inf, upper = Inf,
control = list(), hessian = FALSE)

MCQ1
No ratings yet
MCQ1
22 pages
VenkatSAP MM Resume
No ratings yet
VenkatSAP MM Resume
6 pages
OUA Memo - 02014 - Mi Techtalk Webinar - Training Sessions On Microsoft 0365 Tools For Teachers and Students - 2021 - 02 - 03
No ratings yet
OUA Memo - 02014 - Mi Techtalk Webinar - Training Sessions On Microsoft 0365 Tools For Teachers and Students - 2021 - 02 - 03
5 pages
Color Pattern - Test Basic - A4
No ratings yet
Color Pattern - Test Basic - A4
3 pages
MB Manual Ga-78lmt-Usb3 v.6.0 e 04
No ratings yet
MB Manual Ga-78lmt-Usb3 v.6.0 e 04
36 pages
Database Deign UG - G Assignment 1 Semester 1 2021
No ratings yet
Database Deign UG - G Assignment 1 Semester 1 2021
4 pages
Open Problems in Game Theory
No ratings yet
Open Problems in Game Theory
29 pages
Chapter 8 Linear Regression
No ratings yet
Chapter 8 Linear Regression
22 pages
Ch0 Introduction
No ratings yet
Ch0 Introduction
13 pages
Windows 10 1909 GP OS Administrative Guide
No ratings yet
Windows 10 1909 GP OS Administrative Guide
118 pages
Lecture - 8 Regression and Correlation
No ratings yet
Lecture - 8 Regression and Correlation
34 pages
Nguyễn Minh Thuận: Education
No ratings yet
Nguyễn Minh Thuận: Education
2 pages
Stats135 Reviewer
No ratings yet
Stats135 Reviewer
5 pages
English Template JDLDE
No ratings yet
English Template JDLDE
6 pages
SM 30 0041 Kapitel 00 V8
No ratings yet
SM 30 0041 Kapitel 00 V8
5 pages
Fe5209 3 Ay 2024
No ratings yet
Fe5209 3 Ay 2024
59 pages
2012 Ibex Full Nmea Installation
No ratings yet
2012 Ibex Full Nmea Installation
82 pages
Daunit 3
No ratings yet
Daunit 3
32 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
Lecture 9-10 - Updated Vesion S25 - Regression
No ratings yet
Lecture 9-10 - Updated Vesion S25 - Regression
43 pages
Machine Learning-Lecture 1 (Student)
No ratings yet
Machine Learning-Lecture 1 (Student)
14 pages
Fda Unit 5
No ratings yet
Fda Unit 5
20 pages
Asset-Threat-Vulnerable-Risk Assessment-27k
100% (4)
Asset-Threat-Vulnerable-Risk Assessment-27k
12 pages
Solved - Discuss The Difference Between Resource Loading and Resource Leveling, and Provide An Example..
No ratings yet
Solved - Discuss The Difference Between Resource Loading and Resource Leveling, and Provide An Example..
2 pages
Basic Economterics - I
No ratings yet
Basic Economterics - I
17 pages
Class 9 Question Paper New
No ratings yet
Class 9 Question Paper New
8 pages
Module 3: Linear Regression: TMA4268 Statistical Learning V2025
No ratings yet
Module 3: Linear Regression: TMA4268 Statistical Learning V2025
110 pages
Quiz 2
No ratings yet
Quiz 2
2 pages
CV Fernando2020 English
No ratings yet
CV Fernando2020 English
2 pages
R-Programming - Unit 5
No ratings yet
R-Programming - Unit 5
43 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
BST 32202 Linear Regression 6 SLR Assumptions Lse
No ratings yet
BST 32202 Linear Regression 6 SLR Assumptions Lse
20 pages
Regression 2
No ratings yet
Regression 2
28 pages
The Visakhapatnam Co-Operative Bank LTD: Vacancies
No ratings yet
The Visakhapatnam Co-Operative Bank LTD: Vacancies
13 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
My Courses
No ratings yet
My Courses
10 pages
Applications Chapter 4
No ratings yet
Applications Chapter 4
38 pages
Chapter 9 Simple Linear Regression and Correlation
No ratings yet
Chapter 9 Simple Linear Regression and Correlation
56 pages
Regression Notes - Part-1
No ratings yet
Regression Notes - Part-1
17 pages
Report On Python
No ratings yet
Report On Python
20 pages
Linear Regression
No ratings yet
Linear Regression
47 pages
C1 English
No ratings yet
C1 English
26 pages
Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R
No ratings yet
Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R
39 pages
Turbo C Manual Chapter 2 Algo N Flowchart Module 2
100% (1)
Turbo C Manual Chapter 2 Algo N Flowchart Module 2
44 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Hacking Wireless Wifi
No ratings yet
Hacking Wireless Wifi
7 pages
Lecture 2-3
No ratings yet
Lecture 2-3
8 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Data Analytics Unit 3 Notes
100% (3)
Data Analytics Unit 3 Notes
28 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Practice Excel 2
No ratings yet
Practice Excel 2
3 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
1-Chap II Econometrics ABC DR Mitiku
No ratings yet
1-Chap II Econometrics ABC DR Mitiku
80 pages
Chapter 14
No ratings yet
Chapter 14
18 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Skills and Cert Roadmap 2015
No ratings yet
Skills and Cert Roadmap 2015
1 page
ANOVA and MANOVA: Statistics For Psychology
No ratings yet
ANOVA and MANOVA: Statistics For Psychology
34 pages
Syllabus CS212 Data Structure
No ratings yet
Syllabus CS212 Data Structure
5 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
LinearStatisticalModels and Regression Analysis
No ratings yet
LinearStatisticalModels and Regression Analysis
27 pages
L TEX Table Tricks: Adrian P. Robson
No ratings yet
L TEX Table Tricks: Adrian P. Robson
14 pages
CV Tiwi Kota
No ratings yet
CV Tiwi Kota
2 pages
Web Programming Notes
100% (1)
Web Programming Notes
82 pages
02 Simple Regression
No ratings yet
02 Simple Regression
29 pages
Linera Regression II PDF
No ratings yet
Linera Regression II PDF
14 pages
ch12 0
No ratings yet
ch12 0
82 pages
ECO 401 Econometrics: SI 2021 Week 2, 14 September
100% (1)
ECO 401 Econometrics: SI 2021 Week 2, 14 September
47 pages
Https WWW - Irctc.co - in Cgi-Bin Bv60
No ratings yet
Https WWW - Irctc.co - in Cgi-Bin Bv60
1 page
STAT 3008 Applied Regression Analysis Tutorial 1 - Term 2, 2019 20
No ratings yet
STAT 3008 Applied Regression Analysis Tutorial 1 - Term 2, 2019 20
2 pages
Regression: Dr. Agustinus Suryantoro, M.S
No ratings yet
Regression: Dr. Agustinus Suryantoro, M.S
31 pages
Topic 7 Linear Regreation CHP14
No ratings yet
Topic 7 Linear Regreation CHP14
21 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
Simple Regression
100% (1)
Simple Regression
50 pages
Ordinary Least Squares Linear Regression Review: Week 4
No ratings yet
Ordinary Least Squares Linear Regression Review: Week 4
10 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Simple Linear Regression and Multiple Linear Regression: MAST 6474 Introduction To Data Analysis I
No ratings yet
Simple Linear Regression and Multiple Linear Regression: MAST 6474 Introduction To Data Analysis I
15 pages
Chapter 02
No ratings yet
Chapter 02
14 pages
Math644 Chapter 1 Part1
No ratings yet
Math644 Chapter 1 Part1
5 pages
HW 03 Sol
No ratings yet
HW 03 Sol
9 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
03 Revisions L Regression
No ratings yet
03 Revisions L Regression
25 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Business Statistics II
100% (2)
Business Statistics II
100 pages
Emet2007 Notes
No ratings yet
Emet2007 Notes
6 pages
Regression and Multiple Regression Analysis
100% (1)
Regression and Multiple Regression Analysis
21 pages

6th Lecture Note 108335647 230518 203102

Uploaded by

6th Lecture Note 108335647 230518 203102

Uploaded by

Linear Regression Analysis with R

Install “faraway” package

• Regression model: Y = f (X). Here f (·) explains the regression relationship

• Example 1: X: height, Y: weight

• Example 2: X: Math score, Y: total score

Simple Linear Regression

• Given n data pairs {(xi , yi ), i = 1, · · · , n}, we can write

where ϵi s are indepedent random errors.

• β0 and β1 are unknown parameters

• How to interpret β0 and β1 ? β1 is the effect of X on Y .

## midterm final hw total

“stat500” Example (cont.)

“stat500” Example (cont.)

“stat500” Example (cont.)

“stat500” Example (cont.)

“stat500” Example (cont.)

## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"

• Given n data pairs {(xi1 , · · · , xip , yi ), i = 1, · · · , n}, we can write

yi = β0 + β1 xi1 + β2 xi2 + · · · βp xip + ϵi ,

where ϵi s are indepedent random errors.

• Is the obtained linear regression model statistically significant? using F-test

• Are the obtained βj s statistically significant? using t-test or F-test

“stat500” Example (cont.)

How to estimate coefficients? Least squares

• Find β0 , β1 , · · · , βp that minimizes the sum of squared residuals (SSR):

where β = (β0 , β1 , · · · , βp ) and xi = (1, xi1 , · · · , xip )′

Note that the first column of X are all ones!

• Fitted (predicted) values are ŷ = X β̂ = X(X T X)−1 X T y

• H = X(X T X)−1 X T are called the hat-matrix

OLS estimator (“stat500” Example)

# select "midterm" and "hw" for X

# the first column of X must be all ones!

# X and y must be matrix/vector for computation

OLS estimator (“stat500” Example)

• R2 ("R-squared") is one popular measure. Sometimes called the "coefficient of determination" or

• Total sum of squares:

• Regression sum of squares, or called the explained sum of squares:

• Sum of squares of residuals (related to unexplained variance):

• Adjusted R2 is unbiased estimator:

“Galapagos Islands” Example

## Species Endemics Area Elevation Nearest Scruz Adjacent

“Galapagos Islands” Example (cont.)

Species = 7.08 − 0.02Area + 0.32Elevation − 0.23Scruz − 0.07Adjacent

# Fit a linear model

“Galapagos Islands” Example (cont.)

• Consider the following hypothesis test:

H0 : β1 = β2 = · · · = βp = 0 versus Ha : βj ̸= 0 for some j

F-statistic (Analysis of Variance)

F-statistic (Analysis of Variance)

• Larger F would means rejection of the null hypothesis H0

“Galapagos Islands” Example

## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"

“Galapagos Islands” Example (cont.)

Testing single predictor

• This can be rewritten as

Comparing two nested models (general version)

• Referred distribution is Fp2 −p1 ,n−p2 −1

Comparing two nested models (testing single variable version)

• |M1 | := p1 = p − 1 and |M2 | := p2 = p

## sr pop15 pop75 dpi ddpi

“savings” Example (cont.)

“savings” Example (cont.)

“savings” Example (cont.)

“savings” Example (cont.)

## Analysis of Variance Table

H0 : both dpi and ddpi are not significant

2. Perform the following hypothesis test:

H0 : both dpi and ddpi are not significant

## Analysis of Variance Table

## Analysis of Variance Table

H0 : βpop15 = βpop75 versus Ha : βpop15 ̸= βpop75

fit1 = lm(sr ~ I(pop15 + pop75) + dpi + ddpi, savings)

## Analysis of Variance Table

H0 : βpop75 = 4βpop15 versus Ha : βpop15 ̸= 2βpop75

fit1 = lm(sr ~ I(1*pop15 + 4*pop75) + dpi + ddpi, savings)

## Analysis of Variance Table

fit1 = lm(sr ~ I(1pop15 + 4pop75) + dpi + ddpi, savings)