0% found this document useful (0 votes)
65 views35 pages

6th Lecture Note 108335647 230518 203102

Linear regression analysis with R is discussed. Key topics include simple and multiple linear regression, estimating coefficients using least squares, and measuring goodness of fit using R-squared. An example uses data from the "stat500" dataset to fit a linear regression model predicting final exam scores from midterm and homework scores.

Uploaded by

Girie Yum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views35 pages

6th Lecture Note 108335647 230518 203102

Linear regression analysis with R is discussed. Key topics include simple and multiple linear regression, estimating coefficients using least squares, and measuring goodness of fit using R-squared. An example uses data from the "stat500" dataset to fit a linear regression model predicting final exam scores from midterm and homework scores.

Uploaded by

Girie Yum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Linear Regression Analysis with R

Seyoung Park

Objective
• Linear Regression Analysis with R

• Learn "Estimation", "Inference" and "Shrinkage methods (Lasso regression, Ridge regression)" with
linear regression model

Install “faraway” package


All datasets used in this topic are from “faraway” package.
Reference book : “Linear models with R” by Julian J. Faraway
install.packages("faraway")
library("faraway")

Regression Analysis
• Regression analysis: used for modeling the relationship between explanatory variables (X) and a
dependent variable (Y)

• Regression model: Y = f (X). Here f (·) explains the regression relationship

• Example 1: X: height, Y: weight

• Example 2: X: Math score, Y: total score

Simple Linear Regression


• Simple linear regression model is based on the following linear model: Y = β0 + β1 X

• Simple linear regression analysis: linear regression model with a single explanatory variable (X), i.e.
Y = β0 + β1 X + ϵ. Here ϵ represents a random error, e.g. ϵ ∼ N (0, σ 2 )

• Given n data pairs {(xi , yi ), i = 1, · · · , n}, we can write

y i = β0 + β1 x i + ϵi ,

where ϵi s are indepedent random errors.

• β0 and β1 are unknown parameters

• How to interpret β0 and β1 ? β1 is the effect of X on Y .

1
• Is the obtained β1 statistically significant?

“stat500” Example
# use "stat500" data
# stat500 is 55 by 4 dimensional data
library("faraway")
data(stat500)
head(stat500)

## midterm final hw total


## 1 24.5 26.0 28.5 79.0
## 2 22.5 24.5 28.2 75.2
## 3 23.5 26.5 28.3 78.3
## 4 23.5 34.5 29.2 87.2
## 5 22.5 30.5 27.3 80.3
## 6 16.0 31.0 27.5 74.5
# We can specify column/row names using
# "colnames" and "rownames"

“stat500” Example (cont.)


# scatter plot
plot(final ~ midterm, data = stat500)
35
30
final

25
20
15

10 15 20 25 30

midterm

“stat500” Example (cont.)


Fitted linear regression model is f inal = 15.05 + 0.56 midterm
# apply linear regression model
lm(final ~ midterm, data = stat500)

##

2
## Call:
## lm(formula = final ~ midterm, data = stat500)
##
## Coefficients:
## (Intercept) midterm
## 15.0462 0.5633

“stat500” Example (cont.)


plot(final ~ midterm, data = stat500) # scatter plot
abline(lm(final ~ midterm, data = stat500), col ="red")
35
30
final

25
20
15

10 15 20 25 30

midterm

“stat500” Example (cont.)


# scatter plot and fitted regression line (version2)
plot(final ~ midterm, data = stat500) # scatter plot
fit = lm(final ~ midterm, data = stat500)
abline(coef(fit), col ="red", lty = 2)

3
35
30
final

25
20
15

10 15 20 25 30

midterm

“stat500” Example (cont.)


fit = lm(final ~ midterm, data = stat500)
# Check quantities in the "fit"
names(fit)

## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"


## [9] "xlevels" "call" "terms" "model"
# detailed results
summary(fit)

##
## Call:
## lm(formula = final ~ midterm, data = stat500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.932 -2.657 0.527 2.984 9.286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.0462 2.4822 6.062 1.44e-07 ***
## midterm 0.5633 0.1190 4.735 1.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.192 on 53 degrees of freedom
## Multiple R-squared: 0.2973, Adjusted R-squared: 0.284
## F-statistic: 22.42 on 1 and 53 DF, p-value: 1.675e-05

4
Multiple Linear Regression
• Multiple linear regression analysis is based on multiple explanatory variables X1 , X2 , · · · , Xp and
Y = β0 + β1 X1 + β2 X2 + · · · βp Xp + ϵ

• Given n data pairs {(xi1 , · · · , xip , yi ), i = 1, · · · , n}, we can write

yi = β0 + β1 xi1 + β2 xi2 + · · · βp xip + ϵi ,

where ϵi s are indepedent random errors.

• How to interpret β1 , · · · , βp ? βj is the effect of Xj on Y when the other p − 1 explanatory variables are
fixed.

• Is the obtained linear regression model statistically significant? using F-test

• Are the obtained βj s statistically significant? using t-test or F-test

• How to analyze when there are too many explanatory variables (i.e. p is too large) ?

“stat500” Example
Fitted linear regression model is f inal = 16.81 + 0.57 midterm − 0.08 hw
# use "stat500" data
# stat500 is 55 by 4 dimensional data
data(stat500)
lm(final ~ midterm + hw, data = stat500)

##
## Call:
## lm(formula = final ~ midterm + hw, data = stat500)
##
## Coefficients:
## (Intercept) midterm hw
## 16.81061 0.58179 -0.08157

“stat500” Example (cont.)


summary(lm(final ~ midterm + hw, data = stat500))

##
## Call:
## lm(formula = final ~ midterm + hw, data = stat500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0388 -2.5964 0.3714 3.0063 9.3497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.81061 4.08112 4.119 0.000137 ***
## midterm 0.58179 0.12445 4.675 2.12e-05 ***
## hw -0.08157 0.14916 -0.547 0.586836
## ---

5
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.22 on 52 degrees of freedom
## Multiple R-squared: 0.3013, Adjusted R-squared: 0.2744
## F-statistic: 11.21 on 2 and 52 DF, p-value: 8.948e-05

How to estimate coefficients? Least squares


• Data: {(xi1 , · · · , xip , yi ), i = 1, · · · , n}

• Find β0 , β1 , · · · , βp that minimizes the sum of squared residuals (SSR):


n
X 2
minimize (yi − (β0 + β1 xi1 + · · · + βp xip ))
i=1

In vector form
n
X
minimize ∥yi − x′i β∥2 ,
i=1

where β = (β0 , β1 , · · · , βp ) and xi = (1, xi1 , · · · , xip )′

• The obtained minimizer β̂ = (β̂0 , β̂1 , · · · , β̂p ) is called the "Ordinary Least Squares" estimator (OLS
estimator) for β.

OLS estimator
Let X is an n by p + 1 matrix and y is an n-dimensional vector such that
   
−x1 − y1
 −x2 −   y2 
X =  .  and y =  . 
   
 ..   .. 
−xn − yn

Note that the first column of X are all ones!


• Minimize ∥y − Xβ∥2 is equivalent to

X T X β̂ = X T y ⇒ β̂ = (X T X)−1 X T y

• Fitted (predicted) values are ŷ = X β̂ = X(X T X)−1 X T y

• H = X(X T X)−1 X T are called the hat-matrix

OLS estimator (“stat500” Example)


# still consider stat500
data(stat500)

# select "midterm" and "hw" for X


# select "final" for y
X = stat500[,c(1,3)]; y = stat500[,2]

# the first column of X must be all ones!

6
X = cbind(rep(1, nrow(X)), X)

# X and y must be matrix/vector for computation


X = as.matrix(X); y = as.matrix(y)

# OLS estimator
OLS = solve(t(X)%*%X, t(X)%*%y)

OLS estimator (“stat500” Example)


# confirm that two results are the same!
OLS

## [,1]
## rep(1, nrow(X)) 16.81060740
## midterm 0.58178957
## hw -0.08156661
lm(final ~ midterm + hw, data = stat500)$coefficients

## (Intercept) midterm hw
## 16.81060740 0.58178957 -0.08156661

Goodness of fit
• It is essential to measure how well the linear regression model fits the data

• R2 ("R-squared") is one popular measure. Sometimes called the "coefficient of determination" or


"percentage of variance explained"

• Total sum of squares:


n
X n
X
SST = (yi − ȳ)2 , ȳ = yi /n
i=1 i=1

• Regression sum of squares, or called the explained sum of squares:


n
X
SSR = (ŷi − ȳ)2 , ŷi = x′i β̂
i=1

• Sum of squares of residuals (related to unexplained variance):


n
X
SSE = (yi − ŷi )2
i=1

Goodness of fit
• It holds that SSR + SSE = SST . Why?

• R2 (R-squared):
SSR SSE
R2 = =1−
SST SST

• R2 has ranges from 0 to 1. 0 indicates that ŷi = ȳ. On the other hand, 1 indicates ŷi = yi , i.e. linear
regression predictions perfectly fit the data (or perfectly explains the observed variation)

7
• Larger R2 indicates a better fit to the data

• For simple linear regression (i.e. p = 1), R2 is equal to r2 which is a square of the sample correlation
between X and Y

Adjusted R2
• Note that R2 , i.e.,
SSE
R2 = 1 −
SST
is based on biased estimates of the variances of the dependent variable and of the errors. Why?

• Adjusted R2 is unbiased estimator:

SSE/(n − p − 1) n−1
Adjusted R2 = 1 − = 1 − (1 − R2 ) ,
SST /(n − 1) n−p−1

where n − 1 and n − p − 1 represent degree of freedom of the estimate of the variance of the dependent
variable and of the estimate of the error variance, respectively

“Galapagos Islands” Example


# use "Galapagos" data
# "gala" is 30 by 7 dimensional data
library("faraway")
data(gala)
head(gala)

## Species Endemics Area Elevation Nearest Scruz Adjacent


## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84

“Galapagos Islands” Example (cont.)


Fitted linear regression model is

Species = 7.08 − 0.02Area + 0.32Elevation − 0.23Scruz − 0.07Adjacent

# Fit a linear model


fit = lm(Species ~ Area+Elevation+Scruz+Adjacent, gala)
fit

##
## Call:
## lm(formula = Species ~ Area + Elevation + Scruz + Adjacent, data = gala)
##
## Coefficients:
## (Intercept) Area Elevation Scruz Adjacent
## 7.07538 -0.02398 0.31957 -0.23936 -0.07485

8
“Galapagos Islands” Example (cont.)
# compute R-squared
y = gala$Species
R2 = 1 - deviance(fit)/ sum((y-mean(y))ˆ2)
R2

## [1] 0.7658462
# compute Adjusted R-squared
n=30; p=4
R2_adjusted = 1 - (1-R2)*(n-1)/(n-p-1)
R2_adjusted

## [1] 0.7283816
# "Multiple R-squared" is the R-squared value
summary(fit)

##
## Call:
## lm(formula = Species ~ Area + Elevation + Scruz + Adjacent, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.637 -34.930 -7.864 33.432 182.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.07538 18.74982 0.377 0.709093
## Area -0.02398 0.02151 -1.115 0.275554
## Elevation 0.31957 0.05113 6.250 1.54e-06 ***
## Scruz -0.23936 0.16464 -1.454 0.158434
## Adjacent -0.07485 0.01663 -4.501 0.000136 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.74 on 25 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7284
## F-statistic: 20.44 on 4 and 25 DF, p-value: 1.39e-07

“Galapagos Islands” Example (cont.)


# Fit a linear model by adding one more variable
fit2 = lm(Species ~ Area+Elevation+Scruz+Adjacent+
Nearest, gala)
fit2

##
## Call:
## lm(formula = Species ~ Area + Elevation + Scruz + Adjacent +
## Nearest, data = gala)
##
## Coefficients:
## (Intercept) Area Elevation Scruz Adjacent Nearest
## 7.068221 -0.023938 0.319465 -0.240524 -0.074805 0.009144

9
“Galapagos Islands” Example (cont.)
# compute R-squared
y = gala$Species
R2 = 1 - deviance(fit2)/ sum((y-mean(y))ˆ2)
R2

## [1] 0.7658469
# compute Adjusted R-squared
n=30; p=5
R2_adjusted = 1 - (1-R2)*(n-1)/(n-p-1)
R2_adjusted

## [1] 0.7170651
# "Multiple R-squared" is the R-squared value
summary(fit2)

##
## Call:
## lm(formula = Species ~ Area + Elevation + Scruz + Adjacent +
## Nearest, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07

Inference of model
• Are any of the p predictors X1 , · · · , Xp useful when predicting the dependent variable Y ?

• Consider the following hypothesis test:

H0 : β1 = β2 = · · · = βp = 0 versus Ha : βj ̸= 0 for some j

• Corresponding F-statistics is
SSR/p M SR
F = = ,
SSE/(n − p − 1) M SE
where MSR and MSE are "regression mean square" and "mean square error", respectively

10
Table 1: Analysis of variance table
Source of variation df Sum of squares Mean of squares F-statistic
Pn
Regression p SSR = i=1 (ŷi − ȳ)2 M SR = SSRp F =M M SR
SE
Pn SSE
Residual n - p -1 SSE = i=1 (yi − ŷi )2 M SE = n−p−1
Pn
Total n-1 SST = i=1 (yi − ȳ)2

F-statistic (Analysis of Variance)


• To compute p-value, we refer to Fp,n−p−1 which represent the F-distribution with degress of freedom
(p, n − p − 1)

• Null hypothesis H0 is rejected if the F value computed from the data is greater than the critical value.
−1
More specifically, given the significance level α such as 0.01, 0.05, 0.1, check whether F > F1−α (p, n−p−1)
−1
or not, where F1−α (p, n − p − 1) is the 1 − α quantile of the Fp,n−p−1 distribution

F-statistic (Analysis of Variance)


• Or Null hypothesis H0 is rejected if the p-value computed from the data is less than the significance
level α. More specifically, check whether P (F (p, n − p − 1) > F ) < α or not

• Larger F would means rejection of the null hypothesis H0

• F value is related to R2 :
R2 n − p − 1
F =
1 − R2 p

• What if we get a very small F statistic? We can try nonlinear transformation variables or apply other
models: e.g. xj ← log(xj + 1)

“Galapagos Islands” Example


fit = lm(Species ~ Area+Elevation+Scruz+Adjacent, gala)
# Check quantities in the "fit"
names(fit)

## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"


## [9] "xlevels" "call" "terms" "model"
# Compute F-Statistic
n=30; p=4
SST = sum((gala$Species - mean(gala$Species))ˆ2)
SSE = deviance(fit)
MSE = SSE/fit$df.residual #fit$df.residual = n-p-1
SSR = SST - SSE
MSR = SSR/p
Fstat = MSR/MSE

“Galapagos Islands” Example (cont.)


# Compare with critical value when alpha = 0.05
crit_value = qf(0.95, p, n-p-1)
Fstat > crit_value

11
## [1] TRUE
# Fstat > crit_value! This means we reject H0

# Compute p-value
pvalue = 1-pf(Fstat, p, n-p-1)
pvalue < 0.05

## [1] TRUE
# pvalue < significance level!
# This means we reject H0

Testing single predictor


• Suppose that the previous hypotheis test indicates the rejection of H0 , i.e., some predictors are useful
when predicting Y under the linear regression model

• Now we are interested in whether one particular explanatory variable (say βj ) can be dropped from the
linear regression model, i.e. consider

H0 : βj = 0 versus Ha : βj ̸= 0

• This can be rewritten as


H0 : M1 versus Ha : M2 ,
where M1 = {x1 , · · · , xj−1 , xj+1 , · · · , xp } and M2 = {x1 , · · · , xp }

Comparing two nested models (general version)


• Consider two linear regression models M1 and M2 satisfying M1 ⊂ M2 , and |M1 | = p1 and |M2 | = p2

• Let SSE1 and SSE2 be the sum of squares of residuals of the models M1 and M2 , respectively:

• Then F-Statistic is
(SSE1 − SSE2 )/(p2 − p1 )
F =
SSE2 /(n − p2 − 1)

• Referred distribution is Fp2 −p1 ,n−p2 −1

Comparing two nested models (testing single variable version)


• Now revisit the following "testing single variable" problem:

H0 : βj = 0 versus Ha : βj ̸= 0

• |M1 | := p1 = p − 1 and |M2 | := p2 = p

• Recall that SSE1 and SSE2 are the sum of squares of residuals of the models M1 and M2 , respectively:

12
• Then F-Statistic is
(SSE1 − SSE2 )
F =
SSE2 /(n − p − 1)

• Referred distribution is F1,n−p−1 =d [t(n − p − 1)]2 , where t(n − p − 1) represents Student’s t-distribution
with a degree of freedom n − p − 1

“savings” Example
# use "savings" data
# savings is an old economic dataset on 50
# different countries (50 by 5 dimensional data)
library("faraway")
data(savings)
head(savings)

## sr pop15 pop75 dpi ddpi


## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria 12.07 23.32 4.41 1507.99 3.93
## Belgium 13.17 23.80 4.43 2108.47 3.82
## Bolivia 5.75 41.89 1.67 189.13 0.22
## Brazil 12.88 42.19 0.83 728.47 4.56
## Canada 8.79 31.72 2.85 2982.88 2.43
# We can specify column/row names using
# "colnames" and "rownames"

“savings” Example (cont.)

“savings” Example (cont.)


Fitted linear regression model is sr = 28.57 − 0.46pop15 − 1.69pop75 − 0.0003dpi + 0.41ddpi

13
# apply linear regression model
fit2 = lm(sr ~ ., data = savings)

# Compute F-Statistic
n=nrow(savings); p=ncol(savings)-1
SST2 = sum((savings$sr - mean(savings$sr))ˆ2)
SSE2 = deviance(fit2)
MSE2 = SSE2/fit2$df.residual #fit$df.residual = n-p-1
SSR2 = SST2 - SSE2
MSR2 = SSR2/p
Fstat2 = MSR2/MSE2

“savings” Example (cont.)


Is pop75 significant in the full model?
fit2 = lm(sr ~ ., data = savings)
fit1 = lm(sr ~ pop15 + dpi + ddpi, savings)
SSE1 = deviance(fit1)
Fstat = (SSE1-SSE2)/(SSE2/(n-p-1))
1-pf(Fstat,1,n-p-1) # this is the p-value

## [1] 0.1255298
# compute p-value using t-distribution
2*pt(-1.561,n-p-1)

## [1] 0.1255297

“savings” Example (cont.)


We can perform the hypothesis testing using “anova” function
# compare two nested models
anova(fit1, fit2)

## Analysis of Variance Table


##
## Model 1: sr ~ pop15 + dpi + ddpi
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 685.95
## 2 45 650.71 1 35.236 2.4367 0.1255

Questions
1. Perform the following hypothesis test:

H0 : both dpi and ddpi are not significant


versus Ha : All explanatory variables are significant

2. Perform the following hypothesis test:

H0 : both dpi and ddpi are not significant


versus Ha : pop15, pop75, ddpi are all significant

14
Questions (cont.)
# [1.] compare two nested models
fit2 = lm(sr ~ ., data = savings)
fit1 = lm(sr ~ pop15 + pop75, savings)
anova(fit1, fit2)

## Analysis of Variance Table


##
## Model 1: sr ~ pop15 + pop75
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 47 726.17
## 2 45 650.71 2 75.455 2.609 0.08471 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Questions (cont.)
# [2.] compare two nested models
fit2 = lm(sr ~ pop15 + pop75 + ddpi, data = savings)
fit1 = lm(sr ~ pop15 + pop75, savings)
anova(fit1, fit2)

## Analysis of Variance Table


##
## Model 1: sr ~ pop15 + pop75
## Model 2: sr ~ pop15 + pop75 + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 47 726.17
## 2 46 652.61 1 73.562 5.1851 0.02748 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing a subspace
One can consider the following hypothesis test:

H0 : βpop15 = βpop75 versus Ha : βpop15 ̸= βpop75

fit1 = lm(sr ~ I(pop15 + pop75) + dpi + ddpi, savings)


fit2 = lm(sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
anova(fit1,fit2)

## Analysis of Variance Table


##
## Model 1: sr ~ I(pop15 + pop75) + dpi + ddpi
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 673.63
## 2 45 650.71 1 22.915 1.5847 0.2146

15
Testing a subspace
One can consider the following hypothesis test:

H0 : βpop75 = 4βpop15 versus Ha : βpop15 ̸= 2βpop75

fit1 = lm(sr ~ I(1*pop15 + 4*pop75) + dpi + ddpi, savings)


anova(fit1,fit2)

## Analysis of Variance Table


##
## Model 1: sr ~ I(1 * pop15 + 4 * pop75) + dpi + ddpi
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 651.33
## 2 45 650.71 1 0.61849 0.0428 0.8371

Testing a subspace
One can consider the following hypothesis test:

H0 : βddpi = 0.5 versus Ha : βddpi ̸= 0.5

fit1 = lm(sr ~ pop15+pop75+dpi+offset(0.5*ddpi), savings)


anova(fit1,fit2)

## Analysis of Variance Table


##
## Model 1: sr ~ pop15 + pop75 + dpi + offset(0.5 * ddpi)
## Model 2: sr ~ pop15 + pop75 + dpi + ddpi
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 653.78
## 2 45 650.71 1 3.0635 0.2119 0.6475

Questions
Q1 . Perform the following hypothesis test:

H0 : βpop75 = 4βpop15 and ddpi = 0.5 versus Ha : full model

Caution when using multiple constraints in the “lm” function


Consider the following hypothesis test:

H0 : βpop75 = 4βpop15 and βdpi = βpop15 versus Ha : full model

# which linear model is correct between the following two?


fit1 = lm(sr~I(pop15+4*pop75)+I(dpi+pop15)+ddpi,savings)
fit11 = lm(sr~I(pop15+4*pop75+dpi)+ddpi,savings)

Caution when using multiple constraints (cont.)


The model “fit1” is equivalent to the following linear model:

y = (Xpop15 + 4Xpop75 )β1 + (Xpop15 + Xdpi )β2 + Xddpi β3 + ϵ


= Xpop15 (β1 + β2 ) + Xpop75 (4β1 ) + Xdpi β2 + Xddpi β3 ,

16
which implies
βpop15 = β1 + β2 , βpop75 = 4β1 , βdpi = β2 , βddpi = β3 .
This can be rewritten as the following compact form:

βpop15 = βpop75 /4 + βdpi ,

which is not eqivalent to the null hypothesis H0 : βpop75 = 4βpop15 = 4βdpi

Penalization methods (Shrinkage methods)


• Recall that linear regression is based on minimizing residual sum of squares:
n
X
minimizeβ (yi − x′i β)2
i=1

• The obtained minimizer β̂ (OLS estimator) is generally a good estimator of β.

• However, (1). when the number of explanatory variables (p) is much larger than sample size (n); (2).
when columns of the design matrix X are highly correlated, obtained β̂ can be unstable and often less
interpretable

Penalization methods (Shrinkage methods)


• Shrnkage methods give a penalty on the coefficient (β) in the optimization problem such that the
obtained coefficient (β̂) can’t be too large!

• Shrnkage methods generally solve


n
X p
X
(yi − β0 − xij βj )2 + P enalty(β),
i=1 j=1

where P enalty(·) is a function that penalizes β

• In this class, we will learn "Ridge penalty" and "Lasso penalty"

Ridge regression
Pp
• Ridge regression is based on limiting j=1 βj2

• Suppose that X ∈ Rn×p is columnwise centered. Ridge regression solves


n
X p
X p
X
2
minimizeβ (yi − xij βj ) + λ βj2 ,
i=1 j=1 j=1

where λ > 0 is a user-determined tuning parameter that controls the tradeoff between fit and penalty

• Ridge regression has a closed form solution:

β̂ridge (λ) = (X ′ X + λIp×p )−1 X ′ y,

where Ip×p is a p by p identity matrix

17
library("MASS")
# ridge regression
lm.ridge(sr ~ ., data= savings, lambda = 1)

## pop15 pop75 dpi ddpi


## 25.3262192821 -0.3971087502 -1.2595249673 -0.0003267073 0.4068673345

Ridge regression
1
Coefficients
0
−1

βpop15
−2

βpop75
−3

βdpi
−4

βddpi
0 200 400 600 800 1000

λ
Apendix
# Ridge regression
lam_set = seq(0, 1000, 1)
result = lm.ridge(sr ~ .,data=savings,lambda=lam_set)
plot(lam_set, result$coef["pop15",],type = "l",
xlim=range(lam_set),ylim=range(result$coef),lwd=2,
xlab=expression(lambda),ylab="Coefficients",cex.lab=2)
lines(lam_set,result$coef["pop75",],col="blue",lty=2,lwd=2)
lines(lam_set,result$coef["dpi",],col="red",lty=3,lwd=2)
lines(lam_set,result$coef["ddpi",],col="green",lty=4,lwd=2)
abline(h = 0, lwd = 2)

# Add legend
legend(300,-1,legend=expression(beta[pop15],beta[pop75],
beta[dpi],beta[ddpi]),
col=c("black", "blue", "red", "green"), lty=1:4, cex=1)

Ridge regression with orthonormal design matrix


• In the case of an orthonormal design matrix X ∈ Rn×p , i.e, X ′ X = Ip×p ,

β̂ OLS = X ′ y, β̂ ridge = X ′ y/(1 + λ),

18
which clearly illustrates the shrinkage effect of Ridge regression

• Ridge regression produce the effect of shrinking the estimates of β toward zero that cause a bias but
reduce a variance of the estimator. Think about M SE(β̂) = Bias(β̂)2 + V ariance(β̂)!

Lasso regression
Pp
• Lasso regression is based on limiting j=1 |βj |

• Lasso regression solves


n p p
1X X X
minimizeβ (yi − β0 − xij βj )2 + λ |βj |,
2 i=1 j=1 j=1

where λ > 0 is a user-determined tuning parameter that controls the tradeoff between fit and penalty

• Lasso regression doesn’t have a closed form solution

• Compared to Ridge regression, Lasso regression provides a more sparse solution

Lasso regression with orthonormal design matrix


• In the case of an orthonormal design matrix X ∈ Rn×p , i.e, X ′ X = Ip×p ,

β̂jLasso = β̂jOLS − λ if β̂jOLS > λ


= 0 if λ ≤ β̂jOLS ≤ λ
= β̂jOLS + λ if β̂jOLS < −λ

which clearly illustrates the shrinkage effect of Lasso regression

• Lasso regression has the effect of making the estimates of some of βj s exactly zero that cause a bias but
reduce a variance of the estimator.

Lasso and Ridge regression


library(glmnet)
# Lasso regression
X = as.matrix(savings[,2:5])
y = as.matrix(savings[,1])
lam_set1 = seq(0, 100, 1)
lam_set2 = seq(0, 10, 0.1)

# Lasso and Ridge regression


ridge=glmnet(X,y,family="gaussian",alpha=0,lambda=lam_set1)
lasso=glmnet(X,y,family="gaussian",alpha=1,lambda=lam_set2)

19
Ridge regression (not efficient)
Ridge
Coefficients
0.0
−0.5

βpop15
βpop75
−1.0

βdpi
βddpi
−1.5

0 20 40 60 80 100

λ
Appendix (not efficient)
par(mfrow = c(1,2))
# ridge regression
plot(lam_set1, rev(ridge$beta[1,]),type = "l",
xlim=range(lam_set1),ylim=range(ridge$beta),lwd=2,
main = "Ridge",
xlab=expression(lambda),ylab="Coefficients",cex.lab=2)
lines(lam_set1,rev(ridge$beta[2,]),col="blue",lty=2,lwd=2)
lines(lam_set1,rev(ridge$beta[3,]),col="red",lty=3,lwd=2)
lines(lam_set1,rev(ridge$beta[4,]),col="green",lty=4,lwd=2)
abline(h = 0, lwd = 2)
# Add legend
legend(20,-0.5,legend=expression(beta[pop15],beta[pop75],
beta[dpi],beta[ddpi]),
col=c("black", "blue", "red", "green"), lty=1:4, cex=1)

20
Ridge/Lasso regression (efficient)
Ridge Lasso
Coefficients

Coefficients
0.0

0.0
−0.5

−0.5
−1.0

−1.0
βpop15 βpop15
βpop75 βpop75
βdpi βdpi
−1.5

−1.5
βddpi βddpi

0 20 40 60 80 100 0 2 4 6 8 10

λ λ
Ridge/Lasso regression (efficient)
par(mfrow = c(1,2))
names = c("Ridge", "Lasso");result = list(ridge, lasso)
lam = list(lam_set1,lam_set2)

for (i in 1:2){
plot(lam[[i]], rev(result[[i]]$beta[1,]),type = "l",
xlim=range(lam[[i]]),ylim=range(result[[i]]$beta),
main = names[i],
xlab=expression(lambda),ylab="Coefficients",cex.lab=2)
lines(lam[[i]],rev(result[[i]]$beta[2,]),col="blue",lty=2)
lines(lam[[i]],rev(result[[i]]$beta[3,]),col="red",lty=3)
lines(lam[[i]],rev(result[[i]]$beta[4,]),col="green",lty=4)
abline(h = 0, lwd = 2)
# Add legend
legend("bottomright",legend=expression(beta[pop15],
beta[pop75], beta[dpi],beta[ddpi]),
col=c("black", "blue", "red", "green"), lty=1:4, cex=1)}

Sparsity assumption on the coefficient when p>n


• When p > n, i.e. high-dimensional case, β is not uniquely defined, which cause an identifiability issue.
Why?

• With a sparsity condition ∥β∥0 ≤ s for some s < n, we could estimate β

21
• "Sparsity assumption" is essential in the high-dimensional model

Comparisons of regression methods via simulation models


# Generate X and y using normal distribution
set.seed(2000)
n = 100; p = 50
X = matrix(rnorm(n*p), ncol = p)
X = cbind(rep(1,n), X)

# column normalizing
norm_vec = sqrt(apply(Xˆ2, 2, mean))
X = X / matrix(rep(norm_vec, each = n), nrow = n)

# Check the norm of columns


#sqrt(apply(Xˆ2, 2, mean))

# Generate a dependent variables y


beta = runif(p+1)
#beta[6:(p+1)] = 0
y = X %*% beta + 0.1*rnorm(n)

Comparisons of regression methods via simulation models


# Apply Least squares
ls_beta = lm(y ~ X[,2:(p+1)])$coefficients

# Apply Ridge regression


lam_Ridge = seq(0, 20, 0.05)
ridge=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=0,
lambda=lam_Ridge)
ridge_beta = rbind(ridge$a0,ridge$beta)

# Apply Lasso regression


lam_Lasso = seq(0, 0.5, 0.01)

lasso=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=1,
lambda=lam_Lasso)
lasso_beta = rbind(lasso$a0,lasso$beta)

Comparisons of regression methods via simulation models


# Analyzing estimation errors
err_ls = sqrt(sum((ls_beta - beta)ˆ2))

err_ridge = NULL
for (i in 1:length(lam_Ridge)){
err_ridge=c(err_ridge,sqrt(sum((ridge_beta[,i]-beta)ˆ2)))}

err_ridge = rev(err_ridge)

err_lasso = NULL
for (i in 1:length(lam_Lasso)){

22
err_lasso=c(err_lasso,sqrt(sum((lasso_beta[,i]-beta)ˆ2)))}

err_lasso = rev(err_lasso)

Comparisons of regression methods via simulation models


Errors plot (Ridge) Errors plot (Lasso)
Estimation errors

Estimation errors
3.0
3

2.0
2

1.0
1

Errors plot (Ridge) Errors plot (Lasso)


o Least squares o Least squares
0.0
0

0 5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5

λ λ
Appendix
# Drawing error plots

par(mfrow = c(1,2))
names = c("Errors plot (Ridge)", "Errors plot (Lasso)");
result = list(err_ridge, err_lasso);
lam = list(lam_Ridge,lam_Lasso)
for (i in 1:2){
# ridge regression
plot(lam[[i]], result[[i]],type = "l",
xlim=range(lam[[i]]), ylim=range(result[[i]]),lwd=2,
main = names[i],
xlab=expression(lambda),ylab="Estimation errors")
points(0,err_ls, col = "blue")

# Add legend
legend("bottomright",legend=c(names[[i]],"Least squares"),
col=c("black","blue"),lty=c(1,0),pch = c("","o"),cex=1)}

23
Make a function
# load the uploaded "generating_plot" function instead

generating_plot = function(X, y, beta, lam_Ridge, lam_Lasso){


# Apply Least squares
ls_beta = lm(y ~ X[,2:(p+1)])$coefficients

# Apply Ridge regression


ridge=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=0, lambda=lam_Ridge)
ridge_beta = rbind(ridge$a0,ridge$beta)

# Apply Lasso regression

lasso=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=1, lambda=lam_Lasso)


lasso_beta = rbind(lasso$a0,lasso$beta)

# Analyzing estimation errors


err_ls = sqrt(sum((ls_beta - beta)ˆ2))

err_ridge = NULL
for (i in 1:length(lam_Ridge)){
err_ridge = c(err_ridge, sqrt(sum((ridge_beta[,i] - beta)ˆ2)))}

err_ridge = rev(err_ridge)

err_lasso = NULL
for (i in 1:length(lam_Lasso)){
err_lasso = c(err_lasso, sqrt(sum((lasso_beta[,i] - beta)ˆ2)))}

err_lasso = rev(err_lasso)

# Drawing error plots

par(mfrow = c(1,2))
names = c("Errors plot (Ridge)", "Errors plot (Lasso)"); result = list(err_ridge, err_lasso);
lam = list(lam_Ridge,lam_Lasso)
for (i in 1:2){
# ridge regression
plot(lam[[i]], result[[i]],type = "l",
xlim=range(lam[[i]]), ylim=range(result[[i]]),lwd=2,
main = names[i],
xlab=expression(lambda),ylab="Estimation errors",cex.lab=2)
#abline(h = 0, lwd = 2)
points(0, err_ls, col = "blue")

# Add legend
legend("bottomright", legend=c(names[[i]], "Least squares"),
col=c("black", "blue"), lty=c(1,0), pch = c("", "o") , cex=1)}

lasso_beta = lasso_beta[ , seq(length(lam_Lasso), 1,-1)]


ridge_beta = ridge_beta[ , seq(length(lam_Ridge), 1,-1)]

return(list(err_lasso=err_lasso, err_ridge=err_ridge,

24
lasso_beta=lasso_beta[, ], ridge_beta=ridge_beta))}

Comparisons of regression methods (Sparse model case)


# Generate a dependent variables y with a sparse beta!
beta[6:p+1] = 0
y = X %*% beta + 0.1*rnorm(n)

lam_Ridge = seq(0, 20, 0.05)


lam_Lasso = seq(0, 0.5, 0.005)

results = generating_plot(X,y,beta,lam_Ridge,lam_Lasso)

# We can observe that Lasso gives more accurate solutions


# with some penalty parameter lambda when underlying beta
# is sparse!

# results$lasso_beta[,3] gives a sparse solution and


# provides the most accurate solution!

Comparisons of regression methods (Sparse model case)


Errors plot (Ridge) Errors plot (Lasso)
0.8
Estimation errors

Estimation errors
0.8

0.6
0.6

0.4
0.4

0.2

Errors plot (Ridge) Errors plot (Lasso)


0.2

o Least squares o Least squares

0 5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5

λ λ
Comparisons of regression methods (highly correlated matrix X)
## Now we consider correlated design matrix case
set.seed(3000)
n = 100; p = 70
# Generating covariance (correlation) matrix

25
sigma = 0.99
A = array(0,c(p,p))
for (i in 1:p){for (j in 1:p){A[i,j] = sigmaˆ(abs(i-j))}}
Z = matrix(rnorm(n*p), ncol = p)
# The generating X has independent rows but dependent
# columns whose population covariance matrix is A
# library("expm")
X = Z %*% sqrtm(A); X = cbind(rep(1,n), X)
beta = c(rep(1,5), rep(0, p-4))
y = X %*% beta + 0.1*rnorm(n)
lam_Ridge = seq(0, 20, 0.05); lam_Lasso = seq(0, 1, 0.005)
results = generating_plot(X, y, beta, lam_Ridge, lam_Lasso)

Comparisons of regression methods (highly correlated matrix X)


Errors plot (Ridge) Errors plot (Lasso)

1.0
Estimation errors

Estimation errors
1.6
1.4

0.8
1.2

0.6
1.0
0.8

0.4

Errors plot (Ridge) Errors plot (Lasso)


o Least squares o Least squares
0.6

0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0

λ λ
Model selection criterion
• Among many obtained linear models (by using different λ values), we could choose the best one based
on some criterion

• "R-squared" and "Adjusted R-squared" are one of such criteria, but not often used in the high-dimensional
model

• More popular criterion are "Akaike information criterion" (AIC) and "Bayesian information criterion"
(BIC)

26
AIC and BIC
• AIC/BIC consider trade-off between goodness of fit and simplicity of the model.
• AIC/BIC only provide a relative quality of the model, i.e. they do not provide a statistical inference
(i.e. test) of a model
• Lower AIC/BIC indicates a better model!
• For the model M , let L be the maximum value of the log-likelihood function for the model M . Then
n
!
X (yi − x′ β̂)2 i
AIC(M ) = −2 log L + 2|M | = n log + 2|M |
n
i=1
 Pn
(yi − x′i β̂)2

i=1
BIC(M ) = −2 log L + |M | log n = n log + |M | log n
n

• BIC penalizes larger models more aggressively, i.e. BIC prefers smaller models compared to AIC

AIC and BIC (cont.)


• The likelihood function is
n
!
2 −n/2 1 X
L = (2πσ ) exp (yi − x′i β)2
2σ 2 i=1

• The log-likelihood function is


n
n n 1 X
log L = − log(2π) − log(σ 2 ) − 2 (yi − x′i β)2
2 2 2σ i=1

Pn
• Since σ̂ 2 = 1
n i=1 (yi− x′i β)2 , the maximum value of the log-likelihood function is
n
!
n 2 n 1X ′ 2
− log(σ̂ ) + constant = − log (yi − xi β̂) + constant,
2 2 n i=1

which gives a AIC/BIC formula

Selecting model using AIC/BIC via simulation model (p>n and sparse case)
## Apply AIC/BIC to the simulation model

# Generate X and y using normal distribution


set.seed(1000)
n = 100; p = 200
X = matrix(rnorm(n*p), ncol = p)
X = cbind(rep(1,n), X)

# column normalizing
norm_vec = sqrt(apply(Xˆ2, 2, mean))
X = X / matrix(rep(norm_vec, each = n), nrow = n)

# Generate a dependent variables y


beta = runif(p+1)
beta[8:p+1] = 0
y = X %*% beta + 0.1*rnorm(n)

27
Selecting model using AIC/BIC via simulation model (p>n and sparse case)
# Apply Lasso regression
lam_Lasso = seq(0.05, 5, 0.01)*sqrt(log(p)/n)*0.1

lasso=glmnet(X[,2:(p+1)],y,family="gaussian", alpha=1,
lambda=lam_Lasso)
lasso_beta = rbind(lasso$a0,lasso$beta)

# Analyzing estimation errors


err_lasso = NULL
for (i in 1:length(lam_Lasso)){
err_lasso=c(err_lasso,sqrt(sum((lasso_beta[,i]-beta)ˆ2)))}

err_lasso = rev(err_lasso)

Selecting model using AIC/BIC via simulation model (p>n and sparse case)
Errors plot (Lasso)
0.05 0.10 0.15 0.20 0.25 0.30
Estimation errors

0.00 0.02 0.04 0.06 0.08 0.10

λ
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
## 0.608 0.969 0.588 0.645 0.226 0.688 0.770 0.149 0.000 0.006 0.000 0.000 0.000 0.000 0.
## V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.004 0.000 0.000 0.000 0.000 -0.004 0.000 0.
## V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51
## 0.000 0.000 0.000 0.000 0.000 -0.005 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.
## V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.005 0.000 0.000 0.
## V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89
## 0.000 0.002 0.000 0.000 0.000 -0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.004 0.
## V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V

28
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.005 0.000 0.000 0.001 0.000 -0.001 0.000 0.
## V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.
## V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.007 0.000 0.000 0.000 0.004 0.
## V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V
## 0.000 0.000 0.000 0.000 0.013 0.000 0.000 0.000 0.002 0.000 0.000 0.000 0.000 0.000 0.
## V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V
## -0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.006 0.
## V190 V191 V192 V193 V194 V195 V196 V197 V198 V199 V200
## 0.000 0.000 0.000 0.001 0.000 0.000 0.000 -0.006 -0.010 0.000 0.000

Selecting model using AIC/BIC via simulation model (p>n and sparse case)
# Drawing error plots
names = "Errors plot (Lasso)" ; result = err_lasso;
lam = lam_Lasso

plot(lam, result,type = "l",


xlim=range(lam), ylim=range(result),lwd=2,
main = names,
xlab=expression(lambda),ylab="Estimation errors")

ind = which(err_lasso==min(err_lasso))
round(lasso_beta[,ncol(lasso_beta)-ind+1],3)

Selecting model using AIC/BIC via simulation model (p>n and sparse case)
−200

−200
−300

−250
−400

−300
AIC

BIC
−500

−350
−600

−400

0 100 300 500 0 100 300 500

Index Index
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 0.60696 0.96263 0.58212 0.63603 0.21744 0.68105 0.76318 0.14181 0.00000 0.00000 0.00000 0
## V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25

29
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
## 0.00000 -0.00094 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -0.00220 0.00000 0.00000 0
## V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160
## -0.00281 0.00000 0.00000 0.00000 0.00000 0.00000 0.01067 0.00000 0.00000 0.00000 0.00000 0
## V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190
## 0.00000 0.00000 0.00000 0.00000 0.00348 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0
## V195 V196 V197 V198 V199 V200
## 0.00000 0.00000 0.00000 -0.00636 0.00000 0.00000

Appendix
# Computing AIC/BIC
AIC = NULL; BIC = NULL; BIC_fit = NULL; BIC_pen = NULL
for (i in 1:length(lam_Lasso)){
AIC=c(AIC, n*log(sum((y - X%*% lasso_beta[,i])ˆ2)/n)+
2*sum(abs(lasso_beta[2:(p+1),i])>0.00001))
BIC=c(BIC, n*log(sum((y - X%*% lasso_beta[,i])ˆ2)/n)+
log(n)*sum(abs(lasso_beta[2:(p+1),i])>0.00001))
BIC_fit=c(BIC_fit,n*log(sum((y-X%*%lasso_beta[,i])ˆ2)/n))
BIC_pen=c(BIC_pen, log(n)*
sum(abs(lasso_beta[2:(p+1),i])>0.00001))}
AIC = rev(AIC); BIC = rev(BIC)
BIC_fit = rev(BIC_fit); BIC_pen = rev(BIC_pen)
par(mfrow = c(1,2))
plot(AIC); plot(BIC)
ind_B = which(BIC==min(BIC))
round(lasso_beta[,ncol(lasso_beta)-ind_B+1],5)

30
Selecting model using AIC/BIC via simulation model (p>n and sparse case)
BIC plot
200
Value
0
−400

−2L (fit)
Penalty(simplicity)
−800

BIC

0.00 0.02 0.04 0.06 0.08 0.10

λ
Appendix
# Drawing BIC plot
names = "BIC plot" ;
lam = lam_Lasso

par(mfrow=c(1,1))
plot(lam, BIC_fit,type = "l",
xlim=range(lam),lwd=2,main=names,ylim=
c(min(BIC_fit), max(BIC_pen)),
xlab=expression(lambda),ylab="Value",cex.lab=2)

lines(lam,BIC_pen, col = "blue", lty = 2)


lines(lam,BIC,col="red",lty=3, lwd=2)
# Add legend
legend("bottomright",legend=c("-2L (fit)",
"Penalty(simplicity)", "BIC"),
col=c("black", "blue", "red"), lty=1:3, cex=1)

optim(par, fn, gr = NULL,...,


method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B",
"SANN", "Brent"),
lower = -Inf, upper = Inf,
control = list(), hessian = FALSE)

31

You might also like