0% found this document useful (0 votes)
343 views26 pages

SDS Solution1

Here are the null hypotheses and conclusions that can be drawn from the p-values in Table 1: - H0 for Sales: The coefficients for TV, Radio, and Newspaper are all equal to 0. This very small p-value leads us to reject the null hypothesis - at least one of the advertising variables has a non-zero coefficient and impacts Sales. - H0 for TV: The coefficient for TV is equal to 0. The small p-value leads us to reject this null hypothesis - TV advertising has a statistically significant impact on Sales. - H0 for Radio: The coefficient for Radio is equal to 0. The large p-value does not provide strong evidence to reject this null hypothesis - the impact

Uploaded by

Manthan Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
343 views26 pages

SDS Solution1

Here are the null hypotheses and conclusions that can be drawn from the p-values in Table 1: - H0 for Sales: The coefficients for TV, Radio, and Newspaper are all equal to 0. This very small p-value leads us to reject the null hypothesis - at least one of the advertising variables has a non-zero coefficient and impacts Sales. - H0 for TV: The coefficient for TV is equal to 0. The small p-value leads us to reject this null hypothesis - TV advertising has a statistically significant impact on Sales. - H0 for Radio: The coefficient for Radio is equal to 0. The large p-value does not provide strong evidence to reject this null hypothesis - the impact

Uploaded by

Manthan Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

SDS-SOLUTION

UNIT-I
(CHAPTER-2)

1. For each of parts (a) through (d), indicate whether we would generally expect the
performance of a flexible statistical learning method to be better or worse than an
inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

ANSWER:
better - a more flexible approach will fit the data closer and with the large sample size a
better fit than an inflexible approach would be obtained

(b) The number of predictors p is extremely large, and the number of observations
n is small.

ANSWER:
worse - a flexible method would overfit the small number of observations

(c) The relationship between the predictors and response is highly non-linear.

ANSWER:
better - with more degrees of freedom, a flexible model would obtain a better fit

(d) The variance of the error terms, i.e.  2  Var ( ) , is extremely high.

ANSWER:
worse - flexible methods fit to the noise in the error terms and increase variance

2. Explain whether each scenario is a classification or regression problem, and


indicate whether we are most interested in inference or prediction. Finally,
provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record
profit, number of employees, industry and the CEO salary. We are interested in
understanding which factors affect CEO salary.

ANSWER:
regression. inference. quantitative output of CEO salary based on CEO firm's features.
n - 500 firms in the US
p - profit, number of employees, industry
(b) We are considering launching a new product and wish to know whether it will be
a success or a failure. We collect data on 20 similar products that were previously
launched. For each product we have recorded whether it was a success or failure,
price charged for the product, marketing budget, competition price, and ten other
variables.

ANSWER:
classification. prediction. predicting new product's success or failure.
n - 20 similar products previously launched
p - price charged, marketing budget, comp. price, ten other variables

(c) We are interesting in predicting the % change in the US dollar in relation to the
weekly changes in the world stock markets. Hence we collect weekly data for all of
2012. For each week we record the % change in the dollar, the % change in the US
market, the % change in the British market, and the % change in the German
market.

ANSWER:
regression. prediction. quantitative output of % change
n - 52 weeks of 2012 weekly data
p - % change in US market, % change in British market, % change in German market

3. Following questions are related to bias-variance decomposition.


(a) Provide a sketch of typical (squared) bias, variance, training error, test error,
and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible
statistical learning methods towards more flexible approaches. The x-axis should
represent the amount of flexibility in the method, and the y-axis should represent
the values for each curve. There should be five curves. Make sure to label each
one.

ANSWER:

(b) Explain why each of the five curves has the shape displayed in part (a).

ANSWER:
All 5 lines >= 0

(i) (squared) bias - decreases monotonically because increases in flexibility yield a closer
fit
(ii) variance - increases monotonically because increases in flexibility yield overfit
(iii) training error - decreases monotonically because increases in flexibility yield a closer
fit
(iv) test error - concave up curve because increase in flexibility yields a closer fit before it
overfits
(v) Bayes (irreducible) error - defines the lower limit, the test error is bounded below by
the irreducible error due to variance in the error (epsilon) in the output values (0 <=
value). When the training error is lower than the irreducible error, overfitting has taken
place.
The Bayes error rate is defined for classification problems and is determined by the ratio
of data points which lie at the 'wrong' side of the decision boundary, (0 <= value < 1).

4. You will now think of some real-life applications for statistical learning.
(a) Describe three real-life applications in which classification might be useful.
Describe the response, as well as the predictors. Is the goal of each application
inference or prediction? Explain your answer.

ANSWER:
(i) stock market price direction, prediction, response: up, down, input: yesterday's price
movement % change, two previous day price movement %
change, etc.
(ii) illness classification, inference, response: ill, healthy, input: resting heart rate, resting
breath rate, mile run time
(iii) car part replacement, prediction, response: needs to be replace, good, input: age of
part, mileage used for, current amperage

(b) Describe three real-life applications in which regression might be useful.


Describe the response, as well as the predictors. Is the goal of each application
inference or prediction? Explain your answer.

ANSWER:
(i) CEO salary. inference. predictors: age, industry experience, industry, years of
education. response: salary.
(ii) car part replacement. inference. response: life of car part. predictors: age of part,
mileage used for, current amperage.
(iii) illness classification, prediction, response: age of death, input: current age, gender,
resting heart rate, resting breath rate, mile run time.

(c) Describe three real-life applications in which cluster analysis might be useful.

ANSWER:
(i) cancer type clustering. diagnose cancer types more accurately.
(ii) Netflix movie recommendations. recommend movies based on users who have
watched and rated similar movies.
(iii) marketing survey. clustering of demographics for a product(s) to see which clusters
of consumers buy which products.
5. What are the advantages and disadvantages of a very flexible (versus a less
flexible) approach for regression or classification? Under what circumstances
might a more flexible approach be preferred to a less flexible approach? When
might a less flexible approach be preferred?

ANSWER:
The advantages for a very flexible approach for regression or classification are obtaining
a better fit for non-linear models, decreasing bias.
The disadvantages for a very flexible approach for regression or classification are
requires estimating a greater number of parameters, follow the noise too closely (overfit),
increasing variance.
A more flexible approach would be preferred to a less flexible approach when we are
interested in prediction and not the interpretability of the results.
A less flexible approach would be preferred to a more flexible approach when we are
interested in inference and the interpretability of the results.

6. Describe the differences between a parametric and a non-parametric statistical


learning approach. What are the advantages of a parametric approach to regression
or classification (as opposed to a nonparametric approach)? What are its
disadvantages?

ANSWER:
A parametric approach reduces the problem of estimating f down to one of estimating a
set of parameters because it assumes a form for f.
A non-parametric approach does not assume a functional form for f and so requires a
very large number of observations to accurately estimate f.
The advantages of a parametric approach to regression or classification are the
simplifying of modeling f to a few parameters and not as many observations are required
compared to a non-parametric approach.
The disadvantages of a parametric approach to regression or classification are a
potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the
observations if more flexible models are used.

7. The table below provides a training data set containing six observations, three
predictors, and one qualitative response variable.

Obs. X1 X2 X3 Y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 -1 0 1 Green
6 1 1 1 Red

Suppose we wish to use this data set to make a prediction for Y when X1= X2= X3=
0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point,
X1= X2= X3= 0.
ANSWER:
Obs. X1 X2 X3 Distance(0, 0, 0) Y
1 0 3 0 3 Red
2 2 0 0 2 Red
3 0 1 3 sqrt(10) ~ 3.2 Red
4 0 1 2 sqrt(5) ~ 2.2 Green
5 -1 0 1 sqrt(2) ~ 1.4 Green
6 1 1 1 sqrt(3) ~ 1.7 Red

(b) What is our prediction with K = 1? Why?

ANSWER:
Green. Observation #5 is the closest neighbor for K = 1.

(c) What is our prediction with K = 3? Why?

ANSWER:
Red. Observations #2, 5, 6 are the closest neighbors for K = 3. 2 is Red, 5 is Green, and
6 is Red.

(d) If the Bayes decision boundary in this problem is highly non-linear, then would
we expect the best value for K to be large or small? Why?

ANSWER:
Small. A small K would be flexible for a non-linear decision boundary, whereas a large K
would try to fit a more linear boundary because it takes more points into consideration.
UNIT-II
(CHAPTER-3)

1. Describe the null hypotheses to which the p-values given in Table 1 correspond.
Explain what conclusions you can draw based on these p-values. Your explanation
should be phrased in terms of Sales, TV, Radio, and Newspaper, rather than in
terms of the coefficients of the linear model.

TABLE 1: For the Advertising data, least squares coefficient estimates of the multiple linear
regression of number of units sold on radio, TV, and newspaper advertising budgets.

Coefficient Std. error t-statistic p-value

Intercept 2.939 0.3119 9.42 < 0.0001

TV 0.046 0.0014 32.81 < 0.0001

Radio 0.189 0.0086 21.89 < 0.0001

Newspaper -0.001 0.0059 -0.18 0.8599

ANSWER:
In Table 1, the null hypothesis for "TV" is that in the presence of Radio ads and
Newspaper ads, TV ads have no effect on Sales.

Similarly, in Table 1, the null hypothesis for "Radio" is that in the presence of TV and
Newspaper ads, radio ads have no effect on Sales.

Similarly, in Table 1, the null hypothesis for "Newspaper" is that in the presence of TV
and Radio ads, Newspaper ads have no effect on Sales.

The low p-values of TV and Radio suggest that the null hypotheses are false for TV and
Radio.

The high p-value of Newspaper suggests that the null hypothesis is true for Newspaper.

2. Carefully explain the differences between the KNN classifier and KNN regression
methods.

ANSWER:
KNN classifier and KNN regression methods are closely related in formula. However, the
final result of KNN classifier is the classification output for Y (qualitative), whereas the
output for a KNN regression predicts the quantitative value for f(X).
3. Suppose we have a data set with five predictors,
X 1  GPA, X 2  IQ, X 3  Gender (1 for Female and 0 for Male),
X 4  Interaction between GPA and IQ , and X 5  Interaction between GPA and Gender.
The response is starting salary after graduation (in thousands of dollars).
Suppose we use least squares to fit the model, and get ˆ0  50 , ˆ1  20 , ˆ2  0.07 ,
ˆ3  35 , ˆ4  0.01 and ˆ5  10 .
(a) Which answer is correct, and why?
(i) For a fixed value of IQ and GPA, males earn more on average than females.
(ii) For a fixed value of IQ and GPA, females earn more on average than males.
(iii) For a fixed value of IQ and GPA, males earn more on average than females
provided that the GPA is high enough.
(iv) For a fixed value of IQ and GPA, females earn more on average than males
provided that the GPA is high enough.

ANSWER:
Y  50  20  GPA   0.07  IQ   35  Gender   0.01  GPA * IQ   10  GPA * Gender 

Y = 50 + 20 k_1 + 0.07 k_2 + 35 Gender + 0.01(k_1 * k_2) - 10 (k_1 * Gender)

Male: (Gender = 0) 50 + 20 k_1 + 0.07 k_2 + 0.01(k_1 * k_2)

Female: (Gender = 1) 50 + 20 k_1 + 0.07 k_2 + 35 + 0.01(k_1 * k_2) - 10 (k_1)

Once the GPA is high enough, males earn more on average. Therefore, option (iii) is
correct.

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

ANSWER:
Y (Gender = 1, IQ = 110, GPA = 4.0)

= 50 + 20 * 4 + 0.07 * 110 + 35 + 0.01 (4 * 110) - 10 * 4

= 137.1

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very
small, there is very little evidence of an interaction effect. Justify your answer.

ANSWER:
False.
We must examine the p-value of the regression coefficient to determine if the interaction
term is statistically significant or not.
4. I collect a set of data (n = 100 observations) containing a single predictor and a
quantitative response. I then fit a linear regression model to the data, as well as a
separate cubic regression, i.e. Y   0  1 X   2 X 2  3 X 3   .
(a) Suppose that the true relationship between X and Y is linear, i.e. Y  0  1 X   .
Consider the training residual sum of squares (RSS) for the linear regression, and
also the training RSS for the cubic regression. Would we expect one to be lower
than the other, would we expect them to be the same, or is there not enough
information to tell? Justify your answer.

ANSWER:
I would expect the polynomial regression to have a lower training RSS than the linear
regression because it could make a tighter fit against data that matched with a wider
irreducible error (Var(epsilon)).

(b) Answer (a) using test rather than training RSS.

ANSWER:
Converse to (a), I would expect the polynomial regression to have a higher test RSS as
the overfit from training would have more error than the linear regression.

(c) Suppose that the true relationship between X and Y is not linear, but we don’t
know how far it is from linear. Consider the training RSS for the linear regression,
and also the training RSS for the cubic regression. Would we expect one to be
lower than the other, would we expect them to be the same, or is there not enough
information to tell? Justify your answer.

ANSWER:
Polynomial regression has lower train RSS than the linear fit because of higher
flexibility: no matter what the underlying true relationship is the more flexible model will
closer follow points and reduce train RSS.
(An example of this behavior is shown on Figure~2.9 from Chapter 2.)

(d) Answer (c) using test rather than training RSS.

ANSWER:
There is not enough information to tell which test RSS would be lower for either
regression given the problem statement is defined as not knowing "how far it is from
linear". If it is closer to linear than cubic, the linear regression test RSS could be lower
than the cubic regression test RSS. Or, if it is closer to cubic than linear, the cubic
regression test RSS could be lower than the linear regression test RSS. It is dues to bias-
variance tradeoff: it is not clear what level of flexibility will fit data better.
5. Consider the fitted values that result from performing linear regression without
an intercept. In this setting, the ith fitted value takes the form
yˆi  xi ˆ
Where
 n
  n

ˆ    xi yi    xi2 
 i 1   i1 
Show that we can write
n
yˆi   ai yi
i1
What is ai ?

Note: We interpret this result by saying that the fitted values from linear
regression are linear combinations of the response values.

ANSWER:
ˆ   x y   x2 
n n

  i i    i 
ˆ
y  x ˆ
 
As i i and
 i 1   i1 
Therefore,
n n
 
 xi yi x y i  i n  xx 
yˆi  xi i 1
n
 xi i1
n
   n i i  yi
i1  2 
xi1
2
i x
i1
2
i   xi 
 i1 

 
n  xx  n
yˆi    n i i  yi and as yˆi   ai yi
i1  2 
  xi 
i1

 i1 
Therefore,

xi xi
ai   n

x
i1
2
i
6. Using the following equations, argue that in the case of simple linear regression,
the least squares line always passes through the point ( x , y ) .


n
( xi  x )( yi  y )
ˆ1  i1 and ˆ0  y  ˆ1x

n
i 1
( xi  x ) 2

ANSWER:
y   0  1 x
From given equations:

 0  Avg ( y )  1 Avg ( x)

Right hand side will equal 0 if ( Avg ( x), Avg ( y)) is a point on the line

0   0  1 Avg ( x )  Avg ( y )

0  ( Avg ( y )  1 Avg ( x ))  1 Avg ( x )  Avg ( y )

00
UNIT-III
(CHAPTER-4 & CHAPTER-5)

CHAPTER 4

1. Using a little bit of algebra, prove that the given equation (1) is equivalent to
equation (2).
OR
Prove that the logistic function representation and logit representation for the
logistic regression model are equivalent.
e0 1X
p( X )  (1)
1  e0  1X
p( X )
 e0 1X (2)
1  p( X )

ANSWER:
e0 1X
p( X ) 
1  e0 1X

e 0  1 X e 0  1 X e 0  1X
p( X ) 0  1 X
1  e 0  1X 1  e 0  1 X  e 0  1 X
So,  1  e0  1 X  
1  p( X ) e 1  e 0  1 X e 0  1 X 1
1 
1 e 0  1 X
1 e 0  1 X
1 e 0  1 X
1  e 0  1 X

p( X )
Therefore,  e0 1X
1  p( X )

2. It was stated in the text that classifying an observation to the class for which
pk ( x) (as given in equation 1) is largest is equivalent to classifying an observation
to the class for which  k ( x) (as given in equation 2) is largest. Prove that this is
the case. In other words, under the assumption that the observations in the kth
class are drawn from a N (  k ,  ) distribution, the Bayes’ classifier assigns an
2

observation to the class for which the discriminant function is maximized.


1  1 
k exp   2 ( x  k ) 2 
pk ( x )  2  2  (1)
1  1 
 l 1 l 2 exp   2 2 ( x  l )2 
k

k k2
 k ( x)  x.   log( k ) (2)
 2 2 2
ANSWER:

Assuming that 𝑓 (𝑥) is normal, the probability that an observation 𝑥 is in class 𝑘 is given
by
1 1
𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎
𝑝 (𝑥) =
1 1
∑𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎
while the discriminant function is given by
𝜇 𝜇
𝛿 (𝑥) = 𝑥 − + log(𝜋 )
𝜎 2𝜎
Claim: Maximizing 𝑝 (𝑥) is equivalent to maximizing 𝛿 (𝑥).
Proof. Let 𝑥 remain fixed and observe that we are maximizing over the parameter 𝑘.
Suppose that 𝛿 (𝑥) ≥ 𝛿 (𝑥). We will show that 𝑓 (𝑥) ≥ 𝑓 (𝑥). From our assumption we have
𝜇 𝜇 𝜇 𝜇
𝑥 − + log(𝜋 ) ≥ 𝑥 − + log(𝜋 ).
𝜎 2𝜎 𝜎 2𝜎
Exponentiation is a monotonically increasing function, so the following inequality holds
𝜇 𝜇 𝜇 𝜇
𝜋 exp(𝑥 − ) ≥ 𝜋 exp(𝑥 − )
𝜎 2𝜎 𝜎 2𝜎
Multipy this inequality by the positive constant
1 1
exp(− 𝑥 )
√ 2𝜋𝜎 2𝜎
𝑐=
1 1
∑𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎
and we have that
1 1
𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎 𝜇 𝜇
=𝑥 − + log(𝜋 )
1 1 𝜎 2𝜎
∑𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎
or equivalently, 𝑓 (𝑥) ≥ 𝑓 (𝑥). Reversing these steps also holds, so we have that
maximizing 𝛿 is equivalent to maximizing 𝑝 .
3. This problem relates to the QDA model, in which the observations within each
class are drawn from a normal distribution with a class-specific mean vector and a
class specific covariance matrix. We consider the simple case where p  1 ; i.e.
there is only one feature.
Suppose that we have K classes, and that if an observation belongs to the kth class
then X comes from a one-dimensional normal distribution, X  N (  k , k ) . Recall
2

that the density function for the one-dimensional normal distribution is as given
in equation (1). Prove that in this case, the Bayes’ classifier is not linear. Argue
that it is in fact quadratic.

1  1 
f k ( x)  exp   2 ( x  k )2  (1)
2 k  2 k 

ANSWER:
1  1 
k exp   2 ( x  k ) 2 
pk ( x) 
2 k  2 k 
1  1 
 l 2 exp   2 2 ( x  l )2 
l  l 

 1   1 2
log  k   log      2 ( x  k ) 
log( pk ( x ))   2 k   2 k 
 1  1 
log    l exp   2 ( x  l ) 2  
 2 l  2 l 

 1  1   1   1 2
log( pk ( x ))log    l exp   2 ( x  l ) 2    log  k   log      2 ( x  k ) 
 2 l  2 l   2 k   2 k 

 1   1 2
 ( x)  log  k   log      2 ( x  k ) 
 2 k   2 k 

As you can see, 𝛿(𝑥) is a quadratic function of 𝑥.

5. We now examine the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform
better on the training set? On the test set?

ANSWER:
If the Bayes decision boundary is linear, we expect QDA to perform better on the training
set because it's higher flexiblity will yield a closer fit. On the test set, we expect LDA to
perform better than QDA because QDA could overfit the linearity of the Bayes decision
boundary.

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to


perform better on the training set? On the test set?

ANSWER:
If the Bayes decision boundary is non-linear, we expect QDA to perform better both on
the training and test sets.

(c) In general, as the sample size n increases, do we expect the test prediction
accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

ANSWER:
We expect the test prediction accuracy of QDA relative to LDA to improve, in general, as
the sample size $n$ increases because a more flexibile method will yield a better fit as
more samples can be fit and variance is offset by the larger sample sizes.

(d) True or False: Even if the Bayes decision boundary for a given problem is linear,
we will probably achieve a superior test error rate using QDA rather than LDA
because QDA is flexible enough to model a linear decision boundary. Justify your
answer.

ANSWER:
False. With fewer sample points, the variance from using a more flexible method, such
as QDA, would lead to overfit, yielding a higher test rate than LDA.

6. Suppose we collect data for a group of students in a statistics class with


variables X1=hours studied, X2=undergrad GPA, and Y=receive an A. We fit a logistic
regression and produce estimated coefficient, ˆ0  6, ˆ1  0.05, ˆ2  1.
(a) Estimate the probability that a student who studies for 40 hour and has an
undergrad GPA of 3.5 gets an A in the class.
(b) How many hours would the student in part (a) need to study to have a 50%
chance of getting an A in the class?

ANSWER:

exp( 0  1 X 1  2 X 2 )
p( X )  , X 1  hoursstudied , X 2  undergradGPA,
1  exp( 0  1 X 1   2 X 2 )
0  6, 1  0.05, 2  1

(a)
X  [40 hours, 3.5 GPA]

exp(6  0.05 X 1  X 2 ) exp(6  0.05(40)  (3.5)) exp(0.5)


p( X )     37.75
1  exp(6  0.05 X 1  X 2 ) 1  exp(6  0.05(40)  (3.5)) 1  exp(0.5)

(b)
X  [ X 1 hours, 3.5 GPA]

exp(6  0.05 X 1  X 2 )
p( X ) 
1  exp(6  0.05 X 1  X 2 )

exp(6  0.05 X 1  3.5)


0.50 
1  exp(6  0.05 X1  3.5)

0.50(1  exp( 6  0.05 X 1  3.5))  exp(6  0.05 X 1  3.5)

0.50(1  exp( 2.5  0.05 X 1 ))  exp( 2.5  0.05 X 1 )

0.50  0.50exp( 2.5  0.05 X 1 )  exp( 2.5  0.05 X 1 )

0.50  exp(2.5  0.05 X 1 )  0.50exp( 2.5  0.05 X 1 )

0.50  0.50exp( 2.5  0.05 X 1 )

exp(2.5  0.05 X 1 )  1

Taking log of both sides

log(exp( 2.5  0.05 X 1 ))  log(1)

2.5  0.05 X 1  0

0.05 X 1  2.5

2.5
X1   50 hours
0.05

7. Suppose that we wish to predict whether a given stock will issue a dividend this
year (“Yes” or “No”) based on X , last year’s percent profit. We examine a large
number of companies and discover that the mean value of X for companies that
issued a dividend was X  10 , while the mean for those that didn’t was X  0 . In
addition, the variance of X for these two sets of companies was ˆ  36 . Finally,
2

80 % of companies issued dividends. Assuming that X follows a normal


distribution, predict the probability that a company will issue a dividend this year
given that its percentage profit was X  4 last year.

ANSWER:

1  1 
k exp   2 ( x   k ) 2 
pk ( x )  2  2 
1  1 
 l 1 l 2 exp   2 2 ( x  l )2 
k
 1 
 yes exp   ( x   yes )2 
 2
2
pk ( x )  
 1 
 l exp   2 2 ( x  l )2 

 1 
 yes exp   ( x   yes )2 
 2
2
pk ( x)  
 1   1 
 yes exp   2 ( x   yes )2    no exp   2 ( x  no ) 2 
 2   2 

 1 
0.80exp   ( x  10)2 
pk ( x )   2  36 
 1   1 
0.80exp   ( x  10)2   0.20exp   ( x) 2 
 2  36   2  36 

 1 
0.80exp   (4  10) 2 
p yes (4)   2  36 
 1   1 
0.80exp   (4  10)2   0.20exp   (4)2 
 2  36   2  36 

p yes (4)  75.2

8. Suppose that we take a data set, divide it into equally-sized training and test
sets, and then try out two different classification procedures. First we use logistic
regression and get an error rate of 20 % on the training data and 30 % on the test
data. Next we use 1-nearest neighbors (i.e. K  1 ) and get an average error rate
(averaged over both test and training data sets) of 18 %. Based on these results,
which method should we prefer to use for classification of new observations? Why?

ANSWER:
Given:

Logistic regression: 20% training error rate, 30% test error rate KNN(K=1): average error
rate of 18%.

For KNN with K=1, the training error rate is 0% because for any training observation, its
nearest neighbor will be the response itself. So, KNN has a test error rate of 36%. I would
choose logistic regression because of its lower test error rate of 30%.
9. This problem has to do with odds.
(a) On average, what fraction of people with an odds of 0.37 of defaulting on their
credit card payment will in fact default?
(b) Suppose that an individual has a 16 % chance of defaulting on her credit card
payment. What are the odds that she will default?

ANSWER:

(a)

p( X ) 0.37
 0.37, p( X )  0.37(1  p( X )), 1.37 p( X )  0.37, p( X )   27
1  p( X ) 1.37

(b)

p( X ) 0.16
odds    0.19
1  p( X ) 0.84

CHAPTER 5

1. Using basic statistical properties of the variance, as well as single-variable


 Y2   XY
calculus, derive  2 . In other words, prove that  given by
 X   Y2  2 XY
 Y2   XY
 2 does indeed minimize Var( X  (1   )Y ) .
 X   Y2  2 XY

ANSWER:

Using the following rules:

Var  X  Y   Var  X   Var Y   2 Cov  X , Y 


Var  cX   c 2Var  X 
Cov  cX , Y   Cov  X , cY   c Cov  X , Y 

Minimizing two-asset financial portfolio:

Var ( X  (1   )Y )  Var ( X )  Var((1   )Y )  2Cov( X ,(1   )Y )

  2Var ( X )  (1   ) 2Var (Y )  2 (1   )Cov ( X , Y )

  X2  2   Y2 (1   ) 2  2 XY ( 2   )

Therefore, Var ( X  (1   )Y )   X    Y (1   )  2 XY (   )


2 2 2 2 2
Take the first derivative to find critical points:

0 f ( )


0  2 X2   2 Y2 (1   )(1)  2 XY (2  1)

0  ( X2   Y2  2 XY )   Y2   XY

 Y2   XY
 2
 X   Y2  2 XY

2. We will now derive the probability that a given observation is part of a bootstrap
sample. Suppose that we obtain a bootstrap sample from a set of n observations.

(a) What is the probability that the first bootstrap observation is not the jth
observation from the original sample? Justify your answer.

ANSWER:
(1  1 n )

(b) What is the probability that the second bootstrap observation is not the jth
observation from the original sample?

ANSWER:
(1  1 n )

(c) Argue that the probability that the jth observation is not in the bootstrap sample
is (1  1 n ) .
n

ANSWER:
In bootstrap, we sample with replacement so each observation in the bootstrap sample
has the same 1/n (independent) chance of equaling the jth observation. Applying the
product rule for a total of n observations gives us (1  1 n ) .
n

(d) When n  5 , what is the probability that the jth observation is in the bootstrap
sample?

ANSWER:
Pr (in )  1  Pr (out )  1  (1  1 5) 2  1  (4 5) 2  67.2

(e) When n  100 , what is the probability that the jth observation is in the bootstrap
sample?

ANSWER:
Pr (in)  1  Pr (out )  1  (1  1 100)10  1  (99 100)100  63.4
(f) When n  10000 , what is the probability that the jth observation is in the
bootstrap sample?

ANSWER:
1  (1  1 10000)10000  63.2

(g) Create a plot that displays, for each integer value of n from 1 to 100, 000, the
probability that the jth observation is in the bootstrap sample. Comment on what
you observe.

ANSWER:
pr = function(n) return(1 - (1 - 1/n)^n)
x = 1:1e+05
plot(x, pr(x))

The plot quickly reaches an asymptote of about 63.2%.

3. Answer the following with respect to k-fold cross-validation.

(a) Explain how k-fold cross-validation is implemented.

ANSWER:

k-fold cross-validation is implemented by taking the set of n observations and randomly


splitting into k non-overlapping groups. Each of these groups acts as a validation set
and the remainder as a training set. The test error is estimated by averaging the k
resulting MSE estimates.
(b) What are the advantages and disadvantages of k-fold cross-validation relative
to:
(i.) The validation set approach?
(ii.) LOOCV?

ANSWER:
(i.) The validation set approach is conceptually simple and easily implemented as you
are simply partitioning the existing training data into two sets.
However, there are two drawbacks: (1) the estimate of the test error rate can be
highly variable depending on which observations are included in the training and
validation sets. (2) the validation set error rate may tend to overestimate the test
error rate for the model fit on the entire data set.

(ii.) LOOCV is a special case of k-fold cross-validation with k = n. Thus, LOOCV is the
most computationally intense method since the model must be fit n times. Also,
LOOCV has higher variance, but lower bias, than k-fold CV.

4. Suppose that we use some statistical learning method to make a prediction for
the response Y for a particular value of the predictor X . Carefully describe how
we might estimate the standard deviation of our prediction.

ANSWER:
If we suppose using some statistical learning method to make a prediction for the
response Y for a particular value of the predictor X we might estimate the standard
deviation of our prediction by using the bootstrap approach. The bootstrap approach
works by repeatedly sampling observations (with replacement) from the original data set
B times, for some large value of B , each time fitting a new model and subsequently
obtaining the RMSE of the estimates for all B models.
UNIT-IV (CHAPTER 6)

1. We perform best subset, forward stepwise, and backward stepwise selection on a


single data set. For each approach, we obtain p  1 models, containing 0,1,2,, p
predictors. Explain your answers:

(a) Which of the three models with k predictors has the smallest training RSS?

ANSWER:
Best subset selection has the smallest training RSS because the other two methods
determine models with a path dependency on which predictors they pick first as they
iterate to the k'th model.

(b) Which of the three models with k predictors has the smallest test RSS?

ANSWER:
Best subset selection may have the smallest test RSS because it considers more models
then the other methods. However, the other models might have better luck picking a
model that fits the test data better.

(c) True or False:


(i.) The predictors in the k -variable model identified by forward stepwise are a
subset of the predictors in the (k  1) -variable model identified by forward
stepwise selection.

ANSWER: True

(ii.) The predictors in the k -variable model identified by backward stepwise are
a subset of the predictors in the (k  1) -variable model identified by
backward stepwise selection.

ANSWER: True

(iii.) The predictors in the k -variable model identified by backward stepwise are
a subset of the predictors in the (k  1) -variable model identified by forward
stepwise selection.

ANSWER: False

(iv.) The predictors in the k -variable model identified by forward stepwise are a
subset of the predictors in the (k  1) -variable model identified by backward
stepwise selection.

ANSWER: False

(v.) The predictors in the k -variable model identified by best subset are a subset
of the predictors in the (k  1) -variable model identified by best subset
selection.

ANSWER: False
2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your
answer.

(a) The lasso, relative to least squares, is:


(i.) More flexible and hence will give improved prediction accuracy when its
increase in bias is less than its decrease in variance.
(ii.) More flexible and hence will give improved prediction accuracy when its
increase in variance is less than its decrease in bias.
(iii.) Less flexible and hence will give improved prediction accuracy when its
increase in bias is less than its decrease in variance.
(iv.) Less flexible and hence will give improved prediction accuracy when its
increase in variance is less than its decrease in bias.

ANSWER:
iii. Less flexible and better predictions because of less variance, more bias

(b) Repeat (a) for ridge regression relative to least squares.

ANSWER:
iii. Less flexible and better predictions because of less variance, more bias

(c) Repeat (a) for non-linear methods relative to least squares.

ANSWER:
ii. More flexible, less bias, more variance

3. Suppose we estimate the regression coefficients in a linear regression model by


minimizing

n  p
 p

 y
 i
i 1 
  0    j xij  subject to  j s
j 1  j 1

for a particular value of s . For parts (a) through (e), indicate which of i. through v.
is correct. Justify your answer.

(a) As we increase s from 0 , the training RSS will:


(i.) Increase initially, and then eventually start decreasing in an inverted U
shape.
(ii.) Decrease initially, and then eventually start increasing in a U shape.
(iii.) Steadily increase.
(iv.) Steadily decrease.
(v.) Remain constant.

ANSWER:
(iv) Steadily decreases: As we increase s from 0 , all  's increase from 0 to their least
square estimate values. Training error for 0  's is the maximum and it steadily
decreases to the Ordinary Least Square RSS
(b) Repeat (a) for test RSS.

ANSWER:
(ii) Decrease initially, and then eventually start increasing in a U shape: When s  0 , all
 's are 0 , the model is extremely simple and has a high test RSS. As we increase s ,  's
assume non-zero values and model starts fitting well on test data and so test RSS
decreases. Eventually, as  's approach their full blown OLS values, they start overfitting
to the training data, increasing test RSS.

(c) Repeat (a) for variance.

ANSWER:
(iii) Steadily increase: When s  0 , the model effectively predicts a constant and has
almost no variance. As we increase s , the models includes more  's and their values
start increasing. At this point, the values of  's become highly dependent on training
data, thus increasing the variance.

(d) Repeat (a) for (squared) bias.

ANSWER:
(iv) Steadily decrease: When s  0 , the model effectively predicts a constant and hence
the prediction is far from actual value. Thus bias is high. As s increases, more  's
become non-zero and thus the model continues to fit training data better. And thus, bias
decreases.

(e) Repeat (a) for the irreducible error.

ANSWER:
(v) Remains constant: By definition, irreducible error is model independent and hence
irrespective of the choice of s , remains constant.

4. Suppose we estimate the regression coefficients in a linear regression model by


minimizing

n
 p
 p


i 1 
y
 i   0    x
j ij  subject to    j2
j 1  j 1

for a particular value of  . For parts (a) through (e), indicate which of i. through v.
is correct. Justify your answer.

(a) As we increase  from 0 , the training RSS will:


(i.) Increase initially, and then eventually start decreasing in an inverted U
shape.
(ii.) Decrease initially, and then eventually start increasing in a U shape.
(iii.) Steadily increase.
(iv.) Steadily decrease.
(v.) Remain constant.
ANSWER:
(iii) Steadily increase: As we increase  from 0 , all  's decrease from their least square
estimate values to 0 . Training error for full-blown-OLS  's is the minimum and it
steadily increases as  's are reduced to 0 .

(b) Repeat (a) for test RSS.

ANSWER:
(ii) Decrease initially, and then eventually start increasing in a U shape: When   0 , all
 's have their least square estimate values. In this case, the model tries to fit hard to
training data and hence test RSS is high. As we increase  ,  's start reducing to zero
and some of the overfitting is reduced. Thus, test RSS initially decreases. Eventually, as
 's approach 0 , the model becomes too simple and test RSS increases.

(c) Repeat (a) for variance.

ANSWER:
(iv) Steadily decreases: When   0 , the  's have their least square estimate values. The
actual estimates heavily depend on the training data and hence variance is high. As we
increase  ,  's start decreasing and model becomes simpler. In the limiting case of 
approaching infinity, all  's reduce to zero and model predicts a constant and has no
variance.

(d) Repeat (a) for (squared) bias.

ANSWER:
(iii) Steadily increases: When   0 , the  's have their least-square estimate values and
hence have the least bias. As  increases,  's start reducing towards zero, the model
fits less accurately to training data and hence bias increases. In the limiting case of 
approaching infinity, the model predicts a constant and hence bias is maximum.

(e) Repeat (a) for the irreducible error.

ANSWER:
Remains constant: By definition, irreducible error is model independent and hence
irrespective of the choice of  , remains constant.

5. It is well-known that ridge regression tends to give similar coefficient values to


correlated variables, whereas the lasso may give quite different coefficient values
to correlated variables. We will now explore this property in a very simple setting.

Suppose that n  2 , p  2 , x11  x12 , x21  x22 . Furthermore, suppose that y1  y2  0


and x11  x21  0 and x12  x22  0 , so that the estimate for the intercept in a least
squares, ridge regression, or lasso model is zero: ˆ  0 .
0
(a) Write out the ridge regression optimization problem in this setting.

ANSWER:
A general form of Ridge regression optimization looks like

Minimize: ∑ (𝑦 − 𝛽 − ∑ 𝛽 𝑥 ) + 𝜆∑ 𝛽

In this case, 𝛽 = 0 and 𝑛 = 𝑝 = 2. So, the optimization looks like:

Minimize: (𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + (𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + 𝜆(𝛽 + 𝛽 )

(b) Argue that in this setting, the ridge coefficient estimates satisfy ˆ1  ˆ2 .

ANSWER:
Now we are given that, 𝑥 =𝑥 = 𝑥 and 𝑥 =𝑥 =𝑥 .

We take derivatives of above expression with respect to both 𝛽 and 𝛽 and setting them
equal to zero find that,

∗ ( ) ∗ ( )
𝛽∗ = and 𝛽 ∗ =

Symmetry in these expressions suggests that 𝛽 ∗ = 𝛽 ∗

(c) Write out the lasso optimization problem in this setting.

ANSWER:

Like Ridge regression,

Minimize: (𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + (𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + 𝜆(|𝛽 | + |𝛽 |)

(d) Argue that in this setting, the lasso coefficients ̂1 and ˆ2 are not unique- in
other words, there are many possible solutions to the optimization problem in (c).
Describe these solutions.

ANSWER:

Here is a geometric interpretation of the solutions for the equation in c above. We use
the alternate form of Lasso constraints

|𝛽 | + |𝛽 | < 𝑠.

The Lasso constraint take the form |𝛽 | + |𝛽 | < 𝑠, which when plotted take the familiar
shape of a diamond centered at origin (0,0).

Next consider the squared optimization constraint

(𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + (𝑦 − 𝛽 𝑥 −𝛽 𝑥 ) .

We use the facts 𝑥 =𝑥 ,𝑥 =𝑥 ,𝑥 +𝑥 = 0, 𝑥 +𝑥 = 0 and 𝑦 + 𝑦 = 0 to simplify


it to
Minimize: 2. (𝑦 − (𝛽 + 𝛽 )𝑥 ) .

This optimization problem has a simple solution: 𝛽 + 𝛽 = . This is a line parallel to


the edge of Lasso-diamond 𝛽 + 𝛽 = 𝑠.

Now solutions to the original Lasso optimization problem are contours of the function
(𝑦 − (𝛽 + 𝛽 )𝑥 ) that touch the Lasso-diamond 𝛽 + 𝛽 = 𝑠.

Finally, as 𝛽 and 𝛽 very along the line 𝛽 + 𝛽 = , these contours touch the Lasso-
diamond edge 𝛽 + 𝛽 = 𝑠 at different points.

As a result, the entire edge 𝛽 + 𝛽 = 𝑠 is a potential solution to the Lasso optimization


problem!

Similar argument can be made for the opposite Lasso-diamond edge: 𝛽 + 𝛽 = −𝑠.
Thus, the Lasso problem does not have a unique solution.

The general form of solution is given by two line segments:

𝛽 + 𝛽 = 𝑠; 𝛽 ≥ 0; 𝛽 ≥ 0 and 𝛽 + 𝛽 = −𝑠; 𝛽 ≤ 0; 𝛽 ≤ 0

You might also like