0% found this document useful (0 votes)

343 views26 pages

SDS Solution1

Here are the null hypotheses and conclusions that can be drawn from the p-values in Table 1: - H0 for Sales: The coefficients for TV, Radio, and Newspaper are all equal to 0. This very small p-value leads us to reject the null hypothesis - at least one of the advertising variables has a non-zero coefficient and impacts Sales. - H0 for TV: The coefficient for TV is equal to 0. The small p-value leads us to reject this null hypothesis - TV advertising has a statistically significant impact on Sales. - H0 for Radio: The coefficient for Radio is equal to 0. The large p-value does not provide strong evidence to reject this null hypothesis - the impact

Uploaded by

Manthan Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

343 views26 pages

SDS Solution1

Uploaded by

Manthan Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

SDS-SOLUTION

UNIT-I
(CHAPTER-2)

1. For each of parts (a) through (d), indicate whether we would generally expect the
performance of a flexible statistical learning method to be better or worse than an
inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

ANSWER:
better - a more flexible approach will fit the data closer and with the large sample size a
better fit than an inflexible approach would be obtained

(b) The number of predictors p is extremely large, and the number of observations
n is small.

ANSWER:
worse - a flexible method would overfit the small number of observations

ANSWER:
better - with more degrees of freedom, a flexible model would obtain a better fit

(d) The variance of the error terms, i.e.  2  Var ( ) , is extremely high.

ANSWER:
worse - flexible methods fit to the noise in the error terms and increase variance

2. Explain whether each scenario is a classiﬁcation or regression problem, and

indicate whether we are most interested in inference or prediction. Finally,
provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record
profit, number of employees, industry and the CEO salary. We are interested in
understanding which factors affect CEO salary.

ANSWER:
regression. inference. quantitative output of CEO salary based on CEO firm's features.
n - 500 firms in the US
p - profit, number of employees, industry
(b) We are considering launching a new product and wish to know whether it will be
a success or a failure. We collect data on 20 similar products that were previously
launched. For each product we have recorded whether it was a success or failure,
price charged for the product, marketing budget, competition price, and ten other
variables.

ANSWER:
classification. prediction. predicting new product's success or failure.
n - 20 similar products previously launched
p - price charged, marketing budget, comp. price, ten other variables

(c) We are interesting in predicting the % change in the US dollar in relation to the
weekly changes in the world stock markets. Hence we collect weekly data for all of
2012. For each week we record the % change in the dollar, the % change in the US
market, the % change in the British market, and the % change in the German
market.

ANSWER:
regression. prediction. quantitative output of % change
n - 52 weeks of 2012 weekly data
p - % change in US market, % change in British market, % change in German market

3. Following questions are related to bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error,
and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible
statistical learning methods towards more flexible approaches. The x-axis should
represent the amount of flexibility in the method, and the y-axis should represent
the values for each curve. There should be five curves. Make sure to label each
one.

ANSWER:

(b) Explain why each of the five curves has the shape displayed in part (a).

ANSWER:
All 5 lines >= 0

(i) (squared) bias - decreases monotonically because increases in flexibility yield a closer
fit
(ii) variance - increases monotonically because increases in flexibility yield overfit
(iii) training error - decreases monotonically because increases in flexibility yield a closer
fit
(iv) test error - concave up curve because increase in flexibility yields a closer fit before it
overfits
(v) Bayes (irreducible) error - defines the lower limit, the test error is bounded below by
the irreducible error due to variance in the error (epsilon) in the output values (0 <=
value). When the training error is lower than the irreducible error, overfitting has taken
place.
The Bayes error rate is defined for classification problems and is determined by the ratio
of data points which lie at the 'wrong' side of the decision boundary, (0 <= value < 1).

4. You will now think of some real-life applications for statistical learning.
(a) Describe three real-life applications in which classiﬁcation might be useful.
Describe the response, as well as the predictors. Is the goal of each application
inference or prediction? Explain your answer.

ANSWER:
(i) stock market price direction, prediction, response: up, down, input: yesterday's price
movement % change, two previous day price movement %
change, etc.
(ii) illness classification, inference, response: ill, healthy, input: resting heart rate, resting
breath rate, mile run time
(iii) car part replacement, prediction, response: needs to be replace, good, input: age of
part, mileage used for, current amperage

(b) Describe three real-life applications in which regression might be useful.

Describe the response, as well as the predictors. Is the goal of each application
inference or prediction? Explain your answer.

ANSWER:
(i) CEO salary. inference. predictors: age, industry experience, industry, years of
education. response: salary.
(ii) car part replacement. inference. response: life of car part. predictors: age of part,
mileage used for, current amperage.
(iii) illness classification, prediction, response: age of death, input: current age, gender,
resting heart rate, resting breath rate, mile run time.

ANSWER:
(i) cancer type clustering. diagnose cancer types more accurately.
(ii) Netflix movie recommendations. recommend movies based on users who have
watched and rated similar movies.
(iii) marketing survey. clustering of demographics for a product(s) to see which clusters
of consumers buy which products.
5. What are the advantages and disadvantages of a very flexible (versus a less
flexible) approach for regression or classification? Under what circumstances
might a more flexible approach be preferred to a less flexible approach? When
might a less flexible approach be preferred?

ANSWER:
The advantages for a very flexible approach for regression or classification are obtaining
a better fit for non-linear models, decreasing bias.
The disadvantages for a very flexible approach for regression or classification are
requires estimating a greater number of parameters, follow the noise too closely (overfit),
increasing variance.
A more flexible approach would be preferred to a less flexible approach when we are
interested in prediction and not the interpretability of the results.
A less flexible approach would be preferred to a more flexible approach when we are
interested in inference and the interpretability of the results.

6. Describe the diﬀerences between a parametric and a non-parametric statistical

learning approach. What are the advantages of a parametric approach to regression
or classiﬁcation (as opposed to a nonparametric approach)? What are its
disadvantages?

ANSWER:
A parametric approach reduces the problem of estimating f down to one of estimating a
set of parameters because it assumes a form for f.
A non-parametric approach does not assume a functional form for f and so requires a
very large number of observations to accurately estimate f.
The advantages of a parametric approach to regression or classification are the
simplifying of modeling f to a few parameters and not as many observations are required
compared to a non-parametric approach.
The disadvantages of a parametric approach to regression or classification are a
potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the
observations if more flexible models are used.

7. The table below provides a training data set containing six observations, three
predictors, and one qualitative response variable.

Obs. X1 X2 X3 Y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 -1 0 1 Green
6 1 1 1 Red

Suppose we wish to use this data set to make a prediction for Y when X1= X2= X3=
0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point,
X1= X2= X3= 0.
ANSWER:
Obs. X1 X2 X3 Distance(0, 0, 0) Y
1 0 3 0 3 Red
2 2 0 0 2 Red
3 0 1 3 sqrt(10) ~ 3.2 Red
4 0 1 2 sqrt(5) ~ 2.2 Green
5 -1 0 1 sqrt(2) ~ 1.4 Green
6 1 1 1 sqrt(3) ~ 1.7 Red

(b) What is our prediction with K = 1? Why?

ANSWER:
Green. Observation #5 is the closest neighbor for K = 1.

(c) What is our prediction with K = 3? Why?

ANSWER:
Red. Observations #2, 5, 6 are the closest neighbors for K = 3. 2 is Red, 5 is Green, and
6 is Red.

(d) If the Bayes decision boundary in this problem is highly non-linear, then would
we expect the best value for K to be large or small? Why?

ANSWER:
Small. A small K would be flexible for a non-linear decision boundary, whereas a large K
would try to fit a more linear boundary because it takes more points into consideration.
UNIT-II
(CHAPTER-3)

1. Describe the null hypotheses to which the p-values given in Table 1 correspond.
Explain what conclusions you can draw based on these p-values. Your explanation
should be phrased in terms of Sales, TV, Radio, and Newspaper, rather than in
terms of the coeﬃcients of the linear model.

TABLE 1: For the Advertising data, least squares coeﬃcient estimates of the multiple linear
regression of number of units sold on radio, TV, and newspaper advertising budgets.

Coefficient Std. error t-statistic p-value

Intercept 2.939 0.3119 9.42 < 0.0001

TV 0.046 0.0014 32.81 < 0.0001

Radio 0.189 0.0086 21.89 < 0.0001

Newspaper -0.001 0.0059 -0.18 0.8599

ANSWER:
In Table 1, the null hypothesis for "TV" is that in the presence of Radio ads and
Newspaper ads, TV ads have no effect on Sales.

Similarly, in Table 1, the null hypothesis for "Radio" is that in the presence of TV and
Newspaper ads, radio ads have no effect on Sales.

Similarly, in Table 1, the null hypothesis for "Newspaper" is that in the presence of TV
and Radio ads, Newspaper ads have no effect on Sales.

The low p-values of TV and Radio suggest that the null hypotheses are false for TV and
Radio.

The high p-value of Newspaper suggests that the null hypothesis is true for Newspaper.

2. Carefully explain the diﬀerences between the KNN classiﬁer and KNN regression
methods.

ANSWER:
KNN classifier and KNN regression methods are closely related in formula. However, the
final result of KNN classifier is the classification output for Y (qualitative), whereas the
output for a KNN regression predicts the quantitative value for f(X).
3. Suppose we have a data set with five predictors,
X 1  GPA, X 2  IQ, X 3  Gender (1 for Female and 0 for Male),
X 4  Interaction between GPA and IQ , and X 5  Interaction between GPA and Gender.
The response is starting salary after graduation (in thousands of dollars).
Suppose we use least squares to fit the model, and get ˆ0  50 , ˆ1  20 , ˆ2  0.07 ,
ˆ3  35 , ˆ4  0.01 and ˆ5  10 .
(a) Which answer is correct, and why?
(i) For a fixed value of IQ and GPA, males earn more on average than females.
(ii) For a fixed value of IQ and GPA, females earn more on average than males.
(iii) For a fixed value of IQ and GPA, males earn more on average than females
provided that the GPA is high enough.
(iv) For a fixed value of IQ and GPA, females earn more on average than males
provided that the GPA is high enough.

ANSWER:
Y  50  20  GPA   0.07  IQ   35  Gender   0.01  GPA * IQ   10  GPA * Gender 

Y = 50 + 20 k_1 + 0.07 k_2 + 35 Gender + 0.01(k_1 * k_2) - 10 (k_1 * Gender)

Male: (Gender = 0) 50 + 20 k_1 + 0.07 k_2 + 0.01(k_1 * k_2)

Female: (Gender = 1) 50 + 20 k_1 + 0.07 k_2 + 35 + 0.01(k_1 * k_2) - 10 (k_1)

Once the GPA is high enough, males earn more on average. Therefore, option (iii) is
correct.

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

ANSWER:
Y (Gender = 1, IQ = 110, GPA = 4.0)

= 50 + 20 * 4 + 0.07 * 110 + 35 + 0.01 (4 * 110) - 10 * 4

= 137.1

(c) True or false: Since the coeﬃcient for the GPA/IQ interaction term is very
small, there is very little evidence of an interaction eﬀect. Justify your answer.

ANSWER:
False.
We must examine the p-value of the regression coefficient to determine if the interaction
term is statistically significant or not.
4. I collect a set of data (n = 100 observations) containing a single predictor and a
quantitative response. I then ﬁt a linear regression model to the data, as well as a
separate cubic regression, i.e. Y   0  1 X   2 X 2  3 X 3   .
(a) Suppose that the true relationship between X and Y is linear, i.e. Y  0  1 X   .
Consider the training residual sum of squares (RSS) for the linear regression, and
also the training RSS for the cubic regression. Would we expect one to be lower
than the other, would we expect them to be the same, or is there not enough
information to tell? Justify your answer.

ANSWER:
I would expect the polynomial regression to have a lower training RSS than the linear
regression because it could make a tighter fit against data that matched with a wider
irreducible error (Var(epsilon)).

(b) Answer (a) using test rather than training RSS.

ANSWER:
Converse to (a), I would expect the polynomial regression to have a higher test RSS as
the overfit from training would have more error than the linear regression.

(c) Suppose that the true relationship between X and Y is not linear, but we don’t
know how far it is from linear. Consider the training RSS for the linear regression,
and also the training RSS for the cubic regression. Would we expect one to be
lower than the other, would we expect them to be the same, or is there not enough
information to tell? Justify your answer.

ANSWER:
Polynomial regression has lower train RSS than the linear fit because of higher
flexibility: no matter what the underlying true relationship is the more flexible model will
closer follow points and reduce train RSS.
(An example of this behavior is shown on Figure~2.9 from Chapter 2.)

(d) Answer (c) using test rather than training RSS.

ANSWER:
There is not enough information to tell which test RSS would be lower for either
regression given the problem statement is defined as not knowing "how far it is from
linear". If it is closer to linear than cubic, the linear regression test RSS could be lower
than the cubic regression test RSS. Or, if it is closer to cubic than linear, the cubic
regression test RSS could be lower than the linear regression test RSS. It is dues to bias-
variance tradeoff: it is not clear what level of flexibility will fit data better.
5. Consider the fitted values that result from performing linear regression without
an intercept. In this setting, the ith fitted value takes the form
yî  xi ˆ
Where
 n
  n

ˆ    xi yi    xi2 
 i 1   i1 
Show that we can write
n
yî   ai yi
i1
What is ai ?

Note: We interpret this result by saying that the ﬁtted values from linear
regression are linear combinations of the response values.

ANSWER:
ˆ   x y   x2 
n n

  i i    i 
ˆ
y  x ˆ
 
As i i and
 i 1   i1 
Therefore,
n n
 
 xi yi x y i  i n  xx 
yˆi  xi i 1
n
 xi i1
n
   n i i  yi
i1  2 
xi1
2
i x
i1
2
i   xi 
 i1 

 
n  xx  n
yˆi    n i i  yi and as yˆi   ai yi
i1  2 
  xi 
i1

 i1 
Therefore,

xi xi
ai   n

x
i1
2
i
6. Using the following equations, argue that in the case of simple linear regression,
the least squares line always passes through the point ( x , y ) .


n
( xi  x )( yi  y )
ˆ1  i1 and ˆ0  y  ˆ1x

n
i 1
( xi  x ) 2

ANSWER:
y   0  1 x
From given equations:

 0  Avg ( y )  1 Avg ( x)

Right hand side will equal 0 if ( Avg ( x), Avg ( y)) is a point on the line

0   0  1 Avg ( x )  Avg ( y )

0  ( Avg ( y )  1 Avg ( x ))  1 Avg ( x )  Avg ( y )

00
UNIT-III
(CHAPTER-4 & CHAPTER-5)

CHAPTER 4

1. Using a little bit of algebra, prove that the given equation (1) is equivalent to
equation (2).
OR
Prove that the logistic function representation and logit representation for the
logistic regression model are equivalent.
e0 1X
p( X )  (1)
1  e0  1X
p( X )
 e0 1X (2)
1  p( X )

ANSWER:
e0 1X
p( X ) 
1  e0 1X

e 0  1 X e 0  1 X e 0  1X
p( X ) 0  1 X
1  e 0  1X 1  e 0  1 X  e 0  1 X
So,  1  e0  1 X  
1  p( X ) e 1  e 0  1 X e 0  1 X 1
1 
1 e 0  1 X
1 e 0  1 X
1 e 0  1 X
1  e 0  1 X

p( X )
Therefore,  e0 1X
1  p( X )

2. It was stated in the text that classifying an observation to the class for which
pk ( x) (as given in equation 1) is largest is equivalent to classifying an observation
to the class for which  k ( x) (as given in equation 2) is largest. Prove that this is
the case. In other words, under the assumption that the observations in the kth
class are drawn from a N (  k ,  ) distribution, the Bayes’ classiﬁer assigns an
2

observation to the class for which the discriminant function is maximized.

1  1 
k exp   2 ( x  k ) 2 
pk ( x )  2  2  (1)
1  1 
 l 1 l 2 exp   2 2 ( x  l )2 
k

k k2
 k ( x)  x.   log( k ) (2)
 2 2 2
ANSWER:

Assuming that 𝑓 (𝑥) is normal, the probability that an observation 𝑥 is in class 𝑘 is given
by
1 1
𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎
𝑝 (𝑥) =
1 1
∑𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎
while the discriminant function is given by
𝜇 𝜇
𝛿 (𝑥) = 𝑥 − + log(𝜋 )
𝜎 2𝜎
Claim: Maximizing 𝑝 (𝑥) is equivalent to maximizing 𝛿 (𝑥).
Proof. Let 𝑥 remain fixed and observe that we are maximizing over the parameter 𝑘.
Suppose that 𝛿 (𝑥) ≥ 𝛿 (𝑥). We will show that 𝑓 (𝑥) ≥ 𝑓 (𝑥). From our assumption we have
𝜇 𝜇 𝜇 𝜇
𝑥 − + log(𝜋 ) ≥ 𝑥 − + log(𝜋 ).
𝜎 2𝜎 𝜎 2𝜎
Exponentiation is a monotonically increasing function, so the following inequality holds
𝜇 𝜇 𝜇 𝜇
𝜋 exp(𝑥 − ) ≥ 𝜋 exp(𝑥 − )
𝜎 2𝜎 𝜎 2𝜎
Multipy this inequality by the positive constant
1 1
exp(− 𝑥 )
√ 2𝜋𝜎 2𝜎
𝑐=
1 1
∑𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎
and we have that
1 1
𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎 𝜇 𝜇
=𝑥 − + log(𝜋 )
1 1 𝜎 2𝜎
∑𝜋 exp(− (𝑥 − 𝜇 ) )
√2𝜋𝜎 2𝜎
or equivalently, 𝑓 (𝑥) ≥ 𝑓 (𝑥). Reversing these steps also holds, so we have that
maximizing 𝛿 is equivalent to maximizing 𝑝 .
3. This problem relates to the QDA model, in which the observations within each
class are drawn from a normal distribution with a class-speciﬁc mean vector and a
class speciﬁc covariance matrix. We consider the simple case where p  1 ; i.e.
there is only one feature.
Suppose that we have K classes, and that if an observation belongs to the kth class
then X comes from a one-dimensional normal distribution, X  N (  k , k ) . Recall
2

that the density function for the one-dimensional normal distribution is as given
in equation (1). Prove that in this case, the Bayes’ classiﬁer is not linear. Argue
that it is in fact quadratic.

1  1 
f k ( x)  exp   2 ( x  k )2  (1)
2 k  2 k 

ANSWER:
1  1 
k exp   2 ( x  k ) 2 
pk ( x) 
2 k  2 k 
1  1 
 l 2 exp   2 2 ( x  l )2 
l  l 

 1   1 2
log  k   log      2 ( x  k ) 
log( pk ( x ))   2 k   2 k 
 1  1 
log    l exp   2 ( x  l ) 2  
 2 l  2 l 

 1  1   1   1 2
log( pk ( x ))log    l exp   2 ( x  l ) 2    log  k   log      2 ( x  k ) 
 2 l  2 l   2 k   2 k 

 1   1 2
 ( x)  log  k   log      2 ( x  k ) 
 2 k   2 k 

As you can see, 𝛿(𝑥) is a quadratic function of 𝑥.

5. We now examine the diﬀerences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform
better on the training set? On the test set?

ANSWER:
If the Bayes decision boundary is linear, we expect QDA to perform better on the training
set because it's higher flexiblity will yield a closer fit. On the test set, we expect LDA to
perform better than QDA because QDA could overfit the linearity of the Bayes decision
boundary.

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to

perform better on the training set? On the test set?

ANSWER:
If the Bayes decision boundary is non-linear, we expect QDA to perform better both on
the training and test sets.

(c) In general, as the sample size n increases, do we expect the test prediction
accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

ANSWER:
We expect the test prediction accuracy of QDA relative to LDA to improve, in general, as
the sample size $n$ increases because a more flexibile method will yield a better fit as
more samples can be fit and variance is offset by the larger sample sizes.

(d) True or False: Even if the Bayes decision boundary for a given problem is linear,
we will probably achieve a superior test error rate using QDA rather than LDA
because QDA is ﬂexible enough to model a linear decision boundary. Justify your
answer.

ANSWER:
False. With fewer sample points, the variance from using a more flexible method, such
as QDA, would lead to overfit, yielding a higher test rate than LDA.

6. Suppose we collect data for a group of students in a statistics class with

variables X1=hours studied, X2=undergrad GPA, and Y=receive an A. We ﬁt a logistic
regression and produce estimated coeﬃcient, ˆ0  6, ˆ1  0.05, ˆ2  1.
(a) Estimate the probability that a student who studies for 40 hour and has an
undergrad GPA of 3.5 gets an A in the class.
(b) How many hours would the student in part (a) need to study to have a 50%
chance of getting an A in the class?

ANSWER:

exp( 0  1 X 1  2 X 2 )
p( X )  , X 1  hoursstudied , X 2  undergradGPA,
1  exp( 0  1 X 1   2 X 2 )
0  6, 1  0.05, 2  1

(a)
X  [40 hours, 3.5 GPA]

exp(6  0.05 X 1  X 2 ) exp(6  0.05(40)  (3.5)) exp(0.5)

p( X )     37.75
1  exp(6  0.05 X 1  X 2 ) 1  exp(6  0.05(40)  (3.5)) 1  exp(0.5)

(b)
X  [ X 1 hours, 3.5 GPA]

exp(6  0.05 X 1  X 2 )
p( X ) 
1  exp(6  0.05 X 1  X 2 )

exp(6  0.05 X 1  3.5)

0.50 
1  exp(6  0.05 X1  3.5)

0.50(1  exp( 6  0.05 X 1  3.5))  exp(6  0.05 X 1  3.5)

0.50(1  exp( 2.5  0.05 X 1 ))  exp( 2.5  0.05 X 1 )

0.50  0.50exp( 2.5  0.05 X 1 )  exp( 2.5  0.05 X 1 )

0.50  exp(2.5  0.05 X 1 )  0.50exp( 2.5  0.05 X 1 )

0.50  0.50exp( 2.5  0.05 X 1 )

exp(2.5  0.05 X 1 )  1

Taking log of both sides

log(exp( 2.5  0.05 X 1 ))  log(1)

2.5  0.05 X 1  0

0.05 X 1  2.5

2.5
X1   50 hours
0.05

7. Suppose that we wish to predict whether a given stock will issue a dividend this
year (“Yes” or “No”) based on X , last year’s percent proﬁt. We examine a large
number of companies and discover that the mean value of X for companies that
issued a dividend was X  10 , while the mean for those that didn’t was X  0 . In
addition, the variance of X for these two sets of companies was ˆ  36 . Finally,
2

80 % of companies issued dividends. Assuming that X follows a normal

distribution, predict the probability that a company will issue a dividend this year
given that its percentage proﬁt was X  4 last year.

ANSWER:

1  1 
k exp   2 ( x   k ) 2 
pk ( x )  2  2 
1  1 
 l 1 l 2 exp   2 2 ( x  l )2 
k
 1 
 yes exp   ( x   yes )2 
 2
2
pk ( x )  
 1 
 l exp   2 2 ( x  l )2 

 1 
 yes exp   ( x   yes )2 
 2
2
pk ( x)  
 1   1 
 yes exp   2 ( x   yes )2    no exp   2 ( x  no ) 2 
 2   2 

 1 
0.80exp   ( x  10)2 
pk ( x )   2  36 
 1   1 
0.80exp   ( x  10)2   0.20exp   ( x) 2 
 2  36   2  36 

 1 
0.80exp   (4  10) 2 
p yes (4)   2  36 
 1   1 
0.80exp   (4  10)2   0.20exp   (4)2 
 2  36   2  36 

p yes (4)  75.2

8. Suppose that we take a data set, divide it into equally-sized training and test
sets, and then try out two different classification procedures. First we use logistic
regression and get an error rate of 20 % on the training data and 30 % on the test
data. Next we use 1-nearest neighbors (i.e. K  1 ) and get an average error rate
(averaged over both test and training data sets) of 18 %. Based on these results,
which method should we prefer to use for classification of new observations? Why?

ANSWER:
Given:

Logistic regression: 20% training error rate, 30% test error rate KNN(K=1): average error
rate of 18%.

For KNN with K=1, the training error rate is 0% because for any training observation, its
nearest neighbor will be the response itself. So, KNN has a test error rate of 36%. I would
choose logistic regression because of its lower test error rate of 30%.
9. This problem has to do with odds.
(a) On average, what fraction of people with an odds of 0.37 of defaulting on their
credit card payment will in fact default?
(b) Suppose that an individual has a 16 % chance of defaulting on her credit card
payment. What are the odds that she will default?

ANSWER:

(a)

p( X ) 0.37
 0.37, p( X )  0.37(1  p( X )), 1.37 p( X )  0.37, p( X )   27
1  p( X ) 1.37

(b)

p( X ) 0.16
odds    0.19
1  p( X ) 0.84

CHAPTER 5

1. Using basic statistical properties of the variance, as well as single-variable

 Y2   XY
calculus, derive  2 . In other words, prove that  given by
 X   Y2  2 XY
 Y2   XY
 2 does indeed minimize Var( X  (1   )Y ) .
 X   Y2  2 XY

ANSWER:

Using the following rules:

Var  X  Y   Var  X   Var Y   2 Cov  X , Y 

Var  cX   c 2Var  X 
Cov  cX , Y   Cov  X , cY   c Cov  X , Y 

Minimizing two-asset financial portfolio:

Var ( X  (1   )Y )  Var ( X )  Var((1   )Y )  2Cov( X ,(1   )Y )

  2Var ( X )  (1   ) 2Var (Y )  2 (1   )Cov ( X , Y )

  X2  2   Y2 (1   ) 2  2 XY ( 2   )

Therefore, Var ( X  (1   )Y )   X    Y (1   )  2 XY (   )

2 2 2 2 2
Take the first derivative to find critical points:

0 f ( )


0  2 X2   2 Y2 (1   )(1)  2 XY (2  1)

0  ( X2   Y2  2 XY )   Y2   XY

 Y2   XY
 2
 X   Y2  2 XY

2. We will now derive the probability that a given observation is part of a bootstrap
sample. Suppose that we obtain a bootstrap sample from a set of n observations.

(a) What is the probability that the ﬁrst bootstrap observation is not the jth
observation from the original sample? Justify your answer.

ANSWER:
(1  1 n )

(b) What is the probability that the second bootstrap observation is not the jth
observation from the original sample?

ANSWER:
(1  1 n )

ANSWER:
In bootstrap, we sample with replacement so each observation in the bootstrap sample
has the same 1/n (independent) chance of equaling the jth observation. Applying the
product rule for a total of n observations gives us (1  1 n ) .
n

(d) When n  5 , what is the probability that the jth observation is in the bootstrap
sample?

ANSWER:
Pr (in )  1  Pr (out )  1  (1  1 5) 2  1  (4 5) 2  67.2

(e) When n  100 , what is the probability that the jth observation is in the bootstrap
sample?

ANSWER:
Pr (in)  1  Pr (out )  1  (1  1 100)10  1  (99 100)100  63.4
(f) When n  10000 , what is the probability that the jth observation is in the
bootstrap sample?

ANSWER:
1  (1  1 10000)10000  63.2

(g) Create a plot that displays, for each integer value of n from 1 to 100, 000, the
probability that the jth observation is in the bootstrap sample. Comment on what
you observe.

ANSWER:
pr = function(n) return(1 - (1 - 1/n)^n)
x = 1:1e+05
plot(x, pr(x))

The plot quickly reaches an asymptote of about 63.2%.

3. Answer the following with respect to k-fold cross-validation.

(a) Explain how k-fold cross-validation is implemented.

ANSWER:

k-fold cross-validation is implemented by taking the set of n observations and randomly

splitting into k non-overlapping groups. Each of these groups acts as a validation set
and the remainder as a training set. The test error is estimated by averaging the k
resulting MSE estimates.
(b) What are the advantages and disadvantages of k-fold cross-validation relative
to:
(i.) The validation set approach?
(ii.) LOOCV?

ANSWER:
(i.) The validation set approach is conceptually simple and easily implemented as you
are simply partitioning the existing training data into two sets.
However, there are two drawbacks: (1) the estimate of the test error rate can be
highly variable depending on which observations are included in the training and
validation sets. (2) the validation set error rate may tend to overestimate the test
error rate for the model fit on the entire data set.

(ii.) LOOCV is a special case of k-fold cross-validation with k = n. Thus, LOOCV is the
most computationally intense method since the model must be fit n times. Also,
LOOCV has higher variance, but lower bias, than k-fold CV.

4. Suppose that we use some statistical learning method to make a prediction for
the response Y for a particular value of the predictor X . Carefully describe how
we might estimate the standard deviation of our prediction.

ANSWER:
If we suppose using some statistical learning method to make a prediction for the
response Y for a particular value of the predictor X we might estimate the standard
deviation of our prediction by using the bootstrap approach. The bootstrap approach
works by repeatedly sampling observations (with replacement) from the original data set
B times, for some large value of B , each time fitting a new model and subsequently
obtaining the RMSE of the estimates for all B models.
UNIT-IV (CHAPTER 6)

1. We perform best subset, forward stepwise, and backward stepwise selection on a

single data set. For each approach, we obtain p  1 models, containing 0,1,2,, p
predictors. Explain your answers:

(a) Which of the three models with k predictors has the smallest training RSS?

ANSWER:
Best subset selection has the smallest training RSS because the other two methods
determine models with a path dependency on which predictors they pick first as they
iterate to the k'th model.

(b) Which of the three models with k predictors has the smallest test RSS?

ANSWER:
Best subset selection may have the smallest test RSS because it considers more models
then the other methods. However, the other models might have better luck picking a
model that fits the test data better.

(c) True or False:

(i.) The predictors in the k -variable model identiﬁed by forward stepwise are a
subset of the predictors in the (k  1) -variable model identiﬁed by forward
stepwise selection.

ANSWER: True

(ii.) The predictors in the k -variable model identiﬁed by backward stepwise are
a subset of the predictors in the (k  1) -variable model identiﬁed by
backward stepwise selection.

ANSWER: True

(iii.) The predictors in the k -variable model identiﬁed by backward stepwise are
a subset of the predictors in the (k  1) -variable model identiﬁed by forward
stepwise selection.

ANSWER: False

(iv.) The predictors in the k -variable model identiﬁed by forward stepwise are a
subset of the predictors in the (k  1) -variable model identiﬁed by backward
stepwise selection.

ANSWER: False

(v.) The predictors in the k -variable model identiﬁed by best subset are a subset
of the predictors in the (k  1) -variable model identiﬁed by best subset
selection.

ANSWER: False
2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your
answer.

(a) The lasso, relative to least squares, is:

(i.) More flexible and hence will give improved prediction accuracy when its
increase in bias is less than its decrease in variance.
(ii.) More flexible and hence will give improved prediction accuracy when its
increase in variance is less than its decrease in bias.
(iii.) Less flexible and hence will give improved prediction accuracy when its
increase in bias is less than its decrease in variance.
(iv.) Less flexible and hence will give improved prediction accuracy when its
increase in variance is less than its decrease in bias.

ANSWER:
iii. Less flexible and better predictions because of less variance, more bias

(b) Repeat (a) for ridge regression relative to least squares.

ANSWER:
iii. Less flexible and better predictions because of less variance, more bias

(c) Repeat (a) for non-linear methods relative to least squares.

ANSWER:
ii. More flexible, less bias, more variance

3. Suppose we estimate the regression coeﬃcients in a linear regression model by

minimizing

n  p
 p

 y
 i
i 1 
  0    j xij  subject to  j s
j 1  j 1

for a particular value of s . For parts (a) through (e), indicate which of i. through v.
is correct. Justify your answer.

(a) As we increase s from 0 , the training RSS will:

ANSWER:
(iv) Steadily decreases: As we increase s from 0 , all  's increase from 0 to their least
square estimate values. Training error for 0  's is the maximum and it steadily
decreases to the Ordinary Least Square RSS
(b) Repeat (a) for test RSS.

ANSWER:
(ii) Decrease initially, and then eventually start increasing in a U shape: When s  0 , all
 's are 0 , the model is extremely simple and has a high test RSS. As we increase s ,  's
assume non-zero values and model starts fitting well on test data and so test RSS
decreases. Eventually, as  's approach their full blown OLS values, they start overfitting
to the training data, increasing test RSS.

(c) Repeat (a) for variance.

ANSWER:
(iii) Steadily increase: When s  0 , the model effectively predicts a constant and has
almost no variance. As we increase s , the models includes more  's and their values
start increasing. At this point, the values of  's become highly dependent on training
data, thus increasing the variance.

(d) Repeat (a) for (squared) bias.

ANSWER:
(iv) Steadily decrease: When s  0 , the model effectively predicts a constant and hence
the prediction is far from actual value. Thus bias is high. As s increases, more  's
become non-zero and thus the model continues to fit training data better. And thus, bias
decreases.

(e) Repeat (a) for the irreducible error.

ANSWER:
(v) Remains constant: By definition, irreducible error is model independent and hence
irrespective of the choice of s , remains constant.

4. Suppose we estimate the regression coeﬃcients in a linear regression model by

minimizing

n
 p
 p


i 1 
y
 i   0    x
j ij  subject to    j2
j 1  j 1

for a particular value of  . For parts (a) through (e), indicate which of i. through v.
is correct. Justify your answer.

(a) As we increase  from 0 , the training RSS will:

(i.) Increase initially, and then eventually start decreasing in an inverted U
shape.
(ii.) Decrease initially, and then eventually start increasing in a U shape.
(iii.) Steadily increase.
(iv.) Steadily decrease.
(v.) Remain constant.
ANSWER:
(iii) Steadily increase: As we increase  from 0 , all  's decrease from their least square
estimate values to 0 . Training error for full-blown-OLS  's is the minimum and it
steadily increases as  's are reduced to 0 .

(b) Repeat (a) for test RSS.

ANSWER:
(ii) Decrease initially, and then eventually start increasing in a U shape: When   0 , all
 's have their least square estimate values. In this case, the model tries to fit hard to
training data and hence test RSS is high. As we increase  ,  's start reducing to zero
and some of the overfitting is reduced. Thus, test RSS initially decreases. Eventually, as
 's approach 0 , the model becomes too simple and test RSS increases.

(c) Repeat (a) for variance.

ANSWER:
(iv) Steadily decreases: When   0 , the  's have their least square estimate values. The
actual estimates heavily depend on the training data and hence variance is high. As we
increase  ,  's start decreasing and model becomes simpler. In the limiting case of 
approaching infinity, all  's reduce to zero and model predicts a constant and has no
variance.

(d) Repeat (a) for (squared) bias.

ANSWER:
(iii) Steadily increases: When   0 , the  's have their least-square estimate values and
hence have the least bias. As  increases,  's start reducing towards zero, the model
fits less accurately to training data and hence bias increases. In the limiting case of 
approaching infinity, the model predicts a constant and hence bias is maximum.

(e) Repeat (a) for the irreducible error.

ANSWER:
Remains constant: By definition, irreducible error is model independent and hence
irrespective of the choice of  , remains constant.

5. It is well-known that ridge regression tends to give similar coeﬃcient values to

correlated variables, whereas the lasso may give quite different coeﬃcient values
to correlated variables. We will now explore this property in a very simple setting.

Suppose that n  2 , p  2 , x11  x12 , x21  x22 . Furthermore, suppose that y1  y2  0

and x11  x21  0 and x12  x22  0 , so that the estimate for the intercept in a least
squares, ridge regression, or lasso model is zero: ˆ  0 .
0
(a) Write out the ridge regression optimization problem in this setting.

ANSWER:
A general form of Ridge regression optimization looks like

Minimize: ∑ (𝑦 − 𝛽 − ∑ 𝛽 𝑥 ) + 𝜆∑ 𝛽

In this case, 𝛽 = 0 and 𝑛 = 𝑝 = 2. So, the optimization looks like:

Minimize: (𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + (𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + 𝜆(𝛽 + 𝛽 )

(b) Argue that in this setting, the ridge coeﬃcient estimates satisfy ˆ1  ˆ2 .

ANSWER:
Now we are given that, 𝑥 =𝑥 = 𝑥 and 𝑥 =𝑥 =𝑥 .

We take derivatives of above expression with respect to both 𝛽 and 𝛽 and setting them
equal to zero find that,

∗ ( ) ∗ ( )
𝛽∗ = and 𝛽 ∗ =

Symmetry in these expressions suggests that 𝛽 ∗ = 𝛽 ∗

(c) Write out the lasso optimization problem in this setting.

ANSWER:

Like Ridge regression,

Minimize: (𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + (𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + 𝜆(|𝛽 | + |𝛽 |)

(d) Argue that in this setting, the lasso coeﬃcients ̂1 and ˆ2 are not unique- in
other words, there are many possible solutions to the optimization problem in (c).
Describe these solutions.

ANSWER:

Here is a geometric interpretation of the solutions for the equation in c above. We use
the alternate form of Lasso constraints

|𝛽 | + |𝛽 | < 𝑠.

The Lasso constraint take the form |𝛽 | + |𝛽 | < 𝑠, which when plotted take the familiar
shape of a diamond centered at origin (0,0).

Next consider the squared optimization constraint

(𝑦 − 𝛽 𝑥 − 𝛽 𝑥 ) + (𝑦 − 𝛽 𝑥 −𝛽 𝑥 ) .

We use the facts 𝑥 =𝑥 ,𝑥 =𝑥 ,𝑥 +𝑥 = 0, 𝑥 +𝑥 = 0 and 𝑦 + 𝑦 = 0 to simplify

it to
Minimize: 2. (𝑦 − (𝛽 + 𝛽 )𝑥 ) .

This optimization problem has a simple solution: 𝛽 + 𝛽 = . This is a line parallel to

the edge of Lasso-diamond 𝛽 + 𝛽 = 𝑠.

Now solutions to the original Lasso optimization problem are contours of the function
(𝑦 − (𝛽 + 𝛽 )𝑥 ) that touch the Lasso-diamond 𝛽 + 𝛽 = 𝑠.

Finally, as 𝛽 and 𝛽 very along the line 𝛽 + 𝛽 = , these contours touch the Lasso-
diamond edge 𝛽 + 𝛽 = 𝑠 at different points.

As a result, the entire edge 𝛽 + 𝛽 = 𝑠 is a potential solution to the Lasso optimization

problem!

Similar argument can be made for the opposite Lasso-diamond edge: 𝛽 + 𝛽 = −𝑠.
Thus, the Lasso problem does not have a unique solution.

The general form of solution is given by two line segments:

𝛽 + 𝛽 = 𝑠; 𝛽 ≥ 0; 𝛽 ≥ 0 and 𝛽 + 𝛽 = −𝑠; 𝛽 ≤ 0; 𝛽 ≤ 0

Managerial Decision Techniques
No ratings yet
Managerial Decision Techniques
17 pages
Chapter 8 - Social Media Information Systems
No ratings yet
Chapter 8 - Social Media Information Systems
38 pages
03-Databases MDB2023
No ratings yet
03-Databases MDB2023
15 pages
Heyday Magazine
No ratings yet
Heyday Magazine
45 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Klaxksfo PDF 1653719538
No ratings yet
Klaxksfo PDF 1653719538
31 pages
Digest May 2020 PDF
No ratings yet
Digest May 2020 PDF
20 pages
Information 13 00330 v2 PDF
No ratings yet
Information 13 00330 v2 PDF
28 pages
Scap Assignment 2
No ratings yet
Scap Assignment 2
11 pages
Huawei Ne40e x8 DC Brochure Datasheet
No ratings yet
Huawei Ne40e x8 DC Brochure Datasheet
5 pages
BPCL Case Study
No ratings yet
BPCL Case Study
10 pages
The BMW 5 Series.: July 2019
No ratings yet
The BMW 5 Series.: July 2019
17 pages
BADB1014 Quantitative Methods - Lesson 3
No ratings yet
BADB1014 Quantitative Methods - Lesson 3
23 pages
02-Queue Modul Using Java - Queue API
No ratings yet
02-Queue Modul Using Java - Queue API
24 pages
How To (Maybe) Measure Laser Beam Quality
No ratings yet
How To (Maybe) Measure Laser Beam Quality
16 pages
MSE Activity
No ratings yet
MSE Activity
4 pages
Project Assignment Submitted To AKU-1
No ratings yet
Project Assignment Submitted To AKU-1
73 pages
Project Report On Summer Training Absenteeism at Glaxosmith Kelin
No ratings yet
Project Report On Summer Training Absenteeism at Glaxosmith Kelin
60 pages
Goals and Governance of The Corporation Goals and Governance of The Corporation
No ratings yet
Goals and Governance of The Corporation Goals and Governance of The Corporation
24 pages
7CS082 Report CW1
No ratings yet
7CS082 Report CW1
16 pages
Organizational Behaviour PPT-2
No ratings yet
Organizational Behaviour PPT-2
22 pages
Chapter 3-EER (Part 2)
No ratings yet
Chapter 3-EER (Part 2)
26 pages
TN1 Warren E Buffett 2005
No ratings yet
TN1 Warren E Buffett 2005
10 pages
Developing Skills For Employability at The Secondary Level
No ratings yet
Developing Skills For Employability at The Secondary Level
14 pages
Presentation On Monetary and Fiscal Policy by Bakkaprabhu Uppar
100% (1)
Presentation On Monetary and Fiscal Policy by Bakkaprabhu Uppar
22 pages
Gender in Cockpit
No ratings yet
Gender in Cockpit
8 pages
Understanding the Self Module
No ratings yet
Understanding the Self Module
17 pages
Dell Networking s4100 Series Spec Sheet
No ratings yet
Dell Networking s4100 Series Spec Sheet
6 pages
Employee Development and Taklent Management
No ratings yet
Employee Development and Taklent Management
5 pages
First Last: Education
No ratings yet
First Last: Education
3 pages
Final Internship Report Monika60
No ratings yet
Final Internship Report Monika60
39 pages
Yamaha Motors India Training Report
No ratings yet
Yamaha Motors India Training Report
72 pages
A Decision The Next Prime Minister Must Make - Tony Edwards - Feb 2009
No ratings yet
A Decision The Next Prime Minister Must Make - Tony Edwards - Feb 2009
24 pages
Ema of Rsi Crossover - Buy, Technical Analysis Scanner
No ratings yet
Ema of Rsi Crossover - Buy, Technical Analysis Scanner
2 pages
Decathlon's Norway Market Strategy
No ratings yet
Decathlon's Norway Market Strategy
6 pages
Assignment 2 Specification
No ratings yet
Assignment 2 Specification
3 pages
Strategic Planning Matrices Guide
No ratings yet
Strategic Planning Matrices Guide
6 pages
Inventory Accounting Essentials
No ratings yet
Inventory Accounting Essentials
22 pages
Jees Black Book
No ratings yet
Jees Black Book
70 pages
Edupristine FM Brochure PDF
No ratings yet
Edupristine FM Brochure PDF
8 pages
Assigment F - m-1-1
No ratings yet
Assigment F - m-1-1
5 pages
Preparing Leaders For The Multi-Generational Workforce: Rocky J. Dwyer
No ratings yet
Preparing Leaders For The Multi-Generational Workforce: Rocky J. Dwyer
25 pages
Inheritence
No ratings yet
Inheritence
56 pages
Bhatty 2008 (Falling Through The Cracks) PDF
No ratings yet
Bhatty 2008 (Falling Through The Cracks) PDF
4 pages
SIM HUMaLaOR2
No ratings yet
SIM HUMaLaOR2
17 pages
Tutorial Solutions5
No ratings yet
Tutorial Solutions5
8 pages
Rent A Cloth
No ratings yet
Rent A Cloth
12 pages
Untitled
No ratings yet
Untitled
63 pages
SSE Chemistry
No ratings yet
SSE Chemistry
8 pages
MATLAB GPR Modeling Guide
No ratings yet
MATLAB GPR Modeling Guide
12 pages
Panel Data Models in Econometrics
No ratings yet
Panel Data Models in Econometrics
31 pages
Time To Collision
No ratings yet
Time To Collision
11 pages
MPSC Psi Question Paper 2017 Mains Paper 2
No ratings yet
MPSC Psi Question Paper 2017 Mains Paper 2
40 pages
習題 Ch06 1 Anderson 13e Statistics
No ratings yet
習題 Ch06 1 Anderson 13e Statistics
2 pages
SP14 Preamplifier Manual Guide
No ratings yet
SP14 Preamplifier Manual Guide
22 pages
IEEE Paper (DEVELOPMENT OF PROGRAMMING LANGUAGE PYTHON)
No ratings yet
IEEE Paper (DEVELOPMENT OF PROGRAMMING LANGUAGE PYTHON)
16 pages
Manual Imet GBS242 GH
No ratings yet
Manual Imet GBS242 GH
27 pages
Case Q-RM IIMAr 1-5
No ratings yet
Case Q-RM IIMAr 1-5
3 pages
Conceptual Questions
No ratings yet
Conceptual Questions
5 pages
Book's Solutions
No ratings yet
Book's Solutions
20 pages
David Stove-On Enlightenment (2002)
100% (4)
David Stove-On Enlightenment (2002)
224 pages
Inductive vs. Deductive Methods
No ratings yet
Inductive vs. Deductive Methods
16 pages
ESTADISTICA
No ratings yet
ESTADISTICA
33 pages
Bayes Factor
No ratings yet
Bayes Factor
6 pages
Inference on Single Population
No ratings yet
Inference on Single Population
12 pages
Candidate - Elimination Algorihm
No ratings yet
Candidate - Elimination Algorihm
39 pages
Digital Assignment - 5: V S Akshit 19bee0435
No ratings yet
Digital Assignment - 5: V S Akshit 19bee0435
5 pages
Class XI Mathematics Chapter:4 Principle of Mathematical Induction
No ratings yet
Class XI Mathematics Chapter:4 Principle of Mathematical Induction
3 pages
Example: Right-Tailed Test: Home 3.3 - Hypothesis Testing: Examples
No ratings yet
Example: Right-Tailed Test: Home 3.3 - Hypothesis Testing: Examples
6 pages
Unit 1 Notes Ballb - Iiird Semester-2022
No ratings yet
Unit 1 Notes Ballb - Iiird Semester-2022
25 pages
Test Bank For Sociology 16th Edition Macionis 0134157931 9780134157931 Download
100% (22)
Test Bank For Sociology 16th Edition Macionis 0134157931 9780134157931 Download
46 pages
One Sample & Independent t-Tests Guide
No ratings yet
One Sample & Independent t-Tests Guide
16 pages
UK Fashion: Social Media Impact
No ratings yet
UK Fashion: Social Media Impact
50 pages
Chapter 3 Theory Building
No ratings yet
Chapter 3 Theory Building
2 pages
Single Mean Hypothesis Testing Guide
No ratings yet
Single Mean Hypothesis Testing Guide
16 pages
Science Learning Intervention Analysis
No ratings yet
Science Learning Intervention Analysis
5 pages
Unit - Ii 1. Diverse Ways in Research Methodology: Links Amidst Knowledge, Truth and Moral Perfection
No ratings yet
Unit - Ii 1. Diverse Ways in Research Methodology: Links Amidst Knowledge, Truth and Moral Perfection
14 pages
Hypothesis Testing for Students
No ratings yet
Hypothesis Testing for Students
9 pages
Assignment Session 4 - Vu Thi Thu
No ratings yet
Assignment Session 4 - Vu Thi Thu
14 pages
AEphd 2023 Week 2 Small
No ratings yet
AEphd 2023 Week 2 Small
10 pages
One-Sample Statistics: N Mean Std. Deviation Std. Error Mean Size - of - The - Home - in - Squ Are - Feet
No ratings yet
One-Sample Statistics: N Mean Std. Deviation Std. Error Mean Size - of - The - Home - in - Squ Are - Feet
4 pages
Introduction To Inductive and Deductive Thinking
No ratings yet
Introduction To Inductive and Deductive Thinking
8 pages
Practice Exam Chapter 10-TWO-SAMPLE TESTS: Section I: Multiple-Choice
No ratings yet
Practice Exam Chapter 10-TWO-SAMPLE TESTS: Section I: Multiple-Choice
19 pages
Applied Economics Lesson Plan
No ratings yet
Applied Economics Lesson Plan
61 pages
Infinity 2012
No ratings yet
Infinity 2012
20 pages
Hypothesis Testing 7,8ppt
No ratings yet
Hypothesis Testing 7,8ppt
58 pages
Code Phương Pháp Nghiên C U
No ratings yet
Code Phương Pháp Nghiên C U
6 pages
Argumentation Assumptions
No ratings yet
Argumentation Assumptions
50 pages
Real Statistics Examples Correlation Reliability
No ratings yet
Real Statistics Examples Correlation Reliability
404 pages
Heteroskedasticity in Econometrics
No ratings yet
Heteroskedasticity in Econometrics
20 pages

SDS Solution1

Uploaded by

SDS Solution1

Uploaded by

SDS-SOLUTION

2. Explain whether each scenario is a classiﬁcation or regression problem, and

3. Following questions are related to bias-variance decomposition.

(b) Describe three real-life applications in which regression might be useful.

6. Describe the diﬀerences between a parametric and a non-parametric statistical

(b) What is our prediction with K = 1? Why?

(c) What is our prediction with K = 3? Why?

Coefficient Std. error t-statistic p-value

Intercept 2.939 0.3119 9.42 < 0.0001

TV 0.046 0.0014 32.81 < 0.0001

Radio 0.189 0.0086 21.89 < 0.0001

Newspaper -0.001 0.0059 -0.18 0.8599

Y = 50 + 20 k_1 + 0.07 k_2 + 35 Gender + 0.01(k_1 * k_2) - 10 (k_1 * Gender)

Male: (Gender = 0) 50 + 20 k_1 + 0.07 k_2 + 0.01(k_1 * k_2)

Female: (Gender = 1) 50 + 20 k_1 + 0.07 k_2 + 35 + 0.01(k_1 * k_2) - 10 (k_1)

= 50 + 20 * 4 + 0.07 * 110 + 35 + 0.01 (4 * 110) - 10 * 4

(b) Answer (a) using test rather than training RSS.

(d) Answer (c) using test rather than training RSS.

0  ( Avg ( y )  1 Avg ( x ))  1 Avg ( x )  Avg ( y )

observation to the class for which the discriminant function is maximized.

As you can see, 𝛿(𝑥) is a quadratic function of 𝑥.

5. We now examine the diﬀerences between LDA and QDA.

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to

6. Suppose we collect data for a group of students in a statistics class with

exp(6  0.05 X 1  X 2 ) exp(6  0.05(40)  (3.5)) exp(0.5)

exp(6  0.05 X 1  3.5)

0.50(1  exp( 6  0.05 X 1  3.5))  exp(6  0.05 X 1  3.5)

0.50(1  exp( 2.5  0.05 X 1 ))  exp( 2.5  0.05 X 1 )

0.50  0.50exp( 2.5  0.05 X 1 )  exp( 2.5  0.05 X 1 )

0.50  exp(2.5  0.05 X 1 )  0.50exp( 2.5  0.05 X 1 )

0.50  0.50exp( 2.5  0.05 X 1 )

Taking log of both sides

log(exp( 2.5  0.05 X 1 ))  log(1)

80 % of companies issued dividends. Assuming that X follows a normal

p yes (4)  75.2

1. Using basic statistical properties of the variance, as well as single-variable

Using the following rules:

Var  X  Y   Var  X   Var Y   2 Cov  X , Y 

Minimizing two-asset financial portfolio:

Var ( X  (1   )Y )  Var ( X )  Var((1   )Y )  2Cov( X ,(1   )Y )

  2Var ( X )  (1   ) 2Var (Y )  2 (1   )Cov ( X , Y )

Therefore, Var ( X  (1   )Y )   X    Y (1   )  2 XY (   )

The plot quickly reaches an asymptote of about 63.2%.

3. Answer the following with respect to k-fold cross-validation.

(a) Explain how k-fold cross-validation is implemented.

k-fold cross-validation is implemented by taking the set of n observations and randomly

1. We perform best subset, forward stepwise, and backward stepwise selection on a

(c) True or False:

(a) The lasso, relative to least squares, is:

(b) Repeat (a) for ridge regression relative to least squares.

(c) Repeat (a) for non-linear methods relative to least squares.

3. Suppose we estimate the regression coeﬃcients in a linear regression model by

(a) As we increase s from 0 , the training RSS will:

(c) Repeat (a) for variance.

(d) Repeat (a) for (squared) bias.

(e) Repeat (a) for the irreducible error.

4. Suppose we estimate the regression coeﬃcients in a linear regression model by

(a) As we increase  from 0 , the training RSS will:

(b) Repeat (a) for test RSS.

(c) Repeat (a) for variance.

(d) Repeat (a) for (squared) bias.

(e) Repeat (a) for the irreducible error.

5. It is well-known that ridge regression tends to give similar coeﬃcient values to

Suppose that n  2 , p  2 , x11  x12 , x21  x22 . Furthermore, suppose that y1  y2  0

In this case, 𝛽 = 0 and 𝑛 = 𝑝 = 2. So, the optimization looks like:

Symmetry in these expressions suggests that 𝛽 ∗ = 𝛽 ∗

(c) Write out the lasso optimization problem in this setting.

Like Ridge regression,

Next consider the squared optimization constraint

We use the facts 𝑥 =𝑥 ,𝑥 =𝑥 ,𝑥 +𝑥 = 0, 𝑥 +𝑥 = 0 and 𝑦 + 𝑦 = 0 to simplify

This optimization problem has a simple solution: 𝛽 + 𝛽 = . This is a line parallel to

As a result, the entire edge 𝛽 + 𝛽 = 𝑠 is a potential solution to the Lasso optimization

The general form of solution is given by two line segments:

You might also like