Machine Learning Assignment Solutions
Machine Learning Assignment Solutions
1
4. Which of the following is an unsupervised learning task?
(a) Predicting if a new edible item is sweet or spicy based on the information of the ingredi-
ents, their quantities, and labels (sweet or spicy) for many other similar dishes.
(b) Predicting if a new image has cat or dog based on the historical data of other images of
cats and dogs, where you are supplied the information about which image is cat or dog.
(c) Grouping related documents from an unannotated corpus.
(d) Grouping of hand-written digits from their images.
(e) Predicting the time (in days) a PhD student will take to complete his/her thesis to earn a
degree based on the historical data such as qualifications, department, institute, research
area, and time taken by other scholars to earn the degree.
(f) all of the above
Sol. (c), (d)
P(max(X, Y ) > 3) = P(X > 3) + P(Y > 3) − P(X > 3 & Y > 3)
1 1 1 1
= + − ×
4 2 4 2
5
=
8
2
a b
7. Let the trace and determinant of a matrix A = be 3 and 4 respectively. The eigenvalues
c d
of A are
(a) 1, 3
√ √
3+ι 7 3−ι 7
√
(b) 2 , 2 , where ι = −1
√ √
3+ι 7 3−ι 7
√
(c) 4 , 4 , where ι = −1
(d) 1/2, 3/2
√ √ √
(e) 3 + ι 7, 3 − ι 7, where ι = −1
(f) 2, 8
(g) None of the above
(h) Can be computed only if A is a symmetric matrix.
(i) Can not be computed as the entries of the matrix A are not given.
Sol. (b)
Use of the facts that the trace and determinant of a matrix is equal to the sum and product
of its eigenvalues respectively. Using this
λ1 + λ2 = 4, λ1 λ2 = 3
where λ1 and λ2 denotes the eigenvalues. Solve the above two equations in two variables.
8. What happens when your model complexity increases? (multiple options may be correct)
(a) Model Bias decreases
(b) Model Bias increases
(c) Variance of the model decreases
(d) Variance of the model increases
Sol. (a) and (d)
9. A new phone, E-Corp X1 has been announced and it is what you’ve been waiting for, all along.
You decide to read the reviews before buying it. From past experiences, you’ve figured out
that good reviews mean that the product is good 90% of the time and bad reviews mean that
it is bad 70% of the time. Upon glancing through the reviews section, you find out that the X1
has been reviewed 1269 times and only 172 of them were bad reviews. What is the probability
that, if you order the X1, it is a bad phone? (Round off the answer up to 3 decimal digits)
(a) 0.136
(b) 0.160
(c) 0.360
(d) 0.840
(e) 0.773
(f) 0.573
(g) 0.181
3
(h) None of the above
Sol. (g)
For the solution, let’s use the following abbreviations.
• BP - Bad Phone
• GP - Good Phone
• GR - Good Review
• BR - Bad Review
From the given data, P r(BP |BR) = 0.7 and P r(GP |GR) = 0.9. Using this, Pr(BP—GR) =
1 - Pr(GP—GR) = 0.1.
Hence,
10. Which of the following are false about bias and variance of overfitted and underfitted models?
(multiple options may be correct)
(a) Underfitted models have high bias.
(b) Underfitted models have low bias.
(c) Overfitted models have low variance.
(d) Overfitted models have high variance.
Sol. (b), (c)
4
Assignment 2
Introduction to Machine Learning
Prof. B. Ravindran
1. Given a training dataset, the following visualization shows the fit of three different models
(in blue line). Assume that the test data and training data come from the same distribution.
What can you conclude from the following visualizations? Multiple options can be correct.
(a) The training error in first model is higher when compared to second and third model.
(b) The best model for this regression problem is the last (third) model, because it has
minimum training error.
(c) The second model is more robust than first and third because it will perform better on
unseen data.
(d) The third model is overfitting data as compared to first and second model.
(e) All models will perform same because we have not seen the test data.
Sol. (a),(c),(d)
2. Suppose you have fitted a complex regression model on a dataset. Now, you are using Ridge
regression with tuning parameter lambda to reduce its complexity. Choose the option below
which describes relationship of bias and variance with lambda.
Sol. (c)
3. Given a training data set of 10,000 instances, with each input instance having 17 dimensions
and each output instance having 2 dimensions, the dimensions of the design matrix used in
applying linear regression to this data is
(a) 10000 × 17
(b) 10002 × 17
1
(c) 10000 × 18
(d) 10000 × 19
Sol. (c)
4. Suppose we want to add a regularizer to the linear regression loss function,
Pp to control the
magnitudes
Pp of the weights β. We have a choice between Ω 1 (β) = i=1 |β| and Ω2 (β) =
2
i=1 β . Which one is more likely to result in sparse weights?
(a) Ω1
(b) Ω2
(c) Both Ω1 and Ω2 will result in sparse weights
(d) Neither of Ω1 or Ω2 can result in sparse weights
Sol. (a)
5. Consider forward selection, backward selection and best subset selection with respect to the
same data set. Which of the following is true?
(a) Best subset selection can be computationally more expensive than forward selection
(b) Forward selection and backward selection always lead to the same result
(c) Best subset selection can be computationally less expensive than backward selection
(d) Best subset selection and forward selection are computationally equally expensive
(e) Both (b) and (d)
Sol. (a)
Explanation: Best subset selection has to explore all possible subsets which takes exponential
time. It is not guaranteed that forward selection and backward selection take the same time.
Forward selection and backward selection are computational much cheaper than best subset
selection.
6. In the lecture on Multivariate Regression, you learn about using orthogonalization iteratively
to obtain regression co-effecients. This method is generally referred to as Multiple Regression
using Successive Orthogonalization.
In the formulation of the method, we observe that in iteration k, we regress the entire dataset
on z0 , z1 , . . . zk−1 . It seems like a waste of computation to recompute the coefficients for z0 a
total of p times, z1 a total of p − 1 times and so on. Can we re-use the coefficients computed
in iteration j for iteration j + 1 for zj−1 ?
(a) No. Doing so will result in the wrong γ matrix. and hence, the wrong βi ’s.
(b) Yes. Since zj−1 is orthogonal to zj−l ∀ l ≤ j1, the multiple regression in each iteration is
essentially a univariate regression on each of the previous residuals. Since the regression
coefficients for the previous residuals don’t change over iterations, we can re-use the
coefficients for further iterations.
Sol. (b)
The answer is self-explanatory. Please refer to the section on Multiple Regression using Suc-
cessive Orthogonalization in Elements of Statistical Learning, 2nd edition for the algorithm.
2
7. (2 marks) Consider the following five training examples
x y
2 9.8978
3 12.7586
4 16.3192
5 19.3129
6 21.1351
We want to learn a function f (x) of the form f (x) = ax + b which is parameterised by (a, b).
Using squared error as the loss function, which of the following parameters would you use to
model this function to get a solution with the minimum loss.
(a) (4, 3)
(b) (1, 4)
(c) (4, 1)
(d) (3, 4)
Sol. (d)
8. (2 marks) Here is a data set of words in two languages.
Word Language
piano English
cat English
kepto Vinglish
shaito Vinglish
Let us build a nearest neighbours classifier that will predict which language a word belongs
to. Say we represent each word using the following features.
For example, the representation of the word ‘waffle’ would be [6, 4, 0]. For a distance function,
use the Manhattan distance.
Pn
d(a, b) = i=1 |ai − bi | where a, b ∈ Rn
Take the input word ‘keto’. With k = 1, the predicted language for the word is?
(a) English
(b) Vinglish
(c) None of the above
3
Sol. (a)
Since its nearest neighbour is ‘piano’. The representations for the 4 words are [5, 2, 1], [3, 2,
0], [5, 3, 1] and [6, 3, 1] respectively, and the representation for the input word is [4, 2, 1]. The
distances are 1,2,2 and 3 respectively.
4
Assignment 3
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider the case where two classes follow Gaussian distribution which are centered at (4, 7)
and (−4, −1) and have identity covariance matrix. Which of the following is the separating
decision boundary using LDA assuming the priors to be equal?
(a) y − x = 3
(b) x + y = 3
(c) x + y = 6
(d) both (b) and (c)
(e) None of the above
(f) Can not be found from the given information
Sol. (b)
As the distribution is Gaussian and have identity covariance (which are equal), the separating
boundary will be linear. The decision boundary will be orthogonal to the line joining the
centers and will pass from the midpoint of centers.
2. Consider the following data with two classes. The color indicates different class.
Which of the following models (with NO additional complexity) can achieve zero training error
for classification?
(a) LDA
(b) PCA
(c) Logistic regression
(d) None of these
Sol. (d)
All the classifiers in option (a), (b) and (c) are linear. The training data is linearly non-
separable.
1
3. We discussed the use of MLE for the estimation of parameters of logistic regression model. We
used which of the following assumptions to derive the likelihood function ?
(a) independence among the class labels
(b) independence among each training sample
(c) independence among the parameters of the model
(d) None of these
Sol. (b)
4. Which of the following statements is true about LDA regarding outliers?
(a) LDA is not sensitive to outliers
(b) LDA is sensitive to outliers.
(c) Depends upon the data
(d) None of the above
Sol. (b) Since we use all of the data to calculate the mean and variance, outliers may have
an impact on the performance of LDA as they may adversely skew the estimated mean and
variances.
5. Consider the following distribution of training data:
2
Sol. (c)
The direction of maximum variance is along the direction X1 = X2 . The projected points in
this direction can be easily classified into two classes correctly.
LDA can find a linearly separable decision boundary as the data is linearly separable.
6. Suppose that we have two variables, X and Y (the dependent variable). We wish to find
the relation between them. An expert tells us that relation between the two has the form
Y = m log(X) + c. Available to us are samples of the variables X and Y. Is it possible to apply
linear regression to this data to estimate the values of m and c?
(a) no
(b) yes
(c) insufficient information
Sol. (b)
Instead of considering the dependent variable directly, we can transform the independent
variable by considering the square of each value. Thus, on the X-axis, we can plot values of
log(X) and on the Y-axis, we can plot values of Y . The relation between the dependent and
the transformed independent variable is linear and the value of slope and intercept can be
estimated using linear regression.
7. In a binary classification scenario where x is the independent variable and y is the dependent
variable, logistic regression assumes that the conditional distribution y|x follows a
Sol. (b)
The dependent variable is binary, so a Bernoulli distribution is assumed.
8. Consider the following data:
Assuming that you apply LDA to this data, what is the estimated covariance matrix?
3
1.875 0.3125
(a)
0.3125 0.9375
2.5 0.4167
(b)
0.4167 1.25
1.875 0.3125
(c)
0.3125 1.2188
2.5 0.4167
(d)
0.4167 1.625
8.25 5.2917
(e)
5.2917 5.6250
6.1875 3.9688
(f)
3.9688 4.2188
3.25 1.1667
(g)
1.1667 2.375
2.4375 0.875
(h)
0.875 1.7812
(i) None of these
Sol. (e)
8.2500 5.2917
If you do the above calculation correctly, the answer would be
5.2917 5.6250
9. Given the following 3D input data, identify the principal component.
(Steps: center the data, calculate the sample covariance matrix, calculate the eigenvectors and
eigenvalues, identify the principal component)
4
−0.1022
(a) 0.0018
0.9948
0.5742
(b) −0.8164
0.0605
0.5742
(c) 0.8164
0.0605
−0.5742
(d) 0.8164
0.0605
0.8123
(e) 0.5774
0.0824
0.8098
(f) 0.5762
0.1104
0.0767
(g) −0.0826
0.9936
(h) None of the above
Sol. (f)
Refer to the solution of practice assignment 3
10. For the data given in the previous question, find the transformed input along the first two
principal components.
−0.6025 0.2079
0.4420 −0.0339
1.2552 −0.1164
−1.3030 −0.2639
(a)
−0.5859 0.2521
1.0403 0.0869
1.2717 −0.0723
−1.5178 −0.0605
6.2787 −0.6025
4.3164 0.4420
3.7402 1.2552
1.8870 −1.3030
(b)
−2.3814 −0.5859
−3.5339 1.0403
−4.9199 1.2717
−5.3871 −1.5178
5
0.2541 0.8344
1.2987 0.5926
2.1118 0.5100
−0.4463 0.3626
(c)
0.2707
0.8785
1.8969 0.7134
2.1284 0.5542
−0.6612 0.5660
−1.4964 0.2541
−3.4586 1.2987
−4.0349 2.1118
−5.8881 −0.4463
(d)
−10.1565
0.2707
−11.3090 1.8969
−12.6950 2.1284
−13.1622 −0.6612
(e) None of the above
Sol. (b)
Refer to the solution of practice assignment 3
6
Assignment 4
Introduction to Machine Learning
Prof. B. Ravindran
1. Suppose we use a linear kernel SVM to build a classifier for a 2-class problem where the training
data points are linearly separable. In general, will the classifier trained in this manner produce
the same decision boundary as the classifier trained using the perceptron training algorithm
on the same training data?
(a) No
(b) Yes
Sol. (a)
2. Consider the data set given below. Claim: PLA (perceptron learning algorithm) can be used
to learn a classifier that achieves zero misclassification error on the training data. This claim
is:
(a) True
(b) False
(c) Depends on the initial weights
(d) True, only if we normalize the feature vectors before applying PLA.
Sol. (b)
The given data specifies the well-known XOR problem which cannot be separated by a linear
boundary.
3. For a support vector machine model, let xi be an input instance with label yi . If yi (βˆ0 +xTi β̂) >
1, where β0 and β̂ are the estimated parameters of the model, then
Sol. (a)
1
4. Suppose we use a linear kernel SVM to build a classifier for a 2-class problem where the
training data points are linearly separable. In general, will the classifier trained in this manner
be always the same as the classifier trained using the perceptron training algorithm on the
same training data?
(a) Yes
(b) No
Sol. (b) The hyperplane returned by the SVM approach will have a maximal margin, whereas
no such guarantee can be given for the hyperplane identified using the perceptron training
algorithm.
For Q5,6: Kindly download the synthetic dataset from the following link
https://bit.ly/2Y4SNTF
The dataset contains 1000 points and each input point contains 3 features.
5. (2 marks) Train a linear regression model (without regularization) on the above dataset. Re-
port the coefficients of the best fit model. Report the coefficients in the following format:
β0 , β1 , β2 , β3 .
(a) -1.2, 2.1, 2.2, 1
(b) 1, 1.2, 2.1, 2.2
(c) -1, 1.2, 2.1, 2.2
(d) 1, -1.2, 2.1, 2.2
(e) 1, 1.2, -2.1, -2.2
Sol. (d)
Follow the steps given on the sklearn page.
6. Train an l2 regularized linear regression model on the above dataset. Vary the regularization
parameter from 1 to 10. As you increase the regularization parameter, absolute value of the
coefficients (excluding the intercept) of the model:
(a) increase
(b) first increase then decrease
(c) decrease
(d) first decrease then increase
Sol. (c)
Follow the steps given on the sklearn page.
For Q7,8: Kindly download the modified version of Iris dataset from this link.
Available at: (https://goo.gl/vchhsd)
The dataset contains 150 points and each input point contains 4 features and belongs to one
among three classes. Use the first 100 points as the training data and the remaining 50 as
test data. In the following questions, to report accuracy, use test dataset. You can round-off
the accuracy value to the nearest 2-decimal point number. (Note: Do not change the order of
data points.)
2
7. (2 marks) Train an l2 regularized logistic regression classifier on the modified iris dataset. We
recommend using sklearn. Use only the first two features for your model. We encourage you
to explore the impact of varying different hyperparameters of the model. Kindly note that the
C parameter mentioned below is the inverse of the regularization parameter λ. As part of the
assignment train a model with the following hyperparameters:
Model: logistic regression with one-vs-rest classifier, C = 1e4
For the above set of hyperparameters, report the best classification accuracy
(a) 0.88
(b) 0.86
(c) 0.98
(d) 0.68
Sol. (b)
Following code will give the desired result.
>>clf = LogisticRegression(penalty=’l2’, C=1e4, multi class = ”ovr”).fit(X[0:100,0:2], Y[0:100])
>>clf.score(X[100:,0:2], Y[100:])
8. Train an SVM classifier on the modified iris dataset. We recommend using sklearn. Use only
the first two features for your model. We encourage you to explore the impact of varying
different hyperparameters of the model. Specifically try different kernels and the associated
hyperparameters. As part of the assignment train models with the following set of hyperpa-
rameters
RBF-kernel, gamma = 0.5, one-vs-rest classifier, no-feature-normalization.
Try C = 0.01, 1, 10. For the above set of hyperparameters, report the best classification accu-
racy along with total number of support vectors on the test data.
(a) 0.92, 69
(b) 0.88, 40
(c) 0.88, 69
(d) 0.98, 41
Sol. (c)
Following code will give the desired result.
>>clf = svm.SVC( C=1.0, kernel=’rbf’, decision function shape=’ovr’, gamma = 0.5)).fit(X[0:100,0:2],
Y[0:100])
>>clf.score(X[100:,0:2], Y[100:])
>>clf.n support
3
Assignment 5
Introduction to Machine Learning
Prof. B. Ravindran
1. You are given the N samples of input (x) and output (y) as shown in the figure below. What
will be the most appropriate model y = f (x)?
Sol. (c)
Sol. (b)
3. Consider the following function.
ex
f (x) =
1 + ex
The derivative f 0 (x) will be:
1
(a) f (x)lnf (x) + (1 − f (x))ln(1 − f (x))
(b) f (x)ln(1 − f (x))
(c) f (x)(1 − f (x))
(d) f (x)(1 + f (x))
Sol. (c)
4. Using the notations used in class, evaluate the value of the neural network with a 3-3-1 archi-
tecture (2-dimensional input with 1 node for the bias term in both the layers). The parameters
are as follows
1 0.2 0.4
α=
1 0.8 0.6
β = 0.8 0.4 0.5
Using sigmoid function as the activation functions at both the layers, the output of the network
for an input of (0.8, 0.7) will be (up to 4 decimal places)
(a) 0.6710
(b) 0.9617
(c) 0.6948
(d) 0.7052
(e) 0.8273
(f) 0.2023
(g) 0.7977
(h) 0.2446
(i) 0.7991
(j) None of these
Solution (e)
This is a straight forward computation task. First pad x with 1 and make it the X vector,
1
X = 0.8
0.7
The output of the first layer can be written as
o1 = αX
Next apply the sigmoid function and compute
1
a1 (i) =
1 + e−o1 (i)
Then pad the a1 vector also with 1 for bias, then compute the output of the second layer.
o2 = βa1
1
a2 =
1 + e−o2
a2 = 0.8273
2
5. Which of the following statements are true:
(a) The chances of overfitting decreases with increasing the number of hidden nodes and
increasing the number of hidden layers.
(b) A neural network with one hidden layer can represent any Boolean function given sufficient
number of hidden units and appropriate activation functions.
(c) Two hidden layer neural networks can represent any continuous functions (within a tol-
erance) as long as the number of hidden units is sufficient and appropriate activation
functions used.
Sol. (b), (c)
By increasing the number of hidden nodes or hidden layers we are increasing the number of
parameters. Increased set of parameters is more capable to memorize the training data. Hence
it may result in overfitting.
6. We have a function which takes a two-dimensional input x = (x1 , x2 ) and has two parameters
w = (w1 , w2 ) given by f (x, w) = σ(σ(x1 w1 )w2 + x2 ) where σ(x) = 1+e1−x . We use backprop-
agation to estimate the right parameter values. We start by setting both the parameters to
2. Assume that we are given a training point x2 = 1, x1 = 0, y = 3. Given this information
∂f
answer the next two questions. What is the value of ∂w 2
?
(a) 0.150
(b) -0.25
(c) 0.125
(d) 0.0525
(e) 0.098
(f) 0.0746
(g) 0.1604
(h) None of these
Solution: (d)
Write σ(x1 w1 )w2 + x2 as o2 and x1 w1 as o1
∂f ∂f ∂o2
=
∂w2 ∂o2 ∂w2
∂f
= σ(o2 )(1 − σ(o2 )) × σ(o1 )
∂w2
7. If the learning rate is 0.5, what will be the value of w2 after one update using backpropagation
algorithm?
(a) 0.4197
(b) -0.4197
(c) 0.6881
(d) -0.6881
(e) 1.3119
3
(f) -1.3119
(g) 2.1113
(h) -2.1113
(i) 1.1113
(j) -1.1113
(k) 0.5625
(l) -0.5625
(m) None of these
Solution: (g)
The update equation would be
∂L
w2 = w2 − λ
∂w2
where L is the loss function, here L = (y − f )2
∂f
w2 = w2 − λ × 2(y − f ) × (−1) ×
∂w2
Sol. (b)
10. Which of the following are false?
4
(a) The number of weights to be trained in a neural network should be quite high (10-15
times) the number of samples for effective training of the neural network.
(b) XOR function can not be modelled by a single perceptron.
(c) In backpropagation algorithm, we should start with a relatively small learning parameter
(η) and slowly increase it during the learning process.
(d) None of these
5
Assignment 6
Introduction to Machine Learning
Prof. B. Ravindran
1. Decision trees can be used for .
(a) classification
(b) regression
(c) Both
(d) None of these
Sol. (c)
2. In building a decision tree model, to control the size of the tree, we need to control the number
of regions. One approach to do this would be to split tree nodes only if the resultant decrease
in the sum of squares error exceeds some threshold. For the described method, which among
the following are true?
(a) it would, in general, help restrict the size of the trees
(b) it has the potential to affect the performance of the resultant regression/classification
model
(c) it is computationally infeasible
(a) If we trained only a single step of the decision tree (only the root), the system is equivalent
to a perceptron.
(b) If we trained only a single step of the decision tree (only the root), the system is equivalent
to an SVM.
(c) The resulting system cannot solve the XOR problem (refer to the ’Perceptron’ lectures)
(d) The resulting system can theoretically reach 100% accuracy on the training data set.
Sol. (a),(d). Since a single step decision tree, in the general case, has a single hyperplane
separating the classes, it behaves like a Perceptron.
An SVM has an additional term to find the optimal separating hyperplane. Since this term
will not be present in the loss function, the single step variant will not be an SVM.
1
Given multiple levels, this augmented decision tree can definitely solve the XOR problem
by first splitting along one axis and then splitting perpendicularly in the two halves.
Since the new augmented tree is stronger than the regular decision tree, it can theoretically
achieve 100% accuracy (a regular decision tree can do this too).
4. (2 marks) Having built a decision tree, we are using reduced error pruning to reduce the size
of the tree. We select a node to collapse. For this particular node, on the left branch, there are
3 training data points with the following outputs: 5, 7, 9.6 and for the right branch, there are
four training data points with the following outputs: 8.7, 9.8, 10.5, 11. The average value of the
outputs of data points denotes the response of a branch. The original responses for data points
along the two branches (left right respectively) were response left and, response right and the
new response after collapsing the node is response new. What are the values for response left,
response right and response new (numbers in the option are given in the same order)?
(a) 21.6, 40, 61.6
(b) 7.2; 10; 8.8
(c) 3, 4, 7
(d) depends on the tree height.
Sol. (b)
Original responses:
Left: 5+7+9.6
3 = 7.2
Right: 8.7+9.8+10.5+11
4 = 10
New response: 7.2 × 37 + 10 × 4
7 = 8.8
5. (2 marks) Consider the following dataset:
Which among the following split-points for the feature 1 would give the best split according to
the information gain measure?
(a) 14.6
(b) 16.05
(c) 16.85
(d) 17.35
2
Sol. (b)
3 3 3 0 0 7 2 2 5 5
info feature1 (14.6) (D) = 10 (− 3 log2 3 − 3 log2 3 ) + 10 (− 7 log2 7 − 7 log2 7 ) = 0.6042
4
info feature1 (16.05) (D) = 10 (− 44 log2 44 − 04 log2 04 ) + 10
6
(− 16 log2 61 − 56 log2 56 ) = 0.39
5
info feature1 (16.85) (D) = 10 (− 45 log2 54 − 15 log2 15 ) + 10
5
(− 15 log2 51 − 45 log2 45 ) = 0.7219
7
info feature1 (17.35) (D) = 10 (− 57 log2 75 − 27 log2 27 ) + 10
3
(− 03 log2 30 − 33 log2 33 ) = 0.6042
6. For the same dataset, which among the following split-points for feature2 would give the best
split according to the gini index measure?
(a) 172.6
(b) 176.35
(c) 178.45
(d) 185.4
Sol. (a)
7
ginifeature2 (172.6) (D) = 10 × 2 × 75 × 27 + 10
3
× 2 × 03 × 33 = 0.2857
5
ginifeature2 (176.35) (D) = 10 × 2 × 51 × 54 + 105
× 2 × 45 × 15 = 0.32
ginifeature2 (178.45) (D) = 10 × 2 × 6 × 6 + 10 × 2 × 34 × 14 = 0.4167
6 2 4 4
2
ginifeature2 (185.4) (D) = 10 × 2 × 22 × 02 + 10
8
× 2 × 38 × 85 = 0.375
7. In which of the following situations is it appropriate to introduce a new category ’Missing’ for
missing values? (multiple options may be correct)
(a) When values are missing because the 108 emergency operator is sometimes attending a
very urgent distress call.
(b) When values are missing because the attendant spilled coffee on the papers from which
the data was extracted.
(c) When values are missing because the warehouse storing the paper records went up in
flames and burnt parts of it.
(d) When values are missing because the nurse/doctor finds the patient’s situation too urgent.
Sol. (a),(d)
We typically introduce a ‘Missing’ value when the fact that a value is missing can also be a
relevant feature. In the case of (a) is can imply that the call was so urgent that the operator
couldn’t note it down. This urgency could potentially be useful to determine the target.
But a coffee spill corrupting the records is likely to be completely random and we glean no
new information from it. In this case, a better method is to try to predict the missing data
from the available data.
3
Assignment 7
Introduction to Machine Learning
Prof. B. Ravindran
1. For the given confusion matrix, compute the recall
(a) 0.73
(b) 0.7
(c) 0.6
(d) 0.67
(e) 0.78
(f) None of the above
Sol. (c)
1
4. Which method among bagging and stacking should be chosen in case of limited training data?
and what is the appropriate reason for your preference?
(a) Bagging, because we can combine as many classifier as we want by training each on a
different sample of the training data
(b) Bagging, because we use the same classification algorithms on all samples of the training
data
(c) Stacking, because we can use different classification algorithms on the training data
(d) Stacking, because each classifier is trained on all of the available data
Sol. (d)
5. (2 marks) Which of the following statements are false when comparing Committee Machines
and Stacking
(a) Committee Machines are, in general, special cases of 2-layer stacking where the second-
layer classifier provides uniform weightage.
(b) Both Committee Machines and Stacking have similar mechanisms, but Stacking uses
different classifiers while Committee Machines use similar classifiers.
(c) Committee Machines are more powerful than Stacking
(d) Committee Machines are less powerful than Stacking
Sol. (b), (c)
Both Committee Machines and Stacked Classifiers use sets of different classifiers. Assigning
constant weight to all first layer classifiers in a Stacked Classifier is simply the same as giving
each one a single vote (Committee Machines).
Since Committee Machines are a special case of Stacked Classifiers, they are less powerful than
Stacking, which can assign an adaptive weight depending on the region.
6. Which of the following measure best analyze the performance of a classifier?
(a) Precision
(b) Recall
(c) Accuracy
(d) Time complexity
(e) Depends on the application
Sol. (e)
Explanation Different applications might need to optimize different performance measures.
Applications of machine learning span over playing games to very critical domains(such as
health and security). Measures like accuracy for instance cannot be reliable when we have a
dataset with significant class imbalance. So there cannot be a single measure to analyze the
effectiveness of a classifier in all environments.
7. For the ROC curve of True positive rate vs False positive rate, which of the following are true?
2
(c) The curve may or may not be concave
Sol. (c)
Explanation The nature of ROC curve is dependent on the classifier. Classifiers better than
random classifier have a concave curve. Classifiers that perform worse than random classifier
have a convex curve.
8. Which of the following are true about using 5-fold cross validation with a data set of size n =
100 to select the value of k in the kNN algorithm. (More than one option may be correct)
(a) Will always result in the same k since it does not involve any randomness.
(b) Might give different answers depending on the splitting in 5 fold cross validation.
(c) Does not make sense since n is larger than the number of folds.
Sol. (b)
3
Assignment 8
Introduction to Machine Learning
Prof. B. Ravindran
1. In a given classification problem, there are 6 different classes. In building a classification model,
we want to penalise specific errors made by the model depending upon the actual and predicted
class label. For example, given a training data point belonging to class 1, if the model predicts
it as class 2, then the penalty for this will be different if for the same data point, the model
had predicted it as class 3. To build such a model, we need to select an appropriate
(a) ML model
(b) optimisation algorithm
(c) loss function
(d) evaluation measure
Sol. (c)
An appropriately specified 6X6 loss matrix.
2. The Naive Bayes classifier makes the assumption that the are independent given the
.
(a) features, class labels
(b) class labels, features
(c) features, data points
(d) there is no such assumption
Sol. (a)
3. Consider the problem of learning a function X → Y , where Y is Boolean. X is an input
vector (X1 , X2 ), where X1 is categorical and takes 3 values, and X2 is a continuous variable
(normally distributed). What would be the minimum number of parameters required to define
a Naive Bayes model for this function?
(a) 8
(b) 10
(c) 9
(d) 5
Sol. (c)
There are 3 possible values for X1 and 2 possible values for Y . We would have one parameter
for each P (X1 = x1 |Y = y), and there are 3 of these for each Y = y - however we would
only need 2, since the three probabilities have to sum to 1. Since there are 2 values for Y,
that gives us 4 parameters. For P (X2 = x2 |Y = y), which is continuous, we have the mean
and variance of a Gaussian for each Y = y - this gives 4 parameters. We also need the prior
probabilities P (Y = y); there are 2 of these since Y takes 2 values, but we only need one since
P (Y = 1) = 1 − P (Y = 0). The total is hence 4 + 4 + 1 = 9
4. In boosting, the weights of data points that were miscalssified are as training progresses.
1
(a) decreased
(b) increased
(c) first decreased and then increased
(d) kept unchanged
Sol. (b)
5. In a random forest model let m << p be the number of randomly selected features that are
used to identify the best split at any node of a tree. Which of the following are true? (p is the
original number of features)
(Multiple options may be correct)
(a) increasing m reduces the correlation between any two trees in the forest
(b) decreasing m reduces the correlation between any two trees in the forest
(c) increasing m increases the performance of individual trees in the forest
(d) decreasing m increases the performance of individual trees in the forest
Sol. (b) and (c)
6. (2 marks) Consider the following data for 500 instances of home, 600 instances of office and
700 instances of factory type buildings
Table 1
Suppose a building has a balcony and power-backup, but is not multi-storied. According to
the Naive Bayes algorithm, it is of type
(a) Home
(b) Office
(c) Factory
Sol. (c)
P(Home|(Balcony & Multi-storied & Power-backup ) ∝ P(Balcony|Home) * P( Multi-storied|Home)*
P(Power-backup|Home) * P(Home) = 4/5 * 2/5 * 1/5 * 5/18 = 0.018
2
7. (2 marks) Consider the following graphical model, which of the following are false about the
model? (multiple options may be correct)
3
Assignment 9
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider the bayesian network shown below.
Figure 1
1
Figure 2
The random variables have the following notation: d - Difficulty, i - Intelligence, g - Grade, s -
SAT, l - Letter. The random variables are modeled as discrete variables and the corresponding
CPDs are as below.
d0 d1
0.6 0.4
i0 i1
0.6 0.4
g1 g2 g3
0 0
i ,d 0.3 0.4 0.3
i0 , d1 0.05 0.25 0.7
i1 , d0 0.9 0.08 0.02
i1 , d1 0.5 0.3 0.2
s0 s1
i0 0.95 0.05
i1 0.2 0.8
l0 l1
1
g 0.2 0.8
g2 0.4 0.6
g3 0.99 0.01
What is the probability of P (i = 1, d = 0, g = 2, s = 1, l = 0)?
(a) 0.004608
(b) 0.006144
(c) 0.001536
(d) 0.003992
2
(e) 0.009216
(f) 0.007309
(g) None of these
Sol. (b)
3. Using the data given in the previous question, compute the probability of following assignment,
P (i = 1, g = 1, s = 1, l = 0) irrespective of the difficulty of the course? (up to 3 decimal places)
(a) 0.160
(b) 0.371
(c) 0.662
(d) 0.047
(e) 0.037
(f) 0.066
(g) 0.189
Sol. (d)
d=0,1
X
P (i = 1, g = 1, s = 1, l = 0) = P (i = 1)P (s = 1|i = 1)P (l = 0|g = 1) (P (d)P (g = 1|i = 1, d))
d
Figure 3
3
Two students - Manish and Trisha make the following claims:
• Trisha claims P (H|{S, G, J}) = P (H|{G, J})
• Manish claims P (H|{S, C, J}) = P (H|{C, J})
Which of the following is true?
(a) Manish and Trisha are correct.
(b) Both are incorrect.
(c) Manish is incorrect and Trisha is correct.
(d) Manish is correct and Trisha is incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (c)
Figure 4
Which of the following variables are NOT in the markov blanket of variable “4” shown in the
above Figure 4 ? (multiple answers may be correct)
(a) 1
(b) 8
(c) 2
(d) 5
(e) 6
(f) 4
(g) 7
Sol. (d) and (g)
6. In the Markov network given in Figure 4, two students make the following claims:
4
• Manish claims variable “1” is dependent on variable “7” given variable “2”.
• Trina claims variable “2” is independent of variable “6” given variable “3”.
Which of the following is true?
(a) Both the students are correct.
(b) Trina is incorrect and Manish is correct.
(c) Trina is correct and Manish is incorrect.
(d) Both the students are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (d)
(a)
(b)
5
(c)
(d)
(e)
Sol. (c)
6
Figure 11
Which of the following nodes will have no effect on H given the Markov Blanket of H?
(a) A
(b) B
(c) C
(d) D
(e) E
(f) F
(g) G
(h) I
(i) J
Sol. (c), (e) and (f)
The question requires you to select the random variables not in the Markov blanket of H. We
see that the Markov blanket of H contains A, B, I, J, F, G, D. The only other variables, other
than H are C, E, F . These three variables can have no effect on H once the Markov blanket
is known/given.
9. Select the correct pairs of (Inference Algorithm, Graphical Model) (note: more than one option
may be correct)
(a) (Variable Elimination, Bayesian Networks)
(b) (Viterbi Algorithm, Markov Random Fields)
(c) (Viterbi Algorithm, Hidden Markov Models)
(d) (Belief Propagation, Markov Random Fields)
(e) (Variable Elimination, Markov Random Fields)
Sol. (a), (c), (d) and (e)
Viterbi Algorithm is for a sequence, while MRFs don’t have a concept of sequence.
10. Here is a popular toy graphical model. It models the grades obtained by a student in a course
and it’s implications. Difficulty represents the difficulty of the course and intelligence is an
indicator of how intelligent the student is, SAT represents the SAT scores of the student and
7
Letter presents the event of the student receiving a letter of recommendation from the faculty
teaching the course.
Given this graphical model, which of the following statements are true?
(Note - More than one can be correct.)
(a) Given the grade, difficulty and letter are independent variables.
(b) Given grade, difficulty and intelligence are independent
(c) Without knowing any information, Difficulty and Intelligence are independent.
(d) Given the intelligence, SAT and grades are independent.
8
Assignment 10
Introduction to Machine Learning
Prof. B. Ravindran
1. (1 mark) Considering single-link and complete-link hierarchical clustering, is it possible for a
point to be closer to points in other clusters than to points in its own cluster? If so, in which
approach will this tend to be observed?
(a) No
(b) Yes, single-link clustering
(c) Yes, complete-link clustering
(d) Yes, both single-link and complete-link clustering
Sol. (d)
This is possible in both single-link and complete-link clustering. In the single-link case, an
example would be two parallel chains where many points are closer to points in the other
chain/cluster than to points in their own cluster. In the complete-link case, this notion is
more intuitive due to the clustering constraint (measuring distance between two clusters by
the distance between their farthest points).
2. (1 mark) Consider the following one dimensional data set: 12, 22, 2, 3, 33, 27, 5, 16, 6, 31, 20, 37, 8 and 18.
Given k = 3 and initial cluster centers to be 5, 6 and 31, what are the final cluster centres
obtained on applying the k-means algorithm?
(a) 5, 18, 30
(b) 5, 18, 32
(c) 6, 19, 32
(d) 4.8, 17.6, 32
(e) None of the above
Sol. (d)
3. (1 mark) For the previous question, in how many iterations will the k-means algorithm con-
verge?
(a) 2
(b) 3
(c) 4
(d) 6
(e) 7
Sol. (c)
4. (1 mark) In the lecture on the BIRCH algorithm, it is stated that using the number of points
N, sum of points SUM and sum of squared points SS, we can determine the centroid and
radius of the combination of any two clusters A and B. How do you determine the centroid of
the combined cluster? (In terms of N,SUM and SS of both the clusters)
1
(a) SU MA + SU MB
SU MA SU MB
(b) NA + NB
SU MA +SU MB
(c) NA +NB
SSA +SSB
(d) NA +NB
Sol. (c)
Apply the centroid formula to the combined cluster points. It’s simply the sum of all points
divided by the total number of points.
5. (1 mark) What assumption does the CURE clustering algorithm make with regards to the
shape of the clusters?
(a) No assumption
(b) Spherical
(c) Elliptical
Sol. (a)
Explanation CURE does not make any assumption on the shape of the clusters.
6. (1 mark) What would be the effect of increasing MinPts in DBSCAN while retaining the same
Eps parameter? (Note that more than one statement may be correct)
(a) Increase in the sizes of individual clusters
(b) Decrease in the sizes of individual clusters
(c) Increase in the number of clusters
(d) Decrease in the number of clusters
Sol. (b), (c)
By increasing the MinPts, we are expecting large number of points in the neighborhood, to
include them in cluster. In one sense, by increasing MinPts, we are looking for dense clusters.
This can break not-so-dense clusters into more than one part, which can lead to reduce the
cluster size and increase the number of clusters.
For the next question, kindly download the dataset - DS1. The first two columns in the dataset
correspond to the co-ordinates of each data point. The third column corresponds to the actual
cluster label.
DS1: https://bit.ly/2Lm75Ly
7. (1 mark) Visualize the dataset DS1. Which of the following algorithms will be able to recover
the true clusters (first check by visual inspection and then write code to see if the result
matches to what you expected).
(a) K-means clustering
(b) Single link hierarchical clustering
2
(c) Complete link hierarchical clustering
(d) Average link hierarchical clustering
Sol. (b)
The dataset contains spiral clusters. Single link hierarchical clustering can recover spiral
clusters with appropriate parameter settings.
8. For two independent runs of K-Mean clustering is it guaranteed to get same clustering results?
Note: seed value is not preserved in independent runs.
(a) No
(b) Yes
(c) Only when the number of clusters are even
Sol. (a)
9. (1 marks) Consider the similarity matrix given below: Which of the following shows the
hierarchy of clusters created by the single link clustering algorithm.
P1 P2 P3 P4 P5 P6
P1 1.0000 0.7895 0.1579 0.0100 0.5292 0.3542
P2 0.7895 1.0000 0.3684 0.2105 0.7023 0.5480
P3 0.1579 0.3684 1.0000 0.8421 0.5292 0.6870
P4 0.0100 0.2105 0.8421 1.0000 0.3840 0.5573
P5 0.5292 0.7023 0.5292 0.3840 1.0000 0.8105
P6 0.3542 0.5480 0.6870 0.5573 0.8105 1.0000
3
Sol. (b)
10. (1 marks) For the similarity matrix given in the previous question, which of the following shows
the hierarchy of clusters created by the complete link clustering algorithm.
4
Sol. (d)
5
Assignment 11
Introduction to Machine Learning
Prof. B. Ravindran
1. During parameter estimation for a GMM model using data X, which of the following quantities
are you minimizing (directly or indirectly)?
(a) Log-likelihood
(b) Negative Log-likelihood
(c) Cross-entropy
(d) Residual Sum of Squares (RSS)
Sol. (b)
σ 2 +σp2 N
P
i=1 xi
(b) µM AP = σ 2 +σp2
σ 2 +σp2 N
P
i=1 xi
(c) µM AP = σ 2 +N σp2
1
σ 2 µp +σp2 N
P
i=1 xi
(d) µM AP = N (σ +σp2 )
2
Sol. (a)
For a MAP estimate, we try to maximize f (µ)f (X|µ)
(µ−µp )2
1 Y 1 (xi −µ)2
2
f (µ)f (X|µ) = √ e 2σp
√ e 2σ2
σp 2π i
σ 2π
We will maximize this with respect to µ after taking a logarithm. This will yield the following
equation, P
i xi µp N 1
2
+ 2 − µ( 2 + 2 ) = 0
σ σp σ σp
Thus solution will be A
5. (2 marks) You are presented with a dataset that has hidden/missing variables that influences
your data. You are asked to use Expectation Maximization algorithm to best capture the data.
How would you define the E and M in Expectation Maximization?
(a) Estimate the Missing/Latent Variables in the Dataset, Maximize the likelihood over the
parameters in the model.
(b) Estimate the number of Missing/Latent Variables in the Dataset, Maximize the likelihood
over the parameters in the model.
(c) Estimate likelihood over the parameters in the model, Maximize the number of Miss-
ing/Latent Variables in the Dataset.
(d) Estimate the likelihood over the parameters in the model, Maximize the number of
parameters in the model.
Sol. (a)
6. During parameter estimation for a GMM model using data X, which of the following quantities
are you minimizing (directly or indirectly)?
(a) Log-likelihood
(b) Negative Log-likelihood
(c) Cross-entropy
(d) Residual Sum of Squares (RSS)
Sol. (b)
7. You are given n p-dimensional data points. The task is to learn a classifier to distinguish
between k classes. You come to know that the dataset has missing values. Can you use EM
algorithm to fill in the missing values ? (without making any further assumptions)
(a) Yes
(b) No
Sol. (b)