0% found this document useful (0 votes)
396 views46 pages

Machine Learning Assignment Solutions

Uploaded by

Himanshu Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
396 views46 pages

Machine Learning Assignment Solutions

Uploaded by

Himanshu Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Assignment 1

Introduction to Machine Learning


Prof. B. Ravindran
1. Which of the following is a supervised learning problem?
(a) Grouping related documents from an unannotated corpus.
(b) Predicting credit approval based on historical data.
(c) Predicting if a new image has cat or dog based on the historical data of other images of
cats and dogs, where you are supplied the information about which image is cat or dog.
(d) Fingerprint recognition of a particular person used in biometric attendance from the
fingerprint data of various other people and that particular person.
Sol. (b), (c), (d)
(a) does not have labels to indicate the groups.
2. Which of the following is NOT a classification problem?
(a) Predicting the temperature (in Celsius) of a room from other environmental features (such
as atmospheric pressure, humidity etc).
(b) Predicting if a cricket player is a batsman or bowler given his playing records.
(c) Predicting the price of house (in INR) based on the data consisting prices of other house
(in INR) and its features such as area, number of rooms, location etc.
(d) Filtering of spam messages
(e) Predicting the weather for tomorrow as “hot”, “cold”, or “rainy” based on the historical
data including wind speed, humidity, temperature, and precipitation.
(f) Predicting if a customer is going to return or keep a particular product he/she purchased
from e-commerce website based on the historical data about the customer purchases and
the particular product.
(g) Predicting the number of positive covid cases in upcoming days based on historical data.
Sol. (a),(c), (g)

3. Which of the following is a regression task? (multiple options may be correct)


(a) Predicting the monthly sales of a cloth store in rupees.
(b) Predicting if a user would like to listen to a newly released song or not based on historical
data.
(c) Predicting the confirmation probability (in fraction) of your train ticket whose current
status is waiting list based on historical data.
(d) Predicting if a patient has diabetes or not based on historical medical records.
(e) Predicting if a customer is satisfied or unsatisfied from the product purchased from e-
commerce website using the the reviews he/she wrote for the purchased product.
Sol. (a) and (c)

1
4. Which of the following is an unsupervised learning task?
(a) Predicting if a new edible item is sweet or spicy based on the information of the ingredi-
ents, their quantities, and labels (sweet or spicy) for many other similar dishes.
(b) Predicting if a new image has cat or dog based on the historical data of other images of
cats and dogs, where you are supplied the information about which image is cat or dog.
(c) Grouping related documents from an unannotated corpus.
(d) Grouping of hand-written digits from their images.
(e) Predicting the time (in days) a PhD student will take to complete his/her thesis to earn a
degree based on the historical data such as qualifications, department, institute, research
area, and time taken by other scholars to earn the degree.
(f) all of the above
Sol. (c), (d)

5. Which of the following is a categorical feature?


(a) Number of rooms in a hostel.
(b) Gender of a person
(c) Your weekly expenditure in rupees.
(d) Ethnicity of a person
(e) Area (in sq. centimeter) of your laptop screen.
(f) The color of the curtains in your room.
(g) Number of legs an animal.
(h) Minimum RAM requirement (in GB) of a system to play a game like FIFA, DOTA.
Sol. (b),(d) and (f)
6. Let X and Y be a uniformly distributed random variable over the interval [0, 4] and [0, 6]
respectively. If X and Y are independent events, then compute the probability,
P(max(X, Y ) > 3)
(a) 16
(b) 56
(c) 23
(d) 12
(e) 26
(f) 58
(g) None of the above
Sol. (f)

P(max(X, Y ) > 3) = P(X > 3) + P(Y > 3) − P(X > 3 & Y > 3)
1 1 1 1
= + − ×
4 2 4 2
5
=
8

2
 
a b
7. Let the trace and determinant of a matrix A = be 3 and 4 respectively. The eigenvalues
c d
of A are
(a) 1, 3
√ √
3+ι 7 3−ι 7

(b) 2 , 2 , where ι = −1
√ √
3+ι 7 3−ι 7

(c) 4 , 4 , where ι = −1
(d) 1/2, 3/2
√ √ √
(e) 3 + ι 7, 3 − ι 7, where ι = −1
(f) 2, 8
(g) None of the above
(h) Can be computed only if A is a symmetric matrix.
(i) Can not be computed as the entries of the matrix A are not given.

Sol. (b)
Use of the facts that the trace and determinant of a matrix is equal to the sum and product
of its eigenvalues respectively. Using this

λ1 + λ2 = 4, λ1 λ2 = 3

where λ1 and λ2 denotes the eigenvalues. Solve the above two equations in two variables.
8. What happens when your model complexity increases? (multiple options may be correct)
(a) Model Bias decreases
(b) Model Bias increases
(c) Variance of the model decreases
(d) Variance of the model increases
Sol. (a) and (d)
9. A new phone, E-Corp X1 has been announced and it is what you’ve been waiting for, all along.
You decide to read the reviews before buying it. From past experiences, you’ve figured out
that good reviews mean that the product is good 90% of the time and bad reviews mean that
it is bad 70% of the time. Upon glancing through the reviews section, you find out that the X1
has been reviewed 1269 times and only 172 of them were bad reviews. What is the probability
that, if you order the X1, it is a bad phone? (Round off the answer up to 3 decimal digits)

(a) 0.136
(b) 0.160
(c) 0.360
(d) 0.840
(e) 0.773
(f) 0.573
(g) 0.181

3
(h) None of the above

Sol. (g)
For the solution, let’s use the following abbreviations.
• BP - Bad Phone
• GP - Good Phone
• GR - Good Review
• BR - Bad Review
From the given data, P r(BP |BR) = 0.7 and P r(GP |GR) = 0.9. Using this, Pr(BP—GR) =
1 - Pr(GP—GR) = 0.1.
Hence,

P r(BP ) = P r(BP |BR) · P r(BR) + P r(BP |GR) · P r(GR)


172 1269 − 172
= 0.7 · + 0.1 ·
1269 1269
= 0.1813

10. Which of the following are false about bias and variance of overfitted and underfitted models?
(multiple options may be correct)
(a) Underfitted models have high bias.
(b) Underfitted models have low bias.
(c) Overfitted models have low variance.
(d) Overfitted models have high variance.
Sol. (b), (c)

4
Assignment 2
Introduction to Machine Learning
Prof. B. Ravindran
1. Given a training dataset, the following visualization shows the fit of three different models
(in blue line). Assume that the test data and training data come from the same distribution.
What can you conclude from the following visualizations? Multiple options can be correct.

(a) The training error in first model is higher when compared to second and third model.
(b) The best model for this regression problem is the last (third) model, because it has
minimum training error.
(c) The second model is more robust than first and third because it will perform better on
unseen data.
(d) The third model is overfitting data as compared to first and second model.
(e) All models will perform same because we have not seen the test data.

Sol. (a),(c),(d)

2. Suppose you have fitted a complex regression model on a dataset. Now, you are using Ridge
regression with tuning parameter lambda to reduce its complexity. Choose the option below
which describes relationship of bias and variance with lambda.

(a) In case of very large lambda; bias is low, variance is low.


(b) In case of very large lambda; bias is low, variance is high.
(c) In case of very large lambda; bias is high, variance is low.
(d) In case of very large lambda; bias is high, variance is high.

Sol. (c)
3. Given a training data set of 10,000 instances, with each input instance having 17 dimensions
and each output instance having 2 dimensions, the dimensions of the design matrix used in
applying linear regression to this data is
(a) 10000 × 17
(b) 10002 × 17

1
(c) 10000 × 18
(d) 10000 × 19
Sol. (c)
4. Suppose we want to add a regularizer to the linear regression loss function,
Pp to control the
magnitudes
Pp of the weights β. We have a choice between Ω 1 (β) = i=1 |β| and Ω2 (β) =
2
i=1 β . Which one is more likely to result in sparse weights?

(a) Ω1
(b) Ω2
(c) Both Ω1 and Ω2 will result in sparse weights
(d) Neither of Ω1 or Ω2 can result in sparse weights

Sol. (a)
5. Consider forward selection, backward selection and best subset selection with respect to the
same data set. Which of the following is true?
(a) Best subset selection can be computationally more expensive than forward selection
(b) Forward selection and backward selection always lead to the same result
(c) Best subset selection can be computationally less expensive than backward selection
(d) Best subset selection and forward selection are computationally equally expensive
(e) Both (b) and (d)

Sol. (a)
Explanation: Best subset selection has to explore all possible subsets which takes exponential
time. It is not guaranteed that forward selection and backward selection take the same time.
Forward selection and backward selection are computational much cheaper than best subset
selection.

6. In the lecture on Multivariate Regression, you learn about using orthogonalization iteratively
to obtain regression co-effecients. This method is generally referred to as Multiple Regression
using Successive Orthogonalization.
In the formulation of the method, we observe that in iteration k, we regress the entire dataset
on z0 , z1 , . . . zk−1 . It seems like a waste of computation to recompute the coefficients for z0 a
total of p times, z1 a total of p − 1 times and so on. Can we re-use the coefficients computed
in iteration j for iteration j + 1 for zj−1 ?
(a) No. Doing so will result in the wrong γ matrix. and hence, the wrong βi ’s.
(b) Yes. Since zj−1 is orthogonal to zj−l ∀ l ≤ j1, the multiple regression in each iteration is
essentially a univariate regression on each of the previous residuals. Since the regression
coefficients for the previous residuals don’t change over iterations, we can re-use the
coefficients for further iterations.
Sol. (b)
The answer is self-explanatory. Please refer to the section on Multiple Regression using Suc-
cessive Orthogonalization in Elements of Statistical Learning, 2nd edition for the algorithm.

2
7. (2 marks) Consider the following five training examples

x y
2 9.8978
3 12.7586
4 16.3192
5 19.3129
6 21.1351

We want to learn a function f (x) of the form f (x) = ax + b which is parameterised by (a, b).
Using squared error as the loss function, which of the following parameters would you use to
model this function to get a solution with the minimum loss.
(a) (4, 3)
(b) (1, 4)
(c) (4, 1)
(d) (3, 4)
Sol. (d)
8. (2 marks) Here is a data set of words in two languages.

Word Language
piano English
cat English
kepto Vinglish
shaito Vinglish

Let us build a nearest neighbours classifier that will predict which language a word belongs
to. Say we represent each word using the following features.

• Length of the word


• Number of consonants in the word
• Whether it ends with the letter ’o’ (1 if it does, 0 if it doesn’t)

For example, the representation of the word ‘waffle’ would be [6, 4, 0]. For a distance function,
use the Manhattan distance.

Pn
d(a, b) = i=1 |ai − bi | where a, b ∈ Rn

Take the input word ‘keto’. With k = 1, the predicted language for the word is?

(a) English
(b) Vinglish
(c) None of the above

3
Sol. (a)
Since its nearest neighbour is ‘piano’. The representations for the 4 words are [5, 2, 1], [3, 2,
0], [5, 3, 1] and [6, 3, 1] respectively, and the representation for the input word is [4, 2, 1]. The
distances are 1,2,2 and 3 respectively.

4
Assignment 3
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider the case where two classes follow Gaussian distribution which are centered at (4, 7)
and (−4, −1) and have identity covariance matrix. Which of the following is the separating
decision boundary using LDA assuming the priors to be equal?
(a) y − x = 3
(b) x + y = 3
(c) x + y = 6
(d) both (b) and (c)
(e) None of the above
(f) Can not be found from the given information
Sol. (b)
As the distribution is Gaussian and have identity covariance (which are equal), the separating
boundary will be linear. The decision boundary will be orthogonal to the line joining the
centers and will pass from the midpoint of centers.
2. Consider the following data with two classes. The color indicates different class.

Which of the following models (with NO additional complexity) can achieve zero training error
for classification?
(a) LDA
(b) PCA
(c) Logistic regression
(d) None of these
Sol. (d)
All the classifiers in option (a), (b) and (c) are linear. The training data is linearly non-
separable.

1
3. We discussed the use of MLE for the estimation of parameters of logistic regression model. We
used which of the following assumptions to derive the likelihood function ?
(a) independence among the class labels
(b) independence among each training sample
(c) independence among the parameters of the model
(d) None of these
Sol. (b)
4. Which of the following statements is true about LDA regarding outliers?
(a) LDA is not sensitive to outliers
(b) LDA is sensitive to outliers.
(c) Depends upon the data
(d) None of the above
Sol. (b) Since we use all of the data to calculate the mean and variance, outliers may have
an impact on the performance of LDA as they may adversely skew the estimated mean and
variances.
5. Consider the following distribution of training data:

Which method would you choose for dimensionality reduction?


(a) Linear Discriminant Analysis
(b) Principal Component Analysis
(c) (a) or (b) are equally good
(d) (a) and (b) perform very poorly, so have to choose Quadratic Discriminant Analysis
(e) None of these

2
Sol. (c)
The direction of maximum variance is along the direction X1 = X2 . The projected points in
this direction can be easily classified into two classes correctly.
LDA can find a linearly separable decision boundary as the data is linearly separable.
6. Suppose that we have two variables, X and Y (the dependent variable). We wish to find
the relation between them. An expert tells us that relation between the two has the form
Y = m log(X) + c. Available to us are samples of the variables X and Y. Is it possible to apply
linear regression to this data to estimate the values of m and c?

(a) no
(b) yes
(c) insufficient information

Sol. (b)

Instead of considering the dependent variable directly, we can transform the independent
variable by considering the square of each value. Thus, on the X-axis, we can plot values of
log(X) and on the Y-axis, we can plot values of Y . The relation between the dependent and
the transformed independent variable is linear and the value of slope and intercept can be
estimated using linear regression.
7. In a binary classification scenario where x is the independent variable and y is the dependent
variable, logistic regression assumes that the conditional distribution y|x follows a

(a) Binomial distribution


(b) Bernoulli distribution
(c) Normal distribution
(d) Exponential distribution

Sol. (b)
The dependent variable is binary, so a Bernoulli distribution is assumed.
8. Consider the following data:

Feature 1 Feature 2 Class


1 1 A
2 3 A
2 4 A
5 3 B
8 6 A
8 8 B
9 9 B
11 7 B

Assuming that you apply LDA to this data, what is the estimated covariance matrix?

3
 
1.875 0.3125
(a)
0.3125 0.9375
 
2.5 0.4167
(b)
0.4167 1.25
 
1.875 0.3125
(c)
0.3125 1.2188
 
2.5 0.4167
(d)
0.4167 1.625
 
8.25 5.2917
(e)
5.2917 5.6250
 
6.1875 3.9688
(f)
3.9688 4.2188
 
3.25 1.1667
(g)
1.1667 2.375
 
2.4375 0.875
(h)
0.875 1.7812
(i) None of these
Sol. (e)

 
8.2500 5.2917
If you do the above calculation correctly, the answer would be
5.2917 5.6250
9. Given the following 3D input data, identify the principal component.

Feature 1 Feature 2 Feature 3


1 1 1
2 3 1
2 4 1
5 3 1
8 6 2
8 8 2
9 9 2
11 7 2

(Steps: center the data, calculate the sample covariance matrix, calculate the eigenvectors and
eigenvalues, identify the principal component)

4
 
−0.1022
(a)  0.0018 
0.9948
 
0.5742
(b) −0.8164
0.0605
 
0.5742
(c) 0.8164
0.0605
 
−0.5742
(d)  0.8164 
0.0605
 
0.8123
(e) 0.5774
0.0824
 
0.8098
(f) 0.5762
0.1104
 
0.0767
(g) −0.0826
0.9936
(h) None of the above
Sol. (f)
Refer to the solution of practice assignment 3

10. For the data given in the previous question, find the transformed input along the first two
principal components.
 
−0.6025 0.2079
 0.4420 −0.0339
 
 1.2552 −0.1164
 
−1.3030 −0.2639
(a) 
 
−0.5859 0.2521 

 1.0403 0.0869 
 
 1.2717 −0.0723
−1.5178 −0.0605
 
6.2787 −0.6025
 4.3164 0.4420 
 
 3.7402 1.2552 
 
 1.8870 −1.3030
(b) 
 
−2.3814 −0.5859

−3.5339 1.0403 
 
−4.9199 1.2717 
−5.3871 −1.5178

5
 
0.2541 0.8344
 1.2987 0.5926
 
 2.1118 0.5100
 
−0.4463 0.3626
(c) 
 0.2707

 0.8785

 1.8969 0.7134
 
 2.1284 0.5542
−0.6612 0.5660
 
−1.4964 0.2541
 −3.4586 1.2987 
 
 −4.0349 2.1118 
 
 −5.8881 −0.4463
(d) 
−10.1565

 0.2707 

−11.3090 1.8969 
 
−12.6950 2.1284 
−13.1622 −0.6612
(e) None of the above

Sol. (b)
Refer to the solution of practice assignment 3

6
Assignment 4
Introduction to Machine Learning
Prof. B. Ravindran
1. Suppose we use a linear kernel SVM to build a classifier for a 2-class problem where the training
data points are linearly separable. In general, will the classifier trained in this manner produce
the same decision boundary as the classifier trained using the perceptron training algorithm
on the same training data?
(a) No
(b) Yes

Sol. (a)
2. Consider the data set given below. Claim: PLA (perceptron learning algorithm) can be used
to learn a classifier that achieves zero misclassification error on the training data. This claim
is:

(a) True
(b) False
(c) Depends on the initial weights
(d) True, only if we normalize the feature vectors before applying PLA.
Sol. (b)
The given data specifies the well-known XOR problem which cannot be separated by a linear
boundary.
3. For a support vector machine model, let xi be an input instance with label yi . If yi (βˆ0 +xTi β̂) >
1, where β0 and β̂ are the estimated parameters of the model, then

(a) xi is not a support vector


(b) xi is a support vector
(c) xi is either an outlier or a support vector
(d) Depending upon other data points, xi may or may not be a support vector.

Sol. (a)

1
4. Suppose we use a linear kernel SVM to build a classifier for a 2-class problem where the
training data points are linearly separable. In general, will the classifier trained in this manner
be always the same as the classifier trained using the perceptron training algorithm on the
same training data?
(a) Yes
(b) No
Sol. (b) The hyperplane returned by the SVM approach will have a maximal margin, whereas
no such guarantee can be given for the hyperplane identified using the perceptron training
algorithm.
For Q5,6: Kindly download the synthetic dataset from the following link
https://bit.ly/2Y4SNTF
The dataset contains 1000 points and each input point contains 3 features.
5. (2 marks) Train a linear regression model (without regularization) on the above dataset. Re-
port the coefficients of the best fit model. Report the coefficients in the following format:
β0 , β1 , β2 , β3 .
(a) -1.2, 2.1, 2.2, 1
(b) 1, 1.2, 2.1, 2.2
(c) -1, 1.2, 2.1, 2.2
(d) 1, -1.2, 2.1, 2.2
(e) 1, 1.2, -2.1, -2.2
Sol. (d)
Follow the steps given on the sklearn page.
6. Train an l2 regularized linear regression model on the above dataset. Vary the regularization
parameter from 1 to 10. As you increase the regularization parameter, absolute value of the
coefficients (excluding the intercept) of the model:

(a) increase
(b) first increase then decrease
(c) decrease
(d) first decrease then increase

Sol. (c)
Follow the steps given on the sklearn page.

For Q7,8: Kindly download the modified version of Iris dataset from this link.
Available at: (https://goo.gl/vchhsd)
The dataset contains 150 points and each input point contains 4 features and belongs to one
among three classes. Use the first 100 points as the training data and the remaining 50 as
test data. In the following questions, to report accuracy, use test dataset. You can round-off
the accuracy value to the nearest 2-decimal point number. (Note: Do not change the order of
data points.)

2
7. (2 marks) Train an l2 regularized logistic regression classifier on the modified iris dataset. We
recommend using sklearn. Use only the first two features for your model. We encourage you
to explore the impact of varying different hyperparameters of the model. Kindly note that the
C parameter mentioned below is the inverse of the regularization parameter λ. As part of the
assignment train a model with the following hyperparameters:
Model: logistic regression with one-vs-rest classifier, C = 1e4
For the above set of hyperparameters, report the best classification accuracy

(a) 0.88
(b) 0.86
(c) 0.98
(d) 0.68

Sol. (b)
Following code will give the desired result.
>>clf = LogisticRegression(penalty=’l2’, C=1e4, multi class = ”ovr”).fit(X[0:100,0:2], Y[0:100])
>>clf.score(X[100:,0:2], Y[100:])

8. Train an SVM classifier on the modified iris dataset. We recommend using sklearn. Use only
the first two features for your model. We encourage you to explore the impact of varying
different hyperparameters of the model. Specifically try different kernels and the associated
hyperparameters. As part of the assignment train models with the following set of hyperpa-
rameters
RBF-kernel, gamma = 0.5, one-vs-rest classifier, no-feature-normalization.
Try C = 0.01, 1, 10. For the above set of hyperparameters, report the best classification accu-
racy along with total number of support vectors on the test data.
(a) 0.92, 69
(b) 0.88, 40
(c) 0.88, 69
(d) 0.98, 41
Sol. (c)
Following code will give the desired result.
>>clf = svm.SVC( C=1.0, kernel=’rbf’, decision function shape=’ovr’, gamma = 0.5)).fit(X[0:100,0:2],
Y[0:100])
>>clf.score(X[100:,0:2], Y[100:])
>>clf.n support

3
Assignment 5
Introduction to Machine Learning
Prof. B. Ravindran
1. You are given the N samples of input (x) and output (y) as shown in the figure below. What
will be the most appropriate model y = f (x)?

(a) y = wẋ with w > 0


(b) y = wẋ with w < 0
(c) y = xw with w > 0
(d) y = xw with w < 0

Sol. (c)

2. Given N samples x1 , x2 , . . . , xN drawn independently from a Gaussian distribution with vari-


ance σ2 and unknown mean µ, find the MLE of the mean.
PN
xi
(a) µM LE = i=1
σ2
PN
xi
(b) µM LE = i=1
N
PN
i=1 xi
(c) µM LE = 2σ 2 N
PN
i=1 xi
(d) µM LE = N −1
PN
i=1 xi
(e) µM LE = N −2

Sol. (b)
3. Consider the following function.
ex
f (x) =
1 + ex
The derivative f 0 (x) will be:

1
(a) f (x)lnf (x) + (1 − f (x))ln(1 − f (x))
(b) f (x)ln(1 − f (x))
(c) f (x)(1 − f (x))
(d) f (x)(1 + f (x))
Sol. (c)
4. Using the notations used in class, evaluate the value of the neural network with a 3-3-1 archi-
tecture (2-dimensional input with 1 node for the bias term in both the layers). The parameters
are as follows  
1 0.2 0.4
α=
1 0.8 0.6
 
β = 0.8 0.4 0.5
Using sigmoid function as the activation functions at both the layers, the output of the network
for an input of (0.8, 0.7) will be (up to 4 decimal places)
(a) 0.6710
(b) 0.9617
(c) 0.6948
(d) 0.7052
(e) 0.8273
(f) 0.2023
(g) 0.7977
(h) 0.2446
(i) 0.7991
(j) None of these
Solution (e)
This is a straight forward computation task. First pad x with 1 and make it the X vector,
 
1
X = 0.8
0.7
The output of the first layer can be written as
o1 = αX
Next apply the sigmoid function and compute
1
a1 (i) =
1 + e−o1 (i)
Then pad the a1 vector also with 1 for bias, then compute the output of the second layer.
o2 = βa1
1
a2 =
1 + e−o2
a2 = 0.8273

2
5. Which of the following statements are true:
(a) The chances of overfitting decreases with increasing the number of hidden nodes and
increasing the number of hidden layers.
(b) A neural network with one hidden layer can represent any Boolean function given sufficient
number of hidden units and appropriate activation functions.
(c) Two hidden layer neural networks can represent any continuous functions (within a tol-
erance) as long as the number of hidden units is sufficient and appropriate activation
functions used.
Sol. (b), (c)
By increasing the number of hidden nodes or hidden layers we are increasing the number of
parameters. Increased set of parameters is more capable to memorize the training data. Hence
it may result in overfitting.
6. We have a function which takes a two-dimensional input x = (x1 , x2 ) and has two parameters
w = (w1 , w2 ) given by f (x, w) = σ(σ(x1 w1 )w2 + x2 ) where σ(x) = 1+e1−x . We use backprop-
agation to estimate the right parameter values. We start by setting both the parameters to
2. Assume that we are given a training point x2 = 1, x1 = 0, y = 3. Given this information
∂f
answer the next two questions. What is the value of ∂w 2
?
(a) 0.150
(b) -0.25
(c) 0.125
(d) 0.0525
(e) 0.098
(f) 0.0746
(g) 0.1604
(h) None of these
Solution: (d)
Write σ(x1 w1 )w2 + x2 as o2 and x1 w1 as o1
∂f ∂f ∂o2
=
∂w2 ∂o2 ∂w2
∂f
= σ(o2 )(1 − σ(o2 )) × σ(o1 )
∂w2
7. If the learning rate is 0.5, what will be the value of w2 after one update using backpropagation
algorithm?
(a) 0.4197
(b) -0.4197
(c) 0.6881
(d) -0.6881
(e) 1.3119

3
(f) -1.3119
(g) 2.1113
(h) -2.1113
(i) 1.1113
(j) -1.1113
(k) 0.5625
(l) -0.5625
(m) None of these
Solution: (g)
The update equation would be
∂L
w2 = w2 − λ
∂w2
where L is the loss function, here L = (y − f )2

∂f
w2 = w2 − λ × 2(y − f ) × (−1) ×
∂w2

Now putting in the given values we get the right answer.


8. Which of the following are true when comparing ANNs and SVMs?
(a) ANN error surface has multiple local minima while SVM error surface has only one minima
(b) After training, an ANN might land on a different minimum each time, when initialized
with random weights during each run.
(c) As shown for Perceptron, there are some classes of functions that cannot be learnt by an
ANN. An SVM can learn a hyperplane for any kind of distribution.
(d) In training, ANN’s error surface is navigated using a gradient descent technique while
SVM’s error surface is navigated using convex optimization solvers.

Sol. (a), (b) and (d)


By universal approximate theorem, we can argue that option (d) is not true.
9. Which of the following are correct?
(a) A perceptron can not learn the underlying linearly separable boundary with finite number
of training steps.
(b) Backpropagation algorithm used while estimating parameters of neural networks actually
uses gradient descent algorithm.
(c) The backpropagation algorithm will always converge to global optimum, which is one of
the reasons for impressive performance of neural networks.
(d) None of these

Sol. (b)
10. Which of the following are false?

4
(a) The number of weights to be trained in a neural network should be quite high (10-15
times) the number of samples for effective training of the neural network.
(b) XOR function can not be modelled by a single perceptron.
(c) In backpropagation algorithm, we should start with a relatively small learning parameter
(η) and slowly increase it during the learning process.
(d) None of these

Sol. (a), (c)

5
Assignment 6
Introduction to Machine Learning
Prof. B. Ravindran
1. Decision trees can be used for .

(a) classification
(b) regression
(c) Both
(d) None of these

Sol. (c)
2. In building a decision tree model, to control the size of the tree, we need to control the number
of regions. One approach to do this would be to split tree nodes only if the resultant decrease
in the sum of squares error exceeds some threshold. For the described method, which among
the following are true?
(a) it would, in general, help restrict the size of the trees
(b) it has the potential to affect the performance of the resultant regression/classification
model
(c) it is computationally infeasible

Sol. (a), (b)


While this approach may restrict the eventual number of regions produced, the main problem
with this approach is that it is too restrictive and may result in poor performance. It is very
common for splits at one level, which themselves are not that good (i.e., they do not decrease
the error significantly), to lead to very good splits (i.e., where the error is significantly reduced)
down the line. Think about the XOR problem.
3. (2 marks) In a decision tree, if we decide to swap out the usual splits (of the form xi < k
or xi > k) and instead used a linear combination of features instead, (like β T X + β0 ), where
the parameters of the hyperplane β, β0 are also simultaneously learnt, which of the following
statements would be true?

(a) If we trained only a single step of the decision tree (only the root), the system is equivalent
to a perceptron.
(b) If we trained only a single step of the decision tree (only the root), the system is equivalent
to an SVM.
(c) The resulting system cannot solve the XOR problem (refer to the ’Perceptron’ lectures)
(d) The resulting system can theoretically reach 100% accuracy on the training data set.
Sol. (a),(d). Since a single step decision tree, in the general case, has a single hyperplane
separating the classes, it behaves like a Perceptron.

An SVM has an additional term to find the optimal separating hyperplane. Since this term
will not be present in the loss function, the single step variant will not be an SVM.

1
Given multiple levels, this augmented decision tree can definitely solve the XOR problem
by first splitting along one axis and then splitting perpendicularly in the two halves.

Since the new augmented tree is stronger than the regular decision tree, it can theoretically
achieve 100% accuracy (a regular decision tree can do this too).
4. (2 marks) Having built a decision tree, we are using reduced error pruning to reduce the size
of the tree. We select a node to collapse. For this particular node, on the left branch, there are
3 training data points with the following outputs: 5, 7, 9.6 and for the right branch, there are
four training data points with the following outputs: 8.7, 9.8, 10.5, 11. The average value of the
outputs of data points denotes the response of a branch. The original responses for data points
along the two branches (left right respectively) were response left and, response right and the
new response after collapsing the node is response new. What are the values for response left,
response right and response new (numbers in the option are given in the same order)?
(a) 21.6, 40, 61.6
(b) 7.2; 10; 8.8
(c) 3, 4, 7
(d) depends on the tree height.
Sol. (b)
Original responses:
Left: 5+7+9.6
3 = 7.2
Right: 8.7+9.8+10.5+11
4 = 10
New response: 7.2 × 37 + 10 × 4
7 = 8.8
5. (2 marks) Consider the following dataset:

feature1 feature2 output


11.7 183.2 a
12.8 187.6 a
15.3 177.4 a
13.9 198.6 a
17.2 175.3 a
16.8 151.1 b
17.5 171.4 b
23.6 162.8 b
16.9 179.5 b
19.1 173.8 b

Which among the following split-points for the feature 1 would give the best split according to
the information gain measure?
(a) 14.6
(b) 16.05
(c) 16.85
(d) 17.35

2
Sol. (b)
3 3 3 0 0 7 2 2 5 5
info feature1 (14.6) (D) = 10 (− 3 log2 3 − 3 log2 3 ) + 10 (− 7 log2 7 − 7 log2 7 ) = 0.6042
4
info feature1 (16.05) (D) = 10 (− 44 log2 44 − 04 log2 04 ) + 10
6
(− 16 log2 61 − 56 log2 56 ) = 0.39
5
info feature1 (16.85) (D) = 10 (− 45 log2 54 − 15 log2 15 ) + 10
5
(− 15 log2 51 − 45 log2 45 ) = 0.7219
7
info feature1 (17.35) (D) = 10 (− 57 log2 75 − 27 log2 27 ) + 10
3
(− 03 log2 30 − 33 log2 33 ) = 0.6042

6. For the same dataset, which among the following split-points for feature2 would give the best
split according to the gini index measure?
(a) 172.6
(b) 176.35
(c) 178.45
(d) 185.4
Sol. (a)
7
ginifeature2 (172.6) (D) = 10 × 2 × 75 × 27 + 10
3
× 2 × 03 × 33 = 0.2857
5
ginifeature2 (176.35) (D) = 10 × 2 × 51 × 54 + 105
× 2 × 45 × 15 = 0.32
ginifeature2 (178.45) (D) = 10 × 2 × 6 × 6 + 10 × 2 × 34 × 14 = 0.4167
6 2 4 4
2
ginifeature2 (185.4) (D) = 10 × 2 × 22 × 02 + 10
8
× 2 × 38 × 85 = 0.375

7. In which of the following situations is it appropriate to introduce a new category ’Missing’ for
missing values? (multiple options may be correct)

(a) When values are missing because the 108 emergency operator is sometimes attending a
very urgent distress call.
(b) When values are missing because the attendant spilled coffee on the papers from which
the data was extracted.
(c) When values are missing because the warehouse storing the paper records went up in
flames and burnt parts of it.
(d) When values are missing because the nurse/doctor finds the patient’s situation too urgent.

Sol. (a),(d)
We typically introduce a ‘Missing’ value when the fact that a value is missing can also be a
relevant feature. In the case of (a) is can imply that the call was so urgent that the operator
couldn’t note it down. This urgency could potentially be useful to determine the target.
But a coffee spill corrupting the records is likely to be completely random and we glean no
new information from it. In this case, a better method is to try to predict the missing data
from the available data.

3
Assignment 7
Introduction to Machine Learning
Prof. B. Ravindran
1. For the given confusion matrix, compute the recall

True Positive True Negative


Predicted Positive 36 18
Predicted Negative 24 42

(a) 0.73
(b) 0.7
(c) 0.6
(d) 0.67
(e) 0.78
(f) None of the above

Sol. (c)

2. Which of the following are true?


TP - True Positive, TN - True Negative, FP - False Positive, FN - False Negative
TP
(a) Precision = T P +F P
(b) Precision = T PT+FP
N
(c) Recall = T PT+F
P
N
2(T P +T N )
(d) Accuracy = T P +T N +F P +F N
(e) Recall = T PF+F
P
P

Sol. (a), (c)


3. (2 marks) How does bagging help in improving the classification performance?
(a) If the parameters of the resultant classifiers are fully uncorrelated (independent), then
bagging is inefficient.
(b) It helps reduce variance
(c) If the parameters of the resultant classifiers are fully correlated, then bagging is inefficient.
(d) It helps reduce bias
Sol. (b), (c)
The lecture clearly states that correlated weights generally means that all the classifiers learn
very similar functions. This means that bagging gives no extra stability.
Having a lot of uncorrelated classifier helps to reduce variance since the resultant ensemble is
more resistant to a single outlier (It’s likely that the outlier affects only a small fraction of
classifiers in the ensemble)

1
4. Which method among bagging and stacking should be chosen in case of limited training data?
and what is the appropriate reason for your preference?
(a) Bagging, because we can combine as many classifier as we want by training each on a
different sample of the training data
(b) Bagging, because we use the same classification algorithms on all samples of the training
data
(c) Stacking, because we can use different classification algorithms on the training data
(d) Stacking, because each classifier is trained on all of the available data
Sol. (d)
5. (2 marks) Which of the following statements are false when comparing Committee Machines
and Stacking
(a) Committee Machines are, in general, special cases of 2-layer stacking where the second-
layer classifier provides uniform weightage.
(b) Both Committee Machines and Stacking have similar mechanisms, but Stacking uses
different classifiers while Committee Machines use similar classifiers.
(c) Committee Machines are more powerful than Stacking
(d) Committee Machines are less powerful than Stacking
Sol. (b), (c)
Both Committee Machines and Stacked Classifiers use sets of different classifiers. Assigning
constant weight to all first layer classifiers in a Stacked Classifier is simply the same as giving
each one a single vote (Committee Machines).
Since Committee Machines are a special case of Stacked Classifiers, they are less powerful than
Stacking, which can assign an adaptive weight depending on the region.
6. Which of the following measure best analyze the performance of a classifier?
(a) Precision
(b) Recall
(c) Accuracy
(d) Time complexity
(e) Depends on the application
Sol. (e)
Explanation Different applications might need to optimize different performance measures.
Applications of machine learning span over playing games to very critical domains(such as
health and security). Measures like accuracy for instance cannot be reliable when we have a
dataset with significant class imbalance. So there cannot be a single measure to analyze the
effectiveness of a classifier in all environments.
7. For the ROC curve of True positive rate vs False positive rate, which of the following are true?

(a) The curve is always concave (negative convex).


(b) The curve is never concave.

2
(c) The curve may or may not be concave

Sol. (c)
Explanation The nature of ROC curve is dependent on the classifier. Classifiers better than
random classifier have a concave curve. Classifiers that perform worse than random classifier
have a convex curve.
8. Which of the following are true about using 5-fold cross validation with a data set of size n =
100 to select the value of k in the kNN algorithm. (More than one option may be correct)
(a) Will always result in the same k since it does not involve any randomness.
(b) Might give different answers depending on the splitting in 5 fold cross validation.
(c) Does not make sense since n is larger than the number of folds.
Sol. (b)

3
Assignment 8
Introduction to Machine Learning
Prof. B. Ravindran
1. In a given classification problem, there are 6 different classes. In building a classification model,
we want to penalise specific errors made by the model depending upon the actual and predicted
class label. For example, given a training data point belonging to class 1, if the model predicts
it as class 2, then the penalty for this will be different if for the same data point, the model
had predicted it as class 3. To build such a model, we need to select an appropriate
(a) ML model
(b) optimisation algorithm
(c) loss function
(d) evaluation measure
Sol. (c)
An appropriately specified 6X6 loss matrix.
2. The Naive Bayes classifier makes the assumption that the are independent given the
.
(a) features, class labels
(b) class labels, features
(c) features, data points
(d) there is no such assumption
Sol. (a)
3. Consider the problem of learning a function X → Y , where Y is Boolean. X is an input
vector (X1 , X2 ), where X1 is categorical and takes 3 values, and X2 is a continuous variable
(normally distributed). What would be the minimum number of parameters required to define
a Naive Bayes model for this function?
(a) 8
(b) 10
(c) 9
(d) 5
Sol. (c)
There are 3 possible values for X1 and 2 possible values for Y . We would have one parameter
for each P (X1 = x1 |Y = y), and there are 3 of these for each Y = y - however we would
only need 2, since the three probabilities have to sum to 1. Since there are 2 values for Y,
that gives us 4 parameters. For P (X2 = x2 |Y = y), which is continuous, we have the mean
and variance of a Gaussian for each Y = y - this gives 4 parameters. We also need the prior
probabilities P (Y = y); there are 2 of these since Y takes 2 values, but we only need one since
P (Y = 1) = 1 − P (Y = 0). The total is hence 4 + 4 + 1 = 9
4. In boosting, the weights of data points that were miscalssified are as training progresses.

1
(a) decreased
(b) increased
(c) first decreased and then increased
(d) kept unchanged
Sol. (b)

5. In a random forest model let m << p be the number of randomly selected features that are
used to identify the best split at any node of a tree. Which of the following are true? (p is the
original number of features)
(Multiple options may be correct)
(a) increasing m reduces the correlation between any two trees in the forest
(b) decreasing m reduces the correlation between any two trees in the forest
(c) increasing m increases the performance of individual trees in the forest
(d) decreasing m increases the performance of individual trees in the forest
Sol. (b) and (c)

6. (2 marks) Consider the following data for 500 instances of home, 600 instances of office and
700 instances of factory type buildings

Building Balcony Multi-storied Power-backup Total


Home 400 200 100 500
Office 300 150 450 600
Factory 150 450 450 700
Total 850 800 1000 1800

Table 1

Suppose a building has a balcony and power-backup, but is not multi-storied. According to
the Naive Bayes algorithm, it is of type
(a) Home
(b) Office
(c) Factory

Sol. (c)
P(Home|(Balcony & Multi-storied & Power-backup ) ∝ P(Balcony|Home) * P( Multi-storied|Home)*
P(Power-backup|Home) * P(Home) = 4/5 * 2/5 * 1/5 * 5/18 = 0.018

P(Office|(Balcony & Multi-storied & Power-backup ) ∝ P(Balcony|Office) * P( Multi-storied|Office)


* P(Power-backup|Office) * P(Office) = 3/6 * 15/60 * 45/60 * 6/18 = 0.031

P(Factory|(Balcony & Multi-storied & Power-backup ) ∝ P(Balcony|Factory) * P( Multi-


storied|Factory) * P(Power-backup|Factory) * P(Factory) = 15/70 * 45/70 * 45/70 * 7/18
=0.034

2
7. (2 marks) Consider the following graphical model, which of the following are false about the
model? (multiple options may be correct)

(a) A is independent of B when C is known


(b) D is independent of A when C is known
(c) D is not independent of A when B is known
(d) D is not independent of A when C is known

Sol. (a), (b)


8. Consider the Bayesian network given in the previous question. Let ‘A’, ‘B’, ‘C’, ‘D’and
‘E’denote the random variables shown in the network. Which of the following can be inferred
from the network structure?
(a) ‘A’causes ‘D’
(b) ‘E’causes ‘D’
(c) ‘C’causes ‘A’
(d) options (a) and (b) are correct
(e) none of the above can be inferred
Sol. (e)
As discussed in the lecture, in Bayesian Network, the edges do not imply any causality.

3
Assignment 9
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider the bayesian network shown below.

Figure 1

Two students - Manish and Trisha make the following claims:

• Manish claims P (D|{S, L, C}) = P (D|{L, C})


• Trisha claims P (D|{S, L}) = P (D|L)
where P (X|Y ) denotes probability of event X given Y . Please note that Y can be a set. Which
of the following is true?

(a) Manish and Trisha are correct.


(b) Manish is correct and Trisha is incorrect.
(c) Manish is incorrect and Trisha is correct.
(d) Both are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (b)
D and S are independent events given two variables {L, C} but not in the case when only L is
given.
2. Consider the Bayesian graph shown below in Figure 2.

1
Figure 2

The random variables have the following notation: d - Difficulty, i - Intelligence, g - Grade, s -
SAT, l - Letter. The random variables are modeled as discrete variables and the corresponding
CPDs are as below.
d0 d1
0.6 0.4
i0 i1
0.6 0.4
g1 g2 g3
0 0
i ,d 0.3 0.4 0.3
i0 , d1 0.05 0.25 0.7
i1 , d0 0.9 0.08 0.02
i1 , d1 0.5 0.3 0.2
s0 s1
i0 0.95 0.05
i1 0.2 0.8
l0 l1
1
g 0.2 0.8
g2 0.4 0.6
g3 0.99 0.01
What is the probability of P (i = 1, d = 0, g = 2, s = 1, l = 0)?

(a) 0.004608
(b) 0.006144
(c) 0.001536
(d) 0.003992

2
(e) 0.009216
(f) 0.007309
(g) None of these

Sol. (b)

P (i = 1, d = 0, g = 2, s = 1, l = 0) = P (i = 1)P (d = 0)P (g = 2|i = 1, d = 0)P (s = 1|i = 1)P (l = 0|g = 2)


= 0.4 ∗ 0.6 ∗ 0.08 ∗ 0.8 ∗ 0.4 = 0.006144

3. Using the data given in the previous question, compute the probability of following assignment,
P (i = 1, g = 1, s = 1, l = 0) irrespective of the difficulty of the course? (up to 3 decimal places)

(a) 0.160
(b) 0.371
(c) 0.662
(d) 0.047
(e) 0.037
(f) 0.066
(g) 0.189

Sol. (d)
d=0,1
X
P (i = 1, g = 1, s = 1, l = 0) = P (i = 1)P (s = 1|i = 1)P (l = 0|g = 1) (P (d)P (g = 1|i = 1, d))
d

P (i = 1, g = 1, s = 1, l = 1) = 0.4 × 0.8 × 0.2 × (0.9 × 0.6 + 0.5 × 0.4) = 0.04736

4. Consider the Bayesian network shown below in Figure 3

Figure 3

3
Two students - Manish and Trisha make the following claims:
• Trisha claims P (H|{S, G, J}) = P (H|{G, J})
• Manish claims P (H|{S, C, J}) = P (H|{C, J})
Which of the following is true?
(a) Manish and Trisha are correct.
(b) Both are incorrect.
(c) Manish is incorrect and Trisha is correct.
(d) Manish is correct and Trisha is incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (c)

5. Consider the Markov network shown below in Figure 4

Figure 4

Which of the following variables are NOT in the markov blanket of variable “4” shown in the
above Figure 4 ? (multiple answers may be correct)
(a) 1
(b) 8
(c) 2
(d) 5
(e) 6
(f) 4
(g) 7
Sol. (d) and (g)

6. In the Markov network given in Figure 4, two students make the following claims:

4
• Manish claims variable “1” is dependent on variable “7” given variable “2”.
• Trina claims variable “2” is independent of variable “6” given variable “3”.
Which of the following is true?
(a) Both the students are correct.
(b) Trina is incorrect and Manish is correct.
(c) Trina is correct and Manish is incorrect.
(d) Both the students are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.

Sol. (d)

7. Four random variables are known to follow the given factorization


1
P (A1 = a1 , A2 = a2 , A3 = a3 , A4 = a4 ) = ψ1 (a1 , a2 )ψ2 (a1 , a4 )ψ3 (a1 , a3 )ψ4 (a2 , a4 )ψ5 (a3 , a4 )
Z
The corresponding Markov network would be

(a)

(b)

5
(c)

(d)

(e)

(f): None of the above

Sol. (c)

8. Consider the following Markov Random Field.

6
Figure 11

Which of the following nodes will have no effect on H given the Markov Blanket of H?
(a) A
(b) B
(c) C
(d) D
(e) E
(f) F
(g) G
(h) I
(i) J
Sol. (c), (e) and (f)
The question requires you to select the random variables not in the Markov blanket of H. We
see that the Markov blanket of H contains A, B, I, J, F, G, D. The only other variables, other
than H are C, E, F . These three variables can have no effect on H once the Markov blanket
is known/given.
9. Select the correct pairs of (Inference Algorithm, Graphical Model) (note: more than one option
may be correct)
(a) (Variable Elimination, Bayesian Networks)
(b) (Viterbi Algorithm, Markov Random Fields)
(c) (Viterbi Algorithm, Hidden Markov Models)
(d) (Belief Propagation, Markov Random Fields)
(e) (Variable Elimination, Markov Random Fields)
Sol. (a), (c), (d) and (e)
Viterbi Algorithm is for a sequence, while MRFs don’t have a concept of sequence.
10. Here is a popular toy graphical model. It models the grades obtained by a student in a course
and it’s implications. Difficulty represents the difficulty of the course and intelligence is an
indicator of how intelligent the student is, SAT represents the SAT scores of the student and

7
Letter presents the event of the student receiving a letter of recommendation from the faculty
teaching the course.

Given this graphical model, which of the following statements are true?
(Note - More than one can be correct.)

(a) Given the grade, difficulty and letter are independent variables.
(b) Given grade, difficulty and intelligence are independent
(c) Without knowing any information, Difficulty and Intelligence are independent.
(d) Given the intelligence, SAT and grades are independent.

Sol. (a), (c) and (d)


To check independence between pairs of variables, first check all the paths between the pair of
nodes. We have to ensure that all the paths should be blocked between the nodes. We call a
path blocked in the following cases
• The nodes which occur on the path with head to tail or tail to tail should be known.
• The nodes which occur on the path with head to head shouldn’t be known.
Using this strategy we will try to evaluate each of the option next. For option A, there is
only one path between D and L, which passes through G. Since G is a node with head to tail,
and G is known, hence the path is blocked which makes D, L independent. Similarly you can
evaluate for all the options and reach the given solution.

8
Assignment 10
Introduction to Machine Learning
Prof. B. Ravindran
1. (1 mark) Considering single-link and complete-link hierarchical clustering, is it possible for a
point to be closer to points in other clusters than to points in its own cluster? If so, in which
approach will this tend to be observed?
(a) No
(b) Yes, single-link clustering
(c) Yes, complete-link clustering
(d) Yes, both single-link and complete-link clustering
Sol. (d)
This is possible in both single-link and complete-link clustering. In the single-link case, an
example would be two parallel chains where many points are closer to points in the other
chain/cluster than to points in their own cluster. In the complete-link case, this notion is
more intuitive due to the clustering constraint (measuring distance between two clusters by
the distance between their farthest points).
2. (1 mark) Consider the following one dimensional data set: 12, 22, 2, 3, 33, 27, 5, 16, 6, 31, 20, 37, 8 and 18.
Given k = 3 and initial cluster centers to be 5, 6 and 31, what are the final cluster centres
obtained on applying the k-means algorithm?
(a) 5, 18, 30
(b) 5, 18, 32
(c) 6, 19, 32
(d) 4.8, 17.6, 32
(e) None of the above
Sol. (d)
3. (1 mark) For the previous question, in how many iterations will the k-means algorithm con-
verge?
(a) 2
(b) 3
(c) 4
(d) 6
(e) 7
Sol. (c)
4. (1 mark) In the lecture on the BIRCH algorithm, it is stated that using the number of points
N, sum of points SUM and sum of squared points SS, we can determine the centroid and
radius of the combination of any two clusters A and B. How do you determine the centroid of
the combined cluster? (In terms of N,SUM and SS of both the clusters)

1
(a) SU MA + SU MB
SU MA SU MB
(b) NA + NB
SU MA +SU MB
(c) NA +NB
SSA +SSB
(d) NA +NB

Sol. (c)
Apply the centroid formula to the combined cluster points. It’s simply the sum of all points
divided by the total number of points.
5. (1 mark) What assumption does the CURE clustering algorithm make with regards to the
shape of the clusters?
(a) No assumption
(b) Spherical
(c) Elliptical
Sol. (a)
Explanation CURE does not make any assumption on the shape of the clusters.
6. (1 mark) What would be the effect of increasing MinPts in DBSCAN while retaining the same
Eps parameter? (Note that more than one statement may be correct)
(a) Increase in the sizes of individual clusters
(b) Decrease in the sizes of individual clusters
(c) Increase in the number of clusters
(d) Decrease in the number of clusters
Sol. (b), (c)
By increasing the MinPts, we are expecting large number of points in the neighborhood, to
include them in cluster. In one sense, by increasing MinPts, we are looking for dense clusters.
This can break not-so-dense clusters into more than one part, which can lead to reduce the
cluster size and increase the number of clusters.

For the next question, kindly download the dataset - DS1. The first two columns in the dataset
correspond to the co-ordinates of each data point. The third column corresponds to the actual
cluster label.
DS1: https://bit.ly/2Lm75Ly
7. (1 mark) Visualize the dataset DS1. Which of the following algorithms will be able to recover
the true clusters (first check by visual inspection and then write code to see if the result
matches to what you expected).
(a) K-means clustering
(b) Single link hierarchical clustering

2
(c) Complete link hierarchical clustering
(d) Average link hierarchical clustering
Sol. (b)
The dataset contains spiral clusters. Single link hierarchical clustering can recover spiral
clusters with appropriate parameter settings.

8. For two independent runs of K-Mean clustering is it guaranteed to get same clustering results?
Note: seed value is not preserved in independent runs.
(a) No
(b) Yes
(c) Only when the number of clusters are even

Sol. (a)
9. (1 marks) Consider the similarity matrix given below: Which of the following shows the
hierarchy of clusters created by the single link clustering algorithm.

P1 P2 P3 P4 P5 P6
P1 1.0000 0.7895 0.1579 0.0100 0.5292 0.3542
P2 0.7895 1.0000 0.3684 0.2105 0.7023 0.5480
P3 0.1579 0.3684 1.0000 0.8421 0.5292 0.6870
P4 0.0100 0.2105 0.8421 1.0000 0.3840 0.5573
P5 0.5292 0.7023 0.5292 0.3840 1.0000 0.8105
P6 0.3542 0.5480 0.6870 0.5573 0.8105 1.0000

3
Sol. (b)

10. (1 marks) For the similarity matrix given in the previous question, which of the following shows
the hierarchy of clusters created by the complete link clustering algorithm.

4
Sol. (d)

5
Assignment 11
Introduction to Machine Learning
Prof. B. Ravindran
1. During parameter estimation for a GMM model using data X, which of the following quantities
are you minimizing (directly or indirectly)?
(a) Log-likelihood
(b) Negative Log-likelihood
(c) Cross-entropy
(d) Residual Sum of Squares (RSS)
Sol. (b)

2. (2 marks) When executing the Expectation Maximization algorithm, a common problem


is the sheer complexity of the number of parameters to estimate. For a typical K-Gaussian
Mixture Model in an n-dimensional space, how many independent parameters are being
estimated in total?
(a) 2Kn
(b) K n(n+1)
2 +K −1
(c) K n(n−1)
2 +K
(d) K n(n+3)
2 +K −1
(e) None of the above.
Sol. (d)
3. Which of the following is an assumption that reduces Gaussian Mixture Models to K-means?
(a) The multivariate Gaussians have infinite variance.
(b) The multivariate Gaussians have diagonal co-variance matrices.
(c) The multivariate Gaussians have 0 variance.
(d) The multivariate Gaussians all have the same co-variance matrix.
Sol. (c)
4. (2 marks) Given N samples x1 , x2 , . . . , xN drawn independently from a Gaussian distribution
with variance σ 2 and unknown mean µ. Assume that the prior distribution of the mean is
also a Gaussian distribution, but with parameters mean µp and variance σp2 . Find the MAP
estimate of the mean.
σ 2 µp +σp2 N
P
i=1 xi
(a) µM AP = σ 2 +N σp2

σ 2 +σp2 N
P
i=1 xi
(b) µM AP = σ 2 +σp2

σ 2 +σp2 N
P
i=1 xi
(c) µM AP = σ 2 +N σp2

1
σ 2 µp +σp2 N
P
i=1 xi
(d) µM AP = N (σ +σp2 )
2

Sol. (a)
For a MAP estimate, we try to maximize f (µ)f (X|µ)
(µ−µp )2
1 Y 1 (xi −µ)2
2
f (µ)f (X|µ) = √ e 2σp
√ e 2σ2
σp 2π i
σ 2π

We will maximize this with respect to µ after taking a logarithm. This will yield the following
equation, P
i xi µp N 1
2
+ 2 − µ( 2 + 2 ) = 0
σ σp σ σp
Thus solution will be A
5. (2 marks) You are presented with a dataset that has hidden/missing variables that influences
your data. You are asked to use Expectation Maximization algorithm to best capture the data.
How would you define the E and M in Expectation Maximization?
(a) Estimate the Missing/Latent Variables in the Dataset, Maximize the likelihood over the
parameters in the model.
(b) Estimate the number of Missing/Latent Variables in the Dataset, Maximize the likelihood
over the parameters in the model.
(c) Estimate likelihood over the parameters in the model, Maximize the number of Miss-
ing/Latent Variables in the Dataset.
(d) Estimate the likelihood over the parameters in the model, Maximize the number of
parameters in the model.
Sol. (a)

6. During parameter estimation for a GMM model using data X, which of the following quantities
are you minimizing (directly or indirectly)?
(a) Log-likelihood
(b) Negative Log-likelihood
(c) Cross-entropy
(d) Residual Sum of Squares (RSS)
Sol. (b)

7. You are given n p-dimensional data points. The task is to learn a classifier to distinguish
between k classes. You come to know that the dataset has missing values. Can you use EM
algorithm to fill in the missing values ? (without making any further assumptions)
(a) Yes
(b) No
Sol. (b)

You might also like