APS1070 – Foundations of Data Analytics and Machine Learning
Midterm Examination Fall 2019
Open book
Non-programmable & non-communicating calculators are allowed
Time allotted: 90 minutes
1. We discussed K-Nearest Neighbour Classification (k-NN) in class, a simple and
intuitive way of classifying data.
a) Here are data points plotted in 2D space:
What is the predicted class of a new data point at x = 5, y = 5, using a K-
NN classifier and Euclidian distance with k = 3? (“ ”, “ ” or “ ”) [2]
b) In the dataset above, what is the predicted class of a new data point at
x = 11, y = 7, using Manhattan distance, for k = 5? (“ ”, “ ” or “ ”) [2]
c) In general, if k is increased, which of the following statements is correct [2]:
i. The K-NN decision boundary is smoothed and the noise
sensitivity is increased.
ii. The K-NN decision boundary is jagged and the noise sensitivity is
increased.
iii. The K-NN decision boundary is smoothed and the noise
sensitivity is decreased.
iv. The K-NN decision boundary is jagged and the noise sensitivity is
decreased.
APS1070 Fall 2019 Page 1 of 2
d) In general, if you build a k-NN classifier that achieves high accuracy on
training data, but gets poor accuracy on test data, which of the following
statements is most likely correct? [2]
i. The model is overfitting.
ii. The model is underfitting.
iii. The model is neither overfitting nor underfitting.
iv. The model is both overfitting and underfitting.
2. Here are four scatterplots, each expressing the relation between two variables:
Rank the datasets A, B, C and D in terms of correlation coefficient, from lowest to
highest [2].
2 1
3. Here are two vectors x1 and x2: 𝑥1 = [ ] , 𝑥2 = [ ]
1 −2
a) Are x1 and x2 orthogonal? [2]
b) Calculate the norm of x1 and the norm of x2 [2]
c) Do x1 and x2 form an orthonormal basis for vector space R2? Why? [2]
4. Calculate the inverse of matrix A by gaussian elimination. [2]
1 1 0
𝐴 = [−1 0 0]
0 1 1
5. You build a classification model for cancer detection using an imbalanced training
dataset and achieve an accuracy of 97% when testing on new data. Explain how
this performance can be deceiving, and what performance metric(s) might be more
appropriate. [2]
APS1070 Fall 2019 Page 2 of 2