Why KNN is poor choice for spam filter?
What is KNN?
KNN is a very simple algorithm used to solve
classification problems. KNN stands for K-Nearest
Neighbors. K is the number of neighbors in KNN.
Why KNN is poor choice as spam filter
KNN classifiers are good whenever there is a really
meaningful distance metric. In the spam case, KNN
classifiers are going to label as spam things that are
“close” to known spams being “close” in the sense of your
distance metric (which will likely be poor).
Therefore, KNN classifiers are only going to filter
spams that are really similar to what you already
know. It won’t really generalize properly.
Also, you have to train on non-spam examples too,
and KNN will suffer from the same problem: it will
only confidently say something is non-spam if it is
written very similarly to a non-spam email that KNN
was trained on.
Limitations of KNN to use as spam
filters
1. Doesn’t work well with a large dataset:
Since KNN is a distance-based algorithm, the cost
of calculating distance between a new point and
each existing point is very high which in turn
degrades the performance of the algorithm.
2. Doesn’t work well with a high number of
dimensions:
Again, the same reason as above. In higher
dimensional space, the cost to calculate distance
becomes expensive and hence impacts the
performance.
Distribution of e-mails data set
3. Sensitive to outliers and missing
values:
KNN is sensitive to outliers and missing
values and hence we first need to impute
the missing values and get rid of the
outliers before applying the KNN
algorithm.
4.Need feature scaling: We need to do feature
scaling (standardization and normalization)
before applying KNN algorithm to any dataset. If
we don't do so, KNN may generate wrong
predictions.
5. For different values of
‘k’ prediction of gain data
may varies, therefore
accuracy may be poor.
For example
With respect to given data if
k=3 ,the given data
belongs to class B
If K=7,the given data
belongs to classA
So, for different values of
k predictions may varies
Failure of KNN
CASE 1
In this case, the data is
grouped in clusters but the
query point seems far away
from the actual grouping. In
such a case, we can use K
nearest neighbors to identify
the class, however, it doesn’t
make much sense because the
query point (yellow point) is
really far from the data points
and hence we can’t be very
Case 2
In this case, the data is randomly
spread and hence no useful
information can be obtained from it.
Now in such a scenario when we are
given a query point (yellow point), the
KNN algorithm will try to find the k
nearest neighbors but since the data
points are jumbled, the accuracy
is questionable
Based on accuracy