Lecture Notes 3 Perceptron
Lecture Notes 3 Perceptron
PERCEPTRON
Consider the problem in supervised learning. In this case, the data is given as a set of pairs
Goal: we want to learn the underlying relationship (hypothesis) between x and y, so that next
time we are given an instance of an input x, we can predict its corresponding output y.
More precisely, x might be a customer, patient, image, text, song, etc., and we need to
represent it numerically
• In this section, we will study a binary classification problem where we are trying to separate
two classes of data instances (data points, examples, samples). We assume y ∈ {+1, −1},
denoting the labels for data instances.
https://www.v7labs.com/blog/semi-supervised-learning-guide
1
• To create a learning algorithm, we need a hypothesis class H which is a set of all possible
functions (hypotheses) x 7→ y. More precisely,
y = h(x; θ)
• Next, we define what makes one hypothesis better than another. We define the loss (cost,
error) function for a single data instance (x(i) , y (i)) that measures the error of predicting the
class label ŷ (i) := h(x(i)) when the actual true class label is y (i) . We denote the loss by
where
and we hope that the loss on the test data (additional data that the algorithm has not seen)
0 0
{(x(n+1), y (n+1)), (x(n+2), y (n+2)), . . ., (x(n+n ) , y (n+n ) )} will also be small
n+n 0
1 X
Etest (h) = 0 L(h(x(i)), y (i)) test error
n
i=n+1
D → Learning Algorithm(H) → h
– be a clever human
– use optimization
2
• In this section, we consider linear classifiers, i.e., we assume that H is a class of all possible
linear separators.
https://automaticaddison.com/linear-separability-and-the-xor-problem/
+1, w T x + b > 0
T
h(x; w, b) = sign(w x + b) =
−1, w T x + b < 0
where w ∈ Rd and b ∈ R.
Recall
x1
d
x2 X
w T x = [w1 w2 . . . wd ] = w1 x1 + w2 x2 + . . . + wd xd =
... wj xj
j=1
xd
• The algorithm computes the weighted sum of the inputs and if this weighted sum exceeds
some threshold (specifically, −b), then the predicted label is positive, and if the weighted sum
is less than this threshold, the predicted label is negative.
• For simplicity, from now on, we consider the case d = 2 with two inputs (x1 , x2 ) ∈ R2 . We
can also visualize the above hypothesis as a single neuron.
Given an input (x1 , x2 ), the predicted output ŷ is found using two steps:
3
2. the predicted output ŷ is calculated by applying the activation function φ as
ŷ = φ(z).
https://en.wikipedia.org/wiki/Sign function
x̄ = (x1 , x2 , 1)
w̄ = (w1 , w2 , b)
z = wT x + b
= w1 x1 + w2 x2 + b
= w̄ T x̄
3
X
= wixi
i=1
4
• A single artificial neuron is a model of a biological neuron and is a building block for artificial
neural networks.
https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7
w1 x1 + w2 x2 + b = 0
w1 b
x2 = − x1 −
w2 w2
• Perceptron
5
perceptron (D, T, η)
w̄ = 0̄ initialize the parameter w̄ = (w1, w2 , b) to be the zero vector
for t = 1 to T
for i = 1 to n
ŷ (i) = φ(w̄ T x̄(i))
w̄ := w̄ − η(ŷ (i) − y (i) )x̄(i)
return w̄
– Here, T is the number of iterations and η is the learning rate of the ML model (typically,
a number between 0 and 1). Both T and η are hyperparameters.
– Let us try to understand the update rule for w̄ when we are at a data instance (x(i) , y (i))
Recall that y (i) is the actual true class label, while ŷ (i) := h(x(i)) is the predicted class
label computed by our model.
∗ If our algorithm did not make a mistake, then y (i) = ŷ (i). In that case, w̄new = w̄old.
∗ If our algorithm misclassified x(i), there are two possibilities:
w̄ T x̄(i) < 0,
w̄ = w̄ + 2η x̄(i).
Note that since our algorithm claims w̄ T x̄(i) = kw̄kkx̄(i)k cos θ < 0, it means that
the angle between w̄ and x̄(i) is more than 90. By modifying w̄, we ensure that the
angle between updated w̄ and x̄(i) is less than 90 as it should be.
w̄ T x̄(i) > 0,
w̄ = w̄ − 2η x̄(i).
Note that since our algorithm claims w̄ T x̄(i) = kw̄kkx̄(i)k cos θ > 0, it means that
the angle between w̄ and x̄(i) is less than 90. By modifying w̄, we ensure that the
angle between updated w̄ and x̄(i) is more than 90 as it should be.
6
– If D is linearly separable, the percentron is guaranteed to converge and to produce a
classifier. If D is not linearly separable, the perceptron will not converge (see the proof
in [2], pages 18-20).
– Note that the update rule states
(i)
w1 w1 x1
w2 = w2 − η(ŷ (i) − y (i))
x(i)
2
b b 1
Homework 1:
• We mentioned that perceptron converges if the data is linearly separable. Try sklearn per-
ceptron model for versicolor and virginica, with sepal length and petal length. What do you
observe?
• We created My Perceptron class for only 2 inputs. Extend this code for 3 inputs. Investigate
the iris data set and choose 3 features to classify setosa and versicolor using your code. Notice
that you cannot easily plot the decision boundary now since the data is 3-dimensional, but you
can still compare the actual and the predicted labels to see how your algorithm is performing.
• Try to generalize My Perceptron code so it could be used for any number of inputs. (Hint:
Recall, that for a list w we can use w[-1] and w[:-1] to access the last value in the list and
all the values expect the very last value. Also, use np.dot, NumPy dot product, to compute
the pre-activation value of z.)
[1] Hands-On Machine Learning with Scikit Learn, Keras & TensorFlow, Geron (pages 279-288)
https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/course/